wbOmnibus Extender - RegEx

Started by JTaylor, May 26, 2024, 07:46:03 AM

Previous topic - Next topic

JTaylor

I have added an snRegEx() function to the Extender.  Let me know if you find any problems.  I know very little about RegEx and used it more writing this function than I have used it all the times before, combined.

I have also split out the Extenders into their own individual DLLs.  The combined one is there but it will probably not see much new functionality (I did add snRegEx()).  It was becoming extremely difficult to add new functionality due to conflicts.  Also, the immense number of functions and constants were starting to become a problem, as well.  Benefits to both options but didn't feel like I had much choice on this one.

http://www.jtdata.com/anonymous/wbOmnibus.zip

Jim

cssyphus

Thanks Jim - I'll be looking at this with great interest. Perfect timing for a challenging project.

JD

JTaylor

I will be very interested in any feedback or suggestions you have to offer.

Jim

spl

Jim, not sure if this a suggestion or a tangent? Your function is solely or 'replace'. One interest I have with Regex is validating various date formats, i.e. mm-dd-yyyy, mm/dd/yyyy, dd-mm-yyyy.....  The snippet below is pretty useless, but given a regex source [could be a list, an array, a map] specific to dates [but could be anything] and an input source, could a regex function be created to validate input against regex source. I'm thinking about a switch, select, or case schema.... As you can see from the snippet, the oReg .Net variable is specific only to pattern, although all dates in the list are valid against one of the patterns. It could be modified to work but very ugly coding with little practical value outside of the snippet.
pattern = '^\d{4}-\d{2}-\d{2}'
pattern1 = '^(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}'
pattern2 = '(0[1-9]|1[1,2])(\/|-)(0[1-9]|[12][0-9]|3[01])(\/|-)(19|20)\d{2}'
pattern3 = '(0[1-9]|[12][0-9]|3[01])(\/|-)(0[1-9]|1[1,2])(\/|-)(19|20)\d{2}'
pattern4 = '^\d{1}/\d{2}/\d{4}'
pattern5 = '^\d{2}.\d{2}.\d{4}'
pattern6 = '^(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4}'

ObjectClrOption( 'useany', 'System')
oReg = ObjectClrNew('System.Text.RegularExpressions.Regex',pattern)
oReg.CacheSize = ObjectType("ui2",30)

dateStrings = '2024-05-24,5/24/2024,05/24/2024,05-24-2024,24/5/2024,05.24.2004,May 24, 2024'

For i = 1 to ItemCount(dateStrings,",")-1
   dt = ItemExtract(i,dateStrings,",")
   m = oReg.IsMatch( dt )
   If m==0
      Display(2,"%dt%","DOES NOT MATCH ":pattern)
   Else
      Display(2,"%dt%","MATCHES ":pattern)
   Endif
Next

oReg = 0
Exit
Stan - formerly stanl [ex-Pundit]

JTaylor

I couldn't really think of anything to add apart from the Find and Replace options, other than Count.   Anything I could think of could be accomplished with those two functions.  Of course, I don't know much about RegEx so could be missing something obvious.

In regards to your question, assuming I understand, are you wanting to submit the entire list and all the expression options and then check against all the possible Expressions and Dates?  Currently, you could easily loop through and check them.   If it found a match it would return that date and if not, it would return a blank.  It would look very much like your code (I didn't test):

For i = 1 to ItemCount(dateStrings,",")-1
  dt = ItemExtract(i,dateStrings,",")
  For j = 1 to ItemCount(patternstrings,",")-1
    ex = ItemExtract(j,patternStrings,@CR)
    m = snRegEx(dt,ex)
    If m == ""
       Display(2,"%dt%","DOES NOT MATCH ":pattern)
    Else
       Display(2,"%dt%","MATCHES ":pattern)
    Endif
  Next
Next

Am I following or missing the point? 

Jim

JTaylor

Finished it and Tested it:


ps = '^\d{4}-\d{2}-\d{2}':@CR
ps = ps:'^(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}':@CR
ps = ps:'(0[1-9]|1[1,2])(\/|-)(0[1-9]|[12][0-9]|3[01])(\/|-)(19|20)\d{2}':@CR
ps = ps:'(0[1-9]|[12][0-9]|3[01])(\/|-)(0[1-9]|1[1,2])(\/|-)(19|20)\d{2}':@CR
ps = ps:'^\d{1}/\d{2}/\d{4}':@CR
ps = ps:'^\d{2}.\d{2}.\d{4}':@CR
ps = ps:'^(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4}'

dateStrings = '2024-05-24,5/24/2024,05/24/2024,05-24-2024,24/5/2024,05.24.2004,May 24, 2024'

nm = "DID NOT MATCH: ":@LF
dm = "DID MATCH: ":@LF
For i = 1 to ItemCount(dateStrings,",")
  dt = ItemExtract(i,dateStrings,",")
  For j = 1 to ItemCount(ps,",")
    ex = ItemExtract(j,ps,@CR)
    m = snRegEx(dt,ex)
    If m == ""
       nm = nm:"P%j% - D%i% - ":dt:@LF
    Else
       dm = dm:"P%j% - D%i% - ":dt:@LF
    Endif
  Next
Next
Message("DM",dm)
Message("NM",nm)

spl

I gave the documentation for the regex function only a brief look, so  snRegEx(dt,ex) is equivalent to  snRegEx(dt,ex,"F") or a basic match, so function is really cool. I have to work a little more on the regex for dates: change the month day, year to accept both full month or abbreviated, include datetime stamp, GMT datetimes...

Then set up the regex as either json or a map, and hopefully come up with a switch construct with a final default to "Not a Date" to avoid looping the regex as a list. Your function removes having to play with .NET or ObjectCreate("VBScript.RegExp"). Goal would be to validate a date string candidate from sql query, csv, Rest Query etc..

and not just dates... more a regex lookup library for IP addresses, email address, phone # [ad infinitum]
Stan - formerly stanl [ex-Pundit]

JTaylor

Correct.  I made "Find" the default action.

To expand completely it would be equivalent to

        snRegEx(dt,ex,"F","",@TAB,-1)

Basic Match/Find, returning a @TAB delimited list of all Matches.

Jim

spl

Well....
Maps is a bust - too many special characters get in the way
Switch/Select ask for an integer value to select from, so pretty useless

Below I did create a list and tried a simple For...next loop, but there is something wrong with the code, if date were set to 2024-05-15 loop would catch it, but 5/15/2024 fails, even though I added an extra step to prove it should have worked. AAARRRGGGHHH!!!
AddExtender("wbOmnibus.dll")

datepatterns = $"^\d{4}-\d{2}-\d{2};
^(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}$;
(0[1-9]|1[1,2])(\/|-)(0[1-9]|[12][0-9]|3[01])(\/|-)(19|20)\d{2};
(0[1-9]|[12][0-9]|3[01])(\/|-)(0[1-9]|1[1,2])(\/|-)(19|20)\d{2};
^\d{1}/\d{2}/\d{4}$;
^\d{2}.\d{2}.\d{4}$;
^(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4}$;
^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}$;
^\d{1}-\d{2}-\d{4}$$"

cnt = ItemCount(datepatterns,";")
Display(2,"pattern count",cnt)

date = "5/15/2024"  ;should validate with pattern # 5                   

retval =""
For i = 1 to cnt
   ex = ItemExtract(i,datepatterns,";")
   m = snRegEx(date,ex)
   If m<>""
      retval = date:" is valid with regex":@LF:ex
      Break
   Endif
Next
If retval=="" Then retval=date:" not a valid date format"

Message("Results",retval)  ;should show as not valid

ex = ItemExtract(5,datepatterns,";")
pattern = '^\d{1}/\d{2}/\d{4}$'
m = snRegEx(date,pattern)
Message(date:" now valid",m:@LF:ex:@LF:pattern)  ;but valid if pattern is hard-coded

Exit
Stan - formerly stanl [ex-Pundit]

JTaylor

Yep.  Something is wrong.   I will see if I can get it sorted out shortly.

Jim

JTaylor

Something weird in the script but I can't see it

If you do this:      ex = StrTrim(ItemExtract(j,ps,";"))

it will work.  At least it did for me.   

Without the Trim there is a blank space being introduced at the beginning of each Expression but I cannot figure out why.   I did a replace for @CRLF and it didn't help (which needs to be done anyway).  If I join all the lines in one row, it works.

Maybe there is a bug with the $" line continuation feature?  Tony, thoughts?

Upside for me is that the Function is Functioning and the StrTrim() easily overcomes the problem, assuming you don't need a blank space at the beginning or end of the expression  :-)

Jim

JTaylor

I just verified.  The $" feature is adding a blank space to the end of every line.  After StrReplace() on @CRLF to remove that the blank space will appear at the beginning of the expression since we break on the ";". 

Jim

spl

The ^.......$ are anchors, normally wouldn't affect a variable sent as a single date, but would be needed in cases where validating a date in a larger string. But even removing them the code failed and failed as well with StrTrim(). Threw in the towel and went back to a concatenated list so the code below worked with anchors. Probably best approach is to store the regex as .csv and load as an array.
AddExtender("wbOmnibus.dll")

datepatterns = "^\d{4}-\d{2}-\d{2}$;"
datepatterns = datepatterns:"^(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}$;"
datepatterns = datepatterns:"^(0[1-9]|1[1,2])(\/|-)(0[1-9]|[12][0-9]|3[01])(\/|-)(19|20)\d{2}$;"
datepatterns = datepatterns:"^(0[1-9]|[12][0-9]|3[01])(\/|-)(0[1-9]|1[1,2])(\/|-)(19|20)\d{2}$;"
datepatterns = datepatterns:"^\d{1}/\d{2}/\d{4}$;"
datepatterns = datepatterns:"^\d{2}.\d{2}.\d{4}$;"
datepatterns = datepatterns:"^(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4}$;"
datepatterns = datepatterns:"^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}$;"
datepatterns = datepatterns:"^\d{1}-\d{2}-\d{4}$"

cnt = ItemCount(datepatterns,";")
date = "Jan 15, 2024"                     

retval =""
For i = 1 to cnt
  ex = ItemExtract(i,datepatterns,";")
  m = snRegEx(date,ex)
  If m<>""
      retval = date:"^ is valid with regex":@LF:ex
      Break
  Endif
Next
If retval=="" Then retval=date:"^ not a valid date format"

Message("Results",retval)

Exit
Stan - formerly stanl [ex-Pundit]

spl

Now it seems I have to review the use of anchors. If I keep them on, finding a date in a larger string will fail. but removed date is found [for a given pattern] within a string. I'll have to see if anchors have failures on rubular???
AddExtender("wbOmnibus.dll")

datepatterns = "\d{4}-\d{2}-\d{2};"
datepatterns = datepatterns:"(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2};"
datepatterns = datepatterns:"(0[1-9]|1[1,2])(\/|-)(0[1-9]|[12][0-9]|3[01])(\/|-)(19|20)\d{2};"
datepatterns = datepatterns:"(0[1-9]|[12][0-9]|3[01])(\/|-)(0[1-9]|1[1,2])(\/|-)(19|20)\d{2};"
datepatterns = datepatterns:"\d{1}/\d{2}/\d{4};"
datepatterns = datepatterns:"\d{2}.\d{2}.\d{4};"
datepatterns = datepatterns:"(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4};"
datepatterns = datepatterns:"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4};"
datepatterns = datepatterns:"\d{1}-\d{2}-\d{4}"

cnt = ItemCount(datepatterns,";")
str = "Today is May 28, 2024 and it is sunny."  ;find date in string                
;str = "5/28/2024"   ;find specific date
retval =""
For i = 1 to cnt
   ex = ItemExtract(i,datepatterns,";")
   m = snRegEx(str,ex)
   If m<>""
      retval = str:" has a date":@LF:m
      Break
   Endif
Next
If retval=="" Then retval=str:@LF:"Does Not Contain a Date"

Message("Results",retval)
Stan - formerly stanl [ex-Pundit]

JTaylor

Does this work for you?   I am not sure where things are going wrong for you.  As far as I know I am using your data and it works.



ps = $"^\d{4}-\d{2}-\d{2};
^(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}$;
(0[1-9]|1[1,2])(\/|-)(0[1-9]|[12][0-9]|3[01])(\/|-)(19|20)\d{2};
(0[1-9]|[12][0-9]|3[01])(\/|-)(0[1-9]|1[1,2])(\/|-)(19|20)\d{2};
^\d{1}/\d{2}/\d{4}$;
^\d{2}.\d{2}.\d{4}$;
^(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{1,2},\s\d{4}$;
^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}$;
^\d{1}-\d{2}-\d{4}$$"

ps=StrReplace(StrReplace(ps,@CR,""),@LF,"")

dateStrings = '2024-05-24,5/24/2024,05/24/2024,05-24-2024,24/5/2024,05.24.2004,May 24, 2024'

nm = "DID NOT MATCH: ":@LF
dm = "DID MATCH: ":@LF
For i = 1 to ItemCount(dateStrings,",")
  dt = ItemExtract(i,dateStrings,",")
  For j = 1 to ItemCount(ps,";")
    ex = StrTrim(ItemExtract(j,ps,";"))
    m = snRegEx(dt,ex)
;message(dt,ex:@CRLF:@CRLF:"Result: ":m)
    If m == ""
       nm = nm:"P%j% - D%i% - ":dt:@LF
    Else
       dm = dm:"P%j% - D%i% - ":dt:@LF
    Endif
  Next
Next
Message("DM",dm)
Message("NM",nm)






spl

It probably does, but moved beyond hard-coding in the script [stupid-code, my fault to begin with] into a data-driven array.

working on expressions for GMT times and DateTime stamps:

2024-05-28 T12:30:00Z
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[1-2]\d|3[0-1])T(?:[0-1]\d|2[0-3]):[0-5]\d:[0-5]\dZ

2024:05:28:08:31:46
\d{4}:\d{2}:\d{2}:\d{2}:\d{2}:\d{2}

Outside my tangents working through a complete date regex, your function works super-fast. Regex is not so simple when it comes to leading 0's in dates.  So: both 5/9/2024 and 05/09/2024 can be either
(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}

or

\d{1,2}/\d{1,2}/\d{4}

Regardless, the next step after determining a string/sub-string is a valid date is to convert into either a date or timestamp value for analysis or manipulation.
Stan - formerly stanl [ex-Pundit]

JTaylor

Okay.   Just let me know if you find something that doesn't work right in the function.

Jim

spl

I'll beat the dead horse one more time. I don't believe WB can initialize a blank array, and I went with fileread() to insert array elements as other array functions didn't appear viable [and I stand to be corrected]. But after the initial set up your function worked perfectly and extremely fast.

And maybe off-base here but can the ArrayToStr() be made to write out the elements with @CRLF?
filename = "c:\temp\datepatterns.txt"
If ! FileExist(filename) Then Terminate(@TRUE,"File Not Found",filename)
AddExtender("wbOmnibus.dll")
dateStrings = '2024-05-24;5/24/2024;05/24/2024;05-24-2024;24/5/2024;05.24.2004;May 24, 2024;January 1, 2024'
array=ArrDimension(1)
ArrInitialize(array, "")
h= FileOpen(filename,"READ")
i=1
While @TRUE ; Loop till break do us end
   line = FileRead(h)
   If line == "*EOF*" Then Break
   ArrayInsert(array,i,1,line)
EndWhile
FileClose(h)

;have to remove initial blank element from array               
rowcount = ArrInfo(array,1)
colcount = ArrInfo(array,2)
Display(2,"Initial Array","Rows:":rowcount:@LF:"Columns:":colcount)
ArrayRemove( array, 0) 
rowcount = ArrInfo(array,1)
colcount = ArrInfo(array,2)
Display(2,"Adjusted Array","Rows:":rowcount:@LF:"Columns:":colcount)
;

retval = ""
For i = 1 to ItemCount(dateStrings,";")
   dt = ItemExtract(i,dateStrings,";")
   found=""
   For element = 0 To rowcount-1     
       m = snRegEx(dt,array[element])
       If m<>""
         found = dt:" valid ":array[element]:@LF
         retval = retval:found
         Break
      Endif
   Next
   If found=="" Then retval = retval:dt:" not valid ":@LF
Next
   Message("Pattern Matches",retval)
exit
Stan - formerly stanl [ex-Pundit]

JTaylor

ArrInitialize (array, value)?


Jim

JTaylor

Glad to hear it.   I was actually surprised at how fast the function worked as well and always a bonus when it works correctly.

Jim

JTaylor

Not sure why the ArrayToStr() question wouldn't work.  Should be easy to test.

jim

JTaylor


spl

Quote from: JTaylor on May 30, 2024, 09:39:10 AMWhy not use ArrayFileGet()?

Yep. Missed that.
filename = "c:\temp\datepatterns.txt"
If ! FileExist(filename) Then Terminate(@TRUE,"File Not Found",filename)
AddExtender("wbOmnibus.dll")
array = ArrayFileGet (filename , "" , 0)
rowcount = ArrInfo(array,6)
dateStrings = '2024-05-24;5/24/2024;05/24/2024;05-24-2024;24/5/2024;05.24.2004;May 24, 2024;January 1, 2024'

retval = ""
For i = 1 to ItemCount(dateStrings,";")
   dt = ItemExtract(i,dateStrings,";")
   found=""
   For element = 0 To rowcount-1     
       m = snRegEx(dt,array[element])
       If m<>""
         found = dt:" valid ":array[element]:@LF
         retval = retval:found
         Break
      Endif
   Next
   If found=="" Then retval = retval:dt:" not valid ":@LF
Next
   Message("Pattern Matches",retval)
exit
Stan - formerly stanl [ex-Pundit]