Char Replacement In A String

Started by KeithW, August 21, 2018, 09:00:28 PM

Previous topic - Next topic

KeithW

Greetings,

I was looking thru the String "Str" functions looking for a good  function to use as a Mid() equivalent where I have a long Master Control string where I might extract (StrSub) 5 characters out of the Master and then return the 5 characters to the same position in the master string.  There will never be a time that the count coming out and the count going back will ever differ and I do not need any padding or the subset either.

I did not want to have to bust the Master into pieces as I know where the patch points are....
Some languages allow a mid() type of function to be on either side of the "=" to determine whether to Get or Put.

None of the Str functions really looked like a easy straight forward patching function. Case is not an issue either.

What would you use?

Regards,
Keith

td

Your description of your task is open to several interpretations because of the reference to splitting a string on either side of an equal sign and what is " easy straight forward" is more a matter of preconception. 

WIL string functions do not alter the input string for the most part.  They alter and return a copy of the original per your instructions so you don't have to worry about restoring a "master" string.  It is never altered unless you assign the altered string to the variable containing the "master" string.   If you know the position and length of a substring, and the contents of the substring are not unique, use the StrSub function to build a new string.  For example,

Code (winbatch) Select
; Creates a string without the character '3'
strOrig = '12345'
strNew  = StrSub(strOrig, 1, 2):StrSub(strOrig, 4, -1)
Message('Example', 'Original: ':strOrig:@crlf:'New: ':strNew)


If you don't know the location but the substring is known and unique in the string then use StrReplace to replace the targeted substring with an empty string ("").

Code (winbatch) Select
strNew = StrReplace(strOrig, '3', '')


In WIL you can use the Item* functions to remove parts of a string based on a single character delimiter.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

The data string (Master) contains about 42 fields in CSV encoding...
There are 26 of the columns are numeric on each record.
The Generating Program represent negative number with a Trailing Minus Sign
The Receiving Program does not understand Trailing Negatives and crashes thinking Char data is in a Numeric field

For a Master Sting of 100 chars, say 81-84 is "123-"  (Strings in reality are more like 1500-2000 chars)
I simply want to replace 81-84 with "-123"

There are some data formatting issues (4 columns) as well, however the fixing works exactly like the NegFix above.

Keith

td

If your data string is from a file take a look at the  "ArrayFileGetCsv" and "ArrayFilePutCsv" functions.  If your data is not from a file then you may find "ItemExtractCsv" and "ItemCountCsv" useful.  If you end up using the latter, the simplest approach might be to loop through the fields appending them to a new string and modifying as them as necessary.

A simple example of the second approach:

Code (winbatch) Select
; Does not handle embedded quotes.
strCsv = '"My","dog","has","flees"'
nMax = ItemCountCsv(strCsv, 0)
strOut = ''
for i = 1 to nMax
   strField = ItemExtractCsv(i, strCsv,0)
   if strField == 'dog' then strOut := '"cat"'
   else strOut := '"':strField:'"'
   if i < nMax then strOut:= ','
next
Message('Example', 'Original: ':strCsv:@crlf:'New: ':strOut)
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

Well just to add insult to injury....

The initial file is about 1.2 million lines and expected to grow up to about 2 million lines in the next couple months.

Second, there are embedded quotes, at least in the headers & I was told that both commas & quotes can appear
in the file, I have not run across any in the small subset of records that I am testing with so I cannot say whether
they are correct about commas & quotes in the data fields.

If the dataline has to be rebuilt for each field substitution (30 fields * 1.2 million lines + the 12 other fields that are
going from the InLine to the OutLine) I am guessing it will use a substantial amount of compute time to process
each file.

Fields are variable length, including null so the line has to be processed comma to comma/EOL.

td

If you are using the latest version of 64-bit WinBatch size may not be an issue using the array functions and embedded quotes can be handled with a little additional coding.  The array functions are faster but you can also get around the size and performance issue by only loading the part of the file you need (assuming you don't need the entire file) using FileOpen and friends with the Item* functions.   

On the other hand, preconceptions may be a roadblock more difficult to overcome.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

notds

If it's CSV you can break it all down into unusual delimiters even if you have quotes and whatnot then put it back together later.

I've crunched data in the past this way pretty easily by using StrReplace and list functions with simple loops.

Code (winbatch) Select

yourdata = FileGet( "c:\someplace\some.csv" )

yourdata = StrReplace( yourdata, @CRLF, "Ü" ) ;
yourdata = StrReplace( yourdata, @CR, "Ü" ) ;
yourdata = StrReplace( yourdata, @LF, "Ü" ) ; Create an odd delimiter for each line and avoid confusion with line feeding accross platforms

yourdata = StrReplace( yourdata, ",", "¿" ) ; use if comma delimited
;yourdata = StrReplace( yourdata, @TAB, "¿" ) ; use if comma delimited

numlines = ItemCount( yourdata, "Ü" )
totalitems = ItemCount( yourdata, "¿" )

For x=0 to numlines

currline = ItemExtract( x, yourdata, "Ü" )
currlineitems = ItemCount( currline, "¿" )


For y=0 to currlineitems

curritem = ItemExtract( y, currline,  "¿" )

; Do some stuff to each thing if you need an if statment per item
Display( 1, "x=":x:" y=":y, "Data = ":curritem )


Next

Next



KeithW

Unfortunately, I do need the entire file, all 1.2 million lines !!
All lines have the date issues (4 columns)....
Any line (up to 26 columns) can have the trailing negative issue....

as this is the first running, I am collecting the length
of each column to determine the database field sizing needs...

I have an array currently set up that collects some stats
;**********************************************
;  ARRAY Properties
;  stats[0,0] I/P File Name
;  stats[0,1] I/P Total Lines
;  stats[0,2] No of Header Fields in first row
;  ---------------------------------------------
;  stats[1,0] Field Header Name
;  stats[1,1] Width of Header Name Field
;  stats[1,2] Max Data Field Width in Column
;  ---------------------------------------------
;  Above Repeats for MAXDIM -1 Columns in file
;  To handle more columns simply increase MAXDIM
;***********************************************
stats = ArrDimension(maxdim, 3)


And while I have the subroutines to actually do the fixing, just not the replacing into the O/P line
I read the line, analyzing each column, and just writing the line "as-is", WITHOUT actually fixing
anything is taking "way too long" to rip against the entire files.  The run time will take way too long !

I have the compiler, will I pick up any appreciable performance by a compiled version?

Also, I have not had much luck with the aStatusBar functioning nicely, I am currently using it as a
percentage of bytes of file as while I do know the line count of this file I did not want to have to
modify the code for each running, so I set it to bytes and the amount of I/P bytes read.


------ current code follows --------
;Read through file line by line, parse fields within each line.
;Debug(1) ; Turn On Debugger
;***************************
; Load Appropriate Extender
AddExtender("wilx44i.dll",0,"wilx64i.dll") ; xGetElapsed
AddExtender("wwsop34i.dll",0,"wwsop64i.dll") ; aStatusBar
;***************************
;*********
infile  = "D:\CLC_Test\AR_ATB_JUNE_BFS1.csv"                                       ; 10,000 data lines
outfile = "D:\CLC_Test\AR_ATB_JUNE_BFS1-Fixed.csv"
logfile = "D:\CLC_Test\AR_ATB_JUNE_BFS1-Log.txt"
;*********
;infile  = "D:\CLC_Test\TestFile.csv"                                                         ; 50 data lines
;outfile = "D:\CLC_Test\TestFile-Fixed.csv"
;logfile = "D:\CLC_Test\TestFile-Log.txt"
;trxfile = "D:\CLC_Test\TestFile-Trx.txt"

;************
; VARIABLES
;************
EOLN     = 2 ; 0=None, 1=CR/LF, 2=LF, 3=TAB ; InControl(53....)
fpos     = 0 ; I/P file processed byte count
fsize    = 0 ; I/P file byte count
icount   = 0 ; lines read
ocount   = 0 ; lines written
lcount   = 0 ; Log File Elements Written
maxdim   = 51 ; data array dimensions 0-Based, 0= controls, 1-n equal fields
cdim     = 0 ; current field dimension being processed
fldpos   = 0 ; field postion within line
colval   = 0 ; math variable for column position
line     = "" ; data line read
field    = "" ; data field being analyzed/fixed
lnlen    = 0 ; Data "line" length
cbeg     = 0 ; Col starting "field" in "line"
cend     = 0 ; Col ending "filed" in "line"
fldlen   = 0 ; Length of "field"
fldnotxt = "" ; fldno as a character field for use with
fldnolen = 3 ; number of chars in substr field numbers - MUST MATCH FIX FIELD STRINGS !!

; Fields to be fixed - surround with commas (,) 3 digits, leading "0" if less than "99" !!!
fixdates = ",010,011,013,016," ; fixing dates in "yyyymmdd" to "mmddyyyy" !!!

fixneg   = ",017,018,019,020,021,022,023,024,025,026,027,028,029,030," ; flip trailing negatives
fixneg   =  fixneg: "031,032,033,034,035,036,037,038,039,040,041,042," ; additional columns...

;**********************************************
;  ARRAY Properties
;  stats[0,0] I/P File Name
;  stats[0,1] I/P Total Lines
;  stats[0,2] No of Header Fields in first row
;  ---------------------------------------------
;  stats[1,0] Field Header Name
;  stats[1,1] Width of Header Name Field
;  stats[1,2] Max Data Field Width in Column
;  ---------------------------------------------
;  Above Repeats for MAXDIM -1 Columns in file
;  To handle more columns simply increase MAXDIM
;***********************************************
stats = ArrDimension(maxdim, 3)
     
;***********************************************
; BEGIN PROCESSING
time1 = GetExactTime()
IntControl(53, EOLN, 0,0,0) ; Set EOL for FileWrite Command
fsize = FileSize(infile, 1) ; 0=float, 1=Huge
ErrorMode(@OFF)
FileDelete(outfile) ; Remove exisiting O/P file if it exists
FileDelete(outfile) ; Remove exisiting Trx file if it exists
TimeDelay(1) ; Allow For File Deletion
ErrorMode(@Cancel)
inhandle  = FileOpen(infile, "READ") ; Open I/P File
outhandle = FileOpen(outfile, "WRITE") ; Open O/P File
;trxhandle = FileOpen(trxfile, "WRITE") ; Open Trx File

aStatusBar(0,"File Progress","Initializing",fsize,fpos)  ; Setup StatBar

; Begin Read Loop
While @TRUE ; Loop till "break" do us end
line = FileRead(inhandle)
If line == "*EOF*" Then Break
icount = icount +1
Lnlen = StrLen(line) ; Length of Data Line Read

; ;*****************************
; ; TEST CODE
; ;*****************************
If icount > 10001 ; Testing 2=Header Only, >2 Includes Data
; Debug(1)
Goto PGMEXIT
EndIf
;***************************************
; commas=""
; lstpos=1
; ;Debug(1)
; fldpos = StrScan(line, ",", lstpos, @FWDSCAN) ; find next comma
; WHILE fldpos > 0
; commas=commas :fldpos :", "
; lstpos=fldpos+1
;    fldpos = StrScan(line, ",", lstpos, @FWDSCAN) ; find next comma
; ENDWHILE
; message("Seperators For Line %icount% - (%LnLen%)",commas)
; ENDWHILE
; EXIT
;***********End Test Code***********

;*****************
; MAIN LOGIC
;*****************
;Debug(1)
GoSub Get_Fields ; Start Line Parsing
GoSub StatsLog
;*****************
; WRITE OUTPUT
;*****************
FileWrite(outhandle, line)
ocount = ocount +1
fpos=fpos+StrLen(line)
if fpos mod 10000 == 0 ; StatusBox Update Frequency
aStatusBar(1,"File Progress","Curr_Pos: %fpos%",fsize,fpos)
Endif
EndWhile

:PGMEXIT
FileClose(inhandle)
FileClose(outhandle)
;FileClose(trxhandle)

gosub StatsLog ; Write Out Stats Array Logfile.

time2 = GetExactTime()
rc = xGetElapsed(time2, time1)

aStatusBar(2,"File Progress","Current Position",fsize,fpos)
Message("Processing Complete","Lines Read:     %icount%%@CRLF%Lines Written: %ocount%%@CRLF%Elapsed: %rc%%@CRLF%----------  H: M: S. hh")

EXIT ; END OF PROGRAM

;*************
; SUBROUTINES
;*************
:Get_Fields
;**************; Variables
cbeg=0 ; first char in field
cend=0 ; last char in field
fldlen=0 ; length of field
fldpos=0 ; position in line
lstpos=1 ; start of line OR last position in line
fldno=0 ; working on fieldno
fldnotxt="" ; fldno as a character field for use with
flag=@true ; Main While Loop Control

WHILE Flag ; Main Loop - Top
fldpos = StrScan(line, ",", lstpos, @FWDSCAN) ; find next comma
fldno = fldno +1 ; set field no
fldlen=0 ; Clear FldLen

colval = fldpos -lstpos
If colval == 0  && lnlen > lstpos
lstpos = fldpos+1 ; Empty Field
TrxLog = "L%icount%: F%fldno% = %Fldlen% Char(s)"
;FileWrite(trxhandle, TrxLog) ; TRXLOG - Debug
Continue
EndIf

colval = fldpos -lstpos
If fldpos == 0  && lstpos <= lnlen
fldpos = lnlen+1 ; Last Field On Line
EndIf

cbeg=lstpos
cend=fldpos-1

field = StrSub(line,cbeg,cend-cbeg+1) ; Extract Field
fldlen=StrLen(field) ; Set Length Of Field
fldnotxt = StrFixCharsL(fldno,"0",fldnolen) ; Make 3 char FldNo's

col = StrIndexNC(fixdates, fldnotxt, 1, @FWDSCAN) ; Date Fields
If col > 0 && icount > 1
;gosub DateField ; Flip A Date field
; restore fixed field
EndIf

col = StrIndexNC(fixneg, fldnotxt, 1, @FWDSCAN) ; Trailing Negative Fileds
If col > 0 && icount > 1
;gosub NegField ; Flip A Trailing Negative
; restore fixed field
EndIf

;**********************************
;*   UPDATE ARRAY
;**********************************
If iCount == 1 ; Header Line
Stats[fldno,0] = field ; Store Field Header
Stats[fldno,1] = fldlen ; Length of Field Header
Stats[fldno,2] = 0 ; Widest Field For Header Item
Stats[0,2] = fldno ; Update Max Fields in Headers
ELSE ; ALL Data Line past Header
   If fldlen > Stats[fldno,2]
   Stats[fldno,2] = fldlen ; Update Data Field Width if bigger
EndIf
EndIF
TrxLog = "L%icount%: F%fldno% = %Fldlen% Char(s)"
;FileWrite(trxhandle, TrxLog) ; TRXLOG - Debug

lstpos = fldpos +1 ; Step Scan Pointer
;message("lstpos -vs- lnlen for Field", "%lstpos%  vs  %lnlen%  -  %fldno%")

If lstpos >= lnlen
    Break
EndIf

ENDWHILE ; Main Loop - Bottom
RETURN

;***************************************************************
; FIX & FLIP Routines
;***************************************************************
; Flip Dates in "yyyymmdd" to "mmddyyyy" format
:DateField
field = StrSub(field,5,5) : StrSub(field,1,4)
Return
;***************************************************************
; Flip Trailing Negative Signs to the Front of the field
:NegField
if StrSub(field,cend,1) == "-"
field = "-": StrSub(field,cbeg,cend-1)
EndIf
Return
;***************************************************************
;Write Stats to Log File
:StatsLog
;Debug(1)
ErrorMode(@OFF) ; Shut off Error Trapping
FileDelete(logfile) ; Remove exisiting Log file if it exists
TimeDelay(1) ; Allow For File Deletion
ErrorMode(@CANCEL) ; Restore Normal Error Handling

IntControl(53, 1, 0,0,0) ; Set EOL for FileWrite Command To CR/LF
Stats[0,0]=infile ; Set I/P File (w/ Path)
Stats[0,1]=fsize ; Set I/P File Length in Bytes

ArrayFilePutCSV(logfile, Stats, "|",@TRUE,2)     ; Write out 2 Dim Array, no quote processing
Return

td

As previously mentioned, the fastest way to process a CSV file is to use ArrayFileGetCSV.  If you use the latest version of 64-bit WinBatch on a system with sufficient virtual memory, you should be able to process a file up to about 2GB in size.   The other advantage to using ArrayFileGetCSV is that you don't need to parse the individual fields and reconstruct each record.  You just need to know which fields to check and optionally modify.

That said, any time you iterate through a million records, it is going to take time.  There is no magic pixy dust that will change that.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Take that back.  The maximum file size that ArrayFileGetCSV file is closer to ~350MB.  While the array can hold up to about 2GBs, you will run out of string space before that because the fields will be stored as strings.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

The I/P is 385MB in pure CSV format... no extra White Space.

I compiled the script I posted and ran on the first 10,000 items on the main file... it took 2hr 53mins to consume that.
Making the whole file take longer than a week to process... not going to fly unless what I am doing is "so wrong" that
I would get more than 100 to 1 improvement with properly optimized code.

Keith

td

Compiling a script does not improve performance to any great degree.  3 hours to process 10,000 records seems a bit excessive.  I didn't spend a lot of time looking at your script but that kind of timing suggests inefficiencies.  It could also be that the system you are running the script on is not providing a lot of resources to the script.  I have a WIL script on my workstation that parses many thousands of lines in C and C++ files collecting bits of information and formatting it and then writing it to another C++ source code file.  It runs to completion is less than 5 seconds.   Of course, my workstation is not your typical desktop workstation either. 

However, I somehow doubt that any amount of optimization or experimentation will get you what you want.  You will need to find that magical pixie dust someplace else.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

I appreciate the time you have spent.  I used WB a bit  more several years ago but only get to pull it out a couple times a year now due to where my workload has gone. I expected my attempt to be far less than optimal, but the difference is not something that works for me and the time to figure it out isn't really there too.

Luckily for me, I did find the "pixie dust" you spoke of, with the help of a friend, allowing me to due the conversion in warp speed....
CSV to CSV in ~12minutes   and finally    CSV to SQLite3 in ~15 minutes ... saving my bacon this time around and introducing me
to a tool designed to do this kind of stuff specifically as it knew about the dates & trailing minus signs, all built in.

Thanx again for trying to assist.
Regards,
Keith

td

I was thinking about suggesting SQLite or Excel but forgot to for some reason.  Glad you found a solution.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

File was too big for Excel and I was hoping not to split it...
The tool I found did (in 15 mins) generate a SQLite3 database, that is the way I ended up going when all was said & done.

stanl

Quote from: KeithW on August 24, 2018, 08:45:16 AM
File was too big for Excel and I was hoping not to split it...
The tool I found did (in 15 mins) generate a SQLite3 database, that is the way I ended up going when all was said & done.

Excel Power Query can easily break the million row limit. I was unsure if you wanted to go from .csv --> database, or .csv --> converted .csv.  Would have suggested an ADO fabricated recordset and getstring() but SQLite is probably the fastest possibility.

td

A straightforward task like this is well within the capabilities of SQLite and the price is right.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

Quote from: stanl on August 24, 2018, 09:41:41 AM
Quote from: KeithW on August 24, 2018, 08:45:16 AM
File was too big for Excel and I was hoping not to split it...
The tool I found did (in 15 mins) generate a SQLite3 database, that is the way I ended up going when all was said & done.

Excel Power Query can easily break the million row limit. I was unsure if you wanted to go from .csv --> database, or .csv --> converted .csv.  Would have suggested an ADO fabricated recordset and getstring() but SQLite is probably the fastest possibility.

Have no idea what Excel Power Query is... just know that Excel 2013 could not load the whole file. 
In any case the file is now SQLite and accessible that way.

notds

Code (winbatch) Select


WHILE Flag ; Main Loop - Top
fldpos = StrScan(line, ",", lstpos, @FWDSCAN) ; find next comma
;FileWrite(trxhandle, TrxLog) ; TRXLOG - Debug
field = StrSub(line,cbeg,cend-cbeg+1) ; Extract Field





The lines above interest me. While you can accomplish the same thing with your code, sometimes using a different methods and functions can speed up your script. Some functions are slower.

I've found arrays to be slow at times.

A good example here is the StrScan function. You're asking winbatch to search character by character for something.. it's kind of slow. But using the itemextract function is much quicker.

Also, how often does the script write to the file? You could shave off time if it's doing write operations each loop cycle. Not sure if it is, didn't look at your code that deeply.



td

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

Thanx, I will give it a look.

Keith

stanl

Quote from: KeithW on August 24, 2018, 10:49:42 AM
Have no idea what Excel Power Query is... just know that Excel 2013 could not load the whole file. 
In any case the file is now SQLite and accessible that way.


It comes ready to go with 2016, is an add-on for 2013.  Microsoft has another product called Power BI and you get a bit for free. I've worked with both and client Tableau and their sample file is 10 million rows.


https://blogs.technet.microsoft.com/josebda/2017/02/12/loading-csvtext-files-with-more-than-a-million-rows-into-excel/


td

Someone still has to foot the bill for Office but that is beside the point. The blog post is about as on-topic as it gets.  Thanks.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

Stan,

With Power Query ... can you edit (modify) the data  OR  only generate Pivot Tables off the data being read?

Keith

stanl

Quote from: KeithW on August 27, 2018, 06:50:31 PM
Stan,

With Power Query ... can you edit (modify) the data  OR  only generate Pivot Tables off the data being read?

Keith


At the risk of taking this thread entirely off topic.... in 2016 power query has been renamed Get & Transform. It has a query language DAX and can probably do what you have in mind but with a bit of a learning curve.  As you already have a solution you should probably keep Power Query as a 'nice to know'. My direct working with it has been to transform csv data into a Power Map visualization for product installations in select counties in North Carolina.


Sorry I cannot be of more help. But for the learning curve


https://stackoverflow.com/questions/39632894/whats-the-difference-between-dax-and-power-query-or-m