WinBatch® Technical Support Forum

All Things WinBatch => WinBatch => Topic started by: hdsouza on January 21, 2018, 06:19:01 AM

Title: Script execution speed
Post by: hdsouza on January 21, 2018, 06:19:01 AM
I have been using the traditional approaches to read a file and filter the required lines out, ( like for example, filarray_LCExtract = ArrayFileGet(File_LCExtract)  .... Handle_PreAnalysed = FileOpen(File_PreAnalysed, "READ") ....  etc ) and they have worked fine on normal sized files.

But I have a CSV file with 400,000 lines and each line has 20 values to extract and filter on.
This makes processing extremely long.... hours. So I attempted using powershell. More specifically i used System.IO.File.ReadLines() with powershell. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory (https://msdn.microsoft.com/en-us/library/dd383503.aspx). This  approach managed to get a total processing time of 9- 12 seconds (for 100,000 lines).

I could call the powershell script from winbatch, but I was thinking , can I do something similar inside Winbatch itself?

Here is the full Powershell script for reference:
Code (powershell) Select

$File_NotesDownload = "C:\temp\NotesDownload.csv"
$File_NotesGood = "C:\temp\NotesGood.txt"
if (Test-Path $File_NotesGood) { remove-item $File_NotesGood }
$Start = Get-Date

$r = [IO.File]::OpenText($File_NotesDownload)
while ($r.Peek() -ge 0)
    {
     $Note = $r.ReadLine()
    $Note = ($Note -replace '"','').ToLower()
    $Loan_Id = ($Note -split ',')[0].trim()
    $Loan_Id = $Loan_Id -as [int]
    if (($Loan_Id -is [int])  -eq $false) {Continue}

    $Loan_Status     = ($Note -split ',')[5].trim()
    $Status_Valid = 0
    if ($Loan_Status.ToLower().Contains("current") -eq "true") {$Status_Valid = 1}
    if ($Status_Valid -eq 0) {Continue}

    $NeverLate         = ($Note -split ',')[13].trim()
    if ($NeverLate -ne "true") {Continue}

    $Ask_price       = ($Note -split ',')[6].trim()
    if ($Ask_price -gt 20) {Continue}

    $Loan_Maturity     = ($Note -split ',')[15].trim()
    if ($Loan_Maturity -ne 36) {Continue}

    $Markup_Discount = ($Note -split ',')[7].trim()
    if ($Markup_Discount -gt -0.01) {Continue}

    $NotesList += $Note + "`r`n"
    }
$NotesList | out-file $File_NotesGood
$r.Dispose()
$End = Get-Date
$TotalTime = New-Timespan -Start $Start -End $End
$TotalTimeSecs = $TotalTime.seconds
$TotalTimeSecs


Title: Re: Script execution speed
Post by: td on January 22, 2018, 07:02:43 AM
Reading files in blocks into memory instead of all at once is nothing new.  The concept has been around at least since computers first had disk drives.  It has little impact on performance on modern computers.   In fact, it is just likely than not to slow things down a bit as make processing proceed more quickly.  Also, WinBatch FileOpen related functions read large files in blocks just like the FCL stream class you are using in your Powershell script.  So "store the entire contents in memory" is not the issue.

WinBatch for and while loops tend to be a bit slow.  This is imperceptible for loops that only iterate a few thousand times but can be noticeable when the iterations approach hundreds of thousands.  However, the difference you describe here is to large to attribute to just WinBatch loop slowness.  It suggests that there is something wrong with the way your WinBatch script is written.
Title: Re: Script execution speed
Post by: hdsouza on January 22, 2018, 03:22:38 PM
Nope its a simple For/while loop. See my previous message.
We are missing the point though. I am not saying that winbatch is normally slow, but when you have 400,000 lines and each line has 20 values to filter on, thats when its slow

This is just one of the several options I have already tried
Code (winbatch) Select

filarray_LCExtract = ArrayFileGet(File_LCExtract)
Max_LCExtract = ArrInfo(filarray_LCExtract,1)
For AA_LCExtract = 1 to Max_LCExtract
      Line_LCExtract = strtrim(filarray_LCExtract[AA_LCExtract - 1])
      Line_LCExtract = strreplace(Line_LCExtract, '"', '')
      Loan_id              = strlower(strtrim(Itemextract(1, Line_LCExtract, ";")))
      Note_id              = strlower(strtrim(Itemextract(2, Line_LCExtract, ";")))
      Order_id             = strlower(strtrim(Itemextract(3, Line_LCExtract, ";")))
      Loan_Status          = strlower(strtrim(Itemextract(4, Line_LCExtract, ";")))
      Ask_price            = strlower(strtrim(Itemextract(5, Line_LCExtract, ";")))
     ;.. More filters and then followed by evaluation
Next


All i am asking for .. is there anything similar or can I call something similar to System.IO.File.ReadLines().. as in my previous post.
Thanks
Title: Re: Script execution speed
Post by: kdmoyers on January 23, 2018, 04:07:23 AM
it would be interesting to know how much time that spends doing the ArrayFileGet and how much time doing the loop.
-K
Title: Re: Script execution speed
Post by: hdsouza on January 23, 2018, 04:43:27 AM
I ran ArrayFileGet (as well as a  Handle = FileOpen(File_name, "READ") ) on just a fraction of the data and after 30 minutes or so I terminated the job.
While with the PS command I mentioned earlier the 100,000 line file was done in in 9 seconds.
Even with using the regular file read operations with powershell  ($Notes = Get-Content $File_NotesDownload) would have taken 20 minutes for 100, 1000 lines.

Its not just the number of line (or file size) but also the number of values that have to be extracted and evaluated in each line, which adds to processing time.
Title: Re: Script execution speed
Post by: stanl on January 23, 2018, 04:53:55 AM
Seems like a choice has to be made. WB can run the PS script through CLR, but not sure of your validation steps.   For the record I wrote a generic CSV parser in WB 2013 which reads parameters from a cfg file opens a given csv and places select field names into Access tables.  Currently using it for data downloaded from MicroStrategy - and process over 1 million rows daily.  The largest block is between 150-250,000 rows.  The WB script uses FileOpen(), FileRead()  and ADO for recordset inserts.  Even 250,000 rows takes around 5 minutes, which is an acceptable coffee break.

It is perhaps futile for a WB versus PS debate - and I should know having entered a few in the last couple of years.  just .02
Title: Re: Script execution speed
Post by: hdsouza on January 23, 2018, 05:07:00 AM
Stan, you are right.
My intention was not to have a debate. All I was trying to find out was "Can System.IO.File.ReadLines()  somehow be used from inside winbatch". Is thats not possible then I could could just call the PS script from WB
Title: Re: Script execution speed
Post by: td on January 23, 2018, 08:13:10 AM
Quote from: hdsouza on January 22, 2018, 03:22:38 PM
Nope its a simple For/while loop. See my previous message.
We are missing the point though. I am not saying that winbatch is normally slow, but when you have 400,000 lines and each line has 20 values to filter on, thats when its slow

A couple of points need to be clarified.  WinBatch's implementation of looping structures tends to be a bit slower because it is interpreted and cannot be optimized the same way native or intermediate jit compiled code can be.  That only partially explains why your WinBatch script takes longer.

A second reason why your script takes longer is that you are using ItemExtract.  Each time you call it scans the entire line from the beginning.  You are scanning each complete line 7 times - counting the first StrScan and StrTrim.

Did some file load time benchmarking using the System.IO.File class in a WinBatch script using the class's ReadLines and ReadLine methods.  Of course, the ReadLine method appears to be the fastest because the initial call only reads a small part of the file into process memory.  This is a bit misleading because the time taken to read the file is spread out over the execution of the script when processing the entire file instead entirely at the beginning. 

The WIL FileGet function is much slower than ReadLines and the WIL BinaryRead function is about the same speed as dotNet ReadLines method.   BinaryRead can consume a 90,000 record (85 MB) file in about .03 seconds on a modern system.  FileGet takes about 4 seconds.

Finally,  there is nothing magical about the System.IO.ReadLine.  It is likely splitting the task of reading the file into a couple of threads which can have performance advantages under certain conditions but can also add additional critical section overhead.   The fastest implementation of file parsing in pure WinBatch is to use the Binary function for fast loading into memory and fewer passes over the data.  The speed is not likely to match your PowerShell script but it may (or may not) be acceptable.   
Title: Re: Script execution speed
Post by: kdmoyers on January 23, 2018, 08:45:13 AM
can you allocate a binary buffer big enough to hold the whole thing?

if so, then a few lines of code could replace all those semi colons with commas, which might render the file readable by ArrayFileGetCSV.  That would probably not be any faster than ArrayFileGet, but would make all the ItemExtracts unneeded.

or, skip the ArrayFileGet entirely and process via BinaryTag functions.  By specifying LF and CR has start and end tags, a BinaryTag loop can scan a text file line-by-line.  It will take plenty of time to do 400K lines though, not sure if the net effect would be faster.
Title: Re: Script execution speed
Post by: JTaylor on January 23, 2018, 09:07:17 AM
I realize this isn't an answer to your question but if looking for ways to optimize the WinBatch loop...You might consider using Arrayize() on the extracted line rather than using ItemExtract()s.   You would eliminate that part and you could also eliminate the variable assignment.   For clarity you could assign the array element to a variable name at the beginning:

      Loan_id              = 0
      Note_id              = 1
      Order_id             = 2
      Loan_Status       = 3
      Ask_price            = 4

and then in the loop

     If array[ask_price] > 45 Then....


Also, I remember reading somewhere that the use of Continue can be a big slowdown as well.  No idea if that is still the case or if I just dreamed it and it never was but maybe Tony can comment on that issue.  If it is, maybe adjust the code so it falls through naturally.

Jim
Title: Re: Script execution speed
Post by: kdmoyers on January 23, 2018, 09:31:23 AM
Also, ugly tricks like the following can eliminate a big set of ItemExtracts, depending on your data:

Code (winbatch) Select
  a = "abc;big boy;123;querty;wxyz;87654;ax;ewrt;hhyyrf;big deal;1234"
  parsedata(strreplace(strreplace(a,' ','~'),';',' '))
  message(param2,param10)
Title: Re: Script execution speed
Post by: JTaylor on January 23, 2018, 09:34:51 AM
Also meant to suggest that if you try using ArrayFileGetCSV() (shouldn't need to replace semicolons unless I have missed something) rather than looping through the array just use ArraySort, find the break point and then ArrayRedim() to remove the rows you don't want.  Sort on the next criteria, rinse and repeat.    Also, if you want get fancy rather than looping through the Array looking for the breakpoint just jump halfway, or whatever makes sense, check the value and then jump back or forward as needed by a certain increment.

Jim
Title: Re: Script execution speed
Post by: td on January 23, 2018, 10:12:13 AM
Quote from: kdmoyers on January 23, 2018, 08:45:13 AM
can you allocate a binary buffer big enough to hold the whole thing?

You *might* be able to load up to about a 500MB or even larger file using BinaryRead.  It all depends on how much memory the script is using for other purposes.   
Title: Re: Script execution speed
Post by: td on January 23, 2018, 10:28:50 AM
Got around to using a more reliable bench test and corrected the times in the post above.

The results of comparing the ReadLines method with the return object in a WIL Foreach loop to FileOpen and FileRead in a WIL While loop are interesting.   The ReadLines loop processed 90,000 lines in about 1.2 seconds while the FileRead loop took about 2.1 seconds.  The difference is likely attributable to the fact that FileRead does a lot of normalization checking on the file contents.   
Title: Re: Script execution speed
Post by: kdmoyers on January 23, 2018, 11:23:42 AM
Just for grins, I tried the BinaryTag loop on a 1 million line log file I have laying around. 
113 seconds, including doing ParseData on every line. Your file might be larger though.

Code (winbatch) Select

  f = "G:\COMMON\LOG\BOELOG.LOG" ; 81Mb size, 1 million lines

debugdata("","dbg:":"start")

  fs = filesize(f)          ; size of file
  bb=binaryalloc(fs+1)      ; reserve space
  binarypokestr(bb,0,@lf)   ; lead with a LF
  binaryreadEX(bb,1,f,0,fs) ;   followed by the file itself

  xx = 0 ; count chars
  yy = 0 ; count lines

debugdata("","dbg:":"loop") ; one second to read in 81Mb

  struc = binarytaginit(bb,@lf,@cr)
  while 1
    struc = binarytagfind(struc)
    if struc == "" then break ; all done
    line = binarytagextr(struc,0)

    parsedata(strreplace(strreplace(line,' ','~'),';',' '))

    xx = xx + strlen(line)
    yy = yy + 1

  endwhile

debugdata("","dbg:":"end") ; 112 seconds to scan 1M lines

  message(yy,xx)

exit
Title: Re: Script execution speed
Post by: td on January 23, 2018, 01:55:59 PM
Out of an abundance of curiosity and have more time than good sense, I create a bench test using something similar to the OP's posted WinBatch script.  I used an old test text file with 6,166,632 lines.  For the test, I used a FileOpen handle with FileRead, ItemExtracts, StrTrims and StrReplaces inside the loop.  The script took almost exactly 9 mins to run.   That works out to about 11,419 lines per second if my arithmetic can be trusted. 
Title: Re: Script execution speed
Post by: hdsouza on January 23, 2018, 04:29:07 PM
Thanks Kd , Appreciate the script.
Just two issues:
1) If i use struc = binarytaginit(bb_NotesDownload,@lf,@cr) it breaks out of the loop at start
If i use struc = binarytaginit(bb_NotesDownload,@lf,@lf), it skips every alternate line. (I believe the LF has been used to extract the earlier line) So it ends up doing half the lines
2) time taken for half the lines in the file = 24 seconds for 50, 000 lines.
I tried to make this WBT code similar to the PS code. So comparing apples to apples  that to 9 seconds for 100,000 lines

I have saved the data file to https://we.tl/xdjgPq58t1 (https://we.tl/xdjgPq58t1)

Code (winbatch) Select

File_LCExtract = "c:\temp\File_LCExtract.txt"
Handle_LCExtract = FileOpen(File_LCExtract,"write")
TimeStart_Test1 = TimeYmdHms()

File_NotesDownload = "C:\temp\SecondaryMarketAllNotes.csv"
fs_NotesDownload = filesize(File_NotesDownload)         
bb_NotesDownload =binaryalloc(fs_NotesDownload+1)   
binarypokestr(bb_NotesDownload,0,@lf)   ; lead with a LF
binaryreadEX(bb_NotesDownload,1,File_NotesDownload,0,fs_NotesDownload)

struc = binarytaginit(bb_NotesDownload,@lf,@lf)
;struc = binarytaginit(bb_NotesDownload,@lf,@cr)
while 1
    struc = binarytagfind(struc)
    if struc == "" then break ; all done
    Line_LCDownload = binarytagextr(struc,0)
    Line_LCDownload = strreplace(strreplace(Line_LCDownload, '"', ''), ',', ';')
    parsedata(strreplace(strreplace(Line_LCDownload,' ','#~#'),';',' '))
    Is_Number = IsNumber(param1)
    if Is_Number == 0 then continue

    Loan_Status          = strreplace(param6, '#~#', " ")
    if Loan_Status != "current" then Continue

    NeverLate            = param14
    if NeverLate != "true" then Continue

    Ask_price            = param7
    if Ask_price > 25 then Continue

    Loan_Maturity        = param16
    if Loan_Maturity != 36 then Continue

    Markup_Discount      = param8
    if Markup_Discount > -0.01 then Continue

    FileWrite (Handle_LCExtract, Line_LCDownload)
endwhile
BinaryFree(bb_NotesDownload)
fileclose (Handle_LCExtract)

TimeEnd_Test1 = TimeYmdHms() ; only for testing
TimeDiff_Test1 = TimeDiff( TimeEnd_Test1, TimeStart_Test1) ; only for testing
Message("Time Diff", TimeDiff_Test1)


I removed all the assignments (ex: NeverLate            = param14) and the time dropped from 24 secs to 21 secs
Title: Re: Script execution speed
Post by: JTaylor on January 23, 2018, 07:33:31 PM
Your file does not have a Carriage Returns...only Line Feeds.

Jim
Title: Re: Script execution speed
Post by: hdsouza on January 23, 2018, 07:43:56 PM
Quote from: JTaylor on January 23, 2018, 07:33:31 PM
Your file does not have a Carriage Returns...only Line Feeds.

Jim
Thats correct. that the feature of a CSV file. I cannot change the source as its downloaded from a site.
I could add a CR to the end of each line, but thats increasing the processing time.
Title: Re: Script execution speed
Post by: JTaylor on January 23, 2018, 07:46:06 PM
A BinaryReplace() would probably be fast.   Just make sure you increase Buffer size as needed.

Jim
Title: Re: Script execution speed
Post by: hdsouza on January 23, 2018, 08:57:07 PM
Quote from: JTaylor on January 23, 2018, 07:46:06 PM
A BinaryReplace() would probably be fast.   Just make sure you increase Buffer size as needed.

Jim
That was good one. Thanks Jim. 
Got all the lines in now.. But time taken is still long --- 44 secs for 100,000 lines.
That includes the time taken by the first binbuf  operation ---  6 secs

Code (winbatch) Select

File_LCExtract = "c:\temp\File_LCExtract.txt"
Handle_LCExtract = FileOpen(File_LCExtract,"write")
TimeStart_Test1 = TimeYmdHms()
;------------------------------------------
; adding a CR to the end of each line.. and removing unnecessary chars
File_NotesDownload = "C:\temp\SecondaryMarketAllNotes.csv"
File_NotesDownMod = "C:\temp\SecondaryMarketMod.txt"
Filedelete(File_NotesDownMod)

fs_NotesDownload = filesize(File_NotesDownload)         
bb_NotesDownload =binaryalloc(fs_NotesDownload + 100000)
BinaryRead( bb_NotesDownload, File_NotesDownload )

BinaryReplace( bb_NotesDownload, @lf, @crlf ,0)
BinaryReplace( bb_NotesDownload, '","', ';' ,0)
BinaryReplace( bb_NotesDownload, '"', '' ,0)

BinaryWrite(bb_NotesDownload, File_NotesDownMod)
BinaryFree(bb_NotesDownload)
;------------------------------------------

fs_NotesDownMod = filesize(File_NotesDownMod)         
bb_NotesDownMod =binaryalloc(fs_NotesDownMod+1)   
binarypokestr(bb_NotesDownMod,0,@lf)   ; lead with a LF
binaryreadEX(bb_NotesDownMod,1,File_NotesDownMod,0,fs_NotesDownMod)

struc = binarytaginit(bb_NotesDownMod,@lf,@cr)
while 1
    struc = binarytagfind(struc)
    if struc == "" then break ; all done
    Line_LCDownload = strlower(binarytagextr(struc,0))
    parsedata(strreplace(strreplace(Line_LCDownload,' ','#~#'),';',' '))

    if IsNumber(param1) == 0 then continue

    Loan_Status          = strreplace(param6, '#~#', " ")
    if Loan_Status != "current" then Continue
    if param14 != "true" then Continue
    if param7 > 25 then Continue
    if param16 != 36 then Continue
    if param8 > -0.01 then Continue

    FileWrite (Handle_LCExtract, Line_LCDownload)
endwhile
BinaryFree(bb_NotesDownMod)
fileclose (Handle_LCExtract)

TimeEnd_Test1 = TimeYmdHms() ; only for testing
TimeDiff_Test1 = TimeDiff( TimeEnd_Test1, TimeStart_Test1) ; only for testing
Message("Time Diff", TimeDiff_Test1)
Title: Re: Script execution speed
Post by: td on January 23, 2018, 09:04:43 PM
You don't need to modify the file to use the BinaryTagInit.  For example:

Code (winbatch) Select
hBuf = BinaryAlloc(FileSize('C:\Temp\SecondaryMarketAllNotes.csv'))
BinaryRead(hBuf,'C:\Temp\SecondaryMarketAllNotes.csv')

;; Strip the column headers
Struct=BinaryTaginit(hBuf,'"', @Lf)
Struct=BinaryTagFind(Struct)

While 1
   Struct=BinaryTagFind(Struct)
   if Struct=="" then break
   strLine=binarytagextr(struct,0)
   parsedata(strreplace(strreplace(strline,' ','~'),';',' '))   
Endwhile


Using the above with your parsing produced a 22 second execution time on my system.  I think that could be reduced a bit but since your only concern seems to be the speed of execution you should stick with your Powershell script. As mentioned previously it is compiled before execution- at least as of V3.
Title: Re: Script execution speed
Post by: JTaylor on January 23, 2018, 09:44:08 PM
Yeah...I think I can beat 44 seconds (It's on Kirby  ;) ) if I can figure out why ArraySort() never ends on the third Sort but not sure if I will hit 9 seconds, although I may come close.   I thought going from terminating after 30 minutes to 44 seconds was a viable option if you wanted to keep it native WinBatch but if 9 seconds or better is the only acceptable solution then, as Tony says, you probably just want to go that route and we can stop hijacking your thread with other suggestions :)

Always like to help but this also interested me as I do a HUGE amount of data processing and am always looking for ways to improve things so thanks for the posting even though your thread has been somewhat hijacked.

Jim
Title: Re: Script execution speed
Post by: kdmoyers on January 24, 2018, 04:41:05 AM
<< I think I can beat 44 seconds (It's on Kirby >>
Go for it man, I'm out of ideas. :-)
Title: Re: Script execution speed
Post by: td on January 24, 2018, 08:36:17 AM
Quote from: JTaylor on January 23, 2018, 09:44:08 PM
Yeah...I think I can beat 44 seconds (It's on Kirby  ;) ) if I can figure out why ArraySort() never ends on the third Sort but not sure if I will hit 9 seconds, although I may come close.   I thought going from terminating after 30 minutes to 44 seconds was a viable option if you wanted to keep it native WinBatch but if 9 seconds or better is the only acceptable solution then, as Tony says, you probably just want to go that route and we can stop hijacking your thread with other suggestions :)

Always like to help but this also interested me as I do a HUGE amount of data processing and am always looking for ways to improve things so thanks for the posting even though your thread has been somewhat hijacked.

Jim

I find this type of algorithm analysis discussion endlessly fascinating.  I always learn something and WinBatch enhancements have been known to come out of them.  For example, this discussion has provided some rationale for migrating the WIL interpreter to a JIT compiler model someday. 

It will be interesting to see if anyone comes up with a faster algorithm.  I have gotten it down to about 20 seconds on my systems using the binary buffer + ParseData approach and the OP's dataset but I suspect that more could be done.       
Title: Re: Script execution speed
Post by: td on January 24, 2018, 08:54:19 AM
Take back the 20 seconds.  Using Arrayize and few other stupid programmer tricks was able to get it down to a little under 10 seconds.  Of course, your results may vary depending on your hardware and a few other variables.
Title: Re: Script execution speed
Post by: JTaylor on January 24, 2018, 09:04:29 AM
Cool...thought Arrayize() might be a good option.

Is it expected for multiple ArraySort()s to get progressively slower and then crash WinBatch?

The first two are fast as expected.  The third is MUCH slower and the fourth eventually crashes WinBatch or gives an error about not being able to access array contents.

Jim
Title: Re: Script execution speed
Post by: td on January 24, 2018, 09:10:22 AM
The error is usually the result of passing in a zero element array but can be caused by other untoward events.  If that is not the case and you have a simple example that reproduces the problem, we will look into it a bit.
Title: Re: Script execution speed
Post by: JTaylor on January 24, 2018, 09:23:32 AM
Here is what I was attempting.  Sounds like you have the data file.  I just renamed it.  Thanks.

Jim

Code (winbatch) Select


File_LCExtract = ".\File_LCExtract.txt"
TimeStart_Test1 = TimeYmdHms()

File_NotesDownload = ".\td.csv"

#DefineFunction For_Loop(arr, filter, operator_x)

  rows = ArrInfo(arr,1)-1
  For x = rows to 1 by -1
    If arr[x,filter] %operator_x% Then
      If rows != x Then ArrayRedim(arr,x+1,-1,-1,-1,-1)
      x = 0
    EndIf
  Next
  Message(Filter:" ":ArrInfo(arr,1), ArrInfo(arr,2))

#EndFunction

arr = ArrayFileGetCSV(File_NotesDownload,0)
;Message(ArrInfo(arr,1), ArrInfo(arr,2))

ask_price       = 6
loan_status     = 5
neverlate       = 13
loan_maturity   = 15
markup_discount = 7

;   Is_Number = IsNumber(param1)
;   if Is_Number == 0 then continue

    ArraySort(arr, @LOGICALSORT, ask_price, 1)
    For_Loop(arr, ask_price, "<= 25")

    ArraySort(arr, @LOGICALSORT, markup_discount, 1)
    For_Loop(arr, markup_discount, "<= -0.01")

    ArraySort(arr, @STRINGSORT, loan_status, 1)
    For_Loop(arr, loan_status, '== "current"')

    ArraySort(arr, @STRINGSORT|@DESCENDING, loan_status, 1)
    For_Loop(arr, loan_status, '== "current"')

    ArraySort(arr, @STRINGSORT|@DESCENDING, neverlate, 1)
    For_Loop(arr, neverlate, "== 'true'")

    ArraySort(arr, @LOGICALSORT, loan_maturity, 1)
    For_Loop(arr, loan_maturity, "<= 36")

    ArraySort(arr, @LOGICALSORT|@DESCENDING, loan_maturity, 1)
    For_Loop(arr, loan_maturity, "== 36")


    ArrayFilePutCSV(File_LCExtract, arr)

TimeEnd_Test1 = TimeYmdHms() ; only for testing
TimeDiff_Test1 = TimeDiff( TimeEnd_Test1, TimeStart_Test1) ; only for testing
Message("Time Diff", TimeDiff_Test1)


Title: Re: Script execution speed
Post by: JTaylor on January 24, 2018, 09:38:28 AM
It is error# 1637.    Behavior described above is very consistent.  I have even changed the order of the columns being sorted but doesn't seem to make a difference.

Jim
Title: Re: Script execution speed
Post by: td on January 24, 2018, 11:56:21 AM
Reduced the script to this line:

ArraySort(arr, @STRINGSORT|@DESCENDING, loan_status, 1)

to reproduce the problem.   The combination of dataset size and the repetitive nature of the data in the column is more than the sort algorithm can handle.  The function is consuming all the process stack space.  The only fix is to completely rewrite the function using a different and slower algorithm.  Needless to say, this is not going to be a 'quick' fix. 

Thanks for reporting the problem.
Title: Re: Script execution speed
Post by: td on January 24, 2018, 01:34:00 PM
Things are not as dire as originally thought but MSFT's Spectre variant 1 bug fix makes debugging native code incredible slow sometimes.

The problem only occurs for sorts in descending order so the fix will be a lot simpler than stated above.
Title: Re: Script execution speed
Post by: hdsouza on January 24, 2018, 03:26:25 PM
Quote from: td on January 24, 2018, 08:36:17 AM
I find this type of algorithm analysis discussion endlessly fascinating.  I always learn something and WinBatch enhancements have been known to come out of them.  For example, this discussion has provided some rationale for migrating the WIL interpreter to a JIT compiler model someday. 

It will be interesting to see if anyone comes up with a faster algorithm.  I have gotten it down to about 20 seconds on my systems using the binary buffer + ParseData approach and the OP's dataset but I suspect that more could be done.       

Yes. Execution speed is always an interesting topic. I have several  loops  which can use some of the ideas discussed .. especially the PARAMS extraction.  I tend to like astatusbar.. I know its slow.. but I like to see what the script is doing.  Thanks all for your ideas
Title: Re: Script execution speed
Post by: stanl on January 25, 2018, 03:35:33 AM
Having kept away from using large arrays it was interesting to see the algorithms in operation. Unfortunately, my use of WB is to support linked or PowerPivot data into Excel templates. Some of these connection updates to Microstrategy, SQL Server, Oracle can take 20-30 minutes - but everyone is on the same page with that [querying billing systems for over 21 million customers].

But a final question: could the original Powershell script the OP posted be converted to CLR code? And if so, how would it perform?
Title: Re: Script execution speed
Post by: JTaylor on January 25, 2018, 05:57:44 PM
Guess I should move on...the best I could do was 17 seconds :-(

Jim
Title: Re: Script execution speed
Post by: kdmoyers on January 26, 2018, 07:46:57 AM
Before we leave off, I'd like to request that larger brains than mine consider this performance request:
Code (winbatch) Select

;Presently, the BinaryTag loop works like this:

    struc = BinaryTagInit(bb,@lf,@cr)
    while 1
      struc = BinaryTagFind(struc)
      if struc == "" then break
 
    endwhile

;But it seems like it would be faster if it worked like this:

    struc = BinaryTagInit(bb,@lf,@cr)
    while BinaryTagFind(struc)
 
    endwhile

;I don't know if that's possible.
Title: Re: Script execution speed
Post by: JTaylor on January 26, 2018, 07:57:48 AM
...or maybe...

    ForEach struc in BinaryTagInit(bb,@lf,@cr,@TRUE)
        .....
    Next

Last Parameter an optional one indicating if it is a Collection or not, for backwards compatibility.

Jim
Title: Re: Script execution speed
Post by: kdmoyers on January 26, 2018, 10:57:18 AM
<< ForEach struc in ... >>

Oooohh, fancy!
Title: Re: Script execution speed
Post by: td on January 26, 2018, 01:49:47 PM
Either approach would certainly look prettier but the performance gain would likely be negligible.   Pretty syntax has its place but it also often hides ugly secrets...   
Title: Re: Script execution speed
Post by: td on January 26, 2018, 01:53:38 PM
One possible compromise would be to allow assignment in while statements.

While struc = BinaryTagFind(struc)
   whatever
Endwhile

But that could be a bit of a challenge to implement.
Title: Re: Script execution speed
Post by: JTaylor on January 26, 2018, 07:44:41 PM
Figured there wouldn't be any gains on performance but assume having a Collection would, I assume, open up some potentially useful possibilities in addition to being pretty.  In any event, I do hope you find time to implement the prior suggestions on the BinaryTag front.

Jim


Quote from: td on January 26, 2018, 01:49:47 PM
Either approach would certainly look prettier but the performance gain would likely be negligible.   Pretty syntax has its place but it also often hides ugly secrets...   
Title: Re: Script execution speed
Post by: JTaylor on January 27, 2018, 10:08:40 AM
Guess I should clarify...these changes would be nice but I meant the ones requested sometime back in a different thread.

Jim
Title: Re: Script execution speed
Post by: hdsouza on January 28, 2018, 09:03:23 AM
There is a reference to .net and winbatch,  http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+WinBatch/OLE~COM~ADO~CDO~ADSI~LDAP+WinBatch~and~Dot~NET.txt (http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+WinBatch/OLE~COM~ADO~CDO~ADSI~LDAP+WinBatch~and~Dot~NET.txt) but its not clear whether it should be working or not.. or maybe I am not reading correctly. I mention .net as the initial PS code was .net code in PS.

Also while the PS code is still faster than wbt, Should the input file  be large and a lot of data is being written to the output file, that will slow the speed down. So should anyone try to use the code, here is why:
As we add more lines to the $NotesList the script takes longer and longer. I believe .NET is duplicating the entire array in memory, adding the new item, and deleting the old copy in memory. It is explained here along with a solution, but for now I cannot get it to work. https://blogs.technet.microsoft.com/ashleymcglone/2017/07/12/slow-code-top-5-ways-to-make-your-powershell-scripts-run-faster/ (https://blogs.technet.microsoft.com/ashleymcglone/2017/07/12/slow-code-top-5-ways-to-make-your-powershell-scripts-run-faster/)
I did go a different route in splitting the $NotesList into smaller parts and then combine the parts together at the end with a FOR loop... although I am thinking the earlier solution maybe better.

Not trying to pollute the winbatch forum with PS code,  but I noticed interest on the post for a quicker solution. So if someone should use the initial code, this may get better speeds (at least for now :) )

Code (powershell) Select

$File_Path = "C:\Temp\"
$File_NotesDownload = $File_Path  + "SecondaryMarketAllNotes.csv"
$File_NotesEval     = $File_Path  + "SecNotesEval.txt"
if (Test-Path $File_NotesEval)     { remove-item $File_NotesEval }

$Start = Get-Date
$r = [IO.File]::OpenText($File_NotesDownload)
$Count_EvalNotes = 0
$CountOutFile = 0
while ($r.Peek() -ge 0)
    {
     $Note = $r.ReadLine()
    $Note = ($Note -replace '"','').ToLower()
    $Loan_Id = ($Note -split ',')[0].trim()
    $Loan_Id = $Loan_Id -as [int]
    if (($Loan_Id -is [int])  -eq $false) {Continue}

    $NeverLate         = ($Note -split ',')[13].trim()
    if (($Loan_StatusValue -eq "current") -and ($NeverLate -ne "true")) {Continue}

    $Ask_price       = [float]($Note -split ',')[6].trim()
    if ($Ask_price -gt 25) {Continue}

    # several of the filters have been removed ......
    # several of the filters have been removed ......

    $Count_EvalNotes = $Count_EvalNotes + 1
    $ModCount_EvalNotes = $Count_EvalNotes/200
    $index = "$ModCount_EvalNotes".IndexOf(".")

    if (($index -eq -1) -or ($Count_EvalNotes -eq 1))
       {
       $CountOutFile = $CountOutFile + 1
       New-Variable -Name "List_$CountOutFile"
       }

    $NotesList = $Note + "`r`n"
    $val=Get-Variable -Name "List_$CountOutFile" -ValueOnly
    Set-Variable -Name "List_$CountOutFile" -value $val$NotesList
    }
for ($i=1; $i -le $CountOutFile; $i++)
    {
    $val=Get-Variable -Name "List_$i" -ValueOnly
    $val  >> $File_NotesEval
    }

$r.Dispose()
$End = Get-Date
$TotalTime = New-Timespan -Start $Start -End $End
$TotalTimeSecs = $TotalTime.TotalSeconds
$TotalTimeSecs

Title: Re: Script execution speed
Post by: stanl on January 28, 2018, 01:27:01 PM
Like I posted, way back in the thread - make your choice. The pyrotechnics over arrays and processing speed were something that has always happened on this board and for the betterment of WB and society. I have never been concerned with the speed of getting data, only with validation and error handling...   :o
Title: Re: Script execution speed
Post by: kdmoyers on January 30, 2018, 05:43:30 AM
Quote from: td on January 26, 2018, 01:53:38 PM
One possible compromise would be to allow assignment in while statements.

While struc = BinaryTagFind(struc)
   whatever
Endwhile

But that could be a bit of a challenge to implement.

Hmmm...

Maybe if BinaryTagFind took a pointer? Maybe it would operate directly on that parameter, modifying it.  Also, it could return a simple true/false. So the code becomes:

Code (winbatch) Select
While BinaryTagFind(&struc)
   whatever
Endwhile


I dunno.  Probably not very important in the grand scheme! (smile)
Title: Re: Script execution speed
Post by: td on January 30, 2018, 06:26:33 AM
Ideas are always appreciated and so keeping them coming but pointers and references are expensive in WnBatch so it certainly wouldn't be a performance enhancer.  It would be quite the opposite.    There are several other issues that I won't go into.     
Title: Re: Script execution speed
Post by: kdmoyers on January 30, 2018, 08:56:59 AM
Quote from: td on January 30, 2018, 06:26:33 AM
pointers and references are expensive in WinBatch so it certainly wouldn't be a performance enhancer
Yup, you're right of course. I use them all the time, but I hardly ever do speed-sensitive code, so I had forgotten they are not quick.
Title: Re: Script execution speed
Post by: kdmoyers on January 30, 2018, 12:44:45 PM
I just realized I can write it myself.  Cool - it works! But you are right -- it's slower, not faster.
Code (winbatch) Select
#definefunction BinaryTagFindEx(pStructure)
  *pStructure = BinaryTagFind(*pStructure)
  return *pStructure != ""
#endfunction
Title: Re: Script execution speed
Post by: td on January 30, 2018, 01:36:11 PM
You have the UDF overhead too but the reason references take more time than plain variables is that the interpreter has to parse the pointer and make two trips to the variable instead of one.   In the vast majority of scripts, this overhead is insignificant but like everything else, it adds up in a tight loop.