At work I started receiving text files larger than 2 gig and since the parsing was to Access tables, that is a no-go. I found a Powershell script that was able to break up a 2.1 gig file into smaller chunks in 32 seconds. Below I have set out code to replicate the PS in WB's CLR. I won't be able to test since I need to use a compiled exe. To save the back and forth just want opinions as to the code:
;Winbatch 2018B - CLR Stream Reader
;=================================================================================
file = dirscript():"myfile.txt"
rootname = dirscript():"split"
lines = 100000
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader')
oRdr.OpenText(file)
While ! oRdr.EndOfStream
linecount=0
oWrite = oRdr.CreateText(rootname:"_":filecount:".txt")
While linecount < lines & ! oRdr.EndOfStream
oWrite.Writeline(oRdr,Readline())
EndWhile
oWrite.Dispose()
filecount=filecount+1
EndWhile
oWrite.Dispose()
oRdr.Dispose()
oReader=0
Exit
[/code
It would appear that both filecount and linecount need incrementing but only one is. What is Readline()?
64-bit WinBatch binary buffers can be as large as 2147483647 bytes so using binary buffers might be a faster alternative.
I cannot get past
oRdr.OpenText(file)
Says it is an unknown name. Tried several things but to no avail.
Jim
Quote from: td on April 08, 2019, 07:34:43 AM
It would appear that both filecount and linecount need incrementing but only one is. What is Readline()?
64-bit WinBatch binary buffers can be as large as 2147483647 bytes so using binary buffers might be a faster alternative.
Linecount increment was my bad -
should have been oRdr.LineCount() (bad eyes on the comma).
Files will get to 4gig. Doesn't Binary Buffers require loading the entire file first, then parsing. I thought 32 secs was pretty fast.
No on the Binary. I am stripping out some code from my XML_Splitter app as an example. I have split 40gb files with it.
Jim
Quote from: JTaylor on April 08, 2019, 07:53:35 AM
I cannot get past
oRdr.OpenText(file)
Says it is an unknown name. Tried several things but to no avail.
Jim
Yeah. The actual PS line is: $reader = [io.file]::OpenText($filename)
I often find the MFST docs for clases confusing. Maybe have refer to a text type prior to calling OpenText().
Not as short as yours but doesn't care about file size. Does a 1.6 gig file in 10 seconds when split into 50mb chunks.
You may see things that seem odd such as root_node_offset or item_node. That applied to my XML splitting. I just set root_node_offset to zero and left it there as I didn't want to mess things up by trying to remove it. The item_node I set to @LF to find the end of a line rather than an end of a node.
Again, this was a quick strip job so check the data and make sure it isn't creeping in a bad way.
Jim
Quote from: stanl on April 08, 2019, 08:00:34 AM
Files will get to 4gig. Doesn't Binary Buffers require loading the entire file first, then parsing. I thought 32 secs was pretty fast.
I was assuming that your 32 secs was a Powershell timing. WinBatch will likely not be that fast because you are using WIL loop structures which tend to be a bit slower. With binary buffers, you wouldn't necessarily need to scan every line in the file and thus significantly reduce the number of loop iterations. It does require a bit more coding.
Quote from: td on April 08, 2019, 08:38:07 AM
I was assuming that your 32 secs was a Powershell timing.
Yes, it used System.Diagnostics.Stopwatch
Completely off topic but sometimes it's necessary to time performance with times too small to detect reliably using milliseconds. In such cases something like the following is useful.
AddExtender("wwhug44i.dll",0,"wwhug64i.dll")
hStart = BinaryAlloc(8)
hFinish = BinaryAlloc(8)
hKernel32 = DllLoad('kernel32.dll')
DllCall(hKernel32, LONG:'QueryPerformanceCounter',lpbinary:hStart)
;;; Do something useful
DllCall(hKernel32, LONG:'QueryPerformanceCounter',lpbinary:hFinish)
start_time=BinaryPeek8(hStart, 0)
finish_time=BinaryPeek8(hFinish, 0)
Elapse_time = huge_Subtract(finish_time,start_time)
hFreq = BinaryAlloc(8)
DllCall(hKernel32, LONG:'QueryPerformanceFrequency',lpbinary:hFreq)
Freq_time=BinaryPeek8(hFreq, 0)
BinaryFree(hFreq)
Seconds = huge_divide(Elapse_time,Freq_time)
BinaryFree(hStart)
BinaryFree(hFinish)
DllFree(hKernel32)
Again. This was fun. The code below worked on a sample file. If [as in the past] I worked in an environment where everything could be done in WB this thread would not exist. But I have to produce both PS scripts as well as WB exes. I prefer the latter because there is so much more I can do from experience and familiar functions that would really be a PS learning curve. That being said...
;Winbatch 2018B - CLR Stream Reader
;=================================================================================
path = dirscript()
file = dirscript():"net\products.csv"
rootname = dirscript():"split"
lines = 50
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader',file)
While ! oRdr.EndOfStream
linecount=0
outfile= rootname:"_":filecount:".txt"
oWrite = ObjectClrNew('System.IO.StreamWriter',outfile)
While linecount < lines & ! oRdr.EndOfStream
oWrite.WriteLine(oRdr.ReadLine())
linecount=linecount+1
EndWhile
oWrite.Dispose()
filecount=filecount+1
EndWhile
oWrite.Dispose()
oRdr.Dispose()
oReader=0
Exit
So THAT is how you do it. Was trying similar things but couldn't quite get the right combination. Thanks for posting the solution.
Jim
Quote from: JTaylor on April 08, 2019, 12:27:26 PM
So THAT is how you do it. Was trying similar things but couldn't quite get the right combination. Thanks for posting the solution.
Jim
I prefer Bombay Gin to MSFT anyday :D
Belgian Dark Strong Ale would be my preference.
I am more of a Mr. Pibb guy myself.
Jim
This adds a stopwatch. There is a TimeSpan structure [Elapsed] that was giving an interop non-compatible error, but the ElapsedMilliseconds appears to work.
;Winbatch 2018B - CLR Stream Reader
;=================================================================================
path = dirscript()
file = dirscript():"net\products.csv"
rootname = dirscript():"split"
lines = 50
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader',file)
oTime = ObjectClrNew('System.Diagnostics.Stopwatch')
oTime.Start()
While ! oRdr.EndOfStream
linecount=0
outfile= rootname:"_":filecount:".txt"
oWrite = ObjectClrNew('System.IO.StreamWriter',outfile)
While linecount < lines & ! oRdr.EndOfStream
oWrite.WriteLine(oRdr.ReadLine())
linecount=linecount+1
EndWhile
oWrite.Dispose()
filecount=filecount+1
EndWhile
oWrite.Dispose()
oRdr.Dispose()
oTime.Stop()
Message('Process Time',oTime.ElapsedMilliseconds/1000:" Seconds")
oTime=0
oReader=0
Exit
Working from memory so I may be way off target but I believe the CLR expects the TimeSpan structure to be hosted on the process stack of the procedure using it. MSFT doesn't provide an acceptable mechanism for doing that in COM-based, machine-instruction code like WinBatch. We periodically look for a clever way around this in the CLR and FCL source code but the only solutions so far have undesirable side effects.
I didn't try it but elapsedTicks / frequency is supposed to be an alternative. Then I read a post where the author said Environment.ticks gives better results than Stopwatch. But my script doesn't need real precision as a couple seconds here or there won't matter. A final thought: compiling the script as 64-bit for use on a 64-bit OS vs. 32-bit : any major performance gain?
WinBatch can run noticeably faster as a 64-bit process but it all depends. The reason WinBatch runs faster in 64-bit is because of the __fastcall calling convention used by 64-bit Windows and associated machine code compilers. If a script causes a lot of deep diving into the WIL LL(2) parser, the use of __fastcall makes the script up to 25% faster. If the script is mostly shallow, the performance gain can be minimal or none.
It is very difficult to predict if a script will gain from 64-bit. This is particularly true when a lot of the processing is being done by an external module like the CLR. The best advice is to rigorously test the script both ways.