CLR StreamReader

Started by stanl, April 08, 2019, 05:49:57 AM

Previous topic - Next topic

stanl

At work I started receiving text files larger than 2 gig and since the parsing was to Access tables, that is a no-go. I found a Powershell script that was able to break up a 2.1 gig file into smaller chunks in 32 seconds. Below I have set out code to replicate the PS in WB's CLR. I won't be able to test since I need to use a compiled exe. To save the back and forth just want opinions as to the code:
Code (WINBATCH) Select


;Winbatch 2018B - CLR Stream Reader
;=================================================================================
file = dirscript():"myfile.txt"
rootname = dirscript():"split"


lines = 100000
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader')
oRdr.OpenText(file)


While ! oRdr.EndOfStream
   linecount=0
   oWrite = oRdr.CreateText(rootname:"_":filecount:".txt")
While linecount < lines & ! oRdr.EndOfStream
   oWrite.Writeline(oRdr,Readline())
EndWhile
oWrite.Dispose()
filecount=filecount+1
EndWhile


oWrite.Dispose()
oRdr.Dispose()
oReader=0
Exit
[/code

td

It would appear that both filecount and linecount  need incrementing but only one is.  What is Readline()?

64-bit WinBatch binary buffers can be as large as 2147483647 bytes so using binary buffers might be a faster alternative.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

I cannot get past

    oRdr.OpenText(file)

Says it is an unknown name.   Tried several things but to no avail.

Jim

stanl

Quote from: td on April 08, 2019, 07:34:43 AM
It would appear that both filecount and linecount  need incrementing but only one is.  What is Readline()?

64-bit WinBatch binary buffers can be as large as 2147483647 bytes so using binary buffers might be a faster alternative.


Linecount increment was my bad -


should have been oRdr.LineCount()  (bad eyes on the comma).


Files will get to 4gig. Doesn't Binary Buffers require loading the entire file first, then parsing. I thought 32 secs was pretty fast.

JTaylor

No on the Binary.  I am stripping out some code from my XML_Splitter app as an example.  I have split 40gb files with it.

Jim

stanl

Quote from: JTaylor on April 08, 2019, 07:53:35 AM
I cannot get past

    oRdr.OpenText(file)

Says it is an unknown name.   Tried several things but to no avail.

Jim


Yeah. The actual PS line is:  $reader = [io.file]::OpenText($filename)


I often find the MFST docs for clases confusing. Maybe have refer to a text type prior to calling OpenText().

JTaylor

Not as short as yours but doesn't care about file size.  Does a 1.6 gig file in 10 seconds when split into 50mb chunks. 

You may see things that seem odd such as root_node_offset or item_node.  That applied to my XML splitting.  I just set root_node_offset to zero and left it there as I didn't want to mess things up by trying to remove it.  The item_node I set to @LF to find the end of a line rather than an end of a node.

Again, this was a quick strip job so check the data and make sure it isn't creeping in a bad way.

Jim

td

Quote from: stanl on April 08, 2019, 08:00:34 AM
Files will get to 4gig. Doesn't Binary Buffers require loading the entire file first, then parsing. I thought 32 secs was pretty fast.

I was assuming that your 32 secs was a Powershell timing.  WinBatch will likely not be that fast because you are using WIL loop structures which tend to be a bit slower.  With binary buffers, you wouldn't necessarily need to scan every line in the file and thus significantly reduce the number of loop iterations.  It does require a bit more coding.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: td on April 08, 2019, 08:38:07 AM
I was assuming that your 32 secs was a Powershell timing. 


Yes, it used System.Diagnostics.Stopwatch

td

Completely off topic but sometimes it's necessary to time performance with times too small to detect reliably using milliseconds.  In such cases something like the following is useful.

Code (winbatch) Select
AddExtender("wwhug44i.dll",0,"wwhug64i.dll")
hStart   = BinaryAlloc(8)
hFinish   = BinaryAlloc(8)
hKernel32 = DllLoad('kernel32.dll')

DllCall(hKernel32, LONG:'QueryPerformanceCounter',lpbinary:hStart)

;;; Do something useful

DllCall(hKernel32, LONG:'QueryPerformanceCounter',lpbinary:hFinish)
start_time=BinaryPeek8(hStart, 0)
finish_time=BinaryPeek8(hFinish, 0)
Elapse_time = huge_Subtract(finish_time,start_time)

hFreq   = BinaryAlloc(8)
DllCall(hKernel32, LONG:'QueryPerformanceFrequency',lpbinary:hFreq)
Freq_time=BinaryPeek8(hFreq, 0)
BinaryFree(hFreq)

Seconds = huge_divide(Elapse_time,Freq_time)
BinaryFree(hStart)
BinaryFree(hFinish)
DllFree(hKernel32)
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Again. This was fun. The code below worked on a sample file. If [as in the past] I worked in an environment where everything could be done in WB this thread would  not exist. But I have to produce both PS scripts as well as WB exes. I prefer the latter because there is so much more I can do from experience and familiar functions that would really be a PS learning curve. That being said...


Code (WINBATCH) Select


;Winbatch 2018B - CLR Stream Reader
;=================================================================================
path = dirscript()
file = dirscript():"net\products.csv"
rootname = dirscript():"split"


lines = 50
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader',file)




While ! oRdr.EndOfStream
   linecount=0
outfile= rootname:"_":filecount:".txt"
oWrite = ObjectClrNew('System.IO.StreamWriter',outfile)
   While linecount < lines & ! oRdr.EndOfStream
   oWrite.WriteLine(oRdr.ReadLine())
linecount=linecount+1
EndWhile
oWrite.Dispose()
filecount=filecount+1
EndWhile


oWrite.Dispose()
oRdr.Dispose()
oReader=0
Exit



JTaylor

So THAT is how you do it.  Was trying similar things but couldn't quite get the right combination.  Thanks for posting the solution.

Jim

stanl

Quote from: JTaylor on April 08, 2019, 12:27:26 PM
So THAT is how you do it.  Was trying similar things but couldn't quite get the right combination.  Thanks for posting the solution.

Jim


I prefer Bombay Gin to MSFT anyday :D

td

Belgian Dark Strong Ale would be my preference. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

I am more of a Mr. Pibb guy myself.

Jim

stanl

This adds a stopwatch. There is a TimeSpan structure [Elapsed] that was giving an interop non-compatible error, but the ElapsedMilliseconds appears to work.


Code (WINBATCH) Select


;Winbatch 2018B - CLR Stream Reader
;=================================================================================
path = dirscript()
file = dirscript():"net\products.csv"
rootname = dirscript():"split"


lines = 50
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader',file)
oTime = ObjectClrNew('System.Diagnostics.Stopwatch')
oTime.Start()
While ! oRdr.EndOfStream
   linecount=0
   outfile= rootname:"_":filecount:".txt"
   oWrite = ObjectClrNew('System.IO.StreamWriter',outfile)
   While linecount < lines & ! oRdr.EndOfStream
      oWrite.WriteLine(oRdr.ReadLine())
      linecount=linecount+1
   EndWhile
   oWrite.Dispose()
   filecount=filecount+1
EndWhile


oWrite.Dispose()
oRdr.Dispose()
oTime.Stop()
Message('Process Time',oTime.ElapsedMilliseconds/1000:" Seconds")
oTime=0
oReader=0
Exit



td

Working from memory so I may be way off target but I believe the CLR expects the TimeSpan structure to be hosted on the process stack of the procedure using it.  MSFT doesn't provide an acceptable mechanism for doing that in COM-based, machine-instruction code like WinBatch.  We periodically look for a clever way around this in the CLR and FCL source code but the only solutions so far have undesirable side effects.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

I didn't try it but elapsedTicks / frequency is supposed to be an alternative. Then I read a post where the author said Environment.ticks gives better results than Stopwatch. But my script doesn't need real precision as a couple seconds here or there won't matter. A final thought: compiling the script as 64-bit for use on a 64-bit OS vs. 32-bit : any major performance gain?

td

WinBatch can run noticeably faster as a 64-bit process but it all depends.   The reason WinBatch runs faster in 64-bit is because of the __fastcall calling convention used by 64-bit Windows and associated machine code compilers.  If a script causes a lot of deep diving into the WIL LL(2) parser,  the use of __fastcall makes the script up to 25% faster.  If the script is mostly shallow, the performance gain can be minimal or none.

It is very difficult to predict if a script will gain from 64-bit.  This is particularly true when a lot of the processing is being done by an external module like the CLR.  The best advice is to rigorously test the script both ways.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade