Print Page - CLR StreamReader

Title: CLR StreamReader
Post by: stanl on April 08, 2019, 05:49:57 AM

At work I started receiving text files larger than 2 gig and since the parsing was to Access tables, that is a no-go. I found a Powershell script that was able to break up a 2.1 gig file into smaller chunks in 32 seconds. Below I have set out code to replicate the PS in WB's CLR. I won't be able to test since I need to use a compiled exe. To save the back and forth just want opinions as to the code:

Code (WINBATCH) Select



;Winbatch 2018B - CLR Stream Reader
;=================================================================================
file = dirscript():"myfile.txt"
rootname = dirscript():"split"


lines = 100000
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader')
oRdr.OpenText(file)


While ! oRdr.EndOfStream
   linecount=0
   oWrite = oRdr.CreateText(rootname:"_":filecount:".txt")
	While linecount < lines & ! oRdr.EndOfStream
	   oWrite.Writeline(oRdr,Readline())
	EndWhile
	oWrite.Dispose()
	filecount=filecount+1
EndWhile


oWrite.Dispose()
oRdr.Dispose()
oReader=0
Exit
[/code

Title: Re: CLR StreamReader
Post by: td on April 08, 2019, 07:34:43 AM

It would appear that both filecount and linecount need incrementing but only one is. What is Readline()?

64-bit WinBatch binary buffers can be as large as 2147483647 bytes so using binary buffers might be a faster alternative.

Title: Re: CLR StreamReader
Post by: JTaylor on April 08, 2019, 07:53:35 AM

I cannot get past

oRdr.OpenText(file)

Says it is an unknown name. Tried several things but to no avail.

Jim

Title: Re: CLR StreamReader
Post by: stanl on April 08, 2019, 08:00:34 AM

Quote from: td on April 08, 2019, 07:34:43 AM
It would appear that both filecount and linecount need incrementing but only one is. What is Readline()?

64-bit WinBatch binary buffers can be as large as 2147483647 bytes so using binary buffers might be a faster alternative.

Linecount increment was my bad -

should have been oRdr.LineCount() (bad eyes on the comma).

Files will get to 4gig. Doesn't Binary Buffers require loading the entire file first, then parsing. I thought 32 secs was pretty fast.

Title: Re: CLR StreamReader
Post by: JTaylor on April 08, 2019, 08:04:39 AM

No on the Binary. I am stripping out some code from my XML_Splitter app as an example. I have split 40gb files with it.

Jim

Title: Re: CLR StreamReader
Post by: stanl on April 08, 2019, 08:07:19 AM

Quote from: JTaylor on April 08, 2019, 07:53:35 AM
I cannot get past

oRdr.OpenText(file)

Says it is an unknown name. Tried several things but to no avail.

Jim

Yeah. The actual PS line is: $reader = [io.file]::OpenText($filename)

I often find the MFST docs for clases confusing. Maybe have refer to a text type prior to calling OpenText().

Title: Re: CLR StreamReader
Post by: JTaylor on April 08, 2019, 08:29:05 AM

Not as short as yours but doesn't care about file size. Does a 1.6 gig file in 10 seconds when split into 50mb chunks.

You may see things that seem odd such as root_node_offset or item_node. That applied to my XML splitting. I just set root_node_offset to zero and left it there as I didn't want to mess things up by trying to remove it. The item_node I set to @LF to find the end of a line rather than an end of a node.

Again, this was a quick strip job so check the data and make sure it isn't creeping in a bad way.

Jim

Title: Re: CLR StreamReader
Post by: td on April 08, 2019, 08:38:07 AM

Quote from: stanl on April 08, 2019, 08:00:34 AM
Files will get to 4gig. Doesn't Binary Buffers require loading the entire file first, then parsing. I thought 32 secs was pretty fast.

I was assuming that your 32 secs was a Powershell timing. WinBatch will likely not be that fast because you are using WIL loop structures which tend to be a bit slower. With binary buffers, you wouldn't necessarily need to scan every line in the file and thus significantly reduce the number of loop iterations. It does require a bit more coding.

Title: Re: CLR StreamReader
Post by: stanl on April 08, 2019, 10:45:32 AM

Quote from: td on April 08, 2019, 08:38:07 AM
I was assuming that your 32 secs was a Powershell timing.

Yes, it used System.Diagnostics.Stopwatch

Title: Re: CLR StreamReader
Post by: td on April 08, 2019, 11:40:25 AM

Completely off topic but sometimes it's necessary to time performance with times too small to detect reliably using milliseconds. In such cases something like the following is useful.

Code (winbatch) Select

AddExtender("wwhug44i.dll",0,"wwhug64i.dll") 
hStart   = BinaryAlloc(8)
hFinish   = BinaryAlloc(8)
hKernel32 = DllLoad('kernel32.dll')

DllCall(hKernel32, LONG:'QueryPerformanceCounter',lpbinary:hStart)

;;; Do something useful

DllCall(hKernel32, LONG:'QueryPerformanceCounter',lpbinary:hFinish)
start_time=BinaryPeek8(hStart, 0)
finish_time=BinaryPeek8(hFinish, 0)
Elapse_time = huge_Subtract(finish_time,start_time)

hFreq   = BinaryAlloc(8)
DllCall(hKernel32, LONG:'QueryPerformanceFrequency',lpbinary:hFreq)
Freq_time=BinaryPeek8(hFreq, 0)
BinaryFree(hFreq)

Seconds = huge_divide(Elapse_time,Freq_time)
BinaryFree(hStart)
BinaryFree(hFinish)
DllFree(hKernel32)

Title: Re: CLR StreamReader
Post by: stanl on April 08, 2019, 12:03:35 PM

Again. This was fun. The code below worked on a sample file. If [as in the past] I worked in an environment where everything could be done in WB this thread would not exist. But I have to produce both PS scripts as well as WB exes. I prefer the latter because there is so much more I can do from experience and familiar functions that would really be a PS learning curve. That being said...

Code (WINBATCH) Select



;Winbatch 2018B - CLR Stream Reader
;=================================================================================
path = dirscript()
file = dirscript():"net\products.csv"
rootname = dirscript():"split"


lines = 50
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader',file)




While ! oRdr.EndOfStream
   linecount=0
	outfile= rootname:"_":filecount:".txt"
	oWrite = ObjectClrNew('System.IO.StreamWriter',outfile)
   While linecount < lines & ! oRdr.EndOfStream
	   oWrite.WriteLine(oRdr.ReadLine())
		linecount=linecount+1
	EndWhile
	oWrite.Dispose()
	filecount=filecount+1
EndWhile


oWrite.Dispose()
oRdr.Dispose()
oReader=0
Exit

Title: Re: CLR StreamReader
Post by: JTaylor on April 08, 2019, 12:27:26 PM

So THAT is how you do it. Was trying similar things but couldn't quite get the right combination. Thanks for posting the solution.

Jim

Title: Re: CLR StreamReader
Post by: stanl on April 08, 2019, 12:55:27 PM

Quote from: JTaylor on April 08, 2019, 12:27:26 PM
So THAT is how you do it. Was trying similar things but couldn't quite get the right combination. Thanks for posting the solution.

Jim

I prefer Bombay Gin to MSFT anyday :D

Title: Re: CLR StreamReader
Post by: td on April 08, 2019, 02:17:31 PM

Belgian Dark Strong Ale would be my preference.

Title: Re: CLR StreamReader
Post by: JTaylor on April 08, 2019, 03:38:46 PM

I am more of a Mr. Pibb guy myself.

Jim

Title: Re: CLR StreamReader
Post by: stanl on April 08, 2019, 05:09:25 PM

This adds a stopwatch. There is a TimeSpan structure [Elapsed] that was giving an interop non-compatible error, but the ElapsedMilliseconds appears to work.

Code (WINBATCH) Select



;Winbatch 2018B - CLR Stream Reader
;=================================================================================
path = dirscript()
file = dirscript():"net\products.csv"
rootname = dirscript():"split"


lines = 50
filecount=1
ObjectClrOption("useany","System")
oRdr = ObjectClrNew('System.IO.StreamReader',file)
oTime = ObjectClrNew('System.Diagnostics.Stopwatch')
oTime.Start()
While ! oRdr.EndOfStream
   linecount=0
   outfile= rootname:"_":filecount:".txt"
   oWrite = ObjectClrNew('System.IO.StreamWriter',outfile)
   While linecount < lines & ! oRdr.EndOfStream
      oWrite.WriteLine(oRdr.ReadLine())
      linecount=linecount+1
   EndWhile
   oWrite.Dispose()
   filecount=filecount+1
EndWhile


oWrite.Dispose()
oRdr.Dispose()
oTime.Stop()
Message('Process Time',oTime.ElapsedMilliseconds/1000:" Seconds")
oTime=0
oReader=0
Exit

Title: Re: CLR StreamReader
Post by: td on April 09, 2019, 07:59:01 AM

Working from memory so I may be way off target but I believe the CLR expects the TimeSpan structure to be hosted on the process stack of the procedure using it. MSFT doesn't provide an acceptable mechanism for doing that in COM-based, machine-instruction code like WinBatch. We periodically look for a clever way around this in the CLR and FCL source code but the only solutions so far have undesirable side effects.

Title: Re: CLR StreamReader
Post by: stanl on April 09, 2019, 10:25:40 AM

I didn't try it but elapsedTicks / frequency is supposed to be an alternative. Then I read a post where the author said Environment.ticks gives better results than Stopwatch. But my script doesn't need real precision as a couple seconds here or there won't matter. A final thought: compiling the script as 64-bit for use on a 64-bit OS vs. 32-bit : any major performance gain?

Title: Re: CLR StreamReader
Post by: td on April 09, 2019, 11:29:52 AM

WinBatch can run noticeably faster as a 64-bit process but it all depends. The reason WinBatch runs faster in 64-bit is because of the __fastcall calling convention used by 64-bit Windows and associated machine code compilers. If a script causes a lot of deep diving into the WIL LL(2) parser, the use of __fastcall makes the script up to 25% faster. If the script is mostly shallow, the performance gain can be minimal or none.

It is very difficult to predict if a script will gain from 64-bit. This is particularly true when a lot of the processing is being done by an external module like the CLR. The best advice is to rigorously test the script both ways.

WinBatch® Technical Support Forum

All Things WinBatch => WinBatch => Topic started by: stanl on April 08, 2019, 05:49:57 AM