Print Page - BinaryReplace weirdness

Title: BinaryReplace weirdness
Post by: jmburton2001 on June 17, 2021, 12:51:40 PM

I have a tab delimited input file. I'm trying to replace the tabs with -> quote comma quote <- and it replaces the tabs with a vertical rectangle and a quote in both Notepad and Excel.

Image link (https://i.imgur.com/XNdqp8j.png)

This is the script I'm using.

Code (winbatch) Select


OldDelim = @TAB
NewDelim = '","'
CsvRaw = StrCat (HomeLoc, LogViewRaw)
CsvOut = StrCat (HomeLoc, "NewRaw.csv")
fs = FileSize(CsvRaw)
binbuf = binaryalloc( fs+1000 )
ret = BinaryRead( binbuf, CsvRaw )
num = BinaryReplace( binbuf, OldDelim, NewDelim ,0)
; Message( "Number of '%OldDelim%' strings replaced", num )
BinaryWrite( binbuf, CsvOut )

Thanks in advance for your guidance and insights! :)

Title: Re: BinaryReplace weirdness
Post by: td on June 17, 2021, 02:04:58 PM

Since you did not provide your raw data file I created one of my own from an existing csv file laying around on the workstation. Using this file BinaryReplace worked perfectly to replace the tab characters with the quote-coma-quote character sequence. No "weirdness." One cause of your problem could be that your csv file is not encoded in a multibyte compatible code page. It could be UFT-8 or something. But that is just a very speculative guess based on the fact that the byte value of C2 is the lead byte of some 2 byte UTF-8 characters and your replacement string has the byte sequence 222C22. Another possibility is that what you think is a tab character may be more than just a tab character in the file and that could through off the interpretation of bytes when you only replace part of the sequence.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 17, 2021, 05:30:48 PM

Sorry about that!

I'm using Nirsoft FullEventLogView (https://www.nirsoft.net/utils/full_event_log_view.html) to extract the recording history from the Media Center event log. I have it create a tab delimited text file using the "/stab RawRecordings.txt" command line switch. He doesn't mention the exact technical file format.

Code (winbatch) Select

HomeLoc = DirScript()
LogViewRaw = "RawRecordings.txt"

Please find attached the RawRecordings.txt file that it creates (truncated and zipped because the forum wouldn't accept the entire file). I'll be curious to see if this issue is reproducible on your system.

Even though Nirsoft has a /scomma (comma delimited) command line option, I need it to produce the tab delimited version because there are commas in the raw data fields and I'm using the tabs as delimiters in other parts of my script. I know I'm so far down a rabbit hole on this (apparently simple) problem I should... Ugh! :P

Title: Re: BinaryReplace weirdness
Post by: stanl on June 18, 2021, 02:38:38 AM

Loaded into Notepad++. Used replace \t with "," and no problem. Also loads into Excel with Power Query, or Access. What is the ultimate goal for the data?

Title: Re: BinaryReplace weirdness
Post by: td on June 18, 2021, 07:35:55 AM

The file is Unicode (UTF-16). It has a BOM (Byte Order Mark) at its beginning. If you use Browser.exe that is part of your WinBatch distribution and is a FileMenu menu option or Notepad.exe, it becomes obvious. That means you need to specify Unicode characters as the search and the replace characters.

For example:

Code (winbatch) Select

strFile = 'C:\Temp\RawRecordings.txt'
strOut  = 'C:\temp\dump.txt'

strContentW = FileGetW(strFile)
strReplacedW = StrReplace(strContentW, @Tab, '","')
FilePut(strOut, strReplacedW )
Run('Notepad.exe', strOut)
exit

Note that WinBatch automagically detects that the string to search is Unicode and converts the target and replacement strings to Unicode for you.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 18, 2021, 08:54:00 AM

Quote from: stanl on June 18, 2021, 02:38:38 AMWhat is the ultimate goal for the data?

To replace the tabs and create a CSV file that has the data fields surrounded by quotes AND create a CSV file that doesn't have quotes.

EDIT: I think I could make a pass to remove all the offending commas and then run another pass to replace the tabs with commas. Then I wouldn't need to "quote" the data fields?

Quote from: td on June 18, 2021, 07:35:55 AM
The file is Unicode (UTF-16). It has a BOM (Byte Order Make) at its beginning. If you use Browser.exe that is part of your WinBatch distribution and is a FileMenu menu option or Notepad.exe, it becomes obvious. That means you need to specify Unicode characters as the search and the replace characters.

I have no idea what this means so I'll have to study it. To quote Corporal Walter "Radar" O'Reilly, "Ahhhh... Bach!"

Thank you for pointing me to new (to me) tools and hopefully a new understanding! ;)

Quote from: td on June 18, 2021, 07:35:55 AM
Note that WinBatch automagically detects that the string to search is Unicode and converts the target and replacement strings to Unicode for you.

Nice! Winbatch is awesome. I like automagic (which it must have been doing for me all these years)!

Title: Re: BinaryReplace weirdness
Post by: stanl on June 18, 2021, 10:54:34 AM

My bad. Formatting the data was not my question.... I was asking more

going to a database table?
go to Excel with formatted columns
something else?

Title: Re: BinaryReplace weirdness
Post by: stanl on June 18, 2021, 12:22:04 PM

Quote from: stanl on June 18, 2021, 10:54:34 AM
My bad. Formatting the data was not my question.... I was asking more

going to a database table?
go to Excel with formatted columns
something else?

and I must really be missing something here, as I have used Nirsoft outputs before...

Code (Winbatch) Select



h = FileOpen("c:\temp\RawRecordings.txt", "READ")
h1 = FileOpen("c:\temp\RawRecordings_1.txt", "WRITE")
While @True
   line = FileRead(h)
   If line == "*EOF*" Then Break
   If Strlen(StrTrim(line)) > 0 
	   line = StrReplace(line,@TAB,'","')
	   FileWrite(h1,line)
   Endif
Endwhile

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 18, 2021, 12:27:21 PM

It was just for S&G since I don't really have a need for it. That file is created on the fly at startup and used internally by my script and then deleted on exit.

I had a user ask if they could open the raw file in Excel and I said, "Sure. Just open it as a tab delimited file." but that was beyond their scope. So I thought I'd place a means to convert that file into a "true" csv and then present it to the user to do with as they wish. It's for their convenience.

As far as formatting, the only thing I "might" do is break out the date and time. Currently the date and time would come out as one cell that includes everything in the "Date Time AM.000" field.

As I mentioned previously, I could probably make a pass to remove all the offending commas and then run another pass to replace the tabs with commas. BOOM! CSV file. I don't really care what the user does after that. ;)

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 18, 2021, 12:30:09 PM

Quote from: stanl on June 18, 2021, 12:22:04 PM
and I must really be missing something here, as I have used Nirsoft outputs before...

Me too, on many occasions. That's why this has me flummoxed.

Title: Re: BinaryReplace weirdness
Post by: td on June 18, 2021, 01:33:45 PM

Quote from: stanl on June 18, 2021, 12:22:04 PM

and I must really be missing something here, as I have used Nirsoft outputs before...
Code (Winbatch) Select Expand
h = FileOpen("c:\temp\RawRecordings.txt", "READ") h1 = FileOpen("c:\temp\RawRecordings_1.txt", "WRITE") While @True line = FileRead(h) If line == "*EOF*" Then Break If Strlen(StrTrim(line)) > 0 line = StrReplace(line,@TAB,'","') FileWrite(h1,line) Endif Endwhile

I would imagine that Nirsoft utilities can produce files with different character encoding depending on the input and the utility's purpose. Much like WinBatch. From the FileRead help topic, "FileRead can be used to read files containing Unicode data, and will automatically determine the correct file type (ANSI or Unicode), regardless of the "Unicode" parameter in FileOpen, and will convert the return string if necessary." (My emphasis.)

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 18, 2021, 06:42:50 PM

I'm beginning to believe this is a "one off". I've exported many, many, many Nirsoft output files over the years and they've always been consumed and processed by Winbatch without fail. From my perspective they've always acted like a normal text file so this one's just weird.

Title: Re: BinaryReplace weirdness
Post by: stanl on June 19, 2021, 05:03:54 AM

I know this is beating another dead horse, but the change can also be made with Powershell one-liner:

Code Select



$file = "C:\temp\RawRecordings.txt"
(Get-Content $file).Replace("`t",'","') | Set-Content $file

I have used similar to this to change large [up to 1 gig] files to change tab => ^. This saves time iterating line over line with FileRead().

I'm sure the ability in WB 2021C to access more memory might make it interesting to explore replacements with a single load/save function.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 19, 2021, 07:31:59 AM

Quote from: stanl on June 19, 2021, 05:03:54 AM
I know this is beating another dead horse...

You gotta admit though, it's a fun and easy diversion. ;)

I was already parsing this "export" file through file reads, arrays, and maps with no ill effects. I also replaced the tabs with dashes and semi-colons but when I tried '","' that's when things went south.

Just curious, were you able to recreate my anomalies?

Title: Re: BinaryReplace weirdness
Post by: td on June 19, 2021, 08:25:27 AM

This topic has highlighted a missing bit of WinBatch functionality. Namely, WinBatch needs a version of the BinaryReplace function that supports Unicode binary buffers. Might see it in a coming release alone with potentially some other changes to binary buffer support.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 19, 2021, 01:18:58 PM

Quote from: td on June 18, 2021, 07:35:55 AM
The file is Unicode (UTF-16). It has a BOM (Byte Order Mark) at its beginning. If you use Browser.exe that is part of your WinBatch distribution and is a FileMenu menu option or Notepad.exe, it becomes obvious.

Thank you for your guidance! It's now becoming clearer. I've searched the file (right click) properties and couldn't find anything about the encoding. I opened it in Browser.exe and (if you know what you're looking for) it is obvious.

How would I know by using Notepad.exe?

A picture's worth a thousand words. (https://i.imgur.com/Y0Ao69n.png)

Title: Re: BinaryReplace weirdness
Post by: td on June 20, 2021, 08:14:28 AM

If you open a text file in Notepad.exe the encoding is listed on the right-hand side of Notepad's status bar. It will show something like UTF-8 or UTF-16 for Unicode text files.

Title: Re: BinaryReplace weirdness
Post by: td on June 20, 2021, 09:30:28 AM

Since we are fond of pictures here is an image of the Notepad status bar. The "LE" means little-endian which is the byte order of the two-byte words that make up UTF-16.

Title: Re: BinaryReplace weirdness
Post by: stanl on June 20, 2021, 10:09:15 AM

Going to really beat this horse. Tony, given how Notepad displays the attachment as UTF-16 LE... all transformations I performed to replace tab with "," .... Notepad then shows as ANSI.

So, the question is not so much on parsing the input but using current Binary Buffers where Jim discovered an issue. And, as he noted may only be a 1-off not a real problem.

Title: Re: BinaryReplace weirdness
Post by: td on June 20, 2021, 03:10:26 PM

It is unclear to me what you are referring to. As mentioned earlier in this topic BinaryReplace has a problem. In that same post, I also mentioned that we would be addressing BinaryReplace in a future release. As to Jim, I have no idea which issue you are alluding to. If you want Unicode output, you can use FilePutW. Note the "W" in the name. That means that the file the function creates will be encoded in Unicode regardless of the encoding of the string you pass to the function.

The file output for the above image was produced by changing the FilePut to FilePutW in the script I originally posted.

Title: Re: BinaryReplace weirdness
Post by: td on June 20, 2021, 03:25:37 PM

Should also mention that all enhancement requests are duly noted and logged so if it is mentioned it has been added to the list.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 21, 2021, 05:11:14 AM

Quote from: td on June 20, 2021, 09:30:28 AM
Since we are fond of pictures...

The only reason I posted a picture is that it can more easily show what I see on my system. It's much easier than trying to communicate what I see with words (which could be misconstrued).

With that in mind, here's what I see in Notepad.

(https://i.imgur.com/OH2wsOM.png)

The information you receive in Notepad is different than what I see. I should also note that I'm still using Win 7 Pro x64 on this machine and don't have any intention to update it because it's my WMC based HTPC.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 21, 2021, 05:27:27 AM

I should also add that I'm glad this has provided some insight that will lead to some changes, even though that was never my intent. From my perspective this was an anomaly that was satisfied by the pointers both you and Stan previously provided.

Thank you both for sharing your wisdom and insights to this novice tinkerer. :)

Title: Re: BinaryReplace weirdness
Post by: ChuckC on June 21, 2021, 05:53:44 AM

In this thread I saw several mentions of unfamiliarity with Unicode.

https://en.wikipedia.org/wiki/Unicode

Refer to the sections on "Code planes and blocks" and "Mappings and encodings".

In a nutshell, Unicode is standard regarding a unified way a handling characters and other visible "glyphs", along with meta characters related to punctuation and forming ligatures, etc.... Every visible glyph and every invisible meta character is represented by a unique "code point" composed of a maximum of 21 bits.

[Transfer] Encodings are the mechanism by which code points are encoded into one or more "code units" for storage and transfer. Windows natively uses UTF-16, where characters from the BMP [Basic Multi-lingual Plane] which have code point values that do not exceed 16 bits are represented by 16 bit wide characters. Characters from outside of the BMP which require 17+ bits are represented by "surrogate pairs" consisting of a pair of 16 bit wide characters. "Endianness" is a factor with UTF-16 where the 16 bit wide character values are subject to machine architecture integer representations that are "big endian" or "little endian". Windows on x86/x64 architecture CPUs use "little endian", and so UTF-16 encoding on Windows is qualified as being "UTF-16LE". UTF-8 is by definition a multi-byte encoding scheme that commonly uses 1 to 3 bytes to encode Unicode characters from the BMP, and additional bytes for characters outside the BMP. UTF-8 also preserves binary compatibility with US ASCII for values 0 thru 127, and similarly maintains a high degree of compatibility with certain ANSI code pages w/respect to their code points with values 0 thru 127.

As it pertains to a binary buffer containing text from a file where the text is UTF-16LE encoded Unicode and the characters in question all fall within the range of letters, numbers & punctuation in the US ASCII range, when viewed in some sort of hex editor you'll see sequences of "00 xx 00 yy 00 zz", where "xx", "yy" and "zz" are the hex representations of US ASCII characters. For example, "00 31 00 32 00 33" would be the text "123".

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 21, 2021, 07:28:18 AM

Quote from: ChuckC on June 21, 2021, 05:53:44 AM
In this thread I saw several mentions of unfamiliarity with Unicode.

It's like teaching a monkey complex trigonometry when all he really wants is a banana. In this case... I'm the monkey. ;)

Title: Re: BinaryReplace weirdness
Post by: td on June 21, 2021, 07:31:06 AM

Quote from: jmburton2001 on June 21, 2021, 05:11:14 AM
Quote from: td on June 20, 2021, 09:30:28 AM
Since we are fond of pictures...

The only reason I posted a picture is that it can more easily show what I see on my system. It's much easier than trying to communicate what I see with words (which could be misconstrued).

With that in mind, here's what I see in Notepad.

(https://i.imgur.com/OH2wsOM.png)

The information you receive in Notepad is different than what I see. I should also note that I'm still using Win 7 Pro x64 on this machine and don't have any intention to update it because it's my WMC based HTPC.

And Windows 7 is the reason why you are not seeing file encoding. Windows 10's Notepad includes encoding information. Most users have abandon Windows 7 because it is no longer updated and considered vulnerable so I didn't mention that I was referring to Windows 10 Notepad. I have no idea what you are referring to with the phrase "WMC based HTPC" but it doesn't really matter in this context. I should have mentioned that you need Windows 10.

As for picture's I usually don't use them on this forum as they just take up disk space and like words they can be uninformative. Besides, I don't mind the challenge of practicing my very limited writing skills. Who knows? I might start writing readable prose someday.

Title: Re: BinaryReplace weirdness
Post by: td on June 21, 2021, 07:57:21 AM

Quote from: ChuckC on June 21, 2021, 05:53:44 AM
In this thread I saw several mentions of unfamiliarity with Unicode.

https://en.wikipedia.org/wiki/Unicode

Refer to the sections on "Code planes and blocks" and "Mappings and encodings".

In a nutshell, Unicode is standard regarding a unified way a handling characters and other visible "glyphs", along with meta characters related to punctuation and forming ligatures, etc.... Every visible glyph and every invisible meta character is represented by a unique "code point" composed of a maximum of 21 bits.

[Transfer] Encodings are the mechanism by which code points are encoded into one or more "code units" for storage and transfer. Windows natively uses UTF-16, where characters from the BMP [Basic Multi-lingual Plane] which have code point values that do not exceed 16 bits are represented by 16 bit wide characters. Characters from outside of the BMP which require 17+ bits are represented by "surrogate pairs" consisting of a pair of 16 bit wide characters. "Endianness" is a factor with UTF-16 where the 16 bit wide character values are subject to machine architecture integer representations that are "big endian" or "little endian". Windows on x86/x64 architecture CPUs use "little endian", and so UTF-16 encoding on Windows is qualified as being "UTF-16LE". UTF-8 is by definition a multi-byte encoding scheme that commonly uses 1 to 3 bytes to encode Unicode characters from the BMP, and additional bytes for characters outside the BMP. UTF-8 also preserves binary compatibility with US ASCII for values 0 thru 127, and similarly maintains a high degree of compatibility with certain ANSI code pages w/respect to their code points with values 0 thru 127.

As it pertains to a binary buffer containing text from a file where the text is UTF-16LE encoded Unicode and the characters in question all fall within the range of letters, numbers & punctuation in the US ASCII range, when viewed in some sort of hex editor you'll see sequences of "00 xx 00 yy 00 zz", where "xx", "yy" and "zz" are the hex representations of US ASCII characters. For example, "00 31 00 32 00 33" would be the text "123".

And yes most WinBatch text handling functions handle code planes beyond the basic plane.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 22, 2021, 06:38:51 AM

All I wanted was a banana... :-[

Edit: WMC HTPC = Windows Media Center based Home Theater PC (which isn't available in Win10)

Title: Re: BinaryReplace weirdness
Post by: td on June 22, 2021, 06:48:54 AM

Quote from: jmburton2001 on June 22, 2021, 06:38:51 AM

Edit: WMC HTPC = Windows Media Center based Home Theater PC (which isn't available in Win10)

Completely forgot about that Microsoft gem.

Title: Re: BinaryReplace weirdness
Post by: jmburton2001 on June 22, 2021, 08:23:57 AM

Quote from: td on June 22, 2021, 06:48:54 AM
Completely forgot about that Microsoft gem.

Yeah, most people have moved on from it but there are still a surprising number using it. It's actually the reason this post was created. To extract the recording history from the Media Center event log.

I have two other Win10 PC's I use for "modern" stuff but the money this Win7 HTPC has saved me over the years makes me determined to keep it as long as technologically feasible.

A quick calculation shows that my < $1,000 investment in equipment and infrastructure has saved me close to $20,000 to date (and increasing monthly). Soooo... writing utilities for it with Winbatch is still in my future! ;)

Just so you know... I really appreciate the tips, tricks, pointers, and supplemental education that you and the other users provide on these forums. All of you are my heroes and you have my undying respect and gratitude! ;D

WinBatch® Technical Support Forum

All Things WinBatch => WinBatch => Topic started by: jmburton2001 on June 17, 2021, 12:51:40 PM