BinaryReplace weirdness

Started by jmburton2001, June 17, 2021, 12:51:40 PM

Previous topic - Next topic

jmburton2001

I have a tab delimited input file. I'm trying to replace the tabs with -> quote comma quote <- and it replaces the tabs with a vertical rectangle and a quote in both Notepad and Excel.

Image link

This is the script I'm using.

Code (winbatch) Select

OldDelim = @TAB
NewDelim = '","'
CsvRaw = StrCat (HomeLoc, LogViewRaw)
CsvOut = StrCat (HomeLoc, "NewRaw.csv")
fs = FileSize(CsvRaw)
binbuf = binaryalloc( fs+1000 )
ret = BinaryRead( binbuf, CsvRaw )
num = BinaryReplace( binbuf, OldDelim, NewDelim ,0)
; Message( "Number of '%OldDelim%' strings replaced", num )
BinaryWrite( binbuf, CsvOut )


Thanks in advance for your guidance and insights!  :)

td

Since you did not provide your raw data file I created one of my own from an existing csv file laying around on the workstation. Using this file BinaryReplace worked perfectly to replace the tab characters with the quote-coma-quote character sequence.  No "weirdness." One cause of your problem could be that your csv file is not encoded in a multibyte compatible code page. It could be UFT-8 or something. But that is just a very speculative guess based on the fact that the byte value of C2 is the lead byte of some 2 byte UTF-8 characters and your replacement string has the byte sequence 222C22. Another possibility is that what you think is a tab character may be more than just a tab character in the file and that could through off the interpretation of bytes when you only replace part of the sequence.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

Sorry about that!

I'm using Nirsoft FullEventLogView to extract the recording history from the Media Center event log. I have it create a tab delimited text file using the "/stab RawRecordings.txt" command line switch. He doesn't mention the exact technical file format.

Code (winbatch) Select
HomeLoc = DirScript()
LogViewRaw = "RawRecordings.txt"


Please find attached the RawRecordings.txt file that it creates (truncated and zipped because the forum wouldn't accept the entire file). I'll be curious to see if this issue is reproducible on your system.

Even though Nirsoft has a /scomma (comma delimited) command line option, I need it to produce the tab delimited version because there are commas in the raw data fields and I'm using the tabs as delimiters in other parts of my script. I know I'm so far down a rabbit hole on this (apparently simple) problem I should... Ugh!  :P

stanl

Loaded into Notepad++.  Used replace \t with "," and no problem. Also loads into Excel with Power Query, or Access. What is the ultimate goal for the data?

td

The file is Unicode (UTF-16). It has a BOM (Byte Order Mark) at its beginning. If you use Browser.exe that is part of your WinBatch distribution and is a FileMenu menu option or Notepad.exe, it becomes obvious. That means you need to specify Unicode characters as the search and the replace characters.

For example:
Code (winbatch) Select
strFile = 'C:\Temp\RawRecordings.txt'
strOut  = 'C:\temp\dump.txt'

strContentW = FileGetW(strFile)
strReplacedW = StrReplace(strContentW, @Tab, '","')
FilePut(strOut, strReplacedW )
Run('Notepad.exe', strOut)
exit


Note that WinBatch automagically detects that the string to search is Unicode and converts the target and replacement strings to Unicode for you.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

Quote from: stanl on June 18, 2021, 02:38:38 AMWhat is the ultimate goal for the data?
To replace the tabs and create a CSV file that has the data fields surrounded by quotes AND create a CSV file that doesn't have quotes.

EDIT: I think I could make a pass to remove all the offending commas and then run another pass to replace the tabs with commas. Then I wouldn't need to "quote" the data fields?

Quote from: td on June 18, 2021, 07:35:55 AM
The file is Unicode (UTF-16). It has a BOM (Byte Order Make) at its beginning. If you use Browser.exe that is part of your WinBatch distribution and is a FileMenu menu option or Notepad.exe, it becomes obvious. That means you need to specify Unicode characters as the search and the replace characters.

I have no idea what this means so I'll have to study it. To quote Corporal Walter "Radar" O'Reilly, "Ahhhh... Bach!"

Thank you for pointing me to new (to me) tools and hopefully a new understanding!  ;)

Quote from: td on June 18, 2021, 07:35:55 AM
Note that WinBatch automagically detects that the string to search is Unicode and converts the target and replacement strings to Unicode for you.

Nice! Winbatch is awesome. I like automagic (which it must have been doing for me all these years)!

stanl

My bad. Formatting the data was not my question....  I was asking more

       
  • going to a database table?
  • go to Excel with formatted columns
  • something else?

stanl

Quote from: stanl on June 18, 2021, 10:54:34 AM
My bad. Formatting the data was not my question....  I was asking more

       
  • going to a database table?
  • go to Excel with formatted columns
  • something else?


and I must really be missing something here, as I have used Nirsoft outputs before...
Code (Winbatch) Select


h = FileOpen("c:\temp\RawRecordings.txt", "READ")
h1 = FileOpen("c:\temp\RawRecordings_1.txt", "WRITE")
While @True
   line = FileRead(h)
   If line == "*EOF*" Then Break
   If Strlen(StrTrim(line)) > 0
   line = StrReplace(line,@TAB,'","')
   FileWrite(h1,line)
   Endif
Endwhile

jmburton2001

It was just for S&G since I don't really have a need for it. That file is created on the fly at startup and used internally by my script and then deleted on exit.

I had a user ask if they could open the raw file in Excel and I said, "Sure. Just open it as a tab delimited file." but that was beyond their scope. So I thought I'd place a means to convert that file into a "true" csv and then present it to the user to do with as they wish. It's for their convenience.

As far as formatting, the only thing I "might" do is break out the date and time. Currently the date and time would come out as one cell that includes everything in the "Date Time AM.000" field.

As I mentioned previously, I could probably make a pass to remove all the offending commas and then run another pass to replace the tabs with commas. BOOM! CSV file. I don't really care what the user does after that.  ;)

jmburton2001

Quote from: stanl on June 18, 2021, 12:22:04 PM
and I must really be missing something here, as I have used Nirsoft outputs before...

Me too, on many occasions. That's why this has me flummoxed.

td

Quote from: stanl on June 18, 2021, 12:22:04 PM

and I must really be missing something here, as I have used Nirsoft outputs before...
Code (Winbatch) Select


h = FileOpen("c:\temp\RawRecordings.txt", "READ")
h1 = FileOpen("c:\temp\RawRecordings_1.txt", "WRITE")
While @True
   line = FileRead(h)
   If line == "*EOF*" Then Break
   If Strlen(StrTrim(line)) > 0
   line = StrReplace(line,@TAB,'","')
   FileWrite(h1,line)
   Endif
Endwhile


I would imagine that Nirsoft utilities can produce files with different character encoding depending on the input and the utility's purpose. Much like WinBatch. From the FileRead help topic, "FileRead can be used to read files containing Unicode data, and will automatically determine the correct file type (ANSI or Unicode), regardless of the "Unicode" parameter in FileOpen, and will convert the return string if necessary." (My emphasis.)

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

I'm beginning to believe this is a "one off". I've exported many, many, many Nirsoft output files over the years and they've always been consumed and processed by Winbatch without fail. From my perspective they've always acted like a normal text file so this one's just weird.

stanl

I know this is beating another dead horse, but the change can also be made with  Powershell one-liner:


$file = "C:\temp\RawRecordings.txt"
(Get-Content $file).Replace("`t",'","') | Set-Content $file



I have used similar to this to change large [up to 1 gig] files to change tab => ^.  This saves time iterating line over line with FileRead().


I'm sure the ability in WB 2021C to access more memory might make it interesting to explore replacements with a single load/save function.

jmburton2001

Quote from: stanl on June 19, 2021, 05:03:54 AM
I know this is beating another dead horse...

You gotta admit though, it's a fun and easy diversion.  ;)

I was already parsing this "export" file through file reads, arrays, and maps with no ill effects. I also replaced the tabs with dashes and semi-colons but when I tried '","' that's when things went south.

Just curious, were you able to recreate my anomalies?

td

This topic has highlighted a missing bit of WinBatch functionality. Namely, WinBatch needs a version of the BinaryReplace function that supports Unicode binary buffers. Might see it in a coming release alone with potentially some other changes to binary buffer support.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

Quote from: td on June 18, 2021, 07:35:55 AM
The file is Unicode (UTF-16). It has a BOM (Byte Order Mark) at its beginning. If you use Browser.exe that is part of your WinBatch distribution and is a FileMenu menu option or Notepad.exe, it becomes obvious.

Thank you for your guidance! It's now becoming clearer. I've searched the file (right click) properties and couldn't find anything about the encoding. I opened it in Browser.exe and (if you know what you're looking for) it is obvious.

How would I know by using Notepad.exe?

A picture's worth a thousand words.

td

If you open a text file in Notepad.exe the encoding is listed on the right-hand side of Notepad's status bar.  It will show something like UTF-8 or UTF-16 for Unicode text files.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Since we are fond of pictures here is an image of the Notepad status bar. The "LE" means little-endian which is the byte order of the two-byte words that make up UTF-16.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Going to really beat this horse. Tony, given how Notepad displays the attachment as UTF-16 LE... all transformations I performed to replace tab with "," .... Notepad then shows as ANSI. 


So, the question is not so much on parsing the input but using current Binary Buffers where Jim discovered an issue. And, as he noted may only be a 1-off not a real problem.

td

It is unclear to me what you are referring to. As mentioned earlier in this topic BinaryReplace has a problem. In that same post, I also mentioned that we would be addressing BinaryReplace in a future release. As to Jim, I have no idea which issue you are alluding to. If you want Unicode output, you can use FilePutW. Note the "W" in the name. That means that the file the function creates will be encoded in Unicode regardless of the encoding of the string you pass to the function. 

The file output for the above image was produced by changing the FilePut to FilePutW in the script I originally posted.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Should also mention that all enhancement requests are duly noted and logged so if it is mentioned it has been added to the list.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

Quote from: td on June 20, 2021, 09:30:28 AM
Since we are fond of pictures...

The only reason I posted a picture is that it can more easily show what I see on my system. It's much easier than trying to communicate what I see with words (which could be misconstrued).

With that in mind, here's what I see in Notepad.



The information you receive in Notepad is different than what I see. I should also note that I'm still using Win 7 Pro x64 on this machine and don't have any intention to update it because it's my WMC based HTPC.

jmburton2001

I should also add that I'm glad this has provided some insight that will lead to some changes, even though that was never my intent. From my perspective this was an anomaly that was satisfied by the pointers both you and Stan previously provided.

Thank you both for sharing your wisdom and insights to this novice tinkerer.  :)


ChuckC

In this thread I saw several mentions of unfamiliarity with Unicode.

https://en.wikipedia.org/wiki/Unicode

Refer to the sections on "Code planes and blocks" and "Mappings and encodings".

In a nutshell, Unicode is standard regarding a unified way a handling characters and other visible "glyphs", along with meta characters related to punctuation and forming ligatures, etc....  Every visible glyph and every invisible meta character is represented by a unique "code point" composed of a maximum of 21 bits.

[Transfer] Encodings are the mechanism by which code points are encoded into one or more "code units" for storage and transfer.  Windows natively uses UTF-16, where characters from the BMP [Basic Multi-lingual Plane] which have code point values that do not exceed 16 bits are represented by 16 bit wide characters.  Characters from outside of the BMP which require 17+ bits are represented by "surrogate pairs" consisting of a pair of 16 bit wide characters.  "Endianness" is a factor with UTF-16 where the 16 bit wide character values are subject to machine architecture integer representations that are "big endian" or "little endian".  Windows on x86/x64 architecture CPUs use "little endian", and so UTF-16 encoding on Windows is qualified as being "UTF-16LE".  UTF-8 is by definition a multi-byte encoding scheme that commonly uses 1 to 3 bytes to encode Unicode characters from the BMP, and additional bytes for characters outside the BMP.  UTF-8 also preserves binary compatibility with US ASCII for values 0 thru 127, and similarly maintains a high degree of compatibility with certain ANSI code pages w/respect to their code points with values 0 thru 127.

As it pertains to a binary buffer containing text from a file where the text is UTF-16LE encoded Unicode and the characters in question all fall within the range of letters, numbers & punctuation in the US ASCII range, when viewed in some sort of hex editor you'll see sequences of "00 xx 00 yy 00 zz", where "xx", "yy" and "zz" are the hex representations of US ASCII characters.  For example, "00 31 00 32 00 33" would be the text "123".


jmburton2001

Quote from: ChuckC on June 21, 2021, 05:53:44 AM
In this thread I saw several mentions of unfamiliarity with Unicode.

It's like teaching a monkey complex trigonometry when all he really wants is a banana. In this case... I'm the monkey.  ;)

td

Quote from: jmburton2001 on June 21, 2021, 05:11:14 AM
Quote from: td on June 20, 2021, 09:30:28 AM
Since we are fond of pictures...

The only reason I posted a picture is that it can more easily show what I see on my system. It's much easier than trying to communicate what I see with words (which could be misconstrued).

With that in mind, here's what I see in Notepad.



The information you receive in Notepad is different than what I see. I should also note that I'm still using Win 7 Pro x64 on this machine and don't have any intention to update it because it's my WMC based HTPC.

And Windows 7 is the reason why you are not seeing file encoding. Windows 10's Notepad includes encoding information. Most users have abandon Windows 7 because it is no longer updated and considered vulnerable so I didn't mention that I was referring to Windows 10 Notepad. I have no idea what you are referring to with the phrase "WMC based HTPC" but it doesn't really matter in this context.  I should have mentioned that you need Windows 10.

As for picture's I usually don't use them on this forum as they just take up disk space and like words they can be uninformative. Besides, I don't mind the challenge of practicing my very limited writing skills. Who knows? I might start writing readable prose someday.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Quote from: ChuckC on June 21, 2021, 05:53:44 AM
In this thread I saw several mentions of unfamiliarity with Unicode.

https://en.wikipedia.org/wiki/Unicode

Refer to the sections on "Code planes and blocks" and "Mappings and encodings".

In a nutshell, Unicode is standard regarding a unified way a handling characters and other visible "glyphs", along with meta characters related to punctuation and forming ligatures, etc....  Every visible glyph and every invisible meta character is represented by a unique "code point" composed of a maximum of 21 bits.

[Transfer] Encodings are the mechanism by which code points are encoded into one or more "code units" for storage and transfer.  Windows natively uses UTF-16, where characters from the BMP [Basic Multi-lingual Plane] which have code point values that do not exceed 16 bits are represented by 16 bit wide characters.  Characters from outside of the BMP which require 17+ bits are represented by "surrogate pairs" consisting of a pair of 16 bit wide characters.  "Endianness" is a factor with UTF-16 where the 16 bit wide character values are subject to machine architecture integer representations that are "big endian" or "little endian".  Windows on x86/x64 architecture CPUs use "little endian", and so UTF-16 encoding on Windows is qualified as being "UTF-16LE".  UTF-8 is by definition a multi-byte encoding scheme that commonly uses 1 to 3 bytes to encode Unicode characters from the BMP, and additional bytes for characters outside the BMP.  UTF-8 also preserves binary compatibility with US ASCII for values 0 thru 127, and similarly maintains a high degree of compatibility with certain ANSI code pages w/respect to their code points with values 0 thru 127.

As it pertains to a binary buffer containing text from a file where the text is UTF-16LE encoded Unicode and the characters in question all fall within the range of letters, numbers & punctuation in the US ASCII range, when viewed in some sort of hex editor you'll see sequences of "00 xx 00 yy 00 zz", where "xx", "yy" and "zz" are the hex representations of US ASCII characters.  For example, "00 31 00 32 00 33" would be the text "123".

And yes most WinBatch text handling functions handle code planes beyond the basic plane.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

All I wanted was a banana...  :-[


Edit: WMC HTPC = Windows Media Center based Home Theater PC (which isn't available in Win10)

td

Quote from: jmburton2001 on June 22, 2021, 06:38:51 AM

Edit: WMC HTPC = Windows Media Center based Home Theater PC (which isn't available in Win10)

Completely forgot about that Microsoft gem.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jmburton2001

Quote from: td on June 22, 2021, 06:48:54 AM
Completely forgot about that Microsoft gem.

Yeah, most people have moved on from it but there are still a surprising number using it. It's actually the reason this post was created. To extract the recording history from the Media Center event log.

I have two other Win10 PC's I use for "modern" stuff but the money this Win7 HTPC has saved me over the years makes me determined to keep it as long as technologically feasible.

A quick calculation shows that my < $1,000 investment in equipment and infrastructure has saved me close to $20,000 to date (and increasing monthly). Soooo... writing utilities for it with Winbatch is still in my future!  ;)

Just so you know... I really appreciate the tips, tricks, pointers, and supplemental education that you and the other users provide on these forums. All of you are my heroes and you have my undying respect and gratitude!  ;D