Ascii replace not working

MW4 · March 19, 2014, 12:33:53 PM

I have this code that I am using to replace a registration mark with a null space "".
It is replacing the registration mark with Ãƒâ€š.

This text:
BluetoothÃ,Â® Wireless Technology
ends up as this text:
BluetoothÃƒâ€š Wireless Technology
I want it to end up like this:
Bluetooth Wireless Technology

Any ideas??

Code (winbatch) Select


str = Num2Char (174)
rep = ""
 
infile="c:\Flee.txt"
outfile="c:\Flee_Fixed.txt"

fs = FileSize( infile )
binbuf = binaryalloc( fs+100 )
ret = BinaryRead( binbuf, infile )
num = BinaryReplace( binbuf, str, rep , 0)
BinaryWrite( binbuf, outfile )
binbuf=BinaryFree(binbuf)

Deana · March 19, 2014, 01:56:35 PM

The code you posted will work fine with a truly ANSI file. However, I suspect you are maybe working with a Unicode or UTF-8 encoded file. Try opening the file in Notepad and make sure that you choose ANSI encoding when you save the file. Test the code again and see if that resolve your issue.

Deana · March 19, 2014, 02:04:47 PM

If you are actually dealing with a Unicode file your code "could" look something like this:

Code (winbatch) Select

; Unicode sample
str = ChrStringToUnicode( Num2Char(174) )
rep = ChrStringToUnicode( "" )
 
infile="c:\temp\data\unicode.txt"
outfile="c:\temp\data\_unicode.txt"

strU = FileGetW(infile)
data = StrReplace( strU, str, rep )
FilePutW( outfile, data )
Exit

MW4 · March 19, 2014, 02:22:44 PM

Still does the same thing

Deana · March 19, 2014, 02:45:22 PM

Did you try my suggestion to open and save your file in notepad????

Post the input file. so we can see what might be going on.

ALso, I recommend using DebugTrace. Simply add DebugTrace(@on,"trace.txt") to the beginning of the script and inside any UDF, run it until the error or completion, then inspect the resulting trace file for clues as to the problem. Feel free to post the trace file here ( removing any private info) if you need further assistance.

td · March 20, 2014, 08:16:29 AM

Quote from: MW4 on March 19, 2014, 02:22:44 PM
Still does the same thing

FileGetW will only load a file as Unicode if it contains a BOM for UTF-16. If the file does not contain a BOM at the beginning or it has a BOM for UTF-8, it will treat the file as ANSI and attempt to convert it to Unicode UTF-16. Generally, UTF-8 Unicode files do not have a BOM because it has no meaning, other than to indicate that the file is UTF-8.

The are several ways to determine the character encoding of a file. One fairly simple method is to load the file into a HEX file viewer like the WinBatch Browser utility and look for a BOM and for indicators of UTF-16 and UTF-8 encoding. The BOM is the first one to three bytes of a file and will contain the hex values FFFE or FEFF for UTF-16 and EF, BB and BF for UTF-8.

If a file does not contain a BOM, you can still get a good idea of the encoding by looking at the hex values of the text. If you see a lot of values preceded by a 00 hex value then the file is likely UTF-16. If you see hex values C2, C3, C4, etc., mixed in with regular ANSI values then the file is likely UTF-8.

If it turns out that your file is UTF-8 (which is likely base on evidence presented), the following Tech. DB article demonstrates one technique used to handle UFT-8 in WinBatch

http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/tsleft.web+winbatch/Strings+Convert~To~UTF-8.txt

[Edit] Another way to handle UTF-8 is to set the WinBatch code page to UTF-8 using the ChrSetCodePage function before calling FileGetW. If the file does not have a BOM, the function will assume that the file is UTF-8 and convert it to UTF-16. Make sure to set the code page back to the default after calling FileGetW. If you don't, the subsequent call to StrReplace will not work.

MW4 · March 24, 2014, 09:58:29 AM

OK,
As always you were right...
Saved as ansi and it worked.

How can I force ANSI save within winbatch to start my process

Deana · March 24, 2014, 10:48:05 AM

Actually I recommend dealing with the data in the format that it was received. Please see our previous posts about ways to deal with differently encoded files.

MW4 · March 24, 2014, 11:03:10 AM

OK, I'm super confused.

I use this in my file to strip the registration mark, which it doesn't do because it's UTF-8

str = Num2Char (174)
rep = ""

infile="c:\Flee.txt"
outfile="c:\Flee_Fixed.txt"

fs = FileSize( infile )
binbuf = binaryalloc( fs+100 )
ret = BinaryRead( binbuf, infile )
num = BinaryReplace( binbuf, str, rep , 0)
BinaryWrite( binbuf, outfile )
binbuf=BinaryFree(binbuf)

All I want to do is strip out the registration mark

Are you suggesting this? If so where would that go?

; Convert UTF-8 to ANSI.

strUTF8_ = "Ãƒâ€šÃ,Â® ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ ÃƒÆ'Ã,Â¾ÃƒÆ'Ã,Â¦ÃƒÆ'Ã,Â± HÃƒÆ'Ã,Â¶llÃƒÆ'Ã,Â«" ; "Ãƒâ€šÃ,Â® ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ ÃƒÆ'Ã,Â¾ÃƒÆ'Ã,Â¦ÃƒÆ'Ã,Â± HÃƒÆ'Ã,Â¶llÃƒÆ'Ã,Â«"
strHex = ChrStringToHex (strUTF8_) ; "C2AE20E282AC20C3BEC3A6C3B12048C3B66C6CC3AB"
intVarType = VarType (strUTF8_) ; 2 string

strUTF16LE_ = ChrStringToUnicode (strUTF8_) ; "Ã,Â® Ã¢â€šÂ¬ ÃƒÂ¾ÃƒÂ¦ÃƒÂ± HÃƒÂ¶llÃƒÂ«"
strHex = ChrUnicodeToHex (strUTF16LE_) ; "AE002000AC202000FE00E600F10020004800F6006C006C00EB00"
intVarType = VarType (strUTF16LE_) ; 128 LPWSTR or "Unicode"

ChrSetCodepage (0) ; 0 ANSI code page

strANSI_ = ChrUnicodeToString (strUTF16LE_) ; "Ã,Â® Ã¢â€šÂ¬ ÃƒÂ¾ÃƒÂ¦ÃƒÂ± HÃƒÂ¶llÃƒÂ«"
strHex = ChrStringToHex (strANSI_) ; "AE208020FEE6F12048F66C6CEB"
intVarType = VarType (strANSI_) ; 2 string

MW4 · March 24, 2014, 11:14:18 AM

I pull the original file from a vendors FTP server using:

iFtpGet(conhandle,KiaFileName,slocBoxFile,0,@ASCII, @TRUE)

Can the codepage be set here?
Should that be binary instead of ASCII?

Deana · March 24, 2014, 12:07:30 PM

Since the file is UTF8 the code might look something like this:

Code (winbatch) Select

str = Num2Char (174)
rep = ""
utf8file = 'C:\TEMP\Data\utf8.txt'
intCP = ChrSetCodepage (65001); Translate using UTF-8
data = StrSub( FileGetW( utf8file ), 2, -1 ) ; Ignore BOM
ChrSetCodepage (intCP)
newdata = StrReplace( data, str, rep )
Pause(data,newdata)
Exit

MW4 · March 24, 2014, 01:09:54 PM

So is this my best course then?
Are there any issues with handling it like this?

Code (winbatch) Select

str = Num2Char (174)
rep = ""

utf8file = 'c:\FleetinvCheckit.txt'
outfile="c:\FleetinvCheckit_Fixed_xx.txt"

intCP = ChrSetCodepage (65001); Translate using UTF-8
data = StrSub( FileGetW( utf8file ), 2, -1 ) ; Ignore BOM
ChrSetCodepage (intCP)
newdata = StrReplace( data, str, rep )

handle = FileOpen(outfile, "WRITE")
FileWrite(handle, newdata)
FileClose(handle)

Deana · March 24, 2014, 01:10:30 PM

Looks good.

MW4 · March 24, 2014, 01:47:18 PM

Ugh...
It chops off the first character of the file which throws off the first record.
First character is a number 0

Only the first record, the other lines are perfect

MW4 · March 24, 2014, 02:07:28 PM

Is it because of this?

Make sure to set the code page back to the default after calling FileGetW. If you don't, the subsequent call to StrReplace will not work.

td · March 24, 2014, 02:09:01 PM

As mentioned last week, UTF-8 encoded files often do not have a BOM. Notepad.exe puts a three byte BOM in UTF-8 encoded files but that is the exception and not the rule.

MW4 · March 24, 2014, 02:14:55 PM

So how do I get around that?
In here : StrSub( FileGetW( utf8file ), 2, -1 )

StrSub( FileGetW( utf8file ), 1, -1 ) ??

Deana · March 24, 2014, 02:18:04 PM

Remove the StrSub altogether.

News:

Ascii replace not working