Cleaning up a Unicode (?) string.

Started by snowsnowsnow, July 30, 2017, 06:38:47 PM

Previous topic - Next topic

snowsnowsnow

I have this little problem, that takes the form of "It works fine under XP, but fails under 7".

I have a long-standing program that uses some kind of Windows system-y call (in WB) and gets back a string.  The details of what this string is or where it comes from need not concern us.

Anyway, in XP, the string is what it should be and it works fine.  But under 7, it is basically the right string, but has a bunch of (sort of) question marks interspersed in it.  I'm guessing this is a Unicode thing, but I don't know much about that sort of thing, so it is only a guess.

But the worst part is that there doesn't seem to be any obvious way to clean it up.  I spent a fair amount of time today messing around with this - and came up with a brute force method that works, but I'm wondering if there is a better/cleaner method.  The problem is that although the funny characters look like question marks, they don't compare equal to "?"  Further, although Char2Num(theFunnyChar) is 63, the funnyChar is not equal to Num2Char(63).  The upshot of this is that the only way to detect the funnyChar is to test if Char2Num() == 63.

Anyway, the following works, but I'm wondering if there is something better/cleaner.

Code (winbatch) Select

#DefineFunction udfClean(s)
l = StrLen(s)
res = ""
FOR I = 1 TO l
    t = StrSub(s,I,1)
    IF Char2Num(t) != 63 THEN res = StrCat(res,t)
NEXT
Return res
#EndFunction


stanl

If you confirm you are receiving the string as Unicode you can search for 'codepage' in the Tech DB where there are several examples of converting from Unicode. But if what you posted works I'm not sure any of those examples would be cleaner/better.

td

WinBatch has fully supported UTF-16 Unicode string for close to the last ten years.  If you are seeing "?" characters, it is because at some point you are doing something that converts a Unicode string to a Windows ANSI string.  The obvious solution is to stop doing whatever it is that causes the conversion.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

snowsnowsnow

Quote from: td on July 31, 2017, 06:33:30 AM
WinBatch has fully supported UTF-16 Unicode string for close to the last ten years.
Is there some way to turn that off?  Thanks.

QuoteIf you are seeing "?" characters, it is because at some point you are doing something that converts a Unicode string to a Windows ANSI string.  The obvious solution is to stop doing whatever it is that causes the conversion.

This part doesn't make any sense to me.

Isn't there a function to convert Unicode to ANSI?

td

Resent versions of WinBatch are Unicode/ANSI agnostic.  99% of the time you are using Unicode string without even realizing it. I am not sure what you want to turn off but if you mean Unicode support, there is no reason to turn it off.  If you are getting a string with question marks in it, you have a Unicode source that is producing a Unicode string that you are somehow converting to ANSI.  The best approach is to leave Unicode as Unicode.

And yes, assuming you have a new enough version of WinBatch, there are functiosn to convert Unicode strings to ANSI and ANSI strings to Unicode.  But if you convert a Unicode string with characters that cannot be represented in your current code page, you will still get question marks.   Check out the Chr* function in the consolidated WIL help file.   
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

I assumed you had already thought of the possibility but I will mention it anyway.  It could be that you are not dealing with Unicode characters at all.  You could be somehow dumping some garbage into a WIL string who's values do not represent displayable ANSI characters.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

snowsnowsnow

Quote from: td on July 31, 2017, 07:18:03 PM
I assumed you had already thought of the possibility but I will mention it anyway.  It could be that you are not dealing with Unicode characters at all.  You could be somehow dumping some garbage into a WIL string who's values do not represent displayable ANSI characters.

I think the important points are:

1) It looks like, given my constraints, the solution I came up with is about as good as it gets.

2) As you know, I'm not running anything resembling a "reasonably up-to-date version of WinBatch", which is probably the root of the problem.

3) Although it is always possible, I don't think it is just an ordinary programming error - which is what your text above suggests as a possibility.  The fact that it displays as a question mark, but doesn't compare equal to a question mark, even though Char2Num() returns 63; all of this suggests that something whacky is going on - which is consistent with the Unicode theory.  I don't think it is possible for ordinary programming to produce a character for which Char2Num() would return 63, but comparing to "?" would return False.  At least not in a normal (I.e., US ANSI) locale.  Also keep in mind that the existing code works fine in XP, but fails in 7; I assume this is because 7 does Unicode.

Anyway, it looks like I'll be sticking with the code I have.

td

You are making several incorrect assumptions.  For example,  XP is very much a Unicode operation system and it is certainly quite possible to place non ANSI character set characters in a WIL string.  Have done it many times.  The fact that the character does not compare to a '?'  simply suggests that you are comparing something that looks like a double-byte or two byte character.   One possible reason for the difference in result between Num2Char, Char2Num and single character string comparisons is that your system's default code page is not the same as the current user's code page.  Unlikely but it does happen.

And 'recent enough' meant a version released in the last 9 or 10 years.  It may, however, be the case that your versions does not have one of several enhancements that have been made to extend multi-byte character support that could impact character string handling.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

Probably about to say something stupid as I am no expert on this stuff but, assuming codepage is different than the ANSI/Unicode discussion, is the codepage stuff for XP different than Win7?    Knowing what data you are retrieving, and how, might help solve the problem as well.   If you are simply extracting the data and displaying it and it comes up corrupted with no other steps in between then the source of the data and/or codepage may be relevant.

Jim

td

Not useful but...  Given the Unicode hypothesis and a Latin code page, this demonstrates the previously described behavior:

Code (winbatch) Select
strUnicode = ChrHexToUnicode("e905dc05d505dd05")  ; Hebrew equivilant of english 'hello'.
c1 = Num2Char(63) == StrSub(strUnicode,1,1)  ; False - because first character is the Hebrew letter Shin
c2 = StrSub(strUnicode,1,1) == '?'           ; False - because the first character is still the Hebrew letter Shin.
c3 = Char2Num(StrSub(strUnicode,1,1)) == 63  ; True - Char2Num converts to Latin and the Hebrew letter Shin has no representation
                                                ; in Latin script so it gets the '?' treatment.

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade