SDK - Unicode - WinBatch - SQLite Extender

Started by JTaylor, September 28, 2020, 04:55:48 PM

Previous topic - Next topic

JTaylor

Seeking some general wisdom on Unicode stuff so I know when I am done with the question, in the context of my SQLite Extender but could apply on other stuff as well.

I created a list of titles in my database in various languages.    I can return data correctly to an Array and to a delimited list.    If I use something like ArrayFilePutCSV() the Japanese is corrupted but the French is okay even though I can display any of titles correctly, directly from the Array.   For the delimited list, if I use Fileput(), languages such as French is okay but Japanese is corrupted.  If I use FilePutW() it is all good.

Is this a valid point to be at and can I now leave it up to the user to handle what happens with the data I have provided or do I need to do more?

Thank you.

Jim

ChuckC

Unicode defines a multi-language character set that can be encoded in various ways.  When persisting Unicode text to a file or a database field, it is the encoding of the characters that usually trips you up.  Every application that makes use of the Unicode text needs to know how you have it encoded.

On Windows, the native encoding use UTF-16, which involves 1 or 2 16-bit integer values depending on the particular code point [character], with the 2 word form being used for what's call a "surrogate pair".  However, when it comes to serializing Unicode text, the most common encoding is UTF-8, which is a multi-byte encoding scheme with a variable number of bytes per character depending on the particular code point [character].  Most text that uses the characters found in the Latin-1 code page on Windows will look the same whether it is ANSI text or UTF-8 encoded Unicode text, but languages based on pictographs will definitely be using sequences of 3 to 4 bytes per character and will be corrupted if interpreted as ANSI text in any particular code page.

You will need to assess what each consumer of your Unicode text is expecting the encoding to be and go from there....

JTaylor

So in the stated context what would you recommend?  Have I provided what most would expect?

Jim

td

All WinBatch functions support a multibyte and/or Unicode (UTF-16).  Many but not all functions accepting text have both a Unicode (UTF-16) and Multibyte version. Some functions even support both by converting all string parameters to Unicode when the function finds any Unicode parameters.  It all depends on the function's purpose.  At some point, almost all text passed to Win32 functions is converted to UTF-16 because that is Windows native text.  Win32 does have functions for handling multibyte and WinBatch used them extensively in non-Unicode functions that do any string manipulation requiring string parsing or splitting.  ArrayFilePutCSV happens to be on of the few string handling function in WinBatch that does not have a Unicode version.  If an array element contains Unicode (UTF-16) text the function converts it to multibyte using the current codepage which is the current thread's codepage or the codepage specified in the script.

That is what WinBatch does to meet user expectations.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

So it sounds like what I am seeing would be expected behavior.   Guess I will assume I am good for now and make sure all the functions are acting the same and make some notes in the Help.   If nothing else I can at least be consistent.   Thank you both.

Jim

td

Note that ArrayFilePutCSV works just fine with Japanese on a computer with the default codepage set to "ANSI/OEM Japanese; Japanese (Shift-JIS)" or as the codepage set in the script.  However, you cannot mix Shif-Jis characters with high bit ANSI characters because the high bit is used to indicate a Japanese double-byte character.

[edit] The second sentence should read "you cannot mix Shif-Jis characters with high bit ANSI Latin characters because the high bit is used to indicate Japanese single and double-byte characters.

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor