FileRead() & Unicode

JTaylor · October 18, 2022, 01:35:38 PM

I think I am missing something obvious here...

If, in a file, I have

Réné François

and I do the following, should I expect the message() function to display it correctly. The file is encoded as UTF-8, according to Notepad++. If I remove @TRUE from FileOpen it still displays incorrectly.

Code (winbatch) Select


  update_file = "luc.txt"
  or=FileOpen(update_file,"READ", @TRUE)
  line_read = FileRead(or)
message("LINE",line_read)
  FileClose(or)

Thanks.

Jim

JTaylor · October 18, 2022, 01:39:14 PM

I just tried starting with an ANSI file encoding and it works correctly. How can I read it with UTF8 encoding? I know I am going to feel like an idiot once you tell me.

Jim

bottomleypotts · October 18, 2022, 03:31:30 PM

Here is how I did it.

Code (winbatch) Select

#DefineFunction UTF8toANSI (strUTF8)
intCP = ChrSetCodepage (65001); Translate using UTF-8
strUni = ChrStringToUnicode (strUTF8); Converts an ANSI string to a Unicode string
intCP = ChrSetCodepage (intCP); Back to original code page
Return ChrUnicodeToString (strUni); Converts an Unicode string to ANSI
#EndFunction


  update_file = "luc.txt"
  or=FileOpen(update_file,"READ", @FALSE)
  line_read = FileRead(or)
message("LINE",UTF8toANSI(line_read))
  FileClose(or)

td · October 18, 2022, 10:55:05 PM

On Windows and WinBatch, the term Unicode refers to UTF-16 which is the native Windows character encoding Windows uses. UTF-8 is not the Windows native character encoding. Also, UTF-8 does not equal ANSI. They have more than a few different code points.

JTaylor · October 19, 2022, 05:21:39 AM

I knew UTF-8 and ANSI were not the same. Good to know on the Unicode clarification. Now, back to the original question, is there a simple way to read and use the data or does it require jumping through the hoops as outlined in BP's post?

Jim

stanl · October 19, 2022, 05:33:49 AM

Quote from: JTaylor on October 19, 2022, 05:21:39 AM
I knew UTF-8 and ANSI were not the same. Good to know on the Unicode clarification. Now, back to the original question, is there a simple way to read and use the data or does it require jumping through the hoops as outlined in BP's post?

Jim

Doubt it. Just look over the 'Quiz for Duck' thread ;D

td · October 19, 2022, 07:41:24 AM

Quote from: JTaylor on October 19, 2022, 05:21:39 AM
I knew UTF-8 and ANSI were not the same. Good to know on the Unicode clarification. Now, back to the original question, is there a simple way to read and use the data or does it require jumping through the hoops as outlined in BP's post?

Jim

Your original post made what was and wasn't understood something of a question mark so thought it best to cover the basics.

BP's post hints at the solution. On the Windows OS UTF-8 is treated as a code page except you can't use it as part of a locale (haven't tried it so not sure about the last part.) If you don't like that, you will have to take up with MSFT. So with WinBatch, you have to briefly change the code page to convert UTF-8 to UTF-16. Once you get to UTF-16, you can use almost any WIL function on your text. For example,

Code (winbatch) Select

; @true sets file read to convert the contents to Unicode (UTF-16).
hFile = FileOpen("c:\temp\utf-8Test.txt","read",@true)  

; Set the code page to UTF-8.
nPrevCP = ChrSetCodepage(65001)
strUni = FileRead(hFile)
FileClose(hFile)

; Set the code page back to(in my case) Windows-1252 
; which is also referred to  as ANSI but that is a 
; relatively harmless misnomer. 
ChrSetCodepage(nPrevCP) 

Message("UTF-8 Contents as UTF-16", strUni)
exit

You don't need to call ChrSetCodepage before every call to FileRead. Just make sure you avoid any processing that may perform ANSI to Unicode conversion before you set the code page back to the default.

JTaylor · October 19, 2022, 08:18:36 AM

Excellent. Thank you.

Jim

td · October 19, 2022, 08:55:44 AM

One other caveat. The FileOpen function checks for a UTF-16 BOM but does not recognize UTF-8 BOMs which may or may not exist in UTF-8 encoded files. If the BOM does exist, the first FileRead will contain what appears to be a leading space or empty string which is actually the BOM. If for some reason want to remvoe it, you can try the following:

Code (winbatch) Select

; Only needed on the first call to FileRead.
if StrSub(strUni, 1, 1) == "" then strUni = StrSub(strUni, 2, -1) ; Test is for an empty string and not a space.

Obviously, a little more is needed to check for the first read if desired.

JTaylor · October 19, 2022, 07:21:53 PM

Apologies if I missed this in the help file but, does SendKey() not handle Unicode data. Making sure the file was encoded as UTF-16 solved the initial problem (thanks) but then the next step sends the data to the currently focused window. The application has an option to use the clipboard instead and that works but would like either option to work, if possible. What is interesting is that a lot of people have used this app for over 20 years and this is the first time someone has reported this problem.

Thanks.

Jim

td · October 20, 2022, 08:22:40 AM

Sendkey converts Unicode text to the current user's current code page. It communicates with the keyboard device driver so it must conform to the expectation of that driver at a low level. Windows abstractions span the space between the current locale and the driver to some extent but far from completely. Unicode characters outside the range of the current code page may map to special keystroke combinations that vary from locale to locale depending on the type of keyboard/slash driver installed on the system. I am not sure a Unicode version of Sendkey is practical since WinBatch is used worldwide. This is particularly true once you consider keyboards not based on the keyboard layout for Latin-script code pages.

[edit] I think there may be a way to send Unicode characters using the existing ABI after all. More research is required, so no promises.

JTaylor · October 20, 2022, 10:15:54 AM

Okay. This is being used on a computer using a French Canadian version of Windows or at least configured for French. Don't have exact details. I haven't heard back but hope having them use a UTF-16 encoding and using the Clipboard option will solve their problem. Thanks again.

Jim

td · October 20, 2022, 12:54:16 PM

Interesting. This seems like a reasonable request and could benefit a large number of users. Although it does not help much now, hopefully, there will be something in the next release to address this request.

JTaylor · October 20, 2022, 05:12:50 PM

Most Excellent. Thank you very much!!!

Jim

td · October 20, 2022, 10:36:04 PM

We are going to give it a try but there is no guarantee of success. A few restrictions need to be worked around, and the functionality may be more limited for Unicode which is not a good thing.

td · October 21, 2022, 06:02:25 AM

If your clipboard solution is rejected you could try using SendMessageW to send WM_KEYDOWN or WM_KEYUP message to pass Unicode characters to the foreground processes message queue You would need SendMessageW's wParam set to VK_PACKET for Unicode. Never tried it so have no idea what will happen. Just passing it along in case you get desperate or are just bored and need a rabbit hole to go down.

JTaylor · October 21, 2022, 01:47:53 PM

The Clipboard and UTF-16 file route solved the problem for now.

Will be interested to see how the other turns out. Thanks again.

Jim

JTaylor · October 21, 2022, 01:48:51 PM

...and also, thanks B.P. Not sure I said that earlier.

Jim

chrislegarth · October 17, 2023, 04:46:15 PM

Thank you for the excellent information in the post above. I am in need of reading Spanish and maybe other languages out of a text file and display it in a dialog but was running into the same issue that JTaylor was experiencing. The help and information in this forum and tech database is invaluable.

Thanks!

News:

FileRead() & Unicode