Japanese files with mixed wide Shift-JIS and ANSI

Started by guuzendesu, April 24, 2025, 10:14:38 PM

Previous topic - Next topic

guuzendesu

Is there some way to use WIL to convert an entire file of this type, with double-byte characters, to all single byte characters?

td

You cannot convert a file containing double-byte characters to one with single-byte characters only. That is because there are no representations in ANSI for some double-byte characters. You can, however, convert a file from double-byte to UFT-16 LE (Windows Unicode). All you need to do is open the file using FileGet, then make the conversion to Unicode using the function ChrStringToUnicode:

https://docs.winbatch.com/mergedProjects/WindowsInterfaceLanguage/html/WILAK_C__016.htm
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

guuzendesu

Quote from: td on April 25, 2025, 08:42:36 AMThat is because there are no representations in ANSI for some double-byte characters.
I'm don't intend to "narrow" the kanji, only the Latin characters that are double-bytes. Is there a way?

td

What you now state you want is not a file conversion. It is stripping a file of some characters.

For the most part, Latin characters have the same representation in Shift-JIS as in regular ANSI. Also, WinBatch is double-byte aware as long as your OS is configured to support double-byte code points. Irregardless, the only issue with stripping your file is the lead bytes, which can precede characters above 0X40. If you convert to Unicode, characters outside the ANSI range are easily detected using WinBatch string functions. You could also load your file into a binary buffer and manually detect lead-bytes, which indicates the following byte is not a Latin character.

Wikipedia has a chart that shows how Shift-JIS represents single and double-byte characters.

https://en.wikipedia.org/wiki/Shift_JIS
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

You may also need to use the function ChrSetCodePage with codepage 932 to convert to Unicode. It all depends.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Here is a simple and almost completely untested example - all bugs are provided at no extra charge.

TestFile = 'C:\temp\shift-jis.txt'	
Size = FileSize(TestFile)
Stuff = FileGet(TestFile)
DefPage = ChrSetCodepage(932)
UniStuff = ChrStringToUnicode(Stuff)
ChrSetCodepage(DefPage)
Size *= 2

bin = BinaryAlloc(Size+2)
BinaryPokeStrW(bin, 0, UniStuff)
ansi = ''
for i = 0 to Size by 2
   ; Windows Unicode is LE.
   if BinaryPeek(bin, i+1) then continue
   else ansi := Num2Char(BinaryPeek(bin, i))
next
 
message('Double Bytes Removed', ansi)
exit
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

A slightly different untested version.

TestFile = 'C:\temp\shift-jis.txt'    
Size = FileSize(TestFile)
Stuff = FileGet(TestFile)
DefPage = ChrSetCodepage(932)
UniStuff = ChrStringToUnicode(Stuff)
ChrSetCodepage(DefPage)
Size *= 2

bin = BinaryAlloc(Size+2)
BinaryPokeStrW(bin, 0, UniStuff)
ansi = ''
for i = 0 to Size by 2
   ; Windows Unicode is LE.
   if BinaryPeek(bin, i+1) then continue
   CodeP = BinaryPeek(bin, i)
   if CodeP > 31 && CodeP < 126 then ansi := Num2Char(CodeP)
next
 
message('Double Bytes Removed', ansi)
exit
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

SMF spam blocked by CleanTalk