Japanese files with mixed wide Shift-JIS and ANSI

guuzendesu · April 24, 2025, 10:14:38 PM

Is there some way to use WIL to convert an entire file of this type, with double-byte characters, to all single byte characters?

td · April 25, 2025, 08:42:36 AM

You cannot convert a file containing double-byte characters to one with single-byte characters only. That is because there are no representations in ANSI for some double-byte characters. You can, however, convert a file from double-byte to UFT-16 LE (Windows Unicode). All you need to do is open the file using FileGet, then make the conversion to Unicode using the function ChrStringToUnicode:

https://docs.winbatch.com/mergedProjects/WindowsInterfaceLanguage/html/WILAK_C__016.htm

guuzendesu · April 25, 2025, 06:06:53 PM

Quote from: td on April 25, 2025, 08:42:36 AMThat is because there are no representations in ANSI for some double-byte characters.

I'm don't intend to "narrow" the kanji, only the Latin characters that are double-bytes. Is there a way?

td · April 28, 2025, 08:04:05 AM

What you now state you want is not a file conversion. It is stripping a file of some characters.

For the most part, Latin characters have the same representation in Shift-JIS as in regular ANSI. Also, WinBatch is double-byte aware as long as your OS is configured to support double-byte code points. Irregardless, the only issue with stripping your file is the lead bytes, which can precede characters above 0X40. If you convert to Unicode, characters outside the ANSI range are easily detected using WinBatch string functions. You could also load your file into a binary buffer and manually detect lead-bytes, which indicates the following byte is not a Latin character.

Wikipedia has a chart that shows how Shift-JIS represents single and double-byte characters.

https://en.wikipedia.org/wiki/Shift_JIS

td · April 28, 2025, 10:17:23 AM

You may also need to use the function ChrSetCodePage with codepage 932 to convert to Unicode. It all depends.

td · April 28, 2025, 02:58:56 PM

Here is a simple and almost completely untested example - all bugs are provided at no extra charge.

Code Select

TestFile = 'C:\temp\shift-jis.txt'	
Size = FileSize(TestFile)
Stuff = FileGet(TestFile)
DefPage = ChrSetCodepage(932)
UniStuff = ChrStringToUnicode(Stuff)
ChrSetCodepage(DefPage)
Size *= 2

bin = BinaryAlloc(Size+2)
BinaryPokeStrW(bin, 0, UniStuff)
ansi = ''
for i = 0 to Size by 2
   ; Windows Unicode is LE.
   if BinaryPeek(bin, i+1) then continue
   else ansi := Num2Char(BinaryPeek(bin, i))
next
 
message('Double Bytes Removed', ansi)
exit

td · April 28, 2025, 03:51:42 PM

A slightly different untested version.

Code Select

TestFile = 'C:\temp\shift-jis.txt'    
Size = FileSize(TestFile)
Stuff = FileGet(TestFile)
DefPage = ChrSetCodepage(932)
UniStuff = ChrStringToUnicode(Stuff)
ChrSetCodepage(DefPage)
Size *= 2

bin = BinaryAlloc(Size+2)
BinaryPokeStrW(bin, 0, UniStuff)
ansi = ''
for i = 0 to Size by 2
   ; Windows Unicode is LE.
   if BinaryPeek(bin, i+1) then continue
   CodeP = BinaryPeek(bin, i)
   if CodeP > 31 && CodeP < 126 then ansi := Num2Char(CodeP)
next
 
message('Double Bytes Removed', ansi)
exit

News:

Japanese files with mixed wide Shift-JIS and ANSI

guuzendesu

td

guuzendesu

td

td

td

td