Text encoding determination

Started by stanl, January 23, 2019, 03:49:47 AM

Previous topic - Next topic

stanl

I need to come up with a UDF to determine the encoding of a text file that will be loaded into a cloud server. I think the WB function ChrStringToHex() is the best way to determine the bom.  I found the following in powershell so wondering if the parsing could be duplicated in WB using the function:




$enc = [Text.Encoding]::ASCII
   if ($bom[0] -eq 0x2b -and $bom[1] -eq 0x2f -and $bom[2] -eq 0x76)
     { $enc =  [Text.Encoding]::UTF7 }
   if ($bom[0] -eq 0xff -and $bom[1] -eq 0xfe)
     { $enc =  [Text.Encoding]::Unicode }
   if ($bom[0] -eq 0xfe -and $bom[1] -eq 0xff)
     { $enc =  [Text.Encoding]::BigEndianUnicode }
   if ($bom[0] -eq 0x00 -and $bom[1] -eq 0x00 -and $bom[2] -eq 0xfe -and $bom[3] -eq 0xff)
     { $enc =  [Text.Encoding]::UTF32}
   if ($bom[0] -eq 0xef -and $bom[1] -eq 0xbb -and $bom[2] -eq 0xbf)
     { $enc =  [Text.Encoding]::UTF8}

td

There are more approaches than I can image.  Here is one way that may be appropriate based on the specific requirements:

Code (winbatch) Select
#DefineFunction GetEncoding (_hBin, _nCnt)
   if _nCnt > 4 then nByteMax = 4
   else nByteMax = _nCnt
   strHex = BinaryPeekHex(_hBin, 0, nByteMax)
   strRet = "ASCI"
   switch nByteMax
      case 4
         if strHex == "0000FEFF" then strRet = "UTF32"
         then break
      case 3
         strHex = StrSub(strHex, 1, 6)
         if strHex == "2B2F76" then strRet = "UTF7"
         then break
         if strHex == "EFBBBF" then strRet = "UTF8"
         then break
      case 2
         strHex = StrSub(strHex, 1, 4)
         if strHex == "FFFE" then strRet = "Unicode"
         then break
         if strHex == "FEFF" then strRet = "BigEndianUnicode"
   endswitch

   return strRet
#EndFunction
         
strFile = "c:\temp\Unicode-Test.txt"
;strFile = "c:\temp\UTF-8-Test.txt"
hBin = BinaryAlloc(FileSize(strFIle))
nBytes = BinaryReadEx(hBin, 0, strFIle, 0, 4)

strEncoding = GetEncoding(hBin, nBytes)
BinaryFree(hBin)

Message( strFile:' Encoding', strEncoding)
exit 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl