WinBatch® Technical Support Forum

All Things WinBatch => WinBatch => Topic started by: billfrog on February 22, 2017, 03:12:08 AM

Title: Removing Diacritics
Post by: billfrog on February 22, 2017, 03:12:08 AM
Is there a simple way to remove diacritics from a sting?
For instance, I am parsing a text file line by line that contains accented characters:
é
á

Is there a function (either built in or user defined) that removes them of their accents, returning a standard "e" or "a"?
Title: Re: Removing Diacritics
Post by: JTaylor on February 22, 2017, 05:52:49 AM
Just do a StrReplace().

Jim
Title: Re: Removing Diacritics
Post by: Russell_Williams on February 22, 2017, 06:07:20 AM
The strreplace would work but you will have to have several dozen to cover the alphabet if your text has a lot like that.
You can use binary tables and get them all in one change like this

http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+WinBatch/Strings+Convert~Special~Characters.txt

Title: Re: Removing Diacritics
Post by: billfrog on February 22, 2017, 08:33:52 AM
I've actually already tried that script. It converts things like
Ebène Quartet
to
EbA'ne Quartet

Any other suggestions?
Title: Re: Removing Diacritics
Post by: td on February 22, 2017, 11:09:40 AM
I tried the Tech Database script on your text "Ebène Quartet" and it converted the "è" character to "e" (hex value E8 to hex value 65) as expected.  If you are getting a different result it could be because the file is encoded in something other than standard ANSI characters.  For example it could be in a DBCS or UTF-8.   So first you will need to figure out if character encoding is the issue.  If it is then you may need to go back to the first suggestion and use StrReplace.  Needing to use StrReplace will depend on whether or not the character set used has a mix of single and multi-byte character codes.   
Title: Re: Removing Diacritics
Post by: billfrog on February 23, 2017, 07:22:10 AM
Well, its an XML file, and it is declared as UTF-8. So I guess I need a way to convert it from UTF-8 to ANSI, then I can use the binar buffer script to replace diacritc characters.

Title: Re: Removing Diacritics
Post by: snowsnowsnow on February 23, 2017, 10:49:00 AM
This is the sort of task that you will never get right if you try to do it yourself.  That is, if you try to do it by just doing a bunch of search-and-replaces.  The problem is that these formats are too complex and they keep changing (i.e., being "enhanced") day by day.

The right solution is to get a program that does this conversion for you, and let other people be responsible for tracking the minutiae.  And the tool you want is "iconv".

Now if you are working under Unix/Linux, you can get this for free.  If you're on Windows, you may have to hunt around (but it should be pretty easy to get it running under Cygwin).

Anyway, that's what I would do.
Title: Re: Removing Diacritics
Post by: td on February 23, 2017, 01:33:41 PM
Quote from: billfrog on February 23, 2017, 07:22:10 AM
Well, its an XML file, and it is declared as UTF-8. So I guess I need a way to convert it from UTF-8 to ANSI, then I can use the binar buffer script to replace diacritc characters.

If you have UTF-8 all you need is a couple of more lines added to the Tech Database script to get the job done.  First you convert from UTF-8 to UTF-16 then to 8 bit ANSI:

Code (winbatch) Select
BinaryConvert( BinBuf, 0, 3, 65001, 0 )
BinaryConvert( BinBuf, 3, 0, 0, 0 )



Note that there may be a 3 byte BOM at the beginning of the file.  You can tell that this is the case by looking for three strange characters at the beginning of your output file.

Title: Re: Removing Diacritics
Post by: td on February 23, 2017, 01:39:47 PM
Quote from: snowsnowsnow on February 23, 2017, 10:49:00 AM
This is the sort of task that you will never get right if you try to do it yourself.  That is, if you try to do it by just doing a bunch of search-and-replaces.  The problem is that these formats are too complex and they keep changing (i.e., being "enhanced") day by day.

The right solution is to get a program that does this conversion for you, and let other people be responsible for tracking the minutiae.  And the tool you want is "iconv".

Now if you are working under Unix/Linux, you can get this for free.  If you're on Windows, you may have to hunt around (but it should be pretty easy to get it running under Cygwin).

Anyway, that's what I would do.

UTF -8 has been around for a long time in computer years .  It was first sanctioned around 1993.  The one change, according to Wikipedia, being that it was given the same restrictions UTF-16  has in 2003.  According to several sources, almost 90% of all Websites are encoded in UTF-8.
Title: Re: Removing Diacritics
Post by: billfrog on February 27, 2017, 05:11:52 AM
This worked perfectly for my purpose. Thank you very much.



Quote from: td on February 23, 2017, 01:33:41 PM
Quote from: billfrog on February 23, 2017, 07:22:10 AM
Well, its an XML file, and it is declared as UTF-8. So I guess I need a way to convert it from UTF-8 to ANSI, then I can use the binar buffer script to replace diacritc characters.

If you have UTF-8 all you need is a couple of more lines added to the Tech Database script to get the job done.  First you convert from UTF-8 to UTF-16 then to 8 bit ANSI:

Code (winbatch) Select
BinaryConvert( BinBuf, 0, 3, 65001, 0 )
BinaryConvert( BinBuf, 3, 0, 0, 0 )



Note that there may be a 3 byte BOM at the beginning of the file.  You can tell that this is the case by looking for three strange characters at the beginning of your output file.
Title: Re: Removing Diacritics
Post by: td on February 27, 2017, 07:33:52 AM
I should have mentioned this in my previous response.  Make sure you allocate a binary buffer that is large enough to handle the conversion from UTF-8 to UTF-16.  Usually ~2x file size is sufficient.