Removing Diacritics

Started by billfrog, February 22, 2017, 03:12:08 AM

Previous topic - Next topic

billfrog

Is there a simple way to remove diacritics from a sting?
For instance, I am parsing a text file line by line that contains accented characters:
é
á

Is there a function (either built in or user defined) that removes them of their accents, returning a standard "e" or "a"?

JTaylor


Russell_Williams

The strreplace would work but you will have to have several dozen to cover the alphabet if your text has a lot like that.
You can use binary tables and get them all in one change like this

http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+WinBatch/Strings+Convert~Special~Characters.txt


billfrog

I've actually already tried that script. It converts things like
Ebène Quartet
to
EbA'ne Quartet

Any other suggestions?

td

I tried the Tech Database script on your text "Ebène Quartet" and it converted the "è" character to "e" (hex value E8 to hex value 65) as expected.  If you are getting a different result it could be because the file is encoded in something other than standard ANSI characters.  For example it could be in a DBCS or UTF-8.   So first you will need to figure out if character encoding is the issue.  If it is then you may need to go back to the first suggestion and use StrReplace.  Needing to use StrReplace will depend on whether or not the character set used has a mix of single and multi-byte character codes.   
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

billfrog

Well, its an XML file, and it is declared as UTF-8. So I guess I need a way to convert it from UTF-8 to ANSI, then I can use the binar buffer script to replace diacritc characters.


snowsnowsnow

This is the sort of task that you will never get right if you try to do it yourself.  That is, if you try to do it by just doing a bunch of search-and-replaces.  The problem is that these formats are too complex and they keep changing (i.e., being "enhanced") day by day.

The right solution is to get a program that does this conversion for you, and let other people be responsible for tracking the minutiae.  And the tool you want is "iconv".

Now if you are working under Unix/Linux, you can get this for free.  If you're on Windows, you may have to hunt around (but it should be pretty easy to get it running under Cygwin).

Anyway, that's what I would do.

td

Quote from: billfrog on February 23, 2017, 07:22:10 AM
Well, its an XML file, and it is declared as UTF-8. So I guess I need a way to convert it from UTF-8 to ANSI, then I can use the binar buffer script to replace diacritc characters.

If you have UTF-8 all you need is a couple of more lines added to the Tech Database script to get the job done.  First you convert from UTF-8 to UTF-16 then to 8 bit ANSI:

Code (winbatch) Select
BinaryConvert( BinBuf, 0, 3, 65001, 0 )
BinaryConvert( BinBuf, 3, 0, 0, 0 )



Note that there may be a 3 byte BOM at the beginning of the file.  You can tell that this is the case by looking for three strange characters at the beginning of your output file.

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Quote from: snowsnowsnow on February 23, 2017, 10:49:00 AM
This is the sort of task that you will never get right if you try to do it yourself.  That is, if you try to do it by just doing a bunch of search-and-replaces.  The problem is that these formats are too complex and they keep changing (i.e., being "enhanced") day by day.

The right solution is to get a program that does this conversion for you, and let other people be responsible for tracking the minutiae.  And the tool you want is "iconv".

Now if you are working under Unix/Linux, you can get this for free.  If you're on Windows, you may have to hunt around (but it should be pretty easy to get it running under Cygwin).

Anyway, that's what I would do.

UTF -8 has been around for a long time in computer years .  It was first sanctioned around 1993.  The one change, according to Wikipedia, being that it was given the same restrictions UTF-16  has in 2003.  According to several sources, almost 90% of all Websites are encoded in UTF-8.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

billfrog

This worked perfectly for my purpose. Thank you very much.



Quote from: td on February 23, 2017, 01:33:41 PM
Quote from: billfrog on February 23, 2017, 07:22:10 AM
Well, its an XML file, and it is declared as UTF-8. So I guess I need a way to convert it from UTF-8 to ANSI, then I can use the binar buffer script to replace diacritc characters.

If you have UTF-8 all you need is a couple of more lines added to the Tech Database script to get the job done.  First you convert from UTF-8 to UTF-16 then to 8 bit ANSI:

Code (winbatch) Select
BinaryConvert( BinBuf, 0, 3, 65001, 0 )
BinaryConvert( BinBuf, 3, 0, 0, 0 )



Note that there may be a 3 byte BOM at the beginning of the file.  You can tell that this is the case by looking for three strange characters at the beginning of your output file.

td

I should have mentioned this in my previous response.  Make sure you allocate a binary buffer that is large enough to handle the conversion from UTF-8 to UTF-16.  Usually ~2x file size is sufficient.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade