| Author |
Smarter Charset Conversion
|
Scott Selikoff
Saloon Keeper
Joined: Oct 23, 2005
Posts: 3652
|
|
I have text that comes through as Latin ("ISO-8859-1") and I want to simplify the encoding to ASCII ("US-ASCII"). Both are 8-bit encoding, the first supports all 256 characters (including accented letters for multiple languages) while the latter supports the first 128 characters, primarily English text.
I found one way to do it using the String constructor, namely:
The process converts characters 128-255 to the "?" character. Is there a smarter conversion available in the Java APIs? For example, it would be useful if the multiple "e" vowels with accents (upper and lower case) were converted to their non-accented counter-parts, rather than a "?". Is there anything like that available in Java?
Alternatively, I could write a parser that reads the characters one at a time and uses a look-up table for each letter, but it seems like re-inventing the wheel to me, so I thought I'd checked here first.
|
My Blog: Down Home Country Coding with Scott Selikoff
|
 |
Mike Simmons
Ranch Hand
Joined: Mar 05, 2008
Posts: 2778
|
|
You might want to read this previous discussion. Short version: try java.text.Normalizer. Obligatory disclaimer: I haven't used it myself. Nor do I know anyone who has. As far as I know. But it seems like it's designed for what you're doing, so I say give it a try. And then please tell us if it works. Good luck...
|
 |
 |
|
|
subject: Smarter Charset Conversion
|
|
|