File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Smarter Charset Conversion Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Smarter Charset Conversion" Watch "Smarter Charset Conversion" New topic

Smarter Charset Conversion

Scott Selikoff
Saloon Keeper

Joined: Oct 23, 2005
Posts: 3749

I have text that comes through as Latin ("ISO-8859-1") and I want to simplify the encoding to ASCII ("US-ASCII"). Both are 8-bit encoding, the first supports all 256 characters (including accented letters for multiple languages) while the latter supports the first 128 characters, primarily English text.

I found one way to do it using the String constructor, namely:

The process converts characters 128-255 to the "?" character. Is there a smarter conversion available in the Java APIs? For example, it would be useful if the multiple "e" vowels with accents (upper and lower case) were converted to their non-accented counter-parts, rather than a "?". Is there anything like that available in Java?

Alternatively, I could write a parser that reads the characters one at a time and uses a look-up table for each letter, but it seems like re-inventing the wheel to me, so I thought I'd checked here first.

[OCA 8 Book] [Blog]
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3028
You might want to read this previous discussion. Short version: try java.text.Normalizer. Obligatory disclaimer: I haven't used it myself. Nor do I know anyone who has. As far as I know. But it seems like it's designed for what you're doing, so I say give it a try. And then please tell us if it works. Good luck...
I agree. Here's the link:
subject: Smarter Charset Conversion
jQuery in Action, 3rd edition