File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Smarter Charset Conversion Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Smarter Charset Conversion" Watch "Smarter Charset Conversion" New topic
Author

Smarter Charset Conversion

Scott Selikoff
Saloon Keeper

Joined: Oct 23, 2005
Posts: 3704
    
    5

I have text that comes through as Latin ("ISO-8859-1") and I want to simplify the encoding to ASCII ("US-ASCII"). Both are 8-bit encoding, the first supports all 256 characters (including accented letters for multiple languages) while the latter supports the first 128 characters, primarily English text.

I found one way to do it using the String constructor, namely:



The process converts characters 128-255 to the "?" character. Is there a smarter conversion available in the Java APIs? For example, it would be useful if the multiple "e" vowels with accents (upper and lower case) were converted to their non-accented counter-parts, rather than a "?". Is there anything like that available in Java?

Alternatively, I could write a parser that reads the characters one at a time and uses a look-up table for each letter, but it seems like re-inventing the wheel to me, so I thought I'd checked here first.


My Blog: Down Home Country Coding with Scott Selikoff
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
You might want to read this previous discussion. Short version: try java.text.Normalizer. Obligatory disclaimer: I haven't used it myself. Nor do I know anyone who has. As far as I know. But it seems like it's designed for what you're doing, so I say give it a try. And then please tell us if it works. Good luck...
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Smarter Charset Conversion