File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes Validating UTF-8 encoding using CharsetDecode Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Validating UTF-8 encoding using CharsetDecode" Watch "Validating UTF-8 encoding using CharsetDecode" New topic
Author

Validating UTF-8 encoding using CharsetDecode

Jeff Fiedler
Greenhorn

Joined: Feb 09, 2007
Posts: 4
Background: One of our Oracle DBAs is reporting funny characters in some of our VARCHAR2 (CHAR 500) fields that may suffer from "lossy conversion". The database is configured for AL32UTF8. He is detecting these problems using the Oracle "csscan" utility with FROMCHAR=AL32UTF8 and TOCHAR=UTF8.

To begin the investigation of this problem and to get familiar with character encodings, I was trying to find a way to provide Java with a byte array to check that the array content is a valid sequence of UTF8 characters. After a lot of web searches, I came across CharsetDecoder and created the program below. The "-17, -65 and -67" values are the signed byte representations of the funny characters in one of the database fields. My hope was that by placing them into a byte array, the CharsetDecoder would throw an exception, when it encountered them. The exception that I was expecting would be related to an unmappable character. The program runs without throwing an exception. The resulting output is a "replacement character" according to the referenced website. This website also reports that the input (0xEFBFBD) is undefined for unicode.

I am using Java 1.6_17 . Please let me know what is wrong with this attempt at validating UTF8.




Results




Unicode Lookup Tool
>
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

What makes you think there's anything wrong with it? My hand calculations based on http://en.wikipedia.org/wiki/UTF-8 produce the same results as your code.

It does look to me sort of like something used a faulty decoding to produce the U+FFFD character (which is often the output of charset decoding when it doesn't know what to do), and then encoded that (correctly) into the UTF-8 string you have there.
Jeff Fiedler
Greenhorn

Joined: Feb 09, 2007
Posts: 4
What I thought was wrong was not getting an exception from the decoder in my sample program.

In my sample program the first three bytes of test data that should be UTF8 are: 0xEFBFBD . From the website that I referenced, a look-up of 0xEFBFBD appears to be an undefined Unicode character (code point?). From the website that you reference this byte sequence appears to be a valid 3-byte sequence for UTF8.

So what I was expecting was the CharsetDecoder to throw an UnmappableCharacterException on these first 3 bytes. Note that the sample code sets the handling of unmappable characters to REPORT, which I thought would cause it to throw the exception. Instead, it makes the first character of the results the "replacement character" (0XFFFD) and continues to decode.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Validating UTF-8 encoding using CharsetDecode