Background: One of our Oracle DBAs is reporting funny characters in some of our VARCHAR2 (CHAR 500) fields that may suffer from "lossy conversion". The database is configured for AL32UTF8. He is detecting these problems using the Oracle "csscan" utility with FROMCHAR=AL32UTF8 and TOCHAR=UTF8.
To begin the investigation of this problem and to get familiar with character encodings, I was trying to find a way to provide Java with a byte array to check that the array content is a valid sequence of UTF8 characters. After a lot of web searches, I came across CharsetDecoder and created the program below. The "-17, -65 and -67" values are the signed byte representations of the funny characters in one of the database fields. My hope was that by placing them into a byte array, the CharsetDecoder would throw an exception, when it encountered them. The exception that I was expecting would be related to an unmappable character. The program runs without throwing an exception. The resulting output is a "replacement character" according to the referenced website. This website also reports that the input (0xEFBFBD) is undefined for unicode.
I am using Java 1.6_17 . Please let me know what is wrong with this attempt at validating UTF8.
It does look to me sort of like something used a faulty decoding to produce the U+FFFD character (which is often the output of charset decoding when it doesn't know what to do), and then encoded that (correctly) into the UTF-8 string you have there.
Joined: Feb 09, 2007
What I thought was wrong was not getting an exception from the decoder in my sample program.
In my sample program the first three bytes of test data that should be UTF8 are: 0xEFBFBD . From the website that I referenced, a look-up of 0xEFBFBD appears to be an undefined Unicode character (code point?). From the website that you reference this byte sequence appears to be a valid 3-byte sequence for UTF8.
So what I was expecting was the CharsetDecoder to throw an UnmappableCharacterException on these first 3 bytes. Note that the sample code sets the handling of unmappable characters to REPORT, which I thought would cause it to throw the exception. Instead, it makes the first character of the results the "replacement character" (0XFFFD) and continues to decode.
subject: Validating UTF-8 encoding using CharsetDecode