Background: One of our Oracle DBAs is reporting funny characters in some of our VARCHAR2 (CHAR 500) fields that may suffer from "lossy conversion". The database is configured for AL32UTF8. He is detecting these problems using the Oracle "csscan" utility with FROMCHAR=AL32UTF8 and TOCHAR=UTF8.
To begin the investigation of this problem and to get familiar with character encodings, I was trying to find a way to provide
Java with a byte array to check that the array content is a valid sequence of UTF8 characters. After a lot of web searches, I came across CharsetDecoder and created the program below. The "-17, -65 and -67" values are the signed byte representations of the funny characters in one of the database fields. My hope was that by placing them into a byte array, the CharsetDecoder would throw an exception, when it encountered them. The exception that I was expecting would be related to an unmappable character. The program runs without throwing an exception. The resulting output is a "replacement character" according to the referenced website. This website also reports that the input (0xEFBFBD) is undefined for unicode.
I am using Java 1.6_17 . Please let me know what is wrong with this attempt at validating UTF8.
Results
Unicode Lookup Tool
>