• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Tim Cooke
  • Devaka Cooray
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
Bartenders:
  • Carey Brown
  • Roland Mueller

Validating UTF-8 encoding using CharsetDecode

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Background: One of our Oracle DBAs is reporting funny characters in some of our VARCHAR2 (CHAR 500) fields that may suffer from "lossy conversion". The database is configured for AL32UTF8. He is detecting these problems using the Oracle "csscan" utility with FROMCHAR=AL32UTF8 and TOCHAR=UTF8.

To begin the investigation of this problem and to get familiar with character encodings, I was trying to find a way to provide Java with a byte array to check that the array content is a valid sequence of UTF8 characters. After a lot of web searches, I came across CharsetDecoder and created the program below. The "-17, -65 and -67" values are the signed byte representations of the funny characters in one of the database fields. My hope was that by placing them into a byte array, the CharsetDecoder would throw an exception, when it encountered them. The exception that I was expecting would be related to an unmappable character. The program runs without throwing an exception. The resulting output is a "replacement character" according to the referenced website. This website also reports that the input (0xEFBFBD) is undefined for unicode.

I am using Java 1.6_17 . Please let me know what is wrong with this attempt at validating UTF8.




Results




Unicode Lookup Tool
>
 
Marshal
Posts: 28298
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What makes you think there's anything wrong with it? My hand calculations based on http://en.wikipedia.org/wiki/UTF-8 produce the same results as your code.

It does look to me sort of like something used a faulty decoding to produce the U+FFFD character (which is often the output of charset decoding when it doesn't know what to do), and then encoded that (correctly) into the UTF-8 string you have there.
 
Jeff Fiedler
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What I thought was wrong was not getting an exception from the decoder in my sample program.

In my sample program the first three bytes of test data that should be UTF8 are: 0xEFBFBD . From the website that I referenced, a look-up of 0xEFBFBD appears to be an undefined Unicode character (code point?). From the website that you reference this byte sequence appears to be a valid 3-byte sequence for UTF8.

So what I was expecting was the CharsetDecoder to throw an UnmappableCharacterException on these first 3 bytes. Note that the sample code sets the handling of unmappable characters to REPORT, which I thought would cause it to throw the exception. Instead, it makes the first character of the results the "replacement character" (0XFFFD) and continues to decode.
 
Can you really tell me that we aren't dealing with suspicious baked goods? And then there is this tiny ad:
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic