File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes JDBC and Relational Databases and the fly likes MySQL and reading non-English characters in UTF-8 Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Head First Android this week in the Android forum!
JavaRanch » Java Forums » Databases » JDBC and Relational Databases
Bookmark "MySQL and reading non-English characters in UTF-8" Watch "MySQL and reading non-English characters in UTF-8" New topic

MySQL and reading non-English characters in UTF-8

R Young

Joined: May 05, 2010
Posts: 3
Hi all. The topic is pretty straightforward, but I've been staring at it for quite some time.

I'm trying to retrieve non-English characters from a MySQL database in UTF-8. In my example below, I use "à" (U+00E0: Latin Small Letter A With Grave) but I've also tried with random Japanese characters (Hex=E7A798). I'm new to java/jdbc, so I may be missing something basic, but I've searched for quite a while, and not discovered anything.

I've made sure that my database was created in UTF-8, and that my connection is in UTF8. I've tried enabling all jdbc connection string options (see commented line below) and it makes no difference. I've also tried System.setProperty("file.encoding", "UTF-8");.

The odd thing is that somehow the code below works when I run it from a JUnit test. (the actual results match the expected ones)
Any help would be appreciated. Thanks in advance.

  • SQL to create data

  • Java code

  • Environment Details

  • OS: Windows XP SP3
    MySQL: Server version: 5.1.40-community MySQL Community Server (GPL)
    Java: java version "1.6.0_20"
    JDBC: 5.1.12
    JUnit: 4_4.5 v20090824
    IDE: Eclipse 20090920-1017 (not that it should matter)
    Paul Clapham

    Joined: Oct 14, 2005
    Posts: 19659

    I don't understand why you think there's a problem. You get a string from the database and it contains the "à" character. If you convert it to bytes using your system's default encoding then you get 1 byte which is 224. This is in fact how "à" is encoded in ISO-8859-1 so it's quite likely that is your system's default encoding. And if you convert it to bytes using UTF-8 then you get 2 bytes which are the byte representation of "à" in UTF-8. This is all perfectly normal and nothing to do with JDBC at all.
    R Young

    Joined: May 05, 2010
    Posts: 3
    Okay. Thanks for the quick reply.

    I think I understand your answer. However, I think that "à" was not such a good example.
    Let me change the example to a Japanese character.

    When I run the code below, I get a string of Length=1 and value of 63. Instead I'm expecting a String of Length=3 with values E7 A7 98.
    I'm assuming it's 63 because ISO-8859-1 doesn't know anything about the E7 A7 98 character and decodes its as 63.

    I assumed that JDBC was responsible for this conversion. You mention that it is dependent on my system's default encoding. Is this a setting in the java vm? (I've tried -Dfile.encoding=utf-8)

    In the end, I need to be able to read the string as E7 A7 98. The 63 loses data.
    R Young

    Joined: May 05, 2010
    Posts: 3
    Okay. After thinking further, I see what you're saying. In order for me to get the UTF-8 data, I need to request the UTF-8 data explicitly: by doing name.getBytes("UTF8"). I can't rely on doing name.getBytes().

    You're right. It has nothing to do with JDBC. Please feel free to move to appropriate forum.

    Thanks again.

    reason for mod1: clarified logic
    reason for mod2: answered own question: Charset.defaultCharset()

    A final question: Is there a way I can change the default system encoding to UTF-8? Or detect what the default system encoding is?. I'm planning to write code like below, but I wouldn't want to convert if the encoding is not ISO-8859-1.

    One other detail: I'm using this string to pass to a JNA function, and it seems to care about whether or not the string it gets is UTF-8 or ISO-8859-1.
    In other words, if I pass it a string where name.getBytes()[0] = -32 it fails. If name.getBytes()[0] = -61 it works.

    I agree. Here's the link:
    subject: MySQL and reading non-English characters in UTF-8
    It's not a secret anymore!