aspose file tools*
The moose likes JDBC and the fly likes MySQL and reading non-English characters in UTF-8 Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » JDBC
Bookmark "MySQL and reading non-English characters in UTF-8" Watch "MySQL and reading non-English characters in UTF-8" New topic
Author

MySQL and reading non-English characters in UTF-8

R Young
Greenhorn

Joined: May 05, 2010
Posts: 3
Hi all. The topic is pretty straightforward, but I've been staring at it for quite some time.

I'm trying to retrieve non-English characters from a MySQL database in UTF-8. In my example below, I use "à" (U+00E0: Latin Small Letter A With Grave) but I've also tried with random Japanese characters (Hex=E7A798). I'm new to java/jdbc, so I may be missing something basic, but I've searched for quite a while, and not discovered anything.

I've made sure that my database was created in UTF-8, and that my connection is in UTF8. I've tried enabling all jdbc connection string options (see commented line below) and it makes no difference. I've also tried System.setProperty("file.encoding", "UTF-8");.

The odd thing is that somehow the code below works when I run it from a JUnit test. (the actual results match the expected ones)
Any help would be appreciated. Thanks in advance.

  • SQL to create data



  • Java code


  • Environment Details

  • OS: Windows XP SP3
    MySQL: Server version: 5.1.40-community MySQL Community Server (GPL)
    Java: java version "1.6.0_20"
    JDBC: 5.1.12
    JUnit: 4_4.5 v20090824
    IDE: Eclipse 20090920-1017 (not that it should matter)
    Paul Clapham
    Bartender

    Joined: Oct 14, 2005
    Posts: 18643
        
        8

    I don't understand why you think there's a problem. You get a string from the database and it contains the "à" character. If you convert it to bytes using your system's default encoding then you get 1 byte which is 224. This is in fact how "à" is encoded in ISO-8859-1 so it's quite likely that is your system's default encoding. And if you convert it to bytes using UTF-8 then you get 2 bytes which are the byte representation of "à" in UTF-8. This is all perfectly normal and nothing to do with JDBC at all.
    R Young
    Greenhorn

    Joined: May 05, 2010
    Posts: 3
    Okay. Thanks for the quick reply.

    I think I understand your answer. However, I think that "à" was not such a good example.
    Let me change the example to a Japanese character.



    When I run the code below, I get a string of Length=1 and value of 63. Instead I'm expecting a String of Length=3 with values E7 A7 98.
    I'm assuming it's 63 because ISO-8859-1 doesn't know anything about the E7 A7 98 character and decodes its as 63.

    I assumed that JDBC was responsible for this conversion. You mention that it is dependent on my system's default encoding. Is this a setting in the java vm? (I've tried -Dfile.encoding=utf-8)

    In the end, I need to be able to read the string as E7 A7 98. The 63 loses data.
    R Young
    Greenhorn

    Joined: May 05, 2010
    Posts: 3
    Okay. After thinking further, I see what you're saying. In order for me to get the UTF-8 data, I need to request the UTF-8 data explicitly: by doing name.getBytes("UTF8"). I can't rely on doing name.getBytes().

    You're right. It has nothing to do with JDBC. Please feel free to move to appropriate forum.

    Thanks again.

    reason for mod1: clarified logic
    reason for mod2: answered own question: Charset.defaultCharset()

    A final question: Is there a way I can change the default system encoding to UTF-8? Or detect what the default system encoding is?. I'm planning to write code like below, but I wouldn't want to convert if the encoding is not ISO-8859-1.

    One other detail: I'm using this string to pass to a JNA function, and it seems to care about whether or not the string it gets is UTF-8 or ISO-8859-1.
    In other words, if I pass it a string where name.getBytes()[0] = -32 it fails. If name.getBytes()[0] = -61 it works.



     
    It is sorta covered in the JavaRanch Style Guide.
     
    subject: MySQL and reading non-English characters in UTF-8