Help coderanch get a
new server
by contributing to the fundraiser
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • paul wheaton
  • Henry Wong
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Carey Brown
  • Mikalai Zaikin
Bartenders:
  • Lou Hamers
  • Piet Souris
  • Frits Walraven

MySQL and reading non-English characters in UTF-8

 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all. The topic is pretty straightforward, but I've been staring at it for quite some time.

I'm trying to retrieve non-English characters from a MySQL database in UTF-8. In my example below, I use "à" (U+00E0: Latin Small Letter A With Grave) but I've also tried with random Japanese characters (Hex=E7A798). I'm new to java/jdbc, so I may be missing something basic, but I've searched for quite a while, and not discovered anything.

I've made sure that my database was created in UTF-8, and that my connection is in UTF8. I've tried enabling all jdbc connection string options (see commented line below) and it makes no difference. I've also tried System.setProperty("file.encoding", "UTF-8");.

The odd thing is that somehow the code below works when I run it from a JUnit test. (the actual results match the expected ones)
Any help would be appreciated. Thanks in advance.

  • SQL to create data



  • Java code


  • Environment Details

  • OS: Windows XP SP3
    MySQL: Server version: 5.1.40-community MySQL Community Server (GPL)
    Java: java version "1.6.0_20"
    JDBC: 5.1.12
    JUnit: 4_4.5 v20090824
    IDE: Eclipse 20090920-1017 (not that it should matter)
     
    Marshal
    Posts: 28288
    95
    Eclipse IDE Firefox Browser MySQL Database
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I don't understand why you think there's a problem. You get a string from the database and it contains the "à" character. If you convert it to bytes using your system's default encoding then you get 1 byte which is 224. This is in fact how "à" is encoded in ISO-8859-1 so it's quite likely that is your system's default encoding. And if you convert it to bytes using UTF-8 then you get 2 bytes which are the byte representation of "à" in UTF-8. This is all perfectly normal and nothing to do with JDBC at all.
     
    R Young
    Greenhorn
    Posts: 3
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Okay. Thanks for the quick reply.

    I think I understand your answer. However, I think that "à" was not such a good example.
    Let me change the example to a Japanese character.



    When I run the code below, I get a string of Length=1 and value of 63. Instead I'm expecting a String of Length=3 with values E7 A7 98.
    I'm assuming it's 63 because ISO-8859-1 doesn't know anything about the E7 A7 98 character and decodes its as 63.

    I assumed that JDBC was responsible for this conversion. You mention that it is dependent on my system's default encoding. Is this a setting in the java vm? (I've tried -Dfile.encoding=utf-8)

    In the end, I need to be able to read the string as E7 A7 98. The 63 loses data.
     
    R Young
    Greenhorn
    Posts: 3
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Okay. After thinking further, I see what you're saying. In order for me to get the UTF-8 data, I need to request the UTF-8 data explicitly: by doing name.getBytes("UTF8"). I can't rely on doing name.getBytes().

    You're right. It has nothing to do with JDBC. Please feel free to move to appropriate forum.

    Thanks again.

    reason for mod1: clarified logic
    reason for mod2: answered own question: Charset.defaultCharset()

    A final question: Is there a way I can change the default system encoding to UTF-8? Or detect what the default system encoding is?. I'm planning to write code like below, but I wouldn't want to convert if the encoding is not ISO-8859-1.

    One other detail: I'm using this string to pass to a JNA function, and it seems to care about whether or not the string it gets is UTF-8 or ISO-8859-1.
    In other words, if I pass it a string where name.getBytes()[0] = -32 it fails. If name.getBytes()[0] = -61 it works.



     
    Consider Paul's rocket mass heater.
    reply
      Bookmark Topic Watch Topic
    • New Topic