Hi all. The topic is pretty straightforward, but I've been staring at it for quite some time.
I'm trying to retrieve non-English characters from a MySQL database in UTF-8. In my example below, I use "à" (U+00E0: Latin Small Letter A With Grave) but I've also tried with random Japanese characters (Hex=E7A798). I'm new to java/jdbc, so I may be missing something basic, but I've searched for quite a while, and not discovered anything.
I've made sure that my database was created in UTF-8, and that my connection is in UTF8. I've tried enabling all jdbc connection string options (see commented line below) and it makes no difference. I've also tried System.setProperty("file.encoding", "UTF-8");.
The odd thing is that somehow the code below works when I run it from a JUnittest. (the actual results match the expected ones)
Any help would be appreciated. Thanks in advance.
SQL to create data
OS: Windows XP SP3
MySQL: Server version: 5.1.40-community MySQL Community Server (GPL)
Java: java version "1.6.0_20"
JUnit: 4_4.5 v20090824
IDE: Eclipse 20090920-1017 (not that it should matter)
I don't understand why you think there's a problem. You get a string from the database and it contains the "à" character. If you convert it to bytes using your system's default encoding then you get 1 byte which is 224. This is in fact how "à" is encoded in ISO-8859-1 so it's quite likely that is your system's default encoding. And if you convert it to bytes using UTF-8 then you get 2 bytes which are the byte representation of "à" in UTF-8. This is all perfectly normal and nothing to do with JDBC at all.
Joined: May 05, 2010
Okay. Thanks for the quick reply.
I think I understand your answer. However, I think that "à" was not such a good example.
Let me change the example to a Japanese character.
When I run the code below, I get a string of Length=1 and value of 63. Instead I'm expecting a String of Length=3 with values E7 A7 98.
I'm assuming it's 63 because ISO-8859-1 doesn't know anything about the E7 A7 98 character and decodes its as 63.
I assumed that JDBC was responsible for this conversion. You mention that it is dependent on my system's default encoding. Is this a setting in the java vm? (I've tried -Dfile.encoding=utf-8)
In the end, I need to be able to read the string as E7 A7 98. The 63 loses data.
Joined: May 05, 2010
Okay. After thinking further, I see what you're saying. In order for me to get the UTF-8 data, I need to request the UTF-8 data explicitly: by doing name.getBytes("UTF8"). I can't rely on doing name.getBytes().
You're right. It has nothing to do with JDBC. Please feel free to move to appropriate forum.
reason for mod1: clarified logic
reason for mod2: answered own question: Charset.defaultCharset()
A final question: Is there a way I can change the default system encoding to UTF-8? Or detect what the default system encoding is?. I'm planning to write code like below, but I wouldn't want to convert if the encoding is not ISO-8859-1.
One other detail: I'm using this string to pass to a JNA function, and it seems to care about whether or not the string it gets is UTF-8 or ISO-8859-1.
In other words, if I pass it a string where name.getBytes() = -32 it fails. If name.getBytes() = -61 it works.