• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Interpreting UTF-8 in a Java program

 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My program is receiving an integer array from a browser application that's interpreted as UTF-8 (example in code). I can echo my resulting string ("theString" shown in the code below) back to the browser and everything's fine. But it's not fine in the Java program. The input string is "Hällo". But it prints out from the Java program as "Hõllo".



 
Paul Clapham
Sheriff
Pie
Posts: 20171
25
MySQL Database
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm going to assume that your assertions about the UTF-8 encodings for those characters are correct. Let's go on and see what your code does:

First you create an array of bytes which represents text encoded in UTF-8.

Then you create a string from that array, assuming it was encoded in your system's default charset (let's call that X). This is a bad idea.

Then you create an array of bytes from that string, encoding it according to charset X. You might think you get your original array of bytes back, but that might not be the case if charset X doesn't know how to deal with everything in the original array.

Then you create a string from that array, assuming it was encoded in UTF-8. This would be a good idea if you hadn't done all of the other mucking about.

And finally you write it to the console, which uses some other encoding to represent characters.

So all in all there's a bunch of code which is at best useless and possibly damaging. The code at lines 14 and 15 should be replaced by this:

As for displaying it on the console to see if your code works, that's a crapshoot too. Use a Unicode-aware tool like a Swing GUI to do a proper test.
 
Rob Spoor
Sheriff
Pie
Posts: 20372
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul, that isn't going to compile because there is no constructor that takes both an int[] and an encoding. The simple solution: convert the int[] to a byte[]. That can be done in two ways:
1) create a copy:

2) declare the array as a byte[] from the start:
Note the two casts that are required because the values fall outside the byte range.

Edit: that will still display "Hõllo". The problem here isn't in Java but in the system console - that's quite limited in the characters it can display, and apparently "ä" is one of those characters it can't display. Using JOptionPane.showMessageDialog(null, theString) will show the correct String.
Now your original code actually will display "Hällo" when using JOptionPane (at least on my system), the encoding problem Paul described is still potentially there.
 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I cannot set-up the initial int array differently. It comes in that way. I tried the second method, but the result did not change.

 
Paul Clapham
Sheriff
Pie
Posts: 20171
25
MySQL Database
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Roger F. Gay wrote:I cannot set-up the initial int array differently. It comes in that way. I tried the second method, but the result did not change.


And Rob explained why that was. So don't use the console for testing, it's an inadequate tool. Your revised code looks fine, though.
 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was paying attention. I promise. I just came back to report that the following bit of code confirmed that the text is properly interpreted in the Java environment. I was worried for a sec. that I still had a problem if the input was used in file output, but I checked. I can specify character encoding on PrintWriter (for example), so I'm guessing that won't be a problem. I do use console output all day long however, while developing. I guess - if there's no solution for that part - I'll just stick to good old English for my functional testing.

Thanks for your time and help guys.

 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting side-note. I'm testing on Linux now. I hadn't put a lot of effort into System.out.println for internationalization; so Chinese characters (for example) just produced strange (not Chinese) characters when printed to a command window (Windows). In Linux, they actually print out in Chinese. First time I've seen that in a command window.
 
Campbell Ritchie
Sheriff
Pie
Posts: 47229
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That is because the shell which underlies the Linux terminal is much better, and has a better repertoire of characters available.
I think the default on Windows® is not UTF-8, but ISO-8859-1, but I may be mistaken about that.

I also think this is too difficult for "beginning", so I shall move it.
 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:That is because the shell which underlies the Linux terminal is much better, and has a better repertoire of characters available.
I think the default on Windows® is not UTF-8, but ISO-8859-1, but I may be mistaken about that.

I also think this is too difficult for "beginning", so I shall move it.


Thanks. That comment about too difficult for "beginning" gives me confidence. I keep thinking that whenever I'm faced with a bits and bytes problem, it's something I'd know if my degree was in computer science.

My guess is that Windows doesn't use ISO-8859-1. That covers Scandinavian languages (öäå, etc.) and I include those in my default demo sample. They're just as confused in the console as Chinese.
 
Campbell Ritchie
Sheriff
Pie
Posts: 47229
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The command prompt on Windows® hardly supports more characters and fonts now than it did on Win98. And I think Win98 was actually a posh front-end on top of DOS, but am by no means sure about that. It can't even support £, printing ú instead.
 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Enjoy the fruits of your labors .. . thank you. Click here for a demo of internationalization.
 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I note on my end that even though the article states that support only exists in Google Chrome dev and Firefox Beta, some brave soles are trying to connect with other browsers anyway. Just FYI, I use Chromium dev and Firefox beta for everyday use. It's not a scary experience.
 
Campbell Ritchie
Sheriff
Pie
Posts: 47229
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What's "well done " in Swedish?
 
Roger F. Gay
Ranch Hand
Posts: 408
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:What's "well done " in Swedish?


Bra gjort! .... thanks.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic