wood burning stoves 2.0*
The moose likes Java in General and the fly likes Interpreting UTF-8 in a Java program Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Interpreting UTF-8 in a Java program" Watch "Interpreting UTF-8 in a Java program" New topic
Author

Interpreting UTF-8 in a Java program

Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
My program is receiving an integer array from a browser application that's interpreted as UTF-8 (example in code). I can echo my resulting string ("theString" shown in the code below) back to the browser and everything's fine. But it's not fine in the Java program. The input string is "Hällo". But it prints out from the Java program as "Hõllo".




Correlation does not prove causality.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18716
    
    8

I'm going to assume that your assertions about the UTF-8 encodings for those characters are correct. Let's go on and see what your code does:

First you create an array of bytes which represents text encoded in UTF-8.

Then you create a string from that array, assuming it was encoded in your system's default charset (let's call that X). This is a bad idea.

Then you create an array of bytes from that string, encoding it according to charset X. You might think you get your original array of bytes back, but that might not be the case if charset X doesn't know how to deal with everything in the original array.

Then you create a string from that array, assuming it was encoded in UTF-8. This would be a good idea if you hadn't done all of the other mucking about.

And finally you write it to the console, which uses some other encoding to represent characters.

So all in all there's a bunch of code which is at best useless and possibly damaging. The code at lines 14 and 15 should be replaced by this:

As for displaying it on the console to see if your code works, that's a crapshoot too. Use a Unicode-aware tool like a Swing GUI to do a proper test.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19726
    
  20

Paul, that isn't going to compile because there is no constructor that takes both an int[] and an encoding. The simple solution: convert the int[] to a byte[]. That can be done in two ways:
1) create a copy:

2) declare the array as a byte[] from the start:
Note the two casts that are required because the values fall outside the byte range.

Edit: that will still display "Hõllo". The problem here isn't in Java but in the system console - that's quite limited in the characters it can display, and apparently "ä" is one of those characters it can't display. Using JOptionPane.showMessageDialog(null, theString) will show the correct String.
Now your original code actually will display "Hällo" when using JOptionPane (at least on my system), the encoding problem Paul described is still potentially there.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
I cannot set-up the initial int array differently. It comes in that way. I tried the second method, but the result did not change.

Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18716
    
    8

Roger F. Gay wrote:I cannot set-up the initial int array differently. It comes in that way. I tried the second method, but the result did not change.


And Rob explained why that was. So don't use the console for testing, it's an inadequate tool. Your revised code looks fine, though.
Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
I was paying attention. I promise. I just came back to report that the following bit of code confirmed that the text is properly interpreted in the Java environment. I was worried for a sec. that I still had a problem if the input was used in file output, but I checked. I can specify character encoding on PrintWriter (for example), so I'm guessing that won't be a problem. I do use console output all day long however, while developing. I guess - if there's no solution for that part - I'll just stick to good old English for my functional testing.

Thanks for your time and help guys.

Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
Interesting side-note. I'm testing on Linux now. I hadn't put a lot of effort into System.out.println for internationalization; so Chinese characters (for example) just produced strange (not Chinese) characters when printed to a command window (Windows). In Linux, they actually print out in Chinese. First time I've seen that in a command window.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39478
    
  28
That is because the shell which underlies the Linux terminal is much better, and has a better repertoire of characters available.
I think the default on Windows® is not UTF-8, but ISO-8859-1, but I may be mistaken about that.

I also think this is too difficult for "beginning", so I shall move it.
Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
Campbell Ritchie wrote:That is because the shell which underlies the Linux terminal is much better, and has a better repertoire of characters available.
I think the default on Windows® is not UTF-8, but ISO-8859-1, but I may be mistaken about that.

I also think this is too difficult for "beginning", so I shall move it.


Thanks. That comment about too difficult for "beginning" gives me confidence. I keep thinking that whenever I'm faced with a bits and bytes problem, it's something I'd know if my degree was in computer science.

My guess is that Windows doesn't use ISO-8859-1. That covers Scandinavian languages (öäå, etc.) and I include those in my default demo sample. They're just as confused in the console as Chinese.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39478
    
  28
The command prompt on Windows® hardly supports more characters and fonts now than it did on Win98. And I think Win98 was actually a posh front-end on top of DOS, but am by no means sure about that. It can't even support £, printing ú instead.
Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
Enjoy the fruits of your labors .. . thank you. Click here for a demo of internationalization.
Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
I note on my end that even though the article states that support only exists in Google Chrome dev and Firefox Beta, some brave soles are trying to connect with other browsers anyway. Just FYI, I use Chromium dev and Firefox beta for everyday use. It's not a scary experience.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39478
    
  28
What's "well done " in Swedish?
Roger F. Gay
Ranch Hand

Joined: Feb 16, 2007
Posts: 396
Campbell Ritchie wrote:What's "well done " in Swedish?


Bra gjort! .... thanks.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Interpreting UTF-8 in a Java program