aspose file tools
The moose likes Beginning Java and the fly likes java String UTF8 Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


JavaRanch » Java Forums » Java » Beginning Java
Reply Bookmark "java String UTF8" Watch "java String UTF8" New topic
Author

java String UTF8

Edward Chen
Ranch Hand

Joined: Dec 23, 2003
Posts: 758


My goal is to get a String and convert to UTF8.
1. The above way is wrong. See the comment
2. I can't set my own default locale.
3. Before we change it into UTF8, we should know the string's orginal encoding . But how could I know this ?

Thanks
Francis Shillitoe
Greenhorn

Joined: Aug 30, 2002
Posts: 22
Strings in java are always stored in unicode UCS-2 (also know as UTF-16). When you ask how can you determine the encoding of a String, I assume you mean some series of bytes in a file. Unfortunatley, there is no way to determine this from the bytes alone, you have to know the character encoding used to encode the characters into bytes. To get non-ascii characters into a String in a java source file you can use \u. Character sets are simply mappings between a number and a character (e.g. Unicode). Character encoding are mappings between this number and a sequence of bytes (e.g. UTF-8, UTF-16).

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try
{
myBytes = myString.getBytes("UTF-8");
} catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}


Francis


<a href="http://www.shillitoe.com" target="_blank" rel="nofollow">http://www.shillitoe.com</a>
Edward Chen
Ranch Hand

Joined: Dec 23, 2003
Posts: 758
Thanks.

if I have a string like

String aa = new String(" \u67e5\u770b\u5168\u90e8");
System.out.println(aa);

Sometimes, the system output the UTF code, sometime it output the real Chinese character. It looks weird. Why ?

2. The UTF coding is unique in any system ? No matter what OS, what locale, a Chinese character should have same one UTF code ? This concept is correct ?

3. The unicode and UTF8 are different concepts ? In my understanding, UTF8 is A kind of unicode . Is it right ?

Thanks.
Francis Shillitoe
Greenhorn

Joined: Aug 30, 2002
Posts: 22
UTF-8 is not Unicode, it is a way of encoding unicode. See:

http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#unicode

for a good explanation of the differences.

If you are finding that on one system your program is working correctly and outputting chinese characters, and on another it is not (maybe it is printing empty squares or question marks), this is almost certainly a font issue. You need to have a unicode font installed (such as the Microsoft Arial Unicode font available on an MS Office CD), to see the full range of characters in a UTF-8 encoded file.

All these sorts of issues are covered under the subject of Intenationalization (I18N). This is a good site on the subject:

http://www.joconner.com/javai18n/

regards,

Francis
 
I agree. Here's the link: http://ej-technologies/jprofiler - if it wasn't for jprofiler, we would need to run our stuff on 16 servers instead of 3.
 
subject: java String UTF8
 
Similar Threads
how to get a Unicode form data
question about locale
output TimeZone list as "America/Los_Angeles")
is it possible to change jvm locale by using command line parameters ?
Problem converting form data to UTF-8 on solaris