Help coderanch get a
new server
by contributing to the fundraiser
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • paul wheaton
  • Henry Wong
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Carey Brown
  • Mikalai Zaikin
Bartenders:
  • Lou Hamers
  • Piet Souris
  • Frits Walraven

java String UTF8

 
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator


My goal is to get a String and convert to UTF8.
1. The above way is wrong. See the comment
2. I can't set my own default locale.
3. Before we change it into UTF8, we should know the string's orginal encoding . But how could I know this ?

Thanks
 
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Strings in java are always stored in unicode UCS-2 (also know as UTF-16). When you ask how can you determine the encoding of a String, I assume you mean some series of bytes in a file. Unfortunatley, there is no way to determine this from the bytes alone, you have to know the character encoding used to encode the characters into bytes. To get non-ascii characters into a String in a java source file you can use \u. Character sets are simply mappings between a number and a character (e.g. Unicode). Character encoding are mappings between this number and a sequence of bytes (e.g. UTF-8, UTF-16).

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try
{
myBytes = myString.getBytes("UTF-8");
} catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}


Francis
 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks.

if I have a string like

String aa = new String(" \u67e5\u770b\u5168\u90e8");
System.out.println(aa);

Sometimes, the system output the UTF code, sometime it output the real Chinese character. It looks weird. Why ?

2. The UTF coding is unique in any system ? No matter what OS, what locale, a Chinese character should have same one UTF code ? This concept is correct ?

3. The unicode and UTF8 are different concepts ? In my understanding, UTF8 is A kind of unicode . Is it right ?

Thanks.
 
Francis Shillitoe
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
UTF-8 is not Unicode, it is a way of encoding unicode. See:

http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#unicode

for a good explanation of the differences.

If you are finding that on one system your program is working correctly and outputting chinese characters, and on another it is not (maybe it is printing empty squares or question marks), this is almost certainly a font issue. You need to have a unicode font installed (such as the Microsoft Arial Unicode font available on an MS Office CD), to see the full range of characters in a UTF-8 encoded file.

All these sorts of issues are covered under the subject of Intenationalization (I18N). This is a good site on the subject:

http://www.joconner.com/javai18n/

regards,

Francis
reply
    Bookmark Topic Watch Topic
  • New Topic