Help coderanch get a
new server
by contributing to the fundraiser
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • paul wheaton
  • Henry Wong
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Carey Brown
  • Mikalai Zaikin
Bartenders:
  • Lou Hamers
  • Piet Souris
  • Frits Walraven

Bytes displaying for chinese characters

 
Ranch Hand
Posts: 170
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Encoding is UTF-8


When i run this output = 9 , the Greek letters is coming properly but number of bytes is wrong .. Greek Letters is of two bytes

Encoding is CP1252



When i run this output = 18. Here we are not able to read the String but the number of bytes is correct.

I have read that

Every Chinese Character is represented by a two byte code.


from this web page

Encoding is UTF -8


When i run this output=4

Should the output be 8 not 4 since chinese characters each is 2 bytes ?
Why is the output coming as 4?
I felt that for the correct encoding , the number of bytes should be correct ..

Thanks ..
 
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry I couldn't help you out with this. I tried your code and get the same results.

Have you been able to figure out what's happening?
 
Rancher
Posts: 5012
38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.


Have you tried giving the charset?
public byte[] getBytes(Charset charset)
 
vanlalhmangaiha khiangte
Ranch Hand
Posts: 170
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes,
Tried it ..


Output is 6 and 18 and 14 ...


For UTF -8
1.Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF).
2.Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).



According to this


Which gives the output 18 ...Which is correct ..

So why is it that each Chinese character are of two bytes but when we do this getBytes through String function each character is representing 3 bytes ? This is really confusing to me ...
Was reading all the UTF-8 , UTF-16 , IS0 8859-1 stuffs but it is confusing me more and more ...

So how many bytes is Chinese character ?

Thanks...
 
Ranch Hand
Posts: 142
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator


So how many bytes is Chinese character ?



That depends on the encoding you use ... UTF-8 uses 3 bytes to encode a Chinese character, UTF-16 uses two bytes. String.getBytes() uses the platform's default charset, which may well be some variant of ISO-8859, thus some encoding that uses one byte per character. This will mess up your String. Try out

Depending on the platform's default charset, this is pretty likely to NOT produce your original Chines characters ...
 
vanlalhmangaiha khiangte
Ranch Hand
Posts: 170
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes tried what you've ask me ....

Depending on the platform's default charset, this is pretty likely to NOT produce your original Chines characters ...



How do i print the chinese characters then?
Can you show me how to print it ? Tried printing to a file also .. but not successful.. only ??? coming


Using specific character set UTF-8 and UTF-16 , i did a small comparison


The above will give true

My encoding is UTF-8

Using default character set


The above gives false ...


getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

String(byte[] bytes)
Constructs a new String by decoding the specified array of bytes using the platform's default charset.



Why is it giving false here ?
It get the bytes using platform's default charset and forms the String using platform's default charset .
Should both of the Strings be same?


Also when i change the encoding to cp1252
The same code gives true


This gives true

So for UTF-8 it is showing false but for Cp1252 it is showing true ? why this difference ?

Thanks ...
 
Guido Sautter
Ranch Hand
Posts: 142
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Your test with the special characters is successful because the characters you use are in the system's default code page ... but Chinese characters are not. And therefore, they are messed up when converted to bytes using the system's default code page ...

I recommend you to do some Googling and reading on code pages, characters encodings, and the like. This will explain to you in a lot more detail what's behind the behavior we're discussing here.
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
String s1 ="Инqaтернет-портал Русской службы Би-Би-Си 中國操控匯率 美應立法";
System.out.println("s1: " +s1);

I am getting ??? what should I do?

thanks
amee
 
Sheriff
Posts: 22796
131
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Don't use System.out. The console (from Windows or Linux) usually cannot handle Unicode properly, and any character it cannot handle will be printed as ?.
 
Gamini Sirisena
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I hope it's ok to continue this thread

I am seeing the characters sent by amee as follows.



My question is what is the font being displayed as boxes? It seems that the boxes display the unicode code point of the character that should be displayed. Any idea about what this font is?
 
Marshal
Posts: 79637
380
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The left half is Cyrillic, as written in Russia and parts of Eastern Europe.

The right half suggests you are getting individual char values; most Chinese Japanese and Korean (CJK) characters require two "char"s to make up their "code point." Don't know what to do about it. Sorry.
 
Marshal
Posts: 28288
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Gamini Sirisena:
Any idea about what this font is?

When you say you are "seeing" the characters I assume you must be using some piece of software to project them onto your eyeballs. I see that kind of output in Firefox when the font it's using is incapable of rendering a character. Instead of just displaying an empty rectangular box as a fallback, it displays a rectangular box with the Unicode value of the character inside.
 
Gamini Sirisena
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes it's firefox. Ok so it's not a font but a firefox specific way of handling characters that it cannot render. It's nice.
Thanks for clearing this up..
reply
    Bookmark Topic Watch Topic
  • New Topic