aspose file tools*
The moose likes Beginning Java and the fly likes Bytes displaying for chinese characters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Bytes displaying for chinese characters" Watch "Bytes displaying for chinese characters" New topic
Author

Bytes displaying for chinese characters

vanlalhmangaiha khiangte
Ranch Hand

Joined: Sep 11, 2006
Posts: 169
Encoding is UTF-8


When i run this output = 9 , the Greek letters is coming properly but number of bytes is wrong .. Greek Letters is of two bytes

Encoding is CP1252



When i run this output = 18. Here we are not able to read the String but the number of bytes is correct.

I have read that
Every Chinese Character is represented by a two byte code.

from this web page

Encoding is UTF -8


When i run this output=4

Should the output be 8 not 4 since chinese characters each is 2 bytes ?
Why is the output coming as 4?
I felt that for the correct encoding , the number of bytes should be correct ..

Thanks ..
Gamini Sirisena
Ranch Hand

Joined: Aug 05, 2008
Posts: 347
Sorry I couldn't help you out with this. I tried your code and get the same results.

Have you been able to figure out what's happening?
Norm Radder
Ranch Hand

Joined: Aug 10, 2005
Posts: 685
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Have you tried giving the charset?
public byte[] getBytes(Charset charset)
vanlalhmangaiha khiangte
Ranch Hand

Joined: Sep 11, 2006
Posts: 169
Yes,
Tried it ..


Output is 6 and 18 and 14 ...


For UTF -8
1.Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF).
2.Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).


According to this


Which gives the output 18 ...Which is correct ..

So why is it that each Chinese character are of two bytes but when we do this getBytes through String function each character is representing 3 bytes ? This is really confusing to me ...
Was reading all the UTF-8 , UTF-16 , IS0 8859-1 stuffs but it is confusing me more and more ...

So how many bytes is Chinese character ?

Thanks...
Guido Sautter
Ranch Hand

Joined: Dec 22, 2004
Posts: 142

So how many bytes is Chinese character ?


That depends on the encoding you use ... UTF-8 uses 3 bytes to encode a Chinese character, UTF-16 uses two bytes. String.getBytes() uses the platform's default charset, which may well be some variant of ISO-8859, thus some encoding that uses one byte per character. This will mess up your String. Try out

Depending on the platform's default charset, this is pretty likely to NOT produce your original Chines characters ...
vanlalhmangaiha khiangte
Ranch Hand

Joined: Sep 11, 2006
Posts: 169
Yes tried what you've ask me ....

Depending on the platform's default charset, this is pretty likely to NOT produce your original Chines characters ...


How do i print the chinese characters then?
Can you show me how to print it ? Tried printing to a file also .. but not successful.. only ??? coming


Using specific character set UTF-8 and UTF-16 , i did a small comparison


The above will give true

My encoding is UTF-8

Using default character set


The above gives false ...


getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

String(byte[] bytes)
Constructs a new String by decoding the specified array of bytes using the platform's default charset.


Why is it giving false here ?
It get the bytes using platform's default charset and forms the String using platform's default charset .
Should both of the Strings be same?


Also when i change the encoding to cp1252
The same code gives true


This gives true

So for UTF-8 it is showing false but for Cp1252 it is showing true ? why this difference ?

Thanks ...
Guido Sautter
Ranch Hand

Joined: Dec 22, 2004
Posts: 142
Your test with the special characters is successful because the characters you use are in the system's default code page ... but Chinese characters are not. And therefore, they are messed up when converted to bytes using the system's default code page ...

I recommend you to do some Googling and reading on code pages, characters encodings, and the like. This will explain to you in a lot more detail what's behind the behavior we're discussing here.
amee shah
Greenhorn

Joined: Nov 20, 2008
Posts: 1
String s1 ="Инqaтернет-портал Русской службы Би-Би-Си 中國操控匯率 美應立法";
System.out.println("s1: " +s1);

I am getting ??? what should I do?

thanks
amee
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19538
    
  16

Don't use System.out. The console (from Windows or Linux) usually cannot handle Unicode properly, and any character it cannot handle will be printed as ?.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Gamini Sirisena
Ranch Hand

Joined: Aug 05, 2008
Posts: 347
I hope it's ok to continue this thread

I am seeing the characters sent by amee as follows.



My question is what is the font being displayed as boxes? It seems that the boxes display the unicode code point of the character that should be displayed. Any idea about what this font is?
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36453
    
  15
The left half is Cyrillic, as written in Russia and parts of Eastern Europe.

The right half suggests you are getting individual char values; most Chinese Japanese and Korean (CJK) characters require two "char"s to make up their "code point." Don't know what to do about it. Sorry.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18113
    
    8

Originally posted by Gamini Sirisena:
Any idea about what this font is?
When you say you are "seeing" the characters I assume you must be using some piece of software to project them onto your eyeballs. I see that kind of output in Firefox when the font it's using is incapable of rendering a character. Instead of just displaying an empty rectangular box as a fallback, it displays a rectangular box with the Unicode value of the character inside.
Gamini Sirisena
Ranch Hand

Joined: Aug 05, 2008
Posts: 347
Yes it's firefox. Ok so it's not a font but a firefox specific way of handling characters that it cannot render. It's nice.
Thanks for clearing this up..
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Bytes displaying for chinese characters
 
Similar Threads
something weird
long in a byte array
Unable to read Arabic data
how to insert carriage return
Writing Hex Values to a file