aspose file tools*
The moose likes Beginning Java and the fly likes Java language and Unicode Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Java language and Unicode" Watch "Java language and Unicode" New topic
Author

Java language and Unicode

Fisher Daniel
Ranch Hand

Joined: Sep 14, 2001
Posts: 582
Dear all,
It is about Unicode character set used in Java programs.
I am reading The Java Language Specification and I summarize the content of section 3.1 Unicode.

According to Java Language Specification, the Java programming language use UTF-16 encoding.
UTF-16 is the Unicode standard to represent the complete range of characters.
Characters whose code points are greater than U+FFFF are called supplementary characters and those are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

I am confuse about 'pairs of 16-bit code units'. What is it? Could you give me some examples about it, please.
What are the different between 'code points' and 'code units'? Do those have the same meaning?

Correct my understanding if I am wrong...

thanks
daniel
Damanjit Kaur
Ranch Hand

Joined: Oct 18, 2004
Posts: 346
According to Java Language Specification, the Java programming language use UTF-16 encoding.
UTF-16 is the Unicode standard to represent the complete range of characters.
Characters whose code points are greater than U+FFFF are called supplementary characters and those are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

I am confuse about 'pairs of 16-bit code units'. What is it? Could you give me some examples about it, please.
What are the different between 'code points' and 'code units'? Do those have the same meaning?


UTF-16 encoding means using 16 bits to encode different characters,symbols etc. ( you can refer to UTF-16 code table for all characters and their corresponding bit codes.) These 16 bit code is represented by Hexadecimal bit code and thus first character 0 is represented by 0000 hexadecimal code and the last one by FFFF. This code is code point.

The total number of characters that can be represented by a 16 bit code are
65536. So any character that is not represented by 16 bit code has to use more number of bits to represent it. In that case it UTF-16 uses its
2 - 16 bit code pairs to represent the bit code for that character. and this is called code unit.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Java language and Unicode