Java language and Unicode

Ranch Hand

Posts: 582

posted 19 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Dear all,
It is about Unicode character set used in Java programs.
I am reading The Java Language Specification and I summarize the content of section 3.1 Unicode.

According to Java Language Specification, the Java programming language use UTF-16 encoding.
UTF-16 is the Unicode standard to represent the complete range of characters.
Characters whose code points are greater than U+FFFF are called supplementary characters and those are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

I am confuse about 'pairs of 16-bit code units'. What is it? Could you give me some examples about it, please.
What are the different between 'code points' and 'code units'? Do those have the same meaning?

Correct my understanding if I am wrong...

thanks
daniel

Damanjit Kaur

Ranch Hand

Posts: 346

posted 19 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

According to Java Language Specification, the Java programming language use UTF-16 encoding.
UTF-16 is the Unicode standard to represent the complete range of characters.
Characters whose code points are greater than U+FFFF are called supplementary characters and those are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

I am confuse about 'pairs of 16-bit code units'. What is it? Could you give me some examples about it, please.
What are the different between 'code points' and 'code units'? Do those have the same meaning?

UTF-16 encoding means using 16 bits to encode different characters,symbols etc. ( you can refer to UTF-16 code table for all characters and their corresponding bit codes.) These 16 bit code is represented by Hexadecimal bit code and thus first character 0 is represented by 0000 hexadecimal code and the last one by FFFF. This code is code point.

The total number of characters that can be represented by a 16 bit code are
65536. So any character that is not represented by 16 bit code has to use more number of bits to represent it. In that case it UTF-16 uses its
2 - 16 bit code pairs to represent the bit code for that character. and this is called code unit.