What is a Unicode code unit and a Unicode code point?
Joined: Jan 15, 2007
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding
The above is from the API specification describing about Class Character.In this description Unicode code point is used to indicates characters like "A", "B", "C"?
Unicode code unit is used to indicate 16-bit char values, does that also means characters like "A", "B", "C"?.A char value also denotes a character isn't it?
I was ushered into the character class API documentation by the description of the length() in the String class.I want to understand what is a Unicode code unit.The length returned by length() is equal to the number of code units what is a code unit? Is it a character?
length public int length() Returns the length of this string. The length is equal to the number of Unicode code units in the string.
As far as Java was concerned, a "char" was a Unicode character (or code point) up to Java 1.4. But then Java 5 introduced a new version of Unicode, and now one Java char is no longer necessarily the same as a Unicode code point. Luckily, the code points that take up more than one char are rarely used, but you still need to be aware of it. In particular, String.length() may not return the correct number of characters in a string. John O'Conner and Tom White blogged about this.
A Unicode code unit is a bit size used by a particular Unicode encoding. For example UTF-8 has a code unit size of 8 bits and UTF-16 has 16 and UTF-32 has 32. To represent a character (i.e. a code point, which is a Unique integer assigned to each character) one or many code units may be required depending on the encoding.
Java uses UTF-16 and this means the code unit size is 16 bits. Unicode has over 1 million code points (10FFFF+1 in hex). 16 bits can represents only FFFF+1 code points. (This range is called the BMP (Basic Multilingual Plane. It contains all the commonly used character in the world and some more).
So to represent code points outside the BMP the UTF-16 encoding specifies surrogate pairs. For this two special ranges are defined within the BMP. In UTF-16 any character outside the BMP is represented by two 16 bit code units in this range. (In fact surrogate characters are defined only for UTF-16). Now it should be clear that certain characters may require two code units in UTF-16.
So counting 16 bit code units will not yield the correct "length of characters". String.length() returns the number of code units in the String.
Since 1.5 you can use codePointCount(int beginIndex, int endIndex) to get the length of the characters. It will count a surrogate pair as one character. [ November 27, 2008: Message edited by: Gamini Sirisena ]