aspose file tools*
The moose likes Beginning Java and the fly likes What is a Unicode code unit and a Unicode code point? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "What is a Unicode code unit and a Unicode code point?" Watch "What is a Unicode code unit and a Unicode code point?" New topic
Author

What is a Unicode code unit and a Unicode code point?

Varuna Seneviratna
Ranch Hand

Joined: Jan 15, 2007
Posts: 167
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding


The above is from the API specification describing about Class Character.In this description Unicode code point is used to indicates characters like "A", "B", "C"?

Unicode code unit is used to indicate 16-bit char values, does that also means characters like "A", "B", "C"?.A char value also denotes a character isn't it?

I was ushered into the character class API documentation by the description of the length() in the String class.I want to understand what is a Unicode code unit.The length returned by length() is equal to the number of code units what is a code unit? Is it a character?

length
public int length()
Returns the length of this string. The length is equal to the number of Unicode code units in the string.

Specified by:
length in interface CharSequence
Returns:


Varuna


Varuna Seneviratna
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
You might want to start by reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for some background information.

As far as Java was concerned, a "char" was a Unicode character (or code point) up to Java 1.4. But then Java 5 introduced a new version of Unicode, and now one Java char is no longer necessarily the same as a Unicode code point. Luckily, the code points that take up more than one char are rarely used, but you still need to be aware of it. In particular, String.length() may not return the correct number of characters in a string. John O'Conner and Tom White blogged about this.


Ping & DNS - my free Android networking tools app
Gamini Sirisena
Ranch Hand

Joined: Aug 05, 2008
Posts: 375
Allow me to add a few more inputs..

A Unicode code unit is a bit size used by a particular Unicode encoding.
For example UTF-8 has a code unit size of 8 bits and UTF-16 has 16
and UTF-32 has 32.
To represent a character (i.e. a code point, which is a Unique integer assigned
to each character) one or many code units may be
required depending on the encoding.

Java uses UTF-16 and this means the code unit size is 16 bits.
Unicode has over 1 million code points (10FFFF+1 in hex).
16 bits can represents only FFFF+1 code points.
(This range is called the BMP (Basic Multilingual Plane.
It contains all the commonly used character in the world and some more).

So to represent code points outside the BMP the UTF-16 encoding specifies
surrogate pairs. For this two special ranges are defined within the BMP.
In UTF-16 any character outside the BMP is represented by two 16 bit code units
in this range.
(In fact surrogate characters are defined only for UTF-16).
Now it should be clear that certain characters may require two code units in UTF-16.

So counting 16 bit code units will not yield the correct "length of characters".
String.length() returns the number of code units in the String.

Since 1.5 you can use codePointCount(int beginIndex, int endIndex) to get
the length of the characters.
It will count a surrogate pair as one character.
[ November 27, 2008: Message edited by: Gamini Sirisena ]
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: What is a Unicode code unit and a Unicode code point?