• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

What is a Unicode code unit and a Unicode code point?

 
Ranch Hand
Posts: 213
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding



The above is from the API specification describing about Class Character.In this description Unicode code point is used to indicates characters like "A", "B", "C"?

Unicode code unit is used to indicate 16-bit char values, does that also means characters like "A", "B", "C"?.A char value also denotes a character isn't it?

I was ushered into the character class API documentation by the description of the length() in the String class.I want to understand what is a Unicode code unit.The length returned by length() is equal to the number of code units what is a code unit? Is it a character?

length
public int length()
Returns the length of this string. The length is equal to the number of Unicode code units in the string.

Specified by:
length in interface CharSequence
Returns:



Varuna
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You might want to start by reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for some background information.

As far as Java was concerned, a "char" was a Unicode character (or code point) up to Java 1.4. But then Java 5 introduced a new version of Unicode, and now one Java char is no longer necessarily the same as a Unicode code point. Luckily, the code points that take up more than one char are rarely used, but you still need to be aware of it. In particular, String.length() may not return the correct number of characters in a string. John O'Conner and Tom White blogged about this.
 
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Allow me to add a few more inputs..

A Unicode code unit is a bit size used by a particular Unicode encoding.
For example UTF-8 has a code unit size of 8 bits and UTF-16 has 16
and UTF-32 has 32.
To represent a character (i.e. a code point, which is a Unique integer assigned
to each character) one or many code units may be
required depending on the encoding.

Java uses UTF-16 and this means the code unit size is 16 bits.
Unicode has over 1 million code points (10FFFF+1 in hex).
16 bits can represents only FFFF+1 code points.
(This range is called the BMP (Basic Multilingual Plane.
It contains all the commonly used character in the world and some more).

So to represent code points outside the BMP the UTF-16 encoding specifies
surrogate pairs. For this two special ranges are defined within the BMP.
In UTF-16 any character outside the BMP is represented by two 16 bit code units
in this range.
(In fact surrogate characters are defined only for UTF-16).
Now it should be clear that certain characters may require two code units in UTF-16.

So counting 16 bit code units will not yield the correct "length of characters".
String.length() returns the number of code units in the String.

Since 1.5 you can use codePointCount(int beginIndex, int endIndex) to get
the length of the characters.
It will count a surrogate pair as one character.
[ November 27, 2008: Message edited by: Gamini Sirisena ]
reply
    Bookmark Topic Watch Topic
  • New Topic