This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes Java in General and the fly likes Unicode characters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Unicode characters" Watch "Unicode characters" New topic
Author

Unicode characters

Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
Java char type is a Unicode character, and Java Strings are internally made up of chars.

I thought that Java char had a size of 16 bits, but if that's true then either (a) Java cannot represent all Unicode characters, or (b) some/all Java chars are bigger than 16 bits.

I read that Java uses UTF-16 encoding for its chars. But how can that work?

Consider a big array of Java characters: -


Say I now want to find bigText[nIndex]. One would normally expect array indexing to be really fast: the JVM would get a pointer to the start of the array, add nIndex*sizeof(arrayElement) to it and, hey presto, there's a pointer to the right element. But UTF-16 doesn't have fixed-size elements, so how can the JVM avoid iterating from the start of the array?


Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.
Keith Lynn
Ranch Hand

Joined: Feb 07, 2005
Posts: 2367
A char's size is fixed at 2 bytes so there shouldn't be any difference in access from another array of primitives.
Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
It is simply impossible to represent all Unicode characters in 2 bytes (16 bits). Therefore, if true, what you say implies (a) Java can't represent all Unicode characters, and (b) Java does not use UTF-16, even though unicode.org says it does.

Or did I miss something?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

It's actually more complicated than that. As usual for this type of question, John O'Conner has your answer:

http://weblogs.java.net/blog/joconner/archive/2004/04/unicode_40_supp.html
Keith Lynn
Ranch Hand

Joined: Feb 07, 2005
Posts: 2367
This is what I found that Sun says about this.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

http://java.sun.com/j2se/corejava/intl/reference/faqs/index.html#utf-16

If I'm reading this correctly, it says that if a character in UTF-16 can't be represented in 2 bytes, then it is represented as 2 2 byte characters, with the first 2 bytes being filled with unused values to indicate that the String should consider both 2 bytes to determine the character.
Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
OK, well I have learnt a certain amount, especially from that blog about Java and Unicode, but no-one has answered my main question.

If you remember, I was asking how Java could perform character index based operations rapidly, now that characters do not have a uniform size.

As Java char no longer represents a character, but instead represents a "code unit" (very often a whole character, but maybe only half a character), presumably the subscript on a char array now refers not to characters but to code units. So myCharArray[234] no longer gives you the 234th character in the array, but instead gives you the 234th code unit in the array. This may or may not be at the 234th character position and may or may not be a whole character. In other words, direct use of char has become pretty pointless, for those writing fully international applications.

Presumably, the latest Java has some way of getting the Nth character from a char array, String or StringBuffer. That operation would be quite complicated to implement (seems to me like it would have to read all characters from the zeroth to the Nth), and would have to return something bigger than a char, probably an int.
 
 
subject: Unicode characters