Java char type is a Unicode character, and Java Strings are internally made up of chars.
I thought that Java char had a size of 16 bits, but if that's true then either (a) Java cannot represent all Unicode characters, or (b) some/all Java chars are bigger than 16 bits.
I read that Java uses UTF-16 encoding for its chars. But how can that work?
Consider a big array of Java characters: -
Say I now want to find bigText[nIndex]. One would normally expect array indexing to be really fast: the JVM would get a pointer to the start of the array, add nIndex*sizeof(arrayElement) to it and, hey presto, there's a pointer to the right element. But UTF-16 doesn't have fixed-size elements, so how can the JVM avoid iterating from the start of the array?
Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.
A char's size is fixed at 2 bytes so there shouldn't be any difference in access from another array of primitives.
Joined: Oct 30, 2001
It is simply impossible to represent all Unicode characters in 2 bytes (16 bits). Therefore, if true, what you say implies (a) Java can't represent all Unicode characters, and (b) Java does not use UTF-16, even though unicode.org says it does.
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
If I'm reading this correctly, it says that if a character in UTF-16 can't be represented in 2 bytes, then it is represented as 2 2 byte characters, with the first 2 bytes being filled with unused values to indicate that the String should consider both 2 bytes to determine the character.
Joined: Oct 30, 2001
OK, well I have learnt a certain amount, especially from that blog about Java and Unicode, but no-one has answered my main question.
If you remember, I was asking how Java could perform character index based operations rapidly, now that characters do not have a uniform size.
As Java char no longer represents a character, but instead represents a "code unit" (very often a whole character, but maybe only half a character), presumably the subscript on a char array now refers not to characters but to code units. So myCharArray no longer gives you the 234th character in the array, but instead gives you the 234th code unit in the array. This may or may not be at the 234th character position and may or may not be a whole character. In other words, direct use of char has become pretty pointless, for those writing fully international applications.
Presumably, the latest Java has some way of getting the Nth character from a char array, String or StringBuffer. That operation would be quite complicated to implement (seems to me like it would have to read all characters from the zeroth to the Nth), and would have to return something bigger than a char, probably an int.