my dog learned polymorphism
The moose likes Beginning Java and the fly likes unicode query Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "unicode query" Watch "unicode query" New topic

unicode query

Ajay Kumar Rana

Joined: Feb 27, 2008
Posts: 13

The question is:

Which one of the following are not valid character contants? [8]
Select any two.
(a)char c = '\u00001' ;
(b)char c = '\101';
(c)char c = 65;
(d)char c = '\1001' ;

and the answers are (a) and (d) . The thing is that i am not clear with the unicode notation and this tyoe character constants.Please suggest me a link on which i can read and understand the explanations of this answer and also "unicode".
Kaydell Leavitt
Ranch Hand

Joined: Nov 18, 2006
Posts: 689

The word "Unicode" means literally, "one" "code".

I believe that the history of encoding characters that is relevant to modern day computers is the teletype.

A standard code was agreed upon called ASCII:

American Standard Code for Information Interchange (ASCII)

ASCII was only a seven-bit code. When computers were invented, different manufacturers standardized on the first seven bits of a byte as being ASCII. Computer companies didn't want to waste the eighth bit. 8 is a power of 2 and since digital electronics is based on base-2 math, there was a bit left over that different computer companies used in different ways.

This difference led to having many different "code pages".

Unicode was proposed as a new standard to replace ASCII and *all* of the many code pages that existed for all of the symbols in all of the languages of the world. Originally, Unicode was a 16-bit standard which yielded 64K characters in the character set which was thought to be big enough to encode everything.

Java supported the early Unicode Standard from the beginning so, in Java, a char is 16-bits.

In Java, a char is assignment-compatible with an byte, a short, or an int, but the difference is that a char is a 16-bit unsigned integer and an short is a 16-bit signed integer.

char c1 = 65; // This is legal. It assigns a positive integer literal to a character, 'A'
char c2 = '\u0066'; // This is legal. It assigns c2 to be a capital 'B', using the Unicode escape form.
char c3 = 'C'; // This is legal. It assigns c3 to be a capital 'C', using a char literal.
char c4 = '\u00067'; // This is *NOT* legal because a char only has 16 bits so you have 4 hex digits to work with and this example has 5 hex digits which cannot be held in a char variable.

char c5 = '\101'; // This seems to compile, but I don't understand it. I would use c6 instead.
char c6 = '\u0101'; // This seems to me what c5 should be (but I'm not sure why c5 even compiles).

char c7 = '\1001'; // This is illegal. The 'u' is missing from the Unicode escape sequence
char c8 = '\u1001'; // This is legal. The missing 'u' from c7 is supplied here.

Unicode turned out not to be as simple as having a single encoding for all characters. Ror example, developers didn't want to use 16 bits to transmit characters over the Internet when 8 character encodings were twice as fast. So variations of Unicode exist, namely UTF-8 which is mostly 8 bit characters but some character encodings are longer.



In my opinion, Unicode is good because it does simplify things over the older standards where you could only use one code-page at a time and were therefore limited to 256 chars. Unicode 1.0 through 3.0 are much better, allowing up to 64K characters to be encoded in a 16-bit char.

Unicode versions 4.0, 5.0 broke the 16-bit limit, but is only an issue relatively rare characters for Asian languages that don't fit into Unicode 3.0.

You can use the char type for Unicode 1.0 through Unicode 3.0.

If you want to break the 16-bit limit for chars, you use an int which is 32 bits, 21 of which are used by Unicode nowadays.

So Unicode literally means "one-code" but it is not one code. it is better than having innumerable code-pages that you could only work with one at-a-time, but Unicode really can be encoded in a few different ways (such as UTF-8, UTF-16, and Unicode 4.0, and Unicode 5.0).
[ February 28, 2008: Message edited by: Kaydell Leavitt ]
Ulf Dittmer

Joined: Mar 22, 2005
Posts: 42965
If you're new to the world of Unicode you might start by reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) as an introduction.

And if you're not confused enough yet, have a read of this post and this post, both of which explain what happens if one character isn't the same as one char. That's advanced stuff, though, but it's good to keep in the back of your head the fact that this can happen.
[ February 28, 2008: Message edited by: Ulf Dittmer ]
I agree. Here's the link:
subject: unicode query
jQuery in Action, 3rd edition