• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Java language and Unicode

 
Ranch Hand
Posts: 582
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Dear all,
It is about Unicode character set used in Java programs.
I am reading The Java Language Specification and I summarize the content of section 3.1 Unicode.

According to Java Language Specification, the Java programming language use UTF-16 encoding.
UTF-16 is the Unicode standard to represent the complete range of characters.
Characters whose code points are greater than U+FFFF are called supplementary characters and those are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

I am confuse about 'pairs of 16-bit code units'. What is it? Could you give me some examples about it, please.
What are the different between 'code points' and 'code units'? Do those have the same meaning?

Correct my understanding if I am wrong...

thanks
daniel
 
Ranch Hand
Posts: 346
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

According to Java Language Specification, the Java programming language use UTF-16 encoding.
UTF-16 is the Unicode standard to represent the complete range of characters.
Characters whose code points are greater than U+FFFF are called supplementary characters and those are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

I am confuse about 'pairs of 16-bit code units'. What is it? Could you give me some examples about it, please.
What are the different between 'code points' and 'code units'? Do those have the same meaning?



UTF-16 encoding means using 16 bits to encode different characters,symbols etc. ( you can refer to UTF-16 code table for all characters and their corresponding bit codes.) These 16 bit code is represented by Hexadecimal bit code and thus first character 0 is represented by 0000 hexadecimal code and the last one by FFFF. This code is code point.

The total number of characters that can be represented by a 16 bit code are
65536. So any character that is not represented by 16 bit code has to use more number of bits to represent it. In that case it UTF-16 uses its
2 - 16 bit code pairs to represent the bit code for that character. and this is called code unit.
 
reply
    Bookmark Topic Watch Topic
  • New Topic