• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Confused over Unicode

 
Ranch Hand
Posts: 1970
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have got confused about Unicode, UTF-16 encoding, Java Strings and Java chars. I thought my code was wrong, then I thought not, then I was not sure.

The following code is supposed to convert any Java String into UTF-16, where (for reasons specific to my project) each 16-bit value is a Java short, not the more usual Java char.

 
Ranch Hand
Posts: 130
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Peter,

I have changed your method to the following (sorry, the code is a little messy and I got a little lazy with catching the UnsupportedEncodingException that can be thrown by the getBytes method, so I'm just throwing the Exception - but hopefully you get the picture):



For example if you pass the string "ABCD", the method will return the following array of shorts:
-2
-1
0
65
0
66
0
67

The -2 and -1 values represent the xFEFF byte order mark which is big-endian UTF-16. If you don't want the byte order mark, then change the encoding in the getBytes method to "UTF-16BE".

Is this the sort of thing you were looking for?

Regards,
JD
 
Peter Chase
Ranch Hand
Posts: 1970
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for replying, but I'm fairly sure that's not what I need.

Your method converts to bytes then puts each byte into a short, doesn't it? So you have two bytes for each UTF-16 16-bit code point.

What I want is an array of shorts, where each 16-bit short represents a UTF-16 16-bit code point.

My code achieves that in all the cases I've actually seen, but I am wondering about cases where UTF-16 does not translate a single Unicode character into exactly one Java char.
 
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I believe that what you've done is correct. String.length() has been redefined as the number of 16-bit char values it takes to represent the String in UTF-16; in the cases you're worried about it this number is larger than the number of code points ("characters") in the string, which you can get from codePointCount(). charAt() returns the 16-bit value at the given index, which might be one of a pair of surrogates. Your code is doing the right thing: each of the two members of a surrogate pair will be stored in a separate adjacent short.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The http://faq.javaranch.com/java/JavaIoFaq points to two blog entries talking about UC characters not in the BMP; they also include some code. Maybe that's what you're looking for?
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic