Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Unicode question

 
Todd Vakulskas
Greenhorn
Posts: 2
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am having some trouble understanding unicode as it relates to Java. I have wrote a test method to gain some understanding and it has confused me further.

The method that I wrote was to understand how to produce a Hex representation of a UTF16 encoded string.

Here is a excerpt from the code I wrote:



When I stop this in my eclipse debugger I get the following byte string back:
byte[4] = [-2, -1, -32, 19]


Given it's UTF16 encoding, I would have expected only two bytes back from 0x2013 (en-dash). What am I misunderstanding or what is wrong with my code ?

Any help would be appreciated
Thanks
 
Ralph Cook
Ranch Hand
Posts: 479
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know the answer to your question; however, if I use the same code to encode 'A', I also get 4 bytes, so I don't think it's anything special about the character or the method call. I haven't quite penetrated the density that is the Wikipedia article on UTF-16, but the article says it can return either one or two 16-bit values. I guess for UTF-16 it may always return 2.

I'm curious, but not enough to stay up and research it tonight, but I thought I'd throw this in in case you had not thought to give it a regular ol' ASCII character...

rc
 
Jesper de Jong
Java Cowboy
Saloon Keeper
Pie
Posts: 15150
31
Android IntelliJ IDE Java Scala Spring
  • 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Let's have a look at those bytes in hex:

[-2, -1, -32, 19] = [FE, FF, E0, 13]

The first two bytes, FE FF, are the byte order mark. It's a special code in Unicode that indicates if the bytes are stored in big-endian or little-endian order (see the Wikipedia page for details).

The E0 13 is the actual code for the character. But I don't understand why it's E0 13 instead of 20 13... Are you sure that the third byte isn't 32 instead of -32?

If you do this with for example the character 'A' you get FE FF 00 41, which is the byte order mark FE FF plus the code 00 41 for the character 'A'.
 
Campbell Ritchie
Sheriff
Pie
Posts: 47288
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
. . . and welcome to the Ranch
I have added code tags, which you should always use, to your post, and you can see how much better it looks Also corrected a spelling error.
 
Rob Spoor
Sheriff
Pie
Posts: 20386
46
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It is indeed 32, not -32.
 
Todd Vakulskas
Greenhorn
Posts: 2
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I apologize it should have been 32 as you've already figured out. After reading Jesper's post regarding the byte order mark things started making sense. I added more characters to the string then realized it truly is just a header for the string and that each additional char I added only added two bytes. I want to thank everyone who contributed to this post. What a wonderful / helpful forum.

Thanks!
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic