aspose file tools*
The moose likes Java in General and the fly likes Unicode question Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Unicode question" Watch "Unicode question" New topic
Author

Unicode question

Todd Vakulskas
Greenhorn

Joined: Jan 10, 2011
Posts: 2
I am having some trouble understanding unicode as it relates to Java. I have wrote a test method to gain some understanding and it has confused me further.

The method that I wrote was to understand how to produce a Hex representation of a UTF16 encoded string.

Here is a excerpt from the code I wrote:



When I stop this in my eclipse debugger I get the following byte string back:
byte[4] = [-2, -1, -32, 19]


Given it's UTF16 encoding, I would have expected only two bytes back from 0x2013 (en-dash). What am I misunderstanding or what is wrong with my code ?

Any help would be appreciated
Thanks
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
I don't know the answer to your question; however, if I use the same code to encode 'A', I also get 4 bytes, so I don't think it's anything special about the character or the method call. I haven't quite penetrated the density that is the Wikipedia article on UTF-16, but the article says it can return either one or two 16-bit values. I guess for UTF-16 it may always return 2.

I'm curious, but not enough to stay up and research it tonight, but I thought I'd throw this in in case you had not thought to give it a regular ol' ASCII character...

rc
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14111
    
  16

Let's have a look at those bytes in hex:

[-2, -1, -32, 19] = [FE, FF, E0, 13]

The first two bytes, FE FF, are the byte order mark. It's a special code in Unicode that indicates if the bytes are stored in big-endian or little-endian order (see the Wikipedia page for details).

The E0 13 is the actual code for the character. But I don't understand why it's E0 13 instead of 20 13... Are you sure that the third byte isn't 32 instead of -32?

If you do this with for example the character 'A' you get FE FF 00 41, which is the byte order mark FE FF plus the code 00 41 for the character 'A'.


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38363
    
  23
. . . and welcome to the Ranch
I have added code tags, which you should always use, to your post, and you can see how much better it looks Also corrected a spelling error.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19670
    
  18

It is indeed 32, not -32.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Todd Vakulskas
Greenhorn

Joined: Jan 10, 2011
Posts: 2
I apologize it should have been 32 as you've already figured out. After reading Jesper's post regarding the byte order mark things started making sense. I added more characters to the string then realized it truly is just a header for the string and that each additional char I added only added two bytes. I want to thank everyone who contributed to this post. What a wonderful / helpful forum.

Thanks!
 
jQuery in Action, 2nd edition
 
subject: Unicode question