This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes Java Micro Edition and the fly likes [J2ME] From Unicode to UTF-8 Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Mobile » Java Micro Edition
Bookmark "[J2ME] From Unicode to UTF-8" Watch "[J2ME] From Unicode to UTF-8" New topic
Author

[J2ME] From Unicode to UTF-8

Adriano Bellavita
Ranch Hand

Joined: Mar 11, 2010
Posts: 37
Hi all,

I have to convert a Unicode String to its UTF-8 encoding.

I'm working with emoticons so:

this is my input:

U+1F600 (or \uD83D\uDE03, chars associated with it)

this should be the output

f0 9f 98 80

How can I get this?

Ty and BR,

Adriano
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14278
    
  21

Something like this:

By the way, that gives me f0 9f 98 83, not f0 9f 98 80.


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Adriano Bellavita
Ranch Hand

Joined: Mar 11, 2010
Posts: 37
It doesn't work...

If I try this solution, I'm wondering about 2-byte chars. Each chars of "Hello world" String is built with 2 byte.

In my case, my String is "😀": an emoticon!

To better understand what I'm trying to do, I'll make an example:

we can easily convert a String using the getBytes method when the unicode representation of every char of the String is included between 0x0000 and 0xFFFF values.

The "😀" unicode representation overflows: to be char-encoded, we need 2 charts (not one, so more than 2 bytes....) as we can see here:

http://www.utf8-chartable.de/unicode-utf8-table.pl

The "😀" representation is: 0x1F600 (unicode: so something like 0001|F600???) and f0 9f 98 80 (hex)

So I have to represent a single digit ("😀") like it's composed by three (or four???) bytes...

How can I do this?

Adriano Bellavita
Ranch Hand

Joined: Mar 11, 2010
Posts: 37
Jesper de Jong wrote:Something like this:

By the way, that gives me f0 9f 98 83, not f0 9f 98 80.


Wow.... Give me a moment....

Ok, you use getBytes("UTF-8")...

But then? What you do?

How could you obtain f0 9f 98 83?

If I print the byte array, the "for" returns:

-19
-96
-67
-19
-72
-125

........
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18712
    
    8

I don't know how you could get that. You can never get more then two UTF-8 bytes for a Unicode character. When I run that code the bytes in the resulting array are -16, -97, -104, -125. But that's the decimal representation assuming the byte value is signed. The hexadecimal string representation of those bytes is F0, 9F, 98, 83.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18712
    
    8

... Well, that's interesting. When I take the six bytes you say you got, and convert them to a String assuming they were UTF-8, I do actually get "\uD83D\uDE03". Here's the code I wrote:



I'm using Java 7. I recall seeing something in some JVM change report about fixing code to use canonical UTF-8, but don't remember when that was. What version of Java are you using?

And just in case we are on the wrong track here, why do you have to convert a String to the hexadecimal representation of its UTF-8 encoding?
Adriano Bellavita
Ranch Hand

Joined: Mar 11, 2010
Posts: 37
Hi,

I'm using Java 1.4, MID profile.

I only want to obtain what this table shows:

http://www.utf8-chartable.de/unicode-utf8-table.pl

If you go to "U+1F600 ... U+1F64F - Emoticons" section, you'll see that Unicode starts from U+1F600 Unicode code point and ends at U+1F6FF.

So I want that each Unicode entry is converted into the relative UTF-8 bytes.

My start point is the Unicode code point (or chars representation), not the String.

My end point is its exadecimal representation.

TY in advance,

Adriano

Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18712
    
    8

Okay, you're using Java 1.4, which means that you have to use the UTF-16 encoding of the character (as you did) rather than using the character directly, which Java 5 allows you to do.

At any rate it seems that you are generating something which appears to be a UTF-8 version of that character in some way, at least it converts back to the character via new String(bytearray, "UTF-8"). However I still think you need to explain your original problem, rather than trying to discuss a (possibly) failed solution to that unknown problem.
Adriano Bellavita
Ranch Hand

Joined: Mar 11, 2010
Posts: 37
TY for your reply.

Let's take a look to the table showed at this URL

unicode-utf8

I must obtain the result of the third column, strarting from the value of the first one.

That's my problem...
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18712
    
    8

Let me be more clear, then. The problem I am asking about is the problem to which "I must obtain the result of the third column, strarting from the value of the first one" is your idea of a solution. There may be better ways of solving that unknown problem, but we can't know until we know what that problem is.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: [J2ME] From Unicode to UTF-8