aspose file tools*
The moose likes Java in General and the fly likes string UTF8 Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "string UTF8" Watch "string UTF8" New topic
Author

string UTF8

abalfazl hossein
Ranch Hand

Joined: Sep 06, 2007
Posts: 635


Output
-39
-123

-39
-120
-39
-124
-40


mim is http://www.fileformat.info/info/unicode/char/645/index.htm

11011001:10000101


May someone explain how to calculate -39 to 11011001?
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
integers on modern binary computers handle negative numbers as "2's complement"; you create a 2's complement by reversing all the bits and adding one.

so 39 (decimal) is 27 (hex) is 0010 0111 binary.
reverse all the digits to get 1101 1000
and add one to get 1101 1001

So 11011001 represents -39 using standard 2's complement binary representation.

rc
abalfazl hossein
Ranch Hand

Joined: Sep 06, 2007
Posts: 635


11111111111111111111111111011001
11111111111111111111111110000101
11111111111111111111111111011001
11111111111111111111111110001000
11111111111111111111111111011001
11111111111111111111111110000100
11111111111111111111111111011000
11111111111111111111111110100111
11111111111111111111111111011001
11111111111111111111111110000110
11111111111111111111111111011000
11111111111111111111111110100111



Does it mean that u0645=>11111111111111111111111111011001

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?



UTF8 uses two bytes for this character:

http://www.fileformat.info/info/unicode/char/645/index.htm

UTF-8 (binary) 11011001:10000101
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001

When cast to an int, yes.

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

Nope, just two. That's how char is defined. When encoded it may take up only one, but the char data type is always two bytes.

You're seeing four because you're not printing chars. You're trying to print bytes, but because you pass these to Integer.toBinaryString they get widened to ints.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
abalfazl hossein
Ranch Hand

Joined: Sep 06, 2007
Posts: 635
Can't UFT8 use 4 bytes?
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14146
    
  18

The UTF-8 encoding is a variable-length encoding; characters take up between one and four bytes.

abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001

Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?

No.

Note that \u0645 is not a UTF-8 code, it's a two-byte Unicode code. In UTF-8, characters may be encoded with completely different numbers than two-byte Unicode code points.

Apparently \u0645 is encoded in UTF-8 with two bytes: -39, -123, which have bit patterns: 11011001, 10000101. Note that these are not the same as the two Unicode code point bytes (0x06 and 0x45) because UTF-8 is a different encoding than two-byte Unicode code points.

When you convert an 8-bit byte containing -39 (11011001) to a 32-bit int, you'll get 11111111111111111111111111011001 which is also -39, but in 32 bits instead of 8 bits.

So the 11111111111111111111111111011001 is just the first byte of \u0645 in UTF-8 encoding, converted to a 32-bit int.

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41823
    
  63
UTF-8 can use up to 6 bytes per codepoint. But in memory Java uses UTF-16, which uses 2 bytes (and thus maps nicely to the char type) ... until you consider the subject of Unicode codepoints beyond the basic plane - which do not fit into 16 bits. The JavaIoFaq links to a couple of articles on that subject, and you should read http://www.joelonsoftware.com/articles/Unicode.html.


Ping & DNS - my free Android networking tools app
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14146
    
  18

You're right, Ulf; the Wikipedia page in the intro mentions 1 to 4 bytes, but then later on says it can be up to 6 bytes. Probably an error in the intro of the Wikipedia page.
abalfazl hossein
Ranch Hand

Joined: Sep 06, 2007
Posts: 635


In this line there is type cast myBytes[i] to int.The last byte in interger is used to save char. Right?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41823
    
  63
abalfazl hossein wrote:The last byte in interger is used to save char. Right?

No. Please read the article I linked to.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38793
    
  23
I think this is too difficult for "beginning", so I shall move this thread.
abalfazl hossein
Ranch Hand

Joined: Sep 06, 2007
Posts: 635


myBytes[i] to int, Because toBinaryString accept int input.

Is it type cast?
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14146
    
  18

The byte myBytes[i] is implicitly converted to an int (so, from 8 bits to 32 bits, with a widening primitive conversion) because toBinaryString() takes an int and not a byte. This is done by sign extension, which means that the extra bits on the left are filled with the leftmost bit (the sign bit) of the original byte.

For example: 11011001 -> leftmost bit is a 1, so when this is converted to a 32-bit int you get 11111111 11111111 11111111 11011001

But what do you mean with:
abalfazl hossein wrote:The last byte in interger is used to save char.

This line of code doesn't do anything with a char.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38793
    
  23
abalfazl hossein wrote: . . . Is it type cast?
No.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: string UTF8