The UTF-8 encoding is a variable-length encoding; characters take up between one and four bytes.
abalfazl hossein wrote:Does it mean that u0645=>11111111111111111111111111011001
Does this mean every unicode character occupy four bytes in memory?But according to character, maybe change one or two or more bytes?
Note that \u0645 is not a UTF-8 code, it's a two-byte Unicode code. In UTF-8, characters may be encoded with completely different numbers than two-byte Unicode code points.
Apparently \u0645 is encoded in UTF-8 with two bytes: -39, -123, which have bit patterns: 11011001, 10000101. Note that these are not the same as the two Unicode code point bytes (0x06 and 0x45) because UTF-8 is a different encoding than two-byte Unicode code points.
When you convert an 8-bit byte containing -39 (11011001) to a 32-bit int, you'll get 11111111111111111111111111011001 which is also -39, but in 32 bits instead of 8 bits.
So the 11111111111111111111111111011001 is just the first byte of \u0645 in UTF-8 encoding, converted to a 32-bit int.
UTF-8 can use up to 6 bytes per codepoint. But in memory Java uses UTF-16, which uses 2 bytes (and thus maps nicely to the char type) ... until you consider the subject of Unicode codepoints beyond the basic plane - which do not fit into 16 bits. The JavaIoFaq links to a couple of articles on that subject, and you should read http://www.joelonsoftware.com/articles/Unicode.html.
The byte myBytes[i] is implicitly converted to an int (so, from 8 bits to 32 bits, with a widening primitive conversion) because toBinaryString() takes an int and not a byte. This is done by sign extension, which means that the extra bits on the left are filled with the leftmost bit (the sign bit) of the original byte.
For example: 11011001 -> leftmost bit is a 1, so when this is converted to a 32-bit int you get 11111111 11111111 11111111 11011001
But what do you mean with:
abalfazl hossein wrote:The last byte in interger is used to save char.
This line of code doesn't do anything with a char.