Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

something weird

 
Vin Kris
Ranch Hand
Posts: 154
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Based on two previous posts that I read, I had a question on basics and this is what I came up with to verify - honestly I'm confused. Would appreciate if somebody whacked me into reality again.

The output is
c:\jcp> java Test
??? ???
unequal
63 63 63
s1 chars - ? ? ?
s2 chars - ? ? ?
"unequal" !!! ??? why? compares chars - both say '?'
63 ?? why?
63 is the ASCII value for '?' which is what gets printed when you try to print the string.
The ASCII value of omega is 234 or \u00EA.
[ October 10, 2002: Message edited by: Vin Kris ]
 
Dan Chisholm
Ranch Hand
Posts: 1865
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think that the API Specification for String.getBytes() provides the answer.

getBytes
public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.
 
Vin Kris
Ranch Hand
Posts: 154
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yeah... i did go there... but got lost after StringCoding.encode(). wasn't worth pulling my hair out over every step.
The equals() method compares the 2 strings by each character and returns false when there is a mismatch. Here, both the strings return chars '?' and the byte value of the '?' in both the cases is 63 and yet the comparison returns false.
[ October 10, 2002: Message edited by: Vin Kris ]
 
Jose Botella
Ranch Hand
Posts: 2120
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
change to (int)c[i] the print statements to see the output:

s1 chars - 937 955 960
s2 chars - 63 63 63

Thus both strings are not equal.
s2 really holds question marks. The printing of ??? for s1 is due to the fact that DOS is not able to print the characters 937 955 960
 
Vin Kris
Ranch Hand
Posts: 154
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a bunch Jose, I sure missed it there.
But I have another question here -
A character consumes 2 bytes to represent itself. But a String of a single char gives only 1 byte with the method getBytes(). So, is a single byte interpreted in a different way in case of strings? Is it an unsigned byte that is used and finally mapped to a char set maintained by the java lib?
cuz, even though omega character has an ASCII value of 234, \u03a9 (937) also represents omega. Casting the 937 to byte gives -87 whereas getBytes() for this character returns 63. I guess a lot of stuff happens inside with encode() and decode() - can you give a short and simple explanation? something that doesn't make me go through the StringCoding class? Plz. Thanks.
 
Jose Botella
Ranch Hand
Posts: 2120
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The strings in Java are made of Unicode 2-bytes characters.
However when using String.getBytes() a mapping between these Unicode characters and a given charset is going on. Read here about the charsets supported in Java. Depending on the target charset, the length of the array of bytes will vary.
 
Ron Newman
Ranch Hand
Posts: 1056
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why is the UTF-16 encoding of a single-character string 4 bytes long rather than 2?
 
Vin Kris
Ranch Hand
Posts: 154
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Read here about the charsets supported in Java

phew !!! this is gonna take some time to fathom.
 
Shishio San
Ranch Hand
Posts: 223
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

...
The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:
When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.
When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
 
Jose Botella
Ranch Hand
Posts: 2120
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In IBM developers searching for Unicode appear many articles.
from one of them talking about UFT-16:

All of the most common characters in use for all modern writing systems are already represented with 2 bytes. Characters in surrogate space take 4 bytes, but as a proportion of all world text they will always be very rare.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic