aspose file tools*
The moose likes Programmer Certification (SCJP/OCPJP) and the fly likes something weird Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Certification » Programmer Certification (SCJP/OCPJP)
Bookmark "something weird" Watch "something weird" New topic
Author

something weird

Vin Kris
Ranch Hand

Joined: Jun 17, 2002
Posts: 154
Based on two previous posts that I read, I had a question on basics and this is what I came up with to verify - honestly I'm confused. Would appreciate if somebody whacked me into reality again.

The output is
c:\jcp> java Test
??? ???
unequal
63 63 63
s1 chars - ? ? ?
s2 chars - ? ? ?
"unequal" !!! ??? why? compares chars - both say '?'
63 ?? why?
63 is the ASCII value for '?' which is what gets printed when you try to print the string.
The ASCII value of omega is 234 or \u00EA.
[ October 10, 2002: Message edited by: Vin Kris ]
Dan Chisholm
Ranch Hand

Joined: Jul 02, 2002
Posts: 1865
I think that the API Specification for String.getBytes() provides the answer.

getBytes
public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.


Dan Chisholm<br />SCJP 1.4<br /> <br /><a href="http://www.danchisholm.net/" target="_blank" rel="nofollow">Try my mock exam.</a>
Vin Kris
Ranch Hand

Joined: Jun 17, 2002
Posts: 154
yeah... i did go there... but got lost after StringCoding.encode(). wasn't worth pulling my hair out over every step.
The equals() method compares the 2 strings by each character and returns false when there is a mismatch. Here, both the strings return chars '?' and the byte value of the '?' in both the cases is 63 and yet the comparison returns false.
[ October 10, 2002: Message edited by: Vin Kris ]
Jose Botella
Ranch Hand

Joined: Jul 03, 2001
Posts: 2120
change to (int)c[i] the print statements to see the output:

s1 chars - 937 955 960
s2 chars - 63 63 63

Thus both strings are not equal.
s2 really holds question marks. The printing of ??? for s1 is due to the fact that DOS is not able to print the characters 937 955 960


SCJP2. Please Indent your code using UBB Code
Vin Kris
Ranch Hand

Joined: Jun 17, 2002
Posts: 154
Thanks a bunch Jose, I sure missed it there.
But I have another question here -
A character consumes 2 bytes to represent itself. But a String of a single char gives only 1 byte with the method getBytes(). So, is a single byte interpreted in a different way in case of strings? Is it an unsigned byte that is used and finally mapped to a char set maintained by the java lib?
cuz, even though omega character has an ASCII value of 234, \u03a9 (937) also represents omega. Casting the 937 to byte gives -87 whereas getBytes() for this character returns 63. I guess a lot of stuff happens inside with encode() and decode() - can you give a short and simple explanation? something that doesn't make me go through the StringCoding class? Plz. Thanks.
Jose Botella
Ranch Hand

Joined: Jul 03, 2001
Posts: 2120
The strings in Java are made of Unicode 2-bytes characters.
However when using String.getBytes() a mapping between these Unicode characters and a given charset is going on. Read here about the charsets supported in Java. Depending on the target charset, the length of the array of bytes will vary.
Ron Newman
Ranch Hand

Joined: Jun 06, 2002
Posts: 1056
Why is the UTF-16 encoding of a single-character string 4 bytes long rather than 2?


Ron Newman - SCJP 1.2 (100%, 7 August 2002)
Vin Kris
Ranch Hand

Joined: Jun 17, 2002
Posts: 154
Read here about the charsets supported in Java

phew !!! this is gonna take some time to fathom.
Shishio San
Ranch Hand

Joined: Aug 29, 2002
Posts: 223

...
The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:
When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.
When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.


Whatever doesn't kill us ...<br />Is probably circling back for another try.<br />SCJP 1.4
Jose Botella
Ranch Hand

Joined: Jul 03, 2001
Posts: 2120
In IBM developers searching for Unicode appear many articles.
from one of them talking about UFT-16:

All of the most common characters in use for all modern writing systems are already represented with 2 bytes. Characters in surrogate space take 4 bytes, but as a proportion of all world text they will always be very rare.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: something weird