File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes B&S: Yes... yet more 7 bit/8 bit US ASCII questions Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "B&S: Yes... yet more 7 bit/8 bit US ASCII questions" Watch "B&S: Yes... yet more 7 bit/8 bit US ASCII questions" New topic
Author

B&S: Yes... yet more 7 bit/8 bit US ASCII questions

Michal Charemza
Ranch Hand

Joined: Jul 13, 2004
Posts: 86
Hi,

My encoding in the instructions is specified to be 8 bit US ASCII, but (as I have already written in another thread, as have just a few others) in the Charset API it says that in Java US ASCII is 7 bit.

I've done a search on this forum, and more than one person has said that (correct me if I'm wrong) that Java will ignore the 8th bit in 7-bit ASCII mode.

I have a few questions which I am unable to work out or find answers to (please direct me to a thread if there is one... I did search but admittedly I did skim the threads):

  • Does it specifying 8 bit US ASCII mean that the 8th bit may be used? According to www.asciitable.com, the 8th bit has letters such as � and other accented characters, and to me it doesn't seem unreasonable that some may be used.
  • Do people think that using a 7-bit encoding when there may be accented characters that will end up unrecognisable with the 8th bit ignored is acceptable? As "Bogitt and Scarper" does seem to be based in an English speaking area, the amount of accents will be probably be small (if any).


  • As an example, I worked out �, with code 138 would be "nl line feed" (whatever that is) with code 110 (I just subtracted 128... is that right?) with the 8th bit ignored. Others could be other random non-printing characters. What effect would this have? When converting to String, what would they become?

    Michal
    [ September 03, 2004: Message edited by: Michal Charemza ]
    Marlene Miller
    Ranch Hand

    Joined: Mar 05, 2003
    Posts: 1391

    The 8-bit value 0xe9 is not in the �US-ASCII� character set. When I specify �US-ASCII�, the 8-bit value 0xe9 is converted to the Unicode character \ufffd.

    The String API says �The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.�

    (System.out uses the default encoding for Windows cp1250. I think that's why 0xfffd is displayed as '?')

    ----

    http://www.coderanch.com/t/182088/java-developer-SCJD/certification/Charset-bit-US-ASCII-BETA

    Got word back from Sun about this. Max Habibi was right: there was a typo in the instructions and it should read "7-bit US ASCII" instead of "8-bit US ASCII".

    posted October 11, 2002 10:20 AM
    Two years ago this was considered a typo.

    ----

    http://www.coderanch.com/t/184181/java-developer-SCJD/certification/NX-Bodgitt-Scarper-data-file

    A very important thread! Look for Andrew's response.
    [ September 04, 2004: Message edited by: Marlene Miller ]
    Michal Charemza
    Ranch Hand

    Joined: Jul 13, 2004
    Posts: 86
    Thanks Marlene for your reply,


    Got word back from Sun about this. Max Habibi was right: there was a typo in the instructions and it should read "7-bit US ASCII" instead of "8-bit US ASCII".

    posted October 11, 2002 10:20 AM
    Two years ago this was considered a typo.


    If it was considered a typo two years ago, does that mean it's a typo now?

    Michal
    Michal Charemza
    Ranch Hand

    Joined: Jul 13, 2004
    Posts: 86
    Originally posted by Marlene Miller:

    The String API says �The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."


    Also, does "unspecified" mean it can do something horrible, like throw a RuntimeException. In your example, it showed a "?", but does (and I know I'm going to extremes now) it mean it can try to bring up a window, sound an alarm bell sound, and show you animated flashing big red letters saying "OOOPS A DAISY... UNSUPPORTED CHARACTER USED". I know this seems like madness, but when it says "unspecified" does this give free reign to a Java implementation to do anything it likes?

    Looking at my instructions, it doesn't say that it must run identically on all Java implementations, just it must run on one of them. Still, I thought it was the point of Java to run the same on all implementations.

    Michal
    Richard Jackson
    Ranch Hand

    Joined: Jun 25, 2003
    Posts: 128
    Hi,Marlene

    Nice to look your post.

    In my another similar post "writeUTF() and writeBytes()",we just discussed the character encoding problem.

    According to Max and Andrew' previous posts,we should use which between "US-ASCII" and "UTF-8" charsets?

    Because there is the same sentence in our instructions file,
    All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII.


    Hope you or others' reply.


    Regards, Richard
    Marlene Miller
    Ranch Hand

    Joined: Mar 05, 2003
    Posts: 1391
    Hi Michal and Richard,

    >> If it was considered a typo two years ago, does that mean it's a typo now?

    No. For us, it�s only a suggestion, a possibility.

    >> Also, does "unspecified" mean it can do something horrible, like throw a RuntimeException.[...] I know this seems like madness, but when it says "unspecified" does this give free reign to a Java implementation to do anything it likes?

    Yes. Probably not something horrible, or the product wouldn�t sell. But I would never have guessed 0xfffd.

    >> Still, I thought it was the point of Java to run the same on all implementations.

    The JLS describes the semantics of the language in detail. All implementations must do what the JLS says to do. Otherwise, they can do what they want to do. (I suppose they should also do what the API�s say to do, but those API Javadocs have errors.)

    >> According to Max and Andrew' previous posts, we should use which between "US-ASCII" and "UTF-8" charsets?

    I don't know.

    *If* the data in the file always has the high-order bit of a byte set to 0, either �US-ASCII� and �UTF-8� can be used with String to correctly convert between 8-bit bytes and 16-bit Unicode characters.

    Andrew's point about UTF-8 using all eight bits is convincing, isn't it?

    *If* the data in the file has the high-order bit set to 1, neither �US-ASCII� nor �UTF-8� can be relied on to convert characters correctly.

    *If* the data in the file has the high-order bit set to 1, the encoding is not UTF-8, because UTF-8 would use 16 bits, not 8 bits.

    *If* I were going to handle 8-bit extended US ASCII, the user would set the name of the desired character set in the System properties file, because I don�t know the name of the character set.

    I printed a list of the character sets supported on my system. I don�t see 8-bit extended US-ASCII. Is that a hint?

    [ September 04, 2004: Message edited by: Marlene Miller ]
    Michal Charemza
    Ranch Hand

    Joined: Jul 13, 2004
    Posts: 86
    Originally posted by Marlene Miller:

    *If* the data in the file has the high-order bit set to 1, neither �US-ASCII� nor �UTF-8� can be relied on to convert characters correctly.

    *If* the data in the file has the high-order bit set to 1, the encoding is not UTF-8, because UTF-8 would use 16 bits, not 8 bits.


    Yes. I think I agreee with these points.

    Also, as I mentioned in this thread (well, more of a post than a thread really), UTF-8 may write characters that take up 16 bits, and thus make it definitely violate the format of the data file.

    So I don't think that the decision is between UTF-8 and US-ASCII, it's between ISO-8859-1 and US-ASCII, which leads me to my next point...

    Originally posted by Marlene Miller:

    *If* I were going to handle 8-bit extended US ASCII, the user would set the name of the desired character set in the System properties file, because I don�t know the name of the character set.


    I think ISO-8859-1 is the extended US ASCII set. It contains US ASCII, and uses the 8th bit for extended characters (such as with accents)

    Michal
    Philippe Maquet
    Bartender

    Joined: Jun 02, 2003
    Posts: 1872
    Hi Marlene and Michal,

    Michal:
    >> According to Max and Andrew' previous posts, we should use which between "US-ASCII" and "UTF-8" charsets?

    Marlene:
    I don't know.

    *If* the data in the file always has the high-order bit of a byte set to 0, either �US-ASCII� and �UTF-8� can be used with String to correctly convert between 8-bit bytes and 16-bit Unicode characters.

    Andrew's point about UTF-8 using all eight bits is convincing, isn't it?

    For once, some point of Andrew that didn't convince me...

    As UTF-8 encodes characters either on one byte or two bytes depending on the value of the character to be encoded, using UTF-8 is just a good way of taking the risk of very easily corrupting the file:
  • Put any character - falling outside the range of characters UTF-8 encodes on one byte - in some String field value at full length (i.e 10 characters if the field's length is 10)
  • Save it in the file

  • Guaranteed result: you either corrupt the next field's value or the first field's value of the next record, but in any case you corrupt the file.

    I'd personally stay safely with the "US-ASCII" encoding scheme stated in the instructions.

    Regards,

    Phil.
    [ September 05, 2004: Message edited by: Philippe Maquet ]
    Andrew Monkhouse
    author and jackaroo
    Marshal Commander

    Joined: Mar 28, 2003
    Posts: 11279
        
      59

    Hi everyone,

    Originally posted by Michal Charemza:
    If it was considered a typo two years ago, does that mean it's a typo now?


    There are some parts of the assignment that Sun have deliberately left either vague or contradictory (and, when called on it, they have acknowledged that these items are deliberately obfuscated). Sun want to have some areas where candidates can differentiate themselves from other candidates, and where you can show that you can think of issues and resolve them (in other words, you can show that you are a developer and not just a programmer).

    So whether this is still a typo (which I doubt) or a deliberate point of confusion is pretty much irrelevant - it is still something you are going to have to make up your own mind about, and code (and document) accordingly.

    Originally posted by Philippe Maquet:
    For once, some point of Andrew that didn't convince me...

    As UTF-8 encodes characters either on one byte or two bytes depending on the value of the character to be encoded, using UTF-8 is just a good way of taking the risk of very easily corrupting the file:
  • Put any character - falling outside the range of characters UTF-8 encodes on one byte - in some String field value at full length (i.e 10 characters if the field's length is 10)
  • Save it in the file

  • Guaranteed result: you either corrupt the next field's value or the first field's value of the next record, but in any case you corrupt the file.


    Nope - UTF-8 is 8 bit only. If it was UTF-16 you would be correct: you could be using 1 or 2 bytes and run the risk of corrupting your data. But UTF-8 will always be 8 bits or 1 byte.

    Regards, Andrew


    The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
    Marlene Miller
    Ranch Hand

    Joined: Mar 05, 2003
    Posts: 1391
    Thank you Phil and Andrew.

    >> I think ISO-8859-1 is the extended US ASCII set. It contains US ASCII, and uses the 8th bit for extended characters (such as with accents)

    Hi Michal,

    Compare www.asciitable.com with http://www.unicode.org/charts/PDF/U0000.pdf and http://www.unicode.org/charts/PDF/U0080.pdf. The characters with values between 129 and 255 are not the same. However I think ISO-8859-1 and Unicode are the same for characters between 129 and 255. There appear to be (at least) two ways to �extend� the 7-bit ASCII character set.

    ----
    A test to understand UTF-8 and UTF-16.

    > java Test m
    m 6d US-ASCII: 6d
    m 6d ISO-8859-1: 6d
    m 6d UTF-8: 6d
    m 6d UTF-16: fffffffe ffffffff 0 6d

    > java Test �
    � e9 US-ASCII: 3f
    � e9 ISO-8859-1: ffffffe9
    � e9 UTF-8: ffffffc3 ffffffa9
    � e9 UTF-16: fffffffe ffffffff 0 ffffffe9

    > java Test �
    � f8 US-ASCII: 3f
    � f8 ISO-8859-1: fffffff8
    � f8 UTF-8: ffffffc3 ffffffb8
    � f8 UTF-16: fffffffe ffffffff 0 fffffff8

    (The byte value is converted to an int in toHexString(). The high-order bit of the byte is extended. The value of the byte before the conversion is only the right most two hex digits.)

    Marlene
    [ September 05, 2004: Message edited by: Marlene Miller ]
    Michal Charemza
    Ranch Hand

    Joined: Jul 13, 2004
    Posts: 86
    Originally posted by Andrew Monkhouse:

    But UTF-8 will always be 8 bits or 1 byte.


    Are you sure? On the UTF-8 and Unicode FAQ they seems to say otherwise. Theres a table about 1/4 down the page that gives Unicode values and the correspoding UTF-8 byte sequence. If the 8th bit is set, it seems to say it will be a multi-byte character (I think someone else said this in another thread also somewhere)

    Am I confused?

    Michal
    Philippe Maquet
    Bartender

    Joined: Jun 02, 2003
    Posts: 1872
    Hi Andrew,

    Nope - UTF-8 is 8 bit only. If it was UTF-16 you would be correct: you could be using 1 or 2 bytes and run the risk of corrupting your data. But UTF-8 will always be 8 bits or 1 byte.

    You may believe that, OK, but that's incorrect for both the encoding schemes you mention: UTF-8 may use a varying number of octets, while UTF-16 is a fixed length encoding scheme which uses two bytes for any character.

    Here is the link you provided yourself in this thread:UTF-8 RFC. And from that link, you provide the following excerpt:
    By the way: UTF-8 "uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else."

    Now if you look two paragraphs further in the same document, you read:
    UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
    octets, where the number of octets, and the value of each, depend on the integer value assigned to the character in ISO/IEC 10646. This
    transformation format has the following characteristics (all values are in hexadecimal)


    Best regards,

    Phil.
    [ September 06, 2004: Message edited by: Philippe Maquet ]
    Andrew Monkhouse
    author and jackaroo
    Marshal Commander

    Joined: Mar 28, 2003
    Posts: 11279
        
      59

    He everyone,

    Sorry - I was wrong, and Phil (and others) are correct. UTF-8 can be multi byte.

    Sorry for any confusion.

    Regards, Andrew
    peter wooster
    Ranch Hand

    Joined: Jun 13, 2004
    Posts: 1033
    Originally posted by Andrew Monkhouse:
    He everyone,

    Sorry - I was wrong, and Phil (and others) are correct. UTF-8 can be multi byte.

    Sorry for any confusion.

    Regards, Andrew


    I got really tired of this 7 vs 8 bit discussion and sent the following message to Sun:

    "I have a question about the specifications for the URLyBird project version 1.2.3. The Data file format states �The character encoding is 8 bit US ASCII�. This is not a standard encoding supported by Java. I have considered using either charset �US-ASCII� which is 7 bit or �ISO-8859-1� which is 8 bit. There has been a lot of discussion on the Javaranch forum about this, but a lot of confusion remains. It has been suggested that this is a typo and should have read �7 bit US ASCII�.

    /thank you"

    and got the following, very short reply, that should end this discussion.

    "Use ISO-8859-1"
    Marlene Miller
    Ranch Hand

    Joined: Mar 05, 2003
    Posts: 1391
    Thank you Peter.

    So, the ideal programmer would realize that 8-bit US ASCII means the customer wants some 8-bit encoding scheme that includes or is like 7-bit US ASCII. The ISO Latin Alphabet No. 1 is one such encoding scheme and the one guaranteed to be supported by every implementation of the Java platform. Hard-coding is preferred to a configuration parameter.

    Or maybe, the person at SUN who answered Peter has a European relative.
    [ September 10, 2004: Message edited by: Marlene Miller ]
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: B&S: Yes... yet more 7 bit/8 bit US ASCII questions
     
    Similar Threads
    I/O Misunderstanding (beta data file)
    NX: URLyBird 1.2.1 - Character encoding
    B&S: My thoughts and final decision on character encoding
    Unicode parsing exception
    writeUTF() and writeBytes()