I have to apply several string functions (i.e., indexOf, endsWith, etc.) to an input stream that is coming in as ISO-8895-1. Any suggestions? (i.e., is there a reference such as a "yellow card" for developers that can be used to find out how to represent the string "abc", for example, as 8895-1 programmatically?). I couldn't readily find anything. Thanks.
Hello! I'm not sure, whether ist helps. There is a method in the Java String class to get the String in a special encoding:
As charsetName, you can use a number of charsetnames, it depends on your concrete JDK implementation. Consult the documentaion (i18n). The String "ISO-8859-1" for example is valid on any JVM-implementation according to the specification. I hope it helps, greetings from Hamburg, Stefan [ July 14, 2002: Message edited by: Stefan Zoerner ]
Author of German LDAP-Book
Committer at Apache Directory Project
Are you sure you don't mean ISO 8859-1 (instead of 8895-1)? Unicode is a superset of 8859-1, so the numeric codes representing "abc" in 8859-1 are the same as the numeric codes representing "abc" in Unicode, except that (traditional 16-bit) Unicode uses two bytes per character, which 8859-1 uses one byte per character. 8859-1 is also known as Latin-1. For Unicode values, see http://www.unicode.org/charts/. The codes 0x20-0x7E and 0xA0-0xFF comprise 8859-1. These can be found in the Basic Latin and Latin-1 Supplement pages. For more information on 8859-1, see The ISO Latin 1 character repertoire - a description with usage notes.
In addition to the getBytes(String) method in String (and the matching new String(byte, String) constructor), don't forget InputStreamReader and OutputStreamWriter, which have constructors which allow you to specify the charset:
"I'm not back." - Bill Harding, Twister
Joined: Mar 31, 2002
Thanks Stephan and Jim. I've worked out my problem, but your suggestions might have been another approach. And John, you're right I meant ISO-8859-1. I was using the multipartrequest servlet and did not notice it now has a new constructor that includes an encoding parameter. However, strangely enough when reading the file upon receipt, I still had to set the encoding on an InputStreamReader to UTF16, rather than using a FileReader since it still processed the file as cp1252 (i.e., ISO-8859-1). This works fine with my string functions. What I don't understand is, if ISO-8859-1 is 8 bits then why would setting the encoding to UTF8 on the InputStreamReader, which I did initially, return a malformedinputexception. I'm using jdk 1.3 and I couldn't even find a reference to that class. John, I think we share the same misconception about ISO-8859-1. It may be only 8 bits, but it still requires 2 bytes (perhaps the high-order bit is reserved). At least this is what happens in my container. It also appears that Sun has introduced a new package java.?.nio in 1.4 that would facilitate using the approach suggested by Stephan and Jim. [ July 14, 2002: Message edited by: Elizabeth Reynolds ]
Joined: Feb 22, 2001
ISO 8859-1 specifies a one-byte per character encoding. If setting the encoding on the InputStreamReader to UTF-16 allowed you to read the data correctly, then the data is encoded as UTF-16, not as ISO-8859-1. As you said, UTF-16 uses two bytes per character. The numeric values of the characters common to both are the same, but the number of bits used for encoding is different. Here is why setting the encoding to UTF-8 will return MalformedInputException if the data is not really UTF-8 data. UTF-8 is a variable-width encoding that will encode some (16-bit) Unicode values using 8-bits, some using 16-bits, and some using 24-bits. If you try to read data that is encoded as ISO-8859-1 (or UTF-16) as though it were UTF-8, you will typically get a MalformedInputException if the data includs non-ASCII data (this is, if the data includes bytes with the upper bit set) because any byte with a binary code of 128-255 will be interpreted as part of the multi-byte representation of a single character. If you look at how UTF-8 encodes characters, you will see why this would occur. For a description of UTF-8, see Markus Kuhn's What is UTF-8 sun.io.MalformedInputException is a subclass of java.io.CharConversionException. Is one of a number of implementation-specific subclasses of java.io.IOException. Note that it is easy to create UTF-16 files on Windows NT,2000, and XP, which are Unicode oriented (but still support CP1252), while Windows 95, 98, and, I think, ME, were CP1252 oriented. (Or some other language-dependent code page instead of CP1252.) UTF-8 is more frequently used for networking, since can represent the same characters as UTF-16, but usually uses fewer bytes to do so. UTF-8 is less frequently used in files or RAM, since having a variable number of bytes per character makes it impractical to use the data for random access.
Joined: Mar 31, 2002
John Thanks for the clarification. I was obviously pretty confused this morning. Also, I finally had time to check your links. The references were great--exactly what I was looking for. Thanks again.
Joined: Jan 30, 2000
A few more comments: ISO-8859-1 (like Cp-1252) is definitely one byte per character. If that's not what you're seeing, something else is going on. Maybe it's not really ISO-8859-1, or maybe it is, but you're misinterpreting something else you saw. What make you think your data takes more than one byte? The InputStreamReader and getBytes() methods have been in Java a while - they don't require 1.4. I'm guessing you're thinking of java.nio.CharSet, which expands on the character encoding concept and allows you to use it in a number of new ways. But has been possible since Java 1.1 at least. Cp-1252 is not quite the same thing as ISO-8859-1. Byte values of 0x80-0x9F are undefined in ISO-8859-1 but are defined in Cp-1252 -see here. If your text uses any of these, it should really be treated as Cp-1252, not ISO-8859-1. Cp-1252 may not be on the list of encodings that are guaranteed to be there, but it's pretty darn ubiquitous, so give it a try. (It's on any JDK shipped by Sun nowadays, even the Unix versions.) If setting the encoding on the InputStreamReader to UTF-16 allowed you to read the data correctly, then the data is encoded as UTF-16, not as ISO-8859-1. Ummm, maybe. It sounds to me like Elizabeth may have switched to UTF-16 just to avoid the exceptions she was getting when using UTF-8. If that's the case, it's possible there's another encoding that will provide a more accurate translation. (Like say Cp-1252.) If you just look at the basic Roman alphabet chars, most encodings are undistinguishable after all. Although, I would think that if something were treated as UTF-16 when it isn't really, that would result in a bunch of completely erroneous characters when you look at the actual result. No errors thrown, but the output would most likely be completely and obviously wrong. Nonetheless, I wouldn't be too sure it's really UTF-16 at this point; too many things seem uncertain.