File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes I/O and Streams and the fly likes Weird Encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Weird Encoding" Watch "Weird Encoding" New topic

Weird Encoding

Tapan Parikh

Joined: Jun 28, 2001
Posts: 26
My data is stored in the database in a strange encoding - ISCII, an 8-bit encoding for Indian languages. Basically this code uses the first 7 bits for normal ascii codes, and then uses the 8th bit for special indian chars.
Im using JDBC to extract this data, and then either (1) storing it to an XML file, or (2) just passing it to another JDBC query to INSERT (to another DB).
In case (1), I get this exception when trying to output to file using OutputStreamWriter... I have tried outputting using a number of encodings - 'UTF8', 'Unicode', 'ASCII' - but not UTF-16 b/c it doesnt seem supported in JDK 1.2 (although I will upgrade to 1.3 if someone thinks that will help somehow). Will I have to dump to ByteArrayOutputStream first? Illegal or non-writable character: U+fffe Illegal or non-writable character: U+fffe
at xml.EchoHandler.fatal(
at xml.EchoHandler.characters(, Compiled Code)
at JDBCSAXParser.generateSAXEventForColumn(, Compiled Code)
at JDBCSAXParser.generateSAXEventsForRow(, Compiled Code)
at JDBCSAXParser.parse(, Compiled Code)
at test.main(

In case (2), all non-ascii (i.e. true 8 bit characters) in the src get converted to ? question marks when I form an SQL Query string using it.
Whats the deal? Can I use some other encoding out there to get around these snafus?
(BTW, for another two cents, putting me down a nickle today - CDAC and ISCII sucks. I wish the Indian govt and CDAC would get their act together and get behind Unicode.)
Best, Tap
Ranch Hand

Joined: Jul 06, 2001
Posts: 54
Currently, I am dealing with arabic strings, so I became a little bit familiar with these encoding stories.
The best thing that I have learnt from my experience is that, in any java code, if one wants to take benefit from the JFCs especially the java.lang.String + java.util.StringTokenizer + many others, and if one wants to avoid troubles, then one must deal with strings encoded using the java-Unicode.
Let us consider a string that was written by an E.T. on an E.T. platform using an E.T. encoding. So now this string is represented as stream of bytes, and this representation respects the E.T. encoding. Now, to construct a Unicode string from this stream of bytes, it is mandatory that the JVM knows how to convert from the E.T. encoding to the java-Unicode and from the java-Unicode to the E.T. encoding.
Now let us assume that such requirement is fullfiled.
You will do :
String unicodeString = new String(eTBytes, "E.T.");
}catch(UnsupportedEncodingException ex){
System.out.println("sorry, the E.T encoding is not supported");
As soon as you have your unicodeString you can forget about the E.T. encoding, and deal with your string as if you were dealing with "hello world". Now you can insert in any database, or any xml or wml or whatever type of files you want.
the encoding conversion can also be performed by the OutputStreamWriter and the InputStreamReader.
Now the bad news : I have checked for you to see whether the indian encoding (iscii) is supported by jdk 1.2.2, and I didn't find any reference to it. and you can chck your self if you want : here are the supported encodings :
8859_1 ISO 8859-1
8859_2 ISO 8859-2
8859_3 ISO 8859-3
8859_4 ISO 8859-4
8859_5 ISO 8859-5
8859_6 ISO 8859-6
8859_7 ISO 8859-7
8859_8 ISO 8859-8
8859_9 ISO 8859-9
Big5 Big5, Traditional Chinese
CNS11643 CNS 11643, Traditional Chinese
Cp037 USA, Canada(Bilingual, French), Netherlands,
Portugal, Brazil, Australia
Cp1006 IBM AIX Pakistan (Urdu)
Cp1025 IBM Multilingual Cyrillic: Bulgaria, Bosnia,
Herzegovinia, Macedonia(FYR)
Cp1026 IBM Latin-5, Turkey
Cp1046 IBM Open Edition US EBCDIC
Cp1097 IBM Iran(Farsi)/Persian
Cp1098 IBM Iran(Farsi)/Persian (PC)
Cp1112 IBM Latvia, Lithuania
Cp1122 IBM Estonia
Cp1123 IBM Ukraine
Cp1124 IBM AIX Ukraine
Cp1125 IBM Ukraine (PC)
Cp1250 Windows Eastern European
Cp1251 Windows Cyrillic
Cp1252 Windows Latin-1
Cp1253 Windows Greek
Cp1254 Windows Turkish
Cp1255 Windows Hebrew
Cp1256 Windows Arabic
Cp1257 Windows Baltic
Cp1258 Windows Vietnamese
Cp1381 IBM OS/2, DOS People's Republic of China (PRC)
Cp1383 IBM AIX People's Republic of China (PRC)
Cp273 IBM Austria, Germany
Cp277 IBM Denmark, Norway
Cp278 IBM Finland, Sweden
Cp280 IBM Italy
Cp284 IBM Catalan/Spain, Spanish Latin America
Cp285 IBM United Kingdom, Ireland
Cp297 IBM France
Cp33722 IBM-eucJP - Japanese (superset of 5050)
Cp420 IBM Arabic
Cp424 IBM Hebrew
Cp437 MS-DOS United States, Australia, New Zealand,
South Africa
Cp500 EBCDIC 500V1
Cp737 PC Greek
Cp775 PC Baltic
Cp838 IBM Thailand extended SBCS
Cp850 MS-DOS Latin-1
Cp852 MS-DOS Latin-2
Cp855 IBM Cyrillic
Cp857 IBM Turkish
Cp860 MS-DOS Portuguese
Cp861 MS-DOS Icelandic
Cp862 PC Hebrew
Cp863 MS-DOS Canadian French
Cp864 PC Arabic
Cp865 MS-DOS Nordic
Cp866 MS-DOS Russian
Cp868 MS-DOS Pakistan
Cp869 IBM Modern Greek
Cp870 IBM Multilingual Latin-2
Cp871 IBM Iceland
Cp874 IBM Thai
Cp875 IBM Greek
Cp918 IBM Pakistan(Urdu)
Cp921 IBM Latvia, Lithuania (AIX, DOS)
Cp922 IBM Estonia (AIX, DOS)
Cp930 Japanese Katakana-Kanji mixed with 4370 UDC,
superset of 5026
Cp933 Korean Mixed with 1880 UDC, superset of 5029
Cp935 Simplified Chinese Host mixed with 1880 UDC,
superset of 5031
Cp937 Traditional Chinese Host miexed with 6204 UDC,
superset of 5033
Cp939 Japanese Latin Kanji mixed with 4370 UDC,
superset of 5035
Cp942 Japanese (OS/2) superset of 932
Cp948 OS/2 Chinese (Taiwan) superset of 938
Cp949 PC Korean
Cp950 PC Chinese (Hong Kong, Taiwan)
Cp964 AIX Chinese (Taiwan)
Cp970 AIX Korean
EUCJIS JIS, EUC Encoding, Japanese
GB2312 GB2312, EUC encoding, Simplified Chinese
GBK GBK, Simplified Chinese
ISO2022CN ISO 2022 CN, Chinese
ISO2022CN_CNS CNS 11643 in ISO-2022-CN form, T. Chinese
ISO2022CN_GB GB 2312 in ISO-2022-CN form, S. Chinese
ISO2022KR ISO 2022 KR, Korean
JIS JIS, Japanese
JIS0208 JIS 0208, Japanese
KOI8_R KOI8-R, Russian
KSC5601 KS C 5601, Korean
MS874 Windows Thai
MacArabic Macintosh Arabic
MacCentralEurope Macintosh Latin-2
MacCroatian Macintosh Croatian
MacCyrillic Macintosh Cyrillic
MacDingbat Macintosh Dingbat
MacGreek Macintosh Greek
MacHebrew Macintosh Hebrew
MacIceland Macintosh Iceland
MacRoman Macintosh Roman
MacRomania Macintosh Romania
MacSymbol Macintosh Symbol
MacThai Macintosh Thai
MacTurkish Macintosh Turkish
MacUkraine Macintosh Ukraine
SJIS Shift-JIS, Japanese

Omar IRAQI Houssaini
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

So if JDK doesnt support it Im screwed?
Id just like to get the data in (from an XML file and from a SQL reslut) and out (back to an XML file or to a SQL query) of the application - Im not particularly doing any string processing - except the XML parsing.
The first 7 bits are ascii so ascii chars show up fine, its only when the last bit is on that there is trouble.
Why do I have this feeling in the pit of my stomach that Im really screwed here...
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
> So if JDK doesnt support it Im screwed?
Well, if someone doesn't support it, you're screwed. It doesn't have to be Sun though - other companies make products - often free products - for extending Java's capabilities. I know that when I was extracting Shift-JIS (Japanese) text out of an Oracle database, I ended up using a jar file from Oracle to provide the required encoding. (There are apparently a few different dialects of ShiftJIS; Sun's offerings weren't exactly what I needed.) You could try searching/asking around for ISCII/Unicode converters. If nothing else, you could probably write one yourself if you have documentation for what each of the codes is supposed to mean. Look for the equivalent Unicode characters here, starting with the Devanagari scripts. It sounds like you have at most 256 values that need to be mapped, or which the first 128 are identical in ASCII, ISCII, and Unicode. You can surely create a lookup table, or a big switch statement.
However, it appears that Sun does offer what you need - in Java SDK 1.4beta. You will need to add the file jdk1.4/jre/lib/i18n.jar to your classpath. Then you can perform necessary conversions using code such as


It is really a good idea to make sure that if you're treating data as Java STrings, they have been properly decoded into Unicode. Otherwise they're just gibberish as far as Java is concerned, and will do strange things when you least expect it. Like if one of the characters just happens to get decoded as a double or single quote, which can confuse a SQL statement if you're not prepared for it. Or worse, if it looks like a control-Z or other control character, which is simply forbidden in an XML file.
Note that you need to be aware of the encoding used at every step in the process, and specify it where possible. Never trust a constructor like FileReader() or FileWriter(), which use the sustem default encoding, which are often not be what you need. Instead work an InputStreamReader or OutputStreamWriter into the process, which allow you to specify encoding. Or the String(byte[], String) constructor, or the String getBytes(String) method. In this way you can choose to read in something encoded as ISCII, and write it elsewhere using UTF-8, or vice versa.

"I'm not back." - Bill Harding, Twister
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26
Seems like im really screwed. Using ResultSet.getBytes(index) results in the following exception in this encoding
java.sql.SQLException: [JRun][SQLServer JDBC Driver]The hexadecimal string is [JRun][SQLServer JDBC Driver]The hexadecimal string is invalid.
at allaire.jrun.db.jdbc.base.BaseExceptions.getException(Unknown Source)Error!
at allaire.jrun.db.jdbc.base.BaseData.stringToBytes(Unknown Source)
at allaire.jrun.db.jdbc.base.BaseData.convertToByteArray(Unknown Source)
at allaire.jrun.db.jdbc.base.BaseData.convert(Unknown Source)
at allaire.jrun.db.jdbc.base.BaseData.getData(Unknown Source)
at allaire.jrun.db.jdbc.base.BaseResultSet.getBytes(Unknown Source)
at JDBCSAXParser.generateSAXEventForColumn(
at JDBCSAXParser.generateSAXEventsForRow(
at JDBCSAXParser.parse(
at test.main(
I cant use rs.getString() either because that gives me question marks (aka '?', 3F) where all the funky chars should be...
This seems like a major screw.
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
Hmmm... actually, another possibility is that getString(int) [i]is returning the correct string, in Unicode, but you just aren't able to display it correctly. Try this method on the String returned from the ResultSet:
<code><pre> public static void showCharValues(String input) {
int length = input.length();
for (int i = 0; i < length; i++) {
char c = input.charAt(i);
System.out.println("Character " + i + " = "
+ Integer.toHexString(c) + " (\'" + c + "\')");
Compare the hexadecimal values you see with this method to the Unicode character maps, and see if the numbers mean what they're supposed to mean. If so, then you can concentrate on figuring out how to display them properly. If not, then you look at the docs for the database and find out how the %$#@ they encoded the data in the first place. I have more followup questions in either case, but let's find out which case it is first.
[This message has been edited by Jim Yingst (edited July 18, 2001).]
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

OK, there might be (a little) light through the trees... The characters are getting in to the program ok (i think). if i use your showCharValues at any point in the program, even after ive done some string manip and concat-ing, it prints out the currect funky hex codes.
Character 0 = 27 (''')
Character 1 = 54 ('T')
Character 2 = 61 ('a')
Character 3 = 70 ('p')
Character 4 = 61 ('a')
Character 5 = 6e ('n')
Character 6 = 27 (''')
Character 0 = 27 (''')
Character 1 = ffcc ('?)
Character 2 = ffe3 ('?)
Character 3 = ffcf ('?)
Character 4 = ffda ('?)
Character 5 = 20 (' ')
Character 6 = ffcb ('?)
Character 7 = ffda ('?)
Character 8 = ffcf ('?)
Character 9 = ffc2 ('?)
Character 10 = 20 (' ')
Character 11 = ffcc ('?)
Character 12 = ffd8 ('?)
Character 13 = ffda ('?)
Character 14 = ffc1 ('?)
Character 15 = 20 (' ')
Character 16 = ffcc ('?)
Character 17 = ffd8 ('?)
Character 18 = ffe3 ('?)
Character 19 = ffd5 ('?)
Character 20 = 20 (' ')
Character 21 = 27 (''')
But then whenever I try to output, either to another db thru an SQL Query using stmt.executeUpdate(...), or output to an xml file, all the weird characters get converted to '?', ascii code 3F.
I see this as a I view the hex codes for the xml file (using UltraEdit), or if I try to rextract from the other DB.
This occurs even if I construct my OutputStreamWriter to output to an XML file using ISCII91 encoding (which is dodgy in the first place b/c I dont know any XML parser that supports ISCII91 - btw, does JDK1.4 beta have a built in XMLReader impl? maybe thats better off in the XML section ...)
rgds, tap
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

No that doesnt seem to be right even, those hex codes in the ff range dont seem to be valid Unicode encodings for Gujarati... Im not sure what they are... Ill have to find an iscii91 encoding chart to see if theres some valid 8 bit iscii codes in there, but i dont know why they are all prefaced w/ ff...
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
I assume you've looked at this chart in particular, and don't recognize any fo the characters as valid for Gujarati? This table seems to have an assortment of characters from different alphabets; I have no idea what its purpose is - but if you don't recognize the characters for the above hex codes, then forget it. Between the database and the JDBC driver, something is unable to properly form Java Strings. As a guess, the reason FF keeps appearing as a prefix might be that something somewhere is doing a byte-to-char conversion - anything with a leading byte of 1 gets interpreted as negative, and sign extention has the effect of prepending an FF. OK, try this:
<code><pre> public static String fixBrokenEncoding(String input) {
int length = input.length();
byte[] bytes = new byte[length];
for (int i = 0; i < length; i++) {
char c = input.charAt(i);
byte b = (byte) c; // get rid of first 8 bits; they're meaningless
System.out.println("Character " + i + " = "
+ Integer.toHexString(c) + " (\'" + c
+ "\'\tconverted to " + Integer.toHexString(b)
+ " (\'" + (char) b + "\')");
String result = new String(bytes, "ISCII");
System.out.println("Result: " + result);
return result;
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

No those chars in the ff range have nothing to do w/ gujarati...
I think you're on to something with the sign extension part though... heres the output from your code (i fixed it so the byte array gets populated properly)
Character 0 = 54 ('T')converted to 54 ('T')
Character 1 = 61 ('a')converted to 61 ('a')
Character 2 = 70 ('p')converted to 70 ('p')
Character 3 = 61 ('a')converted to 61 ('a')
Character 4 = 6e ('n')converted to 6e ('n')
Result: Tapan
Character 0 = ffcc ('?)converted to ffffffcc ('?)
Character 1 = ffe3 ('?)converted to ffffffe3 ('?)
Character 2 = ffcf ('?)converted to ffffffcf ('?)
Character 3 = ffda ('?)converted to ffffffda ('?)
Character 4 = 20 (' ')converted to 20 (' ')
Character 5 = ffcb ('?)converted to ffffffcb ('?)
Character 6 = ffda ('?)converted to ffffffda ('?)
Character 7 = ffcf ('?)converted to ffffffcf ('?)
Character 8 = ffc2 ('?)converted to ffffffc2 ('?)
Character 9 = 20 (' ')converted to 20 (' ')
Character 10 = ffcc ('?)converted to ffffffcc ('?)
Character 11 = ffd8 ('?)converted to ffffffd8 ('?)
Character 12 = ffda ('?)converted to ffffffda ('?)
Character 13 = ffc1 ('?)converted to ffffffc1 ('?)
Character 14 = 20 (' ')converted to 20 (' ')
Character 15 = ffcc ('?)converted to ffffffcc ('?)
Character 16 = ffd8 ('?)converted to ffffffd8 ('?)
Character 17 = ffe3 ('?)converted to ffffffe3 ('?)
Character 18 = ffd5 ('?)converted to ffffffd5 ('?)
Character 19 = 20 (' ')converted to 20 (' ')
Result: ?? ?? ?? ??
Ill try to get my hands on an ISCII91 encoding chart to see i can map these values to ISCII somehow...
(doesnt seem like the encoding chart is available publicly, and this is the <em>official</em> char encoding of the indian nation...)
Thanks for taking some interest in this btw...
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
Oops - I got burned by another sign extension - this time in Integer.toHexString(b), which converts b to int before interpreting it as a hex string. You can replace it with Integer.toHexString(b && 0x000000ff). Or you can just ignore the FF's yourself; I was just trying to make the output prettier.
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

Does anyone know a good web site / disc group / mailinglist dealing with international (sp. indian) fonts / encodings / unicode / OTF etc.?
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

OK, we have made some real progress here. It seems if you omit the leading ff (via the sign extension) the rest is a vald iscii code, and non only that, but in fact the code for the characters i had originally types.
so now the question remains - how do I read this into a String variable without getting this sign extension? how can i output to a file w/o getting an exception for an unwritable character? how do i write this value to a DB through a SQL expression executed through JDBC? hmmm....
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
Ummm... I already gave code to do that. See the line:
String result = new String(bytes, "ISCII");
However, this will only work if you have jdk 1.4. Otherwise, your options are:
(1) Examine the process by which you obtain the String from the database. If your "String" is still in ISCII, its meaningless gibberish as far as java is concerned. This means eith the database is being used incorrectly, or the JDBC driver is defective, or you're not extracting the string correctly. Are you using the ResultSet getString() method, or something else? In any event you may be able to fix your extraction method, or replace the driver with a better one.
(2) Write an ISCII-to-Unicode decoder yourself. Loop thrrough each character, and for each character look up the correct Unicode value, and output the replacement value into a StringBuffer. It sounds like char values 0-127 are the same in both Unicode and ISCII; you just need to map values 128-255. Tedious, but far from impossible.
(3) Abandon the idea that you will ever have a true String representation of this data. You can still output values to files using ISCII encoding (since that's all you know.) Treat the value as an array of bytes, and use a FileOutputStream to write the bytes, rather than a FileWriter to write characters (since the latter will try to assume that your characters are Unicode characters.) You can declare the XML file to be encoded in ISCII. You may have to put additional effort into finding an XML parser which understands ISCII, but I'm sure they exist. If it's enough to be able to simply output the value to a file, then this will work.
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

OK, Ive solved my problems (thanks to Jim), with a somewhat dodgy class. Here it is if anyone wants to see it... It requires the JDK to support ISCII91, so now it only works with JDK1.4beta...
Im somewhat skeptical that Im handling strings in non-Unicode representation, fooling Java so to speak. But Im not sure what else I can do since I would like to store data in the DB in ISCII format, and JDBC's executeUpdate fxn requires a String, and also because rs.getBytes() returns an exception. But I guess it will be ok since I dont handle the strings in unstandard formats for very long (only when ive just gotten it from the DB and when Im about to spit it out to a SQL call)...
Right now I miss C. I wouldnt have felt nearly so dodgy fooling char*'s...
What do you guys think?
public class ISCIIConvertor
// takes a String and converts it into a nonstandard ISCII representation (to send to stmt.executeUpdate())
static public String convertToIscii(String value)
throws IOException, UnsupportedEncodingException
ByteArrayOutputStream bytestream = new ByteArrayOutputStream(value.length());
OutputStreamWriter writer = new OutputStreamWriter(bytestream , "ISCII91");
writer.write(value, 0, value.length());
String ret_val = bytestream.toString();
return ret_val;

// takes a mucked up, sign-extended ISCII string and converts it to Unicode
public static String fixISCIIToUnicode(String input)
throws UnsupportedEncodingException
int length = input.length();
byte[] bytes = new byte[length];
//byte[] bytes = input.getBytes();
for (int i = 0; i < length; i++)
char c = input.charAt(i);
bytes[i] = (byte) c; // get rid of first 8 bits
String result = new String(bytes, "ISCII91");
return result;

public static void showCharValues(String input)
int length = input.length();
for (int i = 0; i < length; i++)
char c = input.charAt(i);
System.out.println("Character " + i + " = " + Integer.toHexString(c) + " (\'" + c + "\')");

Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
What DB are you using? What data type is the ISCII string stored as in the database? And what method are you using to read the value from the ResultSet in Java - getString()?
I'm thinking that if the database isn't intended to support ISCII directly, it might be better to store the string as some binary-type format rather than as a string. That way getBytes() should work, and we skip over the String-that-isn't-a-real-String. Just a thought...
Tapan Parikh

Joined: Jun 28, 2001
Posts: 26

Im working with SQL Server 7.0 (using Merant JDBC drivers) and Access 97 (using JDBC-ODBC Bridge). Basically my app is a data transfer prog. transferring to and fro using XML...
Data is stored in DB as either char's, varchar's, or text. Basically - text-type fields...
Either way, the extracting from the DB part isnt the part Im most worried about. As long as we get something reasonable there we can fix the encoding. The part Im more worried about is sending an ISCII string to stmt.executeUpdate(). If the JDBC Driver does some string manip I could be screwed...
mou haj
Ranch Hand

Joined: Sep 12, 2001
Posts: 81
can any body help me I have a kathakana file (japanese ... I have to take out some part and display it in the browser but not getting it...can anybody tell me ( :O
the code goes like this
import javax.servlet.*;
import javax.servlet.http.*;
import* ;
import java.lang.*;
import java.util.*;
public class showFile3 extends HttpServlet
public void doGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
}//end of doGet
public void init(ServletConfig config) throws ServletException
}//end of init
public void doPost (
HttpServletRequest request,
HttpServletResponse response
) throws ServletException, IOException
PrintWriter out = new ());
int i;
String str;
String encodingType = "JISAutoDetect";
RandomAccessFile jpTextFile = new RandomAccessFile("d:\\filehandling\\kana.txt", "r");
byte[] fileBytes = new byte[20] ;
String fileString = null;, 0, 15) ;
fileString = new String(fileBytes, encodingType);
out.println(" <META HTTP-EQUIV='Content-Type' Content='text/html; charset=x-sjis'></head><body>");
catch (Exception e)
System.err.println("File input error");
}//post ends
}//class ends
I agree. Here's the link:
subject: Weird Encoding
It's not a secret anymore!