This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes Java in General and the fly likes Unable to read Arabic data Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Unable to read Arabic data" Watch "Unable to read Arabic data" New topic
Author

Unable to read Arabic data

Nikhil Bansal
Ranch Hand

Joined: Jan 24, 2005
Posts: 60
Hi All,

I do have file which contains data in EBCIDC format. The data is English as well as Arabic. Now my task is to convert this data in the UTF-8 format.

Well,when I am reading the data from the input file, I am able to get the corresponding Hex values (EBCIDC) and by mapping them to Hex values of ASCII the conversion for English is being done.

But the problem is with Arabic.For example there are hex values like 064E,064F for Arabic characters. When I am sending them as o/p then I am getting some junk characters like ?

Plz guys,it's a request if therez some sample code.......plz post it here..........it will be of great help.

Thanks in advance

Nikhil


ban$al
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
My understanding is that there are many EBCDIC encodings possible, and you would need to know exactly which EBCDIC encoding is used here. For Arabic, a common choice is apparently Cp420. This is supported in Java - on older JDK versions you may need to include the file charsets.jar in your classpath. Try something like this:

If the encoding is something other than Cp420, you may or may not have to find additional encoding support somewhere. You may find this documentation useful.


"I'm not back." - Bill Harding, Twister
Nikhil Bansal
Ranch Hand

Joined: Jan 24, 2005
Posts: 60
Hi,

Foll is my code. Here I am reading bytes of data,specifying that it is in cp420 (EBCIDC Arabic) format and then writing to the o/p file in UTF-8 format.

However, there seems to be some problem.There are some junk characters getting written to the file esp the one's where the hex value is alphanumeric for ex 8D,8C etc. If the hex value is numeric then the o/p is correct.

What am I doing wrong in the code.

Also I need to insert a carriage return after every bytes of data read.

Plz help me guys

Nikhil

import java.io.*;

public class ReadBinaryData {

public static void main(String args[]){

try{
File file = new File("D:\\MYDATA.DATASETS");
InputStream is = new FileInputStream(file);

File outfile = new File( "D:\\testHexFile.txt" );
FileOutputStream fout = new FileOutputStream( outfile);

String s = null;
long length = file.length();

if (length > Integer.MAX_VALUE) {
System.out.println("File is too large");
System.exit(0);
}


byte[] bytes = new byte[(int)length];

// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}

// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}

// Close the input stream and return bytes
is.close();
s = new String( bytes, "cp420" );


byte[] output = s.getBytes( "UTF-8" );

fout.write(output);
fout.close();
// return bytes;
}catch(Exception e){
System.out.println("Exception e"+e.toString());

}



}// End of main

}//End of class
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14278
    
  21

Originally posted by Nikhil Bansal:
But the problem is with Arabic.For example there are hex values like 064E,064F for Arabic characters. When I am sending them as o/p then I am getting some junk characters like ?


If you are sure that those codes are the correct Unicode codes for the characters, then the problem is not in the EBCDIC to Unicode conversion.

Ofcourse you need to have a font that contains those Unicode characters, otherwise you can't display them. Where is your output going, to a Unicode text file? What software are you using to view the output? Are you using a font that contains the Arabic characters?


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Nikhil Bansal
Ranch Hand

Joined: Jan 24, 2005
Posts: 60
Hi Jesper,

I am written the o/p to a text file with encoding specified as UTF-8. I am viewing the file in Notepad and also in Microsoft word. I do have Windows XP as the OS.

I am viewing the Arabic o/p with font Arabic Transparent.

Can you plz go thru the code. Let me know if I am missing something or doing something wrong.

Regards

Nikhil
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18709
    
    8

You have two steps in your little piece of code there. The first reads the bytes and attempts to convert them to chars using the CP420 charset, and the second converts those chars to bytes using the UTF-8 charset and writes them out.

Personally I would have used an InputStreamReader that specified CP420 and an OutputStreamWriter that specified UTF-8 rather than the low-level byte-fiddling that you have there. But that shouldn't matter, because it should end up with the same result.

The problem is that you have "?" appearing in the final result where it should not appear. And this always means an encoding or decoding failure. So, which of the two steps is producing these ? characters?
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Hello Nikhil,

I've tried your piece of code ... You were just having one problem, you've chosen a wrong Encoding for Arabic.
Just replace the "cp420" Encoding with "Cp1256" and ISA it will work fine.

Best regards ,


Hesham
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

Please do not DontWakeTheZombies


apigee, a better way to API!
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
I've already checked the date of the post, and i know it's old ... But i was facing the same problem so i took some time to find a solution.
I know others will google to find this post so i didn't want them to take some time like me to fix it.

Just wanted to make the Ranch post better
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: Unable to read Arabic data