aspose file tools*
The moose likes Java in General and the fly likes reading foreign characters (say, multi-byte charcters, Japanese, Turkish, etc) from a file in java. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "reading foreign characters (say, multi-byte charcters, Japanese, Turkish, etc) from a file in java." Watch "reading foreign characters (say, multi-byte charcters, Japanese, Turkish, etc) from a file in java." New topic
Author

reading foreign characters (say, multi-byte charcters, Japanese, Turkish, etc) from a file in java.

Kashish Durgiya
Greenhorn

Joined: Nov 22, 2011
Posts: 5
Hi,

I have a text file, in which I have stored some keywords which I want to import in my Java application. I am using ByteArrayInputStream in InputStreamReader to fetch the data. The data was getting fetched fine for English characters. Now, the problem is that the file also has some non-English characters. So, instead of these characters, '?' in black background are coming in my Java application. So, I investigated for this issue, and found that I can set character encoding for InputStreamReader (for which default character encoding is utf-8), so, I set the encoding to 'windows-1250' (per given on http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html), so after doing this, some foreign characters (such as German) got fetched correctly (as it is), but for some others (Turkish, Japanese, etc), the problem still persisted. Please suggest as to what can I do so as I can fetch all the language characters in my application.



Thanks in advance,

Regards,
Kashish.


P.S. I have looked (a little) on the ranch for problems such as this before posting this myself, any links (on the ranch itself) to something I might have missed will also be helpful.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38363
    
  23
Where are you displaying these characters? Some terminals have a restricted font set which they can display. The Windows®/DOS command line is particularly bad in this respect. Try displaying it on something like JOptionPane.
Kashish Durgiya
Greenhorn

Joined: Nov 22, 2011
Posts: 5
Hi Campbell,

Thanks for early reply. I am displaying the characters currently using System.out.println, on console of Eclipse IDE. I am working on Ubuntu OS. The file I have created is a text file, however, I have created the file from MS Excel and saved it in Tab Delimited format (according to the requirements of my application) and then taken the file on Ubuntu. After that, I have to read the file in Java and display the contents on a web page. But, the main concern is that even if I sys out the contents after reading from file, they are showed as '??'. In fact, even if I see the contents on debugging the app, I see '??' only.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Kashish Durgiya wrote:So, instead of these characters, '?' in black background are coming in my Java application.


Only if you're displaying them in a Java GUI element. If you're looking at them on a command console or in a text editor, then Java has nothing to do with how they're displayed. In any case, that question mark means either the encoding that the displaying console or editor is using doesn't know about that character, or the font doesn't have a glyph for that character

So, I investigated for this issue, and found that I can set character encoding for InputStreamReader (for which default character encoding is utf-8), so, I set the encoding to 'windows-1250' (per given on http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html), so after doing this, some foreign characters (such as German) got fetched correctly (as it is), but for some others (Turkish, Japanese, etc), the problem still persisted. Please suggest as to what can I do so as I can fetch all the language characters in my application.


You have to know ahead of time what encoding was used to write the file, and use that same encoding to read it.

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38363
    
  23
What happens when you open that tab-delimited file with gedit? Does it ask for an encoding?
Kashish Durgiya
Greenhorn

Joined: Nov 22, 2011
Posts: 5
Jeff,
Jeff Verdegan wrote: In any case, that question mark means either the encoding that the displaying console or editor is using doesn't know about that character, or the font doesn't have a glyph for that character

But, even if I debug my application, the character isn't coming, instead, the question mark is coming... so, am I missing something (some concept) there ??

And, can you provide some help regarding your suggestion, as to knowing ahead of time the encoding used to write the file, how do I fetch the same ??


Campbell,
gedit shows a pop-up saying that the file (say, "abc.txt") is an executable text file, and asks whether I want to run "abc.txt", or display its contents?
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38363
    
  23
Click display contents, and see what comes up. Do the characters display correctly? Does it ask for an encoding? Or line-end (CR-LF on Windows®/DOS, CR on very old Macs, LF on Unix/Linux/newer Macs).
Kashish Durgiya
Greenhorn

Joined: Nov 22, 2011
Posts: 5
Campbell,

Thanks for the suggestion, I displayed the contents on gedit, and the contents itself showed '??' instead of the actual characters, so I investigated further, and found that the file itself which I have created is storing the characters wrongly, which happens for double-byte (or multi-byte) characters when storing a file in text (tab-delimited format), from MS Excel. So, now, I am currently investigating on the issue of how to store those characters properly in the text file (tab delimited format) from MS Excel first, if you can help me in that perspective, that'd be great.

Thanks!
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38363
    
  23
Don’t know about excel, I am afraid. Sorry.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: reading foreign characters (say, multi-byte charcters, Japanese, Turkish, etc) from a file in java.