Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

reading foreign characters (say, multi-byte charcters, Japanese, Turkish, etc) from a file in java.

 
Kashish Durgiya
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I have a text file, in which I have stored some keywords which I want to import in my Java application. I am using ByteArrayInputStream in InputStreamReader to fetch the data. The data was getting fetched fine for English characters. Now, the problem is that the file also has some non-English characters. So, instead of these characters, '?' in black background are coming in my Java application. So, I investigated for this issue, and found that I can set character encoding for InputStreamReader (for which default character encoding is utf-8), so, I set the encoding to 'windows-1250' (per given on http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html), so after doing this, some foreign characters (such as German) got fetched correctly (as it is), but for some others (Turkish, Japanese, etc), the problem still persisted. Please suggest as to what can I do so as I can fetch all the language characters in my application.



Thanks in advance,

Regards,
Kashish.


P.S. I have looked (a little) on the ranch for problems such as this before posting this myself, any links (on the ranch itself) to something I might have missed will also be helpful.
 
Campbell Ritchie
Sheriff
Pie
Posts: 48968
60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Where are you displaying these characters? Some terminals have a restricted font set which they can display. The Windows®/DOS command line is particularly bad in this respect. Try displaying it on something like JOptionPane.
 
Kashish Durgiya
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Campbell,

Thanks for early reply. I am displaying the characters currently using System.out.println, on console of Eclipse IDE. I am working on Ubuntu OS. The file I have created is a text file, however, I have created the file from MS Excel and saved it in Tab Delimited format (according to the requirements of my application) and then taken the file on Ubuntu. After that, I have to read the file in Java and display the contents on a web page. But, the main concern is that even if I sys out the contents after reading from file, they are showed as '??'. In fact, even if I see the contents on debugging the app, I see '??' only.
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Kashish Durgiya wrote:So, instead of these characters, '?' in black background are coming in my Java application.


Only if you're displaying them in a Java GUI element. If you're looking at them on a command console or in a text editor, then Java has nothing to do with how they're displayed. In any case, that question mark means either the encoding that the displaying console or editor is using doesn't know about that character, or the font doesn't have a glyph for that character

So, I investigated for this issue, and found that I can set character encoding for InputStreamReader (for which default character encoding is utf-8), so, I set the encoding to 'windows-1250' (per given on http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html), so after doing this, some foreign characters (such as German) got fetched correctly (as it is), but for some others (Turkish, Japanese, etc), the problem still persisted. Please suggest as to what can I do so as I can fetch all the language characters in my application.


You have to know ahead of time what encoding was used to write the file, and use that same encoding to read it.

 
Campbell Ritchie
Sheriff
Pie
Posts: 48968
60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What happens when you open that tab-delimited file with gedit? Does it ask for an encoding?
 
Kashish Durgiya
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jeff,
Jeff Verdegan wrote: In any case, that question mark means either the encoding that the displaying console or editor is using doesn't know about that character, or the font doesn't have a glyph for that character

But, even if I debug my application, the character isn't coming, instead, the question mark is coming... so, am I missing something (some concept) there ??

And, can you provide some help regarding your suggestion, as to knowing ahead of time the encoding used to write the file, how do I fetch the same ??


Campbell,
gedit shows a pop-up saying that the file (say, "abc.txt") is an executable text file, and asks whether I want to run "abc.txt", or display its contents?
 
Campbell Ritchie
Sheriff
Pie
Posts: 48968
60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Click display contents, and see what comes up. Do the characters display correctly? Does it ask for an encoding? Or line-end (CR-LF on Windows®/DOS, CR on very old Macs, LF on Unix/Linux/newer Macs).
 
Kashish Durgiya
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell,

Thanks for the suggestion, I displayed the contents on gedit, and the contents itself showed '??' instead of the actual characters, so I investigated further, and found that the file itself which I have created is storing the characters wrongly, which happens for double-byte (or multi-byte) characters when storing a file in text (tab-delimited format), from MS Excel. So, now, I am currently investigating on the issue of how to store those characters properly in the text file (tab delimited format) from MS Excel first, if you can help me in that perspective, that'd be great.

Thanks!
 
Campbell Ritchie
Sheriff
Pie
Posts: 48968
60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don’t know about excel, I am afraid. Sorry.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic