File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes InputStreamReader is not properly reading some characters in Linux. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "InputStreamReader is not properly reading some characters in Linux." Watch "InputStreamReader is not properly reading some characters in Linux." New topic
Author

InputStreamReader is not properly reading some characters in Linux.

Ganni Kal
Greenhorn

Joined: Jan 17, 2008
Posts: 16
Hi

I have a file that contains the text TEST NAÏVE SUBJECT I wrote a java program to read this file in RedHat Linux.
The java code that reads the file is similar to the below

File inputFile = new File(fileName.toString());
FileInputStream in = new FileInputStream(inputFile);
LineNumberReader lnr = new LineNumberReader(new InputStreamReader(in));
String streamInput = null;
while ((streamInput = lnr.readLine()) != null) {
System.out.println(streamInput);
}

The out put of the program is

TEST NA?VE SUBJECT

(observe that the character Ï is not read properly.)

I can able to guess that the java program is reading the input file in a different encoding than the file was actually encoded. If so what is the solution to overcome this problem?

Here I list the details that I observed on the Linux server:

The env variable LANG=C
While executing file command for that input file, it displays

ISO-8859 text, with CRLF line terminators


The same program is reading the characters properly, when I execute it in a different Linux server but with same Locale (LANG=C) settings.
But here the file encoding type was UTF-8 Unicode English text, with very long lines, with CRLF line terminators

Thank you!

Regards
Ganni
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Just use the InputStreamReader constructor which accepts an encoding name. It looks like you know the actual encoding of the document so that shouldn't be a problem.
Ganni Kal
Greenhorn

Joined: Jan 17, 2008
Posts: 16

Thanks for your suggestion.

I do not want to specify the encoding name inside the InputStreamReader constructor because, characters like Ï are exceptional case. Even I do not know how it was entered into the file (I have not created the file). Also I can not handle all other characters like this that comes from different encoding format.

I just want the read operation to be happened by the default encoding that was set in the OS or JVM.
This is working good in a different Linux server (with the same java code). Mainly I want to understand how the file with same characters are read properly in one Linux server but not in other, where both servers are having the same locale settings.

I know this may be a question related to Linux OS/JVM. But nowhere I can find the answer including the LinuxQuestions forum.

Any more ideas?

Thanks
Ganni
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Well, I predict that if you print out Java's default file encoding on the two systems which work differently, you're going to print out different values.

And if you use the wrong encoding to read a file, you're going to have errors like that. But it sounds like you don't consider that a problem?
Ganni Kal
Greenhorn

Joined: Jan 17, 2008
Posts: 16


I executed the following code to list all the charsets supported by the JVM in both the RH servers.
There is no difference in the output.

Map charSetMap = Charset.availableCharsets();
Iterator itr1 = charSetMap.keySet().iterator();
while (itr1.hasNext()) {
Object key = itr1.next();
System.out.println(key + " - " + charSetMap.get(key));
}




Regards
Ganni
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Ganni Kal wrote:I executed the following code to list all the charsets supported by the JVM in both the RH servers.
There is no difference in the output.


Okay. But that isn't useful information.

Paul Clapham wrote:Well, I predict that if you print out Java's default file encoding on the two systems which work differently, you're going to print out different values.


You didn't try that yet.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19697
    
  20

Ganni Kal wrote:(observe that the character Ï is not read properly.)

Are you sure? Are you sure it's simply not printed properly?

Most terminals, including Windows' CMD.EXE, simply cannot handle anything outside plain old ASCII. Try using JOptionPane.showMessage to show the message (if you have an Xorg session running that is).


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Ganni Kal
Greenhorn

Joined: Jan 17, 2008
Posts: 16
Rob Prime wrote:
Ganni Kal wrote:(observe that the character Ï is not read properly.)

Are you sure? Are you sure it's simply not printed properly?

Most terminals, including Windows' CMD.EXE, simply cannot handle anything outside plain old ASCII. Try using JOptionPane.showMessage to show the message (if you have an Xorg session running that is).



Actually I use Swing based UI. The actual application code is storing the characters into database, and that data is read from DB, and displayed in a JTextField.
For debugging purpose only, I tried printing the characters in console.

The characters are not read properly both in console and Swing.


Thanks
Ganni
Ganni Kal
Greenhorn

Joined: Jan 17, 2008
Posts: 16

Hi

I tried to read the same file by using the charset name ISO-8859-1 as an argument inside InputStreamReader constructor.
Now the non-printable characters are read and printed as they are.

LineNumberReader lnr = new LineNumberReader(new InputStreamReader(in, Class.forName("ISO-8859-1")));

I understand that the input file was with the ISO-8859 charset encoding, but the Java is trying to read the file using its default encoding format ANSI.
I use the Wepsphere's JVM for compiling and running the program.

Will Websphere affect the JVM's default encoding format?
How to change the JVM's default file encoding format?


Regards
Ganni
Ganni Kal
Greenhorn

Joined: Jan 17, 2008
Posts: 16

The same program is reading the characters properly, when I execute it in a different Linux server but with same Locale (LANG=C) settings.
But here the file encoding type was UTF-8 Unicode English text, with very long lines, with CRLF line terminators


While comparing the two server (Red Hat Enterprise Linux AS release 4) settings, the patch update version is different from each other.
One of the server has the patch version as Nahant Update 5 and the other has Nahant Update 7.

Will this difference cause this problem?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: InputStreamReader is not properly reading some characters in Linux.