File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Unable be to read Special character (OS specific) Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Unable be to read Special character (OS specific)" Watch "Unable be to read Special character (OS specific)" New topic
Author

Unable be to read Special character (OS specific)

Ashutosh Devbrat
Greenhorn

Joined: Jun 24, 2008
Posts: 2
Hi,

I am using XMLReader and IPTCEntry classes to read xml file and images, then storing the required data into the database. The data is being extracted properly without any issue in windows, but surprisingly on fedora core and centOS and not able to read the special characters.

The copyright symbol i.e. '�' is extracted as '?' by centOS and '���' by fedora core 4.

The code is fine because I am deploying the same war on all the different machines.
Java is a platform independent language so it should fetch the same symbol irrespective of the operating system.

along with the � double quotes(as copied from MS word) '�' is replaced by '?' in centOS.

Is it because of some setting issue where in centOS is unable to read those characters or a java bug.

Any help is most welcome
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

Hi,

Welcome to JavaRanch!

There are a couple of places where things could be falling down, but none of them are Java bugs. First, an XML file should specify an encoding in the XML header which states how the characters are represented in the file. Make sure that encoding matches the actual content of the XML.

Second, note that even if Java has read the proper character, when you try to print out your special characters, the terminal or GUI window may not know how to represent them. Again, this can either be an encoding issue -- the terminal's default encoding doesn't contain representations for the characters -- or something more primitive, like a terminal that only displays ASCII.

You might test your Linux terminal's capabilities by just using "cat" or "less" to display the XML file itself, looking for those special characters. Alternatively, you could capture the Java output in a file, and edit that file with an editor that lets you see the actual bytes in the file, checking to see that in fact Java is emitting the right ones.


[Jess in Action][AskingGoodQuestions]
Ashutosh Devbrat
Greenhorn

Joined: Jun 24, 2008
Posts: 2
Hi Earnest,

Thanks for the reply.
I have already tried a those things. My Xml files are UTF-8 encoded.
Apart from the the character encoding of my xml reader is also UTF-8.
I am trying to print the output only for trial purpose. In the application,
the data once extracted directly goes into the database(mysql). The database is capable of storing/displaying special character.

Even after all these things the issue remains.

even if I try to print any special character using a simple System.out.println("�"); It appears as '?' on the linux machine.
Now the question remains whether catalina.out(tomcat logger) has the ability to display special character or not. I tried to open it in different editors , even tried to change the encoding schemes. but the output remained '?'.

So all I can conclude from this java was not able to extract/read the � when hosted in CcentOS environment, which is surprizing considering java is platform independent.

I hope I make my point clear.

Once again, Thanks for showing interest.

Thnaks
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

No, not really; all you've shown is that your terminal doesn't know how to display the copyright character. Since you're on Linux, you have the 'od' program ("octal dump") which can show you the actual bytes in a file. Process your XML so that the "bad" character goes from Java directly to a file, being sure that you open the file with an encoding that can handle the character (or that your platform default can handle it.) Then examine the file with a command like

od -t x1 | less

This will show your file as pages of hexadecimal bytes; you can look for the proper bytes for the copyright character. It would help if the output file is as short as possible, of course! The od output looks like



Each number on the right is one byte from the file, as a hexadecimal number.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Unable be to read Special character (OS specific)