I am using XMLReader and IPTCEntry classes to read xml file and images, then storing the required data into the database. The data is being extracted properly without any issue in windows, but surprisingly on fedora core and centOS and not able to read the special characters.
The copyright symbol i.e. '�' is extracted as '?' by centOS and '���' by fedora core 4.
The code is fine because I am deploying the same war on all the different machines. Java is a platform independent language so it should fetch the same symbol irrespective of the operating system.
along with the � double quotes(as copied from MS word) '�' is replaced by '?' in centOS.
Is it because of some setting issue where in centOS is unable to read those characters or a java bug.
There are a couple of places where things could be falling down, but none of them are Java bugs. First, an XML file should specify an encoding in the XML header which states how the characters are represented in the file. Make sure that encoding matches the actual content of the XML.
Second, note that even if Java has read the proper character, when you try to print out your special characters, the terminal or GUI window may not know how to represent them. Again, this can either be an encoding issue -- the terminal's default encoding doesn't contain representations for the characters -- or something more primitive, like a terminal that only displays ASCII.
You might test your Linux terminal's capabilities by just using "cat" or "less" to display the XML file itself, looking for those special characters. Alternatively, you could capture the Java output in a file, and edit that file with an editor that lets you see the actual bytes in the file, checking to see that in fact Java is emitting the right ones.
Thanks for the reply. I have already tried a those things. My Xml files are UTF-8 encoded. Apart from the the character encoding of my xml reader is also UTF-8. I am trying to print the output only for trial purpose. In the application, the data once extracted directly goes into the database(mysql). The database is capable of storing/displaying special character.
Even after all these things the issue remains.
even if I try to print any special character using a simple System.out.println("�"); It appears as '?' on the linux machine. Now the question remains whether catalina.out(tomcat logger) has the ability to display special character or not. I tried to open it in different editors , even tried to change the encoding schemes. but the output remained '?'.
So all I can conclude from this java was not able to extract/read the � when hosted in CcentOS environment, which is surprizing considering java is platform independent.
I hope I make my point clear.
Once again, Thanks for showing interest.
author and iconoclast
No, not really; all you've shown is that your terminal doesn't know how to display the copyright character. Since you're on Linux, you have the 'od' program ("octal dump") which can show you the actual bytes in a file. Process your XML so that the "bad" character goes from Java directly to a file, being sure that you open the file with an encoding that can handle the character (or that your platform default can handle it.) Then examine the file with a command like
od -t x1 | less
This will show your file as pages of hexadecimal bytes; you can look for the proper bytes for the copyright character. It would help if the output file is as short as possible, of course! The od output looks like
Each number on the right is one byte from the file, as a hexadecimal number.