File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Problem with reading non-standard characters from XML using SAX Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Problem with reading non-standard characters from XML using SAX" Watch "Problem with reading non-standard characters from XML using SAX" New topic
Author

Problem with reading non-standard characters from XML using SAX

Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
I am using SAX to deserialise some objects from an XML input. The problem is that a name that is read, André Gonçalves, cannot be read correctly by the parser (or this is the point where I identify the problem at least.)
In fact, when I print the output either to the console, or to a GUI text component, it appears like this: Marcos Andr� Gon�alves

Can I do something to correct this? It is really annoying...
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19723
    
  20

Do you get the same problem (or other problems) when you open the file in Internet Explorer? Perhaps the encoding is simply incorrect.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39478
    
  28
Please check what happens if you give the output to a Java object. Try javax.swing.JOptionPane.showMessageDialog(null, "André Gonçalves"); and see what happens. The Windows console is bad at displaying non-ASCII characters.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Rob Prime wrote:Do you get the same problem (or other problems) when you open the file in Internet Explorer? Perhaps the encoding is simply incorrect.

The encoding is set to ISO-8859-1 and the XML is displayed correctly from Firefox.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Campbell Ritchie wrote:Please check what happens if you give the output to a Java object. Try javax.swing.JOptionPane.showMessageDialog(null, "André Gonçalves"); and see what happens. The Windows console is bad at displaying non-ASCII characters.


I tested it and it appears perfectly well!
By the way, I am using Ubuntu 9.04, so it is not related to the Windows console.

I really believe it has to do with the SAX parser that deserialises the entities... Is there some option I should have set to do it? It cannot be that difficult but still it is a very annoying little bug!
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

SAX parsers work perfectly well with all Unicode characters.

However your problem description is now confusing. It appears that you tested by outputting data from SAX to your console from XML, and had a problem there. Then you tested displaying a constant value into a GUI component, and that worked successfully. I don't see the test where you output data from SAX to a GUI component, and so it's still possible that your console is not a good testing tool for non-ASCII characters.

It's also possible that you are doing something like passing a Reader with the wrong encoding to the SAX parser, but you haven't posted any code so that's just speculation too.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Paul Clapham wrote:SAX parsers work perfectly well with all Unicode characters.

However your problem description is now confusing. It appears that you tested by outputting data from SAX to your console from XML, and had a problem there. Then you tested displaying a constant value into a GUI component, and that worked successfully. I don't see the test where you output data from SAX to a GUI component, and so it's still possible that your console is not a good testing tool for non-ASCII characters.

It's also possible that you are doing something like passing a Reader with the wrong encoding to the SAX parser, but you haven't posted any code so that's just speculation too.

First, I did not output the problematic data from SAX to the console - I created my objects first, using a SAX parser, and then printed the suspicious field of the object under discussion - and it appeared as I describe above. Outputting the data from the object to a TextArea still has the same problem for this name!

Hmmm. Maybe some code will be clarifying.
I have an XML file that contains data about some objects of my system. I deserialise the file into object instances with a class that uses the SAX parser. I use o CharArrayWriter for reading from the XML

and then


If the problem is not there, then the initialisation might be wrong?...
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

Konstantinos Vasileiou wrote:Hmmm. Maybe some code will be clarifying.


Yes. But the clarifying code would be the code where you pass a File or an InputStream or something like that into the parser.
Konstantinos Vasileiou
Greenhorn

Joined: Jul 20, 2009
Posts: 16
Paul Clapham wrote:
Konstantinos Vasileiou wrote:Hmmm. Maybe some code will be clarifying.


Yes. But the clarifying code would be the code where you pass a File or an InputStream or something like that into the parser.

Sorry.... Here it is:


I tested some more things and probably it is not a mistake of the SAX parser after all: I print the list of names the parsing module returns and the name is printed correctly. If I also print the String "André Gonçalves" to the GUI text components, it appears correctly as well. For some reason, the String loses the extra encoding information somewhere in the process? Is that possible?
 
wood burning stoves
 
subject: Problem with reading non-standard characters from XML using SAX