aspose file tools*
The moose likes XML and Related Technologies and the fly likes How to avoid special characters while reading xml through java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "How to avoid special characters while reading xml through java" Watch "How to avoid special characters while reading xml through java" New topic
Author

How to avoid special characters while reading xml through java

balamurugan velliambalam
Greenhorn

Joined: Jul 13, 2010
Posts: 20
In my project we read lot ot xml files in that xml file in some places there are special characters found i try to remove using like below
catch (SAXParseException e)
{
System.out.println("Public ID:"+e.getPublicId());
System.out.println("System ID:"+e.getSystemId());
System.out.println("Line NO:"+e.getLineNumber());
System.out.println("Column NO:"+e.getColumnNumber());
System.out.println("Error MSG:"+e.getMessage());
e.printStackTrace();
throw e;
}
but it throws error while checking only open < and close > and related xml errors. Not for special characters so help me please to remove the special characters in xml through java.
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
What is a "special character" according to your definition? That code handles exceptions, not any particular content that gets parsed. Where did you put it in your parsing code?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18135
    
    8

If you mean to say you have XML which is not well-formed because whoever created it didn't escape "<" and ">" in text nodes correctly, the answer is you can't fix that using Java. It's the responsibility of whoever creates an XML document to ensure that it is well-formed, and it's one of the specific design features of XML that parsers are not required to fix up bad data.
balamurugan velliambalam
Greenhorn

Joined: Jul 13, 2010
Posts: 20
In my project we read lot ot xml files in that xml file in some places there are special characters found i try to remove using like below
catch (SAXParseException e)
{
System.out.println("Public ID:"+e.getPublicId());
System.out.println("System ID:"+e.getSystemId());
System.out.println("Line NO:"+e.getLineNumber());
System.out.println("Column NO:"+e.getColumnNumber());
System.out.println("Error MSG:"+e.getMessage());
e.printStackTrace();
throw e;
}
but it throws error while checking only open < and close > and related xml errors. Not for special characters so help me please to remove the special characters in xml through java.
What is a "special character" according to your definition? That code handles exceptions, not any particular content that gets parsed. Where did you put it in your parsing code?

Hello Lester Burnham ,
Above code only throws exceptions for xml semantics (structure) but not for the special characters like inverted question mark ¿¿¿¿ inverted exclamatory etc I put the catch statement after the IOException if you not clear with above statements please reply to me
balamurugan velliambalam
Greenhorn

Joined: Jul 13, 2010
Posts: 20
Hello Paul Clapham ,

Xml is well formated but i received xml documents from external it contains special character like this that i mentioned below in blue,

<?xml version="1.0"?>
<note>
<to>Tove</to>¿¿¿¿¿¿¿---> saxParseException do not show any exception to this line so how to find and remove this from my xml
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
Why do you expect to get exceptions for those characters? XML can contain just about any character - that's not cause for any exceptions. If those characters are not supposed to be there, talk to the producer of the XML to fix that.
balamurugan velliambalam
Greenhorn

Joined: Jul 13, 2010
Posts: 20
Lester Burnham wrote:Why do you expect to get exceptions for those characters? XML can contain just about any character - that's not cause for any exceptions. If those characters are not supposed to be there, talk to the producer of the XML to fix that.


But the producer of the xml was client we don't supposed to fix it so please provide any alternate solution to find that special character in xml file.
Peter Taucher
Ranch Hand

Joined: Nov 18, 2006
Posts: 174
Lester Burnham wrote:XML can contain just about any character - that's not cause for any exceptions.

I don't fully agree with that. The parser usually is designed to read a defined encoding. In the provided example no encoding is specified. If I try to parse it as 'UTF-8' I'll get an exception and cannot read the document. If I try to parse it as 'ISO-8859-1' (because it is ANSI encoded) there's no problem at all reading the file. So specifying an xml encoding is always a good idea.


And if the document needs to be 'UTF-8' decoded, but it isn't, then you'll have to talk to the document author and tell him to provide documents with the correct encoding...


Censorship is the younger of two shameful sisters, the older one bears the name inquisition.
-- Johann Nepomuk Nestroy
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
balamurugan velliambalam wrote:But the producer of the xml was client we don't supposed to fix it so please provide any alternate solution to find that special character in xml file.

Even a client has to adhere to the predefined rules on how data is to be delivered. Either the data is in the format it's supposed to be in -in which case you'll have to deal with it- or it isn't, in which case the producer needs to fix it. But you still haven't told us what a special character is according your definition - any non-ASCII character? That would be easy to detect and remove in the SAX characters method.

Peter Taucher wrote:
Lester Burnham wrote:XML can contain just about any character - that's not cause for any exceptions.

I don't fully agree with that. The parser usually is designed to read a defined encoding. In the provided example no encoding is specified. If I try to parse it as 'UTF-8' I'll get an exception and cannot read the document. If I try to parse it as 'ISO-8859-1' (because it is ANSI encoded) there's no problem at all reading the file. So specifying an xml encoding is always a good idea.

Agreed. I was assuming that the document is valid according to its stated encoding (or UTF-8 in the case of no encoding and no BOM). If it's not then that's something to be fixed by the producer. But we don't know what those characters really are (that information likely got lost somewhere along the way from the original XML file to this web page), or what makes them "special", so this is mostly conjecture.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

We get lots of external files; we have a process that removes garbage from them before the XML processing itself. Sometimes this causes its own set of issues, but we can't rely on file produces to do the right thing, so we assume the risk of causing a different sort of issue ourselves. Just another option.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12681
    
    5
A frequent bane in XML documents are MS Word "smart punctuation" characters which are illegal Unicode.

Bill


Java Resources at www.wbrogden.com
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

No doubt :(

We also have onsite QA that cut-and-paste XML payloads from spec docs (Word) into a test page then complain that it doesn't pass validation :/
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
But the producer of the xml was client we don't supposed to fix it so please provide any alternate solution to find that special character in xml file.


1. Write a Korn Shell or Perl script that will read file, remove unwanted character(s), and create new clean file.

2. Pass new clean file to Java-based data processing application.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

David Newton wrote:We get lots of external files; we have a process that removes garbage from them before the XML processing itself. Sometimes this causes its own set of issues, but we can't rely on file produces to do the right thing, so we assume the risk of causing a different sort of issue ourselves. Just another option.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: How to avoid special characters while reading xml through java
 
Similar Threads
Using RequestDispatcher with <c:url> tag
Displaying Non-English Characters in XML attribute
Exception in XML Parsing
XSD type: CDATA
Problem in handling special characters using FOP generating pdf