Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

SAX parsing of xml file Tracing invalid characters

 
Abhi Venu
Ranch Hand
Posts: 73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hai ,
I have a program reads a XML file It uses SAX parser .The xml file to be read is validated with a schema .
I am getting errors during this process such as Byte "195" is not a member of the (7-bit) ASCII character set.
But i want to specifically pinpoint to the character that is invalid and the line number of that caharacter
The encodingi am using is ASCII.I may use another encoding and these errors may disappear.But i dont want to do that.
I want to trace out the the characters who breaks the rules

The method i tried was the usage of SAXparse exception generated and getting the info of line number and column number etc But the line numbers i got actually dont have any problem.

public void fatalError(SAXParseException exception) throws SAXException {
validationError = true;
saxParseException = exception;
String message = "Fatal Error: " + getParseExceptionInfo(exception);
writeToErrorFile(exception+"Fatal error occured at "+exception.getLineNumber()) ;
}


can any one suggest please how this can be done. Relevant parts of mcode are posted here

DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
factory.setAttribute(
"http://java.sun.com/xml/jaxp/properties/schemaLanguage",
"http://www.w3.org/2001/XMLSchema");
factory.setAttribute(
"http://java.sun.com/xml/jaxp/properties/schemaSource",
SchemaUrl);
DocumentBuilder builder = factory.newDocumentBuilder();
Validator handler = new Validator();
builder.setErrorHandler(handler);
builder.parse(XmlDocumentUrl);


private class Validator extends DefaultHandler {
public boolean validationError = false;

public SAXParseException saxParseException = null;

public void error(SAXParseException exception) throws SAXException {
validationError = true;
saxParseException = exception;
String message = "Error: " + getParseExceptionInfo(exception);
}

public void fatalError(SAXParseException exception) throws SAXException {
validationError = true;
saxParseException = exception;
String message = "Fatal Error: " + getParseExceptionInfo(exception);

writeToErrorFile(exception+"fatal error occured at "+exception.getLineNumber()) ;
}

public void warning(SAXParseException exception) throws SAXException {
}


private String getParseExceptionInfo(SAXParseException spe) {
String systemId = spe.getSystemId();
if (systemId == null) {
systemId = "null";
}
String info = "URI=" + systemId +
" Line=" + spe.getLineNumber() +
": " + spe.getMessage();
return info;
}

<ProductDescription>PUé CAS~</ProductDescription> in my xml file i have a tag like this at line 20 of xml file and characters é cause the problem.



can any one suggest a feasible solution for this.


 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13058
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
1. please use Code tags for code presentation
2. Every time I have used code to get line and column numbers it has worked - how far off is the line number report? If it tags a line before the error you know about perhaps there is an earlier bad character.
3. If the source file has every been near Microsoft Word you may have "smart punctuation" which looks reasonable when you edit the file but is in fact invalid Unicode.

Bill
 
Paul Clapham
Sheriff
Pie
Posts: 20966
31
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Abhi Venu wrote:
<ProductDescription>PUé CAS~</ProductDescription> in my xml file i have a tag like this at line 20 of xml file and characters é cause the problem.



can any one suggest a feasible solution for this.




Yes. Ask whoever produced the document to have it declare its encoding correctly. (And make sure that your code doesn't convert the bytes of the document to chars.) It appears from what you posted that the document was encoded in UTF-8, but whatever you used to display it is assuming some other encoding.

Trying to locate the "invalid" characters precisely is most likely a waste of time in this case. I think you just have an encoding problem.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic