This week's book giveaways are in the Java EE and JavaScript forums.
We're giving away four copies each of The Java EE 7 Tutorial Volume 1 or Volume 2(winners choice) and jQuery UI in Action and have the authors on-line!
See this thread and this one for details.
The moose likes XML and Related Technologies and the fly likes java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence in JDOM SaxBuilder Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence in JDOM SaxBuilder" Watch "java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence in JDOM SaxBuilder" New topic
Author

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence in JDOM SaxBuilder

Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi All,

I am not able to read an XML file with the following error:

at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache..xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces..parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:489)
at org.jdom..input.SAXBuilder.build(SAXBuilder.java:928)
at XMLProject.main(generateXML.java:45)

The header of state.xml is as follows:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html (View Source for full doctype...)>
- <html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml">



Any assistance would be much appreciated.
Thanks,
Jack
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

The document is encoded in UTF-8, but you are reading it using a Reader which doesn't use UTF-8. It uses some other encoding, which one I don't really care. Don't use a Reader at all. Pass an InputStream to the parser and let it deal with the encoding issues.
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi Paul,

By replacing the FileReader with InputStream the following codes as suggested has finally able to read and transformed state.xml to state.html but only when there is an Internet Online connection:

[code]19. SAXBuilder stateBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser", false);
20. stateBuilder.setValidation(false);
21. FileInputStream stateIS = new FileInputStream("E:\\state.xml");
22. BufferedInputStream stateBIS = new BufferedInputStream(stateIS);
23. Document stateOriginaljdomDocument = stateBuilder.build(stateBIS);
24. TransformerFactory stateFactory = TransformerFactory.newInstance();
25. Transformer stateTransformer = stateFactory.newTransformer(new StreamSource("E:\\stateStyleSheet.xsl"));
26. JDOMSource stateSource = new JDOMSource(stateOriginaljdomDocument);
27. JDOMResult stateResult = new JDOMResult();
28. stateTransformer.transform(stateSource, stateResult);
......[/code]
[u]Offline[/u]

[color=red][b]javax.xml.transform.TransformerException: org.jdom.JDOMException: DTD parsing error: www.w3.org
at org.apache.xalan.transformer.TransformerImpl.fatalError(TransformerImpl.java:738)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:712)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1126)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1104)
at XMLProject.main(generateXML.java:28)
Caused by: org.jdom.JDOMException: DTD parsing error: www.w3.org
at org.jdom.transform.JDOMSource$DocumentReader.parse(JDOMSource.java:525)
at org.apache.xml.dtm.ref.DTMManagerDefault.getDTM(DTMManagerDefault.java:478)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:655)
... 3 more[/b][/color]
It appears that the transformation process is trying to validate DTD even though I have turned validation off during parsing. Can you confirm whether the validation attempt is occurring during parsing or transformation step? And how to prevent it from recurring?

Many thanks again for your valuable advice,

Jack
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

No, it isn't trying to validate the DTD. It is trying to read the DTD. You seem to be under the impression that DTDs are just for validation, but they perform several other functions. Entity definitions, for one. If the DTD contains an entity definition you want that to be applied to your document, or it won't parse correctly. That's why you can't tell the parser to ignore the DTD entirely.

You can use an XML Catalog to tell the parser to look for a local copy of the DTD, or you can use an EntityResolver which you attach to your parser.
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi Paul,

Thanks for your suggestion.

Looks like I have got a bit of home work to be done.

Will try out both XML Catolog and Entity Resolver to see which is more suitable for me.

Cheers,

Jack
 
wood burning stoves
 
subject: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence in JDOM SaxBuilder