File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Parsing RSS2.0 feeds using XML Pull Parser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Parsing RSS2.0 feeds using XML Pull Parser" Watch "Parsing RSS2.0 feeds using XML Pull Parser" New topic
Author

Parsing RSS2.0 feeds using XML Pull Parser

Monu Tripathi
Rancher

Joined: Oct 12, 2008
Posts: 1369
    
    1

I am trying to parse a RSS2.0 feed, obtained from a remote server, on my Android device using XML Pull Parser.

I am getting invalid token exceptions after a few items have been parsed:
Error parsing document. (position:line -1, column -1) caused by: org.apache.harmony.xml.ExpatParser$ParseException: At line 158, column 25: not well-formed (invalid token)

Strangely, when I download the feed XML on the device, bundle it as application asset and then run the same code, everything works fine.
If XML validation is requested: parser.setProperty(XmlPullParser.FEATURE_VALIDATION,true); parsing fails immediately. Eventually, I am going to ask the providers of service to validate the XML at their end.

Could character encoding be the problem here; since I can parse it when I read it locally?

Thanks.
P.S: have also asked this question here

[List of FAQs] | [Android FAQ] | [Samuh Varta]
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18988
    
    8

You are getting this XML in response to an HTTP request? Then it's possible that the encoding declared in the XML is different than the charset of the response. The rule in this case is that the charset of the response should be used by the parser, rather than the encoding declared by the XML.

However you're passing an InputStream to the parser, so the parser has no way to find out what was the charset of the response. Try passing the URL of your HTTP request instead and let the parser deal with the response directly. Or alternatively, get the charset from the response and construct a Reader which uses that charset.
Monu Tripathi
Rancher

Joined: Oct 12, 2008
Posts: 1369
    
    1

Thanks Paul; will try this out
Monu Tripathi
Rancher

Joined: Oct 12, 2008
Posts: 1369
    
    1

Okay I tried this out and here's the update ...
You are getting this XML in response to an HTTP request? Then it's possible that the encoding declared in the XML is different than the charset of the response. The rule in this case is that the charset of the response should be used by the parser, rather than the encoding declared by the XML.

Yes. This XML is a HTTP response. The charset of the response is utf-8(as set in the header Content-Type: text/xml;charset=utf-8). When I open the link in the browser and save it as XML the root tag shows the encoding as "utf-8" also:
<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">


However you're passing an InputStream to the parser, so the parser has no way to find out what was the charset of the response. Try passing the URL of your HTTP request instead and let the parser deal with the response directly. Or alternatively, get the charset from the response and construct a Reader which uses that charset.

As per the javadocs of the Pull parser bundled with the Android SDK, when I call setInput on the parser instance, the parser tries to determine the type of encoding based on certain conditions :
public abstract void setInput (InputStream inputStream, String inputEncoding)

Since: API Level 1
Sets the input stream the parser is going to process. This call resets the parser state and sets the event type to the initial value START_DOCUMENT.
NOTE: If an input encoding string is passed, it MUST be used. Otherwise, if inputEncoding is null, the parser SHOULD try to determine input encoding following XML 1.0 specification (see below). If encoding detection is supported then following feature http://xmlpull.org/v1/doc/features.html#detect-encoding MUST be true amd otherwise it must be false
Parameters
inputStream contains a raw byte input stream of possibly unknown encoding (when inputEncoding is null).
inputEncoding if not null it MUST be used as encoding for inputStream

I tried setting the encoding explicitly as "utf-8" but this still doesn't work; i get exceptions.

When I looked into the HTTP traffic using a sniffer(CharlesProxy) and tried to view the response XML, the tool tells me that there is an invalid unicode character in CDATA and so it cannot parse the XML to fill up the view.
[Failed to parse data: org.xml.sax.SAXParseException: An invalid XML character(Unicode 0x12) was found in CDATA section.]


Maybe I should try creating a reader with appropriate charset(utf-8) and pass that to the parser?
Monu Tripathi
Rancher

Joined: Oct 12, 2008
Posts: 1369
    
    1

Also I think I should mention this, when I open the feed XML with Opera browser I get parse exception Illegal unicode(0x12) character. Safari does not have any issues; the character encoding for Safari is set to Default which i believe is Western ISO Latin-1.

EDIT: It seems Safari removes those characters, because I don't see accents in the text.
Monu Tripathi
Rancher

Joined: Oct 12, 2008
Posts: 1369
    
    1

Okay. Unicode 0x12 is not a valid XML character in a UTF-8 file (according to the XML recommendation of valid character sets). Maybe I should just escape or lop off/ drop off the erroneous entries in the document?? Or should I just ask the providers of the service to fix this at their end??
sudhin philip
Greenhorn

Joined: May 09, 2011
Posts: 1
Hi,

I think you are facing this issue due to xml that is not in good format...Here is a work around for you. Just clean up xml, before parsing it....just replace xml's '&', with "& amp;"(ignore the space between "& amp;")..... this may fix your issue.... as this kinds of issue could arrise due to appearance of '&' in the XML data obtained.

eg: if your string (xmlString ), is your XML data, then

xmlString = xmlString.replaceAll( "&", "& amp;" );
[please ignore the space between "& amp;" as i could not produce the same word when saved to the form. ]

will give you an error free XML string for parsing.

Hope it may saved your time atleast few minutes

Regards,
Sudhin Philip.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing RSS2.0 feeds using XML Pull Parser