aspose file tools*
The moose likes XML and Related Technologies and the fly likes get xml's cdata using saxparser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "get xml Watch "get xml New topic
Author

get xml's cdata using saxparser

tasos georgiou
Greenhorn

Joined: Nov 30, 2012
Posts: 8
hi.i'm building an rss reader and i want to get the text-content of cdata.I've managed to get whatever is contained inside cdata but i'm only intersted in the text without for example links,src images, greater-than/less-than signs.I've tried to do that to some point using regex, but that becomes complex.Is there another way to do something like that, or regex is the only solution?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18880
    
    8

Hi Tasos, welcome to the Ranch!

Am I correct in guessing that what you get out of the CDATA is some kind of HTML data? And that it isn't necessarily well-formed HTML?

If so, then regex isn't going to be very useful in extracting the text and discarding the markup. Regular expressions don't work well with languages with recursive grammars like HTML. So what I suggest is that you should get an HTML parser and parse the contents of the string. Then extract only the text nodes from the parsed HTML and discard everything else.

tasos georgiou
Greenhorn

Joined: Nov 30, 2012
Posts: 8
Paul Clapham wrote:Hi Tasos, welcome to the Ranch!

Am I correct in guessing that what you get out of the CDATA is some kind of HTML data? And that it isn't necessarily well-formed HTML?

If so, then regex isn't going to be very useful in extracting the text and discarding the markup. Regular expressions don't work well with languages with recursive grammars like HTML. So what I suggest is that you should get an HTML parser and parse the contents of the string. Then extract only the text nodes from the parsed HTML and discard everything else.



I've managed to do it for now with regex besides my cdata comes from an rss and there are only a couple of links and pics.Thanks for your help.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18880
    
    8

Yes, if you're only getting simple and predictable data then you can make a regex work. But later if you find the data is not as predictable, or it is more complex, you may find that you can't make a working regex any more.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: get xml's cdata using saxparser