aspose file tools*
The moose likes Java in General and the fly likes Get plain text content from HTML document? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Get plain text content from HTML document?" Watch "Get plain text content from HTML document?" New topic
Author

Get plain text content from HTML document?

Chinh Tran Nam
Ranch Hand

Joined: Nov 08, 2004
Posts: 35
Hi All,

I'm looking for sample codes which remove all html tags from a html document and return plain-text content only. That codes should replace <br> or tags with "\n".
Please help.

Thanks in advance.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41530
    
  53
If you want to control very precisely how the HTML is converted, you could use a library that reads HTML and gives you a DOM tree. NekoHTML and JTidy are two such libraries.
Alternatively, you could use regular expressions to search and replace angle brackets.


Ping & DNS - my free Android networking tools app
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
This short article about Visitor Pattern has a reference to the Quiotix HTML parser. The visitor would be a neat way to go through all the nodes in the HTML DOM and write out text or newlines. I just have a bias against the complexity in walking most DOMs.


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Chinh Tran Nam
Ranch Hand

Joined: Nov 08, 2004
Posts: 35
Thanks All,

I found this library from SourceForge. It works fairly good; however, there is still a problem parsing duplicate tags (e.g more than one <style> blocks in a html document).

http://htmlparser.sourceforge.net/
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Get plain text content from HTML document?