This week's giveaway is in the EJB and other Java EE Technologies forum.
We're giving away four copies of EJB 3 in Action and have Debu Panda, Reza Rahman, Ryan Cuprak, and Michael Remijan on-line!
See this thread for details.
The moose likes Java in General and the fly likes Get plain text content from HTML document? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Get plain text content from HTML document?" Watch "Get plain text content from HTML document?" New topic
Author

Get plain text content from HTML document?

Chinh Tran Nam
Ranch Hand

Joined: Nov 08, 2004
Posts: 35
Hi All,

I'm looking for sample codes which remove all html tags from a html document and return plain-text content only. That codes should replace <br> or tags with "\n".
Please help.

Thanks in advance.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39537
    
  27
If you want to control very precisely how the HTML is converted, you could use a library that reads HTML and gives you a DOM tree. NekoHTML and JTidy are two such libraries.
Alternatively, you could use regular expressions to search and replace angle brackets.


Ping & DNS - updated with new look and Ping home screen widget
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
This short article about Visitor Pattern has a reference to the Quiotix HTML parser. The visitor would be a neat way to go through all the nodes in the HTML DOM and write out text or newlines. I just have a bias against the complexity in walking most DOMs.


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Chinh Tran Nam
Ranch Hand

Joined: Nov 08, 2004
Posts: 35
Thanks All,

I found this library from SourceForge. It works fairly good; however, there is still a problem parsing duplicate tags (e.g more than one <style> blocks in a html document).

http://htmlparser.sourceforge.net/
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Get plain text content from HTML document?
 
Similar Threads
convert html to plain text
Problem with sending plain text
Text Plain and JTextPane
maintain the format of xml message on browser
JEditorPane - content text/html problem