Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Get plain text content from HTML document?

 
Chinh Tran Nam
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I'm looking for sample codes which remove all html tags from a html document and return plain-text content only. That codes should replace <br> or tags with "\n".
Please help.

Thanks in advance.
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you want to control very precisely how the HTML is converted, you could use a library that reads HTML and gives you a DOM tree. NekoHTML and JTidy are two such libraries.
Alternatively, you could use regular expressions to search and replace angle brackets.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This short article about Visitor Pattern has a reference to the Quiotix HTML parser. The visitor would be a neat way to go through all the nodes in the HTML DOM and write out text or newlines. I just have a bias against the complexity in walking most DOMs.
 
Chinh Tran Nam
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks All,

I found this library from SourceForge. It works fairly good; however, there is still a problem parsing duplicate tags (e.g more than one <style> blocks in a html document).

http://htmlparser.sourceforge.net/
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic