File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Get plain text content from HTML document?

 
Chinh Tran Nam
Ranch Hand
Posts: 35
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I'm looking for sample codes which remove all html tags from a html document and return plain-text content only. That codes should replace <br> or tags with "\n".
Please help.

Thanks in advance.
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you want to control very precisely how the HTML is converted, you could use a library that reads HTML and gives you a DOM tree. NekoHTML and JTidy are two such libraries.
Alternatively, you could use regular expressions to search and replace angle brackets.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This short article about Visitor Pattern has a reference to the Quiotix HTML parser. The visitor would be a neat way to go through all the nodes in the HTML DOM and write out text or newlines. I just have a bias against the complexity in walking most DOMs.
 
Chinh Tran Nam
Ranch Hand
Posts: 35
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks All,

I found this library from SourceForge. It works fairly good; however, there is still a problem parsing duplicate tags (e.g more than one <style> blocks in a html document).

http://htmlparser.sourceforge.net/
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic