Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Converting HTML into text

 
johnathan keats
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I wrote a script which gets a webpage and dumps the entire thing into a file.

Is there anyway to remove all the html and formatting stuff so I'm left with the text?

Also, how do I extract the URL's in the file?

Thank you in advance
 
Dave Klein
author
Ranch Hand
Posts: 77
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.

As for identifying URLs in a file you can use regex for that, though it can get ugly too. Here's a JavaRanch thread with an example: http://www.coderanch.com/t/382015/Java-General/java/regex-find-url

Have fun,
Dave
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic