File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Groovy and the fly likes Converting HTML into text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Head First Android this week in the Android forum!
JavaRanch » Java Forums » Languages » Groovy
Bookmark "Converting HTML into text" Watch "Converting HTML into text" New topic

Converting HTML into text

johnathan keats

Joined: Nov 19, 2009
Posts: 7

I wrote a script which gets a webpage and dumps the entire thing into a file.

Is there anyway to remove all the html and formatting stuff so I'm left with the text?

Also, how do I extract the URL's in the file?

Thank you in advance

John Keats is a poet, NOT my real name!
Dave Klein
Ranch Hand

Joined: Aug 29, 2007
Posts: 77
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.

As for identifying URLs in a file you can use regex for that, though it can get ugly too. Here's a JavaRanch thread with an example:

Have fun,

Author of Grails: A Quick-Start Guide
jQuery in Action, 3rd edition
subject: Converting HTML into text
It's not a secret anymore!