wood burning stoves
The moose likes Groovy and the fly likes Converting HTML into text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Java Interview Guide this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Languages » Groovy
Bookmark "Converting HTML into text" Watch "Converting HTML into text" New topic

Converting HTML into text

johnathan keats

Joined: Nov 19, 2009
Posts: 7

I wrote a script which gets a webpage and dumps the entire thing into a file.

Is there anyway to remove all the html and formatting stuff so I'm left with the text?

Also, how do I extract the URL's in the file?

Thank you in advance

John Keats is a poet, NOT my real name!
Dave Klein
Ranch Hand

Joined: Aug 29, 2007
Posts: 77
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.

As for identifying URLs in a file you can use regex for that, though it can get ugly too. Here's a JavaRanch thread with an example: http://www.coderanch.com/t/382015/Java-General/java/regex-find-url

Have fun,

Author of Grails: A Quick-Start Guide
I agree. Here's the link: http://aspose.com/file-tools
subject: Converting HTML into text
It's not a secret anymore!