| Author |
Converting HTML into text
|
johnathan keats
Greenhorn
Joined: Nov 19, 2009
Posts: 4
|
|
Hi,
I wrote a script which gets a webpage and dumps the entire thing into a file.
Is there anyway to remove all the html and formatting stuff so I'm left with the text?
Also, how do I extract the URL's in the file?
Thank you in advance
|
John Keats is a poet, NOT my real name!
|
 |
Dave Klein
author
Ranch Hand
Joined: Aug 29, 2007
Posts: 77
|
|
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.
As for identifying URLs in a file you can use regex for that, though it can get ugly too. Here's a JavaRanch thread with an example: http://www.coderanch.com/t/382015/Java-General/java/regex-find-url
Have fun,
Dave
|
Author of Grails: A Quick-Start Guide
|
 |
 |
|
|
subject: Converting HTML into text
|
|
|