I wrote a script which gets a webpage and dumps the entire thing into a file.
Is there anyway to remove all the html and formatting stuff so I'm left with the text?
Also, how do I extract the URL's in the file?
Thank you in advance
John Keats is a poet, NOT my real name!
posted 6 years ago
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.