IntelliJ Java IDE
The moose likes Groovy and the fly likes Converting HTML into text Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Other Languages » Groovy
Reply Bookmark "Converting HTML into text" Watch "Converting HTML into text" New topic
Author

Converting HTML into text

johnathan keats
Greenhorn

Joined: Nov 19, 2009
Posts: 4
Hi,

I wrote a script which gets a webpage and dumps the entire thing into a file.

Is there anyway to remove all the html and formatting stuff so I'm left with the text?

Also, how do I extract the URL's in the file?

Thank you in advance


John Keats is a poet, NOT my real name!
Dave Klein
author
Ranch Hand

Joined: Aug 29, 2007
Posts: 77
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.

As for identifying URLs in a file you can use regex for that, though it can get ugly too. Here's a JavaRanch thread with an example: http://www.coderanch.com/t/382015/Java-General/java/regex-find-url

Have fun,
Dave


Author of Grails: A Quick-Start Guide
 
 
subject: Converting HTML into text
 
Threads others viewed
Problem getting the Beer Selection Page screen shot
loading a html page with data from a file
java
include contents of an .inc file as in HTML pages
Reading HTML using JAVA
MyEclipse, The Clear Choice

cast iron skillet 49er

more from paul wheaton's glorious empire of web junk: cast iron skillet diatomaceous earth rocket mass heater sepp holzer raised garden beds raising chickens lawn care CFL flea control missoula heat permaculture