This week's book giveaway is in the Other Open Source APIs forum. We're giving away four copies of Storm Applied and have Sean Allen, Peter Pathirana & Matthew Jankowski on-line! See this thread for details.
I've been lately using HtmlUnit2.8 for web scraping. For scraping, I need only html pages and not pdfs,mp3s,rars...etc.
The code so far I've been using is,
The main problem here is the entire target page is loaded into memory and then its content type is checked. Suppose it is a 1MB pdf url, the whole 1MB loads and then says it is of application/pdf content type. This thing here, eats up my memory and takes too much time as well. I've did some digging into the API, but nothing promising. Is there any alternative solution to this?