I've been lately using HtmlUnit2.8 for web scraping. For scraping, I need only html pages and not pdfs,mp3s,rars...etc.
The code so far I've been using is,
The main problem here is the entire target page is loaded into memory and then its content type is checked. Suppose it is a 1MB pdf url, the whole 1MB loads and then says it is of application/pdf content type. This thing here, eats up my memory and takes too much time as well. I've did some digging into the API, but nothing promising. Is there any alternative solution to this?
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com
subject: HtmlUnit - Finding target page content type before loading