aspose file tools*
The moose likes Other Open Source Projects and the fly likes HtmlUnit - Finding target page content type before loading Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "HtmlUnit - Finding target page content type before loading" Watch "HtmlUnit - Finding target page content type before loading" New topic
Author

HtmlUnit - Finding target page content type before loading

Vinoth Kumar Kannan
Ranch Hand

Joined: Aug 19, 2009
Posts: 276

Hello All,
I've been lately using HtmlUnit2.8 for web scraping. For scraping, I need only html pages and not pdfs,mp3s,rars...etc.
The code so far I've been using is,

The main problem here is the entire target page is loaded into memory and then its content type is checked. Suppose it is a 1MB pdf url, the whole 1MB loads and then says it is of application/pdf content type. This thing here, eats up my memory and takes too much time as well. I've did some digging into the API, but nothing promising. Is there any alternative solution to this?


OCPJP 6
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: HtmlUnit - Finding target page content type before loading