This week's book giveaway is in the Design forum.
We're giving away four copies of Design for the Mind and have Victor S. Yocco on-line!
See this thread for details.
Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

HtmlUnit - Finding target page content type before loading

 
Vinoth Kumar Kannan
Ranch Hand
Posts: 276
Chrome Java Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello All,
I've been lately using HtmlUnit2.8 for web scraping. For scraping, I need only html pages and not pdfs,mp3s,rars...etc.
The code so far I've been using is,

The main problem here is the entire target page is loaded into memory and then its content type is checked. Suppose it is a 1MB pdf url, the whole 1MB loads and then says it is of application/pdf content type. This thing here, eats up my memory and takes too much time as well. I've did some digging into the API, but nothing promising. Is there any alternative solution to this?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic