• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

trouble retrieving html of all web sites

 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey

I am using low-level sockets to send a GET message to a Web server to retrieve the HTML content and plaxe it into a file.
I am managing to do so with some websites for example http://www.naturenet.com
But for some other websites such as http://www.google.com or yahoo's website
it either tells me that the object moved or that it doesn't exist.

Here is the code:


Any suggestions will be greatly appreciated.
 
Nitesh Kant
Bartender
Posts: 1638
IntelliJ IDE Java MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
HTTP is a complex protocol and hence softwares called browsers are created to make HTTP request and understand the response.
If you use firefox and use plugins like httpfox (google for it), it will tell you what goes "behind the scenes" whenever an HTTP request is made. In many cases one HTTP request gets converted into a chit-chat between browser and server that spans over multiple request-responses. eg: in your case when you type http://www.google.com it returns an HTTP code 302 that tells that you need to instead query another URL http://www.google.co.in.
Your browser does all this behind the scenes for you.

If you plan to write a browser like code then it is a big exercise and it may be worth looking at apache HTTPClient (google for details) before writing something of your own.
 
Joe Ess
Bartender
Posts: 9297
10
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First, Java sockets aren't "low-level". They abstract away most of the details of dealing with socket connections.
I don't think HTTP is that difficult of a protocol. If your code works with one site and not another, the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site. Have a look at this topic.
 
Nitesh Kant
Bartender
Posts: 1638
IntelliJ IDE Java MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Joe Ess:
the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site.


Hey Joe,

I think in this case, the server is returning a 302 which means a redirect to another location.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think so too...
Would HTTPClient take care of redirections of the url?
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I realized that in the response method of 302 and 301 it will give me the location of the url. Therefore what i could do is extract that url and use that. But at the end of the day i would just be reinventing the wheel of HTTPClient...
Thanks for both your comments they have helped me greatly
 
Nitesh Kant
Bartender
Posts: 1638
IntelliJ IDE Java MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mj:
Would HTTPClient take care of redirections of the url?

This link talks a little about how to handle redirects in HTTPCLient.
 
Chris Blades
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The above suggestions about following the 302 redirection response is correct.

My suggestion, though is to reformat you're http requests like so:

end each line with \c\r

end the request with \c\r\c\r

so the code would be something like this (pseudo):

write("GET index.html HTTP/1.1\c\r"); //end line with \c\r
write("User-agent: mozilla\c\r\c\r"); //end request wiht \c\r\c\r

this is compliant with http docs and some servers complain about it (not many, but it's annoying when it happens).

However, be aware that it's not uncommon for servers to use \n in responses, so they may use a \n\n to terminate their http header. I use a regular expression to match either \c\r\c\r or \n\n.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for all your replies
Chris will your suggestion help with redirections?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic