I am using low-level sockets to send a GET message to a Web server to retrieve the HTML content and plaxe it into a file. I am managing to do so with some websites for example http://www.naturenet.com But for some other websites such as http://www.google.com or yahoo's website it either tells me that the object moved or that it doesn't exist.
HTTP is a complex protocol and hence softwares called browsers are created to make HTTP request and understand the response. If you use firefox and use plugins like httpfox (google for it), it will tell you what goes "behind the scenes" whenever an HTTP request is made. In many cases one HTTP request gets converted into a chit-chat between browser and server that spans over multiple request-responses. eg: in your case when you type http://www.google.com it returns an HTTP code 302 that tells that you need to instead query another URL http://www.google.co.in. Your browser does all this behind the scenes for you.
If you plan to write a browser like code then it is a big exercise and it may be worth looking at apache HTTPClient (google for details) before writing something of your own.
First, Java sockets aren't "low-level". They abstract away most of the details of dealing with socket connections. I don't think HTTP is that difficult of a protocol. If your code works with one site and not another, the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site. Have a look at this topic.
Originally posted by Joe Ess: the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site.
I think in this case, the server is returning a 302 which means a redirect to another location.
Joined: Nov 16, 2008
I think so too... Would HTTPClient take care of redirections of the url?
Joined: Nov 16, 2008
I realized that in the response method of 302 and 301 it will give me the location of the url. Therefore what i could do is extract that url and use that. But at the end of the day i would just be reinventing the wheel of HTTPClient... Thanks for both your comments they have helped me greatly