| Author |
trouble retrieving html of all web sites
|
mj zammit
Ranch Hand
Joined: Nov 16, 2008
Posts: 49
|
|
Hey I am using low-level sockets to send a GET message to a Web server to retrieve the HTML content and plaxe it into a file. I am managing to do so with some websites for example http://www.naturenet.com But for some other websites such as http://www.google.com or yahoo's website it either tells me that the object moved or that it doesn't exist. Here is the code: Any suggestions will be greatly appreciated.
|
 |
Nitesh Kant
Bartender
Joined: Feb 25, 2007
Posts: 1638
|
|
HTTP is a complex protocol and hence softwares called browsers are created to make HTTP request and understand the response. If you use firefox and use plugins like httpfox (google for it), it will tell you what goes "behind the scenes" whenever an HTTP request is made. In many cases one HTTP request gets converted into a chit-chat between browser and server that spans over multiple request-responses. eg: in your case when you type http://www.google.com it returns an HTTP code 302 that tells that you need to instead query another URL http://www.google.co.in. Your browser does all this behind the scenes for you. If you plan to write a browser like code then it is a big exercise and it may be worth looking at apache HTTPClient (google for details) before writing something of your own.
|
apigee, a better way to API!
|
 |
Joe Ess
Bartender
Joined: Oct 29, 2001
Posts: 8291
|
|
First, Java sockets aren't "low-level". They abstract away most of the details of dealing with socket connections. I don't think HTTP is that difficult of a protocol. If your code works with one site and not another, the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site. Have a look at this topic.
|
"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch]
|
 |
Nitesh Kant
Bartender
Joined: Feb 25, 2007
Posts: 1638
|
|
Originally posted by Joe Ess: the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site.
Hey Joe, I think in this case, the server is returning a 302 which means a redirect to another location.
|
 |
mj zammit
Ranch Hand
Joined: Nov 16, 2008
Posts: 49
|
|
I think so too... Would HTTPClient take care of redirections of the url?
|
 |
mj zammit
Ranch Hand
Joined: Nov 16, 2008
Posts: 49
|
|
I realized that in the response method of 302 and 301 it will give me the location of the url. Therefore what i could do is extract that url and use that. But at the end of the day i would just be reinventing the wheel of HTTPClient... Thanks for both your comments they have helped me greatly
|
 |
Nitesh Kant
Bartender
Joined: Feb 25, 2007
Posts: 1638
|
|
mj: Would HTTPClient take care of redirections of the url?
This link talks a little about how to handle redirects in HTTPCLient.
|
 |
Chris Blades
Greenhorn
Joined: Dec 23, 2008
Posts: 7
|
|
The above suggestions about following the 302 redirection response is correct. My suggestion, though is to reformat you're http requests like so: end each line with \c\r end the request with \c\r\c\r so the code would be something like this (pseudo): write("GET index.html HTTP/1.1\c\r"); //end line with \c\r write("User-agent: mozilla\c\r\c\r"); //end request wiht \c\r\c\r this is compliant with http docs and some servers complain about it (not many, but it's annoying when it happens). However, be aware that it's not uncommon for servers to use \n in responses, so they may use a \n\n to terminate their http header. I use a regular expression to match either \c\r\c\r or \n\n.
|
 |
mj zammit
Ranch Hand
Joined: Nov 16, 2008
Posts: 49
|
|
Thanks for all your replies Chris will your suggestion help with redirections?
|
 |
 |
|
|
subject: trouble retrieving html of all web sites
|
|
|