aspose file tools*
The moose likes Sockets and Internet Protocols and the fly likes trouble retrieving html of all web sites Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Sockets and Internet Protocols
Bookmark "trouble retrieving html of all web sites" Watch "trouble retrieving html of all web sites" New topic
Author

trouble retrieving html of all web sites

mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
Hey

I am using low-level sockets to send a GET message to a Web server to retrieve the HTML content and plaxe it into a file.
I am managing to do so with some websites for example http://www.naturenet.com
But for some other websites such as http://www.google.com or yahoo's website
it either tells me that the object moved or that it doesn't exist.

Here is the code:


Any suggestions will be greatly appreciated.
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

HTTP is a complex protocol and hence softwares called browsers are created to make HTTP request and understand the response.
If you use firefox and use plugins like httpfox (google for it), it will tell you what goes "behind the scenes" whenever an HTTP request is made. In many cases one HTTP request gets converted into a chit-chat between browser and server that spans over multiple request-responses. eg: in your case when you type http://www.google.com it returns an HTTP code 302 that tells that you need to instead query another URL http://www.google.co.in.
Your browser does all this behind the scenes for you.

If you plan to write a browser like code then it is a big exercise and it may be worth looking at apache HTTPClient (google for details) before writing something of your own.


apigee, a better way to API!
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8927
    
    9

First, Java sockets aren't "low-level". They abstract away most of the details of dealing with socket connections.
I don't think HTTP is that difficult of a protocol. If your code works with one site and not another, the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site. Have a look at this topic.


"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch]
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

Originally posted by Joe Ess:
the problem may be that the site is looking for a user-agent header to prevent robots from scraping the site.


Hey Joe,

I think in this case, the server is returning a 302 which means a redirect to another location.
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
I think so too...
Would HTTPClient take care of redirections of the url?
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
I realized that in the response method of 302 and 301 it will give me the location of the url. Therefore what i could do is extract that url and use that. But at the end of the day i would just be reinventing the wheel of HTTPClient...
Thanks for both your comments they have helped me greatly
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

mj:
Would HTTPClient take care of redirections of the url?

This link talks a little about how to handle redirects in HTTPCLient.
Chris Blades
Greenhorn

Joined: Dec 23, 2008
Posts: 7
The above suggestions about following the 302 redirection response is correct.

My suggestion, though is to reformat you're http requests like so:

end each line with \c\r

end the request with \c\r\c\r

so the code would be something like this (pseudo):

write("GET index.html HTTP/1.1\c\r"); //end line with \c\r
write("User-agent: mozilla\c\r\c\r"); //end request wiht \c\r\c\r

this is compliant with http docs and some servers complain about it (not many, but it's annoying when it happens).

However, be aware that it's not uncommon for servers to use \n in responses, so they may use a \n\n to terminate their http header. I use a regular expression to match either \c\r\c\r or \n\n.
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
Thanks for all your replies
Chris will your suggestion help with redirections?
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: trouble retrieving html of all web sites