wood burning stoves 2.0*
The moose likes I/O and Streams and the fly likes how to read content of a site? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "how to read content of a site?" Watch "how to read content of a site?" New topic
Author

how to read content of a site?

ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
I found following code to read content (HTML) of any page of any site:



But it gives java.net.UnknownHostException. So I tried with IP address (uncomment first line and comment second line) but now it gives java.net.ConnectException (Connection timed out).

When I "ping" this IP it gies 'request timed out' so that was expected.

Any clue on how to read content?

Thanks.
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8877
    
    8

Step 1: Connect to the internet. Your problem isn't reading the content. Can you load the URL in your browser? Do you have a proxy between you and the internet? Once that's resolved, move to:
Step 2: Getting text from a URL


"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch]
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Thanks Joe. I got your point.


Can you load the URL in your browser?


I can.


Do you have a proxy between you and the internet?


Yes. Its there.

When I open 'LAN Settings...' in IE, I don't see any proxy server mentioned here, but I see second check box in first group (Use automatic configuration script) is checked, it's a pac file.

I read about pac file. It defines a JS function (FindProxyForURL(url, host)) minimally, which probably finds proxy server name, port etc.

But how do I know proxy server name and port?

Once I get these I can use this code:



Thanks.
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Oops, I can access that file directly.

Some very dirty code is written in that function (FindProxyForURL) and it returns loads of �proxy server IP & port� based on some conditions. I took one of the IP and port from this file (which is being returned at some place) and set it like this:



Now it gives this exception:



ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Okay. One of the IPs worked.

But I get some wired HTML which has no relation with the HTML which I supposed to get (HTML for google.co.in). Why so?

Thanks.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41621
    
  55
But I get some wired HTML which has no relation with the HTML which I supposed to get (HTML for google.co.in).


It would help if you told us what you are getting.


Ping & DNS - my free Android networking tools app
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8877
    
    8

My advice is to contact your network support people, tell them what you are trying to do and get the proxy information from them. We'd just be guessing without knowing your network topography.
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Originally posted by Joe Ess:
My advice is to contact your network support people, tell them what you are trying to do and get the proxy information from them.


It doesn't seem feasible option as it�s not for any official project.

I actually wanted to make a proxy site (I got a clue from your post here: http://www.coderanch.com/t/132832/gc/proxy-sites-work).

Let me see if I can host this code to some other server which is not using any proxy (as I won�t have that information)...

Thanks buddy.
[ March 05, 2008: Message edited by: ankur rathi ]
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Originally posted by ankur rathi:
Okay. One of the IPs worked.

But I get some wired HTML which has no relation with the HTML which I supposed to get (HTML for google.co.in). Why so?

Thanks.


The IP which worked is actually for an internal site.

The pac file has code something like:

if(shExpMatch(url, 'http://internal/*')) {
return "PROXY 10.20.30.50:80";
}

So if I use this IP (10.20.30.50) in above program, it returns HTML content of the internal site. One question here is, how my requested UTLs are getting changed (from google to this internal site)? The internal site URLs should pass through this proxy but can this proxy change any URLs to internal site URL?

And to my original question, I still get "500 error", if I use correct proxy:

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41621
    
  55
The proxy can certainly change request and response at its discretion.

Maybe the network admins don't want anyone to access certain sites? Can you access that site with a web browser?
ankur rathi
Ranch Hand

Joined: Oct 11, 2004
Posts: 3830
Originally posted by Ulf Dittmer:
Can you access that site with a web browser?


Yes.
 
 
subject: how to read content of a site?