This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes Sockets and Internet Protocols and the fly likes GET html contents from a web server Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Sockets and Internet Protocols
Bookmark "GET html contents from a web server" Watch "GET html contents from a web server" New topic
Author

GET html contents from a web server

mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
hi
i am building a program to get the contents of html on an http website.
The code is found below:



But unfortunately i only receive half of the html
why is that???

When i use the URL class getcontent I get all the html, but i need to use sockets. Can someone please indicate where my error is please...
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
Why am I only receiving half the data from the server??
any suggestions will be greatly appreciated...
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

You don't get the whole result because you don't read the whole result. Instead you stop reading earlier than that because of this:

By the way, that readLine() method is deprecated. The API documentation has some suggestions about what you should be using instead.

Also, you said this:
When i use the URL class getcontent I get all the html, but i need to use sockets.

That doesn't quite make sense to me, as the URL class does use sockets. So if you use that, you are using sockets.
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
Hi Paul

Thanks for the reply.
What i meant when i said that i want to use sockets and not URL is that i want to use low-level sockets.
I have followed your suggestions but i am still only getting half of the html
This the code:


This is the output i am getting on the console:
Line 1: <html>
Line 2: <head>
Line 3: <meta NAME="GENERATOR" Content="Microsoft FrontPage 12.0">
Line 4: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Line 5: <title>Nature Net</title>
Line 6: <link REL="stylesheet" HREF="styles/style.css" TYPE="text/css">
Line 7: <script src="include/i_javascript.js" type="text/javascript"></script>
Line 8:
Line 9: <style type="text/css">
Line 10: .style1 {
Line 11: text-align: center;
Line 12: }
Line 13: .style2 {
Line 14: border-width: 0px;
Line 15: }
Line 16: .style5 {
Line 17: color: #E4761F;
Line 18: }
Line 19: </style>
Line 20:
Line 21: </head>
Line 22: <body leftmargin="0" topmargin="0" bgcolor="#FFFFFF">
Line 23: <table border="0" cellpadding="0" cellspacing="0" width="780">
Line 24: <tr>
Line 25: <td width="195" bgcolor="#4346D3" align="center" valign="middle">
Line 26: <img src="images/naturenetlogo2.gif" width="92" height="92" align="middle"></td>
Line 27: <td width="585">
Line 28: <table border="0" cellpadding="0" cellspacing="0">
Line 29: <tr>
Line 30: <td width="443" height="138" bgcolor="#84C55F" align="center" valign="center">
Line 31: <img src="images/headertitle.gif" alt="Naturenet The Environmental Learning Network" width="409" height="96">
Line 32: </td>
Line 33: <td width="142" bgcolor="#84C55F">
Line 34: <img src="images/headerpic1.gif" id="rightuppergraphic" alt="" width="142" height="140">
Line 35: </td>
Line 36: </tr>
Line 37: <tr>
Line 38: <td colspan="2" height="22" bgcolor="#FBE590" align="right"><a href="contact.html" class="navlink">
Line 39: contact us</a> |
Line 40: <a href="sitemap.html" class="navlink">sitemap</a>  </td>
Line 41: </tr>
Line 42: </table>
Line 43: </td>
Line 44: </tr>
Line 45: <tr>
Line 46: <td width="195" height="6" bgcolor="#FBE590"></td><td bgcolor="#84C55F"></td>
Line 47: </tr>
Line 48: <tr>
Line 49: <td width="195" height="500" bgcolor="#FBE590" valign="top">
Line 50: <table border="0" cellpadding="0" cellspacing="0"><tr><td bgcolor="#FBE590" width="5"></td>
Line 51: <td bgcolor="#FBE590">
Line 52: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
That is all the contents found in the DataInputStream

I also had the following contents in the console:
The request header : GET /styles/style.css HTTP/1.0
User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Referer:http://127.0.0.1:8080/?getURL=www.naturenet.com
Accept: text/css,*/*;q=0.1
Host: 127.0.0.1:8080
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: __qca=1223192728-64253405-47139338; __utma=96992031.3524520648312145400.1227907344.1227907344.1227907344.1; __utmz=96992031.1227907344.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)

Error getting page java.net.MalformedURLException: no protocol:

What does this mean? Why is it throwing me this request header?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41034
    
  43
You're assuming that there's nothing more to read if available() returns 0; that's not the case: AvailableDoesntDoWhatYouThinkItDoes. Use read() instead, but be aware of ReadDoesntDoWhatYouThinkItDoes.


Ping & DNS - my free Android networking tools app
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
The urls you indicated to me do everything in bytes, but what i want to achieve in the end is the html of any url so i can manipulate it on my web server before outputting the results. But i can not do that with bytes right?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41034
    
  43
You can't convert the bytes to strings until you know which encoding they're in, and you won't know that until you've inspected the META tag that specifies it.

If this was my project, I'd use a library like https://sourceforge.net/projects/jwebunit which let's you retrieve (and work with) web pages on a much higher level.
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
I tried this way but i am still getting only half the html


Thank you for your patience, i am new to networking and would really like to manage in low-level sockets...

I am working with low-level sockets since the class URL can only do POST and GET requests from the HTTP methods, is this true? can it do other HTTP methods sych as DELETE?
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
When you say you have to inspect the META tag does that mean to find charset value?
mj zammit
Ranch Hand

Joined: Nov 16, 2008
Posts: 49
I tried to encode using UTF-8 but this if i am not mistaken is for text/html content
the code it as follows:


but i am still getting only half the html
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41034
    
  43
but i am still getting only half the html

Read the ReadDoesntDoWhatYouThinkItDoes page I linked to; it explains the problem.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: GET html contents from a web server
 
Similar Threads
compile at run time????
retrieve images from the web
how to extract search engine results
IO Exception :Connection timed out
Simple program for URL