• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

GET html contents from a web server

 
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
hi
i am building a program to get the contents of html on an http website.
The code is found below:



But unfortunately i only receive half of the html
why is that???

When i use the URL class getcontent I get all the html, but i need to use sockets. Can someone please indicate where my error is please...
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why am I only receiving half the data from the server??
any suggestions will be greatly appreciated...
 
Marshal
Posts: 28177
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You don't get the whole result because you don't read the whole result. Instead you stop reading earlier than that because of this:

By the way, that readLine() method is deprecated. The API documentation has some suggestions about what you should be using instead.

Also, you said this:

When i use the URL class getcontent I get all the html, but i need to use sockets.


That doesn't quite make sense to me, as the URL class does use sockets. So if you use that, you are using sockets.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Paul

Thanks for the reply.
What i meant when i said that i want to use sockets and not URL is that i want to use low-level sockets.
I have followed your suggestions but i am still only getting half of the html
This the code:


This is the output i am getting on the console:
Line 1: <html>
Line 2: <head>
Line 3: <meta NAME="GENERATOR" Content="Microsoft FrontPage 12.0">
Line 4: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Line 5: <title>Nature Net</title>
Line 6: <link REL="stylesheet" HREF="styles/style.css" TYPE="text/css">
Line 7: <script src="include/i_javascript.js" type="text/javascript"></script>
Line 8:
Line 9: <style type="text/css">
Line 10: .style1 {
Line 11: text-align: center;
Line 12: }
Line 13: .style2 {
Line 14: border-width: 0px;
Line 15: }
Line 16: .style5 {
Line 17: color: #E4761F;
Line 18: }
Line 19: </style>
Line 20:
Line 21: </head>
Line 22: <body leftmargin="0" topmargin="0" bgcolor="#FFFFFF">
Line 23: <table border="0" cellpadding="0" cellspacing="0" width="780">
Line 24: <tr>
Line 25: <td width="195" bgcolor="#4346D3" align="center" valign="middle">
Line 26: <img src="images/naturenetlogo2.gif" width="92" height="92" align="middle"></td>
Line 27: <td width="585">
Line 28: <table border="0" cellpadding="0" cellspacing="0">
Line 29: <tr>
Line 30: <td width="443" height="138" bgcolor="#84C55F" align="center" valign="center">
Line 31: <img src="images/headertitle.gif" alt="Naturenet The Environmental Learning Network" width="409" height="96">
Line 32: </td>
Line 33: <td width="142" bgcolor="#84C55F">
Line 34: <img src="images/headerpic1.gif" id="rightuppergraphic" alt="" width="142" height="140">
Line 35: </td>
Line 36: </tr>
Line 37: <tr>
Line 38: <td colspan="2" height="22" bgcolor="#FBE590" align="right"><a href="contact.html" class="navlink">
Line 39: contact us</a> |
Line 40: <a href="sitemap.html" class="navlink">sitemap</a>  </td>
Line 41: </tr>
Line 42: </table>
Line 43: </td>
Line 44: </tr>
Line 45: <tr>
Line 46: <td width="195" height="6" bgcolor="#FBE590"></td><td bgcolor="#84C55F"></td>
Line 47: </tr>
Line 48: <tr>
Line 49: <td width="195" height="500" bgcolor="#FBE590" valign="top">
Line 50: <table border="0" cellpadding="0" cellspacing="0"><tr><td bgcolor="#FBE590" width="5"></td>
Line 51: <td bgcolor="#FBE590">
Line 52: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
That is all the contents found in the DataInputStream

I also had the following contents in the console:
The request header : GET /styles/style.css HTTP/1.0
User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Referer:http://127.0.0.1:8080/?getURL=www.naturenet.com
Accept: text/css,*/*;q=0.1
Host: 127.0.0.1:8080
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: __qca=1223192728-64253405-47139338; __utma=96992031.3524520648312145400.1227907344.1227907344.1227907344.1; __utmz=96992031.1227907344.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)

Error getting page java.net.MalformedURLException: no protocol:

What does this mean? Why is it throwing me this request header?
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You're assuming that there's nothing more to read if available() returns 0; that's not the case: AvailableDoesntDoWhatYouThinkItDoes. Use read() instead, but be aware of ReadDoesntDoWhatYouThinkItDoes.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The urls you indicated to me do everything in bytes, but what i want to achieve in the end is the html of any url so i can manipulate it on my web server before outputting the results. But i can not do that with bytes right?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can't convert the bytes to strings until you know which encoding they're in, and you won't know that until you've inspected the META tag that specifies it.

If this was my project, I'd use a library like https://sourceforge.net/projects/jwebunit which let's you retrieve (and work with) web pages on a much higher level.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I tried this way but i am still getting only half the html


Thank you for your patience, i am new to networking and would really like to manage in low-level sockets...

I am working with low-level sockets since the class URL can only do POST and GET requests from the HTTP methods, is this true? can it do other HTTP methods sych as DELETE?
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
When you say you have to inspect the META tag does that mean to find charset value?
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I tried to encode using UTF-8 but this if i am not mistaken is for text/html content
the code it as follows:


but i am still getting only half the html
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

but i am still getting only half the html


Read the ReadDoesntDoWhatYouThinkItDoes page I linked to; it explains the problem.
 
reply
    Bookmark Topic Watch Topic
  • New Topic