File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

getting an html page

 
stephen dimitrov
Greenhorn
Posts: 16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
not a true servlets question but...

I'm trying to write a simple program that saves an html page given its url. However, what I'm retrieving is not the same html that the browser (in my case Mozilla) uses. To see what I mean:
- run the following code
- open the embedded link (http://www.ranchhouseinn.com/ranch.html) in a browser and then select "save page as..." and save it to c:/good.html (or whataver you wish)
- compare that file with c:/copytest.html that was generated by the code.

So my questions are:
- Why are these files different?
- How can I get the html in good.html using java?

Thanks

>>>

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.ranchhouseinn.com/ranch.html");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
 
stephen dimitrov
Greenhorn
Posts: 16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Bad example, here's a better one:

- run the following code
- open the embedded link (http://www.hikinglasvegas.com/peaks_of_the_sierra.htm) in a browser and then select "save page as..." and save it to c:/good.html (or whatever you wish)
- open both c:/copytest.html and c:/good.html IN NOTEPAD
- search for 'Mallory' in both files

In good.html, you'll see there's a fully qualified url for this link:
- href="http://www.hikinglasvegas.com/Mt_Malloryl_Photo_pg.htm"
In copytest.html, you'll see it's been shortened:
- href="Mt_Malloryl_Photo_pg.htm"

This is my problem. I'm trying to parse out individual URLs from the document but when I go the code route they're shortened. Any ideas?

>>>

New code:

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.hikinglasvegas.com/peaks_of_the_sierra.htm");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you use "Save As" the browser is probably converting all the links to absolute so the page will work from your local machine.

I just ran your program and compared with the HTML found by right clicking and chooseing "view source" and all the links were the same.


It shouldn't be difficult to add the base url when you do your parsing.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic