aspose file tools*
The moose likes Servlets and the fly likes getting an html page Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Servlets
Bookmark "getting an html page" Watch "getting an html page" New topic
Author

getting an html page

stephen dimitrov
Greenhorn

Joined: Feb 28, 2005
Posts: 16
not a true servlets question but...

I'm trying to write a simple program that saves an html page given its url. However, what I'm retrieving is not the same html that the browser (in my case Mozilla) uses. To see what I mean:
- run the following code
- open the embedded link (http://www.ranchhouseinn.com/ranch.html) in a browser and then select "save page as..." and save it to c:/good.html (or whataver you wish)
- compare that file with c:/copytest.html that was generated by the code.

So my questions are:
- Why are these files different?
- How can I get the html in good.html using java?

Thanks

>>>

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.ranchhouseinn.com/ranch.html");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
stephen dimitrov
Greenhorn

Joined: Feb 28, 2005
Posts: 16
Bad example, here's a better one:

- run the following code
- open the embedded link (http://www.hikinglasvegas.com/peaks_of_the_sierra.htm) in a browser and then select "save page as..." and save it to c:/good.html (or whatever you wish)
- open both c:/copytest.html and c:/good.html IN NOTEPAD
- search for 'Mallory' in both files

In good.html, you'll see there's a fully qualified url for this link:
- href="http://www.hikinglasvegas.com/Mt_Malloryl_Photo_pg.htm"
In copytest.html, you'll see it's been shortened:
- href="Mt_Malloryl_Photo_pg.htm"

This is my problem. I'm trying to parse out individual URLs from the document but when I go the code route they're shortened. Any ideas?

>>>

New code:

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.hikinglasvegas.com/peaks_of_the_sierra.htm");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
Ben Souther
Sheriff

Joined: Dec 11, 2004
Posts: 13410

When you use "Save As" the browser is probably converting all the links to absolute so the page will work from your local machine.

I just ran your program and compared with the HTML found by right clicking and chooseing "view source" and all the links were the same.


It shouldn't be difficult to add the base url when you do your parsing.


Java API J2EE API Servlet Spec JSP Spec How to ask a question... Simple Servlet Examples jsonf
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: getting an html page
 
Similar Threads
getting an html page
I/O operation in jsp
getRequestURL problem
applet to servlet
How do I copy and paste to a text file?