aspose file tools*
The moose likes Beginning Java and the fly likes Problem getting html of a WebPage Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Problem getting html of a WebPage" Watch "Problem getting html of a WebPage" New topic
Author

Problem getting html of a WebPage

Rohan Deshmkh
Ranch Hand

Joined: Aug 31, 2012
Posts: 127
I wanted to know what i am doing wrong, i don't want other alternative classes to be used.This is my code:


The output that i get is many random integer values, each on one line and at the end there is -1.
I am not very sure about what does InputStream s = u.openStream(); and size=s.read(); does?
I want to know how to print the html of web page without using any other claases.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
The value is certainly not random. As the relevant javadocs explain, it's the next byte of data. For a web page, that's probably the ASCII code of a character of text (or UTF-8, ISO_8859 or whatever the page is encoded in).

If you expect text to be returned (and not binary data), wrap the InputStream into a BufferedReader, and process the output line by line. That would provide a more human-readable representation of the content.


Ping & DNS - my free Android networking tools app
Rohan Deshmkh
Ranch Hand

Joined: Aug 31, 2012
Posts: 127
Ulf Dittmer wrote:The value is certainly not random. As the relevant javadocs explain, it's the next byte of data. For a web page, that's probably the ASCII code of a character of text (or UTF-8, ISO_8859 or whatever the page is encoded in).

If you expect text to be returned (and not binary data), wrap the InputStream into a BufferedReader, and process the output line by line. That would provide a more human-readable representation of the content.

Hey thanks for the suggestion.But i have following questions:
1) Why should we wrap InputStream into BufferedReader?
2)I did what you said and got correct ouptut as expected, but the code given in book does not make use of Bufferedreader, although it uses some byte array which i am not able to understand.

Here is the code given in the book:


I am not understanding why the byte array is used.And in which variable is exactly the html content is stored.
AS you suugested i wrapped the code in BufferedReader and got correct output but i want to understand how the above example is working correctly.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
There are any number of ways to read an InputStream, and none is the best in all given circumstances. I prefer the BufferedReader approach because then I don't have to deal with byte arrays and creating String objects myself, but it only works if you're certain that you're reading text (which is what a web page is, but web content in general can also be binary, in which case you can't use Readers).

Note that, to be entirely correct, you would also have to handle the character encoding the web page is in. The code above assumes that it's compatible with the platform default encoding of the machine where the code runs - which is often a correct assumption, but definitely not always.
Rohan Deshmkh
Ranch Hand

Joined: Aug 31, 2012
Posts: 127
Ulf Dittmer wrote:There are any number of ways to read an InputStream, and none is the best in all given circumstances. I prefer the BufferedReader approach because then I don't have to deal with byte arrays and creating String objects myself, but it only works if you're certain that you're reading text (which is what a web page is, but web content in general can also be binary, in which case you can't use Readers).

Note that, to be entirely correct, you would also have to handle the character encoding the web page is in. The code above assumes that it's compatible with the platform default encoding of the machine where the code runs - which is often a correct assumption, but definitely not always.


Ok, i understood about InputStream but would you mind telling me how the above code that i posted, is working?I am not able to understand it.
After this statement is executed: InputStream s = u.openStream();
does s now contain all the html content that we want?Now it may be in other format, the only thing we have to do is convert into textual format (using BufferedReader or other method) and the print it.Am i right?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

Rohan Deshmkh wrote:I am not understanding why the byte array is used.And in which variable is exactly the html content is stored.
AS you suugested i wrapped the code in BufferedReader and got correct output but i want to understand how the above example is working correctly.

Basically: because it's doing a String conversion; however, it looks rather tortuous to me, and not as good as Ulf's suggestion.

Simply put, all Files and Streams contain binary data that can be read byte by byte. Only some of those Streams contain TEXT, and text needs to be converted. This is because a Java char is a TWO-byte primitive and streamed text (particularly ASCII text, but other forms too) often contains each character in one byte. The foundation classes provide a Reader which is specifically designed for converting text streams to Java characters, and there may be quite a lot going on behind the scenes that you don't see. Adding buffering (ie, with a BufferedReader) makes I/O more efficient, and also allows you to read in "lines" of data (which is the normal way of breaking up text) as Strings.

HIH

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Rohan Deshmkh
Ranch Hand

Joined: Aug 31, 2012
Posts: 127
Winston Gutkowski wrote:
Rohan Deshmkh wrote:I am not understanding why the byte array is used.And in which variable is exactly the html content is stored.
AS you suugested i wrapped the code in BufferedReader and got correct output but i want to understand how the above example is working correctly.

Basically: because it's doing a String conversion; however, it looks rather tortuous to me, and not as good as Ulf's suggestion.

Simply put, all Files and Streams contain binary data that can be read byte by byte. Only some of those Streams contain TEXT, and text needs to be converted. This is because a Java char is a TWO-byte primitive and streamed text (particularly ASCII text, but other forms too) often contains each character in one byte. The foundation classes provide a Reader which is specifically designed for converting text streams to Java characters, and there may be quite a lot going on behind the scenes that you don't see. Adding buffering (ie, with a BufferedReader) makes I/O more efficient, and also allows you to read in "lines" of data (which is the normal way of breaking up text) as Strings.

HIH

Winston


OK thanks , i will use BufferedReader approach from now .
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Problem getting html of a WebPage