aspose file tools*
The moose likes Java in General and the fly likes Quickest way to read in a Web page Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Quickest way to read in a Web page" Watch "Quickest way to read in a Web page" New topic
Author

Quickest way to read in a Web page

Craig Sullivan
Greenhorn

Joined: Nov 21, 2002
Posts: 4
I'm using the following code to read in the contents of a Web page:

URL SECURL = new URL(formURL);
URLConnection SECCon = SECURL.openConnection();

InputStream input = SECCon.getInputStream();
InputStreamReader IReader = new InputStreamReader(input);
BufferedReader BReader = new BufferedReader(IReader);

String BString;

while ((BString = BReader.readLine()) != null){
/* processing code here */
}

Is there a faster way to read the web page? This code segment seems to take longer than I would like.

Thanks.
Tim West
Ranch Hand

Joined: Mar 15, 2004
Posts: 539
I can't speak for what's fastest, but here's the code I use:



In terms of efficiency, one important thing is to make sure you're using StringBuffers not Strings (as String concatenation is expensive). However, it may be that the speed of your network is sufficiently slow that the concatenation isn't what's causing the slowness.

BTW, this code is hackish ... I've only used it when playing around. I haven't put much time into writing it particularly prettily. In particular, the while loop is nasty and C-like. I would rewrite it in production code...but you get the idea



--Tim
[ June 27, 2004: Message edited by: Tim West ]
Ko Ko Naing
Ranch Hand

Joined: Jun 08, 2002
Posts: 3178
Tim West explanation is reasonable... Using StringBuffer is better than using String at least in long characters processing...

Craig Sullivan, what do u mean by "faster way to read the web page"? Do u mean which readers or inputstreams are supposed to be used in ur code? Could you provide more info about your code so that we can help you much more than you can imagine?


Co-author of SCMAD Exam Guide, Author of JMADPlus
SCJP1.2, CCNA, SCWCD1.4, SCBCD1.3, SCMAD1.0, SCJA1.0, SCJP6.0
Craig Sullivan
Greenhorn

Joined: Nov 21, 2002
Posts: 4
I want to know the fastest way to get the content of the page from the server to my Java client.
Tim West
Ranch Hand

Joined: Mar 15, 2004
Posts: 539
Well, you're limited by two things:

  • The speed of the connection between the remote server and your local box.
  • The speed of your Java code.


  • For the latter, implement a decent solution that uses a BufferedReader and StringBuffers not Strings. I'm not aware of anything else that will significantly increase your code speed in this situation. If there is anything, I'm sure someone else will point it out soon.

    Then, unless your connection is really fast, I'd say it's highly likely that your connection, not the code, is the performance bottleneck. So, upgrade your inter|intranet facilities

    In any case, it should be relatively simple to profile your code to work out which methods are taking most time. Then you can decide where to optimise next.


    --Tim
    Craig Sullivan
    Greenhorn

    Joined: Nov 21, 2002
    Posts: 4
    My main concern is with reducing round-trips to and from the server. Is there a certain method of downloading the data from the server that will reduce round-trips?

    For example, does BufferedReader.readLine() use more round trips than BufferedReader.read()? I tried changing the buffer size, but the largest download I could get was 2555 bytes. Is the max buffer size dependent on HTTP or is there some parameter within the JDK that I can change?
    Tim West
    Ranch Hand

    Joined: Mar 15, 2004
    Posts: 539
    Hmm. I'm not qualified to give a definitive answer at this point, but I can offer some more thoughts.

    Firstly, the size of any given packet (at the lowest level) is determined by your Maximum Transfer Unit, or MTU. This is an OS-level concern, and something Java has no control over. For NICs, it's generally around 1480 bytes (I think. At least, it is for me).

    This is the maximum packet size. It includes all the HTTP/TCP/IP headers, checksums and whatever else the different layers on the network stack put in. So, you don't get a huge amount of data in an individual packet. I'm not familiar enough with the various protocols to know, but I think any network connection always involves round trips of a sort - the TCP 3-way handshake to start, then the process of accepting each packet from the source and requesting more data. Do you want to reduce this sort of round trip, or have I missed something?

    However, all this is transparent to a Java app. As far as Java's concerned, you get a byte stream (well, URL.getStream() returns an InputStream) and read happily away.

    I think from a Java POV, all you can do is use a larger buffer in the BufferedReader. Then you avoid the possibility that the buffer could fill and the connection would have to stall. That said, I'd guess most OSs would buffer network connections themselves, but that is complete speculation.

    Anyway, there are some random thoughts that may or may not help.

    I'm curious though - what do you mean the largest download you could get was 2555 bytes? Is that one packet or the total download size?

    Dunno whether I helped or not jus' then, but there ya go



    --Tim
    Craig Sullivan
    Greenhorn

    Joined: Nov 21, 2002
    Posts: 4
    Here's the deal as far as I know: HTTP has a flexible window. HTTP will send more or less packets at one time without an ACK from the client depending on the speed and stability of the network.

    After I created a BuffereReader, I called bytesAvailable and got back 2555, or some such. This tells me that I can only read in 2555 bytes at one time.

    When I use BufferedReader.readLine() to read a 8 MB web page, it takes 25 seconds. When I used my web browser, it takes 5 seconds. Somewhere, I don't know where, the amount I can read in at one time is being limited to 2555 bytes. I don't believe my TCP/IP stack is limiting my download. I believe there is some parameter in Java that is limiting the number of bytes I can DL at one time to 2555. If the download size were bigger, not as many ACKs would be sent from my client, and the download would be faster.

    I may need to dig into the JDK to see what's going on.
    [ July 02, 2004: Message edited by: Craig Sullivan ]
    Tim West
    Ranch Hand

    Joined: Mar 15, 2004
    Posts: 539
    Hmm, this is out of my depth now :-)

    To confirm your ideas on packets, you might like to use Ethereal (or something) to see if there are any obvious differences between the way Java is doing TCP/IP as compared to your web browser.

    Also, did you play with the size of the buffer in the BufferedReader? Make it 8Mb and see if you get speeds comparable with the browser.

    Anyway, what I'm writing now is speculation more than well-founded advice, so take it at your peril Would be interesting to know what the cause of all this is, though.

    Whoa, just a thought after all this - should we be using BufferedInputStream, not BufferedReader? I would think we want buffering as "close to the network" as possible. Erm, maybe someone else can comment. I dunno what the relative merits of a BufferedReader vs. BufferedInputStream are (I mean, besides the obvious).


    --Tim
    [ July 01, 2004: Message edited by: Tim West ]
    Stan James
    (instanceof Sidekick)
    Ranch Hand

    Joined: Jan 29, 2003
    Posts: 8791
    Your code will likely be much faster than the Internet. I have a little program that downloads files and shows the bytes per second after every 1k bytes. I can run one thread or five and the BPS is the same for each. My code is not the bottleneck. If I had a need for 50 threads it might be.


    A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12788
        
        5
    BufferedReader.readLine()
    That has a huge overhead - converting a byte stream to characters, building a line, finally converting to string.
    For speed -
    1. Never convert to characters - stay with bytes
    2. Start with a monsterous byte[] and read directly into it - probably
    with the read( buf, offset, length) method, where length is the result of calling available.
    Or you might use the ServletInputStream readLine(buf,off.length) method which will return -1 at the eof, and will let you count lines.
    Bill
    Tim West
    Ranch Hand

    Joined: Mar 15, 2004
    Posts: 539
    Hmm, so based on William's post, using a BufferedInputStream over a BufferedReader is definitely a good thing - you get the advantages of buffering without the overhead of character/String conversion.



    -Tim
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12788
        
        5
    Right - but I think that reading directly from the ServletInputStream would be best. Remember, the operating system TCP/IP stack already has a buffer to hold a packet (?or maybe more than one?) - there is no need to introduce another buffer, just grab the bytes as they become available.
    Bill
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: Quickest way to read in a Web page