Hi All,
I am starting to write a small crawler and wanted to consult about some issue.
More specifically, the crawler is supposed to crawl to weather sites and extract weather data. The base idea is to crawl to a weather site, extract the list of cities from it, and for each city crawl to its page and extract the spcific city's weather data.
I am interested in making the crawler as most asynchronous as possible, meaning never to wait on a blocking function. The basic design is as followed:
Have a thread pool of workers.
Each worker handles an async task which never blocks.
First task: "Download main site page, and put second task in queue".
Second task: "If page has been downloaded, parse the list of cities and for each city put third task in queue. If page has not been downloaded yet, return to queue".
Third task: "Download the citie's page, and put fourth task in queue".
Fourth task: "If city's page has been donloaded, parse its weather data and put it in a data structure. If the page has't been downloaded yet, return to queue."
Of course some failure and timeout mechanisms should be implemented, but they aren't relevant yet.
This design should promise max CPU utilization and as little waiting as possible.
I thought of using Java NIO package, and use a SocketChannel and a selector that will tell me if the page is ready. But what is happening unser the SocketChannel's hood? Where is the downloading mechnism being carried out?
If the HTTP call is carried out somwhere under the OS's responsibility, everything is fine. The JVM is free for the next task.
But if the JVM itself divide's the HTTP request into TCP packets, and handles the entire flow in the TCP layer, things are much compilcated. In order to achieve more utilization I should handle it myself, including dividing the request into packets, carry out the negotiation part, sending packets, receiving ACKs, receiving data and sending ACKs, rebuilding packets and closing connection.
So the question is how exactly the JVM works? Is it a good idea to consider the NIO flow which works above the TCP layer as a black box, or should I look into better resolution?
Thanks,
Guy