I have a project which requires (I think) a servlet that runs forever (sort of). It needs to go out and fetch the LAST DATE MODIFIED from the web pages of our intranet site. Say maybe at 12 midnight everynight. It needs to see if any pages are older than 30 days. Is that at all possible and if it is how would that impact the performance of the webserver (WebSphere 3.5) with other applications on it? And how could that be implemented? Thanks in advanced.
You want a separate class (not a servlet) with its own Thread to do this. You can use a servlet to get it started and to check on it's status, and maybe to pick up the report it generates. To ensure that the impact on the server is minimal, just give the Thread the lowest priority. Obviously this utility class should be designed with a "Singleton" pattern. Bill
I think what you are describing is really a special case of a more general "web portal" or "smart cache". If you consider it in this light there are a few impoertant aspects of the problem which I think you should also be considering. The first important issue is scaling the problem. You mention that your scheduled process should run at midnight, so it's probably not much use if it takes 12 hours to run! To decide how to handle the fetching of the last-modified date, you should consider how many pages you have to examine, how your software knows which URLs to look at, how long each one might take to return the information, and you need to consider what to do if any of the pages are unavailable or slow. Do you plan to keep a separate list of URLs to query? If so, how will that list be derived - manually, or via some sort of "web-spider"? If you don't have such a list, then your little "last modified" engine will also need to be responsible for fetching and parsing all the pages and extracting links to the others. This could really slow it down and add dangerous complexity. However you design your fetching process, the majority of it's time will probably be spent waiting for remote servers to return pages. To speed up the overall process you should consider running multiple "fetch" threads at once. Several threads waiting doesn't take much more CPU horsepower than one thread waiting. Running multiple parallel threads really helps if any of the URLs are unavailable or slow. Waiting for a few potentially long timeouts can cause big problems in a single-threaded solution. Losing one thread for a bit while the others keep on working is only a minor issue. I'm always wary of the Singleton pattern, as it can be very limiting if used carelessly. I would probably consider a more flexible solution (some sort of "fetcher factory", or a pool of worker threads maybe). And don't forget to make sure that whatever Collection class you are using to gather the results is thread-safe enough to allow multiple fetchers to populate it in parallel.