• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Invite your threading advice

 
Ranch Hand
Posts: 123
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I've written a Java webscraper. For each lookup key (currently 500-1000 at a time) it uses the Apache commons httpclient to make a connection and retrieve the page. Then I parse the html for the desired data and pop it into a hashtable, using the original key as the hash key.

After all data is fetched, I iterate through it and update the database with the fetched values.

Here's my question: Should my scraper utilize threads to increase performance? If so, what are the benefits/pitfalls of threading here?

Generally, what indicates that threading should be implemented, especially when writing a utility program such as this one?

Thanks for your time,

Julia
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Parallel threads can speed up a process that spends time waiting for something outside the JVM to happen. Retrieving a web page is a good example. Your JVM executes zero instructions while waiting for a response over the network, so another thread would have a good opportunity to run. At the opposite end, a process that is CPU bound, maybe doing some deep math, would not be a good candidate for threading because the CPU just doesn't have time to run another process.

So with that in mind, how many threads? Good question! You could try adding more and more until you saturate the CPU or the network, then back off a bit so you're not making the JVM work so hard on thread management that it can't do your real work. The whole 1,000 would probably be a Bad Thing.

How do you control the number of threads and know when they're done? The number is easy. If you're in JDK5 look at thread pooling with the Executor class. In earlier JDKs get another thread pool, maybe from Apache Commons. Both are pretty easy to use. You can put your 1,000 requests into a queue as Commands and know you are done when the queue is empty. Hmm, not quite, you'd only know when all the commands have been picked up and started, not finished. Any other ideas on how to know when you're done? The number of values in the map equals the number of commands? Might hang forever if one command threw an exception and never put a result in the map.
 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Julia,
Why Dont you think of doing something like this...
The moment you are parsing and you have found the required data�write it to a logical queue which is a synchronized resource�there can be another thread which can read the queue and update the database�I think u will achieve a good performance with this kid of a system�

class PageReader
{
PageReader(Queue objQue) { //constrcutor }

readPage(){
//read and parse whatever
}
findData(){
///find your stuff
}
writeToQueue()
{
//wite your stuff to the queue
}
}

class DBUpdate {

checkForQupdate()
{
//read que and find updates
}

writeToDb()
{
if(checkForQUpdate())
{
//write your stuff to db;
}
}
}

Have a thread service that synchronises the activity the sync resource is the Queue
 
Julia Reynolds
Ranch Hand
Posts: 123
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Stan and Mr. Bird for the good advice. I'll get to work on version two of my web scraper.

Julia
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic