Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
The moose likes Threads and Synchronization and the fly likes Dividing a large list of items between threads Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Threads and Synchronization
Bookmark "Dividing a large list of items between threads" Watch "Dividing a large list of items between threads" New topic
Author

Dividing a large list of items between threads

Mike Curwen
Ranch Hand

Joined: Feb 20, 2001
Posts: 3695

I'm looking for advice on the best approach to a problem like the following. I have an object that is responsible for retrieving rows from a database and doing quite a bit of processing for each entry. It also does further queries for each row, does a bit of templating and merging, potentially generating an email. Right now, the object just takes the selection criteria in the form of a key; work on all the rows that have 'this' value in the status field.



What if my rowcount is in the thousands? Right now it's just hundreds, and a single thread takes an acceptable amount of time. Eventually, I'm going to click a button, and it will take all night. So I want to multithread this...

My questions is.. where do the threads go?

1)
Should I produce a 'Manager' object? Move the overall query from the existing class into this manager object, turn it into a SELECT COUNT(*) rather than SELECT *, have it divide by the 'maxThreads', and then instantiate 'maxThreads' instances of my (now slightly modified) processing object. Each modified object would then work on its own subset of rows. The only change in the existing processor object (I think) is the method signature would be two long ids, rather than one and the query would change from SELECT * where status=id to SELECT * where row_id between first_id AND last_id. Otherwise, all the code remains the same.

2)
Forget the 'Manager' and simply implement threads within the existing class?

Is there a best practice here? Is there a better idea?



hmm... I just thought better of this. While currently, my rows are pretty much always in a contiguous block, this is not a guarantee. And the way MySql does LIMIT is not safe either:
If you use LIMIT # with ORDER BY, MySQL will end the sorting as soon as it has found the first # lines instead of sorting the whole table.


So I think I need to retrieve all the PK's and stuff them into a List. Then send sublists into the processor object. Yes?
[ December 31, 2003: Message edited by: Mike Curwen ]
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
In an ideal world you would be able to isolate the blocking operations and just do those in a thread pool, but alas it sounds like your code will be doing a lot of networking and other blocking operations all the way through.
I vote #2.
When I'm converting single-threaded code to multi-threaded code, I like to start out as simple as possible and keep my code looking approximately the same.

I figure fewer changes decrease the chances of introducing bugs into already working code. No counting or clever allocating necessary. The database is probably also optimized for lots of accesses to the same region. But either way, it's the simplest thing that could possibly work and it's a good way to go about doing things if you're using Test Driven Development.
[ December 31, 2003: Message edited by: David Weitzman ]
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
I like Dave's idea. To describe it in one sentence, I'd use workflow words and say: Create a single queue of work that needs done and any number of threads that pull the "next" piece of work from the queue. If you need to generalize the queue to hold different types of work, consider putting "command" objects in the queue instead of simple data structures. Sounds like fun!


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Mike Curwen
Ranch Hand

Joined: Feb 20, 2001
Posts: 3695

Ok, so I've got my head around that first idea. (don't split the tasks, let the threads 'naturally' split them).

So something like:

I had thought to use Doug Lea's package, specifically PooledExecutor. I'd do something like:
So this would allow 'natural' splitting of the large task between a number of threads. But the problem is in the pool.execute(FOO).

All the PooledExecutor gives me is a bunch of 'dumb threads', that need to be told what to do (given a command object, FOO). Except my particular FOO object contains a number of non-lightweight components, and I'd want to re-use them. The things like: manager object, manipulator object, template merger, etc.

So what I really want is a 'smart' PooledExecutor or perhaps I should say PooledCommand ?

From Doug Lea's package, the FJTask and FJTaskRunner seem a closer fit. In particular, the concept of one task thread 'stealing' work from another threads' queue is great. But then doesn't this lead me back to packaging out the large task into discreet amounts, separate queues for separate threads... something I might try to avoid, if I could?

Am I stretching myself in the wrong directions?
[ January 05, 2004: Message edited by: Mike Curwen ]
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
This pattern will be a little more complex to implement because, as you're noticed, you can't use local variables to hold objects for reuse in each thread.
You can, however, use ThreadLocal variables to accomplish the same thing. If you use setThreadFactory() to assign your PooledExecutor a custom ThreadFactory, you can wrap the passed Runnable inside another Runnable that builds the desired objects and stores them in ThreadLocals. You can also rely on the ThreaLocal initialValue.
However this doesn't look like a textbook case where a thread pool is necessary. Work doesn't need to be queued as it's found -- there's a constant stream of it. Only real advantage I see to a thread pool is that you don't have to worry about uncaught RuntimeExceptions here or there. Typically people use thread pools when they'd like to use an infinate number of threads but can't actually spare that many (i.e. thread per connection to a server). Here you're using a fixed number -- the maximum the pool will support. There shouldn't been any need to set keep alive time or a minimum pool size. Just launching a fixed number of threads would work equally well.
[ January 05, 2004: Message edited by: David Weitzman ]
[ January 05, 2004: Message edited by: David Weitzman ]
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Interesting. I was thinking of casting the control the other way. So the runnable object would say:

and the main program would say:

Does that make sense? This eliminates the thread pool idea, but does not support keeping idle threads around between batches of commands. When you empty the queue, the threads all go away.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Dividing a large list of items between threads
 
Similar Threads
Update based on Row id
how to select 2nd,3rd or the Nth highest salary from a table in desc or ascen order?
Design advice on how to implement preferences and language support
Display 20 records per page use MySQL(urgent)
Design Mental Roadblock