This week's book giveaway is in the Cloud/Virtualizaton forum.
We're giving away four copies of Mesos in Action and have Roger Ignazio on-line!
See this thread for details.
Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Dividing a large list of items between threads

 
Mike Curwen
Ranch Hand
Posts: 3695
IntelliJ IDE Java Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm looking for advice on the best approach to a problem like the following. I have an object that is responsible for retrieving rows from a database and doing quite a bit of processing for each entry. It also does further queries for each row, does a bit of templating and merging, potentially generating an email. Right now, the object just takes the selection criteria in the form of a key; work on all the rows that have 'this' value in the status field.



What if my rowcount is in the thousands? Right now it's just hundreds, and a single thread takes an acceptable amount of time. Eventually, I'm going to click a button, and it will take all night. So I want to multithread this...

My questions is.. where do the threads go?

1)
Should I produce a 'Manager' object? Move the overall query from the existing class into this manager object, turn it into a SELECT COUNT(*) rather than SELECT *, have it divide by the 'maxThreads', and then instantiate 'maxThreads' instances of my (now slightly modified) processing object. Each modified object would then work on its own subset of rows. The only change in the existing processor object (I think) is the method signature would be two long ids, rather than one and the query would change from SELECT * where status=id to SELECT * where row_id between first_id AND last_id. Otherwise, all the code remains the same.

2)
Forget the 'Manager' and simply implement threads within the existing class?

Is there a best practice here? Is there a better idea?



hmm... I just thought better of this. While currently, my rows are pretty much always in a contiguous block, this is not a guarantee. And the way MySql does LIMIT is not safe either:
If you use LIMIT # with ORDER BY, MySQL will end the sorting as soon as it has found the first # lines instead of sorting the whole table.


So I think I need to retrieve all the PK's and stuff them into a List. Then send sublists into the processor object. Yes?
[ December 31, 2003: Message edited by: Mike Curwen ]
 
David Weitzman
Ranch Hand
Posts: 1365
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In an ideal world you would be able to isolate the blocking operations and just do those in a thread pool, but alas it sounds like your code will be doing a lot of networking and other blocking operations all the way through.
I vote #2.
When I'm converting single-threaded code to multi-threaded code, I like to start out as simple as possible and keep my code looking approximately the same.

I figure fewer changes decrease the chances of introducing bugs into already working code. No counting or clever allocating necessary. The database is probably also optimized for lots of accesses to the same region. But either way, it's the simplest thing that could possibly work and it's a good way to go about doing things if you're using Test Driven Development.
[ December 31, 2003: Message edited by: David Weitzman ]
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I like Dave's idea. To describe it in one sentence, I'd use workflow words and say: Create a single queue of work that needs done and any number of threads that pull the "next" piece of work from the queue. If you need to generalize the queue to hold different types of work, consider putting "command" objects in the queue instead of simple data structures. Sounds like fun!
 
Mike Curwen
Ranch Hand
Posts: 3695
IntelliJ IDE Java Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ok, so I've got my head around that first idea. (don't split the tasks, let the threads 'naturally' split them).

So something like:

I had thought to use Doug Lea's package, specifically PooledExecutor. I'd do something like:
So this would allow 'natural' splitting of the large task between a number of threads. But the problem is in the pool.execute(FOO).

All the PooledExecutor gives me is a bunch of 'dumb threads', that need to be told what to do (given a command object, FOO). Except my particular FOO object contains a number of non-lightweight components, and I'd want to re-use them. The things like: manager object, manipulator object, template merger, etc.

So what I really want is a 'smart' PooledExecutor or perhaps I should say PooledCommand ?

From Doug Lea's package, the FJTask and FJTaskRunner seem a closer fit. In particular, the concept of one task thread 'stealing' work from another threads' queue is great. But then doesn't this lead me back to packaging out the large task into discreet amounts, separate queues for separate threads... something I might try to avoid, if I could?

Am I stretching myself in the wrong directions?
[ January 05, 2004: Message edited by: Mike Curwen ]
 
David Weitzman
Ranch Hand
Posts: 1365
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This pattern will be a little more complex to implement because, as you're noticed, you can't use local variables to hold objects for reuse in each thread.
You can, however, use ThreadLocal variables to accomplish the same thing. If you use setThreadFactory() to assign your PooledExecutor a custom ThreadFactory, you can wrap the passed Runnable inside another Runnable that builds the desired objects and stores them in ThreadLocals. You can also rely on the ThreaLocal initialValue.
However this doesn't look like a textbook case where a thread pool is necessary. Work doesn't need to be queued as it's found -- there's a constant stream of it. Only real advantage I see to a thread pool is that you don't have to worry about uncaught RuntimeExceptions here or there. Typically people use thread pools when they'd like to use an infinate number of threads but can't actually spare that many (i.e. thread per connection to a server). Here you're using a fixed number -- the maximum the pool will support. There shouldn't been any need to set keep alive time or a minimum pool size. Just launching a fixed number of threads would work equally well.
[ January 05, 2004: Message edited by: David Weitzman ]
[ January 05, 2004: Message edited by: David Weitzman ]
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting. I was thinking of casting the control the other way. So the runnable object would say:

and the main program would say:

Does that make sense? This eliminates the thread pool idea, but does not support keeping idle threads around between batches of commands. When you empty the queue, the threads all go away.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic