Recently I have been asked an interview question on how to do a somewhat elaborated multithreaded search. (Due to NDA I can't say too much about it though)
So what I did was divide the tasks into chunks and try and load balance them before throwing them to a executor service.
I was asked to optimize it a little further. He gave me a hint that "when you are reading from a disk, the thread is doing what" I answered "waiting", but then I wasn't able to go any further based on that hint.
I am pretty sure I am missing something... Help? (The interview is over but I am still wondering)
read data from disk
read data from disk
While the read operation is happening, your code is just stuck waiting. It's not executing. It's just waiting for the next batch of data to come down the wire.
And while you're processing the data, the I/O system is doing nothing. It's just waiting for the next request to read or write something.
Imagine you have to walk from your desk to the mail room, carry a box of papers back to your desk, and then sign each paper. It takes you 5 minutes to walk to and from the mail room to transport a box of papers, and 5 minutes to sign all the papers in the box. You can't transport while you're signing and you can't sign while you're transporting.
Wouldn't it be nice if there were two of you--one to just go back and forth carrying boxes and stacking them on your desk and one to sign all the papers in a box? The papers would be moving in a steady stream, never just sitting motionless while you sign, and the documents would constantly be getting signed, never just waiting for more to arrive at your desk.
Of course, if the time to transport a box isn't the same as the time to process the papers in it, there will be some down time for one worker or the other, but the basic idea is that now at least some of the time there can be both transporting and signing happening simultaneously, whereas when there's just one of you, it's impossible.
Joined: Aug 25, 2006
Does it mean I need to allocate more threads than number of cores? That's it?
(If that's the case it's really unforgivable for me to miss it... I just need to pass higher value to the executor service?)
Peter Hsu wrote:Does it mean I need to allocate more threads than number of cores? That's it?
Not necessarily more threads than cores, no. The point is to have one or more threads to do the fetching and one or more to do the processing. How many of each depends on many factors.
If your problem is heavily I/O bound, then adding more threads for processing won't really help. One each might be enough. But if that I/O comes from multiple different sources--different disks on different controllers, or some from files and some from network, or from multiple network sites--then you might want several I/O threads. And if you can keep enough I/O threads busy that your one processing thread can't keep up, then you might add processing thread up to (but not more than) the number of cores, until it balances out.
Again though, the main point is to have separate threads for I/O vs. processing.