can some one tell me what would be the best approach to split a larger file into chunks and i was thinking that multithreading would give a better performance. I got this approach in my mind :
Algo: 1. Read the Size of the file.
2. Create 'n' threads such that size/n = 250MB
3. Start all the Threads.
4. Each thread will read the content in b/w the (IndexOfThread * n) To (( IndexofThread + 1) * n) and write it down into
a seperate file and probably a naming algorithm to assign a incremental file name at the end.
The algorithm looks OK, but multi-threading will probably not help you a lot for this purpose. The file you are splitting is on one harddisk, and that harddisk can only read from one place at a time - if you have n threads all trying to read from the same large file, they will have to wait for each other anyway.
It would be an interesting experiment to write it the single-threaded way and the multi-threaded way, and comparing the results to each other.
In general I agree with Jesper - but it depends on the filesystem. If, indeed, you have a single drive involved (as for most home users), then multithreading will probably be no help at all. But if your filesystem uses data striping (more common in enterprise environments), multiple threads may indeed be beneficial. If you don't know which is the case, you can either learn more about your filesystem, of simply try both ways, and measure to see which is faster. If multi-threading does not give you a big, obvious advantage in speed, it's probably not worth the bother.
Vijay Kumar koganti
Ranch Hand
Joined: Jan 23, 2006
Posts: 53
posted
0
Jesper Young wrote:The algorithm looks OK, but multi-threading will probably not help you a lot for this purpose. The file you are splitting is on one harddisk, and that harddisk can only read from one place at a time - if you have n threads all trying to read from the same large file, they will have to wait for each other anyway.
It would be an interesting experiment to write it the single-threaded way and the multi-threaded way, and comparing the results to each other.
What if the file is already read into the memory before the threads start their process in that case i believe they don't have to wait isn't it ?
regards,
vijay
Mike Simmons
Ranch Hand
Joined: Mar 05, 2008
Posts: 2816
2
posted
0
Well, assuming (as Jesper did) that there's just one disc drive involved... wouldn't each thread then need to wait to write its own data? Is it better to flood the drive with multiple requests to write to different files (given that it can only write to one location in one file at a time), or is it better to just make one request at a time?
Vijay Kumar koganti
Ranch Hand
Joined: Jan 23, 2006
Posts: 53
posted
0
Hi all,
Just a small change in the algorithm, i just realized the mistake when i am trying to implement it ..
Correction : its not (Index * n) but it is (Index * (size/n)) and same for the end point.
regards,
vijay
Vijay Kumar koganti
Ranch Hand
Joined: Jan 23, 2006
Posts: 53
posted
0
Mike Simmons wrote:Well, assuming (as Jesper did) that there's just one disc drive involved... wouldn't each thread then need to wait to write its own data?
I don't see any problem with Threads writing their own data as each would be writing to a separate file though they might be in the same HardDisk.
Besides i was just wondering about the jasper's comments ie threads have to wait to read from the same file present in the HardDisk, Take for example in Windows OS if you are copying the Same file to Multiple locations at the same time wouldn't it take little less time than if you do it one by one. My point is if the threads have to wait one after one then the Cumulative time would be equivalent to the copying of file to multiple locations one by one with out considering the manual delay while copying the files.Can you please throw more light on it i am really confused.
regards,
vijay
Mike Simmons
Ranch Hand
Joined: Mar 05, 2008
Posts: 2816
2
posted
0
I don't see any problem with Threads writing their own data as each would be writing to a separate file though they might be in the same HardDisk.
In terms of performance, I also don't see any problem. But neither do I see any advantage. Attempts at concurrency yield little benefit in this case, I think.
In terms of complexity, there may be a problem, because many programmers can't write good concurrent Java code. Maybe you're an exception to this statement, and maybe you can be confident that everyone else who maintains the code is also exceptional in this regard. If that's the case, great. If not... um, remind me, what was the benefit you were hoping to gain from multiple threads? As I understand it, we're just talking about the case with a single disc drive, right? It can only do one thing at a time. Do multiple threads offer any improvement on this? If so, how?
On a normal desktop computer, you (usually) have a single harddisk. That harddisk can only do one thing at a time, for example read or write a block of data.
If you ask the harddisk to read from multiple places in a file at the same time, or ask it to write multiple blocks of data at the same time, as you are planning to do with your multi-threaded copy program, it will perform those tasks sequentially - for example, first it reads block 1 for thread 1, then it reads block 2 for thread 2, etc., while threads 3 to 10 are waiting until the harddisk is ready to read their block of data.
It doesn't make much sense to write a multi-threaded program like that, because if the harddisk can do only one thing at a time, all the threads have to wait for one another until the harddisk is done executing their commands.
It doesn't matter if you copy the data to one output file or multiple output files - all those files are still on the same harddisk, that can only do one thing at a time. The total time for writing all the data would be the same, whether you use multiple threads or a single thread.
See the attached diagram. Yellow is the harddisk reading a block of data, red is the harddisk writing a block of data. Reading and writing the data takes the same time in total whether you have multiple threads or not.
Vijay Kumar koganti
Ranch Hand
Joined: Jan 23, 2006
Posts: 53
posted
0
Thanks a lot jesper thats very helpful info. so in that case it doesn't make any sense to implement my algorithm which uses multithreading instead i go with one thread that will do writing things sequentially..