I want to implement distributed sorting using Map reduce in Java. From whatever I know, Merge sort will look better for distributed sorting. Are there are any other algorithms that I could use? Please guide.
Note that using Mapreduce will involve two separate phases and thus two algorithm choices.
Obviously the phase which collects and merges separate sort jobs will have to be a merge but the separate sort jobs will operate just like any other sort with the same considerations.
Joined: Feb 27, 2010
I have some questions popping out.
I am not able to find what is the exact key over here. Out of Map function, if I need to find separate keys, it is not possible. Because if I have large amount of integers to be sorted, then key should be the same for everything for me to merge it together later at reduce phase.
Shall I do something like.. I will go through the input data set and picks few keys randomly. I will make all the integers lesser than a key to get associated with that key. I will sort each group of a key using Map function. Later I will just merge everything in the reduce phase depending on the sorting order of keys. Will this work out well?
I dont know how to use Mapreduce efficiently for large data set. Can you help me out.
Author and all-around good cowpoke
Joined: Mar 22, 2000
I don't understand the function of "keys" in your example - sounds like an extra step for no gain.
I'm not an expert on Mapreduce but if this was my problem I would cut the data set into roughly equal portions, one for each sorting processor/thread/whatever, then merge by looking at the "top" element of each of the returned sorted arrays, take the lowest (or highest), remove it and repeat till all sources are exhausted.