aspose file tools*
The moose likes Distributed Java and the fly likes Sorting using Mapreduce Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Distributed Java
Bookmark "Sorting using Mapreduce" Watch "Sorting using Mapreduce" New topic
Author

Sorting using Mapreduce

Madhumitha Baskaran
Ranch Hand

Joined: Feb 27, 2010
Posts: 66
Hi,

I want to implement distributed sorting using Map reduce in Java. From whatever I know, Merge sort will look better for distributed sorting. Are there are any other algorithms that I could use? Please guide.

Thanks,
Madhu
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12806
    
    5
Note that using Mapreduce will involve two separate phases and thus two algorithm choices.

Obviously the phase which collects and merges separate sort jobs will have to be a merge but the separate sort jobs will operate just like any other sort with the same considerations.

Bill
Madhumitha Baskaran
Ranch Hand

Joined: Feb 27, 2010
Posts: 66
Thanks.

I have some questions popping out.

I am not able to find what is the exact key over here. Out of Map function, if I need to find separate keys, it is not possible. Because if I have large amount of integers to be sorted, then key should be the same for everything for me to merge it together later at reduce phase.

Shall I do something like.. I will go through the input data set and picks few keys randomly. I will make all the integers lesser than a key to get associated with that key. I will sort each group of a key using Map function. Later I will just merge everything in the reduce phase depending on the sorting order of keys. Will this work out well?

I dont know how to use Mapreduce efficiently for large data set. Can you help me out.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12806
    
    5
I don't understand the function of "keys" in your example - sounds like an extra step for no gain.

I'm not an expert on Mapreduce but if this was my problem I would cut the data set into roughly equal portions, one for each sorting processor/thread/whatever, then merge by looking at the "top" element of each of the returned sorted arrays, take the lowest (or highest), remove it and repeat till all sources are exhausted.

Bill
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Sorting using Mapreduce