This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.

In discussing this problem I looked up some of the possible solutions, and discovered a "median of medians" algorithm that intrigued me.

It's basis is to divide the dataset into groups of 5 and calculate the median of each group, so my first thought was: why 5? Since the next logical set for a median would be 11, I wonder if this is a throwback to the days when machines only had 8 registers available? If anybody has any other ideas, I'd love to know. And is there any advantage of using 5 over 7 as the base?

However, that aside, it got me thinking; because it's a problem that (oddly enough) I've never had to deal with:
What is the most efficient way of determining the median of specifically a group of 5 values?

I've been mulling it over for a few hours and think I'm pretty close to the best (mine involves at most 6 comparisons); but I'm interested in other suggestions, because I've been known to miss the obvious. I'm also finding it darn difficult to prove to myself that my solution works empirically.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here

Winston Gutkowski wrote:In discussing this problem I looked up some of the possible solutions, and discovered a "median of medians" algorithm that intrigued me.

It's basis is to divide the dataset into groups of 5 and calculate the median of each group, so my first thought was: why 5? Since the next logical set for a median would be 11, I wonder if this is a throwback to the days when machines only had 8 registers available? If anybody has any other ideas, I'd love to know. And is there any advantage of using 5 over 7 as the base?

My first question too is why only groups of 5? Why not 3 or 7 or 9 or 11 or whatever? What is the reason for having 5 elements in a group?

Winston Gutkowski wrote:However, that aside, it got me thinking; because it's a problem that (oddly enough) I've never had to deal with:
What is the most efficient way of determining the median of specifically a group of 5 values?

I've been mulling it over for a few hours and think I'm pretty close to the best (mine involves at most 6 comparisons); but I'm interested in other suggestions, because I've been known to miss the obvious. I'm also finding it darn difficult to prove to myself that my solution works empirically.

Winston

As far as I can remember from my school days, a median is the value that lies midway between a list of sorted values. So, each sub group needs to be sorted, its median found and then a continuous group of such medians is to be formed. Then the median of these medians is to be found. This would be the pivot element. And you are attempting to do that in the least possible time. Am I correct? What is the objective of doing so much work just to find a "median of medians" which would only serve to be the "pivot" for the partition algorithm later? In other words, what I am asking is what does this algorithm achieve over and above the one that I implemented in my code?

Mansukhdeep Thind wrote:And you are attempting to do that in the least possible time. Am I correct?

Yup, and since the base group is 5, it makes sense to start with that problem; and it's not as easy as it looks.

What is the objective of doing so much work just to find a "median of medians" which would only serve to be the "pivot" for the partition algorithm later?

Because partition() does a lot of work itself, so it makes sense to make it as effective as possible. If you're unlucky enough to choose your pivots badly, the same element can get swapped many times by Quicksort (hence the worst case time of O(n^2)). The "median of medians, grouped by 5" is also extremely fast, since at each stage of recursion you're reducing by a power of 5 rather than the usual 2.

Mansukhdeep Thind wrote:what I am asking is what does this algorithm achieve over and above the one that I implemented in my code?

Well, according to the Wiki page, it improves worst-case time to O(n*log(n)) - although they also say that, in practice, it generally doesn't make much difference.

I'm also darn sure that 5 wasn't chosen by accident, and that it probably gives the best "bang for the buck" in terms of efficiency; but exactly why I honestly don't know.

Paul Mrozik wrote:You are, in fact, asking about the most efficient way to sort five elements, right? As the median will always be the third value in a sorted set of 5.

Possibly, but I suspect not. Quickselect, which finds the kth element of n, only needs to do a partial sort (the "partition" method we've been talking about), since you already know which partition the result should be in. The problem is that the first partition of 5 requires 4 comparisons and at least 2 swaps. I'm trying to work out if there's a better alternative than Quickselect for specifically 5 items.

Just a fun little problem that I've never really thought about before.