This week's book giveaway is in the Big Data forum. We're giving away four copies of Elasticsearch in Action and have Radu Gheorghe & Matthew Lee Hinman on-line! See this thread for details.

Problem: given an input in.txt , there are a few hundred thousand integers in that file , separated by " ". The file in.txt only contains the integers and nothing else, and no integer appears twice.

The numbers in in.txt are partially ordered as groups of 5 , meaning that the first 5 are sorted n0<n1<n2<n3<n4 , the next 5 are already sorted n5<n6<n7<n8<n9 and so on.
So for whatever i we have n(5*i) < n(5*i + 1) < n(5*i + 2) < n(5*i + 3) < n(5*i + 4) , provided that 5*i + 4 is less than the total number of integers in the file.

Produce the file out.txt containing the numbers from in.txt sorted in ascending order.

---------------------------------------------

Given this problem , a simple approach would be read all numbers in some collection , the sort it with the Java already implemented sort() function , then write the result in out.txt.

I am sure you can come up with a faster approach than this. I invite you to do so , and after you post here a custom sorting method , I will compare its efficiency with the default java .sort() one.

If you have a sorted array, you can find the place for the new element to insert to with binary search.

You begin with the first sorted group, and insert the elements of the second group one after one.
After every insertion, you remember the place where the element went, as this piece of information narrows
where the next element's place can be.

Whether this is better than lumping together all the elements and sort them as such - I do not know.

Ivan Jozsef Balazs wrote:Whether this is better than lumping together all the elements and sort them as such - I do not know.

I'd say it isn't, at least for large number of elements. This is effectively insertsort. And there are other algorithms - notably quicksort and heapsort - which are preferred over insertsort. For small arrays the insertion sort is very efficient, though, and some quicksort implementations use insertsort to sort small portions of the array.

Ivan Jozsef Balazs
Rancher

Joined: May 22, 2012
Posts: 948

5

posted

0

Martin Vajsar wrote:

Ivan Jozsef Balazs wrote:Whether this is better than lumping together all the elements and sort them as such - I do not know.

I'd say it isn't,

This was my guess but I was too lazy to think on it further.

Myke Enriq
Ranch Hand

Joined: Feb 13, 2012
Posts: 115

posted

0

I am thinking about a modified merge sort. Assume input has size N a power of 5.

You split the input in 5 sub arrays , sort those then merge them in O(n). This is the formula for a recurrent function.

Anyhow , this gets an O ( Nlog(base 5)N ) complexity.

----------------------------------------------

Now I wonder , what if the starting sub arrays of size 5 are merged one by one in O(n) into starting sub arrays of size 10 ?
Then the whole algorithm gets to be O(N log(base 10) N).

----------------------------------------------

Then mix the now starting sub arrays of size 10 one by one into starting sub arrays of size 20. Now the complexity is even smaller.

How do you optimize this thing ?

Meaning the final algorithm has 2 stages:
step 2: starting with the already sorted sub arrays of size 5 , you merge them one by one , and so you get starting sub arrays of double the size.
Do this x times and you get to start with already started sub arrays of size 5 * 2^x. The cost of step 1 is x * n

step 2: starting with already sorted sub arrays of size 5 * 2^x = S , use a modified merge sort , with the recurrent formula:
- split input into S parts
- sort each of those S parts
- merge the S parts

Cost of step 2 is S*N log(base S)N .

adding the costs of the 2 steps we need to find a minimum.

Thus find the minimum for: x* N + [5 * 2^x] * N log(base 5*2^N) N.
This function is positive for N > 0 , f(x,0) = x;

Please help me find its minimum , meaning the Xmin where f(Xmin, n) <= f(x , n) for every n.

Myke Enriq wrote:I am thinking about a modified merge sort. Assume input has size N a power of 5.

You split the input in 5 sub arrays , sort those then merge them in O(n). This is the formula for a recurrent function.

Anyhow , this gets an O ( Nlog(base 5)N ) complexity.

BTW, it doesn't matter what base is used for the log ... in big-O notation, "O (log N)" where log is in any base is equal to "O(log N)" where log is in any other base. It is just a quirk of the math.

True... In big-O notation, O(log base 2 N) is equivalent to O(log base 57 N).

For this problem, I wouldn't be surprised if we can't do any better than O(N log N), but I don't think that big-O captures everything we're concerned about. We're not just concerned about our algorithm being O(N log N), but also what the actual multiplier is. The simple solution of just reading all the elements into one big collection and doing a merge sort (or maybe a quicksort) on them is O(N log N). Surely we can come up with a better algorithm that takes advantage of the "sorted by fives" precondition and may still be O(N log N) but runs in, say, one fourth the time as the simple solution.

Myke's modified merge sort idea from yesterday is one candidate. Any other ideas? In particular, do we have any implementations to actually test?

Ryan McGuire wrote:but I don't think that big-O captures everything we're concerned about.

Was it specified somewhere what are we actually concerned about?

We're not just concerned about our algorithm being O(N log N), but also what the actual multiplier is.

No, we aren't. The actual multiplier doesn't give any additional information. As has already mentioned recently elsewhere in this forum, the "big-O" just describes how the runtime of the algorithm changes when the size of its input changes. And really it is just an approximation. You cannot use it in any way to compute that runtime, since lower-order factors of the complexity are simply omitted.

Any other ideas? In particular, do we have any implementations to actually test?

The answer to similar proposals around here usually is: show us your idea or implementation first.

Myke's idea actually sounds pretty good to me. If I understand it right, it's actually a merge sort that starts at the size of sublists equal to 5 instead of 1, so it shaves off some of the time of the original merge sort.

Ryan McGuire
Ranch Hand

Joined: Feb 18, 2005
Posts: 1020

4

posted

0

Martin Vajsar wrote:

Ryan McGuire wrote:but I don't think that big-O captures everything we're concerned about.

Was it specified somewhere what are we actually concerned about?

We're not just concerned about our algorithm being O(N log N), but also what the actual multiplier is.

No, we aren't. The actual multiplier doesn't give any additional information. As has already mentioned recently elsewhere in this forum, the "big-O" just describes how the runtime of the algorithm changes when the size of its input changes. And really it is just an approximation. You cannot use it in any way to compute that runtime, since lower-order factors of the complexity are simply omitted.

Any other ideas? In particular, do we have any implementations to actually test?

The answer to similar proposals around here usually is: show us your idea or implementation first.

Myke's idea actually sounds pretty good to me. If I understand it right, it's actually a merge sort that starts at the size of sublists equal to 5 instead of 1, so it shaves off some of the time of the original merge sort.

I guess I don't understand your distinction. On the one hand you say that no, we aren't concerned with the multiplier. On the other, you say that Myke's modified merge sort, which is O(N log N), shaves some time off the original merge sort, which is also O(N log N). Since both algorithms run in time that is proportional to N log N (at least for large N), isn't it exactly the comparison of the N log N multipliers for those two algorithms that makes Myke's better than the standard?

> Was it specified somewhere what are we actually concerned about?
Yes. The first sentence of the last line of the original post: "I am sure you can come up with a faster approach than this."

We are looking at the overall speed of any candidate algorithms. I would say that an algorithm that can sort a few hundred thousand numbers in less time than the simple-minded java sort, even if both have the same big-O expression, is what we're looking for.

> The answer to similar proposals around here usually is: show us your idea or implementation first.
But of course, and I've contributed to quite a few previous challenges: Sorting fifteen apples when you can do a three way comparison, solving Sudokus, identifying the one name (or other string) that shows up in a list just once while others are repeated, etc. I have an idea for an algorithm for this challenge as well but haven't had time to actually put pen to paper, so to speak.

It just seemed that in this case we were going off on a tangent about how log(base 5) N is proportional to log(base 10) N, and I wanted us to get back on the topic of actual algorithms that might be faster than the simple-minded one without dismissing candidates just because they execute in O(N log N) time.

Ryan McGuire wrote:
Myke's modified merge sort idea from yesterday is one candidate. Any other ideas? In particular, do we have any implementations to actually test?

Martin Vajsar wrote:
Myke's idea actually sounds pretty good to me. If I understand it right, it's actually a merge sort that starts at the size of sublists equal to 5 instead of 1, so it shaves off some of the time of the original merge sort.

Sorry, I can't make the same conclusion -- how is Myke's idea "pretty good"? Especially since part of the premise is ignoring portions big-O notation?

Ryan McGuire wrote:
It just seemed that in this case we were going off on a tangent about how log(base 5) N is proportional to log(base 10) N, and I wanted us to get back on the topic of actual algorithms that might be faster than the simple-minded one without dismissing candidates just because they execute in O(N log N) time.

There is a reason why constants are ignored. There is a reason why lower order components of the expressions are ignored too. The number of elements are calculated to approach infinity.

Let's look at Myke's modifications. He wants to do a bunch of extra passes to merge the sub-components -- these new extra passes are, of course, ignored by big-O notation. He also wants to break each recursive step into a large number of components to do recursion instead of the standard two recursive calls at each step. This is, of course, ignored by big-O notation as N is approaching infinity, and we are assuming that K (the number of components) is not related to N.

Now, I agree that ignoring the constant is kinda "simple-minded", but ignoring one rule of big-O notation, but assuming that the others still applies is not really good either. If you are arguing that N should not be infinity, then you have to consider the extra passes, and that it may no longer be order NlogN due to the large number of components with each recursive step.

Henry

Ryan McGuire
Ranch Hand

Joined: Feb 18, 2005
Posts: 1020

4

posted

0

Myke, have you had a chance to actually code either the simple-minded sort or your modified one to see how long either one takes on a sample of the size stated in the original post: "a few hundred thousand"? I'd be interested in seeing how well the rubber hits the road for either version.

Henry Wong wrote:
There is a reason why constants are ignored. There is a reason why lower order components of the expressions are ignored too. The number of elements are calculated to approach infinity.
...
If you are arguing that N should not be infinity, then you have to consider the extra passes, and that it may no longer be order NlogN due to the large number of components with each recursive step.

I am indeed arguing that N should not be infinity but rather a few hundred thousand as stated in the OP. Let's say max(N) = 999,999. Therefore we do have to consider the the extra passes. The fact that the time taken for the extra passes, while lower order than NlogN, may still be significant with the given stated range for N makes big-O analysis an important first step, but certainly not the last.

However, even if we do let N approach infinity, between two algorithms that are both O(NlogN), I would want the one that had the smaller coefficient for the highest order term. Again, big-O is important but not 100% definitive.

Personally, I think that in regards to efficiency and/or performance (and I am not talking about big-O notation here), the simpler the enhancement, the better. Extra passes are not good in this regard. Doing something K times, instead of two times, is not good in this regard. etc.

How is this as a simple enhancement? The standard merge sort splits the data set in half. It then, recursively sort the two halves, and then merges them. When the merge sort is asked to sort a data set of one, it just returns it as sorted.

The enhancement is simply to have a modified merge sort that understands the grouping of five. When it splits the data set in half, it splits it at a boundary of a grouping (the two halves are multiples of five). When the merge sort is asked to sort a data set of five, it just returns it as sorted. There are no extra passes that needs to be done. There isn't an extra loop to process K number of sub data sets. etc.

The mergesort Henry described (with segment size of 5 instead of 1) was, I believe, where Myke started off himself (and this is the idea I though "good"):

Myke Enriq wrote:Now I wonder , what if the starting sub arrays of size 5 are merged one by one in O(n) into starting sub arrays of size 10 ?

But I too believe the other proposed modification - merging more than two segments at once - makes things worse. My reasoning is that taken to the extreme - merging as many segments as there are elements in the array in one step - would effectively be a selectsort or insertsort. As these are O(n^2), I guess that merging more and more segments in one step pushes the complexity from O(N log(N)) towards O(n^2).

Myke Enriq
Ranch Hand

Joined: Feb 13, 2012
Posts: 115

posted

0

Guys, the O(n) notation is a CURSE.

I do not care , and have ever cared EVER , if an algorithm is O(whatever). I only care about the costs , as 2 algorithms of the same O() could have very different costs.

Judging algorithms in terms of O() is a huge handicap to any programmer. The only thing that ever matters is the actual cost , not how that cost behaves when the input data goes to infinity (lol).

Having said that , it makes a huge difference to me if an algorithm has the cost N*log(base 2)N or the cost N*log(base 5)N

Myke Enriq wrote:Guys, the O(n) notation is a CURSE.

I do not care , and have ever cared EVER , if an algorithm is O(whatever). I only care about the costs , as 2 algorithms of the same O() could have very different costs.

Judging algorithms in terms of O() is a huge handicap to any programmer. The only thing that ever matters is the actual cost , not how that cost behaves when the input data goes to infinity (lol).

Uh oh.

It is irrelevant that your application has a low cost of handling 100 records, if it will eventually handle millions of records and its complexity is O(n^2) (or worse).

If you always, always measure the performance of your application at the biggest possible dataset and don't allow for any growth in the amount of data, then yes, you may forget about complexity and concentrate just on the costs. However, in all projects I've ever worked on, the amount of data increased with time. New records are added to the database, new transactions are realized, new facilities are build or bought. New clients are acquired (hopefully). If you don't care about complexity, you actually don't care about how your application will perform when the size of its data doubles. That would be a hard sell to many customers.

Most of the time, you need to care about both - the complexity and the actual execution time. You'll estimate the complexity to be sure your application response time won't go through the roof once real data start flowing in, and you use the profiler to find spots where you could bring the execution time down most effectively.

Having said that , it makes a huge difference to me if an algorithm has the cost N*log(base 2)N or the cost N*log(base 5)N

It's probably not possible to compute a cost of an algorithm. An algorithm may be implemented in many different ways and compiled and run on many different platforms with different characteristics, and I'm quite sure that it is not possible to describe the cost of an algorithm by just one formula that will match all implementations, compilers and platforms. (The big-O is different: it is an approximation, therefore it is equally valid for all platforms, compilers and implementations, provided that the implementation is correct).

I also believe that your estimates of the costs of the algorithms in this particular case are not correct. If I'm not mistaken, you assume that merging two lists of certain size will take the same time as merging three (or more) lists of the same size. That is not true (finding a minimum of n numbers takes n-1 comparison, and you're not accounting for that). We've already discussed this and both Henry and I have told why we think that merging two lists at once is better strategy than merging more lists at once.

So, if you really want to compare the costs, you need to do it correctly. Much easier, in my opinion, would be to implement the two algorithms and just measure the time they take (using inputs of different size for good control of the complexity).

I've realized that you do care about complexity, Myke, but that you're trying to compute it even more precise (the cost). I've edited my previous post to demote the text that is really not relevant. Sorry about that.

Myke Enriq wrote:Guys, the O(n) notation is a CURSE.

I do not care , and have ever cared EVER , if an algorithm is O(whatever). I only care about the costs , as 2 algorithms of the same O() could have very different costs.

Judging algorithms in terms of O() is a huge handicap to any programmer. The only thing that ever matters is the actual cost , not how that cost behaves when the input data goes to infinity (lol).

Having said that , it makes a huge difference to me if an algorithm has the cost N*log(base 2)N or the cost N*log(base 5)N

Sorry for the late response, but better late than ..... First, O(n) notation is not a CURSE -- it is merely a tool. And programmers are smart and reasonable people (well, except for a couple that I met over the years, but I digress ). There isn't a bunch of developers stuck in the corner somewhere wondering why their programs are not fast enough because it look good on paper based with big O notation...

The issue here is that you are trying to have it two ways. You don't want to use big O notation with N going to infinity, but you want to use the math. You want to use the math developed for big O notation to represent cost. This is not how it is done. There are just too many variables when it comes to performance to do it exact on paper. And using the math that makes many assumptions, but ignoring those assumptions, won't work.

There is no such a thing as "COST (N Log N)". And hence, there is no such as thing as comparing it that way.

With performance / cost comparison, you need to actually test it. You need to write a testing framework, that can test different values of N -- and doing so without the testing framework affecting the results too much. This testing framework needs to produce results with mean, min, max, standard deviations, etc., as numbers without an indication of accuracy is not useful. etc.

This is not exactly sorting but should get the job done faster if you have enough memory.

Create an int array of length Integer.MAX_INT.

Now, start reading the input file stream and for each integer, put a 1 in the array at that index.

After processing the file, the indexes in the array where the value is 1 will give you the sorted list.

If memory is limited, this could be modified to use a smaller array with Node objects as array elements. Each input integer can be normalized to fall between 0 and array length and appended to the Node.

The second step will involve sorting the values in Node if it has multiple values. It is possible to have input values such that all of them are bucketed in the same index.

Yea, that can be faster, or not. The problem is the last step - you have to iterate through the array and look for all the elements that have been set. That operation is O(R), where R is the size of the range of int values you have to check. It can be considerably larger than N, especially considering N is supposed to be just a few hundred thousand in this problem. However, if N is a reasonably large fraction of R, then the time required to scan all elements of the array is not such a deterrent, and this technique can be pretty fast.

Note that you can also do this with a boolean[] array or a java.util.BitSet. The BitSet uses much less memory, one bit per element, whereas using a boolean[] will probably (on standard JDKs, last time I checked) will probably use just as much memory as an int[] does.