# Collections.sort(List, Comparator)??

eric sato

Ranch Hand

Posts: 37

John Smith

Ranch Hand

Posts: 2937

posted 12 years ago

Since you are using this particular method that takes a comparator as a parameter, then you could simply increment a counter in the compare() implementation of your comparator. That would be an experimental way of doing it. You can also do it theoretically: the javadocs tell you that the sorting algorithm is a modified mergesort which offers guaranteed nlog(n) performance. So, if you have 100 elements in your list, your worst case number of invocations of compare() would be 100 * log(100).

**Is there any formula to calculate [the worse case] the number of times accessing the method c.compare when Collections.sort(List list, Comparator c) is used?**

Since you are using this particular method that takes a comparator as a parameter, then you could simply increment a counter in the compare() implementation of your comparator. That would be an experimental way of doing it. You can also do it theoretically: the javadocs tell you that the sorting algorithm is a modified mergesort which offers guaranteed nlog(n) performance. So, if you have 100 elements in your list, your worst case number of invocations of compare() would be 100 * log(100).

eric sato

Ranch Hand

Posts: 37

posted 12 years ago

hi,

Yes i read that javadoc too. I tested on list with size 8 [all in strings] and do sorting according each of the string length.

The max counter hits is 17.

Thus the formula for n log(n), 8 log 8 = 7.2247 . it sounds like not really accurate.

Perhaps is this n log(n) is just a average running times.

Pls consult and thanks again.

Sato

Yes i read that javadoc too. I tested on list with size 8 [all in strings] and do sorting according each of the string length.

The max counter hits is 17.

Thus the formula for n log(n), 8 log 8 = 7.2247 . it sounds like not really accurate.

Perhaps is this n log(n) is just a average running times.

Pls consult and thanks again.

Sato

If you can't change the direction the the wind, change your sail.

Stan James

(instanceof Sidekick)

Ranch Hand

Ranch Hand

Posts: 8791

posted 12 years ago
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

I think the "guarantee" is worst case, guarantees this or better. How would we test the worst case? Different sort algorithms have different weaknesses ... I recall that one (quicksort?) which is usually relatively quick bogs down horribly if the input is already sorted. Data makes so much difference that we had mainframe sort products that sampled the input to decide what algorithm to use.

Jim Yingst

Wanderer

Sheriff

Sheriff

Posts: 18671

posted 12 years ago

Mmmm, I'd beware of interpreting performance guarantees this way. Often when they say something like "n * log(n) performance" they mean that performance will be

A * (n + B) * (log(n) + C)

where A, B, C are constants. B and C will be insignificant if n is large enough, so they're often ignored. A is not so insignificant, but it's generaly ignored too, since we're usually more interested in the

Note also that log(n) generally means logarithm base 2 of n. Though this doesn't really matter much in most cases, because log base 2 or n = log(n) / log(2) (using whatever other type of log you want to use, base 10 or natural log) so log base 2 is

As Stan noted, we also need to pay attention to whether a formula is for worst-case behavior, or typical. In this case the API doesn't really say, though I believe it's both, for a mergesort.

Anyway, all this was just to say that if you want an exact formula for the number of comparisons, you can't necessarily trust the API guarantees, because the language they use is vague enough to hide numerous factors.

Now it turn out that in this case, the formula they provide it pretty accurate, and A = 1 (or pretty close) once you calculate log base 2 correctly. Here's a test program:

Note that if you comment out the shuffle(), you get much better performance. And if you use add(0, new Integer(i)) to create the list in reverse order initially, performance is still pretty good. These are two common cases which could have yeilded worst-case performance for some algorithms, but give good performance here.

So - the formula n * log(n) is in fact pretty accurate here. But beware of taking such formulas too literally in the future - there are often other issues being hidden.

*proprortional to*n * log(n) (or whatever other formula they provide). Note that they don't generally specify the proportionaility constant. And note that the formula they provide may well omit some terms which are deemed less significant. E.g. a more precise formula might be something likeA * (n + B) * (log(n) + C)

where A, B, C are constants. B and C will be insignificant if n is large enough, so they're often ignored. A is not so insignificant, but it's generaly ignored too, since we're usually more interested in the

*shape*of the graph than in its actual values. (Is it linear? Quadratic? Exponential?)Note also that log(n) generally means logarithm base 2 of n. Though this doesn't really matter much in most cases, because log base 2 or n = log(n) / log(2) (using whatever other type of log you want to use, base 10 or natural log) so log base 2 is

*proportional*to log base 10, which means the proportionality constant could just be included in the factor A which we're generally ignoring anyway.As Stan noted, we also need to pay attention to whether a formula is for worst-case behavior, or typical. In this case the API doesn't really say, though I believe it's both, for a mergesort.

Anyway, all this was just to say that if you want an exact formula for the number of comparisons, you can't necessarily trust the API guarantees, because the language they use is vague enough to hide numerous factors.

Now it turn out that in this case, the formula they provide it pretty accurate, and A = 1 (or pretty close) once you calculate log base 2 correctly. Here's a test program:

Note that if you comment out the shuffle(), you get much better performance. And if you use add(0, new Integer(i)) to create the list in reverse order initially, performance is still pretty good. These are two common cases which could have yeilded worst-case performance for some algorithms, but give good performance here.

So - the formula n * log(n) is in fact pretty accurate here. But beware of taking such formulas too literally in the future - there are often other issues being hidden.

"I'm not back." - Bill Harding, *Twister*

Herb Schildt

Author

Ranch Hand

Ranch Hand

Posts: 253

6

posted 12 years ago

I want to echo what Jim is saying about the meaning of n log(n) in the documentation describing Collections.sort(). In this context, n is the number of elements being sorted. As Jim states, it means that Collections.sort() runs proportional to n log(n). Thus, n log(n) describes in a general way how the runtime performance of a sort is affected when n increases. (Or, as Jim states, it describes the shape of the curve.) Of course, the main ingredients of sorting are comparisons and exchanges, so how often these occur dictiates the overall performance of the sort.

Different sorting algorithms have different performance curves. Some sorts, such as the Bubble sort, are very poor, with performance proportional to n-squared. This means that the Bubble sort's performance is proportional to the square of the number of elements being sorted. This makes the Bubble sort unacceptable for sorting anything but very small sets of data.

By contrast, the best general-purpose sorts run proportional to n log(n). n log(n) is a much smaller value than n-squared, especially for large n values. Thus, when the API documentation describes Collections.sort() as "guaranteed n log(n) performance", it means its runtime is proportional to n log(n). This puts it the "best-practices" category of generalized sorts.

Different sorting algorithms have different performance curves. Some sorts, such as the Bubble sort, are very poor, with performance proportional to n-squared. This means that the Bubble sort's performance is proportional to the square of the number of elements being sorted. This makes the Bubble sort unacceptable for sorting anything but very small sets of data.

By contrast, the best general-purpose sorts run proportional to n log(n). n log(n) is a much smaller value than n-squared, especially for large n values. Thus, when the API documentation describes Collections.sort() as "guaranteed n log(n) performance", it means its runtime is proportional to n log(n). This puts it the "best-practices" category of generalized sorts.

For my latest books on Java, including *Introducing JavaFX 8 Programming*, see HerbSchildt.com

John Smith

Ranch Hand

Posts: 2937

posted 12 years ago

It's been a while since I took my "Algorithms" course, so I looked it up in my Algorithms in Java book. Among other things, it states these two:

Mergesort sorts [...] N elements in time propotional to N log N Mergesort requires [...] N lg N comparisons to sort [...] N elements (

Now, since the original poster question was about

Anyway, I added another loop to Jim's code to see how the number of compare() invocations scales with the size of the list, and got some very persistent results for the lists with the following structure:

10000, 1, 99999, 2, 99998, 3, ...

Here are the results:

Elements Actual Expected Ratio

-------- ------ -------- -----

100000 1233791 1660964 0.742816198760518

110000 1367803 1842185 0.7424891509035242

120000 1502639 2024720 0.7421462071380728

130000 1639211 2208459 0.7422417289476446

140000 1772751 2393309 0.7407111606788924

150000 1909659 2579190 0.7404102332803668

160000 2049471 2766033 0.7409420904730387

170000 2189003 2953779 0.7410853744876023

180000 2331231 3142374 0.7418691917289426

190000 2473755 3331771 0.7424743686198316

200000 2617583 3521928 0.7432244297661377

The "expected" column is size * Math.log(size) / Math.log(2) (i.e. the log is base 2, as Jim suggested). The ratio is still well below 1, which tells me that my worst case is not bad enough.

[ October 07, 2003: Message edited by: Eugene Kononov ]

*note the difference between log and lg*).

Now, since the original poster question was about

*the number of times the compare() method is invoked in the worst case scenario*, I would still maintain that the answer is NlgN (well, I did say that it was NlogN, but I accept Jim's comment about the base of the logarithm). I experimented a bit with Jim's code to prove it, but I had a hard time coming up with the worst case scenario: mergesort is a "divide and concur" approach, so sorting the list in the opposite order is not the worst case scenario.Anyway, I added another loop to Jim's code to see how the number of compare() invocations scales with the size of the list, and got some very persistent results for the lists with the following structure:

10000, 1, 99999, 2, 99998, 3, ...

Here are the results:

Elements Actual Expected Ratio

-------- ------ -------- -----

100000 1233791 1660964 0.742816198760518

110000 1367803 1842185 0.7424891509035242

120000 1502639 2024720 0.7421462071380728

130000 1639211 2208459 0.7422417289476446

140000 1772751 2393309 0.7407111606788924

150000 1909659 2579190 0.7404102332803668

160000 2049471 2766033 0.7409420904730387

170000 2189003 2953779 0.7410853744876023

180000 2331231 3142374 0.7418691917289426

190000 2473755 3331771 0.7424743686198316

200000 2617583 3521928 0.7432244297661377

The "expected" column is size * Math.log(size) / Math.log(2) (i.e. the log is base 2, as Jim suggested). The ratio is still well below 1, which tells me that my worst case is not bad enough.

[ October 07, 2003: Message edited by: Eugene Kononov ]

Jim Yingst

Wanderer

Sheriff

Sheriff

Posts: 18671

posted 12 years ago

I think that's probably correct; I just objected to accepting the API's word for it, given that the API wording was inherently vague.

Particularly given that you can use Collections.shuffle() and easily get results around .95 for N > 100000. Your test cases are just too orderly, it seems.

**Now, since the original poster question was about the number of times the compare() method is invoked in the worst case scenario, I would still maintain that the answer is NlgN**

I think that's probably correct; I just objected to accepting the API's word for it, given that the API wording was inherently vague.

**The ratio is still well below 1, which tells me that my worst case is not bad enough.**

Particularly given that you can use Collections.shuffle() and easily get results around .95 for N > 100000. Your test cases are just too orderly, it seems.

"I'm not back." - Bill Harding, *Twister*