Win a copy of Testing JavaScript Applications this week in the HTML Pages with CSS and JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Ron McLeod
  • Jeanne Boyarsky
  • Paul Clapham
Sheriffs:
  • Tim Cooke
  • Liutauras Vilda
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • fred rosenberger
  • salvin francis
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Carey Brown

How to count how many duplicates are in an ArrayList?

 
Marshal
Posts: 69889
278
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Those solutions with what Junilu calls the arrow anti‑pattern also have nested loops, so those solutions will run in quadratic time. You won't notice with ten items in the List but when you get to a big list (e.g. 1000000 elements) you will notice slower execution.

I forgot to say with my solution that you need to pass the words as command line arguments

campbell@campbellsComputer:~/java$ java DuplicateFinder zebra dog cat dog zebra zebra cat zebra horse cat zebra horse cat dog
4

Alert observers will doubtless have noticed the opportunities I left for refactoring in my sugestion.

Please somebody suggest a technique using Streams.
 
Bartender
Posts: 4007
156
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OP indicated that he had never heard of Sets, so I am very doubtful if Streams would
do any good. But here is a possible way:

create a Map<String, List<String>>, with the first element being
one of the unique words in the inputlist, the second element being
all the words equal to the key. You can do this with the use of Collectors.groupingBy.
Then do a filter, based on the size of the value , and finally a mapping
from the key, value pair to the key.

That's two lines, I guess, and maybe one line if you can combine them.

And maybe there are better ways. But I have found that using some
old non-java 8 methods, are very much more readable.

Try it. As you know, we do not give complete solutions here

Greetz,
Piet

Edit: have a look at this topic:

https://coderanch.com/t/646296/java/java/Venkat
 
Ranch Hand
Posts: 789
Python C++ Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You have to compare every element to every other element and I don't think there's a way to do that that isn't quadratic. The nested loops just make it obvious what's happening.
 
Bartender
Posts: 7208
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Guillermo Ishi wrote:You have to compare every element to every other element and I don't think there's a way to do that that isn't quadratic. The nested loops just make it obvious what's happening.


No, it's not. See my example which does not have nested loops.
 
Campbell Ritchie
Marshal
Posts: 69889
278
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nor does my example. Nor does Piet Souris' example. They will all run in linear time.

By the way, there is a thread in another forum of ours about a similar problem, only with letters rather than words.
 
Piet Souris
Bartender
Posts: 4007
156
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Edit: this is in reply to Carey's post.

Yours could be quadratic, depending on the sort method.
But normally it should be O(nlog(n) + n).

Here is an O(n) implementation (I think, it is the OldSchool method)


There's also a stream version, it looks ugly, someone to make a
more elegant version?

Greetz,
Piet
 
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Guillermo Ishi wrote:You have to compare every element to every other element and I don't think there's a way to do that that isn't quadratic. The nested loops just make it obvious what's happening.


The deep nesting and arrow code anti-pattern actually make it more difficult to see what's happening in your solution.

The problem can be solved with just two passes through data structures involved: first pass goes through the original list and collects/records distinct elements and counts in another data structure (the histogram), and the second pass will go through the histogram and pick out the values that have more than one instance recorded. Using Collection.frequency() actually results in more passes through the main list, I think. N + 1 times maybe?


Of course, if you really wanted to make this readable, you'd refactor by creating a Composed Method:


Line 2 in the code above tells you exactly what's going on in the solution. Java 8 can make this even simpler and straightforward but we'll leave that as an exercise.
 
Campbell Ritchie
Marshal
Posts: 69889
278
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote: . . . O(nlog(n) + n). . . .

Is there such a thing as O(nlogn + n)? Shouldn't it simply be O(nlogn)?
 
Guillermo Ishi
Ranch Hand
Posts: 789
Python C++ Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:
The deep nesting and arrow code anti-pattern actually make it more difficult to see what's happening in your solution.


It's not bad enough for it to be hard to understand. It depends on lots of things. Even where jaywalking is enforced, sometimes it's the most rational thing to do...
 
Piet Souris
Bartender
Posts: 4007
156
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Junilu
I find your second method less readable than the first one. Also
the use of very long names does not increase the clarity.

But you are our expert in matters like these, and so it could
just be me. Therefore I am very curious as to what other people
think.

@Campbell
First of all: if O(nlog(n) + n) == O(nlog(n)), then my notation is correct!
But I admit: I'm not sure. In one of Coursera's courses about Algorithmic Thinking
I found this a pretty nasty subject. I remember an exercise where the counting
lead to O(n^2 + m), according to the official answer. However, m was anyting
between n and n^2, so I thought: worst case, the whole thing is O(n^2).
But, as said, that was an incorrect answer.

Greetz,
Piet
 
Guillermo Ishi
Ranch Hand
Posts: 789
Python C++ Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:

Guillermo Ishi wrote:You have to compare every element to every other element and I don't think there's a way to do that that isn't quadratic. The nested loops just make it obvious what's happening.


No, it's not. See my example which does not have nested loops.



That's interesting code and I will look at it more later. I noticed you start with a sort, which is often a good thing to do and is considered a freebie. The problem is my intuition tells me that loops or not, you ultimately have to compare every element to every other element. The fastest you can do that is N^2/2, which is what I'm doing. It's still considered quadratic though since you throw out the coefficients.
 
Saloon Keeper
Posts: 12165
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In this case I wouldn't use a sort because sorting usually runs in linearithmic time, and this problem can be solved in linear time.

Guillermo, the fault in your reasoning lies in that you don't have to compare every element to every other element. You only have to compare every element to the first element that's equal to it. Since there is a constant number of first elements (namely 1), you only have to consider each element a constant number of times. The upper bound of the growth rate of the function is linear.
 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just had a look at Carey's solution and it works, too, except that it modifies the list passed in by sorting it. Easy enough to fix though. The one thing about that solution though is that it's more difficult to refactor for clarity and there's that post pre-increment of count in the middle of it that's critical to the algorithm working correctly. It does the job in just one pass though if you disregard how many passes had to be made in sorting the list. If you consider the sorting, then it all depends on how big the list is and what sorting algorithm gets used. But that's premature optimization. The bigger concern would be clarity.

Edit: I guess Campbell and I had pretty much the same idea.
 
Stephan van Hulst
Saloon Keeper
Posts: 12165
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here's another functional solution. I agree that it's not very readable, but it may be educational:
 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Here's another functional solution. I agree that it's not very readable, but it may be educational


Indeed. Right now, I'm thinking that the clear winner in clarity between the two equivalent methods below is the latter, even after my best effort at formatting the former and adding static imports for groupingBy and summingInt:

But then again I'm still not very conversant with the Streaming API in Java 8. Maybe in a few months time, I'll be more comfortable reading the former...

Thanks for sharing!
 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:@Junilu
I find your second method less readable than the first one. Also
the use of very long names does not increase the clarity.


That's just my style; you can't please everyone but I like being able to read it out loud and actually hear what the code is doing. I will read that out loud as "The count duplicates method takes a list of values and returns how many items have duplicates in the histogram of values" which then might lead me to rename the method to howManyHaveDuplicatesIn(histogramOf(values)). If enough people complain about the long name, I work with them to find a shorter yet still descriptive name that captures the intent just as well. This one may be a bit on the long side but it's still within reason for me.

The real experts like Grady Booch, Kent Beck, Uncle Bob, and others favor using longer rather than shorter names that express the intent well. Finding the right balance between expressiveness and terseness is almost an art but I tend to err on the side of expressiveness and longer names. YMMV.
 
Carey Brown
Bartender
Posts: 7208
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This has been fun and enlightening. Had to do some benchmarks though.
Given 10 million strings in the list, I got...

Ran it a bunch of times and the results were consistent.

 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting that the streams version seems to be the slowest. Of course, it might be different if it were allowed to do concurrent processing although I don't really understand how that works or if it will with this particular kind of processing. I think algorithm #1 takes a big hit from having to sort 10M items though. I ran the program on my Mac, putting algorithm #1 last in the execution and got pretty much the same results:

testing...
#2 10.5446 seconds
#3 3.9009 seconds
#1 9.0904 seconds

Thanks for sharing, Carey!
 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
And I just noticed a bug in your test code, Carey. Here's the fix:


The problem with the version you have is that the list is sorted in place so when you pass an unreferenced ArrayList to the Collections.sort(), the sorted list just goes away and you're back to dealing with unsorted list you passed in. I noticed the problem when I added the duplicate count in the status messages. Algorithm #1 was giving back 0 dupes before the fix:

testing...
#2 6114 dupes 10.2771 seconds
#3 6114 dupes 3.1961 seconds
#1 6114 dupes 9.8686 seconds
DONE.
 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The results of your test bolster the argument against premature optimization and for favoring clean code first. It's actually a pleasant surprise to see that the cleanest version (IMO) comes out consistently faster as well. Thanks again, Carey. I agree, it has been fun and enlightening.
 
Junilu Lacar
Sheriff
Posts: 15815
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Out of curiosity, I added the pre-Composed Method refactoring version I had as algorithm #4:


Below are the results I got with several runs. I even switched the order in which the algorithms were run to see if that would affect the results. From the sample runs that I have it does seem like the order changes the results a little bit. Not sure what the deal is there but it's even more surprising to see that in 8 out of 10 runs, #3 is faster than #4. Also, Algorithm #1 didn't seem to do well when run last either, coming in the slowest in most of the runs.

testing...
#2 6326 dupes 7.7816 seconds
#3 6326 dupes 3.4466 seconds
#4 6326 dupes 6.7617 seconds
#1 6326 dupes 9.1336 seconds
DONE.

testing...
#2 6179 dupes 9.5098 seconds
#3 6179 dupes 3.5047 seconds
#4 6179 dupes 3.4234 seconds
#1 6179 dupes 11.1353 seconds
DONE.

testing...
#2 6163 dupes 10.7968 seconds
#3 6163 dupes 3.5547 seconds
#4 6163 dupes 4.7490 seconds
#1 6163 dupes 9.5733 seconds
DONE.

testing...
#2 6236 dupes 9.7591 seconds
#3 6236 dupes 4.2966 seconds
#4 6236 dupes 5.1835 seconds
#1 6236 dupes 9.5680 seconds
DONE.

testing...
#4 6146 dupes 7.5050 seconds
#2 6146 dupes 6.1299 seconds
#3 6146 dupes 3.0032 seconds
#1 6146 dupes 9.1021 seconds
DONE.

testing...
#4 6229 dupes 7.4504 seconds
#2 6229 dupes 10.8367 seconds
#3 6229 dupes 3.0185 seconds
#1 6229 dupes 8.9324 seconds
DONE.

testing...
#3 6170 dupes 4.2541 seconds
#4 6170 dupes 6.7610 seconds
#2 6170 dupes 7.9032 seconds
#1 6170 dupes 11.1621 seconds
DONE.

testing...
#3 6300 dupes 4.4871 seconds
#4 6300 dupes 3.3503 seconds
#2 6300 dupes 6.4435 seconds
#1 6300 dupes 9.5885 seconds
DONE.

testing...
#3 6197 dupes 4.3536 seconds
#4 6197 dupes 5.4288 seconds
#2 6197 dupes 7.1514 seconds
#1 6197 dupes 9.3876 seconds
DONE.

testing...
#3 6010 dupes 4.4333 seconds
#4 6010 dupes 5.4191 seconds
#2 6010 dupes 7.4480 seconds
#1 6010 dupes 9.5592 seconds
DONE.

 
Stephan van Hulst
Saloon Keeper
Posts: 12165
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think in general Streams are slower than conventional procedural algorithms, because they probably use a lot more object creation. Every operation you perform on a Stream will create an object representing the operation. There's probably more overhead, but because of the declarative nature of Streams, we don't know (or care) about them.

If I can write a readable algorithm using functional programming, I will. Only if I can prove that the algorithm causes the overall application to not be responsive enough, will I change it to use faster methods.
 
What is that? Is that a mongol hoarde? Can we fend them off with this tiny ad?
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
    Bookmark Topic Watch Topic
  • New Topic