This week's book giveaway is in the Design forum.We're giving away four copies of Design for the Mind and have Victor S. Yocco on-line!See this thread for details.
Win a copy of Design for the Mind this week in the Design forum!

Removing duplicates from a String array

Michael Ebner
Greenhorn
Posts: 3
Hello there,

As part of an assignment I am supposed to populate a String array with words from a text file. Then I am supposed to go through the array and make sure it has no duplicates. If it has duplicates I want to count the duplicates in another , equally large array. This is what I've got so far, but there's a problem with my logic here, as the new array still has duplicates.

As far as I know, a Set would make sure there are no duplicates, but I am unfortunately only allowed to use arrays (not even sure I should be using ArrayList, but I figure it is OK since I don't know how many words the file I will be reading has). Any tips?

Tony Docherty
Bartender
Posts: 2952
59
Why are you incrementing 'i' inside the loop?
Also why are you initializing the wordCount array to 50000 elements. Why not wait until you know how many words there in the file are before initializing the array.

Tony Docherty
Bartender
Posts: 2952
59
And welcome to the ranch.

Piet Souris
Rancher
Posts: 1213
25
hi Michael,

apart from what Tony wrote, your first problem is in the lines 28 and further.

If words[0] is different from words[1], but equal to words[2], then when i = 0 and j = 1,
you consider words[0] to be unique.

The easiest way to deal with this is to first sort the array 'words'. After that, you need only one
loop to fulfill the assignment. Another way is to add the line

boolean unique = true;

at line 25,5 and then set it to false if equality is found in the j-loop. At the end of this j loop,
check the boolean. Note however, that you need your j loop to go from 0 to the last word (why? And watch out for i == j!).

And lastly, you could initially read all lines of the file, to determine how many words there are.
That would eliminate the use of ArrayList, the use of which I think is dubious, given that
you are not allowed to use Sets.

Greetings,
Piet

Winston Gutkowski
Bartender
Posts: 10243
58
Michael Ebner wrote:Any tips?

Yes: StopCoding (←click).

Think about the assignment, and the steps that you're likely to need to do to achieve it, rather than just writing code and hoping it'll work.

Like Tony, I'm also dubious about using ArrayList if the assignment said you're supposed to use arrays, so, just add a "count words" step.

As to sorting (which will make things much simpler for determining your duplicates): there are two possible alternatives.
1. Plough all your words into your array and then sort them.
2. Insert each word into its correct place as you read.

The first is likely to be much quicker, but the second is very simple, and ensures that you discover duplicates while you're adding. If you want to get fancy, you could even employ a binary search, since you know that everything you're searching for will be sorted; but to be honest, I'm not sure it will save you that much time.

Winston

Michael Ebner
Greenhorn
Posts: 3
Thanks for all the helpful replies! Apparently I am not supposed to be sorting the words I get from the text file either. I have been trying to incorporate some of your advice, but right now my code is a mess. I have tried to comment on my thinking process where necessary.

I think I am going to take Winston Gutkowski's tip and bring out pen and paper, as I don't see the problem clearly right now.

Winston Gutkowski
Bartender
Posts: 10243
58
Michael Ebner wrote:I think I am going to take Winston Gutkowski's tip and bring out pen and paper, as I don't see the problem clearly right now.

I've broken yours up this time, but for future reference, please remember:
80 characters max.
(the SSCCE page actually recommends 62)
And that includes string literals AND comments AND long method calls.

Oh, and you're usually better off using spaces for indentation, rather than TABs.

Thanks.

Winston

Winston Gutkowski
Bartender
Posts: 10243
58
Michael Ebner wrote:Apparently I am not supposed to be sorting the words I get from the text file either.

Hunh? That seems OTT to me, but maybe your tutors have their reasons. Maybe they just want you to get familiar with the whole business of loops.

However, it actually simplifies things for you quite a bit, providing you design properly.

Winston

Tyson Lindner
Ranch Hand
Posts: 211
Array's are initialized by the size of the array, not what the largest index is. So if there are x words in the file, you would create an array with " new String[x]" NOT "new String[x-1]" even though the last index is x - 1 instead of x.

I suggest just creating a method to count duplicates, and for each word in the unique array invoke that method and print both the word and the result. Creating a separate int array seems unnecessary.

Winston Gutkowski
Bartender
Posts: 10243
58
Tyson Lindner wrote:I suggest just creating a method to count duplicates, and for each word in the unique array invoke that method and print both the word and the result. Creating a separate int array seems unnecessary.

Hmmm, still seems to me that the best (and easiest) way to count dups is when you're reading the words in - especially if you're not allowed to sort - and then a "count" array makes absolute sense.

Michael Ebner wrote:I have been trying to incorporate some of your advice, but right now my code is a mess.

Because you're still trying to do everything at once and put all your code in one place. Think about splitting up this problem into tasks.

And further to what I said to Tyson: Have a look at the List.indexOf() method, and see if you can work out how something like that might help you.

Winston

Tony Docherty
Bartender
Posts: 2952
59
Winston Gutkowski wrote:
Hmmm, still seems to me that the best (and easiest) way to count dups is when you're reading the words in - especially if you're not allowed to sort - and then a "count" array makes absolute sense.

Agreed, especially as the requirement is to create an array without duplicates.

With Winston's suggested approach you can count the duplicates as you read in each word and only add a word if it is not currently in the array.

Tyson Lindner
Ranch Hand
Posts: 211
Winston Gutkowski wrote:
Tyson Lindner wrote:I suggest just creating a method to count duplicates, and for each word in the unique array invoke that method and print both the word and the result. Creating a separate int array seems unnecessary.

Hmmm, still seems to me that the best (and easiest) way to count dups is when you're reading the words in - especially if you're not allowed to sort - and then a "count" array makes absolute sense.

I mean its perfectly fine, I just prefer not to use parallel arrays when I don't have to so I don't have to worry about them not matching up properly.

Tony Docherty
Bartender
Posts: 2952
59
Tyson Lindner wrote:I mean its perfectly fine, I just prefer not to use parallel arrays when I don't have to so I don't have to worry about them not matching up properly.

I agree that the use of parallel arrays is generally a poor design but for a beginners project which seems to be aimed at teaching about arrays I would say it is an acceptable solution.

ibrahim yener
Ranch Hand
Posts: 188
Why not just use Map instead of Arraylist

Campbell Ritchie
Sheriff
Posts: 48652
56
Rather than parallel arrays, create a WordCount class. Then you can use that for counting, but you will have to check equality with the word field rather than the whole object. You can have a WordCount[].
Note that these solutions where you iterate the array checking whether the word has already been found will run in O(n²) complexity, so performance will be slow.

Piet Souris
Rancher
Posts: 1213
25

In all the well meant advises for better, best and better-than-best methods, we completely forgot Michael!

Why not first discuss what is right and what is wrong with his second listing?

And finally: Michael wrote:

I think I am going to take Winston Gutkowski's tip and bring out pen and paper, as I don't see the problem clearly right now.

Why not wait for his findings first?

Greetz,
Piet

Tyson Lindner
Ranch Hand
Posts: 211
Why not first discuss what is right and what is wrong with his second listing?
Piet

I already mentioned he's setting up his arrays incorrectly, but after looking it over again it doesn't look like he's even using his third int array. So he's counting up the total number of duplicates but not how often each one of those duplicates occurs. There are also several spots where his code could be simplified. I suggest a complete overhaul.

To others: he's not allowed to use a map or sorted list and so I highly doubt he would be allowed to create his own class. As per performance, cycling through each word among the unique words is certainly slower than using a parallel array from a proportional perspective, but I think it can still be done within a couple seconds for any reasonable input. IMO its worth the cost of not having to deal with a messy parallel int array.

Paul Clapham
Sheriff
Posts: 20966
31
As Tyson implies, don't forget that this is a beginner exercise. Beginners are routinely taught techniques that we advanced programmers wouldn't be caught dead using. They learn basic programming but once they get out of school they have to unlearn much of that to become Java programmers.

Anyway what I mean here is that Michael is probably limited to writing code involving arrays and not much else.

Tyson Lindner
Ranch Hand
Posts: 211
It could be counting the number of duplicates for each word wasn't even necessary to begin with and we were all confusing the OP. In that case the number of duplicates is obviously just the number of words in the file minus the size of the unique array.

Michael Ebner
Greenhorn
Posts: 3
Hello again, and thanks for all the help. Unfortunately I have not been able to work much on this assignment, but I have given it a bit more thought. First I would like to clarify that I have to just use arrays, and I need one String array for the words, another for the occurrences of the words, and an int to count unique words. Also, I am allowed to set both arrays to 5000, as the unique words won't exceed that.

Agh, I am at a loss. What I would like is if each word in the text file was checked with each word of the array. If it did not match, the word was added to the String array at the first index, and the corresponding index of the int array was incremented by one. If it did match, only the corresponding index of the int array was incremented. I tried to do what I had in mind with the code below, but I am not even close. Here I check if the first word in the text file is in the array, which it is not (the array is filled with ""), then I add the first word to all of the positions in the array. That could work, so long as the next word was checked with the first word, and if it did not match, it was added to the second index (and out). But as far as I understand, right now I am only overwriting every position of the array with the last word.

Any tips on how I could get closer to what I would like to do? Is there a way to make sure that each new, unique word is saved to the first available position in the String array, increment the corresponding position in the int array, and if a match is found, just increment the int array?

Winston Gutkowski
Bartender
Posts: 10243
58
Michael Ebner wrote:Agh, I am at a loss. What I would like is if each word in the text file was checked with each word of the array. If it did not match, the word was added to the String array at the first index, and the corresponding index of the int array was incremented by one.

OK, chill out. What "you would like" is not the issue here. The question is: What do you need to do?

It sounds to me like you've launched yourself down one path, and (at least at the moment) can't see any other way to solve the problem.

So: once again. StopCoding.

Forget efficiency for the moment, and forget files. If need be, you can just read every word in your file straight into a String[5000] and start from there. The file is NOT the issue, and I suspect it's distracting you.

Imagine that someone has 5,000 wooden blocks with words printed on them. They may be unique, they may not be.

Someone hands you a block, and your job is to put it in one of two piles: "unique" or "duplicate". For each block, you are allowed to check any or all blocks in either of the piles you have made to work out what to do with the one you were just given. What do you do?

That is the problem: What do you do with each block you're given? Don't worry about efficiency; just come up with a solution that works.

Now: repeat 5,000 times.

And when you can describe that solution on paper, put it into Java. And test it thoroughly.

Then - and ONLY then - think about how you might make it faster.

HIH

Winston

Tyson Lindner
Ranch Hand
Posts: 211
Michael Ebner wrote:First I would like to clarify that I have to just use arrays, and I need one String array for the words, another for the occurrences of the words, and an int to count unique words. Also, I am allowed to set both arrays to 5000, as the unique words won't exceed that.

Still not sure if you're understanding this correctly, just to clarify:
You need one string array for ALL the words (including duplicates). This array will just mirror the text file so you don't have to use the text file anymore.

Your second string array includes only unique words. By unique, I mean each word in this array will not be repeated in this array, but it may be repeated in the original array.

Your third array is the int array that parallels your second array. So each value in your int array corresponds to the number of occurrences of the corresponding word (same exact index) in the unique word array, that there are in the original array (which is the same as the original file).

Agh, I am at a loss. What I would like is if each word in the text file (first array) was checked with each word of the unique array. If it did not match, the word was added to the unique String array at the first open index, and the corresponding index of the int array was incremented by one. If it did match, only the corresponding index of the int array was incremented.

Yes, that's the right idea.

I tried to do what I had in mind with the code below, but I am not even close. Here I check if the first word in the text file is in the array, which it is not (the array is filled with ""), then I add the first word to all of the positions in the array. That could work, so long as the next word was checked with the first word, and if it did not match, it was added to the second index (and out). But as far as I understand, right now I am only overwriting every position of the array with the last word.

First you have to check if the value in the unique array is null, which is NOT the same as "". Empty array spots are initialized to null. If it is null, you can just put the current word there, and then break the loop. If its not null, then you can compare. Yes its obviously bad to overwrite the previous word in the unique array loop, you should never be doing that. You should either be adding the word to an empty spot, incrementing the int value of the word there, or just be continuing along with the loop.