aspose file tools*
The moose likes Java in General and the fly likes Similar String grouping in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Similar String grouping in Java" Watch "Similar String grouping in Java" New topic
Author

Similar String grouping in Java

Ankur Luthra
Ranch Hand

Joined: Dec 13, 2010
Posts: 36
Hi,
I am working on a problem wherein I need to Group similar strings in an Array.
i.e If my String array is say "AXPZ","BXT","PZXA","XAZP","XBT","TBX","ABCD"
then I would group AXPZ,PZXA and XAZP in one group.
BXT,XBT and TBX in 2nd group
ABCD in 3rd group.
A group consists of String having same characters (order does not matter)
1)What would be the best approach for the problem.
2) What would be the final DataStructure representing groups and Similar Strings.
3)What should be the approach for this problem.

If anyone has worked on similar problem do provide some comments or ways to approach it.

Regards,
Ankur Luthra
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Ankur Luthra wrote:
1)What would be the best approach for the problem.


"Best" depends on your specific requirements, constraints, and use cases. It's likely that there isn't one approach that can be definitely said to be "best".


2) What would be the final DataStructure representing groups and Similar Strings.
3)What should be the approach for this problem.


One idea that comes to mind--that may or may not meet your requirements--is to define a class that stores the original version of each string and the sorted version. Then, to determine if two strings are "similar", you just look at their sorted versions. For example, AXPZ,PZXA and XAZP will all sort to APXZ.

Another approach--that may or may not work for you--is to use a BitSet for each string and turn on the bits corresponding to the letters that are present in that String. If two BitSets have the same bits turned on, then the corresponding Strings contain the same letters.

I leave it to you to consider the relative merits and demerits of these approaches in the context of your requirements.

Ankur Luthra
Ranch Hand

Joined: Dec 13, 2010
Posts: 36
I was able to do it in following way.Any comments to improve performance if the Array size increases?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Jeff Verdegan wrote:Another approach--that may or may not work for you--is to use a BitSet for each string and turn on the bits corresponding to the letters that are present in that String. If two BitSets have the same bits turned on, then the corresponding Strings contain the same letters.

I fear I see a fatal flaw in that approach, kemosabe: wouldn't "AB" and "AAB" produce the same BitSet?

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Ankur Luthra wrote:I was able to do it in following way. Any comments to improve performance if the Array size increases?

Well, I notice you've defined a HashSet, but don't appear to use it. What about a HashSet of sorted strings?

Better yet might be to create a HashMap, where each key is a sorted string, and each value is a List of all the strings that contain the same letters.

Winston
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Winston Gutkowski wrote:
Jeff Verdegan wrote:Another approach--that may or may not work for you--is to use a BitSet for each string and turn on the bits corresponding to the letters that are present in that String. If two BitSets have the same bits turned on, then the corresponding Strings contain the same letters.

I fear I see a fatal flaw in that approach, kemosabe: wouldn't "AB" and "AAB" produce the same BitSet?

Winston


Part of what I meant when I told the OP, "I leave it to you to consider the relative merits and demerits of these approaches in the context of your requirements," was defining whether he could have duplicated letters and what effect it would have on using one approach or the other--and, for that matter, if AB and AAB should be considered a "match" by his rules.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Ankur Luthra wrote:I was able to do it in following way.Any comments to improve performance if the Array size increases?


Not regarding performance in particular, just general comments:

1. Your ModifiedString class should not have setters. Just take the single, unsorted String as a constructor arg, and set the original and sorted variables right there.

2. It looks like there's way too much going on there. An array, and 3 Lists?

3. Don't use get(i) for iteration. Use an Iterator, whether explicit, or implied via foreach loop.

4. Two of your ArrayLists are unparameterized. You should use generics there.

5. Unless you know your variable specifically needs to be an ArrayList, just declare it as List rather than ArrayList.
Ankur Luthra
Ranch Hand

Joined: Dec 13, 2010
Posts: 36
Jeff Verdegan wrote:
Ankur Luthra wrote:I was able to do it in following way.Any comments to improve performance if the Array size increases?


Not regarding performance in particular, just general comments:

1. Your ModifiedString class should not have setters. Just take the single, unsorted String as a constructor arg, and set the original and sorted variables right there.

2. It looks like there's way too much going on there. An array, and 3 Lists?

3. Don't use get(i) for iteration. Use an Iterator, whether explicit, or implied via foreach loop.

4. Two of your ArrayLists are unparameterized. You should use generics there.

5. Unless you know your variable specifically needs to be an ArrayList, just declare it as List rather than ArrayList.

Thank you Jeff.I was focusing mainly on getting the work done so skipped these points.Thanks for the Inputs
Ankur Luthra
Ranch Hand

Joined: Dec 13, 2010
Posts: 36

Well, I notice you've defined a HashSet, but don't appear to use it. What about a HashSet of sorted strings?

Can you please tell me where can I fit in HashSet in this.Are you talking about the final Data Structure i.e. finalList in my case or taking making al HashSet?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Jeff Verdegan wrote:Part of what I meant when I told the OP, "I leave it to you to consider the relative merits and demerits of these approaches in the context of your requirements,"

Oh, OK. My profuse apologies.

Winston
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Ankur Luthra wrote: Can you please tell me where can I fit in HashSet in this.Are you talking about the final Data Structure i.e. finalList in my case or taking making al HashSet?

Well, there are several ways of doing what you want; but the main advantage of a HashSet / HashMap in this case is speed, because it can do a "contains" check in O(1) time. Most other structures will require either O(n) or O(log(n))) time.

Assuming you have something like a HashMap<String, List<String>>, my logic would go something like this :However, as Jeff pointed out, if you're only interested in grouping strings that contain the same letters, regardless of count; you may need another method.

HIH

Winston
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Winston Gutkowski wrote:
Jeff Verdegan wrote:Part of what I meant when I told the OP, "I leave it to you to consider the relative merits and demerits of these approaches in the context of your requirements,"

Oh, OK. My profuse apologies.

Winston


None necessary, of course.
 
Don't get me started about those stupid light bulbs.
 
subject: Similar String grouping in Java