Meaningless Drivel is fun!*
The moose likes Beginning Java and the fly likes Tough Question: Detect Partial Duplicates Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Tough Question: Detect Partial Duplicates" Watch "Tough Question: Detect Partial Duplicates" New topic
Author

Tough Question: Detect Partial Duplicates

Justin Filmer
Greenhorn

Joined: Jul 04, 2011
Posts: 27
Let's say I have two Strings...


What I would like to do is detect what percentage of words match between the two... I would LOVE to do sequence detection (http://59.108.48.12/proceedings/sigir/sigir2010/docs/p675.pdf), but I think that's too difficult.

This method is severely flawed, but at least it'll give me a start. Let's first find the words that are similar between the two:

11 words match
s1 has 18 words
s2 has 15 words
Use smaller of the two...
% match = 11/15 = 74% match

Remove common words (my,is,I,to,and). Resulting words...
name, Justin, like, code, have, fun
% match = 6/15 = 40% match

Is there already a Java function that could give me some percentage of partial duplicates? If not, how can I write an algorithm that can compare two strings and figure out what percentage they have in common?

Here's my pseudocodo with the approach that I don't like...


I would really appreciate better ideas or code snippets... Thanks!
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

I was about to post the standard procedure, which would be to put the words from the two sentences into two sets, then find their intersection.

But then I noticed you said you didn't like that. Why not?
Justin Filmer
Greenhorn

Joined: Jul 04, 2011
Posts: 27
Paul Clapham wrote:I was about to post the standard procedure, which would be to put the words from the two sentences into two sets, then find their intersection.

But then I noticed you said you didn't like that. Why not?


Please go ahead and post it - it'll definitely help - if not for this, then for something else.
The reason I don't like it is because my reasoning is somewhat flawed. If we're looking for partial duplicates between sentences or paragraphs (strings in general), we should be also looking at phrases and ORDER of phrases within the strings, not just the words. The algorigthm I proposed doesn't take this into account at all.

By my previous reasoning, these two sentences would be considered to have a 66% match, when in my mind they should have a 0% match:


It would be way better and much favorable in my mind to compare bunches of two or three word phrases between the two strings to see how many matches show up. However, that seems like a problem that is out of the scope of "beginning java". If that is indeed easily possible, please let me know how! :)
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Well, I did post it. And you posted it too. The pseudocode, that is. That's what I was going to post too. (Except lines 3 and 4 of your pseudocode are unnecessary.)

But I think you have to stop now and figure out your requirements before you start looking for algorithms. And no, you don't really have your requirements firmed up yet. You provided an example where word order was significant. Here's another one:

I don't know what you would make of that, but at any rate it might help if you started generating all kinds of pairs and decided what the match was and also why. After a while you should be able to put pen to paper and write down your rules.
Justin Filmer
Greenhorn

Joined: Jul 04, 2011
Posts: 27
Good point Paul. Thumbs up. I posted the pseudocode to show that I am thinking about how to code it - I just don't have the Java experience to actually turn it into real code.

Anyway, I've been thinking about it and think that this kind of algorithm may not be bad:


Would you be able to help me code this? I'm thinking maybe using subStrings or something would help here.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Tough Question: Detect Partial Duplicates