This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
Hello everyone, Can anyone teach me, any algorithm that can calculate two text contents similarity? I need to write a class to compare two texts and give similarity, for example, 1 if two contents are exactly same.
Regards, Jackson [ July 29, 2008: Message edited by: Cairo Jackson ]
I just did a Google search for "measure text similarity" and got all sorts of interesting stuff.
Computer analysis of text has a looooong history so this is a big topic. (People trying to prove that Shakespeare did or didn't write X, etc.)
Word frequency analysis is a good start and simple to program. Gather statistics on average word size at the same time. Common misspellings may be found this way. I'm not sure what good Metaphone would be since it would tend to conceal misspelling errors. The standard java library collections classes such as TreeMap are what you need to gather this information.
You might also look at simple measures of sentence structure like sentence length (word count), use of commas, etc. - dont just gather averages, a frequency histogram would be more informative.
This is an area I find fascinating so please keep us up to date on what you come up with.
Bill [ July 29, 2008: Message edited by: William Brogden ]
I seem to remember that zip files were being considered for some of this stuff, something about documents by the same person having a similar compression rate due to the frequency of words in thats auhtors vernacular.
Not sure i am convinced, but it did sound interesting.
Joined: Jan 18, 2007
Well, I have tried n-gram, lcs (longest common subsequence), Levenshtein and I found that Levenshtein is easier. In fact, they have their own weakness.
Try the code below. It will return a value between 0 and 1, where 1 if the text content exactly the same, and 0 if the content totally different.
As I say, it still has its weakness. If you are interested in this topic too, please come and work together for better solution.
[ July 30, 2008: Message edited by: Cairo Jackson ] [ July 31, 2008: Message edited by: Ulf Dittmer ]
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com