File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Compare two text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Compare two text" Watch "Compare two text" New topic

Compare two text

Cairo Jackson

Joined: Jan 18, 2007
Posts: 14
Hello everyone,
Can anyone teach me, any algorithm that can calculate two text contents similarity? I need to write a class to compare two texts and give similarity, for example, 1 if two contents are exactly same.

Thank you.

[ July 29, 2008: Message edited by: Cairo Jackson ]
Costa Lamona

Joined: Sep 17, 2006
Posts: 29

you can use String.equals or String.equalsIgnoreCase methods. also you can take a look to its implementations.

if you are in college or high school, and that was a homework.. it is a good idea to use google, so you can get familiar with finding code using google. (don't rely on one result).

I think comparing strings char by char is fine implementations.

Cairo Jackson

Joined: Jan 18, 2007
Posts: 14
mm... no. What i mean is input two text files, compare them, and give "similarity". how much similarity of these two text contents. For example, i want to know if plagiarism happen.
Ulf Dittmer

Joined: Mar 22, 2005
Posts: 42965
There are commercial online services that help with this, but if you want to roll your own, you might start by investigating algorithms like Double Metaphone and the Damerau Levenshtein edit distance.

A Java implementation of Double Metaphone is part of the Apache Commons Codec library.
William Brogden
Author and all-around good cowpoke

Joined: Mar 22, 2000
Posts: 13037
I just did a Google search for "measure text similarity" and got all sorts of interesting stuff.

Computer analysis of text has a looooong history so this is a big topic. (People trying to prove that Shakespeare did or didn't write X, etc.)

Word frequency analysis is a good start and simple to program. Gather statistics on average word size at the same time. Common misspellings may be found this way. I'm not sure what good Metaphone would be since it would tend to conceal misspelling errors. The standard java library collections classes such as TreeMap are what you need to gather this information.

You might also look at simple measures of sentence structure like sentence length (word count), use of commas, etc. - dont just gather averages, a frequency histogram would be more informative.

This is an area I find fascinating so please keep us up to date on what you come up with.

[ July 29, 2008: Message edited by: William Brogden ]
Gavin Tranter
Ranch Hand

Joined: Jan 01, 2007
Posts: 333
I seem to remember that zip files were being considered for some of this stuff, something about documents by the same person having a similar compression rate due to the frequency of words in thats auhtors vernacular.

Not sure i am convinced, but it did sound interesting.
Cairo Jackson

Joined: Jan 18, 2007
Posts: 14
Well, I have tried n-gram, lcs (longest common subsequence), Levenshtein and I found that Levenshtein is easier. In fact, they have their own weakness.

Try the code below. It will return a value between 0 and 1, where 1 if the text content exactly the same, and 0 if the content totally different.

As I say, it still has its weakness. If you are interested in this topic too, please come and work together for better solution.

Thank you.

[ July 30, 2008: Message edited by: Cairo Jackson ]
[ July 31, 2008: Message edited by: Ulf Dittmer ]
I agree. Here's the link:
subject: Compare two text
It's not a secret anymore!