This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes Java in General and the fly likes Trying to use string similarity match algorithms not very effective any ideas how to make better Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Trying to use string similarity match algorithms not very effective any ideas how to make better" Watch "Trying to use string similarity match algorithms not very effective any ideas how to make better" New topic
Author

Trying to use string similarity match algorithms not very effective any ideas how to make better

steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
I'd like to thank you all in advance for any help you can give me to get a better match on my data. Here is the current situation:

I have names and url's that are entered into two different applications. Because of this the names are being abbreviated and slightly entered in different from person to person. My job is to map these names together to the names in application 1.

Here is small sample of some i'm trying to match.
Application1 | application 2

AITSMM Technology | AITSMMTechnologyInc
CareGivingmark Rx, Inc. (CVS CareGivingmark Corporation)| Caremark Rx, Inc.
BrinkmanJones Financial Corporation | BrinkmanJonesFinancial
Citysearch.com | CitySearch
(etrade) E*TRADE Financial Corp. | etrade
eLiftIT (First American) | eLiftIT
First American Equity Services (ELS) (formerly Lenders Advantage) | First American Equity Loan Services
Open Technology Solutions, LLC (OTS LLC) | OTS LLC

I'm looking for the best metric or hybrid to help me out.

Right now what i try to do is i loop through the data starting at a result of 1.0 and call the the algorithm and test if there was a single match if not i decrement the result number by .05 and try again until i get a single match. I return nothing if accuracy drops past .60. I CURRENTLY USING MongeElkan AND IT IS NOT DOING A VERY GOOD JOB.
I have these others from this java library i found online:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

==================
AbstractStringMetric metric1 = new Levenshtein();

AbstractStringMetric metric2 = new CosineSimilarity();

AbstractStringMetric metric3 = new EuclideanDistance();

AbstractStringMetric metric4 = new MongeElkan();
AbstractStringMetric metric5 = new DiceSimilarity();
AbstractStringMetric metric6 = new JaroWinkler();
AbstractStringMetric metric7 = new Jaro();
AbstractStringMetric metric8 = new MatchingCoefficient();
AbstractStringMetric metric9 = new NeedlemanWunch ();

AbstractStringMetric metric10 = new OverlapCoefficient ();

AbstractStringMetric metric11 = new QGramsDistance();
AbstractStringMetric metric12 = new SmithWatermanGotoh ();
AbstractStringMetric metric13 = new SmithWatermanGotohWindowedAffine ();
AbstractStringMetric metric14 = new Soundex();
=================

these 3 so far are most accurrate in my tests
AbstractStringMetric metric12 = new SmithWatermanGotoh ();
AbstractStringMetric metric13 = new SmithWatermanGotohWindowedAffine ();
AbstractStringMetric metric14 = new Soundex();

However, there are times when soundex will grab things much better than say the other two i'd like to combine them somehow or maybe use them differently?

as an example with soundex this name is found BMO accuracy: 0.6999999999999997 from Target: WebMethods Inc.

However, SmithWatermanGotohWindowedAffine finds WEBMETHODS, INC
the correct match.
Any ideas?

also i have to match url's examples may look like
www.anachrentronic.com NA
http://www.caremartestingk.com/wps/portal/client
irequest.pharminsurancceacare.com (test)
https://us.youtrade.com
mirena.thera.com
CVS Vendor CRM App
http://www.myislandking.com
http://www.usfilter.com -dropped

Any help on how to be more effective on my string matches would be great. Also doing that loop decrementing is that bad idea? because there are times its getting multiple matches and it needs to make the match itself i cannot hand check each match?
Martijn Verburg
author
Bartender

Joined: Jun 24, 2003
Posts: 3274
    
    5

The only thing I can suggest here is that you run X algorithms over each data set and give your algorithms a weighting. So alg 1 might have a weighting of .9, alg 2 a weighting of .7 etc.


Cheers, Martijn - Blog,
Twitter, PCGen, Ikasan, My The Well-Grounded Java Developer book!,
My start-up.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41066
    
  43
If Soundex does a credible -if not perfect- job, then check out Metaphone (or DoubleMetaphone for non-English texts). It is a definite improvement on Soundex, e.g., it works if both strings do not have the same starting letter.
Implementations are part of http://commons.apache.org/codec/ (maybe the RefinedSoundex in that library also helps).


Ping & DNS - my free Android networking tools app
 
Don't get me started about those stupid light bulbs.
 
subject: Trying to use string similarity match algorithms not very effective any ideas how to make better
 
Similar Threads
if statement and combo box help
unreported exception java.io.IOexception; must be caught or declared to be thrown
Full-Time Perm. WebMethods Job, Pittsburgh, PA **US Citizens Only**
J2ee Developers
Using Version Control