I'm working on a project that involves OCR input and matching parts of that input to existing data. What I need to match comes in three forms, a name, an address or a date. I'm looking for some help finding some of the best techniques to do this.
Currently I'm getting fairly decent matching. I've used the following techniques for the name:
1) Comparing them directly using equasIgnoreCase(). 2) Dropping punctuation (including any whitespace greater than one character) then doing #1. 3) Dropping vowels then doing #1. 4) Doing #2, then doing #3, then doing #1. 5) Doing #2, generating a hash value, then comparing the value. 6) Doing #2, then comparing characters and character pairs and returning true if a specified % of them match.
I've made the address a weighted match. If the street address and either the zip code OR the state + city match then the address is considered a match. This obviously requires a match on the street address which is the most difficult part. To do that I'm doing the following:
1) Comparing them directly via equalsIgnoreCase(). 2) Removing punctuation and spaces and doing #1. 3) Replacing common words and variants of those words (PO, P O, P.O., P. O., POST OFFICE, all become just "PO") then doing #1.
The date matching is junk, if it's not the exact date (and in the right format, with a small margin of error to fix it like removing anything not a digit) it won't match. Right now on the subjects that are matched based upon name and address I'm getting around 80% matched. Should still be better I think. The group that I have to match on name and date are poor though, a little shy of 50% and the biggest hold up seems to be the date.
Any advice? Please? Good books that really get into this? Good algorithms or techniques for matchign things like addresses or names? Suggestions?
Could you give some (10-20) examples of the different formats the dates come in?
I'm currently trying to write a program to parse NOAA weather report texts and plot the conditions on a chart. Some of the data comes in varying formats and is being a challenge to parse. I enjoy writing code and thought that a finite state machine approach would work for the day of week header. So far it has.
Here are some sample days/times from the NOAA weather texts: today and tonight thu and thu night tonight mon night today through fri this afternoon and tonight this afternoon tonight wed night and thru
Do you have any budget for commercial address standardization software? We use Trillium and there are others out there. Among other things, it fixes up abbreviations and terms like "blvd" and "cir", and validate the address against a USPO list of every deliverable address in the country. It tells you stuff like "That's an apartment building ... you should have an apartment number". If it's all good, it gives you back 9-digit zip. End of sales pitch. [ September 15, 2005: Message edited by: Stan James ]
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Joined: Jul 15, 2003
Norm: The date should always come in the format of YYYYMMDD. The issue is that being OCR data it's not uncommon to see something like 0032113 which was supposed to be 20032113 but the OCR didn't get the 2. Or another examplew ould be 19?808?7 instead of 19580827. I believe that in both these cases the dates should be considered a match. So for example I may have
"JOHN", "PAUL", "R" and "19?808?7" in my OCR input. I compare that against an existing database (it's not literally a database but the point is I have it on file somewhere) and find a JOHN, PAUL R. and I match the name. So far so good. Now I have to look at the date. In my database I have 19580827 and the file says 19?808?7. I should be able to figure out that it's close enough, but I don't know how to go about that for dates.
As you can see there is no problem parsing the data. The problem is with taking the OCR input and matching it against existing data since the OCR input is inherently bad but close enough that with proper techniques/algorithms it could be matched.
Stan: No, I can't rely on something like that. I honestly don't care if the address is a valid address, I simply care that it matches what I already have in the database. If I have "1234 1st Washington Boulevard" in my database and I get OCR input that looks like " 12`34 first washington bvd." it should still match. In fact, with my existing code that actually would match since I pull out punctuation, and white space as well as replace common words like blvd, bvd, etc. with a standard version of that word before comparing. That code, however, is very poor and something I just did a few hours ago in about 10 minutes. I'll even show it, despite my embarrassment at how poor it is!
Even that crap got me about 7% more matches.
Anyway... suggestions, help, books, tutorials, places to find good discussion, criticism of code/techniques etc. is all welcome.