| Author |
Close words in vocabulary
|
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 11862
|
|
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example. Bill
|
 |
Erik Hatcher
Author
Ranch Hand
Joined: Jun 11, 2002
Posts: 111
|
|
Originally posted by William Brogden: If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example. Bill
Here are two different type of live examples: http://www.lucenebook.com/search?query=stemming - look at the highlighted words and compare it to the query expression. I'm using the Snowball stemmer (part of the Lucene Sandbox) to accomplish. http://www.lucenebook.com/search?query=eric%7E - this one is using a FuzzyQuery, which uses the Levenshtein distance algorithm, to find words close enough. (for future reference, I spell my name with a "k"!) There are other techniques that can be employed for seeing through transliterations and misspellings. In fact, Bob Carpenter contributed a wonderful case study to Chapter 10 describing this in detail using his LingPipe project. This brings up another great selling point to the book... Case Studies chapter - it has case studies of Nutch, Searchblox, Michaels.com, TheServerSide, jGuru, and Alias-i (LingPipe). Read this to see how Lucene is leveraged in some heavy duty systems - I learned a lot by reading what they contributed, thats for sure!
|
Co-author of Lucene in Action
|
 |
Otis Gospodnetic
Author
Greenhorn
Joined: Dec 30, 2004
Posts: 23
|
|
There is another relevant solution: synonym injection via the Analyzer. Here is some context: http://www.lucenebook.com/search?query=synonym The code that comes with the book includes a synonym engine. Otis
|
Lucene in Action: http://www.manning.com/lucene
|
 |
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 11862
|
|
|
So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.
|
 |
Erik Hatcher
Author
Ranch Hand
Joined: Jun 11, 2002
Posts: 111
|
|
Originally posted by William Brogden: So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.
In fact, Metaphone from Jakarta Commons Codec is an example I wrote about in the Analysis chapter! Yes, very easy to integrate into an analyzer. Check out the source code (lia.analysis package) for the book freely available here to see for yourself.
|
 |
 |
|
|
subject: Close words in vocabulary
|
|
|