The moose likes Other Open Source Projects and the fly likes Close words in vocabulary Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Reply Bookmark "Close words in vocabulary" Watch "Close words in vocabulary" New topic
Author

Close words in vocabulary

William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 11862
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example.
Bill
Erik Hatcher
Author
Ranch Hand

Joined: Jun 11, 2002
Posts: 111
Originally posted by William Brogden:
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example.
Bill


Here are two different type of live examples:

http://www.lucenebook.com/search?query=stemming - look at the highlighted words and compare it to the query expression. I'm using the Snowball stemmer (part of the Lucene Sandbox) to accomplish.

http://www.lucenebook.com/search?query=eric%7E - this one is using a FuzzyQuery, which uses the Levenshtein distance algorithm, to find words close enough. (for future reference, I spell my name with a "k"!)

There are other techniques that can be employed for seeing through transliterations and misspellings. In fact, Bob Carpenter contributed a wonderful case study to Chapter 10 describing this in detail using his LingPipe project.

This brings up another great selling point to the book... Case Studies chapter - it has case studies of Nutch, Searchblox, Michaels.com, TheServerSide, jGuru, and Alias-i (LingPipe). Read this to see how Lucene is leveraged in some heavy duty systems - I learned a lot by reading what they contributed, thats for sure!


Co-author of Lucene in Action
Otis Gospodnetic
Author
Greenhorn

Joined: Dec 30, 2004
Posts: 23
There is another relevant solution: synonym injection via the Analyzer. Here is some context: http://www.lucenebook.com/search?query=synonym

The code that comes with the book includes a synonym engine.

Otis


Lucene in Action: http://www.manning.com/lucene
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 11862
So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.
Erik Hatcher
Author
Ranch Hand

Joined: Jun 11, 2002
Posts: 111
Originally posted by William Brogden:
So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.


In fact, Metaphone from Jakarta Commons Codec is an example I wrote about in the Analysis chapter! Yes, very easy to integrate into an analyzer. Check out the source code (lia.analysis package) for the book freely available here to see for yourself.
 
 
subject: Close words in vocabulary
 
Threads others viewed
Usage of Reserved words for Database Table fields
resultset
Good Programming Practice
Difference Between Class.forName & import statement in JDBC Applications
Question regarding the WA threads
WebSphere development made easy
without the weight of IBM tools
http://www.myeclipseide.com