Erik Hatcher

Author
+ Follow
since Jun 11, 2002
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
In last 30 days
0
Forums and Threads

Recent posts by Erik Hatcher

Also, Lucene's scoring/ranking mechanisms make it stand out well above what you can get from a general RDBMS.
This is where Solr really shines... it has built-in replication capabilities so you can have a master indexing server and any number of load balanced replicants.

Jeanne Boyarsky wrote:This week, we're delighted to have Michael McCandless, Erik Hatcher, and Otis Gospodnetic helping to answer questions about the new book Lucene in Action. See table of contents and sample chapters one and three online.

The promotion starts Tuesday, August 3rd 2010 and will end on Friday, August 6th 2010.

We'll be selecting four random posters in this forum to win a free copy of the book provided by the publisher, Manning.

Please see the Book Promotion page to ensure your best chances at winning!

Posts in this welcome thread are not eligible for the drawing.



Thanks!!! I'll be tuning a bit today, then away for a few days and then back. Now off to catch up on the posts already today
Thanks to JavaRanch for hosting Otis and I and allowing us to talk about a topic we love, Lucene!

I regret that I cannot continually tune in to this forum as I'm already over-extended in e-mail forums. I'm very active on the lucene-user e-mail list and am always an e-mail click away (from the lucenebook.com site). Tune in to the site to stay up-to-date with the fun things Otis and I are doing with the site itself and book updates, errata, and announcements.

Erik

Originally posted by Siripa Siangklom:
Is there a PHP port of the Lucene search engine?



Not to my knowledge.... but you can call Java from PHP, right? That would be the most ideal way to integrate it.

Originally posted by Gian Franco Casula:
If I understood correctly Lucene
can perform searches on file systems.
Is it possible to integrate this
functionality with database searches
as well, in a way making it independent
of where a file is stored?



Where the text comes from is of no concern whatsoever to Lucene. If you have text, Lucene can index it, and search it. How you tie what is indexed back to your domain is your concern, not Lucene's.

Originally posted by William Brogden:
So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.



In fact, Metaphone from Jakarta Commons Codec is an example I wrote about in the Analysis chapter! Yes, very easy to integrate into an analyzer. Check out the source code (lia.analysis package) for the book freely available here to see for yourself.

Originally posted by Erik Hatcher:

There are many strange things that can occur between parsing and analyzing the query expression that are best avoided by forming queries directly with the API whenever possible - even if that means working with an analyzer directly as QueryParser does under the covers.



This can be a bit confusing, so bear with me. Lucene includes a parser to parse text expressions like "this AND that" into a query. You can qualify the field to search on using syntax like "field1:value OR field2:value" and so on. The query parser syntax is detailed
here. When the expression is parsed, it also runs the analyzer on each piece of the expression. (we won't go into the analysis details here - its a whole chapters worth!).

A Lucene Query can be constructed through the API, and bypass the parsing/analysis steps.

You can aggregate parsed expressions with API-created queries using BooleanQuery.

How to actually build queries in your system really depends on what you want to do, but I cringe when I see code that string concatenates clauses in expression syntax to be parsed when it could be done more rigorously and less room for error using other techniques.

Originally posted by William Brogden:
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example.
Bill



Here are two different type of live examples:

http://www.lucenebook.com/search?query=stemming - look at the highlighted words and compare it to the query expression. I'm using the Snowball stemmer (part of the Lucene Sandbox) to accomplish.

http://www.lucenebook.com/search?query=eric%7E - this one is using a FuzzyQuery, which uses the Levenshtein distance algorithm, to find words close enough. (for future reference, I spell my name with a "k"!)

There are other techniques that can be employed for seeing through transliterations and misspellings. In fact, Bob Carpenter contributed a wonderful case study to Chapter 10 describing this in detail using his LingPipe project.

This brings up another great selling point to the book... Case Studies chapter - it has case studies of Nutch, Searchblox, Michaels.com, TheServerSide, jGuru, and Alias-i (LingPipe). Read this to see how Lucene is leveraged in some heavy duty systems - I learned a lot by reading what they contributed, thats for sure!

Originally posted by Manmohan Singh:

Which comparison do you want?Softwares may be compared in terms of Space,Time and Cost.As you know its Open Source with GPL License hence its free.Among space and time,which comparison you are interested in?



Correction - Lucene is licensed using the Apache Software License, not GPL. Big difference for many!

Originally posted by Ali Pope:
This article seems interesting, opening some kind of a new direction for object persistance.

article

--
./pope



This is certainly a clever way to use Lucene. My main critique of the implementation details is the use of QueryParser for non-human-entered queries. There are many strange things that can occur between parsing and analyzing the query expression that are best avoided by forming queries directly with the API whenever possible - even if that means working with an analyzer directly as QueryParser does under the covers.

Originally posted by Arjun Shastry:

IndexSearcher in Lucene accepts the query and returns the Hits object.As stated in one tutorial,Lucene is IR Library rather than Search Engine.Does implementor need to construct catche/crawler for even faster search/indexing?
Also how the results are returned?As per the tutorial(s) on net,it uses Score for a page(Document in general),how differen is this in comparison with PageRank of Google?To my knowledge ,PageRank calculates the score not only on the frequency of accessing the page but also the backlinks(total pages pointing towards that page)How the score of Document is calculated in Lucene?
Does Hit stand for Hypertext Induced Topic Selection?the algorithm used to rank the document?
[ January 06, 2005: Message edited by: Arjun Shastry ]



I call Lucene a "search engine" because its a convenient and recognizable term. Technically it is an API that has no user interface, no crawler, and no parsers. To me, it is the "engine", whereas Google is a search "application". Semantics and word games aside it is not necessary to implement caching around Lucene. The Hits object itself has some built-in caching for most recently accessed (or soon to be accessed) documents.

Hits from Lucene are ordered by score, a sophisticated calculation which puts more relevant documents (to the query) at the top, and less relevant documents below.

Google's PageRank is comparable to how Nutch, a system built around Lucene, ranks its documents. It does lots of Lucene trickery to weight documents in a PageRank-like fashion. Most of us, however, are not building web crawlers where PageRank works decently. In intranet or other domains of use, the built-in Lucene scoring mechanism works amazingly well.

I have never heard that acronym for HIT, and I do not think it applies to Lucene's concept of a Hit. A "hit" is synonymous with "match".

Originally posted by Pradeep Bhat:
Is Lucene faster than other search techniques? If yes, how ?Thanks



Lucene is FAST!

What other techniques do you want it compared to? Lucene uses an inverted index, and uses algorithms, storage, and data structures designed by a search engine expert. Doug Cutting was instrumental in building the Excite search engine in hits hey-day, and worked for Apple building the VTwin engine, and has published numerous papers and is named on several patents related to indexing and searching techniques. Check 'em out to know more on the "how"

Originally posted by Ali Pope:
Erik, this is something very interesting. Still i checkout many asf projects throught cvs? Does svn support this? (i didn't read this uptill now).



You can see what repositories are in Subversion here: http://svn.apache.org/repos/asf/.

Many have converted to svn, many have not yet, but the plan is to get them all converted in the near future.