This what I have been able to find about scalability:
Does it scale?
Generally speaking, for systems with light to moderate traffic with reasonably simple queries on datasets up to 100,000 documents, our current impression is that Lucene should be adequate. We have seen reports of Lucene performing well on a 300,000 document dataset, and we have run queries on 800,000 document sets. Simple queries still performed reasonably.
If your datasets are routinely in the 100,000 document range, or if you will ever be searching more than 1 million records, you should investigate performance carefully.
If you require an average of more than 10 queries per second, we encourage you to at least do some performance testing before making decisions. This holds true for commercial vendors as well. Lucene does support some amount of threading.
Lucene does not do as well for systems with highly volatile data. When source data changes, the Lucene indices must be updated to reflect the new terms present in the modified content. For each "update", Lucene requires a pair of "delete" and an "add" transactions; and the "add" will only be visible to newly opened search sessions. This can cause search synchronization and/or latency issues if not properly handled.
Find all the articles that contain Kepler within five words of Galileo:
"Galileo AND Kepler"~5
Just a quick correction on this. Because you were using StandardAnalyzer, the word "AND" was stripped and you got lucky this worked as desired. But really the word "AND" should not be in phrase queries like this. Erik (co-author of the upcoming Lucene in Action book by Manning)
Joined: May 05, 2000
Two things... First you are correct. The technical term for what I did is "cut and paste error". Second, I want that book when it comes out!!! For anyone coming in late, the article has been fixed so don't go looking for my error. But what did you think of the article otherwise? [ April 08, 2004: Message edited by: Thomas Paul ]
Joined: Jun 11, 2002
Originally posted by Thomas Paul: But what did you think of the article otherwise?
Good intro article. I'm a bit biased though, as I've written some "competing" articles at java.net You covered all the basics nicely. There are, of course, lots of interesting details that are tough to cover in an intro article. For example, do you really want to use a Date for the "date" field? Dates have lots of quirks in terms of how they are indexed and how to search on them. I generally use YYYYMMDD strings instead. Your description of an Analyzer is perhaps a bit misleading. It processes individual fields, not a Document. It is nice to see so much info on Lucene propogating. It is a true gem. Erik p.s. shameless plug - I'm also speaking on Lucene at JavaOne this year.
the article is very nice, but as per my understanding lucene can search only in text files, how can I make this lucene to search in pdfs, and word documents? I needed these details badly, please help me
Originally posted by santoshkumar samala: the article is very nice, but as per my understanding lucene can search only in text files, how can I make this lucene to search in pdfs, and word documents? I needed these details badly, please help me
Huynh, welcome to JavaRanch. If you have questions, please start a new thread, and don't tack on to an old existing one.
As was suggested above, you can teach Lucene how to index additional sources of information by creating an implementation of org.apache.lucene.store.Directory, but I'm not sure how well that would work for a DB. What kinds of searches do you envision that SQL can't handle? Or are you trying to integrate various sources of information?
Ping & DNS - updated with new look and Ping home screen widget