File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Other Open Source Projects and the fly likes Journal Article - The Lucene Search Engine - Adding search to your applications Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Journal Article - The Lucene Search Engine - Adding search to your applications" Watch "Journal Article - The Lucene Search Engine - Adding search to your applications" New topic
Author

Journal Article - The Lucene Search Engine - Adding search to your applications

Dirk Schreckmann
Sheriff

Joined: Dec 10, 2001
Posts: 7023
The just-released April 2004 edition of The JavaRanch Journal includes an article by Thomas Paul, "The Lucene Search Engine - Adding search to your applications".
Please use this thread to comment on and discuss the article.


[How To Ask Good Questions] [JavaRanch FAQ Wiki] [JavaRanch Radio]
Peter Daly
Greenhorn

Joined: Jan 04, 2002
Posts: 17
How well does this scale? Is it reasonable to try and use this for a 1 million document index for instance?
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
This what I have been able to find about scalability:
Does it scale?

Generally speaking, for systems with light to moderate traffic with reasonably simple queries on datasets up to 100,000 documents, our current impression is that Lucene should be adequate. We have seen reports of Lucene performing well on a 300,000 document dataset, and we have run queries on 800,000 document sets. Simple queries still performed reasonably.

If your datasets are routinely in the 100,000 document range, or if you will ever be searching more than 1 million records, you should investigate performance carefully.

If you require an average of more than 10 queries per second, we encourage you to at least do some performance testing before making decisions. This holds true for commercial vendors as well. Lucene does support some amount of threading.

Lucene does not do as well for systems with highly volatile data. When source data changes, the Lucene indices must be updated to reflect the new terms present in the modified content. For each "update", Lucene requires a pair of "delete" and an "add" transactions; and the "add" will only be visible to newly opened search sessions. This can cause search synchronization and/or latency issues if not properly handled.


http://www.ideaeng.com/pub/entsrch/issue03/article03.html#land7


Associate Instructor - Hofstra University
Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
Erik Hatcher
Author
Ranch Hand

Joined: Jun 11, 2002
Posts: 111
Find all the articles that contain Kepler within five words of Galileo:

"Galileo AND Kepler"~5

Just a quick correction on this. Because you were using StandardAnalyzer, the word "AND" was stripped and you got lucky this worked as desired. But really the word "AND" should not be in phrase queries like this.
Erik (co-author of the upcoming Lucene in Action book by Manning)
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
Two things... First you are correct. The technical term for what I did is "cut and paste error". Second, I want that book when it comes out!!!
For anyone coming in late, the article has been fixed so don't go looking for my error.
But what did you think of the article otherwise?
[ April 08, 2004: Message edited by: Thomas Paul ]
Erik Hatcher
Author
Ranch Hand

Joined: Jun 11, 2002
Posts: 111
Originally posted by Thomas Paul:
But what did you think of the article otherwise?

Good intro article. I'm a bit biased though, as I've written some "competing" articles at java.net You covered all the basics nicely.
There are, of course, lots of interesting details that are tough to cover in an intro article. For example, do you really want to use a Date for the "date" field? Dates have lots of quirks in terms of how they are indexed and how to search on them. I generally use YYYYMMDD strings instead.
Your description of an Analyzer is perhaps a bit misleading. It processes individual fields, not a Document.
It is nice to see so much info on Lucene propogating. It is a true gem.
Erik
p.s. shameless plug - I'm also speaking on Lucene at JavaOne this year.
santhoshkumar samala
Ranch Hand

Joined: Nov 12, 2003
Posts: 156

the article is very nice, but as per my understanding lucene can search only in text files, how can I make this lucene to search in pdfs, and word documents? I needed these details badly, please help me


santhosh<br />SCJP,SCWCD
Lasse Koskela
author
Sheriff

Joined: Jan 23, 2002
Posts: 11962
    
    5
Originally posted by santoshkumar samala:
the article is very nice, but as per my understanding lucene can search only in text files, how can I make this lucene to search in pdfs, and word documents? I needed these details badly, please help me
Take a look at these two FAQs from the JGuru Lucene FAQ page:
How can I index PDF documents?
How can I index Word documents?

(in short, you'll probably need to provide Lucene with your own implementation of org.apache.lucene.store.Directory which understands .pdf and .doc)


Author of Test Driven (2007) and Effective Unit Testing (2013) [Blog] [HowToAskQuestionsOnJavaRanch]
Dharmanand Singh
Greenhorn

Joined: Oct 27, 2004
Posts: 13
You can refer one of the lucene examples http://dharmanand.tarundua.net/lucene_eg.war that parses pdf documents as well as text documents. This does not handle word documents, but you can definitely find a suitable library for that as well. You can read details about this example on http://dharmanand.tarundua.net/
[ November 09, 2004: Message edited by: Dharmanand Singh ]
HUYNH hong Phuong
Greenhorn

Joined: Aug 02, 2006
Posts: 1
hello everyone , i have a question about lucene : is lucene capable to index informations in the data base? i realy need these details .thanks a lot.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42047
    
  64
Huynh, welcome to JavaRanch. If you have questions, please start a new thread, and don't tack on to an old existing one.

As was suggested above, you can teach Lucene how to index additional sources of information by creating an implementation of org.apache.lucene.store.Directory, but I'm not sure how well that would work for a DB. What kinds of searches do you envision that SQL can't handle? Or are you trying to integrate various sources of information?


Ping & DNS - my free Android networking tools app
 
GeeCON Prague 2014
 
subject: Journal Article - The Lucene Search Engine - Adding search to your applications