aspose file tools*
The moose likes Other Open Source Projects and the fly likes A Question on Lucene Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "A Question on Lucene" Watch "A Question on Lucene" New topic
Author

A Question on Lucene

Joe Harry
Ranch Hand

Joined: Sep 26, 2006
Posts: 9383
    
    2

Guys,

I have a question on using Lucene to search and serve HTML contents for my web app. One general question that I have is how to read the HTML documents and index it's content so that they are searchable? Are there any good references other than the demo app that comes along with the Lucene download?


SCJP 1.4, SCWCD 1.4 - Hints for you, Certified Scrum Master
Did a rm -R / to find out that I lost my entire Linux installation!
Joe Harry
Ranch Hand

Joined: Sep 26, 2006
Posts: 9383
    
    2

Does Apache Solr and Lucene complement each other? What is the difference between these two?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41874
    
  63
Reading through the websites of both Solr and Lucene, they don't sound similar. If this is for the project you mentioned elsewhere, then Lucene is almost certainly the proper choice.

With respect to HTML, I think Lucene comes with an example you should be able to adapt. If you're serious about it then you really should work through "Lucene in Action "; it'll save you much time and effort.


Ping & DNS - my free Android networking tools app
Joe Harry
Ranch Hand

Joined: Sep 26, 2006
Posts: 9383
    
    2

Yes, I'm planning to give my community project that I'm working on Lucene powered search capabilities to actually search for articles. I'm using the Lucene demo and building on top of that. But there are certain things that I would like to customize and certain things that I need to understand. Lucene in Action looks promising. Will give it a try.
Joe Harry
Ranch Hand

Joined: Sep 26, 2006
Posts: 9383
    
    2

Well, Lucene in Action says that Solr is a crawler.
Joe Harry
Ranch Hand

Joined: Sep 26, 2006
Posts: 9383
    
    2

Gold hold of Tika for content extraction and it really made my life easier.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
If Joe's still around:
Correct my understanding - you used Apache Tika for converting Html files into indexable text format and used that indexes to be searched using Lucene?
R Hoefer
Greenhorn

Joined: Oct 27, 2011
Posts: 10
I'm not Joe, but that's starting on the right track.


John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Thanks Hoefer!
R Hoefer
Greenhorn

Joined: Oct 27, 2011
Posts: 10
Since I'm thinking of it, have you heard of Luke? http://code.google.com/p/luke/

Nice tool to manage lucene databases in a GUI. I wish someone had told me about it when I started messing with Lucene.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: A Question on Lucene