This week's book giveaways are in the Java EE and JavaScript forums.
We're giving away four copies each of The Java EE 7 Tutorial Volume 1 or Volume 2(winners choice) and jQuery UI in Action and have the authors on-line!
See this thread and this one for details.
The moose likes Java in General and the fly likes Create meta cards of documents using POI. Lucene or reg expressions Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Create meta cards of documents using POI. Lucene or reg expressions" Watch "Create meta cards of documents using POI. Lucene or reg expressions" New topic
Author

Create meta cards of documents using POI. Lucene or reg expressions

Aaron Williams
Greenhorn

Joined: Mar 09, 2011
Posts: 2

All,

Thanks in advance. I am indexing documents via multiple data sources. I am creating meta cards for each document and storing them in an Oracle DB. I only store the meta card and a link to the document, not the document itself.

I started using POI and PDFBOX to read doc, excel, power point, etc..

If I want to create structured, intelligeble phrases and summaries from let us say a an expense report, would you recommend using LUCENE or regular expressions? I've considering creating a library class of some sort of keywords to phrases and just allowing it to grow. I know there has to be a more powerful and efficient way to do this other than regular expressions.

So back the expense report example. I want to find words that match Mr. or Mrs, Unilever, 2012 conference, etc.. and store those in the metacard.

Thanks,
AD
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
Instead of handling all those document formats yourself, you may want to look into the Apache Tika project - it has all that built in, and runs on top of Lucene. For semantic text handling I definitely recommend Lucene.
 
 
subject: Create meta cards of documents using POI. Lucene or reg expressions