Thanks in advance. I am indexing documents via multiple data sources. I am creating meta cards for each document and storing them in an Oracle DB. I only store the meta card and a link to the document, not the document itself.
I started using POI and PDFBOX to read doc, excel, power point, etc..
If I want to create structured, intelligeble phrases and summaries from let us say a an expense report, would you recommend using LUCENE or regular expressions? I've considering creating a library class of some sort of keywords to phrases and just allowing it to grow. I know there has to be a more powerful and efficient way to do this other than regular expressions.
So back the expense report example. I want to find words that match Mr. or Mrs, Unilever, 2012 conference, etc.. and store those in the metacard.
posted 4 years ago
Instead of handling all those document formats yourself, you may want to look into the Apache Tika project - it has all that built in, and runs on top of Lucene. For semantic text handling I definitely recommend Lucene.