Thanks in advance. I am indexing documents via multiple data sources. I am creating meta cards for each document and storing them in an Oracle DB. I only store the meta card and a link to the document, not the document itself.
I started using POI and PDFBOX to read doc, excel, power point, etc..
If I want to create structured, intelligeble phrases and summaries from let us say a an expense report, would you recommend using LUCENE or regular expressions? I've considering creating a library class of some sort of keywords to phrases and just allowing it to grow. I know there has to be a more powerful and efficient way to do this other than regular expressions.
So back the expense report example. I want to find words that match Mr. or Mrs, Unilever, 2012 conference, etc.. and store those in the metacard.
Instead of handling all those document formats yourself, you may want to look into the Apache Tika project - it has all that built in, and runs on top of Lucene. For semantic text handling I definitely recommend Lucene.