Forums Register Login

Searching PDF file taking Longer time

+Pie Number of slices to send: Send
I am developing a web application that will be used to search for text in a set of pdf files ( about 150 pdf each containing minimum of 800pages). i am using itext jars to access contents in the pdfs. i run a loop on each file and every page in the file extracting the content and searching for the text in the extracted contents. This takes a very long while to return output and its decreases the performance of the application.

public HashMap<String,String> searchForText(String searchKey,String Category){
Pattern expression =Pattern.compile(searchKey,Pattern.CASE_INSENSITIVE);

Vector<String> files=searchForFiles(Category); // generating a list of all pdf files in that category
Vector<String> filesFound= new Vector<String>();
for(String file : files){
try {
Document document = new Document();
document.open();
com.itextpdf.text.pdf.PdfReader reader = new com.itextpdf.text.pdf.PdfReader(baseDir+Category+"\\"+file);

for(int i=1; i<=reader.getNumberOfPages(); i++){
String fileContents = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
Matcher matcher = expression.matcher( fileContents );
if( matcher.find())
{
filesFound.add(file);
}

}
}
catch (Exception e) {
}
}
return filesFound;

}
Can any one advice on a better way of doing these.
+Pie Number of slices to send: Send
I assume that the PDFs do not change a whole lot during runtime? If so, consider using Lucene to index and search the files.

By the way, this is a bad idea:


catch (Exception e) {
}


You need to get in the habit of handling exceptions properly. At least write something to a log file so you know what happened.
+Pie Number of slices to send: Send
Tim, thanks for your response. i have done a little exploration on lucene and pdfbox online but seem not to be making a headway. please, provide me with codes that can do these indexing and searching on the pdfs. thanks...
+Pie Number of slices to send: Send
If you're serious about using Lucene then you really need a copy of "Lucene in Action"; it is a very helpful book that will pay itself back in no time by teaching you stuff about Lucene that you won't find online.
He's dead Jim. Grab his tricorder. I'll get his wallet and this tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com


reply
reply
This thread has been viewed 958 times.
Similar Threads
Unable to Print more than 1 File
Unable to Print more than 1 File
Itext seeking help
this is working but i don't understand why!!
indexing and searching on pdf page by page
Thread Boost feature
More...

All times above are in ranch (not your local) time.
The current ranch time is
Mar 28, 2024 08:22:15.