permaculture playing cards*
The moose likes Other Open Source Projects and the fly likes Searching PDF file taking Longer time Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Searching PDF file taking Longer time" Watch "Searching PDF file taking Longer time" New topic
Author

Searching PDF file taking Longer time

Moses Oyebade
Greenhorn

Joined: Nov 09, 2011
Posts: 2

I am developing a web application that will be used to search for text in a set of pdf files ( about 150 pdf each containing minimum of 800pages). i am using itext jars to access contents in the pdfs. i run a loop on each file and every page in the file extracting the content and searching for the text in the extracted contents. This takes a very long while to return output and its decreases the performance of the application.

public HashMap<String,String> searchForText(String searchKey,String Category){
Pattern expression =Pattern.compile(searchKey,Pattern.CASE_INSENSITIVE);

Vector<String> files=searchForFiles(Category); // generating a list of all pdf files in that category
Vector<String> filesFound= new Vector<String>();
for(String file : files){
try {
Document document = new Document();
document.open();
com.itextpdf.text.pdf.PdfReader reader = new com.itextpdf.text.pdf.PdfReader(baseDir+Category+"\\"+file);

for(int i=1; i<=reader.getNumberOfPages(); i++){
String fileContents = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
Matcher matcher = expression.matcher( fileContents );
if( matcher.find())
{
filesFound.add(file);
}

}
}
catch (Exception e) {
}
}
return filesFound;

}
Can any one advice on a better way of doing these.
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
I assume that the PDFs do not change a whole lot during runtime? If so, consider using Lucene to index and search the files.

By the way, this is a bad idea:

catch (Exception e) {
}

You need to get in the habit of handling exceptions properly. At least write something to a log file so you know what happened.
Moses Oyebade
Greenhorn

Joined: Nov 09, 2011
Posts: 2

Tim, thanks for your response. i have done a little exploration on lucene and pdfbox online but seem not to be making a headway. please, provide me with codes that can do these indexing and searching on the pdfs. thanks...
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
If you're serious about using Lucene then you really need a copy of "Lucene in Action"; it is a very helpful book that will pay itself back in no time by teaching you stuff about Lucene that you won't find online.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Searching PDF file taking Longer time