Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Searching PDF file taking Longer time

 
Moses Oyebade
Greenhorn
Posts: 2
Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am developing a web application that will be used to search for text in a set of pdf files ( about 150 pdf each containing minimum of 800pages). i am using itext jars to access contents in the pdfs. i run a loop on each file and every page in the file extracting the content and searching for the text in the extracted contents. This takes a very long while to return output and its decreases the performance of the application.

public HashMap<String,String> searchForText(String searchKey,String Category){
Pattern expression =Pattern.compile(searchKey,Pattern.CASE_INSENSITIVE);

Vector<String> files=searchForFiles(Category); // generating a list of all pdf files in that category
Vector<String> filesFound= new Vector<String>();
for(String file : files){
try {
Document document = new Document();
document.open();
com.itextpdf.text.pdf.PdfReader reader = new com.itextpdf.text.pdf.PdfReader(baseDir+Category+"\\"+file);

for(int i=1; i<=reader.getNumberOfPages(); i++){
String fileContents = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
Matcher matcher = expression.matcher( fileContents );
if( matcher.find())
{
filesFound.add(file);
}

}
}
catch (Exception e) {
}
}
return filesFound;

}
Can any one advice on a better way of doing these.
 
Tim Moores
Bartender
Posts: 2790
38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I assume that the PDFs do not change a whole lot during runtime? If so, consider using Lucene to index and search the files.

By the way, this is a bad idea:

catch (Exception e) {
}

You need to get in the habit of handling exceptions properly. At least write something to a log file so you know what happened.
 
Moses Oyebade
Greenhorn
Posts: 2
Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim, thanks for your response. i have done a little exploration on lucene and pdfbox online but seem not to be making a headway. please, provide me with codes that can do these indexing and searching on the pdfs. thanks...
 
Tim Moores
Bartender
Posts: 2790
38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you're serious about using Lucene then you really need a copy of "Lucene in Action"; it is a very helpful book that will pay itself back in no time by teaching you stuff about Lucene that you won't find online.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic