*
The moose likes Other Open Source Projects and the fly likes program to read and extract data from pdf file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "program to read and extract data from pdf file" Watch "program to read and extract data from pdf file" New topic
Author

program to read and extract data from pdf file

pavithra murthy
Ranch Hand

Joined: Feb 06, 2009
Posts: 56
dear all ,
Thanks a lot Ulm for the help you provided me with . i used pdfbox jar file and now with the below program i am able to get the full data of pdf onto my command prompt.

the next step is i need to extract/decompress only a particular string from that . before that the data is encrypted also . so i need to decrypt it and then extract only that particular string ...

the prerequisites for this program approach is
1 . setting classpath with the pdf-0.7.3.jar file and also for fontbox-0.1.0.jar .

am not able to find the function please could you help me out with the program
=========================================
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Reader;
import java.io.StringReader;
import java.util.Date;
import java.lang.String;

import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.*;
import org.pdfbox.util.PDFTextStripper;
import org.pdfbox.Decrypt;
public class boxpd {

public final String getContent(final File f) {
// setType("PDF");
Reader reader = null;
PDDocument pdfDocument = null;
FileInputStream fis = null;
String contents = null;
try {
System.out.println("Getting contents from PDF: " + f.getName());
fis = new FileInputStream(f);
PDFParser parser = new PDFParser(fis);
parser.parse();
pdfDocument = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
contents = stripper.getText(pdfDocument);
reader = new StringReader(contents);
}
catch (IOException e) {
System.out.println("Error: Can't open file: " + f.getName());
}
return contents; }

public static void main(String[] s)
{
boxpd box = new boxpd();
File f = new File("D:\\Exportbegleitdokument.PDF"); // some pdf file example
String str = box.getContent(f);
System.out.println("PDF Contents: " + str);

}
}
====================================
awaiting for earliest reply
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41068
    
  43
the next step is i need to extract/decompress only a particular string from that . before that the data is encrypted also . so i need to decrypt it and then extract only that particular string

I'm confused on what you're trying to do. Extracting text (it sounds as if you've done that already)? Decompressing text (whatever that means)? Decrypting text (text in a PDF isn't encrypted - the whole PDF may be)? So, TellTheDetails.


Ping & DNS - my free Android networking tools app
pavithra murthy
Ranch Hand

Joined: Feb 06, 2009
Posts: 56
yes ulm
i am able to get all the text of pdf on the command prompt .

now i have to extract one particular string or may be more based on the requirement into an ordinary text document .

for example :
sample.pdf is my pdf file and have a data "javaranch" in some location in the pdf (currently it has got displayed on the command prompt)

now i should be able to extract that string "javaranch" into an ordinary text file /document .

i searched for the function to get a particular word into text doc in the pdfbox api in that PDFTextStripper but not able to find one

awaiting reply
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41068
    
  43
yes ulm

If that is supposed to be my name, then please take a moment to check how it is spelled correctly. If it's supposed to be something else, then I don't know what it means.

now i should be able to extract that string "javaranch" into an ordinary text file

What exactly does it mean to extract a string that you already know from a text? You said you were successful in getting all the text of the PDF; what would be the result of extracting the text "JavaRanch" from it? Maybe you can elaborate on what "the requirement" is.

awaiting reply

I'd advise to avoid comments like this; it sounds impatient.
Siva Masilamani
Ranch Hand

Joined: Sep 19, 2008
Posts: 385
If i understand your question correctly ,then you can use regular expression in java to parse the content from the document.

Using classes from java.util.regex.* may help you in such a case

But before using this convert the entire document into string which you already did in the main method and use regular expression class with that string to manipulate


SCJP 6,SCWCD 5,SCBCD 5

Failure is not an option.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: program to read and extract data from pdf file
 
Similar Threads
PDF file indexing and Searching using lucene
convert pdf to text using pdfbox
PDFTextStripper returning null for all the japanese text in the PDF
Lucene and PDF
problem with writing file , please help