Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

program to read and extract data from pdf file

 
pavithra murthy
Ranch Hand
Posts: 56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
dear all ,
Thanks a lot Ulm for the help you provided me with . i used pdfbox jar file and now with the below program i am able to get the full data of pdf onto my command prompt.

the next step is i need to extract/decompress only a particular string from that . before that the data is encrypted also . so i need to decrypt it and then extract only that particular string ...

the prerequisites for this program approach is
1 . setting classpath with the pdf-0.7.3.jar file and also for fontbox-0.1.0.jar .

am not able to find the function please could you help me out with the program
=========================================
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Reader;
import java.io.StringReader;
import java.util.Date;
import java.lang.String;

import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.*;
import org.pdfbox.util.PDFTextStripper;
import org.pdfbox.Decrypt;
public class boxpd {

public final String getContent(final File f) {
// setType("PDF");
Reader reader = null;
PDDocument pdfDocument = null;
FileInputStream fis = null;
String contents = null;
try {
System.out.println("Getting contents from PDF: " + f.getName());
fis = new FileInputStream(f);
PDFParser parser = new PDFParser(fis);
parser.parse();
pdfDocument = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
contents = stripper.getText(pdfDocument);
reader = new StringReader(contents);
}
catch (IOException e) {
System.out.println("Error: Can't open file: " + f.getName());
}
return contents; }

public static void main(String[] s)
{
boxpd box = new boxpd();
File f = new File("D:\\Exportbegleitdokument.PDF"); // some pdf file example
String str = box.getContent(f);
System.out.println("PDF Contents: " + str);

}
}
====================================
awaiting for earliest reply
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
the next step is i need to extract/decompress only a particular string from that . before that the data is encrypted also . so i need to decrypt it and then extract only that particular string

I'm confused on what you're trying to do. Extracting text (it sounds as if you've done that already)? Decompressing text (whatever that means)? Decrypting text (text in a PDF isn't encrypted - the whole PDF may be)? So, TellTheDetails.
 
pavithra murthy
Ranch Hand
Posts: 56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yes ulm
i am able to get all the text of pdf on the command prompt .

now i have to extract one particular string or may be more based on the requirement into an ordinary text document .

for example :
sample.pdf is my pdf file and have a data "javaranch" in some location in the pdf (currently it has got displayed on the command prompt)

now i should be able to extract that string "javaranch" into an ordinary text file /document .

i searched for the function to get a particular word into text doc in the pdfbox api in that PDFTextStripper but not able to find one

awaiting reply
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yes ulm

If that is supposed to be my name, then please take a moment to check how it is spelled correctly. If it's supposed to be something else, then I don't know what it means.

now i should be able to extract that string "javaranch" into an ordinary text file

What exactly does it mean to extract a string that you already know from a text? You said you were successful in getting all the text of the PDF; what would be the result of extracting the text "JavaRanch" from it? Maybe you can elaborate on what "the requirement" is.

awaiting reply

I'd advise to avoid comments like this; it sounds impatient.
 
Siva Masilamani
Ranch Hand
Posts: 385
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If i understand your question correctly ,then you can use regular expression in java to parse the content from the document.

Using classes from java.util.regex.* may help you in such a case

But before using this convert the entire document into string which you already did in the main method and use regular expression class with that string to manipulate
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic