This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Other Open Source Projects and the fly likes Extract TExt from pdf Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Extract TExt from pdf " Watch "Extract TExt from pdf " New topic
Author

Extract TExt from pdf

roshan sinha
Greenhorn

Joined: Aug 28, 2013
Posts: 12
i extracted text from pdf using pdf box......

but the format of text and alignment and format of text is not there in the extracted text.
How to extract the text from pdf in same formt and alignment ?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41124
    
  45
Text is just that - text. It does not include formatting or layout information. It is notoriously hard to extract that information from PDFs; I'm not aware of any free tool that can do that. If you can spend lots of time on this, check out the PDF-Renderer project. It can render PDFs in Swing, so obviously it has code that knows how to handle layout and styling.

It sounds as if what you actually is to convert the PDF to some other file format?


Ping & DNS - my free Android networking tools app
sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
roshan sinha wrote:i extracted text from pdf using pdf box......

but the format of text and alignment and format of text is not there in the extracted text.
How to extract the text from pdf in same formt and alignment ?


May be Apache Tika is well and one of the solution and more ever PDFBox is embedded in tika.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41124
    
  45
Sudheer- As I pointed out to you elsewhere, Apache Tika does nothing with respect to text extraction for PDFs beyond what PDFBox does. Please don't confuse others by suggesting that it can do things that it can't do.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extract TExt from pdf
 
Similar Threads
jar in Jdk 1.7
Calling a Hashtable from another class.
Looking for PDF software
mock question
Reading PDF text with font styles