aspose file tools*
The moose likes Other Open Source Projects and the fly likes Extract TExt from pdf Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Extract TExt from pdf " Watch "Extract TExt from pdf " New topic
Author

Extract TExt from pdf

roshan sinha
Greenhorn

Joined: Aug 28, 2013
Posts: 12
i extracted text from pdf using pdf box......

but the format of text and alignment and format of text is not there in the extracted text.
How to extract the text from pdf in same formt and alignment ?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42277
    
  64
Text is just that - text. It does not include formatting or layout information. It is notoriously hard to extract that information from PDFs; I'm not aware of any free tool that can do that. If you can spend lots of time on this, check out the PDF-Renderer project. It can render PDFs in Swing, so obviously it has code that knows how to handle layout and styling.

It sounds as if what you actually is to convert the PDF to some other file format?


Ping & DNS - my free Android networking tools app
sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
roshan sinha wrote:i extracted text from pdf using pdf box......

but the format of text and alignment and format of text is not there in the extracted text.
How to extract the text from pdf in same formt and alignment ?


May be Apache Tika is well and one of the solution and more ever PDFBox is embedded in tika.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42277
    
  64
Sudheer- As I pointed out to you elsewhere, Apache Tika does nothing with respect to text extraction for PDFs beyond what PDFBox does. Please don't confuse others by suggesting that it can do things that it can't do.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extract TExt from pdf