Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extract TExt from pdf

 
roshan sinha
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i extracted text from pdf using pdf box......

but the format of text and alignment and format of text is not there in the extracted text.
How to extract the text from pdf in same formt and alignment ?
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Text is just that - text. It does not include formatting or layout information. It is notoriously hard to extract that information from PDFs; I'm not aware of any free tool that can do that. If you can spend lots of time on this, check out the PDF-Renderer project. It can render PDFs in Swing, so obviously it has code that knows how to handle layout and styling.

It sounds as if what you actually is to convert the PDF to some other file format?
 
sudheer yathagiri kumar
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
roshan sinha wrote:i extracted text from pdf using pdf box......

but the format of text and alignment and format of text is not there in the extracted text.
How to extract the text from pdf in same formt and alignment ?


May be Apache Tika is well and one of the solution and more ever PDFBox is embedded in tika.
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sudheer- As I pointed out to you elsewhere, Apache Tika does nothing with respect to text extraction for PDFs beyond what PDFBox does. Please don't confuse others by suggesting that it can do things that it can't do.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic