This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes Java in General and the fly likes Reading a table in a pdf file ? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Reading a table in a pdf file ?" Watch "Reading a table in a pdf file ?" New topic
Author

Reading a table in a pdf file ?

Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Hello ,

Is there any java library that can help me read a table in a pdf file ?
I tried to use PDFBox library but i guess it doesn't allow this.
I need t read a table in the pdf to grab the data in each cell then use this data.

Any help ?

Thanks ,
Hesham


Hesham
Venkata Kumar
Ranch Hand

Joined: Apr 16, 2008
Posts: 110

see this link http://schmidt.devlib.org/java/libraries-pdf.html


SCJP 5.0, SCWCD 5, preparing for SCDJWS
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Thanks for the help ... I've checked most of those libraries.
Most of them can extract text from the pdf files, but i don't see any that can read a table and extract the data from each cell.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41634
    
  55
I don't think there's a Java library that can do this. Something like JPedal will give you all the text of the PDF, but not cell by cell.
[ November 17, 2008: Message edited by: Ulf Dittmer ]

Ping & DNS - my free Android networking tools app
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Does the PDF file format actually have a concept of tables? It's much like postscript so I'd imagine it only holds layout information for (a) text, (b) vector graphics (including the lines around table cells), (c) bitmap graphics (such as inserted images). Most PDFs aren't directly editable either, for the very reason they don't contain nearly as much information as an original DTP/word processor/spreadsheet document. PDFs are designed for uniformly displaying a document, not for allowing non-human content analysis---at least that's what I understand. So I think you'll struggle to extract anything other than text, lines/shapes and graphics.


Charles Lyons (SCJP 1.4, April 2003; SCJP 5, Dec 2006; SCWCD 1.4b, April 2004)
Author of OCEJWCD Study Companion for Oracle Exam 1Z0-899 (ISBN 0955160340 / Amazon Amazon UK )
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
I guess you're right Charles ... I first thought a pdf file structure may be like an XML file structure(or something like that) so i can detect tables, images, ...
But i guess i was mistaken.

This means what i want to do is impossible.

Thanks for help.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19685
    
  20

Well impossible is maybe a bit harsh, but definitely not easy.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
sunil dm
Greenhorn

Joined: Jul 04, 2006
Posts: 4
Hi,

I am facing similar issue, used PDF Box and IText not much of luck, Did you came across any solution for this?
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11257
    
  16

Sunil, you realize that this thread has not been touched in over two years, right?


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
sunil dm
Greenhorn

Joined: Jul 04, 2006
Posts: 4
Hi fred,

ya i see that.. But dint find a suitable post to check acoss... So coming to the point.. Do we have anything in these 2 years which made it simple?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

You can read the official PDF specification here (PDF). You'll see that it hardly uses the word "table" at all and certainly never in the context of a rectangular grid containing independent cells.

So. PDF doesn't have tables. So if you're trying to get data out of a "table" then you're going down the wrong track. You need to find out how the data is actually organized... but this was all covered in the posts from two years ago.
 
Don't get me started about those stupid light bulbs.
 
subject: Reading a table in a pdf file ?