This week's book giveaway is in the OCMJEA forum. We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line! See this thread for details.
Is there any java library that can help me read a table in a pdf file ? I tried to use PDFBox library but i guess it doesn't allow this. I need t read a table in the pdf to grab the data in each cell then use this data.
Does the PDF file format actually have a concept of tables? It's much like postscript so I'd imagine it only holds layout information for (a) text, (b) vector graphics (including the lines around table cells), (c) bitmap graphics (such as inserted images). Most PDFs aren't directly editable either, for the very reason they don't contain nearly as much information as an original DTP/word processor/spreadsheet document. PDFs are designed for uniformly displaying a document, not for allowing non-human content analysis---at least that's what I understand. So I think you'll struggle to extract anything other than text, lines/shapes and graphics.
Charles Lyons (SCJP 1.4, April 2003; SCJP 5, Dec 2006; SCWCD 1.4b, April 2004)
Author of OCEJWCD Study Companion for Oracle Exam 1Z0-899 (ISBN 0955160340 / AmazonAmazon UK )
Joined: Feb 26, 2007
I guess you're right Charles ... I first thought a pdf file structure may be like an XML file structure(or something like that) so i can detect tables, images, ... But i guess i was mistaken.
You can read the official PDF specification here (PDF). You'll see that it hardly uses the word "table" at all and certainly never in the context of a rectangular grid containing independent cells.
So. PDF doesn't have tables. So if you're trying to get data out of a "table" then you're going down the wrong track. You need to find out how the data is actually organized... but this was all covered in the posts from two years ago.