This week's book giveaway is in the Agile and other Processes forum. We're giving away four copies of The Mikado Method and have Ola Ellnestam and Daniel Brolund on-line! See this thread for details.
I am working on a project, where I need to convert PDF to XML & XSLT.
I am able to extract text from PDF but not able to read layout and formatting information of the text and paragraphs. Meaning I want to read font size, font name, style, color and other formatting stuff of the text/paragraph.
I have tried using iText & PDFBox but not able to derive a solution.
Any help in this regard is highly appreciated.
Solution Spider
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35241
7
posted
0
iText and PDFBox can't do that.
The http://java.net/projects/pdf-renderer/ library can display PDFs, so it includes code that extracts layout information from PDFs; you can try to find the bits and pieces that are of interest to you in that.
Thank for the help. I am going through it and will update you soon.
Lokesh Tank
Greenhorn
Joined: May 08, 2010
Posts: 18
posted
0
The PDF-renderer project is pretty big and it is consuming considerable amount of time in analysis (cont..).
Is there any other light weight library (jar) available to achieve the same result in a short span of time?
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35241
7
posted
0
No no-commercial ones.
Lokesh Tank
Greenhorn
Joined: May 08, 2010
Posts: 18
posted
0
Thanks. I have got a light weight library and i.e. JPOD PDF library. This is a very small library and provides me everything what I wanted to extract from PDF
Lance Wellspring
Greenhorn
Joined: Feb 06, 2012
Posts: 1
posted
0
I am trying to do the same thing. Could you share your code, or at least provide an example of how to get started?
Thanks for your time.