I have a requirement where I have to convert the PDF document to HTML5. I do not want to use any available tool achieve this. I want to write my own code to achieve this. Being java developer I have started with iText but I saw that, iText just extract the text from PDF and does not keep the formatting layout on PDF.
Can someone please guide which API i should use to achieve this? below is my high level requirement.
1-Extract the text from the PDF without loosing formatting layout.
2-extract the images if any.
3-Retain the formatting in the newly converted HTML5 page same as that of PDF page.
I'm confused - you do not want to use any available tool (why? PDF is hugely complicated, do you really want to write all that code yourself?), but you considered using iText? There's a disconnect that you need to resolve for us before we can usefully recommend an approach.
AFAIK there is no free tool to convert PDF to anything that keeps the formatting. You can use the PDFRenderer project as a basis - it can display PDFs in Swing, so obviously it knows what to do with the formatting information.
Joined: Feb 15, 2011
Thanks Ulf , Sorry for confusion. what I meant that, I do not want to use any paid software. I am looking for any open source java API. I wrote the program by using the iText, but it just extract text from PDF.
Joined: Mar 22, 2005
As I said, I'm unaware of any free tool that extracts layout information from PDFs. If you are prepared to put a lot of work into it, you can go the route I suggested with the PDFRenderer source code.