my dog learned polymorphism
The moose likes I/O and Streams and the fly likes How can i convert a PDF file to XML file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "How can i convert a PDF file to XML file" Watch "How can i convert a PDF file to XML file" New topic

How can i convert a PDF file to XML file

Amit Yadav

Joined: Aug 09, 2007
Posts: 8
I want to convert a pdf file in a xml file. This pdf file may contain any format like table, text etc. Can anyone give me sorce or any other information regarding this.
Joe Ess

Joined: Oct 29, 2001
Posts: 9189

PDF is not an easy-to-manipulate format by design. It is intended to be a finished product rather than an editable format (like RTF, DOC, HTML and so on). Our AccessingFileFormats FAQ has what options are available to interact with it.

[How To Ask Questions On JavaRanch]
Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970
A PDF is a description of how to render a document on a page. Things like "draw a vertical line here", "write 'foo bar baz' here in Courier". It does not contain any information about the format or organisation of the stuff it is rendering. You won't be able to tell that you're looking at a table, or a list of bullet points, or a paragraph, or anything like that.

The PDF format does contain information on a page-by-page basis. Therefore, page breaks are the one piece of format/organisation information that you can find.

If you want anything more than a raw stream of completely unformatted, disorganised text, one per page, you are out of luck. It's virtually impossible.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.
I agree. Here's the link:
subject: How can i convert a PDF file to XML file
It's not a secret anymore!