aspose file tools*
The moose likes Other Open Source Projects and the fly likes PDF to XML Conversion using Apache Tika Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "PDF to XML Conversion using Apache Tika" Watch "PDF to XML Conversion using Apache Tika" New topic
Author

PDF to XML Conversion using Apache Tika

sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
Dear All,
I have to convert PDF files to Xml by using Apache Tika,is this is the right choice(PDFBox is embedded).
Can you give sample source code and links related to that.
Actual requirment is in pdf we have tablur data i want to extract that data.

thanks in advance
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42951
    
  72
I'm not sure what Apache Tika would have to do with this. You can extract the text of a PDF using PDFBox, but it's generally very hard to get at the formatting information in PDFs, so you will likely not be able to distinguish easily which text is in tables in the PDF, and which text isn't.

If you have LOTS of time available, then my advice is the same as I gave here.

Otherwise, my advice is to give up on the idea.
sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
Dear Experts,

Actually my requirment is Convert PDF Table Data to xml format using APACHE TIKA.
Can Any one.
Is it possible to overwrite Jars in java.
If yes how can i call the static,private methods in my java class.

Thanks in advance.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42951
    
  72
Yes, I think we understood that from your original question. But the question remains: why do you think TIka would be involved? Do you know what Tika is and does? Other than that, I stand by my previous post, and predict that you will end up not doing this due to its complexity.
sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
Ulf Dittmer wrote:Yes, I think we understood that from your original question. But the question remains: why do you think TIka would be involved? Do you know what Tika is and does? Other than that, I stand by my previous post, and predict that you will end up not doing this due to its complexity.


i download the PDFRerender project and run the code it shows a swing UI and asking file name , it shows only PDF FILE format not more than that,
my actual requirment is not a swing ui and styling,its simply extraction of data ,
there is extraction of data .

i use this link https://java.net/projects/pdf-renderer/downloads
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42951
    
  72
You misunderstood what I was suggesting. I'm aware that PDF-Renderer displays a PDF in a Swing GUI. What I meant was that -since PDF-Renderer can display PDFs that have tables- obviously its code knows how to extract information in tables. So you could check out what exactly that code does, and adapt that code to your purposes. This involves significant digging into that code, and will probably take a few days to accomplish. But it's the only way I could see how to use free/open source code to accomplish your objective.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: PDF to XML Conversion using Apache Tika