This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Other Open Source Projects and the fly likes PDF to XML Conversion using Apache Tika Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "PDF to XML Conversion using Apache Tika" Watch "PDF to XML Conversion using Apache Tika" New topic
Author

PDF to XML Conversion using Apache Tika

sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
Dear All,
I have to convert PDF files to Xml by using Apache Tika,is this is the right choice(PDFBox is embedded).
Can you give sample source code and links related to that.
Actual requirment is in pdf we have tablur data i want to extract that data.

thanks in advance
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41149
    
  45
I'm not sure what Apache Tika would have to do with this. You can extract the text of a PDF using PDFBox, but it's generally very hard to get at the formatting information in PDFs, so you will likely not be able to distinguish easily which text is in tables in the PDF, and which text isn't.

If you have LOTS of time available, then my advice is the same as I gave here.

Otherwise, my advice is to give up on the idea.


Ping & DNS - my free Android networking tools app
sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
Dear Experts,

Actually my requirment is Convert PDF Table Data to xml format using APACHE TIKA.
Can Any one.
Is it possible to overwrite Jars in java.
If yes how can i call the static,private methods in my java class.

Thanks in advance.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41149
    
  45
Yes, I think we understood that from your original question. But the question remains: why do you think TIka would be involved? Do you know what Tika is and does? Other than that, I stand by my previous post, and predict that you will end up not doing this due to its complexity.
sudheer yathagiri kumar
Ranch Hand

Joined: Mar 22, 2011
Posts: 35
Ulf Dittmer wrote:Yes, I think we understood that from your original question. But the question remains: why do you think TIka would be involved? Do you know what Tika is and does? Other than that, I stand by my previous post, and predict that you will end up not doing this due to its complexity.


i download the PDFRerender project and run the code it shows a swing UI and asking file name , it shows only PDF FILE format not more than that,
my actual requirment is not a swing ui and styling,its simply extraction of data ,
there is extraction of data .

i use this link https://java.net/projects/pdf-renderer/downloads
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41149
    
  45
You misunderstood what I was suggesting. I'm aware that PDF-Renderer displays a PDF in a Swing GUI. What I meant was that -since PDF-Renderer can display PDFs that have tables- obviously its code knows how to extract information in tables. So you could check out what exactly that code does, and adapt that code to your purposes. This involves significant digging into that code, and will probably take a few days to accomplish. But it's the only way I could see how to use free/open source code to accomplish your objective.
 
wood burning stoves
 
subject: PDF to XML Conversion using Apache Tika
 
Similar Threads
Lucene search in attachements does not work
Extract text from websites?
Searching uploaded file
XML
Fit data in a PDF page