The project looks very promising and it is in its early stages. I know of commercial products that can do this, but are you looking to do something specific with this XML ? Is text extraction from PDF ok ? PDFBox seems to support that. Or do you need some sort of meaningful hierarchical data ?