I'm helping to write an application that needs to read in a word doc (the text of the doc will be processed by some language processing software. Im working on the
frontend for the project.) At the moment, I am able to read in word docs with the file extension .doc using the Apache POI library (POIFileSystem, HWPFDocument and WordExtractor).
Now I want to be able to read in .docx files. I've tried using XWPFDocument and XWPFWordExtractor. I pass in OPCPackage.create(filename) as an argument to XWPFDocument, but
its not working.The code compiles, but when I run it, it throws an exception.Its throwing an org.apache.xmlbeans.XmlException. I thought I had set the classpath for the relevant jar files.
I'm using Apache POI 3.5 beta6. If anyone can shed some light on this, that would great!
Instead of OPCPackage.create, try POIXMLDocument.openPackage. Here's sample code that shows the XWPFExtractor in action.
Note that the change notes for the trunk code (post-beta 6) list various improvements in the XWPF extractor. So you may want to grab the latest source from the repository and use that to build the jar files.
Joined: Jul 29, 2009
cheers for that. Just after I posted my problem, I got it working last night. I implemented mine a little differently.
I don't know if its the proper way to do it, but it reads in the file perfectly. I'll have a go at trying the code you linked me to (no harm in knowing 2 ways). I also kept getting ClassNotFoundExceptions. I put the jar file it was looking for (such xmlbeans, dom4j) in the classpath and it worked then.
ps if anyone needs any help reading doc or docx files, I'll be happy to post code here
Could you please tell me the how to read .docx file in POI.... When I try to .docx file using XWPF. Its throwing exception as
Exception in thread "main" org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]