jQuery in Action, 3rd edition
The moose likes Other Open Source Projects and the fly likes Problems reading in .docx files in java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Problems reading in .docx files in java" Watch "Problems reading in .docx files in java" New topic

Problems reading in .docx files in java

D Slevin

Joined: Jul 29, 2009
Posts: 6

I'm helping to write an application that needs to read in a word doc (the text of the doc will be processed by some language processing software. Im working on the
frontend for the project.) At the moment, I am able to read in word docs with the file extension .doc using the Apache POI library (POIFileSystem, HWPFDocument and WordExtractor).
Now I want to be able to read in .docx files. I've tried using XWPFDocument and XWPFWordExtractor. I pass in OPCPackage.create(filename) as an argument to XWPFDocument, but
its not working.The code compiles, but when I run it, it throws an exception.Its throwing an org.apache.xmlbeans.XmlException. I thought I had set the classpath for the relevant jar files.
I'm using Apache POI 3.5 beta6. If anyone can shed some light on this, that would great!
Ulf Dittmer

Joined: Mar 22, 2005
Posts: 42958
Welcome to JavaRanch.

Instead of OPCPackage.create, try POIXMLDocument.openPackage. Here's sample code that shows the XWPFExtractor in action.

Note that the change notes for the trunk code (post-beta 6) list various improvements in the XWPF extractor. So you may want to grab the latest source from the repository and use that to build the jar files.
D Slevin

Joined: Jul 29, 2009
Posts: 6
cheers for that. Just after I posted my problem, I got it working last night. I implemented mine a little differently.

I don't know if its the proper way to do it, but it reads in the file perfectly. I'll have a go at trying the code you linked me to (no harm in knowing 2 ways). I also kept getting ClassNotFoundExceptions. I put the jar file it was looking for (such xmlbeans, dom4j) in the classpath and it worked then.

Again, thanks.

ps if anyone needs any help reading doc or docx files, I'll be happy to post code here
Megha Ad

Joined: Nov 05, 2009
Posts: 1
We are facing the similar problem using POI for reading 2007 docs can you please tell me from where you get 3.5 version?
& please sahre the sample code as well.
Thanks & regds
Ulf Dittmer

Joined: Mar 22, 2005
Posts: 42958
Megha Ad wrote:can you please tell me from where you get 3.5 version?

Searching for "download apache poi" should find it real quick.
Jeya Sri

Joined: May 27, 2012
Posts: 1
Could you please tell me the how to read .docx file in POI.... When I try to .docx file using XWPF. Its throwing exception as

Exception in thread "main" org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:148)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:623)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:209)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
at org.apache.poi.openxml4j.opc.OPCPackage.openOrCreate(OPCPackage.java:248)
at view.Document_XWPF_Sample.main(Document_XWPF_Sample.java:28)

Please let me know as soon as possible.
alaky alakiyea

Joined: Dec 17, 2012
Posts: 5
i have a problem with parse a zip file in tika parser
in parse a zip file i have a error that is :

java.lang.InternalError: jzentry == 0, jzfile = 139750727169136, total = 235, name = /tmp/apache-tika-8076182698055047262.tmp, i = 176, message = null

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:322)

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:304)

at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:158)

at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:615)

at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:208)

at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:118)

at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:74)

at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.Tika.detect(Tika.java:134)

at org.apache.tika.Tika.detect(Tika.java:181)

at org.apache.tika.Tika.detect(Tika.java:228)

at java.lang.Thread.run(Thread.java:619)

please response to my error .
Have you checked out Aspose?
subject: Problems reading in .docx files in java
It's not a secret anymore!