File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Other Open Source Projects and the fly likes Problems reading in .docx files in java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Problems reading in .docx files in java" Watch "Problems reading in .docx files in java" New topic
Author

Problems reading in .docx files in java

D Slevin
Greenhorn

Joined: Jul 29, 2009
Posts: 6
Hi,

I'm helping to write an application that needs to read in a word doc (the text of the doc will be processed by some language processing software. Im working on the
frontend for the project.) At the moment, I am able to read in word docs with the file extension .doc using the Apache POI library (POIFileSystem, HWPFDocument and WordExtractor).
Now I want to be able to read in .docx files. I've tried using XWPFDocument and XWPFWordExtractor. I pass in OPCPackage.create(filename) as an argument to XWPFDocument, but
its not working.The code compiles, but when I run it, it throws an exception.Its throwing an org.apache.xmlbeans.XmlException. I thought I had set the classpath for the relevant jar files.
I'm using Apache POI 3.5 beta6. If anyone can shed some light on this, that would great!
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39547
    
  27
Welcome to JavaRanch.

Instead of OPCPackage.create, try POIXMLDocument.openPackage. Here's sample code that shows the XWPFExtractor in action.

Note that the change notes for the trunk code (post-beta 6) list various improvements in the XWPF extractor. So you may want to grab the latest source from the repository and use that to build the jar files.


Ping & DNS - updated with new look and Ping home screen widget
D Slevin
Greenhorn

Joined: Jul 29, 2009
Posts: 6
Hi,
cheers for that. Just after I posted my problem, I got it working last night. I implemented mine a little differently.



I don't know if its the proper way to do it, but it reads in the file perfectly. I'll have a go at trying the code you linked me to (no harm in knowing 2 ways). I also kept getting ClassNotFoundExceptions. I put the jar file it was looking for (such xmlbeans, dom4j) in the classpath and it worked then.

Again, thanks.

ps if anyone needs any help reading doc or docx files, I'll be happy to post code here
Megha Ad
Greenhorn

Joined: Nov 05, 2009
Posts: 1
Hi,
We are facing the similar problem using POI for reading 2007 docs can you please tell me from where you get 3.5 version?
& please sahre the sample code as well.
Thanks & regds
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39547
    
  27
Megha Ad wrote:can you please tell me from where you get 3.5 version?

Searching for "download apache poi" should find it real quick.
Jeya Sri
Greenhorn

Joined: May 27, 2012
Posts: 1
Could you please tell me the how to read .docx file in POI.... When I try to .docx file using XWPF. Its throwing exception as


Exception in thread "main" org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:148)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:623)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:209)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
at org.apache.poi.openxml4j.opc.OPCPackage.openOrCreate(OPCPackage.java:248)
at view.Document_XWPF_Sample.main(Document_XWPF_Sample.java:28)

Please let me know as soon as possible.
alaky alakiyea
Greenhorn

Joined: Dec 17, 2012
Posts: 5
i have a problem with parse a zip file in tika parser
in parse a zip file i have a error that is :

java.lang.InternalError: jzentry == 0, jzfile = 139750727169136, total = 235, name = /tmp/apache-tika-8076182698055047262.tmp, i = 176, message = null

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:322)

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:304)

at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:158)

at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:615)

at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:208)

at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:118)

at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:74)

at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.Tika.detect(Tika.java:134)

at org.apache.tika.Tika.detect(Tika.java:181)

at org.apache.tika.Tika.detect(Tika.java:228)

at java.lang.Thread.run(Thread.java:619)

please response to my error .
thanks
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Problems reading in .docx files in java
 
Similar Threads
Java Html parser
java reporting tool
Java API for RTF to DOCX Conversion
InvalidFormatException
to print a table in word(.doc file)