Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Problems reading in .docx files in java

 
D Slevin
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm helping to write an application that needs to read in a word doc (the text of the doc will be processed by some language processing software. Im working on the
frontend for the project.) At the moment, I am able to read in word docs with the file extension .doc using the Apache POI library (POIFileSystem, HWPFDocument and WordExtractor).
Now I want to be able to read in .docx files. I've tried using XWPFDocument and XWPFWordExtractor. I pass in OPCPackage.create(filename) as an argument to XWPFDocument, but
its not working.The code compiles, but when I run it, it throws an exception.Its throwing an org.apache.xmlbeans.XmlException. I thought I had set the classpath for the relevant jar files.
I'm using Apache POI 3.5 beta6. If anyone can shed some light on this, that would great!
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to JavaRanch.

Instead of OPCPackage.create, try POIXMLDocument.openPackage. Here's sample code that shows the XWPFExtractor in action.

Note that the change notes for the trunk code (post-beta 6) list various improvements in the XWPF extractor. So you may want to grab the latest source from the repository and use that to build the jar files.
 
D Slevin
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
cheers for that. Just after I posted my problem, I got it working last night. I implemented mine a little differently.



I don't know if its the proper way to do it, but it reads in the file perfectly. I'll have a go at trying the code you linked me to (no harm in knowing 2 ways). I also kept getting ClassNotFoundExceptions. I put the jar file it was looking for (such xmlbeans, dom4j) in the classpath and it worked then.

Again, thanks.

ps if anyone needs any help reading doc or docx files, I'll be happy to post code here
 
Megha Ad
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
We are facing the similar problem using POI for reading 2007 docs can you please tell me from where you get 3.5 version?
& please sahre the sample code as well.
Thanks & regds
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Megha Ad wrote:can you please tell me from where you get 3.5 version?

Searching for "download apache poi" should find it real quick.
 
Jeya Sri
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Could you please tell me the how to read .docx file in POI.... When I try to .docx file using XWPF. Its throwing exception as


Exception in thread "main" org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:148)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:623)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:209)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
at org.apache.poi.openxml4j.opc.OPCPackage.openOrCreate(OPCPackage.java:248)
at view.Document_XWPF_Sample.main(Document_XWPF_Sample.java:28)

Please let me know as soon as possible.
 
alaky alakiyea
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i have a problem with parse a zip file in tika parser
in parse a zip file i have a error that is :

java.lang.InternalError: jzentry == 0, jzfile = 139750727169136, total = 235, name = /tmp/apache-tika-8076182698055047262.tmp, i = 176, message = null

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:322)

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:304)

at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:158)

at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:615)

at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:208)

at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:118)

at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:74)

at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.Tika.detect(Tika.java:134)

at org.apache.tika.Tika.detect(Tika.java:181)

at org.apache.tika.Tika.detect(Tika.java:228)

at java.lang.Thread.run(Thread.java:619)

please response to my error .
thanks
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic