Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Java API for word files

 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello, anybody knows any good user friendly API for doing regex-in-combination-with-word-formatting (bold, italics, paragraph, etc) searches in doc and docx files?
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Apache POI can do lots of things with such files, but its API is not particularly intuitive. I'm not aware of more capable free libraries, though.
 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:Apache POI can do lots of things with such files, but its API is not particularly intuitive. I'm not aware of more capable free libraries, though.


I know Apache POI. Indeed that's what i had in mind when talking about "user friendly"... I don't think it is..
 
Wim Vanni
Ranch Hand
Posts: 96
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Docx4j could be a solution.

If you can make it so that the Word documents come in the format of xml you might be better of using xml/xpath etc to do the searches.

Cheers,
Wim
 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wim Vanni wrote:Docx4j could be a solution.

If you can make it so that the Word documents come in the format of xml you might be better of using xml/xpath etc to do the searches.

Cheers,
Wim


Actually i need my app to be able to handle both .doc (word 2003, xp etc.) and .docx files. Aspose.Words for Java is a very very good API that can do many things but unfortunately i cannot afford it (especially since it's license is valid for a given period)... On the other hand OO Uno or Apache POI have pretty big learning curve. The goog thing with Aspose is that it does not require Ms Word installed and that's very important i think..
 
Wim Vanni
Ranch Hand
Posts: 96
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Remember that Office 2003 came with XML support, meaning you could save Word (and Excel and ..) in an XML format.

Wim

 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The XML supported by Office 2003 is very different from the Office 2007 formats.
 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wim Vanni wrote:Remember that Office 2003 came with XML support, meaning you could save Word (and Excel and ..) in an XML format.

Wim



Yep i know that but it since i need to make the app Ms Word (install) independent and considering that i have a big bunch of doc files already, it would be essential to handle both formats. I guess i shall have to wait for a free API or if i become a Java Guru i will make myself :P
 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:The XML supported by Office 2003 is very different from the Office 2007 formats.


That i didn't know. So now, for efficiency, a program should handle .doc (COM), 2003 xml format (is that docx too?) and .docx formats plus independancy... That hurts already...!!
 
Wim Vanni
Ranch Hand
Posts: 96
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats. I haven't used this myself but making a POC for this shouldn't be too hard. If this is succesful you basicly have at that point, XML. Plenty of Java libraries around to handle (and search in) XML, and if needed, adding regexp searching to that shouldn't be difficult either.

You don't have to be a guru. Just learn new ingredients now and then and learn to combine them into extraordinary dishes

Chef Wim
 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wim Vanni wrote:I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats. I haven't used this myself but making a POC for this shouldn't be too hard. If this is succesful you basicly have at that point, XML. Plenty of Java libraries around to handle (and search in) XML, and if needed, adding regexp searching to that shouldn't be difficult either.

You don't have to be a guru. Just learn new ingredients now and then and learn to combine them into extraordinary dishes

Chef Wim


So, does Docx4j need ms word installed?
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So now, for efficiency, a program should handle .doc (COM), 2003 xml format (is that docx too?) and .docx formats

Nobody uses the Office 2003 XML formats; you can safely ignore those.

I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats.

I've seen no indication that docx4j can handle the old binary Office formats. Or did you mean the 2003 XML formats?

does Docx4j need ms word installed

No, it's an all-Java solution.
 
Miltos Deligiannis
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats.

I've seen no indication that docx4j can handle the old binary Office formats. Or did you mean the 2003 XML formats?


Indeed, Docx4j cannot handle legacy doc files. This can be done only through conversion using Apache POI. This is a quote from Doc4j "Getting Started".

Handling legacy binary .doc files
Apache POI's HWPF can read .doc files, and docx4j could use this for basic conversion of .doc to .docx. The problem with this approach is that POI's HWPF code fails on many .doc files.
An effective approach is to use OpenOffice (via jodconverter) to convert the doc to docx, which docx4j can then process. If you need to return a binary .doc, OpenOffice/jodconverter can convert the docx back to .doc.
There is also http://b2xtranslator.sourceforge.net/ . If a pure Java approach were required, this could be converted.


 
Wim Vanni
Ranch Hand
Posts: 96
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OpenOffice API was one I had in mind too when writing my last reply. Like I said: it will come down to combining a few libraries in order to tackle your problem.

 
Shahzad Latif
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Miltos Deligiannis wrote:
Actually i need my app to be able to handle both .doc (word 2003, xp etc.) and .docx files. Aspose.Words for Java is a very very good API that can do many things but unfortunately i cannot afford it (especially since it's license is valid for a given period)... On the other hand OO Uno or Apache POI have pretty big learning curve. The goog thing with Aspose is that it does not require Ms Word installed and that's very important i think..


Hi Miltos,

Just wanted to share one thing. You only need to upgrade your license if you update to a newer version after one year of your license purchase. However, if you do not need any new features and do not upgrade to the newer version, you'll not have to upgrade your license and you can keep using the same license without any issues.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic