wood burning stoves 2.0*
The moose likes Java in General and the fly likes Java API for word files Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Java API for word files" Watch "Java API for word files" New topic
Author

Java API for word files

Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29
Hello, anybody knows any good user friendly API for doing regex-in-combination-with-word-formatting (bold, italics, paragraph, etc) searches in doc and docx files?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41867
    
  63
Apache POI can do lots of things with such files, but its API is not particularly intuitive. I'm not aware of more capable free libraries, though.


Ping & DNS - my free Android networking tools app
Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29
Ulf Dittmer wrote:Apache POI can do lots of things with such files, but its API is not particularly intuitive. I'm not aware of more capable free libraries, though.


I know Apache POI. Indeed that's what i had in mind when talking about "user friendly"... I don't think it is..
Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

Docx4j could be a solution.

If you can make it so that the Word documents come in the format of xml you might be better of using xml/xpath etc to do the searches.

Cheers,
Wim
Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29
Wim Vanni wrote:Docx4j could be a solution.

If you can make it so that the Word documents come in the format of xml you might be better of using xml/xpath etc to do the searches.

Cheers,
Wim


Actually i need my app to be able to handle both .doc (word 2003, xp etc.) and .docx files. Aspose.Words for Java is a very very good API that can do many things but unfortunately i cannot afford it (especially since it's license is valid for a given period)... On the other hand OO Uno or Apache POI have pretty big learning curve. The goog thing with Aspose is that it does not require Ms Word installed and that's very important i think..
Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

Remember that Office 2003 came with XML support, meaning you could save Word (and Excel and ..) in an XML format.

Wim

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41867
    
  63
The XML supported by Office 2003 is very different from the Office 2007 formats.
Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29
Wim Vanni wrote:Remember that Office 2003 came with XML support, meaning you could save Word (and Excel and ..) in an XML format.

Wim



Yep i know that but it since i need to make the app Ms Word (install) independent and considering that i have a big bunch of doc files already, it would be essential to handle both formats. I guess i shall have to wait for a free API or if i become a Java Guru i will make myself :P
Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29
Ulf Dittmer wrote:The XML supported by Office 2003 is very different from the Office 2007 formats.


That i didn't know. So now, for efficiency, a program should handle .doc (COM), 2003 xml format (is that docx too?) and .docx formats plus independancy... That hurts already...!!
Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats. I haven't used this myself but making a POC for this shouldn't be too hard. If this is succesful you basicly have at that point, XML. Plenty of Java libraries around to handle (and search in) XML, and if needed, adding regexp searching to that shouldn't be difficult either.

You don't have to be a guru. Just learn new ingredients now and then and learn to combine them into extraordinary dishes

Chef Wim
Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29
Wim Vanni wrote:I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats. I haven't used this myself but making a POC for this shouldn't be too hard. If this is succesful you basicly have at that point, XML. Plenty of Java libraries around to handle (and search in) XML, and if needed, adding regexp searching to that shouldn't be difficult either.

You don't have to be a guru. Just learn new ingredients now and then and learn to combine them into extraordinary dishes

Chef Wim


So, does Docx4j need ms word installed?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41867
    
  63
So now, for efficiency, a program should handle .doc (COM), 2003 xml format (is that docx too?) and .docx formats

Nobody uses the Office 2003 XML formats; you can safely ignore those.

I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats.

I've seen no indication that docx4j can handle the old binary Office formats. Or did you mean the 2003 XML formats?

does Docx4j need ms word installed

No, it's an all-Java solution.
Miltos Deligiannis
Greenhorn

Joined: Jun 12, 2011
Posts: 29

I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats.

I've seen no indication that docx4j can handle the old binary Office formats. Or did you mean the 2003 XML formats?


Indeed, Docx4j cannot handle legacy doc files. This can be done only through conversion using Apache POI. This is a quote from Doc4j "Getting Started".

Handling legacy binary .doc files
Apache POI's HWPF can read .doc files, and docx4j could use this for basic conversion of .doc to .docx. The problem with this approach is that POI's HWPF code fails on many .doc files.
An effective approach is to use OpenOffice (via jodconverter) to convert the doc to docx, which docx4j can then process. If you need to return a binary .doc, OpenOffice/jodconverter can convert the docx back to .doc.
There is also http://b2xtranslator.sourceforge.net/ . If a pure Java approach were required, this could be converted.


Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

OpenOffice API was one I had in mind too when writing my last reply. Like I said: it will come down to combining a few libraries in order to tackle your problem.

Shahzad Latif
Greenhorn

Joined: Apr 28, 2011
Posts: 28
Miltos Deligiannis wrote:
Actually i need my app to be able to handle both .doc (word 2003, xp etc.) and .docx files. Aspose.Words for Java is a very very good API that can do many things but unfortunately i cannot afford it (especially since it's license is valid for a given period)... On the other hand OO Uno or Apache POI have pretty big learning curve. The goog thing with Aspose is that it does not require Ms Word installed and that's very important i think..


Hi Miltos,

Just wanted to share one thing. You only need to upgrade your license if you update to a newer version after one year of your license purchase. However, if you do not need any new features and do not upgrade to the newer version, you'll not have to upgrade your license and you can keep using the same license without any issues.


Developer Evangelist @ Aspose. I love to explore and learn new technologies and help other developers along the way.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Java API for word files