File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Other JSE/JEE APIs and the fly likes MS Word to XML conversion Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Other JSE/JEE APIs
Bookmark "MS Word to XML conversion" Watch "MS Word to XML conversion" New topic
Author

MS Word to XML conversion

Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
Hi All,

I want to convert MSWord document to XML. I want to follow the same steps as I am following currently from the frontend on ms-word docs with my java code also.
1. Open the MsWord docs.
2. Click on the saveas option.
3. select .xml as its save type.
4. save it.

By doin so whole data will be converted into the different nodes and tags.

This I want to follow with my java code. Is there is any API by which I can convert my ms-Word Doc to xml or any other way by which I can convert the ms-word document to xml.

Please help me if anybody having answer of it.


Thanks
Kushagra
Phani Raju
Greenhorn

Joined: Aug 03, 2007
Posts: 19
Check the Apache POI api, specifically hwpf api
Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
Hi,

I have already go through this API but this API is for reading the MS Word docs mainly and having few feature to write(WordExtrator) if I am not wrong. There is no facility in the this API to convert the MSWord internally into XMl file. If it is having please suggest me the possible way.

Thanks
Kushagra
Phani Raju
Greenhorn

Joined: Aug 03, 2007
Posts: 19
If you are looking for straight conversion to xml, POI does not help. It can only help you to extract the text from the document and then you can convert it to whatever format you like to.

I missed the point that you were looking for an outright conversion into xml
Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
Actually I want to count the word count of the ms word document and which should be match exactly from the word count done by tool option in MSWord document. And by doing it with WordExtractor class of POI the number count is not coming same. This is coming equal only in case when i will convert the msword to xml automatically as done from the frontend. And then a well formatted node in the xml is created for the word count from where i can count the word count.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41509
    
  53
Are you using the getParagraphText method or the getText method?

Have you tried it with simple documents to determine in which cases is the word count off? I'm sure the POI folks would welcome test cases that show where their approach is faulty.
[ October 16, 2008: Message edited by: Ulf Dittmer ]

Ping & DNS - my free Android networking tools app
Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
Hi, I am using getText method. And by using this method I am getting word count less than the word count of actual ms-word document.



Or mayt be I am currently using word count on the basis of space betwwen the two and consider it a new word. So can you tell me in how many case word will consider set of character as a new word may be this will solve my problem. But I don't think it is a valid solution because this may be happening that the condition got change and then the word count will change again.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41509
    
  53
I have no idea what Word does (I doubt anyone could write down hard and fast rules for that :-). That's why I suggested to use some simple documents to try and determine where the two approaches differ; then you could say which approach is incorrect according to your definition of word count.
Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
This problem can be sort out if i will convert the msword into standard xml document. So if there is any possible solution for that then please tellme the same.


Thanks
Kushagra
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41509
    
  53
This problem can be sort out if i will convert the msword into standard xml document.

How do you know that if you're not sure how your definition of word count, Word's definition of it, and the text extracted by POI are different?

If this was my problem, I'd make sure to understand why there are differences, and then I'd think about taking steps to address those.
[ October 16, 2008: Message edited by: Ulf Dittmer ]
Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
Hi Ulf,

There is an alternate solution of that. May be this will help me.

There is a functionality in java.io to read the file property of the document.

And word count is itself in the file property of the document. So may be this can help me in getting the word count that the ms word is also provding in this case both will be in sync.
Please provide me a way to read that file property.

Thanks
Kushagra
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41509
    
  53
Let's not sidetrack this discussion by talking about document properties. I've responded in the other thread where you asked that same question.
Kushagra Bindal
Ranch Hand

Joined: Oct 15, 2008
Posts: 156
Ya ok I have got your messege in that too.

Thanks for that but in that case also i am getting the same error. Means the EOF type exception while reading the file

InputStream objInputStream = new FileInputStream("inputDocs/Performance Management Resources.doc");
POIFSFileSystem poi=new POIFSFileSystem(objInputStream);
HWPFDocument hwpdDocs=new HWPFDocument(poi);

DocumentProperties a=hwpdDocs.getDocProperties();
int as=a.getCWords();
System.out.println(as);

It is giving me the exception on
POIFSFileSystem poi=new POIFSFileSystem(objInputStream);
when I am reading the inputstream.
Actually main problem here is while reading this inputstream.

Any possible solution for that.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41509
    
  53
As I said, let's not sidetrack this discussion. Repost your last post in the other thread, and I'll respond there. Be sure to include details of the exception you're getting -the full stack trace-; just saying "I'm getting an exception" doesn't do any good.
krithika shekhar
Greenhorn

Joined: Sep 12, 2007
Posts: 9
Kushagra Bindal wrote:Hi All,

I want to convert MSWord document to XML. I want to follow the same steps as I am following currently from the frontend on ms-word docs with my java code also.
1. Open the MsWord docs.
2. Click on the saveas option.
3. select .xml as its save type.
4. save it.

By doin so whole data will be converted into the different nodes and tags.

This I want to follow with my java code. Is there is any API by which I can convert my ms-Word Doc to xml or any other way by which I can convert the ms-word document to xml.

Please help me if anybody having answer of it.


Thanks

Kushagra


Hi all,

Similar is my requirement . I have to convert a MSword doc to XML.Apache Poi only reads it.What do i do for converting it to XML?? please help!!!

Thanks ,
Krithika
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41509
    
  53
krithika shekhar wrote:Similar is my requirement . I have to convert a MSword doc to XML.Apache Poi only reads it.What do i do for converting it to XML?

What, exactly do you mean by "convert"? Create the XML file? There are multiple XML libraries available that can do that: DOM, JDOM, XOM, ...
krithika shekhar
Greenhorn

Joined: Sep 12, 2007
Posts: 9
Ulf Dittmer wrote:

What, exactly do you mean by "convert"? Create the XML file? There are multiple XML libraries available that can do that: DOM, JDOM, XOM, ...


Yes Ulf.
Create the new XML file and write into it by reading the contents in the MSWord document.But i was able to do it using XOM.

Thanks anyways
Krithika

Nilesh Chavan
Greenhorn

Joined: Apr 25, 2011
Posts: 1
Dear Kritika,

Could you please share the sample code that you have written for reading the word document and then putting the contents read into the well formed XML document.

I'm new to this POI concepts. Your help will be highly appreciated.

thanks!
Nilesh CHavan.
Pulkit Kapur
Greenhorn

Joined: Dec 07, 2011
Posts: 1
Nilesh Chavan wrote:Dear Kritika,

Could you please share the sample code that you have written for reading the word document and then putting the contents read into the well formed XML document.

I'm new to this POI concepts. Your help will be highly appreciated.

thanks!
Nilesh CHavan.


Hi Nilesh / Kritika

Can you share the code please.

Thanks
Pulkit
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
Programatically executing the MS Word SaveAs function to generate a XML-based version of a Word document is very simple (from VBScript.) A call on the ActiveDocument with a 10 code will work nicely.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: MS Word to XML conversion