• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

MS Word to XML conversion

 
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

I want to convert MSWord document to XML. I want to follow the same steps as I am following currently from the frontend on ms-word docs with my java code also.
1. Open the MsWord docs.
2. Click on the saveas option.
3. select .xml as its save type.
4. save it.

By doin so whole data will be converted into the different nodes and tags.

This I want to follow with my java code. Is there is any API by which I can convert my ms-Word Doc to xml or any other way by which I can convert the ms-word document to xml.

Please help me if anybody having answer of it.


Thanks
Kushagra
 
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Check the Apache POI api, specifically hwpf api
 
Kushagra Bindal
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I have already go through this API but this API is for reading the MS Word docs mainly and having few feature to write(WordExtrator) if I am not wrong. There is no facility in the this API to convert the MSWord internally into XMl file. If it is having please suggest me the possible way.

Thanks
Kushagra
 
Phani Raju
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you are looking for straight conversion to xml, POI does not help. It can only help you to extract the text from the document and then you can convert it to whatever format you like to.

I missed the point that you were looking for an outright conversion into xml
 
Kushagra Bindal
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Actually I want to count the word count of the ms word document and which should be match exactly from the word count done by tool option in MSWord document. And by doing it with WordExtractor class of POI the number count is not coming same. This is coming equal only in case when i will convert the msword to xml automatically as done from the frontend. And then a well formatted node in the xml is created for the word count from where i can count the word count.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Are you using the getParagraphText method or the getText method?

Have you tried it with simple documents to determine in which cases is the word count off? I'm sure the POI folks would welcome test cases that show where their approach is faulty.
[ October 16, 2008: Message edited by: Ulf Dittmer ]
 
Kushagra Bindal
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi, I am using getText method. And by using this method I am getting word count less than the word count of actual ms-word document.



Or mayt be I am currently using word count on the basis of space betwwen the two and consider it a new word. So can you tell me in how many case word will consider set of character as a new word may be this will solve my problem. But I don't think it is a valid solution because this may be happening that the condition got change and then the word count will change again.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have no idea what Word does (I doubt anyone could write down hard and fast rules for that :-). That's why I suggested to use some simple documents to try and determine where the two approaches differ; then you could say which approach is incorrect according to your definition of word count.
 
Kushagra Bindal
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This problem can be sort out if i will convert the msword into standard xml document. So if there is any possible solution for that then please tellme the same.


Thanks
Kushagra
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

This problem can be sort out if i will convert the msword into standard xml document.


How do you know that if you're not sure how your definition of word count, Word's definition of it, and the text extracted by POI are different?

If this was my problem, I'd make sure to understand why there are differences, and then I'd think about taking steps to address those.
[ October 16, 2008: Message edited by: Ulf Dittmer ]
 
Kushagra Bindal
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Ulf,

There is an alternate solution of that. May be this will help me.

There is a functionality in java.io to read the file property of the document.

And word count is itself in the file property of the document. So may be this can help me in getting the word count that the ms word is also provding in this case both will be in sync.
Please provide me a way to read that file property.

Thanks
Kushagra
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Let's not sidetrack this discussion by talking about document properties. I've responded in the other thread where you asked that same question.
 
Kushagra Bindal
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ya ok I have got your messege in that too.

Thanks for that but in that case also i am getting the same error. Means the EOF type exception while reading the file

InputStream objInputStream = new FileInputStream("inputDocs/Performance Management Resources.doc");
POIFSFileSystem poi=new POIFSFileSystem(objInputStream);
HWPFDocument hwpdDocs=new HWPFDocument(poi);

DocumentProperties a=hwpdDocs.getDocProperties();
int as=a.getCWords();
System.out.println(as);

It is giving me the exception on
POIFSFileSystem poi=new POIFSFileSystem(objInputStream);
when I am reading the inputstream.
Actually main problem here is while reading this inputstream.

Any possible solution for that.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As I said, let's not sidetrack this discussion. Repost your last post in the other thread, and I'll respond there. Be sure to include details of the exception you're getting -the full stack trace-; just saying "I'm getting an exception" doesn't do any good.
 
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Kushagra Bindal wrote:Hi All,

I want to convert MSWord document to XML. I want to follow the same steps as I am following currently from the frontend on ms-word docs with my java code also.
1. Open the MsWord docs.
2. Click on the saveas option.
3. select .xml as its save type.
4. save it.

By doin so whole data will be converted into the different nodes and tags.

This I want to follow with my java code. Is there is any API by which I can convert my ms-Word Doc to xml or any other way by which I can convert the ms-word document to xml.

Please help me if anybody having answer of it.


Thanks

Kushagra



Hi all,

Similar is my requirement . I have to convert a MSword doc to XML.Apache Poi only reads it.What do i do for converting it to XML?? please help!!!

Thanks ,
Krithika
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

krithika shekhar wrote:Similar is my requirement . I have to convert a MSword doc to XML.Apache Poi only reads it.What do i do for converting it to XML?


What, exactly do you mean by "convert"? Create the XML file? There are multiple XML libraries available that can do that: DOM, JDOM, XOM, ...
 
krithika shekhar
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ulf Dittmer wrote:

What, exactly do you mean by "convert"? Create the XML file? There are multiple XML libraries available that can do that: DOM, JDOM, XOM, ...



Yes Ulf.
Create the new XML file and write into it by reading the contents in the MSWord document.But i was able to do it using XOM.

Thanks anyways
Krithika

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Dear Kritika,

Could you please share the sample code that you have written for reading the word document and then putting the contents read into the well formed XML document.

I'm new to this POI concepts. Your help will be highly appreciated.

thanks!
Nilesh CHavan.
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nilesh Chavan wrote:Dear Kritika,

Could you please share the sample code that you have written for reading the word document and then putting the contents read into the well formed XML document.

I'm new to this POI concepts. Your help will be highly appreciated.

thanks!
Nilesh CHavan.



Hi Nilesh / Kritika

Can you share the code please.

Thanks
Pulkit
 
Ranch Hand
Posts: 2187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Programatically executing the MS Word SaveAs function to generate a XML-based version of a Word document is very simple (from VBScript.) A call on the ActiveDocument with a 10 code will work nicely.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic