• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

How to extract data from Doc and Docx file format.

 
Greenhorn
Posts: 10
Hibernate Spring Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How to extract data from Doc and Docx file format?

I am trying to read it using FileInputStream but it didnt work for me.
somewhere i read FileInputStream works for plain text file only. Doc and Docx are structured file format.
So how to extract data from Microsoft Office File format.

Regards
Rohit More
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No, FileInputStream, like all InputStreams, is for reading bytes (binary data) from files. For reading plain text (in whatever character encoding), you would use a FileReader.

Microsoft Word .doc and .docx files are files that have a specific, proprietary format defined by Microsoft. The Apache POI library helps you read and write those formats.
 
Rohit More
Greenhorn
Posts: 10
Hibernate Spring Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you Jesper.
I have solved my problem regarding reading doc and docx. file but unable to get its properties.

suppose I have "demo.docx"
contents are as follows -

Welcome to JavaRanch


Now I am trying to get Welcome word with Italic font style similarly for JavaRanch word.

can i request you for some examples ?
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know that there is much documentation beyond http://poi.apache.org/hwpf/quick-guide.html, so you'll have to do some reading of the javadocs, and some experimenting to see what works.

The basic approach is that from a Document you get its Range, and from the Range you get the CharacterRuns and Paragraphs, both of which carry style information.
 
Rohit More
Greenhorn
Posts: 10
Hibernate Spring Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
sorry for delay in reply.

want to ask 1 thing. If i brock docx file programatically (unzipping and zipping docx file) and did whatever i want with its contents.
Doing this in commercial project. is valid or not?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
And by "valid" you mean... ?
 
Rohit More
Greenhorn
Posts: 10
Hibernate Spring Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i mean extracting docx file and changing its original contents and then wrapping it in single zip fileright.
is this process valid (authenticated) or not?

actually one of open source communicator said that its illegal. he has read somewhere.
so i just wanted to make conifrm that its legal or illegal?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Not sure what you mean by "authenticated", but it is legal in the sense that the MS XML Office formats are not covered by patents that would limit what you can do with them, or how you can create or alter them (in the sense MP3 was/is, and GIF once was).

It is of course perfectly possible to create malformed files with respect to the documented file format that way, but that risk also exists to a lesser degree via possible bugs in POI (which is a very stable library, though, and unlikely to create malformed files).
 
Time flies like an arrow. Fruit flies like a banana. Steve flies like a tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic