aspose file tools*
The moose likes I/O and Streams and the fly likes how to read a particular page from a DOC file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "how to read a particular page from a DOC file" Watch "how to read a particular page from a DOC file" New topic
Author

how to read a particular page from a DOC file

Gajendra Kangokar
Ranch Hand

Joined: Dec 25, 2012
Posts: 82
    
    1

hello all
i have .doc file but i am not supposed to read entire file instead i am given a page number.
therefore i got to read only that particular page from the doc file.
I am using apache.poi api.



thank you.
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2364
    
  50
I'm not sure that you can easily do this as I believe doc files don't store pages number information. The page number is calculated using information such as content, font, page size etc.

I may be wrong but this post on the POI forum seems to confirm my view: http://apache-poi.1045710.n5.nabble.com/Using-text-find-the-page-number-in-word-document-td5710448.html
Gajendra Kangokar
Ranch Hand

Joined: Dec 25, 2012
Posts: 82
    
    1

ok the doc file do not store page numbers.
but is there anyway to know that we have come to end of a page.
or any way to know that the page is changing.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42594
    
  65
I don't think it is possible to know page numbers before the entire file has been read, for the reasons Tony mentioned.

i am not supposed to read entire file instead i am given a page number.

This sounds like a really strange requirement; what is the point of it?


Ping & DNS - my free Android networking tools app
Gajendra Kangokar
Ranch Hand

Joined: Dec 25, 2012
Posts: 82
    
    1

I just want to count number of pages in a doc file.
we use while((in.read())!=-1) to read till end of file.
but is there any logic to check control has come to an end of page?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42594
    
  65
OK, so that requirement doesn't actually exist; that's good. You could use a library like JODConverter (which relies on running OpenOffice in server mode) to convert the document to PDF - PDFs are fixed in layout, and libraries like PDFBox can tell you the number of pages.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18879
    
    8

Basically a Word document doesn't have pages at all. When you see it displayed in Word it may appear to have pages, but that's because it's using the default page layout information to paginate the document. If you click on the Page Layout tab you'll see all the things you can change -- margins, page orientation, page size, columns, and more -- and which will affect the pagination. And as already pointed out, there are many other things which affect the pagination.

But if, as you say, you're just reading the raw bytes from the .doc file, you don't have any hope of finding out any of those things. You're just reading the document text and the document formatting and other control information as uninterpreted bytes. You can't find out anything at all about the document that way except how many bytes it took Word to store it on disk.
Gajendra Kangokar
Ranch Hand

Joined: Dec 25, 2012
Posts: 82
    
    1

I am not supposed to convert it to PDF.

@paul you mean there is no way to know where page break happened in DOC file..?is there any way to use form feed or something.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42594
    
  65
Yes, that's what Tony and Paul and myself have been saying.

I am not supposed to convert it to PDF.

Where are all these strange requirements coming from? It sounds like the requirements contain details of the technical implementation, where that kind of thing has no place.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18879
    
    8

Gajendra Kangokar wrote:@paul you mean there is no way to know where page break happened in DOC file..?is there any way to use form feed or something.


Form feed? No, it's not nearly that simple. In fact Word is probably a thousand times as complicated as just throwing in a form-feed character. I'm guessing you haven't actually used Word yourself much?

If you really have to address the requirement of extracting a page from a Word document, you at least have to start by accessing it via Apache POI's Word components, or else Aspose's software which allows you to access Word documents. And then prepare yourself for a long stretch where you learn how to use those things. Last time I looked at accessing Word (from Visual Basic over a decade ago) there were about 500 different types in its data model. I'm sure that the number is closer to 1,000 by now. It isn't simple and you shouldn't expect a simple solution.
Gajendra Kangokar
Ranch Hand

Joined: Dec 25, 2012
Posts: 82
    
    1

yes i am using Apache POI and thank you,will try with Aspose software also.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: how to read a particular page from a DOC file