File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes Convert .doc file to .txt file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Convert .doc file to .txt file" Watch "Convert .doc file to .txt file" New topic
Author

Convert .doc file to .txt file

Smriti Anchu
Ranch Hand

Joined: Dec 21, 2004
Posts: 40
Hi ,

I am having a template in .doc format. In my application I want to read this .doc file and insert the text to .txt file.. But when I am doing this, I find some special ASCII characters are also inserted into the text file.. I dont want these special characters but only the the text (words) present in the word file



Please help me in this regard

thanks
Smriti
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
DOC files contain plenty of control characters (and other text) that are not part of the main text. You'll need to use a library that understands the DOC format, like Jakarta POI. Have a look at "Basic Text Extraction" here.


Ping & DNS - my free Android networking tools app
Kartik Lunkad
Greenhorn

Joined: Dec 07, 2009
Posts: 3
I have a similar problem. When i extract text plainly, the format of the strings is not the same as any Text File. I need to match strings between a Text File and .doc file. It gives 2 same strings as unequal.Any suggestions?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
It gives 2 same strings as unequal.

What does that mean? Can you post a short code section that illustrates the problem?
Kartik Lunkad
Greenhorn

Joined: Dec 07, 2009
Posts: 3
Consider 2 words string1 and string 2 extracted from 2 files .txt and .doc file.( The string extracted from .doc file is done using the method specified in the above posts)
When we compare them, they come as unequal.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
Why should "string1" and "string 2" be considered equal?
Kartik Lunkad
Greenhorn

Joined: Dec 07, 2009
Posts: 3
Suppose they are equal in a certain case, for example string1= "sample" and string2="sample" also. But when extracted from their respective file formats, if you compare them, the compiler will show them as unequal. I hope you got my problem.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18907
    
    8

The compiler isn't doing any of that comparing. It's the runtime which is comparing. And if it says the two strings are unequal, then they are unequal. If you say they are equal, then you are using a non-standard definition of equal; or more likely, you have overlooked something. Often people overlook things like trailing blanks, for example, because they aren't easy to see in debugging output.
Amit kumarJha
Greenhorn

Joined: Jun 29, 2011
Posts: 1
Below program will convert .doc to .txt file:-

import java.io.*;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadDocFile {
public static void main(String[] args) {
File file = null;

try {
// Read the Doc/DOCx file
file = new File("D:\\New.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();

//write the text in txt file
File fil = new File("D:\\New.txt");
Writer output = new BufferedWriter(new FileWriter(fil));
output.write(text);
output.close();
} catch (Exception exep) {
}
}
}


Also upload the xmlbeans-2.3.0,dom4j-1.6.1 and stax-api-1.0.1.

Download the Apache POI jar also.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19761
    
  20

Welcome to the Ranch! While technically that doesn't convert .doc to .txt but .docx (there's a difference), you can use org.apache.poi.hwpf.extractor.WordExtractor and org.apache.poi.hwpf.HWPFDocument instead of org.apache.poi.xwpf.extractor.XWPFWordExtractor and org.apache.poi.xwpf.usermodel.XWPFDocument. The rest of the code should be the same.

And please UseCodeTags next time.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
vamshi gurudu
Greenhorn

Joined: Sep 23, 2008
Posts: 21
Hello Amit,

For code is working for some files only.
For some files i got the exception as org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]

How to resolve this.

Thanks.
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
Amit kumarJha wrote:Below program will convert .doc to .txt file:-

import java.io.*;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadDocFile {
public static void main(String[] args) {
File file = null;

try {
// Read the Doc/DOCx file
file = new File("D:\\New.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();

//write the text in txt file
File fil = new File("D:\\New.txt");
Writer output = new BufferedWriter(new FileWriter(fil));
output.write(text);
output.close();
} catch (Exception exep) {
}
}
}


Also upload the xmlbeans-2.3.0,dom4j-1.6.1 and stax-api-1.0.1.

Download the Apache POI jar also.


Hello, I tried this code but it does not work for me. At the line XWPFDocument doc = new XWPFDocument(fis); everything stops, nothing happens. I inserted a print statement after this line but nothing is printed (to the console). Also no exception is mentioned. Could you give me some suggestions as to what may be the cause of this behavior?

Thanks a bunch,
Monica
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Welcome to Javaranch, Monica

You need to print the Exception in the catch block if you need to see it. Also check Rob Spoor's earlier message on some changes to that code.
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
Hi John, thanks for the message. The fact is that I had noticed Rob Spoor's message and the code works for doc files. My problem is with docx files only. And I print the exceptions in the catch blocks. Any other suggestion? I am completely stuck.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
Monica Marcus wrote:the code works for doc files. My problem is with docx files only.

Really? I would have thought it would be the other way around, bcause the XWPF classes can handle .docx files, but not .doc files. For .doc files you'd need to use the corresponding HWPF classes.
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
Hi Ulf. Well, yes, of course: for doc files I used the HWPF classes.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
So your problem is solved now, and you're able to extract text from both file types?
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
No, I can extract text only from the doc files, but not from the docx files. I explained what happens with docx files in my first message of this thread. I would appreciate it very much if you (or someone else) can help, because now I am really stuck.
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
Hi guys, I was able to write the code to work for docx and doc files (different classes, of course) but I cannot get them both work as part of a larger application. The problem is that I need to use two jar files: poi-3-0-alpha3.jar (for the doc files) and poi-3.9-20121203.jar (for the docx files). Now both jar files contain two classes with identical names, but the contents of the classes is different. One of the classes misses a function required by doc files and the other class misses a function required by the docx files. So the order in which the jar files are added to my NetBeans project determines which of the two files (doc or docx) can be translated to text files. Is there a way to determine the program to look into both jar files?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
Monica Marcus wrote:The problem is that I need to use two jar files: poi-3-0-alpha3.jar (for the doc files) and poi-3.9-20121203.jar (for the docx files)

Why is that? I'm fairly certain that the current POI version does everything 3.0 did, so you should not need to use any of the old jar files, and anyway I strongly advise against maxing jars from different versions, problems like what you're experiencing are bound to happen.
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
No, the current version does not contain the classes mentioned below:

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.HWPFDocument;

I need the older version for them. I tried some intermediate versions too, but they do not work either.
What can I do to have my Java tool work for both doc and docx files?

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
It sure does - both classes are in the scratchpad jar file. If they weren't part of POI, why would they be in the javadocs?

org.apache.poi.hwpf.extractor.WordExtractor and org.apache.poi.hwpf.HWPFDocument
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
Well, they are in the javadoc but my compiler (NetBeans) says otherwise. What can I do? Perhaps it is a mistake and the people at apache.org should know about it. I don't know how to contact them.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
How are you adding the scratchpad jar file to the classpath? That's separate from the main jar file - you need both.
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
I work with NetBeans, and I did not set my classpath myself. I just added the jar files (both poi jar files) to the project. If I add first the older version and then the new version then it works for doc files only. If I add first the new version and then the older version, it works for docx files only. NetBeans builts a jar file for my whole application to run.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
I guess I need to be explicit about it: the jar files you need to add are named poi-3.9-20121203.jar and poi-scratchpad-3.9-20121203.jar. Don't add any files that are not part of the POI 3.9 download (like from older POI versions) - it simply does not work, nor is it necessary.

(You may also have to add poi-ooxml-3.9-20121203.jar and poi-ooxml-schemas-3.9-20121203.jar, and some of the jars in the "ooxml-lib" directoy; I'm not sure in which circumstances those are needed.)
Monica Marcus
Ranch Hand

Joined: Oct 17, 2012
Posts: 43
Thank you, Ulf. I did not know what the scratchpad jar file is.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Convert .doc file to .txt file