This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes I/O and Streams and the fly likes How to read lines from .docx, .doc .rtf or .odt document Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "How to read lines from .docx, .doc .rtf or .odt document" Watch "How to read lines from .docx, .doc .rtf or .odt document" New topic
Author

How to read lines from .docx, .doc .rtf or .odt document

Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Can someone explain me how to read one line from .docx, .doc .rtf or .odt document? It doesn't matter from which of these file formats you can explain me. For example if i have document that contains two sentences:



These two sentences are delimited with sign for new row (Enter). How to take first sentence and put it in String variable.


Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Which library to use?
Can someone write short code for this?
pete stein
Bartender

Joined: Feb 23, 2007
Posts: 1561
Ivan Bisevac wrote:Can someone explain me how to read one line from .docx, .doc .rtf or .odt document? It doesn't matter from which of these file formats you can explain me. For example if i have document that contains two sentences:



These two sentences are delimited with sign for new row (Enter). How to take first sentence and put it in String variable.


I am no expert on this, but I don't believe that there is any one-size-fits all straight forward answer to this question. If it were me trying to solve this problem, I'd Google for the specification of the document of interest and then Google for possible libraries to allow one to read the file. YMMV.
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Problem is because i am beginner in java programming, i know basic techniques and some about databases. I want to make one project for english-serbian dictionary and have words in .rtf, odt and .docx format. Format of words is that there is one word in serbian then blank character and then english words. After that is sign for new row (enter). I want to get these words from that file and to put them in the database. For the database i don't need help it is ok, but for reading .rtf, .odt or .docx i need help..

File is at next location:

http://biske.hyperphp.com/viewtopic.php?f=8&t=8

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41136
    
  45
The premier library for reading DOCX files in Java is Apache POI. For ODT, several libraries are mentioned in the http://faq.javaranch.com/java/AccessingFileFormats page; not sure which one might handle this best.

If you need to handle all 3 file formats, you could convert two of them into the third using the JODConverter library (which needs OpenOffice installed). That way, your code only needs to handle a single format.


Ping & DNS - my free Android networking tools app
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

I don't need to handle all 3 formats. Just one. I said that i can copy words from .docx to .odf if it is easier to read.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41136
    
  45
Not having tried any of the ODF-handling libraries, I can't say which one might be easiest to use. But I do know that POI works for DOC and DOCX.
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

I did what i wanted. I extracted first word of paragraph in one string and then rest of paragraph in second string. Best thing shold bi that could extract just bold text from paragraph but i don't know to do this and i used other trick. Here is the code:

Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Is there a way to set serbian character set in the code above? How to do this?
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Instead to get |aforističan| it gives me |aforisti?an|

How to set charset to serbian latin?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

That happens when you do System.out.println(string), right? That's because the console isn't configured with an appropriate encoding. But when you get a real program, you aren't going to use the console for output. So don't waste your time trying to fix it. If you want to display the output in a better way, use a Swing GUI.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41136
    
  45
The WordExtractor class is a very crude way of getting at the text contained in the document; my guess would be that it completely punts on the question of encodings (the source code would tell you for sure).

As an alternative, you could iterate through all Ranges and/or CharacterRuns of the document via HWPFDocument.getRange(), and see what you get.

(Edit: Just saw Paul's post - that's actually a more likely explanation.)
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Paul Clapham wrote:That happens when you do System.out.println(string), right? That's because the console isn't configured with an appropriate encoding. But when you get a real program, you aren't going to use the console for output. So don't waste your time trying to fix it. If you want to display the output in a better way, use a Swing GUI.


That happens when i do System.out.println(string) and when i wrote to mysql database. Mysql database collation is set to utf_general_ci and, mysql connection collation is set to utf_general_ci and structure of my table "reci" is id (int), srpska (varchar) and engleska (varchar). Fields srpska and engleska have collation utf_general_ci.

Here is the whole code for both class:







I don't know what to do to fix this..


The WordExtractor class is a very crude way of getting at the text contained in the document; my guess would be that it completely punts on the question of encodings (the source code would tell you for sure).

As an alternative, you could iterate through all Ranges and/or CharacterRuns of the document via HWPFDocument.getRange(), and see what you get.

(Edit: Just saw Paul's post - that's actually a more likely explanation.)
;

Can you give some code. I don't know anything about that class?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Ivan Bisevac wrote:
Paul Clapham wrote:That happens when you do System.out.println(string), right? That's because the console isn't configured with an appropriate encoding. But when you get a real program, you aren't going to use the console for output. So don't waste your time trying to fix it. If you want to display the output in a better way, use a Swing GUI.


That happens when i do System.out.println(string) and when i wrote to mysql database. Mysql database collation is set to utf_general_ci and, mysql connection collation is set to utf_general_ci and structure of my table "reci" is id (int), srpska (varchar) and engleska (varchar). Fields srpska and engleska have collation utf_general_ci.


I configured my installation of MySQL to use UTF-8. I have put Slovak place names in it which contain those characters and it stores them just fine. I don't know what utf-general-ci is supposed to be.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41136
    
  45
Can you give some code. I don't know anything about that class?

Neither do I, but POI is open source, so you can download the code and look at what, exactly, that class is doing. And if it does something that may cause this, use the actual HWPF API classes and methods that I suggested.

But first pursue Paul's idea - you said already that the problem happens when you're using System.out.print, which is no surprise given that most consoles can't handle umlauts and such characters. But how did you find out that the data in the DB is messed up as well?
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

I used phpmyadmin tool for viewing data in the database. All latters are good except these like č,ć,đ. Instead of them there is a sign ?

I will try with odf support maybe there is better support for charactersets.
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

I did it!!!

Problem was connection string. I just added ?useUnicode=true&characterEncoding=utf-8 in connection string and now is ok.

The complete code is:




Thanks all for help.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Ivan Bisevac wrote:Problem was connection string. I just added ?useUnicode=true&characterEncoding=utf-8 in connection string and now is ok.


Ah, that's it. I don't have to do that because my database is already configured to use UTF-8. But it's a pain to convert a MySQL database to a different encoding, so leaving it like that is the best thing.

However, can I recommend the use of PreparedStatement for your SQL Inserts? Building an SQL command as a string like that will cause problems when the parameters contain quotes or other reserved characters. And if the parameters are coming from outside your program, a user can manipulate them to cause what's called an "SQL injection" attack on your database. PreparedStatement is the best practice for all SQL access via Java.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19655
    
  18


http://xkcd.com/327/

Just to show you what can go wrong if you don't stop SQL injection


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

I use this program just to convert words for english-serbian dictionary so noone couldnt use mydatabase. I told that i am beginner in java programming. Thanks for advice, i will see what is prepared statement.
Lucien Elin
Greenhorn

Joined: Feb 22, 2010
Posts: 1
I have a very similar problem to this one. I want to generate a DOC file, to send it to a translation agency. Depending on the target language the doc file will contain different character sets. The generation of the file works well. I only have a trouble with asian languages and czech for example. In czech I get all letters, with simple accents like á ú Š properly into the file. But some charachters like č are not written properly. They apear with a ? in the file.

In the debugger I can see that the string read from the source contains different characters where I allways get a ? because the character code shown in the debugger is different. It looks like I have to tell the POI API somehow to use a special character set when writing the file. Is there a way to encode the written file with a specific character set?

When writing files through a FileOutputStream I can set the character set when defining the stream. But I did not find a way to do this with the PIO API.

Thanks for any help and advice

Lucien
Ivan Bisevac
Ranch Hand

Joined: Jan 04, 2010
Posts: 48

Can you give us a code? Maybe it will be usefull for us and it is easier to help you.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to read lines from .docx, .doc .rtf or .odt document
 
Similar Threads
Display Contents of File on Web Page
Convert JasperReports & JasperServer reports to RTF & ODT formats
Java API for RTF to DOCX Conversion
Export Documents in SVG Vector Image Format & Encrypted DOC Creation
Control Document Generation with New Mail Merge Cleanup Options