aspose file tools*
The moose likes Java in General and the fly likes Reading/Writing Foreign Text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Reading/Writing Foreign Text" Watch "Reading/Writing Foreign Text" New topic
Author

Reading/Writing Foreign Text

Mike Watts
Greenhorn

Joined: Aug 17, 2005
Posts: 5
Is there a way to Read in a file which contains English and German text, and then writing it out to another file?

Here is my problem:

I am reading in a file (which contains mostly English Text with a little German). I use a FileReader and then putting it into a BufferedReader. I read it line by line searching for particular strings (English text) and then storing it to an ArrayList. After I'm done, using a BufferedWriter,I write out to a file which contains all the Strings in my ArrayList. The problem is when there are German/Foreign text, for example W�HRUNG , it is coming out as W?HRUNG. The special characters are not being converted.

I would like to thank you in advance for any advice given.

~Mike
Jeff Albertson
Ranch Hand

Joined: Sep 16, 2005
Posts: 1780
The problem has to do with the decoding that happens when a InputStream of bytes is converted in a Reader of characters, as well as the corresponding encoding of characters into an OutputStream of bytes. The key classes are InputStreamReader and OutputStreamWrite and you need to specify your charset -- say ISO-8859-1 or UTF-8.


There is no emoticon for what I am feeling!
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
How are you verifying the contents of the output file? I suspect that your Java program works perfectly, but there may be a problem with the font and/or character set that is used when you display the contents of the file. Are you using the command line (such as the "more" command) or a text editor (such as Notepad) to view the file? In either case, you need to be sure that it supports the character set that you are using. It is highly likely that the contents of the output file are correct but that the characters are not displayed correctly when you try to verify them.

Layne


Java API Documentation
The Java Tutorial
Mike Watts
Greenhorn

Joined: Aug 17, 2005
Posts: 5
I am using the command line to view the file. I can use "more" or the VI editor. The file essentially comes from a C-program that prints out to a file containing English and German text. I can view the output (from the C program) fine using the command line. But through Java, its not correctly coming out.

I will also try and specify the charsets to see if that fixes the problem.

Another question:

Is there a way to insert text into a file without reading through the whole file, inserting the text, and then writing it out again.

I deal with thousands of text files with over 100,000 of lines. It takes quite some time to do such a thing.

The way I have been doing it is not very effecient. I read through the whole file, store each line into a Collection while checking for conditions, and then if the condition is true, I insert some text. After my readline=null, I write out my Collection line by line to a new file.

Thanks in advance!
Jeff Albertson
Ranch Hand

Joined: Sep 16, 2005
Posts: 1780
Originally posted by Mike Wattana:

Is there a way to insert text into a file without reading through the whole file, inserting the text, and then writing it out again.

I deal with thousands of text files with over 100,000 of lines. It takes quite some time to do such a thing.

The way I have been doing it is not very effecient. I read through the whole file, store each line into a Collection while checking for conditions, and then if the condition is true, I insert some text. After my readline=null, I write out my Collection line by line to a new file.

1. The only alternative to rewriting a file to "edit" it is to use a RandomAcessFile, but they hardly ever work for text because editting in the middle of the file is a matter of replacing N old bytes with N new bytes. Unless your file format has fixed-length lines and your character encoding is a fixed number of bytes/char (say 1 or 2), this will fail.

2. It sounds like you are reading the entire file into memory before you start to rewrite it. Is is possible to hold fewer lines in memory and interleave reading and writing? This would keep your process from bloating and burdening your machine. Doing this means writing to a temp file (since you are not done reading!) and renaming the temp file at the end if you want to "overwrite" file contents. If your program is supposed to overwrite contents you should be doing this in any case, so that if it crashes, only a temp file is left in an incomplete state.

3. 100,000 lines? Perhaps it's time to rethink the design. With that much data, why not keep it in a database? If needed, you could write code that generated text file reports when needed. Another processing approach, even if input and output where *required* to be text files, would be to use a database as an intermediate data structure. It'll be slower than holding data in memory, but it will give you a lot of options.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Reading/Writing Foreign Text