File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes simple I/O  read into char[] using the fileLength Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "simple I/O  read into char[] using the fileLength" Watch "simple I/O  read into char[] using the fileLength" New topic
Author

simple I/O read into char[] using the fileLength

Ram Kumar Subramaniam
Ranch Hand

Joined: Jan 17, 2003
Posts: 68
Please understand the code below first. I am trying to read the a file into a character array. This works great. But......... here's.... my doubt. If you notice on the 2nd line i get the length of the file. This returns a double which i typecast to int to create a char array. Type casting to int could also reduce the size of the file and hence the entire content will not be read into the char array. Is this correct ?
File file = new File(fileName);
int fileLength = (int)file.length();
char[] cBuffer = new char[fileLength];

FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr, fileLength);
br.read(cBuffer, 0, fileLength);
jason adam
Chicken Farmer ()
Ranch Hand

Joined: May 08, 2001
Posts: 1932
You are correct (though File.length() returns a long, not a double), there is a possible loss of precision when casting a long to an int.
Good question, and I'm sure it's one that probably has some nifty way to solve (step in anytime Jim ).
Though I've never dealt with this situation, off the top of my head I'm thinking if you know for certain the file is going to be more bytes than the maximum numerical value of an int ( 2147483647 bytes I believe ), then you could break the number into manageable chunks and do some sort of iterating until the file is fully read in. Not sure if this is the best way, just a guess.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
In Java, no array can ever have more than Integer.MAX_VALUE elements in it. So it's not just a problem with casting length() to int - where would you put the data anyway?
In many cases, you'd be better off processing the file in parts, rather than trying to put the whole thing in memory at once. A common approach for a text file is to use readLine() to read one line at a time. As long as no once line takes too much memory, and you don't need to retain previous lines while processing the current line, this works pretty well. You can also just ignore line boundaries and grab as many chars as you can fit into a char[] array, and do this in a loop until you've read the whole file. What you want to do with all those chars on each loop iteration, well, that's up to you.
Note that JDK 1.4 has the java.nio classes whilch will probably be much more efficient for large files. I'd recommend using a FileChannel to map() a MappedByteBuffer, then use asCharBuffer() to see the characters. But the basic problem remains - ByteBuffer and CharBuffer can't have more than Integer.MAX_VALUE elements in them. If your file is bigger than that, you'll have to break it up somehow.
One other thing to be aware of is that the read(char[]) method (and similar methods in InputStream, FileChannel, and other classes) are not guaranteed to provide all the chars or bytes that you request. They may return less - check the return value. If you want to guarantee that you read something completely, you usually have to read() in a loop. For example to fill a char[] array:


"I'm not back." - Bill Harding, Twister
Ram Kumar Subramaniam
Ranch Hand

Joined: Jan 17, 2003
Posts: 68
Thanx for the suggestions. Will have to change the implementations of how i read the file once i sort out the root problem i am faced with actually.(was'nt aware of java.nio -- been sleeping for sometime -- but we're using jdk1.3, so no choices)
I do not user readLine() method because of a small problem. I am trying to actually read a PDF file and then write it to another pdf file. Note the pdf file contains chinese and english characters too. However if i use readLine() the file gets corrupted. -- i get an error saying that the pdf file is corrupted when i try to open it. If i read the file into char[] etc. I am able to recreate the file and it opens well. However theres a problem with the chinese characters. The newly created pdf file says that the font used is missing(Please note that i am able to open the orginal pdf file with chinese characters displayed).
Am i missing something. I am using BufferedReader, FileReader for the same.
(I think i should have posted this in a seperate topic. Should i repost it in a seperate topic ?)
Ram Kumar Subramaniam
Ranch Hand

Joined: Jan 17, 2003
Posts: 68
Forgot to mention something.
1.) This works perfectly(reading and writing and opening the pdf file) when i use visualage(jdk1.2) on a different machine(A). However when i use command prompt(jdk1.2) or WSAD(jdk1.3) on a different machine(B) the generated pdf file shows error with respect to the chinese characters.(Strange !!! findings )
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Sounds like you're having character encoding issues - and a PDF file is not a text file; you should really consider it a binary file. Meaning don't try to use Readers and Writers to translate it to or from characters - just pass the raw bytes around. FileReader tries to interpret the bytes using your platform's default encoding (which may vary when you switch to a different platform), and some byte sequences do not map to any character at all in some encodings - which means they get dropped, or become ?, or who knows. Additional problems occur when you use readLine() because of the way new lines are handled - if the file is really a PDF file then some things that appear to be \r or \n chars may really be something else entirely. If you were dealing with a real text file, then things that look like new lines would really be new lines - but for PDF the bytes could be anywthing.
You're not supposed to change anything, right, just copy a file? Here's a simple technique for that:

InputStream and OutputStream operate on plain old bytes, and do not attempt to transform them to any other format.
[ May 05, 2003: Message edited by: Jim Yingst ]
Ram Kumar Subramaniam
Ranch Hand

Joined: Jan 17, 2003
Posts: 68
Well i do a bit of manipulation with the pdf file . It is actually converted to xml( encoding/decoding -- all that stuff is handled) and later converted back to the pdf. So i need the file in a String format -- for the xml encoding/decoding part.

As you have said the characters which are not supported are printed as a '?'. If you notice in the code above the if loop executes if the encoding/character is supported.
Well as i said earlier it works if execute from Visualage(ie in its JVM -- jdk 1.2). However if i run the same from command prompt(dos prompt - jdk1.2 installed -- same machine) the if loop is not executed.
So i guess there must be some jar file or what ever to support the encoding in visualage. Any ideas on that front ?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
All right - your application is more complex than I assumed; you will need Readers rather than InputStreams it seems. I suspect that one source of trouble is the platform default encoding which FileReader employs - this may vary from machine to machine, and some encodings may not preserve all the info you need. Try inserting this as an experiment:

Now if you're creating an XML file, you should really choose a particular encoding to use (e.g. UTF-8) rather than relying on the platform default. Make sure that the xml header includes the encoding info, and use an OutputStreamWriter/InputStreamReader to ensure that the correct encoding is used. This may solve your problems.
Also, I'm a bit suspicious of how a PDF file is getting converted to XML. Are you using base64 or some similar encoding scheme to convert between binary and text formats?
Ram Kumar Subramaniam
Ranch Hand

Joined: Jan 17, 2003
Posts: 68
Problem solved...........
Tackled it in a bit different way. I used Streams to read the file and directly encode the same(Base64Encoder -- did the trick using that -- one of the constructors ). So now since its encoded i have no problems with xml etc. Later i just used Base64Decoder to directly write to the file.

But still i would like to get the earlier idea working. Or rather try to understand the concepts of encoding etc. How do i avoid that '?' The print on my system shows CP152 as the encoding type
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
The print on my system shows CP152 as the encoding type
Try checking the default encoding on the other machine you were using - the one that screwed up the Chinese chars. I'm betting it used a different encoding, and that's the basis for the different behavior.
But still i would like to get the earlier idea working. Or rather try to understand the concepts of encoding etc. How do i avoid that '?'
Well that can be complex to figure out, depending on exactly what you were doing before. It's hard to cover all the possibilities; as long as you've got it working now, it may be easier not to worry about it. In general, my advice ie to always pay careful attention to what encoding a document is in (or is supposed to be in), and always distrust system default encodings - particularly any time you use "exotic" characters, or any time a file will be passed from one machine to another.
When debugging streams and readers, it's often useless to try to write results to System.out, as System.out may not have the appropriate fonts to display the chars you're dealing with. I often write results to a file, and then look at the contents of the file using another tool. To look at text (esp. to see exotic chars) I sometimes change the file extension to .html and open it with a browser, since I know my browser has a huge number of different fonts installed - I can often see whether characters look like chinese, or hindi, or whatever, (as opposed to '?' or a black box) even though I don't actually understand the languages themselves. For binary data (or anything from an InputStream rather than a Writer) I may change the file extension to .bin and open it with TextPad. This gives me a nice simple hex representation of the bytes in the file. I may not know what the bytes mean, but at least I can see their values, and look up docs for the appropriate encoding or file format to answer my questions.
In this particular case, if you're looking at a PDF file, you're not going to be able to do much with it other than (a) transfer it using Input/OutputStream with no changes, or (b) transform to text using something like base64, then do whatever you want with the text. Unless you find documentation for the PDF file format and can interpret it more intelligently. (I'm too lazy to sort through all the google hits I got when looking for this; I believe it's proprietary info, not freely available.) Good luck...
[ May 07, 2003: Message edited by: Jim Yingst ]
 
wood burning stoves
 
subject: simple I/O read into char[] using the fileLength
 
Similar Threads
Printing from a file
BufferedWriter, PrinterWriter etc
problem reading file
Array IndexOutOfBounds Exception
close() for I/O objects