I have a directory which contains large files ~1GB to ~5GB. Recursively i pick up the files and from each file I look for a particular string or sequence of char and create files. The string is the starting and ending point of the sub file which i want to create.
I read some amount of bytes at a time and work on those bytes. However, I am getting OutOfMemory exception after some time. I have set Xmx to 2048. I try to read less number of bytes but then program gets very slow. I have added System.gc() and System.runFinalization() in between the object creation.
I am using RandomFileAccess.
Please let me know how I can over come the problem of OutOfMemory and program to run little faster.
I agree with David. The code is hard to read. I would refactor/rewrite it to use a more object oriented approach and introduce a lot of new methods. Because a single method of 190 lines of code is hard to comprehend. The calls to System.gc() are unnecessary because the garbage collector is guaranteed to run before a OutOfMemoryError is thrown by the JVM.
"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." --- Martin Fowler
Please correct my English.
Joined: Oct 20, 2004
Sorry for being late. Here is the complete code. It has got more comments as well. I have refactored little. Please let me know if this is still not readable.
I would use the profiler too. But I don't see why you need a 100 MB buffer to work in. In most situations a buffer size of somewhere around 8 KB is quite sufficient.
And I wouldn't do my own buffering like that. I'm not quite sure what rules you're using to filter the data but it seems to me that you're looking for the string "ABCD" as some kind of sentinel. I would rewrite the code to read the data sequentially using an ordinary FileInputStream. Set up a buffer which always contains the last four bytes read. Then you can compare that buffer to "ABCD" and act accordingly with respect to whether to write data.
Line 57, 59: You never need to call new String("literal"). Just write "literal" - it makes no sense to create a separate new String object.
If I see this right, when the program doesn't find the sequence ABCD anywhere in the file, it will try to append the whole file to your ByteArrayOutputStream. That could easily cause an OutOfMemoryError if the file is very large.
author and iconoclast
There are big files and each file is made with joining many other files. The starting of nested file is ABCD. So I need to go through each byte and see if ABCD comes in the sequence. Once i find it, i need to start writing the bytes to files until i find another ABCD.
Paul, I can do the way you have mentioned but that makes my program very slow. And i have files with the size of 5GB as well. Reading 8/4 bytes at time will be a very slow progress. I will try to use profiler and see what they have to say.
I have written comments to get better understanding of the algorithm which i have implemented in the program. I can still shorten it. I will do it and post the code again.
Line 57, 59 are just for testing I have kept. Later on these strings will be passed as arguments to the program from the console.
The program doesn't append the whole file to ByteArrayOutputStream when we do not find ABCD. It will only append the current byte which is read and if any bytes A,AB,ABC where found before the current byte (given that current byte is not D).
File will be written/appended only when we find ABCD and RandomFileAccess object is not null. All the bytes written in BinaryArrayOutputStream will be dumped into RandomAccessFile.
Thanks everybody, I really appreciate your valuable time spending on this post.
I will redo the code/comments and post it again.
Joined: Oct 20, 2004
Ernest , Also, 131072 * 100 is not 100MB, it's about 1.25 MB. Just an example of what was meant by "the comments hurt more than they help".
Thanks for pointing this out.
I have corrected back the comments and number of bytes to read (MAXREAD) in the code.
Please have a look at it.
Joined: Oct 13, 2005
Please don't post the code again. My point was not that the comments were too long, but that they were
allwrittenasasinglelinewhi . . . wthescrollbarsappear.
See next post for what it ought to have said
Joined: Oct 13, 2005
I myself wrote: . . .
all written as a single line which nobody could read because it is too wide for the screen even here at work where I have a screen with 1440px from left to right so it is really difficult to read on an ordinary screen or with large writing in fact it is a good idea to keep all line lengths below 80 keystrokes to make it easier to read and I look forward to Maneesh Godbole posting UseRealWords but you will see how the scroll bars appear.
Well, I think the comments are too long, and don't add to our understanding. They might add to *yours*, but then I'd recommend taking them out so they don't confuse *us*, the folks trying to understand your code.The method is named wrong. With the proper name, no comment would be necessary. It also has a side effect, which makes the naming somewhat more difficult, but if it was clear that all file pointer manipulation methods also updated the numberOfData (which is also not a great name), it wouldn't be necessary to include it.
Rather than rename the method, why not just pass in the actual pointer manipulation you want?Otherwise it'd have to be called moveFilePointerBack or something, and it just seems redundant.Here you've described much of what's already known about RandomFileAccess, and again, you are forced to make a comment that would be unnecessary if the method was named appropriately.Here the comment describes precisely what the code does, but takes 2.5x as much text as just the code--this code is self-explanatory. The comment adds unnecessary cognitive overhead.
These same meta-comments apply to much of the code: the code speaks louder than the comments. The code is the ultimate arbiter of what's happening. Comments are difficult to keep in sync with code, because they cannot be (trivially) mechanically checked for correctness. Naming, refactoring, and common sense do more to aid readability than a paragraph for each line of code.
(This is obviously just my opinion, and other folks may feel differently.)
author and jackaroo
How many times do you create a new RandomAccessFile and assign it to fos? And how many times do you close the old streams?
You would probably benefit greatly by moving most of your instance variables to method variables. It will then become more obvious as to when they are not being used correctly.
You would probably also benefit from renaming methods and variables (normally if I see a variable named fos, I expect it to be a FileOutputStream - not a RandomAccessFile (often named raf)).
Neither fos nor raf are good descriptions of what the file is being used for though.