File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes Parsing huge file without reading into memory Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "Parsing huge file without reading into memory" Watch "Parsing huge file without reading into memory" New topic
Author

Parsing huge file without reading into memory

sanju dharma
Ranch Hand

Joined: Oct 19, 2000
Posts: 45
Hi ,
I want to parse huge log file to see status of the job .. I dont want to read it into memory and create memory related problems.
Should I use RunTime environent for system command like grep, get the output and then parse it for status ?
Will this approach work for my purpose ? Is there any other approach within java ?
Thanks,
Sudhir
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Well, grep will work, and for a huge file it may well be faster than Java. However, the exec()/Process API is kind of a pain to work with, and your program won't be very portable. Plus, you're eventually going to have to parse something using Java in order to make the results available to the rest of your Java program. (Admittedly, if you only parse with Java after grep has done its work, the file size may be a lot smaller, so performance will be less of a concern.) But anyway, it's probably worth considering a pure Java solution... read on.
In one sense, in order to parse a file you're going to have to read everything into memory - at least temporarily. However you don't have to keep it there. If you avoid the temptation to build one big String out of the whole file (or a String[] array, or a List of Strings, or other huge structure containing all the file data) you can usually get good performance:

If this is too slow, the problem is most likely that you're creating one String object for each line read. These get GC'ed eventually, but that's a lot of objects being created and destroyed. You can get around this by using read(char[]) rather than readLine(). You can re-use the char[] array and simply overwrite the results on each iteration, to avoid creating objects. However this gets more complicated to parse - e.g. if you're looking for "foo" you have to consider the possibility that it's split across two successive reads. If the char[] array has "fo" at the very end, you must save that info until the next read, and check if the very next character is "o". This takes a bit of work to get right, so only go down this route if you've tried the readLine() method and have determined that it's too slow. And you really should do profiling to determine where the problem is; I'm just providing educated guesses here.


"I'm not back." - Bill Harding, Twister
gautham kasinath
Ranch Hand

Joined: Dec 01, 2000
Posts: 583
Howdy!
I believe you can read the file in bytes. Since you know the starting byte of your pattern, you can look for that and discard those that do not have that starting byte.
If the starting byte is matched, match the next byte with the one that you have.
This way you will implement "pay-as-you-use" ( which is also done by the Line reading method.
Again, because it is bytes, lesser over head and since you do not keep unwanted segments of the line/file, easier on memory.
Regds
Lupo


"In the country of the blind, the one eyed man is the King"
Gautham Kasinath CV at : http://www.geocities.com/gkasinath
saumil shukla
Ranch Hand

Joined: Dec 01, 2000
Posts: 47
If can take help to do it in PERL then, you can parse the file in PERL. This will allow your program to be portable without performance issues.
Thanks
Mike Brock
Greenhorn

Joined: Dec 30, 2002
Posts: 15
Of course you can read files without completely loading them into memory. Are you not familiar with the concept of a buffer?
Despite this being a fairly rudimentary computer science concept, I will try to point you in the right direction.
There are several ways you can accomplish this (creating a read buffer that is). Java 1.4's new high-performance IO library offers a new compelling first choice.
Consider this piece of example code:

This code is a simple example of a buffered read of a file. In this case we are using two buffers. One is the read buffer 'byteBuf', which only contains 10 bytes of data at any one time. And the other is 'lineBuf', which is a line buffer. The point of this program is simple, to print the file to the screen prefixed by line numbers.
However, this concept is fairly applicable to other scenarios as well.
The way the buffer works is simple. Everytime the 'fc.read(byteBuf)', the next set of data from the input stream is filled into the buffer. The integer returned from 'fc.read(byteBuf)' which we capture in 'bytesRead' is the total number of bytes which was read into the buffer. Why do we need to do this you ask? Well, we can't always assume that the buffer is full. Imagine you have a file which is 105 bytes long: you will end processing through this loop at least 11 times, right? So on the 11th and final iteration, you will only encounter 5 bytes read into the buffer. This is crucial, because if you try to read off the end of the file or buffer you will run into a java.io.EOFException. So this is what the preceding for-loop inside the while-loop is for. It simply 'counts down' the value in byteRead back to 0, reading one byte at a time. On each iteration, we introspect for '\n' (end of line) and ignore '\r' (carraige return), otherwise we append the data into lineBuffer. When we do detect a '\n', we print out the data with the line number. Pretty simple.
Hope this helps!
Mike.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[JY]: In one sense, in order to parse a file you're going to have to read everything into memory - at least temporarily.
Perhaps I should have phrased this differently. Everything will be read into memory at some point, but you don't need to have it all in memory simultaneously. The code I gave shows one simple technique (pre-1.4) for doing this; there are certainly others. Note there's a buffer in the BufferedReader. Also note that converting bytes into chars directly is risky unless you understand character encoding, and/or unless you're sure that the data you're looking at is always in the Unicode range 1-127. If there are any complications, using a Reader rather than an InputStream usually makes things much simpler. On the other hand, the new 1.4 libraries suggested by Mike will probably give better performance, so they're certainly worth looking into...
[ December 31, 2002: Message edited by: Jim Yingst ]
Mike Brock
Greenhorn

Joined: Dec 30, 2002
Posts: 15
Originally posted by Jim Yingst:
[b][JY]: Also note that converting bytes into chars directly is risky unless you understand character encoding, and/or unless you're sure that the data you're looking at is always in the Unicode range 1-127. [ December 31, 2002: Message edited by: Jim Yingst ]

This is completely true, but in this case I was trying to give a simple understandable example. And generally, this is still safe code even if other character encodings are being used, since I am only scanning for line terminators. Correct me if I am wrong, but Unicode's EOL, CR, EOF, etc. characters match up with standard ASCII encodings.
Mike.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
You're right about the line terminators (as far as I know anyway). However your code does more than just look for line terminators:

This actually ends up appending string representations of numeric values of bytes, rather than single-char representations. E.g. 'A' becomes "65". This is easily corrected by a cast to char:

(or declare b as a char rather than a byte in the first place, since it's never used as anything other than a char). This will work for the majority of common cases, converting a byte to char directly before appending it to the StringBuffer. But it assumes the value of the byte means the same thing in Unicode as it does in whatever encoding is used in the file. This isn't always the case. A common example is if the file is in Cp-1252 (the ubiquitous Windows Latin-1) and contains any character values in the 128-159 range, which represent printable characters in Cp-1252, but unrelated nonprintable control chars in Unicode. For example the trademark symbol ™ (should appear as "TM") has Cp-1252 value 153, but unicode value 8482 (!). When the StringBuffer has char 153 appended to it, this gets interpreted as a Unicode 153, which is apparently unused and meaningless. Other Cp-1252 codes for things like right and left single and double quotes get interpreted as control characters instead. In general, data get mangled and lost this way. It gets worse if the file is in UTF-8 instead, where all char values above 128 mean different things. This screws up many European characters like 'è' and 'ñ'. Not good. Maybe the file never has any chars like this - but in this day and age that's often an unsafe assumption. And it's a lot easier to let a Reader handle all this, since most encoding problems can be fixed by a one-line change to the declared encoding that an InputStreamReader uses. (That is, once you figure out what encoding the file really uses.)
Mike Brock
Greenhorn

Joined: Dec 30, 2002
Posts: 15
You are absolutely right. I missed that when I typed up the example.
I also was kind of assuming that I was reading in a standard ASCII text file as well. But the point about using the read buffer remains the same. The text encoding issue is completely seperate from the point I was trying to make
Regards,
Mike.
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
If you open up a MappedByteBuffer to the file and use Charset.forName("US-ASCII or whatever").decode(mappedByteBuffer), you'll get a CharBuffer that should be pretty usable.
Roberto Lo Giacco
Greenhorn

Joined: Feb 01, 2005
Posts: 2
Sorry, but the memory problem is not solved using MappedByteBuffers and the Charset.decode(...): this method tries to decode the entire buffer of bytes causing an out of memory error...

I need something to decode the buffer while it's readed from the file system....
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Originally posted by Roberto Lo Giacco:
Sorry, but the memory problem is not solved using MappedByteBuffers and the Charset.decode(...): this method tries to decode the entire buffer of bytes causing an out of memory error...
Not if you set the ByteBuffer's position and limit of the buffer before decoding it. Loop over the mapped buffer, setting up a good block size using position and limit. Decocding will now just decode the bytes in the range you specify.

Use CharsetDecoder.decode(ByteBuffer, CharBuffer) or one of the other similar methods so you can reuse the same CharBuffer. Since decoding advances the position, it should leave you at the next correct spot, dealing with multi-byte character encodings for you; just set limit to be position + BLOCK_SIZE and keep going.

If you want ultimate speed, cannot count on ASCII files, and don't want to write your own specialized decoder, this is the way to go.
[ February 02, 2005: Message edited by: David Harkness ]
Roberto Lo Giacco
Greenhorn

Joined: Feb 01, 2005
Posts: 2
Originally posted by David Harkness:
Not if you set the ByteBuffer's position and limit of the buffer before decoding it. Loop over the mapped buffer, setting up a good block size using position and limit. Decocding will now just decode the bytes in the range you specify.

Use CharsetDecoder.decode(ByteBuffer, CharBuffer) or one of the other similar methods so you can reuse the same CharBuffer. Since decoding advances the position, it should leave you at the next correct spot, dealing with multi-byte character encodings for you; just set limit to be position + BLOCK_SIZE and keep going.

If you want ultimate speed, cannot count on ASCII files, and don't want to write your own specialized decoder, this is the way to go.


You are right, but my needs doesn't allow me to perform the operations you described: the CharBuffer I want to get out from the big log file is going to be parsed by regexp...

I ended up with this solution: wrapping the MappedByteBuffer with a custom CharSequence implementation, named MappedCharBuffer!

The result works correctly with ASCII files only, but log files are ASCII compliant usually...

Here is the code:



Actually the code performs something corresponding to this SQL statement:

SELECT COUNT(*),username FROM log WHERE message LIKE '%LOGIN OK%' GROUP BY username ORDER BY username
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Originally posted by Roberto Lo Giacco:
I think that's essentially what I just said, except you're tracking the block bounds outside of the class. Also, you might want to put the resetting of position/limit to the saved values into a finally block to be safe. Cool class.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing huge file without reading into memory