File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Reading large files in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Reading large files in Java" Watch "Reading large files in Java" New topic
Author

Reading large files in Java

Andrei Antonescu
Ranch Hand

Joined: Jul 08, 2010
Posts: 75
Hello

I want to calculate the MD5 hash of a file (1-2Mb size). Problem is that with my current approach, it required a lot of time to calculate the hash. Is there any way to do this faster?
I am doing it like this:




Thanks in advance
Miran Cvenkel
Ranch Hand

Joined: Nov 23, 2010
Posts: 149
change


to



and fileContent = fileContent + line --> sb.append....

That should give it a boost, for starter.


Searchable nature photo gallery: http://agrozoo.net/jsp/Galery.jsp?l2=en
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 19070
    
  40


MD5 works with binary data, so, instead of reading the file as text strings, it may be faster to read in as binary into a buffer. No parsing by the library to find you the lines. No conversion to bytes. etc. It should be much faster.

Of course, if you make the change, it will unlikely be compatible with the MD5 hashes that you obtained with the previous technique.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12835
    
    5
Also note that reading as text will cause character conversions to UNICODE - time consuming, and if the default encoding is changed you will get a different MD5 sig. Binary is going to be much faster and stable.

Bill
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19785
    
  20

MessageDigest is able to calculate intermediate results:
This way you don't need to store the entire file in memory. It will still take a long time because you still need to read and process the entire file, but you can't do anything about that.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Andrei Antonescu
Ranch Hand

Joined: Jul 08, 2010
Posts: 75
Hello again,

Thank you all for posting. I am trying to create a file integrity checker like tripwire. I don't understand why tripwire can calculate hashes soo fast for bigger files, and I can't...

Miran Cvenkel
Ranch Hand

Joined: Nov 23, 2010
Posts: 149


make that bigger, I guess that would speed things.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19785
    
  20

Using a BufferedInputStream around the FileInputStream would probably also improve performance.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
I'm skeptical that BufferedInputStream will offer much improvement here, considering the latest version of the code is already doing its own buffering using the byte[] array. BufferedInputStream is mostly helpful when the client code is doing a lot of single-byte read() calls. But it's easy to try it and see if it makes a difference. And I agree with Miran that increasing the buffer size may help - but again, it may not, so try it and see.

I note that MessageDigest also has an update(ByteBuffer) method (if using JDK 5 or later). That could end up being much faster than traditional java.io. Or not - often the traditional methods are just as fast, as the traditional classes have been updated to use nio classes internally. But sometimes that's not possible to do efficiently, and the nio classes like ByteBuffer can be much faster. Unfortunately java.nio classes can be hard to use. Still, I think they're worth trying here, if the performance so far is not good enough.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19785
    
  20

Definitely a +1 there. With FileChannel.map you can even use direct I/O:
I'll run a short test in a few minutes to see how well this performs compared to a regular read.

Edit: just ran a test on a 2GB file with a block size of 8192 bytes, and using NIO is slightly faster, but not much (44s vs 50s). I've then increased the block size by a factor 8 and tried again, and the time dropped to 37s for NIO and 31s for non-NIO. So which one the fastest is depends on multiple factors.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Hmmm, on my machine (a MacBook Pro reading from an external hard drive) it's much faster to map the entire file at one time, rather than break it up into many small blocks. I assume there's an upper limit somewhere where this would simply throw an error, and blocks are necessary, but I haven't encountered it yet.

On the other hand, that method seems no faster (on my system) than the java.io version:

I guess there are a lot of variables here: os, hardware (CPU and IO), file size. And there are other ways to put this together using NIO but no memory mapping, using direct ro indirect buffers. I think this might be more suitable for smaller files. Anyway, hopefully one of these will work well for Andrei's system.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19785
    
  20

My test file, which is 10 bytes short of 2TB, failed with an IOException mentioning the "parameter is incorrect". After reading only half the file size I got an IOException caused my an OutOfMemoryError. The latest Ubuntu ISO, 693MB, gave the same error with the full size; half the size did work.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Your code seems to give me the same performance as mine if I put a much larger block size in place (e.g. > 1 GB). So I guess the trick is to find out what the best block size is for a particular system, and then your code will be safer since it can prevent errors where mine will fail.

Then again, as my non-nio code is also giving me essentially the same results, I'm not sure it matters. But the nio code may work better on other systems.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Reading large files in Java