I want to calculate the MD5 hash of a file (1-2Mb size). Problem is that with my current approach, it required a lot of time to calculate the hash. Is there any way to do this faster?
I am doing it like this:
MD5 works with binary data, so, instead of reading the file as text strings, it may be faster to read in as binary into a buffer. No parsing by the library to find you the lines. No conversion to bytes. etc. It should be much faster.
Of course, if you make the change, it will unlikely be compatible with the MD5 hashes that you obtained with the previous technique.
Also note that reading as text will cause character conversions to UNICODE - time consuming, and if the default encoding is changed you will get a different MD5 sig. Binary is going to be much faster and stable.
MessageDigest is able to calculate intermediate results:
This way you don't need to store the entire file in memory. It will still take a long time because you still need to read and process the entire file, but you can't do anything about that.
I'm skeptical that BufferedInputStream will offer much improvement here, considering the latest version of the code is already doing its own buffering using the byte array. BufferedInputStream is mostly helpful when the client code is doing a lot of single-byte read() calls. But it's easy to try it and see if it makes a difference. And I agree with Miran that increasing the buffer size may help - but again, it may not, so try it and see.
I note that MessageDigest also has an update(ByteBuffer) method (if using JDK 5 or later). That could end up being much faster than traditional java.io. Or not - often the traditional methods are just as fast, as the traditional classes have been updated to use nio classes internally. But sometimes that's not possible to do efficiently, and the nio classes like ByteBuffer can be much faster. Unfortunately java.nio classes can be hard to use. Still, I think they're worth trying here, if the performance so far is not good enough.
Definitely a +1 there. With FileChannel.map you can even use direct I/O:
I'll run a short test in a few minutes to see how well this performs compared to a regular read.
Edit: just ran a test on a 2GB file with a block size of 8192 bytes, and using NIO is slightly faster, but not much (44s vs 50s). I've then increased the block size by a factor 8 and tried again, and the time dropped to 37s for NIO and 31s for non-NIO. So which one the fastest is depends on multiple factors.
Joined: Mar 05, 2008
Hmmm, on my machine (a MacBook Pro reading from an external hard drive) it's much faster to map the entire file at one time, rather than break it up into many small blocks. I assume there's an upper limit somewhere where this would simply throw an error, and blocks are necessary, but I haven't encountered it yet.
On the other hand, that method seems no faster (on my system) than the java.io version:
I guess there are a lot of variables here: os, hardware (CPU and IO), file size. And there are other ways to put this together using NIO but no memory mapping, using direct ro indirect buffers. I think this might be more suitable for smaller files. Anyway, hopefully one of these will work well for Andrei's system.
My test file, which is 10 bytes short of 2TB, failed with an IOException mentioning the "parameter is incorrect". After reading only half the file size I got an IOException caused my an OutOfMemoryError. The latest Ubuntu ISO, 693MB, gave the same error with the full size; half the size did work.
Joined: Mar 05, 2008
Your code seems to give me the same performance as mine if I put a much larger block size in place (e.g. > 1 GB). So I guess the trick is to find out what the best block size is for a particular system, and then your code will be safer since it can prevent errors where mine will fail.
Then again, as my non-nio code is also giving me essentially the same results, I'm not sure it matters. But the nio code may work better on other systems.