• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Reading large files in Java

 
Andrei Antonescu
Ranch Hand
Posts: 75
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello

I want to calculate the MD5 hash of a file (1-2Mb size). Problem is that with my current approach, it required a lot of time to calculate the hash. Is there any way to do this faster?
I am doing it like this:




Thanks in advance
 
Miran Cvenkel
Ranch Hand
Posts: 172
1
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
change


to



and fileContent = fileContent + line --> sb.append....

That should give it a boost, for starter.
 
Henry Wong
author
Marshal
Pie
Posts: 20828
75
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

MD5 works with binary data, so, instead of reading the file as text strings, it may be faster to read in as binary into a buffer. No parsing by the library to find you the lines. No conversion to bytes. etc. It should be much faster.

Of course, if you make the change, it will unlikely be compatible with the MD5 hashes that you obtained with the previous technique.

Henry

 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13045
6
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also note that reading as text will cause character conversions to UNICODE - time consuming, and if the default encoding is changed you will get a different MD5 sig. Binary is going to be much faster and stable.

Bill
 
Rob Spoor
Sheriff
Pie
Posts: 20371
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
MessageDigest is able to calculate intermediate results:
This way you don't need to store the entire file in memory. It will still take a long time because you still need to read and process the entire file, but you can't do anything about that.
 
Andrei Antonescu
Ranch Hand
Posts: 75
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello again,

Thank you all for posting. I am trying to create a file integrity checker like tripwire. I don't understand why tripwire can calculate hashes soo fast for bigger files, and I can't...

 
Miran Cvenkel
Ranch Hand
Posts: 172
1
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


make that bigger, I guess that would speed things.
 
Rob Spoor
Sheriff
Pie
Posts: 20371
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Using a BufferedInputStream around the FileInputStream would probably also improve performance.
 
Mike Simmons
Ranch Hand
Posts: 3028
10
  • 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm skeptical that BufferedInputStream will offer much improvement here, considering the latest version of the code is already doing its own buffering using the byte[] array. BufferedInputStream is mostly helpful when the client code is doing a lot of single-byte read() calls. But it's easy to try it and see if it makes a difference. And I agree with Miran that increasing the buffer size may help - but again, it may not, so try it and see.

I note that MessageDigest also has an update(ByteBuffer) method (if using JDK 5 or later). That could end up being much faster than traditional java.io. Or not - often the traditional methods are just as fast, as the traditional classes have been updated to use nio classes internally. But sometimes that's not possible to do efficiently, and the nio classes like ByteBuffer can be much faster. Unfortunately java.nio classes can be hard to use. Still, I think they're worth trying here, if the performance so far is not good enough.
 
Rob Spoor
Sheriff
Pie
Posts: 20371
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Definitely a +1 there. With FileChannel.map you can even use direct I/O:
I'll run a short test in a few minutes to see how well this performs compared to a regular read.

Edit: just ran a test on a 2GB file with a block size of 8192 bytes, and using NIO is slightly faster, but not much (44s vs 50s). I've then increased the block size by a factor 8 and tried again, and the time dropped to 37s for NIO and 31s for non-NIO. So which one the fastest is depends on multiple factors.
 
Mike Simmons
Ranch Hand
Posts: 3028
10
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hmmm, on my machine (a MacBook Pro reading from an external hard drive) it's much faster to map the entire file at one time, rather than break it up into many small blocks. I assume there's an upper limit somewhere where this would simply throw an error, and blocks are necessary, but I haven't encountered it yet.

On the other hand, that method seems no faster (on my system) than the java.io version:

I guess there are a lot of variables here: os, hardware (CPU and IO), file size. And there are other ways to put this together using NIO but no memory mapping, using direct ro indirect buffers. I think this might be more suitable for smaller files. Anyway, hopefully one of these will work well for Andrei's system.
 
Rob Spoor
Sheriff
Pie
Posts: 20371
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My test file, which is 10 bytes short of 2TB, failed with an IOException mentioning the "parameter is incorrect". After reading only half the file size I got an IOException caused my an OutOfMemoryError. The latest Ubuntu ISO, 693MB, gave the same error with the full size; half the size did work.
 
Mike Simmons
Ranch Hand
Posts: 3028
10
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your code seems to give me the same performance as mine if I put a much larger block size in place (e.g. > 1 GB). So I guess the trick is to find out what the best block size is for a particular system, and then your code will be safer since it can prevent errors where mine will fail.

Then again, as my non-nio code is also giving me essentially the same results, I'm not sure it matters. But the nio code may work better on other systems.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic