wood burning stoves 2.0*
The moose likes Performance and the fly likes better option than TarInputStream to untar a tar file in terms of performance....anyone? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "better option than TarInputStream to untar a tar file in terms of performance....anyone?" Watch "better option than TarInputStream to untar a tar file in terms of performance....anyone?" New topic
Author

better option than TarInputStream to untar a tar file in terms of performance....anyone?

ruth abraham
Greenhorn

Joined: Oct 31, 2012
Posts: 9
Hi guys...
My app currently makes use of the TarInputStream to untar a tar file of around 10k. All that needs to be done after is to take the content files and place them in a separate directory.
Can anyone help me figure out a better way (in terms of performance) to do this untar part?


Thanks!
Ruth
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41106
    
  45
10K meaning 10000 bytes? That's so small it should hardly take any time. What timings have you done?


Ping & DNS - my free Android networking tools app
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11155
    
  16

What are your documented performance requirements?

Without having a well-defined target, how do you know when you're done? it's probably always possible to 'improve performance', but there is the law of diminishing returns here.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7545
    
  18

ruth abraham wrote:My app currently makes use of the TarInputStream to untar a tar file of around 10k. All that needs to be done after is to take the content files and place them in a separate directory.

OK. Use the native command.

Can anyone help me figure out a better way (in terms of performance) to do this untar part?

Don't worry about performance until you know it's an issue. Worry about getting it right.

And to that end: Why are you actually writing these files out at all? Unless they are actually needed by other, unrelated, applications, it seems to me that this task is likely to be I/O-bound. And you ain't going to solve that unless you rethink your strategy.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
ruth abraham
Greenhorn

Joined: Oct 31, 2012
Posts: 9
Yes, 10 KB is rather small. But the problem here is that this action is done for 7500 files of 10 KB each, every 15 minutes. We place the untarred files in another path is for consumption by other unrelated apps. Now the problem here is the number of files to be untarred every 15 minutes is going to increase from 7500 to 11000 and I was hoping there would be a better way than my current approach.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Again, though, without concrete requirements, and concrete measurements showing how far you are from meeting those requirements, you're stumbling around in the dark.

Is it okay to take the full 15 minutes to handle each batch? If not, how much time are you allowed?

Is it okay for a batch to occasionally take more than 15 minutes to process, or does one batch have to complete before the next one starts?

How long will your current approach take to process 11,000 files? If you need to be done in 15 minutes and it's taking 16, the solution will probably be very different than if it's taking 60 minutes.

How do you know that the TarInputStream is the bottleneck, rather than one of your own classes or some third-party library? Have you use a profiler to measure it, or are you just guessing?

There are many possible ways to speed up the process. It's impossible to say at this point which ones are most appropriate for your case.

  • Get a faster disk.
  • Get a faster CPU.
  • Get more RAM.
  • Use multiple computers in parallel.
  • Put the source and destination on the same physical drive/controller.
  • Put the source and destination on separate physical drives/controlers.
  • Don't use tar.
  • Find a 3rd party library that's faster than the TarInputStream you're currently using.
  • Get hold of the tar spec and write your own TarInputStream.
  • Find the bug in your code that's the real culprit and fix that.
  • Add a BufferedInputStream around your TarInputStream and read chunks at a time rather than individual bytes.
  • Don't do anything, because you hadn't actually measured previously, but now that you did, you find that it's running plenty fast enough.


  • Some of those are likely to be of little or no value, but without more details, it's impossible to say which ones will be appropriate and which will not.
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 7545
        
      18

    ruth abraham wrote:Yes, 10 KB is rather small. But the problem here is that this action is done for 7500 files of 10 KB each, every 15 minutes. We place the untarred files in another path is for consumption by other unrelated apps. Now the problem here is the number of files to be untarred every 15 minutes is going to increase from 7500 to 11000 and I was hoping there would be a better way than my current approach.

    OK, so it sounds like you're treating the file system like a database - which is not what they were designed for - and I suspect that most of the I/O will be taken up with the "write" side of this task (creating directories, adding nodes and chains, writing files etc.).

    In addition to all the things that Jeff listed, there is another possibility if this is a Unix/Linux system (it may also be possible on Windows, but I don't know how):
  • Tune the target filesystem(s) for a small number of bytes per node.
  • Unix fs's are configured for most general use, but it sounds to me like you're creating tons of very small files inside directory structures; and for that, the default isn't so great.

    Whatever you come up with, it strikes me that your current methodology may not be very scalable, so you may actually want to think about completely different strategies, eg:
  • Don't untar at all, and have all applications read the tar files as streams.
  • Create a function that untars files on the fly so that they can be read/scanned by a normal program.
  • Put the data in a database.

  • Winston
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: better option than TarInputStream to untar a tar file in terms of performance....anyone?
     
    Similar Threads
    file upload
    how to tar/untar a file using java?
    UNIX/tar file
    How to tar and untar insing Java program
    How to create a pkg