Forum:

I/O and Streams

ZipInputStream problem

Greenhorn

Posts: 4

posted 15 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Dear all,

I'm encountering something I find very strange when dealing with ZipInputStream.

I compress several files into a zip and store its checksum. When the zip is read I control said checksum.

The thing is, depending which files I compress, on decompression sometimes the checksum does not match. I found the cause for this, but don't understand it.

I set up my input streams like so:

Then I just iterate though all zip entries and decompress them all. The thing is that sometimes the ZipInputStream will have no further entries to read, the available() method will return '0' (indicating EOF reached) but the underlying CheckedInputStream will still have some bytes that haven't been read!
At this point the input's checksum differs from the original, but if I just read these remaining bytes directly from the CheckedInputStream then the checksums do match.

To be clear on this: when I say 'sometimes' it's not that it's random. It somehow depends on which files I zip; the same files will always yield the same result. Furthermore, the extra bytes would seem to be of no use at all; all compressed files seem to be there on decompression without damage (though I still have to check this last statement more thoroughly).

I've been looking around, but found no answer so far and am at a loss on how to solve this.

Any info/tips/suggestions would be highly appreciated.
Many thanks already for taking the time to read this.

Best regards.
[ August 20, 2008: Message edited by: everton landio ]

Nitesh Kant

Bartender

Posts: 1638

I like...

posted 15 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

everton:
The thing is that sometimes the ZipInputStream will have no further entries to read, the available() method will return '0' (indicating EOF reached) but the underlying CheckedInputStream will still have some bytes that haven't been read!

I think you need to read this article: AvailableDoesntDoWhatYouThinkItDoes

This is how i extract a Jar file. Zip will be on the same lines.

import java.io.BufferedOutputStream;
import java.io.Closeable;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.jar.JarEntry;
import java.util.jar.JarInputStream;
 
public class JarExtracter {
 
    /**
     * Extracts the passed jar stream into the directory <code>destinationDir</code> <p>
     *
     * @param jar Stream representing the jar file to extract.
     * @param destinationDir Extraction directory.
     * @param cleanBeforeExtract If the destination directory is to be cleaned before extraction.
     * @throws IOException Extraction problems.
     */
    public static void extractJar(JarInputStream jar, File destinationDir, boolean cleanBeforeExtract)
            throws IOException {
        if (!destinationDir.exists()) {
            createDirs(destinationDir);
        } else if(cleanBeforeExtract) {
            System.out.println("Destination directory: " + destinationDir.getAbsolutePath() +
                               ", cleaning up before extraction.");
            emptyAndDeleteDir(destinationDir);
        }
 
        JarEntry nextEntry;
        while ((nextEntry = jar.getNextJarEntry()) != null) {
            writeEntry(jar, destinationDir, nextEntry);
        }
    }
 
    private static void writeEntry(JarInputStream jar, File destinationDir, JarEntry nextEntry)
            throws IOException {
        String entryName = nextEntry.getName();
        System.out.println("Processing jar entry: " + entryName);
        if (nextEntry.isDirectory()) {
            File dirPath = new File(destinationDir, entryName);
            createDirs(dirPath);
        } else {
            writeFile(jar, destinationDir, entryName);
        }
    }
 
    private static void writeFile(JarInputStream jar, File destinationDir, String entryName)
            throws IOException {
        File filePath = new File(destinationDir, entryName);
        BufferedOutputStream entryDestinationStream = null;
        FileOutputStream fileStream = null;
        try {
            fileStream = new FileOutputStream(filePath);
            entryDestinationStream = new BufferedOutputStream(fileStream);
            int chunkSize = 1024;
            byte[] chunk = new byte[chunkSize];
            int bytesRead;
            while ((bytesRead = jar.read(chunk)) != -1) {
                entryDestinationStream.write(chunk, 0, bytesRead);
            }
            System.out.println("Written file: " + filePath.getAbsolutePath());
        } finally {
            if (!closeStream(entryDestinationStream)) {
                closeStream(fileStream);
            }
        }
    }
 
    private static boolean closeStream(Closeable closeable) {
        if(null != closeable) {
            try {
                closeable.close();
                return true;
            } catch (IOException e) {
                System.out.println("Error closing the stream. ");
            }
        }
        return false;
    }
 
    private static void createDirs(File directoryPath) throws IOException {
        boolean created = directoryPath.mkdirs();
        if (!created) {
                throw new IOException("Failed to create the directory:  " + directoryPath.getAbsolutePath());
        }
        System.out.println("Created directory: " + directoryPath.getAbsolutePath());
    }
 
    private static void emptyAndDeleteDir(File dir) {
        if (!dir.exists()) {
            System.out.println("Directory: " + dir +" does not exist, ignoring delete.");
            return;
        }
        File[] files = dir.listFiles();
        for (File file : files) {
            if (file.isDirectory()) {
                emptyAndDeleteDir(file);
            } else {
                if (file.delete()) {
                    System.out.println("Deleted: " + file.getAbsolutePath());
                } else {
                    System.out.println("Could not delete: " + file.getAbsolutePath());
                }
            }
        }
        dir.delete();
    }
}

apigee, a better way to API!

everton landio

Greenhorn

Posts: 4

posted 15 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Many thanks Nitesh for your prompt reply.
I had read the article you linked, but ZipInputStream redefines the available() method. Taken from the javadoc:

Returns 0 after EOF has reached for the current entry data, otherwise always return 1.

Programs should not count on this method to return the actual number of bytes that could be read without blocking.

My unzipping code is quite similar to yours: I iterate through all entries and extract them. The thing is, I get to the point where read() on the ZipInputStream returns -1, which is consistent with available() returning 0 (EOF for current entry) and getNextEntry() returns null, but there are still six thousand something bytes to be read from the underlying CheckedInputStream. aarrrrggggggghhhhhhhhh!!!

I found this comment:

ZipOutputStream produces a slighly non-standard format. ZipOutputStream puts the compressed and uncompressed size and CRC after the data, instead of in the local header just in front of it.

here: http://forums.sun.com/thread.jspa?messageID=2316622

Could this have anything to do with my problem? Can the remaining bytes be this extra data? I doubt this is the case, since then these leftover bytes would appear when I extracted any zip, instead of just sometimes.

Any ideas?
Thank you all for your attention.

Regards.
[ August 21, 2008: Message edited by: everton landio ]

Nitesh Kant

Bartender

Posts: 1638

I like...

posted 15 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

everton:
ZipInputStream redefines the available() method.

Oops, I never saw that

.
Since, it is sometime that you are getting this problem, my doubts are on the available() method.

Probably, it will be worth a try to abstain from using available and depend on read() returning -1.
I dont know a reason why this is happening but may be trying this will be of some help.

apigee, a better way to API!

everton landio

Greenhorn

Posts: 4

posted 15 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Thanks for your answer Nitesh.

As a matter of fact, I don't use the available() method, I use read(...) until it returns -1. I just also checked that available() returns 0 while debugging, but I can see how my first post could have been confusing in that regard.

Bottom line is: I have a ZipInputStream with no remaining zip entries, no bytes to read from the last of the read entries, but some leftover bytes remaining in the underlying input stream.

I'll keep looking into it and post back here if I find the reason.

Meanwhile, any pointers you guys could give me would be very helpful.

Thanks.

everton landio

Greenhorn

Posts: 4

posted 15 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Ok, I have this kinda figured out so I'm posting (quite belatedly) my current solution.

I checked this thoroughly and am confident that those extra bytes are metadata created by the ZipOutputStream.

If the extra bytes are read, the checksums match, and I've used WinMerge to compare several sets of original files against their uncompressed counterparts and did not find a single difference.

I mentioned before that the difference in the checksums happened just in some cases. I found out that this was because I had the CheckedInputStream wrapped in a BufferedInputStream. When this metadata was sufficiently small, it was placed entirely in the buffer when reading the final portion of the last ZipEntry, and thus the cheksums coincided. When the BufferedInputStream was removed, all checksums showed differences, which was what I expected.

It's kind of weird though that info on this is not readily available...
Makes me feel like I'm missing something or that maybe I'm not using the checked streams correctly.

Anyway there you have it, at the moment I'm reading all remaining bytes directly from the checked stream and if the checksum match I assume all is well.

As an extra comment, part of the metadata seems to be a timestamp or some other variable thingy: if I zip the exact same set of files in different occasions the resulting zips have different hashes, which didn't seem to be the case when using other compression tools.

Regards.

Consider Paul's rocket mass heater.