The moose likes Linux / UNIX and the fly likes Compression/Decompression encoding in unix. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Engineering » Linux / UNIX
Bookmark "Compression/Decompression encoding in unix." Watch "Compression/Decompression encoding in unix." New topic

Compression/Decompression encoding in unix.

pawan chopra
Ranch Hand

Joined: Jan 23, 2008
Posts: 410

Hi All,

I would like to know that what encoding is used by compression/decompression algorithms in Unix/Linux. For example in windows it uses Cp437.

Pawan Chopra
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14074

Which compression/decompression algorithms? Are talking about for example gzip? It can compress and decompress any kind of data (text or binary), it doesn't have to do anything with character encodings.

If you mean something else, then please explain in more detail what your question is exactly.

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
pawan chopra
Ranch Hand

Joined: Jan 23, 2008
Posts: 410

Hi Jesper,

Actually I am using Java zip utility to compress some files. File names contains some scandinavic letters(like ä and ö). Now after compression the ZipOutputStream uses UTF-8 to write file name so It doesn't give me the correct name. I saw that there is a bug in Java. I am trying to change the implementation of ZipOutputStream class to accept encoding in constructor. I have tried this for windows with encoding Cp437 and it worked fine for me. I was opening the same file in Unix but its not working there. so I was looking the encoding used by UNIX/Linux for compression/decompression file names.

Let me know If I have not made myself clear in that case I will explain you more on that. You can also refer to Corrupt File name.
Marco Ehrentreich
best scout

Joined: Mar 07, 2007
Posts: 1280

I'm not sure if you already figured out if the question was about compression or enconding

But to convert different ENCODINGS there's a handy utility called "recode" which allows to easily convert between different encodings like UTF-8 or latin1 for example.

Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15960

I think you're confusing the compression with the facilities that create ZIP/JAR files in the Java compression classes.

Most of the popular compression algorithms are bit-level (binary) algorithms, so they don't care about code pages. There are, in fact, about 5 different algorithms used in ZIP files, and the normal course of events causes the most effective one to be used. In some cases, that's the "store" algorithm, which doesn't compress at all.

I never really paid attention to the limitations on code pages in a ZIP file directory. The first thing I'd do, however, is check the documentation for ZIP files themselves, since ZIP format was intended to be something that was portable even to the extent that you could move them between ASCII and EBCDIC (IBM mainframe) systems.

Customer surveys are for companies who didn't pay proper attention to begin with.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

I would recommend iconv to convert textfiles, not recode.

UTF-8 should be fine to store your filename, and should be understood by linux as well as Windows, so that's the way to go to get rid of conversion trouble.

pawan chopra
Ranch Hand

Joined: Jan 23, 2008
Posts: 410

I have executed the following experiment:
- created several text files with English, Hungarian, Chinese, Japanese and
Korean name
- attempted to compress them using FilZip, WinZip and PKZip
- attempted to uncompress then using the above tools
My findings are:
- FilZip and WinZip cannot add files with non-English-only names (not even
Hungarian which uses Latin characters); they cannot list files
- PKZip can add add file with any names, but names are transformed: all
non-Western European accented Latin characters are converted to similar
character without accent (e.g. ű->u, ő->o) and all non-Latin characters are
converted to question marks; NOTE: Accented Western European characters are
preserved (e.g. áéíóöúüñ), thus Spanish is supported
- WinZip cannot list non-Western European file names, but can extract the
files when "Extract all" is selected; but non-Latin characters are replaced
with underscore (_); since all non-Western European Latin characters are
converted to non-accented Western European ones during compression, these files
are listed and extracted but without accents.
- FilZip and PKZip can display and extract all files but with transformation;
see above

Summary: ZIp format does not support Unicode in filenames. It might be possible
to pick one specific code page/character set that would be usable for a
specific language, but it is not know how as tested tools do not provide
control for this.

Solution: No real solution. As workaround, Spanish text should be used with all
accented characters replaced with non-accented relative (ú->u, ó->o, etc.) or
compress files using ISO8859P1 character set for filenames.

Note: PKZip is one of the first zip utilities for Windows; WinZip is the market
leader. If they cannot support Unicode, how could we?
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15960

Just to repeat myself: Although PKZIP defined the ZIPfile standard, the file format standard long ago became independent of whether you used PKWare, Info-ZIP, Winzip or whatever. There are variants and recently struggles to work around the original limitations like 2.2GB/contained file, but there is a published standard, and that's what should be consulted to determine what's allowable for a file contained in a ZIP archive and what options are available.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Well - I made 3 empty files with these names:

and packed them into a zip file:
and it doesn't surprise me to find those names inside the file:

Note that we don't talk about the files content, but the filenames.

The displayed filenames may depend on the font which is used by your programs, so an extraction might be correct, but the preview seems to show corrupted filenames.

I'm sorry the ranch doesn't allow zipfiles (or Jars) to be uploaded.

Update: I put it on my website: http://home.arcor.de/hirnstrom/tmp/suspicious.zip
I agree. Here's the link: http://aspose.com/file-tools
subject: Compression/Decompression encoding in unix.
Similar Threads
Decompression in java
LZW Compression / Decompression
Decompression of strings received from socket connection
C Compression and Java Decompression
Image compression and decompression