This week's book giveaway is in the OCAJP 8 forum. We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line! See this thread for details.
Which compression/decompression algorithms? Are talking about for example gzip? It can compress and decompress any kind of data (text or binary), it doesn't have to do anything with character encodings.
If you mean something else, then please explain in more detail what your question is exactly.
Actually I am using Java zip utility to compress some files. File names contains some scandinavic letters(like ä and ö). Now after compression the ZipOutputStream uses UTF-8 to write file name so It doesn't give me the correct name. I saw that there is a bug in Java. I am trying to change the implementation of ZipOutputStream class to accept encoding in constructor. I have tried this for windows with encoding Cp437 and it worked fine for me. I was opening the same file in Unix but its not working there. so I was looking the encoding used by UNIX/Linux for compression/decompression file names.
Let me know If I have not made myself clear in that case I will explain you more on that. You can also refer to Corrupt File name.
I think you're confusing the compression with the facilities that create ZIP/JAR files in the Java compression classes.
Most of the popular compression algorithms are bit-level (binary) algorithms, so they don't care about code pages. There are, in fact, about 5 different algorithms used in ZIP files, and the normal course of events causes the most effective one to be used. In some cases, that's the "store" algorithm, which doesn't compress at all.
I never really paid attention to the limitations on code pages in a ZIP file directory. The first thing I'd do, however, is check the documentation for ZIP files themselves, since ZIP format was intended to be something that was portable even to the extent that you could move them between ASCII and EBCDIC (IBM mainframe) systems.
An IDE is no substitute for an Intelligent Developer.
I have executed the following experiment:
- created several text files with English, Hungarian, Chinese, Japanese and
- attempted to compress them using FilZip, WinZip and PKZip
- attempted to uncompress then using the above tools
My findings are:
- FilZip and WinZip cannot add files with non-English-only names (not even
Hungarian which uses Latin characters); they cannot list files
- PKZip can add add file with any names, but names are transformed: all
non-Western European accented Latin characters are converted to similar
character without accent (e.g. ű->u, ő->o) and all non-Latin characters are
converted to question marks; NOTE: Accented Western European characters are
preserved (e.g. áéíóöúüñ), thus Spanish is supported
- WinZip cannot list non-Western European file names, but can extract the
files when "Extract all" is selected; but non-Latin characters are replaced
with underscore (_); since all non-Western European Latin characters are
converted to non-accented Western European ones during compression, these files
are listed and extracted but without accents.
- FilZip and PKZip can display and extract all files but with transformation;
Summary: ZIp format does not support Unicode in filenames. It might be possible
to pick one specific code page/character set that would be usable for a
specific language, but it is not know how as tested tools do not provide
control for this.
Solution: No real solution. As workaround, Spanish text should be used with all
accented characters replaced with non-accented relative (ú->u, ó->o, etc.) or
compress files using ISO8859P1 character set for filenames.
Note: PKZip is one of the first zip utilities for Windows; WinZip is the market
leader. If they cannot support Unicode, how could we?
Just to repeat myself: Although PKZIP defined the ZIPfile standard, the file format standard long ago became independent of whether you used PKWare, Info-ZIP, Winzip or whatever. There are variants and recently struggles to work around the original limitations like 2.2GB/contained file, but there is a published standard, and that's what should be consulted to determine what's allowable for a file contained in a ZIP archive and what options are available.