• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

'Perfect' optimized compression help!

 
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This is my first post! Before I fill in my question, first let me say I started with Python when I was 8 and switched to Java when I was 11. I'm VERY interested in compression, physics simulations, vortices, cellular automaton, and other nerdy stuff. I go to a really advanced school. However, although I am a good programmer, I still have problems with some things like programming algebra and optimizing problems such as compression. I've been trying to solve this for several months and I've been growing desperate. So far in my file compression application I'm making, I have a working GUI, file reader and writer, and decompressor. I have a not-so working, about a third of the way done compressor. My problem needs a lot of explaining, so only help me if you have a lot of time.


Anyways, for the past three quarters of a year (Wow!) I've been working on a dictionary-based file compression application for Mac (but it is platform-independent, you just can double click on compressed files to decompress them) called Comprezzor. The compression method is completely my invention. Basically, it replaces repeating strings with numbers. I've already made it foolproof, with escape characters and everything, and making sure multi digit numbers work.


As you can see there is compression:

How much wood would a woodchuck chuck if a woodchuck could chuck wood?
versus
wood;chuck ;ould; a ;:How much 0 w23011if301c2 10?

The compressed file is roughly 71.4% of the uncompressed file in number of characters. This isn't much, but compression would increase with larger documents.

How it works is it finds every single substring of characters in the whole string. (up to a certain depth, so it doesn't take hours) It then uses a formula I came up with to find, assuming it is given a single digit number as a replacement. The substrings that save the most space are put in the dictionary first. Or should it be ordered by the ones that repeat the most and save space, because if they go farther down the list they'll get multi-digit numbers and cost more per replacement? And what if, say one repeats more than the other, so it should be 10th on the list and get 9, while the other has to get replaced with [multi-digit escape character]10[multi-digit escape character], and not save any space anymore? Should they be swapped, to make both save space?

Plus, say we have the compressible string "green green green green javagreenhorn javagreenhorn javagreenhorn".
You'll compress "green" and "javagreenhorn". However, "javagreenhorn" is larger and will save more space than "green" per replacement, so the compressed form, not including the dictionary, should look like

"0 0 0 0 1 1 1",
not
"0 0 0 0 java0horn java0horn java0horn"

So, that means that in the formula for how much space "green" saves, the number of repeats should be reduced because you should replace larger ones first, because they save more space. However, that might not apply to everything, because say the larger word is only a few characters larger, and it's multi digit, so it doesn't save any space, or saves less space than the smaller word, which is single digit. That would mean that the number of repititions depends on the number of repititions, which throws the program into an infinite loop! AGH!

So, so frustrating!
*throws computer out of window*

Please, I need serious help on a very complicated problem!
 
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Dictionary-based compression is *extremely* well-documented--there are a million references and implementations freely available; I'd probably start by looking at existing algorithms for ideas on how to improve your own. The compression FAQ has a brief intro to LZ/LZW compression (think ZIP files) and describes the algorithm at a high level. For lower-level information just grab the source for anything that can make a ZIP file.

The compression Usenet group (comp.compression, although there are various sub-groups) is always a good place for both archival and current discussions and would be a better source of information than JavaRanch (IMO) because that's their purpose in life. I haven't lurked on that group for many years now, so I don't know how they are with relative newbies.

To answer your question ("how should I put things in the dictionary" etc.) I'd suggest running tests: implement your various ideas, run them on representative text files, and find out what gives you the best compression/performance ratio, or whatever your ultimate goals are--there's always a trade-off.
 
These are not the droids you are looking for. Perhaps I can interest you in a tiny ad?
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic