This is my first post! Before I fill in my question, first let me say I started with Python when I was 8 and switched to
Java when I was 11. I'm VERY interested in compression, physics simulations, vortices, cellular automaton, and other nerdy stuff. I go to a really advanced school. However, although I am a good programmer, I still have problems with some things like programming algebra and optimizing problems such as compression. I've been trying to solve this for several months and I've been growing desperate. So far in my file compression application I'm making, I have a working GUI, file reader and writer, and decompressor. I have a not-so working, about a third of the way done compressor. My problem needs a lot of explaining, so only help me if you have a lot of time.
Anyways, for the past three quarters of a year (Wow!) I've been working on a dictionary-based file compression application for Mac (but it is platform-independent, you just can double click on compressed files to decompress them) called Comprezzor. The compression method is completely my invention. Basically, it replaces repeating strings with numbers. I've already made it foolproof, with escape characters and everything, and making sure multi digit numbers work.
As you can see there is compression:
How much wood would a woodchuck chuck if a woodchuck could chuck wood?
versus
wood;chuck ;ould; a ;:How much 0 w23011if301c2 10?
The compressed file is roughly 71.4% of the uncompressed file in number of characters. This isn't much, but compression would increase with larger documents.
How it works is it finds every single substring of characters in the whole
string. (up to a certain depth, so it doesn't take hours) It then uses a formula I came up with to find, assuming it is given a single digit number as a replacement. The substrings that save the most space are put in the dictionary first. Or should it be ordered by the ones that repeat the most and save space, because if they go farther down the list they'll get multi-digit numbers and cost more per replacement? And what if, say one repeats more than the other, so it should be 10th on the list and get 9, while the other has to get replaced with [multi-digit escape character]10[multi-digit escape character], and not save any space anymore? Should they be swapped, to make both save space?
Plus, say we have the compressible string "green green green green javagreenhorn javagreenhorn javagreenhorn".
You'll compress "green" and "javagreenhorn". However, "javagreenhorn" is larger and will save more space than "green" per replacement, so the compressed form, not including the dictionary, should look like
"0 0 0 0 1 1 1",
not
"0 0 0 0 java0horn java0horn java0horn"
So, that means that in the formula for how much space "green" saves, the number of repeats should be reduced because
you should replace larger ones first, because they save more space. However, that might not apply to everything, because say the larger
word is only a few characters larger, and it's multi digit, so it doesn't save any space, or saves less space than the smaller word, which is single digit. That would mean that the number of repititions depends on the number of repititions, which throws the program into an infinite loop! AGH!
So, so frustrating!
*throws computer out of window*
Please, I need serious help on a very complicated problem!