Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!

# Huffman encoding/Decoding

Ayan Biswas
Ranch Hand
Posts: 104
I am writing code to implement Huffman encoding/decoding scheme.Till now I have counted the frequency of characters and stored them in a HashMap .Now I have to start building the huffman tree.Can somebody give me any idea on how to start doing so?

David O'Meara
Rancher
Posts: 13459
convert the HashMap to a TreeMap, if you refer to the datatype using the Map interface then this is a trivial change.
The TreeMap will then allow you to iterate the keys by their natural order and you can use this to get each part to build your final encoding.

Ayan Biswas
Ranch Hand
Posts: 104
@ David thanks for your reply.I guess if I use a TreeMap the caharcters will be sorted in the map according to their respective frequencies.But then how to start building the huffman tree?That is the major problem that I am facing at this moment?

David O'Meara
Rancher
Posts: 13459
To start, you count the frequency of each character eg Map<Character, Integer>
next (possibly) you'd start assembling these based on frequency into a binary tree.
You need a data type that can store a binary tree, then you would convert the initial Map into a TreeMap<Integer, List<BinaryTree>>
You need a List here in case more than one tree has the same frequency.
while the size is greater than one, pop the top two items from the TreeMap, combine and put them back into the TreeMap

Campbell Ritchie
Sheriff
Posts: 48981
60
You could design your own Tree Set which uses a Comparator working on frequencies. Put an object incorporating frequency and a char into it; if you get a collision, you can add the char to a List<Character> associated with the frequency.

Campbell Ritchie
Sheriff
Posts: 48981
60
Extract all the K, V pairs from the Map, put them into a List, and sort on frequency??

Ayan Biswas
Ranch Hand
Posts: 104
Here's what I did..created an abstract Node class and extended it in Leaf and NLeaf class(non-leaf).And the inserted the nodes in TreeSet.So,I was succesful in creating the nodes and sort them but now I'm again stuck with the encoding portion.I was trying to pop the top 2 nodes from treeset and encode them.but by doing so, the top 2 nodes will get 0 and 1 and so on...but it should be like ,the root getting 0 and so on..What shall i do.I'm posting some portions of the code here..
Node class

Leaf class

Nleaf class

part of the main class

Ayan Biswas
Ranch Hand
Posts: 104

Rob Spoor
Sheriff
Posts: 20533
54

Campbell Ritchie
Sheriff
Posts: 48981
60
How do you know whether a Node is a leaf or not when you first create it?

Ayan Biswas
Ranch Hand
Posts: 104

How do you know whether a Node is a leaf or not when you first create it?

Suppose there are three distinct characters in a fille a,b&c.These 3 characters along with their frequencies make the leaves.Now, suppose I pop a&b (say ,their frequencies are least) and the combine them (i.e add their frequencies and ascii values) and create the first non-leaf node and put it in the TreeSet<Node> again.By ,this way i proceed .
In this for loop in the main method I create all the leaf nodes(with its frequency,ascii code and sequence no.).Sequence no. was created to design the compareTo method efficiently.

Ayan Biswas
Ranch Hand
Posts: 104
I have fixed it. Its giving me the codes properly.Now,I have to create the encoded file where I have to replace each character with its code.Will give it a try,now.

Ayan Biswas
Ranch Hand
Posts: 104
In my program to implement huffman algorithm.I have created the huffman codes and stored the ascii values and corresponding codes in a map.While creating the encoded file I followed this approach:I firstly converted the ascii codes in to their corresponding integer values and wrote them in a file.Now ,while decoding I read each byte from that file and converted it into its binary equivalent and searched for their ascii values in the map.This approach is creating the following problem.Since,01,001,1 all have integer equivalent as 1,so when I convert the codes into integer it no longer remains unique.I have ran out of ideas. And I failed to store the actual huffman tree.So I could not follow the decoding scheme where I can traverse the tree a file the codes. Help!

Henry Wong
author
Marshal
Posts: 21127
78
Ayan Biswas wrote::I firstly converted the ascii codes in to their corresponding integer values and wrote them in a file

Please elaborate what you mean here -- ASCII codes are already a number. There is no conversion to "corresponding integer".

Ayan Biswas wrote:Now ,while decoding I read each byte from that file and converted it into its binary equivalent and searched for their ascii values in the map.This approach is creating the following problem.Since,01,001,1 all have integer equivalent as 1,so when I convert the codes into integer it no longer remains unique.

Again, please elaborate.... because with numbers a one is a one is a one -- regardless of how many zeros are in front of it. There is no difference between a ASCII code one and a ASCII code zero one.

Henry

Ayan Biswas
Ranch Hand
Posts: 104
Sorry for not being precise with the query
Ayan Biswas wrote:
:I firstly converted the ascii codes in to their corresponding integer values and wrote them in a file

I meant I converted the huffman codes(which are of type string in my case) of the characters(actually I stored the ascii of the characters ..not the character itself in the tree map) into corresponding integers.so,say huffman code for 'a' is 101 .then the integer will be 5.
Ayan Biswas wrote:
Now ,while decoding I read each byte from that file and converted it into its binary equivalent and searched for their ascii values in the map.This approach is creating the following problem.Since,01,001,1 all have integer equivalent as 1,so when I convert the codes into integer it no longer remains unique.

Again,here I meant huffman codes.Say huffman code for 'b' is 010 and for 'c' it's 10.So,when I convert 010 and 10 to integer both gives 2.

Henry Wong
author
Marshal
Posts: 21127
78
Oh, I see what you mean.... Short answer. Don't do that. Either stay with using strings, or create a new type that tracks both the number *and* the number of leading zeros.

ASCII codes are numbers. There is no concept of how many leading zeros in a number. So the conversion from Huffman code to the number is a lossy conversion. And there is no way to guarantee to convert it back correctly.

Henry

Ayan Biswas
Ranch Hand
Posts: 104
@Henry thanks for the idea.But using the codes in string format wont be much helpful.because firstly the encoded file will be much larger in size and while decoding (say10011) how will I know to search for '1' or 10 or 100 ..But I will try to use the ascii codes and their leading 0's in another Map and give it a try.

Henry Wong
author
Marshal
Posts: 21127
78
Ayan Biswas wrote:@Henry thanks for the idea.But using the codes in string format wont be much helpful.because firstly the encoded file will be much larger in size and while decoding (say10011) how will I know to search for '1' or 10 or 100 ..But I will try to use the ascii codes and their leading 0's in another Map and give it a try.

Oh... I thought you were referring to the mapping -- not the encoded output.

Basically, you can try to do this for the encoded file. Say you have five hoffman codes:

001 101 01 100 110

Instead of storing in as five ascii characters with the binary 00000001 00000101 00000001 00000100 00000110. You need to mash the bits together, and store it as 00110101 10011000.

This not only gives you much better compression than what you have now. But you don't have to worry about the leading zero problem. And hoffman codes don't have a trailing zero problem based on how it is designed.

Henry

Ayan Biswas
Ranch Hand
Posts: 104
@ Henry here is what i am doing.say a=101,b=11,c=111,I converetd all these strings into integer so,a=5,b=3,c=7.Now,suppose the file to be encoded has abbcab.so,the encoded output will be 533753.Now,while decoding I reconstruct the strings from the integers.so 5 becomes 101 and so on.Then I search for that string in the map and place the corresponding character in the decoded file.
But the problem is ,suppose i have d=011,whose integer value is 3 ,same as b's.So,codes are no longer unique.this is the problem that i am facing.

here's my code to decode the file

Henry Wong
author
Marshal
Posts: 21127
78
Ayan Biswas wrote:
But the problem is ,suppose i have d=011,whose integer value is 3 ,same as b's.So,codes are no longer unique.this is the problem that i am facing.

Read my last post again -- if you mash the bits together, you can tell the difference. Mainly...

a b --> 10111000

a d --> 10101100

There is a difference.

And BTW, you can't have b=11 while c=111. Huffman codes don't work that way.

Henry

Ayan Biswas
Ranch Hand
Posts: 104
I have solved the problem of decoding but now I facing the problem of compressing a file using Huffman algorithm Here is my modified code for decoding.Here I am using the huffman tree to decode the encoded file

But the problem I am facing now is that my huffman codes are String type.So if the code for a is 101 then its actually taking 3 bytes to store the code.So,the encoded file size is larger than the actual file!What am I going to do?What is the idea to read and write bits from a file in java?

Campbell Ritchie
Sheriff
Posts: 48981
60
You can try parsing the "101" String with 2 as the radix to get 5, but you are probably better off getting the codes into numbers initially.

By the way: if I remember correctly, each character in a String is stored as a char, so a String like "101" would require 6 bytes, not 3 (plus pointers to the Class<String> object, etc etc).

Ayan Biswas
Ranch Hand
Posts: 104
You can try parsing the "101" String with 2 as the radix to get 5, but you are probably better off getting the codes into numbers initially.

By the way: if I remember correctly, each character in a String is stored as a char, so a String like "101" would require 6 bytes, not 3 (plus pointers to the Class<String> object, etc etc).

I had tried the approach of converting the codes into numbers initially,but it did not work for me because I was not able to save the uniqueness of the codes so i tried the tree traversal approach.it worked but I failed with the compression of data.Any other way to do it?
And the encoded file size is coming as the no.of 0's and 1's in the file.So,if the file consists 10101 the size=5 bytes not 10.

Campbell Ritchie
Sheriff
Posts: 48981
60
Bear has earlier showed you how to compress your file by putting different combinations of bits together. If you are using Strings, then 10101 is at least 10 bytes, if you get it into a number, then 10101 is 5 bits.

Henry Wong
author
Marshal
Posts: 21127
78
Campbell Ritchie wrote:Bear has earlier showed you how to compress your file by putting different combinations of bits together. If you are using Strings, then 10101 is at least 10 bytes, if you get it into a number, then 10101 is 5 bits.

Agreed. Not only is it a difference of 10 bytes versus 5 bits per character -- but if you further mash the bits together across many characters, you won't have the "can't tell difference between 10 and 010" that you have been encountering either.

Henry

Ayan Biswas
Ranch Hand
Posts: 104
So,basically what you are saying is that I shall read 8 bytes of 0's and 1's and then convert it to integer and store in the encoded file?
Mashing of the bits seems a bit difficult to implement.Say,a="101" b="10" now while reading the file i first read a and find the code 101 now since this is less than 8 chars i read the next byte ,and that is 10,so i concatenate it with 101,and continue this until string length is less or equals eight.Now,suppose while reading the next byte the string length becomes more than 8 ,so what I do is that i store the extraneous bits in another string and read the next character.Is this the way forward?This kind of seems messy to me.

Ayan Biswas
Ranch Hand
Posts: 104
here is the problem i am facing now.I have a file consist of a b c d.codes a=100,b=101,c=110,d=111 and space=0.Now,the encoded file is 100010101100111.so ,I take 8 bytes(of 0 and 1) from the encoded file convert it into integer and store it in another file(say auxfile.txt) so, auxfile.txt contains 138 and 103.Now while decoding I first convert 138 and 103 into its bainary equivalent.But to my horror ,while reading the first byte(from auxfile.txt) I get 63 and not 138!The next byte is coming properly as 103.Please help
Here is the code where I am converting strings of 0 ,1 into numbers.

Here is my code where I am trying to re-convert numbers into string of 0 and 1

Ayan Biswas
Ranch Hand
Posts: 104
I have fixed it now.