Huffman Coding A Case Study of A Comparison
Huffman Coding A Case Study of A Comparison
A. Description - prerequisites
Huffman coding is the optimal code for a given probability
2
1876 novel “The Adventures of Tom Sayer” by Mark Twain
Assuming that we have to use the Huffman algorithm to [7]. The novel is clearly indicative of the folklore surrounding
encode the following subset of the English letters , having the life on the Mississippi River, around the time it was written.
above probabilities The second is a plain HTML document of the FrontPage from
X = {a, b, c, d, e} and the BBC news channel website, on the 08-11-2012. It represents
P(X) = {0.082, 0.015, 0.028, 0.043, 0.127} the modern spoken English in the western world. Finally the
According to the algorithm, we first arrange the symbols in last document is the C language source code implementation of
descending order of the transmission probability. The first the Huffman algorithm itself. It represents a strictly technical
column of the following table contains the symbols and the document, with a limited set of words.
second column their chances. In the next step, the symbols
with the least chance are combined with a probability equal to
B. Implementation
the sum of these probabilities. Then we re-arrange the symbols
considering the joining of the last two. Two programs were used in order to apply Huffman coding
In the next step, the symbols with the least chance are to the three documents. The first one, “letter_count.cpp”,
merged again and the remaining symbols are arranged again. accepts as input a file in text format and then counts the exact
We repeat the steps of merging and arranging of the symbols number of the letters and the spaces presented in the whole
that are left until we finally end up with the merging of the last document. It also calculates the frequency of each symbol and
two symbols, having a probability equal to the sum of those of writes these values to a text file named “freqs.txt”. The second,
them. Starting from the last column of probability, we assign to “huffman.c”, is an implementation of the Huffman algorithm,
them the symbols "0" and "1", respectively. In the prior having as output the compressed binary code in a text file,
probabilities column we also assign the symbols "0" and "1" calculating also the percentage of memory saved by means of
next to the previous combined, etc. the use of the algorithm.
Finally the resulting codes respectively for each letter of the
chosen would be C. Results
The first document, “The Adventures of Tom Sawyer”, was
found to have a total of 379.055 letters and spaces altogether.
The uncompressed text would have been encoded using 552
original bits in total for all letters and after the application of
the Huffman coding the total bits used are 322. This yields to a
58.33% saved memory.
The second document, from www.bbc.com, was found to
have a total of 88.687 letters and spaces altogether. The
uncompressed text would have been encoded using 824
original bits in total for all letters and after the application of
D. Limitations
the Huffman coding the total bits used are 623. This yields to a
Although Huffman's algorithm is optimal for a symbol-by- 75.61% saved memory.
symbol coding with a known input probability distribution, it is The third document, “huffman.c”, was found to have a total
not optimal when the symbol-by-symbol restriction is dropped, of 3.367 letters and spaces altogether. The uncompressed text
or when the probability mass functions are unknown, not would have been encoded using 144 original bits in total for all
identically distributed, or not independent. Other methods letters and after the application of the Huffman coding the total
such as arithmetic coding and LZW coding often have better bits used are 78. This yields to a 54.17% saved memory.
compression capability: both of these methods can combine an
arbitrary number of symbols for more efficient coding, and IV. CONCLUSION
generally adapt to the actual input statistics, the latter of which
According to the results of the Huffman coding application
is useful when input probabilities are not precisely known or
to the three documents, we can say that the gain was greater to
vary significantly within the stream [6].
the bbc.com HTML document. This may be a result of the
frequent appearance in this document of certain tags like
<div>, <p>, etc. Also interesting is the fact that the amount of
III. CASE STUDY
compression between the other two documents is almost
A. Description equal. There has to be a further research with more documents
to examine in order to end up with a safe conclusion.
In order to compare the results of the Huffman coding
implementation between different types of human-produced
texts, we have chosen to pick three representative documents
of human activity. The first one is in the field of literature, a