0% found this document useful (0 votes)
37 views

Huffman Coding A Case Study of A Comparison

Huffman coding was applied to three different documents: a 19th century novel, a modern news website HTML document, and a C source code file. Huffman coding assigns variable-length binary codes to symbols based on their frequency of occurrence to compress data losslessly. It was found to reduce the file size of the novel by 58.33%, the news article by 75.61%, and the source code by 80.56% compared to an uncompressed representation. Huffman coding works best on data with uneven symbol frequencies, as it assigns shorter codes to more common symbols.

Uploaded by

SIDDHARTH GUPTA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Huffman Coding A Case Study of A Comparison

Huffman coding was applied to three different documents: a 19th century novel, a modern news website HTML document, and a C source code file. Huffman coding assigns variable-length binary codes to symbols based on their frequency of occurrence to compress data losslessly. It was found to reduce the file size of the novel by 58.33%, the news article by 75.61%, and the source code by 80.56% compared to an uncompressed representation. Huffman coding works best on data with uneven symbol frequencies, as it assigns shorter codes to more common symbols.

Uploaded by

SIDDHARTH GUPTA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1

Huffman Coding: a Case Study of a Comparison


Between Three Different Type Documents
Konstantinos D. Pechlivanis

of occurrence of the source s ymbols that can be obtained by


Abstract—We examine the results of applying Huffman coding means of a simple algorithm coding. It can be shown that no
to three different style documents: a novel from the 19th other algorithm can lead to the construction of a code with
century, an HTML document of a modern news website and a C smaller average length codewords , for a given alphabet of the
language source code. source. The outcome of the certain coding consists of a
variable-length code table for encoding a source, where the
Index Terms - Data compression, Lossless source coding, variable-length code table has been derived in a particular way,
Entropy rate of a source, Information theory based on the estimated probability of occurrence for each
possible value of the source symbol [6].
I. INT RODUCT ION
B. The algorithm
E VER since the development of electronic means for the
transmission and information processing, has emerged the According to the algorithm of Huffman, for the binary
encoding of the source symbols follow the next steps :
need of reducing the volume of data transmitted.
Data compression theory was formulated by Claude E.
Shannon, in his 1948 paper "A Mathematical Theory of 1. The source symbols are arranged in decreasing
Communication" [1]. Shannon proved that there is a transmission probability.
fundamental limit to lossless data compression. This limit, 2. The last two symbols of the source with the lowest
called entropy rate, is denoted by H. It is possible to compress likelihood of producing combined into one,
the source in a lossless manner, with compression rate close to probability equal to the sum of the robabilities of
H and it can be mathematically proven that it is impossible to the two symbols, resulting in the reduction of one
have a better compression rate than H. of the plurality of conventional of all of the source
One important method of transmitting messages is to alphabet.
transmit in their place sequences of symbols [2]. For the best 3. Steps 1 and 2 are repeated until the source alphabet
performance of communication systems, sought as much as consists of only two symbols. In these two
possible compact representation of messages which is symbols assigned to 0 and 1 of the binary code.
achieved by removing the redundancy inherent in them. This 4. A "0" and a "1" is assigned in place of one and the
process is called source coding. More specifically source other symbol respectively, which in step 2 were
coding is the process of converting the sequence of symbols merged into one. This step relates to all mergers.
generated by a source, in symbol sequences of the code 5. The codewords of symbols formed by all the bits "0"
(usually binary sequences), so as to remove the redundancy and "1" associated with these symbols (from
and the resulting compressed representation of messages . bottom to top), ie the digits are assigned directly to
Compression can be either lossy or lossless. Lossless them or in merged symbols involved.
compression reduces bits by identifying and eliminating C. Example
statistical redundancy. No information is lost in lossless
The frequency of the letters in the English language
compression. Lossy compression reduces bits by identifying
(according to Wikipedia) is the following:
marginally important information and removing it [5].
Lossless compression examples of methods for source
coding are Shannon coding [1], Shannon – Fano coding [3],
Shannon – Fano – Elias coding [4], and Huffman coding [2],
where the last one can been proven to be the optimum for a
specific set of symbols with specific probabilities.

II. HUFFMAN CODING

A. Description - prerequisites
Huffman coding is the optimal code for a given probability
2
1876 novel “The Adventures of Tom Sayer” by Mark Twain
Assuming that we have to use the Huffman algorithm to [7]. The novel is clearly indicative of the folklore surrounding
encode the following subset of the English letters , having the life on the Mississippi River, around the time it was written.
above probabilities The second is a plain HTML document of the FrontPage from
X = {a, b, c, d, e} and the BBC news channel website, on the 08-11-2012. It represents
P(X) = {0.082, 0.015, 0.028, 0.043, 0.127} the modern spoken English in the western world. Finally the
According to the algorithm, we first arrange the symbols in last document is the C language source code implementation of
descending order of the transmission probability. The first the Huffman algorithm itself. It represents a strictly technical
column of the following table contains the symbols and the document, with a limited set of words.
second column their chances. In the next step, the symbols
with the least chance are combined with a probability equal to
B. Implementation
the sum of these probabilities. Then we re-arrange the symbols
considering the joining of the last two. Two programs were used in order to apply Huffman coding
In the next step, the symbols with the least chance are to the three documents. The first one, “letter_count.cpp”,
merged again and the remaining symbols are arranged again. accepts as input a file in text format and then counts the exact
We repeat the steps of merging and arranging of the symbols number of the letters and the spaces presented in the whole
that are left until we finally end up with the merging of the last document. It also calculates the frequency of each symbol and
two symbols, having a probability equal to the sum of those of writes these values to a text file named “freqs.txt”. The second,
them. Starting from the last column of probability, we assign to “huffman.c”, is an implementation of the Huffman algorithm,
them the symbols "0" and "1", respectively. In the prior having as output the compressed binary code in a text file,
probabilities column we also assign the symbols "0" and "1" calculating also the percentage of memory saved by means of
next to the previous combined, etc. the use of the algorithm.
Finally the resulting codes respectively for each letter of the
chosen would be C. Results
The first document, “The Adventures of Tom Sawyer”, was
found to have a total of 379.055 letters and spaces altogether.
The uncompressed text would have been encoded using 552
original bits in total for all letters and after the application of
the Huffman coding the total bits used are 322. This yields to a
58.33% saved memory.
The second document, from www.bbc.com, was found to
have a total of 88.687 letters and spaces altogether. The
uncompressed text would have been encoded using 824
original bits in total for all letters and after the application of
D. Limitations
the Huffman coding the total bits used are 623. This yields to a
Although Huffman's algorithm is optimal for a symbol-by- 75.61% saved memory.
symbol coding with a known input probability distribution, it is The third document, “huffman.c”, was found to have a total
not optimal when the symbol-by-symbol restriction is dropped, of 3.367 letters and spaces altogether. The uncompressed text
or when the probability mass functions are unknown, not would have been encoded using 144 original bits in total for all
identically distributed, or not independent. Other methods letters and after the application of the Huffman coding the total
such as arithmetic coding and LZW coding often have better bits used are 78. This yields to a 54.17% saved memory.
compression capability: both of these methods can combine an
arbitrary number of symbols for more efficient coding, and IV. CONCLUSION
generally adapt to the actual input statistics, the latter of which
According to the results of the Huffman coding application
is useful when input probabilities are not precisely known or
to the three documents, we can say that the gain was greater to
vary significantly within the stream [6].
the bbc.com HTML document. This may be a result of the
frequent appearance in this document of certain tags like
<div>, <p>, etc. Also interesting is the fact that the amount of
III. CASE STUDY
compression between the other two documents is almost
A. Description equal. There has to be a further research with more documents
to examine in order to end up with a safe conclusion.
In order to compare the results of the Huffman coding
implementation between different types of human-produced
texts, we have chosen to pick three representative documents
of human activity. The first one is in the field of literature, a

You might also like