Huffman Coding A Case Study of A Comparison

Huffman coding was applied to three different documents: a 19th century novel, a modern news website HTML document, and a C source code file. Huffman coding assigns variable-length binary codes to symbols based on their frequency of occurrence to compress data losslessly. It was found to reduce the file size of the novel by 58.33%, the news article by 75.61%, and the source code by 80.56% compared to an uncompressed representation. Huffman coding works best on data with uneven symbol frequencies, as it assigns shorter codes to more common symbols.

Uploaded by

SIDDHARTH GUPTA

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Huffman Coding A Case Study of A Comparison

Uploaded by

SIDDHARTH GUPTA

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

1

Huffman Coding: a Case Study of a Comparison

Between Three Different Type Documents
Konstantinos D. Pechlivanis

of occurrence of the source s ymbols that can be obtained by

Abstract—We examine the results of applying Huffman coding means of a simple algorithm coding. It can be shown that no
to three different style documents: a novel from the 19th other algorithm can lead to the construction of a code with
century, an HTML document of a modern news website and a C smaller average length codewords , for a given alphabet of the
language source code. source. The outcome of the certain coding consists of a
variable-length code table for encoding a source, where the
Index Terms - Data compression, Lossless source coding, variable-length code table has been derived in a particular way,
Entropy rate of a source, Information theory based on the estimated probability of occurrence for each
possible value of the source symbol [6].
I. INT RODUCT ION
B. The algorithm
E VER since the development of electronic means for the
transmission and information processing, has emerged the According to the algorithm of Huffman, for the binary
encoding of the source symbols follow the next steps :
need of reducing the volume of data transmitted.
Data compression theory was formulated by Claude E.
Shannon, in his 1948 paper "A Mathematical Theory of 1. The source symbols are arranged in decreasing
Communication" [1]. Shannon proved that there is a transmission probability.
fundamental limit to lossless data compression. This limit, 2. The last two symbols of the source with the lowest
called entropy rate, is denoted by H. It is possible to compress likelihood of producing combined into one,
the source in a lossless manner, with compression rate close to probability equal to the sum of the robabilities of
H and it can be mathematically proven that it is impossible to the two symbols, resulting in the reduction of one
have a better compression rate than H. of the plurality of conventional of all of the source
One important method of transmitting messages is to alphabet.
transmit in their place sequences of symbols [2]. For the best 3. Steps 1 and 2 are repeated until the source alphabet
performance of communication systems, sought as much as consists of only two symbols. In these two
possible compact representation of messages which is symbols assigned to 0 and 1 of the binary code.
achieved by removing the redundancy inherent in them. This 4. A "0" and a "1" is assigned in place of one and the
process is called source coding. More specifically source other symbol respectively, which in step 2 were
coding is the process of converting the sequence of symbols merged into one. This step relates to all mergers.
generated by a source, in symbol sequences of the code 5. The codewords of symbols formed by all the bits "0"
(usually binary sequences), so as to remove the redundancy and "1" associated with these symbols (from
and the resulting compressed representation of messages . bottom to top), ie the digits are assigned directly to
Compression can be either lossy or lossless. Lossless them or in merged symbols involved.
compression reduces bits by identifying and eliminating C. Example
statistical redundancy. No information is lost in lossless
The frequency of the letters in the English language
compression. Lossy compression reduces bits by identifying
(according to Wikipedia) is the following:
marginally important information and removing it [5].
Lossless compression examples of methods for source
coding are Shannon coding [1], Shannon – Fano coding [3],
Shannon – Fano – Elias coding [4], and Huffman coding [2],
where the last one can been proven to be the optimum for a
specific set of symbols with specific probabilities.

II. HUFFMAN CODING

A. Description - prerequisites
Huffman coding is the optimal code for a given probability
2
1876 novel “The Adventures of Tom Sayer” by Mark Twain
Assuming that we have to use the Huffman algorithm to [7]. The novel is clearly indicative of the folklore surrounding
encode the following subset of the English letters , having the life on the Mississippi River, around the time it was written.
above probabilities The second is a plain HTML document of the FrontPage from
X = {a, b, c, d, e} and the BBC news channel website, on the 08-11-2012. It represents
P(X) = {0.082, 0.015, 0.028, 0.043, 0.127} the modern spoken English in the western world. Finally the
According to the algorithm, we first arrange the symbols in last document is the C language source code implementation of
descending order of the transmission probability. The first the Huffman algorithm itself. It represents a strictly technical
column of the following table contains the symbols and the document, with a limited set of words.
second column their chances. In the next step, the symbols
with the least chance are combined with a probability equal to
B. Implementation
the sum of these probabilities. Then we re-arrange the symbols
considering the joining of the last two. Two programs were used in order to apply Huffman coding
In the next step, the symbols with the least chance are to the three documents. The first one, “letter_count.cpp”,
merged again and the remaining symbols are arranged again. accepts as input a file in text format and then counts the exact
We repeat the steps of merging and arranging of the symbols number of the letters and the spaces presented in the whole
that are left until we finally end up with the merging of the last document. It also calculates the frequency of each symbol and
two symbols, having a probability equal to the sum of those of writes these values to a text file named “freqs.txt”. The second,
them. Starting from the last column of probability, we assign to “huffman.c”, is an implementation of the Huffman algorithm,
them the symbols "0" and "1", respectively. In the prior having as output the compressed binary code in a text file,
probabilities column we also assign the symbols "0" and "1" calculating also the percentage of memory saved by means of
next to the previous combined, etc. the use of the algorithm.
Finally the resulting codes respectively for each letter of the
chosen would be C. Results
The first document, “The Adventures of Tom Sawyer”, was
found to have a total of 379.055 letters and spaces altogether.
The uncompressed text would have been encoded using 552
original bits in total for all letters and after the application of
the Huffman coding the total bits used are 322. This yields to a
58.33% saved memory.
The second document, from www.bbc.com, was found to
have a total of 88.687 letters and spaces altogether. The
uncompressed text would have been encoded using 824
original bits in total for all letters and after the application of
D. Limitations
the Huffman coding the total bits used are 623. This yields to a
Although Huffman's algorithm is optimal for a symbol-by- 75.61% saved memory.
symbol coding with a known input probability distribution, it is The third document, “huffman.c”, was found to have a total
not optimal when the symbol-by-symbol restriction is dropped, of 3.367 letters and spaces altogether. The uncompressed text
or when the probability mass functions are unknown, not would have been encoded using 144 original bits in total for all
identically distributed, or not independent. Other methods letters and after the application of the Huffman coding the total
such as arithmetic coding and LZW coding often have better bits used are 78. This yields to a 54.17% saved memory.
compression capability: both of these methods can combine an
arbitrary number of symbols for more efficient coding, and IV. CONCLUSION
generally adapt to the actual input statistics, the latter of which
According to the results of the Huffman coding application
is useful when input probabilities are not precisely known or
to the three documents, we can say that the gain was greater to
vary significantly within the stream [6].
the bbc.com HTML document. This may be a result of the
frequent appearance in this document of certain tags like
<div>, <p>, etc. Also interesting is the fact that the amount of
III. CASE STUDY
compression between the other two documents is almost
A. Description equal. There has to be a further research with more documents
to examine in order to end up with a safe conclusion.
In order to compare the results of the Huffman coding
implementation between different types of human-produced
texts, we have chosen to pick three representative documents
of human activity. The first one is in the field of literature, a

Dana Marniche Africa Arabian Origins of The Israelites and The Ishmaelites Part
100% (2)
Dana Marniche Africa Arabian Origins of The Israelites and The Ishmaelites Part
20 pages
2000 Mitchell RELATIONALITY From Attachment To Intersubjectivity PDF
100% (7)
2000 Mitchell RELATIONALITY From Attachment To Intersubjectivity PDF
192 pages
B2 First Sample Paper 2 Writing 2022 PDF
100% (1)
B2 First Sample Paper 2 Writing 2022 PDF
3 pages
Data Compression Using Adaptive Coding and Partial String Matching
No ratings yet
Data Compression Using Adaptive Coding and Partial String Matching
7 pages
UofT SCS Coding Boot Camp Online Curriculum
No ratings yet
UofT SCS Coding Boot Camp Online Curriculum
7 pages
Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents
No ratings yet
Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents
5 pages
What Is Huffman Coding and Its History
No ratings yet
What Is Huffman Coding and Its History
5 pages
Karantp
No ratings yet
Karantp
10 pages
#G9943LK
No ratings yet
#G9943LK
4 pages
Huffman Code
No ratings yet
Huffman Code
51 pages
ICT - Module 1 Lecture 3
No ratings yet
ICT - Module 1 Lecture 3
43 pages
Huff Man Coding
No ratings yet
Huff Man Coding
8 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
Data Compression
No ratings yet
Data Compression
7 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
Improvement of Huffmancoding
No ratings yet
Improvement of Huffmancoding
7 pages
EC3021D
No ratings yet
EC3021D
22 pages
Modification of Adaptive Huffman Coding For Use in
No ratings yet
Modification of Adaptive Huffman Coding For Use in
6 pages
Huffman Coding Technique
No ratings yet
Huffman Coding Technique
13 pages
[IEEE Comput. Soc 1990 Symposium on Applied Computing - -- Saeed, F.; Lu, H.; Hedrick, G.E. -- pages 348-354, 1990 -- IEEE Comput. Soc -- 10.1109_SOAC.1990.82195 -- 0975c509acf1f116eb51ae52ca832598 -- Anna’s Archive
No ratings yet
[IEEE Comput. Soc 1990 Symposium on Applied Computing - -- Saeed, F.; Lu, H.; Hedrick, G.E. -- pages 348-354, 1990 -- IEEE Comput. Soc -- 10.1109_SOAC.1990.82195 -- 0975c509acf1f116eb51ae52ca832598 -- Anna’s Archive
7 pages
H y H H y Y: Bibliography
No ratings yet
H y H H y Y: Bibliography
5 pages
Mini Project 2
No ratings yet
Mini Project 2
4 pages
Text and Text Compression
No ratings yet
Text and Text Compression
28 pages
An Introduction To Arithmetic Coding: Glen G. Langdon, JR
No ratings yet
An Introduction To Arithmetic Coding: Glen G. Langdon, JR
15 pages
Lecture 3-Huffman Coding
No ratings yet
Lecture 3-Huffman Coding
30 pages
Compression and Decompression Using Huffman Convention Synopsis
No ratings yet
Compression and Decompression Using Huffman Convention Synopsis
10 pages
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
No ratings yet
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
16 pages
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
No ratings yet
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
25 pages
Information Theory in Dynamic Systems
No ratings yet
Information Theory in Dynamic Systems
44 pages
Truncated Huffman
No ratings yet
Truncated Huffman
5 pages
Module IV
No ratings yet
Module IV
37 pages
KMA SS05 Kap03 Compression
No ratings yet
KMA SS05 Kap03 Compression
54 pages
Chapter Three
No ratings yet
Chapter Three
30 pages
Huffman
No ratings yet
Huffman
53 pages
Group-8 DIP Presentation
No ratings yet
Group-8 DIP Presentation
100 pages
Lesson - Huffman and Entropy Coding
No ratings yet
Lesson - Huffman and Entropy Coding
31 pages
Huffman Coding
No ratings yet
Huffman Coding
16 pages
Huffman
No ratings yet
Huffman
17 pages
Data Compression
No ratings yet
Data Compression
28 pages
12 - Huffman Coding Algorithm
No ratings yet
12 - Huffman Coding Algorithm
16 pages
Huffman Coding
No ratings yet
Huffman Coding
23 pages
Huffman Encoding: Farhad Muhammad Riaz
No ratings yet
Huffman Encoding: Farhad Muhammad Riaz
17 pages
Assignment 6: Huffman Encoding: Assignment Overview and Starter Files
No ratings yet
Assignment 6: Huffman Encoding: Assignment Overview and Starter Files
20 pages
Data Compression
No ratings yet
Data Compression
20 pages
IEEE Paper
No ratings yet
IEEE Paper
2 pages
Mmis G1 Ass
No ratings yet
Mmis G1 Ass
13 pages
Source Coding
No ratings yet
Source Coding
29 pages
Data Compression
No ratings yet
Data Compression
35 pages
Ultimedia OF ATA Ompression: IS502:M D I S
No ratings yet
Ultimedia OF ATA Ompression: IS502:M D I S
29 pages
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
No ratings yet
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
24 pages
Digital Communications Lab (CE-343L) : Experiment NO
No ratings yet
Digital Communications Lab (CE-343L) : Experiment NO
3 pages
DCT Based Coding
No ratings yet
DCT Based Coding
49 pages
Text Compression
No ratings yet
Text Compression
16 pages
Huffman Coding Ms 140400147 Sadia Yunas Butt
No ratings yet
Huffman Coding Ms 140400147 Sadia Yunas Butt
9 pages
Huff Man
No ratings yet
Huff Man
8 pages
HuffmanCoding-2
No ratings yet
HuffmanCoding-2
16 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
2.7. Huffman Cod
No ratings yet
2.7. Huffman Cod
12 pages
Huffman Coding, RLE, LZW
No ratings yet
Huffman Coding, RLE, LZW
41 pages
Witten Acm 87 Ar It HM Coding
No ratings yet
Witten Acm 87 Ar It HM Coding
21 pages
Analysis and Comparison of Algorithms For Lossless Data Compression
No ratings yet
Analysis and Comparison of Algorithms For Lossless Data Compression
8 pages
Universal Code (Data Compression)
No ratings yet
Universal Code (Data Compression)
3 pages
Information Theory: A Concise Introduction
From Everand
Information Theory: A Concise Introduction
Stefan Hollos
No ratings yet
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Per g27 Pub 2207 Touchstone AssessmentQPHTMLMode1 2207O23185 2207O23185S26D100515 16903651869397934 3003033237 2207O23185S26D100515E1.html
No ratings yet
Per g27 Pub 2207 Touchstone AssessmentQPHTMLMode1 2207O23185 2207O23185S26D100515 16903651869397934 3003033237 2207O23185S26D100515E1.html
35 pages
Prayer Not Important But Essential
100% (1)
Prayer Not Important But Essential
34 pages
South India
No ratings yet
South India
8 pages
Household Duties
No ratings yet
Household Duties
7 pages
Worksheet Dead Mens Path Agatha Malinda 212114114
No ratings yet
Worksheet Dead Mens Path Agatha Malinda 212114114
6 pages
Complex Number Plane 2013
No ratings yet
Complex Number Plane 2013
8 pages
Security Lab Record
No ratings yet
Security Lab Record
43 pages
Lesson Plan
No ratings yet
Lesson Plan
3 pages
Designing A Spartan-3 FPGA DDR Memory Interface
No ratings yet
Designing A Spartan-3 FPGA DDR Memory Interface
2 pages
MTRN2500
No ratings yet
MTRN2500
6 pages
HIP Out Class
No ratings yet
HIP Out Class
4 pages
Cái Tôi Cô Đơn, Lạc Loài Của Vũ Bằng Trong Tùy Bút "Thương Nhớ Mười Hai"
No ratings yet
Cái Tôi Cô Đơn, Lạc Loài Của Vũ Bằng Trong Tùy Bút "Thương Nhớ Mười Hai"
7 pages
G1-Kosana Miniswar # Pvpsit
No ratings yet
G1-Kosana Miniswar # Pvpsit
2 pages
Functions Calculus 1
No ratings yet
Functions Calculus 1
49 pages
Inglisuri Doq 2019
No ratings yet
Inglisuri Doq 2019
6 pages
Class 7 Bridge Course Final Copy (1)
100% (1)
Class 7 Bridge Course Final Copy (1)
49 pages
Ankita Singh Resume
No ratings yet
Ankita Singh Resume
1 page
Children's Misconceptions in Geometry
100% (1)
Children's Misconceptions in Geometry
19 pages
Ryan Abrahams AB Submission 1
No ratings yet
Ryan Abrahams AB Submission 1
12 pages
PS4.docx REVIEWER
No ratings yet
PS4.docx REVIEWER
6 pages
David Crystal - The Cambridge Encyclopedia of The English Language - Cap. 6 - Modern English (2019) PDF
No ratings yet
David Crystal - The Cambridge Encyclopedia of The English Language - Cap. 6 - Modern English (2019) PDF
18 pages
Manifest UFSFiles Android
No ratings yet
Manifest UFSFiles Android
44 pages
Talking About Preferences
100% (1)
Talking About Preferences
4 pages
Composition Du 1er Trimestre Anglais 2nde Ab 2019-2020 CS Saint Felix
No ratings yet
Composition Du 1er Trimestre Anglais 2nde Ab 2019-2020 CS Saint Felix
3 pages
Summary of NLC Plan 1
No ratings yet
Summary of NLC Plan 1
4 pages
LITERATURE
No ratings yet
LITERATURE
15 pages

Huffman Coding A Case Study of A Comparison

Uploaded by

Huffman Coding A Case Study of A Comparison

Uploaded by

1

Huffman Coding: a Case Study of a Comparison

of occurrence of the source s ymbols that can be obtained by

II. HUFFMAN CODING

You might also like