0% found this document useful (0 votes)
92 views

What Is Huffman Coding and Its History

Huffman coding is a lossless data compression algorithm that uses variable-length codewords to encode symbols. It assigns shorter codewords to more common symbols and longer codewords to less common symbols to reduce the average codeword length. The algorithm builds a Huffman tree from the frequency of symbols, where each leaf node represents a symbol. Codewords are generated by traversing the tree from root to leaf, where 0 represents moving left and 1 represents moving right. This allows encoding text more compactly than fixed-length encodings like ASCII.

Uploaded by

Nam Phương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

What Is Huffman Coding and Its History

Huffman coding is a lossless data compression algorithm that uses variable-length codewords to encode symbols. It assigns shorter codewords to more common symbols and longer codewords to less common symbols to reduce the average codeword length. The algorithm builds a Huffman tree from the frequency of symbols, where each leaf node represents a symbol. Codewords are generated by traversing the tree from root to leaf, where 0 represents moving left and 1 represents moving right. This allows encoding text more compactly than fixed-length encodings like ASCII.

Uploaded by

Nam Phương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Huffman Coding

What is Huffman Coding and its history


Problem
When you put English text into a computer, it is saving every individual character as eight bits of
data. Eight ones and zeros. A modern phone or computer might store a few quadrillion of them.
But every time your phone complains that the storage is full, you're running up against the same
question that computer scientists have been working on: How do we reduce the bits used to store
things?

How computer compress texts


Our computer compresses images and videos differently to text - they use lossy compression for
images and videos. This means even if the images and videos are blurry and pixelated, you can
still see its content. For example, the below image is compressed heavily but it still represents a
carrot:

As you can see, it doesn't matter if you lose a little bit of detail.

But text has to be loss-lessly compressed: if you lose a little bit of detail, your original word will
change into a completely different word. So the first thing we need to know is how text is stored
on disk before it's compressed. On a modern computer, each English character takes up exactly
eight ones and zeros on disk - eight bits, or one byte. And there are 256 possible combinations of
those ones and zeros, so you can have 256 possible characters. That's enough for the English
alphabet, numbers, some punctuation,... If you want to know how long a string of text is, you just
count the bits and you divide by eight. But, there is 1 problem: Computers don't have the luxury
of spaces. So how can we solve this? A good plans will be discussed below.

A (good?) approach

First, we assign the most common characters to the smaller arrangements of bits. Example: the
space bar is generally
used most often, so we give it the code "0". Then the second most used characters is lowercase
"e", so give it the code "1". Then give lowercase "t" a code "00" and etc...
But, this approach has a problem. If the computer running through the text, it has no way to know
whether "00" is a "t", or double space bar. Similarly, "000" is a lowercase "n", or a "t" followed by a
space, or triple space? Because there are no gaps here, the computer can only see a constant
stream of 1 and 0. But, in 1952 a mathematician called David Huffman (hence the name Huffman
Coding) invented Huffman Coding to solve this problem.

Huffman Coding algorithm


We now discuss one of the best known algorithms for lossless data compression. As mentioned
above, it is desirable for a code to have the prefix-free property: for any two symbols, the code of
one symbol should not be a prefix of the code of the other symbol. Huffman’s algorithm is a
greedy approach to generating optimal prefix-free binary codes. Optimality refers to the property
that no other prefix code uses fewer bits per symbol, on average, than the code constructed by
Huffman’s
algorithm.

Huffman’s algorithm is based on the idea that a variable length code should use the shortest code
words for the most likely symbols and the longest code words for the least likely symbols. In this
way, the average code length will be reduced. The algorithm assigns code words to symbols by
constructing a binary coding tree. Each symbol of the alphabet is a leaf of the coding tree. The
code of a given symbol corresponds to the unique path from the root to that leaf, with 0 or 1
added to the code for each edge along the path depending on whether the left or right child of a
given node occurs next along the path.

Main data structures and algorithms of Huffman


Coding
First, we talk about the algorithms first and data structure we used later.

Algorithms
For example, we want to compress the line "abbcccdddd". When uncompressed, that is
characters taking up to 80 bits.

First, we count how many times each character is used and put it in a list in order.

Character a b c d

Frequency 1 2 3 4

Now take the two least used characters. Those two are going to be the bottom branches on the
“Huffman tree”. Write them down, with how often they're used next to them. That’s called their
frequency. Then connect them together, one level up,
with the sum of their frequencies. Now add that new sum back into your list, wherever it sits,
higher up, and repeat until there are only 1 node left.
You now have a Huffman tree and it tells you how to convert your text into 1 and 0. From the
root, each time you take the left hand side, write a 0; and each time you take the right hand side,
write a 1 until you meat a leaf node. The written string will be the code for that node's character.
So "d" will be "0", "c" will be "10", "a" will be "110" and "b" will be "111". Some of the letters will
take up more than 8 bits, but that’s fine, because they’re not used very often. You do have to also
store this tree to provide a translation table between your new symbols and the uncompressed
text. We compressed it down from 80 to only 19 bits.

To uncompress the resulting stream of bits, it works the other way: just read across, take the left
turn every time you see a 0 and the right turn every time you see a 1. When you reach a leave
node, write down that leave node's character.

Data structure
Step 1: Create a leaf node for each unique character.

struct Node{
   char ch;
   int fr;
   Node *left, *right;
};

Step 2: Build a min heap of all leaf nodes. We use priority_queue:

priority_queue<Node*,vector<Node*>,comp> pq;

Step 3: Extract two nodes with the minimum frequency from the min heap.

Node* left = pq.top(); pq.pop();


Node* right = pq.top(); pq.pop();
Step 4: Create a new internal node with a frequency equal to the sum of the two nodes
frequencies. Make the first extracted node as its left child and the other extracted node as its
right child. Add this node to the min heap.

Step 5: Repeat step 3 and step 4 until the heap contains only one node. The remaining node is the
root node and the tree is complete.

Time complexity and space complexity evaluation


Assuming an encoded text string of length n and an alphabet of k symbols.

Time complexity
As with Dijkstra’s algorithm for the single source shortest paths task in graphs, the running time
of Algorithm 1 depends on how the priority queue is implemented. Assuming that a heap is used,
each insert and extract operation will require time , where n is the number of elements in
the priority queue. Since these operations are performed a constant number of times in each
iteration of the main loop, and since iterations are carried out altogether, the total running
time will be . By the way, the initial construction of the heap may easily be completed in
time also, since it suffices to perform n individual insertions.

Space complexity evaluation


Space complexity is for the tree and for the decoded text.

Application
Prefix codes nevertheless remain in wide use because of their simplicity, high speed, and lack of
patent coverage. They are often used as a "back-end" to other compression methods. Deflate
(PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and
quantization followed by the use of prefix codes; these are often called "Huffman codes" even
though most applications use pre-defined variable-length codes rather than codes designed using
Huffman's algorithm.

References
http://www.huffmancoding.com/my-uncle/scientific-american

https://www.geeksforgeeks.org/huffman-coding-greedy-algo-3/

https://www.youtube.com/channel/UCBa659QWEk1AI4Tg--mrJ2A

You might also like