Data Compression Unit-2
Data Compression Unit-2
Huffman Coding:
• This technique was developed by David Huffman as
part of a class assignment.
• The class was the first ever in the area of information
theory and was taught by Robert Fano at MIT.
• The codes generated using this technique or
procedure are called Huffman codes.
• These codes are prefix codes and are optimum for a
given model (set of probabilities).
• It is a famous algorithm used for lossless data
encoding.
Data Compression
Huffman Coding:
The Huffman procedure is based on two
observations regarding optimum prefix codes.
1. In an optimum code, symbols that occur more
frequently (have a higher probability of occurrence) will
have shorter codeword's than symbols that occur less
frequently.
2. In an optimum code, the two symbols that occur least
frequently will have the same length.
Data Compression
Huffman Coding:
Example: Design a Huffman code for the given
below letters.
LETTER PROBABILITIES
a1 0.2
a2 0.4
a3 0.2
a4 0.1
a5 0.1
Data Compression
Huffman Coding:
Example: step1:
LETTER PROBABILITIES LETTER PROBABILITIES
a1 0.2 a2 0.4
a5 0.1 a5 0.1
Data Compression
Huffman Coding:
Example: step1:
a2 (0.4) a2 (0.4) a2 (0.4) a’’3 (0.6) 0
a’’’3 (1)
a’3 (0.4) 0 a2 (0.4) 1
a1 (0.2) a1 (0.2) a’’3 (0.6)
a1 (0.2) 1
a3 (0.2) a3 (0.2)
0
a1 0.2 a2 0.4
Arrange letters
a2 0.4 according to a1 0.2
descending
a3 0.2 order of a3 0.2
probability .
a4 0.1 a4 0.1
a5 0.1 a5 0.1
Data Compression
Huffman Coding:
Example: step1:
a2 (0.4) a2 (0.4) a’1 (0.4) a’2 (0.6) 0
a’’2 (1)
a’4 (0.2) a2 (0.4) 0 a’1 (0.4) 1
a1 (0.2) a’2 (0.6)
a1 (0.2) 0 a’4 (0.2) 1
a’1 (0.4)
a3 (0.2) a3 (0.2)
1
a4 (0.1) 0
LETTER PROBABILITIES Codewords
a’4 (0.2)
a1 0.2 10
a5 (0.1) 1
a2 0.4 00
a3 0.2 11
a4 0.1 010
a5 0.1 011
Data Compression
Minimum Variance Huffman Coding:
The codes of letters are given below, Now we have to find the
Average length.
• Initial_code assigns symbols with some initially agreed upon codes, without any
prior knowledge of the frequency counts.
• The encoder and decoder must use exactly the same initial_code and update_tree
routines.
– The tree must always maintain its sibling property, i.e., all
nodes (internal and leaf) are arranged in the order of
increasing counts.
51 51
2 2 a
2 a 49 50
1
49 50
1 1 r
NYT 0 1 r 47 48
47 48
NYT 0 1 d
45 46
Data Compression
Ex: Update tree for the 4. Next symbol “v” encountered, Now
tree for “aardv”
message by Adaptive 5
Huffman. 51
“ aardvark ” Level-1 3 2 a
50
Sol: 49
Level-2 2 1 r
47 48
1 d
1
46
45 At level-1 and level-2
the value of left child
is greater than right
NYT 0 1 v child , So as per the
43 44 tree property need to
swap.
Data Compression
5. After swapping at level-1 and level-2,
Now tree for “aardv”
5
51
Level-1 2 a 3
49 50
Level-2 1 r 2
48
47
1 d
1
46
45
NYT 0 1 v
43 44
Data Compression
6. Next symbol “a” encountered, Now
tree for “aardva”
6
51
3 a 3
49 50
1 r 2
48
47
1 d
1
46
45
NYT 0 1 v
43 44
Data Compression
7. Next symbol “r” encountered, Now
tree for “aardvar”
7
51
3 a 4
49 50
2 r 2
48
47
1 d
1
46
45
NYT 0 1 v
43 44
Data Compression
8. Next symbol “k” encountered, Now
tree for “aardvark”
8
51
3 a 5
49 50
2 r 3
47 48
Level-3 2 1 d
46
45
1 1 v
Level-3 1 d 2
46 45
1 1 v
NYT 0 1 a
2
49 50
0 51 1
Again the next symbol is “a” and it is
encountered earlier.
So, Only Send the Code and it is NYT 0 2 a
traverse the path from Root to Arrival 49 50
Symbol “a”
Data Compression
Ex: Encode the message by Adaptive Huffman.
“ a a r d v a r k ”
Sol:
Again the next symbol is “r” and it is
encountered first time.
For symbol “r” encountered, k=18, We have 3
e=4 and r=10.
The value of k=18 So, the Case 1 occurs 0 51 1
perform (k-1) in (e+1) bits
(k-1)= (18-1)=17 and (e+1)=(4+1)=5bits
1 2 a
The representation of 17 in 5 bits is 10001 0 49 50
1
The Code is NYT (0) + Fixed Code (10001).
So, the Code for “r” is 010001 NYT 0 1 r
47 48
The Updated tree is
Data Compression
Ex: Encode the message by Adaptive Huffman.
“ a a r d v a r k ”
Sol: 4
0 51 1
Again the next symbol is “d” and it is
encountered first time.
For symbol “d” encountered, k=4, We have 2 2 a
e=4 and r=10. 49 50
The value of k=4 So, the Case 1 occurs 0 1
perform (k-1) in (e+1) bits
(k-1)= (4-1)=3 and (e+1)=(4+1)=5bits 1 1 r
0 47 48
The representation of 3 in 5 bits is 00011 1
1 1 d
0 45 1 46
NYT 0 1 v
43 44
Data Compression
Ex: Encode the message by Adaptive Huffman.
“ a a r d v a r k ”
Sol: 6
0 51 1
Again the next symbol is “a”
and it is encountered earlier.
3 a 3
So, Only Send the Code and it 49
is traverse the path from Root 0 50 1
to Arrival Symbol “a”
1 r 2
The Code for “a” is 0 and the 0 48
47 1
tree for “a” is given below
1 1 d
The Updated tree is 0 45 1 46
NYT 0 1 v
43 44
Data Compression
Ex: Encode the message by Adaptive Huffman.
“ a a r d v a r k ”
Sol: 7
0 51 1
Again the next symbol is “r”
and it is encountered earlier.
3 a 4
So, Only Send the Code and it 49
is traverse the path from Root 0 50 1
to Arrival Symbol “r”
2 r 2
The Code for “r” is 10 and the 0 48
47 1
tree for “a” is given below
1 1 d
The Updated tree is 0 45 1 46
NYT 0 1 v
43 44
Data Compression
Ex: Encode the message by 8
Adaptive Huffman. 0 51 1
“ a a r d v
a r k ”
3 a
Sol: 5
Again the next symbol is “k” and it is 49 50
0 1
encountered first time.
For symbol “k” encountered, k=11, We 2 r 3
have e=4 and r=10. 48
47 0
The value of k=11 So, the Case 1 occurs 1
perform (k-1) in (e+1) bits 2 1 d
Level-3
(k-1)= (11-1)=10 and 0 1
45 46
(e+1)=(4+1)=5bits
1 1 v
The representation of 10 in 5 bits is 01010
0 1 44
43
The Code is NYT (1100) + Fixed Code
(01010). NYT 0 1 k
So, the Code for “d” is 110001010 41 42
The Updated tree is
Data Compression
8
After Swapping at level-3
0 51 1
3 a 5
49 50
0 1
2 r 3
48
47 1
0
2
1 d
0 1
46 45
1 1 v
0 1 44
43
NYT 0 1 k
41 42
Data Compression
After Encoding the letters the Codeword for each alphabet are:
a a r d v a r k
Sol: 1
0 51 1
NYT 0 1 a
2
49 50
0 51 1
Now the next bit in the String is “1” So
traverse from the root node via link 1
will get the symbol “a”. NYT 0 2 a
So, the code for the next bit 1 is “a” 49 50
Data Compression
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol:
So, The alphabet at Position (18) is “r”,
Now the Next bit in the String “0” So,
The updated tree is given below
traverse from the root node via link “0” and
will get NYT node, the next “0” bit is used 3
for NYT node .
0 51 1
So, read next “e” bits after the NYT bit “0”
is 1000. Decimal value of “1000” is “8”.
Perform e<r8<10, True 2 a
1
Then perform e+1 bits 10001 (17)
Decimal value. 0 49 50
1
Now we read (e+1) bits decimal value +1
1000117+118
NYT 0 1 r
47 48
Data Compression
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol: 3
0 51 1
Now the Next bit in the String “0” So,
traverse from the root node via link “0” and
again next bit is “0” and will get NYT node, 2 2 a
So, the next two bits “00” parse to reach 49 50
the NYT node. 0 1
So, Now perform the task
So, read next “e” bits after the NYT bit “00” 1 1 r
is 0001. Decimal value of “0001” is “1”. 0 47 48
1
Perform e<r1<10, True
Then perform e+1 bits 00011 (3) NYT 0 1 d
Decimal value. 45 46
Now we read (e+1) bits decimal value +1
000113+14
The alphabet at position (4) is “d”
Data Compression
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol: 5
0 51 1
Now the Next bit in the String “0” So,
traverse from the root node via link “0” and
again next bits are “00” and will get NYT 3 2 a
node, So, the next three bits “000” parse to 49 50
reach the NYT node. 0 1
Next “e” bits 1011 11(Dec)
Condition e<r11<10, False 2 1 r
Perform “e” bits for (k+r+1) 0 47 48
1
position in alphabets.
1 1 d
(k+r+1)(11+10+1)22 46
0 45 1
The position of alphabet at 22 is
“v” NYT 0 1 v
43 44
Data Compression
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol: 6
0 51 1
Now the Next bit in the String
“0” So, traverse from the root
node via link “0” will get 3 a 3
alphabet “a”. 49 0 50 1
Now the next bit “0”
represents the alphabet “a” 1 r 2
0 48
47 1
1 1 d
0 45 1 46
NYT 0 1 v
43 44
Data Compression
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol: 7
0 51 1
Now the Next bit in the String
“1” So, traverse from the root
node via link “1” and again 3 a 4
next bits are “0” and will get 49
the alphabet “r” 0 50 1
The next two bits “10”
2 r 2
represents the code for
0 48
alphabet “r” 47 1
1 1 d
0 45 1 46
NYT 0 1 v
43 44
Data Compression
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol: 7
0 51 1
Now the Next bit in the String
“1” So, traverse from the root
node via link “1” and again 3 a 4
next bits are “00” and will get 49
the NYT code. 0 50 1
Next “e” bits 0101 5(Dec)
2 r 2
Condition e<r5<10, True
0 48
Then perform e+1 bits 01010 47 1
(10) Decimal value.
Now we read (e+1) bits 1 1 d
decimal value +1 0 1
45 46
0101010+111
The alphabet at position (11) NYT 0 1 v
is “k”
43 44
Ex: Decode the message by Adaptive Huffman.
“ 00000101000100000110001011010110001010 ”
Sol:
Now the Next bit in the String
“1” So, traverse from the root
node via link “1” and again The updated tree is on next slide
next bits are “00” and will get
the NYT code.
Next “e” bits 0101 5(Dec)
Condition e<r5<10, True
Then perform e+1 bits 01010
(10) Decimal value.
Now we read (e+1) bits
decimal value +1
0101010+111
The alphabet at position (11)
is “k”
Data Compression
8
0 51 1
3 a 5
49 50
0 1
2 r 3
48
47 1
0
2
1 d
0 1
46 45
1 1 v
0 1 44
43
NYT 0 1 k
41 42
Data Compression
Golomb Codes:
• The Golomb code is described in a Succinct paper by
Solomon W. Golomb (1960).
• The Golomb codes belong to a family of codes
designed to encode integers with the assumption
that the larger an integer, the lower its probability
of occurrence.
• The simplest code for this situation is the unary
code.
• The unary code for a positive integer n is simply n
1s followed by a 0
• Thus, the code for 4 is 11110, and the code for 7 is
11111110.
Data Compression
Golomb Codes:
• Coding Scheme works in two parts
– Unary part
– Different part
• The Golomb code is actually a family of codes
parameterized by an integer m > 0.
• In the Golomb code with parameter m, we represent an
integer n > 0 using two numbers q and r, where
– q= ⌊ n/m ⌋ and r= n-qm
• We will use q for unary code and r for different code
• The codeword will get after combine q and r
Data Compression
Golomb Codes:
Different code of r
• ⌊ log2m ⌋ bit representation of r for first 2 ⌈ log2m ⌉ -m
values
• ⌈ log2m ⌉ bit representation of r+ 2 ⌈ log2m ⌉ -m for rest of values.
Data Compression
Golomb Codes:
Eg: Design a Golomb code for m=5 and
n= 0,1,2,3,....,15.
Next Find
⌊ log2m ⌋ bit representation of r for first 2 ⌈ log2m ⌉ -m values
• This sequence has the property that smaller values are more
probable than larger values.
Now as the value of J =8, So we need to divide the sequence into two blocks.
32 2 4 8 3 2 2 2
111111111111111111111111111111110 110 11110 111111110 1110 110 110 110
0 0 0 1 2 0 2 1
0 0 0 10 110 0 110 10
Data Compression
Rice Codes:
For Coding now:
Now as the value of J =8, So we need to divide the sequence into two blocks.
Letters Probability
B 0.3
C 0.1
AA 0.36
AB 0.18
AC 0.06
Data Compression
Sol:
Discard the letter of higher priority letter (AA) and concatenate with all letters in
both tables only or until the required number of codeword's reached to make new
letters and we get new table after Second (last) iteration
Letters Probability Codewords S no
B 0.3 000 0
C 0.1 001 1
AB 0.18 010 2
AC 0.06 011 3
AAA 0.216 100 4
AAB 0.108 101 5
AAC 0.036 110 6
Data Compression
Application of Huffman Coding:
• Lossless Image Compression
• Text Compression
• Audio Compression
Data Compression
PROBLEMS:
1. Alphabet A={a,b,c,d,e} with
P(a)=0.15, P(b)=0.04, P(c)=0.26,
P(d)=0.05 and P(e)=0.50.
Find:
– Entropy
– Huffman code for the source
– Average length of this code
Data Compression
Ans.
Entropy=1.82 bits/symbol
LETTER CODE
a 110
b 1111
c 10
d 1110
e 0