0% found this document useful (0 votes)
129 views

Text Compression: Examples Huffman Coding: Go Go Gophers

1. Huffman coding is a data compression technique that uses variable-length codewords to encode characters based on their frequency of occurrence. Frequent characters are assigned shorter codewords while rare characters are assigned longer codewords. 2. To build a Huffman tree, each unique character is assigned a single-node tree weighted by its frequency count. Trees are repeatedly combined by joining the two lowest frequency trees until there is only one tree left, the Huffman tree. 3. The Huffman coding process analyzes character frequencies, builds the Huffman tree assigning codewords based on tree paths, and encodes the data using the variable-length codewords to achieve compression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Text Compression: Examples Huffman Coding: Go Go Gophers

1. Huffman coding is a data compression technique that uses variable-length codewords to encode characters based on their frequency of occurrence. Frequent characters are assigned shorter codewords while rare characters are assigned longer codewords. 2. To build a Huffman tree, each unique character is assigned a single-node tree weighted by its frequency count. Trees are repeatedly combined by joining the two lowest frequency trees until there is only one tree left, the Huffman tree. 3. The Huffman coding process analyzes character frequencies, builds the Huffman tree assigning codewords based on tree paths, and encodes the data using the variable-length codewords to achieve compression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Text Compression: Examples Huffman coding: go go gophers

“abcde” in the different formats ASCII 3 bits Huffman


Encodings ASCII: g 103 1100111 000 00
01100001011000100110001101100100… o 111 1101111 001 01
13
ASCII: 8 bits/character p 112 1110000 010 1100
Fixed: h 104 1101000 011 1101 6 7
Unicode: 16 bits/character
000001010011100 e 101 1100101 100 1110 3 4
10 g o
r 114 1110010 101 1111
Symbol ASCII Fixed Var. 3 3 s * 2 2
s 115 1110011 110 101
length length 0 1 sp. 32 1000000 111 101 1
0 0 2
a 01100001 000 000 p h e r
0 1 0 1
0
 Encoding uses tree: 1 1 1 1
b 01100010 001 11  0 left/1 right
a b c d e 6
Var:
c 01100011 010 01  How many bits? 37!! 4
000110100110 g o
0 1  Savings? Worth it?
d 01100100 011 001 2 2 3 3
3
e 01100101 100 10 0 1 0 1
p h e r
c e b 1 1 1 s *
0 1 1
1 2
CPS 100 a d 8.1 CPS 100 8.2

Huffman Coding Building a Huffman tree


 D.A Huffman in early 1950’s  Begin with a forest of single-node trees (leaves)
 Before compressing data, analyze the input stream  Each node/tree/leaf is weighted with character count
 Represent data using variable length codes  Node stores two values: character and count
 Variable length codes though Prefix codes  There are n nodes in forest, n is size of alphabet?
 Each letter is assigned a codeword
 Codeword is for a given letter is produced by traversing  Repeat until there is only one node left: root of tree
the Huffman tree  Remove two minimally weighted trees from forest
 Property: No codeword produced is the prefix of another  Create new tree with minimal trees as children,
 Letters appearing frequently have short codewords, • New tree root's weight: sum of children (character ignored)
while those that appear rarely have longer ones
 Huffman coding is optimal per-character coding method  Does this process terminate? How do we get minimal trees?
 Remove minimal trees, hummm……

CPS 100 8.3 CPS 100 8.4


Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

1 1
F C

11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 2 1
CPS 100 8.5 CPS 100 8.6
I E N S M A B O T G D L R U P F C I E N S M A B O T G D L R U P

Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

2 3 2 3 2 2
L R
1 1 1 2 1 1 1 2
C F P U C F P U

11 6 5 5 4 4 3 3 3 3 3 2 2 2 2 2 11 6 5 5 4 4 4 3 3 3 3 3 2 2 2
CPS 100 8.7 CPS 100 8.8
I E N S M A B O T G D L R I E N S M A B O T G D
Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

23 21 16 37 23
CPS 100 8.9 CPS 100 8.10

Building a tree Mary Shaw


 Software engineering and
software architecture
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
 Tools for constructing
60 large software systems
 Development is a small
37 23 piece of total cost,
maintenance is larger,
21 16 12 11
depends on well-designed
and developed techniques
10 11 8 8 6 6
I  Interested in computer
5 5 5 6 4 4 4 4 3 3
science, programming,
E N S M A B curricula, and canoeing,
3 2 3 3 2 2 2 2 health-care costs
O T G D L R  ACM Fellow, Alan Perlis
1 1 1 2
Professor of Compsci at CMU
C F P U

60
CPS 100 8.11 CPS 100 8.12
Huffman Complexities Writing code out to file
 How do we measure? Size of input file, size of alphabet  How do we go from characters to encodings?
 Which is typically bigger?  Build Huffman tree
 Root-to-leaf path generates encoding
 Accumulating character counts: ______
 How can we do this in O(1) time, though not really  Need way of writing bits out to file
 Building the heap/priority queue from counts ____  Platform dependent?
 Initializing heap guaranteed  Complicated to write bits and read in same ordering
 Building Huffman tree ____
 Why?  See BitInputStream and BitOutputStream classes
 Create table of encodings from tree ____  Depend on each other, bit ordering preserved
 Why?
 Write tree and compressed file _____  How do we know bits come from compressed file?
 Store a magic number

CPS 100 8.13 CPS 100 8.14

Decoding a message Decoding a message

01100000100001001101 1100000100001001101
60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

CPS 100 8.15 CPS 100 8.16


Decoding a message Decoding a message

100000100001001101 00000100001001101
60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

CPS 100 8.17 CPS 100 8.18

Decoding a message Decoding a message

0000100001001101 1
60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

CPS 100 G 8.19 CPS 100 GOOD 8.20


Decoding a message Huffman coding: go go gophers
ASCII 3 bits Huffman
g 103 1100111 000 ?? g o p h e r s *
01100000100001001101 o 111 1101111 001 ?? 3 3 1 1 1 1 1 2
p 112 1110000 010
60 h 104 1101000 011 2 2 3
e 101 1100101 100
37 23 r 114 1110010 101 p h e r s *
s 115 1110011 110 1 1
21 16 12 11 1 1 1 2
sp. 32 1000000 111
 choose two smallest weights
10 11 8 8 6 6
 combine nodes + weights
I
 Repeat
5 5 5 6 4 4 4 4 3 3
 Priority queue? 4 6
E N S M A B
3 2 3 3 2 2 2 2  Encoding uses tree: 2 2 g o
O T G D L R  0 left/1 right
3 3
1 1 1 2  How many bits? p h e r
C F P U 1 1 1 1

CPS 100 GOOD 8.21 CPS 100 8.22

Huffman Tree 2 Huffman Tree 2


 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.23 CPS 100 8.24


Huffman Tree 2 Huffman Tree 2
 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.25 CPS 100 8.26

Huffman Tree 2 Huffman Tree 2


 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.27 CPS 100 8.28


Huffman Tree 2 Huffman Tree 2
 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.29 CPS 100 8.30

Huffman Tree 2 Other methods


 “A SIMPLE STRING TO BE ENCODED USING A  Adaptive Huffman coding
MINIMAL NUMBER OF BITS”  Lempel-Ziv algorithms
 E.g. “ A SIMPLE” ⇔  Build the coding table on the fly while reading
“10101101001000101001110011100000” document
 Coding table changes dynamically
 Protocol between encoder and decoder so that everyone
is always using the right coding scheme
 Works well in practice (compress, gzip, etc.)
 More complicated methods
 Burrows-Wheeler (bunzip2)
 PPM statistical methods

CPS 100 8.31 CPS 100 8.32

You might also like