Chapter 5 Data Compression
Chapter 5 Data Compression
Basic Compression
Data Compression
Redundancy
Variable Length Coding
Huffman encoding
Run Length Encoding (RLE)
Quantization (Lossy)
1
Data Compression
Two categories
• Information Preserving
– Error free compression
– Original data can be recovered completely
• Lossy
– Original data is approximated
– Less than perfect
– Generally allows much higher compression
2
Basics
Data Compression
– Process of reducing the amount of data required to
represent a given quantity of information
Data vs. Information
– Data and Information are not the same thing
– Data
• the means by which information is conveyed
• various amounts of data can convey the same
information
– Information
• “A signal that contains no uncertainty”
3
Redundancy
Redundancy
– “data” that provides no relevant information
– “data” that restates what is already known
• For example
– Consider that N1 and N2 denote the number of “data
units” in two sets that represent the same information
– where Cr is the “Compression Ratio”
• Cr = N1 / N2
• EXAMPLE
N1 = 10 & N2 = 1 data can encode the same information Compression is Ratio
Cr = N1/N2 = 10 (or 10:1)
Implying 90% of the data in N1 is redundant
4
Variable Length Coding (1)
Our binary system (called natural binary) is not always that good at
representing data from a compression point of view
In natural binary we would need at least 2 bits to represent this, assigning bits as
follows:
a=00, b=01, c=10, d=11
There
5 are 35 pieces of data, that is 35*2bits = 70bits
Variable Length Coding (2)
Now, consider the occurrence of each symbol:
a,b,c,d
Abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa
a = 21/35 (60%)
b = 8 /35 (22%)
c = 5/35 (14%)
d = 1/35 (02%)
6
Variable Length Coding (3)
Idea of variable length coding: assign less bits to encode to
more frequent symbols, more bits for less frequent symbols
a = 21/35 (84%)
So, we have a compression ratio of 70:56 or 1.25, meaning
b = 8 /35 (22%)
c = 5/35 (14%)
7
20% of the data using natural binary encoding is redundant
d = 1/35 (02%)
David Huffman
Huffman encoding
This is an example of error free coding, the information is completely the same,
the data is different
• Prof. David A. Huffman developed an algorithm to take a data set and compute
its “optimal” encoding (bit assignment)
– He developed this as a student at MIT
• There are other VLC techniques, such as Arthimetic coding and LZW coding
8
(ZIP)
Huffman encoding
Is the same of VLC, but using the Histogram
amount to measure the frequency for each level
in the Gray image.
9
Huffman encoding
codew Gray
lp l p
ord level
pi=hi/n
0.000 0
• p(i) is the probability of 0.012 1
occurrence for a gray level i. 0.071 2
• h is the frequency of occurrence
0.019 3
of a gray level i
• n is the total number of pixels in 0.853 4
the image 0.023 5
0.019 6
0.003 7
RLE is often more compact, especially when data contains lots of runs
of the same number(s)
The natural binary range is smaller, we could also use Huffman encoding to get a bit
more compression.
Of course, the values are not the same, on “reconstruction” (multiply by 5) we get
only an approximation of the original:
RECOVERED
16 = {10, 10, 15, 15, 20, 0, 5, 0, 10}
Lossy vs. Lossless
• For things like text documents and computer data files, lossy compression
doesn’t make sense
– An approximation of the original is no good!
• But for data like audio or visual, small errors are not easily detectable by
our senses
– An approximation is acceptable
• This is one reason we can get significant compression of images and audio,
vs. other types of data
– Lossless 10:1 is typically possible
– Lossy 300:1 is possible with no significant perceptual loss