Data Compression
Data Compression
source source
encoder decoder
Examples of source:
Human speeches, photos, text messages, computer programs …
Examples of channel:
storage media, telephone lines, wireless transmission …
Source-Channel Separation Principle
Self-information
1 0 must happen
(no uncertainty)
unlikely to happen
0
(infinite amount of uncertainty)
Intuitively, I(p) measures the amount of uncertainty with event x
Weighted Self-information
p I ( p) I w ( p) p I ( p)
0 0
1/2 1 1/2
1 0 0
p=1/e
1
I w ( p)
e ln 2
Quantification of Uncertainty of a Discrete
Source
N
H ( X ) pi log2 pi (bits/sample)
i 1 or bps
Weighting
coefficients
Source Entropy Examples
1 1 W E
prob( x S ) , prob( x N )
2 4
1 S
prob( x E ) prob( x W )
8
1 1 1 1 1 1 1 1
H ( X ) ( log 2 log 2 log2 log2 ) 1.75bps
2 2 4 4 8 8 8 8
Source Entropy Examples (Con’t)
1 1
p prob( x red ) ,1 p prob( x blue)
2 2
Prob(event)=Prob(blue in the first k-1 picks)Prob(red in the k-th pick )
=(1/2)k-1(1/2)=(1/2)k
Source Entropy Calculation
If we consider all possible events, the sum of their probabilities will be one.
k
1
Check: 1
k 1 2 k
1
Then we can define a discrete random variable X with P( x k )
2
Entropy:
k
1
H ( X ) pk log 2 pk k 2bps
k 1 k 1 2
Notes:
1. Memoryless means that the events are independently
generated (e.g., the outcomes of flipping a coin N times
are independent events)
2. Source redundancy can be then understood as the
difference between raw data rate and source entropy
*Code Redundancy
Practical performance Theoretical bound
r l H (X ) 0
N li: the length of
Average code length: l pi li
i 1
codeword assigned
N
1 to the i-th symbol
H ( X ) pi log 2
i 1 pi
Note: if we represent each symbol by q bits (fixed length codes),
Then redundancy is simply q-H(X) bps
?How to achieve source entropy
P(X)
fixed-length variable-length
l 2bps H ( X ) l 1.75bps H ( X )
Problems with VLC
• When codewords have fixed lengths, the
boundary of codewords is always
identifiable.
• For codewords with variable lengths,
their boundary
symbol VLC could become
S S N W S ambiguous
E …
e
S 0 0 0 1 11 0 10…
N 1 0 0 11 1 0 10… 0 0 1 11 0 1 0…
E 10 d d
W 11 SSWN SE … SSNW SE …
Uniquely Decodable Codes
• To avoid the ambiguity in decoding, we
need to enforce certain conditions with
VLC to make them uniquely decodable
• Since ambiguity arises when some
codeword becomes the prefix of the other,
it is natural to consider prefix condition
Example: p pr pre pref prefi prefix
No codeword is allowed to
be the prefix of any other
codeword.
Level 1 1 0 2
Level 2 11 10 01 00 22
Level k …… 2k
Prefix Condition Examples
symbol x codeword 1 codeword 2
S 0 0
N 1 10
E 10 110
W 11 111
1 0 1 0
11 10 01 00 11 10
…… 111 110… …
codeword 1 codeword 2
How to satisfy prefix condition?
• Basic rule: If a node is used as a
codeword, then all its descendants
cannot be used as codeword.
Example 1 0
11 10
111 110
…
Property of Prefix Codes
N
Kraft’s inequality 1
2 li
i 1
1
2
i 1
li
1
2
i 1
li
Two Goals of VLC design
• achieve optimal code length (i.e., minimal redundancy)
For an event x with probability of p(x), the optimal
code-length is –log2p(x) , where x denotes the
smallest integer larger than x (e.g., 3.4=4 )
code redundancy: r l H ( X ) 0
symbol x p(x)
S 0.5 0.5 0.5
N 0.25 0.25
0.5
E 0.125 (NEW)
0.25
W 0.125 (EW)
compound symbols
Example-I (Con’t)
Step 2: Codeword assignment
1 0 0 1
NEW 0 NEW 1
1 0 S 0 1 S
or
EW 10 EW 01
N N
1 0 1 0
110 001 000
W E W E
The codeword assignment is not unique. In fact, at each
merging point (node), we can arbitrarily assign “0” and “1”
to the two branches (average code length is the same).
Example-II
Step 1: Source reduction
symbol x p(x)
e 0.4 0.4 0.4 0.6
(aiou)
a 0.2 0.2 0.4
i 0.2 0.2 (iou) 0.4
0.2
o 0.1 0.2
u 0.1 (ou)
compound symbols
Example-II (Con’t)
Step 2: Codeword assignment
symbol x p(x) codeword
e 0.4 0.4 0.4 0.6 0 1
(aiou) 01
a 0.2 0.2 0.4 1
i 0.2 0.2 (iou) 0.4 000
0.2
o 0.1 0010
0.2
u 0.1 (ou) 0011
compound symbols
Example-II (Con’t)
0 1
(aiou) e
00 01
(iou) a
000 001
i (ou)
0010 0011
o u
binary codeword tree representation
Example-II (Con’t)
symbol x p(x) codeword length
e 0.4 1 1
a 0.2 01 2
i 0.2 000 3
o 0.1 0010 4
5
u 0.1 0011 4
l pi li 0.4 1 0.2 2 0.2 3 0.1 4 0.1 4 2.2bps
i 1 5
H ( X ) pi log2 pi 2.122bps r l H ( X ) 0.078bps
i 1
compound symbol
Example-III (Con’t)
Step 2: Codeword assignment
compound symbol
Summary of Huffman Coding
Algorithm
• Achieve minimal redundancy subject to the
constraint that the source symbols be coded one at
a time
• Sorting symbols in descending probabilities is the
key in the step of source reduction
• The codeword assignment is not unique. Exchange
the labeling of “0” and “1” at any node of binary
codeword tree would produce another solution that
equally works well
• Only works for a source with finite number of
symbols (otherwise, it does not know where to start)
Variation: Golomb Codes
Optimal VLC for geometric source: P(X=k)=(1/2)k, k=1,2,…
k codeword
1 0 1 0
2 10
3 110 1 0
4 1110
5 11110 1 0
6 111110
1 0
7 1111110
8 11111110
… …… …
Data Compression Basics
• Discrete source
– Information=uncertainty
– Quantification of uncertainty
– Source entropy
• Variable length codes
– Motivation
– Prefix condition
– Huffman coding algorithm
• Lempel-Ziv coding*
History of Lempel-Ziv Coding
• Invented by Lempel-Ziv in 1977
• Numerous variations and improvements
since then
• Widely used in different applications
– Unix system: compress command
– Winzip software (LZW algorithm)
– TIF/TIFF image format
– Dial-up modem (to speed up the transmission)
Dictionary-based Coding
• Use a dictionary
– Think about the evolution of an English dictionary
• It is structured - if any random combination of
alphabets formed a word, the dictionary would not
exist
• It is dynamic - more and more words are put into the
dictionary as time moves on
– Data compression is similar in the sense that
redundancy reveals as patterns, just like English
words in a dictionary
Toy Example
I took a walk in town one day
And met a cat along the way. entry pattern
What do you think that cat did say?
Meow, Meow, Meow 1 I took a walk in town one day
2 And met a
I took a walk in town one day
And met a pig along the way. 3 along the way
What do you think that pig did say?
Oink, Oink, Oink 4 What do you think that
5 did say?
I took a walk in town one day
And met a cow along the way. 6 cat
Bits / pixel
File Size Bit-rate Frame
Frame
(Bytes) (bps) Size
Rate
176 x 144 30
68,428,800 9,123,840 12
pixels frames/sec
Compression
• Objective:
– to reduce the data size
• Approach:
– reduce redundancy
• Uncompressed multimedia objects are large. Objects
kept in compressed form
– Save storage space
– Save retrieval bandwidth
– Decompressed and retrieved in parallel
– Save processing time
• Example: To transmit a gigitized color 35 mm slide scanned at
3,000 x 2,000 pixels, and 24 bits, at 28.8 kbaud would take about
( 3000 x 2000 pixels)( 24bits / pixel )
4883sec ond 81min utes 1.35hours
( 28.8 x1024bits / sec ond )
Lossless vs Lossy Compression
• If the compression and decompression
processes induce no information loss,
then the compression scheme is
lossless; otherwise, it is lossy.
• compression ratio = B0/B1
– B0 = number of bits before compression
– B1 = number of bits after compression
Compression Techniques
• Lossless compression
– Used on programs, DB records, critical
information
• Lossy compression
– Used on images, audio, video, non-critical
data
• Hybrid compression
– JPEG, JPEG-LS, JPEG 2000, MPEG-1,
MPEG-2
Image Compression
• Objective:
– to reduce the data size
• Approach:
– to reduce redundancy
• Compression Techniques
– Lossless compression
• Used on programs, DB records, critical information
– Lossy compression
• Used on images, audio, video, non-critical data
– Hybrid compression
• JPEG, JPEG-LS, JPEG 2000, MPEG-1, MPEG-2
Lossless Compression
• Encode into a form to represent the
original in fewer bits
• The original representation can be
perfectly recovered
• Compression ratios:
– Text 2:1
– Bilevel images 15:1
– Facsimile transmission 50:1
Lossless Compression
• Encode into a form to represent the original in fewer bits
• The original representation can be perfectly recovered
• Compression ratios:
– Text 2:1
– Bilevel images 15:1
– Facsimile transmission 50:1
Lossless
(Noiseless)
Huffman Arithmetic
Lempel Ziv Run length
coding decomposition
Lossy Compression
• Can be decoded into a representation that humans find similar to
the original
• Compression ratios:
– JPEG image 15:1
– MPEG video 200:1 Lossy
Frequency Importance
Predictive Hybrid
oriented oriented
Motion
Transform Filtering Subsamping JPEG MPEG
compensation
Bit JPEG
Subband Quantization
allocation 2000
? Why is Compression possible
• Information Redundancy
• Attempt to eliminate redundancy
Uncompressed text Run-Length Compressed text
ABCCCCCCCCDEFGGG Encoder AB8CDEF3G
P(A)
0 1
Entropy
• Suppose:
– a data source generates output sequence from a set
{A1, A2, …, AN}
– P(Ai): Probability of Ai
• First-Order Entropy:
– the average self-information of the data set
H P ( Ai ) log 2 P ( Ai )
i
• These numbers are then stored in the RLC compressed file as:
• 10,1,7,2,10,2,8,3,10,2,6,1,10,2,12,3,10,1,6,1,10,3,12,3,0,3,10,3,0,2,5,3,0,5,
3,10,2,9,2,10,1, 5,3,4,3,0,2, 0,8
• Compression ratio= 64x4 bits: 49x4 bits
Variable Length Codes (VLC)
Since the entropy indicates the information content in an information source
S, it leads to a family of coding methods commonly known as entropy coding
methods. VLC is one of the best known such methods
l 1
Adaptive Huffman Coding
• Huffman coding requires prior statistical
knowledge about the information source and
such information is not available. E.g. live
streaming.
• An adaptive Huffman coding algorithm can be
used, in which statistics are gathered and
updated dynamically as the data-stream
arrives.
– The probabilities are no longer based on prior
knowledge but on the actual data received so far.
Huffman Coding
• Idea: a code outputs Codeword Probability Symbol
short codewords for
0000 0.05 a
likely symbols and
long codewords for 0001 0.05 b
rare symbols 001 0.1 c
• Objective: to reduce 01 0.2 d
the average length of 10 0.3 e
codewords 110 0.2 f
111 0.1 g
Huffman Tree
a b c d e f g
0.05 0.05 0.1 0.2 0.3 0.2 0.1
Huffman Tree
0.1
a b c d e f g
0.05 0.05 0.1 0.2 0.3 0.2 0.1
Huffman Tree
0.2
0.1
a b c d e f g
0.05 0.05 0.1 0.2 0.3 0.2 0.1
Huffman Tree
0.2
0.1 0.3
a b c d e f g
0.05 0.05 0.1 0.2 0.3 0.2 0.1
Huffman Tree
0.4
0.2
0.1 0.3
a b c d e f g
0.05 0.05 0.1 0.2 0.3 0.2 0.1
Huffman Tree
0.4
0.6
0.2
0.1 0.3
a b c d e f g
0.05 0.05 0.1 0.2 0.3 0.2 0.1
Solve
Forward
Discrete Entropy
Quantizer
Cosine Encoder
Transform
compressed
uncompressed
block
block
Table Table
Specification Specification
FDCT output
• After preprocessing, 8x8 source block is
transformed to a 64-point discrete signal
which is a function of two spatial dimension
x and y (spatial frequencies, or DCT
coefficients)
• Input [-255,255] outputs [-2048, 2048]
• One DC coefficient: F(0,0)
• 63 AC coefficients: F(0,1), F(0,2), …,F(8,8)
• Typical 8x8 block, most of the spatial
frequencies have 0 or near zero amplitude
and need not be encoded after quantization;
foundation for compression
DCT Coefficients Example
217 216 217 222 228 234 233 232 -0.6 0.1 -0.6 0.6 -5.2 4.0 19.3 1779
216 215 217 221 224 227 229 229 0.8 0.4 -0.1 0.2 0.4 -2.4 15.7 -3.0
216 216 217 220 221 223 226 226 0.6 -0.2 -0.9 -0.5 -4.2 -0.5 -0.1 14.7
217 217 219 219 221 222 223 226 -0.3 0.2 0.8 -0.2 -0.8 -0.2 3.5 -0.8
219 219 219 219 221 222 223 225 0.5 0.8 0.9 -0.3 -0.9 0.1 -0.3 5.6
221 220 219 218 221 221 223 223 -0.1 0.3 -0.1 0.0 0.0 -0.6 0.7 -0.2
226 223 222 222 223 223 224 222 0.1 -0.4 0.2 -0.5 -0.6 0.5 -0.1 0.0
233 229 225 224 225 226 225 222 0.4 -0.3 -0.4 -0.1 -0.2 -0.2 0.1 0.7
AC Quantized Coefficients
• Order into “zig-zag” DC AC01 AC07
sequence
• Helps to facilitate
entropy coding by
placing low-frequency
coefficients (more likely
to be non-zero) before
high-frequency
coefficients
AC70 AC77
Probability Distribution of
DCT Coefficients
Probability of
being non-zero
1.0
0.5
0.0
0 8 16 24 32 40 48 56 63 index
Entropy Coding
• JPEG specifies 2 entropy coding
methods
– Huffman coding
– Arithmetic coding
• The baseline sequential codec uses
Huffman coding
• Entropy coding is a 2-step process
1. The quantized coefficients are converted into
intermediate sequence of symbols
2. Symbols are converted into a data stream
JPEG: Modes of Operations
• Sequential Encoding
– Each image component is encoded in a single left-to-right,
top to bottom scan
• Progressive Encoding
– The image is encoded in multiple scans for applications in
which transmission time is long
• Lossless Encoding
– The image is encoded to guarantee exact recovery of every
source image sample value
– Low compression ratio
• Hierarchical Encoding (also called Layered encoding)
– The image is encoded at multiple resolutions
– Lower resolution versions may be accessed without first
having to decompress the image at its full resolution
JPEG Picture Quality
For colour images with moderately complex
scenes
• 0.25 – 0.5 bits/pixel Moderate to Good quality
• 0.5 – 0.75 bits/pixel Good to Very good
quality
• 0.75 – 1.5 bits/pixel Excellent quality
• 1.5 – 2.0 bits/pixel Indistinguishable from the
original
JPEG Sequential Encoding
Summary
• An image is divided into components
• Each component is divided into blocks with 8x8
pixels in each block
• Each block goes through
– Forward Discrete Cosine Transform (FDCT)
– The transformed values goes through quantizer
– The quantized values are fed through an entropy
encoder
• The resultant code stream is the compressed
image
Block Transform Encoding
DCT
Zig-zag Quantize
011010001011101...
Run-length Huffman
Code Code
Block Encoding
DC component
139 144 149 153 1260 -1 -12 -5
144 151 153 156 -23 -17 -6 -3 Quantize
DCT -11 -9 -2 2
150 155 160 163
159 161 162 160 -7 -2 0 1
original image
AC components
79 0 -1 0
-2 -1 0 0
79 0 -2 -1 -1 -1 0 0 -1 0 0 0 0 0 0 0 zigzag -1 -1 0 0
0 0 0 0
0 79
1 -2
run-length 0 -1 Huffman
code
0 -1
code 10011011100011...
0 -1
2 -1
0 0 coded bitstream < 10 bits (0.55 bits/pixel)
Result of Coding/Decoding
139 144 149 153 144 146 149 152
144 151 153 156 156 150 152 154
150 155 160 163 155 156 157 158
159 161 162 160 160 161 161 162
-5 -2 0 1
-4 1 1 2
-5 -1 3 5
-1 0 1 -2
errors
Examples
Forward Forward
Preprocessing Intercomponent Intracomponent
Transform Transform
Original
Image
Tier-1 Tier-2
Quantization
Encoder Encoder
Coded
Image
Rate
control
Non-progressive Progressive
JP2 File Format
• A JP2 file contains a JPEG-2000 signature box
number of boxes
File type box
• Each box has
JP2 header box
– LBox(32 bits): box length
– TBox(32 bits): box type Image header box
– XLBox(64 bits): true Color spec box
length of box when Lbox …
is 1.
– DBox(variable): box data Contiguous code stream box
• The valid file structure
…
is shown
Summary to JPEG 2000
• JPEG 2000 compression standard
– Is a hybrid compression method for continuous
tone images compressions
– Implements compression for low bit rate (50:1)
– Compresses each 4096 pixels tile using 2D
wavelet, quantization, multiple passes,
progression, rate-control, and region-of-interest.
From JPEG to JPEG2000