0% found this document useful (0 votes)
44 views

CS174: Note11

1) The document discusses entropy and data compression. Entropy is a measure of the uncertainty in a random variable, and provides a lower bound on the expected length of a lossless compression algorithm. 2) It considers a random file consisting of n characters from an alphabet of size c, where character i appears with probability p(i). The entropy of this distribution is defined as the expected number of bits needed to encode each character, which is calculated as the sum of p(i)log2(1/p(i)) over all characters i. 3) By the law of large numbers, as the file size n increases, the frequency of each character i will concentrate around its expected value of p(i

Uploaded by

juggleninja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

CS174: Note11

1) The document discusses entropy and data compression. Entropy is a measure of the uncertainty in a random variable, and provides a lower bound on the expected length of a lossless compression algorithm. 2) It considers a random file consisting of n characters from an alphabet of size c, where character i appears with probability p(i). The entropy of this distribution is defined as the expected number of bits needed to encode each character, which is calculated as the sum of p(i)log2(1/p(i)) over all characters i. 3) By the law of large numbers, as the file size n increases, the frequency of each character i will concentrate around its expected value of p(i

Uploaded by

juggleninja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

U.C.

Berkeley CS174: Randomized Algorithms


Professor Luca Trevisan

Lecture Note 11
April 15, 2003

Entropy and Data Compression


1

Introduction and Simple Cases

We consider the loss-less data compression problem. We have a random variable X of


possible files that we want to compress, and we want to devise algorithms C and D such
that the decompression algorithm D inverts the compression algorithm C,
PrxX [D(C(x)) = x] = 1
and such that the average length E[|C(X)|] of the compressed file is as small as possible.
We assume that the output of C() is a sequence of bits, however we let the input of C() be
a sequence of characters coming from an alphabet that may not necessarily be {0, 1}.
The existence of a decompressing procedure D() implies that C() is an injective mapping,
and this is a minimal requirement for any compression algorithm. Often, we will make an
additional requirement on the output of C(): then there are no two inputs x, x such that
C(x) is a prefix of C(x ). This way, while we scan a file C(x) we know when it ends without
a need for a special end-of-file marker. A mapping with such a property is called prefix-free.
Every mapping can be made prefix-free with a moderate increase in length. Also note that
if all the outputs of C() have the same length, then C() is trivially prefix-free.
Exercise 1 Show that if C() is an injective mapping then there is a prefix-free mapping
C () such that for every input x we have |C (x)| 2log |C(x)| + |C(x)|.
A first observation is that no data compression algorithm can possibly reduce the length of
all its possible inputs.
Exercise 2 Prove that if C : {0, 1}n {0, 1} is an encoding algorithm that allows lossless
decoding (i.e., if C is an injective function mapping n bits into a sequence of bits), then
there is an input x {0, 1}n such that |C(x)| n.
More generally, a random input is essentially incompressible.
Theorem 1 Let C : {0, 1}n {0, 1} be an encoding algorithm that allows lossless decoding
(i.e. let C be an injective function mapping n bits into a sequence of bits). Let X be uniform
in {0, 1}n . Then, for every t,
Pr[|C(X)| n t]

1
2t1

For example, there is less than a chance in a million of compression an input file of n bits
into an output file of length n 21, and less than a chance in eight millions that the output
will be 3 bytes shorter than the input or less.
Proof: We can write
Pr[|C(X)| n t] =

|{f : |C(f )| n t}|


2n

Regarding the numerator, P


it is the size of a set that contains only strings of length n t or
nt l
less, so it is no more than l=1
2 , which is at most 2nt+1 2 < 2nt+1 = 2n /2t1 .

In practice, however, compression algorithms work, and they can reduce the size of most
files by a significant factor. The reason is that most files are not completely random. For
example, if X is a distribution of texts written in English, then it cannot possibly be the
uniform distribution over n-bit string, because not all n-bits strings are ASCII encoding of
meaningful English setences. If X is uniform over a set of possible messages, then we can
still find a lower bound to the encoding length achievable by any compression algorithm,
and see that, at least in principle, the lower bound can be matched.
Theorem 2 Let F n be a set of possible files, let X be the uniform distribution over
F . Then,
1. For every injective function C : n {0, 1} there is a file f F such that |C(f )|
log2 |F |.
2. For every injective function C : n {0, 1} and every integer t, we have
Pr[|C(X)| (log2 |F |) t]

1
2t1

3. For every prefix-free mapping C : n {0, 1} we have E[|C(X)|] log2 |F |.


4. There is a prefix-free mapping C : n {0, 1} such that E[|C(X)|] = log2 |F |
Proof: For the first part, let k = log2 |F |. There are only 2k 1 < |F | binary string of
length k 1, and the set of possible outputs of C() given inputs from F must contain
some string of length k.
For the second part, let k = log2 |F | t. There are at most 2k+1 1 < |F |/2t1 binary
strings of length k, and so there are at most so many elements of F that can be mapped
by C() into a string of lenght k.
The third part has a fairly complicated proof that we omit.
For the fourth part, let us fix an arbitrary order among the elements of F . If x is the i-th
element of F , then we let C(x) = i. The outputs of C() are numbers in the range 1, . . . , |F |,
and so log2 |F | are enough to represent them. We add leading zeroes in front of each
number, so that the all outputs of C() have the same length log2 |F |. When all outputs
have the same length, the mapping is necessarily prefix-free.

Sequences with Fixed Number of Occurrences

Suppose now that we are in the following setting:


the file contains n characters
the characters come from an alphabet of size c
character i has probability p(i) of appearing in the file

What can we say about probable and expected length of the output of an encoding algorithm? We will analyse this setting in the next section, but let us first consider a slightly
different setting that is simple to analyse. Consider the set F of files such that
the file contains n characters
the characters come from an alphabet of size c
character i occurs n p(i) times in the file
When we pick a random element of F , what is the best compression we can achieve? By
Theorem 2 we need to compute log2 |F |.P
We will show that log2 |F | is roughlyP
n i p(i) log2 1/p(i), and so a random element of F
cannot be compressed to less than n i p(i) log2 1/p(i) bits. In the next section, we show
that a similar result holds also for the first setting. Let us call f (i) = n p(i) the number of
occurrences of character i in the file.
Lemma 3
|F | =

n!
f (1)! f (2)! f (c)!

Here is a sketch of the proof of this formula. There are n! permutations of n characters,
but many are the same because there are only c different characters. In particular, the
f (1) appearances of character 1 are the same, so all f (1)! orderings of these locations are
identical. Thus we need to divide n! by f (1)!. The same argument leads us to divide by all
other f (i)!.
Now we have an exact formula for |F |, but it is hard to interpret, so we replace it using
Stirlings formula for approximating n!:
  n n 
n
n! =
e
Stirlings formula is accurate for large arguments, so we will be interested in approximating
|F | for large n. Here goes:

n!
(p(1)n)!(p(2)n)! (p(c)n)!
!

n (n/e)n
p
= p
p(1)n (p(1)n/e)p(1)n p(c)n (p(c)n/e)p(c)n
r

n
nn en
=

p(1)p(2) p(c)nc (p(1))p(1)n (p(c))p(c)n np(1)n+p(c)n ep(1)np(c)n




1
1
1

=
nc1 p(1)p(1)n
p(1)p(1)n

|F | =

So

!
1
p(i) log
log2 |F | = n
+ (log n)
p(i)
i
P
This means that, up to lower order terms, i p(i) log p1 (i) is the number of bits per characters used in an optimal encoding.
X

Sequences of Independent and Identially Distributed Random Variables

Now we go back to the setting in which X is a sequence of n independent and identically


distributed characters X1 , . . . , Xn , and Pr[Xj = i] = p(i).
Here is the intuition for the calculation. When we pick a file according to the above
distribution, very likely there will be about p(i) n characters
equal to i. Each file with
Q
these typical frequencies has a probability about p = i p(i)p(i)n of being generated.
Since files with
make up almost all the probability mass, there must be
Q typical frequencies
p(i)n
about 1/p = i (1/p(i))
files of typical frequencies. Now, we are in a setting which is
similar to the one of Theorem 2, where F is the set Q
of files with typical
P frequencies. We
p(i)n = n
then expect the
encoding
to
be
of
length
at
least
log
p(i)
2
i
i p(i) log 2 (1/p(i)).
P
The quantity i p(i) log 2 1/(p(i)) is the expected number of bits that it takes to encode
each character, and is called the entropy of the distribution over the characters. The notion
of entropy, the discovery of several of its properties, (a formal version of) the calculation
above, as well as a (inefficient) optimal compression algorithm, and much, much more, are
due to Shannon, and appeared in the late 40s in one of the most influential research papers
ever written.
First, we give a statement of the law of large numbers, that says the expectation of a
sequence of independent random variables is concentrated around its expectation.
Lemma 4 (Law of Large Numbers) Let Z be a random variable with expectation and
finite variance, and let Z1 , . . . , Zn , . . . be independent copies of Z. Then for every > 0



n

1 X
Zj > = 0
lim Pr
n

n
j=1

Proof: Let v be the variance of Z. Then the variance of Z1 + + Zn is vn and the


average is n. By Chebyshevs inequality


X

n

vn
v
Zj n > n 2 2 = 2
Pr

n
j=1

and the right-hand side tends to zero when n tends to infinity.

Now, for each character Xj of our random file, consider the random variable Zj = log 1/p(Xj ).
Clearly, the random variables Z1 , . . . , Zn are independent, identically distributed and have
finite variance. Their expectation is
X

p(i) log

1
p(i)

And we define the entropy of p, denoted by H(p), to be the above average.


Now, for every > 0 and > 0, we have that, for sufficiently large n,



n

X
Pr
Zj nH(p) n

j=1
4

Say that a file (a1 , . . . , an ) is typical when




X

n

1

n
log

nH(p)


p(aj )
j=1

Then a random file has a probability at least 1 of being typical.


If (a1 , . . . , an ) is typical, then it is generated with probability at least 2n(H(p)+) (why?)
and so there are at most 2n(H(p)+) typical files (why?). In particular, each typical file can
be represented using n(H(p) + ) bits.
Consider the following prefix-free encoding algorithm C. Given an input (a1 , . . . , an ), C
checks if (a1 , . . . , an ) is typical. If so, if (a1 , . . . , an ) is the t-th typical string in lexicographic
order, then C outputs 0, < t >, where < t > is the binary representation of t using n(H(p)+
) bits. Otherwise, C outputs 1, a1 , . . . , an .
Overall, the average encoding length is
(1 )n(H(p) + ) + n
For larger and larger n, we can make and be arbitrarily small, and thus the encoding
length per character can get arbitrarily close to H(p) bits.
It is more difficult, but possible, to show that the average encoding length of any prefix-free
encoding must be at least nH(p). It is easy, however, to show that the encoding length can
be much smaller than nH(p) only with low probability.
Exercise 3 Show that, for an arbitrary injective procedure C, and for n, , as above
Pr[|C(X1 , . . . , Xn )| < n(H(p) ) log(1/)] 2

Entropy of General Distributions

More generally, consider the case of an arbitrary distribution over strings of length n formed
with characters coming from a finite set of c possible characters. (In the previous section
we studied the special case in which each character is sampled independently according to
the same fixed distribution.)
For a string A = (a1 , . . . , an ) n let p(a1 , . . . , an ) be its probability according to such a
distribution.
Then we define the entropy of p to be
H(p) =

p(A) log

An

1
p(A)

Exercise 4 Let p() be a probability distribution over . Let p be a a probability distribution


over n defined by picking independently each of the n simbols according to p(). That is,
let p(a1 , . . . , an ) = p(a1 ) p(an ).
Prove that H(
p) = nH(p).
Theorem 5 Let X be a random variable ranging over n having distribution p(). Then
for every C() that computes a prefix-free encoding of X we have E[|C(X)|] H(p).
5

Arithmetic Encoding

In this section we give a way of encoding arbitrary distributions in an essentially optimal


way.
Let X = (X1 , . . . , Xn ) be distributed among strings of length n over a finite alphabet of
c characters. (For example, c = 2 and = {0, 1}.)
Let us denote by . the lexicographic order among elements of n , denote by A 1 the
string that comes before A in the lexicographic order, and let p() be a distribution over n .
Define
X
f (A) =
p(B)
B.A

to be the total probability of strings that come before A (A included), and suppose that
f () is efficiently computable.
We write down f (A) and f (A 1) in binary, and then we encode f (A) as the shortest prefix
of the binary expansion of f (A) that is bigger that f (A 1).1
How many bits do we need to encode f (A)? Say that we need k bits: then this means
that f (A) and f (A 1) agree in the first k 1 bits of their binary expansion, and so
f (A) f (A 1) < 2(k1) . But, by definition, f (A) f (A 1) = p(A), and so if we need
k bits to encode f (A) it means that p(A) < 2(k1) . Put another way, the number of bits
needed to encode A is at most log(1/p(A)) + 1.
The average length of the encoding is then at most


X
1
= 1 + H(p)
p(A) 1 + log
p(A)
A

Also, note that the encoding is prefix-free.


What about decoding? Once we receive the encoding of A, we can view it as a number
f such that f (A 1) < f f (A), and remember that we assumed that f () is efficiently
computable. Then we can use binary search to find A in time logarithmic in |n |, that is,
linear in n.
For this method to be applicable, we need to be able to compute the function f ().
Exercise 5 Prove that if p() describes a sequence of independent and identically distributed
characters, then f () is efficiently computable.

Other Encoding Methods

What if the distribution is not known or if the function f () is not efficiently computable?
The Lempel-Ziv algorithm, on which most commercial compression algorithms are based,
works without knowing the distribution of the input, and it discovers regularities as it
reads the input.
It can be proved (and it is quite difficult), that if p() describes a sequence of independent
and identically distributed characters, or if p() describes the sequence of states encountered
while running a small Markov chain, then the average output length of the Lempel-Ziv
algorithm is close to the entropy of the given distribution.
1

If A is the lexicographically first string, then we use the shortest prefix of the binary expansion of f (A)
that is bigger than zero.

You might also like