CS174: Note11
CS174: Note11
Lecture Note 11
April 15, 2003
1
2t1
For example, there is less than a chance in a million of compression an input file of n bits
into an output file of length n 21, and less than a chance in eight millions that the output
will be 3 bytes shorter than the input or less.
Proof: We can write
Pr[|C(X)| n t] =
In practice, however, compression algorithms work, and they can reduce the size of most
files by a significant factor. The reason is that most files are not completely random. For
example, if X is a distribution of texts written in English, then it cannot possibly be the
uniform distribution over n-bit string, because not all n-bits strings are ASCII encoding of
meaningful English setences. If X is uniform over a set of possible messages, then we can
still find a lower bound to the encoding length achievable by any compression algorithm,
and see that, at least in principle, the lower bound can be matched.
Theorem 2 Let F n be a set of possible files, let X be the uniform distribution over
F . Then,
1. For every injective function C : n {0, 1} there is a file f F such that |C(f )|
log2 |F |.
2. For every injective function C : n {0, 1} and every integer t, we have
Pr[|C(X)| (log2 |F |) t]
1
2t1
What can we say about probable and expected length of the output of an encoding algorithm? We will analyse this setting in the next section, but let us first consider a slightly
different setting that is simple to analyse. Consider the set F of files such that
the file contains n characters
the characters come from an alphabet of size c
character i occurs n p(i) times in the file
When we pick a random element of F , what is the best compression we can achieve? By
Theorem 2 we need to compute log2 |F |.P
We will show that log2 |F | is roughlyP
n i p(i) log2 1/p(i), and so a random element of F
cannot be compressed to less than n i p(i) log2 1/p(i) bits. In the next section, we show
that a similar result holds also for the first setting. Let us call f (i) = n p(i) the number of
occurrences of character i in the file.
Lemma 3
|F | =
n!
f (1)! f (2)! f (c)!
Here is a sketch of the proof of this formula. There are n! permutations of n characters,
but many are the same because there are only c different characters. In particular, the
f (1) appearances of character 1 are the same, so all f (1)! orderings of these locations are
identical. Thus we need to divide n! by f (1)!. The same argument leads us to divide by all
other f (i)!.
Now we have an exact formula for |F |, but it is hard to interpret, so we replace it using
Stirlings formula for approximating n!:
n n
n
n! =
e
Stirlings formula is accurate for large arguments, so we will be interested in approximating
|F | for large n. Here goes:
n!
(p(1)n)!(p(2)n)! (p(c)n)!
!
n (n/e)n
p
= p
p(1)n (p(1)n/e)p(1)n p(c)n (p(c)n/e)p(c)n
r
n
nn en
=
=
nc1 p(1)p(1)n
p(1)p(1)n
|F | =
So
!
1
p(i) log
log2 |F | = n
+ (log n)
p(i)
i
P
This means that, up to lower order terms, i p(i) log p1 (i) is the number of bits per characters used in an optimal encoding.
X
n
1 X
Zj > = 0
lim Pr
n
n
j=1
X
n
vn
v
Zj n > n 2 2 = 2
Pr
n
j=1
Now, for each character Xj of our random file, consider the random variable Zj = log 1/p(Xj ).
Clearly, the random variables Z1 , . . . , Zn are independent, identically distributed and have
finite variance. Their expectation is
X
p(i) log
1
p(i)
n
X
Pr
Zj nH(p) n
j=1
4
nH(p)
p(aj )
j=1
More generally, consider the case of an arbitrary distribution over strings of length n formed
with characters coming from a finite set of c possible characters. (In the previous section
we studied the special case in which each character is sampled independently according to
the same fixed distribution.)
For a string A = (a1 , . . . , an ) n let p(a1 , . . . , an ) be its probability according to such a
distribution.
Then we define the entropy of p to be
H(p) =
p(A) log
An
1
p(A)
Arithmetic Encoding
to be the total probability of strings that come before A (A included), and suppose that
f () is efficiently computable.
We write down f (A) and f (A 1) in binary, and then we encode f (A) as the shortest prefix
of the binary expansion of f (A) that is bigger that f (A 1).1
How many bits do we need to encode f (A)? Say that we need k bits: then this means
that f (A) and f (A 1) agree in the first k 1 bits of their binary expansion, and so
f (A) f (A 1) < 2(k1) . But, by definition, f (A) f (A 1) = p(A), and so if we need
k bits to encode f (A) it means that p(A) < 2(k1) . Put another way, the number of bits
needed to encode A is at most log(1/p(A)) + 1.
The average length of the encoding is then at most
X
1
= 1 + H(p)
p(A) 1 + log
p(A)
A
What if the distribution is not known or if the function f () is not efficiently computable?
The Lempel-Ziv algorithm, on which most commercial compression algorithms are based,
works without knowing the distribution of the input, and it discovers regularities as it
reads the input.
It can be proved (and it is quite difficult), that if p() describes a sequence of independent
and identically distributed characters, or if p() describes the sequence of states encountered
while running a small Markov chain, then the average output length of the Lempel-Ziv
algorithm is close to the entropy of the given distribution.
1
If A is the lexicographically first string, then we use the shortest prefix of the binary expansion of f (A)
that is bigger than zero.