Shannon's Noiseless Coding Theorem
Shannon's Noiseless Coding Theorem
We are working with messages written in an alphabet of symbols x1 , . . . , xn which occur with probabilities p1 , . . . , pn . We have dened the entropy E of this set of probabilities to be
n
E=
i=1
pi log2 pi .
These pages give a proof of an important special case of Shannons theorem (which holds for any uniquely decipherable code). We will prove it for prex codes, which are dened as follows: Denition: A (binary) prex code is an assignment of binary strings (strings of 0s and 1s, code words) to symbols in the source alphabet so that no code word occurs as the beginning of another code word. Note that message written in a prex code can be unambiguously decoded by slicing o code words as they occur. Theorem: For any binary prex code encoding x1 , . . . , xn the average length of a word must be greater than E . More explicitly, setting i as the length of the code word for xi ,
n
pi i E.
i=1
Our proof of this theorem will involve two lemmas. Lemma 1: (Gibbs inequality). Suppose p1 , . . . , pn is a probability distribution (i.e. each pi 0 and i pi = 1). Then for any other probability distribution q1 , . . . , qn with the same number of elements,
n n
pi log2 pi
i=1 i=1
pi log2 qi .
(Notes: 1. The sum on the right may diverge to if one of the qi is zero and the corresponding pi is not. As remarked before, pi = 0 is not a problem since limp0 p log2 p = (1/ ln 2) limp0 p ln p = 0 by LH opitals rule.
i=1
pi log2 pi
i=1
pi log2 qi .
Our formulation avoids many minus signs, even though the numbers involved are both negative. 3. For a heuristic motivation consider the case where all the pi are equal. 1 and Each pi = n
n
pi log2 pi =
i=1
1 log2 (q1 q2 . . . qn ). When the sum Substituting q1 , . . . qn for p1 , . . . pn gives n of the side-lengths is xed, the maximum volume of a rectangular solid is n obtained when all the sides are equal; so since n i=1 qi = i=1 pi = 1, the product q1 q2 . . . qn must be less than or equal to p1 p2 . . . pn in this case.)
Proof: (from http : //en.wikipedia.org/wiki/Gibbs%27 inequality) Since pi log2 pi = ln and ln 2 > 0 it is enough to prove the inequality with log2 ln 2 replaced by ln wherever it occurs. Additionally, since if any one of the qi is zero and the corresponding pi = 0 the inequality is automatically true; so we may assume () that qi = 0 whenever pi = 0. We use the following property of the natural logarithm: ln x x 1 for all x > 0, and ln x = x 1 only when x = 1. In order to avoid zero denominators in the following calculation, we set I = {i|pi > 0}, the set of indices for which pi is non-zero (and therefore, by (), qi is also non-zero). Then we write pi ln
iI
qi pi
pi (
iI
qi 1) = pi
qi
iI iI
pi =
iI
qi 1 0.
pi ln qi
iI iI
pi ln pi .
Now
pi ln pi = n i=1 pi ln pi since the new terms all have pi = 0; and n iI pi ln qi i=1 pi ln qi since new terms are 0. I.e.
iI n n
pi ln qi
i=1 iI
pi ln qi
iI
pi ln pi =
i=1
pi ln pi
yielding Gibbs inequality. Lemma 2: (Krafts inequality for binary prex codes) Let x1 , . . . , xn be the symbols in our alphabet, and suppose we have encoded them as binary words using a prex code. Let 1 , . . . , n be the lengths of the words corresponding to x1 , . . . , xn . Then
n i=1
2i 1.
Proof: Note that a (binary) prex code can always be represented as a binary tree: as a word is read, the tree branches right or left according as the next bit is 0 or 1. Each word occurs at the end of a unique branch. Set L = maxi i . Then the tree corresponding to our prex code can be extended to a tree where every branch has length L, and there are 2L branches. A code word of length i corresponds to pruning o from this tree all the possible extensions of the corresponding branch. There are 2Li of these. The total number of deleted branches is then i 2Li ; since this sum must be smaller than the total number of branches, we have
n i=1
2L i = 2L
n i=1
2i 2L ,
so
n i=1
2i 1, Krafts inequality.
Proof of Shannons theorem: Take x1 , . . . , xn and p1 , . . . , pn as in the statement, suppose the xi have been encoded in a binary prex code, and let i be the length of the code word for xi . Then by Krafts inequality i 1. Call this number 1/C , so that C 21 , . . . , C 2n is a probability i2 distribution, and can play the role of {qi } in Gibbs inequality, which then tells us
n n n n
pi log2 pi
i=1 i=1
pi log2 (C 2i ) =
i=1
pi (log2 C i ) = log2 C
i=1
p i i .
Now put back the minus signs and remember that since 1/C 1 we have C 1 and log2 C 0. We obtain
n n n
p i i
i=1 i=1
pi log2 pi + log2 C
i=1
pi log2 pi ,
as required.