Entropy 2
Entropy 2
1 Entropy
Let X be a discrete random variable with alphabet X = {1, 2, . . . m}. Assume there
is a probability mass function p(x) over X . How many binary questions, on average,
does it take to determine the outcome?
2 Source Coding
Definition 2.1: A (binary) source code C for a random variable X is a mapping from
X to a (finite) binary string. Let C(x) be the codeword corresponding to x and let l(x)
denote the length of C(x).
The nice property of a prefix code is that one can transmit multiple outcomes x1 , x2 , . . . xn
by just concatenating the strings into C(x1 )C(x2 ) . . . C(xn ), where the latter denotes
the concatenation of C(x1 ),C(x2 ) up to C(xn ), and this leads to decoding xi instantly
after xi is received. In this sense, prefix codes are “self punctuating”.
Let the expected length of C be:
X
L(C) = p(x)l(x)
x∈X
1
Theorem 2.3: The expected length of any (prefix) code is greater than the entropy, i.e.
L(C) ≥ H(X)
L(C) ≤ H(X) + 1
This theorem is actually more general and applies to uniquely extendable codes.
Conversely, given a set of codeword lengths which satisfy this inequality, then there
exists a prefix code with these lengths.
where the last step follows since probabilities are bounded by 1. This proves the first
statement.
We proved the forward direction with a technique know as the “probabilistic method”.
For the converse, order the lengths in ascending order l1 to lm . Pick code words in this
order subject to the constraint that any previous codeword is not a prefix of the selected
code. To prove that this works, consider a full binary tree of depth lm . Associate each
codeword with a path on the tree — from the root to some internal node (the end node
of the codeword). The prefix condition states that the path of each codeword must
not contain an endpoint of another codeword’s path. With each leaf node, associate a
probability mass of 2−lm . Assign the probability mass of codeword i as 2−li . Note that
each codeword removes 2−li from the remaining mass to be allocated. Furthermore,
allocation is always possible, if there is enough remaining mass, by the prefix condition.
As the lengths satisfy the Kraft inequality, there is enough initial mass (of 1) to assign
all the items to valid codewords.
2
2.2 The proof of the source coding theorem
We first show that there exists a code within one bit of the entropy. Choose the lengths
as:
1
l(x) = dlog e
p(x)
This choice is integer and satisfies the craft inequality, hence there exists a code. Also,
we can upper bound the average code length as follows:
X X 1
p(x)l(x) = p(x)dlog e
p(x)
x∈X x∈X
X 1
≤ p(x)(log + 1)
p(x)
x∈X
= H(X) + 1
Now, let us prove the lower bound on L(C). Consider the optimization problem
X X
min p(x)l(x) such that 2−l(x) ≤ 1
l(x)
x∈X x∈X
The above finds the shortest possible code length subject to satisfying the Kraft in-
equality. If we relax the the codelengths to be non-integer, then we can obtain a lower
bound.
To do this, the Lagrangian is:
X X
L= p(x)l(x) + λ( 2−l(x) − 1)
x∈X x∈X
Taking derivatives with respect to l(x) and λ and setting to 0, leads to:
p(x) + ln 2λ2−l(x) = 0
X
2−l(x) − 1 = 0
x∈X
1
Solving this for l(x) leads to l(x) = log p(x) , which can be verified by direct substitu-
tion. This proves the lower bound.