0% found this document useful (0 votes)
31 views13 pages

Final 2015 Sol

Teoría de Información Y Codificación Problemas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views13 pages

Final 2015 Sol

Teoría de Información Y Codificación Problemas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

EE376A - Information Theory

Final, Monday March 16th Solutions

Instructions:

• You have three hours, 3.30PM - 6.30PM

• The exam has 4 questions, totaling 120 points.

• Please start answering each question on a new page of the answer booklet.

• You are allowed to carry the textbook, your own notes and other course related ma-
terial with you. Electronic reading devices [including kindles, laptops, ipads, etc.] are
allowed, provided they are used solely for reading pdf files already stored on them and
not for any other form of communication or information retrieval.

• You are required to provide detailed explanations of how you arrived at your answers.

• You can use previous parts of a problem even if you did not solve them.

• As throughout the course, entropy (H) and Mutual Information (I) are specified in
bits.

• log is taken in base 2.

• Throughout the exam ‘prefix code’ refers to a variable length code satisfying the prefix
condition.

• Good Luck!

Final Page 1 of 13
1. Three Shannon Codes (25 points)
Let {Ui }i≥1 be a stationary finite-alphabet source whose alphabet size is r. Note that
the stationarity property implies that P (ui ), P (ui |ui−1 ) do not depend on i. Throughout
this problem, assume that − log P (ui ) and − log P (ui |ui−1 ) are integers for all (ui , ui−1 ).
Recall the definition of a Shannon Code given in the lecture. Your TA’s decided to
compress this source in a lossless fashion using Shannon coding. However, each of them
had a different idea:

• Idoia suggested to code symbol-by-symbol, i.e., concatenate Shannon codes on the


respective source symbols U1 , U2 , . . ..
• Kartik suggested to code in pairs. In other words, first code (U1 , U2 ) with a Shannon
code designed for the pair, then code (U3 , U4 ), and so on.
• Jiantao suggested to code each symbol given the previous symbol by using the
Shannon code for the conditional pmf {P (ui |ui−1 )}. In other words, first code U1 ,
then code U2 given U1 , then code U3 given U2 , and so on.

In this problem, you will investigate which amongst the three schemes is best for a general
stationary source.

(a) (10 points) If the source is memoryless (i.e. i.i.d.), compare the expected codeword
length per symbol, i.e., n1 E[l(U n )], of each scheme, assuming n > 2 is even.
(b) (15 points) Compare the schemes again, for the case where the source is no longer
memoryless and, in particular, is such that Ui−1 and Ui are not independent.

Solution:
We will first analyze each of the coding schemes for general stationary sources.
Idoia’s scheme: Use codeword length − log P (u) for a symbol u.

1
l¯1 = E[l1 (U n )]
n " #
n
1 X
= E − log P (Ui )
n i=1
n
1X
= E [− log P (U1 )] (stationarity, and linearity of expectation)
n i=1
= H(U1 ) (definition of entropy)

Kartik’s scheme: Use codeword length − log P (ui , ui+1 ) for each successive pair of sym-
bols (ui , ui+1 ).

1
l¯2 = E[l2 (U n )]
n

Final Page 2 of 13
 
n/2
1 X
= E − log P (U2i−1 , U2i )
n i=1
n/2
1X
= E [− log P (U1 , U2 )] (stationarity, and linearity of expectation)
n i=1
1
= H(U1 , U2 ) (definition of entropy)
2
Jiantao’s scheme: Use codeword length − log P (u1 ) for first symbol u1 , and then code-
word length − log P (ui |ui−1 ) for successive symbols.

1
l¯3 = E[l3 (U n )]
n " #
n
1 X
= E − log P (U1 ) + − log P (Ui |Ui−1 )
n i=2
" n
#
1 X
= E − log P (U1 ) + − log P (U2 |U1 ) (stationarity)
n i=2
1
= [H(U1 ) + (n − 1)H(U2 |U1 )] (linearity of expectation, definition of entropy)
n
(a) Because the source is i.i.d., all three coding schemes have the same average codeword
length, equal to the entropy H(U1 ). One can verify that when the source is i.i.d.
l¯1 = l¯2 = l¯3 = H(U1 ).
(b) Idoia’s codeword length is longest because H(U2 |U1 ) ≤ H(U2 ) = H(U1 ). For n = 2,
the performance of Kartik’s and Jiantao’s coding schemes are identical. However for
larger n, Jiantao’s coding scheme has smallest codeword length, since H(U1 , U2 ) =
H(U1 )+H(U2 |U1 ) ≥ 2H(U2 |U1 ), with equality iff the source is memoryless. Therefore,
in general, l¯1 ≥ l¯2 ≥ l¯3 .

Final Page 3 of 13
2. Channel coding with side information (35 points)

Consider the binary channel given by

Yi = Xi ⊕ Zi , (1)

where Xi , Yi , Zi all take values in {0, 1}, and ⊕ denotes addition modulo-2. There are
channel states Si which determine the noise level of Zi as follows.

• Si is binary valued, taking values in the set {G, B}, distributed as


(
G, with probability 23
Si =
B, with probability 13

• The conditional distribution of Zi given Si is characterized by


(
1
, if s = G
P (Zi = 1|Si = s) = 14
3
, if s = B

In other words, Zi |{Si = s} ∼ Bernoulli(ps ), where


(
1
, if s = G
ps = 41
3
, if s = B

{(Si , Zi )} are i.i.d. (in pairs), independent of the channel input sequence {Xi }.

(a) (10 points) What is the capacity of this channel when both the encoder and the
decoder have access to the state sequence {Si }i≥1 ?
(b) (10 points) What is the capacity of this channel when neither the encoder nor the
decoder have access to the state sequence {Si }i≥1 ?
(c) (10 points) What is the capacity of this channel when only the decoder knows the
state sequence {Si }i≥1 ?
(d) (5 points) Which is largest and which is smallest among your answers to parts (a),
(b) and (c)? Explain.

Solution:

(a) The capacity of this channel is given by

C = max I(X; Y |S)


p(X|S)

= max H(Y |S) − H(Y |X, S)


p(X|S)

Final Page 4 of 13
= max H(X ⊕ Z|S) − H(Z|S) (Y = X ⊕ S ⊕ Z, and Z is independent of X)
p(X|S)

≤ 1 − H(Z|S) (binary entropy is upper bounded by 1)


= 1 − P (S = G)H(Z|S = G) − P (S = B)H(Z|S = B)
2 1
= 1 − h2 (1/4) − h2 (1/3),
3 3
where h2 (·) is the binary entropy function. Note that the above bound is achieved
by choosing input X ∼ Bern(0.5) regardless of the state S. Choosing this input
gives the output a uniform distribution which maximizes the entropy. It is crucial to
state the capacity achieving distribution to show that an upper bound on the mutual
information can be achieved, which is why it is the capacity. Several students did
not do this for the problem, and lost some points.
Alternate solution: Since both encoder and decoder know the state, they can use
the corresponding capacity achieving codes for when the channel is “good” or “bad”
respectively. The capacity is simply the weighted average of the capacities of the two
binary symmetric channels, i.e.

C = P (S = G)CG + P (S = B)CB
2 1
= (1 − h2 (1/4)) + (1 − h2 (1/3))
3 3
2 1
= 1 − h2 (1/4) − h2 (1/3).
3 3

(b) When neither encoder nor decoder has any state information, there is no way to
use the state information. This results in an average BSC with equivalent crossover
probability
2 1 1 1
p= · + ·
3 4 3 3
5
=
18
Thus, the capacity of this channel is simply that of a BSC(p), i.e.

C = 1 − h2 (5/18)

(c) The decoder has access to the state, which means that equivalently the channel output
can be viewed as the pair (Y, S). The capacity of the channel in this case is

C = max I(X; Y, S)
p(X)

= max H(Y, S) − H(Y, S|X)


p(X)

= max H(S) + H(Y |S) − H(S|X) − H(Y |S, X)


p(X)

= max H(Y |S) − H(Y |S, X) (S and X are independent)


p(X)

Final Page 5 of 13
= max H(Y |S) − H(Z|S) (Y = X ⊕ S ⊕ Z, and Z is independent of X)
p(X)

≤ 1 − H(Z|S) (binary entropy is upper bounded by 1)


= 1 − P (S = G)H(Z|S = G) − P (S = B)H(Z|S = B)
2 1
= 1 − h2 (1/4) − h2 (1/3),
3 3
where the upper bound can be achieved by choosing X ∼ Bern(0.5). This expression
is exactly the same as (a). In other words, knowing the state at the decoder is just
as useful as knowing the state at both encoder and decoder! This magic happens
because the capacity achieving input for part (a) does not need to know the state of
the channel.
(d) In this problem Ca = Cc > Cb . Clearly, knowing the state is advantageous since it
reduces the uncertainty in the channel noise. I.e. H(Z|S) < H(Z). The capacity
of part (a) is largest because we have the entire state information available to both
encoder and decoder. In this problem, the additional beauty is that just knowing the
state at the decoder is sufficient to achieve the capacity of (a). The reason for this
is stated above, and is due to the fact that X ∼ Bern(0.5) achieves the capacity in
both (a) and (c).

Final Page 6 of 13
3. Modulo-3 additive noise channel (25 points)

(a) (5 points) Consider the modulo-3 additive white noise channel given by

Yi = Xi ⊕ Z i , (2)

where Xi , Zi , Yi all take values in the alphabet {0, 1, 2}, ⊕ denotes addition modulo-3,
and {Zi } are i.i.d. ∼ Z and independent of the channel input sequence {Xi }.

Zi

Xi Yi
Figure 1: Ternary additive channel.

Show that the capacity of this channel is given by

C = log 3 − H(Z). (3)

(b) (7 points) For  ≥ 0 define

φ() = max H(Z), (4)


Z:P r(Z6=0)≤

where the maximization is over ternary random variables Z that take values in
{0, 1, 2} (and that satisfy the indicated constraint). Obtain φ() explicity, as well
as the distribution of the random variable, Z , that achieves the associated maximum.

[Distinguish between the ranges 0 ≤  < 2/3 and  ≥ 2/3.]


(c) (5 points) Consider the problem of rate distortion coding of a memoryless source
Ui ∼ U , where the source and the reconstruction alphabets are both equal and
ternary, i.e., U = V = {0, 1, 2}. Let the distortion measure be Hamming loss

(
0, if u = v
d(u, v) =
1, otherwise.

For U, V that are jointly distributed such that E[d(U, V )] ≤ D, justify the following
chain of equalities and inequalities

(i)
I(U ; V ) = H(U ) − H(U |V )

Final Page 7 of 13
(ii)
= H(U ) − H(U V |V )
(iii)
≥ H(U ) − H(U V )
(iv)
≥ H(U ) − φ(D),

where denotes subtraction modulo-3 and φ(D) was defined in Equation (4). Argue
why this implies that the rate distortion function of the source U is lower bounded
as
R(D) ≥ H(U ) − φ(D). (5)
The above inequality is known as the ‘Shannon lower bound’ (specialized to our
setting of ternary alphabets and Hamming loss).
(d) (8 points) Show that when U is uniform (on {0, 1, 2}), the Shannon lower bound
holds with equality, i.e.,

R(D) = H(U ) − φ(D) = log 3 − φ(D), 0 ≤ D ≤ 1. (6)

[Hint: establish, by construction, existence of a joint distribution on U, V such that


U is uniform and the inequalities in Part (c) hold with equalities]

Solution:

(a) Under any input distribution PX ,


(i)
I(X; Y ) = H(Y ) − H(Y |X)
(ii)
= H(Y ) − H(Y X|X)
(iii)
= H(Y ) − H(Z|X)
(iv)
= H(Y ) − H(Z)
(v)
≤ log 3 − H(Z),

where (i) follows from the definition of mutual information, (ii) is due to invariance of
entropy to translation of the RV (or to any one-to-one transformation), (iii) is due to
the channel model, (iv) to the independence of the additive channel noise component
on the channel input, and (v) is because Y is ternary. On the other hand, when PX
is the uniform distribution, the distribution of Y is uniform as well, in which case (v)
holds with equality.
(b) We have
φ() = max H(Z) = max H(Z), (7)
Z:P r(Z6=0)≤ Z:E[ρ(Z)]≤

Final Page 8 of 13
(
0, if z = 0
ρ(Z) =
1, otherwise.

It can be shown that the maximum is attained by a distribution on Z of the form:

PZ (z) = c(λ)e−λρ(z) , (8)

where c(λ) is the normalization constant and λ ≥ 0 is tuned so that the constraint is
met with equality (when possible). In our case this boils down to the distribution
(
1 − , if z = 0
PZ (z) =
/2, otherwise,

for 0 ≤  ≤ 2/3. For 2/3 <  ≤ 1, the uniform distribution is in the constraint set,
and is therefore the maximimizing distribution. Thus
(
H(Z ), if 0 ≤  ≤ 2/3
φ() =
log 3, if /3 <  ≤ 1

(c) (i) From definition of mutual information.


(ii) From invariance of entropy to translation of the RV (or to any one-to-one trans-
formation).
(iii) Conditioning reduces entropy.
(iv) Due to P (U V 6= 0) = E[d(U, V )] ≤ D and the definition of φ.
Thus, H(U ) − φ(D) lower bounds any mutual information in the feasible set over
which the minimum in the definition of R(D) is taken, and therefore lower bounds
R(D).
(d) We need to find a distribution on (U, V ) such that:
(a) U is uniform.
(b) U V is independent on V (for equality in (iii)).
(c) U V ∼ ZD (for equality in (iv)).
Taking V to be uniform, and U = V ⊕ ZD , for ZD independent of V , satisfies these
three conditions.

Final Page 9 of 13
4. Gaussian source and channel (35 points)

• Gaussian Channel

Consider the parallel Gaussian channel which has two inputs X = (X1 , X2 ) and two
outputs Y = (Y1 , Y2 ), where

Y1 = X1 + Z 1
Y2 = X2 + Z 2 ,

and Zi ∼ N (0, σi2 ), i = 1, 2, are independent Gaussian random variables. We impose


an average power constraint on the input X, which is

E[kXk2 ] = E[X12 + X22 ] ≤ P

(a) (10 points) Give an explicit formula for the capacity of this channel in terms of
P, σ12 , σ22 .
(b) (7 points) Suppose you had access to capacity-achieving schemes for the scalar
AWGN channel whose capacity we derived in class. How would you use them
to construct capacity-achieving schemes for this parallel Gaussian channel?
• Gaussian Source

Consider a two-dimensional real valued source U = (U1 , U2 ) such that U1 ∼ N (0, σ12 ),
and U2 ∼ N (0, σ22 ), and U1 is independent of U2 . Let d : R2 × R2 → R be the
distortion measure

d(u, v) = ku − vk2 = |u1 − v1 |2 + |u2 − v2 |2

We wish to compress i.i.d. copies of the source U , with average per-symbol distortion
no greater than D, i.e. the usual lossy compression setup discussed in class.
(a) (10 points) Evaluate the rate-distortion function R(D) explicitly in terms of the
problem parameters D, σ12 , σ22 .
(b) (8 points) Suppose you had access to good lossy compressors for the scalar
Gaussian source whose rate-distortion function we derived in class. How would
you use them to construct good schemes for this two-dimenstional Gaussian
source?

Solution:

Gaussian Channel

Final Page 10 of 13
By Shannon’s channel coding theorem, the capacity of this parallel Gaussian channel is
given by

max I(X1 , X2 ; Y1 , Y2 ) = max (h(Y1 , Y2 ) − h(Y1 , Y2 |X1 , X2 )) (9)


PX1 ,X2 :E[X12 +X22 ]≤P PX1 ,X2 :E[X12 +X22 ]≤P

= max (h(Y1 , Y2 ) − h(Y1 |X1 ) − h(Y2 |X2 ))


PX1 ,X2 :E[X12 +X22 ]≤P
(10)
≤ max (h(Y1 ) + h(Y2 ) − h(Y1 |X1 ) − h(Y2 |X2 ))
PX1 ,X2 :E[X12 +X22 ]≤P
(11)
= max (I(X1 ; Y1 ) + I(X2 ; Y2 )) (12)
PX1 ,X2 :E[X12 +X22 ]≤P
 
= max max I(X1 ; Y1 ) + max I(X2 ; Y2 )
P1 ,P2 ≥0,P1 +P2 ≤P PX1 :EX12 ≤P1 PX2 :EX22 ≤P2
(13)
    
1 P1 1 P2
= max log 1 + 2 + log 1 + 2 .
P1 ,P2 ≥0,P1 +P2 ≤P 2 σ1 2 σ2
(14)

Note that in the above chain of inequalities, if we take X1 to be independent of X2 , then


every inequality holds equality. Hence, it suffices to solve the last optimization problem,
whose solution is not only an upper bound, but also a lower bound of the capacity of this
parallel Gaussian channel.
Since the function 12 log(1 + x) is increasing for x ≥ 0, the maximum is attained when
P1 + P2 = P . Define the following function of P1 ∈ [0, P ]:
   
1 P1 1 P − P1
f (P1 ) = log 1 + 2 + log 1 + . (15)
2 σ1 2 σ22
The function f (P1 ) is concave on [0, P ] and has only one maximum. Taking derivative
with respect to P1 , we have

0 1/σ12 −1/σ22
f (P1 ) = + . (16)
2(1 + P1 /σ12 ) 2(1 + (P − P1 )/σ22 )

P +σ22 −σ12
Setting it to zero, we have P1∗ = 2
. If |σ22 − σ12 | ≤ P , P1∗ ∈ [0, P ], the optimal
power allocation is given by
P + σ22 − σ12
P1∗ = (17)
2
2 2
P + σ1 − σ2
P2∗ = . (18)
2

If |σ12 − σ22 | > P , without loss of generality we assume σ12 <= σ22 . Then we should set
P1∗ = P, P2∗ = 0. In other words, if the quality of the two Gaussian channels are very

Final Page 11 of 13
different (in the sense that |σ12 − σ22 | > P ), then we should allocate all the power to the
stronger channel.
To sum up, we have the capacity of this parallel Gaussian channel equal to

P1∗ P − P1∗
   
1 1
Cparallel (P ) = log 1 + 2 + log 1 + , (19)
2 σ1 2 σ22

where
 P +σ2 −σ2


2
2
1
|σ12 − σ22 | ≤ P
P1∗ = P |σ12 − σ22 | > P, σ12 < σ22 (20)

0 |σ12 − σ22 | > P, σ12 > σ22

Suppose we now have the capacity achieving schemes for single Gaussian channel. In
order to achieve the capacity of this parallel Gaussian channel, we first allocate power
P − P1∗ to channel 2. Then we take the codebook for Gaussian
P1∗ to channel 1, power 
P∗
channel with rate 21 log 1 + σ12 and power P1∗ for channel 1, and take codebook for
1  
P −P1∗
1
Gaussian channel with rate 2 log 1 + σ2 and power P − P1∗ for channel 2. The joint
2
codebook has rate Cparallel (P ).
Gaussian Source

By Shannon’s rate distortion theorem, the rate distortion function of this source is given
by

Rjoint (D) = min I(U1 , U2 ; V1 , V2 ) (21)


PV1 ,V2 |U1 ,U2 :E[|U1 −V1 |2 +|U2 −V2 |2 ]≤D

= min (h(U1 , U2 ) − h(U1 , U2 |V1 , V2 )) (22)


PV1 ,V2 |U1 ,U2 :E[|U1 −V1 |2 +|U2 −V2 |2 ]≤D

= min (h(U1 ) + h(U2 ) − h(U1 |V1 , V2 ) − h(U2 |U1 , V1 , V2 ))


PV1 ,V2 |U1 ,U2 :E[|U1 −V1 |2 +|U2 −V2 |2 ]≤D
(23)
≥ min (h(U1 ) + h(U2 ) − h(U1 |V1 ) − h(U2 |V2 )) (24)
PV1 ,V2 |U1 ,U2 :E[|U1 −V1 |2 +|U2 −V2 |2 ]≤D

= min (I(U1 ; V1 ) + I(U2 ; V2 )) (25)


PV1 ,V2 |U1 ,U2 :E[|U1 −V1 |2 +|U2 −V2 |2 ]≤D
 
= min min I(U1 ; V1 ) + min I(U2 ; V2 )
D1 ≥0,D2 ≥0,D1 +D2 ≤D PV1 |U1 :E[|U1 −V1 |2 ]≤D1 PV2 |U2 :E[|U2 −V2 |2 ]≤D2
(26)
σ12 σ22
 
1 1
= min max{0, log } + max{0, log } (27)
D1 ≥0,D2 ≥0,D1 +D2 ≤D 2 D1 2 D2

We note that in the above chain of inequalities, if we take the joint test channel PV1 ,V2 |U1 ,U2
to be of form PV1 |U1 PV2 |U2 , all inequalities hold equality. Hence, it suffices to solves the

Final Page 12 of 13
last optimization problem and the resulting answer is not only a lower bound but also an
upper bound on the rate distortion function of this two dimensional source.
Since the function 21 log(1/x) is non-increasing for x > 0, the minimum is attained when
D1 + D2 = D. We define function

1 σ2 1 σ22
g(D1 ) = max{0, log 1 } + max{0, log }. (28)
2 D1 2 D − D1

When D1 ≤ σ12 , D − D1 ≤ σ22 , taking derivatives of g(D1 ), we obtain


 
0 1 1 1
g (D1 ) = − . (29)
2 D − D1 D1

Setting it to zero, we have D1 = D/2. Hence, if D ≤ 2 min{σ12 , σ22 }, the optimal distortion
allocation is

D1∗ = D/2, D2∗ = D/2. (30)

When σ12 + σ22 ≥ D > 2 min{σ12 , σ22 }, without loss of generality we assume σ12 ≤ σ22 .
Then we should use zero rate to describe U1 , and allocate distortion D − σ12 to U2 . If
D > σ12 + σ22 , we simply use zero rate to describe both U1 and U2 .
In other words, the joint rate distortion function is given by
2σ12 2σ22
1 1
2
 log D
+ 2
log D
D ≤ 2 min{σ12 , σ22 }
2 2
max{σ1 ,σ2 }
Rjoint (D) = 12 log D−min{σ 2 2 σ12 + σ22 ≥ D > 2 min{σ12 , σ22 } (31)
 1 ,σ2 }

D > σ12 + σ22



0

Suppose we now have good lossy compressors for Gaussian source. To achieve the rate
distortion function of this two dimensional source, if D ≤ 2 min{σ12 , σ22 }, we use the rate
distortion code for source U1 and U2 independently under distortion D/2 for each source.
If σ12 + σ22 ≥ D > 2 min{σ12 , σ22 }, we simply use a constant 0 to encode the source with
smaller variance (say U1 ), and use rate distortion code to encode another source with
distortion D − min{σ12 , σ22 }. If D > σ12 + σ22 , we simply use constant 0 to describe both
U1 and U2 .

Final Page 13 of 13

You might also like