P,> +--+ > P.. By regarding the last two symbols
as 1 of S as combined into one symbol, we obtain a new source} from S
H(S) = ¥ Polo p, containing only q~ 1 symbols. We refer to this new source as a
a reduction of S. The symbols of this reduction of S may be reor-
= 1.22 bits/souree symbol dered, and again we may combine the two least probable symbols
Note that Lis indeed bounded by to form a reduction of this reduction of S. By proceeding in this
. manner, we construct sequence of sources, ench containing one
BS Tes ee I fewer symbol than the previous one, until we arrive at a source
‘This fact, however, is little consolation for a poor code. Tt with only two symbols
is easy to find an instantaneous code for this source which is better RP Feet cyst shh a pmree olidkcarmton
than code @. Such a code (code &) is given in the last column of BEaqiinivccckwccou nue
ee eet ca ater eee J Wetak hia toe a zromenery ore forty ake femenine
aber cis pase ce Since we are allowed to code only one symbol fom Sa aime, tdowe not make
= 1.33 binits/souree symbol Uany diforenco whether Sin a neromemory or a Marlow ure.
rnd sue7B INFORMATION THEORY AND CODING
Original source Reduced sources
Symbols Probabititiee SS BS
Frovar 4-1. A souree and its reductions
Construction of a sequence of reduced sources, as illustrated,
is the first step in the construction of a compact instantaneous
code for the original source S. ‘The second step is merely the recog-
nition that a binary compact instantaneous code for the last
reduced source (a source with only two symbols) is the trivial
code with the two words 0 and 1. As the final step, we shall show
that if we have a compact instantaneous eode for one of the sources:
in the sequence of reduced sources, it is a simple matter to construct
‘4 compact instantaneous code for the source immediately preceding
this reduced source. Using this fact, we start at the last reduced
source and its trivial compact instantaneous code and work back-
ward along the sequence of sources until we arrive at a compact
instantaneous code for the original source.
Let us assumo that we have found a compact instantaneous code
for S,, one of the sources in a sequence of reduced sources. One of
‘the symbols of S;, say ss, is formed from two symbols of the pre-
ceding source S,-1. We eall these two symbols gy and gy; Each
of the other symbols of S, corresponds to one of the remaining sym-
bols of Ss. Then the compact instantaneous code for Sis is
formed from the code for S, as follows:
We assign to each symbol of S,-1 (except suo and
sa1) the code word used by the corresponding
symbol of §;. ‘The eode words used by suo and
‘a1 are formed by adding a 0 and 1, respectively,
+o tho code word used for ta. Cony
It is easy to see that the code we form in this manner is instan-
CODING INFORMATION SOURCES 79
taneous [condition (3-1)]. The proof that the code is compact
is not so immediate, and we defer the proof of this fact until after
we have illustrated the construction of a compact eode,
Example 4-4. We illustrate the synthesis of a binary compact code for the
source of Figure 4-1.
Original source Reduced sources
Syne Prabae
ole bilities Code Ss Ss
8 on.
“ OM 20.2 0104 p03 OF
ono Jor on
owo1
‘roux 4-2, Synthesis of a compact code.
We formed the compact code shown at the left in the three steps
previously noted. First, we formed a sequence of reduced sources
from the originalsource S. (Seealso Figure 4-1.) ‘Then we assigned
the words 0 and 1 to the last source in the sequence (Sy in this case).
Finally, we worked our way back from Ss to S through the reduced
sources, As we passed from source to source, we “decomposed”
‘one code word each time in order to form two new code words.
Several properties of compact codes are illustrated by this pro-
cedure. ‘The multiplicity of compact eodes is especially important.
Note that the method used to form tsro words from one in passing
from reduced source to reduced souree is merely to add a binit to the
fend of the word we decompose. It makes no difference which of the
‘two words formed is assigned to which of the source symbols.
‘This means that the assignment of the two code symbols 0 and 1 to
tthe various words of the compact code we construct is arbitrary.
fe may complement the jth digit of every word of the eode and
‘obtain another compact code. For example, if we complement the
first and last digits of the code of Figure 4-2, we have the “new”
‘compact code:
‘1 The complement of O is 1; the complement of 1 is 0.80 INFORMATION THEORY AND CODING CODING INFORMATION SOURCES 81
0 If we choose the second or the third, the word lengths obtained
10 are
m1 1, 2,3, 4, 5,5
1100 ee
fight ‘The average lengths of the two codes obtained are identical:
1010 L = 1004) + 2(0.3) + 4(0.1) + 4(0.1) + 400.06) + 4(0.04)
‘This method ot pecdiucing « new oumpiiot cae, hawaret ints = 2.2 binits/symbol
in just trivial differences in the two codes. ‘The new code is L = 0A) + 20.3) + 30.1) + 40.1) + 5(0.06) + 5(0.04)
obtained from the old code by a relabeling. It is also possible to = 22 binits/symbol
obtain two compact codes for the same source which are funda- A :
mentally different. To see this, we synthesize a different code and we cannot construct an instantaneous code for this source with
for the example given in Figure 4-2. a smaller average length.
__ Another point made evident by the synthesis procedure deseribed
is that it may sometimes be unnecessary to form a sequence of
reductions of the original source all the way to a source with only
I two symbols. ‘This is so since we need only form reductions until
Original source we find the first reduction for which we have a compact eode, Once
ee we have a compact code for any reduetion of a source, we may start
Pe ee ‘working backward from this compact code, as described in rule
. (4-24). ‘This point is illustrated in Figure 4-4.
Example 4-8. We construet a different code for the source of Example 44
in Figure 43.
not Reduced
Original source pny
Symbols Probebititien Code S
Fiavme 4-3, Synthesis of a compact code, rs Ps > --~ > Py, the lengths of the words
assigned to these symbols will be ordered so that, < lz S++ < ly
This is not surprising, Tt is merely an expression of the fact that
we assign the shortest words to the most probable symbols of our
code. The second property is perhaps a bit less obvious. We
have shown that the lengths of the last two words (in order of
decreasing probability) of a compact code are identical:
haha 27)
If there are several symbols with probability P,, we may assign
their subscripts so that the words assigned to the last two symbols
differ only in their last digit.
#8. r-ary Compact Codes
Section 4-6 pointed out that the construction of a binary compact
code proceeded in three steps. First, we formed a sequence of
reduced sources from the original source. ‘Then we found a compact
code for one of the sources in this sequence. Finally, we worked
our way backward through the sequence, constructing new compact
codes from the ones we had obtained, until we formed a compact
ode for the original source S, In this section we shall see that the
construction of a compact code consists of the same three steps
when the code alphabet has r symbols. The last two of these steps,
furthermore, will be changed in no important respect from the
binary case.
The formation of reduced sources preparatory to the synthesis,
of a binary compact code proceeded by combining the ‘wo least
Probable source symbols in order to form a single symbol. When
wwe wish to form an r-ary compact code, we shall combine the source84 INFORMATION THEORY AND CODING
‘symbols r ata time in order to form one symbol in the reduced source.
‘One hitch appears in the r-ary case, however, which did not appear
in the binary ease. In the binary case, each souree in the sequence
of reduced sourees contained one fewer symbol than the souree
immediately preceding. In the rary ease, we combine r symbols
‘to form one symbol, and thus each source in the sequence has r — 1
fewer symbols than the preceding source. We would like the last
source in the sequence to have exactly r symbols. (This will allow
‘us to construct the trivial compact code for this source.) The last
source will have r symbols if and only if the otiginal souree has
r+ a(r—1) symbols, where a is an integer. Therefore, if the
original source does not have r + a(r ~ 1) symbols, we add “dummy
symbols” to the source until this number is reached. ‘The dummy
symbols are assumed to have probability 0, and so they may be
ignored after a code is formed
Example 4-6. Consider the source S with 11 symbols shown in Figure 4-5.
We wish to form a sequence of resiuced sources from ths suree before encoding
the source in a quaternary code—a code using four code symbols. Ifthe last
source in the sequence is to have four symbols, S must have 4 + 3a symbols,
whore ais an integer. Since 11 isnot ofthe form 4 + Sa, we add two dummy
{ym to to bring the total number of symbols to 18. Now, reducing the
{ource four symbols at time produces a source with exactly four symbols,
Original source Reduced eoureer
Symbols Probabilities Ss
BRSS5RSSER8
Ficuns 45. A source and its reductions
CODING INFORMATION SOURCES 85
Having formed the reduetions shown in Figure 4-5, we proceed to
synthesize a compact code in the same manner as in Section 4-6.
We assign r code words, each of length 1, to the last reduction
in order to form s compact code for this source. ‘This code is
enlarged, just as in the binary ease, to form a compact code for the
preceding reduced source. Each time we go from a reduced source
to the preceding reduced source, we form r symbols from one
symbol for a net gain of r — 1 symbols. The proof that if we start
with a compact code we shall produce a compact code by this pro-
cedure is entirely analagous to the proof provided in Section 4-7
(Problem 4-2),
Example 4-7, To provide an illustration of the procedure deseribed above,
wwe find quaternary compact code (Figure 4-6) for the souree of Figure 4-5.
Original cource Reduced sources
‘Symbol Probability Cole word
Fiounx 46, A quaternary compact code.
49. Code Efficiency and Redundancy
Shannon’s first theorem shows that there exists a common yard-
stick with which we may measure any information source. The
valve of a symbol from an information source S may be measured
in terms of an equivalent number of binary digits needed to repre~8 INFORMATION THEORY AND CODING
sent one symbol from that source; the theorem says that the average
value of a symbol from S is H(S). More generally, the average
value of a symbol from S in terms of r-ary digits is 11,(S).
‘Let the average length of a uniquely decodable rary code for
the source 8 be L. L cannot be less than H,(S). Accordingly, we
define 9, the efficiency of the eode, by
1 HO (428)
It is also possible to define the redundancy of a code.
Redundancy = 1 — 9
L- HAS) tae)
i (4-29)
ple 4-8. Consider the zero-memory source $ = {41, 2], with P(s:) =
Fiand Piss) = We ealulste
HOS) = Plog 4 +4 oe §
= Oat bit
‘A-compact code for this source is as follows:
8 Po)
a t
* + 1
‘The average length of this code is 1 binit, and so the elficieney is
= 08t1
‘To improve the efcieney, we might code S, the second extension of S:
Compact
« Pod ‘ode
te i.
am ts 10
a ve m9
a
CODING INFORMATION SOURCES
‘The average length of this codo i $f binite, The entropy of S* is 2H(8); 50
2x osu x16
a
7
= 0.961
the third and fourth extensions of S, we find efficiencies
m= 0.985,
and
= 0.901
As we encore higher and higher extensions of the original sourse S, the ef
cieney must approach 1, In this example, the approach is quite rapid, and
little increase in efficiency can be obtained by going further than the second
extension. Such behavior is typical of Huflman coding.
‘The previous example showed the increase in efficiency obtained
by encoding higher and higher extensions of a source. It will be
instructive to examine the efficiency as a function of r, the number
‘of symbols in the code alphabet.
Example 4-9. We takes zero-memory source 8 with 13 symbols and symbol
probabilities as shown in Table 4-5. In the same table we list the compact
(Huffman) code for code alphabets comprised of 2 to 13 symbols.
‘Tamux 4-5. Cowract Cones ror Dirvenest Cope Aurnasets
Compact codes for r =
~® «32 UO 987 6 5 4 2
+ 00000000 000 w
4 ‘HEL Beat Bae Clee oar seat Jaan eres et eens Sena
te 22 2 2 2 2 2 2 2 2 20 100
ve 33 3 8 3 3 3 3 3 2 201 100
* 4404 44 4 4 4 3 202 100
te 55 5 5 5 5 5 50 32 23 20 101
aa 6 6 6 6 6 6 6 51 33-30 21 100
tr 7 7 7 7 7 7 6 52 34 31 22 LOL
od 8 8 8 8 80 71 62 53 40 32 220 1110
ac 9 9 9 9 81 72 63 54 41 330 221 111100
ee A A AO 91 82 73 64 550 42 331 2220 111101
a B BO Al 92 83 74 65 551 43 382 2221 111110
ve C BI AZ 93 84 75 66 552 44 333 2222 111111
Average
keoghh.. 1 HP Ht HEH HH OW88 INFORMATION THEORY AND CODING
‘The entropy of the source of Table 4-5 is 8.125 bits per symbol. Using this
fact and (4-28), we may plot the efficiency asa function ofr.
1.09
ose
04
ose}
080
oa
os
ose
22}
80,
ode etic
Ta. +. 2
Number of code symbol
Frovne 4-7. Code efficiency versus number of eode symbols.
From Figure 4-7 we se that the efficiency tends to limb as r decreases. The
increase in eficiency, however, in not monotonic. Note the efficiency of the
code for r= 2andr = 4, The symbol probabilities are all of the form 1/2
for 1/4%, where « isan integer. In these eases, we know (Section 4-2) that we
fan find a compact code with average length equal to the entropy.
NOTES,
Note 1. Tn this chapter we have proved Shannon's fist theorem only for
ergovlic Markov sources with a finite number of symbols (Le. states). A more
‘legant proof ofthis theorem, valid for any stationary, ergodic source, was given
by MeMillan (1058) in & slightly different form, ealled the asymptotic equi-
partition property (AEP). For a general source 8, let
1
Shs ) © 18 Fa, od
‘Shas the AEP if Is, « .- ,4,)/n converges in probability to H(S). The
importance of the AEP Tica in the fact that long sequences of outputs from 8
source with this property’ can be divided into two classes
"| A class such that each sequence in the elass has probebility roughly equal
toon
2. A class composed of sequences which hardly ever occur
‘A simple combinatorial proof of the AEP has been provided by Thomasian,
CODING INFORMATION SOURCES 89.
(1960). A generalization to more complicated sourees is given by Peres (1958).
Nole 2. In Seetion 4-6 we provided an example to show that two diferent
(in their word lengths) binary codes might each be compact for a given source.
Golomb has investigated the conditions under which this phenomenon ean
‘occur and the number of diferent possible nontrivially diferent compact codes
for a given source.
‘We can describe the construction of a code by moans of a code tree (Fano,
1001). For example, eonsider the binary code and its corresponding tree.
to
uo mn
‘Then the question of how many different codes exist for a source with ¢
symbols ean be phrased in teria of code tres.
For q = 2 there is only one possible tree corresponding to the word lengths.
nat
net
‘For q = 3 there is again only one possible tree corresponding to the word
lengths.
ar
he?
For q = 4 there are two posible trees.90 INFORMATION THEORY AND CODING
For q = 5 there are three possible trees.
For q = Gand 7 there are five and nine diferent trees, respectively.
Golotab bas also found conditions on the symbol probabilities in order for
‘more than one compact code to exist. For example, with q = 4, we must
clearly have P, = Py + Ps. Further analysis shows that $< Py < § if two
‘different compact codes exist.
‘Note 8. We have treated the problem of coding information sources under
the assumption that the duration (or some other criterion of cost) isthe same
{for each code symbol. When this is not the ease, certain modifiations of the
results of Chapter 4 are required. Let the code alphabet be
Xe eyes. ad
tnd Iet the duration of code symbol 2; be 4 ‘Then if N(T) is the number of
sequences of duration exaetly 7,
N(Q) = NOt) + NT — 6) $206 NT -4)
When this difference equation is solved, we find that N(P) grows as ART for
large T, where A is some constant and ity isthe largest real root of the charac
‘teristic equation
Pepee bee pee 1]0
‘The asymptotic number of equivalent binits per unit time is then
tnd this result ean be used to modify our form of Shannon's frst theorem.
‘The problem of finite time coding for such a code alphabet (equivalent to
Huffman coding) has been treated by Karp (1961).
CODING INFORMATION SOURCES 91
PROBLEMS
4:1, Prove Equation (1-22) for mth-order Markov sources.
4.2, Show that proceeding from redaced source to reduced source, asdesribed
in Section 4-8, will produce an r-ary compact code, if one starts with an rary
‘compact code.
4-8. A sequence of symbols from S* is coded by the Huffman procedure into
fe code alphabet X = [zs, 24. - 2). The result of this coding procedure
nay be viewed as anew information source with the rource alphabet X. Show
that the probability of each of the symbols z, of this new soureo approaches
fr as m increases.
44. A zero-memory binary source has P(0) = 0.1 and P(1) = 09.
(@) Find HS).
(©) Find Z, the average word length of a compact cole for $ when X =
0, 11.
(pind Lu for n= 2,3, 4and n+ ©, when > isenended int a sompast
code, still with X= (0, 1).
(@ Find the effcieney of your four codes.
46. In Problem 4-1 we encoded 5, S#, 5%, and S¢ into X, These codings
‘produced sequences of Os and 1s. We can view these sequences as being
“mitted from some new source S,, a4 shown in Figure P 4-5. Find H(S4)
when n= 1,2,3,4
pinay source So
Frovme P45
4-6, Given the following table:
Sofa fetal a] |
Powel a la lala lala le
(@) Find (8) and (5).
(©) Find « compact code for (8) when X = (0,1) and X = (0,1, 2}.
(©) Compute L for both the above codes.
441 Given the following table:
Bees am | ajalajeda
os 0.1 | 0.05 | 0.08 | 0.08 | 0.0592 INFORMATION THEORY AND CODING
(a) Find a compset code for this source with X = (0, 1,2)
(@) There may be more than one nontrivially different (Le, different sets of
word lengths) compact code for this source and this alphabet.” List the sets of
‘word lengths forall such codes you ean find
“8, Let « =} in Problem 2-14, A binary ende is found for S with L =
H(S).. Find 17, the average length of a compact code for S'.
48, The source 8 has nine symbols; each occurs with probability §.
(a) Find a compact code for S using the code alphabet X = (0, 1}.
(©) Find a compsct code for 8 using the code alphabet X = (0, 1, 2}.
(©) Find s compact eode for $ using the code alphabet X = {0, 1, 2,8)
410. A source 5 has six symbols with probabilities P, to Py, respectively.
Assume that we have ordered the Po that P, > P, > ‘We wish
to find a compact code for this source using the code al = 10,1,2,3).
Find 2 set of word lengths for such a compact code if Pe = gy.
4-11. Find all possible different compact binary codes for the souree shown
in the following tablo:
Bessa [aleiealalala
PO). -+1+++-|0.20 0.18 0.12 0.10 0.10 0.08 | 0.06 0.06 0.05[0.08,
(Count as “different” se only those coves with different ats af word lengths &
J-12. (a) Find the fve diferent code trees corresponding to q = 6 in Note 2
©) Find the nine diferent cade trees coresponding to 9 = 7.
4-18, This problem deals wth a generalization of Noto. Find all posible
dierent code’ trees corresponding to trinary compact codes for sources with
12 3,45,6,7,8
5
CHANNELS
AND MUTUAL
INFORMATION
5-1. Introduction
In the first four chapters, we were concerned with properties of
information sources and with transformations of sequences of
source symbols into sequences of code symbols, We were able to
relate our information measure to properties of information sources.
Tn particular, the entropy of a source (expressed in suitable units)
was shown to define a lower bound to the average number of code
symbols needed to encode each source symbol, We used this
bound in Seetion 4-9 to define the efficiency and redundancy of a
cy94 INFORMATION THEORY AND CODING
code. Indeed, in retrospect we note that a large portion of the
first part of this book was devoted to providing a background for
our definitions of efficiency and redundaney and to the synthesis
of codes containing as little redundancy as possible.
In view of this preoccupation with redundancy reduetion up to
now, it may surprise the reader to learn that Chapters 5 and 6
will be concerned primarily with methods of putting redundancy
back into codes! We shall see that it is not always desirable to
use codes with little or no redundancy. In this chapter, our
interest will shift from the information source to the information
channel—that is, from information generation to information
transmission,
Our introduction of the concept of an information channel
leads directly to the possibility of errors arising in the process of
information transmission. We shall study the effect such errors
have on our efforts to transmit information, This will, in turn,
lead to the possibility of eoding in order to decrease the effect of
errors caused by an information channel. The reader should not
be surprised to learn that our information measure may be used
to analyze this type of coding, as well as the type of coding already
discussed. In fact, in spite of the considerable progress we have
made so far, the central result of information theory and the most
dramatic use of the concept of entropy are yet to come. This
result—Shannon's remarkable second theorem—will use the entropy
idea to describe how we may utilize an unreliable information ehan-
nel to transmit reliable information!
‘information Channels
Our primary concern in the rest of this book will be the informa-
tion channel,
Definition. An information channelt is described by
giving an input alphabet A = {aj, i= 1, 2, Be
an output alphabet B= [0], j= 1,2,..., s;anda
4 The channel defined above is sometimes called w zero-memory information
channel. A more general definition, where the probability of « given output
1, may depend upon several preceding input symbols or even output aymabols,
is also possible. Such channels are referred to as channels with memory.
CHANNELS AND MUTUAL INFORMATION 95
set of conditional probabilities P(bj/a)) for all ¢ and
P(b,/as) is just the probability that the output symbol b,
will be received if the input symbol a, is sent.
A particular channel of great theoretical and practical importance
is the binary symmetric S
‘The channel diagram of the BSC is shown in Figure 5-2. As
usual, we let) =1—p. This channel has two input symbols
(a, = 0, a; = 1) and two output symbols (6; = 0, bs = 1). It is
symmetric because the probability of receiving a 1 if a 0 is sent is
_ 0
1
Fioune 5-1. An information channel, Fone 52. The binary sym-
‘motrie channel (BSC),
equal to the probability of receiving a 0 if 1is sent; this probability,
the probability that an error will oceur, is p
A convenient way of describing an information channel is to
arrange the conditional probabilities of its outputs as shown in
Figure 5:3.
Outpuse
bs
Pisa) Pnfa) ~~ Pola
Plas) Pobvfas) +++ Plbv/ad
POs/a) POrfar) Pebu/as)
Frovas 5:3, Description of an information channel
Note that each row of this array corresponds to a fixed input,
and that the terms in this row are just the probabilities of obtaining
the various b; at the output if that fixed input is sent, We shall see
‘this description of an information channel so often that it will be
Useful to have a more streamlined notation for it. Accordingly, we
define
Py = Pooja) 641)9 INFORMATION THEORY AND CODING
and the array of Figure 5-3 becomes the channel matrix P:
Pa Pa c++ Pa
Bie ene eae
FaiPa ed
a
An information channel is completely described by giving its channel
matriz. We therefore use P interchangeably to represent both the
‘Panel uated oud the dena.
Tish sor of the hast wth rminpoae oS gat or
channel, and each column corresponds to a channel output. Note
a fundamental property of the channel matrix—the terms in any
one row of the matrix must sum to 1.+ This follows since, if we
send any input symbol a;, we must get some output symbol. We
write this equation for future reference:
Dye meee gee (63)
‘The channel mateix of the BSC is
PP
(4)
ba wo
Just as we did in the case of information sources, we may view
the inputs and outputs of a channel in blocks of n symbols, rather
than individually. Thus, we define the nth extension of a channel.
Definition. Consider an information channel with input,
alphabet A= {ai}, ¢= 1,2, - . . , rj output alphabet,
B= (bi j= 1,2, and channel matrix
Pu Pa
Ps Ba. =-
Pa Pa
‘Tho nth extension of this channel has input alphabet
A= [ad,i= 1,2, . .. ,rsjoutput alphabet B* = [85],
J=1,2,-.., 9%} and channel matrix
Such matrices are called Markov matrices oF stochastic matrices.
CHANNELS AND MUTUAL INFORMATION 97
Un Me +++ Th
n-[t» Ha The
Tmt Us
Each of the inputs a consists of a sequence of n ele-
mentary input symbols (a, a, . . + » di), and each
of the outputs 8; consists of a sequence of n elementary
output symbols (bj, bj, ... » ba). ‘The probability
Tly = P(B,/a) consists of the product of the correspond-
ing elementary symbol probabilities.
Tust as was the case when we defined the extension of an informa-
tion source, the extension of an information channel is not really
4 new concept, but just a new way of viewing an old concept.
‘Merely by looking at symbols of some channel in blocks of length n,
‘we obtain the nth extension of that channel.
Example 8-1. The second extension of the BSC isa channel with four input
symbols and four output symbols. ts channel matrix is shown in Figure 5-4
rrp pt p* pp
Ppp ppp
Fiouns 5-4, Channel matrix of the (BSC).
Bt pp pp pt
wa|fe > pp
‘We note that the channel matrix of the (BSC) may be written
as a matrix of matrices. Let P, as before, be the channel matrix
of the BSC. Then, the channel matrix of the (BSC)? can also be
written
_ [ee
n-[% |
The above matrix is known as the Kronecker square (Bellman,
1960) (or tensor square) of the matrix P. In the more general case
the channel matrix of the nth extension of a channel is the nth
Kronecker power of the original ehannel matrix.
In the first part of the book, we used our information measure to
measure the average amount of information produced by a source.
The function of an information channel, however, is not to produce98 INFORMATION THEORY AND CODING
information, but to transmit information from the input to the
output. We expect, therefore, to use our information measure to
measure the ability of a channel to transport information. This
will indeed be the case; we now proceed to an investigation of
‘the amount of information a channel can transmit,
5-3. Probability Relations in a Channel
Consider an information channel with r input symbols and
s output symbols, We define the channel by the channel matrix P:
Pu Pas
Pu P;
el
Pa Pa
Let us select input symbols according to the probabilities P(a;),
P(a:), . . . , P(a,) for transmission through this channel. Then
the output symbols will appear according to some other set of
probabilities: P(b;), P(bs), . . . , P(b,). The relations between the
probabilities of the various input symbols and the probabilities of
the various output symbols are easily derived, For example, there
are r ways in which we might receive output symbol bs. If ai is
sent, b: will occur with probability Pir; if ae is sent, 6 will oceur
with probability P21; ete. We therefore write
P(a,)Pu + Pla)Pu + * ++ + P(@)Pa = PQ) (5-60)
P(a:)Pur + Pla:)Pae + + ++ + P(@)Pra = P(bs) (5-65)
P(a:)Pu + P(as)Pu + +++ + PC)Pn = Plb,) (6-60)
Faquations (5-6) provide us with expressions for the probabilities
of the various output symbols if we are given the input probabilities
P(q,) and the channel matrix, the matrix of conditional probabilities
P(b,/a). For the remainder of this chapter, we assume that we
are given the P(q,) and the P(b;/a;), so that the P(,) may be eal-
culated from (5-6). Note, however, that if we are given the output
probabilities P(,) and P(,/a,), it may not be possible to invert the
system of linear equations (5-6) in order to determine the P(a).
For exaruple, ina BSC with p = J, any set of input probabilities
CHANNELS AND MUTUAL INFORMATION 99
will lead to output symbols which are equiprobable, In general,
there may be many input distributions which lead to the same
output distribution. If we are given the input distribution, on
the other hand, we may alWaye calculate a unique output distribu-
tion with the aid of (5-6).
In addition to the P(b,), there are two more sets of probabilities
associated with an information channel which may be calculated
from the P(a,) and the P(b,/a,). According to Bayes’ Jaw, the
conditional probability of an input a, given that an output b; has
been received, is
Psa) P(a)
Pease) = POmiried (5-70)
or, using (6-6),
Pea) = —PeolanPa) 6)
3 PeovayPeay
The probabilities P(a,/b,) are sometimes referred to as backward
Probabilities in order to distinguish them from the forward proba-
bilities Po/a).
The numerator of the right side of
(5-7) is the probability of the joint
event (0,4),
Play b)) = P(b;/a)P(a) (5-8a)
and this quantity may also be reeog-
nized as
Frovne 5.5, A noisy informa-
P(e, bj) = P(ai/b,)P(b)) (5-86) tion channel
~ Example 6-2. Let us illustrate the calculation of the various probabilities
‘sssociated with an information channel. We take a binary channel; that is,
A = (0, 1) and B = (0,1). We assume that the P(b,/a:) are as shown in the
ae
a
»-[t d]
We amociate the rors and columns ofthe above matrix with the aput and
euipat symbols i the aural aver Theriore Pr [b= Ofe Soh 2
Frit ija~ 0) =f oa, Fialy, vo tone ta P(e 0) ob aah
Br laci}= 4. The informalon given shove i nealy’ summariaed in
Piste 3.100 INFORMATION THEORY AND CODING
‘The probabilities ofthe output symbols are obtained with the use of (6-6).
Pr (b= 0} = @) + DG) =H (6-00)
Pr (b= 1) = + DG) = 08)
As a check, we note that Pr | = 0) +Pr (b= 1] = 1. The conditional
input probabilities are obtained from (6-7)
aw
Pr ia =f =) - 2@ _ (1
te= 07 = 0) ~ SAP ae (6-100)
Prfa=1/ a1) = 2® _
foo 10 91) = Sy ats (6-10)
The other two backward probabilities may be similarly obtained. A sim-
ler method, however, is to use the fact that Pr a = 0/b = 0} + Pr {a = 1/
b= 0} = Land Pr (a 0/6 =1) + Pr a= 1/b= 1) = 1, Thuy
Pr (a= 1/0 = 0) = ay (6-106)
Pr (a=0/f = 1) = yy (6-104)
‘The probabilities of various joint events are found by using (6-8). We ealeu-
late just one of these:
Pr {a = 0,b = 0] = Pr [a = 0/) = 0} Pr {b= 0}
andy 61)
=
5-4. A Priori and A Posteriori Entropies v
‘The various output symbols of our channel occur according to the
set of probabilities P(,). Note that the probability a given output
symbol will be b; is P(b,) if we do not know which input symbol is
sent. On the other hand, if we do know that input symbol a: is
sent, the probability that the corresponding output. will be bj
changes from P(,) to P(b,/a,). Likewise, we recall that the input
symbol a is chosen with probability P(a). If we observe the out-
put symbol }j, however, we know the probability that a, is the
corresponding input symbol is P(a/b,) ((5-7)). Let us focus our
attention on this change induced on the probabilities of the various
input symbols by the reception of a given output symbol b;
We shall refer to the P(a,) as the a priori probabilities of the input
symbols—the probabilities of the a; before the reception of an out
put symbol. The P(a/b,) will be called the a posteriori proba-
bilities of the input symbols—the probabilities after the reception
CHANNELS AND MUTUAL INFORMATION 101
of}, From Section 2-2 we know that we can calculate the entropy
of the set of input symbols with respect to both these sets of proba-
bilities. The a priori entropy of A ist
H(A) = J PC) ton B55 2)
and the a posteriori entropy of A, when b; is received, is
H(A/b) = J Plo/t) oe wag 6-18)
‘The interpretation of these two quantities follows directly from
Shannon’s first theorem. H(A) is the average number of binits
needed to represent a symbol from a source with the a priori proba-
bilities Pa), i= 1, 2, ... , 7; H(A/b) is the average number of
binits needed to represent a symbol :
from a source with the posteriori prob- 5
—— ee
abilities P(a/),f=1,2,... 57.
Brample 6-3. We repest, es Figure 5-5,
‘the figure used in Examplo 6-2 for case of
reference. ‘Thea priori entropy of the set of
input symbols is
H(A) = Pog + Hog 4 ~ 0811 bit 4) Pyeons 5-6, A noisy informa
Af we receive the symbol 0 st the output of io channel,
the channel, our a posteriori probabilities will he given by (5-104) and (5-108).
‘The a posteriori entropy is
H(A/O) = ¥ log # + zr log 21 = 0.276 bit 9)
It we reecive the symbol 1, on the other hand, the a posteriori entropy is
HAM) = plow f+ $5 low $ = 0.998 bit 19)
‘Hence, if «0 is received, the entropy—our uncertainty about which iaput was
sent—deereases, whereas if a1 is received, our uncertainty inereasos.
wv
5-5 Generalization of Shannon's First Theorem
Shannon's first theorem tells us that the entropy of an alphabet
‘may be interpreted as the average number of binits necessary to
In the remainder of this book it will be convenient to omit the subseripts
‘om o; and b, when writing sume over all aymbols of the A or B alphabets,102 INFORMATION THEORY AND CODING.
represent one symbol of that alphabet. Consider this interpreta-
tion applied to the a priori and a posteriori entropies (Figure 5-7)
Before reception of the output symbol of the channel, we associ-
ate the a priori probabilities P(a) with the input alphabet A.
‘The average number of binits necessary to represent a symbol from
this alphabet is H(A). If we receive
« Crome 8 a given symbol, say b,, we associate
= the a posteriori probabilities P(a/by)
Ficene 5-7. An information with the input alphabet. The aver-
‘age number of binits necessary to rep-
resent a symbol from the alphabet with these (a posteriori) statis-
ties is H(A/b,). Since the output symbols occur with probabilities
(i), we might expect that the average number of binits (now aver-
‘aged over 6, also) necessary to represent an input eymbol a, if we
are given an output symbol, is the average a posteriori entropy,
F POH’) (17)
‘This important result is, in fact, true. Tt does not follow from
Shannon's first theorem, however. ‘That theorem deals only with
coding for a source with a fixed set of source statisties and not with
coding for a source which selects a new set of source statistics after
each output symbol. We therefore generalize Shannon's first
‘theorem to cover this ease,
‘The question we ask in order to obtain such # generalization is
‘the same question we asked in order to obtain Shannon's first
theorem, namely, “What is the most efficient method of coding
from a source?” (In this ease, the source is A.) This time, how-
ever, the statisties of the source we wish to eode change from symbol
tosymbol. The indication of which set of source statistics we have
is provided by the output of the channel bs, Since a compact eode
for one set of source statisties will not, in general, be a compact
code for another set of source statisties, we take advantage of our
knowledge of b for each transmitted symbol to construct s binaryt
‘codes—one for each of the possible received symbols bj. When the
output of our channel is 6, we use the jth binary code to encode the
transmitted symbol a;. Let the word lengths of our s codes be as
shown in Table 5-1.
1 The binary assumption is not necessary, but it serves to simplify the dis-
cussion which follows
CHANNELS AND MUTUAL INFORMATION
‘Tanus 5-1. Wonp Luxoris von # Coons
Input Code Code Code
1 2
Ia
tel
fee
If we require these codes to be instantancous, we may apply the
first part of Shannon's first theorem (4-7) to each code separately
and obtain
H(A/) SX Pla/b ls * Ls G18)
L; is the average length of the jth code. We employ the conditional
probabilities P(a/t,) rather than the marginal probabilities P(a)
to calculate L,, since the jth code is employed only when b, is the
received symbol. The average number of bi used for each
member of the A alphabet when we encode in this fashion is obtained
by averaging with respect to the received symbols b;. Multiplying
G-18) by P(b) and summing over all B yield
J H(A/o)PO) < FY PCa, bly G19)
Phare Z Ww'ths average number of binite per symbol from the A
alphabet—averaged with respect to both input and output symbols.
Note the similarity of (5-19) to (4-7).
In order to show that the bound (5-19) can be achieved, we next
describe a specific coding procedure. When b, is the output of our
‘channel, we select ly, the word length of the code word eorrespond-
{ng to the input a, as the unique integer satisfying
1
Jon eae $< Noe pray +1 20)
‘Word lengths defined in this fashion satisfy the Kraft inequality
for each j. The ly, therefore, define s sets of word lengths accepta
ble as the word lengths of # instantaneous codes, Now multiply
‘This can be shown in exactly the same manner used in our proof of Shan-
‘non's fret theorem (Section 4-3).104 INFORMATION THEORY AND CODNG
6-20) by Pla, b) = Pla/b) PO),
PO)PCa/b) I0K pers < WiP(OH b)
1
< Pl)Pla/d) toe pays + Plas») (621)
and sum this equation over all members of the A and B alphabets:
YPrOMam
where p= 1p. Assumo that the probabilities of a 0 and a 1 being
ies of 8 O and a 1 boing trans
mitted are w and 6, respectively. We-write the mutual information in the form
1a: 8) = HH) ~ H@/8)
= m1 ~ 3 70) r/o me as
~ 10 -¥ 0 (pin! + rie)
= 10) ~ (rive! + pim3) ow
‘The probabilities that 6) = 0 and D, = 1 are easily calculated to be wp + ép
Fioune: 5.10. Geometric interpretation of the mutual information of a BSC,
and op + ap, reapectively. Hence
1A; 8) = [lp + on tn hse + or +59 06 =]
~ (p06 5 + pion
os
CHANNELS AND MUTUAL INFORMATION ITT
We may write I(A; B) in terme of the entropy function (Figure 2-3)
(A; B) = Hlep + H(p), and Figure 6-10 provides »
‘geometric proof of the nonnegativty of mutual information,
Certain limiting conditions of interest may slso be seen from Figure 5-10.
For example, for a fixed value of p, we may vary w and examine the behavior
fof I(A; B). We see that I(4; B) achieves its maximum when « = }, and the
‘alue of this maximum is 1 ~ Hip). For » = O or = 1, on the other hand,
‘the mutual information is 0.
546, Noiseless Channels and Deterministic Channels
In this section, we define two special types of channels and obtain
simplified expressions for the mutual information of these channels.
b
bs
bs
abe
bs
ek
Froume 5-11, A noveless chaanel.
In the discussion that follows, we assume that each column of the
channel matrix has at least one nonzero element, An output
symbol corresponding to a column of zeros will oceur with probabil-
ity 0 for any distribution over the input symbols. Tt is therefore
‘of no interest and may be ignored.
Definition, A channel deseribed by a channel matrix
with one, and only one, nonzero element in each column
will be called s noiseless channel.
Example 6-6. ‘The channel matrix of « noiseless channel is given a8
$400 00
polo 0 8 wy v0
ooo o O14
‘The channel diagram of this channel is shown in Figure 5-11.112 INFORMATION THEORY AND CODING
A.BSC with probability of error p equal to 0is a noiseless channel,
Note, however, that a BSC with probability of error equal to 1
also noiseless channel! This is an expression of the fact that a
channel which is consistently in error can be as useful as a channel
which is consistently correct.
Definition. A channel described by a channel matrix
with one, and only one, nonzero element in each row
will be called a deterministic channel
‘Example 6-6. The channel matrix of a deterministic channel is given as
shown in Figure 512
en
Fioune 5-12, A deterministic channel,
‘Since there is but one nonzero clement in each row of a determinis-
tie channel matrix, and the sum of the elements in each row must,
be 1, the elements of a deterministic channel matrix are all either
orl
The mutual information of the types of channels defined above
can be easily ealeulated. First, consider a noiseless channel. In a
noiseless channel, when we observe the output bj, we know with
probability 1 which a; was transmitted—that is, the conditional
probabilities P(a./6,) are all either 0 or 1. Now, we write the equiv-
cation H(A/B) as
HA/B) = FPO) Y POM eazy — 649)
7 2
CHANNELS AND MUTUAL INFORMATION. 113
and we note that all the terms in the inner summation (being either
of the form 1 X log 1 or 0 X log j) are zero, Hence, for a noiseless
channel,
H(A/B) =0 6-50)
‘This conclusion is also evident in view of the generalization of
Shannon's first theorem (Section 5-5). ‘The outputs of a noiseless
‘channel are sufficient by themselves to specify the inputs to the
channel. Hence, the average number of binits necessary to specify
the inputs when we know the outputs is zero, From (5-30) we
see that, for a noiseless channel,
1(A; B) = H(A) (6-51)
‘The amount of information transmitted through such a channel
is equal to the total uncertainty of the input alphabet.
For deterministic channels, we may derive set of analogous
results, In a deterministic channel, the input symbol a; is sufficient
to determine the output symbol b, with probability 1. Hence, all
the probabilities P(,/a) are either 0 or 1, and
H(B/A) = Y PCed) Y Ps/0) toe BGuyas
-o 52)
‘On using (5-37), we have, for a deterministic channel,
1(A;B) = HB) 63)
BEF, Comoded Chennels
Some interesting properties of entropy and mutual information
are revealed by consideration of the cascade of two channels (Figure
5-13). [A detailed investigation of eascades of binary channels is
provided by Silverman (1955).)
A 2
‘Chane ©
(Channa 2
Frou 5-13, The cascade of two channels,
We assume that a channel with an r-symbol input alphabet 4
and an saymbol output alphabet B is eascaded with a second
‘channel, as indicated above. The input alphabet of the second114 INFORMATION THEORY AND CODING.
‘channel is identified with B, and its output alphabet, consisting of ¢
symbols, is denoted by C.
‘The fact that the alphabets are connected by the cascade of
Figure 5-13 implies certain relationships among the symbol prob-
abilities. When a, a symbol from A, is transmitted, the output
of the first channel is some symbol from B, say b;. In turn, by
produces an output ci from the second channel. ‘The symbol c
‘depends on the original input a, only through by. In act, if we know
the intermediate symbol}, the probability of obtaining the terminal
symbol e depends only upon 6,, and not upon the initial symbol
4a, which produced b,. This property of eascaded channels may be
written as
Pla/by, a) = Ples/b)) for all 4, j, & (6-54)
Indeed, (5-54) may be taken as the definition of what we mean
by the cascade of two channels. A direct application of Bayes’
rule to (5-54) yields a similar equation in the reverse direction:
Plai/b,, +) = Pla/by (5-55)
It should be emphasized at this point that (5-54) and (5-55) hold
only in the special ease where A, B, and C are the alphabets of
cascaded channels, as indicated in Figure 5-13.
‘As we transmit information through cascaded channels from A to
B to C, it seems plausible that the equivocation should increase—
that is, H(A/C) should be greater than H(A/B). Let us investi-
gate this question.
H(A/C) — H(A/B) = re: 0 lee raray
- 2 r@p lor pa
1
P(a, b, €) 108 (a7)
is Plaje)
Pla, 6,0 lor pray
y Pla/b)
Be P(a/e)
CHANNELS AND MUTUAL INFORMATION 115
‘Now we use (5-55) in (5-56);
Pla/b, ©)
H(A/C) — H(A/B) = PCa, b, ©) low pares
SFG: g Pla/b,e) og,
2 Pe, DY Poly oe Piajey 57)
We may use the inequality (2-8a) to show that the summation
over the A alphabet in (5-57) is nonnegative. Hence,
H(A/C) ~ H(A/B) > 0
H(A/C) > H(A/B)
An immediate consequence of (5-59) is
(A; B) > (A; 0) (5-00)
These useful inequalities were apparently first proved by Woodward
(0955). ‘They show that information channels tend to “leak”
information, ‘The information that finally comes out of a cascade
of channels can be no greater than the information which would
emerge from an intermediate point in the easeade if we eould tap
such a point.
‘The condition for equality to hold in (5-59) and (5-60) is of some
interest. Retracing the steps in our proof to (5-50), we see that the
equality holds if, and only if,
P(a/b, ©) = P(a/e) (5-610)
for all a symbols and all } and c symbols, such that PG, ¢) #0.
Equivalently, the condition may also be written as
P(a/b) = P(a/e) (5-610)
for all a symbols and all b and ¢ symbols, such that P(, ¢) 0.
‘The condition for equality deserves some comment. At first
lance, it might appear that the equality would apply if, and only if,
‘the second channel in the easeade of Figure 5-13 were noiseless.
If this channel is noiscless, itis not hard to check that our condition
(6-616) will apply. ‘That condition, however, will also apply in
‘other circumstances, as the next example shows.116 INFORMATION THEORY AND CODING
rte 61. Tah cmb ch
Gad
Loo
Gil
044.
A channel diagram of the cascade is shown in Figure 5-14,
Fioune 5-14. Cascaded channels,
In spite of the fact that nether channel ie noiseless, it may be ween thet (6-618)
holds, and, hence,
MA; B) = HAs)
In this example, (6-618) holds for any assignment of probabilities over the
input alphabet 4. Tes also possible to find eases where (6-616) hol only for
some specific input disteibution. We shall go into thie point further in the
next section,
We can illustrate the loss of information as it flows through a
cascade of channels by employing people as information channels.
A message, originally written in English, is translated into another
language, and then translated back into English by sceond
translator who has not seen the original message. ‘The result of
this process will be a corrupted version of the original meseage, and
may be thought of as the result of passing a message through a
noisy channel. In order to simulate cascade of channels, we
repeat the process, this time, however, starting with the corrupted
version of the message,
CHANNELS AND MUTUAL INFORMATION 117
‘The experiment just described was performed using a simple
four-line poem, The Turtle, by Ozden Nash, ‘The poem was trans-
lated from English to French to English to German to English to
Spanish to English. No effort was made to retain the rhyme or
meter of the original piece.
‘The turtle lives "twixt plated decks
Which practically conceal its sex.
it clever of the turtle
ivabassionaniene,
‘The output of the English-French-English channel was
‘The turtle lives in a scaled carapace which in fact hides its sex.
1 find that it is clever for the turtle to be so fertile in such a
tricky situation,
‘The output of the English-German-English channel was
‘The turtle lives in an enclosed shell under which, in reality, it
hiidesitssex. 1 find that the turtle must be very clever, indeed,
to be so fertile in such a tight situation.
Finally, the output of the English-Spanish-English channel was
‘The turtle lives inside a closed shell, under which, really, it
hridesitesex. I feel the turtle had to be certainly elever to be so
fertile in a so tight situation.
‘The noisiness of the human communication channel and the
resulting loss of information have been recognized for some time,
‘Thueydides, in Book I of ““The Peloponnesian War,” states:
Of the events of war, Ihave not ventured tospeak from any chanee informa
tion, nor according to any notion of my own [i.e., a priori probabilities);
T have described nothing but what I saw myself, or learned from others
‘of whom I made the most careful and particular inquiry [i.e., noiseless
channels}. The task was a laborious one, because eyewitnesses of the
‘same occurrence gave different accounts of them, as they remembered or
were interested in the actions of one side or the other [i., noisy channels).
As a final (but more quantitative) example of the loss of informa-
tion in cascaded channels, we consider the ease of the cascade of
two identical BSCs,118 INFORMATION THEORY AND CODING
‘Brample 6-8. Two TSC, each with channel matrix
Us]
fm)
‘The two possible inputs to the first BSC are chosen with equal probability.
Hence, from (5-48) we have
1A; B) = 1 ~ Hp) (5-62)
tis easy to show that the eascade of these BSCs is equivalent toa single BSC
with probability of error 2pp, Hence,
1A; © = 1 — Hepp) (563)
another identioal BSC is addled (with output alphabet D), we obtain
1A; D) = 1 — HGptp +9) (564)
‘These curves are plotted in Figure 5-15.
are cascaded as follows
i
lo
02040608 id
Single chanel ror probability 9
Frovws 6-15. Mutual information of the cascade of m BSCs. (Input symbols
are assumed equally probable.)
5-10. Reduced Channels and Sufficient Reductions
In many types of information channels encountered in real life,
the set of channel outputs is far larger than the user would like.
For example, scientific data relayed from a satellite via a binary
‘CHANNELS AND MUTUAL INFORMATION 119
telemetry channel often contains information irrelevant to the pri-
mary phenomenon under investigation. The antenna on the earth
in such a gystem might obtain a sequence of pulses of various
amplitudes. The receiver would take each pulse and, if its ampli-
‘ude is greater than some threshold, interpret the pulse as 9 1"; if
‘the amplitude is less than the threshold, the receiver interprets the
Pulse asa “0.” We may think of two different channels in the
situation just described. First, there is the channel with binary
inputs (sent from the satellite) and a large number of outputs
(Corresponding to the number of distinguishable pulse amplitudes).
Second, there is the channel with binary inputs and binary out-
puts (corresponding to the outputs of our receiver). This second
channel is clearly a simplification of the first channel; we eall the
‘second channel a reduction of the first,
Definition. Consider a ehannel with r inputs and ¢ out-
puts deseribed by a channe! matrix P.
Pu Pass Pu Puss
pa[Ps Pa oo: Py Pass -
Pa Pas++ Pa Pr
We define a new channel with r inputs and s ~ 1 outputs
by adding together any two columns of P. We eall the
channel matrix of the new ehannel P’.
Pa Pa Put Pigs +> Pe
Pa Pao- + Put Pais 07> Pu
Pa Bae Pat Pri *- > Pr
‘The new channel P’ is called an elementary reduction of P.
We may repeat this process a number of times, forming an
elementary reduction of P’, ete. The end product of
more than one elementary reduction will be called
simply a reduction of the original channel P,
Bxample 6-9, In Example 5-1 we formed the channel matrix of the (BSC)*:
pp mp ot
pale Poy pp
pop pt pp
pp ppp120 INFORMATION THEORY AND CODING CHANNELS AND MUTUAL INFORMATION 121
‘An elementary reduction of P is formed by combining the first and second
An channel?” In order to answer this question, we need only eonsider
opp wt
pa|? ot
the ease of elementary reductions. ‘The question in the case of a
general reduction may then be answered by indus
‘Lot us form an elementary partition of the channel
PP pp
> mp Pu Pu
A reduction of Pis formed by combining the second and third elumns of P's pu|?s Px
(5-67)
bale
Without loss of generality, we may assume that the elementary
partition is formed by combining the first two columns of P. ‘This
situation is described in the channel diagram shown in Figure 5-17.
»r
|? >
of PP
>
A useful way of viewing a reduced channel is as shown in Figure
5-16, ‘The deterministic channel combines symbols of the B alpha-
‘anne!
Fioone 5-16, A reduced channel
bet into a smaller number of symbols of the C alphabet. Hence,
the channel with input alphabet and output alphabet ( indicated
by dashed lines in Figure 5-16 is a reduetion of channel P. This
‘method of constructing a reduced channel allows us to use the results
of the previous section on channel cascades, In particular, we have
(referring to Figure 5-16)
é
Ficums 8.17. Channel reduction by cascade.
In Section 5-0 we found necessary and sufficient conditions that a
‘eascade not lose information. ‘These were {(5-610)]
H(A/) > H(A/B 5-65)
es pe Ce ore P(a/b) = Pla/e) (6-08)
HA; $ KA;B) 6-60) forall ab, and e symbols, such that
Forming a reduction of a channel decreases (or at best leaves P(b, 0) #0
unchanged) the mutual information of the input and output
alphabets, This is the price we pay for simplification in the
channel.
‘A most important question suggested by the above remarks
is, “When can we simplify the channel without paying a penalty
in reduced mutual information?” ‘That is, “When is the mutual
information of a reduced channel equal to that of the original
Since we are investigating an clementary reduction, this condition
is satisfied trivially for all B symbols except the two symbols we
have combined, b; and b;, Let ¢ be the C symbol formed from
band bs. On applying (5-68) to b, and bs, we find that the neces-
"sary and sulficient conditions are
P(a/by) = P(a/e:) = P(a/bs) foralla (5-69)122 INFORMATION THEORY AND CODING
This is equivalent tot
P(a/bi) = P(a/bs) for alla 6-70)
In other words, the two output symbols by and bs may be combined
without loss of information if, and only if, the backward probabilities
P(a/b) and P(a/bs) are identical for all a. This is an important
result, both in terms of understanding information and from a
practical point of view. It provides conditions under which a
channel may be simplified without paying a penalty, The back-
ward probabilities, however, depend upon the a priori probabilities
P(aq); ie, they depend upon how we use our channel, It is of
even more interest to determine when we may combine channel
‘outputs no matter how we use the channel, ie., for any a priori
probabilities, ‘This may be done by using Bayes’ law to rewrite
_Poy/a\P(@)_ __Plbs/a)P(a)
yr »/a)P(a) ¥ P(bs/a)P(a)
foralla (5-71)
P(bi/a) _ PPovar@
Plbs/a) ~ ¥ P(b,/a)P(a)
for all a (6-72)
If (5-72) is to hold for all possible a priori probabilities P(a), we
must have
P(bs/a) = const X P(bs/a) for all a (5-73)
Equation (5-73) is the condition we seek. If we have a channel
matrix satisfying (5-73), we may combine two columns of the
matrix, and the new channel matrix will be just as good as the
original one. More precisely, for any set of probabilities over
the input alphabet, the mutual information of the channel and the
reduced channel will be identical. A reduced channel with this
Property will be called a suficient reduction
Example 6-10. The channel
itd
Gidl
4 The condition on P(a/e,) follows automatically from (5-70).
may be reduced to
CHANNELS AND MUTUAL INFORMATION 123
al
‘This channel is « sufisiont reduction of the original channel.
‘and finally to
5-11. Additivity of Mutual Information
Another important property of mutual information is additivity,
In this seetion we investigate additivity by considering the average
amount of information provided about a set of input symbols by a
succession of output is, we consider the ease where
‘we may gain information about the input by a number of observa-
tions, instead of a single observation. An example of this situation
‘occurs when the input symbols of a noisy channel are repeated a
number of times, rather than transmitted just once. Such a
procedure might be used to improve the reliability of information
transmitted through an unreliable channel. Another example is
aan information channel where the response to a single input is a
sequence of output symbols, rather than a single output symbol
We investigate the additivity property of mutual information
in the special case where the output for a single input symbol
consists of two symbols. The more general case where the output
consists of n symbols may then be treated by induction,
Let us modify our model of the information channel, then, so
that instead of a single output for each input, we receive two
symbols, say b, and ci. The symbols b, and ¢, are from the output
j=1,2,..., 8, 0nd C= fa], b= 1, 2,
ae, e
Without loss of generality, we may assume that the two output
symbols are received in the order be. Then the a priori probabili-
ties of the input symbols P(a,) change into the a posteriori prob-
P(ai/>,) upon reception of the first output. symbol; upon
Teception of the second output symbol, they change into the
“even more a posteriori” probabilities P(a/b,, c.).
If the two symbols b, and c, are received, the average uncertainty
‘or entropy of the set of input symbols changes from
HA) = ¥ P00) 8 75 (6-740)124 INFORMATION THEORY AND CODING
to the a posteriori entropy
WA/b) = J Pla/t) toe rare
and then to the “even more a posteriori” entropy’
H(A/by 04) = J Pla/by cs) on py (STAC)
- PCy, &)
As in Section 5-5, we average H(A/b,) over the b, to find the
average a posteriori entropy, or the equivoeation of A with respect
to B
Y PO)H(A/») = W(A/B) (6-750)
In the same manner, we may average H(A/by, ex) over all by and cx
in order to find the equivocation of A with respect to B and
& P(b, ©) H(A/b, ¢) = H(A/B, C) (5-750)
‘The results of our generalization of Shannon's first theorem (Sec-
tion 5-5) apply direetly to H(A/B, C); H(A/B, C) is the average
number of binits necessary to encode a symbol from the A alphabet
alter we are given the corresponding B and C symbols,
Equations (5-75a) and (5-75) suggest two different ways we
might measure the amount of information B and C yield about
A—the mutual information of (B,C) and A. First, we might define
the mutual information of A and (B,C), just as we did when the
channel output consisted of a single symbol. That is,
1(A; B, C) = H(A) — H(A/B, C) (6-76)
Second, we might consider the amount of information provided
about A by B alone, and then the amount of information about
A provided by C after we have seen B, ‘These quantities are
H(A) — W(A/B) 170)
and
H(A/B) ~ H(A/B, C) (6-70)
‘The first of these has already been defined as
1(A; B) = H(A) ~ H(A/B) (6-780)
‘CHANNELS AND MUTUAL INFORMATION 125
It is natural to define (5-77b) as
1(A; C/B) = H(A/B) ~ H(A/B, ©) 6-780)
called the mutual information of A and C, given B. Upon ad
(6-78a) and (5-78b), we find
(Aj B) + 1(4; C/B) = H(A) — H(A/B, ©)
(A; B, C) (6-79)
Equation (5-79) expresses the additivity property of mutual
information. It says that the average information provided by an
‘observation does not depend upon whether we consider the observa-
tion in its entirety or broken into its component parts. This
equation may be generalized immediately to
I(A;B, O, . ..,D) = WA; B) + 1A5C/B) + +
+ 1(A; D/B,C,...) (6-80)
where the term on the left is the average amount of information
about A provided by an observation from the alphabets B,C, .. .,
D. The first term on the right is the average amount of informa-
tion about A provided by an observation from the alphabet B.
‘The second term on the right is the average amount of information
about A provided by an observs from the alphabet C after an
observation from the alphabet B, ete.
‘The particular order of the information we receive is, of course,
irrelevant for (5-79) and (5-80). For example, we may write [cor-
responding to (5-79)]
I(A; B,C) = 1(A; C) + 1(A; B/C) (81)
We may write the information quantities discussed above in
several different forms. From (5-76) we have
(As B, ©) = H(A) ~ H(A/B, €)
J Peo) toe pigs — YS Plot oboe parE sy
ae
= y PC, b Nos peas
= y PCa, b 2108 pari
ae 7
J Pla, blog OY 2)
ae
(6-82a)126 INFORMATION THEORY AND CODING
Another useful form is found by multiplying the numerator and
denominator of the logarithm of the above equation by P(b, ¢)
PCa, b, 6)
Pore Fe
WA; B,C) = y P(a, b, €) log
‘The reader should note the similarity of (5-82a) and (5-82h) to
(5-312) and (5-31). We might have obtained (5-822) and (5-825)
merely by replacing b in (5-B1a) and (5-318) by (6, ¢). This argu-
ment suggests the definition
HB, C/A) i PCa, b, €) log py
It is easily verified that
I(A; B, C) = H(B, ©) — H(B, C/A) (5-84)
vj 8
. To illustrate the additivity of mutual information, we
ee ce
ppl ott
where p 1 —p. This time, however, we sscume that the input symbol
(Gither 0 oF 1) is repeated, so that the output of the channel consists of two
binary symbols b,, es for each input symbol a. Furthermore, for simplicity
‘we amume that the two inputs are chosen with equal probabilities. There
fore, eotting w= ) in (5-48), we find
MA; B) = 1 ~ Hip) (6-85)
To find 1(4; B, C), we use (6828). Table 5-2 gives the necessary probabilities.
‘Teoux $2, Paonannaries or 4 Revere BSC
he Pha) Play byes) Plby od)
woot
ip op
pop
wept
wo pep
ip
ip
ieee
CHANNELS AND MUTUAL INFORMATION 127
Using these probabilities im (6-828) yields
1; 2,0) = wg 2B
P
sotm[i-n(sts)] oso
‘The interpretation of (5-86) is clear. If we observe the outputs 10 or O1 from
such a channel, its meaning is entirely ambiguous; the two possible inputs will
still be equally probable, sad we gain no information from our observation.
Ton the other hand, we obterve O0 oF 11, wo gain information about the input
‘equivalent to that gained by observing a single output from a BSC with error
probability
ee ae
Pre
‘From (6-85) the information of such an observation is
1-855)
‘We observe cither 00 or LI with probability p* + p%; hence (5-80),
‘The arguments given above are easily generalized to the case of a BSC used
with more than # single repetition. For example, if each input produces three
binary outputs, we obtain
148.6.) = + P)[1 0 (Bem) | +3908 — Ho) sn
Equations (585), (6-86), and (5-88) are plotted in Figure 5-18
5-12. Mutual Information of Several Alphabets
In our investigation of the additivity of mutual information in
Section 5-11, we were led to a consideration of the sequence of
entropy quantities.
H(A)
H(A/B)
H(A/B, ©)128 INFORMATION THEORY AND CODING
‘Mutual ntermaton 1A;
002 a8 “06 os
Probability of ror p
Frovns 6-18, Mutual information of a BSC with w repetitions.
Each member of this sequence is no greater than the preceding
member. We saw that the difference between two successive mem-
bers could be interpreted as the average information about A
provided by a new observation.
(A; B) = H(A) ~ H(A/B) (6-900)
1(A; C/B) = H(A/B) ~ H(A/B, ©) (5-006)
I(A; B) is the mutual information of A and B; I(A; C/B) is the
mutual information of A and C after we are given B. Both these
quantities, however, involve the mutual information of just two
alphabets, It is also possible to define the mutual information of
more than two alphabets (McGill, 1954). We define the mutual
information of A, B, and C by
1(A; B; C) = 1(A; B) ~ 1(A; B/C) (6-910)
Our definition of the mutual information of A, B, and C implies
that 1(4; B; C) is symmetric in 4, B, and C. If this is true,
CHANNELS AND MUTUAL INFORMATION 129
(6-912) can also be written
(Aj B; C) = 1(B; C) ~ 1(B; C/A) (6-910)
4(C; A) — 1(C; A/B) (6-91)
To prove the symmetry of I(4; B; C), we write (5-91a) as
1A: B:0) = J Pa) tox BS
f
P(a, b/c)
= J, P68 9 6 pe/eP Or)
P(a, b)P(o/e)P(o/0)
Pe PCG, by 108 p@PO)P Ca, b/2)
PCa, b)PLo, PO, &
a PCa by €) 108 BES PEYPLE)PCG, b, 6)
= H(A, B, ©) ~ H(A, B) ~ H(A, ©)
~ H(B, 6) + H(A) + HB) + HO) G92)
In addition to exhibiting the symmetry we want, (5-92) is reminis-
cent of the expression for the mutual information of two alphabets:
1(A; B) = H(A, B) — H(A) — H(B) (6-93)
It is easy to generalize (5-92) and (5-03) to more than three alpha-
bets. For example, the mutual information of A, B, C, and D is
(A; B; 0; D) = 1(A; B; ©) — 1(A; BC/D)
= [H(A, B, C, D)) = [(A, B,C)
+ H(A, B, D) + H(A, C, D)
+ H(B, C, D)] + (H(A, B) + H(A, ©)
+ H(A, D) + H(B, C) + H(B, D)
+ H(C, D)] ~ [H(A) + H(B)
+ H(C) + H(D)) (6-94)
Blachman (1961) has suggested a generalization of Figure 5-9 to
id in the interpretation of the above expressions. For three
alphabets, we have the relationships shown in Figure 5-19.130 INFORMATION THEORY AND CODING
Although Figure 5-19 is an important aid in remembering
relationships among the quantities we have defined, it ean also be
somewhat deceptive. The mutual information (4; B) was shown
to be nonnegative; the mutual information 1(A; B; C), however,
HALO, HA,B.O,
eT Oe
@ o
ey ,
© @
Fravas 5-19, Some information relationships.
ean be negative, ‘This means that the intersection of the three
circles of Figure 5-194 ean be negative! ‘To show this we present an
example,
[Bxample 6-12. Consider the three binary alphabets A, B,C. Let a.and by
be selected ax 0 or 1, each with probability } and cach independently of the
‘other. Finally, we assume that cris selected as 0 if ay equals by and as 1 if a
CHANNELS AND MUTUAL INFORMATION 137
doce not equsl by. Some of the probabilities of these three random variables
are given in Table 53.
‘Tanux 5:3. Pronanttsrins oF Tumse Raspoxt VARIARES
abi Plan dyed) Player) Plaw bylea) Plas, b) Pe)
Se
00 +
01
oo
ou
100
10
0
um
SUES Sa eee
‘Using this table, we ealeulate
1A; B) = 0 bits
1A; B/C) = bit
MA; B; C) = HA; B) ~ (A; B/C) = ~1 bit
This dear why we get such an answer, Since A and B are statistically indo-
pendent, (4; B) = 0, and B provides no information about A. If we already
Prow C, however, earning B tells us which A was chosen, and therefore pro-
‘vides us with one bit of information,
5413, Channel Capacity
Consider an information channel with input alphabet , output
alphabet B, and conditional probabilities P(},/a). In order to
‘calculate the mutual information
104; B) = YPC) tog PSP 95)
it is necessary to know the input symbol probabilities P(a,). The
‘mutual information, therefore, depends not only upon the channel,
Dut also upon how we use the channel—ie., the probabilities with
which we choose the channel inputs. It is of some interest to exam-
ine the variation of (A; B) as we change the input probabilities132 INFORMATION THEORY AND CODING
Example 6-18. For the BSC with probability of error p, we found [(5-48))
1A; B) = Hwp +p) — He) 6-00)
where w is the probability of selecting a 0 at the input.and 2 = 1 — w,
1 =p. We may plot (6-06) as a function of w for a fixed 7 (se Figure 820),
Honce, the mutual information of x BSC varies from 0 to 1 — Hip), The
Tn these eases, the input
is known with probability 1 st the output, even before an output symbol is
‘minimum value of 0 is achieved when w = 0 or 1
received. ‘The maximum value of 1 ~ H(p) ie achieved when w = 4 that ie,
‘hen both inputs are equally probable,
belie
Mutual information 1;
0 t 1
Probably of3°0” atiaputas
Fravxs 5.20. Mutual information of « BSC.
For a general information channel, we see that the mutual
information can always be made 0 by choosing one of the input
symbols with probability 1. Since the mutual information is non-
negative, this is an easy answer to the question of what is the
minimum value of /(4; 8). ‘The question of the maximum value of
1(A; B) for a general channel is not so easily answered, however.
The maximum value of 1(4; B) as we vary the input symbol
Probabilities is called C, the eapacity of the channel:
ce
)
Note that the capacity of an information channel is a funetion
only of the conditional probabilities defining that channel. It does
not depend upon the input probabilities—how we use the channel. ly
From Figure 5-20 we see that the capacity of a BSC with error
probability p is 1 — H(p)
—
CHANNELS AND MUTUAL INFORMATION 133
‘The calculation of the eapacity of an information channel is, in
mneral, quite involved (Muroga, 1953; Shannon, 1957b; Fano, 1961).
Th eerain case, however, the calculation can be simplied. The
10
oa} 08
og pe
4} 4
‘dl loa
lo
0 02 00 08 08 18
Probability of erorp
Fiouns 5-21. Capacity of a BSC.
most important class of channels for which the calculation simplifies
is the class of uniform channels.
Definition. Consider a channel defined by the channel
matrix
Pu Pa Pw
Pu Px Ps
Fa Pa Pr,
As before, Py = P(a/b,). This channel is said to be
‘uniform if the terms in every row and every column of
the channel matrix consist of an arbitrary permutation
of the terms in the first row.
ample B14, We have aleady dst with one camps of wiorm infor
imation shann-the BG. “The safure!genalnaon of he DSC, ey134 INFORMATION THEORY AND CODING
symmetric channel (PSC), is uniform channel with r input and r output aym-
Dols. The channel matrix of the rSC is shown in Figure 6-22.
Fa Fai Fi
Fioune 5-22. Channel matrix of the r8C.
‘As usual, p = 1 — p. The overall probability of error for this channel is p,
Dut there are r ~ 1 possible incorrect output aymbols for each input symbol
‘We now calculate the eapacity of a general uniform channel
‘The capacity is the maximum of 1(A; B) as we vary the input
distribution.
I(A; B) = H(B) — H(B/A)
= H@) ~ J Pe) S PO/a) ox pera, (98)
The summation over B in the last term of (5-98) is a summation,
for each a, of the terms in the ith row of the channel matrix, For
uniform channel, however, this summation is independent of i
Hence,
1A; B) = WB) ~ J PO/0) oe pera5 (6-99)
5
and the last term of (5-99) is independent of the input symbol
distribution. ‘To find the maximum of the right side of (5-99), we
need only find the maximum of H(B). Since the output alphabet
‘consists of r symbols, we know that /1(B) cannot exceed log r bits.
H(B) will equal log r bits if and only if all the output symbols
occur with equal probability. In general, it is not true that there
exists a distribution over the input symbols such that the output
symbols are equiprobable. For uniform ebannel, however, it is
easy to check that equiprobable symbols at the input produce
equiprobable symbols at the output. Therefore, the maximum
value of (5-09), the capacity of the uniform channel, is
CiUNHES AND MUTUAL MFORANON 18
@ = bogr — J Poe py
= logr + Y P(b/a) log Pb/a) // (6-100)
‘Using (5-100), we ealeulate the capacity of the r8C:
coker eeu perenne
= log r ~ p log © — ) - Hip) 101)
T 5:14, Coneitonal Mutvl Information
‘The channel capacity is the maximum value of
1A; B) = Y Plat) tog (6-102)
the average of log [P(6/a)/P(O)] over both the input alphabet A
and the output alphabet B. The mutual informstion may also be
written
: Pib/a)
(A; B) = YP (b/a) log OE
= > P(a)I(a; B) (5-103)
where we have defined
119; B) = ¥ P/a) on (108)
I(a; B) is called the conditional mutual information (conditional
on a). The conditional mutual information is the average of
log [P(b/a)/P(b)] with respect to the conditional probability P(b/a).
In general, /(a; B) depends on the input symbol a. When the
input symbols are selected according toa set of probabilities which
achieve channel capacity, however, we shall show that J(a; B) does
‘not depend upon a for any input symbol with P(a) #0. When
the input probabilities are selected so as to achieve capacity,
(a; B) = C (6-105)
for all a such that P(a) ¥ 0.136 INFORMATION THEORY AND CODING
This fact is central to the calculation of the channel eapacity for
channels more general than the uniform channel treated in the
previous section (Fano, 1961). We shall use this fact when we
prove Shannon’s second theorem in Section 6-10.
‘We prove (5-105) by contradiction. That is, we assume we have
® set of input probabilitiest P(a,), P(a),..., P(a,) which
achieve channel eapacity, but for which (5-105) does not hold,
and all the (a; B) are not equal to the capacity. Since the average
of 1(a; B) does equal the capacity, there must be at least one
1(a; B) greater than C and at least one Z(a; B) less than C. With-
out loss of generality, we assume
1a; B) > C (6-106a)
Mas; B) <@ (6-106)
Now we change the probabilities
P(ax), Pas), Pla), . . . , Pla) (5-107a)
to
Play) +4, Pa) — 4, Pla), ... , Plas) (6-107)
where A is some small positive number less than P(a;), and show
that the value of the mutual information increases. Since the ori
nal probabilities (5-107a) were assumed to achieve capacity, this is
a contradiction; henee, our assumption that all the I(a; B) were not
constant must have been in error,
Let us denote the new probabilities in (5-107b) by Py(a,), Px(as),
, Pi(a,). The corresponding output probabilities will be writ
ten Pa(hs), Pulls), . . . , Pr(bs). ‘The P,(b) are given by
Px) = ¥ P(@PO/a)
= P(b) + A[P(b/a) — P(b/as)) (5-108)
Let 1,(Aj B) be the value of the mutual information ealeulated
using the probabilities P,(a); by assumption, the value of the mutual
information using the original probabilities P(a) is C, the channel
capacity. We calculate
We assume that none of the P(a,) are equal to aero. If P(a)) = Owe may
‘consider a new channel derived from the old channel by deleting the input cy
CHANNELS AND MUTUAL INFORMATION 137
1A; B) — © = J Pua) Y 0/2) og He
ice
= Fre) ¥ Pe/a) oe “64
-s[J P6b/a) log PO) ~ J Peve9
X log P(6/as)] + J Ps) low p55
ie J 70 Pe) 1%)
Upon adding and subtracting
[J PO /a oe pi; — YPO/a) lox pay] 6-110
on both sides of (5-109), we obtain
(v/a
1(4;B) ~ C= 4[¥ P@/ay) log wn
>(o/as P)
= J Perey on His?) + FP) 1 Pry
= a[ Mes; B) ~ Has; B)]
£5 2.0) 0029 oy
We wish to show that the right side of (5-111) is positive, in order
to arrive ata contradiction. ‘The first term on the right of (5-111)
is positive by (5-106). ‘The second term, however, must be negative
by our often used inequality (2-80). Henee, superially, it appears
‘as if we cannot draw any conclusions about the sign of the right side
of (S111). Such pessimism is not warranted, however, as we shall
now demonstrate by examining the last term of (5-111) in more
detail.
J P.O) toe Foy = DAPO + AP O/a ~ PO/as}
1
X log alP 7a) = Pile
P(b)
112)INFORMATION THEORY AND CODING
For x small enough, we may approximate lo;
For enough, we may approxi og (1/(1 + 2)] by
ao 2. Using this fact in (5-112), we see that for 4 sufficiently.
Po) _ 1
y Pi) oe py = md {PO + AlP@/a) — PO/a)}}
x APO/a = PO/ed)
fea 5 P/a0 ~ PO/a91
— gs J PO/e) — Posane
Por
=a" y [PO/a) ~ PO/ad
mz) poe (113)
since J P(/a;) = J P(d/as) = 1. Thus, the second term (a
negative quantity) of (5-111) goes as a* for A sufficiently small
anes heft tra ( & positive quantity goes as; by making A
sufficiently small, the right side can be made positive,
our contradiction. oe ee tee
‘The assumption that not all the conditic in
tional mutual informations
(a; B) equal the capacity must be incorre
ier ty st be ime ‘et, and we have proved
NoTES
Mat oo ans ee
La ee
ee es ae ee and outputs defined in Section 5-1 ioe
ea Poo P(-/a) on B for each ain A, Panes
East eared
cna Ta"
(6) Find F(A, B) in terms of Tas; Bs), HAs; Ba), and %
5-8. Generalize Problem 6-7 to the ease of m information channels, rather
than just 2.
8-2. The binary multiplicative channel shown in the sketch has two binary
fnpute and one binery output. 8 = ac. Thic channel may be described as an
Frovae P59
ordinary rero-memory channel by considdfing the four possible input combi-
nations to comprise # new input alphabet A’
y= {oH
z fi
i
(e) Write the channel matrix forthe channel with input A’ and output B.
{@) The input symbols aand care selected independently. Pr {a = 01 = wy
and Pr {c= 0} ws. Define 1 — 0; = Gs and 1 ey = dy Find 1(A%; 8).
‘Give an interpretation of your answer.INFORMATION THEORY AND CODING
(©) Find the maximum value of /(4"; B) as e, andes vary. Find all posible
combinations of wy and ws which achieve this maximum value,
‘5-10. Let P be the channel matric of « channel with r inputs and s outputs
Let a be the number of columns in the matrix having all their elements equal
pile ©
Frovnn P 510
(@) Ifthe channel is deterministic, find ita eapacity.
(@) If (inatead of the assumption of part a) we assume that the cbsnnel is
noiseless, nd its eapacity.
(@) Now make the assumptions of oth parts a and bat the same time, Two
‘such channels (with the assumptions of parts a and b applying to each) ae
cascaded, as shown in the sketch. Find the eapacity of the cascade channel
with input A and output C.
Froune P 5.
8-11. Two BSCs, each with probability of eror p, are cascaded, as shown in
the sketch. The inputs 0 and 1 of A are chosen with equal probability, Find
the following:
(@) HA,
© H@.
HO.
@ HCA, B,
(© HE, ©),
HLA, ©)
@ HC, B, ©,
(14; B; 0).
5-12. Let @ and 6 be indopendent identically distributed binary random
variables with the probability of a 0 equal to the probability of « 1. Define
the binary random variable
ene
(a) Find H(A), H(B), HC).
(©) Find 1(4; B), 114; C), 108; ©).
(©) Find H(A, B), H(A, €), M(B, C)
@ Find H(A, B, ©).
(© Find H(A/B), (A/C), H(B/C).
() Find H(A7B, C), H(B/A, C), H(C/A, B),
@) Find 1(4; B/C), 108; A/C), C3 A/B).
() Bind 1(4; B; 0).
CHANNELS AND MUTUAL INFORMATION 145
18. Let a and be independent identically distributed binary random
variables with the probability of a 0 equsl tothe probability of « 1. Define the
binary random variable c = a +6, modulo 2. That is, cis Oif a = band eis 1
ard,
(@) Find H(A), H(B), HO),
@) Find 1(4; B), 1(4; ©), 1B; ©).
Find H(A, B), H(A, ©), H(B, ©).
Find H(A, B, ©).
Find H(4/B), H(A/C), H(B/C).
Find H(4/B, ©), H(/A, €), H(C/A, Bb.
Find 1(4; B/C), 1(B; A/C), 1C; 4/8).
Find 1(4; B; ©
5-14. Find the capacity of
Doe ft ated
‘The special case of p = 0 is called the binary erarure channel. Provide an
interpretation of the eapacity of the binary eragure channel
46, Lat Py and P, be the channel matrices of two channels with input
alphabets As and Ay and output alphabets B, and B,, respectively. Form a
new channel matrix P with input alphabet A = Ay U'A; and output alphabet
B= By UB, as shown below:
PO
Plo ».
Osprey 6m wt sn eet,
Tat Pa) be the probability ofan nput gymbol aE A, Lat Qs = J PCa)
snd G1 = J.P(ed. Geis jut the probably that a aymbol fom AU is seat
Lat Cu, Cs, and C be the eapacities of P,, Ps, and P, respectively
(@) Pind the values of Qin terms og dC.) which achive capacity for
the channel P.
(©) Find C in terms of C; and C;, ;
(©) Extend the results of (a) and (6) to cover the ease whére n, instead of
just 2, channels are combined,
‘6-18. (a) Find the capacity of the channel
?
>
0
0
‘Sketch the capacity as a funetion ofp,