Introduction To Information Theory and Coding: Louis Wehenkel
Introduction To Information Theory and Coding: Louis Wehenkel
Louis WEHENKEL
IT 2000-8, slide 1
1. Intuitive introduction to channel coding
How can we communicate in reliable fashion over noisy channels ?
p
1 1
1−p
IT 2000-8, slide 2
Let us suppose that : p = 0.1 (one error every 10 bits, on the average)
But in order to use the channel (say, a hard-disk) : we want to make sure that during
the whole lifecycle of 100 disks there is no more than one error.
E.g. : life period = 10 years. And let us suppose that we transfer 1GB every day on
the disk :
⇒ Pe < 10−15 (requested)
1. Physical aproach : better circuits, lower density, better cooling, increase signal
power. . .
2. System approach : compensate for the bad performances by using the disk in an
“intelligent” way
W Xn CHANNEL Yn Ŵ
ENCODER DECODER
P (Y|X ) Estimated
Message
message
IT 2000-8, slide 3
Information theory (and coding) ⇒ system approach (and solutions)
Add redundancy to the input message and exploit this redundancy when decoding the
received message
Information theory :
Which are the possibilities (and limitations) terms of performance tradeoffs ?
⇒ analysis problem
Coding theory :
How to build practical error-compensation systems ?
⇒ design problem
(Cf. analogy with “Systems theory vs Control theory”)
IT 2000-8, slide 4
Some preliminary remarks
Analysis problem : - more or less solved for most channels (but not for networks)
Design problam : - Very difficult in general
- Solved in a satisfactory way for a subset of problems only.
Cost of the system approach : - performance tradeoff, computational complexity,
- loss of real-time characteristics (in some cases).
Cost of the physical approach : - investment, power (energy).
⇒ Tradeoff depends on application contexts
IT 2000-8, slide 5
Error handling systems
Retransmission request
Error Channel
avoidance Coded Error
or Demod- Error Error
and Modu- conceal-
Storage ulation detection correction
redundancy lation ment
Medium
encoding
Abstract channel
(Error concealment : exploits natural redundancy to interpolate “missing values”)
NB:
Error detection with retransmission request is an alternative to error correction, but is
not always possible.
In some protocols, error detection merely leads to dropping packets.
⇒ We will come back later to the discussion of “error detection and retransmission”
versus “forward error correction”.
IT 2000-8, slide 6
Codes for detecting and/or correcting errors on the binary symmetric channel
1. Repetition codes :
Source Code
0 000 Decoder : majority vote.
1 111
Example of transmission : T = 0010110.
s 0 0 1 0 1 1 0
x 000 000 111 000 111 111 000
(b: noise vector)
b 000 001 000 000 101 000 000
y 000 001 111 000 010 111 000
Decoding : T̂ = 0010010
Pe (per source bit) : p3 + 3p2 (1 − p) = 0.028 and code rate : R = 1/3
NB: to reach Pe ≤ 10−15 we need R ≤ 1/60 . . .
Other properties : correction of single errors, detection of double errors.
IT 2000-8, slide 7
2. Linear block codes (Hamming (7, 4))
We would like to maximize the code rate under the reliability constraint Pe ≤ 10−15
Block codes : to a block of K source bits we associate a codeword of length N ≥ K.
Example : Hamming (7, 4)
s x s x s x s x
0000 0000000 0100 0100110 1000 1000101 1100 1100011
0001 0001011 0101 0101101 1001 1001110 1101 1101000
0010 0010111 0110 0110001 1010 1010010 1110 1110100
0011 0011100 0111 0111010 1011 1011001 1111 1111111
This code may be written in compact way by (s and x denote line vectors)
1 0 0 0 1 0 1
0 1 0 0 1 1 0
x = sG with G =
0
= [I 4 P ]
0 1 0 1 1 1
0 0 0 1 0 1 1
IT 2000-8, slide 8
Definition (linear code) We say that a code is linear if all linear combinations of
codewords are also codewords.
Binary codewords of length n form an n−dimensional linear space. A linear code
consists of a linear subspace of this space.
In our example ⇒ first 4 bits = source word, last 3 bits = parity control bits.
E.g. : 5th bit = parity (sum mod. 2) of the first 3 bits.
IT 2000-8, slide 9
Decoding :
Let r = x + b denote the received word (b = error vector)
Maximum likelihood decoding :
Guess that codeword x̂ was sent which maximizes the probability p(x̂|r).
Assuming that all codewords are equiprobable, this is equivalent to maximizing p(r|x̂)
IT 2000-8, slide 10
Pictorial illustration
Repetition code (3, 1)
011 111
001 101
010 110
000
100
⇒ maximum likelihood decoding ≡ nearest neighbor decoding
But : there is a more efficient way to do it than searching explicitly for the nearest
neighbor (syndrom decoding)
IT 2000-8, slide 11
Let us focus on correcting single errors with the Hamming (7, 4) code
Meaning of the 7 bits of a codeword
s1 s2 s3 s4 p1 p2 p3
If error on any of the 4 first bits (signal bits) → two or three parity check violations.
If single error on one of the parity check bits : only this parity check is violated.
In both cases, we can identify erronous bit.
IT 2000-8, slide 12
Alternative representation : parity circles
p2 s4
p3
s3
s2 s1
p1
IT 2000-8, slide 13
Syndrom decoding
Syndrom : difference between the received parity bit vector and those which are
recomputed from the received signal bits.
⇒ Syndrom is a vector of three bits ⇒ 23 = 8 possible syndroms
The syndrom contains all the information needed for optimal decoding :
8 possible syndroms → 8 most likely error patterns (can be precomputed).
E.g. : suppose that r = 0101111 :
- signal bits 0101 → codeword 0101101 (parity 101)
- syndrom : 101 + 111 = 010 (bit per bit)
- more probable error pattern : 0000010
- decoded word : 0101101
E.g. : suppose that r = 0101110 :
- signal bits 0101 → codeword 0101101 (parity 101)
- syndrom : 101 + 110 = 011 (bit per bit)
- more probable error pattern : 0001000
- decoded signal bits : 0100 (code 0100110).
IT 2000-8, slide 14
Summary
Like the repetition code, the Hamming code (7, 4) corrects single errors and detects
double errors, but uses longer words (7 bits instead of 3) and has a higher rate.
If p = 0.1 : probability of error per code word : 0.14
→ (signal) bit error rate (BER) of : 0.07
Less good in terms of BER but better in terms of code-rate : R = 4/7.
There seems to be a compromize between BER and Code-rate.
Intuitively : limPe →0 R(Pe ) = 0
(this is what most people still believed not so long ago...)
And then ?
...Shannon came...
IT 2000-8, slide 15
Second Shannon theorem
States that if R < C(p) = 1 − H2 (p) then Pe = 0 may be attained.
Third Shannon theorem (rate distorsion : Pe > 0 tolerated)
Using irreversible compression we can further increase the code rate by a factor
1
1−H2 (Pe ) if we accept to reduce realiability (i.e. Pe ր).
C(p)
Conclusion : we can operate in a region satisfying R ≤ 1−H2 (Pe )
log Pe
0.1
Possible
0.01
Impossible
1e-11
Conclusion : we only need two
R disks and a very good code to
0 0.53 1 reach Pe ≤ 10−15 .
IT 2000-8, slide 16
2nd Shannon theorem (channel coding)
What is a channel ?
Later and/or elsewhere
Abstract model :
Y1 , Y2 , . . .
P (Y1 , Y2 , . . . |X1 , X2 , . . .)
X1 , X2 , . . .
IT 2000-8, slide 17
Simplifications
Causal channel : if ∀m ≤ n
P (Y1 , . . . , Ym |X1 , . . . , Xn ) = P (Y1 , . . . , Ym |X1 , . . . , Xm ). (1)
Remarks
This quantity relates to one single use of the channel (one symbol)
I(X ; Y) depends both on source and channel properties.
C solely depends on channel properties.
We will see later that this quantity coincides with the notion of operational capacity.
NB: How would you generalize this notion to more general classes of channels ?
IT 2000-8, slide 19
Examples of discrete channels and values of capacity
Channel transition matrix :
P (Y1 |X1 ) · · · P (Y|Y| |X1 )
[P (Yj |Xi )] =
.. .. ..
. . .
P (Y1 |X|X | ) · · · P (Y|Y| |X|X | )
IT 2000-8, slide 20
2. Noisy channel without overlapping outputs
E.g. : transition matrix
p (1 − p) 0 0
.
0 0 q (1 − q)
H(X |Y) = 0 ⇒ I(X ; Y) = H(X ) ⇒ C = 1. (Achievable...)
3. Noisy type-writer
Input alphabet : a, b, c,. . . , z Output alphabet : a, b, c,. . . , z
P (a|a) = 0.5, P (b|a) = 0.5, P (b|b) = 0.5, P (c|b) = 0.5, . . . P (z|z) = 0.5, P (a|z) =
0.5
I(X ; Y) = H(Y) − H(Y|X ) with H(Y|X ) = 1 ⇒ max if outputs are equiprobable
E.g. if inputs are equiprobable : H(Y) = log2 26 ⇒ C = log2 13
Achievable : just use the right subset of input alphabet...
(NB: this is THE idea of channel coding)
IT 2000-8, slide 21
4. Binary symmetric channel
1−p
0 0
p
1−p p
[P (Yj |Xi )] = .
p 1−p
p
1 1
1−p
1. C ≥ 0.
2. C ≤ min{log |X |, log |Y|}.
Moreover, one can show that I(X ; Y) is continuous and concave with respect to
P (X ).
Thus every local maximum must be a global maximum (on the convex set of input
probability distributions P (X )).
Since the function I(X ; Y) is upper bounded capacity must be finite.
One can use powerful optimisation techniques to compute information capacity for a
large class of channels with the desired accuracy.
In general, the solution can not be obtained analytically.
IT 2000-8, slide 23
Communication system
W Xn CHANNEL Yn Ŵ
ENCODER DECODER
P (Y|X ) Estimated
Message
message
IT 2000-8, slide 24
Memoryless channel specification
Input and output alphabets X and Y, and the P (Yk |Xk ) are given
IT 2000-8, slide 25
(M, n) Code
An (M, n) code for a channel (X , P (Y|X ), Y) is defined by
IT 2000-8, slide 26
Decoding error rates
1. Conditional Probability of error given that index i was sent
X
n n n
λi = P (g(Y ) 6= i|X = X (i)) = P (Y n |X n (i))(1 − δg(Y n ),i )
Y n ∈Y n
λ(n) = max λi
i∈{1,...,M }
IT 2000-8, slide 27
Optimal decoding rule
By definition : the decoding rule which minimises expected error rate.
For a received word Y n → choose i such that P (X n (i)|Y n ) is maximal.
⇒ maximises a posteriori probability (MAP)
⇒ minimises for each Y n error probability
⇒ minimises the expected error rate.
⇒ general principle in decision theory : Bayes rule
We use information Y (random variable which is observed).
We want to guess (decide on) a certain variable D (choose among M possibilities).
Correct decision : D∗ a random variable ⇒ P (D∗ |Y) known.
Cost of the taken decision : 0 if correct, 1 if incorrect.
Optimal decision based on information Y : D̂(Y ) = argD max{P (D|Y )}
IT 2000-8, slide 28
For our channel :
n n P (Y n |X n (i))P (X n (i))
P (X (i)|Y ) = PM
n n n
i=1 P (Y |X (i))P (X (i))
PM
Since i=1 P (Y n |X n (i))P (X n (i)) does not depend on the decision, this is the same
than maximizing P (Y n |X n (i))P (X n (i)).
Discussion
P (Y n |X n (i)) : channel specification.
P (X n (i)) : source specification.
If non-redundant source : P (X n (i)) independent of i ⇒ maximize P (Y n |X n (i)).
(n)
⇒ Maximum likelihood rule : minimize Pe
Quasi optimal, if source is quasi non-redundant.
E.g. if we code long source messages (cf. AEP)
IT 2000-8, slide 29
Communication rate : denoted R
The communication rate (denoted by R) of an (M, n) code is defined by R = lognM
Shannon/channel use ≡ input entropy per channel use if inputs are uniformly dis-
tributed.
Achievable rate (more subtle notion)
R is said to be achievable if ∃ a sequence of (M (n), n), n = 1, 2, . . . codes suchthat
IT 2000-8, slide 30
Second Shannon theorem
Objective : prove that information capacity C is equal to operational capacity Co .
Hypothesis : the pair (X n , Y n ) satisfy the AEP (stationary and ergodic : OK for
stationary finite memory channels and ergodic inputs.)
Information capacity (per channel use) :
1 n n
C = lim max {I(X ; Y )}
n→∞ n P (X n )
Basic idea : for large block lengths, every channel looks like the noisy typewriter :
- it has a subset of inputs that produce essentially disjoint sequences at the output
- the rest of the proof is matter of counting and packing...
IT 2000-8, slide 31
Outline of the proof (Shannon’s random coding ideas)
Let us fix for a moment P (X ), n and M , and construct a codebook by generating
random signals according to P (X ) (M × n drawings) ⇒ if n is large the codewords
as well as the received sequences must be typical (AEP)
Xn Yn
X1n
X2n
X3n
IT 2000-8, slide 32
Let us count and pack (approximately and intuitively...)
At the input : 2nH(X ) possible typical messages (we choose M of them at random to
construct our codebook).
At the output : for each X n (typical) 2nH(Y|X ) possible typical output sequences.
IT 2000-8, slide 33
Second Shannon theorem
Statement in two parts :
IT 2000-8, slide 34