0% found this document useful (0 votes)
1 views

IT_w1

The document provides an introduction to Information Theory, outlining essential prerequisites and course objectives, including the definition and measurement of information, source and channel coding theorems, and channel capacities. It discusses key concepts such as random variables, entropy, mutual information, and the significance of relative entropy, along with properties of information measures and their applications in optimization. The historical context highlights Claude Shannon's foundational contributions to the field, emphasizing the importance of information in communication systems.

Uploaded by

v3193373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

IT_w1

The document provides an introduction to Information Theory, outlining essential prerequisites and course objectives, including the definition and measurement of information, source and channel coding theorems, and channel capacities. It discusses key concepts such as random variables, entropy, mutual information, and the significance of relative entropy, along with properties of information measures and their applications in optimization. The historical context highlights Claude Shannon's foundational contributions to the field, emphasizing the importance of information in communication systems.

Uploaded by

v3193373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Introduction to Information Theory

1 / 20
Pre-requisites and course objectives
Essential:
Probability and random variables
Communication Systems/Digital Communication
Good to Have:
Optimization
Course Objectives
Define Information and its measures
State and apply Source and Channel Coding Theorems
Define and compute channel capacities of different types of channels
Apply information theory to other domains

History
This subject celebrates the work done by the “Father of the Information
Age“ Claude Shannon [tWp01] . His contributions are deemed parallel to
Einstein. See this video. Everything digital is traced back to him!
2 / 20
Brief recap of Random Variables

Random variables are mappings1 from the sample space to real nos.
X :Ω→R
If Ω is countable, X is a discrete random variable and is characterized
by PMF : pX (x) = P(X = x) 2
P
Expectation: E [X ] = x∈X x p(x), X is the domain/support of X .
Function of a random variable: Z = g (X ) is another random variable.
E [Z ] = E [g (X )] = x∈X g (x) p(x) 3
P

Expectation is linear: E [aX + bY ] = aE [X ] + bE [Y ]4

1
In this course, we will consider them as functions as you know, however, there are
some more technicalities associated with it which requires a course in measure theory
which is out of scope.
2
We might omit the subscript X and write p(x) if it is clear from context
3
Try to prove this if g is monotonic and invertible
4
Prove this!, a and b are deterministic while X and Y are random variables, not
necessarily independent
3 / 20
Information Theory [Cov99]

Motivation: Why do you do courses/pursue research?


Answer: To gain new information
“The sun rises in the east or 1+1 = 2“. If I teach you this, would you
come to class?
“The joy is in the surprise“ → Information is more in the unexpected.

Self Information Axioms for Events


A sure event is perfectly unsurprising and yields no information.
The less probable an event is, the more information it yields
(monotonic?).
If two independent events are measured separately, the total amount
of information is the sum of the self-informations of the individual
events.

4 / 20
Measures of Information
Self-information
The only function that satisfies the axioms are logarithmic functions. a

I (A) = − log P(A)

where A is an event and P(A) denotes the probability of its occurence.


a
You should try and prove this and justify the importance of negative sign!

If the base of log is 2, the unit is shannon or bits, if it is e or natural log, it


is called nat and if it is 10, it is termed as hartley.
Entropy: Measure of uncertainty of a random variable
Let X be a discrete random variable with domain/alphabet X and
pmf p(x) = P(X = x), x ∈ X P
H(X ) = − x∈X p(x) log p(x)
To preserve continuity, 0 log 0 = 0
5 / 20
Entropy and Joint Entropy

Entropy in terms of Expectation


We know that the random variable X is a mapping where each X = x is
associated with a probability p(x).
Let us define the function g (X = x) = log P(X = x). We write this as
g (X ) = log p(X ) as this isP
true for every X = x. We have,
E [g (X )] = E [log p(X )] = x p(x) log p(x) = −H(X ) and hence,

H(X ) = −E [log p(X )]

Joint Entropy
For a system, there is an i/p and o/p both of which can be treated to be
random. So, we extend the definition to two RVs.
XX
H(X , Y ) = −E [log p(X , Y )] = − p(x, y ) log p(x, y )
x y

6 / 20
Conditional Entropy
Evidently, as we extend the definition to two random variables,
conditionals also come into play.
XX
H(X |Y ) = −E [log p(X |Y )] = − p(x, y ) log p(x|y )
x y

Relation between Joint and Conditional

H(X , Y ) = −E [log p(X , Y )] = −E [log p(X ) + log p(Y |X )]


= H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

As entropy is a measure of uncertainty, the statement implies that the


joint uncertainty is equal to sum of uncertainty of one and the remaining
uncertainty of the other given you have observed the one.

We will soon be seeing all these in terms of Venn diagrams


Exercise: Compute H(X |X ) and H(X , X )
7 / 20
5
Relative Entropy

Relative Entropy/Kullback-Leibler(KL) divergence/I-divergence


The KL divergence is a metric between two pmfs p(x)
h and q(x)i defined as:
P p(x) p(X )
DKL (p||q) = x p(x) log q(x) = Ep log q(X )

Significance of Relative Entropy


D(p||q) = −H(X ) − Ep [log q(X )] ⇒ −Ep [log q(X )] = H(X ) + D(p||q)
D(p||q) is the additional uncertainty when p(X ) is replaced by a new
distribution q(X ) where p(X ) is the original distribution.

KL divergence or distance is the term used to signify the distance


between two distributions.
When p = q, this is 0. Similar to entropy, to preserve continuity,
0 log 00 = 0, 0 log q0 = 0, 0 log p0 = ∞
−Ep [log q(X )] is called cross entropy, used as a loss function in ML
5
Extensively used in Theoretical ML
8 / 20
Mutual Information (MI)6
There are many ways to define MI.
 
p(X , Y )
I (X ; Y ) = D(p(x, y )||p(x)p(y )) = E log
p(X )p(Y )
= −E [log p(X )] − E [log p(Y )] − (−E [log p(X , Y )])
= H(X ) + H(Y ) − H(X , Y )
= H(X ) + H(Y ) − [H(Y ) + H(X |Y )]
= H(X ) − H(X |Y )
= H(Y ) − H(Y |X )
Significance
KL distance between a joint distribution against independence
Reduction in uncertainty when one is revealed.
I (X ; Y ) is a metric i.e. it is symmetric, always non-negative and
satisfies triangle inequality
However, the same is not true for KL distance
6
This is the most important concept in IT. Channel Capacity will be related to this.
9 / 20
Conditional MI and Extension to Multiple RVs
Consider 3 random variables: X , Y , and Z .
Conditional MI
I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z ) = D(p(X , Y |Z )||p(X |Z )p(Y |Z ))
 
p(X , Y , Z )
H(X , Y |Z ) = −E [log p(X , Y |Z )] = −E log
p(Z )
 
p(X , Z )p(Y |X , Z )
= −E log
p(Z )
 
p(Z )p(X |Z )p(Y |X , Z )
= −E log = H(X |Z ) + H(Y |X , Z )
p(Z )

Then, H(X , Y , Z ) = H(X ) + H(Y , Z |X ) = H(X ) + H(Y |X ) + H(Z |Y , X )


Continuing this way; P
H(X1 , X2 , . . . , Xn ) = i H(Xi |Xi−1 , . . . , X1 )
Similarly, for MI, I (X1 , X2 , . . . , Xn ; Y ) = P
H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y ) = i I (Xi ; Y |X1 , . . . , Xi−1 )
10 / 20
Properties of Information Measures
Similar to conditional Entropy and MI, we can have
Conditional Relative Entropy
h i
p(Y |X )
D(p(y |x)||q(y |x)) = Ep(x,y ) log q(Y |X )

Show that: D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y |x)||q(y |x))


Properties
We start to look into some properties of the defined measures till now:
H(X ) is a concave function of p(X ) and the maximum occurs when
p(X ) is uniform.
D(p||q) ≥ 0 and so I (X ; Y ) ≥ 0
I (X ; Y ) is concave w.r.t p(x) for fixed p(y |x) and convex w.r.t
p(y |x) for fixed p(x)

We need some background on convexity and related properties. Those


who have done optimization course, should be familiar with these.
11 / 20
Basics of Convexity and Optimization
Convex Function
A function f (x) is convex if for any two points x1 ≤ x ≤ x2 such that
x = tx1 + (1 − t)x2 , λ ∈ (0, 1),
f (x = tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x1 )

Exercise: Can you show that if this condition is satisfied for a continuous
function, for any point x ∈ [x1 , x2 ], it must be that f ′′ (x) ≥ 0?
Concave Function
A function f is concave if −f is convex.

Convex Set
A set C is convex if for any two
points x1 , x2 ∈ C ,
tx1 + (1 − t)x2 ∈ C ∀t ∈ [0, 1]
12 / 20
1
Why Convexity?
Local and Global Minima
x ∗ is a global minima of f (x) if f (x ∗ ) ≤ f (x) ∀x ∈ Dom(f )
x ∗ is a local minima of f (x) if ∃ Nϵ (x ∗ ) ⊆ Dom(f ) s.t. f (x ∗ ) ≤ f (x)
∀x ∈ Nϵ (x ∗ ). Nϵ (x ∗ ) is an ϵ - neighbourhood of x ∗ .

Unconstrained Optimization2
An unconstrained optimization is of the form minx f (x). If f (x) is convex,
a local minima is global minima. Condition for local minima is f ′ (x) = 0.

Constrained Optimization
A constrained optimization is of the form minx f (x) s.t. x ∈ C . If f (x) is
convex, and C is a convex set, every local minima is a global minima.

The set C can be expressed in terms of functional constraints such as:


C = {x|h1 (x) ≤ 0, h2 (x) ≤ 0, . . . }
1
Those who have attended my optimization course can skip these
2
Already known from +2 level
13 / 20
7
Convexity (Contd)

We know convexity using any 2 points. In general, we can also extend this
forPn points. LetPx1 , x2 , . . . , xn be n points P
and f be a convex function,
n n
f ( i=1 θi xi ) ≤ i=1 θi f (xi ), ∀θi ∈ [0, 1], i θi = 1.
Proof: We use mathematical induction for the proof. The base case is
already covered i.e. for n = 2. Let this be true for n, we need to show it is
true for n + 1P as well. Let xn+1 be the new point. As it is true for n = 2,
consider x̄ = ni=1 θi xi to be the other point and hence we should have,
n
X
f (λx̄+(1−λ)xn+1 ) ≤ λf (x̄)+(1−λ)f (xn+1 ) ≤ λ θi f (xi )+(1−λ)f (xn+1 )
i=1

Thus, if we choose a new set of coefficients


ti = λθi ∈ [0, 1], i = 1, 2, . . . , n and tn+1 = (1 − λ), we have n+1
P
i=1 ti = 1.
AtP the same time,
f ( n+1
Pn+1
i=1 ti xi ) ≤ i=1 ti f (xi )
7
Already covered in optimization
14 / 20
Jensen’s Inequality
If f is a convex function and X is a discrete random variable:

f (E [X ]) ≤ E [f (X )]

The proof is a straightforward application of the previous slide: If X is


finite with
P PMF represented as p1 , p2 , . . . , P
pn , this boils down to:
P P n
i f ( p x
i i i ) ≤ pi i f (x i ), ∵ p i ∈ [0, 1] i=1 pi = 1
For a concave function, the inequality just reverses.
Question: log x is convex or concave?
Let us apply this once to log X where X is either x1 or x2 with probabilities
p1 and p2
log (p1 x1 + p2 x2 ) ≥ p1 log x1 + p2 log x2 ⇒ p1 x1 + p2 x2 ≥ x1p1 .x2p2
When p1 = p2 = 21 , this is nothing but A.M. - G.M inequality. For those,
who know weighted A.M-G.M., they may recognize this as:
P
i wi xi
1
≥ Πni=1 xiwi w , where w = i wi
P
w
As you know the equality will occur iff xi are equal i.e. X is a constant
and hence X = E [X ]
15 / 20
Information/Gibb’s Inequality
Let A = {x|p(x) > 0}
   
q(X ) q(X ) X q(x)
−D(p||q) = Ep log ≤ log Ep = log p(x)
p(X ) p(X ) p(x)
x∈A
X X
= log q(x) ≤ log q(x) = 0
x∈A x∈X
The equality occurs
P only whenP
q(x) = cp(x) ⇒ x q(x) = c x p(x) ⇒ c = 1 ⇒ p(x) = q(x)
D(p||q) ≥ 0 where equality occurs iff p = q
I (X ; Y ) = D(p(x, y )||p(x)p(y )) ≥ 0 and equality occurs iff X and
Y are independent
Let q(x) = u(x) = |X1 | i.e. for uniform distribution,
p(x)
D(p||u) = E [log u(x) ] = −H(X ) + log |X | ≥ 0 ⇒ H(X ) ≤ log |X |
Thus, maximum entropy occurs for uniform distribution
Conditioning reduces entropy
I (X ; Y ) = H(X ) − H(X |Y ) ≥ 0 ⇒ H(X |Y ) ≤ H(X )
16 / 20
Problem
The weatherman’s record in a given city is summarized in the table below,
where the numbers indicate the relative frequency of the events:

Actual
Prediction Rain No Rain
1 3
Rain 8 16
1 10
No Rain 16 16

A clever student notices:


The weatherman is right only 1216 of the time.
By always predicting ”No Rain,” one could be right 1316 of the time.
The student explains this situation and applies for the weatherman’s job.
However, the weatherman’s boss (who has studied information theory)
rejects the student’s application.
Question: Why does the boss reject the student’s claim?
17 / 20
Solution and Exercises

Information Comparison: Let X denote the Actual and Y denote the


predicted. 0 denote ”No rain” and 1 denote ”Rain”
P P p(x,y )
Weatherman’s Information: I (X ; Y ) = x y p(x, y ) log p(x)p(y )
1 1/8
I (X ; Y ) = 8 log (3/16)(5/16) + ··· > 0
Student’s Information: The student always says there is no rain it
implies that the prediction is deterministic and not a random event
anymore, so P(X |Y ) = P(X ) which should give 0 information.
Exercises: Solve 2.1,2.2,2.3,2.4,2.7,2.8,2.10,2.11,2.12,2.13,2.14 from
Cover and Thomas
The problems are by and large difficult! It is advisable that you refer the
solutions by clicking https://cpb-us-w2.wpmucdn.com/sites.
gatech.edu/dist/c/565/files/2017/01/solutions2.pdf

18 / 20
References

Thomas M Cover, Elements of information theory, John Wiley &


Sons, 1999.
Contributors to Wikimedia projects, Claude shannon - wikipedia, 8
2001, [Online; accessed 2025-01-06].

19 / 20
Thank You!

20 / 20

You might also like