0% found this document useful (0 votes)

1 views

IT_w1

The document provides an introduction to Information Theory, outlining essential prerequisites and course objectives, including the definition and measurement of information, source and channel coding theorems, and channel capacities. It discusses key concepts such as random variables, entropy, mutual information, and the significance of relative entropy, along with properties of information measures and their applications in optimization. The historical context highlights Claude Shannon's foundational contributions to the field, emphasizing the importance of information in communication systems.

Uploaded by

v3193373

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

IT_w1

Uploaded by

v3193373

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Introduction to Information Theory

1 / 20
Pre-requisites and course objectives
Essential:
Probability and random variables
Communication Systems/Digital Communication
Good to Have:
Optimization
Course Objectives
Define Information and its measures
State and apply Source and Channel Coding Theorems
Define and compute channel capacities of different types of channels
Apply information theory to other domains

History
This subject celebrates the work done by the “Father of the Information
Age“ Claude Shannon [tWp01] . His contributions are deemed parallel to
Einstein. See this video. Everything digital is traced back to him!
2 / 20
Brief recap of Random Variables

Random variables are mappings1 from the sample space to real nos.
X :Ω→R
If Ω is countable, X is a discrete random variable and is characterized
by PMF : pX (x) = P(X = x) 2
P
Expectation: E [X ] = x∈X x p(x), X is the domain/support of X .
Function of a random variable: Z = g (X ) is another random variable.
E [Z ] = E [g (X )] = x∈X g (x) p(x) 3
P

Expectation is linear: E [aX + bY ] = aE [X ] + bE [Y ]4

1
In this course, we will consider them as functions as you know, however, there are
some more technicalities associated with it which requires a course in measure theory
which is out of scope.
2
We might omit the subscript X and write p(x) if it is clear from context
3
Try to prove this if g is monotonic and invertible
4
Prove this!, a and b are deterministic while X and Y are random variables, not
necessarily independent
3 / 20
Information Theory [Cov99]

Motivation: Why do you do courses/pursue research?

Answer: To gain new information
“The sun rises in the east or 1+1 = 2“. If I teach you this, would you
come to class?
“The joy is in the surprise“ → Information is more in the unexpected.

Self Information Axioms for Events

A sure event is perfectly unsurprising and yields no information.
The less probable an event is, the more information it yields
(monotonic?).
If two independent events are measured separately, the total amount
of information is the sum of the self-informations of the individual
events.

4 / 20
Measures of Information
Self-information
The only function that satisfies the axioms are logarithmic functions. a

I (A) = − log P(A)

where A is an event and P(A) denotes the probability of its occurence.

a
You should try and prove this and justify the importance of negative sign!

If the base of log is 2, the unit is shannon or bits, if it is e or natural log, it

is called nat and if it is 10, it is termed as hartley.
Entropy: Measure of uncertainty of a random variable
Let X be a discrete random variable with domain/alphabet X and
pmf p(x) = P(X = x), x ∈ X P
H(X ) = − x∈X p(x) log p(x)
To preserve continuity, 0 log 0 = 0
5 / 20
Entropy and Joint Entropy

Entropy in terms of Expectation

We know that the random variable X is a mapping where each X = x is
associated with a probability p(x).
Let us define the function g (X = x) = log P(X = x). We write this as
g (X ) = log p(X ) as this isP
true for every X = x. We have,
E [g (X )] = E [log p(X )] = x p(x) log p(x) = −H(X ) and hence,

H(X ) = −E [log p(X )]

Joint Entropy
For a system, there is an i/p and o/p both of which can be treated to be
random. So, we extend the definition to two RVs.
XX
H(X , Y ) = −E [log p(X , Y )] = − p(x, y ) log p(x, y )
x y

6 / 20
Conditional Entropy
Evidently, as we extend the definition to two random variables,
conditionals also come into play.
XX
H(X |Y ) = −E [log p(X |Y )] = − p(x, y ) log p(x|y )
x y

Relation between Joint and Conditional

H(X , Y ) = −E [log p(X , Y )] = −E [log p(X ) + log p(Y |X )]

= H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

As entropy is a measure of uncertainty, the statement implies that the

joint uncertainty is equal to sum of uncertainty of one and the remaining
uncertainty of the other given you have observed the one.

We will soon be seeing all these in terms of Venn diagrams

Exercise: Compute H(X |X ) and H(X , X )
7 / 20
5
Relative Entropy

Relative Entropy/Kullback-Leibler(KL) divergence/I-divergence

The KL divergence is a metric between two pmfs p(x)
h and q(x)i defined as:
P p(x) p(X )
DKL (p||q) = x p(x) log q(x) = Ep log q(X )

Significance of Relative Entropy

D(p||q) = −H(X ) − Ep [log q(X )] ⇒ −Ep [log q(X )] = H(X ) + D(p||q)
D(p||q) is the additional uncertainty when p(X ) is replaced by a new
distribution q(X ) where p(X ) is the original distribution.

KL divergence or distance is the term used to signify the distance

between two distributions.
When p = q, this is 0. Similar to entropy, to preserve continuity,
0 log 00 = 0, 0 log q0 = 0, 0 log p0 = ∞
−Ep [log q(X )] is called cross entropy, used as a loss function in ML
5
Extensively used in Theoretical ML
8 / 20
Mutual Information (MI)6
There are many ways to define MI.

p(X , Y )
I (X ; Y ) = D(p(x, y )||p(x)p(y )) = E log
p(X )p(Y )
= −E [log p(X )] − E [log p(Y )] − (−E [log p(X , Y )])
= H(X ) + H(Y ) − H(X , Y )
= H(X ) + H(Y ) − [H(Y ) + H(X |Y )]
= H(X ) − H(X |Y )
= H(Y ) − H(Y |X )
Significance
KL distance between a joint distribution against independence
Reduction in uncertainty when one is revealed.
I (X ; Y ) is a metric i.e. it is symmetric, always non-negative and
satisfies triangle inequality
However, the same is not true for KL distance
6
This is the most important concept in IT. Channel Capacity will be related to this.
9 / 20
Conditional MI and Extension to Multiple RVs
Consider 3 random variables: X , Y , and Z .
Conditional MI
I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z ) = D(p(X , Y |Z )||p(X |Z )p(Y |Z ))

p(X , Y , Z )
H(X , Y |Z ) = −E [log p(X , Y |Z )] = −E log
p(Z )

p(X , Z )p(Y |X , Z )
= −E log
p(Z )

p(Z )p(X |Z )p(Y |X , Z )
= −E log = H(X |Z ) + H(Y |X , Z )
p(Z )

Then, H(X , Y , Z ) = H(X ) + H(Y , Z |X ) = H(X ) + H(Y |X ) + H(Z |Y , X )

Continuing this way; P
H(X1 , X2 , . . . , Xn ) = i H(Xi |Xi−1 , . . . , X1 )
Similarly, for MI, I (X1 , X2 , . . . , Xn ; Y ) = P
H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y ) = i I (Xi ; Y |X1 , . . . , Xi−1 )
10 / 20
Properties of Information Measures
Similar to conditional Entropy and MI, we can have
Conditional Relative Entropy
h i
p(Y |X )
D(p(y |x)||q(y |x)) = Ep(x,y ) log q(Y |X )

Show that: D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y |x)||q(y |x))

Properties
We start to look into some properties of the defined measures till now:
H(X ) is a concave function of p(X ) and the maximum occurs when
p(X ) is uniform.
D(p||q) ≥ 0 and so I (X ; Y ) ≥ 0
I (X ; Y ) is concave w.r.t p(x) for fixed p(y |x) and convex w.r.t
p(y |x) for fixed p(x)

We need some background on convexity and related properties. Those

who have done optimization course, should be familiar with these.
11 / 20
Basics of Convexity and Optimization
Convex Function
A function f (x) is convex if for any two points x1 ≤ x ≤ x2 such that
x = tx1 + (1 − t)x2 , λ ∈ (0, 1),
f (x = tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x1 )

Exercise: Can you show that if this condition is satisfied for a continuous
function, for any point x ∈ [x1 , x2 ], it must be that f ′′ (x) ≥ 0?
Concave Function
A function f is concave if −f is convex.

Convex Set
A set C is convex if for any two
points x1 , x2 ∈ C ,
tx1 + (1 − t)x2 ∈ C ∀t ∈ [0, 1]
12 / 20
1
Why Convexity?
Local and Global Minima
x ∗ is a global minima of f (x) if f (x ∗ ) ≤ f (x) ∀x ∈ Dom(f )
x ∗ is a local minima of f (x) if ∃ Nϵ (x ∗ ) ⊆ Dom(f ) s.t. f (x ∗ ) ≤ f (x)
∀x ∈ Nϵ (x ∗ ). Nϵ (x ∗ ) is an ϵ - neighbourhood of x ∗ .

Unconstrained Optimization2
An unconstrained optimization is of the form minx f (x). If f (x) is convex,
a local minima is global minima. Condition for local minima is f ′ (x) = 0.

Constrained Optimization
A constrained optimization is of the form minx f (x) s.t. x ∈ C . If f (x) is
convex, and C is a convex set, every local minima is a global minima.

The set C can be expressed in terms of functional constraints such as:

C = {x|h1 (x) ≤ 0, h2 (x) ≤ 0, . . . }
1
Those who have attended my optimization course can skip these
2
Already known from +2 level
13 / 20
7
Convexity (Contd)

We know convexity using any 2 points. In general, we can also extend this
forPn points. LetPx1 , x2 , . . . , xn be n points P
and f be a convex function,
n n
f ( i=1 θi xi ) ≤ i=1 θi f (xi ), ∀θi ∈ [0, 1], i θi = 1.
Proof: We use mathematical induction for the proof. The base case is
already covered i.e. for n = 2. Let this be true for n, we need to show it is
true for n + 1P as well. Let xn+1 be the new point. As it is true for n = 2,
consider x̄ = ni=1 θi xi to be the other point and hence we should have,
n
X
f (λx̄+(1−λ)xn+1 ) ≤ λf (x̄)+(1−λ)f (xn+1 ) ≤ λ θi f (xi )+(1−λ)f (xn+1 )
i=1

Thus, if we choose a new set of coefficients

ti = λθi ∈ [0, 1], i = 1, 2, . . . , n and tn+1 = (1 − λ), we have n+1
P
i=1 ti = 1.
AtP the same time,
f ( n+1
Pn+1
i=1 ti xi ) ≤ i=1 ti f (xi )
7
Already covered in optimization
14 / 20
Jensen’s Inequality
If f is a convex function and X is a discrete random variable:

f (E [X ]) ≤ E [f (X )]

The proof is a straightforward application of the previous slide: If X is

finite with
P PMF represented as p1 , p2 , . . . , P
pn , this boils down to:
P P n
i f ( p x
i i i ) ≤ pi i f (x i ), ∵ p i ∈ [0, 1] i=1 pi = 1
For a concave function, the inequality just reverses.
Question: log x is convex or concave?
Let us apply this once to log X where X is either x1 or x2 with probabilities
p1 and p2
log (p1 x1 + p2 x2 ) ≥ p1 log x1 + p2 log x2 ⇒ p1 x1 + p2 x2 ≥ x1p1 .x2p2
When p1 = p2 = 21 , this is nothing but A.M. - G.M inequality. For those,
who know weighted A.M-G.M., they may recognize this as:
P
i wi xi
1
≥ Πni=1 xiwi w , where w = i wi
P
w
As you know the equality will occur iff xi are equal i.e. X is a constant
and hence X = E [X ]
15 / 20
Information/Gibb’s Inequality
Let A = {x|p(x) > 0}

q(X ) q(X ) X q(x)
−D(p||q) = Ep log ≤ log Ep = log p(x)
p(X ) p(X ) p(x)
x∈A
X X
= log q(x) ≤ log q(x) = 0
x∈A x∈X
The equality occurs
P only whenP
q(x) = cp(x) ⇒ x q(x) = c x p(x) ⇒ c = 1 ⇒ p(x) = q(x)
D(p||q) ≥ 0 where equality occurs iff p = q
I (X ; Y ) = D(p(x, y )||p(x)p(y )) ≥ 0 and equality occurs iff X and
Y are independent
Let q(x) = u(x) = |X1 | i.e. for uniform distribution,
p(x)
D(p||u) = E [log u(x) ] = −H(X ) + log |X | ≥ 0 ⇒ H(X ) ≤ log |X |
Thus, maximum entropy occurs for uniform distribution
Conditioning reduces entropy
I (X ; Y ) = H(X ) − H(X |Y ) ≥ 0 ⇒ H(X |Y ) ≤ H(X )
16 / 20
Problem
The weatherman’s record in a given city is summarized in the table below,
where the numbers indicate the relative frequency of the events:

Actual
Prediction Rain No Rain
1 3
Rain 8 16
1 10
No Rain 16 16

A clever student notices:

The weatherman is right only 1216 of the time.
By always predicting ”No Rain,” one could be right 1316 of the time.
The student explains this situation and applies for the weatherman’s job.
However, the weatherman’s boss (who has studied information theory)
rejects the student’s application.
Question: Why does the boss reject the student’s claim?
17 / 20
Solution and Exercises

Information Comparison: Let X denote the Actual and Y denote the

predicted. 0 denote ”No rain” and 1 denote ”Rain”
P P p(x,y )
Weatherman’s Information: I (X ; Y ) = x y p(x, y ) log p(x)p(y )
1 1/8
I (X ; Y ) = 8 log (3/16)(5/16) + ··· > 0
Student’s Information: The student always says there is no rain it
implies that the prediction is deterministic and not a random event
anymore, so P(X |Y ) = P(X ) which should give 0 information.
Exercises: Solve 2.1,2.2,2.3,2.4,2.7,2.8,2.10,2.11,2.12,2.13,2.14 from
Cover and Thomas
The problems are by and large difficult! It is advisable that you refer the
solutions by clicking https://cpb-us-w2.wpmucdn.com/sites.
gatech.edu/dist/c/565/files/2017/01/solutions2.pdf

18 / 20
References

Thomas M Cover, Elements of information theory, John Wiley &

Sons, 1999.
Contributors to Wikimedia projects, Claude shannon - wikipedia, 8
2001, [Online; accessed 2025-01-06].

19 / 20
Thank You!

20 / 20

Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Entropy
No ratings yet
Entropy
21 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
It Lectures
No ratings yet
It Lectures
342 pages
Information-Theoretic Identities
No ratings yet
Information-Theoretic Identities
29 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
session2
No ratings yet
session2
60 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Notes It
No ratings yet
Notes It
46 pages
Information Theory
No ratings yet
Information Theory
114 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
L01
No ratings yet
L01
5 pages
Mit6 441s16 Course Notes
No ratings yet
Mit6 441s16 Course Notes
295 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
Statistics Part3 2013
No ratings yet
Statistics Part3 2013
25 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Lecture Notes
No ratings yet
Lecture Notes
495 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
Information Theory For Single-User Systems With Arbitrary Statistical Memory
No ratings yet
Information Theory For Single-User Systems With Arbitrary Statistical Memory
111 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Prob RV Opt Basics
No ratings yet
Prob RV Opt Basics
35 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Full Notes
No ratings yet
Full Notes
197 pages
Instructor: DR - Saleem AL Ashhab Al Ba'At University Mathmatical Class Second Year Master Dgree
No ratings yet
Instructor: DR - Saleem AL Ashhab Al Ba'At University Mathmatical Class Second Year Master Dgree
13 pages
Prob 2 B English
No ratings yet
Prob 2 B English
81 pages
Information and Entropy: Aria Nosratinia - Information Theory 2-1
No ratings yet
Information and Entropy: Aria Nosratinia - Information Theory 2-1
7 pages
1-Information Removed
No ratings yet
1-Information Removed
5 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
SummaryFeb5 2024
No ratings yet
SummaryFeb5 2024
2 pages
On Measures of Entropy and Information
No ratings yet
On Measures of Entropy and Information
18 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
book (1)
No ratings yet
book (1)
113 pages
Book
No ratings yet
Book
106 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Identifying Objects Words Foundational Worksheet
No ratings yet
Identifying Objects Words Foundational Worksheet
12 pages
Sample Problems on Piles
No ratings yet
Sample Problems on Piles
58 pages
IB MAA SL Calculus Practice
100% (1)
IB MAA SL Calculus Practice
13 pages
Work Power and Energy
No ratings yet
Work Power and Energy
51 pages
Chapter 8 - 2 - Physics of Matters
No ratings yet
Chapter 8 - 2 - Physics of Matters
42 pages
Biochemistry
No ratings yet
Biochemistry
199 pages
LONG QUIZ in TLE 9
No ratings yet
LONG QUIZ in TLE 9
4 pages
Siamak Shahbazi (2016)
No ratings yet
Siamak Shahbazi (2016)
14 pages
Recycling Robot
No ratings yet
Recycling Robot
6 pages
Conductive Polymer Chip Tantalum Capacitor
No ratings yet
Conductive Polymer Chip Tantalum Capacitor
38 pages
Bombay Scottish - STD 6 (2021-22) - Algebra 1
No ratings yet
Bombay Scottish - STD 6 (2021-22) - Algebra 1
7 pages
Cross Culture Presentation ........
No ratings yet
Cross Culture Presentation ........
17 pages
1.4.2_Boiler Damage Mechanism
No ratings yet
1.4.2_Boiler Damage Mechanism
81 pages
Dumpy Level Experiment 3
No ratings yet
Dumpy Level Experiment 3
9 pages
FP 20 Sabadi
No ratings yet
FP 20 Sabadi
11 pages
Bids
No ratings yet
Bids
2 pages
Intelligent Grandiose Narcissists Are Less Likely To Exhibit Narcissistic Rivalry
No ratings yet
Intelligent Grandiose Narcissists Are Less Likely To Exhibit Narcissistic Rivalry
5 pages
Kalantari, K., Shabanali, F. H., Asadi, A. & Mohamadi, H. M.
No ratings yet
Kalantari, K., Shabanali, F. H., Asadi, A. & Mohamadi, H. M.
8 pages
Disclosure To Promote The Right To Information
No ratings yet
Disclosure To Promote The Right To Information
13 pages
(eBook PDF) Practice of Computing Using Python, The 3rd Editioninstant download
100% (2)
(eBook PDF) Practice of Computing Using Python, The 3rd Editioninstant download
41 pages
MSC Counselling and Psychotherapy
No ratings yet
MSC Counselling and Psychotherapy
71 pages
JTG D30-2004
0% (1)
JTG D30-2004
294 pages
Job Seeking Skills Handbook
No ratings yet
Job Seeking Skills Handbook
53 pages
Communication with Noteboxes Luhmann
No ratings yet
Communication with Noteboxes Luhmann
7 pages
Ge 9 Rizal PT Tasksheet 2 Module Answers
No ratings yet
Ge 9 Rizal PT Tasksheet 2 Module Answers
2 pages
Skrypt SCIA Zaawansowany
No ratings yet
Skrypt SCIA Zaawansowany
83 pages
Chapter 3-171
No ratings yet
Chapter 3-171
53 pages
Viva-Voce Questions:: Fluid Mechanics and Machinary Laboratory
No ratings yet
Viva-Voce Questions:: Fluid Mechanics and Machinary Laboratory
3 pages
Justification For Using Elevator Algorithm
No ratings yet
Justification For Using Elevator Algorithm
3 pages
Reflection Journal 4
No ratings yet
Reflection Journal 4
6 pages

IT_w1

Uploaded by

IT_w1

Uploaded by

Introduction to Information Theory

Expectation is linear: E [aX + bY ] = aE [X ] + bE [Y ]4

Motivation: Why do you do courses/pursue research?

Self Information Axioms for Events

I (A) = − log P(A)

where A is an event and P(A) denotes the probability of its occurence.

If the base of log is 2, the unit is shannon or bits, if it is e or natural log, it

Entropy in terms of Expectation

H(X ) = −E [log p(X )]

Relation between Joint and Conditional

H(X , Y ) = −E [log p(X , Y )] = −E [log p(X ) + log p(Y |X )]

As entropy is a measure of uncertainty, the statement implies that the

We will soon be seeing all these in terms of Venn diagrams

Relative Entropy/Kullback-Leibler(KL) divergence/I-divergence

Significance of Relative Entropy

KL divergence or distance is the term used to signify the distance

Then, H(X , Y , Z ) = H(X ) + H(Y , Z |X ) = H(X ) + H(Y |X ) + H(Z |Y , X )

Show that: D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y |x)||q(y |x))

We need some background on convexity and related properties. Those

The set C can be expressed in terms of functional constraints such as:

Thus, if we choose a new set of coefficients

The proof is a straightforward application of the previous slide: If X is

A clever student notices:

Information Comparison: Let X denote the Actual and Y denote the

Thomas M Cover, Elements of information theory, John Wiley &

You might also like