0% found this document useful (0 votes)
5 views

Machine Learning

Uploaded by

egkafati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Machine Learning

Uploaded by

egkafati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Information Theory meets Machine Learning

Kedar Tatwawadi

EE376a Course
Table of contents

1. Introduction

2. Unsupervised Learning

3. Learning Data Distribution

4. ML for Lossy Compression

1
Introduction
ML and IT

ML/Statistics & Information theory are two sides of the same coin!

Figure 1: ML and IT

2
A short (very) intro to ML

Figure 2: ML zoo

3
A short (very) intro to ML

Figure 3: ML Zoo

4
A short (very) intro to ML

Figure 4: ML Zoo

5
Supervised Learning

Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such


that:

6
Supervised Learning

Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such


that:
F (X ) = y

6
Supervised Learning

Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such


that:
F (X ) = y

6
Supervised Learning

1. What is the function F ?


2. SVM, ConvNet, Recurrent Neural Network, Decision Tree ...

7
Supervised Learning

1. What is the function F ?


2. SVM, ConvNet, Recurrent Neural Network, Decision Tree ...

Take CS229, CS231n courses!

7
A short (very) intro to ML

Figure 5: ML Zoo

8
A short (very) intro to ML

Figure 6: ML Zoo

9
Unsupervised Learning
Unsupervised Learning

Given data: X1 , X2 , X3 , . . . , XN
”Learn” something useful about X

1. Clustering
2. Data Representation
3. Distribution of the data

10
Clustering

Figure 7: Clustering
11
Data Representation

Figure 8: Word2Vec Representation

12
Data Representation

13
Learning Data Distribution
Learning the distribution

”Learn” the underlying Distribution of the data


Given data: X (1) , X (2) , . . . , X (N) with distribution pX (X ), How do we
learn pX (X )?

14
Learning the distribution

”Learn” the underlying Distribution of the data


Given data: X (1) , X (2) , . . . , X (N) with distribution pX (X ), How do we
learn pX (X )?
Use cases

1. Sampling
2. Prediction
3. De-noising
4. Compression

14
Sampling

15
Prediction

16
Denoising

17
Learning the distribution

”Learn” the underlying Distribution of the data

1. Sampling
2. Prediction
3. De-noising
4. Compression

18
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)


with distribution pX (X )

• Xi ∈ X
• Potentially |X | can be high

19
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)


with distribution pX (X )

• Xi ∈ X
• Potentially |X | can be high

How do we learn pX (X )?

19
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)


with distribution pX (X )

• Xi ∈ X
• Potentially |X | can be high

How do we learn pX (X )?

• We can use the Log-loss (Cross-entropy loss) to learn pX (X )

1
pX = argmin EpX log (1)
q(X ) q(X )

19
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) with distribution pX (X )

1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X

= Hp (X ) + DKL (pX ||q)

20
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) with distribution pX (X )

1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X

= Hp (X ) + DKL (pX ||q)

1
pX = argmin EpX log
q(X ) q(X )

20
Learning the distribution

1
pX = argmin EpX log
q(X ) q(X )

• In practice we consider empirical expectation instead:

N
1 1 X 1
argmin EpX log ≈ argmin log
q(X ) q(X ) q(X ) N i=1 q(X (i) )

21
Learning the distribution

• In practice we consider empirical expectation instead:

N
1 X 1 1 1
argmin log (i) )
= argmin log
q(X ) N q(X q(X ) N q(X 1 )q(X2 ) . . . q(XN )
i=1
X nx 1
= argmin log
q(X ) N q(x)
x∈X
1
= argmin Ep̂X log
q(X ) q(x)

22
Learning the distribution

N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N

23
Learning the distribution

N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N

• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!

23
Learning the distribution

N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N

• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!
• We need more data, or ... some regularization.

23
Data Example

• X = (Y1 , Y2 , . . . , Yd ), |X | = k d , N ≈ number of dimensions.

24
Regularization

1 1
argmin EpX log = argmin Ep̂X log
q(X ) q(x) q(X ) q(x)
1
≈ argmin Ep̂X log
q(X )∈Q q(x)

• q(X ) = q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 , Y1 ) . . . qd (Yd |Y1 , . . . , Yd−1 )
• Q restricts some distributions
e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )

25
QI independent distributions

• QI restricts the distribution over the d dimensions to be independent


e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )

1
argmin Ep̂X log = (q̂1 (y1 ), . . . , q̂d (yd ))
q(X )∈QI q(x)

• q(Y1 , Y2 , . . . , Yd ) = q̂1 (y1 )q̂2 (y2 ) . . . q̂d (yd )


is not very useful for the tabular dataset

26
Tabular Example

27
Tree-based distributions

• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}

28
Tree-based distributions

• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i

28
Tree-based distributions

• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i
• This exactly corresponds to a ”tree distribution”

28
Tree-based distributions — Examples

• Example tree distribution:


q(Y1 , Y2 , . . . , Y5 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y1 )q4 (Y4 |Y2 )q5 (Y5 |Y2 )

29
Tree-based distributions — Examples

• Example tree distribution:


q(Y1 , Y2 , Y3 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 )

Figure 11: Graph example

30
Tree-based distributions

Figure 12: Graph example

• Tree distributions are practical! No of parameters = dk 2

31
Tree-based distributions

Figure 12: Graph example

• Tree distributions are practical! No of parameters = dk 2


• Sampling is easy (in a breadth-first search order):
Y1 → Y2 → Y3 → Y4 → Y5 31
Tree-based distributions

• Can be used for compression, using Arithmetic coding:


• q(Y1 , Y2 , . . . , Y5 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y1 )q4 (Y4 |Y2 )q5 (Y5 |Y2 )

Figure 13: HW3 Q3(f)

32
Chow-Liu Tree Algorithm

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the


”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (2)
edges(i,j)

33
Chow-Liu Tree Algorithm

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the


”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (2)
edges(i,j)

• Intuition:: Add edges which have ”high’ correlation.

33
Data Example

• X = (Y1 , Y2 , . . . , Yd ), |X | = k d , N ≈ number of dimensions.

34
Data Example

35
Chow-Liu Tree Algorithm

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the


”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (3)
edges(i,j)

• Intuition:: Add edges which have ”high’ correlation.


• We will proove this in the class!

36
Practical Considerations

• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm

37
Practical Considerations

• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm
• Need to compute O(d 2 ) mutual informations, which is the more
costly part

37
Practical Considerations

• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)

G is a solution to the problem:


1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)

and thus ”approximates” p̂X (X ).

38
Practical Considerations

• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)

G is a solution to the problem:


1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)

and thus ”approximates” p̂X (X ).


• We really want to solve the problem:
X
argmax I (Yi ; Yj )
edges(i,j)

38
Practical Considerations

• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)

G is a solution to the problem:


1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)

and thus ”approximates” p̂X (X ).


• We really want to solve the problem:
X
argmax I (Yi ; Yj )
edges(i,j)

• Using N samples we can have better estimators for I (Yi , Yj ) than


the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!

38
Practical Considerations

• Using N samples we can have better estimators for I (Yi , Yj ) than


the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!

39
Practical Considerations

• Using N samples we can have better estimators for I (Yi , Yj ) than


the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!

40
Practical Considerations

• Using N samples we can have better estimators for I (Yi , Yj ) than


the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!

41
Practical Considerations

• Using N samples we can have better estimators for I (Yi , Yj ) than


the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!

42
Practical Considerations

• When we solve the optimization problem, we are not penalizing


model complexity:
X
G = argmax Iˆ(Yi ; Yj ) (5)
edges(i,j)

G is a solution to the problem:

• Practically this is important. For example in compression, we also


need space to store the distributions p̂(Yi |Jj ) themselves! (along
with arithmetic coding).

43
Practical Considerations

• When we solve the optimization problem, we are not penalizing


model complexity:
X
G = argmax Iˆ(Yi ; Yj ) (5)
edges(i,j)

G is a solution to the problem:

• Practically this is important. For example in compression, we also


need space to store the distributions p̂(Yi |Jj ) themselves! (along
with arithmetic coding).
• The BIC Criteria (Bayesian Information Criteria), alters the
optimization by adding a penalty function for model complexity
X  1

argmax Iˆ(Yi ; Yj ) + log N|Yi ||Yj |
2
edges(i,j)

43
General Bayesian Networks

44
Chow-Liu Algorithm for Bayesian Networks

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the


”empirical” distributions.

For general bayesian networks:


X
G = argmax Iˆ(Yi ; Yj ) (6)
edges(i,j)

• Chow-liu algorithm for Bayesian networks is an approximation based


on the intuition for tree-based algorithms
• Exact solutions no more possible. Apply heuristic greedy schemes

45
Learning Distributions — Language model

Figure 14: Slides borrowed from CS224n lecture, Jan 22

46
Learning Distributions — Language model

Figure 15: Slides borrowed from CS224n lecture, Jan 22


47
Learning Distributions — Language model

Figure 16: Slides borrowed from CS224n lecture, Jan 22

48
Learning Distributions — Language model

Figure 17: Slides borrowed from CS224n lecture, Jan 22


49
Learning Distributions — Language model

Figure 18: Slides borrowed from CS224n lecture, Jan 22


50
Learning Distributions — Language model

Figure 19: Slides borrowed from CS224n lecture, Jan 22 51


Learning Distributions — Language model

Figure 20: Slides borrowed from CS224n lecture, Jan 22


52
Learning Distributions — Language model

Figure 21: Slides borrowed from CS224n lecture, Jan 22


53
Learning Distributions — Language model

Figure 22: Slides borrowed from CS224n lecture, Jan 22


54
Learning Distributions — Language model

Figure 23: Slides borrowed from CS224n lecture, Jan 22

55
Learning Distributions — Language model

Figure 24: DeepZip Language compressor based on Language models

56
ML for Lossy Compression
Representation learning for Compression

• More recently, ML is being used for lossy compression of data


• Why use ML?
1. Unclear, high-dimensional data models. Eg: natural images.

57
Representation learning for Compression

• More recently, ML is being used for lossy compression of data


• Why use ML?
1. Unclear, high-dimensional data models. Eg: natural images.
2. Unclear Loss functions for lossy compression
3. EG: Image Compression → ”Human perception loss”

57
ML for Compressionl

Figure 25: Waveone image compression using neural networks 58


ML for Compressionl

Figure 26: Waveone image compression using neural networks 59


ICLR Compression Contest

Figure 27: More details: https://www.compression.cc

60
ML for Compression

• Lots of very different types of techniques used. The core idea is:
1. Learn a ”smooth” representation for the Image
2. Quantize the representation
3. Entropy coding of the representation

61
ML for Compression

• Lots of very different types of techniques used. The core idea is:
1. Learn a ”smooth” representation for the Image
2. Quantize the representation
3. Entropy coding of the representation
• The ”smoothness” of the representation is the key to good lossy
compression.
• Different techniques used for compression: Autoencoders, VAE,
GANs

61
ML for Compression

Some notable papers:

• Toderici, George, et al. ”Full resolution image compression with


recurrent neural networks.” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2017.

Figure 28: Compression over multiple iterations

62
ML for Compression

Some notable papers:

• Rippel et.al, ”Real-time adaptive Image compression” ICML’17


Proceedings

Figure 29: Using a GAN based loss

63
ML for Channel Coding

Figure 30: Using a RNN for channel coding

64
ML for Channel Coding

Figure 31: Using a RNN for channel coding

65
ML for Channel Coding

Figure 32: Using a RNN for channel coding

66
ML for Joint Source-Channel Coding

Figure 33: NECST: Neural Joint Source-Channel Code, Choi et.al. arxiv

67
Conclusion?

ML/Statistics & Information theory are two sides of the same coin!

Figure 34: ML and IT

68
Thank You!

68

You might also like