Machine Learning
Machine Learning
Kedar Tatwawadi
EE376a Course
Table of contents
1. Introduction
2. Unsupervised Learning
1
Introduction
ML and IT
ML/Statistics & Information theory are two sides of the same coin!
Figure 1: ML and IT
2
A short (very) intro to ML
Figure 2: ML zoo
3
A short (very) intro to ML
Figure 3: ML Zoo
4
A short (very) intro to ML
Figure 4: ML Zoo
5
Supervised Learning
6
Supervised Learning
6
Supervised Learning
6
Supervised Learning
7
Supervised Learning
7
A short (very) intro to ML
Figure 5: ML Zoo
8
A short (very) intro to ML
Figure 6: ML Zoo
9
Unsupervised Learning
Unsupervised Learning
Given data: X1 , X2 , X3 , . . . , XN
”Learn” something useful about X
1. Clustering
2. Data Representation
3. Distribution of the data
10
Clustering
Figure 7: Clustering
11
Data Representation
12
Data Representation
13
Learning Data Distribution
Learning the distribution
14
Learning the distribution
1. Sampling
2. Prediction
3. De-noising
4. Compression
14
Sampling
15
Prediction
16
Denoising
17
Learning the distribution
1. Sampling
2. Prediction
3. De-noising
4. Compression
18
Learning the distribution
• Xi ∈ X
• Potentially |X | can be high
19
Learning the distribution
• Xi ∈ X
• Potentially |X | can be high
How do we learn pX (X )?
19
Learning the distribution
• Xi ∈ X
• Potentially |X | can be high
How do we learn pX (X )?
1
pX = argmin EpX log (1)
q(X ) q(X )
19
Learning the distribution
1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X
20
Learning the distribution
1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X
1
pX = argmin EpX log
q(X ) q(X )
20
Learning the distribution
1
pX = argmin EpX log
q(X ) q(X )
N
1 1 X 1
argmin EpX log ≈ argmin log
q(X ) q(X ) q(X ) N i=1 q(X (i) )
21
Learning the distribution
N
1 X 1 1 1
argmin log (i) )
= argmin log
q(X ) N q(X q(X ) N q(X 1 )q(X2 ) . . . q(XN )
i=1
X nx 1
= argmin log
q(X ) N q(x)
x∈X
1
= argmin Ep̂X log
q(X ) q(x)
22
Learning the distribution
N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N
23
Learning the distribution
N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N
• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!
23
Learning the distribution
N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N
• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!
• We need more data, or ... some regularization.
23
Data Example
24
Regularization
1 1
argmin EpX log = argmin Ep̂X log
q(X ) q(x) q(X ) q(x)
1
≈ argmin Ep̂X log
q(X )∈Q q(x)
• q(X ) = q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 , Y1 ) . . . qd (Yd |Y1 , . . . , Yd−1 )
• Q restricts some distributions
e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )
25
QI independent distributions
1
argmin Ep̂X log = (q̂1 (y1 ), . . . , q̂d (yd ))
q(X )∈QI q(x)
26
Tabular Example
27
Tree-based distributions
• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
28
Tree-based distributions
• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i
28
Tree-based distributions
• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i
• This exactly corresponds to a ”tree distribution”
28
Tree-based distributions — Examples
29
Tree-based distributions — Examples
30
Tree-based distributions
31
Tree-based distributions
32
Chow-Liu Tree Algorithm
33
Chow-Liu Tree Algorithm
33
Data Example
34
Data Example
35
Chow-Liu Tree Algorithm
36
Practical Considerations
• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm
37
Practical Considerations
• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm
• Need to compute O(d 2 ) mutual informations, which is the more
costly part
37
Practical Considerations
• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)
38
Practical Considerations
• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)
38
Practical Considerations
• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)
38
Practical Considerations
39
Practical Considerations
40
Practical Considerations
41
Practical Considerations
42
Practical Considerations
43
Practical Considerations
43
General Bayesian Networks
44
Chow-Liu Algorithm for Bayesian Networks
45
Learning Distributions — Language model
46
Learning Distributions — Language model
48
Learning Distributions — Language model
55
Learning Distributions — Language model
56
ML for Lossy Compression
Representation learning for Compression
57
Representation learning for Compression
57
ML for Compressionl
60
ML for Compression
• Lots of very different types of techniques used. The core idea is:
1. Learn a ”smooth” representation for the Image
2. Quantize the representation
3. Entropy coding of the representation
61
ML for Compression
• Lots of very different types of techniques used. The core idea is:
1. Learn a ”smooth” representation for the Image
2. Quantize the representation
3. Entropy coding of the representation
• The ”smoothness” of the representation is the key to good lossy
compression.
• Different techniques used for compression: Autoencoders, VAE,
GANs
61
ML for Compression
62
ML for Compression
63
ML for Channel Coding
64
ML for Channel Coding
65
ML for Channel Coding
66
ML for Joint Source-Channel Coding
Figure 33: NECST: Neural Joint Source-Channel Code, Choi et.al. arxiv
67
Conclusion?
ML/Statistics & Information theory are two sides of the same coin!
68
Thank You!
68