Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
ETH Zurich
Janik Schuettler
Marcel Graetz
FS18
6.3 Autoencoders vs. PCA . . . . . . . . . . . . . . . . . . . 8
6.4 PCA vs. k-Means . . . . . . . . . . . . . . . . . . . . . . 8
Contents
III Probabilistic modeling 9
7 Probabilistic Modeling, Bias-variance tradeoff 9
7.1 Parametric Estimation . . . . . . . . . . . . . . . . . . . 9
Contents i 7.2 Least Squares Regression = Gaussian Maximum Like-
lihood Estimation (MLE) . . . . . . . . . . . . . . . . . . 9
1 Overview 1 7.3 Bias Variance Tradeoff . . . . . . . . . . . . . . . . . . . 9
7.4 Ridge Regression = Maximum A Posteriori (MAP) Es-
timation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
I Supervised Learning 1 7.5 Examples for other Priors and Likelihood Functions . 10
i
1 Overview Algorithm 1 Gradient Descent
1: w0 ∈ R . start with arbitrary w0
2: for t = 1, 2, . . . , T do
3: ∇w R̂(wt ) = −2 ∑in=1 (yi − wtT xi ) xi
4: wt+1 = wt − ηt ∇ R̂(wt ) . ηt is the learning rate
5: return w T
wt+1 ← wt − ηt ∇ R̂(wt )
whut?
GD convergence Stop if either
• gradient is small enough, or
I Supervised Learning
• difference in objective between subsequent iterations is small
2 Regression and Gradient Descent enough.
We try to fit a function to training data (learning) to make predic- 2.3 Non-linear Regression via Linear Regression
tions. Our goal is to learn real-valued mapping f : Rd → R.
Non-linear basis functions are used to fit non-linear data via linear
The general model is regression
d d
f (x) = ∑ wi xi + b = wT x + b = w̃T x̃ f (x) = ∑ wi φi (x)
i =1 i =1
with x̃ = ( x1 , ..., xd , 1) and w̃ = (w1 , ..., wd , b). In 2D, φ could be φ( x ) = (1, x1 , x2 , x12 , x22 , x1 x2 , . . .).
Model error We measure goodness of a model (i.e. fit) using a
p-loss function l p (w, x, y), 2.4 Model selection
n n
We would like to choose the model that optimizes trade-off between
R̂(w) = ∑ l p (w, x, y) = ∑ |yi − w T p
xi | , p ≥ 1. model complexity and training error, i.e. under- and overfitting data.
Mathematically, we try to minimize the true risk
i =1 i =1
Z
For p = 2, we get the least squares measure R̂(w) = ∑in=1 (yi − R(w) = Ex,y [(y − wT x)2 ] = P( x, y)(y − w T x )2 dx dy,
w T x i )2 .
w∗ = argmin R(w)
Optimization problem We want to find the optimal weight vector w
However, we can only compute the empirical risk
n
w∗ = argmin( R̂) = argmin ∑ (yi − w T xi )2 . 1
| D | (x,y∑
w w i =1 R̂ D (w) = ( y − w T x)2 , ŵ = argmin R̂train (w).
)∈ D w
2.1 Closed-form Solution: Linear Least Squares Theorem 2.2 (Law of large numbers (LLN)) R̂ D (w) → R(w) for any
Closed-form solution is w∗ = (XT X)−1 XT y. fixed w almost surely as | D | → ∞.
Complexity for solving in closed form is O(nd2 + d3 ). iid assumnption assumes that the data set is generated indepen-
dently and identically distributed (iid), (xi , yi ) ∼ P(X, Y ).
2.2 Optimization: Gradient Descent Convergence of learning For learning via empirical risk minimiza-
tion to be successful, we need uniform convergence supw | R(w) −
Theorem 2.1 (Gradient descent) Let f be convex with w∗
the global R̂ (w)| → 0 for | D | → ∞, which is not implied by LLN, but de-
∗ ∗ D
minimizer. Assume kw1 − w k ≤ D and k∇ f (w)k ≤ G ∀w ∈ R , BD (w ).pends on model class.
d
If we choose ηt = D√ , then
G t
Splitting data set Do not test a model on the training data, because
GD E[ R̂train (ŵ)] < E[ R(ŵ)].
f (w T ) − f (w∗ ) ≤ √ .
T
Best practice is to split the data set into a training set Dtrain and a
testing set Dtest . Optimize w on Dtrain
Least squares function is convex. Gradient descent finds an optimal
solution with better complexity. w = argmin R̂train (w)
w
Complexity for one iteration of gradient descent is O(dn).
and evaluate it on Dtest
Problem: For low steps size very slow, but for high step size this can 1
diverge. R̂test (ŵ) =
| Dtest | ∑ (y − ŵT x)2 .
(x,y)∈ Dtest
Adaptive step size Examples of how to update the step size adap-
tively. Then EDtrain ,Dtest [ R̂ Dtest (ŵ Dtrain )] = EDtest [ R(ŵ Dtrain )].
1
2.5 Cross validation
Test error R̂test is itself random. Variance usually increases for more
complex models. Idea is to use and average over multiple test sets
to reduce the bias. Note that this only works for i.i.d. data.
for some λ ∈ R. Using the homogeneous representation, the con- 3.1.1 Perceptron and Stochastic Gradient Descent
stant term is not being regularized. λ balances Perceptron optimization problem
λ → ∞, the optimization problem tries to minimize w only, n n
w∗ = argmin ∑ l p (w, xi , yi ) = argmin ∑ max(0, −yi wT xi ).
λ → 0, optimization problem with no regularization. w i =1 w i =1
Closed-form solution is w∗ = (XT X + λI)−1 XT y. Matrix XT X + λI
is always invertible and better conditioned. Perceptron Gradient
Renormalizing data ensures that each feature has zero mean and
unit variance, because scaling does matter for regularization. ∇w R̂ p (w) = − ∑ y i xi
i:yi 6=sign(wT xi )
x̃i,j = ( xi,j − µ̂ j )/σ̂j2 , where
n
1 1 Stochastic Gradient Descent (SGD) picks uniformly and at ran-
n i∑ ∑(xi,j − µ̂ j )2
µ̂ j = xi,j σ̂j2 =
=1
n i
dom m data points to compute the gradient (mini-batch SGD)
1
∇ R̂(w) =
n ∑ ∇l (w; x I , y I ) = E I ∼Uni f {1,...,n}) [∇l (w; x I , y I )].
Algorithm 3 Gradient Descent with regularization I
2
Algorithm 5 Perceptron with Stochastic Gradient Descent Algorithm 6 Greedy forward selection
1: w0 ∈ R . start with arbitrary w0 1: S = ∅, E0 = ∞
2: for t = 1, 2, . . . do 2: for t = 1 : d do
3: pick it ∼ Unif{1, . . . , n} . one random data point 3: si = argmin j∈V \S L̂(S ∪ { j}) . Find best element to add
4: if yit 6= sign(wtT xit ) then . Perceptron gradient 4: Ei = L̂(S ∪ {si }) . Compute error
5: w t +1 = w t + η t y i t x i t 5: if Ei > Ei−1 then
6: else 6: break
7: w t +1 = w t 7: else
8: S ← S ∪ { si }
9: return S
Robbins-Monro Conditions keep the learning rate ηt such that the
algorithm will not terminate before converging, i.e. ∑t ηt = ∞, but Algorithm 7 Greedy backward selection
with bound variance, i.e. ∑t ηt2 < ∞. These conditions are sufficient
1: S = V, Ed+1 = ∞
for convergence. For example ηt = 1/t or ηt = min(c1 , c2 /t).
2: for t = d : 1 : −1 do
Remark 3.1 This is the perceptron algorithm. (Says probelm set 3.) 3: si = argmin j∈S L̂(S\{ j}) . Find best element to remove
4: Ei = L̂(S\{si }) . Compute error
Adaptive learning rates are used by algorithms such as AdaGrad, 5: if Ei > Ei+1 then
RMSProp, Adam. 6: break
7: else
Theorem 3.2 If the data is linearly separable, the Perceptron will obtain a
8: S ← S\{si }
linear separator.
9: return S
SGD convergence criteria Stop if either
• a fixed number of iterations have passed,
can be suboptimal. As an extreme counter example consider a set-
• GD conditions would suggest convergence (occasionally, say ting in which all features are uninformative on their own, but infor-
every n-th iteration, compute full objective value/ gradient mative altogether.
magnitude),
3.2.2 Linear models
• error on separate validation data set is small enough (direct
We want to solve the learning and feature selection problem simulta-
monitoring). This is a special form of regularization called
neously via a single optimization.
early stopping.
Sparse regression The key idea is to replace feature selection with
3.1.2 Support Vector Machines (SVM) setting unimportant features to 0, i.e. working with sparse feature
The hinge-loss encourages not only correct classification, but correct representations, encoded within the pseudo-norm kwk0 = number
classification with maximal margin to the decision boundary. of non-zero entries in w. The 0-norm penalty encourages coefficients
Can this lead to non-optimal decisions in case of e.g. separability? to be exactly 0 and therefore automatic feature selection, however,
the optimization problem is hard to solve. We instead use the 1-
SVM optimization problem norm k·k1 for optimization to keep the convex optimization prob-
n lem.
w∗ = argmin ∑ max(0, 1 − yi wT xi ) + λkwk22 L1-regularized regression problem (Lasso)
w i =1
n
w∗ = argmin ∑ (yi − wT xi )2 + λkwk1
Theorem 3.3 SVM finds solution with max margin to decision boundary. w i =1
3
Definition 3.5 (Kernel) A kernel is a function k : X × X → R sat- The kernelized perceptron may have improved performance due to
isfying symmetry and positive semi-definiteness, i.e. for any n, any set optimized weights, can capture ”global trends” with suitable kernels,
S = {x1 , . . . , xn } ⊆ X, the kernel (Gram) matrix and it depends on wrongly classified examples only, but training
requires optimization.
k(x1 , x1 ) · · · k(x1 , xn )
.. .. .. Parametric vs nonparametric Models Parametric models have fi-
K=
. . . nite set of parameter (e.g. linear regression, linear perceptron), non-
k(xn , x1 ) · · · k(xn , xn ) parametric models grow in complexity with the size of the data (e.g.
kernelized perceptron, k-NN) and are thus potentially much more
is positive semi-definite. expressive and computationally expensive. Kernels provide a princi-
Theorem 3.6 (Kernel Composition) Let k1 : χ × χ → R, k2 : χ × pled way of deriving non-parametric models from parametric ones.
χ → R be defined on data space χ. Then the following are valid kernels 3.3.4 Kernelized SVM
• k(x, x0 ) = k1 (x, x0 ) + k 2 (x, x0 ) Kernelized SVM optimization problem
• k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) n
argmin ∑ max(0, 1 − yi α T ki ) + λα T Dy KDy α
• k(x, x0 ) = ck1 (x, x0 ) for c > 0 α i =1
• k(x, x0 ) = f (k1 (x, x0 )), where f is a polynomial with positive coeffi-
for ki = (y1 k(xi , x1 ), . . . , yn k(xi , xn )) T and Dy = diag(y1 , . . . , yn ).
cients or the exponential function.
3.3.5 Kernelized Regression
Theorem 3.7 (Mercer’s Theorem) Let X be a compact subset of Rn and
Kernelized Linear Regression optimization problem
k : X × X → Rn a kernel function. Then one can expand k in a uniformly
convergent series of bounded functions φ such that
argminkα T K − yk22 + λα T Kα
α
∞
k( x, x 0 ) = ∑ λi φi (x)φi (x0 ). with closed form solution α∗ = (K + λI)−1 y and predictor f (x) =
i =1
∑in=1 αi k(xi , x).
Kernels in Rd 3.4 Class Imbalance
• Linear Kernel: k(x, x0 ) = x T x0
True label
• Polynomial Kernel: k(x, x0 ) = (xT x0 + 1)d implicitly repre- Positive Negative ∑=
sents all monomials of up to degree m. In d-D, there are (d+ m
m ) Positive TP FP p+
Predicted
such monomials. Negative FN TN p−
• Gaussian/ RBF Kernel: k(x, x0 ) = exp(−kx − x0 k22 /h2 ) maps ∑= n+ n−
to infinite dimensional space.
• Laplacian Kernel: k(x, x0 ) = exp(−kx − x0 k1 /h)
T
• ? Kernel: k(x, x0 ) = xT Mx0 for symmetric positive definite 1 0
matrix M. 1 TP FP p+
P
0 FN TN p−
• ANOVA Kernel: k(x, x0 ) = ∑id=1 k i (x(i) , x0(i) ) with k i ( x, x 0 ) = n+ n−
exp(−( x − x 0 )2 /h2i ).
3.3.2 k-Perceptron
Metrics to measure goodness of fit
Kernelized perceptron optimization problem
TP + TN TP + TN
n n n = (Accuracy)
argmin ∑ max{0, −yi α T ki } = min ∑ max{0, − ∑ α j yi y j k(xi , x j ) TP + TN + FP + FN n
α i =1
α1:n
i =1 j =1 TP TP
= ∈ [0, 1] (Precision)
TP + FP p+
for ki = (y1 k(xi , x1 ), . . . , yn k(xi , xn )) T . TP TP
= ∈ [0, 1] (Recall (TPR))
TP + FN n+
Algorithm 8 Kernelized Perceptron FP FP
= ∈ [0, 1] (False positive rate (FPR))
1: α1 = . . . = αn = 0 TN + FP n−
2: for t = 1, 2, . . . do 2TP 2
3: Pick it ∼ Unif{1, . . . , n} . Random data points = TP+ FP TP+ FN (F1 Score)
2TP + FP + FN TP + TP
4: ŷ = sign ∑nj=1
α j y j k ( x j , xi t ) . Predict
5: if yit 6= ŷ then . Perceptron gradient Accuracy is often not meaningful for imbalanced data set, because
6: α i ← α i + ηt
it may prefer certain mistakes over others (trade false positives and
7: ŷ = sign(∑nj=1 α j y j k(x j , x)) . Prediction for x false negatives). Minority class instances contribute little to the em-
pirical risk and may be ignored during optimization.
3.3.3 k nearest Neighbors (k-NN) Upsampling Repeat data points from minority class (possibly with
small random perturbations) to obtain balanced data set. This method
k-nearest Neighbor Label depending on the k nearest neighbor of
makes us of all data, but is slower and adding perturbations requires
all data points
arbitrary choices.
!
n
Downsampling Remove training examples from the majority class
y = sign ∑ yi {xi among k nearest neighbors of x} .
(e.g. uniformly at random) such that the resulting data set is bal-
i =1
anced. This method is faster, because it reduces the test set size, but
Choose k using cross validation. available data is wasted and information about the majority class.
Comparison of Perceptron and k-NN For k-NN, no training is Cost sensitive classification Modify Perceptron/ SVM by using a
necessary, but it depends on all data, which may render it inefficient. cost-sensitive loss function lCS (w; x, y) = cy l (w; x, y) to take class
4
imbalance into account. Multi-class methods Maintain c weight vectors w(1) , . . . , w(c) , one
T
for each class, and predict ŷ = argmaxi w(i) x. Given each data
lCS− P (w; x, y) = cy max(0, −ywT x) (Perceptron)
point (x, y), we want to archive that
lCS− H (w; x, y) = cy max(0, 1 − ywT x) (SVM)
w(y) x > max w(i)T x + 1. (∗)
with parameters c+ , c− > 0. i 6=y
l MC− H (w(1:c) ; x, y) =
!
( j) T (y) T
max 0, 1 + max w x−w x
j∈{1,...,y−1,y+1,...,c}
0
(∗) or j ∈
/ {y, ŷ}
∇w( j) l MC− H (w(1:c) ; x, y) = −x not (∗) and j = y
x not (∗) and j = ŷ
True label
Cat Dog Elefant
Cat 5 2 0
Predicted Dog 3 7 0
Elefant 1 0 6
Theorem 3.8 Algorithm 1 dominated algorithm 2 in terms of ROC curve 4 Neural Networks
if and only if algorithm 1 dominates algorithm 2 in terms of precision recall
What are good features?
curves.
Neural Networks optimization problem
3.5 Multi-class Problems
n m
One-vs-All Solve c binary classifiers for each class. Classify using w∗ = argmin ∑ l (yi ; ∑ w j φ( xi , θ j ))
the classifier with the largest confidence, i.e. predict w,θ i =1 j =1
T
ŷ = argmax w(i) x.
i ∈{1,...,c}
Feature maps, activation function For example φ( x, θ ) = ϕ(θ T x )
Activation functions
One-vs-All discussion One-vs-All only works well if classifiers pro- • Sigmoid ϕ(z) = 1+exp1 (−z)
duce confidence scores on the same scale. Individual binary classi-
fiers see imbalanced data even if the whole data set is balanced. One • Tanh ϕ(z) = tanh z
class might not be linearly separable from all other classes. • ReLU ϕ(z) = max(0, z)
One-vs-One Train c c−2 1 binary classifiers, one for each pair of classes Artificial Neural Networks (ANN) are functions of the form
i, j. Apply voting scheme, class with highest number of positive pre-
m
diction wins.
f (x; w, θ ) = ∑ w j ϕ(θ jT x).
Discussion One-vs-One One-vs-One does not rely on confidence, j =1
but is slower than One-vs-All.
5
Algorithm 9 Forward Propagation weight update
1: v(0) = x . Input layer a = m · a + ηt ∇w l (W; y, x),
2: for each hidden layer l = 1 : L − 1 do
W = W − a,
3: z( l ) = W( l ) v( l −1)
4: v( l ) = ϕ (z( l ) ) where m denotes a parameter of friction. This method can help to
5: f = W( L ) v( L −1) prevent oscillations.
6: y=f . Prediction for regression, or
7: y = sign(f) . Prediction for classification, or Weight-space symmetries cause ’degenerate’ local minima, i.e. mul-
8: y = argmax j f j . Prediction for multiclass classification tiple local-minima can be equivalent in terms of input-output map-
ping.
Derivatives of activation functions
Theorem 4.1 Let σ be any continuous sigmoidal function. Then finite z
• ϕ0 (z) = ( 1+1e−z )0 = (1+e ez )2 = (1 − ϕ(z)) ϕ(z). Properties: Dif-
sums of the form
ferentiable and non-zero everywhere, but ϕ0 (z) ≈ 0 almost
N everywhere except for z ≈ 0.
G(x) = ∑ α j σ(yiT x + θi ) (
j =1 0 0 1 if z > 0
• ϕ (z) = (max(0, z)) = . Properties: not differ-
0 if z < 0
are dense in C ( In ). In other words, given any f ∈ C ( In ) and ε > 0, there
entiable at 0 (in practice just set to 0, doesn’t really matter),
is a sum, G ( x ), of the above form, for which
efficient and > 0 in R+ .
| G ( x ) − f ( x )| < ε ∀ x ∈ In . wie bringen wir dieses concept am besten unter? Wie oder wo?
Aber solche Funktionen kriegen wir doch nicht aus neuronalen Net- 4.2 Initialization and Termination
zwerken, oder? Initialization of weights Matters, because problem is non-convex.
Random initialization usually works well, e.g. wi,j ∼ N (0, 0.1),
4.1 Training: Momentum SGD, Backpropagration p
wi,j ∼ N (0, 1/ | Layerl +1 |). However, incorrect initialization can
Training Given data set D = {( x1 , y1 ), . . . , ( xn , yn )}, we want to lead to bad results. Might want to repeat training multiple times to
optimize weights W = (W(1) , . . . , W( L) ) using any loss function avoid getting stuck in a poor local optimum. Less deep architectures
l (W; y, x) (Perceptron loss, multi-class hinge loss, squared loss, etc.) are more prone to get stuck in a local optimum.
n Termination
W∗ = argmin ∑ l (W; yi , xi ).
W i =1 4.3 Choosing parameters
When predicting multiple outputs at the same time, usually define In principle, one could use cross validation to compare models, how-
loss as sum per-output losses ever, training ANNs is usually expensive.
p Choosing parameters like number of units/ layers/ activation func-
l (W; y, x) = ∑ li (W; yi , x). tions/ learning rate/ ...
i =1
Type of activation function Sigmoid and tanh are differentiable
This optimization problem is not convex. and were popular in the past. ReLUs are currently used extensively.
They are not differentiable (not a problem), fast to compute and their
Algorithm 10 SGD for ANNs gradients do not vanish (important).
1: Initialize weights W Number of hidden layers [1] In most tasks, one hidden layer is
2: for t = 1, 2, . . . do sufficient. More generally:
3: Pick data point ( x, y) ∈ D uniformly at random
• 0: only capable of representing linear separable functions or
4: Take step in negative gradient direction
5: W = W − ηt ∇W l (W; y, x ) decisions.
• 1: Can approximate any function that contains a continuous
mapping from one finite space to another.
Computing the gradient To compute ∇w l (W; y, x), we use back-
propagation exploiting the chain-rule and the weight-specific gradi- • 2: Can represent an arbitrary decision boundary to arbitrary ac-
ents ∇wi,j l (W; y, x). curacy with rational activation functions and can approximate
any smooth mapping to any accuracy.
Wo genau ist der Unterschied zwischen 1 und 2?
Algorithm 11 Backpropagation
1: for the output layer do Number of hidden units [1] The optimal size of the hidden layer
is usually between the size of the input and size of the output layers.
2: δ( L) = Dl (f) = (l10 ( f 1 ), . . . , l 0p ( f p )) . compute ”error” True?
Some rules of thumb are
3: ∇W(L) l (W; y, x) = δ( L) v( L−1)T . Compute gradient matrix
• The number of hidden neurons should be between the size of
4: for each hidden Layer l = L − 1 : −1 : 1 do
the input layer and the size of the output layer.
5: δ ( l ) = ϕ 0 (z( l ) ) (W( l +1) T δ ( l +1) ) . compute ”error”
6: ∇W(l) l (W; y, x) = δ(l ) v(l −1)T . Compute gradient • The number of hidden neurons should be 2/3 the size of the
input layer, plus the size of the output layer.
• The number of hidden neurons should be less than twice the
Learning rate often initially chosen with a fixed (small) learning size of the input layer.
rate and decreased slowly after some iterations, e.g. ηt = min(0.1, 100/t).
It is also possible to monitor the ratio of weight change (gradient) to An upper bound for the number of hidden units is given by
weight magnitude. If the ratio is too small, increase learning rate,
Nsample
otherwise decrease learning rate. Nhidden units ≤ , 2 ≤ α ≤ 10.
α( Ninput + Noutput )
Learning with momentum can help to escape local minima by not
only moving into direction of gradient, but also in direction of last Woher kommt das?
6
Regularization method 5 Clustering: k-means
Learning rate schedule Unsupervised analog to classification.
Weight initialization Clustering Given data points, group them into clusters such that
Number of convolution/ pooling layers similar points are in the same cluster and dissimilar points are in
different clusters. Points are typically represented either in (high-
4.4 Regularization dimensional) Euclidean space or with distances specified by a metric
or kernel. Clustering is related to anomaly/ outlier detection.
Neural networks are prone to overfitting due to their large number
of parameters. Standard approaches to clustering
Early stopping doesn’t let the neural net converge. Monitor predic- • Hierarchical clustering: Build a tree (bottom-up or top-down),
tion performance on validation set and stop training once validation representing distances among data points. Examples include
error starts to increase. single-, average-linkage clustering.
Regularization adds the usual regularization term to the optimiza- • Partitional approaches: Define and optimize a notion of ”cost”
tion problem defined over partitions. Examples include spectral clustering,
n and graph-cut based approaches.
W∗ = argmin ∑ l (W; yi , xi ) + λkWk22 . • Model based approaches: Maintain cluster ”models” and in-
W i =1
fer cluster membership (e.g. assign each point to closest cen-
ter). Examples include k-means, and Gaussian mixture mod-
Dropout regularization Randomly ignore hidden units during each els.
iteration of SGD with probability p. After the training, half the
weights to compensate. k-means optimization problem assumes points are in Euclidean
space xi ∈ Rd , represents clusters as centers µ j ∈ Rd , and each point
4.5 Invariances is assigned to its closest center (Voronoi partition). The goal is to
minimize the average squared distance, i.e.
Predictions should be unchanged under some transformations of the
data, e.g. translation, rotation, scale, pitch, speed, etc. Invariances n n
can be learned from data: SIFT (scale invariant feature transforma- R̂(µ) = R̂(µ1 , . . . , µk ) = ∑ d(xi , µ) = ∑ j∈{min
1,...,k }
kxi − µ j k22 ,
tion), Ceptum (speech recognition). i =1 i =1
7
determining the number of clusters k is difficult, and models of arbi- • Complexity grows with the number of data points.
trary shape cannot be modeled well. • Cannot easily ”explicitly” embed high-dimensional data (un-
Don’t do crossvalidation, because there is a strong correlation be- less we have an appropriate kernel).
tween train accuracy and test/validation accuracy. Explain, please. • Kernel-PCA corresponds to applying PCA in the feature space
induced by the kernel k.
Nonlinear k-means Applying k-means on kernel-principal compo- • Can be used to dissolve non-linear feature maps in closed form.
nents is sometimes called Kernel-k-means or Spectral Clustering. This can be used as inputs, e.g. SVMs given ”multilayer SVMs”
6 Dimension Reduction was heisst das?
Unsupervised analog to regression. Given data set D = {x1 , . . . , xn }, • May want to center the kernel E = n1 [1, . . . , 1][1, . . . , 1] T , K0 =
obtain ”embedding” (low-dimensional representation) z1 , . . . , zn ∈ K − KE − EK + EKE ... und das?.
R .
k
6.2.2 Autoencoders
Typical approaches Assume D = {x1 , . . . , xn } ⊆ R , obtain map- Idea is to learn the identity function x ≈ f (x; θ ) = f 2 ( f 1 (x; θ1 ); θ2 ),
d
ping f : Rd → Rk such that k d. One can distinguish linear where f 1 : Rd → Rk is the encoder and f 2 : Rk → Rd is the de-
dimension reduction f (x) = Ax and nonlinear dimension reduction, coder.
parametric and non-parametric. Neural network autoencoders are ANNs where there is one output
unit for each of the d input units and the number k of hidden units is
6.1 Linear Dimension Reduction: PCA
usually smaller than the number of inputs (compression effect). The
PCA optimization problem goal is to optimize the weights such that the output agrees with the
n
input, for example by minimizing the square loss.
(W∗ , z1∗ , . . . , z∗n ) = argmin ∑ kWzi − xi k22 , Training autoencoders For example, minimize the square loss
i =1
n
such that W ∈ Rd×k is orthogonal and z1 , . . . , zn ∈ Rk . min ∑ kxi − f (xi ; W)k22
W
i =1
Theorem 6.1 (PCA) Let Σ = 1
n ∑in=1 xi xiT
= ∑id=1 λi vi viT , λ1 ≥ . . . ≥ using SGD (backpropagation). Initialization matters and is challeng-
λd ≥ 0 be the empirical covariance. Assume that µ = n1 ∑i xi = 0. The ing, c.f. work on pretraining, e.g. layerwise training of restricted
linear dimension reduction optimization problem is equivalent to Boltzmann machines.
1
α ( i ) = √ vi
λi
Notes on PCA
8
III Probabilistic modeling MLE = Least squares With the assumption of Gaussian noise, i.e.
yi ∼ N (wT xi , σ2 ) ≡ yi = wT xi + ε i , ε i ∼ N (0, σ2 ), maximizing the
General approach to probabilistic modeling likelihood is equivalent to least squares estimation
(i) Start with statistical assumption on data, mostly data points n
modeled as i.i.d. (can be relaxed). argmax P(y1 , . . . , yn |x1 , . . . , xn , w) = argmin ∑ (yi − wT xi )2 .
w w i =1
(ii) Choose likelihood function (e.g. Gaussian, student-t, logistic,
exponential). This defines the loss function.
MLE for i.i.d. Gaussian noise Suppose H = {h : X → R} is a
(iii) Choose a prior (e.g. Gaussian, Laplace, exponential). This
class of functions. Assuming that P(Y = y|X = x) = N (y|h∗ (x), σ2 )
defines the regularizer.
was heisst das argument für den mean, also y|h∗ (x)? for some
(iv) Optimize for MAP parameters. function h∗ : X → R and some σ2 > 0 the MLE for data D =
{(x , y ), . . . , (xn , yn )} in H is given by
(v) Choose hyperparameters (i.e., variance) through cross-validation. 1 1
n
(vi) Make predictions via Bayesian Decision Theory.
ĥ = argmin ∑ (yi − h(xi ))2 .
7 Probabilistic Modeling, Bias-variance tradeoff h∈H i =1
Maximum (conditional) Likelihood Estimation is a method for 7.4 Ridge Regression = Maximum A Posteriori (MAP) Es-
optimizing the parameters, i.e. timation
n A posteriori estimate Introduce bias by expressing assumptions on
i.i.d
θ ∗ = argmax P̂(y1 , . . . , yn |xi , . . . , xn , θ ) = argmax ∏ P̂(yi |x1 , θ ) ⇐⇒ parameters through a Bayesian prior, e.g. θ ∼ N (0, β2 Id). Then, the
θ θ i =1
posterior distribution of θ using Bayes’ rule is given by
n
θ ∗ = argmax log P̂(yi:n |xi:n , θ ) = argmax ∑ log P̂(yi |xi , θ ) P(θ )P(y|x, θ )
θ θ i =1 P(θ |x, y) = . (Posterior)
P( y |x)
9
Logistic regression replaces the assumption of Gaussian noise (squared
loss) by i.i.d. Bernoulli noise
Maximizing a posteriori estimate parameters θ leads to the ridge
regression problem P(y|x, w) = Ber(y; σ(wT x)).
argmax log P(θ |x, y) = − log P(θ ) − log P(y1:n | x1:n , θ ) The parameters w can be estimated via MLE or MAP estimation.
θ
1 1 n MLE for logistic regression is the convex optimization problem
+ log P(y| x ) = argmin
2β2
k θ k 2
2 + ∑ ( y i − θ T xi ) 2 .
2σ2 i=1 n
ŵ = argmax P(yi:n |w, x1:n ) = argmax ∑ log(1 + exp(−yi wT xi )).
θ
w w i =1
Ridge Regression = MAP Ridge regression can be understood as
finding the Maximum A Posteriori (MAP) parameter estimate for a Logistic loss gradient is given by
linear regression problem, assuming that the noise P(y|x, θ ) is i.i.d.
Gaussian and the prior P(θ ) on the model parameters θ is Gaus- −yx
∇ w l (w) = T x) + 1
= −yxP̂(Y = −y|w, x),
sian. exp ( yw
Regularization and MAP interference More generally, regularized i.e. the gradient is large if model w is ’surprised’ by y.
estimation can often be understood as MAP inference
n Algorithm 14 SGD for logistic regression
argmin ∑ l (wT xi ; xi , yi ) + C (w) = argmax ∏ P(yi |xi , w)P(w)
w w 1: Initialize w
i =1 i
2: for t = 1, 2, . . . do
= argmax P(w| D ) 3: Pick data point (w, y) uniformly at random from data D
w
4: Compute probability of misclassification with current model
where C (w) = − log P(w) and l (wT xi ; xi , yi ) = − log P(yi |xi , w).
1
This perspective allows changing priors (= regularizers) and likeli- P̂(Y = −y|w, x) =
hoods (= loss functions). 1 + exp(ywT x)
| x − µ|
1
P( x; µ, b) = exp −
2b b
Regularizers can be introduced by estimating MAP instead of solv-
ing the MLE problem. For the respective priors, we get the following
One can introduce robustness by changing the likelihood (=loss) optimization problems
function.
• L2 (Gaussian) argminw ∑in=1 log(1 + exp(−yi wT xi )) + λkwk22
Student-t likelihood
• L1 (Laplace) argminw ∑in=1 log(1 + exp(−yi wT xi )) + λkwk1
ν +1
Γ( ν+ 1
( y − w T x) 2 − 2
2 )
P(y|x, w, ν, σ2 ) = √ 1+
πνσ2 Γ( ν2 ) νσ2Algorithm 15 SGD for l2-regularized logistic regression
1: Initialize w
Compared with the Gaussian distribution, outliers are encouraged, 2: for t = 1, 2, . . . do
because the student-t likelihood decreases algebraically (P(|y − µ| > 3: Pick data point (w, y) uniformly at random from data D
t) = O(t−α )), whereas Gaussian likelihood decreases exponentially 4: Compute probability of misclassification with current model
(P(|y − µ| > t) = O(e−t )). Thus, student-t likelihood might be
better for data with extreme outliers. 1
P̂(Y = −y|w, x) =
1 + exp(ywT x)
10
8.2 Kernelized Logistic Regression Logistic Regression Estimated conditional distribution P̂(y|x) =
Kernelized logistic regression Find optimal weights by minimizing Ber(y; σ(ŵT x)), action set A = {+1, −1}, and cost function C (y, a) =
logistic loss + regularizer [y 6= a]. Then the action that minimizes the expected cost is the most
likely class
n
ŵ = argmin ∑ log(1 + exp(−yi α T Ki )) + λα T Kα (Learning) a∗ = argmin Ey (C (y, a)|x) = argmax P̂(y|x) = sign(wT x).
α ∈ Rn i =1 a∈A y
11
Algorithm 16 Uncertainty sampling (i) Learning given Data D = {(x1 , y1 ), . . . , (xn , yn )} using MLE
1: Pool of unlabeled examples Dx = {x1 , . . . , xn } MLE for class prior P̂(Y = y) = p̂y =
# [Y = y ]
2: Also maintain an empty data set D, initially empty n
3: for t = 1, 2, 3, . . . do MLE for feature distribution P̂( xi |y) = N ( xi ; µ̂y,i , σ̂y,i
2 )
4: Estimate P̂(Yi |xi ) given current data D
5: Pick most uncertain unlabeled example µ̂i,y = #[Y1=y] ∑ j:y j =y x j,i , σ̂i,y
2 = 1
( x − µ̂y,i )2
#[Y =y] ∑ j:y j =y j,i
Warum kein -1? Ist das nicht biased? J: wo ein -1?
it ∈ argmin |0.5 − P̂(Yi |xi )|
i (ii) Prediction given new data point x
1
6: Query label yit P̂(y| x ) =
Z
P( y )P(x| y ) Z= ∑ P( y )P(x| y )
7: Set D ← D ∪ {(xi , yit )} y
d
y = argmax P̂(y0 |x) = argmax P̂(y0 ) ∏ P̂( xi |y0 ).
y0 y0 i =1
10 Generative Modeling
10.1 Discriminative vs. Generative Modeling 10.2.2 Gaussian NB vs. Logistic Regression
Assumptions are P(Y = 1) = 0.5 (mild assumption) and inde-
Disciminative vs. generative modeling
pendent variance, i.e. P(x|y) = ∏i N ( xi ; µy,i , σ + i2 ) (rather strong
• Discriminative models aim to estimate P(y|x) assumption).
• Generative models aim to estimate the joint distribution P(y, x) Discriminant function Decision rule forbinary classification y =
0 , x) is equivalent to y = sign log P(Y =+1|x) = sign f (x),
Discriminative models can be derived from generative models by argmax y 0 P̂ ( y P(Y =−1|x)
where f is called the discriminant function.
P(x, y) P(x, y)
P( y |x) = = . GNB recovers linear classifier With the above assumptions Gaus-
P(x) ∑y P( x, y)
sian Naive Bayes produces a linear classifier
Discriminative models do not have access to P(x) and thus will not P (Y = + 1 | x )
be able to detect outliers, i.e. points for which p(xi ) is small. f (x) = log = w T x + w0 with
P (Y = − 1 | x )
Typical approach to generative modeling p̂+ d µ̂2 − µ̂2 µ+,i − µ−,i
w0 = log + ∑ −,i 2 +,i , wi = .
(i) Estimate prior on labels P(Y = y) 1 − P̂+ i=1 2σ̂i σi2
(ii) Estimate conditional distribution P(X|Y = y) for each class y
Corresponding class distribution is of the same form as logistic
(iii) Obtain predictive distribution using Bayes’ rule
regression, i.e.
1 1
P( y |x) = P( y )P(x| y ) P (Y = + 1 | x ) = = σ ( w T x + w0 ) .
Z 1 + exp(− f (x))
Conclusion If model assumptions are met, GNB will make the same
predictions as Logistic Regression.
10.2.3 Issues with Naive Bayes models
Overconfidence Conditional independence assumption means that
features are generated independently given class label. Thus, predic-
tions may become overconfident.
Example For duplicate data points x2 = x3 = . . . = xd = x1 unser
certain assumptions, NBM predicts p1 (x) = 1+exp1( f (x)) and pk (x) =
1
1
This gives p1 (ε) ≈ 0.5 + ε, but pt (ε) ≈ 01 for large d
1+exp(d· f 1 (x))
.
(overconfidence).
10.2.4 Categorical Naive Bayes for discrete Features
Setting Model features by (conditionally) independent categorical
(i ) (i )
10.2 Naive Bayes Model random variables P( Xi = x |Y = y) = θ x,y such that θ x|y ≥ 0 for all
(i )
10.2.1 Model Description i, x, y and ∑cx=1 θ x|y = 1 for all i, y.
Assumptions for the naive Bayes model are MLE estimation given dataset D = {(x1 , y1 ), . . . , (xn , yn )}
(i) Class labels can be modeled as generated from categorical vari- 1 [Y = y ]
• for class label distribution is P̂(Y = y) = p̂y = n ,
able P(Y = y) = py , y ∈ Y = {1, . . . , c}.
(i ) 1[ X =c,Y =c]
(ii) Naivety: Model features are conditionally independent given • for distribution of features is i P̂( Xi = c|y) = θc|y = 1i[Y =y] .
Y, i.e. given class feature, each data point is ”generated” inde- Lifting independence assumption requires specification of proba-
pendently of the other features. This assumption is strong and bility of every possible categorical data, requiring in d exponentially
mostly not true, but it still works somehow. many parameters, which is computationally intractable and a fantas-
d tic way to overfit.
P ( X1 , . . . , X d | Y ) = ∏ P ( X i | Y ) 10.2.5 Discrete and categorical Features
i =1
The (naive) Bayes classifier does not require each feature to follow
the same type of conditional distribution, e.g. model some features
(iii) Feature distribution, e.g. for Gaussian Naive Bayes classifier
as categorical and some as Gaussian
we have P( X = xi |Y = y) = N ( xi |µy,i , σy,i
2 ). Note that the
parameters are class and feature dependent. X1:10 discrete P( xi |y) = Categorical ( xi |y, θ )
Gaussian Naive Bayes Classifier X11:20 Gaussian P( xi |y) = N ( xi ; µi|y , σi2|y )
12
10.3 Gaussian Bayes Classifier
10.3.1 Model Description
Assumptions for the Bayes model are
(i) Class labels can be modeled as generated from categorical vari-
able P(Y = y) = py , y ∈ Y = {1, . . . , c}.
(ii) Model features as generated by multivariante Gaussian P(x|y) =
N (x; µy , Σ2y ).
ML for Gaussian Bayes Classifier Given Data D = {(x1 , y1 ), . . . , (xn , yn )},
MLE for feature distribution P̂(x|y) = N (x; µ̂y , Σ̂y ) with estimators
µ̂y and Σ̂y and MLE for class prior P̂(Y = y) are given by
# [Y = y ]
P̂(Y = y) = p̂y =
n
1 1
µ̂y =
# [Y = y ] ∑ xi Σ̂y =
# [Y = y ] ∑ (xi − µ̂y )T (xi − µ̂y )
i:yi =y i:yi =y
1
Z
Fisher’s LDA Logistic Regression P( θ | y1 , . . . , y n ) = P(θ )P(yi:n |θ ) Z= P(θ )P(y1:n |θ ) dθ
generative model discriminative model Z
+ can be used to detect outliers - cannot detect outliers
- assumes normality of x + makes no assumptions on X
- not robust against viola- + more robust Beta distribution Beta(θ; α+ , α− ) = B(α 1,α ) θ α+ −1 (1 − θ )α− −1
+ −
tion of this assumption
Definition 10.1 (Conjugate distributions) A pair of prior distributions
and likelihood functions is called conjugate if the posterior distribution re-
Conclusion If model assumptions are met, Fisher’s LDA will make mains the same family as the prior.
the same predictions as Logistic Regression.
Beta conjugate With prior Beta(θ; α+ , α− ) and n+ positive and n−
10.3.4 Gaussian Naive Bayes vs. General Gaussian Bayers negative labels, the posterior distribution is Beta(θ; α+ n+ , α− + n− ).
Classifiers Thus, α+ and α− act as pseudo-counts.
GNB models General GB models
Beta MAP estimate
- conditional independence + captures correlations
assumption may lead to among features α+ + n+ − 1
overconfidence θ̂ = argmax P(θ |y1 , . . . , yn ; α+ , α− ) =
α+ + n+ + α− + n− − 2
+ predictions might still be useful + avoids overconfidence θ
13
Prior/ Posterior Likelihood function 11.2 Expectation-Maximization Algorithm
Beta Bernoulli/ Binomial
Dirichlet Categorical/ Multinomial Latent variables are variables that are not directly observed but are
Gaussian (fixed covariance) Gaussian rather inferred (through a mathematical model) from other variables
Gaussian-inverse Wishart Gaussian that are observed (directly measured). Mathematical models that
Gaussian process Gaussian aim to explain observed variables in terms of latent variables are
called latent variable models. Concretely, this means for each data
point xi we introduce a latent variables zi denoting the class that this
11 Probabilistic Modeling of Unsupervised Learning: points is assigned.
Latent Variable Modeling EM algorithm is an iterative method to find maximum likelihood
We will focus on missing labels, ideas may be applied to missing (ML) or maximum a posteriori (MAP) estimates of parameters in
data as well. statistical models, where the model depends on unobserved latent
variables. The EM iteration alternates between performing an expec-
11.1 Gaussian Mixture Models tation (E) step and a maximization (M) step described below.
Assumptions P( X, Y ) is a Gaussian-Bayes classifier: P(Y = y) = py E-step creates a function Q for the expectation of the complete data
and P(x|y) = N (x; µy , Σy ). We also require i.i.d. data. log-likelihood
Gaussian mixtures are convex combinations of Gaussians
L(θ; x) = log P(X = x; θ ) = log ∑ P(X = x, Z = z; θ )
z
P(X = x|θ ) = P(X = x|µ, Σ, w) = ∑ wi N (x; µi , Σi ),
i evaluated using the current estimate for the parameters by
computing
where wi ≥ 0 and ∑i wi = 1.
Mixture modeling models each cluster i ∈ {1, . . . , k} as a prob- γz (x) = P( Z = z|X = x, θ (t−1) ).
ability distribution P(x|θ j ). Using i.i.d. assumption of data, the
likelihood of data is M-step computes parameters maximizing the expected log-likelihood
found on the E step by optimizing
n k
P( D | θ ) = ∏ ∑ w j P ( X = xi | θ j ) . θ (t+1) = argmax Q(θ; θ (t−1) ),
i =1 j =1
θ
(t)
Q(θ, θ ) = EZ|X=x,θ (t) [log P(X, Z |θ )]
Optimization problem by minimizing the negative log likelihood n
k
= ∑ EZ|X=x ,θ( ) [log P(X = xi , Z|θ )]
i
t
i =1
(µ∗ , Σ∗ , w∗ ) = argmin − ∑ log ∑ w j N (xi |µ j , Σ j ),
n k
i j =1
k
= ∑ ∑ P(Z = z|X = xi , θ (t) ) log P(X = xi , Z = z|θ )
i =1 z =1
while ∑ wj = 1 and Σ j positive definite n k
j =1
= ∑ ∑ γz (xi ) log (P(Z = z)P(X = xi |Z = z; θ ))
i =1 z =1
is non-convex and constrained. One could try to optimize it using
(stochastic) gradient descent, but the constraints might be difficult These parameter-estimates are then used to determine the dis-
to maintain. tribution of the latent variables in the next E step. This proce-
dure is equivalent to training a GBC with weighted data and
Choosing k Same as for k-means. However, for GMMs typically
admits a closed form solution.
cross-validation works fairly well.
11.2.1 Hard-EM Algorithms
GMMs for density estimation and not for clustering by for exam-
ple modeling P(x) as Gaussian mixture and P(x|y) using logistic Fitting a GMM = training a GBC without labels Idea is to repeat-
regression, neural network, etc. Then P(x, y) = P(x)P(y|x) is a edly fill in or update the missing data and then train the resulting
valid model. This combines the advantage of accurate predictions dataset. The algorithms assigns (only) a label to each data point
and robustness from discriminative model with the ability to detect which is why it called hard.
outliers.
Anomaly detection with mixture models by comparing the esti- Algorithm 17 Hard Expectation Maximization (EM)
mated density of a data point against a threshold. If we do not have 1: Initialize the parameters θ (0)
any examples of anomalies, this is challenging. If we do have some 2: for t = 1, 2, 3, . . . do
examples, we could try 3: E-Step: Predict most likely class for each data point
• varying the threshold to trade false-positives and false-negatives, (t)
zi = argmax P(z|xi , θ (t−1) )
• to use precision-recall curves/ ROC curves as evaluation crite- z
rion, e.g. maximizing F1-score. = argmax P(z|θ (t−1) )P(xi |z, θ (t−1) )
z
This allows to optimize the threshold, e.g. via cross-validation.
(t) (t)
Why are mixture models useful 4: and complete the data D (t) = {(x1 , z1 ), . . . , (xn , zn )}.
5: M-Step: compute MLE as for the Gaussian Bayes classifier
• Can encode assumptions about shape of clusters, e.g. fit el-
lipses instead of points.
θ (t) = argmax P( D (t) |θ )
• Can be part of more complex statistical models, e.g. classifiers θ
(or more general probabilistic models)
• Probabilistic models can output likelihood P(x) of a point x.
This can be useful for anomaly detection. Problems with Hard-EM Points are assigned a fixed label even
though the model is uncertain. This tries to extract too much in-
• Can be naturally used for semi-supervised learning. formation from a single point. In practice, this may work poorly if
clusters are overlapping.
14
11.2.2 Soft-EM Algorithm 11.2.4 EM vs. k-means
Posterior probabilities given a model P(z|θ ), P(x|z, θ ), we can com- Assumptions are uniform weights over mixture components and
pute a posterior distribution over cluster membership identical spherical covariance matrices.
w j P(X = x| Σ j , µ j ) Hard EM recovers k-means under the above assumptions. The steps
γ j (x) = P( Z = j|X = x, Σ, µ, w) = in the hard EM algorithm become the same decisions as Lloyd’s
∑ l w l P(X = x| Σ l , µ l ) heuristic
(t) ( t −1)
zi = argmax P(z|θ (t−1) ) = argmink xi − µz k, (E-step)
MLE At MLE (µ∗ , Σ∗ , w∗ ) = argmin − ∑i log ∑kj=1 w j N (xi |µ j , Σ j ) it z z
must hold that 1
∑
(t)
µj = xi . (M-step)
nj
∑in=1 γ j (xi )xi (t)
i:zi = j
µ∗j =
∑in=1 γ j (xi )
∑in=1 γ j (xi )(xi − µ∗j ) T (xi − µ∗j ) Soft EM recovers k-means under the above assumptions and ad-
Σ∗j = ditionally(variances tending to 0, because for σ2 → 0 it holds that
∑in=1 γ j (xi )
n 1 µi is closest to x
1 γi ( x ) → .
n i∑
w∗j = γ j ( xi ) 0 otherwise
=1
11.3 Avoiding Overfitting with GMMs
Degeneracy For only one data point, the optimal log-likelihood is
Algorithm 18 Soft Expectation Maximization (EM) −logP( x |µ, σ) = 12 log(2πσ2 ) + 2σ1 2 ( x − µ)2 → −∞ for µ = x and
1: while not converged do σ2 → 0, i.e. the minimization problem is not bounded from below.
(t) Thus, the ”optimal” GMM chooses k = n and puts one Gaussian
2: E-Step: calculate γ j (xi ) for each i and j given estimates of
around each data point with variance tending to 0. This is overfit-
µ(t−1) , Σ(t−1) , w(t−1) from previous iteration
3: M-Step: fit clusters to weighted data points, i.e. calculate ting.
(t) (t) (t) Adding a Wishart prior to the covariance matrix and computing
w j , µ j , and Σ j .
the MAP instead of MLE can regularize the problem and thus avoid
degeneracy (variances → 0). The corresponding update rule reads
Constrained GMMS are special cases of Gaussian mixtures ∑in=1 γ j (xi )(xi − µ∗j (xi − µ∗j ) T )
Σ∗j = + ν2 I
• Spherical Σ j = σj2 · I, with #params = k ∑in=1 γ j (xi )
• Diagonal Σ j = diag(σj,1
2 , . . . , σ2 ), with #params = dk
j,d
d ( d +1) 11.4 Gaussian-Mixture Bayes classifier
• Tied Σ1 = . . . = Σk , with #params = 2
Given labeled data set D = {(x1 , y1 ), . . . , (x N , y N )} with labels yi ∈
d ( d +1)
• Full, with #params = k 2 {1, . . . , m}, estimate class prior P(y) and conditional distribution for
each class as Gaussian mixture model
Discussion Soft EM will typically result in higher likelihood values,
because it can deal better with ”overlapping clusters”. ky
∑ wj
(y) (y) (y)
P(x| y ) = N (x; µ j , Σ j ).
11.2.3 Theory behind EM j =1
Convergence of EM One can prove that the EM algorithm monoton-
ically increases the likelihood log P(x1:n |θ (t) ) ≥ log P(x1:n |θ (t−1) ). Classification is done by Bayes rule
For Gaussian mixture, EM is guaranteed to converge to a local mini- kj
1
P(y|x) = P(y) ∑ w j N (x; µ j , Σ j ).
mum, but the quality of solution highly depends on initialization. A (y) (y) (y)
common strategy is to rerun the algorithm multiple times and use Z j =1
the solution with largest likelihood.
11.5 Semi-supervised Learning with GMMs
Initialization
(0) 1
We would like to combine unlabeled and labeled data.
• for weights: typically use a uniform distribution wi = k ∀i
Semi-supervised learning is learning from large amounts of unla-
• for means use random initialization or k-means++, i.e. pick beled data and small amounts of labeled data.
(0)
µi = x ji
Modification to EM algorithm The computation of γ also takes
• for variances for example initialize according to empirical co- into account the labeled data points xi
variance of the data (perhaps restrict to spherical) Σ1 = . . . = (
Σk = n1 ∑in=1 (xi − x)(xi − x) T . P( Z = j|xi , Σ, µ, w) xi is unlabeled
γ j ( xi ) =
Hard EM performs alternating optimization of the complete data [ j = yi ] xi is labeled with label yi
likelihood
The computation of w j , µ j and Σ j does not change.
(t)
z1:n = argmax P(x1:n , z1:n |θ (t−1) ) (E-step)
zz:n 11.6 Outlook: Implicit generative Models
θ (t) = argmax P(x1:n , z1:n |θ ) (M-step) Given sample of (unlabeled) points x1 , . . . , xn , the goal is to learn a
θ model X = f (Z; w). The approach is to optimize parameters w to
make samples from model hard to distinguish from data sample.
and converges to a local optimum of maxz1:n ,θ P(x1:n , z1:n |θ ).
A Convex functions
EM more generally EM algorithm is more widely applicable. It can
be used whenever the E and M steps are traceable, i.e. we must be Theorem A.1 (Jensen’s inequality) Let f be convex and x1 , . . . , xn ∈
able to compute and maximize the complete data likelihood. This Rd , λ1 , . . . , λn ∈ [0, 1], such that ∑in=1 λi = 1. Then
can be used for example for imputing (some) missing features and
handling likelihoods beyond Gaussian (e.g. categorical). f ( λ1 x1 + . . . + λ n x n ) ≤ λ1 f ( x1 ) + . . . + λ n f ( x n )
15
Also, if x ∈ Rd is a random variable, then
f (E[x]) ≤ E[ f (x)].
16