0% found this document useful (0 votes)
104 views

Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18

Uploaded by

Jingchen Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18

Uploaded by

Jingchen Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction to Machine Learning

ETH Zurich

Janik Schuettler
Marcel Graetz

FS18
6.3 Autoencoders vs. PCA . . . . . . . . . . . . . . . . . . . 8
6.4 PCA vs. k-Means . . . . . . . . . . . . . . . . . . . . . . 8
Contents
III Probabilistic modeling 9
7 Probabilistic Modeling, Bias-variance tradeoff 9
7.1 Parametric Estimation . . . . . . . . . . . . . . . . . . . 9
Contents i 7.2 Least Squares Regression = Gaussian Maximum Like-
lihood Estimation (MLE) . . . . . . . . . . . . . . . . . . 9
1 Overview 1 7.3 Bias Variance Tradeoff . . . . . . . . . . . . . . . . . . . 9
7.4 Ridge Regression = Maximum A Posteriori (MAP) Es-
timation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
I Supervised Learning 1 7.5 Examples for other Priors and Likelihood Functions . 10

2 Regression and Gradient Descent 1 8 Classification: Logistic regression 10


2.1 Closed-form Solution: Linear Least Squares . . . . . . 1 8.1 Regularized Logistic Regression . . . . . . . . . . . . . 10
2.2 Optimization: Gradient Descent . . . . . . . . . . . . . 1 8.2 Kernelized Logistic Regression . . . . . . . . . . . . . . 11
2.3 Non-linear Regression via Linear Regression . . . . . . 1 8.3 Multi-class Logistic Regression . . . . . . . . . . . . . . 11
2.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . 1 8.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Cross validation . . . . . . . . . . . . . . . . . . . . . . . 2
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 2 9 Bayesian Decision Theory 11
9.1 Asymmetric Costs . . . . . . . . . . . . . . . . . . . . . 11
3 Classification 2 9.2 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . 11
3.1 Linear Classification . . . . . . . . . . . . . . . . . . . . 2
10 Generative Modeling 12
3.1.1 Perceptron and Stochastic Gradient Descent . . 2
10.1 Discriminative vs. Generative Modeling . . . . . . . . . 12
3.1.2 Support Vector Machines (SVM) . . . . . . . . . 3
10.2 Naive Bayes Model . . . . . . . . . . . . . . . . . . . . . 12
3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . 3
10.2.1 Model Description . . . . . . . . . . . . . . . . . 12
3.2.1 Greedy feature selection . . . . . . . . . . . . . 3
10.2.2 Gaussian NB vs. Logistic Regression . . . . . . 12
3.2.2 Linear models . . . . . . . . . . . . . . . . . . . 3
10.2.3 Issues with Naive Bayes models . . . . . . . . . 12
3.3 Non-linear Classification . . . . . . . . . . . . . . . . . . 3 10.2.4 Categorical Naive Bayes for discrete Features . 12
3.3.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . 3 10.2.5 Discrete and categorical Features . . . . . . . . 12
3.3.2 k-Perceptron . . . . . . . . . . . . . . . . . . . . 4 10.3 Gaussian Bayes Classifier . . . . . . . . . . . . . . . . . 13
3.3.3 k nearest Neighbors (k-NN) . . . . . . . . . . . 4 10.3.1 Model Description . . . . . . . . . . . . . . . . . 13
3.3.4 Kernelized SVM . . . . . . . . . . . . . . . . . . 4 10.3.2 Fisher’s Linear Discriminant Analysis (LDA) . 13
3.3.5 Kernelized Regression . . . . . . . . . . . . . . . 4 10.3.3 LDA vs. Logistic regression . . . . . . . . . . . 13
3.4 Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . 4 10.3.4 Gaussian Naive Bayes vs. General Gaussian
3.5 Multi-class Problems . . . . . . . . . . . . . . . . . . . . 5 Bayers Classifiers . . . . . . . . . . . . . . . . . . 13
10.3.5 LDA vs. PCA . . . . . . . . . . . . . . . . . . . . 13
4 Neural Networks 5 10.3.6 Quadratic Discriminant Analysis (LDA) . . . . 13
4.1 Training: Momentum SGD, Backpropagration . . . . . 6 10.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Initialization and Termination . . . . . . . . . . . . . . . 6 10.5 Avoiding Overfitting: Introducing Priors . . . . . . . . 13
4.3 Choosing parameters . . . . . . . . . . . . . . . . . . . . 6
4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 7 11 Probabilistic Modeling of Unsupervised Learning: Latent
4.5 Invariances . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Variable Modeling 14
4.6 Convolutional Neural Networks (CNN) . . . . . . . . . 7 11.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . 14
4.7 ANNs vs. tanh-kernels . . . . . . . . . . . . . . . . . . . 7 11.2 Expectation-Maximization Algorithm . . . . . . . . . . 14
11.2.1 Hard-EM Algorithms . . . . . . . . . . . . . . . 14
11.2.2 Soft-EM Algorithm . . . . . . . . . . . . . . . . 15
II Unsupervised Learning 7 11.2.3 Theory behind EM . . . . . . . . . . . . . . . . . 15
11.2.4 EM vs. k-means . . . . . . . . . . . . . . . . . . 15
5 Clustering: k-means 7 11.3 Avoiding Overfitting with GMMs . . . . . . . . . . . . 15
11.4 Gaussian-Mixture Bayes classifier . . . . . . . . . . . . 15
6 Dimension Reduction 8 11.5 Semi-supervised Learning with GMMs . . . . . . . . . 15
6.1 Linear Dimension Reduction: PCA . . . . . . . . . . . . 8 11.6 Outlook: Implicit generative Models . . . . . . . . . . . 15
6.2 Nonlinear Dimension Reduction . . . . . . . . . . . . . 8
6.2.1 Kernel PCA . . . . . . . . . . . . . . . . . . . . . 8 A Convex functions 15
6.2.2 Autoencoders . . . . . . . . . . . . . . . . . . . . 8
6.2.3 Other . . . . . . . . . . . . . . . . . . . . . . . . 8 Bibliography 16

i
1 Overview Algorithm 1 Gradient Descent
1: w0 ∈ R . start with arbitrary w0
2: for t = 1, 2, . . . , T do
3: ∇w R̂(wt ) = −2 ∑in=1 (yi − wtT xi ) xi
4: wt+1 = wt − ηt ∇ R̂(wt ) . ηt is the learning rate
5: return w T

• Line search: Optimizing step size every step


ηt ← argmin R̂(wt − η ∇ R̂(wt ))
η ∈ R+

wt+1 ← wt − ηt ∇ R̂(wt )

• Bold driver heuristic: If function decreases, increase step size,


else decrease step size
(
cinc ηt , if R̂(wt+1 ) < R̂(wt )
η t +1 ←
cdec ηt , if R̂(wt+1 ) > R̂(wt )

whut?
GD convergence Stop if either
• gradient is small enough, or
I Supervised Learning
• difference in objective between subsequent iterations is small
2 Regression and Gradient Descent enough.
We try to fit a function to training data (learning) to make predic- 2.3 Non-linear Regression via Linear Regression
tions. Our goal is to learn real-valued mapping f : Rd → R.
Non-linear basis functions are used to fit non-linear data via linear
The general model is regression
d d
f (x) = ∑ wi xi + b = wT x + b = w̃T x̃ f (x) = ∑ wi φi (x)
i =1 i =1

with x̃ = ( x1 , ..., xd , 1) and w̃ = (w1 , ..., wd , b). In 2D, φ could be φ( x ) = (1, x1 , x2 , x12 , x22 , x1 x2 , . . .).
Model error We measure goodness of a model (i.e. fit) using a
p-loss function l p (w, x, y), 2.4 Model selection
n n
We would like to choose the model that optimizes trade-off between
R̂(w) = ∑ l p (w, x, y) = ∑ |yi − w T p
xi | , p ≥ 1. model complexity and training error, i.e. under- and overfitting data.
Mathematically, we try to minimize the true risk
i =1 i =1
Z
For p = 2, we get the least squares measure R̂(w) = ∑in=1 (yi − R(w) = Ex,y [(y − wT x)2 ] = P( x, y)(y − w T x )2 dx dy,
w T x i )2 .
w∗ = argmin R(w)
Optimization problem We want to find the optimal weight vector w
However, we can only compute the empirical risk
n
w∗ = argmin( R̂) = argmin ∑ (yi − w T xi )2 . 1
| D | (x,y∑
w w i =1 R̂ D (w) = ( y − w T x)2 , ŵ = argmin R̂train (w).
)∈ D w

2.1 Closed-form Solution: Linear Least Squares Theorem 2.2 (Law of large numbers (LLN)) R̂ D (w) → R(w) for any
Closed-form solution is w∗ = (XT X)−1 XT y. fixed w almost surely as | D | → ∞.

Complexity for solving in closed form is O(nd2 + d3 ). iid assumnption assumes that the data set is generated indepen-
dently and identically distributed (iid), (xi , yi ) ∼ P(X, Y ).
2.2 Optimization: Gradient Descent Convergence of learning For learning via empirical risk minimiza-
tion to be successful, we need uniform convergence supw | R(w) −
Theorem 2.1 (Gradient descent) Let f be convex with w∗
the global R̂ (w)| → 0 for | D | → ∞, which is not implied by LLN, but de-
∗ ∗ D
minimizer. Assume kw1 − w k ≤ D and k∇ f (w)k ≤ G ∀w ∈ R , BD (w ).pends on model class.
d

If we choose ηt = D√ , then
G t
Splitting data set Do not test a model on the training data, because
GD E[ R̂train (ŵ)] < E[ R(ŵ)].
f (w T ) − f (w∗ ) ≤ √ .
T
Best practice is to split the data set into a training set Dtrain and a
testing set Dtest . Optimize w on Dtrain
Least squares function is convex. Gradient descent finds an optimal
solution with better complexity. w = argmin R̂train (w)
w
Complexity for one iteration of gradient descent is O(dn).
and evaluate it on Dtest
Problem: For low steps size very slow, but for high step size this can 1
diverge. R̂test (ŵ) =
| Dtest | ∑ (y − ŵT x)2 .
(x,y)∈ Dtest
Adaptive step size Examples of how to update the step size adap-
tively. Then EDtrain ,Dtest [ R̂ Dtest (ŵ Dtrain )] = EDtest [ R(ŵ Dtrain )].

1
2.5 Cross validation
Test error R̂test is itself random. Variance usually increases for more
complex models. Idea is to use and average over multiple test sets
to reduce the bias. Note that this only works for i.i.d. data.

Algorithm 2 Model selection with cross validation


1: for each candidate model m do
2: for i = 1 : k do
(i ) (i )
3: D = Dtrain ] Dvalidation . split data set
(i )
4: ŵ = argminw R̂train (w) . train model
(i ) (i )
5: R̂m = R̂val (ŵi ) . estimate error
(i )
6: m̂ = argminm 1k ∑ik=1 R̂m . select model

Splitting test sets may be obtained by different procedures.


• Monte-Carlo cross-validation: Pick training set of given size
uniformly at random, validate on remaining points, estimate 3 Classification
prediction error by averaging the validation error over multiple We would like to assign data points X a label Y, i.e. Y is discrete.
random trials.
• k-fold cross validation: Partition the data into k folds, train on 3.1 Linear Classification
(k − 1) folds, evaluating on remaining fold, estimate prediction Linear Classification seems restrictive, but using more dimensions
error by averaging the validation error obtained while varying and the right features often works quite well. Prediction is typically
the validation fold. This is the default. efficient.
Number of folds k is hard too choose. Loss functions We want to find the optimal weight vector w∗ =
• too small: risk of overfitting to test set, using too little data for argminw ∑in=1 l (w, xi , yi ). Possible loss-functions are
training, risk of underfitting to training set • 0/1-loss, minimizing the number of errors, an intractable (non-
• too large: in general, better performance (k = n is fine), but convex) loss-function
higher computational complexity (
1 if yi 6= sign(wT xi )
• in practice mostly k = 5 or k = 10 l (w, xi , yi ) = l0/1 (w, xi , yi ) = ,
0 else
2.6 Regularization
For non-linear models, there might not be a natural order of com- • perceptron-loss, a tractable surrogate-loss function
plexity like with monomials, i.e. higher order monomial account
for more complex models. Too complex models often manifest in l (w, xi , yi ) = l p (w, xi , yi ) = max(0, −yi wT xi ),
large weights. The idea of regularization is to keep models simple
by punishing large weights. • hinge-loss (SVM)
Ridge regression problem adds the term λkwk22 l (w, xi , yi ) = lh (w, xi , yi ) = max(0, 1 − yi wT xi ).
n
w∗ = argmin( R̂) + λkwk22 = argmin ∑ (yi − w T xi )2 + λkwk22 . was bedeutet immer dieses traceable? Du meinst tractable?
w w i =1

for some λ ∈ R. Using the homogeneous representation, the con- 3.1.1 Perceptron and Stochastic Gradient Descent
stant term is not being regularized. λ balances Perceptron optimization problem
λ → ∞, the optimization problem tries to minimize w only, n n
w∗ = argmin ∑ l p (w, xi , yi ) = argmin ∑ max(0, −yi wT xi ).
λ → 0, optimization problem with no regularization. w i =1 w i =1
Closed-form solution is w∗ = (XT X + λI)−1 XT y. Matrix XT X + λI
is always invertible and better conditioned. Perceptron Gradient
Renormalizing data ensures that each feature has zero mean and
unit variance, because scaling does matter for regularization. ∇w R̂ p (w) = − ∑ y i xi
i:yi 6=sign(wT xi )
x̃i,j = ( xi,j − µ̂ j )/σ̂j2 , where
n
1 1 Stochastic Gradient Descent (SGD) picks uniformly and at ran-
n i∑ ∑(xi,j − µ̂ j )2
µ̂ j = xi,j σ̂j2 =
=1
n i
dom m data points to compute the gradient (mini-batch SGD)

1
∇ R̂(w) =
n ∑ ∇l (w; x I , y I ) = E I ∼Uni f {1,...,n}) [∇l (w; x I , y I )].
Algorithm 3 Gradient Descent with regularization I

1: w0 ∈ R . start with arbitrary w0


2: for t = 1, 2, . . . , T do
3: ∇w R̂(wt ) = −2 ∑in=1 (yi − wtT xi ) xi Algorithm 4 Stochastic Gradient Descent
4: wt+1 = wt (1 − 2ληt ) − ηt ∇ R̂(wt ) . ηt is the learning rate
1: w0 ∈ R . start with arbitrary w0
5: return w 2: for t = 1, 2, . . . do
3: pick ( xi0 , yi0 )im=1 ∈ D . Random data points
4: w t +1 = w t − η t ∇ l ( w t , x 0 , y 0 )
Choosing λ is mostly done using cross-validation for logarithmi- 5: return w
cally spaced λ ∈ {10−6 , 10−5 , . . . , 105 , 106 }.

2
Algorithm 5 Perceptron with Stochastic Gradient Descent Algorithm 6 Greedy forward selection
1: w0 ∈ R . start with arbitrary w0 1: S = ∅, E0 = ∞
2: for t = 1, 2, . . . do 2: for t = 1 : d do
3: pick it ∼ Unif{1, . . . , n} . one random data point 3: si = argmin j∈V \S L̂(S ∪ { j}) . Find best element to add
4: if yit 6= sign(wtT xit ) then . Perceptron gradient 4: Ei = L̂(S ∪ {si }) . Compute error
5: w t +1 = w t + η t y i t x i t 5: if Ei > Ei−1 then
6: else 6: break
7: w t +1 = w t 7: else
8: S ← S ∪ { si }
9: return S
Robbins-Monro Conditions keep the learning rate ηt such that the
algorithm will not terminate before converging, i.e. ∑t ηt = ∞, but Algorithm 7 Greedy backward selection
with bound variance, i.e. ∑t ηt2 < ∞. These conditions are sufficient
1: S = V, Ed+1 = ∞
for convergence. For example ηt = 1/t or ηt = min(c1 , c2 /t).
2: for t = d : 1 : −1 do
Remark 3.1 This is the perceptron algorithm. (Says probelm set 3.) 3: si = argmin j∈S L̂(S\{ j}) . Find best element to remove
4: Ei = L̂(S\{si }) . Compute error
Adaptive learning rates are used by algorithms such as AdaGrad, 5: if Ei > Ei+1 then
RMSProp, Adam. 6: break
7: else
Theorem 3.2 If the data is linearly separable, the Perceptron will obtain a
8: S ← S\{si }
linear separator.
9: return S
SGD convergence criteria Stop if either
• a fixed number of iterations have passed,
can be suboptimal. As an extreme counter example consider a set-
• GD conditions would suggest convergence (occasionally, say ting in which all features are uninformative on their own, but infor-
every n-th iteration, compute full objective value/ gradient mative altogether.
magnitude),
3.2.2 Linear models
• error on separate validation data set is small enough (direct
We want to solve the learning and feature selection problem simulta-
monitoring). This is a special form of regularization called
neously via a single optimization.
early stopping.
Sparse regression The key idea is to replace feature selection with
3.1.2 Support Vector Machines (SVM) setting unimportant features to 0, i.e. working with sparse feature
The hinge-loss encourages not only correct classification, but correct representations, encoded within the pseudo-norm kwk0 = number
classification with maximal margin to the decision boundary. of non-zero entries in w. The 0-norm penalty encourages coefficients
Can this lead to non-optimal decisions in case of e.g. separability? to be exactly 0 and therefore automatic feature selection, however,
the optimization problem is hard to solve. We instead use the 1-
SVM optimization problem norm k·k1 for optimization to keep the convex optimization prob-
n lem.
w∗ = argmin ∑ max(0, 1 − yi wT xi ) + λkwk22 L1-regularized regression problem (Lasso)
w i =1
n
w∗ = argmin ∑ (yi − wT xi )2 + λkwk1
Theorem 3.3 SVM finds solution with max margin to decision boundary. w i =1

Choosing λ is mostly done using cross-validation. Note that in-


stead of using the hinge-loss for validation, one would use the target L1-SVM optimization problem
performance metric (e.g. the number of mistakes).
n
3.2 Feature Selection w∗ = argmin ∑ max(0, 1 − yi wT xi ) + λkwk1
w i =1
In many high-dimensional problems, we may prefer not to work
with potentially available features, because of
Comparison FW/BW applies to any prediction method but takes
• interpretability and generalization: simple models may be un- time. L1-Regularization is faster (training and feature selection hap-
derstood and generalize better to more complex tasks, pen jointly) but only works for linear models.
• storage/ computation/ cost: it is unnecessary to compute with
3.3 Non-linear Classification
and store unused or less important features or features that
depend upon another. Also, acquiring features might be ex- Theorem 3.4 (Representer Theorem) The normal to the optimal hyper-
pensive, so one might prefer not to acquire a feature if it is not plane lives in the span of the data ŵ = ∑n α y x .
i =1 i i i
needed.
Reformulation using inner-products xiT x j
3.2.1 Greedy feature selection
Greedily add or remove features to maximize cross-validated predic- n n
tion accuracy, mutual information, or other notions of informative- R̂(α) = min ∑ max{0, − ∑ α j yi y j xiT x j }
α1:n
ness. i =1 j =1
| {z }
= yi w T xi
For a set of features V = {1, . . . , d}, define a cost function L̂ :
P (V ) → R to be the cross-validation error using features in subsets
of V only. 3.3.1 Kernels
Comparison Forward is usually fast (if few relevant features), back- Kernel-trick Express problem such that it only depends on in-
ward can handle ”dependent” features. ner products and replace these inner products with kernels xiT x j →
k ( xi , x j ) .
Problems Both algorithms are computationally expensive (need to
retrain models many times for different feature combinations) and

3
Definition 3.5 (Kernel) A kernel is a function k : X × X → R sat- The kernelized perceptron may have improved performance due to
isfying symmetry and positive semi-definiteness, i.e. for any n, any set optimized weights, can capture ”global trends” with suitable kernels,
S = {x1 , . . . , xn } ⊆ X, the kernel (Gram) matrix and it depends on wrongly classified examples only, but training
requires optimization.
k(x1 , x1 ) · · · k(x1 , xn )
 
.. .. .. Parametric vs nonparametric Models Parametric models have fi-
K=
 
. . .  nite set of parameter (e.g. linear regression, linear perceptron), non-
k(xn , x1 ) · · · k(xn , xn ) parametric models grow in complexity with the size of the data (e.g.
kernelized perceptron, k-NN) and are thus potentially much more
is positive semi-definite. expressive and computationally expensive. Kernels provide a princi-
Theorem 3.6 (Kernel Composition) Let k1 : χ × χ → R, k2 : χ × pled way of deriving non-parametric models from parametric ones.
χ → R be defined on data space χ. Then the following are valid kernels 3.3.4 Kernelized SVM
• k(x, x0 ) = k1 (x, x0 ) + k 2 (x, x0 ) Kernelized SVM optimization problem
• k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) n
argmin ∑ max(0, 1 − yi α T ki ) + λα T Dy KDy α
• k(x, x0 ) = ck1 (x, x0 ) for c > 0 α i =1
• k(x, x0 ) = f (k1 (x, x0 )), where f is a polynomial with positive coeffi-
for ki = (y1 k(xi , x1 ), . . . , yn k(xi , xn )) T and Dy = diag(y1 , . . . , yn ).
cients or the exponential function.
3.3.5 Kernelized Regression
Theorem 3.7 (Mercer’s Theorem) Let X be a compact subset of Rn and
Kernelized Linear Regression optimization problem
k : X × X → Rn a kernel function. Then one can expand k in a uniformly
convergent series of bounded functions φ such that
argminkα T K − yk22 + λα T Kα
α

k( x, x 0 ) = ∑ λi φi (x)φi (x0 ). with closed form solution α∗ = (K + λI)−1 y and predictor f (x) =
i =1
∑in=1 αi k(xi , x).
Kernels in Rd 3.4 Class Imbalance
• Linear Kernel: k(x, x0 ) = x T x0
True label
• Polynomial Kernel: k(x, x0 ) = (xT x0 + 1)d implicitly repre- Positive Negative ∑=
sents all monomials of up to degree m. In d-D, there are (d+ m
m ) Positive TP FP p+
Predicted
such monomials. Negative FN TN p−
• Gaussian/ RBF Kernel: k(x, x0 ) = exp(−kx − x0 k22 /h2 ) maps ∑= n+ n−
to infinite dimensional space.
• Laplacian Kernel: k(x, x0 ) = exp(−kx − x0 k1 /h)
T
• ? Kernel: k(x, x0 ) = xT Mx0 for symmetric positive definite 1 0
matrix M. 1 TP FP p+
P
0 FN TN p−
• ANOVA Kernel: k(x, x0 ) = ∑id=1 k i (x(i) , x0(i) ) with k i ( x, x 0 ) = n+ n−
exp(−( x − x 0 )2 /h2i ).
3.3.2 k-Perceptron
Metrics to measure goodness of fit
Kernelized perceptron optimization problem
TP + TN TP + TN
n n n = (Accuracy)
argmin ∑ max{0, −yi α T ki } = min ∑ max{0, − ∑ α j yi y j k(xi , x j ) TP + TN + FP + FN n
α i =1
α1:n
i =1 j =1 TP TP
= ∈ [0, 1] (Precision)
TP + FP p+
for ki = (y1 k(xi , x1 ), . . . , yn k(xi , xn )) T . TP TP
= ∈ [0, 1] (Recall (TPR))
TP + FN n+
Algorithm 8 Kernelized Perceptron FP FP
= ∈ [0, 1] (False positive rate (FPR))
1: α1 = . . . = αn = 0 TN + FP n−
2: for t = 1, 2, . . . do 2TP 2
3: Pick it ∼ Unif{1, . . . , n} . Random data points = TP+ FP TP+ FN (F1 Score)
 2TP + FP + FN TP + TP
4: ŷ = sign ∑nj=1
α j y j k ( x j , xi t ) . Predict
5: if yit 6= ŷ then . Perceptron gradient Accuracy is often not meaningful for imbalanced data set, because
6: α i ← α i + ηt
it may prefer certain mistakes over others (trade false positives and
7: ŷ = sign(∑nj=1 α j y j k(x j , x)) . Prediction for x false negatives). Minority class instances contribute little to the em-
pirical risk and may be ignored during optimization.

3.3.3 k nearest Neighbors (k-NN) Upsampling Repeat data points from minority class (possibly with
small random perturbations) to obtain balanced data set. This method
k-nearest Neighbor Label depending on the k nearest neighbor of
makes us of all data, but is slower and adding perturbations requires
all data points
arbitrary choices.
!
n
Downsampling Remove training examples from the majority class
y = sign ∑ yi {xi among k nearest neighbors of x} .
(e.g. uniformly at random) such that the resulting data set is bal-
i =1
anced. This method is faster, because it reduces the test set size, but
Choose k using cross validation. available data is wasted and information about the majority class.
Comparison of Perceptron and k-NN For k-NN, no training is Cost sensitive classification Modify Perceptron/ SVM by using a
necessary, but it depends on all data, which may render it inefficient. cost-sensitive loss function lCS (w; x, y) = cy l (w; x, y) to take class

4
imbalance into account. Multi-class methods Maintain c weight vectors w(1) , . . . , w(c) , one
T
for each class, and predict ŷ = argmaxi w(i) x. Given each data
lCS− P (w; x, y) = cy max(0, −ywT x) (Perceptron)
point (x, y), we want to archive that
lCS− H (w; x, y) = cy max(0, 1 − ywT x) (SVM)
w(y) x > max w(i)T x + 1. (∗)
with parameters c+ , c− > 0. i 6=y

Receiver operator characteristic (ROC) curve draws true positive


rate vs. false positive rate. Multi-class hinge-loss

l MC− H (w(1:c) ; x, y) =
!
( j) T (y) T
max 0, 1 + max w x−w x
j∈{1,...,y−1,y+1,...,c}

0
 (∗) or j ∈
/ {y, ŷ}
∇w( j) l MC− H (w(1:c) ; x, y) = −x not (∗) and j = y

x not (∗) and j = ŷ

Confusion Matrices are often used to evaluate multi-class classi-


fiers.

True label
Cat Dog Elefant
Cat 5 2 0
Predicted Dog 3 7 0
Elefant 1 0 6

Area under the curve of ROC or Precision Recall as ability of a


classifier to provide imbalanced classification.

Theorem 3.8 Algorithm 1 dominated algorithm 2 in terms of ROC curve 4 Neural Networks
if and only if algorithm 1 dominates algorithm 2 in terms of precision recall
What are good features?
curves.
Neural Networks optimization problem
3.5 Multi-class Problems
n m
One-vs-All Solve c binary classifiers for each class. Classify using w∗ = argmin ∑ l (yi ; ∑ w j φ( xi , θ j ))
the classifier with the largest confidence, i.e. predict w,θ i =1 j =1

T
ŷ = argmax w(i) x.
i ∈{1,...,c}
Feature maps, activation function For example φ( x, θ ) = ϕ(θ T x )
Activation functions
One-vs-All discussion One-vs-All only works well if classifiers pro- • Sigmoid ϕ(z) = 1+exp1 (−z)
duce confidence scores on the same scale. Individual binary classi-
fiers see imbalanced data even if the whole data set is balanced. One • Tanh ϕ(z) = tanh z
class might not be linearly separable from all other classes. • ReLU ϕ(z) = max(0, z)
One-vs-One Train c c−2 1 binary classifiers, one for each pair of classes Artificial Neural Networks (ANN) are functions of the form
i, j. Apply voting scheme, class with highest number of positive pre-
m
diction wins.
f (x; w, θ ) = ∑ w j ϕ(θ jT x).
Discussion One-vs-One One-vs-One does not rely on confidence, j =1
but is slower than One-vs-All.

5
Algorithm 9 Forward Propagation weight update
1: v(0) = x . Input layer a = m · a + ηt ∇w l (W; y, x),
2: for each hidden layer l = 1 : L − 1 do
W = W − a,
3: z( l ) = W( l ) v( l −1)
4: v( l ) = ϕ (z( l ) ) where m denotes a parameter of friction. This method can help to
5: f = W( L ) v( L −1) prevent oscillations.
6: y=f . Prediction for regression, or
7: y = sign(f) . Prediction for classification, or Weight-space symmetries cause ’degenerate’ local minima, i.e. mul-
8: y = argmax j f j . Prediction for multiclass classification tiple local-minima can be equivalent in terms of input-output map-
ping.
Derivatives of activation functions
Theorem 4.1 Let σ be any continuous sigmoidal function. Then finite z
• ϕ0 (z) = ( 1+1e−z )0 = (1+e ez )2 = (1 − ϕ(z)) ϕ(z). Properties: Dif-
sums of the form
ferentiable and non-zero everywhere, but ϕ0 (z) ≈ 0 almost
N everywhere except for z ≈ 0.
G(x) = ∑ α j σ(yiT x + θi ) (
j =1 0 0 1 if z > 0
• ϕ (z) = (max(0, z)) = . Properties: not differ-
0 if z < 0
are dense in C ( In ). In other words, given any f ∈ C ( In ) and ε > 0, there
entiable at 0 (in practice just set to 0, doesn’t really matter),
is a sum, G ( x ), of the above form, for which
efficient and > 0 in R+ .
| G ( x ) − f ( x )| < ε ∀ x ∈ In . wie bringen wir dieses concept am besten unter? Wie oder wo?

Aber solche Funktionen kriegen wir doch nicht aus neuronalen Net- 4.2 Initialization and Termination
zwerken, oder? Initialization of weights Matters, because problem is non-convex.
Random initialization usually works well, e.g. wi,j ∼ N (0, 0.1),
4.1 Training: Momentum SGD, Backpropagration p
wi,j ∼ N (0, 1/ | Layerl +1 |). However, incorrect initialization can
Training Given data set D = {( x1 , y1 ), . . . , ( xn , yn )}, we want to lead to bad results. Might want to repeat training multiple times to
optimize weights W = (W(1) , . . . , W( L) ) using any loss function avoid getting stuck in a poor local optimum. Less deep architectures
l (W; y, x) (Perceptron loss, multi-class hinge loss, squared loss, etc.) are more prone to get stuck in a local optimum.
n Termination
W∗ = argmin ∑ l (W; yi , xi ).
W i =1 4.3 Choosing parameters
When predicting multiple outputs at the same time, usually define In principle, one could use cross validation to compare models, how-
loss as sum per-output losses ever, training ANNs is usually expensive.
p Choosing parameters like number of units/ layers/ activation func-
l (W; y, x) = ∑ li (W; yi , x). tions/ learning rate/ ...
i =1
Type of activation function Sigmoid and tanh are differentiable
This optimization problem is not convex. and were popular in the past. ReLUs are currently used extensively.
They are not differentiable (not a problem), fast to compute and their
Algorithm 10 SGD for ANNs gradients do not vanish (important).
1: Initialize weights W Number of hidden layers [1] In most tasks, one hidden layer is
2: for t = 1, 2, . . . do sufficient. More generally:
3: Pick data point ( x, y) ∈ D uniformly at random
• 0: only capable of representing linear separable functions or
4: Take step in negative gradient direction
5: W = W − ηt ∇W l (W; y, x ) decisions.
• 1: Can approximate any function that contains a continuous
mapping from one finite space to another.
Computing the gradient To compute ∇w l (W; y, x), we use back-
propagation exploiting the chain-rule and the weight-specific gradi- • 2: Can represent an arbitrary decision boundary to arbitrary ac-
ents ∇wi,j l (W; y, x). curacy with rational activation functions and can approximate
any smooth mapping to any accuracy.
Wo genau ist der Unterschied zwischen 1 und 2?
Algorithm 11 Backpropagation
1: for the output layer do Number of hidden units [1] The optimal size of the hidden layer
is usually between the size of the input and size of the output layers.
2: δ( L) = Dl (f) = (l10 ( f 1 ), . . . , l 0p ( f p )) . compute ”error” True?
Some rules of thumb are
3: ∇W(L) l (W; y, x) = δ( L) v( L−1)T . Compute gradient matrix
• The number of hidden neurons should be between the size of
4: for each hidden Layer l = L − 1 : −1 : 1 do
the input layer and the size of the output layer.
5: δ ( l ) = ϕ 0 (z( l ) ) (W( l +1) T δ ( l +1) ) . compute ”error”
6: ∇W(l) l (W; y, x) = δ(l ) v(l −1)T . Compute gradient • The number of hidden neurons should be 2/3 the size of the
input layer, plus the size of the output layer.
• The number of hidden neurons should be less than twice the
Learning rate often initially chosen with a fixed (small) learning size of the input layer.
rate and decreased slowly after some iterations, e.g. ηt = min(0.1, 100/t).
It is also possible to monitor the ratio of weight change (gradient) to An upper bound for the number of hidden units is given by
weight magnitude. If the ratio is too small, increase learning rate,
Nsample
otherwise decrease learning rate. Nhidden units ≤ , 2 ≤ α ≤ 10.
α( Ninput + Noutput )
Learning with momentum can help to escape local minima by not
only moving into direction of gradient, but also in direction of last Woher kommt das?

6
Regularization method 5 Clustering: k-means
Learning rate schedule Unsupervised analog to classification.
Weight initialization Clustering Given data points, group them into clusters such that
Number of convolution/ pooling layers similar points are in the same cluster and dissimilar points are in
different clusters. Points are typically represented either in (high-
4.4 Regularization dimensional) Euclidean space or with distances specified by a metric
or kernel. Clustering is related to anomaly/ outlier detection.
Neural networks are prone to overfitting due to their large number
of parameters. Standard approaches to clustering
Early stopping doesn’t let the neural net converge. Monitor predic- • Hierarchical clustering: Build a tree (bottom-up or top-down),
tion performance on validation set and stop training once validation representing distances among data points. Examples include
error starts to increase. single-, average-linkage clustering.
Regularization adds the usual regularization term to the optimiza- • Partitional approaches: Define and optimize a notion of ”cost”
tion problem defined over partitions. Examples include spectral clustering,
n and graph-cut based approaches.
W∗ = argmin ∑ l (W; yi , xi ) + λkWk22 . • Model based approaches: Maintain cluster ”models” and in-
W i =1
fer cluster membership (e.g. assign each point to closest cen-
ter). Examples include k-means, and Gaussian mixture mod-
Dropout regularization Randomly ignore hidden units during each els.
iteration of SGD with probability p. After the training, half the
weights to compensate. k-means optimization problem assumes points are in Euclidean
space xi ∈ Rd , represents clusters as centers µ j ∈ Rd , and each point
4.5 Invariances is assigned to its closest center (Voronoi partition). The goal is to
minimize the average squared distance, i.e.
Predictions should be unchanged under some transformations of the
data, e.g. translation, rotation, scale, pitch, speed, etc. Invariances n n
can be learned from data: SIFT (scale invariant feature transforma- R̂(µ) = R̂(µ1 , . . . , µk ) = ∑ d(xi , µ) = ∑ j∈{min
1,...,k }
kxi − µ j k22 ,
tion), Ceptum (speech recognition). i =1 i =1

Encourage learning of invariances e.g. by µ̂ = argmin R̂(µ).


µ
• augmentation of the training set
This optimization problem is non-convex and NP-hard.
• special regularization terms
• invariance built into pre-processing
Algorithm 12 k-means algorithm (Lloyd’s heuristic)
• implement invariance into structure of ANN (e.g. CNN) (0) (0)
1: µ (0) = { µ 1 , . . . , µ k } . Initialize cluster centers
Hat er dazu mehr gesagt? 2: while not converged, t = 1, t = t + 1 do
( t −1)
4.6 Convolutional Neural Networks (CNN) 3: zi = argmin j∈{1,...,k} kxi − µ j k22
4: . Assign each point xi to closest center
CNNs are ANNs for specialized applications like image recogni- (t)
tion. They are robust against transformations (translation, rotations, 5: µ j = n1j ∑i:zi = j xi
scaling). The hidden layer(s) closest to the input layer shares param- 6: . update center as mean of assigned data points
eters: each hidden unit only depends on all ”closeby” inputs (e.g.
pixels), and weights constrained to be identical across all units on
the layer. This reduces number of parameters and enourages robust- Properties and challenges of k-means Guaranteed to monotoni-
ness against (small amounts of) translation. The weights can still be cally decrease average squared distance in each iteration. It con-
optimized via backprobagation. verges to a local minimum. Complexity per iteration is O(nkd).
Output dimension of CNNs when applying m different f × f filters Initializing k-means different approaches: multiple random restarts,
to an n × n image with padding p and stride s is given by farthest points heuristic (often works well, but prone to outliers),
seeding with k-means++.
n + 2p − f
L= + 1.
s
Algorithm 13 k-means++ algorithm for initializing centers
Pooling aggregates several units to decrease the with of the network 1: i1 ∑ Unif{1, . . . , n}
and hence the number of parameters. Usually, either average or 2: µ1 = x i1
3: for j = 2 : k do
maximum values are considered. d( xi ,µ1:j−1 )2 min1≤l ≤ j−1 k xi −µl k2
4: i j = i with probability n
∑i=1 d( xi ,µ1:j−1 )2
= n
∑i=1 d( xi ,µ1:j−1 )2
4.7 ANNs vs. tanh-kernels 5: µ j = xi j
Class of functions that can be modeled with ANNs or tanh-kernels 6: Continue with standard k-means algorithm
are the same. This does not mean the trained models are the same.
Kernel optimization problems are linear and hence convex, whereas
the ANN optimization problem optimizes θ and w, which makes Model selection (i.e. determining the number of clusters) is diffi-
it non-convex. Robustness for kernels that ANNs do not exhibit . cult. Approaches include heuristic quality measures, regularization
However, ANNs have more degrees of freedom. Noisy data better (favor ”simple” models with few parameters by penalizing complex
with kernels because of their robustness. models), information theoretic basis (tradeoff between robustness
(stability) and informativeness). One heuristic for determining k is
to pick k such that increasing k leads to negligible decrease in loss.
II Unsupervised Learning Challenges with k-means Generally, the algorithm only converges
to local minimum (dependence on initialization), the number of it-
Learning without labels. Typically used for exploratory data analy- erations needed may be exponential (however, practically not often),
sis, e.g. finding patterns, visualizations.

7
determining the number of clusters k is difficult, and models of arbi- • Complexity grows with the number of data points.
trary shape cannot be modeled well. • Cannot easily ”explicitly” embed high-dimensional data (un-
Don’t do crossvalidation, because there is a strong correlation be- less we have an appropriate kernel).
tween train accuracy and test/validation accuracy. Explain, please. • Kernel-PCA corresponds to applying PCA in the feature space
induced by the kernel k.
Nonlinear k-means Applying k-means on kernel-principal compo- • Can be used to dissolve non-linear feature maps in closed form.
nents is sometimes called Kernel-k-means or Spectral Clustering. This can be used as inputs, e.g. SVMs given ”multilayer SVMs”
6 Dimension Reduction was heisst das?

Unsupervised analog to regression. Given data set D = {x1 , . . . , xn }, • May want to center the kernel E = n1 [1, . . . , 1][1, . . . , 1] T , K0 =
obtain ”embedding” (low-dimensional representation) z1 , . . . , zn ∈ K − KE − EK + EKE ... und das?.
R .
k
6.2.2 Autoencoders
Typical approaches Assume D = {x1 , . . . , xn } ⊆ R , obtain map- Idea is to learn the identity function x ≈ f (x; θ ) = f 2 ( f 1 (x; θ1 ); θ2 ),
d

ping f : Rd → Rk such that k  d. One can distinguish linear where f 1 : Rd → Rk is the encoder and f 2 : Rk → Rd is the de-
dimension reduction f (x) = Ax and nonlinear dimension reduction, coder.
parametric and non-parametric. Neural network autoencoders are ANNs where there is one output
unit for each of the d input units and the number k of hidden units is
6.1 Linear Dimension Reduction: PCA
usually smaller than the number of inputs (compression effect). The
PCA optimization problem goal is to optimize the weights such that the output agrees with the
n
input, for example by minimizing the square loss.
(W∗ , z1∗ , . . . , z∗n ) = argmin ∑ kWzi − xi k22 , Training autoencoders For example, minimize the square loss
i =1
n
such that W ∈ Rd×k is orthogonal and z1 , . . . , zn ∈ Rk . min ∑ kxi − f (xi ; W)k22
W
i =1
Theorem 6.1 (PCA) Let Σ = 1
n ∑in=1 xi xiT
= ∑id=1 λi vi viT , λ1 ≥ . . . ≥ using SGD (backpropagation). Initialization matters and is challeng-
λd ≥ 0 be the empirical covariance. Assume that µ = n1 ∑i xi = 0. The ing, c.f. work on pretraining, e.g. layerwise training of restricted
linear dimension reduction optimization problem is equivalent to Boltzmann machines.

W∗ = argmax WT ΣW, zi∗ = (W∗ ) T xi , 6.2.3 Other


There are other nonlinear methods like locally linear embedding
and its solution is given by (LLE) or multi-dimensional scaling.
W ∗ = ( v1 | . . . | v k ) , zi∗ = (W∗ ) T xi . 6.3 Autoencoders vs. PCA
If the activation function is the identity φ(z) = z, then fitting a NN
Proof?
autoencoder is equivalent to PCA.
PCA via SVD From SVD it follows that X = USVT , X ∈ Rn×d ,
U ∈ Rn×n and V ∈ Rd×d orthogonal, S ∈ Rn×d diagonal. The top k 6.4 PCA vs. k-Means
principal components are the first k columns of V and Σ = VST SVT After reformulating the optimization problems, both differ mostly in
What? constraints.
PCA Problem
6.2 Nonlinear Dimension Reduction n
6.2.1 Kernel PCA (W, z1 , . . . , zn ) = argmin ∑ kWzi − xi k22
i =1
Theorem 6.2 For w∗ there exist αi such that w∗ = ∑in=1 αi xi .
where W ∈ Rd×k is orthogonal, z1 , . . . , zn ∈ Rk .
1D Kernel PCA The 1D Kernel-PCA problem requires solving
k-Means Problem
∗ T T n
α = argmax α K Kα
α T Kα=1 (W, z1 , . . . , zn ) = argmin ∑ kWzi − xi k22
i =1
with closed-form solution from the eigendecomposition of K = ∑in=1 λ1 vi viT ,
where W ∈ Rd×k arbitrary, z1 , . . . , zn ∈ Ek and Ek = {(1, 0, . . . , 0) T , . . . , (0,
λ1 ≥ . . . ≥ λ d ≥ 0
is the set of unit vectors in Rk .
1
α∗ = √ v1
λ1

Kernel PCA For general k ≥ 1, the Kernel Principal Components


are given by α(1) , . . . , α(k) ∈ Rn , where

1
α ( i ) = √ vi
λi

is obtained from K = ∑in=1 λi vi viT , λ1 ≥ . . . λd ≥ 0. Given this, a


new point x is projected as z ∈ Rk
n
∑ αj
(i )
zi = k(x, x j ).
j =1

Notes on PCA

8
III Probabilistic modeling MLE = Least squares With the assumption of Gaussian noise, i.e.
yi ∼ N (wT xi , σ2 ) ≡ yi = wT xi + ε i , ε i ∼ N (0, σ2 ), maximizing the
General approach to probabilistic modeling likelihood is equivalent to least squares estimation
(i) Start with statistical assumption on data, mostly data points n
modeled as i.i.d. (can be relaxed). argmax P(y1 , . . . , yn |x1 , . . . , xn , w) = argmin ∑ (yi − wT xi )2 .
w w i =1
(ii) Choose likelihood function (e.g. Gaussian, student-t, logistic,
exponential). This defines the loss function.
MLE for i.i.d. Gaussian noise Suppose H = {h : X → R} is a
(iii) Choose a prior (e.g. Gaussian, Laplace, exponential). This
class of functions. Assuming that P(Y = y|X = x) = N (y|h∗ (x), σ2 )
defines the regularizer.
was heisst das argument für den mean, also y|h∗ (x)? for some
(iv) Optimize for MAP parameters. function h∗ : X → R and some σ2 > 0 the MLE for data D =
{(x , y ), . . . , (xn , yn )} in H is given by
(v) Choose hyperparameters (i.e., variance) through cross-validation. 1 1
n
(vi) Make predictions via Bayesian Decision Theory.
ĥ = argmin ∑ (yi − h(xi ))2 .
7 Probabilistic Modeling, Bias-variance tradeoff h∈H i =1

We would like to take a statistical perspective on supervised learning


and statistically model the data, e.g. quantify uncertainty or express MLE properties for n → ∞ (for finite n we must still avoid overfit-
prior knowledge/ assumptions about the data. Many of the previ- ting)
ous approaches from part 1 can be interpreted as fitting probabilistic • consistency, parameter estimate converges to true parameters
models. in probability,
Fundamental assumption: i.i.d Our data is independently and • asymptotic efficiency, smallest variance among all ”well-behaved”
identically distributed, i.e. (xi , yi ) ∼ P(X, Y ). estimators for large n,
Problem formulation We would like to identify a hypothesis h : • asymptotic normality.
X → Y that minimizes the prediction error (risk)
Z 7.3 Bias Variance Tradeoff
R(h) = P(x, y)l (y; h(x)) dx dy = Ex,y [l (y; h(x))]
Prediction error descomposition states that prediction error = ED EX,Y [(Y
ĥ D (X))2 ] = bias2 + variance + noise, or mathematically
defined in terms of a loss function.
Risk In least squares regression, the risk is R(h) = Ex,y [(y − h(x))2 ] ED EX,Y [(Y − ĥ D (X))2 ] = EX [ED ĥ D (X) − h∗ (X)]2 +
EX ED [ĥ D (X) − ED0 ĥ D0 (X)]2 + EX,Y [Y − h∗ (X)]2
7.1 Parametric Estimation
Bayes’ optimal predictor for the squared loss assuming we knew Bias, Variance, Noise in estimation MLE solution depends on
P(X, Y ). Minimizing least squares risk leads to the hypothesis h∗ training data ĥ = ĥ D = argming∈H ∑(x,y)∈ D (y − h(x))2 , but train-
given by the conditional mean
ing data D itself is random, drawn i.i.d from P(X, Y ). We thus use
h ∗ ( x ) = E [Y | X = x ] . (Bayes optimal predictor) ED [ĥ D (X)].
Bias is an error from erroneous assumptions in the learning algo-
rithm. High bias can cause an algorithm to miss the relevant re-
Estimating conditional distributions We know which h∗ minimizes lations between features and target outputs (underfitting). Bias
the risk, thus one strategy for estimating a predictor from training is expressed as EX [ED ĥ D (X) − h∗ (X)]2
data is to estimate the conditional distribution
Variance is an error from sensitivity to small fluctuations in the
P̂(Y |X) (Conditional distribution) training set. High variance can cause an algorithm to model
the random noise in the training data rather than the intended
and then for a test point x to predict the label outputs (overfitting). Variance is expressed as EX VD [ĥ D (X)]2 =
Z EX ED [ĥ D (X) − ED0 ĥ D0 (X)]2
ŷ = Ê[Y |X = x] = yP̂(y|X = x) dy. Noise is the risk error incurred by optimal model, i.e. the irre-
ducible error, and constant with respect to the model complex-
ity. Nose is expressed as EX,Y [Y − h∗ (X)]2 .
Parametric Estimation P̂(Y |X, θ ) is a common approach. Choose a
particular parametric form J: ich glaube wir sollten noch mal über diese formalen formulierun-
gen von bias, variance und noise sprechen.
P̂(Y |X, θ ) (Parametric cond. distribution)
Bias and variance in regression The MLE (= least-squares fit) for

with parameters θ, then optimize the parameters. This is done for linear regression is unbiased (if h in class H) and the minimum
example by Maximum Likelihood estimation. variance estimator among all unbiased estimators. However, least-
squares solutions may be overfit. Thus, trade (a little bit of) bias
7.2 Least Squares Regression = Gaussian Maximum Like- for a (potentially dramatic) reduction in variance, i.e. regularization
lihood Estimation (MLE) (ridge, Lasso, etc.).

Maximum (conditional) Likelihood Estimation is a method for 7.4 Ridge Regression = Maximum A Posteriori (MAP) Es-
optimizing the parameters, i.e. timation
n A posteriori estimate Introduce bias by expressing assumptions on
i.i.d
θ ∗ = argmax P̂(y1 , . . . , yn |xi , . . . , xn , θ ) = argmax ∏ P̂(yi |x1 , θ ) ⇐⇒ parameters through a Bayesian prior, e.g. θ ∼ N (0, β2 Id). Then, the
θ θ i =1
posterior distribution of θ using Bayes’ rule is given by
n
θ ∗ = argmax log P̂(yi:n |xi:n , θ ) = argmax ∑ log P̂(yi |xi , θ ) P(θ )P(y|x, θ )
θ θ i =1 P(θ |x, y) = . (Posterior)
P( y |x)

9
Logistic regression replaces the assumption of Gaussian noise (squared
loss) by i.i.d. Bernoulli noise
Maximizing a posteriori estimate parameters θ leads to the ridge
regression problem P(y|x, w) = Ber(y; σ(wT x)).
argmax log P(θ |x, y) = − log P(θ ) − log P(y1:n | x1:n , θ ) The parameters w can be estimated via MLE or MAP estimation.
θ
1 1 n MLE for logistic regression is the convex optimization problem
+ log P(y| x ) = argmin
2β2
k θ k 2
2 + ∑ ( y i − θ T xi ) 2 .
2σ2 i=1 n
ŵ = argmax P(yi:n |w, x1:n ) = argmax ∑ log(1 + exp(−yi wT xi )).
θ
w w i =1
Ridge Regression = MAP Ridge regression can be understood as
finding the Maximum A Posteriori (MAP) parameter estimate for a Logistic loss gradient is given by
linear regression problem, assuming that the noise P(y|x, θ ) is i.i.d.
Gaussian and the prior P(θ ) on the model parameters θ is Gaus- −yx
∇ w l (w) = T x) + 1
= −yxP̂(Y = −y|w, x),
sian. exp ( yw
Regularization and MAP interference More generally, regularized i.e. the gradient is large if model w is ’surprised’ by y.
estimation can often be understood as MAP inference
n Algorithm 14 SGD for logistic regression
argmin ∑ l (wT xi ; xi , yi ) + C (w) = argmax ∏ P(yi |xi , w)P(w)
w w 1: Initialize w
i =1 i
2: for t = 1, 2, . . . do
= argmax P(w| D ) 3: Pick data point (w, y) uniformly at random from data D
w
4: Compute probability of misclassification with current model
where C (w) = − log P(w) and l (wT xi ; xi , yi ) = − log P(yi |xi , w).
1
This perspective allows changing priors (= regularizers) and likeli- P̂(Y = −y|w, x) =
hoods (= loss functions). 1 + exp(ywT x)

5: Take gradient step


7.5 Examples for other Priors and Likelihood Functions
Laplace prior = l1-regularization w ← w + ηt yxP̂(Y = −y|w, x)

| x − µ|
 
1
P( x; µ, b) = exp −
2b b
Regularizers can be introduced by estimating MAP instead of solv-
ing the MLE problem. For the respective priors, we get the following
One can introduce robustness by changing the likelihood (=loss) optimization problems
function.
• L2 (Gaussian) argminw ∑in=1 log(1 + exp(−yi wT xi )) + λkwk22
Student-t likelihood
• L1 (Laplace) argminw ∑in=1 log(1 + exp(−yi wT xi )) + λkwk1
ν +1
Γ( ν+ 1
( y − w T x) 2 − 2
2 )

P(y|x, w, ν, σ2 ) = √ 1+
πνσ2 Γ( ν2 ) νσ2Algorithm 15 SGD for l2-regularized logistic regression
1: Initialize w
Compared with the Gaussian distribution, outliers are encouraged, 2: for t = 1, 2, . . . do
because the student-t likelihood decreases algebraically (P(|y − µ| > 3: Pick data point (w, y) uniformly at random from data D
t) = O(t−α )), whereas Gaussian likelihood decreases exponentially 4: Compute probability of misclassification with current model
(P(|y − µ| > t) = O(e−t )). Thus, student-t likelihood might be
better for data with extreme outliers. 1
P̂(Y = −y|w, x) =
1 + exp(ywT x)

5: Take gradient step

w ← w(1 − 2ληt ) + ηt yxP̂(Y = −y|w, x)

8.1 Regularized Logistic Regression


Regularized logistic regression Find optimal weights by minimiz-
ing logistic loss + regularizer
n
ŵ = argmin ∑ log(1 + exp(−yi wT xi )) + λkwk22 (Learning)
w i =1
= argmax P(w|x1 , . . . , xn , y1 , . . . , yn )
8 Classification: Logistic regression w
There are no natural statistical models for classification. Use conditional distribution
Link function for logistic regression 1
P(y|x, ŵ) = (Classification)
1 1 + exp(−yŵT x)
T
σ (w x) = .
1 + exp(−wT x) to for example predict more likely class.
Generalizations Logistic regression may be kernelized, there exist
what is a link function? natural multi-class variants, and one can apply logistic loss function
to neural networks in order to have them output probabilities.

10
8.2 Kernelized Logistic Regression Logistic Regression Estimated conditional distribution P̂(y|x) =
Kernelized logistic regression Find optimal weights by minimizing Ber(y; σ(ŵT x)), action set A = {+1, −1}, and cost function C (y, a) =
logistic loss + regularizer [y 6= a]. Then the action that minimizes the expected cost is the most
likely class
n
ŵ = argmin ∑ log(1 + exp(−yi α T Ki )) + λα T Kα (Learning) a∗ = argmin Ey (C (y, a)|x) = argmax P̂(y|x) = sign(wT x).
α ∈ Rn i =1 a∈A y

Use conditional distribution


LS Regression Estimated conditional distribution P̂(y|x) = N (y; ŵT x, σ2 )
1
P̂(y|x, α̂) = (Classification) action set A = R, and cost function C (y, a) = (y − a)2 . Then the ac-
1 + exp(−y ∑nj=1 α j k(x j , x))
tion that minimizes the expected cost is the conditional mean
to for example predict more likely class. Z
a∗ = argmin Ey (C (y, a)|x) = Ey (y|x) = yP̂(y|x) dy = ŵT x.
8.3 Multi-class Logistic Regression a∈A

Multi-class Maintain one weight vector per class and model


9.1 Asymmetric Costs
exp(wiT x)
P(Y = y, x, w1 , . . . , wc ) = Asymmetric Costs Estimated conditional distribution P̂(y|x) =
∑cj=1 exp(wTj x)
Ber(y; σ(ŵT x)), action set A = {+1, −1}, and cost function
By setting wc = 0 we force uniqueness and can recover logistic re- 
gression as a special case. c FP y = −1 and a = +1

C (y, a) = c FN y = 1 and a = −1 .
Cross-entropy loss is given by 
0 otherwise

l (y; x; w1 , . . . , wc ) = − log P(Y = y, x, w1 , . . . , wc )


Then the action that minimizes the expected cost is
(
c FP
∗ 1 P( y = +1|x) >
a = argmin Ey (C (y, a)|x) = c FP +c FN .
8.4 Comparison a∈A 0 otherwise
SVM vs. Logistic regression SVM/ Perceptron sometimes has
higher classification accuracy and produces sparse solutions, but
Doubtful logistic regression Estimated conditional distribution
cannot easily give class probabilities. Logistic regression can obtain
P̂(y|x) = Ber(y; σ(ŵT x)), action set A = {+1, −1, D }, and cost func-
class probabilities, but it produces dense solutions.
tion
Outlook: Bayesian Learning 
• Optimization based learning such as MAP or MLE, i.e. ŵ = [y 6= a] a ∈ {+1, −1}

argmaxw P(w| D ) given P(y|x, ŵ) ignores uncertainty in model, C (y, a) = c FN y = 1 and a = −1 .

c a=D

but optimization is typically efficient.
• Integration
R based learning/ Bayesian model averaging, i.e. Then the action that minimizes the expected cost is
P(y|x) = P(y|x, w)P(w| D ), quantifies uncertainty in model,
but integration is typically intractable.
(
∗ y P̂(y|x) ≥ 1 − c
a = argmin Ey (C (y, a)|x) = ,
a∈A D otherwise

i.e. pick the most likely class only if confident enough.


Asymmetric cost for regression Estimated conditional distribution
P̂(y|x) = N (y; ŵT x, σ2 ), action set A = R, and cost function C (y, a) =
c1 max(y − a, 0) + c2 max( a − y, 0). Then the action that minimizes
the expected cost is

a∗ = argmin Ey (C (y, a)|x) = c1 1[y> a] + c2 1[y< a]


a∈A
 
c1
= ŵT x + σΦ−1 ,
c1 + c2

where Φ is the Gaussian CDF.

9.2 Uncertainty Sampling


Outlook: active learning We would like to minimize the number
9 Bayesian Decision Theory of labels. A simple strategy is to always pick the label we are most
uncertain about. Estimate P̂(y|x) from seen data, compute pi =
Idea Given a conditional distribution over labels P(y|x) with y ∈ Y , P(y = +1|x ) for unknown data point x . If p ≈ 1 or p ≈ 0,
i i i i i
a set of actions A, and a cost function C : Y × A → R, Bayesian the model is certain, if p ≈ 0.5, the model is uncertain. Define
i
decision theory recommends to pick the action that minimizes the an uncertainty score U = −| p − 0.5|, find the most uncertain data
i i
expected cost pint i∗ = argmaxi Ui and ask for this label. For linear regression
Ui = |wT xi |.

a∗ = argmin E (C (y, a)|x) = argmin P(y|x)C (y, a)
y
a∈A a∈A y Comments Active learning violates i.i.d. assumption, it can get
stuck with bad models, and more advanced selection criteria are
If we had access to the true distribution P(y|x) this decision would available, e.g. query point that reduces uncertainty of other points
implement the Bayesian optimal decision. In practice, one can only as much as possible.
estimate it, e.g. (logistic) regression.

11
Algorithm 16 Uncertainty sampling (i) Learning given Data D = {(x1 , y1 ), . . . , (xn , yn )} using MLE
1: Pool of unlabeled examples Dx = {x1 , . . . , xn } MLE for class prior P̂(Y = y) = p̂y =
# [Y = y ]
2: Also maintain an empty data set D, initially empty n
3: for t = 1, 2, 3, . . . do MLE for feature distribution P̂( xi |y) = N ( xi ; µ̂y,i , σ̂y,i
2 )
4: Estimate P̂(Yi |xi ) given current data D
5: Pick most uncertain unlabeled example µ̂i,y = #[Y1=y] ∑ j:y j =y x j,i , σ̂i,y
2 = 1
( x − µ̂y,i )2
#[Y =y] ∑ j:y j =y j,i
Warum kein -1? Ist das nicht biased? J: wo ein -1?
it ∈ argmin |0.5 − P̂(Yi |xi )|
i (ii) Prediction given new data point x
1
6: Query label yit P̂(y| x ) =
Z
P( y )P(x| y ) Z= ∑ P( y )P(x| y )
7: Set D ← D ∪ {(xi , yit )} y
d
y = argmax P̂(y0 |x) = argmax P̂(y0 ) ∏ P̂( xi |y0 ).
y0 y0 i =1
10 Generative Modeling
10.1 Discriminative vs. Generative Modeling 10.2.2 Gaussian NB vs. Logistic Regression
Assumptions are P(Y = 1) = 0.5 (mild assumption) and inde-
Disciminative vs. generative modeling
pendent variance, i.e. P(x|y) = ∏i N ( xi ; µy,i , σ + i2 ) (rather strong
• Discriminative models aim to estimate P(y|x) assumption).
• Generative models aim to estimate the joint distribution P(y, x) Discriminant function Decision rule forbinary classification y =
0 , x) is equivalent to y = sign log P(Y =+1|x) = sign f (x),

Discriminative models can be derived from generative models by argmax y 0 P̂ ( y P(Y =−1|x)
where f is called the discriminant function.
P(x, y) P(x, y)
P( y |x) = = . GNB recovers linear classifier With the above assumptions Gaus-
P(x) ∑y P( x, y)
sian Naive Bayes produces a linear classifier
Discriminative models do not have access to P(x) and thus will not P (Y = + 1 | x )
be able to detect outliers, i.e. points for which p(xi ) is small. f (x) = log = w T x + w0 with
P (Y = − 1 | x )
Typical approach to generative modeling p̂+ d µ̂2 − µ̂2 µ+,i − µ−,i
w0 = log + ∑ −,i 2 +,i , wi = .
(i) Estimate prior on labels P(Y = y) 1 − P̂+ i=1 2σ̂i σi2
(ii) Estimate conditional distribution P(X|Y = y) for each class y
Corresponding class distribution is of the same form as logistic
(iii) Obtain predictive distribution using Bayes’ rule
regression, i.e.
1 1
P( y |x) = P( y )P(x| y ) P (Y = + 1 | x ) = = σ ( w T x + w0 ) .
Z 1 + exp(− f (x))

Conclusion If model assumptions are met, GNB will make the same
predictions as Logistic Regression.
10.2.3 Issues with Naive Bayes models
Overconfidence Conditional independence assumption means that
features are generated independently given class label. Thus, predic-
tions may become overconfident.
Example For duplicate data points x2 = x3 = . . . = xd = x1 unser
certain assumptions, NBM predicts p1 (x) = 1+exp1( f (x)) and pk (x) =
1
1
This gives p1 (ε) ≈ 0.5 + ε, but pt (ε) ≈ 01 for large d
1+exp(d· f 1 (x))
.
(overconfidence).
10.2.4 Categorical Naive Bayes for discrete Features
Setting Model features by (conditionally) independent categorical
(i ) (i )
10.2 Naive Bayes Model random variables P( Xi = x |Y = y) = θ x,y such that θ x|y ≥ 0 for all
(i )
10.2.1 Model Description i, x, y and ∑cx=1 θ x|y = 1 for all i, y.
Assumptions for the naive Bayes model are MLE estimation given dataset D = {(x1 , y1 ), . . . , (xn , yn )}
(i) Class labels can be modeled as generated from categorical vari- 1 [Y = y ]
• for class label distribution is P̂(Y = y) = p̂y = n ,
able P(Y = y) = py , y ∈ Y = {1, . . . , c}.
(i ) 1[ X =c,Y =c]
(ii) Naivety: Model features are conditionally independent given • for distribution of features is i P̂( Xi = c|y) = θc|y = 1i[Y =y] .
Y, i.e. given class feature, each data point is ”generated” inde- Lifting independence assumption requires specification of proba-
pendently of the other features. This assumption is strong and bility of every possible categorical data, requiring in d exponentially
mostly not true, but it still works somehow. many parameters, which is computationally intractable and a fantas-
d tic way to overfit.
P ( X1 , . . . , X d | Y ) = ∏ P ( X i | Y ) 10.2.5 Discrete and categorical Features
i =1
The (naive) Bayes classifier does not require each feature to follow
the same type of conditional distribution, e.g. model some features
(iii) Feature distribution, e.g. for Gaussian Naive Bayes classifier
as categorical and some as Gaussian
we have P( X = xi |Y = y) = N ( xi |µy,i , σy,i
2 ). Note that the

parameters are class and feature dependent. X1:10 discrete P( xi |y) = Categorical ( xi |y, θ )
Gaussian Naive Bayes Classifier X11:20 Gaussian P( xi |y) = N ( xi ; µi|y , σi2|y )

12
10.3 Gaussian Bayes Classifier
10.3.1 Model Description
Assumptions for the Bayes model are
(i) Class labels can be modeled as generated from categorical vari-
able P(Y = y) = py , y ∈ Y = {1, . . . , c}.
(ii) Model features as generated by multivariante Gaussian P(x|y) =
N (x; µy , Σ2y ).
ML for Gaussian Bayes Classifier Given Data D = {(x1 , y1 ), . . . , (xn , yn )},
MLE for feature distribution P̂(x|y) = N (x; µ̂y , Σ̂y ) with estimators
µ̂y and Σ̂y and MLE for class prior P̂(Y = y) are given by

# [Y = y ]
P̂(Y = y) = p̂y =
n
1 1
µ̂y =
# [Y = y ] ∑ xi Σ̂y =
# [Y = y ] ∑ (xi − µ̂y )T (xi − µ̂y )
i:yi =y i:yi =y

Discriminant function is given by


10.3.6 Quadratic Discriminant Analysis (LDA)
|Σ̂− | Quadratic discriminant analysis (LDA) uses the non-simplified

p 1
f (x) = log + log + (x − µ̂− )T Σ̂− 1
− (x − µ̂− ) discriminant function f (x) and predicts using y = sign f (x).
1− p 2 |Σ̂+ |

−(x − µ̂+ )T Σ̂− 1
− (x − µ̂+ ) 10.4 Outlier Detection
Data point probability can be calculated as

10.3.2 Fisher’s Linear Discriminant Analysis (LDA) c c


GBC
Fisher’s linear discriminant analysis (LDA) for binary classifica-
p̂y N (x|µ̂y , Σ̂y ).
P(x) = ∑ P( y )P(x| y ) = ∑
y =1 y =1
tion (c = 2). Suppose p = 0.5 and Σ̂+ = Σ̂− = Σ̂, then the discrim-
inant function becomes again a linear function f (x) = wT x + w0
with Outliers are points for which P(x) ≤ τ holds.
1 T −1 1 T −1 10.5 Avoiding Overfitting: Introducing Priors
w = (µ̂+ − µ̂− ) T Σ̂−1 and w0 = µ̂ Σ̂ µ̂− − µ̂+ Σ̂ µ̂+ .
2 − 2
Avoiding overfitting can be done by
Under these assumptions, we predict y = sign f (x) which is called
• restricting the model class to reduce the number of parame-
Fisher’s linear discriminant analysis.
ters, e.g. by further assumptions on covariance structure, e.g.
10.3.3 LDA vs. Logistic regression Gaussian Naive Bayesm
Corresponding class distribution is of the same form as logistic • using priors.
regression, i.e.
Problems with MLE estimation In the extreme case of n = 1, the
1 [Y = 1 ]
1 estimator θ̂ = n predicts θ̂ = 11 = 1 for D = {(y1 )}, y1 = 1, i.e.
P (Y = + 1 | x ) = = σ ( w T x + w0 ) .
1 + exp(− f (x)) it is overconfident.
Introducing priors by computing the posterior distribution

1
Z
Fisher’s LDA Logistic Regression P( θ | y1 , . . . , y n ) = P(θ )P(yi:n |θ ) Z= P(θ )P(y1:n |θ ) dθ
generative model discriminative model Z
+ can be used to detect outliers - cannot detect outliers
- assumes normality of x + makes no assumptions on X
- not robust against viola- + more robust Beta distribution Beta(θ; α+ , α− ) = B(α 1,α ) θ α+ −1 (1 − θ )α− −1
+ −
tion of this assumption
Definition 10.1 (Conjugate distributions) A pair of prior distributions
and likelihood functions is called conjugate if the posterior distribution re-
Conclusion If model assumptions are met, Fisher’s LDA will make mains the same family as the prior.
the same predictions as Logistic Regression.
Beta conjugate With prior Beta(θ; α+ , α− ) and n+ positive and n−
10.3.4 Gaussian Naive Bayes vs. General Gaussian Bayers negative labels, the posterior distribution is Beta(θ; α+ n+ , α− + n− ).
Classifiers Thus, α+ and α− act as pseudo-counts.
GNB models General GB models
Beta MAP estimate
- conditional independence + captures correlations
assumption may lead to among features α+ + n+ − 1
overconfidence θ̂ = argmax P(θ |y1 , . . . , yn ; α+ , α− ) =
α+ + n+ + α− + n− − 2
+ predictions might still be useful + avoids overconfidence θ

+ # parameters = O(cd) - # parameters = O(cd2 )


+ complexity (memory + in- complexity quadratic in d Examples of conjugate priors are listed below
ference) linear in d
Remarks conjugate priors can be used as regularizers why conju-
10.3.5 LDA vs. PCA gate priors? with almost no computational cost. Choose hyperpa-
LDA can be viewed as a projection to a 1D subspace that maximizes rameters by crossvalidation.
ration of between-class and within-class variances. In contrast, PCA
TODO check alignment adsadsad
(k = 1) maximizes the variance of the resulting 1D projection.

13
Prior/ Posterior Likelihood function 11.2 Expectation-Maximization Algorithm
Beta Bernoulli/ Binomial
Dirichlet Categorical/ Multinomial Latent variables are variables that are not directly observed but are
Gaussian (fixed covariance) Gaussian rather inferred (through a mathematical model) from other variables
Gaussian-inverse Wishart Gaussian that are observed (directly measured). Mathematical models that
Gaussian process Gaussian aim to explain observed variables in terms of latent variables are
called latent variable models. Concretely, this means for each data
point xi we introduce a latent variables zi denoting the class that this
11 Probabilistic Modeling of Unsupervised Learning: points is assigned.
Latent Variable Modeling EM algorithm is an iterative method to find maximum likelihood
We will focus on missing labels, ideas may be applied to missing (ML) or maximum a posteriori (MAP) estimates of parameters in
data as well. statistical models, where the model depends on unobserved latent
variables. The EM iteration alternates between performing an expec-
11.1 Gaussian Mixture Models tation (E) step and a maximization (M) step described below.
Assumptions P( X, Y ) is a Gaussian-Bayes classifier: P(Y = y) = py E-step creates a function Q for the expectation of the complete data
and P(x|y) = N (x; µy , Σy ). We also require i.i.d. data. log-likelihood
Gaussian mixtures are convex combinations of Gaussians
L(θ; x) = log P(X = x; θ ) = log ∑ P(X = x, Z = z; θ )
z
P(X = x|θ ) = P(X = x|µ, Σ, w) = ∑ wi N (x; µi , Σi ),
i evaluated using the current estimate for the parameters by
computing
where wi ≥ 0 and ∑i wi = 1.
Mixture modeling models each cluster i ∈ {1, . . . , k} as a prob- γz (x) = P( Z = z|X = x, θ (t−1) ).
ability distribution P(x|θ j ). Using i.i.d. assumption of data, the
likelihood of data is M-step computes parameters maximizing the expected log-likelihood
found on the E step by optimizing
n k
P( D | θ ) = ∏ ∑ w j P ( X = xi | θ j ) . θ (t+1) = argmax Q(θ; θ (t−1) ),
i =1 j =1
θ
(t)
Q(θ, θ ) = EZ|X=x,θ (t) [log P(X, Z |θ )]
Optimization problem by minimizing the negative log likelihood n

k
= ∑ EZ|X=x ,θ( ) [log P(X = xi , Z|θ )]
i
t
i =1
(µ∗ , Σ∗ , w∗ ) = argmin − ∑ log ∑ w j N (xi |µ j , Σ j ),
n k
i j =1
k
= ∑ ∑ P(Z = z|X = xi , θ (t) ) log P(X = xi , Z = z|θ )
i =1 z =1
while ∑ wj = 1 and Σ j positive definite n k
j =1
= ∑ ∑ γz (xi ) log (P(Z = z)P(X = xi |Z = z; θ ))
i =1 z =1
is non-convex and constrained. One could try to optimize it using
(stochastic) gradient descent, but the constraints might be difficult These parameter-estimates are then used to determine the dis-
to maintain. tribution of the latent variables in the next E step. This proce-
dure is equivalent to training a GBC with weighted data and
Choosing k Same as for k-means. However, for GMMs typically
admits a closed form solution.
cross-validation works fairly well.
11.2.1 Hard-EM Algorithms
GMMs for density estimation and not for clustering by for exam-
ple modeling P(x) as Gaussian mixture and P(x|y) using logistic Fitting a GMM = training a GBC without labels Idea is to repeat-
regression, neural network, etc. Then P(x, y) = P(x)P(y|x) is a edly fill in or update the missing data and then train the resulting
valid model. This combines the advantage of accurate predictions dataset. The algorithms assigns (only) a label to each data point
and robustness from discriminative model with the ability to detect which is why it called hard.
outliers.
Anomaly detection with mixture models by comparing the esti- Algorithm 17 Hard Expectation Maximization (EM)
mated density of a data point against a threshold. If we do not have 1: Initialize the parameters θ (0)
any examples of anomalies, this is challenging. If we do have some 2: for t = 1, 2, 3, . . . do
examples, we could try 3: E-Step: Predict most likely class for each data point
• varying the threshold to trade false-positives and false-negatives, (t)
zi = argmax P(z|xi , θ (t−1) )
• to use precision-recall curves/ ROC curves as evaluation crite- z
rion, e.g. maximizing F1-score. = argmax P(z|θ (t−1) )P(xi |z, θ (t−1) )
z
This allows to optimize the threshold, e.g. via cross-validation.
(t) (t)
Why are mixture models useful 4: and complete the data D (t) = {(x1 , z1 ), . . . , (xn , zn )}.
5: M-Step: compute MLE as for the Gaussian Bayes classifier
• Can encode assumptions about shape of clusters, e.g. fit el-
lipses instead of points.
θ (t) = argmax P( D (t) |θ )
• Can be part of more complex statistical models, e.g. classifiers θ
(or more general probabilistic models)
• Probabilistic models can output likelihood P(x) of a point x.
This can be useful for anomaly detection. Problems with Hard-EM Points are assigned a fixed label even
though the model is uncertain. This tries to extract too much in-
• Can be naturally used for semi-supervised learning. formation from a single point. In practice, this may work poorly if
clusters are overlapping.

14
11.2.2 Soft-EM Algorithm 11.2.4 EM vs. k-means
Posterior probabilities given a model P(z|θ ), P(x|z, θ ), we can com- Assumptions are uniform weights over mixture components and
pute a posterior distribution over cluster membership identical spherical covariance matrices.

w j P(X = x| Σ j , µ j ) Hard EM recovers k-means under the above assumptions. The steps
γ j (x) = P( Z = j|X = x, Σ, µ, w) = in the hard EM algorithm become the same decisions as Lloyd’s
∑ l w l P(X = x| Σ l , µ l ) heuristic
(t) ( t −1)
zi = argmax P(z|θ (t−1) ) = argmink xi − µz k, (E-step)
MLE At MLE (µ∗ , Σ∗ , w∗ ) = argmin − ∑i log ∑kj=1 w j N (xi |µ j , Σ j ) it z z
must hold that 1

(t)
µj = xi . (M-step)
nj
∑in=1 γ j (xi )xi (t)
i:zi = j
µ∗j =
∑in=1 γ j (xi )
∑in=1 γ j (xi )(xi − µ∗j ) T (xi − µ∗j ) Soft EM recovers k-means under the above assumptions and ad-
Σ∗j = ditionally(variances tending to 0, because for σ2 → 0 it holds that
∑in=1 γ j (xi )
n 1 µi is closest to x
1 γi ( x ) → .
n i∑
w∗j = γ j ( xi ) 0 otherwise
=1
11.3 Avoiding Overfitting with GMMs
Degeneracy For only one data point, the optimal log-likelihood is
Algorithm 18 Soft Expectation Maximization (EM) −logP( x |µ, σ) = 12 log(2πσ2 ) + 2σ1 2 ( x − µ)2 → −∞ for µ = x and
1: while not converged do σ2 → 0, i.e. the minimization problem is not bounded from below.
(t) Thus, the ”optimal” GMM chooses k = n and puts one Gaussian
2: E-Step: calculate γ j (xi ) for each i and j given estimates of
around each data point with variance tending to 0. This is overfit-
µ(t−1) , Σ(t−1) , w(t−1) from previous iteration
3: M-Step: fit clusters to weighted data points, i.e. calculate ting.
(t) (t) (t) Adding a Wishart prior to the covariance matrix and computing
w j , µ j , and Σ j .
the MAP instead of MLE can regularize the problem and thus avoid
degeneracy (variances → 0). The corresponding update rule reads
Constrained GMMS are special cases of Gaussian mixtures ∑in=1 γ j (xi )(xi − µ∗j (xi − µ∗j ) T )
Σ∗j = + ν2 I
• Spherical Σ j = σj2 · I, with #params = k ∑in=1 γ j (xi )
• Diagonal Σ j = diag(σj,1
2 , . . . , σ2 ), with #params = dk
j,d
d ( d +1) 11.4 Gaussian-Mixture Bayes classifier
• Tied Σ1 = . . . = Σk , with #params = 2
Given labeled data set D = {(x1 , y1 ), . . . , (x N , y N )} with labels yi ∈
d ( d +1)
• Full, with #params = k 2 {1, . . . , m}, estimate class prior P(y) and conditional distribution for
each class as Gaussian mixture model
Discussion Soft EM will typically result in higher likelihood values,
because it can deal better with ”overlapping clusters”. ky
∑ wj
(y) (y) (y)
P(x| y ) = N (x; µ j , Σ j ).
11.2.3 Theory behind EM j =1
Convergence of EM One can prove that the EM algorithm monoton-
ically increases the likelihood log P(x1:n |θ (t) ) ≥ log P(x1:n |θ (t−1) ). Classification is done by Bayes rule
For Gaussian mixture, EM is guaranteed to converge to a local mini- kj
1
P(y|x) = P(y) ∑ w j N (x; µ j , Σ j ).
mum, but the quality of solution highly depends on initialization. A (y) (y) (y)
common strategy is to rerun the algorithm multiple times and use Z j =1
the solution with largest likelihood.
11.5 Semi-supervised Learning with GMMs
Initialization
(0) 1
We would like to combine unlabeled and labeled data.
• for weights: typically use a uniform distribution wi = k ∀i
Semi-supervised learning is learning from large amounts of unla-
• for means use random initialization or k-means++, i.e. pick beled data and small amounts of labeled data.
(0)
µi = x ji
Modification to EM algorithm The computation of γ also takes
• for variances for example initialize according to empirical co- into account the labeled data points xi
variance of the data (perhaps restrict to spherical) Σ1 = . . . = (
Σk = n1 ∑in=1 (xi − x)(xi − x) T . P( Z = j|xi , Σ, µ, w) xi is unlabeled
γ j ( xi ) =
Hard EM performs alternating optimization of the complete data [ j = yi ] xi is labeled with label yi
likelihood
The computation of w j , µ j and Σ j does not change.
(t)
z1:n = argmax P(x1:n , z1:n |θ (t−1) ) (E-step)
zz:n 11.6 Outlook: Implicit generative Models
θ (t) = argmax P(x1:n , z1:n |θ ) (M-step) Given sample of (unlabeled) points x1 , . . . , xn , the goal is to learn a
θ model X = f (Z; w). The approach is to optimize parameters w to
make samples from model hard to distinguish from data sample.
and converges to a local optimum of maxz1:n ,θ P(x1:n , z1:n |θ ).
A Convex functions
EM more generally EM algorithm is more widely applicable. It can
be used whenever the E and M steps are traceable, i.e. we must be Theorem A.1 (Jensen’s inequality) Let f be convex and x1 , . . . , xn ∈
able to compute and maximize the complete data likelihood. This Rd , λ1 , . . . , λn ∈ [0, 1], such that ∑in=1 λi = 1. Then
can be used for example for imputing (some) missing features and
handling likelihoods beyond Gaussian (e.g. categorical). f ( λ1 x1 + . . . + λ n x n ) ≤ λ1 f ( x1 ) + . . . + λ n f ( x n )

15
Also, if x ∈ Rd is a random variable, then

f (E[x]) ≤ E[ f (x)].

Theorem A.2 (Gradient inequality) Let f be convex, then ∀x, y ∈ Rd

f (y) − f (x) ≥ ∇ f (x) T (y − x).

Theorem A.3 Let f : Rd → R be convex and A, b such that ∀z ∈ Rd


Az + b ∈ Rd . Then g(z) = f (Az + b) is convex in z ∈ Rn .
Theorem A.4 If f is convex and x ∗ ∈ Rd sucht that ∇ f ( x ∗ ) = 0, then
x ∗ is a global minimizer, i.e. f ( x ∗ ) ≤ f ( x ) ∀ x ∈ Rd .
Bibliography
[1] StackExchange. How to choose the number of hidden layers and
nodes in a feedforward neural network?, 2010.

16

You might also like