Lec10 Intro ML
Lec10 Intro ML
5
Types of Machine Learning
• Supervised Learning
• given training examples with corresponding outputs
(also called label, target, class, answer, etc.)
• learn to produces correct output for a new example
• Unsupervised Learning
• given training examples only
• discover good data representation
• e.g. “natural” clusters
• k-means is the most widely known example
• Reinforcement Learning
• learn to select action that maximizes payoff 6
Supervised Machine Learning: subtypes
• Classification
• output belongs to a finite set
• example: age {baby, child, adult, elder}
• output is also called class or label
• Regression
• output is continuous
• example: age [0,130]
• Training phase
• estimate function y = f(x) from labeled data
• f(x) is called classifier, learning machine, prediction function, etc.
• Testing phase (deployment)
• predict output f(x) for a new (unseen) sample x
Training/Testing Phases Illustrated
Training
training
training examples
labels
feature Learned
Training
vectors model f
Testing
f(x,w5)
f(x,w2)
hypothesis space f(x,w1)
f(x,w3)
f(x,w4)
f(x) -1+2x
class -1 class 1
x
0 0.5
Training Phase Example in 1D
• 2 class classification problem, yi ∊{-1,1}
• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}
1.5+x
class -1 class 1
x
0
1.5
decision
x2 boundary
decision regions
x1
x2 x2
x1 x1
x2
x1
16
Test Classifier on New Data
• The goal is to classify well on new data
• Test “wiggly” classifier on new data: 25% error
x2
x1
Overfitting
x2
x1
21
Training Phase
• Find weights w s.t. f(xi,w) = yi “as much as possible” for
training samples xi
• define “as much as possible”
• penalty whenever f(xi,w) ≠ yi
• via loss function L(f(xi,w), yi)
• how to search for good setting of w?
• usually through optimization
• can be very time consuming
• classification error on training data is called training error
Testing Phase
• The goal is good performance on unseen examples
• Evaluate performance of the trained classifier f(x,w) on
the test samples (unseen labeled samples)
• Testing on unseen labeled examples is an approximation
of how well classifier will perform in practice
• If testing results are poor, may have to go back to the
training phase and redesign f(x,w)
• Classification error on test data is called test error
• Side note
• when we “deploy” the final classifier f(x,w) in practice, this is
also called testing
Underfitting
24
Underfitting → Overfitting
underfitting “just right” overfitting
26
Classification System Design Overview
• Collect and label data by hand
salmon sea bass salmon salmon sea bass sea bass
J(x1, x2)
x2
x1
-J(x(k))=0
x
global minimum x(k) x(2) x(1)
k=1 k=1
x(1) = any initial guess x(1) = any initial guess
choose , choose
while ||J(x(k))|| > while ||J(x(k))|| >
x(k+1) = x (k) - J(x(k)) choose (k)
k=k+1 x(k+1) = x (k) - (k) J(x(k))
k=k+1
Variable Learning Rate
• Usually do not keep track of all intermediate solutions
k=1 k=1
x(1) = any initial guess x = any initial guess
choose , choose ,
while ||J(x(k))|| > while ||J(x)|| >
x(k+1) = x (k) - J(x(k)) x = x - J(x)
k=k+1 k=k+1
Learning Rate
• Monitor learning rate by looking at how fast the
objective function decreases
very high learning rate
J(x)
number of iterations
Learning Rate: Loss Surface
= 0.1 Illustration
= 0.1 = 0.001
~ 3k updates
= 0.01
~ 0.3k updates
Linear Classifiers
Supervised Machine Learning: Recap
• Chose type of f(x,w)
• w are tunable weights, x is the input example
• f(x,w) should output the correct class of sample x
• use labeled samples to tune weights w so that
f(x,w) give the correct class y for x in training data
• loss function L(f(x,w) ,y)
• How to choose type of f(x,w)?
• many choices
• simple choice: linear classifier
• other choices
• Neural Network
• Convolutional Neural Network
Linear Classifier
• Classifier it makes a decision based on linear combination
of features
g(x,w) = w0+x1w1 + … + xdwd
• g(x,w) called discriminant function
• Use
• y = 1 for the first class f(x)
1
• y = -1 for the second class
x
• One choice for linear classifier -1
g(x)
f(x,w) = sign(g(x,w))
• 1 if g(x,w) is positive
• -1 if g(x,w) is negative
Linear Classifier: Decision Boundary
g(x) < 0
decision region x1
for class 2
decision boundary
g(x) = 0
More on Linear Discriminant Function (LDF)
• Example in 2D
x1 3
g(x , w , w 0 ) = 3x1 + 2x 2 + 4 x= w = , w0 = 4
x 2 2
a
a
a
Loss Function
• How to find solution vector a?
• or, if no separating a exists, a good approximation a?
• Design a non-negative loss function L(a)
• L(a) is small if a is good
• L(a) is large if a is bad
• Minimize L(a) with gradient descent
• Two steps in design of L(a)
1. per-example loss L(f(zi,a),yi)
• penalizes for deviations of f(zi,a) from yi
2. total loss adds up per-sample loss over all training examples
L(a) = L (f (z i , a), y i )
i
Loss Function, First Attempt
• Per-example loss function ‘counts’ if error happens
L(f (zi , a), y i ) =
0 if f (z i
, a) = y i
1 otherwise
• Example
1 1
z =
1
y =1
1
z =
2
y 2 = −1
2 4
1 otherwise
• Total loss function counts total number of errors
L(a) = L ( f (z i , a), y i )
i
L(a)
a
Perceptron Loss Function
• Different Loss Function: Perceptron Loss
Lp (f (zi , a), y i ) = i t i
0 if f (z i
, a) = y i
− y (a z ) otherwise
• Lp(a) is non-negative
• positive misclassified example zi
• atzi < 0
• yi = 1
• yi(atzi) < 0
• negative misclassified example zi
• atzi > 0
• yi = -1
• yi(atzi) < 0
• if zi is misclassified then yi(atzi) < 0
• if zi is misclassified then -yi(atzi) > 0
• Lp(a) proportional to distance of misclassified example to boundary
Perceptron Loss Function
Lp (f (zi , a), y i ) =
0 if f (z i
, a) = y i
− y i
(a z ) otherwise
t i
• Example
1 y =1
1 1 y 2 = −1
z =
1
z =
2
2 4
2 f(z1 , a) = sign(atz1 ) f(z2 , a) = sign(atz2 )
a=
− 3 = sign(1 2 − 3 2) = sign(1 2 − 3 4)
= sign(− 4) = −1
= −1
(( ) )
Lp f z1 , a , y1 = 4 L p (f (z 2 , a), y 2 ) = 0
Lp(a)
a
Optimizing with Gradient Descent
• Per-example loss • Total loss
Lp (f (zi , a), y i ) =
0 if f (z i
, a) = y i
L p (a) = L ( f (z i , a), y i )
− y i
(a z ) otherwise
t i i
a= a – Lp(a)
• Need gradient vector Lp(a)
L p
• the same dimension as a
1 a
2 L p
L p (a) =
a = 3 a
2
1
L
p
a3
Optimizing with Gradient Descent
• Per-example loss • Total loss
0 if f (zi , a) = y i L p (a) = L ( f (z i , a), y i )
Lp (f (z , a), y ) =
i i
− y (a z ) otherwise
i t i i
i i L p (f (z i , a), y i )
per example gradient
a 1
L p (f (z , a), y )
i i
a 2
• Compute and add up per example gradients
L (f (z , a), y )
i i
p
a3
Per Example Loss Gradient
• Per-example loss has two cases
Lp (f (zi , a), y i ) =
0 if f (z i
, a) = y i
− y i
(a z ) otherwise
t i
L p (f (zi , a), y i ) =
? otherwise
Per Example Loss Gradient
• Per-example loss has two cases
Lp (f (zi , a), y i ) =
0 if f (z i
, a) = y i
− y (a z ) otherwise
i t i
= − y i zi
• Gradient for per-example loss
Lp (f (zi , a), y i ) = i i
0 if f (z i
, a) = y i
− y z otherwise
Optimizing with Gradient Descent
• Simpler formula
L p (a) = z
− y i i
misclassif ied
examples i
a= a+α z
y i i
misclassif ied
examples i
1 2 3 1
1 3 5 3 6 1 4 3
Z = 1 3 5 Y = 1
1 1 3
− 1
• Pile all examples as rows in matrix Z
1 5 6
• Pile all labels into column vector Y − 1
Perceptron Loss Batch Example
• Examples in Z, labels in Y
1 2 3 1
1 4 3 1
Z = 1 3 5 Y = 1
1 1 3 − 1
1 5 6 − 1
1
• Initial weights a = 1
1
• This is line x1 + x2 + 1 = 0
Perceptron Loss Batch Example
1 2 3 1
1 4 3 1 1
Z = 1 3 5 a = 1 Y = 1
1
1 1 3 − 1
1 5 6 − 1
• Perceptron Batch
a=a +α z
y i i
misclassif ied
examples i
• Let us use learning rate α = 0.2
a = a + 0.2 z
y i i
misclassif ied
examples i
0 if f (zi , a) = y i
• Per example loss is Lp (f (z , a), y ) = i t i
i i
− y (a z ) otherwise
Perceptron Loss Batch Example
1 2 3 1
1 4 3 1 1
Z = 1 3 5 a = 1 Y = 1
1
1 1 3 − 1
1 5 6 − 1
misclassif ied
examples i
misclassif ied
examples i
𝐚 = 𝐚 − 𝛂𝛻𝐋p 𝐟 𝐳 i , 𝐚 , 𝐲 i
per example gradient
Batch Size: Loss Surface Illustration
see only one
example
start start
finish
𝐚 = 𝐚 − 𝛂 𝛻𝐋p 𝐟 𝐳 i , 𝐚 , 𝐲 i
𝑚 per example gradient
1 1
z
z
-1 -1
Ϭ(t) 1
0.5
t
1
σ(t ) =
1 + exp(− t )
quadratic loss
1
logistic
regression loss
0.5
Ϭ(atz)
atz
Logistic Regression: Loss Function
• Could use (yi- Ϭ(atz)) 2 as per-example loss function
• Instead use a different loss
-log t
• if z has label 1, want Ϭ(aTz)
close to 1, define loss as
1
–log [Ϭ(aTz)]
• if z has label 0, want Ϭ(aTz) close to 0, define loss as t
–log [1-Ϭ(aTz)]
a = a + α (y i − σ (at z i )) z i
i
• Probabilistic interpretation
• P(class 1) = Ϭ(aTz)
• P(class 0) = 1 - P(class 1)
• loss function is negative log-likelihood , -log P(y)
• standard objective in statistics
Logistic Regression vs. Perceptron
1
• Green example classified correctly, but
close to decision boundary Ϭ(wtx)
• no loss under Perceptron 0.5
• non-negligible loss under logistic regression
• Logistic Regression encourages decision x
boundary move away from training samples
• better generalization
• zero Perceptron loss • zero Perceptron loss • red classifier works better
• smaller LR loss • larger LR loss for new data
Logistic Regression vs. Regression vs. Perceptron
quadratic loss
logistic regression
loss
perceptron
loss 1 yatz
misclassified classified correctly classified correctly
but close to and not too close
decision boundary to the decision
boundary
R1
g1(x) > g2(x) R2
g1(x) > g3(x) g2(x) > g1(x)
g2(x) > g3(x)
R3
g3(x) > g1(x)
g3(x) > g2(x)
Multiple Classes
• Can be shown that decision regions are convex
• In particular, they must be spatially contiguous
Multiclass Linear Classifier: Matrix Notation
• Assume examples x are augmented with extra feature 1, no need
to write bias explicitly
• but from now on will not change notation to z’s
• Define m discriminant functions
gi(x) = witx for i = 1, 2, … m
• Assign x to i that gives maximum gi(x)
• Picture illustration
g1(x) →5 5
g2(x)
3
x
→3 pile all outputs
− 9
into one vector
g3(x) → -9
g4(x) → 10 10
decide class 4
Multiclass Linear Classifier: Matrix Notation
• Could use one dimensional output yi ∊ {1, 2, 3, …, m}
• Convenient to use multi-dimensional outputs (one-hot encoding)
1 0 0 0
0 1 0 0
yj = yj = yj = yj =
0 0 1 0
0 0 0 1
class 1 class 2 class 3 class 4
• if sample is of class i,
g1(x) 5 0
want output vector 3 1
to be 0 everywhere
g2(x)
x
except position i, − 9 0
where it should be 1 x is of class 2
g3(x)
10 0
g4(x) got this want this
Multiclass Linear Classifier: Matrix Notation
• Assign x to i that gives maximum gi(x)= witx
g1(x) w1tx 2 4 − 7 x
g2(x) w2tx 9 − 3 2 x
x x x
g3(x) w3tx 4 5 2 x
g4(x) w4tx 2 − 7 1 x
w1 2 4 − 7 2
• In matrix notation w2 9 1
−3 2 − 4
7 =
w3 4 5 2 47
w4 2 4
−7 1 − 43
W x Wx
• Assign x to class that corresponds to largest row of Wx
Quadratic Loss Function
• Assign sample xi to class that corresponds to largest row of Wxi
• Loss function? 2 0
− 4 0
47 0
− 43 1
Wxi yi
• Can use quadratic loss per sample xi as ½||Wxi - yi ||2
• for example above, loss (22 + 42 + 472 +442)/2
• total loss on all training samples L(W) = ½Σi || Wxi - yi ||2
• gradient of the loss
L(W ) = (Wx i − y i )(x i )
t
i
• L(W) has the same shape as the same shape as W
• batch gradient descent update
W = W − (Wx − y i i
) (x )
i t
i
Quadratic Loss Function
• Consider gradient descent update, single sample x with α = 1
W = W − (Wx − y )x t
2 4 − 7
1 9 − 3 2
• Suppose x = 3 and is of class 2 and W =
4 5 2
2
2 − 7 1
0 0 0
ok 4 1 3
too large Wx − y = − =
23 0 23
too small
− 17 0
− 17
• update rule
2 4 − 7 0 2 4 − 7 0 0 0 2 4 −7
9 − 3 2 3 9 − 3 2 3 9 6 6 − 12 − 4
W= − 1 3 2 = − =
4 5 2 23 4 5 2 23 69 46 − 19 − 64 − 44
19
2 − 7 1 − 17 2 − 7 1 − 17 − 51 − 34 44 35
Quadratic Loss Function
0 0 0
ok 4 1 3
too large Wx − y = − =
23 0 23
too small
− 17 0
− 17
2 4 −7 0
6 − 12 − 4 − 38
• With new W= , Wx =
− 19 − 64 − 44 − 299
19 44 35 221
• Already saw that quadratic loss does not work that well for classification
Softmax Function
• Define softmax(a) function • Example
exp (a1 ) exp (− 3)
4 exp (− 3) + exp (2) + exp (1)
exp (a j ) − 3
j=1
exp (a2 ) softmax exp (2)
1
a 4 2
a exp (a j ) exp (− 3) + exp (2) + exp (1)
softmax j=1
exp (1)
2
w 1T x 2 0.0473 Pr (class1)
T − 1 0.0024 Pr (class 2 )
w x
soft max = soft max =
2
=
w 3T x 5 0.9500 Pr (class 3)
T
0.0003
4
w x − 3
Pr (class 4 )
(
W = W + y − soft max (Wx
i i
)) (x )
i t
i
• Example, single sample gradient descent with α = 0.1
0
1 0 2 4 − 7
0 9 − 3
2 4
W= Wx =
i
x = 3
i
0 4 5 2 23
2
1 2 − 7 1 − 17
yi
• Update for W
2 4 − 7 0 0 2 4 − 7
9 − 3 2 0 4 9 −3 2
W=
+ 0.1 − soft max
1 3 2 =
4 5 2 0 23 3.9 4.7 1.8
2 − 7 2.1 − 6.7 1.2
1 1 − 17
More General Discriminant Functions
• Linear discriminant functions
• simple decision boundary
• should try simpler models first to avoid overfitting
• optimal for certain type of data
• Gaussian distributions with equal covariance
• May not be optimal for other data distributions
• Discriminant functions can be more general than linear
• For example, polynomial discriminant functions
• Decision boundaries more complex than linear
• Later will look more at non-linear discriminant functions
Summary
• Linear classifier works well when examples are
linearly separable, or almost separable
• Perceptron Loss function was the first historic loss
• Logistic regression/softmax work better in practice
• Optimization with gradient descent
• stochastic mini-batch works best in practice
Linear Classifier: Quadratic Loss
• Quadratic per-example loss atz
1
L p (f (zi , a), zi ) = (y i − at zi )
1 2
2
z
-1
• This is standard line fitting
• note that even correctly classified examples can have a large loss