0% found this document useful (0 votes)
17 views93 pages

Lec10 Intro ML

ML

Uploaded by

Sonysethukumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views93 pages

Lec10 Intro ML

ML

Uploaded by

Sonysethukumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Machine Learning:

Intro & Linear Clasifier


Olga Veksler
Outline
• Introduction to Machine Learning
• motivation of machine learning
• types of machine learning
• basic concepts
• overfitting, underfitting, generalization
• Optimization with gradient descent
• Linear Classifier
• Two classes
• Loss functions
• Gradient descent schemes
• Multiple classes
INTRO TO ML
Why use Machine Learning?
• Difficult to come up with explicit program for some tasks
• Digit Recognition, a classic example
0 4
• Easy to collect images of digits with their correct labels

• Machine Learning Algorithm takes collected data and produces


program for recognizing digits
• done right, program will recognize correctly new images it has never seen
What is Machine Learning?
• General definition (Tom Mitchell):
• Based on experience E, improve performance on task T as
measured by performance measure P
• Digit Recognition Example
• T = recognize character in the image
• P = percentage of correctly classified images
• E = dataset of human-labeled images of characters

5
Types of Machine Learning
• Supervised Learning
• given training examples with corresponding outputs
(also called label, target, class, answer, etc.)
• learn to produces correct output for a new example
• Unsupervised Learning
• given training examples only
• discover good data representation
• e.g. “natural” clusters
• k-means is the most widely known example
• Reinforcement Learning
• learn to select action that maximizes payoff 6
Supervised Machine Learning: subtypes
• Classification
• output belongs to a finite set
• example: age  {baby, child, adult, elder}
• output is also called class or label

• Regression
• output is continuous
• example: age [0,130]

• Difference mostly in design of loss function


7
Supervised Machine Learning
• Have examples with corresponding outputs
• Example: fish classification, salmon or sea bass?
 3 .3   6 .3  2.3 6 . 4 
x1 =  5 . 7  x 2 = 8 . 7  x3 = 1.7 x4 =  7 . 0 
       

salmon sea bass salmon sea bass


y1=0 y2=1 y3 = 0 y4=1
• Each example represented by a vector
• data may be given in vector form from the start
• if not, for each example i, extract useful features and put them in a vector xi
• fish classification example
• extract two features, fish length and average fish brightness
• can extract as many other features
• for images, can also use raw pixel intensity or color as features
• an example is often called feature vector
• yi is the output for example xi
Supervised Machine Learning
• We are given
1. Training examples x1, x2,…, xn
labeled data
2. Target output for each sample y1, y2,…yn

• Training phase
• estimate function y = f(x) from labeled data
• f(x) is called classifier, learning machine, prediction function, etc.
• Testing phase (deployment)
• predict output f(x) for a new (unseen) sample x
Training/Testing Phases Illustrated
Training
training
training examples
labels

feature Learned
Training
vectors model f

Testing

feature Learned label


vector model f prediction
test Image
More on Training Phase
• Estimate prediction function y = f(x) from labeled data
• Choose hypothesis space f(x) belongs to
• hypothesis space f(x,w) is parameterized by vector of weights w
• each setting of w corresponds to a different hypothesis

f(x,w5)
f(x,w2)
hypothesis space f(x,w1)
f(x,w3)
f(x,w4)

• find f(x,w) in the hypothesis space s.t. f(xi,w) = yi “as much as


possible” for training examples
• define “as much as possible” via some loss function L(f(x,w),y)
Training Phase Example in 1D
• 2 class classification problem, yi ∊{-1,1}
• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}
• Hypothesis space f(x,w) = sign(w0 + w1x)
w 0 
w= 
 w1 

• with w0 = -1, w1 = 2, f(x) = sign( -1 + 2x )

f(x) -1+2x

class -1 class 1
x
0 0.5
Training Phase Example in 1D
• 2 class classification problem, yi ∊{-1,1}
• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

• Hypothesis space f(x,w) = sign(w0+w1x )


• With w0 = -1.5, w1 = 1, f(x) = sign(-1.5 + x )

1.5+x

class -1 class 1
x
0
1.5

• The process of finding good w is weight tuning, or training


Training Phase Example in 2D
• For 2 class problem and 2 dimensional samples
f(x,w) = sign(w0+w1x1+w2x2)

decision
x2 boundary

decision regions

x1

• Can be generalized to examples of arbitrary dimension


• Classifier that makes a decision based on linear combination of
features is called a linear classifier
Training Phase: Linear Classifier
bad setting of w better setting of w

x2 x2

x1 x1

classification error 38% classification error 4%


Training Stage: More Complex Classifier

x2

x1

• for example if f(x,w) is a polynomial of high degree


• 0% classification error

16
Test Classifier on New Data
• The goal is to classify well on new data
• Test “wiggly” classifier on new data: 25% error

x2

x1
Overfitting
x2

x1

• Have only limited amount of data for training


• Complex model often has too many parameters to fit
reliably with limited data
• Complex model may adapt too closely to random noise of
training data, rather than look at a ‘big picture’
Overfitting: Extreme Example
• 2 class problem: face and non-face images
• Memorize (i.e. store) all the “face” images
• For a new image, see if it is one of the stored faces
• if yes, output “face” as the classification result
• If no, output “non-face”
• also called “rote learning”
• problem: new “face” images are different from stored
“face” examples
• zero error on stored data, 50% error on test (new) data
• decision boundary is very irregular
• Rote learning is memorization without generalization
slide is modified from Y. LeCun
Generalization
training data new data

• Ability to produce correct outputs on previously unseen examples


is called generalization
• Big question of learning theory: how to get good generalization
with a limited number of examples
• Intuitive idea: favor simpler classifiers
• William of Occam (1284-1347): “entities are not to be multiplied without necessity”
• Simpler decision boundary may not fit ideally to training data but
tends to generalize better to new data
Training and Testing
• How to diagnose overfitting?
• Divide all labeled samples x1,x2,…xn into training set
and test set
• Use training set (training samples) to tune classifier
weights w
• Use test set (test samples) to see how well classifier with
tuned weights w work on unseen examples
• Thus there are 2 main phases in classifier design
1. training
2. testing

21
Training Phase
• Find weights w s.t. f(xi,w) = yi “as much as possible” for
training samples xi
• define “as much as possible”
• penalty whenever f(xi,w) ≠ yi
• via loss function L(f(xi,w), yi)
• how to search for good setting of w?
• usually through optimization
• can be very time consuming
• classification error on training data is called training error
Testing Phase
• The goal is good performance on unseen examples
• Evaluate performance of the trained classifier f(x,w) on
the test samples (unseen labeled samples)
• Testing on unseen labeled examples is an approximation
of how well classifier will perform in practice
• If testing results are poor, may have to go back to the
training phase and redesign f(x,w)
• Classification error on test data is called test error
• Side note
• when we “deploy” the final classifier f(x,w) in practice, this is
also called testing
Underfitting

• Can also underfit data, i.e. too


simple decision boundary
• chosen hypothesis space is not
expressive enough
• No linear decision boundary can
separate the samples well
• Training error is too high
• test error is, of course, also high

24
Underfitting → Overfitting
underfitting “just right” overfitting

• high training error • low training error • low training error


• high test error • low test error • high test error
Sketch of Supervised Machine Learning
• Chose a hypothesis space f(x,w)
• w are tunable weights
• x is the input sample
• tune w so that f(x,w) gives the correct label for
training samples x
• Which hypothesis space f(x,w) to choose?
• has to be expressive enough to model our problem
well, i.e. to avoid underfitting
• yet not to complicated to avoid overfitting

26
Classification System Design Overview
• Collect and label data by hand
salmon sea bass salmon salmon sea bass sea bass

• Split data into training and test sets


• Preprocess data (i.e. segmenting fish from background)

• Extract possibly discriminating features


• length, lightness, width, number of fins,etc.
• Classifier design
• choose model for classifier
mostly look at
• train classifier on training data these steps in
• Test classifier on test data the course
OPTIMIZATION
Optimization
• How to minimize a function of a single variable
J(x) =(x-5)2
• From calculus, take derivative, set it to 0
d
J(x ) = 0
dx
• Solve the resulting equation
• maybe easy or hard to solve

• Example above is easy


d
J(x ) = 2(x − 5 ) = 0  x = 5
dx
Optimization
• How to minimize a function of many variables
J(x) = J(x1,…, xd)
• From calculus, take partial derivatives, set them to 0
gradient
  
 x J( x )
 1 
   = J(x ) = 0
  J(x )
 x d 

• Solve the resulting system of d equations


• It may not be possible to solve the system of equations
above analytically
Optimization: Gradient Direction

J(x1, x2)

x2
x1

• Gradient J(x) points in the direction of steepest


increase of function J(x)
• - J(x) points in the direction of steepest decrease

Picture from Andrew Ng


Gradient Direction in 2D
• J(x1, x2) =(x1-5)2+(x2-10)2
x2
 global min
• J(x ) = 2(x 1 − 5) 10
x 1

• J(x ) = 2(x 2 − 10 ) 5 − 10  a
x 2  10 
10  
• Let a =   x1
5
 5 10
• J(a) = 10
x1

• J(a) = −10
x 2
 10 
• J(a) =  
 − 10 
− 10 
• − J(a) =  
 10 
Gradient Descent: Step Size
• J(x1, x2) =(x1-5)2+(x2-10)2 x2 global min
10
• Which step size to take?
• Controlled by parameter  5 − 10  a
• called learning rate  10 
  x1
• From previous slide
5 10
10  − 10
• a =   , − J(a) =  
 
5  10 
• Let  = 0.2
10  − 10   8
a − αJ(a) =   + 0.2   = 
 5  10  7 

• J(10, 5) = 50; J(8,7) = 18


Gradient Descent Algorithm
k=1
J(x) -J(x(1))
x(1) = any initial guess
-J(x(2))
choose , 
while ||J(x(k))|| >  -J(x(k))0
x(k+1) = x (k) -  J(x(k)) x
k=k+1 x(1) x(2) x(k)
Gradient Descent: Local Minimum
• Not guaranteed to find global minimum
• gets stuck in local minimum
-J(x(1))
J(x)
-J(x(2))

-J(x(k))=0
x
global minimum x(k) x(2) x(1)

• Still gradient descent is very popular because it is


simple and applicable to any differentiable function
How to Set Learning Rate ?
• If  too small, too J(x)
many iterations to
converge
x

• If  too large, may J(x)


overshoot the local
minimum and possibly
never even converge x
x(4) x(2) x(1) x(3)

• It helps to compute J(x) as a function of iteration


number, to make sure we are properly minimizing it
Variable Learning Rate
• If desired, can change learning rate  at each iteration

k=1 k=1
x(1) = any initial guess x(1) = any initial guess
choose ,  choose 
while ||J(x(k))|| >  while ||J(x(k))|| > 
x(k+1) = x (k) -  J(x(k)) choose (k)
k=k+1 x(k+1) = x (k) - (k) J(x(k))
k=k+1
Variable Learning Rate
• Usually do not keep track of all intermediate solutions

k=1 k=1
x(1) = any initial guess x = any initial guess
choose ,  choose , 
while ||J(x(k))|| >  while ||J(x)|| > 
x(k+1) = x (k) -  J(x(k)) x = x -  J(x)
k=k+1 k=k+1
Learning Rate
• Monitor learning rate by looking at how fast the
objective function decreases
very high learning rate

J(x)

low learning rate

high learning rate

good learning rate

number of iterations
Learning Rate: Loss Surface
 = 0.1 Illustration

 = 0.1  = 0.001
~ 3k updates

 = 0.01
~ 0.3k updates
Linear Classifiers
Supervised Machine Learning: Recap
• Chose type of f(x,w)
• w are tunable weights, x is the input example
• f(x,w) should output the correct class of sample x
• use labeled samples to tune weights w so that
f(x,w) give the correct class y for x in training data
• loss function L(f(x,w) ,y)
• How to choose type of f(x,w)?
• many choices
• simple choice: linear classifier
• other choices
• Neural Network
• Convolutional Neural Network
Linear Classifier
• Classifier it makes a decision based on linear combination
of features
g(x,w) = w0+x1w1 + … + xdwd
• g(x,w) called discriminant function

• Use
• y = 1 for the first class f(x)
1
• y = -1 for the second class
x
• One choice for linear classifier -1
g(x)
f(x,w) = sign(g(x,w))
• 1 if g(x,w) is positive
• -1 if g(x,w) is negative
Linear Classifier: Decision Boundary

bad boundary better boundary

• f(x,w) = sign(g(x,w)) = sign(w0+x1w1 + … + xdwd)


• Decision boundary is linear
• Find w0, w1,…, wd that give best separation of two
classes with a linear boundary
More on Linear Discriminant Function (LDF)
• LDF: g(x,w,w0) = w0+x1w1 + … + xdwd
 w1 
  bias or threshold
w 2 
w= 
 ...  x2
  g(x) > 0
w d  decision region
for class 1

g(x) < 0
decision region x1
for class 2
decision boundary
g(x) = 0
More on Linear Discriminant Function (LDF)

• Decision boundary: g(x,w) = w0+x1w1 + … + xdwd = 0


• This is a hyperplane, by definition
• a point in 1D
• a line in 2D
• a plane in 3D
• a hyperplane in higher dimensions
Vector Notation
• Linear discriminant function g(x, w, w0) = wtx + w0

• Example in 2D
 x1   3
g(x , w , w 0 ) = 3x1 + 2x 2 + 4 x=  w =  , w0 = 4
 x 2  2

• Shorter notation if add extra feature of value 1 to x


1  4 1
     
z =  x1  a =  3 g(z, a) = z a = 4 3 2 x1 
t
     
x 2  2 x 2 

• Use atz instead of wtx + w0


g(z, a) = zta = 4 + 3x1 + 2x 2 = x t w + w0 = g(x , w, w0 )
Fitting Parameters w
1
• Rewrite g(x,w,w0) = [w0 wt] = atz = g(z,a)
new weight x
vector a new
feature
vector z
• z is called augmented feature vector
• new problem equivalent to the old g(z,a) = atz
• f(z,a) = sign(g(z,a))
w 0  1
w  x 
 1  1
   
   
w d  x d 
Solution Region
• If there is a that classifies all examples correctly, it is
called separating or solution vector
• then there are infinitely many solution vectors a
• original samples x1,… xn are also linearly separable

a
a
a
Loss Function
• How to find solution vector a?
• or, if no separating a exists, a good approximation a?
• Design a non-negative loss function L(a)
• L(a) is small if a is good
• L(a) is large if a is bad
• Minimize L(a) with gradient descent
• Two steps in design of L(a)
1. per-example loss L(f(zi,a),yi)
• penalizes for deviations of f(zi,a) from yi
2. total loss adds up per-sample loss over all training examples

L(a) =  L (f (z i , a), y i )
i
Loss Function, First Attempt
• Per-example loss function ‘counts’ if error happens

L(f (zi , a), y i ) = 
0 if f (z i
, a) = y i

1 otherwise
• Example
1  1 
z = 
1
y =1
1
z = 
2
y 2 = −1
 2  4

f(z1 , a) = sign(atz1 ) f(z2 , a) = sign(atz2 )


 2
a=  
= sign(1 2 − 3  2) = sign(1 2 − 3  4)
− 3
= −1
= −1

L(f(z1 , a), y1 ) = 1 L(f(z2 , a), y 2 ) = 0


Loss Function, First Attempt
• Per-example loss function ‘counts’ if error happens

L(f (zi , a), y i ) = 
0 if f (z i
, a) = y i

1 otherwise
• Total loss function counts total number of errors
L(a) =  L ( f (z i , a), y i )
i

• For previous example


 2 L(f(z1 , a), y1 ) = 1
a=  
− 3 L(f(z2 , a), y 2 ) = 0
• Total loss
L(a) = 1 + 0 = 1
Loss Function: First Attempt
• ‘error count’ loss function not suitable for gradient
descent
• piecewise constant, gradient zero or does not exist

L(a)

a
Perceptron Loss Function
• Different Loss Function: Perceptron Loss

Lp (f (zi , a), y i ) =  i t i
0 if f (z i
, a) = y i

− y (a z ) otherwise
• Lp(a) is non-negative
• positive misclassified example zi
• atzi < 0
• yi = 1
• yi(atzi) < 0
• negative misclassified example zi
• atzi > 0
• yi = -1
• yi(atzi) < 0
• if zi is misclassified then yi(atzi) < 0
• if zi is misclassified then -yi(atzi) > 0
• Lp(a) proportional to distance of misclassified example to boundary
Perceptron Loss Function

Lp (f (zi , a), y i ) = 
0 if f (z i
, a) = y i

 − y i
(a z ) otherwise
t i

• Example
1  y =1
1 1  y 2 = −1
z = 
1
z = 
2

 2  4
 2 f(z1 , a) = sign(atz1 ) f(z2 , a) = sign(atz2 )
a=  
− 3 = sign(1 2 − 3  2) = sign(1 2 − 3  4)
= sign(− 4) = −1
= −1
(( ) )
Lp f z1 , a , y1 = 4 L p (f (z 2 , a), y 2 ) = 0

• Total loss Lp(a) = 4 + 0 = 4


Perceptron Loss Function
• Per-example loss • Total loss

Lp (f (zi , a), y i ) = 
0 if f (z i
, a) = y i
L p (a) =  L ( f (z i , a), y i )
 − y i
(a z ) otherwise
t i i

• Lp(a) is piecewise linear and suitable for gradient descent

Lp(a)

a
Optimizing with Gradient Descent
• Per-example loss • Total loss

Lp (f (zi , a), y i ) = 
0 if f (z i
, a) = y i
L p (a) =  L ( f (z i , a), y i )
 − y i
(a z ) otherwise
t i i

• Gradient descent to minimize Lp(a)

a= a –  Lp(a)
• Need gradient vector Lp(a)
 L p 
• the same dimension as a  

 1 a
 
 2  L p 
L p (a) =  
a = 3 a
 2
1   
 L 
 p
 a3 
Optimizing with Gradient Descent
• Per-example loss • Total loss
0 if f (zi , a) = y i L p (a) =  L ( f (z i , a), y i )
Lp (f (z , a), y ) = 
i i

 − y (a z ) otherwise
i t i i

• Gradient descent to minimize Lp(a)


a= a –  Lp(a)
• Compute
Lp (a) =   L p ( f (z i , a), y i ) =  L p ( f (z , a), y )
i i

i i  L p (f (z i , a), y i )
 
per example gradient
  a 1 
 
 L p (f (z , a), y )
i i

 
  a 2 
• Compute and add up per example gradients  
 L (f (z , a), y )
i i
 p 
 a3 
Per Example Loss Gradient
• Per-example loss has two cases

Lp (f (zi , a), y i ) = 
0 if f (z i
, a) = y i

 − y i
(a z ) otherwise
t i

• First case, f(zi,a) = yi


 0 
 0  if f (zi , a) = y i
 0 
L p (f (zi , a), y i ) =  


? otherwise

• To save space, rewrite


0 if f (z i
, a) = y i

L p (f (zi , a), y i ) = 


? otherwise
Per Example Loss Gradient
• Per-example loss has two cases

Lp (f (zi , a), y i ) = 
0 if f (z i
, a) = y i

 − y (a z ) otherwise
i t i

• Second case, f(zi,a) ≠ yi


 L
 L i t i 
(− y (a z ))  a1 − y(i
a z(i
+ a z i
+ a z i
))  − y i z1i 

1 1 2 2 3 3

 a1   L = − y i z i2 
Lp (f (zi , a), y i ) =  L
(− y (a z ))  a
i t i = − y(i
a z(i
+ a z i
+ a z i
))  i i
 − y z 3 
1 1 2 2 3 3
 a2 
 
2
L
 L
 i t i

(− y (a z ))  a3
 − y(i
a z(
1 1
i
+ a z
2 2
i
+ a z
3 3
i
))
 a3  

= − y i zi
• Gradient for per-example loss

Lp (f (zi , a), y i ) =  i i
0 if f (z i
, a) = y i

− y z otherwise
Optimizing with Gradient Descent
• Simpler formula
L p (a) =  z
− y i i

misclassif ied
examples i

• Gradient decent update rule for Lp(a)

a= a+α  z
y i i

misclassif ied
examples i

• called batch because it is based on all examples


Perceptron Loss Batch Example
• Examples

x1 = 3 x 2 = 3 x 3 = 5 x 4 = 3 x 5 = 6


2 4 3 1 5
         
class 1 class 2
• Labels
y1 = 1 y 2 = 1 y3 = 1 y 4 = −1 y 5 = −1
• Add extra feature
1  1  1  1  1
   
z1 = 2 z 2 =  4 z = 3 z 4 =  1 z =  5  1
3 5

1 2 3  1
1  3 5 3 6 1 4 3  
Z = 1 3 5 Y =  1
1 1 3  
− 1
• Pile all examples as rows in matrix Z
1 5 6
• Pile all labels into column vector Y − 1
Perceptron Loss Batch Example
• Examples in Z, labels in Y
1 2 3  1
1 4 3  1
   
Z = 1 3 5 Y =  1
   
1 1 3 − 1
1 5 6 − 1

1
• Initial weights a = 1
1

• This is line x1 + x2 + 1 = 0
Perceptron Loss Batch Example
1 2 3  1
1 4 3 1  1
   
Z = 1 3 5 a = 1 Y =  1
  1  
1 1 3 − 1
1 5 6 − 1

• Perceptron Batch
a=a +α  z
y i i

misclassif ied
examples i
• Let us use learning rate α = 0.2
a = a + 0.2  z
y i i

misclassif ied
examples i

• Sample misclassified if y(atz) < 0


Perceptron Loss Batch Example
1 2 3
 1
1 4 3  1
  1  
Z = 1 3 5 a = 1 Y =  1
   
1 1 3 1 − 1
1 5 6 − 1

• Sample misclassified if y(atz) < 0


• Can compute y(atz) for all samples
 6
 1  6  8
 1  8  
= Y. * (Z * a) =  1. *  9 =  9
− 1  5  
− 1 12   − 5  Total loss is
− 12  L(a) = 5+12 = 17

0 if f (zi , a) = y i
• Per example loss is Lp (f (z , a), y ) =  i t i
i i

 − y (a z ) otherwise
Perceptron Loss Batch Example
1 2 3  1
1 4 3 1  1
   
Z = 1 3 5 a = 1 Y =  1
  1  
1 1 3 − 1
1 5 6 − 1

• Samples 4 and 5 misclassified


• Perceptron Batch rule update a = a + 0.2  z
y i i

misclassif ied
examples i

 1 1  1 0.2 0.2  0.6


           
a = a + 0.2 − 1 1 − 1 5  = 1 − 0.2 −  1  = − 0.2
      1 0.6 1.2   − 0.8
  
3         
6 
• This is line -0.2x1 -0.8 x2 +0.6 = 0
Perceptron Loss Batch Example
1 2 3  1
1 4 3  0.6
 1
     
Z = 1 3 5 a = − 0.2 Y =  1
   − 0.8  
1 1 3 − 1
1 5 6 − 1

• Sample misclassified if y(atz) < 0


• Find all misclassified samples  − 2 .2 
 − 2 .6 
 
(Z * a). * Y = − 4.0
 
 2 
 5.2

• Total loss is L(a) = 2.2 + 2.6 +4= 8.8


• previous loss was 17 with 2 misclassified examples
Perceptron Loss Batch Example
1 2 3  1  − 2 .2 
1 4 3  1  − 2 .6 
   0 . 6     
Z = 1  
3 5 a = − 0.2 Y =  1 (Z * a). * Y = − 4.0
     
 − 0.8  
1 − 1
1 3  2 
1 5 6 − 1  5.2

• Perceptron Batch rule update


a = a + 0.2  z
y i i

misclassif ied
examples i

 1  1 1   0.6 0.2 0.2 0.2 1.2


        
a = a + 0.21 2 + 1 4 + 1 3  = − 0.2 + 0.4 + 0.8 + 0.6 = 1.6
         
  3 3 
5   − 0.8 0.6 0.6  1  1.4
  

• This is line 1.6x1 +1.4 x2 + 1.2 = 0


Perceptron Loss Batch Example
1 2 3  1
1 4 3 1.2
 1
   
Z = 1 3 5 a = 1.6 Y =  1
  1.4  
1 1 3 − 1
1 5 6 − 1

• Sample misclassified if y(atz) < 0


• Find all misclassified samples  8 .6 
 11.8
 
(Z * a). * Y =  13.0
 
 − 7 
− 17.6

• Total loss is L(a) = 7 + 17.6 = 24.6


• previous loss was 8.8 with 3 misclassified examples
• loss went up, means learning rate of 0.2 is too high
Single-Sample Gradient Descent
• Batch gradient descent can be slow to converge if lots of examples
• Single sample optimization
• update weights a as soon as possible, after seeing 1 example
• One iteration (epoch)
• go over all examples, in random order
• update after seeing one example zi

𝐚 = 𝐚 − 𝛂𝛻𝐋p 𝐟 𝐳 i , 𝐚 , 𝐲 i
per example gradient
Batch Size: Loss Surface Illustration
see only one
example

start start

global finish global


min min

finish

Batch Gradient Descent, Single sample gradient descent,


one iteration one iteration
Mini-Batch Gradient Descent
• Mini-Batch optimization
• update weights a after seeing m examples
• One iteration (epoch)
• go over all examples, in random order
• update after seeing m examples

𝐚 = 𝐚 − 𝛂 ෍ 𝛻𝐋p 𝐟 𝐳 i , 𝐚 , 𝐲 i
𝑚 per example gradient

• Most often used in practice


• Practical Issue: both batch and mini-batch algorithms converge
faster if features are roughly on the same scale
Logistic Regression, Motivation
• Recall line fitting

1 1

z
z
-1 -1

• solution with squared error loss • solution with perceptron loss


• one sample misclassified • all samples classified correctly

• Instead of trying to get atz close to y


• use differentiable function Ϭ(atz) that “squishes” the range
• try to get Ϭ(atz) close to y
Linear Classifier: Logistic Regression
• Denote classes with 1 and 0 now
• yi = 1 for positive class, yi = 0 for negative
• Use logistic sigmoid function Ϭ(t) for “squishing” atz

Ϭ(t) 1

0.5
t
1
σ(t ) =
1 + exp(− t )

• Despite “regression” in the name, logistic regression is used


for classification, not regression
Logistic Regression vs. Regresson

quadratic loss

1
logistic
regression loss

0.5

Ϭ(atz)

atz
Logistic Regression: Loss Function
• Could use (yi- Ϭ(atz)) 2 as per-example loss function
• Instead use a different loss
-log t
• if z has label 1, want Ϭ(aTz)
close to 1, define loss as
1
–log [Ϭ(aTz)]
• if z has label 0, want Ϭ(aTz) close to 0, define loss as t
–log [1-Ϭ(aTz)]

• Gradient descent batch update rule

a = a + α  (y i − σ (at z i )) z i
i

• Probabilistic interpretation
• P(class 1) = Ϭ(aTz)
• P(class 0) = 1 - P(class 1)
• loss function is negative log-likelihood , -log P(y)
• standard objective in statistics
Logistic Regression vs. Perceptron
1
• Green example classified correctly, but
close to decision boundary Ϭ(wtx)
• no loss under Perceptron 0.5
• non-negligible loss under logistic regression
• Logistic Regression encourages decision x
boundary move away from training samples
• better generalization

• zero Perceptron loss • zero Perceptron loss • red classifier works better
• smaller LR loss • larger LR loss for new data
Logistic Regression vs. Regression vs. Perceptron

quadratic loss

logistic regression
loss

perceptron
loss 1 yatz
misclassified classified correctly classified correctly
but close to and not too close
decision boundary to the decision
boundary

• Assuming labels are +1 and -1


Multiple Classes: General Case
• Define m linear discriminant functions
gi(x) = witx + wi0 for i = 1, 2, … m
• Assign x to class i if
gi(x) > gj(x) for all j ≠ i
• Let Ri be decision region for class i
• all points in Ri assigned to class i

R1
g1(x) > g2(x) R2
g1(x) > g3(x) g2(x) > g1(x)
g2(x) > g3(x)
R3
g3(x) > g1(x)
g3(x) > g2(x)
Multiple Classes
• Can be shown that decision regions are convex
• In particular, they must be spatially contiguous
Multiclass Linear Classifier: Matrix Notation
• Assume examples x are augmented with extra feature 1, no need
to write bias explicitly
• but from now on will not change notation to z’s
• Define m discriminant functions
gi(x) = witx for i = 1, 2, … m
• Assign x to i that gives maximum gi(x)
• Picture illustration
g1(x) →5 5
g2(x)
3
x
→3 pile all outputs  
− 9
into one vector
g3(x) → -9
 
g4(x) → 10  10 
decide class 4
Multiclass Linear Classifier: Matrix Notation
• Could use one dimensional output yi ∊ {1, 2, 3, …, m}
• Convenient to use multi-dimensional outputs (one-hot encoding)
1  0  0  0 
0  1  0  0 
yj =   yj =   yj =   yj =  
0  0  1  0 
       
0  0  0  1 
class 1 class 2 class 3 class 4

• if sample is of class i,
g1(x) 5 0 
want output vector 3 1 
to be 0 everywhere
g2(x)    
x
except position i, − 9 0 
where it should be 1 x is of class 2
g3(x)
   
 10  0 
g4(x) got this want this
Multiclass Linear Classifier: Matrix Notation
• Assign x to i that gives maximum gi(x)= witx

g1(x) w1tx 2 4 − 7 x

g2(x) w2tx 9 − 3 2 x

x x x
g3(x) w3tx 4 5 2 x

g4(x) w4tx 2 − 7 1 x

w1  2 4 − 7  2
• In matrix notation w2 9  1 
−3 2    − 4 
  7 = 
w3 4 5 2     47 
w4  2   4 
−7 1    − 43
W x Wx
• Assign x to class that corresponds to largest row of Wx
Quadratic Loss Function
• Assign sample xi to class that corresponds to largest row of Wxi
• Loss function?  2 0 
 − 4 0 
   
 47  0 
   
 − 43   1
Wxi yi
• Can use quadratic loss per sample xi as ½||Wxi - yi ||2
• for example above, loss (22 + 42 + 472 +442)/2
• total loss on all training samples L(W) = ½Σi || Wxi - yi ||2
• gradient of the loss
L(W ) =  (Wx i − y i )(x i )
t

i
•  L(W) has the same shape as the same shape as W
• batch gradient descent update
W = W −  (Wx − y i i
) (x )
i t

i
Quadratic Loss Function
• Consider gradient descent update, single sample x with α = 1
W = W − (Wx − y )x t
2 4 − 7
1 9 − 3 2
• Suppose x = 3 and is of class 2 and W =  
4 5 2
2  
 2 − 7 1
 0  0   0 
ok  4  1   3
too large Wx − y =  − = 
 23 0  23
too small      
 − 17 0
    − 17 
• update rule
 2 4 − 7   0 2 4 − 7   0 0 0   2 4 −7 
9 − 3 2  3  9 − 3 2  3 9 6   6 − 12 − 4 
W=  −   1 3 2 =  − =  
4 5 2  23 4 5 2  23 69 46  − 19 − 64 − 44
         19 
 2 − 7 1   − 17  2 − 7 1  − 17 − 51 − 34   44 35 
Quadratic Loss Function
 0  0   0 
ok  4  1   3
too large Wx − y =  − = 
 23 0  23
too small      
 − 17 0
    − 17 

 2 4 −7   0
 6 − 12 − 4   − 38 
• With new W= , Wx =  
− 19 − 64 − 44  − 299 
   
 19 44 35   221 

• Already saw that quadratic loss does not work that well for classification
Softmax Function
• Define softmax(a) function • Example
 exp (a1 )   exp (− 3) 
 4   exp (− 3) + exp (2) + exp (1) 
  exp (a j )  − 3
 j=1     
 exp (a2 )    softmax  exp (2) 
 1
a  4   2  
a    exp (a j )    exp (− 3) + exp (2) + exp (1) 
  softmax  j=1     
exp (1)
2

 a3   exp (a3 )   1


   4   
a
 4   exp (a )  exp (− 3) + exp (2) + exp (1) 
 j=1 j

 exp (a ) 
 4 4

 exp (a )  0.005 
 j   
 j=1 
= 0.7275 
 
0.2676 
• Softmax renormalizes a vector so that it can be interpreted as
a vector of probabilities
Softmax Loss Function
• Generalization of logistic regression to multiclass case
 w 1T x   2
• Instead of raw scores    
w 2 x 
T
 − 1
  = 
 w 3T x   5
   
w x 
T − 3
 4 
• Use softmax scores

  w 1T x     2  0.0473   Pr (class1) 
        
 T    − 1  0.0024  Pr (class 2 )
 w x
soft max     = soft max     =   
2
=  
 w 3T x     5  0.9500  Pr (class 3)
     
  T      
0.0003 
 4 
w x  − 3   
Pr (class 4 )

• Classifier output interpreted as probability for each class


Gradient Descent: Softmax Loss Function
• Optimize under –log Pr( yi) loss function
• Example
1  0  2 4 − 7
0  9 − 3
 
  2
x =  3
i
W= 
  0  4 5 2
2    
 1  2 − 7 1
yi
Pr (class1)
  0  0.00000000 0102619   

     
  4  0.00000000 5602796  Pr (class 2 )
softmax(Wxi ) = soft max  
  =  =  
 23   0.99999999 4294585  Pr (class 3)
     
    0.00000000 0000001   
 −17  Pr (class 4 )

• Loss on this example is –log(0.000000000000001) = 40


Gradient Descent: Softmax Loss Function
• Update rule for weight matrix W

(
W = W +  y − soft max (Wx
i i
)) (x )
i t

i
• Example, single sample gradient descent with α = 0.1
 0
1  0  2 4 − 7  
0  9 − 3
 
  2  4
W=  Wx =  
i
x =  3
i

  0  4 5 2  23
2      
 1  2 − 7 1 − 17 
yi
• Update for W
2 4 − 7  0   0   2 4 − 7
         
9 − 3 2  0   4    9 −3 2
W= 
 + 0.1   − soft max  
 1 3 2 =  
4 5 2  0   23   3.9 4.7 1.8
        
2 − 7      2.1 − 6.7 1.2
1  1  − 17  
More General Discriminant Functions
• Linear discriminant functions
• simple decision boundary
• should try simpler models first to avoid overfitting
• optimal for certain type of data
• Gaussian distributions with equal covariance
• May not be optimal for other data distributions
• Discriminant functions can be more general than linear
• For example, polynomial discriminant functions
• Decision boundaries more complex than linear
• Later will look more at non-linear discriminant functions
Summary
• Linear classifier works well when examples are
linearly separable, or almost separable
• Perceptron Loss function was the first historic loss
• Logistic regression/softmax work better in practice
• Optimization with gradient descent
• stochastic mini-batch works best in practice
Linear Classifier: Quadratic Loss
• Quadratic per-example loss atz
1
L p (f (zi , a), zi ) = (y i − at zi )
1 2

2
z
-1
• This is standard line fitting
• note that even correctly classified examples can have a large loss

• Can find optimal weight a analytically with least squares


• expensive for large problems
• Gradient descent more efficient for a larger problem
L p (a) = − (y i − at z i )z i
i

• Batch update rule a = a + α  (y i − at z i )z i


i

You might also like