0% found this document useful (0 votes)
119 views

Machine Learning and Data Mining: Prof. Alexander Ihler

Uploaded by

oriontherecluse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views

Machine Learning and Data Mining: Prof. Alexander Ihler

Uploaded by

oriontherecluse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

+

Machine Learning and Data Mining

Linear classification

Prof. Alexander Ihler


Supervised learning
Notation
Features x
Targets y
Predictions
Parameters Learning algorithm

Change
Program (Learner)
Improve performance
Characterized by
Training data some parameters
(examples)
Procedure (using )
Features that outputs a prediction
Feedback /
Target values Score performance
(cost function)
Linear regression
Predictor:
40
Evaluate line:
Target y

return r
20

0
0 10 20
Feature x
Contrast with classification
Classify: predict discrete-valued target y

(c) Alexander Ihler


Perceptron Classifier (2 features)
T(f)

f(X,Y)
x1 1
x2 Classifier
2 T(f)
f = 1 X1 + 2 X2 + 0
Threshold {-1, +1}
1 0
weighted sum Function
of the inputs output
= class decision

Visualizing for one feature x:

f(x)
y y T(f)

x (c) Alexander Ihler x


Perceptrons
Perceptron = a linear classifier
The parameters are sometimes called weights (w)
real-valued constants (can be positive or negative)
Define an additional constant input 1

A perceptron calculates 2 quantities:


1. A weighted sum of the input features
2. This sum is then thresholded by the T(.) function

Perceptron: a simple artificial model of human neurons


weights = synapses
threshold = neuron firing

(c) Alexander Ihler


Notation
Inputs:
x0, x1, x2, , xn,
x1, x2, , xn-1, xn are the values of the n features
x0 = 1 (a constant input)
x = [[x0, x1, x2, , xn ]] : feature vector (row vector)
Weights (parameters):
0, 1, 2, , n,
we have n+1 weights: one for each feature + one for the constant
= [[0, 1, 2, , n ]] : parameter vector (row vector)

Linear response
0x0 + 1x1 + n xn = x . then threshold

F = X.dot( theta.T ); # compute linear response


Yhat = np.sign(F) # predict class +1 or -1
Yhat = 2*(F>0)-1 # manual sign of F

(Matlab) >> f = th*x; f = sum(th.*x); yhat = sign(f);


Perceptron Decision Boundary
The perceptron is defined by the decision algorithm:

The perceptron represents a hyperplane decision surface in d-dimensional space

= 1 (if . x > 0)
A line in 2D, a plane in 3D, etc.

The equation of the hyperplane is given by

o(x1, x2,, xd, xd+1)


This defines the set of points that are on the boundary.

= -1 (otherwise)

. x = 0

(c) Alexander Ihler


Example, Linear Decision Boundary

= (0, 1, 2)
= (1, .5, -.5 )

x2

x1

From P. Smyth
Example, Linear Decision Boundary

= (0, 1, 2)
= (1, .5, -.5 )
. x = 0
x2
=> .5 x1 - .5 x2 + 1 1 = 0

=> -.5 x2 = -.5 x1 - 1

=> x2 = x1 + 2

x1

From P. Smyth
Example, Linear Decision Boundary

= (0, 1, 2)
= (1, .5, -.5 )
. x = 0
x2
. x < 0

=> x1 + 2 < x2
(this is the . x > 0
equation for
decision => x1 + 2 > x2
region -1) (decision
region +1)

x1

From P. Smyth
Separability
A data set is separable by a learner if
There is some instance of that learner that correctly predicts all the data
points
Linearly separable data
Can separate the two classes using a straight line in feature space
in 2 dimensions the decision boundary is a straight line

Linearly separable data Linearly non-separable data

Decision boundary

Feature 2, x2
Feature 2, x2

Decision boundary

Feature 1, x1 Feature 1, x1
(c) Alexander Ihler
Class overlap 5

Classes may not be well-separated 3

Same observation values possible


2

under both classes 0

High vs low risk; features {age, income} -1

Benign/malignant cells look similar -2


-2 -1 0 1 2 3 4 5


Common in practice
May not be able to perfectly distinguish between classes
Maybe with more features?
Maybe with more complex classifier?
Otherwise, may have to accept some errors

(c) Alexander Ihler


Another example

-1

-2
-2 -1 0 1 2 3 4

(c) Alexander Ihler


Non-linear decision boundary

-1

-2
-2 -1 0 1 2 3 4

(c) Alexander Ihler


Representational Power of Perceptrons
What mappings can a perceptron represent perfectly?
A perceptron is a linear classifier
thus it can represent any mapping that is linearly separable
some Boolean functions like AND (on left)
but not Boolean functions like XOR (on right)

(c) Alexander Ihler


Adding features
Linear classifier cant learn some functions

1D example: y = T( b x + c )

Not linearly separable

Add quadratic features


y = T( a x2 + b x + c )

Linearly separable in new features


(c) Alexander Ihler
Adding features
Linear classifier cant learn some functions

1D example: y = T( b x + c )

Not linearly separable

Quadratic features, visualized in original feature space:


y = T( a x2 + b x + c )

More complex decision boundary: ax2+bx+c = 0


Representational Power of Perceptrons
What mappings can a perceptron represent perfectly?
A perceptron is a linear classifier
thus it can represent any mapping that is linearly separable
some Boolean functions like AND (on left)
but not Boolean functions like XOR (on right)

What kinds of functions would we need to learn the data on the right?

(c) Alexander Ihler


Representational Power of Perceptrons
What mappings can a perceptron represent perfectly?
A perceptron is a linear classifier
thus it can represent any mapping that is linearly separable
some Boolean functions like AND (on left)
but not Boolean functions like XOR (on right)

What kinds of functions would we need to learn the data on the right?
Ellipsiodal decision boundary: a x12 + b x1 + c x22 + d x2 + e x1x2 + f = 0
(c) Alexander Ihler
Feature representations
Features are used in a linear way
Learner is dependent on representation

Ex: discrete features


Mushroom surface: {fibrous, grooves, scaly, smooth}
Probably not useful to use x = {1, 2, 3, 4}
Better: 1-of-K, x = { [1000], [0100], [0010], [0001] }
Introduces more parameters, but a more flexible relationship

(c) Alexander Ihler


Effect of dimensionality
Data are increasingly separable in high dimension is this a good thing?

Good
Separation is easier in higher dimensions (for fixed # of data m)
Increase the number of features, and even a linear classifier will eventually be
able to separate all the training examples!

Bad
Remember training vs. test error? Remember overfitting?
Increasingly complex decision boundaries can eventually get all the training
data right, but it doesnt necessarily bode well for test data

Predictive
Error Error on Test Data

Error on Training Data

Complexity
Underfitting Overfitting
Ideal Range
Summary
Linear classifier perceptron

Linear decision boundary


Computing and visualizing

Separability
Limits of the representational power of a perceptron

Adding features
Interpretations
Effect on separability
Potential for overfitting

(c) Alexander Ihler


+

Machine Learning and Data Mining

Linear classification: Learning

Prof. Alexander Ihler


Learning the Classifier Parameters
Learning from Training Data:
training data = labeled feature vectors
Find parameter values that predict well (low error)
error is estimated on the training data
true error will be on future test data

Define an objective function J() :


Classifier accuracy (for a given set of weights and labeled data)

Maximize this objective function (or, minimize error)


An optimization or search problem over the vector (1, 2, 0)

(c) Alexander Ihler


Training a linear classifier
How should we measure error?
Natural measure = fraction we get wrong (error rate)

err() = 1/m ( (i) y(i) )


where ( (i) y(i) ) = 0 if (i) = y(i), and 1 otherwise

Yhat = np.sign( X.dot( theta.T ) ); # predict class (M


err = np.mean( Y != Yhat ) # count errors: empirical error rate

But, hard to train via gradient descent


Not continuous
As decision boundary moves, errors change abruptly

T(f) = -1 if f < 0
1D example:
T(f) = +1 if f > 0

(c) Alexander Ihler


Linear regression?
Simple op)on: set using linear regression

In prac)ce, this o6en doesnt work so well


Consider adding a distant but easy point
MSE distorts the solu)on

(c) Alexander Ihler


Perceptron algorithm
Perceptron algorithm: an SGD-like algorithm
While (~done)
For each data point j:
(j) = T( * x(j) ) : predict output for data point j
+ ( y(j) - (j) ) x(j) : gradient-like step

Compare to linear regression + MSE cost


Identical update to SGD for MSE except error uses
thresholded (j) instead of linear response x so:

(1) For correct predictions, y(j) - (j) = 0


(2) For incorrect predictions, y(j) - (j) = 2

adaptive linear regression: correct predictions stop contributing


(c) Alexander Ihler
Perceptron algorithm
Perceptron algorithm: an SGD-like algorithm
While (~done)
For each data point j:
(j) = T( * x(j) ) : predict output for data point j
+ ( y(j) - (j) ) x(j) : gradient-like step

y(j)
predicted
incorrectly:
update
weights

(c) Alexander Ihler


Perceptron algorithm
Perceptron algorithm: an SGD-like algorithm
While (~done)
For each data point j:
(j) = T( * x(j) ) : predict output for data point j
+ ( y(j) - (j) ) x(j) : gradient-like step

y(j)
predicted
correctly:
no update

(c) Alexander Ihler


Perceptron algorithm
Perceptron algorithm: an SGD-like algorithm
While (~done)
For each data point j:
(j) = T( * x(j) ) : predict output for data point j
+ ( y(j) - (j) ) x(j) : gradient-like step
(Converges if data are linearly separable)

y(j)
predicted
correctly:
no update

(c) Alexander Ihler


Surrogate loss functions
Another solution: use a smooth loss T(f)
e.g., approximate the threshold function f(X,Y)

Usually some smooth function of distance


Example: sigmoid, looks like an S (f)
f(X,Y)
Now, measure e.g. MSE
Class y = {0, 1}

Far from the decision boundary: |f(.)| large, small error


Nearby the boundary: |f(.)| near 1/2, larger error

1D example:

Classification error = 2/9 MSE = (02 + 12 + .22 + .252 + .052 + )/9


Beyond misclassification rate
Which decision boundary is better?
Both have zero training error (perfect training accuracy)
But, one of them seems intuitively better

Feature 2, x2
Feature 2, x2

Feature 1, x1 Feature 1, x1
Side benefit of smoothed error function
Encourages data to be far from the decision boundary
See more examples of this principle later...

(c) Alexander Ihler


Training the Classifier
Once we have a smooth measure of quality, we can find the
best settings for the parameters of
f(X1,X2) = a*X1 + b*X2 + c

Example: 2D feature space parameter space

J = 1.9

[X,Y] [arctan(A/B), c]

(c) Alexander Ihler


Training the Classifier
Once we have a smooth measure of quality, we can find the
best settings for the parameters of
f(X1,X2) = a*X1 + b*X2 + c

Example: 2D feature space parameter space

J = 0.4

[X,Y] [arctan(A/B), c]

(c) Alexander Ihler


Training the Classifier
Once we have a smooth measure of quality, we can find the best
settings for the parameters of
f(X1,X2) = a*X1 + b*X2 + c
Finding the minimum loss J(.) in parameter space

Best Point
(minimum MSE)

J = 0.1

(c) Alexander Ihler


Finding the Best MSE
As in linear regression, this is now just optimization

Methods:
Gradient descent
Improve loss by small Gradient Descent
changes in parameters
(small = learning rate)

Or, substitute your favorite


optimization algorithm
Coordinate descent
Stochastic search
Genetic algorithms

(c) Alexander Ihler


Gradient Equations
MSE (note, depends on function (.) )

Whats the derivative with respect to one of the


parameters?

Error between class Sensitivity of prediction to


and prediction changes in parameter a

Similar for parameters b, c [replace x1 with x2 or 1


(constant)]
(c) Alexander Ihler
Saturating Functions
Many possible saturating functions

Logistic sigmoid (scaled for range [0,1]) is


(z) = 1 / (1 + exp(-z))
Derivative is
(to predict:
threshold z at 0 or
(z) = (z) (1-(z))
threshold (z) at )

Python Implementation:
def sig(z): # logistic sigmoid
return 1.0 / (1.0 + np.exp(-z) ) # in [0,1] For range [-1 , +1]:

def dsig(z): # its derivative at z (z) = 2 (z) -1


return sig(z) * (1-sig(z))
(z) = 2 (z) (1-(z))

Predict: threshold z or at zero


(c) Alexander Ihler
Logistic regression
Intepret ( x ) as a probability that y = 1
Use a negative log-likelihood loss function
If y = 1, cost is - log Pr[y=1] = - log ( x )
If y = 0, cost is - log Pr[y=0] = - log (1 - ( x ) )

Can write this succinctly:

Nonzero only if y=1 Nonzero only if y=0

(c) Alexander Ihler


Logistic regression
Intepret ( x ) as a probability that y = 1
Use a negative log-likelihood loss function
If y = 1, cost is - log Pr[y=1] = - log ( x )
If y = 0, cost is - log Pr[y=0] = - log (1 - ( x ) )

Can write this succinctly:

Convex! Otherwise similar: optimize J() via

1D example:

Classification error = MSE = 2/9 NLL = - (log(.99) + log(.97) + )/9


Gradient Equations
Logistic neg-log likelihood loss:

Whats the derivative with respect to one of the


parameters?

(c) Alexander Ihler


Surrogate loss functions
Replace 0/1 loss
with something easier:

0/1 Loss

Logistic MSE

Logistic Neg Log Likelihood

(c) Alexander Ihler


Summary
Linear classifier perceptron

Measuring quality of a decision boundary


Error rate (0/1 loss)
Logistic sigmoid + MSE criterion
Logistic Regression

Learning the weights of a linear classifer from data


Reduces to an optimization problem
Perceptron algorithm
For MSE or Logistic NLL, we can do gradient descent
Gradient equations & update rules

(c) Alexander Ihler


Mul)class linear models
Dene a generic linear classier by

Example: y 2 {-1, +1}

(Standard perceptron rule)

(c) Alexander Ihler


Mul)class linear models
Dene a generic linear classier by

Example: y 2 {0,1,2,}

(parameters for each class c)

(predict class with largest linear response)


(c) Alexander Ihler
Training mul)class perceptrons
Mul)-class perceptron algorithm
StraighQorward generaliza)on of perceptron alg

Mul)logis)c regression
Take p(c | x) / exp[ (x,c) ]
Normalize by sum over classes c
StraighQorward generaliza)on of logis)c regression

(c) Alexander Ihler

You might also like