Machine Learning and Data Mining: Prof. Alexander Ihler
Machine Learning and Data Mining: Prof. Alexander Ihler
Linear classification
Change
Program (Learner)
Improve performance
Characterized by
Training data some parameters
(examples)
Procedure (using )
Features that outputs a prediction
Feedback /
Target values Score performance
(cost function)
Linear regression
Predictor:
40
Evaluate line:
Target y
return r
20
0
0 10 20
Feature x
Contrast with classification
Classify: predict discrete-valued target y
f(X,Y)
x1 1
x2 Classifier
2 T(f)
f = 1 X1 + 2 X2 + 0
Threshold {-1, +1}
1 0
weighted sum Function
of the inputs output
= class decision
f(x)
y y T(f)
Linear response
0x0 + 1x1 + n xn = x . then threshold
= 1 (if . x > 0)
A line in 2D, a plane in 3D, etc.
= -1 (otherwise)
. x = 0
= (0, 1, 2)
= (1, .5, -.5 )
x2
x1
From P. Smyth
Example, Linear Decision Boundary
= (0, 1, 2)
= (1, .5, -.5 )
. x = 0
x2
=> .5 x1 - .5 x2 + 1 1 = 0
=> x2 = x1 + 2
x1
From P. Smyth
Example, Linear Decision Boundary
= (0, 1, 2)
= (1, .5, -.5 )
. x = 0
x2
. x < 0
=> x1 + 2 < x2
(this is the . x > 0
equation for
decision => x1 + 2 > x2
region -1) (decision
region +1)
x1
From P. Smyth
Separability
A data set is separable by a learner if
There is some instance of that learner that correctly predicts all the data
points
Linearly separable data
Can separate the two classes using a straight line in feature space
in 2 dimensions the decision boundary is a straight line
Decision boundary
Feature 2, x2
Feature 2, x2
Decision boundary
Feature 1, x1 Feature 1, x1
(c) Alexander Ihler
Class overlap 5
Common in practice
May not be able to perfectly distinguish between classes
Maybe with more features?
Maybe with more complex classifier?
Otherwise, may have to accept some errors
-1
-2
-2 -1 0 1 2 3 4
-1
-2
-2 -1 0 1 2 3 4
1D example: y = T( b x + c )
1D example: y = T( b x + c )
What kinds of functions would we need to learn the data on the right?
What kinds of functions would we need to learn the data on the right?
Ellipsiodal decision boundary: a x12 + b x1 + c x22 + d x2 + e x1x2 + f = 0
(c) Alexander Ihler
Feature representations
Features are used in a linear way
Learner is dependent on representation
Good
Separation is easier in higher dimensions (for fixed # of data m)
Increase the number of features, and even a linear classifier will eventually be
able to separate all the training examples!
Bad
Remember training vs. test error? Remember overfitting?
Increasingly complex decision boundaries can eventually get all the training
data right, but it doesnt necessarily bode well for test data
Predictive
Error Error on Test Data
Complexity
Underfitting Overfitting
Ideal Range
Summary
Linear classifier perceptron
Separability
Limits of the representational power of a perceptron
Adding features
Interpretations
Effect on separability
Potential for overfitting
T(f) = -1 if f < 0
1D example:
T(f) = +1 if f > 0
y(j)
predicted
incorrectly:
update
weights
y(j)
predicted
correctly:
no update
y(j)
predicted
correctly:
no update
1D example:
Feature 2, x2
Feature 2, x2
Feature 1, x1 Feature 1, x1
Side benefit of smoothed error function
Encourages data to be far from the decision boundary
See more examples of this principle later...
J = 1.9
[X,Y] [arctan(A/B), c]
J = 0.4
[X,Y] [arctan(A/B), c]
Best Point
(minimum MSE)
J = 0.1
Methods:
Gradient descent
Improve loss by small Gradient Descent
changes in parameters
(small = learning rate)
Python Implementation:
def sig(z): # logistic sigmoid
return 1.0 / (1.0 + np.exp(-z) ) # in [0,1] For range [-1 , +1]:
1D example:
0/1 Loss
Logistic MSE
Example: y 2 {0,1,2,}
Mul)logis)c regression
Take p(c | x) / exp[ (x,c) ]
Normalize by sum over classes c
StraighQorward generaliza)on of logis)c regression