0% found this document useful (0 votes)
14 views

ML_Lec 6- Linear Classifiers

The document discusses linear classifiers, particularly focusing on the perceptron algorithm for classification tasks where input variables are continuous and target variables are discrete. It explains the concept of decision boundaries, the learning process for perceptrons, and the challenges associated with non-linearly separable data. Additionally, it touches on generalizations for multiclass problems and practical limitations of the perceptron model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ML_Lec 6- Linear Classifiers

The document discusses linear classifiers, particularly focusing on the perceptron algorithm for classification tasks where input variables are continuous and target variables are discrete. It explains the concept of decision boundaries, the learning process for perceptrons, and the challenges associated with non-linearly separable data. Additionally, it touches on generalizations for multiclass problems and practical limitations of the perceptron model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

L6: LINEAR CLASSIFIERS

Classification: Problem Statement


3 Probability & Bayesian Inference

¨ In regression, we are modeling the relationship


between a continuous input variable x and a
continuous target variable t.
¨ In classification, the input variable x may still be
continuous, but the target variable is discrete.
¨ In the simplest case, t can have only 2 values.
e.g., Let t = +1↔ x assigned to C1
t = −1↔ x assigned to C2
Example Problem
4 Probability & Bayesian Inference

¨ Animal or Vegetable?
Discriminative Classifiers
5 Probability & Bayesian Inference

¨ If the conditional distributions are normal, the best


thing to do is to estimate the parameters of these
distributions and use Bayesian decision theory to
classify input vectors. Decision boundaries are
generally quadratic.
¨ However if the conditional distributions are not
exactly normal, this generative approach will yield
sub-optimal results.
¨ Another alternative is to build a discriminative
classifier, that focuses on modeling the decision
boundary directly, rather than the conditional
distributions themselves.
Linear Models for Classification
6 Probability & Bayesian Inference

¨ Linear models for classification separate input vectors into


classes using linear (hyperplane) decision boundaries.
¤ Example:
2D Input vector x
Two discrete classes C1 and C2
4

x2 −2

−4

−6

−8

−4 −2 0 2 4 6 8

x1
Two Class Discriminant Function
7 Probability & Bayesian Inference

y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) ≥ 0 → x assigned to C1 y(x)
⇤w⇤

y (x) < 0 → x assigned to C2 x⇥

x1
Thus y (x) = 0 defines the decision boundary w0
⇤w⇤
Two-Class Discriminant Function
8 Probability & Bayesian Inference

y (x) = w t x + w 0
y>0 x2
y (x) ≥ 0 → x assigned to C1 y=0
y<0 R1
y (x) < 0 → x assigned to C2
R2

For convenience, let x


t t w
w = ⎡⎣w 1 …w M ⎤⎦ ⇒ ⎡⎣w 0 w 1 …w M ⎤⎦
y(x)
⇤w⇤

and x⇥
t t x1
x = ⎡⎣x 1 … x M ⎤⎦ ⇒ ⎡⎣1 x1 … x M ⎤⎦
w0
⇤w⇤

So we can express y(x) = w t x


Generalized Linear Models
9 Probability & Bayesian Inference

¨ For classification problems, we want y to be a predictor of t. In other


words, we wish to map the input vector into one of a number of discrete
classes, or to posterior probabilities that lie between 0 and 1.
¨ For this purpose, it is useful to elaborate the linear model by introducing a
nonlinear activation function f, which typically will constrain y to lie between
-1 and 1 or between 0 and 1.

(
y(x) = f w t x + w 0 )
The Perceptron
10 Probability & Bayesian Inference

(
y(x) = f w t x + w 0 ) y (x) ≥ 0 → x assigned to C1
y (x) < 0 → x assigned to C2

¨ A classifier based upon this simple generalized linear model is


called a (single layer) perceptron.
¨ It can also be identified with an abstracted model of a neuron
called the McCulloch Pitts model.
Parameter Learning
12 Probability & Bayesian Inference

¨ How do we learn the parameters of a perceptron?


Outline
13 Probability & Bayesian Inference

¨ The Perceptron Algorithm


¨ Least-Squares Classifiers
¨ Fisher’s Linear Discriminant
¨ Logistic Classifiers
Case 1. Linearly Separable Inputs
14 Probability & Bayesian Inference

¨ For starters, let’s assume that the training data is in


fact perfectly linearly separable.
¨ In other words, there exists at least one hyperplane
(one set of weights) that yields 0 classification error.
¨ We seek an algorithm that can automatically find
such a hyperplane.
The Perceptron Algorithm
15 Probability & Bayesian Inference

¨ The perceptron algorithm was


invented by Frank Rosenblatt
(1962).
¨ The algorithm is iterative.
¨ The strategy is to start with a
random guess at the weights w, Frank Rosenblatt (1928 – 1971)

and to then iteratively change


the weights to move the
hyperplane in a direction that
lowers the classification error.
The Perceptron Algorithm
16 Probability & Bayesian Inference

¨ Note that as we change the weights continuously,


the classification error changes in discontinuous,
piecewise constant fashion.
¨ Thus we cannot use the classification error per se as
our objective function to minimize.
¨ What would be a better objective function?
The Perceptron Criterion
17 Probability & Bayesian Inference

¨ Note that we seek w such that


w t x ≥ 0 when t = +1
w t x < 0 when t = −1

¨ In other words, we would like


w t x ntn ≥ 0 ∀n

¨ Thus we seek to minimize


( )
E P w = − ∑ w t x ntn
n∈M

where M is the set of misclassified inputs.


The Perceptron Criterion
18 Probability & Bayesian Inference

( )
E P w = − ∑ w t x ntn
n∈M

where M is the set of misclassified inputs.

¨ Observations:
¤ EP(w) is always non-negative.
¤ EP(w) is continuous and
piecewise linear, and thus ( )
EP w

easier to minimize.

wi
The Perceptron Algorithm
19 Probability & Bayesian Inference

( )
E P w = − ∑ w t x ntn
n∈M

where M is the set of misclassified inputs.


dEP w ( )=− ( )
dw
∑x t n n
EP w
n∈M

where the derivative exists.

¨ Gradient descent:
( )
w τ +1 = w τ − η∇EP w = w τ + η ∑ x ntn
n∈M

wi
The Perceptron Algorithm
20 Probability & Bayesian Inference

( )
w τ +1 = w τ − η∇EP w = w t + η ∑ x ntn
n∈M

¨ Why does this make sense?


¤ If an input from C1(t = +1) is misclassified, we need to
make its projection on w more positive.
¤ If an input from C2 (t = -1) is misclassified, we need to
make its projection on w more negative.
The Perceptron Algorithm
21 Probability & Bayesian Inference

¨ The algorithm can be implemented sequentially:


¤ Repeat until convergence:
n For each input (xn, tn):
n If it is correctly classified, do nothing
n If it is misclassified, update the weight vector to be
w τ +1 = w τ + ηx ntn
n Note that this will lower the contribution of input n to the
objective function:
( )xt ( )xt ( )xt ( ) ( )xt.
t t t t t
(τ ) (τ +1) (τ ) (τ )
− w n n
→− w n n
=− w n n
− η x ntn x ntn < − w n n
Not Monotonic
22 Probability & Bayesian Inference

¨ While updating with respect to a misclassified input


n will lower the error for that input, the error for
other misclassified inputs may increase.
¨ Also, new inputs that had been classified correctly
may now be misclassified.
¨ The result is that the perceptron algorithm is not
guaranteed to reduce the total error monotonically
at each stage.
The Perceptron Convergence Theorem
23 Probability & Bayesian Inference

¨ Despite this non-monotonicity, if in fact the data are


linearly separable, then the algorithm is
guaranteed to find an exact solution in a finite
number of steps (Rosenblatt, 1962).
Example
24 Probability & Bayesian Inference
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Numerical example
• Compute the perceptron and MSE solution for the dataset
10

– 𝑋1 = [ (1,6), (7,2), (8,9), (9,9)]


 1 1 6 
– 𝑋2 = [ (2,1), (2,2), (2,4), (7,1)] 

1 7 2

 8

• Perceptron leaning  1

1
8
9
9 
9 

Y  
– Assume 𝜂 = 0.1 and an online update rule  1

2  1

6

– Assume 𝑎(0) = [0.1, 0.1, 0.1]  1


 1
2  2
x2
2  4
– SOLUTION 
  1 7

 1  4

• Normalize the dataset


• Iterate through all the examples and update 𝑎(𝑘) MSE

on the ones that are misclassified 2

– Y(1)  [1 1 6]*[0.1 0.1 0.1]T>0  no update P e r c e p tro n


– Y(2)  [1 7 2]*[0.1 0.1 0.1] T>0  no update
0
… 0 2 4 6 8 10

– Y(5)  [-1 -2 -1]*[0.1 0.1 0.1] <0  update a(1) = [0.1 0.1 0.1] + 𝜂[-1 -2 -1] = [0 -0.1 0]
T x1

– Y(6)  [-1 -2 -2]*[0 -0.1 0] T>0  no update


….
– Y(1)  [1 1 6]*[0 -0.1 0] T<0  update a(2) = [0 -0.1 0] + 𝜂[1 1 6] = [0.1 0 0.6]
– Y(2)  [1 7 2]*[0.1 0 0.6] T>0  no update

• In this example, the perceptron rule converges after 175 iterations to 𝑎 = [−3.5 0.3 0.7]
• To convince yourself this is a solution, compute 𝑌𝑎 (you will find out that all terms are non-negative)
The First Learning Machine
25 Probability & Bayesian Inference

¨ Mark 1 Perceptron Hardware (c. 1960)

Visual Inputs Patch board allowing Rack of adaptive weights w


configuration of inputsφ (motor-driven potentiometers)
Practical Limitations
26 Probability & Bayesian Inference

¨ The Perceptron Convergence Theorem is an


important result. However, there are practical
limitations:
¤ Convergence may be slow
¤ If the data are not separable, the algorithm will not
converge.
¤ We will only know that the data are separable once
the algorithm converges.
¤ The solution is in general not unique, and will depend
upon initialization, scheduling of input vectors, and the
learning rate η.
Generalization to inputs that are not linearly separable.
27 Probability & Bayesian Inference

¨ The single-layer perceptron can be generalized to


yield good linear solutions to problems that are not
linearly separable.
¨ Example: The Pocket Algorithm (Gal 1990)
¤ Idea:
Run the perceptron algorithm
Keep track of the weight vector w* that has produced the
. best classification error achieved so far
.I
Generalization to Multiclass Problems
28 Probability & Bayesian Inference

¨ How can we use perceptrons, or linear classifiers in


general, to classify inputs when there are K > 2
classes?
K>2 Classes
29 Probability & Bayesian Inference

¨ Idea #1: Just use K-1 discriminant functions, each of


which separates one class Ck from the rest. (One-
versus-the-rest classifier.)
¨ Problem: Ambiguous regions

R1
R2

C1
R3
C2
not C1
K>2 Classes
30 Probability & Bayesian Inference

¨ Idea #2: Use K(K-1)/2 discriminant functions, each


of which separates two classes Cj, Ck from each
other. (One-versus-one classifier.)
¨ Each point classified by majority vote.
¨ Problem: Ambiguous regions
C3
C1

R1
R3
C1 ?

C3
R2
C2

C2
K>2 Classes
31 Probability & Bayesian Inference

¨ Idea #3: Use K discriminant functions yk(x)


¨ Use the magnitude of yk(x), not just the sign.

y k (x) = w tk x

x assigned to Ck if y k (x) > y j (x)∀j ≠ k

( ) x + (w )
t
Decision boundary y k (x) = y j (x) → w k − w j k0
− w j0 = 0

Results in decision regions that are


Rj
simply-connected and convex. Ri

Rk
xB
Example: Kesler’s Construction
32 Probability & Bayesian Inference

¨ The perceptron algorithm can be generalized to K-


class classification problems.
¨ Example:
¤ Kesler’s Construction:
n Allows use of the perceptron algorithm to simultaneously
learn K separate weight vectors wi.
n Inputs are then classified in Class i if and only if
w ti x > w tj x ∀j ≠ i
n The algorithm will converge to an optimal solution if a
solution exists, i.e., if all training vectors can be correctly
classified according to this rule.
1-of-K Coding Scheme
33 Probability & Bayesian Inference

¨ When there are K>2 classes, target variables can


be coded using the 1-of-K coding scheme:
Input from Class C i ⇔ t = [0 0 …1…0 0]t

Element i
Computational Limitations of Perceptrons
34 Probability & Bayesian Inference

¨ Initially, the perceptron was


thought to be a potentially
powerful learning machine that
could model human neural
x2
processing.
¨ However, Minsky & Papert
(1969) showed that the single-
layer perceptron could not learn x1

a simple XOR function.


¨ This is just one example of a
non-linearly separable pattern Marvin Minsky (1927 - )

that cannot be learned by a


single-layer perceptron.
Multi-Layer Perceptrons
35 Probability & Bayesian Inference

¨ Minsky & Papert’s book was widely


misinterpreted as showing that
artificial neural networks were
inherently limited.
¨ This contributed to a decline in the
reputation of neural network
research through the 70s and 80s.
¨ However, their findings apply only
to single-layer perceptrons. Multi-
layer perceptrons are capable of
learning highly nonlinear functions,
and are used in many practical
applications.
Outline
36 Probability & Bayesian Inference

¨ The Perceptron Algorithm


¨ Least-Squares Classifiers
¨ Fisher’s Linear Discriminant
¨ Logistic Classifiers
Dealing with Non-Linearly Separable Inputs
37 Probability & Bayesian Inference

¨ The perceptron algorithm fails when the training


data are not perfectly linearly separable.
¨ Let’s now turn to methods for learning the
parameter vector w of a perceptron (linear
classifier) even when the training data are not
linearly separable.
The Least Squares Method
38 Probability & Bayesian Inference

¨ In the least squares method, we simply fit the (x, t)


observations with a hyperplane y(x).
¨ Note that this is kind of a weird idea, since the t
values are binary (when K=2), e.g., 0 or 1.
¨ However it can work pretty well.
Least Squares: Learning the Parameters
39 Probability & Bayesian Inference

Assume D − dimensional input vectors x.

For each class k ∈1…K :


y k (x) = w tk x + w k 0

 t x
→ y(x) = W

where
x = (1,x t )t

( )
t
 is a (D + 1) × K matrix whose kth column is w
W  k = w 0 ,w tk
Learning the Parameters
40 Probability & Bayesian Inference

¨ Method #2: Least Squares


 t x
y(x) = W

(
Training dataset x n ,t n , ) n = 1,…,N

where we use the 1-of-K coding scheme for t n

Let T be the N × K matrix whose nth row is t tn

 be the N × (D + 1) matrix whose nth row is x t


Let X n

Let RD W( )
 = X W
 −T

Then we define the error as ED W ( )


 = 1 R 2 = 1 Tr R W

2 i,j ij
2 D
 tR W
D

{ ( ) ( )}
 to 0 yields:
Setting derivative wrt W

 = X
W ( )

 tX −1
 tT = X
X  †T Recall:
∂A
( )
Tr AB = B t .
Outline
41 Probability & Bayesian Inference

¨ The Perceptron Algorithm


¨ Least-Squares Classifiers
¨ Fisher’s Linear Discriminant
¨ Logistic Classifiers
Fisher’s Linear Discriminant
42 Probability & Bayesian Inference

¨ Another way to view linear discriminants: find the 1D subspace


that maximizes the separation between the two classes.

1 1
Let m1 = ∑x ,
N1 n∈C1 n
m2 = ∑x
N2 n∈C2 n

( )
For example, might choose w to maximize w t m2 − m1 , subject to w = 1

This leads to w ∝ m2 − m1
4

However, if conditional distributions are not isotropic,


−2
this is typically not optimal.
−2 2 6
Fisher’s Linear Discriminant
43 Probability & Bayesian Inference

Let m1 = w t m1, m2 = w t m2 be the conditional means on the 1D subspace.

∑ (y )
2
Let sk2 = n
− mk be the within-class variance on the subspace for class Ck
n∈Ck

(m )
2
2
− m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where −2

( )( )
t
SB = m2 − m1 m2 − m1 is the between-class variance −2 2 6

and

∑ (x )( )
− m1 x n − m1 + ∑ x n − m2 x n − m2 ( )( )
t t
SW = n
is the within-class variance
n∈C1 n∈C2

J(w) is maximized for w ∝ SW−1 m2 − m1 ( )


Fisher Linear Discriminant Example
Data
Class 1 has 5 samples c1=[(1,2),(2,3),(3,3),(4,5),(5,5)]
Class 2 has 6 samples c2=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]
Arrange data in 2 separate matrices
1 2 1 0
c1 = c2 =
5 5 6 5

Notice that PCA performs very


poorly on this data because the
direction of largest variance is not
helpful for classification
Fisher Linear Discriminant Example
First compute the mean for each class
µ 1 = mean (c 1 ) = [3 3 . 6 ] µ 2 = mean (c 2 ) = [3 . 3 2]

Compute scatter matrices S1 and S2 for each class


S 1 = 4 ∗ cov (c 1 ) = 810 8 .0
.0 7 .2 S 2 = 5 ∗ cov (c 2 ) = 17 . 3 16
16 16

Within the class scatter:


S W = S 1 + S 2 = 27 . 3 24
24 23 . 2
it has full rank, don’t have to solve for eigenvalues

The inverse of SW is S W− 1 = inv (S W ) = −00. 39 − 0 . 41


. 41 0 . 47
Finally, the optimal line direction v
v = S W− 1 (µ 1 − µ 2 ) = −00. 89
. 79
Connection to MVN Maximum Likelihood
44 Probability & Bayesian Inference

(
J(w) is maximized for w ∝ SW−1 m2 − m1 )
¨ Recall that if the two distributions are normal with
the same covarianceΣ, the maximum likelihood
classifier is linear, with
(
w ∝ Σ −1 m2 − m1 )
¨ Further, note that SW is proportional to the maximum
likelihood estimator forΣ.
¨ Thus FLD is equivalent to assuming MVN distributions
with common covariance.
Connection to Least-Squares
45 Probability & Bayesian Inference

Change coding scheme used in least-squares method to


N
tn = for C1
N1
N
tn = − for C2
N2

Then one can show that the ML w satisfies


(
w ∝ SW−1 m2 − m1 )
Problems with Least Squares
47 Probability & Bayesian Inference

¨ Problem #1: Sensitivity to outliers

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
Problems with Least Squares
48 Probability & Bayesian Inference

¨ Problem #2: Linear activation function is not a


good fit to binary data. This can lead to problems.
6

−2

−4

−6
−6 −4 −2 0 2 4 6
Outline
49 Probability & Bayesian Inference

¨ The Perceptron Algorithm


¨ Least-Squares Classifiers
¨ Fisher’s Linear Discriminant
¨ Logistic Classifiers
 LOGISTIC DISCRIMINATION
 Let an M-class task, 1 , 2 , ..., M. In logistic
discrimination, the logarithm of the likelihood ratios
are modeled via linear functions, i.e.,
 Pi | x  
ln    wi , 0  wi T x, i  1, 2 , ..., M-1
 P M | x  

 Taking into account that


M

 P | x   1
i 1
i

it can be easily shown that the above is equivalent


with modeling posterior probabilities as:
P  M | x  
1
 
M 1
1   exp wi , 0  wi x
T

i 1

Pi | x  

exp wi ,0  wi x
T
 , ι  1,2,...  1
 
M 1
1   exp wi ,0  wi x
T

i 1

 For the two-class case it turns out that

P2 | x  
1

1  exp w0  w x
T

P1 | x  
exp w0  w x  T

1  exp w0  w x
T
 
 The unknown parameters wi , wi , 0 , i  1, 2, ..., M-1 are
usually estimated by maximum likelihood arguments.

 Logistic discrimination is a useful tool, since it allows


linear modeling and at the same time ensures posterior
probabilities to add to one.
Example
58 Probability & Bayesian Inference

6 6

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

Least-Squares Logistic

You might also like