ML_Lec 6- Linear Classifiers
ML_Lec 6- Linear Classifiers
¨ Animal or Vegetable?
Discriminative Classifiers
5 Probability & Bayesian Inference
x2 −2
−4
−6
−8
−4 −2 0 2 4 6 8
x1
Two Class Discriminant Function
7 Probability & Bayesian Inference
y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) ≥ 0 → x assigned to C1 y(x)
⇤w⇤
x1
Thus y (x) = 0 defines the decision boundary w0
⇤w⇤
Two-Class Discriminant Function
8 Probability & Bayesian Inference
y (x) = w t x + w 0
y>0 x2
y (x) ≥ 0 → x assigned to C1 y=0
y<0 R1
y (x) < 0 → x assigned to C2
R2
and x⇥
t t x1
x = ⎡⎣x 1 … x M ⎤⎦ ⇒ ⎡⎣1 x1 … x M ⎤⎦
w0
⇤w⇤
(
y(x) = f w t x + w 0 )
The Perceptron
10 Probability & Bayesian Inference
(
y(x) = f w t x + w 0 ) y (x) ≥ 0 → x assigned to C1
y (x) < 0 → x assigned to C2
( )
E P w = − ∑ w t x ntn
n∈M
¨ Observations:
¤ EP(w) is always non-negative.
¤ EP(w) is continuous and
piecewise linear, and thus ( )
EP w
easier to minimize.
wi
The Perceptron Algorithm
19 Probability & Bayesian Inference
( )
E P w = − ∑ w t x ntn
n∈M
¨ Gradient descent:
( )
w τ +1 = w τ − η∇EP w = w τ + η ∑ x ntn
n∈M
wi
The Perceptron Algorithm
20 Probability & Bayesian Inference
( )
w τ +1 = w τ − η∇EP w = w t + η ∑ x ntn
n∈M
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Numerical example
• Compute the perceptron and MSE solution for the dataset
10
• Perceptron leaning 1
1
8
9
9
9
Y
– Assume 𝜂 = 0.1 and an online update rule 1
2 1
6
– Y(5) [-1 -2 -1]*[0.1 0.1 0.1] <0 update a(1) = [0.1 0.1 0.1] + 𝜂[-1 -2 -1] = [0 -0.1 0]
T x1
R1
R2
C1
R3
C2
not C1
K>2 Classes
30 Probability & Bayesian Inference
R1
R3
C1 ?
C3
R2
C2
C2
K>2 Classes
31 Probability & Bayesian Inference
y k (x) = w tk x
( ) x + (w )
t
Decision boundary y k (x) = y j (x) → w k − w j k0
− w j0 = 0
Rk
xB
Example: Kesler’s Construction
32 Probability & Bayesian Inference
Element i
Computational Limitations of Perceptrons
34 Probability & Bayesian Inference
t x
→ y(x) = W
where
x = (1,x t )t
( )
t
is a (D + 1) × K matrix whose kth column is w
W k = w 0 ,w tk
Learning the Parameters
40 Probability & Bayesian Inference
(
Training dataset x n ,t n , ) n = 1,…,N
Let RD W( )
= X W
−T
1 1
Let m1 = ∑x ,
N1 n∈C1 n
m2 = ∑x
N2 n∈C2 n
( )
For example, might choose w to maximize w t m2 − m1 , subject to w = 1
This leads to w ∝ m2 − m1
4
∑ (y )
2
Let sk2 = n
− mk be the within-class variance on the subspace for class Ck
n∈Ck
(m )
2
2
− m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where −2
( )( )
t
SB = m2 − m1 m2 − m1 is the between-class variance −2 2 6
and
∑ (x )( )
− m1 x n − m1 + ∑ x n − m2 x n − m2 ( )( )
t t
SW = n
is the within-class variance
n∈C1 n∈C2
(
J(w) is maximized for w ∝ SW−1 m2 − m1 )
¨ Recall that if the two distributions are normal with
the same covarianceΣ, the maximum likelihood
classifier is linear, with
(
w ∝ Σ −1 m2 − m1 )
¨ Further, note that SW is proportional to the maximum
likelihood estimator forΣ.
¨ Thus FLD is equivalent to assuming MVN distributions
with common covariance.
Connection to Least-Squares
45 Probability & Bayesian Inference
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
Problems with Least Squares
48 Probability & Bayesian Inference
−2
−4
−6
−6 −4 −2 0 2 4 6
Outline
49 Probability & Bayesian Inference
P | x 1
i 1
i
i 1
Pi | x
exp wi ,0 wi x
T
, ι 1,2,... 1
M 1
1 exp wi ,0 wi x
T
i 1
P2 | x
1
1 exp w0 w x
T
P1 | x
exp w0 w x T
1 exp w0 w x
T
The unknown parameters wi , wi , 0 , i 1, 2, ..., M-1 are
usually estimated by maximum likelihood arguments.
6 6
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Least-Squares Logistic