Unit 3-Discriminative Models
Unit 3-Discriminative Models
FC - Fortaleza
Probabilistic discriminative models
Linear models for classification
Probabilistic discriminative models
p(x|C1)p(C1) T
p(C1|x) = σ ln 2 2 = σ(w x + w0 )
p(x|C )p(C )
determine the parameters of the densities and the class priors p(Ck )
◮ Bayes’ theorem is then used to find posterior class probabilities p(Ck |x)
We considered classification models that work with the original input vector x
However, all of the algorithms are equally applicable if we first make a fixed
nonlinear transformation of the inputs using a vector of basis functions φ(x)
The resulting decision boundaries will be linear in the feature space φ, and
these correspond to nonlinear decision boundaries in the original x space
◮ Classes that are linearly separable in the feature space φ(x) need
not be linearly separable in the original observation space x
Fixed basis functions (cont.)
1
1
φ2
x2
0 0.5
−1
0
−1 0 1 0 0.5 1
x1 φ1
Original input space (x1 , x2 ) together with points from two classes (red/blue)
◮ Two ‘Gaussian’ basis functions φ1 (x) and φ2(x) are defined in this space
with centres (green crosses) and with contours (green circles)
Feature space (φ1, φ2) together with the linear decision boundary (black line)
◮ Nonlinear decision boundary in the original input space (black curve)
Fixed basis functions (cont.)
− p(C1|φ) (1)
For data {φn, tn } N with tn = {0, 1 } and φn = φ(xn ), the likelihood function
n=1
p(t|w) = Y ytn (1 − yn 1− tn
N
(2)
n
) n=1
By taking the negative log of the likelihood, our error function is defined
by N Σ
(3)
E (w) = − ln p(t|w) = − t n ln (yn ) + (1 − t n ) ln (1 −
n=1
yn )
which is the cross-entropy error function with yn = σ(an) and an = w T φn
Logistic regression (cont.)
The contribution to the gradient from point n comes from the error (yn − t n ) between
target value and model prediction, times the basis function vector φn
◮ The gradient takes the same form as the gradient of
the sum-of-squares error function for linear
regression models
Logistic regression (cont.)
P robabilistic discriminative
models
H is the Hessian matrix, with elements the second derivatives of E (w) wrt w
1 ΣN 2
ED (w) = 2 t n − w T φ(xn )
n=1
ΣN
E (w) = − ln p(t|w) =
− t n ln (yn ) + (1 − t n ) ln (1 −
yn )
n=1
Iterative reweighted least squares (cont.)
Gradient and Hessian of the sum-of-squares error function are
ΣN
∇E (w) (6)
= (w T φ n − t n )φ n = Φ T Φ w − Φ T t
n=1
ΣN
H = ∇∇E (w) (7)
= φ nφ T = Φ T Φ
n=1
where Φ is the N × M design matrix with nφ T in the n-th
row
The Newton-Raphson update takes the form
ΣN
∇E (w) (9)
= (yn − t n )φ n = Φ T (y − t)
n=1
(10)
∇∇E (w) = Σ
N
H = yn (1 − ynφ nφnT ) = Φ T RΦ
n=1
yn )
The Hessian is no longer constant, depends on w through weighting matrix R
Iterative reweighted least squares (cont.)
The update is the set of normal equations for a weighted least-squares problem
For this reason, the algorithm is iterative reweighted least squares, or IRLS
Iterative reweighted least squares (cont.)
As in weighted least-squares problems, the elements of the diagonal weighting
matrix R can be interpreted as variances because the mean and variance of t
(t 2 = t , for t ∈ { 0, 1} ) in the logistic regression model are
T (old) (yynn−)
= φ
n w − t n) = zn (16)
yn (1 − y
) n
Multiclass logistic regression
Probabilistic discriminative models
Multiclass logistic regression
∂yk = yk (Ikj − y )
∂aj
(19)j
where Ikj are the elements of the identity matrix
Next we need to write the likelihood function using the 1-of-K coding scheme
◮ The target vector t n for feature vector φn belonging to class Ck
is a binary vector with all elements zero except for element k
The likelihood is then given by
Y
N
Y
K
Multiclass logistic regression (cont.)
ΣN ΣK
This is the cross-entropy error function for the multiclass classification problem
We now take the gradient of the error function wrt to one parameter vector wj
ΣN
(22)
∇wj E (w 1, . . . , w K ) = (ynj −
t nj )φ n ∂yk
We used the result for derivatives of the n=1 kj −
softmax function, = yk (I
∂aj
y) 1 j
Σ
1
We used also t = 1
k nk
Multiclass logistic regression (cont.)
The same form for the gradient as found for the sum-of-squares error function
with the linear model and the cross-entropy error for logistic regression model
◮ T he product of the error (ynj − t nj ) times the basis function φ n
The derivative of the log likelihood function for a linear regression model with
respect to the parameter vector w for a data point n took the same form
◮ T he error (yn − t n ) times the feature vector φ n
ΣN
∇w kj∇w E (w1, . . . , wk ) = − ynk (Ikj − ynj )φ n φn T (23)
n=1
As with two-classes, the Hessian matrix for the multiclass logistic regression
models is positive definite and the error function has a unique minimum