0% found this document useful (0 votes)
21 views

Unit 3-Discriminative Models

This document discusses probabilistic discriminative models and linear classification models. It covers logistic regression, which models the posterior probability of a class using a logistic sigmoid function of a linear combination of features. Maximum likelihood is used to determine the logistic regression model parameters. The document also describes using iterative reweighted least squares to efficiently find the maximum likelihood solution for logistic regression since there is no closed-form solution. Nonlinear kernel methods can also be used to transform inputs to allow modeling of nonlinear decision boundaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Unit 3-Discriminative Models

This document discusses probabilistic discriminative models and linear classification models. It covers logistic regression, which models the posterior probability of a class using a logistic sigmoid function of a linear combination of features. Maximum likelihood is used to determine the logistic regression model parameters. The document also describes using iterative reweighted least squares to efficiently find the maximum likelihood solution for logistic regression since there is no closed-form solution. Nonlinear kernel methods can also be used to transform inputs to allow modeling of nonlinear decision boundaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Probabilistic discriminative models

Linear models for classification

FC - Fortaleza
Probabilistic discriminative models
Linear models for classification
Probabilistic discriminative models

For the two-class classification problem, the posterior probability of class


C1 can be written as a logistic sigmoid acting on a linear function of x

p(x|C1)p(C1) T
p(C1|x) = σ ln 2 2 = σ(w x + w0 )
p(x|C )p(C )

◮ for a wide choice of class-conditional distributions p(x|Ck )

For the multi-class case, the posterior probability of class Ck


is given by a softmax transformation of a linear function
of x
p(x|Ck )p(Ck ) exp (wkT x + wk0 )
p(Ck |x) = Σ K = Σ K
j p(x|Cj )p(Cj ) j exp (w Tj x + wj 0)
=1 =1
Probabilistic discriminative models (cont.)

For specific choices of class-conditionals p(x|Ck ), maximum likelihood can be used to

determine the parameters of the densities and the class priors p(Ck )

◮ Bayes’ theorem is then used to find posterior class probabilities p(Ck |x)

An alternative approach is to use the functional form of the generalised


linear model explicitly and determine its parameters directly by maximum
likelihood
◮ There is an efficient algorithm finding such solutions
◮ Iterative re-weighted least squares, IRLS
Probabilistic discriminative models (cont.)

The indirect approach to find parameters of a generalised linear model, by


fitting class-conditional densities and class priors separately and then by
applying Bayes’ theorem, represents an example of generative modelling
◮ We could take such a model and generate synthetic data
by drawing values of x from the marginal distribution p(x)

In the direct approach, we maximise a likelihood function defined through the


conditional distribution p(Ck |x), this is a form of discriminative training
◮ One advantage of the discriminative approach is that there
will typically be fewer adaptive parameters to be determined
◮ It may also lead to improved predictive performance, particularly
when the class-conditional density assumptions give a poor
approximation to the true distributions
Fixed basis functions
Probabilistic discriminative models
Fixed basis functions

We considered classification models that work with the original input vector x
However, all of the algorithms are equally applicable if we first make a fixed
nonlinear transformation of the inputs using a vector of basis functions φ(x)

The resulting decision boundaries will be linear in the feature space φ, and
these correspond to nonlinear decision boundaries in the original x space
◮ Classes that are linearly separable in the feature space φ(x) need
not be linearly separable in the original observation space x
Fixed basis functions (cont.)

1
1

φ2
x2

0 0.5

−1
0

−1 0 1 0 0.5 1
x1 φ1

Original input space (x1 , x2 ) together with points from two classes (red/blue)
◮ Two ‘Gaussian’ basis functions φ1 (x) and φ2(x) are defined in this space
with centres (green crosses) and with contours (green circles)

Feature space (φ1, φ2) together with the linear decision boundary (black line)
◮ Nonlinear decision boundary in the original input space (black curve)
Fixed basis functions (cont.)

Often, there is significant overlap between class-conditional densities p(x|Ck )

◮ This corresponds to posterior probabilities p(Ck |x), which are not 0 or 1


◮ At least, for some values of x

In such cases, the optimal solution is obtained by modelling the posterior


probabilities p(Ck |x) accurately and then applying standard decision theory

Note that nonlinear transformations φ(x) cannot remove such class


overlap
◮ Indeed, they can increase the level of overlap, or even create
overlap where none existed in the original observation space

However, suitable choices of nonlinearity can often make


the process of modelling the posterior probabilities easier
Notwithstanding these limitations, models with fixed
Logistic regression
Probabilistic discriminative models
Logistic regression

When considering the two-class problem using a generative approach and


under general assumptions, the posterior probability of class C1 can be written as
◮ a logistic sigmoid on a linear function of the feature vector φ so
that

p(C1|φ) = y (φ) = σ(w T φ) with p(C2|φ) = 1

− p(C1|φ) (1)

◮ The logistic σ(a)


sigmoid
= function
1 is defined
with a =as p(x|C1)p(C1)
1 + exp (−a) ln p(x|C2)p(C2)

In the terminology of statistics this model is known as logistic regression


◮ For an M-dimensional feature space φ, the model has M parameters
Logistic regression (cont.)

To fit Gaussian class conditional densities with maximum likelihood, we need


◮ 2M + M(M + 1)/2 parameters for means and (shared) covariance matrix
And a total of M ( M + 5)/2 + 1 parameters, if we include the class prior p(C1)
◮ The number of parameters grows quadratically with M

For the M parameters of logistic regression model, we use maximum


likelihood to determine the parameters
Logistic regression (cont.)

For data {φn, tn } N with tn = {0, 1 } and φn = φ(xn ), the likelihood function
n=1

p(t|w) = Y ytn (1 − yn 1− tn
N
(2)
n
) n=1

is written for t = (t1, . . . , t N ) T and yn = p(C1|φn )

By taking the negative log of the likelihood, our error function is defined
by N Σ
(3)
E (w) = − ln p(t|w) = − t n ln (yn ) + (1 − t n ) ln (1 −
n=1
yn )
which is the cross-entropy error function with yn = σ(an) and an = w T φn
Logistic regression (cont.)

By taking the gradient of the error function with respect to w, we


get
(4)
∇E (w) = ΣN (yn − t n )φ n
n=1

The contribution to the gradient from point n comes from the error (yn − t n ) between

target value and model prediction, times the basis function vector φn
◮ The gradient takes the same form as the gradient of
the sum-of-squares error function for linear
regression models
Logistic regression (cont.)
P robabilistic discriminative
models

Maximum likelihood can show severe over-fitting for linearly separable


datasets
◮ The MLE solution occurs when the hyperplane for σ = 0.5, or w T φ
= 0, separates the two classes and the magnitude of w goes to
infinity
◮ The logistic sigmoid becomes infinitely steep (Heaviside) in feature
space, and every point from each class k gets a posterior probability p(Ck |x) =
1

There is also a continuum of such solutions because any separating


hyperplane gives rise to the same posterior probabilities at the
training data points
◮ Maximum likelihood does not favour one such solution over another
◮ The solution depends on the optimisation algorithm and initialisation

One possibility would be to introduce a prior over w and finding a M AP


solution
◮ Add a regularisation term to the error function
Iterative reweighted least squares
Probabilistic discriminative models
Iterative reweighted least squares

In the case of the linear regression models, the maximum likelihood


solution, on the assumption of a Gaussian noise model, leads to a closed-
form solution
◮ A consequence of quadratic dependence of log likelihood function on w

For logistic regression, due to the nonlinearity of the logistic sigmoid


function
◮ There is no longer a closed-form solution
◮ Departure from quadratic is not substantial
Specifically, the error function is concave, and hence it has a unique
minimum

Furthermore, the error function can be minimised by an efficient


iterative technique based on the Newton-Raphson iterative
optimisation scheme
◮ A local quadratic approximation to the log likelihood function
Iterative reweighted least squares (cont.)
The Newton-Raphson update, for minimising a function E (w), takes the form
−1
w (new) = w (old) − H ∇E (w)
(5)

H is the Hessian matrix, with elements the second derivatives of E (w) wrt w

We apply the Newton-Raphson method to


1. the sum-of-squares error function (linear regression model)

1 ΣN 2
ED (w) = 2 t n − w T φ(xn )
n=1

2. the cross-entropy error function (logistic regression model)

ΣN
E (w) = − ln p(t|w) =
− t n ln (yn ) + (1 − t n ) ln (1 −

yn )
n=1
Iterative reweighted least squares (cont.)
Gradient and Hessian of the sum-of-squares error function are

ΣN
∇E (w) (6)
= (w T φ n − t n )φ n = Φ T Φ w − Φ T t
n=1
ΣN
H = ∇∇E (w) (7)
= φ nφ T = Φ T Φ
n=1
where Φ is the N × M design matrix with nφ T in the n-th
row
The Newton-Raphson update takes the form

w (new) = w (old) − (Φ T Φ )− 1 (Φ T Φ w (old) − Φ T t)


= (Φ T Φ )− 1 Φ T t (8)

which is the classical/standard least-squares solution


The error function is quadratic, N-R formula gets the exact solution in one step
Iterative reweighted least squares (cont.)

Gradient and Hessian of the cross-entropy error function are

ΣN
∇E (w) (9)
= (yn − t n )φ n = Φ T (y − t)
n=1
(10)
∇∇E (w) = Σ
N
H = yn (1 − ynφ nφnT ) = Φ T RΦ
n=1

where R(w) is a N × N diagonal matrix with (n, n)


elements
(11)
Rnn = yn (1 −

yn )
The Hessian is no longer constant, depends on w through weighting matrix R
Iterative reweighted least squares (cont.)

Because 0 < yn < 1, for an arbitrary vector u, we have that u T Hu > 0


◮ H is positive definite
The error function is concave in w and hence it has a unique minimum

The Newton-Raphson update formula becomes


= w (old) − (Φ T RΦ )− 1Φ T (y − t)
w(new)
= (Φ T RΦ )− 1 Φ T RΦ w (old) −
Φ T (y − t)
(12)
= (Φ T RΦ )− 1 Φ T Rz
where z is a N-vector with
elements
z = Φ w (old) − R− 1 (y − (13)
t)
Iterative reweighted least squares
(cont.)

wnew = (Φ T RΦ) − 1 Φ T Rz with z = Φw(old) − R−1 (y − t)

The update is the set of normal equations for a weighted least-squares problem

Because the weighing matrix R is not constant but depends on the


parameter vector w, we must apply the normal equations iteratively
◮ each time using the new weight vector w to compute revised weights R

For this reason, the algorithm is iterative reweighted least squares, or IRLS
Iterative reweighted least squares (cont.)
As in weighted least-squares problems, the elements of the diagonal weighting
matrix R can be interpreted as variances because the mean and variance of t
(t 2 = t , for t ∈ { 0, 1} ) in the logistic regression model are

E[t] = σ(x) = y (14)


var[t ] = E[t 2 ] − E[t ]2 = σ(x) − σ(x)2 = y (1 (15)
−y)
We interpret IRLS as solution to a linearised problem in the space of a = w T φ

The quantity zn (n-th element of z) can then be given an interpretation as an effective


target value in this space by making a local linear approximation to the
logistic sigmoid function around the current operating point w(old)
dan..
an (w) ≃ anw(old) +
. (t n −
dyn w(old)

T (old) (yynn−)
= φ
n w − t n) = zn (16)
yn (1 − y
) n
Multiclass logistic regression
Probabilistic discriminative models
Multiclass logistic regression

In the discussion of generative models for multiclass classification, we have


seen that for a large class of distributions, the posterior probabilities are
given by a softmax transformation of linear functions of feature variables

exp (ak ) (17)


p(Ck |φ) = yk (φ) = Σ
j exp (aj )

where the activations ak are


(18)
ak = w Tk φ

We used maximum likelihood to determine separately the class-conditional


densities and the class priors and then found the corresponding posterior
probabilities using Bayes’ theorem, implicitly determining parameters {w k }
Multiclass logistic regression (cont.)
We can use maximum likelihood to get parameters {w k } of this model directly To do this,

we need the derivatives of yk with respect to all of the activations aj

∂yk = yk (Ikj − y )
∂aj
(19)j
where Ikj are the elements of the identity matrix

Next we need to write the likelihood function using the 1-of-K coding scheme
◮ The target vector t n for feature vector φn belonging to class Ck
is a binary vector with all elements zero except for element k
The likelihood is then given by

Y
N

Y
K
Multiclass logistic regression (cont.)

Taking the negative logarithm gives

ΣN ΣK

E (w1, . . . , w K ) = − ln p(T|w1, . . . , w K ) = − tnk ln (ynk ) (21)


n=1 k=1

This is the cross-entropy error function for the multiclass classification problem

We now take the gradient of the error function wrt to one parameter vector wj
ΣN
(22)
∇wj E (w 1, . . . , w K ) = (ynj −

t nj )φ n ∂yk
We used the result for derivatives of the n=1 kj −
softmax function, = yk (I
∂aj
y) 1 j

Σ
1
We used also t = 1
k nk
Multiclass logistic regression (cont.)

The same form for the gradient as found for the sum-of-squares error function
with the linear model and the cross-entropy error for logistic regression model
◮ T he product of the error (ynj − t nj ) times the basis function φ n

The derivative of the log likelihood function for a linear regression model with
respect to the parameter vector w for a data point n took the same form
◮ T he error (yn − t n ) times the feature vector φ n

Similarly, for the combination of logistic sigmoid activation function and


cross-entropy error function, and for the softmax activation function with the
multiclass cross-entropy error function, we again obtain this same simple
form
Multiclass logistic regression (cont.)

To find a batch algorithm, we can use the Newton-Raphson update to


obtain the corresponding IRLS algorithm for the multiclass problem

This requires evaluation of the Hessian matrix that comprises


blocks of size M × M in which block (j , k ) is given by

ΣN
∇w kj∇w E (w1, . . . , wk ) = − ynk (Ikj − ynj )φ n φn T (23)
n=1

As with two-classes, the Hessian matrix for the multiclass logistic regression
models is positive definite and the error function has a unique minimum

You might also like