0% found this document useful (0 votes)

21 views

Unit 3-Discriminative Models

This document discusses probabilistic discriminative models and linear classification models. It covers logistic regression, which models the posterior probability of a class using a logistic sigmoid function of a linear combination of features. Maximum likelihood is used to determine the logistic regression model parameters. The document also describes using iterative reweighted least squares to efficiently find the maximum likelihood solution for logistic regression since there is no closed-form solution. Nonlinear kernel methods can also be used to transform inputs to allow modeling of nonlinear decision boundaries.

Uploaded by

Sridarshini Vikkram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Unit 3-Discriminative Models

Uploaded by

Sridarshini Vikkram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Probabilistic discriminative models

Linear models for classification

FC - Fortaleza
Probabilistic discriminative models
Linear models for classification
Probabilistic discriminative models

For the two-class classification problem, the posterior probability of class

C1 can be written as a logistic sigmoid acting on a linear function of x

p(x|C1)p(C1) T
p(C1|x) = σ ln 2 2 = σ(w x + w0 )
p(x|C )p(C )

◮ for a wide choice of class-conditional distributions p(x|Ck )

For the multi-class case, the posterior probability of class Ck

is given by a softmax transformation of a linear function
of x
p(x|Ck )p(Ck ) exp (wkT x + wk0 )
p(Ck |x) = Σ K = Σ K
j p(x|Cj )p(Cj ) j exp (w Tj x + wj 0)
=1 =1
Probabilistic discriminative models (cont.)

For specific choices of class-conditionals p(x|Ck ), maximum likelihood can be used to

determine the parameters of the densities and the class priors p(Ck )

◮ Bayes’ theorem is then used to find posterior class probabilities p(Ck |x)

An alternative approach is to use the functional form of the generalised

linear model explicitly and determine its parameters directly by maximum
likelihood
◮ There is an efficient algorithm finding such solutions
◮ Iterative re-weighted least squares, IRLS
Probabilistic discriminative models (cont.)

The indirect approach to find parameters of a generalised linear model, by

fitting class-conditional densities and class priors separately and then by
applying Bayes’ theorem, represents an example of generative modelling
◮ We could take such a model and generate synthetic data
by drawing values of x from the marginal distribution p(x)

In the direct approach, we maximise a likelihood function defined through the

conditional distribution p(Ck |x), this is a form of discriminative training
◮ One advantage of the discriminative approach is that there
will typically be fewer adaptive parameters to be determined
◮ It may also lead to improved predictive performance, particularly
when the class-conditional density assumptions give a poor
approximation to the true distributions
Fixed basis functions
Probabilistic discriminative models
Fixed basis functions

We considered classification models that work with the original input vector x
However, all of the algorithms are equally applicable if we first make a fixed
nonlinear transformation of the inputs using a vector of basis functions φ(x)

The resulting decision boundaries will be linear in the feature space φ, and
these correspond to nonlinear decision boundaries in the original x space
◮ Classes that are linearly separable in the feature space φ(x) need
not be linearly separable in the original observation space x
Fixed basis functions (cont.)

1
1

φ2
x2

0 0.5

−1
0

−1 0 1 0 0.5 1
x1 φ1

Original input space (x1 , x2 ) together with points from two classes (red/blue)
◮ Two ‘Gaussian’ basis functions φ1 (x) and φ2(x) are defined in this space
with centres (green crosses) and with contours (green circles)

Feature space (φ1, φ2) together with the linear decision boundary (black line)
◮ Nonlinear decision boundary in the original input space (black curve)
Fixed basis functions (cont.)

Often, there is significant overlap between class-conditional densities p(x|Ck )

◮ This corresponds to posterior probabilities p(Ck |x), which are not 0 or 1

◮ At least, for some values of x

In such cases, the optimal solution is obtained by modelling the posterior

probabilities p(Ck |x) accurately and then applying standard decision theory

Note that nonlinear transformations φ(x) cannot remove such class

overlap
◮ Indeed, they can increase the level of overlap, or even create
overlap where none existed in the original observation space

However, suitable choices of nonlinearity can often make

the process of modelling the posterior probabilities easier
Notwithstanding these limitations, models with fixed
Logistic regression
Probabilistic discriminative models
Logistic regression

When considering the two-class problem using a generative approach and

under general assumptions, the posterior probability of class C1 can be written as
◮ a logistic sigmoid on a linear function of the feature vector φ so
that

p(C1|φ) = y (φ) = σ(w T φ) with p(C2|φ) = 1

− p(C1|φ) (1)

◮ The logistic σ(a)

sigmoid
= function
1 is defined
with a =as p(x|C1)p(C1)
1 + exp (−a) ln p(x|C2)p(C2)

In the terminology of statistics this model is known as logistic regression

◮ For an M-dimensional feature space φ, the model has M parameters
Logistic regression (cont.)

To fit Gaussian class conditional densities with maximum likelihood, we need

◮ 2M + M(M + 1)/2 parameters for means and (shared) covariance matrix
And a total of M ( M + 5)/2 + 1 parameters, if we include the class prior p(C1)
◮ The number of parameters grows quadratically with M

For the M parameters of logistic regression model, we use maximum

likelihood to determine the parameters
Logistic regression (cont.)

For data {φn, tn } N with tn = {0, 1 } and φn = φ(xn ), the likelihood function
n=1

p(t|w) = Y ytn (1 − yn 1− tn
N
(2)
n
) n=1

is written for t = (t1, . . . , t N ) T and yn = p(C1|φn )

By taking the negative log of the likelihood, our error function is defined
by N Σ
(3)
E (w) = − ln p(t|w) = − t n ln (yn ) + (1 − t n ) ln (1 −
n=1
yn )
which is the cross-entropy error function with yn = σ(an) and an = w T φn
Logistic regression (cont.)

By taking the gradient of the error function with respect to w, we

get
(4)
∇E (w) = ΣN (yn − t n )φ n
n=1

The contribution to the gradient from point n comes from the error (yn − t n ) between

target value and model prediction, times the basis function vector φn
◮ The gradient takes the same form as the gradient of
the sum-of-squares error function for linear
regression models
Logistic regression (cont.)
P robabilistic discriminative
models

Maximum likelihood can show severe over-fitting for linearly separable

datasets
◮ The MLE solution occurs when the hyperplane for σ = 0.5, or w T φ
= 0, separates the two classes and the magnitude of w goes to
infinity
◮ The logistic sigmoid becomes infinitely steep (Heaviside) in feature
space, and every point from each class k gets a posterior probability p(Ck |x) =
1

There is also a continuum of such solutions because any separating

hyperplane gives rise to the same posterior probabilities at the
training data points
◮ Maximum likelihood does not favour one such solution over another
◮ The solution depends on the optimisation algorithm and initialisation

One possibility would be to introduce a prior over w and finding a M AP

solution
◮ Add a regularisation term to the error function
Iterative reweighted least squares
Probabilistic discriminative models
Iterative reweighted least squares

In the case of the linear regression models, the maximum likelihood

solution, on the assumption of a Gaussian noise model, leads to a closed-
form solution
◮ A consequence of quadratic dependence of log likelihood function on w

For logistic regression, due to the nonlinearity of the logistic sigmoid

function
◮ There is no longer a closed-form solution
◮ Departure from quadratic is not substantial
Specifically, the error function is concave, and hence it has a unique
minimum

Furthermore, the error function can be minimised by an efficient

iterative technique based on the Newton-Raphson iterative
optimisation scheme
◮ A local quadratic approximation to the log likelihood function
Iterative reweighted least squares (cont.)
The Newton-Raphson update, for minimising a function E (w), takes the form
−1
w (new) = w (old) − H ∇E (w)
(5)

H is the Hessian matrix, with elements the second derivatives of E (w) wrt w

We apply the Newton-Raphson method to

1. the sum-of-squares error function (linear regression model)

1 ΣN 2
ED (w) = 2 t n − w T φ(xn )
n=1

2. the cross-entropy error function (logistic regression model)

ΣN
E (w) = − ln p(t|w) =
− t n ln (yn ) + (1 − t n ) ln (1 −

yn )
n=1
Iterative reweighted least squares (cont.)
Gradient and Hessian of the sum-of-squares error function are

ΣN
∇E (w) (6)
= (w T φ n − t n )φ n = Φ T Φ w − Φ T t
n=1
ΣN
H = ∇∇E (w) (7)
= φ nφ T = Φ T Φ
n=1
where Φ is the N × M design matrix with nφ T in the n-th
row
The Newton-Raphson update takes the form

w (new) = w (old) − (Φ T Φ )− 1 (Φ T Φ w (old) − Φ T t)

= (Φ T Φ )− 1 Φ T t (8)

which is the classical/standard least-squares solution

The error function is quadratic, N-R formula gets the exact solution in one step
Iterative reweighted least squares (cont.)

Gradient and Hessian of the cross-entropy error function are

ΣN
∇E (w) (9)
= (yn − t n )φ n = Φ T (y − t)
n=1
(10)
∇∇E (w) = Σ
N
H = yn (1 − ynφ nφnT ) = Φ T RΦ
n=1

where R(w) is a N × N diagonal matrix with (n, n)

elements
(11)
Rnn = yn (1 −

yn )
The Hessian is no longer constant, depends on w through weighting matrix R
Iterative reweighted least squares (cont.)

Because 0 < yn < 1, for an arbitrary vector u, we have that u T Hu > 0

◮ H is positive definite
The error function is concave in w and hence it has a unique minimum

The Newton-Raphson update formula becomes

= w (old) − (Φ T RΦ )− 1Φ T (y − t)
w(new)
= (Φ T RΦ )− 1 Φ T RΦ w (old) −
Φ T (y − t)
(12)
= (Φ T RΦ )− 1 Φ T Rz
where z is a N-vector with
elements
z = Φ w (old) − R− 1 (y − (13)
t)
Iterative reweighted least squares
(cont.)

wnew = (Φ T RΦ) − 1 Φ T Rz with z = Φw(old) − R−1 (y − t)

The update is the set of normal equations for a weighted least-squares problem

Because the weighing matrix R is not constant but depends on the

parameter vector w, we must apply the normal equations iteratively
◮ each time using the new weight vector w to compute revised weights R

For this reason, the algorithm is iterative reweighted least squares, or IRLS
Iterative reweighted least squares (cont.)
As in weighted least-squares problems, the elements of the diagonal weighting
matrix R can be interpreted as variances because the mean and variance of t
(t 2 = t , for t ∈ { 0, 1} ) in the logistic regression model are

E[t] = σ(x) = y (14)

var[t ] = E[t 2 ] − E[t ]2 = σ(x) − σ(x)2 = y (1 (15)
−y)
We interpret IRLS as solution to a linearised problem in the space of a = w T φ

The quantity zn (n-th element of z) can then be given an interpretation as an effective

target value in this space by making a local linear approximation to the
logistic sigmoid function around the current operating point w(old)
dan..
an (w) ≃ anw(old) +
. (t n −
dyn w(old)

T (old) (yynn−)
= φ
n w − t n) = zn (16)
yn (1 − y
) n
Multiclass logistic regression
Probabilistic discriminative models
Multiclass logistic regression

In the discussion of generative models for multiclass classification, we have

seen that for a large class of distributions, the posterior probabilities are
given by a softmax transformation of linear functions of feature variables

exp (ak ) (17)

p(Ck |φ) = yk (φ) = Σ
j exp (aj )

where the activations ak are

(18)
ak = w Tk φ

We used maximum likelihood to determine separately the class-conditional

densities and the class priors and then found the corresponding posterior
probabilities using Bayes’ theorem, implicitly determining parameters {w k }
Multiclass logistic regression (cont.)
We can use maximum likelihood to get parameters {w k } of this model directly To do this,

we need the derivatives of yk with respect to all of the activations aj

∂yk = yk (Ikj − y )
∂aj
(19)j
where Ikj are the elements of the identity matrix

Next we need to write the likelihood function using the 1-of-K coding scheme
◮ The target vector t n for feature vector φn belonging to class Ck
is a binary vector with all elements zero except for element k
The likelihood is then given by

Y
N

Y
K
Multiclass logistic regression (cont.)

Taking the negative logarithm gives

ΣN ΣK

E (w1, . . . , w K ) = − ln p(T|w1, . . . , w K ) = − tnk ln (ynk ) (21)

n=1 k=1

This is the cross-entropy error function for the multiclass classification problem

We now take the gradient of the error function wrt to one parameter vector wj
ΣN
(22)
∇wj E (w 1, . . . , w K ) = (ynj −

t nj )φ n ∂yk
We used the result for derivatives of the n=1 kj −
softmax function, = yk (I
∂aj
y) 1 j

Σ
1
We used also t = 1
k nk
Multiclass logistic regression (cont.)

The same form for the gradient as found for the sum-of-squares error function
with the linear model and the cross-entropy error for logistic regression model
◮ T he product of the error (ynj − t nj ) times the basis function φ n

The derivative of the log likelihood function for a linear regression model with
respect to the parameter vector w for a data point n took the same form
◮ T he error (yn − t n ) times the feature vector φ n

Similarly, for the combination of logistic sigmoid activation function and

cross-entropy error function, and for the softmax activation function with the
multiclass cross-entropy error function, we again obtain this same simple
form
Multiclass logistic regression (cont.)

To find a batch algorithm, we can use the Newton-Raphson update to

obtain the corresponding IRLS algorithm for the multiclass problem

This requires evaluation of the Hessian matrix that comprises

blocks of size M × M in which block (j , k ) is given by

ΣN
∇w kj∇w E (w1, . . . , wk ) = − ynk (Ikj − ynj )φ n φn T (23)
n=1

As with two-classes, the Hessian matrix for the multiclass logistic regression
models is positive definite and the error function has a unique minimum

189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Tutorial 5 and 6
No ratings yet
Tutorial 5 and 6
5 pages
69813
No ratings yet
69813
16 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
output_25
No ratings yet
output_25
8 pages
Lecture 5_Logistic Regression (1)
No ratings yet
Lecture 5_Logistic Regression (1)
28 pages
output_23
No ratings yet
output_23
6 pages
ML-chap10_2024_110300
No ratings yet
ML-chap10_2024_110300
29 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
Lect 11 P1
No ratings yet
Lect 11 P1
21 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Lecture 04
No ratings yet
Lecture 04
28 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
26660418
No ratings yet
26660418
59 pages
3. LR, decision tree
No ratings yet
3. LR, decision tree
48 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
Mauryan Empire
No ratings yet
Mauryan Empire
11 pages
Classification
No ratings yet
Classification
31 pages
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
No ratings yet
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
79 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
No ratings yet
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
21 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
CH 1
No ratings yet
CH 1
24 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
FALLSEM2024-25 BCSE401L TH VL2024250102078 2024-09-04 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102078 2024-09-04 Reference-Material-I
27 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
Slide 2
No ratings yet
Slide 2
30 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
South Africa Heart Disease Project: Omar M. Osama Deyaa Eldeen A. Almahallawi June 16, 2010
No ratings yet
South Africa Heart Disease Project: Omar M. Osama Deyaa Eldeen A. Almahallawi June 16, 2010
7 pages
ECE_449_Notes
No ratings yet
ECE_449_Notes
5 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
I2ml3e Chap10
No ratings yet
I2ml3e Chap10
27 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
12_Bài toán phân lớp_LR_v2
No ratings yet
12_Bài toán phân lớp_LR_v2
130 pages
3-LG_Eval
No ratings yet
3-LG_Eval
52 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
Ml -5 Sovan Lr Svm 1
No ratings yet
Ml -5 Sovan Lr Svm 1
59 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
21 pages
Opamp
No ratings yet
Opamp
11 pages
Semiconductor Physics and Devices
No ratings yet
Semiconductor Physics and Devices
2 pages
Unit 3-Generative Models
No ratings yet
Unit 3-Generative Models
23 pages
Unit 3-Bayesian Logistic
No ratings yet
Unit 3-Bayesian Logistic
11 pages
PPTX
No ratings yet
PPTX
12 pages
Econometric Analysis of Health Data
No ratings yet
Econometric Analysis of Health Data
231 pages
Eco 201
No ratings yet
Eco 201
23 pages
Econometrics Notes
No ratings yet
Econometrics Notes
30 pages
(Ebook) Applied Econometrics for Health Economists: A Practical Guide by Andrew Jones (Author) ISBN 9780429092190, 9781138445833, 9781785230141, 9781846191718, 0429092199, 1138445835, 178523014X, 1846191718 2024 Scribd Download
100% (2)
(Ebook) Applied Econometrics for Health Economists: A Practical Guide by Andrew Jones (Author) ISBN 9780429092190, 9781138445833, 9781785230141, 9781846191718, 0429092199, 1138445835, 178523014X, 1846191718 2024 Scribd Download
76 pages
Chapter-9-Simple Linear Regression & Correlation
No ratings yet
Chapter-9-Simple Linear Regression & Correlation
11 pages
Lee Wooldridge 20230720
No ratings yet
Lee Wooldridge 20230720
45 pages
Chap 11 Heterscedasticity
100% (1)
Chap 11 Heterscedasticity
45 pages
Machine and Deep Learning (Nezar a. El-Kady)
No ratings yet
Machine and Deep Learning (Nezar a. El-Kady)
353 pages
Harvard Ec 1123 Econometrics Problem Set 7 - Tarun Preet Singh
No ratings yet
Harvard Ec 1123 Econometrics Problem Set 7 - Tarun Preet Singh
3 pages
Econometrics Shalabh
No ratings yet
Econometrics Shalabh
300 pages
Hedging Effectiveness of Stock Index Futures Contracts in The Indian Derivative Markets
No ratings yet
Hedging Effectiveness of Stock Index Futures Contracts in The Indian Derivative Markets
20 pages
Post Midsem Prob
No ratings yet
Post Midsem Prob
5 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Katharina Klingler PDF
No ratings yet
Katharina Klingler PDF
188 pages
Regressao Linear Simples - Ipynb - Colaboratory
100% (1)
Regressao Linear Simples - Ipynb - Colaboratory
2 pages
Estudo China - Uso Celulares
No ratings yet
Estudo China - Uso Celulares
70 pages
The Impact of Corporate Governance On The Digitalization Process: Empirical Evidence For The Romanian Companies
No ratings yet
The Impact of Corporate Governance On The Digitalization Process: Empirical Evidence For The Romanian Companies
28 pages
The Football Players' Labor
No ratings yet
The Football Players' Labor
26 pages
Lesson 10 Simple Linear Regression and Correlation
No ratings yet
Lesson 10 Simple Linear Regression and Correlation
70 pages
Block 4
No ratings yet
Block 4
51 pages
Subsistence Farming and Food Security in Cameroon A Macroeconomic Approach
No ratings yet
Subsistence Farming and Food Security in Cameroon A Macroeconomic Approach
6 pages
Official Pps
No ratings yet
Official Pps
43 pages
Chapitre 4 IMEA 1 PDF
No ratings yet
Chapitre 4 IMEA 1 PDF
50 pages
Econometrics Module
No ratings yet
Econometrics Module
148 pages
Cross Section: Combination Pooled Data Heteroscedasticity
No ratings yet
Cross Section: Combination Pooled Data Heteroscedasticity
17 pages
Time Series Practice HW
No ratings yet
Time Series Practice HW
3 pages
Linear - Models - (Contents)
No ratings yet
Linear - Models - (Contents)
12 pages
Advanced Econometric Methods I: Problem Set 1: Geert Mesters September 26, 2020
No ratings yet
Advanced Econometric Methods I: Problem Set 1: Geert Mesters September 26, 2020
2 pages

Unit 3-Discriminative Models

Uploaded by

Unit 3-Discriminative Models

Uploaded by

Probabilistic discriminative models

Linear models for classification

For the two-class classification problem, the posterior probability of class

◮ for a wide choice of class-conditional distributions p(x|Ck )

For the multi-class case, the posterior probability of class Ck

For specific choices of class-conditionals p(x|Ck ), maximum likelihood can be used to

An alternative approach is to use the functional form of the generalised

The indirect approach to find parameters of a generalised linear model, by

In the direct approach, we maximise a likelihood function defined through the

Often, there is significant overlap between class-conditional densities p(x|Ck )

◮ This corresponds to posterior probabilities p(Ck |x), which are not 0 or 1

In such cases, the optimal solution is obtained by modelling the posterior

Note that nonlinear transformations φ(x) cannot remove such class

However, suitable choices of nonlinearity can often make

When considering the two-class problem using a generative approach and

p(C1|φ) = y (φ) = σ(w T φ) with p(C2|φ) = 1

◮ The logistic σ(a)

In the terminology of statistics this model is known as logistic regression

To fit Gaussian class conditional densities with maximum likelihood, we need

For the M parameters of logistic regression model, we use maximum

is written for t = (t1, . . . , t N ) T and yn = p(C1|φn )

By taking the gradient of the error function with respect to w, we

Maximum likelihood can show severe over-fitting for linearly separable

There is also a continuum of such solutions because any separating

One possibility would be to introduce a prior over w and finding a M AP

In the case of the linear regression models, the maximum likelihood

For logistic regression, due to the nonlinearity of the logistic sigmoid

Furthermore, the error function can be minimised by an efficient

We apply the Newton-Raphson method to

2. the cross-entropy error function (logistic regression model)

w (new) = w (old) − (Φ T Φ )− 1 (Φ T Φ w (old) − Φ T t)

which is the classical/standard least-squares solution

Gradient and Hessian of the cross-entropy error function are

where R(w) is a N × N diagonal matrix with (n, n)

Because 0 < yn < 1, for an arbitrary vector u, we have that u T Hu > 0

The Newton-Raphson update formula becomes

wnew = (Φ T RΦ) − 1 Φ T Rz with z = Φw(old) − R−1 (y − t)

Because the weighing matrix R is not constant but depends on the

E[t] = σ(x) = y (14)

The quantity zn (n-th element of z) can then be given an interpretation as an effective

In the discussion of generative models for multiclass classification, we have

exp (ak ) (17)

where the activations ak are

We used maximum likelihood to determine separately the class-conditional

we need the derivatives of yk with respect to all of the activations aj

Taking the negative logarithm gives

E (w1, . . . , w K ) = − ln p(T|w1, . . . , w K ) = − tnk ln (ynk ) (21)

Similarly, for the combination of logistic sigmoid activation function and

To find a batch algorithm, we can use the Newton-Raphson update to

This requires evaluation of the Hessian matrix that comprises

You might also like