0% found this document useful (0 votes)
34 views

Discriminant Analysis: 5.1 The Maximum Likelihood (ML) Rule

1. Discriminant analysis is used to allocate individuals to groups based on measurements, with the goal of minimizing mistakes. The maximum likelihood rule allocates individuals to the group with the highest likelihood. 2. For two normal populations, the maximum likelihood rule allocates based on a linear discriminant function that compares the Mahalanobis distance between the individual's measurements and the population means. 3. The misclassification probabilities can be estimated based on the sample Mahalanobis distances between the population means. The maximum likelihood rule minimizes the total probability of misclassification.

Uploaded by

George Wang
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Discriminant Analysis: 5.1 The Maximum Likelihood (ML) Rule

1. Discriminant analysis is used to allocate individuals to groups based on measurements, with the goal of minimizing mistakes. The maximum likelihood rule allocates individuals to the group with the highest likelihood. 2. For two normal populations, the maximum likelihood rule allocates based on a linear discriminant function that compares the Mahalanobis distance between the individual's measurements and the population means. 3. The misclassification probabilities can be estimated based on the sample Mahalanobis distances between the population means. The maximum likelihood rule minimizes the total probability of misclassification.

Uploaded by

George Wang
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

5.

Discriminant Analysis
Given k populations (groups)
1
; :::;
k
; we suppose that an individual from
j
has p.d.f. f
j
(x)
for a set of p measurement x.
The purpose of discriminant analysis is to allocate an individual to one of the groups
j
on
the basis of x, making as few "mistakes" as possible. For example a patient presents at a doctors
surgery with a set of symptom x. The symptoms suggest a number of posible disease groups
j

to which the patient might belong. What is the most likely diagnosis?
The aim initially is to nd a partition of R
p
into disjoint regions R
1
; :::; R
k
together with a
decision rule
x 2 R
j
=allocate x to
j
The decision rule will be more accurate if "
j
has most of its probability concentrated in R
j
"
for each j:
5.1 The maximum likelihood (ML) rule
Allocate x to population
j
that gives the largest likelihood to x. Choose j by
L
j
(x) = max
1ik
L
i
(x)
(break ties arbitrarily).
Result 1
If
i
is the multivariate normal (MVN) population N
p
(
i
; ) for i = 1; :::; k; the ML rule
allocates x to population
i
that minimize the Mahalanobis distance between x and
i
:
Proof
L
i
(x) = [2[

1
2
exp
_

1
2
(x
i
)
T

1
(x
i
)
_
so the likelihood is maximized when the exponent is minimized.
Result 2
When k = 2 the ML rule allocates x to
1
if
d
T
(x ) > 0 (5.1)
where d =
1
(
1

2
) and =
1
2
(
1
+
2
) and to
2
otherwise.
Proof
For the two group case, the ML rule is to allocate x to
1
if
(x
1
)
T

1
(x
1
) < (x
2
)
T

1
(x
2
)
1
which reduces to
2d
T
x >
1

2
= (
1

2
)
T

1
(
1
+
2
)
= d
T
(
1
+
2
)
Hence the result. The function
h(x) = (
1

2
)
T

1
_
x
1
2
(
1
+
2
)
_
(5.2)
is known as the discriminant function (DF). In this case the DF is linear in x.
5.2 Sample ML rule
In practice
1
;
2
; are estimated by, respectively x
1
; x
2
; S
P
where S
P
is the pooled (unbiased)
estimator of covariance matrix.
Example
The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types of
iris. Two of the variables: x
1
= sepal length and x
2
= sepal width gave the following data on
species I and II:
x
1
=
_
5:0
3:4
_
x
2
=
_
6:0
2:8
_
S
1
=
_
:12 :10
:10 :14
_
S
2
=
_
:26 :08
:08 :10
_
(The data have been rounded for clarity).
S
p
=
50S
1
+ 50S
2
98
=
_
0:19 0:09
0:09 0:12
_
Hence
d = S
1
p
( x
1
x
2
)
=
_
0:19 0:09
0:09 0:12
_
1
_
1:0
0:6
_
=
_
11:4
14:1
_
=
1
2
( x
1
+ x
2
) =
_
5:5
3:1
_
giving the rule:
Allocate x to
1
if
11:4 (x
1
5:5) + 14:1 (x
2
3:1) > 0
11:4x
1
+ 14:1x
2
+ 19:0 > 0
2
5.3 Misclassication probabilities
The misclassication probabilities p
ij
dened as
p
ij
= Pr [Allocate to
i
when in fact from
j
]
form a k k matrix, of which the diagonal elements p
ii
are a measure of the classiers accuracy.
For the case k = 2
p
12
= Pr [h(x) > 0 [
2
]
Since h(x) = d
T
(x ) is a linear compound of x it has a (univariate) normal distribution.
Given that x
2
:-
E[h(x)] = d
T
_

1
2
(
1
+
2
)
_
=
1
2
d
T
(
2

1
)
=
1
2

2
where
2
= (
2

1
)
T

1
(
2

1
) is the Mahalanobis distance between
2
and
1
:
The variance of h(x) is
d
T
d = (
2

1
)
T

1
(
2

1
)
= (
2

1
)
T

1
(
2

1
)
=
2
p
12
= Pr [h(x) > 0]
= Pr
_
h(x) +
1
2

>
1
2

_
(1)
= Pr
_
Z >
1
2

_
= 1
_
1
2

_
(5.3)
=
_

1
2

_
(2)
By symmetry if we write =
12
the misclassication rate p
21
=
1
2
(
21
) and since

12
=
21
we have
p
12
= p
21
Example (contd.)
We can estimate the misclassication probability from the sample Mahalanobis distance between
x
2
and x
1
3
D
2
= ( x
2
x
1
)
T
S
1
p
( x
2
x
1
)
=
_
1:0 0:6

_
11:4
14:1
_
19:9

1
2
D
_
= (2:23)
= 0:013
The misclassication rate is 1.3%.
5.4 Optimality of ML rule
We can show that the ML rule minimizes the probability of misclassication if an individual is a
priori equally likely to belong to any population.
Let M be the event of a misclassication and consider a decision rule (x) represented as
follows:

i
(x) =
_
1 if x is assigned to
i
0 otherwise
i.e. = (
1
;
2
; :::;
k
) is a 0-1 vector everywhere in the space of x and
i
(x) = 1 for x R
i
:
Recall that the classier assigns x to
i
if x R
i
:
The ML rule is represented as

ML
i
(x) =
_
1 if f
i
(x) _ f
k
(x) k ,= i
0 otherwise
Ties can be ignored, arbitrarily decided, or randomized by allowing
i
=
1
t
if t populations
(likelihoods) are tied.
The misclassication probabilities are
p
ij
= Pr [Allocate to
i
when in fact from
j
]
= Pr (
i
= 1[x
j
)
=
_

i
(x) f
j
(x) dx
=
_
R
i
f
j
(x) dx (3)
where x is equally likely to come from
j
(j = 1; :::; k) :
The total probability of misclassication is
Pr (M) =
k

i=1
Pr (M[x
i
) Pr (x
i
)
=
1
k
k

i=1
(1 p
ii
)
= 1
1
k
k

i=1
p
ii
(4)
4
Clearly we need to maximize the sum of the probabilities of correct classication which is the
trace of the misclassication matrix
k

i=1
p
ii
=
k

i=1
_

i
(x) f
i
(x) dx
=
_
k

i=1

i
(x) f
i
(x) dx
_
_
max
i
f
i
(x) dx
This shows that the trace is maximized for the ML rule and therefore Pr (M) is minimized.
5.5 Bayes Rule
The Bayes rule generalizes the ML rule by introducing a set of prior probabilities
i
assumed
known, where

i
= Pr(individual belongs to
i
)
The misclassication probability becomes
Pr (M) =
k

i=1
Pr (M[x
i
) Pr (x
i
)
=
k

i=1
(1 p
ii
)
i
= 1
k

i=1
p
ii

i
The previous analysis carries across as follows:
k

i=1
p
ii

i
=
k

i=1

i
_

i
(x) f
i
(x) dx
=
_
k

i=1

i
(x) f
i
(x) dx
_
_
max
i

i
f
i
(x) dx
The Bayes rule assigns x to
i
that maximizes the posterior probability p (j[x)
j
f
j
(x) :
This rule also minimizes the probability of misclassication Pr (M)
5.6 Minimizing Expected Loss
We can also introduce unequal costs of misclassication.
Let c
ij
= c (i[j) be the cost of assigning individual x
j
to
i
:
Generally we suppose c
ii
= 0:
Denition
5
The expected cost of misclassication is known as the Bayes risk:
R
i
(x) =
k

j=1
c (i[j) p (j[x)
is the risk or expected loss conditional on x and taking action i, where
p (j[x) =

j
f
j
(x)
f (x)
and f (x) =

k
j=1

j
f
j
(x)
Denition
The overall risk of a rule dened by is the expected loss at x
R(x) =
k

i=1

i
(x) R
i
(x)
We can show that it is optimal to take the action i that minimizes the Bayes risk.
ER(x) =
_
_
k

i=1

i
(x) R
i
(x)
_
f (x) dx
_
_
k

i=1
_
min
i
R
i
(x)
_
f (x) dx
It is optimal to minimize the overall expected loss to minimize the Bayes risk.
Example
Given two populations
1
;
2
; suppose c (2[1) = 5 and c (1[2) = 10:
Suppose that 20% of the population belong to
2
; then
1
= 0:8 and
2
= 0:2
Given a new individual x; the Bayes risk of assigning x to
1
is
R
1
(x) =
k

j=1
c (1[j) p (j[x)
= 10 0:2 f
2
(x)
= 2f
2
(x)
The Bayes risk of assigning x to
2
is
R
2
(x) =
k

j=1
c (2[j) p (j[x)
= 5 0:8 f
1
(x)
= 4f
1
(x)
Suppose that a new individual x
0
gives f
1
(x
0
) = 0:3 and f
2
(x
0
) = 0:4; then
2 0:4 = 0:8 < 4 0:3 = 1:2
so we assign x
0
to
1
:
6

You might also like