Bayesian Decision Theory
Bayesian Decision Theory
Decision Theory
Henrik I Christensen
Georgia Tech.
Bayesian Decision Theory
• Design classifiers to recommend decisions that
minimize some total expected ”risk”.
– The simplest risk is the classification error (i.e., costs
are equal).
– Typically, the risk includes the cost associated with
different decisions.
Terminology
• State of nature ω (random variable):
– e.g., ω1 for sea bass, ω2 for salmon
⎧ P(ω1 ) if we decide ω2
P(error ) = ⎨
⎩ P(ω2 ) if we decide ω1
⎧ P(ω1 / x) if we decide ω2
P(error / x) = ⎨
⎩ P(ω2 / x) if we decide ω1
or P(error/x) = min[P(ω1/x), P(ω2/x)]
p ( x / Ci )P (C i )
P(Ci / x ) =
p( x)
• We need to estimate p(x/C1), p(x/C2), P(C1), P(C2)
Example (cont’d)
• Collect data
– Ask drivers how much their car was and measure height.
• Determine prior probabilities P(C1), P(C2)
– e.g., 1209 samples: #C1=221 #C2=988
221
P(C1 ) = = 0.183
1209
988
P(C2 ) = = 0.817
1209
Example (cont’d)
• Determine class conditional probabilities (likelihood)
– Discretize car height into bins and use normalized histogram
p( x / Ci )
Example (cont’d)
• Calculate the posterior probability for each bin:
p( x = 1.0 / C1) P( C1)
P(C1 / x = 1.0) = =
p( x = 1.0 / C1) P( C1) + p( x =1.0 / C2) P( C2)
0.2081*0.183
= = 0.438
0.2081*0.183 + 0.0597 *0.817
P(Ci / x)
A More General Theory
• Use more than one features.
• Allow more than two categories.
• Allow actions other than classifying the input to
one of the possible categories (e.g., rejection).
• Employ a more general error function (i.e., “risk”
function) by associating a “cost” (“loss” function)
with each error (i.e., wrong action).
Terminology
• Features form a vector x ∈ R d
• A finite set of c categories ω1, ω2, …, ωc
• Bayes rule (i.e., using vector notation):
p (x / ω j ) P(ω j )
P(ω j / x) =
p( x)
c
where p(x) = ∑ p(x / ω j ) P(ω j )
j =1
R = ∫ R(a(x) / x) p(x)dx
• The optimum decision rule is the Bayes rule
Overall Risk (cont’d)
• The Bayes decision rule minimizes R by:
(i) Computing R(αi /x) for every αi given an x
(ii) Choosing the action αi with the minimum R(αi /x)
or
>
or
or
θa = P(ω2 ) / P(ω1 )
gi(x)=-R(αi / x)
gi(x)=P(ωi / x)
Discriminants for Bayes Classifier
(cont’d)
• Is the choice of gi unique?
– Replacing gi(x) with f(gi(x)), where f() is monotonically
increasing, does not change the classification results.
p(x / ωi ) P(ωi )
g i ( x) =
gi(x)=P(ωi/x) p ( x)
gi (x) = p(x / ωi ) P(ωi )
gi (x) = ln p(x / ωi ) + ln P(ωi )
we’ll use this
form extensively!
Case of two categories
• More common to use a single discriminant function
(dichotomizer) instead of two:
• Examples:
g (x) = P(ω1 / x) − P(ω2 / x)
p (x / ω1 ) P(ω1 )
g (x) = ln + ln
p ( x / ω2 ) P(ω2 )
Decision Regions and Boundaries
• Decision rules divide the feature space in decision regions
R1, R2, …, Rc, separated by decision boundaries.
decision boundary
is defined by:
g1(x)=g2(x)
Discriminant Function for
Multivariate Gaussian Density
N(µ,Σ)
p(x/ωi)
Multivariate Gaussian Density:
Case I
• Σi=σ2(diagonal)
– Features are statistically independent
– Each feature has the same variance
wi=
)
)
Multivariate Gaussian Density:
Case I (cont’d)
• Properties of decision boundary:
– It passes through x0
– It is orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi)= P(ω
≠ j), then x0 shifts away from the most likely category.
– If σ is very small, the position of the boundary is insensitive to P(ωi)
and P(ωj)
)
)
Multivariate Gaussian Density:
Case I (cont’d)
If P(ωi)= P(ω
≠
j), then x0 shifts away
from the most likely category.
Multivariate Gaussian Density:
Case I (cont’d)
If P(ωi)= P(ω
≠
j), then x0 shifts away
from the most likely category.
Multivariate Gaussian Density:
Case I (cont’d)
If P(ωi)= P(ω
≠
j), then x0 shifts away
from the most likely category.
Multivariate Gaussian Density:
Case I (cont’d)
• Minimum distance classifier
– When P(ωi) are equal, then:
g i ( x) = − || x − µi ||2
max
Multivariate Gaussian Density:
Case II
• Σi= Σ
Multivariate Gaussian Density:
Case II (cont’d)
Multivariate Gaussian Density:
Case II (cont’d)
• Properties of hyperplane (decision boundary):
– It passes through x0
– It is not orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi)= P(ω
≠ j), then x0 shifts away from the most likely category.
Multivariate Gaussian Density:
Case II (cont’d)
If P(ωi)= P(ω
≠
j), then x0 shifts away
from the most likely category.
Multivariate Gaussian Density:
Case II (cont’d)
If P(ωi)= P(ω
≠
j), then x0 shifts away
from the most likely category.
Multivariate Gaussian Density:
Case II (cont’d)
• Mahalanobis distance classifier
– When P(ωi) are equal, then:
max
Multivariate Gaussian Density:
Case III
• Σi= arbitrary
hyperquadrics;
decision boundary:
P(ω1)=P(ω2)
boundary does
not pass through
midpoint of μ1,μ2
Multivariate Gaussian Density:
Case III (cont’d)
non-linear
decision
boundaries
Multivariate Gaussian Density:
Case III (cont’d)
• More examples
Error Bounds
• Exact error calculations could be difficult – easier to
estimate error bounds!
or
min[P(ω1/x), P(ω2/x)]
P(error)
Error Bounds (cont’d)
• If the class conditional distributions are Gaussian, then
where:
| |
Error Bounds (cont’d)
Bhattacharyya error:
k(0.5)=4.06
P(error ) ≤ 0.0087
Receiver Operating
Characteristic (ROC) Curve
• Every classifier employs some kind of a threshold.
θa = P(ω2 ) / P(ω1 )
A
I
Example: Person Authentication
(cont’d)
• Possible decisions:
– (1) correct acceptance (true positive):
• X belongs to A, and we decide A correct rejection
correct acceptance
– (2) incorrect acceptance (false positive):
• X belongs to I, and we decide A
– (3) correct rejection (true negative): A
• X belongs to I, and we decide I I
– (4) incorrect rejection (false negative):
• X belongs to A, and we decide I false negative false positive
Error vs Threshold
ROC
False Negatives vs Positives
Next Lecture
• Linear Classification Methods
– Hastie et al, Chapter 4
• Compound decision
(1) Wait for n fish to emerge.
(2) Make all n decisions jointly.
– Could improve performance when consecutive states
of nature are not be statistically independent.
Compound Bayesian
Decision Theory (cont’d)
• Suppose Ω=(ω(1), ω(2), …, ω(n))denotes the
n states of nature where ω(i) can take one of
c values ω1, ω2, …, ωc (i.e., c categories)
• Suppose P(Ω) is the prior probability of the n
states of nature.
• Suppose X=(x1, x2, …, xn) are n observed
vectors.
Compound Bayesian
Decision Theory (cont’d)
P P
acceptable! i.e., consecutive states of nature may
not be statistically independent!