slide07-bayes
slide07-bayes
1
Bayesian Learning
• Quantities of interest are governed by prob. dist. and optimal decisions can be made by
reasoning about these prob.
2
Two Roles for Bayesian Methods
3
Basic Probability Formulas
P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
4
Bayes Theorem
P (D|h)P (h)
P (h|D) =
P (D)
• P (h) = prior probability that h holds, before seeing the training data
5
Bayes Theorem: Example
A patient takes a lab test and the result comes back positive. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present, and a
correct negative result in only 97% of the cases in which the disease is not present.
Furthermore, .001 of the entire population have this cancer.
P (cancer) = P (¬cancer) =
P (⊕|cancer) = P ( |cancer) =
P (⊕|¬cancer) = P ( |¬cancer) =
6
Bayes Theorem: Example
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present, and a correct negative result in only 97% of the cases in which the
disease is not present. Furthermore, .001 of the entire population have this cancer.
P (⊕|cancer)P (cancer)
P (cancer|⊕) =
P (⊕)
0.98 × 0.001
=
P (⊕)
0.00098
=
P (⊕, cancer) + P (⊕, ¬cancer)
0.00098
=
P (⊕|cancer)P (cancer) + P (⊕|¬cancer)P (¬cancer)
0.00098
= = 0.031664
0.98 × 0.001 + 0.03 × 0.999
(1)
7
Conditional Independence
(∀xi , yj , zk ) P (X = xi |Y = yj , Z = zk ) = P (X = xi |Z = zk )
8
Choosing Hypotheses
P (D|h)P (h)
P (h|D) =
P (D)
Generally want the most probable hypothesis given the training data
9
Choosing Hypotheses
10
Brute Force MAP Hypothesis Learner
P (D|h)P (h)
P (h|D) =
P (D)
hM AP = argmax P (h|D)
h∈H
11
Learning A Real Valued Function
y
hML
• di = f (xi ) + ei
• ei is random variable (noise) drawn independently for each xi according to some Gaussian
distribution with mean=0
Then the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors:
m
X
hM L = arg min (di − h(xi ))2
h∈H
i=1
12
Setting up the Stage
• Training instances hx1 , ..., xm i and target values hd1 , ..., dm i, where di = f (xi ) + ei .
13
Derivation of ML for Func. Approx.
Qm
From hM L = argmaxh∈H i=1 p(di |h):
di ∼ N (f (xi ), σ 2 ).
– x ∼ N (µ, σ 2 ) means random variable x is normally distributed with mean µ and variance
σ2 .
• Using pdf of N :
m (di −µ)2
Y 1 −
hM L = argmax √ e 2σ 2 .
h∈H i=1 2πσ 2
m (di −h(xi ))2
Y 1 −
hM L = argmax √ e 2σ 2 .
h∈H i=1 2πσ 2
14
Derivation of ML
15
Least Square as ML
Assumptions
• Observed training values di generated by adding random noise to true target value, where noise
has a normal distribution with zero mean.
Limitations
16
Minimum Description Length
According to information theory, the shortest code length for a message occurring with probability pi
is − log2 pi bits.
17
MDL
• − log2 P (D|h): description length of training data D given hypothesis h, under optimal encoding CD|H .
• Finally, we get:
hM AP = argmin LCD|H (D|h) + LCH (h)
h∈H
18
MDL
• MAP:
hM AP = argmin LCD|H (D|h) + LCH (h)
h∈H
which is the hypothesis that minimizes the combined length of the hypotheis itself, and the data
described by the hypothesis.
• hM DL = hM AP if C1 = CH and C2 = CD|H .
19
Bayes Optimal Classifier
• What is the most probable hypothesis given the training data, vs. What is the most probable
classification?
• Example:
– P (h1 |D) = 0.4, P (h2 |D) = 0.3, P (h3 |D) = 0.3.
– Given a new instance x, h1 (x) = 1, h2 (x) = 0, h3 (x) = 0.
– In this case, probability of x being positive is only 0.4.
20
Bayes Optimal Classification
If a new instance can take classification vj ∈ V , then the probability P (vj |D) of correct
classification of new instance being vj is:
X
P (vj |D) = P (vj |hi )P (hi |D)
hi ∈H
21
Bayes Optimal Classifier
22
Bayes Optimal Classifier: Example
• Bayes optimal classifiers maximize the probability that a new instance is correctly classified,
given the available data, hypothesis space H , and prior probabilities over H .
• Some oddities: The resulting hypotheis can be outside of the hypothesis space.
23
Gibbs Sampling
24
Naive Bayes Classifier
• Want to estimate P (a1 , a2 , ..., an |vj ) and P (vj ) from training data.
25
Naive Bayes
• Naive Bayes only takes number of distinct attribute values × number of distinct target values.
Naive Bayes uses cond. indep. to justify
26
Naive Bayes Algorithm
27
Naive Bayes: Example
Want to compute:
Y
vN B = argmax P (vj ) P (xi |vj )
vj ∈V i
28
Naive Bayes: Subtleties
• ...but it works surprisingly well anyway. Note don’t need estimated posteriors P̂ (vj |x) to be
correct; need only that
Y
argmax P̂ (vj ) P̂ (ai |vj ) = argmax P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V
29
Naive Bayes: Subtleties
What if none of the training instances with target value vj have attribute value ai ? Then
nc + mp
P̂ (ai |vj ) ←
n+m
where
30
Extra Slides: Will be covered, time permitting
31
Expectation Maximization (EM)
When to use:
Some uses:
32
EM for Estimating k Means
Given:
33
EM for Estimating k Means
p(x = xi |µ = µj )
E[zij ] = P2
n=1 p(x = xi |µ = µn )
− 1 (x −µ )2
2σ 2 i j
e
= 1 (x −µ )2
P2 − i n
2σ 2
n=1 e
0
step: Calculate a new maximum likelihood hypothesis h = hµ01 , µ02 i, assuming the value taken on by each
hidden variable zij is its expected value E[zij ] calculated above. Replace h = hµ1 , µ2 i by
h0 = hµ01 , µ02 i.
Pm
i=1 E[zij ] xi
µj ← P m
i=1 E[zij ]
34
EM Algorithm
35
General EM Problem
Given:
Determine:
36
General EM Method
0
Define likelihood function Q(h |h) which calculates Y = X ∪ Z using observed X and current parameters h
to estimate Z
0 0
Q(h |h) ← E[ln P (Y |h )|h, X]
EM Algorithm:
0
Estimation (E) step: Calculate Q(h |h) using the current hypothesis h and the observed data X to estimate
the probability distribution over Y .
0 0
Q(h |h) ← E[ln P (Y |h )|h, X]
0
Maximization (M) step: Replace hypothesis h by the hypothesis h that maximizes this Q function.
0
h ← argmax Q(h |h)
h0
37
Derivation of k -Means
38
Deriving ln P (Y |h)
1 − 12
Pk 0 2
0 0
p(yi |h ) = p(xi , zi1 , zi2 , ..., zik |h ) = p e 2σ j=1 zij (xi −µj )
2πσ 2
Note that the vector hzi1 , ..., zik i contains only a single 1 and all the rest are 0.
m
0 0
Y
ln P (Y |h ) = ln p(yi |h )
i=1
m
0
X
= ln p(yi |h )
i=1
m k
X 1 1 X 0 2
= ln √ − zij (xi − µj )
i=1 2πσ 2 2σ 2 j=1
39
Deriving E[ln P (Y |h)]
Since P (Y |h0 ) is a linear function of zij , and since E[f (z)] = f (E[z]),
m 1 1 k
0 X X 0 2
E[ln P (Y |h )] = E ln p − zij (xi − µj )
i=1 2πσ 2 2σ 2 j=1
m 1 1 k
X X 0 2
= ln p − E[zij ](xi − µj )
i=1 2πσ 2 2σ 2 j=1
Thus,
0 0 0
Q(h |h) = Q(hµ1 , ..., µk i|h)
m k
X 1 1 X 0 2
= ln √ − 2
E[zij ](xi − µ j)
i=1 2πσ 2 2σ j=1
40
Finding argmaxh0 Q(h0 |h)
With
− 1 (x −µ )2
2σ 2 i j
e
E[zij ] = 1 (x −µ )2
P2 − i n
2σ 2
n=1 e
0
we want to find h such that
m 1 1 k
0 X X 0 2
argmax Q(h |h) = argmax ln p − E[zij ](xi − µj )
h0 h0 2πσ 2 2
2σ j=1
i=1
m X k
X 0 2
= argmin E[zij ](xi − µj ) ,
h0 i=1 j=1
which is minimized by
Pm
i=1 E[zij ]xi
µj ← P m .
i=1 E[zij ]
41
Deriving the Update Rule
∂ m X
k
X 0 2
E[zij ](xi − µj )
∂µ0 i=1 j=1
j
∂ m
X 0 2
= E[zij ](xi − µj )
∂µ0 i=1
j
m
X 0
= 2 E[zij ](xi − µj ) = 0
i=1
m m
X X 0
E[zij ]xi − E[zij ]µj = 0
i=1 i=1
m m
X 0 X
E[zij ]xi = µj E[zij ]
i=1 i=1
Pm
0
µj = i=1 E[zij ]xi
Pm
E[zij ]
i=1
See Bishop (1995) Neural Networks for Pattern Recognition, Oxford U Press. pp. 63–64.
42
Losses and Risks
Actions: αi
Loss of αi when the state is Ck : λik
Expected risk (Duda and Hart, 1973)
K
R i | x ikP C k | x
k 1
7
Losses and Risks: 0/1 Loss
8
0 if i k
ik
1 if i k
K
R i | x ikP C k | x
k 1
P C k | x
k i
1 P C i | x
R i | x P C k | x 1 P C i | x
k i
9
Discriminant Functions
chooseCi if gi x maxkgk x gi x, i 1,, K
R i | x
gi x P C i | x
px | C P C
i i
11
K=2 Classes
12
Utility Theory
13
Association Rules
Association rule: X Y
People who buy/click/visit/enjoy X are also likely to
buy/click/visit/enjoy Y.
A rule implies association, not necessarily causation.
14
Association measures
15
Support (X Y):
# customerswho bought X and Y
P X ,Y
# customers
Confidence (X Y):
P X ,Y
P Y | X
P( X )
# customerswho bought X and Y
Lift (X Y):
# customerswho bought X
P X ,Y P(Y | X )
P( X )P(Y ) P(Y )
References
42-9