0% found this document useful (0 votes)
7 views

slide07-bayes

The document discusses Bayesian Learning, which employs a probabilistic approach to inference, utilizing prior knowledge and observed data to create learning algorithms like Naive Bayes and Bayesian belief networks. It covers fundamental concepts such as Bayes Theorem, conditional independence, and the process of selecting hypotheses based on maximum posterior probability. Additionally, it touches on practical applications like the Bayes Optimal Classifier and Gibbs Sampling for efficient classification.

Uploaded by

Shohanur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

slide07-bayes

The document discusses Bayesian Learning, which employs a probabilistic approach to inference, utilizing prior knowledge and observed data to create learning algorithms like Naive Bayes and Bayesian belief networks. It covers fundamental concepts such as Bayes Theorem, conditional independence, and the process of selecting hypotheses based on maximum posterior probability. Additionally, it touches on practical applications like the Bayes Optimal Classifier and Gibbs Sampling for efficient classification.

Uploaded by

Shohanur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Bayesian Learning

Olive slides: Alpaydin

Black slides: Mitchell.

1
Bayesian Learning

• Probabilistic approach to inference.

• Quantities of interest are governed by prob. dist. and optimal decisions can be made by
reasoning about these prob.

• Learning algorithms that directly deal with probabilities.

• Analysis framework for non-probabilistic methods.

2
Two Roles for Bayesian Methods

Provides practical learning algorithms:

• Naive Bayes learning

• Bayesian belief network learning

• Combine prior knowledge (prior probabilities) with observed data

• Requires prior probabilities

Provides useful conceptual framework

• Provides “gold standard” for evaluating other learning algorithms

• Additional insight into Occam’s razor

3
Basic Probability Formulas

• Product Rule: probability P (A ∧ B) of a conjunction of two events A and B:

P (A, B) = P (B, A) = P (A ∧ B) = P (A|B)P (B) = P (B|A)P (A)

• Sum Rule: probability of a disjunction of two events A and B:

P (A ∨ B) = P (A) + P (B) − P (A ∧ B)

• Theorem of total probability: if events A1 , . . . , An are mutually exclusive with


Pn
i=1 P (Ai ) = 1, then
X n
P (B) = P (B|Ai )P (Ai )
i=1

4
Bayes Theorem

P (D|h)P (h)
P (h|D) =
P (D)
• P (h) = prior probability that h holds, before seeing the training data

• P (D) = prior probability of observing training data D

• P (D|h) = probability of observing D in a world where h holds

• P (h|D) = probability of h holding given observed data D

• Some useful tricks:


– P (h, D) = P (D, h)
P (h,D)
– P (h|D) = P (D)
P (D,h)
– P (D, h) = P (D|h)P (h), from P (D|h) = P (h)

5
Bayes Theorem: Example

Does patient have cancer or not?

A patient takes a lab test and the result comes back positive. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present, and a
correct negative result in only 97% of the cases in which the disease is not present.
Furthermore, .001 of the entire population have this cancer.

P (cancer) = P (¬cancer) =
P (⊕|cancer) = P ( |cancer) =
P (⊕|¬cancer) = P ( |¬cancer) =

How does P (cancer|⊕) compare to P (¬cancer|⊕)?

6
Bayes Theorem: Example
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present, and a correct negative result in only 97% of the cases in which the
disease is not present. Furthermore, .001 of the entire population have this cancer.

P (cancer) = 0.001, given P (¬cancer) = 1 − P (cancer) = 1 − 0.001 = 0.999


P (⊕|cancer) = 0.98, given P ( |cancer) = 1 − P (⊕|cancer) = 1 − 0.98 = 0.02
P (⊕|¬cancer) = 1 − P ( |¬cancer) P ( |¬cancer) = 0.97, given
= 1 − 0.97 = 0.03
How does P (cancer|⊕) compare to P (¬cancer|⊕)?

P (⊕|cancer)P (cancer)
P (cancer|⊕) =
P (⊕)
0.98 × 0.001
=
P (⊕)
0.00098
=
P (⊕, cancer) + P (⊕, ¬cancer)
0.00098
=
P (⊕|cancer)P (cancer) + P (⊕|¬cancer)P (¬cancer)
0.00098
= = 0.031664
0.98 × 0.001 + 0.03 × 0.999
(1)
7
Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution


governing X is independent of the value of Y given the value of Z ; that is, if

(∀xi , yj , zk ) P (X = xi |Y = yj , Z = zk ) = P (X = xi |Z = zk )

more compactly, we write


P (X|Y, Z) = P (X|Z)

Example: T hunder is conditionally independent of Rain, given Lightning

P (T hunder|Rain, Lightning) = P (T hunder|Lightning)

8
Choosing Hypotheses

P (D|h)P (h)
P (h|D) =
P (D)
Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis hM AP :

hM AP = arg max P (h|D)


h∈H
P (D|h)P (h)
= arg max
h∈H P (D)
= arg max P (D|h)P (h)
h∈H

9
Choosing Hypotheses

• If all hypotheses are equally probable a priori:

P (hi ) = P (hj ), ∀hi , hj ,

then, hM AP reduces to:


hM L ≡ argmax P (D|h).
h∈H
→ Maximum Likelihood hypothesis.

10
Brute Force MAP Hypothesis Learner

1. For each hypothesis h in H , calculate the posterior probability

P (D|h)P (h)
P (h|D) =
P (D)

2. Output the hypothesis hM AP with the highest posterior probability

hM AP = argmax P (h|D)
h∈H

11
Learning A Real Valued Function
y

hML

Consider any real-valued target function f

Training examples hxi , di i, where di is noisy training value

• di = f (xi ) + ei

• ei is random variable (noise) drawn independently for each xi according to some Gaussian
distribution with mean=0

Then the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors:
m
X
hM L = arg min (di − h(xi ))2
h∈H
i=1

12
Setting up the Stage

• Probability density function:


1
p(x0 ) ≡ lim P (x0 ≤ x < x0 + )
→0 
• ML hypothesis
hM L = argmax p(D|h)
h∈H

• Training instances hx1 , ..., xm i and target values hd1 , ..., dm i, where di = f (xi ) + ei .

• Assume training examples are mutually independent given h,


m
Y
hM L = argmax p(di |h)
h∈H i=1

Note: p(a, b|c) = p(a|b, c) · p(b|c) = p(a|c) · p(b|c)

13
Derivation of ML for Func. Approx.
Qm
From hM L = argmaxh∈H i=1 p(di |h):

• Since di = f (xi ) + ei and ei ∼ N (0, σ 2 ), it must be:

di ∼ N (f (xi ), σ 2 ).

– x ∼ N (µ, σ 2 ) means random variable x is normally distributed with mean µ and variance
σ2 .

• Using pdf of N :
m (di −µ)2
Y 1 −
hM L = argmax √ e 2σ 2 .
h∈H i=1 2πσ 2
m (di −h(xi ))2
Y 1 −
hM L = argmax √ e 2σ 2 .
h∈H i=1 2πσ 2

14
Derivation of ML

m (di −h(xi ))2


Y 1 −
hM L = argmax √ e 2σ 2 .
h∈H i=1 2πσ 2
1
• Get rid of constant factor √ , and put on log:
2πσ 2

m (di −h(xi ))2


Y −
hM L = argmax ln e 2σ 2
h∈H i=1
m (di −h(xi ))2
X −
= argmax ln e 2σ 2
h∈H i=1
m
(di − h(xi ))2
X
= argmax −
h∈H i=1 2σ 2
m
X 2
= argmin (di − h(xi )) (2)
h∈H i=1

15
Least Square as ML

Assumptions

• Observed training values di generated by adding random noise to true target value, where noise
has a normal distribution with zero mean.

• All hypotheses are equally probable (uniform prior).


– Note: it is possible that M AP 6= M L!

Limitations

• Possible noise in xi not accounted for.

16
Minimum Description Length

Occam’s razor: prefer the shortest hypothesis.

hM AP = argmax P (D|h)P (h)


h∈H
hM AP = argmax log2 P (D|h) + log2 P (h)
h∈H
hM AP = argmin − log2 P (D|h) − log2 P (h)
h∈H

Surprisingly, the above can be interpreted as hM AP preferring shorter hypotheses, assuming a


particular encoding scheme is used for the hypothesis and the data.

According to information theory, the shortest code length for a message occurring with probability pi
is − log2 pi bits.

17
MDL

hM AP = argmin − log2 P (D|h) − log2 P (h)


h∈H

• LC (i): description length of message i with respect to code C .


• − log2 P (h): description length of h under optimal coding CH for the hypothesis space H .

LCH (h) = − log2 P (h)

• − log2 P (D|h): description length of training data D given hypothesis h, under optimal encoding CD|H .

LCD|H (D|h) = − log2 P (D|h)

• Finally, we get:
hM AP = argmin LCD|H (D|h) + LCH (h)
h∈H

18
MDL

• MAP:
hM AP = argmin LCD|H (D|h) + LCH (h)
h∈H

• MDL: Choose hM DL such that:

hM DL = argmin LC1 (h) + LC2 (D|h)


h∈H

which is the hypothesis that minimizes the combined length of the hypotheis itself, and the data
described by the hypothesis.

• hM DL = hM AP if C1 = CH and C2 = CD|H .

19
Bayes Optimal Classifier

• What is the most probable hypothesis given the training data, vs. What is the most probable
classification?

• Example:
– P (h1 |D) = 0.4, P (h2 |D) = 0.3, P (h3 |D) = 0.3.
– Given a new instance x, h1 (x) = 1, h2 (x) = 0, h3 (x) = 0.
– In this case, probability of x being positive is only 0.4.

20
Bayes Optimal Classification

If a new instance can take classification vj ∈ V , then the probability P (vj |D) of correct
classification of new instance being vj is:
X
P (vj |D) = P (vj |hi )P (hi |D)
hi ∈H

Thus, the optimal classification is


X
argmax P (vj |hi )P (hi |D).
vj ∈V hi ∈H

21
Bayes Optimal Classifier

What is the assumption for the following to work?


X
P (vj |D) = P (vj |hi )P (hi |D)
hi ∈H

Let’s consider H = {h, ¬h}:

P (v|D) = P (v, h|D) + P (v, ¬h|D)


P (v, h, D) P (v, ¬h, D)
= +
P (D) P (D)
P (v|h, D)P (h|D)P (D)
=
P (D)
P (v|¬h, D)P (¬h|D)P (D)
+
P (D)
{if P (v|h, D) = P (v|h), etc.}
= P (v|h)P (h|D) + P (v|¬h)P (¬h|D)

22
Bayes Optimal Classifier: Example

• P (h1 |D) = 0.4, P (h2 |D) = 0.3, P (h3 |D) = 0.3.

• Given a new instance x, h1 (x) = 1, h2 (x) = 0, h1 (x) = 0.


– P ( |h1 ) = 0, P (⊕|h1 ) = 1, etc.
– P (⊕|D) = 0.4 + 0 + 0, P ( |D) = 0 + 0.3 + 0.3 = 0.6
– Thus, argmaxv∈O{⊕, } P (v|D) = .

• Bayes optimal classifiers maximize the probability that a new instance is correctly classified,
given the available data, hypothesis space H , and prior probabilities over H .

• Some oddities: The resulting hypotheis can be outside of the hypothesis space.

23
Gibbs Sampling

Finding argmaxv∈V P (v|D) by considering every hypothesis h ∈ H can be infeasible. A less


optimal, but error-bounded version is Gibbs sampling:

1. Randomly pick h ∈ H with probability P (h|D).

2. Use h to classify the new instance x.

The result is that missclassification rate is at most 2× that of BOC.

24
Naive Bayes Classifier

Given attribute values ha1 , a2 , ..., an i, give the classification v ∈V:

vM AP = argmax P (vj |a1 , a2 , ..., an )


vj ∈V

P (a1 , a2 , ..., an |vj )P (vj )


vM AP = argmax
vj ∈V P (a1 , a2 , ..., an )
= argmax P (a1 , a2 , ..., an |vj )P (vj )
vj ∈V

• Want to estimate P (a1 , a2 , ..., an |vj ) and P (vj ) from training data.

25
Naive Bayes

• P (vj ) is easy to calculate: Just count the frequency.


• P (a1 , a2 , ..., an |vj ) takes the number of posible instances × number of possible target
values.

• P (a1 , a2 , ..., an |vj ) can be approximated as


Y
P (a1 , a2 , ..., an |vj ) = P (ai |vj ).
i

• From this naive Bayes classifier is defined as:


Y
vN B = argmax P (vj ) P (ai |vj )
vj ∈V i

• Naive Bayes only takes number of distinct attribute values × number of distinct target values.
Naive Bayes uses cond. indep. to justify

P (X, Y |Z) = P (X|Y, Z)P (Y |Z)


= P (X|Z)P (Y |Z)

26
Naive Bayes Algorithm

Naive Bayes Learn(examples)

For each target value vj

P̂ (vj ) ← estimate P (vj )


For each attribute value ai of each attribute a
P̂ (ai |vj ) ← estimate P (ai |vj )

Classify New Instance(x)


Y
vN B = argmax P̂ (vj ) P̂ (xi |vj )
vj ∈V i

27
Naive Bayes: Example

Consider PlayTennis again, and new instance:

x = hOutlk = sun, T emp = cool, Humid = high, W ind = strongi


V = {Y es, N o}

Want to compute:
Y
vN B = argmax P (vj ) P (xi |vj )
vj ∈V i

P (Y ) P (sun|Y ) P (cool|Y ) P (high|Y ) P (strong|Y ) = .005


P (N ) P (sun|N ) P (cool|N ) P (high|N ) P (strong|N ) = .021
Thus, vN B = No

28
Naive Bayes: Subtleties

1. Conditional independence assumption is often violated


Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

• ...but it works surprisingly well anyway. Note don’t need estimated posteriors P̂ (vj |x) to be
correct; need only that
Y
argmax P̂ (vj ) P̂ (ai |vj ) = argmax P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V

• Naive Bayes posteriors often unrealistically close to 1 or 0.

29
Naive Bayes: Subtleties

What if none of the training instances with target value vj have attribute value ai ? Then

P̂ (ai |vj ) = 0, and...


Y
P̂ (vj ) P̂ (ai |vj ) = 0
i

Typical solution is Bayesian estimate for P̂ (ai |vj )

nc + mp
P̂ (ai |vj ) ←
n+m
where

• n is number of training examples for which v = vj ,


• nc number of examples for which v = vj and a = ai

• p is prior estimate for P̂ (ai |vj )


• m is weight given to prior (i.e. number of “virtual” examples)

30
Extra Slides: Will be covered, time permitting

31
Expectation Maximization (EM)

When to use:

• Data is only partially observable

• Unsupervised clustering (target value unobservable)

• Supervised learning (some instance attributes unobservable)

Some uses:

• Train Bayesian Belief Networks

• Unsupervised clustering (AUTOCLASS)

• Learning Hidden Markov Models

32
EM for Estimating k Means

Given:

• Instances from X generated by mixture of k Gaussian distributions


• Unknown means hµ1 , . . . , µk i of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
Determine:

• Maximum likelihood estimates of hµ1 , . . . , µk i

Think of full description of each instance as yi = hxi , zi1 , zi2 i, where


• zij is 1 if xi generated by j th Gaussian
• xi observable
• zij unobservable

33
EM for Estimating k Means

EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate


step: Calculate the expected value E[zij ] of each hidden variable zij , assuming the current hypothesis
h = hµ1 , µ2 i holds.

p(x = xi |µ = µj )
E[zij ] = P2
n=1 p(x = xi |µ = µn )
− 1 (x −µ )2
2σ 2 i j
e
= 1 (x −µ )2
P2 − i n
2σ 2
n=1 e

0
step: Calculate a new maximum likelihood hypothesis h = hµ01 , µ02 i, assuming the value taken on by each
hidden variable zij is its expected value E[zij ] calculated above. Replace h = hµ1 , µ2 i by
h0 = hµ01 , µ02 i.
Pm
i=1 E[zij ] xi
µj ← P m
i=1 E[zij ]

34
EM Algorithm

Converges to local maximum likelihood h

and provides estimates of hidden variables zij

In fact, local maximum in E[ln P (Y |h)]

• Y is complete (observable plus unobservable variables) data

• Expected value is taken over possible values of unobserved variables in Y

35
General EM Problem

Given:

• Observed data X = {x1 , . . . , xm }

• Unobserved data Z = {z1 , . . . , zm }

• Parameterized probability distribution P (Y |h), where


– Y = {y1 , . . . , ym } is the full data yi = xi ∪ zi
– h are the parameters

Determine:

• h that (locally) maximizes E[ln P (Y |h)]

36
General EM Method
0
Define likelihood function Q(h |h) which calculates Y = X ∪ Z using observed X and current parameters h
to estimate Z

0 0
Q(h |h) ← E[ln P (Y |h )|h, X]

EM Algorithm:
0
Estimation (E) step: Calculate Q(h |h) using the current hypothesis h and the observed data X to estimate
the probability distribution over Y .

0 0
Q(h |h) ← E[ln P (Y |h )|h, X]

0
Maximization (M) step: Replace hypothesis h by the hypothesis h that maximizes this Q function.

0
h ← argmax Q(h |h)
h0

37
Derivation of k -Means

• Hypothesis h is parameterized by θ = hµ1 ...µk i.

• Observed data X = {hxi i}

• Hidden variables Z = {hzi1 , ..., zik i}:


– zik = 1 if input xi is generated by th k-th normal dist.
– For each input, k entries.

• First, start with defining ln p(Y |h).

38
Deriving ln P (Y |h)

1 − 12
Pk 0 2
0 0
p(yi |h ) = p(xi , zi1 , zi2 , ..., zik |h ) = p e 2σ j=1 zij (xi −µj )
2πσ 2

Note that the vector hzi1 , ..., zik i contains only a single 1 and all the rest are 0.
m
0 0
Y
ln P (Y |h ) = ln p(yi |h )
i=1
m
0
X
= ln p(yi |h )
i=1
 
m k
X 1 1 X 0 2
= ln √ − zij (xi − µj ) 
i=1 2πσ 2 2σ 2 j=1

39
Deriving E[ln P (Y |h)]

Since P (Y |h0 ) is a linear function of zij , and since E[f (z)] = f (E[z]),

  
m 1 1 k
0 X X 0 2
E[ln P (Y |h )] = E ln p − zij (xi − µj ) 
i=1 2πσ 2 2σ 2 j=1
 
m 1 1 k
X X 0 2
= ln p − E[zij ](xi − µj )
i=1 2πσ 2 2σ 2 j=1

Thus,

0 0 0
Q(h |h) = Q(hµ1 , ..., µk i|h)
 
m k
X 1 1 X 0 2
= ln √ − 2
E[zij ](xi − µ j)
i=1 2πσ 2 2σ j=1

40
Finding argmaxh0 Q(h0 |h)
With

− 1 (x −µ )2
2σ 2 i j
e
E[zij ] = 1 (x −µ )2
P2 − i n
2σ 2
n=1 e

0
we want to find h such that
 
m 1 1 k
0 X X 0 2
argmax Q(h |h) = argmax ln p − E[zij ](xi − µj ) 
h0 h0 2πσ 2 2
2σ j=1
i=1
m X k
X 0 2
= argmin E[zij ](xi − µj ) ,
h0 i=1 j=1

which is minimized by
Pm
i=1 E[zij ]xi
µj ← P m .
i=1 E[zij ]

41
Deriving the Update Rule

Set the derivative of the quantity to be minimized to be zero:

∂ m X
k
X 0 2
E[zij ](xi − µj )
∂µ0 i=1 j=1
j
∂ m
X 0 2
= E[zij ](xi − µj )
∂µ0 i=1
j
m
X 0
= 2 E[zij ](xi − µj ) = 0
i=1

m m
X X 0
E[zij ]xi − E[zij ]µj = 0
i=1 i=1
m m
X 0 X
E[zij ]xi = µj E[zij ]
i=1 i=1
Pm
0
µj = i=1 E[zij ]xi
Pm
E[zij ]
i=1

See Bishop (1995) Neural Networks for Pattern Recognition, Oxford U Press. pp. 63–64.

42
Losses and Risks

 Actions: αi
 Loss of αi when the state is Ck : λik
 Expected risk (Duda and Hart, 1973)
K
R i | x    ikP C k | x 
k 1

choose i if R i | x   mink R k | x 

7
Losses and Risks: 0/1 Loss
8

0 if i  k
ik  
1 if i  k
K
R i | x    ikP C k | x 
k 1

  P C k | x 
k i

 1  P C i | x 

For minimum risk, choose the most probable class


Losses and Risks: Reject
0 if i  k

ik   if i  K  1 , 0    1
1 otherwise

K
R K 1 | x    P C k | x   
k 1

R i | x    P C k | x  1  P C i | x 
k i

chooseC i if PC i | x   PC k | x  k  i and PC i | x   1  


reject otherwise

9
Discriminant Functions
chooseCi if gi x  maxkgk x gi x, i  1,, K

 R i | x 

gi x   P C i | x 
px | C P C 
 i i

K decision regions R1,...,RK

Ri  x|gi x  maxkgk x

11
K=2 Classes

 Dichotomizer (K=2) vs Polychotomizer (K>2)


 g(x) = g1(x) – g2(x)
C1 if gx   0
choose 
C 2 otherwise

 Log odds: P C1 | x 


log
P C 2 | x 

12
Utility Theory

 Prob of state k given exidence x: P (Sk|x)


 Utility of αi when state is k: Uik
 Expected utility:
EU i | x   UikP Sk | x 
k

Choose αi if EU i | x   max EU j | x 


j

13
Association Rules

 Association rule: X  Y
 People who buy/click/visit/enjoy X are also likely to
buy/click/visit/enjoy Y.
 A rule implies association, not necessarily causation.

14
Association measures
15

 Support (X  Y):
# customerswho bought X and Y 
P X ,Y  
# customers
 Confidence (X  Y):
P  X ,Y 
P Y | X  
P( X )
# customerswho bought X and Y 
Lift (X  Y): 

# customerswho bought X 
P X ,Y  P(Y | X )
 
P( X )P(Y ) P(Y )
References

42-9

You might also like