0% found this document useful (0 votes)
21 views

SL09. Bayesian Learning

This document provides an overview of Bayesian learning. It begins by defining Bayesian learning as finding the most probable hypothesis given input data and prior knowledge. It then introduces Bayes' rule and how it can be used to calculate the posterior probability of a hypothesis. The document explains how to perform Bayesian learning by calculating the posterior probability for each hypothesis and selecting the one with the highest value. It also discusses how to apply Bayesian learning to classification problems and handles noise in the data.

Uploaded by

Keshav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

SL09. Bayesian Learning

This document provides an overview of Bayesian learning. It begins by defining Bayesian learning as finding the most probable hypothesis given input data and prior knowledge. It then introduces Bayes' rule and how it can be used to calculate the posterior probability of a hypothesis. The document explains how to perform Bayesian learning by calculating the posterior probability for each hypothesis and selecting the one with the highest value. It also discusses how to apply Bayesian learning to classification problems and handles noise in the data.

Uploaded by

Keshav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS7641 SL09.

Bayesian Learning Mohamed Ameen Amer

SL09. Bayesian Learning

Introduction:
● We’re trying to learn the best (most probable) hypothesis 𝐻 given input data
and domain knowledge.
Best == Most probable
• It’s the probability of some hypothesis h given input data D:
𝑃𝑟 (ℎ | 𝐷)
• We’re trying to find hypothesis h with the highest probability Pr:
𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (ℎ | 𝐷))

Bayes Rule:
• Bayes Rule for probability states that:
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
𝑃𝑟 (ℎ | 𝐷) =
𝑃𝑟 (𝐷)
- 𝑃𝑟 (ℎ | 𝐷) → The probability of a specific hypothesis given input data (Posterior probability).
- 𝑃𝑟 (𝐷 | ℎ) → The probability of data given the hypothesis. It’s the (likelihood) of seeing some
particular labels associated with input points, given a world where some hypothesis h is true.
- 𝑃𝑟 (ℎ) → The prior probability of a particular hypothesis. This value encapsulates our prior belief
that one hypothesis is likely or unlikely compared to other hypotheses. This is basically the
domain knowledge.
- 𝑃𝑟 (𝐷) → The likelihood of the data under all hypotheses (A normalizing term).
- Bayes Rule is derived from the chain rule:
𝑃𝑟 (𝑎, 𝑏) = 𝑃𝑟 (𝑎, 𝑏)𝑃𝑟 (𝑏)
𝑃𝑟 (𝑎, 𝑏) = 𝑃𝑟 (𝑏, 𝑎)𝑃𝑟 (𝑎)
𝑡ℎ𝑒𝑛 → 𝑃𝑟 (𝑎, 𝑏)𝑃𝑟 (𝑏) = 𝑃𝑟 (𝑏, 𝑎)𝑃𝑟 (𝑎)
𝑃𝑟 (𝑏, 𝑎)𝑃𝑟 (𝑎)
𝑃𝑟 (𝑎, 𝑏) =
𝑃𝑟 (𝑏)

Bayesian Learning:
• Bayesian Learning algorithm:
For each ℎ ∈ 𝐻:
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
Calculate 𝑃𝑟 (ℎ | 𝐷) = 𝑃𝑟 (𝐷)
Output:
ℎ = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (ℎ | 𝐷))

Page | 1
CS7641 SL09. Bayesian Learning Mohamed Ameen Amer

- Using this approximate probability, we can calculate the Maximum a Posteriori (MAP), which is
the maximum probability hypothesis given the data across all hypotheses:
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (ℎ | 𝐷))
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 ( )
𝑃𝑟 (𝐷)
- Since we’re interested in finding the hypothesis with the highest probability, not the exact
probability value for each hypothesis, our prior on the data isn’t exactly relevant. That is, we
don’t care about the 𝑃𝑟 (𝐷) term in the denominator as it affects all computations equally:
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ))
- We can also assume that our prior belief is uniform over all the hypotheses ℎ ∈ 𝐻 (we equally
believe in every ℎ ∈ 𝐻), then we can drop 𝑃𝑟 (ℎ) from the equation, ending up with the
Maximum Likelihood:
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (𝐷 | ℎ))
• The problem with Bayes Learning is that it’s not practical to perform direct computations for
large hypotheses spaces, because you have to look into every single hypothesis.

Bayesian Learning in Action:


• Assume:
- Given noise-free training data {〈𝑥𝑖 , 𝑑𝑖 〉} as examples of 𝑐.
- 𝑐 ∈ 𝐻
- Uniform prior.
• We need to calculate 𝑃𝑟 (ℎ | 𝐷):
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
𝑃𝑟 (ℎ | 𝐷) =
𝑃𝑟 (𝐷)
1
𝑃𝑟 (ℎ) = , 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑎 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 𝑝𝑟𝑖𝑜𝑟
|𝐻|
1 𝑖𝑓 𝑑𝑖 = ℎ(𝑥𝑖 ) ∀𝑥𝑖 , 𝑑𝑖 ∈ 𝐷
𝑃𝑟 (𝐷 | ℎ) = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
- This equation basically means that 𝑃𝑟 (𝐷 | ℎ) = 1 if ℎ ∈ 𝑉𝑆(𝐷)
1 |𝑉𝑆|
𝑃𝑟 (𝐷) = ∑ 𝑃𝑟 (𝐷 | ℎ𝑖 ) 𝑃𝑟 (ℎ𝑖 ) = ∑ 1 . =
|𝐻| |𝐻|
ℎ𝑖 ∈ 𝐻 ℎ𝑖 ∈ 𝑉𝑆𝐻,𝐷
1
1. 1
|𝐻|
𝑃𝑟 (ℎ | 𝐷) = =
|𝑉𝑆| |𝑉𝑆|
|𝐻|
- This means that given data 𝐷, the probability of ℎ to be a correct hypothesis is a uniform over all
the hypotheses in the version space.

Page | 2
CS7641 SL09. Bayesian Learning Mohamed Ameen Amer

Bayesian Learning with Noise:


• Assume:
- Given {〈𝑥𝑖 , 𝑑𝑖 〉}
- 𝑑𝑖 = 𝑓(𝑥𝑖 ) + 𝜀𝑖
- 𝜀𝑖 ~ 𝑁(0, 𝜎 2 ) → IID (Independent and Identically Distributed)
• We need to calculate 𝑃𝑟 (ℎ | 𝐷):
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (𝐷 | ℎ))
- To find 𝑃𝑟 (𝐷 | ℎ) for IID, we find the product of the probability of each data point given the
hypothesis:
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 ∏ 𝑃𝑟 (𝑑𝑖 | ℎ)
𝑖
- Given a Gaussian noise:
2
1 −1 (𝑑𝑖 −ℎ(𝑥𝑖 ))
( . )
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 ∏ 𝑒 2 𝜎2
𝑖
√2𝜋𝜎 2
- Since we’re looking for the maximum:
1
1. We can remove since we’re looking for the maximum.
√2𝜋𝜎 2
2. We can take the natural log 𝑙𝑛 to remove the exponential. Since 𝑙𝑛 of a product is equal to
the sum of the terms, we end up with the following function.

2
−1 (𝑑𝑖 − ℎ(𝑥𝑖 ))
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 ∑ .
2 𝜎2
𝑖
1
- Again, since we’re calculating the maximum, we can remove the 2 and the 𝜎 2 :
2
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 − ∑ (𝑑𝑖 − ℎ(𝑥𝑖 ))
𝑖
- Maximizing a negative value is the same as minimizing the positive sum of this value:
2
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑖𝑛ℎ ∈ 𝐻 ∑ (𝑑𝑖 − ℎ(𝑥𝑖 ))
𝑖
• This means: If you’re looking for the maximum likelihood hypothesis, you should minimize the sum
of squared error.
• This model will not work if the data is corrupted with any sort of noise other than Gaussian noise.

Page | 3
CS7641 SL09. Bayesian Learning Mohamed Ameen Amer

Minimum Description Length:


ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ))
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 [log 𝑃𝑟 (𝐷 | ℎ) + log 𝑃𝑟 (ℎ)]
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑖𝑛ℎ ∈ 𝐻 [− 𝑙𝑜𝑔 𝑃𝑟 (𝐷 | ℎ) − 𝑙𝑜𝑔 𝑃𝑟 (ℎ)]
• Information theory: The optimal code for some event 𝑤 with probability 𝑃𝑟 has a length of − log 𝑃𝑟 .
• This means that in order to maximize the Maximum a Posteriori hypothesis, we need to minimize
two terms that can be described as length:
- 𝑙𝑜𝑔 𝑃𝑟 (ℎ) → This is the length of the hypothesis, which is the number of bits needed to
represent this hypothesis.
- 𝑙𝑜𝑔 𝑃𝑟 (𝐷 | ℎ) → This is the length of the data given a particular hypothesis. If the hypothesis
perfectly describes the data, so we don’t need any points. But if the hypothesis labels some
points wrong, so we need the correct labels for these points to be able to come up with a better
hypothesis. So basically this term captures the error.
• This is always a trade of, a more complex hypothesis will drive down error, while a simple hypothesis
will have some error.
• We need to find the best hypothesis, which is the simplest hypothesis that minimizes error. This
hypothesis is called the Minimum Description.

Bayesian Classification:
• The question in classification is “What is the best label?” not the best hypothesis.
• To find the best label, we need to do a weighted vote for every single hypothesis in the hypotheses
set, where the weight is the probability 𝑃𝑟 (ℎ | 𝐷).
• Now we end up trying to maximize 𝑣𝑚𝑎𝑝 :
𝑉𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑗 ∈ 𝑉 ∑ 𝑃𝑟 (𝑣𝑗 | ℎ𝑖 ) 𝑃𝑟 (ℎ𝑖 | 𝐷)
ℎ𝑖 ∈ 𝐻
• The Bayes optimal classifier is computationally very costly. This is because the posterior probability
𝑃𝑟 (ℎ | 𝐷) must be computed for each hypothesis ℎ ∈ 𝐻 and combined with the prediction 𝑃𝑟 (𝑣 | ℎ)
before 𝑉𝑚𝑎𝑝 can be computed.

Page | 4

You might also like