SL09. Bayesian Learning
SL09. Bayesian Learning
Introduction:
● We’re trying to learn the best (most probable) hypothesis 𝐻 given input data
and domain knowledge.
Best == Most probable
• It’s the probability of some hypothesis h given input data D:
𝑃𝑟 (ℎ | 𝐷)
• We’re trying to find hypothesis h with the highest probability Pr:
𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (ℎ | 𝐷))
Bayes Rule:
• Bayes Rule for probability states that:
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
𝑃𝑟 (ℎ | 𝐷) =
𝑃𝑟 (𝐷)
- 𝑃𝑟 (ℎ | 𝐷) → The probability of a specific hypothesis given input data (Posterior probability).
- 𝑃𝑟 (𝐷 | ℎ) → The probability of data given the hypothesis. It’s the (likelihood) of seeing some
particular labels associated with input points, given a world where some hypothesis h is true.
- 𝑃𝑟 (ℎ) → The prior probability of a particular hypothesis. This value encapsulates our prior belief
that one hypothesis is likely or unlikely compared to other hypotheses. This is basically the
domain knowledge.
- 𝑃𝑟 (𝐷) → The likelihood of the data under all hypotheses (A normalizing term).
- Bayes Rule is derived from the chain rule:
𝑃𝑟 (𝑎, 𝑏) = 𝑃𝑟 (𝑎, 𝑏)𝑃𝑟 (𝑏)
𝑃𝑟 (𝑎, 𝑏) = 𝑃𝑟 (𝑏, 𝑎)𝑃𝑟 (𝑎)
𝑡ℎ𝑒𝑛 → 𝑃𝑟 (𝑎, 𝑏)𝑃𝑟 (𝑏) = 𝑃𝑟 (𝑏, 𝑎)𝑃𝑟 (𝑎)
𝑃𝑟 (𝑏, 𝑎)𝑃𝑟 (𝑎)
𝑃𝑟 (𝑎, 𝑏) =
𝑃𝑟 (𝑏)
Bayesian Learning:
• Bayesian Learning algorithm:
For each ℎ ∈ 𝐻:
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
Calculate 𝑃𝑟 (ℎ | 𝐷) = 𝑃𝑟 (𝐷)
Output:
ℎ = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (ℎ | 𝐷))
Page | 1
CS7641 SL09. Bayesian Learning Mohamed Ameen Amer
- Using this approximate probability, we can calculate the Maximum a Posteriori (MAP), which is
the maximum probability hypothesis given the data across all hypotheses:
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (ℎ | 𝐷))
𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ)
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 ( )
𝑃𝑟 (𝐷)
- Since we’re interested in finding the hypothesis with the highest probability, not the exact
probability value for each hypothesis, our prior on the data isn’t exactly relevant. That is, we
don’t care about the 𝑃𝑟 (𝐷) term in the denominator as it affects all computations equally:
ℎ𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (𝐷 | ℎ)𝑃𝑟 (ℎ))
- We can also assume that our prior belief is uniform over all the hypotheses ℎ ∈ 𝐻 (we equally
believe in every ℎ ∈ 𝐻), then we can drop 𝑃𝑟 (ℎ) from the equation, ending up with the
Maximum Likelihood:
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 (𝑃𝑟 (𝐷 | ℎ))
• The problem with Bayes Learning is that it’s not practical to perform direct computations for
large hypotheses spaces, because you have to look into every single hypothesis.
Page | 2
CS7641 SL09. Bayesian Learning Mohamed Ameen Amer
2
−1 (𝑑𝑖 − ℎ(𝑥𝑖 ))
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 ∑ .
2 𝜎2
𝑖
1
- Again, since we’re calculating the maximum, we can remove the 2 and the 𝜎 2 :
2
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ ∈ 𝐻 − ∑ (𝑑𝑖 − ℎ(𝑥𝑖 ))
𝑖
- Maximizing a negative value is the same as minimizing the positive sum of this value:
2
ℎ𝑚𝑙 = 𝑎𝑟𝑔𝑚𝑖𝑛ℎ ∈ 𝐻 ∑ (𝑑𝑖 − ℎ(𝑥𝑖 ))
𝑖
• This means: If you’re looking for the maximum likelihood hypothesis, you should minimize the sum
of squared error.
• This model will not work if the data is corrupted with any sort of noise other than Gaussian noise.
Page | 3
CS7641 SL09. Bayesian Learning Mohamed Ameen Amer
Bayesian Classification:
• The question in classification is “What is the best label?” not the best hypothesis.
• To find the best label, we need to do a weighted vote for every single hypothesis in the hypotheses
set, where the weight is the probability 𝑃𝑟 (ℎ | 𝐷).
• Now we end up trying to maximize 𝑣𝑚𝑎𝑝 :
𝑉𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑗 ∈ 𝑉 ∑ 𝑃𝑟 (𝑣𝑗 | ℎ𝑖 ) 𝑃𝑟 (ℎ𝑖 | 𝐷)
ℎ𝑖 ∈ 𝐻
• The Bayes optimal classifier is computationally very costly. This is because the posterior probability
𝑃𝑟 (ℎ | 𝐷) must be computed for each hypothesis ℎ ∈ 𝐻 and combined with the prediction 𝑃𝑟 (𝑣 | ℎ)
before 𝑉𝑚𝑎𝑝 can be computed.
Page | 4