Bayesian Learning: Thanks To Nir Friedman, HU
Bayesian Learning: Thanks To Nir Friedman, HU
.
Example
Suppose we are required to build a controller that
removes bad oranges from a packaging line
Decision are made based on a sensor that reports
the overall color of the orange
sor
sen
Bad
oranges
Classifying oranges
Suppose we know all the aspects of the problem:
Prior Probabilities:
Probability of good (+1) and bad (-1) oranges
oranges
p(X | C = -1 )
p(X | C = +1 )
Bayes Rule
Given this knowledge, we can compute the
posterior probabilities
Bayes Rule
P (C )P (X x | C )
P (C | X x )
P (X x )
P (X x ) P (C 1)P (X x | C 1)
P (C 1)P (X x | C 1)
Posterior of Oranges
Data likelihood
… combined with prior… after normalization …
P(C = -1 ) p(X | C = -1 )
1 P(C = +1)p(X | C = +1 )
p(X | C = -1 )
p(X | C = +1 )
p(C = -1 |X)
P(C = +1|X)
0
Decision making
Intuition:
Predict “Good” if P(C=+1 | X) > P(C=-1 | X)
p(C = -1 |X)
P(C = +1|X)
R (a | X ) L(a , c )P (C c | X )
c
The Risk in Oranges
1
-1 +1
p(C = -1 |X)
P(C = +1|X) Bad 1 5
Good 10 0
10
R(Good|X)
R(Bad|X)
0
Optimal Decisions
Goal:
Minimize risk
Consequence:
R(a|X) = P(a c|X)
Decision rule:
Pros:
Specifies optimal actions in presence of noisy
signals
Can deal with skewed loss functions
Cons:
Requires P(C|X)
Simple Statistics : Binomial Experiment
Head Tail
When tossed, it can land in one of two positions: Head
or Tail
We denote by the (unknown) probability P(H).
Estimation task:
Given a sequence of toss samples x[1], x[2], …, x[M]
distribution
M k
P (# Heads k ) (1 )M k
k
and thus E[# Heads ] M
# Heads
This suggests, that we can estimate by
M
Maximum Likelihood Estimation
MLE Principle:
Learn parameters that maximize the
likelihood function
Datasets
Statistics
Maximum A Posterior (MAP)
Suppose we observe the sequence
H, H
MLE estimate is P(H) = 1, P(T) = 0
k
MLE P (H )
n
k 1
Laplace correction: P (H )
n 2
known value
Given examples x1, …, xn, we can compute posterior
distribution on
P (x1, xn | )P ( )
P ( | x1, xn )
P ( x1 , x n )
Where the marginal likelihood is
P (x1, xn ) P (x1, xn | )P ( )d
Binomial Distribution: Laplace Est.
In this case the unknown parameter is = P(H)
Simplest prior P() = 1 for 0< <1
Likelihood
P (x1, xn | ) k (1 )n k
Marginal Likelihood:
1
P (x1, xn ) k (1 )n k d
0
Marginal Likelihood
Using integration by parts we have:
1
P (x1, xn ) k (1 )n k d
0
1
1 1 n k
k 1 (1 )n k d
k 1 n k 1
(1 )
k 1 0 k 1 0
1
n k
d
k 1 n k 1
(1 )
k 1 0
Multiply both side by n choose k, we have
n k n k 1
1 1
(1 ) d
n k
(1 )n k 1d
k 0 k 1 0
Marginal Likelihood - Cont
The recursion terminates when k = n
n n
1 1
1
(1 ) d d
n n n
n 0 0
n 1
Thus
1
1 n
1
P (x1, xn ) (1 ) d
k n k
0
n 1 k
We conclude that the posterior is
n k
P ( | x1, xn ) (n 1) (1 )n k
k
Bayesian Prediction
How do we predict using the posterior?
We can think of this as computing the probability of
the next element in the sequence
P (xn 1 | x1, , xn ) P (xn 1, | x1, , xn )d
P (xn 1 | , x1, , xn )P ( | x1, , xn )d
.
Bayesian Classification:
Binary Domain
Consider the following situation:
Two classes: -1, +1
Example dataset:
X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Binary Domain - Priors
How do we estimate P(C) ?
X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Binary Domain - Attribute Probability
How do we estimate P(X1,…,XN|C) ?
Two sub-problems:
Training set for P(X1,…,XN|C=+1):
Training set for P(X1,…,XN|C=-1):
X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Naïve Bayes
Naïve Bayes:
Assume
P ( X 1 , , X N | C ) P ( X 1 | C ) P (X N | C )
X1 X2 … XN C
Simple binomial estimation
0 1 1 +1
Count #1 and #0 values of X1 1 0 1 -1
in instances where C=+1 1 1 0 +1
… … … …
0 0 0 +1
Interpretation of Naïve Bayes
2
0.4
N(0,12)
N(4,22)
0.3
0.2
0.1
0
-4 -2 0 2 4 6 8 10
Maximum Likelihood Estimate
Suppose we observe x1, …, xm
Simple calculations show that the MLE is
1
xm
M m
1 1
(xm ) x m 2 xm 2
2 2 2
M m M m M m
1 1
Sufficient statistics are
M
xm ,
m M
m
m
x 2
Naïve Bayes with Gaussian
Distributions
Recall,
P (X 1 , , X N | C ) P (X 1 | C ) P (X N | C )
Assume: P ( X i | C ) ~ N ( i ,C , i )
2
i ,1 i , 1
Naïve Bayes with Gaussian
Distributions
P ( 1 | X1 ,..., Xn ) P ( 1) P (Xi | 1)
Recall log log log
P ( 1 | X1 ,..., Xn ) P ( 1) i P (Xi | 1)
P (Xi | 1) i , 1 i , 1 1 i , 1 i , 1
log X i
P (Xi | 1) i i 2
i , 1 i ,1
Different Variances?
If we allow different variances, the classification
rule is more complex
P (Xi | 1)
The term log is quadratic in Xi
P (Xi | 1)