0% found this document useful (0 votes)
13 views

Bayesian Learning: Thanks To Nir Friedman, HU

Bayesian learning uses probability and statistics to estimate unknown probabilities based on observations. It begins with a prior probability distribution over what the unknown probability could be. As observations are made, the posterior probability distribution is updated using Bayes' rule. This allows predicting future observations based on combining the posterior with the probability of new observations.

Uploaded by

Rajkumar Lodh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Bayesian Learning: Thanks To Nir Friedman, HU

Bayesian learning uses probability and statistics to estimate unknown probabilities based on observations. It begins with a prior probability distribution over what the unknown probability could be. As observations are made, the posterior probability distribution is updated using Bayes' rule. This allows predicting future observations based on combining the posterior with the probability of new observations.

Uploaded by

Rajkumar Lodh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 41

Bayesian Learning

Thanks to Nir Friedman, HU

.
Example
 Suppose we are required to build a controller that
removes bad oranges from a packaging line
 Decision are made based on a sensor that reports
the overall color of the orange
sor
sen

Bad
oranges
Classifying oranges
Suppose we know all the aspects of the problem:

Prior Probabilities:
 Probability of good (+1) and bad (-1) oranges

 P(C = +1) = probability of a good orange


 P(C = -1) = probability of a bad orange
 Note: P(C = +1) + P(C = -1) = 1
 Assumption: oranges are independent
The occurrence of a bad orange does not depend
on previous
Classifying oranges (cont)
Sensor performance:
 Let X denote sensor measurement from each type of

oranges

p(X | C = -1 )
p(X | C = +1 )
Bayes Rule
 Given this knowledge, we can compute the
posterior probabilities

Bayes Rule
P (C )P (X  x | C )
P (C | X  x ) 
P (X  x )

P (X  x )  P (C  1)P (X  x | C  1) 
P (C  1)P (X  x | C  1)
Posterior of Oranges

Data likelihood
… combined with prior… after normalization …
P(C = -1 ) p(X | C = -1 )
1 P(C = +1)p(X | C = +1 )
p(X | C = -1 )
p(X | C = +1 )

p(C = -1 |X)
P(C = +1|X)

0
Decision making
Intuition:
 Predict “Good” if P(C=+1 | X) > P(C=-1 | X)

 Predict “Bad”, otherwise


1

p(C = -1 |X)
P(C = +1|X)

bad good bad


Loss function
 Assume we have classes +1, -1
 Suppose we can make predictions a1,…,ak

 A loss function L(ai, cj) describes the loss


associated with making prediction ai when the class
is cj
Real Label
-1 +1
Bad 1 5
Prediction
Good 10 0
Expected Risk
 Given the estimates of P(C | X) we can compute
the expected conditional risk of each decision

R (a | X )   L(a , c )P (C  c | X )
c
The Risk in Oranges
1

-1 +1
p(C = -1 |X)
P(C = +1|X) Bad 1 5
Good 10 0

10
R(Good|X)

R(Bad|X)

0
Optimal Decisions
Goal:
 Minimize risk

Optimal decision rule:


“Given X = x, predict ai if R(ai|X=x) = mina R(a|X=x) “
 (break ties arbitrarily)

Note: randomized decisions do not help


0-1 Loss
 If we don’t have prior knowledge, it is common to
use the 0-1 loss
 L(a,c) = 0 if a = c
 L(a,c) = 1 otherwise

Consequence:
 R(a|X) = P(a c|X)

 Decision rule:

“choose ai if P(C = ai | X) = maxa P(C = a|X) “


Bayesian Decisions: Summery
Decisions based on two components:
 Conditional distribution P(C|X)
 Loss function L(A,C)

Pros:
 Specifies optimal actions in presence of noisy
signals
 Can deal with skewed loss functions
Cons:
 Requires P(C|X)
Simple Statistics : Binomial Experiment

Head Tail
 When tossed, it can land in one of two positions: Head
or Tail
 We denote by  the (unknown) probability P(H).

Estimation task:
 Given a sequence of toss samples x[1], x[2], …, x[M]

we want to estimate the probabilities P(H)=  and P(T) =


1-
Why Learning is Possible?
Suppose we perform M independent flips of the
thumbtack
 The number of head we see is a binomial

distribution
M  k
P (# Heads  k )    (1   )M k
k 
 and thus E[# Heads ]  M

# Heads
This suggests, that we can estimate  by
M
Maximum Likelihood Estimation

MLE Principle:
Learn parameters that maximize the
likelihood function

 This is one of the most commonly used estimators in


statistics
 Intuitively appealing
 Well studied properties
Computing the Likelihood Functions
 Tocompute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)

L( : D )  NH  (1  )NT


 Applying the MLE principle we get
NH
ˆ 
NH  NT

 NH and NT are sufficient statistics for the


binomial distribution
Sufficient Statistics
 A sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
 Formally, s(D) is a sufficient statistics if for any
two datasets D and D’
 s(D) = s(D’ )
 L( |D) = L( |D’)

Datasets
Statistics
Maximum A Posterior (MAP)
 Suppose we observe the sequence
 H, H
 MLE estimate is P(H) = 1, P(T) = 0

 Should we really believe that tails are impossible at


this stage?
 Such an estimate can have disastrous effect
 If we assume that P(T) = 0, then we are willing
to act as though this outcome is impossible
Laplace Correction
Suppose we observe n coin flips with k heads

k
 MLE P (H ) 
n
k 1
 Laplace correction: P (H ) 
n 2

As though we observed one additional H and one


additional T

 Can we justify this estimate? Uniform prior!


Bayesian Reasoning
 In Bayesian reasoning we represent our
uncertainty about the unknown parameter  by a
probability distribution

 This probability distribution can be viewed as


subjective probability
 This is a personal judgment of uncertainty
Bayesian Inference
We start with
 P() - prior distribution about the values of 

 P(x1, …, xn|) - likelihood of examples given a

known value 
Given examples x1, …, xn, we can compute posterior
distribution on 
P (x1,  xn |  )P ( )
P ( | x1,  xn ) 
P ( x1 ,  x n )
Where the marginal likelihood is
P (x1,  xn )   P (x1,  xn |  )P ( )d
Binomial Distribution: Laplace Est.
 In this case the unknown parameter is  = P(H)
 Simplest prior P() = 1 for 0< <1
 Likelihood

P (x1,  xn |  )   k (1   )n k

where k is number of heads in the sequence

 Marginal Likelihood:
1
P (x1,  xn )    k (1   )n k d 
0
Marginal Likelihood
Using integration by parts we have:
1
P (x1,  xn )    k (1   )n k d 
0
1
1 1 n k
 k 1 (1   )n k  d
k 1 n  k 1
   (1   )
k 1 0 k 1 0
1
n k
 d
k 1 n k 1
  (1   )
k 1 0
Multiply both side by n choose k, we have
n  k  n  k 1
1 1
    (1   ) d   
n k
   (1   )n k 1d 
k  0 k  1  0
Marginal Likelihood - Cont
 The recursion terminates when k = n
n  n
1 1
1
    (1   ) d     d  
n n n

n  0 0
n 1

Thus
1
1 n 
1
P (x1,  xn )    (1   ) d  
k n k
 
0
n  1 k 
We conclude that the posterior is
n  k
P ( | x1,  xn )  (n  1)  (1   )n k
k 
Bayesian Prediction
 How do we predict using the posterior?
 We can think of this as computing the probability of
the next element in the sequence
P (xn 1 | x1,  , xn )   P (xn 1,  | x1,  , xn )d
  P (xn 1 |  , x1,  , xn )P ( | x1,  , xn )d

  P (xn 1 |  )P ( | x1,  , xn )d


 Assumption: if we know , the probability of Xn+1
is independent of X1, …, Xn
P (xn  1 |  , x1 ,  , xn )  P (xn  1 |  )
Bayesian Prediction
 Thus, we conclude that

P (xn 1  H | x1,  , xn )   P (xn 1 |  )P ( | x1,  , xn )d


  P ( | x1,  , xn )d
 n  k 1
 (n  1)    (1   )n k d
k 
1
n  1 n  1  k 1
 (n  1)    
k  n  2 k  1  n 2
Naïve Bayes

.
Bayesian Classification:
Binary Domain
Consider the following situation:
 Two classes: -1, +1

 Each example is described by by N attributes

 Xn is a binary variable with value 0,1

Example dataset:

X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Binary Domain - Priors
How do we estimate P(C) ?

 Simple Binomial estimation


 Count # of instances with C = -1, and with C = +1

X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Binary Domain - Attribute Probability
How do we estimate P(X1,…,XN|C) ?

Two sub-problems:
Training set for P(X1,…,XN|C=+1):
Training set for P(X1,…,XN|C=-1):
X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Naïve Bayes
Naïve Bayes:
 Assume

P ( X 1 ,  , X N | C )  P ( X 1 | C )  P (X N | C )

This is an independence assumption


 Each attribute Xi is independent of the other

attributes once we know the value of C


Naïve Bayes:Boolean Domain
 Parameters: i |1  P (Xi  1 | C  1)
 i |1 i |1 for each i i |1  P (Xi  1 | C  1)

How do we estimate 1|+1?

X1 X2 … XN C
 Simple binomial estimation
0 1 1 +1
 Count #1 and #0 values of X1 1 0 1 -1
in instances where C=+1 1 1 0 +1
… … … …
0 0 0 +1
Interpretation of Naïve Bayes

P ( 1 | X1 ,..., Xn ) P (X1 ,..., Xn | 1)P ( 1)


log  log
P ( 1 | X1 ,..., Xn ) P (X1 ,..., Xn | 1)P ( 1)
P ( 1) P (Xi | 1)
 log  log 
P ( 1) i P (Xi | 1)

P ( 1) P (Xi | 1)


 log   log
P ( 1) i P (Xi | 1)
Interpretation of Naïve Bayes

P ( 1 | X1 ,..., Xn ) P ( 1) P (Xi | 1)


log  log   log
P ( 1 | X1 ,..., Xn ) P ( 1) i P (Xi | 1)

 Each Xi “votes” about the prediction


 If P(Xi|C=-1) = P(Xi|C=+1) then Xi has no say in
classification
 If P(Xi|C=-1) = 0 then Xi overrides all other votes
(“veto”)
Interpretation of Naïve Bayes
P ( 1 | X1 ,..., Xn ) P ( 1) P (Xi | 1)
log  log   log
P ( 1 | X1 ,..., Xn ) P ( 1) i P (Xi | 1)

P (Xi  1 | 1) P (Xi  0 | 1)


Set wi  log  log
P (Xi  1 | 1) P (Xi  0 | 1)
P ( 1) P (Xi  0 | 1)
b  log   log
P ( 1) i P (Xi  0 | 1)

Classification rule sign(b  wi xi )


i
Normal Distribution
The Gaussian distribution:
2
1 x  
1   
X ~ N ( ,  ) 2
if p (x )  e 2  

2 

0.4
N(0,12)
N(4,22)
0.3

0.2

0.1

0
-4 -2 0 2 4 6 8 10
Maximum Likelihood Estimate
 Suppose we observe x1, …, xm
Simple calculations show that the MLE is
1
   xm
M m
1 1 
   (xm  )   x m  2  xm   2
2 2 2

M m M m M m

1 1
 Sufficient statistics are
M
 xm ,
m M
 m
m
x 2
Naïve Bayes with Gaussian
Distributions
 Recall,
P (X 1 ,  , X N | C )  P (X 1 | C )  P (X N | C )

 Assume: P ( X i | C ) ~ N ( i ,C ,  i )
2

 Mean of Xi depends on class


 Variance of Xi does not

i ,1 i , 1
Naïve Bayes with Gaussian
Distributions
P ( 1 | X1 ,..., Xn ) P ( 1) P (Xi | 1)
Recall log  log   log
P ( 1 | X1 ,..., Xn ) P ( 1) i P (Xi | 1)

P (Xi | 1) i , 1  i , 1 1  i , 1  i , 1 
log    X i 
P (Xi | 1) i i  2 

Distance between means

Distance of Xi to midway point

i , 1 i ,1
Different Variances?
 If we allow different variances, the classification
rule is more complex

P (Xi | 1)
 The term log is quadratic in Xi
P (Xi | 1)

You might also like