0% found this document useful (0 votes)
62 views7 pages

Bayesian Basics: Ryan P. Adams

T (S0−1 + NΣ−1 )(μ − m N ) 1. The document provides an overview of Bayesian statistics and machine learning, explaining that Bayesians represent uncertain quantities as probability distributions over possible values rather than point estimates. 2. It describes how Bayesian inference works by updating a prior distribution with observed data via Bayes' theorem to obtain a posterior distribution, representing the revised beliefs after seeing data. 3. Specific examples are given of conjugate prior distributions that have a closed form for the posterior, including the beta-binomial model and a Gaussian model with known covariance.

Uploaded by

Pen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views7 pages

Bayesian Basics: Ryan P. Adams

T (S0−1 + NΣ−1 )(μ − m N ) 1. The document provides an overview of Bayesian statistics and machine learning, explaining that Bayesians represent uncertain quantities as probability distributions over possible values rather than point estimates. 2. It describes how Bayesian inference works by updating a prior distribution with observed data via Bayes' theorem to obtain a posterior distribution, representing the revised beliefs after seeing data. 3. Specific examples are given of conjugate prior distributions that have a closed form for the posterior, including the beta-binomial model and a Gaussian model with known covariance.

Uploaded by

Pen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Bayesian Basics

Ryan P. Adams

These are notes to help clarify things and create context. Please note that they are not a replacement for the readings.
There are a lot of ideas that seem to carry the name Bayesian and so it can be unclear sometimes what this word actually means. At a high level, however, it is about being willing to use
probability distributions to represent unknown quantities that are not necessarily random. That
is, using probability to capture degrees of belief in which the uncertainty may be entirely in ones
own head or the state of an algorithm. For example, we might have some noisy astronomical information about the question how many rings does Saturn have? We might have some evidence
supporting some number of rings, but it is noisy and incomplete, so we are uncertain. A Bayesian
is willing to place a probability distribution on this quantity and represent it as a random variable,
because she is uncertain. A frequenist might assert that there is no possibility of repeating a random
event that produces a new Saturn with a different number of rings, and so it is inappropriate to
consider this a random variable with a probability distribution: there is an unknown truth and we
must estimate it. This is a deep philosophical question that has been debated for a long time. In
this class, were going to take it as a given that some kinds of machine learning problems are noisy
and uncertain and that it can be useful to reason about these using the calculus of probabilities.
The Bayesian model for machine learning is appealing for a few reasons. First, as Ive said, it
allows one to represent beliefs in the presence of noise. However, it also allows you to integrate out
that uncertainty and account for it when making decisions and predictions from data. It provides a
coherent way to balance old data against new data and accumulate more information as it arrives.
It also enable one to separate out modeling assumptions from fitting (inference) procedures and
separate algorithmic concerns from our inductive biases. Finally, it enables us to handle difficult
tasks like model selection in a clear and rigorous way. Being Bayesian is not the only approach to
machine learning and statistics, but it can be a nice one for many problems due to these and other
properties.
Personally, I like to think of Bayesian inference as a kind of hypothesis processing machine.
Imagine that there is a space of possible (unobserved) states of the world and wed like to reason about them. Lets imagine that our a priori beliefs about the world are captured by a prior
distribution p( ). Now, we see some data and those data are coupled to these hypotheses via a
likelihood function p(data | ). This likelihood function is a distribution over data, given a state of
the world. The environment gives the data to us and were stuck with it, so we think of likelihood
functions as being functions over their parameters; this can be somewhat confusing because those
parameters appear behind the vertical bars.1 In any case, this hypothesis processing machine has
1 Hard-core

frequentists might say that you cant condition on something that isnt a random variable and so therefore likelihood function should be written as a parameterized family of densities like f (data).

two steps: first, multiply these two function together pointwise as a function of ; second, normalize so that you get a probability density back on . This multiplication penalizes values of that
assign low probability to the data, and upweights those that assign high probability to the data.
prior p( ) multiply by p(data | ) divide by

p(data | ) p( ) d posterior p( | data)

Bayes theorem really is that simple:


p( | data) = R

p( ) p(data | )
.
p( 0 ) p(data | 0 ) d 0

Here Im using 0 instead of to make it clearer that this denominator doesnt depend on a specific
value of , but is an integral over all values. It is the normalization constant for this distribution
over , often called the marginal likelihood:
p(data) =

p( ) p(data | ) d .

Conjugacy
It is often the case that your prior might have a simple form that you get to choose, but after you
multiply it by one or more likelihood functions, it starts to become complicated. This typically
means that you can evaluate it pointwise, but only up to a constant because the marginal likelihood integral becomes intractable. There are a variety of methods out there for dealing with this
(common) situation, with the two most popular ones being Markov chain Monte Carlo and variational inference. These more advanced techniques are out of the scope of this course, so we will
instead focus on the important situations in which the posterior distribution has the same form
as the prior distribution. Likelihoods that have this form are generally exponential family distributions and the priors that are closed under the corresponding Bayesian updates are called conjugate
priors. We wont go into too much detail about exponential family distributions in this course, but
the basic idea is that these distributions have the form
o
n
p(data | ) exp T T (data) ,
that is, the log densities are linear in the parameters. Here Im imagining that is now a realvalued vector. The vector function T (data) provides the sufficient statistics. The book goes into
more detail about exponential families and why they are interesting, and CS281 discusses these
topics further.
In any case, the reason these kinds of likelihoods are convenient is because if we have a prior
that looks like
n
o
p( ) exp T ,
then when you multiply it by one of those likelihoods, things inside the exponential function just
add:
n
o
p( | data) exp T ( + T (data)) .
2

Im skipping a ton of details here about the conjugate prior setup. Bishop 2.4 has a more rigorous
and thorough treatment in which these equations have all the other terms to make the math work
out correctly with normalization constants and such. Nevertheless, this is the high level idea:
make things multiply nicely.

Example: Beta-Binomial
The Bernoulli distribution is just the coin flip distribution with some bias (0, 1). Well say
that a heads is 1 and a tails is 0. The probability mass function for the Bernoulli is
Pr( X = x | ) = x (1 )1 x .
You can write this as an exponential family using the natural parameterization = ln{/(1 )}:
x (1 )1 x = exp{ x ln + (1 x ) ln(1 )}
exp{ x ln x ln(1 )}

}.
exp{ x ln
1
The conjugate prior for the Bernoulli distribution is the beta distribution, which is a density on the
interval (0, 1) given by
p( | , ) =

( + ) 1
(1 ) 1 .
()( )

Bishop 2.1.1 has nice pictures of the beta distribution and discusses some further properties. Now,
in class I sort of threw around lets imagine you see J heads and K tails, but if you were paying close attention, you know I was skipping over some important pieces. If you have J heads
out of J + K tosses, then what you really have is a binomial distribution and there is a binomial
coefficient out there in the likelihood:


J+K J
Pr( J heads, K tails | ) =
(1 ) K .
J
The binomial coefficient doesnt effect the math we do for the Bayesian update, however, since it
just gets sucked into the normalization constant. To see how this works, we first denote our prior
parameters for the beta distribution as 0 and 0 .
p() = Beta( | 0 , 0 )
prior

z


}|

likelihood

{

( 0 + 0 ) 0 1
p( | J heads, K tails, 0 , 0 )

(1 ) 0 1
( 0 ) ( 0 )
0 + J 1 (1 ) 0 + K 1

= Beta( | = 0 + J, = 0 + K ) .
So after seeing these flips we get a new beta distribution back out!

z

}|
{
J+K J
(1 ) K
J

Bayesian Updates for Gaussians with Known Covariance


Next, lets consider a very slightly more complex model: multivariate Gaussian data with known
covariance. That is, were imagining that we have N data in RD that have been drawn from
a D-dimensional Gaussian distribution with unknown mean but known covariance matrix .
Gaussian distributions are very fundamental and I am going to assume that youve seen them
before and are comfortable with them and what covariance matrices are, etc. For a review, see
Bishop 2.3. The probability density function for a Gaussian with this parameterization is


1
D/2
1/2
T 1
N ( x | , ) = |2 |
||
exp ( x ) ( x ) .
2
This is the likelihood function, in terms of . The conjugate prior for in this case is another
Gaussian. Well denote the prior parameters for that Gaussian as m0 and S0 . We encounter N
data { xn }nN=1 and now we would like the posterior distribution on :
likelihood
prior

}|
{
N
z
}|
{
p( | m0 , S0 , , { xn }nN=1 ) [N ( | m0 , S0 )] N ( xn | , )
n =1

1
exp ( m0 )T S01 ( m0 )
2

1
exp 2 (xn )T 1 (xn )
n =1

Ive thrown out all of the factors that did not involve . Its usually convenient to write this all in
log space:
!
N
1
ln p( | m0 , S0 , , { xn }nN=1 ) = const
( m0 )T S01 ( m0 ) + ( xn )T 1 ( xn )
2
n =1
!
N
1
= const
T S01 2T S01 m0 2T 1 xn + NT 1 .
2
n =1
Here I just expanded the two quadratic forms and stuck terms that dont depend on into the
constant out front. I also observed that the xn only participate in one of the terms. Now we
collapse the like terms and write x N = N1 n xn for the sample mean of the data:



1  T  1
S0 + N1 2T S01 m0 + N1 x N
.
ln p( | m0 , S0 , , { xn }nN=1 ) = const
2
We now complete the square and write this as a quadratic form, which results in some other things
getting baked into the constant. We dont care about any of these constants because we have to
normalize later anyway. Remember that the (log) normalization constant is just a number we
subtract in log space.
1
1
ln p( | m0 , S0 , , { xn }nN=1 ) = const ( m N )T S
N ( m N )
2

 1
S N = S01 + N1


m N = S N S01 m0 + N1 x N .
4

So now, in log space, we have a quadratic form and so we can see well wind up with a Gaussian
in if we exponentiate and normalize! We can compute the posterior mean and covariance, denoted m N and S N , respectively, using a little bit of linear algebra.2 A few things to think about to
help get some intuition:
The more data you get, the bigger N will be and the more relative effect 1 and 1 x N will
have on S N and m N , respectively. This feels right because more data means that the prior
should be overwhelmed.
Consider what a strong prior would look like for , in the case where the dimensions were
independent a priori. The matrix S0 would have small positive numbers on the diagonal
and zeros off of it. When we took its inverse, it would still be a diagonal matrix, but now
the values on the diagonal would be big. These would be competing with N to center
the posterior mean at m0 instead of x N . This also feels right because a strong prior should
compete more with the data.
Note that the posterior covariance depends on the data only through N and not on the actual
values of xn . That means that our uncertainty in this case is entirely a function of the number
of data. Note that this wouldnt be the case if was unknown.
It is a useful exercise to work through this same math where both and are unknown. In
this case, there is still a conjugate prior, but now it is a more complicated distribution called a
Normal-Inverse-Wishart distribution. The Wishart distribution is a distribution over positive definite matrices that is conjugate to Gaussian likelihoods with unknown covariances. Bishop goes
through this in 2.3.6.

Bayesian Linear Regression


We now have the tools to revisit linear regression in a Bayesian setting. Recall that are data are
now pairs, { xn , tn }nN=1 . Well assume that there are some basis function and our inputs become a
design matrix which has N rows and J columns. The targets are real-valued and we stack them
into a column vector t R N . Our regression model assumes independent zero-mean Gaussian
noise with precision . Our weight parameter is a J-dimensional w and so were saying the labels
arise as
en N (0, 1 ) .

t n = ( x n )T w + en
This becomes a likelihood function for the nth datum via

p(tn | xn , w, ) = N (tn | ( xn )T w, 1 ) .
The noise is independent, so we can write the likelihood for all N data as
p(t | { xn }nN=1 , w, ) =

N (tn | (xn )T w, 1 )

n =1

= N (t | w, 1 I N ) .
2 Its

pretty common to use the subscript N to denote posterior parameters. This is a kind of reflection of using the
subscript 0 for prior parameter; in the beginning you have zero data and afterward you have N data.

Here were writing this likelihood as a big multivariate Gaussian rather than a product of univariate ones, but is exactly the same. Im using the notation I N to indicate an N N identity matrix.
If you find this switch to matrix notation confusing, it might be worth working through it to convince yourself that it is correct. To do Bayesian linear regression, well need to put a prior on the
weights w. The convenient conjugate prior is a Gaussian and, as before, well use prior parameters m0 and S0 ; Bishop does the same in Equation (3.48). We proceed exactly as in the simple
Gaussian case and write down the prior and likelihood to get the posterior:
prior

likelihood

}|
{
z
}|
{ z
1
p(w | t, , , m0 , S0 ) N (w | m0 , S0 ) N (t | w, I N ) .
We move to log space and collapse constants, as before:

1
(w m0 )T S01 (w m0 ) + (t w)T (t w)
2

1  T 1
w S0 w 2wT S01 m0 2wT T t + wT T w .
= const
2

ln p(w | t, , , m0 , S0 ) = const

Collect the quadratic and linear terms:


ln p(w | t, , , m0 , S0 ) = const




1  T  1
.
w S0 + T w 2wT S01 m0 + T t
2

Complete the square:


1
1
ln p(w | t, , , m0 , S0 ) = const (w m N )T S
N (w m N )
2

 1
S N = S01 + T


m N = S N S01 m0 + T t .
So, it turns out that with Gaussian noise we have a Gaussian posterior on the weights. Now, think
about what these parameters would look like if we made the prior very weak and zero mean. A
very weak independent prior would mean that S0 was zero off of the diagonal and large positive
values on the diagonal, i.e., large variances and large a priori uncertainty about w. When this
matrix is inverted, S01 will have zeros off the diagonal and values that are nearly zero on the
diagonal. That means that
S N 1 ( T ) 1
and S01 m0 will be the zero vector. So now the posterior mean will be
SN

z
}|
{
m N 1 (T )1 T t = (T )1 T t ,
which we recognize as both the ordinary least squares and maximum likelihood estimates for w.
Bishop 3.3 has some very nice figures showing this posterior for simple data. Note that all through
this we have assumed that is known. It is of course also possible to infer using an appropriate
prior.
6

Bayesian Linear Regression Posterior Predictive


We can also compute the posterior predictive in this case. Recall that the posterior predictive is
what the model predicts about new data, integrating out the parameters. In this case, that means
making a prediction of a new output at a new input location, taking into account all possible
values of w:
p(t | x, { xn , tn }nN=1 , m0 , S0 , ) =

p(t | x, w, ) p(w | { xn , tn }nN=1 , m0 , S0 , ) dw


predictive distribution

posterior

Z z

}|
{z
}|
{
N (t | ( x)T w, 1 ) N (w | m N , S N ) dw .

There are different ways to do this integral, including the kind of brute-force algebra weve been
using. Personally, I like to think it through using some basic properties of the Gaussian distribution. In particular, if you have a Gaussian random variable z N (, ), and you apply the
linear transformation y = Az + b, the resulting distribution on y is also Gaussian with a simple
form: y N ( A + b, AAT ). This is just saying: If I have a Gaussian random variable and I perform
a linear transformation of it, what is the resulting distribution? That is relevant here because this is
exactly what the integral is computing: draw a random w from the posterior and then linearly
transform it with ( x). Theres just one more piece of information we need: when we add two
Gaussian random variables of the same dimension, the covariance of their sum is the sum of their
covariance matrices. With these two pieces of knowledge, we see that:
p(t | x, { xn , tn }nN=1 , m0 , S0 , ) = N (t | ( x)T m N , ( x)T S N ( x) + 1 ) .
So the predictive distribution is nice and Gaussian. Bishop 3.3.2 has some nice figures of what this
looks like with polynomial basis functions.

You might also like