0% found this document useful (0 votes)
4 views

w2e_multivariate_gaussian

This document discusses multivariate Gaussian distributions, extending the concept of univariate Gaussians to multiple dependent variables. It covers topics such as independent standard normals, covariance matrices, transformations, and the properties of Gaussian distributions, including their positive definiteness. Additionally, it provides practical insights on sampling from multivariate Gaussians using matrix transformations and the Cholesky decomposition method.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

w2e_multivariate_gaussian

This document discusses multivariate Gaussian distributions, extending the concept of univariate Gaussians to multiple dependent variables. It covers topics such as independent standard normals, covariance matrices, transformations, and the properties of Gaussian distributions, including their positive definiteness. Additionally, it provides practical insights on sampling from multivariate Gaussians using matrix transformations and the Cholesky decomposition method.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Multivariate Gaussians

[This note assumes that you know the background material on expectations of random variables.]
We’re going to use Gaussian distributions as parts of models of data, and to represent beliefs
about models. Most models and algorithms in machine learning involve more than one
scalar variable however. (A scalar meaning a single number, rather than a vector of values.)
Multivariate Gaussians generalize the univariate Gaussian distribution to multiple variables,
which can be dependent.

1 Independent Standard Normals


We could sample a vector x by independently sampling each element from a standard normal
distribution, xd ∼ N (0, 1). Because the variables are independent, the joint probability is the
product of the individual or marginal probabilities:

D D
p(x) = ∏ p(xd ) = ∏ N (xd ; 0, 1). (1)
d =1 d =1

Usually I recommend that you write any Gaussian PDFs in your maths using the N ( x; µ, σ2 )
notation unless you have to expand them. It will be less writing, and clearer. Here, I want to
combine the PDFs, so will substitute in the standard equation:

D
1 1 1
∏ √2π e−xd /2 =
2 2
p(x) = e− 2 ∑d xd (2)
d =1 (2π ) D/2
1 1 >
= D/2
e− 2 x x . (3)
(2π )

The PDF is proportional to the Radial Basis Functions (RBFs) we’ve used previously. Here
the normalizer 1/(2π ) D/2 means that the PDF integrates to one.
Like an RBF centred at the origin, this density function only depends on the square-distance
or radius of x from the origin. Any point in a spherical shell (or a circular shell in 2-
dimensions) is equally probable. Therefore if we simulate points in 2-dimensions and draw
a scatter plot:
# Python
N = int(1e4); D = 2
X = np.random.randn(N, D)
plt.plot(X[:,0], X[:,1], '.')
plt.axis('square')
plt.show()
We will see a diffuse circular spray of points. The spherical symmetry is a special property
of Gaussians. If you were to draw independent samples from, say, a Laplace distribution
you would see a non-circular distribution that has more density close to the axes.

2 Covariance
The multivariate generalization of variance, is covariance, which is represented with a matrix.
While a variance is often denoted σ2 , a covariance matrix is often denoted Σ — not to be
confused with a summation ∑dD=1 . . .
The elements of the covariance matrix for a random vector x are:

cov[x]ij = E[ xi x j ] − E[ xi ]E[ x j ]. (4)

MLPR:w2e Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1


On the diagonal, where i = j, you will see that this definition gives the scalar variances
var[ xi ] for each of the elements of the vector. We can write the whole matrix with a linear
algebra expression:
cov[x] = E[xx> ] − E[x]E[x]> . (5)

Question: What is the covariance S of the spherical distribution of the previous section?
(We will reserve Σ for the covariance of the general Gaussian in the next section.)
The video will take you through the answer, so we’re not making you type it in. But by
considering each element Sij , you should be able to derive the answer yourself using the
results in the review notes on expectations.

Answer: The first term is E[ xi x j ] = E[ xi ]E[ x j ] if xi and x j are independent, which they are if
i 6= j. Thus Si6= j = 0. The diagonal elements Sii are equal to the variances of the individual
variables, which are all equal to one. Therefore, Sij = δij , where δij is a Kronecker delta. Or
as a matrix, S = I, the identity matrix.

2.1 Empirical covariance


The covariance above is a formal property of a distribution.
An empirical covariance (or sample covariance) is where the expectations in the definition
of covariance, are replaced with averages over samples1 . In NumPy, np.cov computes a
covariance using expectations under a uniform distribution over N samples. Annoyingly this
function requires an input of shape (D,N), so an (N,D) design matrix must be transposed.
If you have any doubt how covariances are computed, you should write your own version
of cov from primitive matrix operations, and check agreement with np.cov.

3 Transforming and Rotating: general Gaussians


As with one-dimensional Gaussians, we can generalize the standard zero-mean, unit-variance
Gaussian by a linear transformation and a shift. If any of the steps here are unclear, make
sure you are comfortable with the univariate Gaussian note first.
If we generated the elements of x from independent N (0, 1) draws as above, we could form
a linear combination of these outcomes:
y = Ax. (6)
To keep the discussion simpler, I will assume that A is square and invertible, so y has the
same dimensionality as x.
Question: What is the covariance Σ of the new variable y?

Answer: Simply substitute y into the definition:

cov[y] = E[yy> ] − E[y]E[y]> (7)


> > >
= E[ Axx A ] − E[ Ax]E[ Ax] (8)
> > >
= AE[xx ] A − AE[x]( AE[x]) . (9)
Because E[x] is zero, the second term is zero, and the expectation in the first term is equal to
cov[x] = I. Therefore,
cov[y] = Σ = AA> . (10)

1. There is also an “( N − 1)” version of the estimator, just as there is for estimating variances.

MLPR:w2e Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2


Because we’re assuming A is invertible, we can compute the original vector from the
transformed one: x = A−1 y. Substituting that expression into the PDF for x we can see the
shape of the new PDF:
1 −1 y ) > ( A −1 y )
p(y) ∝ e− 2 ( A (11)
− 21 y> A−> A−1 y
∝e . (12)
As we saw in the univariate Gaussian note, if we stretch out a PDF, we must scale it down so
that the distribution remains normalized. If we apply a linear transformation A to a volume
of points, then the volume is multiplied by | A|, the determinant of the matrix.2 Therefore,
1 1 > −> −1
p(y) = D/2
e− 2 y A A y . (13)
| A|(2π )
Usually this expression is re-written in terms of the covariance of the vector. Noticing that

Σ−1 = A−> A−1 , and (14)


> > 2
|Σ| = | AA | = | A|| A | = | A| , (15)
we can write:
1 1 > Σ −1 y 1 > Σ −1 y
p(y) = e− 2 y = |2πΣ|−1/2 e− 2 y . (16)
|Σ|1/2 (2π ) D/2
As demonstrated above, there are different equivalent ways to write the normalizing constant,
and different books will choose different forms.3
Finally, we can shift the distribution to have non-zero mean:
z = y + µ. (17)
Shifting the PDF does not change its normalization, so we can simply substitute y = z − µ
into the PDF for y:
1 > −1
p(z) = N (z; µ, Σ) = 1
|Σ|1/2 (2π ) D/2
e− 2 (z−µ) Σ (z−µ) . (18)

Here we’ve generalized the N notation for Gaussian distributions to take a mean vector
and a matrix of covariances. In one-dimension, these quantities still correspond to the scalar
mean, and the variance.
It’s a common mistake to forget the matrix inverse inside the exponential. The inverse
covariance matrix Σ−1 , is also known as the precision matrix.

4 Covariances are positive (semi-)definite


[This section may be tough going on first reading. If so, that’s ok: just keep going to the “check your
understanding” section, which you should work through.]

2. Here | A| is the “Jacobian of the transformation”, although confusingly “Jacobian” can refer to both a matrix
and its determinant. The change of variables might be clearer if we label the different probability density functions:
pY (y) = p X (x)/| A| = p X ( A−1 y)/| A|. See also the further reading section.
3. Over the years, many students have questioned whether and why 1/(|Σ|1/2 (2π ) D/2 ) = |2πΣ|−1/2 . The first
thing I do when unsure, is check an example numerically. For example:
D = 5; Sigma = np.cov(np.random.randn(D, 10*D))
lhs = 1 / (np.linalg.det(Sigma)**0.5 * (2*np.pi)**(D/2))
rhs = np.linalg.det(2*np.pi*Sigma)**-0.5
To understand why they’re equal: for a scalar c and a matrix A, we can write |cA| = |cIA| = |cI|| A| = c D | A|. The
transformation cI stretches an object in each of D directions by c. The determinant gives the resulting volume
change of c D .

MLPR:w2e Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3


Covariance matrices are always symmetric: in the definition of covariance cov[ xi , x j ] =
cov[ x j , xi ] or Σij = Σ ji . Moreover, just as variances must be positive — or zero if we are
careful — there is a positive-like constraint on covariance matrices.
A real4 symmetric matrix Σ is positive definite iff5 it satisfies:

z> Σ z > 0, for all real vectors z 6= 0. (19)

these matrices are always invertible, and the inverse is also positive definite:

z> Σ−1 z > 0, for all real vectors z 6= 0. (20)

Therefore, the exponential term in the probability density of a Gaussian falls as we move
away from the mean in any direction iff the covariance is positive definite. If the density
didn’t fall in some direction then it wouldn’t be normalized (integrate to one), and so
wouldn’t be a valid density.
Edge case: In general, covariances can be positive semi-definite, which means:

z> Σ z ≥ 0, for all real vectors z. (21)

However, if z> Σ z = 0 for some z 6= 0, then the determinant |Σ| will be zero, and the
covariance won’t be invertible. Therefore the expression we gave for the probability density
is only valid for strictly positive definite covariances.
An example of a Gaussian distribution where the covariance isn’t strictly positive definite
can be simulated by drawing x1 ∼ N (0, 1) and deterministically setting x2 = x1 . You should
be able to show that the theoretical covariance of such vectors is:
 
1 1
Σ= , (22)
1 1

and you should be able to simulate this process to confirm numerically6 .


In this example, the probability
R density is zero “almost everywhere”, for any x where x1 6= x2 .
The only way to make p(x) dx = 1 is to make the density infinite along the line x1 = x2 .
Gaussian distributions with zero-determinant covariances generalize the Dirac delta function,
where the distribution is constrained to a surface with zero volume, rather than just a point.
Care is required with such distributions, both analytically and numerically. We will stick to
strictly positive definite covariances whenever we can.
Given a real-valued matrix A, Σ = AA> is always positive semi-definite. Moreover, if Σ
is symmetric and positive semi-definite, it can always be written in this form. Allowing
non-symmetric Σ wouldn’t expand the set of probability densities that can be expressed7 .
Therefore, the process for sampling from a Gaussian that was described in this document
is general: we can sample from any Gaussian by transforming draws from a standard
normal, and such a process always generates points from a distribution with a well-defined
covariance.

5 Computing A to sample from N (0, Σ)


Given a covariance matrix Σ, the transformation A is not uniquely defined. For example
   √ 
2 0 3 √1
A= , and A = , (23)
0 2 −1 3

4. I often forget such distinctions, because I rarely deal with complex numbers. Although machine learning
systems that use complex numbers have been proposed.
5. “iff” means “if and only if”.
6. For example: x1 = np.random.randn(10**6); X = np.hstack((x1[:,None], x1[:,None])); np.cov(X.T)
7. Because x> Mx = x> ( 12 M + 12 M> )x, for any x and square matrix M. So we can replace a non-symmetric
precision Σ−1 with the symmetric matrix ( 12 Σ−1 + 12 Σ−> ), and the covariance will be symmetric too.

MLPR:w2e Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4


both give the same product AA> . However, any A such that Σ = AA> , can be used to
transform points to sample from N (0, Σ).
It’s common to use a fast and standard matrix routine known as the (lower-triangular)
Cholesky decomposition, which is easy to call in NumPy:
D = 3; Sigma = np.cov(np.random.randn(D, 3*D))
A = np.linalg.cholesky(Sigma)
Sigma_from_A = A @ A.T # up to round-off error, matches Sigma
Do try this out, and look at the matrices; this checking step is not just for beginners! I still
routinely check that functions do what I expect, because I am still frequently bitten by nasty
surprises. For example, SciPy also has a cholesky function:
import scipy.linalg as sla
A = sla.cholesky(Sigma)
Sigma_wrong = A @ A.T # doesn't match Sigma!
Unlike NumPy, the SciPy version gives the upper-triangular Cholesky decomposition by
default — a difference you notice if you look at the matrices for a small example. You need
to transpose the result, or use A = sla.cholesky(Sigma, lower=True).

6 Check your understanding


Diagonal transformation: A special case of a general transformation A, is a diagonal matrix,
that simply stretches each variable independently: Λij = δij σi . (This Kronecker delta notation
was used earlier in the note.) For three dimensions the transformation would be:
 
σ1 0 0
Λ =  0 σ2 0  . (24)
0 0 σ3

[The website version of this note has a question here.]


Investigate a special case: You could use Python to sample some points from different
multivariate Gaussians, and see how the covariance affects the cloud of points.
For example, you could use a family of transformations parameterized by a:
 
1 0
A= , (25)
a 1− a

What does this transformation do? Is it clear why the variables are dependent for a 6= 0?
When are the variables maximally dependent8 ? What happens to the PDF as a → 1 and
why? Does the covariance matrix have an inverse when a = 1?
[The website version of this note has a question here.]
Contours: The shape of a two-dimensional Gaussian is often sketched using a contour of its
PDF. Just like the Radial Basis Function (RBF) discussed last week, the contours of a radially
symmetric Gaussian are circular. So if you compute the ( x1 , x2 ) coordinates of some points
on a circle, these can be joined up to plot a contour of the Gaussian with identity covariance.
You can then transform these points, just like sample positions, with a matrix A, to plot a
contour of N (x; 0, AA> ).
[The website version of this note has a question here.]

8. We don’t need a formal definition of dependence here. If you first understand the transformation, this question
has a clear answer and is not ambiguous.

MLPR:w2e Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 5


7 Further reading
https://en.wikipedia.org/wiki/Multivariate_normal_distribution
Both Bishop Section 2.3 and Barber Section 8.4 start with the definition that this note builds
up to, and then works in the reverse direction from there to build up an interpretation. These
sections then go further than this note, and both these books have some further exercises.
The treatment by Murphy, Section 2.5.2, is rather more terse!
Transforming the PDF of the spherical distribution required getting the normalization correct
due to the change of variables. If you would like a more rigorous treatment, or to understand
what to do if the transformation is non-linear, I’ll defer to the text books. The maths for
transforming a PDF due to a change of variables is quickly reviewed in Barber Section 8.2,
Result 8.1. Murphy’s treatment is longer this time, in Section 2.6.

MLPR:w2e Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 6

You might also like