w2e_multivariate_gaussian
w2e_multivariate_gaussian
[This note assumes that you know the background material on expectations of random variables.]
We’re going to use Gaussian distributions as parts of models of data, and to represent beliefs
about models. Most models and algorithms in machine learning involve more than one
scalar variable however. (A scalar meaning a single number, rather than a vector of values.)
Multivariate Gaussians generalize the univariate Gaussian distribution to multiple variables,
which can be dependent.
D D
p(x) = ∏ p(xd ) = ∏ N (xd ; 0, 1). (1)
d =1 d =1
Usually I recommend that you write any Gaussian PDFs in your maths using the N ( x; µ, σ2 )
notation unless you have to expand them. It will be less writing, and clearer. Here, I want to
combine the PDFs, so will substitute in the standard equation:
D
1 1 1
∏ √2π e−xd /2 =
2 2
p(x) = e− 2 ∑d xd (2)
d =1 (2π ) D/2
1 1 >
= D/2
e− 2 x x . (3)
(2π )
The PDF is proportional to the Radial Basis Functions (RBFs) we’ve used previously. Here
the normalizer 1/(2π ) D/2 means that the PDF integrates to one.
Like an RBF centred at the origin, this density function only depends on the square-distance
or radius of x from the origin. Any point in a spherical shell (or a circular shell in 2-
dimensions) is equally probable. Therefore if we simulate points in 2-dimensions and draw
a scatter plot:
# Python
N = int(1e4); D = 2
X = np.random.randn(N, D)
plt.plot(X[:,0], X[:,1], '.')
plt.axis('square')
plt.show()
We will see a diffuse circular spray of points. The spherical symmetry is a special property
of Gaussians. If you were to draw independent samples from, say, a Laplace distribution
you would see a non-circular distribution that has more density close to the axes.
2 Covariance
The multivariate generalization of variance, is covariance, which is represented with a matrix.
While a variance is often denoted σ2 , a covariance matrix is often denoted Σ — not to be
confused with a summation ∑dD=1 . . .
The elements of the covariance matrix for a random vector x are:
Question: What is the covariance S of the spherical distribution of the previous section?
(We will reserve Σ for the covariance of the general Gaussian in the next section.)
The video will take you through the answer, so we’re not making you type it in. But by
considering each element Sij , you should be able to derive the answer yourself using the
results in the review notes on expectations.
Answer: The first term is E[ xi x j ] = E[ xi ]E[ x j ] if xi and x j are independent, which they are if
i 6= j. Thus Si6= j = 0. The diagonal elements Sii are equal to the variances of the individual
variables, which are all equal to one. Therefore, Sij = δij , where δij is a Kronecker delta. Or
as a matrix, S = I, the identity matrix.
1. There is also an “( N − 1)” version of the estimator, just as there is for estimating variances.
Here we’ve generalized the N notation for Gaussian distributions to take a mean vector
and a matrix of covariances. In one-dimension, these quantities still correspond to the scalar
mean, and the variance.
It’s a common mistake to forget the matrix inverse inside the exponential. The inverse
covariance matrix Σ−1 , is also known as the precision matrix.
2. Here | A| is the “Jacobian of the transformation”, although confusingly “Jacobian” can refer to both a matrix
and its determinant. The change of variables might be clearer if we label the different probability density functions:
pY (y) = p X (x)/| A| = p X ( A−1 y)/| A|. See also the further reading section.
3. Over the years, many students have questioned whether and why 1/(|Σ|1/2 (2π ) D/2 ) = |2πΣ|−1/2 . The first
thing I do when unsure, is check an example numerically. For example:
D = 5; Sigma = np.cov(np.random.randn(D, 10*D))
lhs = 1 / (np.linalg.det(Sigma)**0.5 * (2*np.pi)**(D/2))
rhs = np.linalg.det(2*np.pi*Sigma)**-0.5
To understand why they’re equal: for a scalar c and a matrix A, we can write |cA| = |cIA| = |cI|| A| = c D | A|. The
transformation cI stretches an object in each of D directions by c. The determinant gives the resulting volume
change of c D .
these matrices are always invertible, and the inverse is also positive definite:
Therefore, the exponential term in the probability density of a Gaussian falls as we move
away from the mean in any direction iff the covariance is positive definite. If the density
didn’t fall in some direction then it wouldn’t be normalized (integrate to one), and so
wouldn’t be a valid density.
Edge case: In general, covariances can be positive semi-definite, which means:
However, if z> Σ z = 0 for some z 6= 0, then the determinant |Σ| will be zero, and the
covariance won’t be invertible. Therefore the expression we gave for the probability density
is only valid for strictly positive definite covariances.
An example of a Gaussian distribution where the covariance isn’t strictly positive definite
can be simulated by drawing x1 ∼ N (0, 1) and deterministically setting x2 = x1 . You should
be able to show that the theoretical covariance of such vectors is:
1 1
Σ= , (22)
1 1
4. I often forget such distinctions, because I rarely deal with complex numbers. Although machine learning
systems that use complex numbers have been proposed.
5. “iff” means “if and only if”.
6. For example: x1 = np.random.randn(10**6); X = np.hstack((x1[:,None], x1[:,None])); np.cov(X.T)
7. Because x> Mx = x> ( 12 M + 12 M> )x, for any x and square matrix M. So we can replace a non-symmetric
precision Σ−1 with the symmetric matrix ( 12 Σ−1 + 12 Σ−> ), and the covariance will be symmetric too.
What does this transformation do? Is it clear why the variables are dependent for a 6= 0?
When are the variables maximally dependent8 ? What happens to the PDF as a → 1 and
why? Does the covariance matrix have an inverse when a = 1?
[The website version of this note has a question here.]
Contours: The shape of a two-dimensional Gaussian is often sketched using a contour of its
PDF. Just like the Radial Basis Function (RBF) discussed last week, the contours of a radially
symmetric Gaussian are circular. So if you compute the ( x1 , x2 ) coordinates of some points
on a circle, these can be joined up to plot a contour of the Gaussian with identity covariance.
You can then transform these points, just like sample positions, with a matrix A, to plot a
contour of N (x; 0, AA> ).
[The website version of this note has a question here.]
8. We don’t need a formal definition of dependence here. If you first understand the transformation, this question
has a clear answer and is not ambiguous.