MLPR w0f - Machine Learning and Pattern Recognition
MLPR w0f - Machine Learning and Pattern Recognition
Examples:
⎧p
⎪ x = 1,
P(x) = ⎨
⎪1−p x = 0, (1)
⎩0 otherwise,
2 Expectations
An expectation is a property of a probability distribution, defined by a probability-weighted
sum. The expectation of some function, f , of an outcome, x , is:
∑
𝔼P(x) [f (x)] = pi f (ai ). (2)
i=1
Often the subscript P(x) is dropped from the notation because the reader knows under which
distribution the expectation is being taken. Notation can vary considerably, and details are
often dropped. You might also see E[f ], [f ], or ⟨f ⟩ , which all mean the same thing.
The expectation is sometimes a useful representative value of a random function value. The
expectation of the identity function, f (x) = x , is the ‘mean’, which is one measure of the centre
of a distribution.
𝔼[f (x) + g(x)] = 𝔼[f (x)] + 𝔼[g(x)] and 𝔼[cf (x)] = c𝔼[f (x)]. (3)
These properties are apparent if you explicitly write out the summations.
∑ pi
𝔼[c] = c = c, (4)
i=1
3 The mean
The mean of a distribution over a number, is simply the ‘expected’ value of the numerical
outcome.
∑
‘Expected Value' = ‘mean' = μ = 𝔼[x] = pi ai . (6)
i=1
1 1 1 1 1 1
𝔼[x] = ×1 + ×2 + ×3 + ×4 + ×5 + ×6 = 3.5. (7)
6 6 6 6 6 6
In every day language I wouldn’t say that I ‘expect’ to see 3.5 as the outcome of throwing a
die… I expect to see an integer! However, 3.5 is the ‘expected value’ as it is commonly
defined. Similarly a single Bernoulli outcome will be a zero or a one, but its ‘expected’ value is
a fraction,
Change of units: I might have a distribution over heights measured in metres, for which I
have computed the mean. If I multiply the heights by 100 to obtain heights in centimetres, the
mean in centimetres can be obtained by multiplying the mean in metres by 100. Formally:
𝔼[100 x] = 100 𝔼[x] .
4 The variance
The variance is also an expectation, measuring the average squared distance from the mean:
Exercise 4: show that var[x + y] = var[x] + var[y] , for independent outcomes x and y .
Exercise 5: Given outcomes distributed with mean μ and variance σ 2 , how could you shift
and scale them to have mean zero and variance one?
Change of units: If the outcome x is a height measured in metres, then x 2 has units m2 ;
x2 is an area. The variance also has units m2 , it cannot be represented on the same scale as the
outcome, because it has different units. If you multiply all heights by 100 to convert to
centimetres, the variance is multiplied by 1002 . Therefore, the relative size of the mean and
the variance depends on the units you use, and so often isn’t meaningful.
Standard deviation: The standard deviation σ , the square root of the variance, does have the
same units as the mean. The standard deviation is often used as a measure of the typical
distance from the mean. Often variances are used in intermediate calculations because they are
easier to deal with: it is variances that add (as in Exercise 4 above), not standard deviations.
If the drunkard started in the centre of the alleyway, will he ever escape? If so, roughly how
long will it take? (If you don’t already know, have a think…)
The expected, or mean position after N steps is 𝔼[kN ] = N𝔼[xn ] = 0 . This doesn’t mean we
don’t think the drunkard will escape. There are ways of escaping both left and right, it’s just
‘on average’ that he’ll stay in the middle.
The variance of the drunkard’s position is var[kN ] = Nvar[xn ] = N σ 2 . The standard deviation
of the position is then std[kN ] = √N
‾‾σ , which is a measure of the width of the distribution over
the displacement from the centre of the alleyway. If we double the length of the alley, then it
will typically take four times the number of random steps to escape.
Corollary: the typical magnitude of the mean of N independent zero-mean variables with
finite variance scales with 1/√N
‾‾ .
6 Solutions
As always, you are strongly recommended to work hard on a problem yourself before looking
at the solutions. As you transition into doing research, there won’t be any answers, and you
have to build confidence in getting and checking your own answers.
Exercise 3:
var[cx] = 𝔼[(cx)2 ] − 𝔼[cx]2 = 𝔼[c2 x2 ] − (c𝔼[x])2 = c2 (𝔼[x2 ] − 𝔼[x]2 ) = c2 var[x] .
Exercise 4:
var[x + y] = 𝔼[(x + y)2 ] − 𝔼[x + y]2 = 𝔼[x2 ] + 𝔼[y2 ] + 2𝔼[xy] − (𝔼[x]2 + 𝔼[y]2 + 2𝔼[x]𝔼[y]) = var[x] + var[y] ,
if 𝔼[xy] = 𝔼[x]𝔼[y] , true if x and y are independent variables.
Exercise 5: z = (x − μ)/σ has mean 0 and variance 1. The division is by the standard
deviation, not the variance. You should now be able to prove this result for yourself.
What to remember: using the expectation notation where possible, rather than writing out the
summations or integrals explicitly, makes the mathematics concise.