CS2 CMP Upgrade 2022
CS2 CMP Upgrade 2022
Subject CS2
CMP Upgrade 2021/22
CMP Upgrade
This CMP Upgrade lists the changes to the Syllabus objectives, Core Reading and the ActEd
material since last year that might realistically affect your chance of success in the exam. It is
produced so that you can manually amend your 2021 CMP to make it suitable for study for the
2022 exams. It includes replacement pages and additional pages where appropriate.
Alternatively, you can buy a full set of up-to-date Course Notes / CMP at a significantly reduced
price if you have previously bought the full-price Course Notes / CMP in this subject. Please see
our 2022 Student Brochure for more details.
This upgrade does not contain amendments to the assignments. We only accept the current
version of assignments for marking, ie those published for the sessions leading to the 2022
exams. If you wish to submit your script for marking but have only an old version, then you can
order the current assignments free of charge if you have purchased the same assignments in the
same subject in a previous year, and have purchased marking for the 2022 session.
Alternatively, if you wish to purchase the 2022 assignments, then you can do so at a significantly
reduced rate. Again, please see our 2022 Student Brochure for more details.
The objectives for machine learning have been changed to the following:
5.1 Explain and apply elementary principles of machine learning. (Chapter 21)
5.1.1 Explain the bias/variance trade-off and its relationship with model complexity.
5.1.4 Use software to apply supervised learning techniques to solve regression and
classification problems.
5.1.5 Use metrics such as precision, recall, F1 score and diagnostics such as the ROC
curve and confusion matrix to evaluate the performance of a binary classifier.
Chapter 6
Section 4, pages 21-22
The R boxes on these pages have been updated. Replacement pages are included at the end of
this document.
Chapter 10
Section 7.5, page 38
The Core Reading describing the probability of obtaining exactly t positive groups has been
updated to reference the hypergeometric distribution. The updated Core Reading and following
ActEd paragraph now reads:
m
(c) There are ways to arrange n1 positive and n2 negative signs,
n1
since, by definition, m n1 n2 .
Hence, the probability of exactly t positive groups can be obtained from the hypergeometric
distribution:
n1 1 n2 1
t 1 t
P(G t )
n1 n2
n1
This formula is given on page 34 of the Tables. G follows a hypergeometric distribution and can
be thought of as the number of ‘successes’ drawn from a finite population with n2 1 total
‘successes’ and n1 1 total ‘failures’ when n1 total items are drawn.
Chapter 12
Section 2.5, page 31
The expression for the penalised log-likelihood has been updated to remove the factor of 12 (this
also impacts ActEd text on page 43 of the summary and page 54 of the practice questions). The
Core Reading now reads:
l p ( ) l ( ) P ( )
The reference to the MortalitySmooth package has been removed. The following R box is no
longer in the Core Reading:
p-spline forecasting for single years of age or for many ages simultaneously can be carried
out using the MortalitySmooth package in R.
Chapter 13
Section 3.3, page 27
The following lines in R calculate the ACF and PACF functions up to lag 12 for an A R (1)
model with 0.7 :
0 1 2 3 4
1.00000000 0.70000000 0.49000000 0.34300000 0.24010000
5 6 7 8 9
0.16807000 0.11764900 0.08235430 0.05764801 0.04035361
10 11 12
0.02824752 0.01977327 0.01384129
[1] 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The following lines in R plot ACF and PACF functions up to lag 12 as bar plots:
Note that the value of 0 ( 1) has been excluded from the ACF bar plot using negative
indexing, ie using the code [-1].
The following lines in R generate the ACF and PACF functions up to lag 12 for an MA(1)
model with 0.7 :
0 1 2 3 4
1.0000000 0.4697987 0.0000000 0.0000000 0.0000000
5 6 7 8 9
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
10 11 12
0.0000000 0.0000000 0.0000000
ARMAacf(ma = 0.7, lag.max = 12, pacf = TRUE)
[1] 0.469798658 -0.283220623 0.185631274
The following lines in R plot the ACF and PACF functions up to lag 12 as bar plots:
par(mfrow = c(1,2))
barplot(ARMAacf(ma=0.7,lag.max = 12)[-1],
xlab = "Lag", ylab = "ACF", main="ACF of MA(1)",
col = "red")
Note that the value of 0 ( 1) has been excluded from the ACF bar plot using negative
indexing, ie using the code [-1].
Chapter 14
Section 4.2, page 37
Predicting three steps ahead in R using the ARIMA(1,0,1) model fitted to the data
generated in Section 2.1:
predict(fit, n.ahead = 3)
$pred
Time Series:
Start = 301
End = 303
Frequency = 1
[1] 1.1164950 0.7184494 0.4749197
$se
Time Series:
Start = 301
End = 303
Frequency = 1
[1] 0.9495467 1.4808587 1.6359396
The $pred component of the output contains the predicted values and the $se component
contains estimated standard errors.
The following code outputs predictions and estimated standard errors for 15 to 20 steps
ahead:
As indicated in Section 4.1, the predicted values and standard errors are converging.
The predicted values are converging to the estimated mean of the process, which is 0.0911
to 4 decimal places.
The standard errors are converging to 1.722053, which is the square root of 0 of the fitted
model.
Chapter 15
Section 1.1, pages 7-10
The R boxes on these pages have been updated. Replacement pages are included at the end of
this document.
Example R code for simulating a random sample of 100 values (with a seed of 3) from the
lognormal distribution with mean and standard deviation (on the logarithmic scale) of 0 and
1, respectively, is:
set.seed(3)
log.norm.sample = rlnorm(100, 0, 1)
Sample moments can then be calculated as well as the sample histogram plotted with the
hist() function. For example, the mean of this sample is:
mean(log.norm.sample)
[1] 1.414397
Similarly, the PDF, CDF and quantiles can be obtained using the R functions dlnorm(),
plnorm() and qlnorm().
There is no built in R code for the Pareto distribution (in the basic R installation) so we have
to define the functions rpareto(), dpareto(), ppareto() and qpareto() from first
principles as follows:
Example R code for simulating a random sample of 100 values (with a seed of 4) from the
Pareto distribution with parameters 3 and 200 is:
set.seed(4)
pareto.sample = rpareto(100, 3, 200)
Sample moments can then be calculated as well as the sample histogram plotted with the
hist() function. For example, the mean of this sample is:
mean(pareto.sample)
[1] 120.12
The R boxes on these pages have been updated. Replacement pages are included at the end of
this document.
Chapter 16
Section 3.2, page 22
The first paragraph has been updated and a line of ActEd text added underneath. It now reads:
More generally we find that, for a large class of underlying distributions of the data, the
distribution of the threshold exceedances more closely resembles a generalised Pareto
distribution as the threshold u increases towards x F .
Chapter 18
Section 1.1, page 8
Suppose claims (in £’s) have an exponential distribution with parameter 0.0005 . The R
code for simulating 10,000 claims, x, is given by:
x = rexp(10000, 0.0005)
We can obtain the simulated amounts paid by the insurer, y, using the vectorised minimum
function pmin() to compare each claim value with the retention limit, M:
y = pmin(x,M)
Similar code can be used to obtain the simulated reinsurer payments, z, using the
vectorised maximum function pmax():
z = pmax(0, x - M)
set.seed(1)
x = rexp(10000, 0.0005)
M = 3000
y = pmin(x, M)
z = pmax(0, x - M)
We can then obtain the simulated means and variances using the R functions mean() and
var(). For example:
mean(y)
[1] 1543.317
var(y)
[1] 1125872
mean(z)
[1] 453.4059
var(z)
[1] 1679221
We can use these vectors to estimate probabilities. For example, we can use the following
code to estimate the probability that the insurer pays less than £1,000 on any given claim:
[1] 0.3949
Section 4, page 24
The R box on this page has been updated. Replacement pages for pages 23-26 are included at the
end of this document.
Chapter 19
Section 3.3, page 12
The Core Reading paragraph describing formula 19.5 has been updated to the following:
Formula (19.5) is the law of total variance, which can be seen as decomposing variability in
S into two distinct components. The first term represents the variability in S due to
variability between individual claims, and the second term is the variability in S attributable
to variability in the number of claims.
d3
The equation near the bottom of the page showing that log MS (t ) is equal to m3 is
dt 3 t 0
incorrect. It has been updated to:
d3 d3
3
log MS (t ) 3 M X (t ) 1 m3
dt t 0
dt
t 0
The R box on this page has been updated. Replacement pages are included for pages 15-18 at the
end of this document.
The following Core Reading paragraph has been added under the probability function for the
negative binomial distribution:
In this form, N counts the number failures before k successes are observed in a sequence
of independent Bernoulli trials, where the probability of an individual success is p .
Chapter 20
Section 1.2, page 9
To simulate the collective risk model with individual reinsurance we can combine the R
code from Chapters 15, 18 and 19.
For example, to simulate 10,000 values for a reinsurer where claims have a compound
Poisson distribution with parameter 1,000 and a gamma claims distribution with 7 5 0
and 0.25 under individual excess of loss with retention 2,500 we could use:
set.seed(123)
sims = 10000
M = 2500
n = rpois(sims, 1000)
sR = rep(NA, sims)
for(i in 1:sims){
x = rgamma(n[i], shape = 750, rate = 0.25)
z = pmax(0, x - M)
sR[i] = sum(z)
}
We can now estimate moments, the coefficient of skewness, probabilities and quantiles as
before. For example, the sample mean and variance are:
mean(sR)
[1] 499588.8
var(sR)
[1] 257183130
Suppose aggregate claims have a compound Poisson distribution with parameter 1,000 and
a gamma claims distribution with 750 and 0.25 .
To simulate 10,000 values for a reinsurer under aggregate excess of loss with retention limit
3,000,000, we could take the simulations of the aggregate underlying claims, S , from
Section 3.4 of Chapter 19 and use the pmax() function:
We can now estimate moments, the coefficient of skewness, probabilities and quantiles as
before. For example, the sample mean and variance are:
mean(sR.agg)
[1] 36153.85
var(sR.agg)
[1] 2932084069
Chapter 21
There were significant changes to the Core Reading and accompanying ActEd text throughout this
chapter. The full new chapter is included at the end of this document.
Chapter 10
Section 7.5, page 38
A question has been added to show the relationship between the number of positive groups and
the hypergeometric distribution. The question is as follows:
Question
Consider a bag containing n2 1 blue balls (successes) and n1 1 red balls (failures). Let B be
the random variable representing the number of blue balls drawn when drawing a total of n1
balls without replacement.
Show that B has the same distribution as G by considering the probability of drawing exactly t
blue balls in the sample, ie P(B t ) .
Solution
Using notation from Subject CS1, B follows the hypergeometric distribution with:
k N k N
P B t
t n t n
n 1 n 1 n n
2 1 1 2
t n1 t n1
However:
n1 1 n1 1 n1 1
n1 t (n1 1) (n1 t ) t 1
Also:
n1 n2 =m
So:
n 1 n 1 n n
P B t 2 1 1 2 P G t
t t 1 n1
Chapter 12
Summary, page 43
The expression for the penalised log-likelihood in the p-splines section has been updated to
remove the factor of 12 in the penalty term.
The expression for the penalised log-likelihood in the fourth bullet point has been updated to
remove the factor of 12 in the penalty term.
Chapter 16
Summary, page 37
The Generalised Pareto distribution summary has been updated to the following:
Let losses Xi be IID with cumulative distribution F (xi ) . The distribution of conditional losses
above a threshold u , X u| X u , is given by:
F (x u) F (u)
Fu (x)
1 F (u)
For a large class of underlying distributions for the data, this distribution will more closely
resemble a generalised Pareto distribution (GPD) as u increases towards xF .
x
1 1 0
G( x )
x
1 exp 0
a shape parameter, .
Chapter 17
Section 7.1, page 38 and Summary, page 47
The summary of upper and lower tail dependence results has been updated to include the case
when 1 for the Student’s t copula. The summary for this copula now reads:
Chapter 18
Section 1.3, page 11
The distribution in the question on this page is the three-parameter Pareto distribution, not the
generalised Pareto distribution. The question has been updated to read:
Question
Claims from a particular portfolio have a three-parameter Pareto distribution with parameters
6 , 200 and k 4 . A proportional reinsurance arrangement is in force with a retained
proportion of 80%.
Calculate the mean and variance of the amount paid by the insurer and the reinsurer in respect of
a single claim.
Section 4, page 25
The solution on this page has been updated to include the general form of the estimate of the
parameter. Replacement pages for pages 23-26 are included at the end of this document.
Chapter 21
There were significant changes to the Core Reading throughout this chapter. The full new chapter
is included at the end of this document.
For further details on ActEd’s study materials, please refer to the 2022 Student Brochure, which is
available from the ActEd website at www.ActEd.co.uk.
3.2 Tutorials
For further details on ActEd’s tutorials, please refer to our latest Tuition Bulletin, which is available
from the ActEd website at www.ActEd.co.uk.
3.3 Marking
You can have your attempts at any of our assignments or mock exams marked by ActEd. When
marking your scripts, we aim to provide specific advice to improve your chances of success in the
exam and to return your scripts as quickly as possible.
For further details on ActEd’s marking services, please refer to the 2022 Student Brochure, which
is available from the ActEd website at www.ActEd.co.uk.
ActEd is always pleased to receive feedback from students about any aspect of our study
programmes. Please let us know if you have any specific comments (eg about certain sections of
the notes or particular questions) or general suggestions about how we can improve the study
material. We will incorporate as many of your suggestions as we can when we update the course
material each year.
If you have any comments on this course, please send them by email to [email protected].
x
t
0
t
t p x S x (t ) exp ds exp s 0 exp( t )
t qx 1 t px 1 exp( t )
For example, if x takes the constant value 0.001 between ages 25 and 35, then the probability
that a life aged exactly 25 will survive to age 35 is:
10
e 0.01 0.99005
p
10 25 exp 0.001 dt
0
We can use R to simulate values from an exponential distribution, plot its PDF, and calculate
probabilities and percentiles.
Suppose we have an exponential distribution with parameter 0.5 . The PDF evaluated at
x , f ( x ) , can be calculated using the function dexp() as follows:
dexp(x, 0.5)
More generally, the PDF evaluated at a vector of values <values> for the exponential
distribution with parameter <rate> can be obtained with the following code structure:
dexp(<values>, <rate>)
For example, calculating f (2) and f (3) for the exponential distribution with parameter
0.5 :
dexp(c(2,3), 0.5)
A vector of rates can also be provided as the second argument. In this case, R cycles through the
inputted rates as it cycles through the inputted values.
The PDF can be plotted using the plot() function with general structure:
This plots a series of points with x -coordinates given by the values in <x values> and
y -coordinates given by the values in <y values>. Specifying type = "l" (lower case ‘L’)
outputs a line graph joining the points.
Suppose we want to plot the PDF over the interval 0 to 10. Appropriate x -coordinates can
be obtained using the seq() function with the structure:
Example code for calculating the x -coordinates, y -coordinates and plotting the PDF is:
The xlab, ylab arguments are used to label the axes. The main argument is used to give
the graph a title.
Alternatively, the curve() function can be used with the following structure:
Example code for plotting the PDF of the exponential distribution with parameter 0.5 is:
pexp(2,0.5)
[1] 0.6321206
qexp(0.25, 0.5)
[1] 0.5753641
The function rexp(<n>, <rate>) can be used to generate a random sample of size <n>
from the exponential distribution with parameter <rate>.
For example, the following R code simulates 100 values from the exponential distribution
with parameter 0.5 using the seed 1 and stores them in the object exp.sample:
set.seed(1)
exp.sample = rexp(100, 0.5)
Sample moments can then be calculated using the mean() and var() functions. For
instance, the mean of this sample can be calculated as follows:
mean(exp.sample)
[1] 2.061353
A simple extension to the exponential model is the Weibull model, in which the survival
function Sx (t ) is given by the two-parameter formula:
S x (t ) exp t (6.3)
Recall that Sx (t) 1 Fx (t ) , where Fx (t ) P Tx t . The CDF of the Weibull distribution is given
on page 15 of the Tables.
Since:
∂
x t log[Sx (t )]
∂t
we see that:
[ t ] [ t 1 ] t 1
∂
x t
∂t
Different values of the parameter can give rise to a hazard that is monotonically
increasing or monotonically decreasing as t increases, or in the specific case where 1 ,
a hazard that is constant, since if 1 :
t 1 .1.t 0
This can be seen also from the expression for Sx (t ) (6.3), from which it is clear that, when
1 , the Weibull model is the same as the exponential model.
We can adjust the R code given above for an exponential distribution to calculate corresponding
quantities for a Weibull distribution.
Example R code for simulating a random sample of 100 values from the Weibull distribution
with c 2 and 0.25 is:
set.seed(5)
weibull.sample = rweibull(100, 0.25, 2^(-1/0.25))
Note that R uses a different parameterisation for the scale parameter, c, from that in the
Formulae and Tables for Actuarial Examinations and presented here.
Sample moments can then be calculated as well as the sample histogram plotted with the
hist() function. For example, the mean of this sample is:
mean(weibull.sample)
[1] 0.7669583
Similarly, the PDF, CDF and quantiles can be obtained using the R functions dweibull(),
pweibull() and qweibull().
Alternatively, we could redefine them from first principles using the parameterisation
presented here and in the Formulae and Tables for Actuarial Examinations as follows:
E(etX |Type I) is the MGF of the exponential distribution with mean 500, ie:
Similarly, E (etX |Type II) is the MGF of the exponential distribution with mean 1,000, ie:
So:
We can use R to simulate values from statistical distributions, plot their PDFs, and calculate
probabilities and percentiles. An example involving the exponential distribution is given below.
Suppose we have an exponential distribution with parameter 0.5 . The PDF evaluated at
x , f ( x ) , can be calculated using the function dexp() as follows:
dexp(x, 0.5)
More generally, the PDF evaluated at a vector of values <values> for the exponential
distribution with parameter <rate> can be obtained with the following code structure:
dexp(<values>, <rate>)
For example, calculating f (2) and f (3) for the exponential distribution with parameter
0.5 :
dexp(c(2,3), 0.5)
A vector of rates can also be provided as the second argument. In this case, R cycles through the
inputted rates as it cycles through the inputted values.
The PDF can be plotted using the plot() function with general structure:
This plots a series of points with x -coordinates given by the values in <x values> and
y -coordinates given by the values in <y values>. Specifying type = "l" (lower case ‘L’)
outputs a line graph joining the points.
Suppose we want to plot the PDF over the interval 0 to 10. Appropriate x -coordinates can
be obtained using the seq() function with the structure:
Example code for calculating the x -coordinates, y -coordinates and plotting the PDF is:
The xlab, ylab arguments are used to label the axes. The main argument is used to give
the graph a title.
Figure 15.1
Alternatively, the curve() function can be used with the following structure:
Example code for plotting the PDF of the exponential distribution with parameter 0.5 is:
pexp(2, 0.5)
[1] 0.6321206
qexp(0.25, 0.5)
[1] 0.5753641
The function rexp(<n>, <rate>) can be used to generate a random sample of size <n>
from the exponential distribution with parameter <rate>.
For example, the following R code simulates 100 values from the exponential distribution
with parameter 0.5 using the seed 1 and stores them in the object exp.sample:
set.seed(1)
exp.sample = rexp(100, 0.5)
Sample moments can then be calculated using the mean() and var() functions. For
instance, the mean of this sample can be calculated as follows:
mean(exp.sample)
[1] 2.061353
Figure 15.2
1
f (x ) x exp( x ), x 0
( )
The gamma function, ( ) , appears in the denominator of this PDF. The definition and
properties of this function are given on page 5 of the Tables.
E(X )
var( X ) 2
Question
t
MX (t) 1
Solution
1 1 x 1
M X (t ) E (et X ) et x x e dx x 1e ( t ) x dx
0 ( ) 0 ( )
We can make the integrand look like the PDF of the Gamma( , t ) distribution by writing:
1
MX (t) 0 ( t) x 1e ( t )x dx
t ( )
t t
MX (t ) 1 , for t
t
The variance and skewness can be obtained more quickly using the cumulant generating function
(CGF). Recall that:
C X (t ) ln M X (t )
Question
Solution
Since X Gamma( , ) :
t t
C X (t ) ln 1 ln 1
1 1
1 t t
C X (t ) 1 1
2 2
1 t t
C X (t ) (1) 1 2 1
3 3
1 t 2 t
C X (t ) 2 (2) 1 3 1
So:
3
2 0 2
skew( X ) C X (0) 3 1 3
skew( X )
coeff of skew( X )
var(X )3/2
skew( X ) 2 / 3 2 2
coeff of skew( X ) 1/2
var( X )3/2 2 3/2
( / )
Formulae for the PDF, MGF, mean, variance, non-central moments and coefficient of skewness of
the gamma distribution are all given on page 12 of the Tables.
There is no closed form (ie no simple formula) for the CDF of a gamma random variable, which
means that it is not easy to find gamma probabilities directly without using a computer package
such as R. However, these probabilities can be obtained using the relationship between the
gamma and chi-squared distributions.
2 X 22
As an illustration of how this relationship can be used, suppose that X Gamma(10,4) and we
2
want to calculate P( X 4.375) . Using the result above, we know that 8X 20 , so:
2
P( X 4.375) P(8 X 8 4.375) P(20 35)
So:
Example R code for simulating a random sample of 100 values (with a seed of 2) from the
gamma distribution with 2 and 0.25 is:
set.seed(2)
gamma.sample = rgamma(100, 2, 0.25)
Sample moments can then be calculated using the mean() and var() functions as before,
as well as the sample histogram plotted with the hist() function.
mean(gamma.sample)
[1] 8.732949
The PDF, CDF and quantiles of gamma distributions can be obtained using the R functions
dgamma(), pgamma() and qgamma() respectively.
Question
Derive the formula for the MGF of a standard normal random variable.
3 Estimation
The methods of maximum likelihood, moments and percentiles can be used to fit
distributions to sets of data.
The method of percentiles is outlined in Section 3.3; the other methods have been covered
in Subject CS1, Actuarial Statistics.
We will now give a summary of the method of moments and maximum likelihood estimation and
introduce the method of percentiles. We then discuss how these methods may be applied to the
distributions covered in this chapter. In the next Section we will give a brief reminder of how the
chi-squared test can be used to check the fit of a statistical distribution to a data set.
1 n j
mj x
n i 1 i
j = 1, 2 … r
where:
So, for example, if we are trying to estimate the value of a single parameter, and we have a sample
of n claims whose sizes are x1 , x2 , , xn , we would solve the equation:
1 n
E(X ) xi
n i 1
ie we would equate the first non-central moments for the population and the sample.
If we are trying to find estimates for two parameters (for example if we are fitting a gamma
distribution and need to obtain estimates for both parameters), we would solve the simultaneous
equations:
1 n 1 n 2
E(X ) xi
n i 1
and E(X 2 ) xi
n i 1
In fact, in the two-parameter case, estimates are often obtained by equating sample and population
means and variances.
1 n 1 n
sn2 ( xi x )2 x i2 nx 2
n i 1
n i 1
this will give the same estimates as would be obtained by equating the first two non-central
moments.
1 n k
k
More generally, we use as many equations of the form E ( X ) xi , k 1, 2, as are needed
n i 1
to determine estimates of the relevant parameters.
n
L( ) P X x i | for a discrete random variable, X
i 1
or:
n
L( ) f x i | for a continuous random variable, X
i 1
To determine the MLE, the likelihood function needs to be maximised. Often it is practical
to consider the log-likelihood function:
n
l ( ) log L( ) log P X x i | for a discrete random variable, X
i 1
or:
n
l ( ) log L( ) log f x i for a continuous random variable, X
i 1
d
l (ˆ ) = 0
d
It is necessary to check, either formally or through simple logic, that the turning point is a
maximum. Generally, the likelihood starts at zero, finishes at or tends to zero, and is
nonnegative. Therefore, if there is one turning point it must be a maximum.
Where there is more than one parameter, the MLEs for each parameter can be determined
by taking partial derivatives of the log-likelihood function and setting each to zero.
Checking that the turning point is a maximum is more complicated when there is more than
one parameter. This is beyond the scope of the syllabus.
The determination of MLEs when the data are incomplete is covered in Chapter 18.
We will now look at the distributions described earlier in this chapter and consider how the
parameters can be estimated in each case.
For example, suppose that an insurance company uses an exponential distribution to model the
cost of repairing insured vehicles that are involved in accidents, and the average cost of repairing
a random sample of 1,000 vehicles is £2,200.
We can calculate the maximum likelihood estimate of the exponential parameter as follows.
The likelihood of obtaining these values for the costs, if they come from an exponential distribution
with parameter , is:
1,000
L e xi 1,000exi 1,000e1,000 x
i 1
1 1,000
(where x xi denotes the average claim amount).
1,000 i 1
We want to determine the value of that maximises the likelihood, or equivalently the value that
maximises the log-likelihood:
ln L 1,000ln 1,000 x
1,000
ln L 1,000 x
1
x
2 1,000
logL
2
2
1
Since the second derivative is negative when , the stationary point is a maximum. So, ̂ , the
x
1 1
maximum likelihood estimate of is , or .
x 2,200
Alternatively, we could argue that the likelihood function is continuous and is always positive (by
necessity) and that ne n x 0 as 0 or . So any stationary point that we find must
be a maximum.
The ML estimate for the exponential distribution is 1 x . We can use R to calculate the ML
estimate for the sample generated earlier:
1 / mean(exp.sample)
[1] 0.4851183
We could also use the fitdistr() function in the MASS package as follows:
The fitdistr() function calculates ML estimates analytically when they exist in closed
form (ie can be expressed in terms of elementary functions) and uses numerical methods
when they do not. For example, the function returns 1 x for the exponential distribution.
Using the function to calculate the ML estimate based on the data in exp.sample:
fitdistr(exp.sample, "exponential")
rate
0.48511830
(0.04851183)
The second number outputted in brackets is an estimate of the standard error of the
estimator of the rate parameter.
Example R code to construct a function to calculate the negative log-likelihood for the
exponential distribution is:
So, to fit an exponential distribution to the vector exp.sample with initial estimate of
0.5 we could use:
$minimum
[1] 172.3363
$estimate
[1] 0.4851181
$gradient
[1] 0.0001084857
$code
[1] 1
$iterations
[1] 2
Warning messages:
1: In dexp(data.vector, param) : NaNs produced
2: In nlm(nfMLE, lambda, data.vector = exp.sample) :
NA/Inf replaced by maximum positive value
3: In dexp(data.vector, param) : NaNs produced
4: In nlm(nfMLE, lambda, data.vector = exp.sample) :
NA/Inf replaced by maximum positive value
$minimum is the value of the negative log-likelihood for the estimated parameter value.
$estimate is the estimate of the parameter. The estimated value of 0.4851181 is close to
the true value of 0.5. It is also close to the analytical estimate of 0.4851183 obtained by
using 1 x or the fitdistr() function.
In general, the estimates obtained with this method may differ to those obtained when using
the fitdistr() function, even when it also uses numerical methods.
$code indicates why the numerical algorithm terminated. It is important to check whether
this is because a possible solution has been found or for some other reason. An output of 1
indicates that the relative gradient is close to 0, ie it is likely that a maximum has been
identified. More detail on the possible output codes is provided in the help page for this
function. This can be accessed with the command ?nlm.
$iterations indicates how many iterations the numerical algorithm performed before
terminating.
The output from the nlm() function has been stored in the R object MLE. These
components can be extracted from this object. For example, extracting the estimate:
MLE$estimate
[1] 0.4851181
To obtain ML estimates in R, we can use the fitdistr() function in the MASS package as
follows:
fitdistr(gamma.sample, "gamma")
shape rate
1.66867434 0.19107663
(0.21650487) (0.02886839)
As closed form expressions do not exist for the ML estimates, the fitdistr() function
uses a numerical algorithm and this requires starting values. The default starting values
used by the fitdistr() function are the method of moments estimates. Alternative
starting values <alpha> and <lambda> can be specified using the argument start as
follows:
fitdistr(gamma.sample, "gamma",
start = list(shape = <alpha>, rate = <lambda>))
Specifying a lower (or upper) limit changes the numerical algorithm used and may lead to
different estimates.
Alternatively, we could define a function to calculate the negative log-likelihood and use the
nlm() function on it as before:
As the nlm() function can only minimise with respect to one input, the params input to the
neg.log.lik() function is a vector of length two containing values for the parameters,
and .
So, to fit a gamma distribution to the vector gamma.sample with initial estimates of 1
and 0.5 , say, we could use:
$minimum
[1] 309.9164
$estimate
[1] 1.6686885 0.1910796
$gradient
[1] -1.440461e-05 6.125589e-05
$code
[1] 1
$iterations
[1] 18
1 n
ˆ x and ˆ 2
n i 1
(xi x )2
n 1
The estimate for the population variance is the usual sample variance. Of course,
n
provided the sample size is large, there will be little difference between estimates calculated
using the two different sample variance formulae.
1 n
ˆ yi
n i 1
1 n
ˆ 2 yi y 2
n i 1
The ML estimates for the lognormal distribution exist in closed form. We can use R to
calculate the ML estimates of and for the sample generated earlier:
n = length(log.norm.sample)
(m = mean(log(log.norm.sample)))
[1] 0.01103557
(s = sqrt((n-1) / n * var(log(log.norm.sample))))
[1] 0.8517857
We could also use the fitdistr() function in the MASS package as follows, which
calculates the estimates in the same way:
fitdistr(log.norm.sample, "log-normal")
meanlog sdlog
0.01103557 0.85178567
(0.08517857) (0.06023034)
Alternatively, we can define a function to calculate the negative log-likelihood and use the
nlm() function on it as before (even though, in practice, a numerical approach is not
required here):
The input params is a vector of length two, corresponding to the parameters and , the
mean and standard deviation (on the logarithmic scale).
For example, to fit a lognormal distribution to the vector log.norm.sample with the sample
mean and the sample standard deviation of the logged data as initial estimates ( 0.01
and 0.86 to 2 decimal places) we could use:
(m = mean(log(log.norm.sample)))
[1] 0.01103557
(s = sd(log(log.norm.sample)))
[1] 0.8560768
$minimum
[1] 126.9554
$estimate
[1] 0.01103519 0.85178553
$gradient
[1] 1.698197e-05 9.808332e-05
$code
[1] 1
$iterations
[1] 2
As an example, suppose that based on an analysis of past claims, an insurance company believes
that individual claims in a particular category for the coming year will be lognormally distributed
with a mean size of £5,000 and a standard deviation of £7,500. The company wants to estimate
the proportion of claims that will exceed £25,000.
To do this, it needs to estimate the parameters, and 2 , of the lognormal distribution. Equating
the formulae for the mean and standard deviation of the lognormal distribution to the values given
gives:
12 2 12 2
e 1 7,500
2
e 5,000 and e
7,500
e 1
2
1.5
5,000
2 1.179
P(N(7.928,1.179) ln25,000)
ln25,000 7.928
P N(0,1)
1.179
1 (2.025) 0.021
The method of moments estimates were also calculated in the previous R example, which we
then used as initial values when determining the maximum likelihood estimates.
Question
Claims arising from a particular group of policies are believed to follow a Pareto distribution with
parameters and . A random sample of 20 claims gives values such that x 1,508 and
x2 257,212 . Estimate and using the method of moments.
Solution
E(X )
1
2
Rearranging the variance formula to find E( X ) , we have:
2 2
E ( X 2 ) var( X ) [E ( X )2 ]
( 1)( 2)
So we set:
1,508 2 2 257,212
75.4 and 12,860.6
1 20 ( 1)( 2) 20
Squaring the first of these equations and substituting into the second, we see that:
2 75.42 ( 1)
12,860.6
2
Solving this equation, we find that the method of moments estimates of and are 9.630 and
650.7, respectively.
To carry out numerical maximum likelihood estimation, we can define a function to calculate
the negative log-likelihood and use the nlm() function on it as before:
Here we are using the dpareto() function that we defined in Section Error! Reference source
not found.. The input params is a vector of length two, corresponding to the parameters
and .
For example, to fit a Pareto distribution to the vector pareto.sample with initial estimates
of 4.53 and 424 (the method of moments estimates using the n 1 denominator
sample variance) we could use:
m = mean(pareto.sample)
v = var(pareto.sample)
(a = (-2*v/m^2) / (1 - v/m^2))
[1] 4.530004
(lam = m * (a - 1))
[1] 424.024
$minimum
[1] 575.0624
$estimate
[1] 3.540334 308.705072
$gradient
[1] -3.692868e-06 4.934819e-08
$code
[1] 1
$iterations
[1] 19
As for estimation, the CDF does not exist in closed form, so the method of percentiles is not
available.
ML can be used, but again suitable computer software is required; the method of moments
can provide initial estimates for any iterative scheme.
We will need to define a function to calculate the negative log-likelihood and use the
function nlm() on it as before.
To obtain ML estimates in R, we could use the fitdistr() function in the MASS package.
As the parameters need both be positive, we can specify a lower limit as follows:
shape scale
0.24117205 0.04588711
(0.01930196) (0.02000550)
These estimates are for the parameterisation used in R. We can transform these into
estimates for the parameterisation used here in the notes and in the Tables (using the
invariance property of ML estimates) as follows:
(c = MLE$estimate["scale"] ^ (-MLE$estimate["shape"]))
scale
2.10263
(g = MLE$estimate["shape"])
shape
0.241172
Recall that the shape and scale parameters used by R are and c 1 respectively. So:
shape c c 1 scale shape
The fitdistr() function uses a numerical algorithm for the Weibull distribution, which requires
starting values. If no values are provided, then the function automatically calculates a starting
point.
Estimates obtained from the method of percentiles could also be used as the initial values.
Q1 = quantile(weibull.sample, 0.25)
Q3 = quantile(weibull.sample, 0.75)
25%
0.2035839
(c = log(0.75) / -Q1^g)
25%
1.868239
fitdistr(weibull.sample, "weibull",
start = list(shape = g, scale = c^(-1/g)),
lower = 0)
shape scale
0.24117530 0.04587929
(0.01930205) (0.01999973)
Converting these into the parameterisation presented here and in the Tables after first saving the
fit in an object:
(c = fit$estimate["scale"] ^ (-fit$estimate["shape"]))
scale
2.102737
(g = fit$estimate["shape"])
shape
0.2411753
These estimates are very similar to those calculated using the starting values calculated by
the function.
Alternatively, we could define a function to calculate the negative log-likelihood and use the
function nlm() on it as before:
$minimum
[1] -251.7054
$estimate
[1] 2.1027447 0.2411672
$gradient
[1] -6.758241e-07 1.164750e-03
$code
[1] 2
$iterations
[1] 13
Here dweibull2() has been used to construct the negative likelihood function. So, these
estimates are for the parameterisation presented here and in the Formulae and Tables for
Actuarial Examinations.
Recall that $code indicates why the numerical algorithm terminated. An output of 2
indicates that it is likely that the maximum likelihood estimate has been identified. More
detail on the possible output codes is provided in the help page for this function. This can
be accessed with the command ?nlm.
In the case where has the known value , maximum likelihood is easy enough.
To do this, we let yi xi . If the original distribution is Weibull, the y values now have an
exponential distribution. The MLE of c can now be determined in the usual way.
In the case where has the known value , then yi xi has a Pareto distribution if the
original distribution is Burr. This can be seen by comparing the CDFs. A Pareto distribution can
then be fitted to the yi ’s.
In the method of moments, the first two moments are used if there are two unknown
parameters, and this seems intuitively reasonable (although the theoretical basis for this is
not so clear). In a similar fashion, when using the method of percentiles, the median would
be used if there were one parameter to estimate. With two parameters, the best procedure
is less clear, but the lower and upper quartiles seem a sensible choice.
Example
Estimate c and in the Weibull distribution using the method of percentiles, where the
first sample quartile is 401 and the third sample quartile is 2,836.75.
Solution
The two equations for c and are:
c 401 ln0.75
Dividing, it is found that = 0.8038, and hence c = 0.002326, where ~ denotes the
percentile estimate. Note that is less than 1, indicating a fatter tail than the exponential
distribution gives.
F ( x ) 1 e cx
So, when using the method of percentiles with the lower and upper quartiles, Q1 and Q3 , we
have:
F (Q1 ) 1 e cQ1 0.25
F (Q3 ) 1 e cQ3 0.75
cQ1 ln0.75
cQ3 ln0.25
ln0.75
ln
ln0.25
Q
ln 1
Q3
The estimate of can then be substituted into either of the starting equations to find the
estimate for c .
We can apply the method of percentiles to any distribution for which it is possible to calculate a
closed form for the cumulative distribution function, although the resulting algebra can be messy.
Question
Claims arising from a particular group of policies are believed to follow a Pareto distribution with
parameters and . A random sample of 20 claims has a lower quartile of 11 and an upper
quartile of 85. Estimate the values of and using the method of percentiles.
Solution
The cumulative distribution function of the Pareto distribution is F (x) 1 .
x
F (Q1 ) 1 0.25
Q1
F (Q3 ) 1 0.75
Q3
So:
1/ 1/
Q1 3 / 4 1 and Q3 1 / 4 1
1/
11 3 / 4 1
1/
85 1 / 4 1
1/
11 3 / 4 1
85 1 / 4 1/
1
We cannot solve this algebraically but it can easily be done on a computer, eg using the Goal Seek
function in Excel. Doing this, we find that 1.284 and hence 43.790 .
These estimates are very different from those obtained using the method of moments. (In an
earlier question, we calculated the method of moments estimates of and to be 9.630 and
650.7, respectively.)
The method of percentiles is very unreliable for estimating the parameters of a Pareto
distribution unless we use extremely large samples. In this particular case, the method of
percentiles is unlikely to give us reasonable estimates unless we use samples of, say, 1,000 or
more.
For example, plotting a histogram of the data in exp.sample and overlaying the fitted
distribution:
lambda = 1 / mean(exp.sample)
In the hist() function, the argument freq has been set to FALSE to plot the data on a
density scale rather than a frequency scale. This makes it comparable with the scale of a
PDF. Alternatively, the prob argument can be set to TRUE.
In the curve() function, the add argument has been set to TRUE to add the curve to the
existing histogram.
A legend can also be added to label the different information in the plot:
legend("topright",
legend = c("histogram of sample from Exp(0.5) distribution",
"Fitted exponential distribution PDF"),
col = c("black", "red"),
lty = 1)
Figure 15.3
Better yet, we can plot an empirical density function from the data, using the function
density(), and compare to the true density function of the fitted distribution. We can add
the empirical density with the lines() function:
Figure 15.4
An even better way is to use the qqplot() function to compare the sample data to
theoretical values from the fitted model distribution:
A straight diagonal line indicates perfect fit. A comparison line can be added with the
abline(<intercept>, <slope>) function:
abline(0, 1)
The relevant theoretical values can be calculated using the quantile function of the relevant
distribution, evaluated at appropriate percentage points. These can be calculated using the
ppoints(<n>) function, where <n> is the sample size.
For example:
n = length(exp.sample)
abline(0, 1)
Figure 15.5
2 2
The fit of the distribution can also be tested formally by using a test. The test has
been covered in Subject CS1, Actuarial Statistics.
Given a sample of claims in the vector x, a set of sample payments of the insurer, y, and the
reinsurer, z, with retention M after applying the inflation factor k would be:
y = pmin(k * x, M)
z = pmax(0, k * x - M)
We can then estimate moments, probabilities and quantiles for the distributions of the
insurer’s and reinsurer’s payments as before.
4 Estimation
Consider the problem of estimation in the presence of excess of loss reinsurance. Suppose
that the claims record shows only the net claims paid by the insurer. A typical claims
record might be:
As before, we wish to estimate the parameters for the distribution we have assumed for the
claims.
The method of moments is not available since even the mean claim amount cannot be
computed. On the other hand, it may be possible to use the method of percentiles without
alteration; this would happen if the retention level M is high and only the higher sample
percentiles were affected by the (few) reinsurance claims.
The statistical terminology for a sample of the form (18.6) is censored. In general, a
censored sample occurs when some values are recorded exactly and the remaining values
are known only to exceed a particular value, here the retention level M .
Maximum likelihood can be applied to censored samples. The likelihood function is made
up of two parts. If the values of x1, x2 , ... , x n are recorded exactly these contribute a factor
of:
n
L1( ) f (xi ; )
i 1
If a further m claims are referred to the reinsurer, then the insurer records a payment of M
for each of these claims. These censored values then contribute a factor:
m
ie P ( X M)
m
L2 ( ) P ( X M )
j 1
n
f ( x i ; ) 1 F ( M ; )
m
L( )
i 1
The reason for multiplying is that the likelihood reflects the probability of getting the n claims
with known values and m claims exceeding M . Also, we are assuming that the claims are
independent.
In R, we can define a function to calculate the negative censored log-likelihood and use the
function nlm() on it as in Chapter 15.
set.seed(1)
exp.sample = rexp(100, 0.5)
Say that the values in this vector represent full claim amounts and that the insurer has
excess of loss reinsurance with M 3. We can create a vector of the censored data (the
claim amounts paid by the insurer) as follows:
M = 3
cens.sample = exp.sample
cens.sample[cens.sample > M] = M
The final line selects all the claim values greater than the retention limit and sets them equal to this
limit, which is the amount paid by the insurer on these claims.
For example, counting how many claims were above the retention limit:
length(cens.sample[cens.sample == M])
[1] 19
We can then create a function to calculate the negative log-likelihood assuming that only
the censored data is available (and not the underlying claim amounts):
cont.1 + cont.2
}
As we are working with continuous data, we assume that all values in the censored data
exactly equal to the retention limit correspond to underlying claim values over the limit.
We can then minimise it with the nlm() function using the reciprocal of the sample mean of
the censored data, say, as the starting value (which would be the MLE for an uncensored
sample):
$minimum
[1] 138.3461
$estimate
[1] 0.4926395
$gradient
[1] 7.048584e-06
$code
[1] 1
$iterations
[1] 5
Note that for the exponential distribution, the ML estimate can be calculated analytically
even with censored data. So numerical methods are not actually required. Obtaining the
estimate exactly:
[1] 81
[1] 19
[1] 0.49264
Question
Claims from a portfolio are believed to follow an Exp( ) distribution. The insurer has effected
individual excess of loss reinsurance with a retention limit of 1,000.
The insurer observes a random sample of 100 claims, and finds that the average amount of the 90
claims that do not exceed 1,000 is 82.9. There are 10 claims that do exceed the retention limit.
Solution
Taking logs:
ln L 90ln (10,000 xi )
90
ln L (10,000 xi )
90 90
0.005154
10,000 xi 10,000 (90 82.9)
n1
This is of the form where, using the notation in the R example above:
n2 M xi
Differentiating again:
2 90
lnL
2
2
This is negative when 0.005154 . (In fact it is always negative.) So we have a maximum
turning point and hence ˆ 0.005154 .
5 Policy excess
Insurance policies with an excess are common in motor insurance and many other kinds of
property and accident insurance. Under this kind of policy, the insured agrees to carry the
full burden of the loss up to a limit, L , called the excess. If the loss is an amount X ,
greater than L , then the policyholder will claim only X L . If Y is the amount actually paid
by the insurer, then:
Y 0 if X L
Y X L if X L
Clearly, the premium due on any policy with an excess will be less than that on a policy
without an excess.
This assumes that some of the saving is actually passed on to the policyholder. A policy excess
may also be referred to as a deductible.
The position of the insurer for a policy with an excess is exactly the same as that of the
reinsurer under excess of loss reinsurance. The position of the policyholder as far as
losses are concerned is exactly the same as that of an insurer with an excess of loss
reinsurance contract.
In practice, expenses form a significant part of the insurance cost. So the presence of an excess
might not affect the premium as much as might be expected. A premium calculated ignoring
expenses is called a ‘risk premium’.
Question
An insurer believes that claims from a particular type of policy follow a Pareto distribution with
parameters 2 and 900 . The insurer wishes to introduce a policy excess so that 20% of
losses result in no claim to the insurer.
Solution
Let L be the size of the excess. The insurer wants to set L so that P( X L) 0.2 . Using the given
loss distribution, we have:
2
900
P(X L) 1
900 L
So we require:
2
900
1 0.2
900 L
Note also that the formula for the skewness of S has a simple form when S is a compound
Poisson random variable:
skew [S ] m3 (19.12)
ie:
skew(S) E(X 3)
The easiest way to show that the third central moment of S is m3 is to use the cumulant
generating function:
CS (t ) log MS (t )
To determine the skewness, we differentiate it three times with respect to t and set t 0 ,
ie:
d3
skew [S ] log MS (t )
dt 3 t 0
In other words:
E (S) C S (0)
log MS (t ) MX (t ) 1
So:
d3 d3
3
log MS (t ) 3 M X (t ) 1 m3
dt t 0 dt t 0
ie skew [S ] m3
skew ( S )
[var( S )]3/2
3/2
Hence the coefficient of skewness m3 / ( m2 ) .
This result shows that the distribution of S is positively skewed, since m 3 is the third
moment about zero of X i and hence is greater than zero because X i is a non-negative
valued random variable. Note that the distribution of S is positively skewed even if the
distribution of X i is negatively skewed. The coefficient of skewness of S is
m3 / ( m2 )3/2 , and hence goes to 0 as . Thus for large values of , the distribution
of S is almost symmetric.
E (S) E ( X ) m1
var(S) E ( X 2 ) m2
skew(S) E ( X 3 ) m3
Despite being able to calculate the moments of a compound Poisson distribution we are not
able to calculate its probabilities as it forms no standard distribution. On a piece of paper,
we could use the Central Limit Theorem to approximate it using a normal distribution.
However, with computer packages, such as R, we can simulate an empirical compound
distribution from which we can estimate probabilities with ease.
The following R code simulates 10,000 values from a compound Poisson distribution with
parameter 1,000 and a gamma claims distribution with 750 and 0.25 :
set.seed(123)
sims = 10000
n = rpois(sims, 1000)
s = rep(NA, sims)
for(i in 1:sims){
x = rgamma(n[i], shape = 750, rate = 0.25)
s[i] = sum(x)
}
Note we have used set.seed(123) so you can obtain the same values as this example.
We can obtain the sample mean of 2,997,651, the sample standard deviation of 93,719.71
and the sample coefficient of skewness of 0.02655921 as follows:
mean(s)
[1] 2997651
sd(s)
[1] 93719.71
n = length(s)
skewness / var(s)^(3/2)
[1] 0.02655921
[1] 0.4881
quantile(s, 0.9)
90%
3115719
We can plot a histogram of the sample from the compound distribution using the hist()
function:
We set prob = TRUE so that the histogram can be compared to probability density
functions. Alternatively, setting freq = FALSE produces the same output.
Figure 19.1
We can calculate an empirical density function using the density() function and overlay
this onto the histogram using the lines() function:
We can then also superimpose a normal or other distribution to see if it provides a good
approximation as well as a legend to identify the curves:
legend("topleft",
legend = c("Empirical density", "Normal approximation"),
col = c("blue", "red"), lty = c(1,2), lwd = 3)
Figure 19.2
The plot indicates that the normal distribution appears to be a reasonable approximation.
However, a better way to check the fit with a normal distribution is to examine a Q-Q plot
using the qqnorm() function:
qqnorm(<simulated values>)
If the normal distribution is a good approximation, then the points should be close to the
straight diagonal line plotted with qqline(<simulated values>). For example:
Figure 19.3
Despite showing some deviation to the line in the tails, the Q-Q plot also indicates that the
normal distribution appears to be a reasonable approximation.
To check the fit more generally with any model distribution, we can use the qqplot()
function to compare the sample data to the theoretical values of the fitted distribution as
described in Chapter 15.
Let S1, S2 ,..., Sn be independent random variables. Suppose that each Si has a compound
Poisson distribution with parameter i , and that the CDF of the individual claim amount
random variable for each Si is Fi ( x ) .
n n
1
i and F ( x ) F (x )
i 1 i i
i 1
To prove the result, first note that F ( x ) is a weighted average of distribution functions and
that these weights are all positive and sum to one. This means that F ( x ) is a distribution
function and this distribution has MGF:
M(t ) etx f (x) dx
0
where f (x ) F (x ) is the PDF of the individual claim amount random variable for A .
So:
n
1
M (t ) 0 exp(tx )
i fi ( x ) dx
i 1
n n
1 1
M (t )
i 0 exp{tx } fi ( x ) dx
i Mi (t ) (19.13)
i 1 i 1
n
By independence of {Si }i 1 :
n
M A (t ) E (exp(tSi ))
i 1
As Si is a compound Poisson random variable, its MGF is of the form given by formula
(19.11), so:
Thus:
n n
M A (t ) exp{i (Mi (t ) 1)} exp i (Mi (t ) 1)
i 1 i 1
ie:
where:
n n
1
i and M (t )
i Mi (t )
i 1 i 1
By the one-to-one relationship between distributions and MGFs, formula (19.14) shows
that A has a compound Poisson distribution with Poisson parameter . By (19.13), the
individual claim amount distribution has CDF F(x).
Machine learning
Syllabus objectives
5.1.1 Explain the bias/variance trade-off and its relationship with model
complexity.
5.1.5 Use metrics such as precision, recall, F1 score and diagnostics such as the
ROC curve and confusion matrix to evaluate the performance of a binary
classifier.
0 Introduction
In this chapter we introduce the concept of machine learning. This is an extremely broad topic
with a wide range of applications. We focus on getting to grips with some fundamentals of two
types of machine learning, supervised and unsupervised.
We start with a description of machine learning as well as some examples from everyday life that
may be familiar.
In Section 2 we look at supervised learning in more detail. First, we introduce the aim of a
supervised learning algorithm, which is to approximate the functional relationship between input
variables (eg age, sex and smoking status) and an output variable (eg life expectancy). We then
consider how we might evaluate the model output and some of the issues that can lead to poor
performance. We also look at some common supervised learning problems and the typical
workflow involved in approaching such problems.
In Section 3 we turn to applications of supervised learning and introduce some commonly used
techniques with some of examples of them in action.
Finally, we give an overview of unsupervised learning and consider the K -means algorithm and
principal components analysis as example applications.
When describing the new age of ‘big data’ researchers often talk about the three V’s: volume,
velocity and variety. There are now very large volumes of data in use, which computers can
process very rapidly (velocity) and the data can take many different forms (variety).
While there is a large overlap between the fields of machine learning and statistical
modelling, in machine learning, there is typically more emphasis on prediction than on
inference.
For example, using a fitted model to predict future claims rather than carrying out tests on the
model’s parameters or constructing confidence intervals.
Machine learning uses highly flexible models, which can capture complex and subtle
patterns in data. The downside of this flexibility is the tendency of machine learning models
to overfit to the data used to train them: they mistakenly capture incidental properties of the
data used to train them.
The idea of overfitting to the training data is explored further in Section 2.4.
Broadly, machine learning problems can be divided into supervised learning problems and
unsupervised learning problems. This chapter will mostly focus on supervised learning.
Examples of problems which are commonly solved using machine learning include:
targeting of advertising at consumers using web sites
location of stock within supermarkets to maximise turnover
forecasting of election results
prediction of which borrowers are most likely to default on a loan.
Question
Solution
2 Supervised learning
In a typical supervised learning problem, the aim is to determine the relationship between a
collection of d input variables, x1,..., xd , and an output variable, y , based on a sample of
training data x i , y i i 1,2,..., n . For each observation, the values of the input variables,
x i x i 1, x i 2 ,..., x id , are structured as a vector, so that the entirety of the input data, x , can
be considered as an n d matrix, ie with observations in rows and input variables in
columns.
With supervised learning, the algorithm has a target output, y i , for each point in the training
data. It aims to predict this target based on the values of the input variables by identifying a
relationship between input and output.
Once a model has been created using the training data, it can be used to predict the output in
cases where it has not yet been observed, based on the values of the input variables.
determining the relationship between expected survival time (the output) and
background covariates such as age, sex and smoking behaviour (the inputs)
predicting whether or not mortgage applicants are likely to default over the
mortgage term (the output), given background information on salary, debts and prior
credit behaviour (the inputs).
In the first example, a supervised learning model would be trained on a data set where:
y i is the observed survival time for the i th life in the training data set
xi xi1 , xi 2 ,..., xid is the corresponding vector of covariate values for this life, ie age,
sex, smoker status etc.
Once trained, it can be used to predict survival times for lives still alive based on their
characteristics.
In the second example, the model would be trained on a data set where:
Once trained, we can predict whether a new applicant will default on their mortgage.
Another actuarial example is using a generalised linear model in insurance pricing to predict claim
values based on policyholder characteristics.
y i f xi i
f represents the true unknown relationship between the input and the output. When training a
supervised learning model, we are trying to approximate this relationship as best as possible
based on the available data.
One familiar setting is where f is a linear function of the inputs and the errors i are taken
to be normally distributed random variables: this is the problem of linear regression,
covered in Subject CS1.
In the (multiple) linear regression model, the assumed relationship between input and output is
as follows:
In linear regression we assume that the functional form of f is linear. Not all machine learning
methods assume a particular form of f .
Here fˆ( x ) is the approximation of f outputted by the supervised learning algorithm, which
would have been trained on a data set for which the outputs are known.
Where prediction is the primary concern, machine learning methods often regard fˆ as a
black box – its functional form is unimportant, so long as it produces good predictions.
If prediction is the main aim, then we may not be that interested in the underlying structure of
the relationship between input and output, only how accurate our model is.
While the most practical use of machine learning methods is accurate prediction, the
question of inference is also important. For example, where machine learning methods are
used in making credit decisions, there may be strong regulatory or ethical pressures to
understand the factors that contribute to a decision.
Problems where the output variable is qualitative (categorical), such as when predicting
credit default (yes/no), are said to be classification problems. The most common
classification tasks are binary classification tasks, where the outcome variable takes two
distinct values.
Question
Give examples of problems that would come under the headings of classification and regression.
Solution
An example of a classification problem is a spam filter that classifies emails into the two
categories ‘Safe’ or ‘Suspicious’.
An example of a regression problem is a health awareness app that predicts the user’s life
expectancy.
Commonly used statistical models, such as linear regression for continuous output
variables, or logistic regression for binary output variables, are examples of parametric
machine learning methods. They make assumptions about the functional form of f , which
simplify the learning problem. Moreover, these assumptions give the regression
parameters an explicit interpretation. For example, in the linear model:
y i 0 1x i i
By contrast, many machine learning methods make no (or few) assumptions about the
explicit functional form of f . Instead, they aim to produce an estimate, fˆ , that performs
well on the data used to train the model, without being too badly behaved. These more
flexible methods are often called non-parametric methods. This does not mean that they
don’t have parameters – they typically have many – rather, it means that their parameters do
not have simple interpretations in the context of the problem, and they are not usually of
direct interest.
smoothing splines
neural networks.
These methods will not be discussed in this chapter, but they are well-covered in the
references at the end of the chapter. In Section 3.3, we will demonstrate bagged decision
trees and the random forest. These are flexible non-parametric models with comparable
performance to the methods above.
1 n
MSE
n i 1
( y i fˆ( x i ))2
More precisely, this might be called the training MSE, because it is evaluated on the data
used to train the model. Given that we want to use the model to predict unseen data, this
measure does not quite match up with our requirements. Given enough parameters, we
could easily make a model that matches the training data with zero training error, but such a
model would likely perform badly when used to predict new observations.
The aim is to capture the underlying relationship between input and output, not just capture the
idiosyncrasies of the training data. With enough parameters, a model can be built so that the
predictions will exactly match the observed output for the training data. However, this reflects
the specifics of that particular data set and is therefore not likely to produce accurate predictions
for other, unseen, data points.
Figure 21.1 shows three different models fitted to a sample of size n 50 . It illustrates the
two different modes of failure when fitting statistical models to data, and their relationship
with model complexity.
The true function f is a polynomial of degree 5 in a single variable x , and the three
different models are polynomials of different degrees, although the behaviour exhibited
here is seen for supervised learning models more generally.
The Core Reading here means that the behaviour demonstrated in his section (the two modes of
failure) is not specific to polynomials.
d
fˆ( x ) ˆk x k
k 0
where the parameter estimates, ˆk , are the values that minimise the training MSE.
Figure 21.1 shows fˆ for d 1, 5, 15 .
Three models have been fitted to the training data to compare their behaviour:
a linear model, which has far fewer parameters than the true function
a polynomial of degree 5, which has the same number of parameters as the true function
a polynomial of degree 15, which has many more parameters than the true function.
Figure 21.1 below shows a plot of the training data, the three fitted models and the true model:
Figure 21.1
For d 1 , the model is just a straight line, and so it is too rigid to accommodate the
curvature in f . This leads to a bias in the predictions of fˆ : systematically too large in the
centre, and too small at the edges.
This model doesn’t have enough parameters to capture the changes in shape that we can see
exhibited by the true f . We saw this same issue when we considered the graduation of mortality
rates by parametric formula in Chapter 11. If we use a formula with too few parameters, the
graduated rates will be overgraduated. They will be smoothed too much and will not follow the
underlying pattern of the rates closely enough, eg ‘smoothing over’ genuine features such as the
accident hump.
In contrast, the model with d 15 is too flexible: fˆ bends to accommodate noise in the
training data. The more flexible the model, the better it will fit the training data. However, in
bending to accommodate the training data, fˆ departs substantially from f so that the error
in predicting unseen observations will be large.
The more parameters in the model, the more flexible it is. If we fitted a polynomial with high
enough degree to this data, then it would exactly go through each of the observed data points (ie
it would have a training MSE of 0). This would perfectly capture the specifics of the training data
but would not be a good approximation to the true relationship shown on the graph. So, it would
likely not produce accurate predictions for unseen data.
We also saw this issue with graduation by parametric formula. If we use a formula with too many
parameters, the graduated rates will be undergraduated. They will follow the crude rates too
closely, reflecting a lot of the random ‘noise’ present in the data, rather than just capturing the
underlying pattern of the rates.
The model with d 5 approximates the true f well, unsurprisingly, since it has exactly the
right form. While the illustration here uses a simple class of models – polynomials, the
relationship between the flexibility of a model, as measured by the number of free
parameters, and prediction performance on unseen data, is very general. Evaluating and
controlling model flexibility is an important part of designing good machine learning
methods.
2 2
E y 0 fˆ( x0 ) var 0 var fˆ( x0 ) f ( x0 ) E fˆ( x0 )
The previous MSE formula was for the training data. This is now the expected MSE for an unseen
data point based on our approximation, f̂ . There are two elements of randomness over which
this expectation is being taken. Firstly, f̂ is being treated as a random variable (ie we are
considering the estimator, rather than the estimate). This is because different training data sets
lead to different approximations of the functional relationship. Secondly, the value of y 0 given
x 0 is random due to the presence of the error term, 0 .
The expected MSE on the left can be thought of as the average of the (squared) error on an
unseen observation with input value x 0 , taken over the distribution of random errors in the
training sample, 1,..., n , and the random error of the unseen observation, 0 . Note that
here fˆ is considered a random variable, as it depends on the training observations
x i , y i , i 1,2,..., n .
The Core Reading is describing the two elements of randomness outlined previously. The
distribution of random errors in the training sample drives different approximations of the
functional relationship for different training data sets. The distribution of the random error for an
unseen observation drives different values of y 0 for a given x 0 .
var 0 is the error variance, sometimes called the irreducible error. It represents a
fundamental limitation on the precision of an individual prediction, as a result of the random
error term present in each observation.
var fˆ( x 0 ) , the variance of fˆ , is the contribution to the expected error that comes from
having used only a finite sample of size n : a different sample of size n would give a
different estimate. High variance is a sign that the model is too flexible – this additional
flexibility soaks up noise in the training data, leading to overfitting. This mode of failure can
be seen in the left panel of Figure 21.2 below.
If the variance of f̂ is large, then this means that the estimate varies a lot between training
samples. This is a sign that it is overfitting, capturing the idiosyncratic trends of the training data.
f ( x ) E fˆ( x )
2
0 0 is the square of the bias of fˆ . It represents the systematic
Recall from Subject CS1 that the bias of an estimator is the difference between its expectation
and the quantity being estimated. Here the bias of f̂ is given by:
E [ fˆ ] f
E[ fˆ] f f E[ fˆ]
2 2
This is a measure of the fundamental difference between the structure of the model being fitted
and the true functional relationship.
For example, if simple linear regression is used for a problem where the relationship
between x and y is genuinely non-linear, then even in the limit of an infinite sample, and
small error variance, an error will be incurred when using fˆ in place of f : this is bias. This
mode of failure can be seen in the right panel of Figure 21.2 below. High bias is a sign that
the model is not flexible enough – it is not able to capture all of the signal in the data,
leading to underfitting.
Again from Subject CS1, we know that the MSE of an estimator can be written as:
So, for a given (unseen) value x 0 , the expected mean square error for the corresponding output
value y 0 can be thought of as the irreducible error (ie var 0 ) plus the MSE of the estimator,
MSE [ fˆ ] . Although we have no control over var 0 , we can try to find an approach of
estimating the functional form that minimises MSE [ fˆ ] and hence minimises the expected MSE for
an unseen output. It is possible for an estimator to be biased but have a lower overall MSE, in
which case it may be the preferred approach. Even though it is wrong on average (biased), it is
more likely to give something closer to the true functional relationship (low MSE).
Figure 21.2
Machine learning models typically contain many parameters, and so are capable of
capturing faint patterns, which simpler methods might not be able to distinguish from noise.
However, in being so flexible, their primary mode of failure is in overfitting, which results in
high-variance predictions. For this reason, it is important to control the flexibility of our
models (see Section 3.1 on penalized regression models) or find ways of reducing the
variance of predictions (see Section 3.3 on decision trees and random forests).
The output of a binary classifier can be represented as shown in the table below. When
populated with values from data, this table is known as a confusion matrix.
The terminology is based on the idea that the matrix of possible outcomes quantifies the extent
to which, in the context of disease diagnosis, the test ‘confuses’ patients who do / do not have the
condition.
YES NO
Question
A country is introducing a new screening programme for early identification of people with a
particular type of cancer.
(i) Explain what ‘false positive’ and ‘false negative’ results would be in this context.
(ii) Discuss the impact of false positives and false negatives from the point of view of a
patient.
(iii) State an additional concern regarding false negatives if this had been a test for an
infectious disease.
Solution
(i) A false positive is a patient that the test flags as having the disease, but in fact does not.
A false negative is a patient that the test indicates as not having the disease, when in fact
they do have it.
(ii) A false positive outcome is undesirable because the patient may be caused unnecessary
worry or required to undergo further tests or treatment before it is established that they
do not actually have the disease.
A false negative outcome is also undesirable because the patient may not now be
identified early enough to receive effective treatment for the disease.
(iii) With an infectious disease, there is the additional concern that false negative patients
may unknowingly spread the disease to other people.
There are several useful measures we can calculate from the confusion matrix to gauge the
effectiveness of the test.
Accuracy. This is the proportion of correct predictions. Note that raw accuracy scores can
be misleading, particularly for imbalanced data sets, where most of the observations come
from a single class. For a problem where 95% of the observations are from one class, even
the naïve classifier that assigns all observations to the more common class has 95%
accuracy. Hence it is common to report the accuracy alongside the accuracy of the naïve
classifier for comparison.
TP TN
accuracy
TP TN FP FN
This measure can also be used for classification problems where the outcome variable can take
more than two values.
Precision and recall. Consider a diagnostic test for a medical condition. The patients who
take the test either have the condition or they do not. The test will classify (predict) patients
as having the condition or not according to whether the outcome of the test fulfils certain
criteria.
TP
precision
TP+FP
TP
recall
TP+FN
In Subject CS1, this measure is called the sensitivity and is equivalent to 1 P(type II error) or the
power of a test where the null hypothesis is that the patient is healthy. It is also called the true
positive rate. If FN 0 , ie if the test has not missed anyone who has the condition, it equals
100%.
2 x precision x recall
F1 score
precision + recall
The ‘F’ here arose historically and doesn’t actually stand for anything. The ‘1’ subscript just
identifies this measure out of several similar measures that have also been proposed and could be
used instead.
Question
(i) Explain why the precision and recall are not necessarily useful stand-alone measures, by
considering example confusion matrices.
(ii) Derive and simplify a formula for the harmonic mean of the precision and the recall.
Hint: The harmonic mean of a set of values is the reciprocal of the arithmetic mean of their
reciprocals.
Solution
(i) Consider a diagnostic test that predicts every individual as having the condition. In this
case, we have FN = 0 (the other values in the confusion matrix are not directly relevant for
this example, though we would also have TN = 0). So, the recall is:
TP TP
recall 1
TP+FN TP
This is the best possible recall score. However, this test is likely not very useful, it does
not differentiate between patients at all.
Next consider a diagnostic test that is always correct when it predicts a patient as having a
condition but rarely makes such predictions across the affected population. In this case
we have that FP = 0 (the other values in the confusion matrix are not directly relevant for
this example as long as we assume that TP 0 ). So, the precision is:
TP TP
precision 1
TP+FP TP
This is the best possible precision score. However, this test on its own may not be that
useful as it identifies few patients with the condition.
For example, imagine there is a particular compound that, when present in the blood,
means that the patient definitely has the condition but the lack of that compound is
inconclusive. If the presence of that compound is extremely rare within the population
with the condition, then this test may not be that useful as the sole diagnostic tool.
However, it could of course be used as part of a more comprehensive diagnostic strategy.
(ii) Using the hint given, we can see that the harmonic mean H of the precision and the recall
can be found from the equation:
1 1 1 1
H 2 Precision Recall
So:
(iii) F1 is the harmonic mean of the precision and the recall. This is different from the more
familiar arithmetic mean, but it also gives an average value of the two measures taken
together and results in a value in the same range, ie 0 to 1.
There is a trade-off between recall (the true positive rate, also known as the sensitivity) and
the false positive rate (the proportion of cases that are incorrectly classified as positives).
The false positive rate is:
FP
false positive rate
TN+FP
This is not the same as (1 – the true positive rate), ie it is not the same as 1 – the recall rate.
In the context of this trade-off, the sensitivity is often compared with the specificity, which
is 1 – false positive rate.
Recall from Subject CS1, that the specificity is defined to be the true negative rate (or
1 P(type I error) for a test where the null hypothesis is that the patient is healthy).
The trade-off between recall and the false positive rate can be illustrated using a receiver
operating characteristic (ROC) curve. An example is shown below, taken from Alan Chalk
and Conan McMurtrie ‘A practical introduction to Machine Learning concepts for actuaries’
Casualty Actuarial Society E-forum, Spring 2016.
Figure 21.3
This figure shows the ROC curve for a logistic regression classifier fitted to the cause
codes for aircraft accidents.
Logistic regression is a type of generalised linear model where the response variable is a binary
outcome. It is based on the logistic function f (x) 1 , shown in the graph below. This
1 e x
function converts an input value, which can be anywhere in the entire range x , to an
output value on a continuous scale between 0 and 1. If we interpret the output value as a
probability, we can convert it to a categorical output by saying that values exceeding a specified
value p (eg p 0.5 ) correspond to Yes, while smaller values correspond to No.
Logistic function
1.0
0.8
y = 1/(1+exp(-x))
0.6
0.4
0.2
0.0
-4 -2 0 2 4
The ROC curve is used when the test involves a threshold of some kind. In the aircraft accidents
example, the aim of the logistic regression model is to identify accidents that were caused by the
aircraft (outcome 1) or for some other reason (outcome 2) based on key words in descriptions of
the incidents. The output of the logistic regression model is interpreted as the probability of
outcome 1. A threshold is then chosen such that accidents with an output above that threshold
are classified as outcome 1 and outcome 2 otherwise. The false positive rate and true positive
rate can then be calculated for that particular threshold based on the classifications. This is
repeated for different thresholds and the points plotted on a graph.
Points near the top left of the graph correspond to a good test where the true positive rate is high
and the false positive rate is low.
The diagonal line corresponds to a neutral ‘zero-sum’ test where there is a simple trade-off with
any improvement in the true positive rate being matched by an equal deterioration in the false
positive rate. As the diagonal represents naïve models that randomly guess the outcome for each
individual, it is therefore a lower bound for models with any predictive power.
The area under the ROC curve (AUC) is a commonly reported single-number measure of
model performance. However, as the AUC aggregates performance over the entire range of
false positives, it is not a reliable guide to performance in realistic problem domains, where
only classifiers with well-controlled error rates would be deployed: ordinarily, only
classifiers with small false positive rates would be useful in practice.
In the previous figure, the AUC for the naïve model is the area of the lower right triangle, which is
0.5. For the logistic regression model, the AUC is 0.83. The area of the whole rectangle is 1
(representing the best possible AUC for an ROC curve). So, a model that has some predictive
power over just random guessing will have an AUC between 0.5 and 1.
However, the AUC can sometimes be misleading. It is a single metric that describes model
performance for a range of thresholds, which may not always be an appropriate measure. For
example, say that the treatment for a particular condition is incredibly invasive, whilst the
condition itself is not particularly serious. When diagnosing patients, it may be desirable to keep
the false positive rate fairly low (to avoid giving the treatment to people that don’t need it).
Suppose that a false positive rate below 5% is desired. In this case, a model that has strictly
higher true positive rates for false positive rates below 5% should be preferred, even if it has a
lower overall AUC.
Question
Two different tests have been applied to 100 individuals to identify whether or not a particular
feature is present. These resulted in the following confusion matrices:
(a) Test 1
NO FP = 4 TN = 16 20
TOTAL 80 20 100
(b) Test 2
NO FP = 1 TN = 19 20
TOTAL 71 29 100
Calculate the precision, recall, F1 score, false positive rate and accuracy for each matrix and
comment on the answers.
Solution
TP 76 TP 76
(a) precision 95% , recall 95%
TP FP 76 4 TP FN 76 4
FP 4
false positive rate 20%
TN FP 16 4
TP+TN 76 16
accuracy 92%
TP+TN FP FN 76 16 4 4
TP 70 TP 70
(b) precision 98.6% , recall 87.5%
TP FP 70 1 TP FN 70 10
FP 1
false positive rate 5%
TN FP 1 19
TP+TN 70 19
accuracy 89%
TP+TN FP FN 70 19 1 10
The precision values show that Test 2 is more effective at correctly identifying individuals who do
have the feature, ie out of those predicted as having the condition, Test 2 gets a higher proportion
correct.
However, the recall values show that Test 1 is much better at identifying individuals who do have
the feature. In other words, Test 1 correctly identifies a higher proportion out of the population
of those with the condition.
Given that the tests each outperform the other for one of these metrics, which of them may be
more useful depends on the misclassification costs.
According to the F1 scores, the overall performance of Test 1 is (just) better than Test 2.
However, Test 1 has a much higher false positive rate, indicating that it flags a higher proportion
of individuals who do not have the feature as having it.
The accuracy shows that Test 1 is a better predictor overall (although they are very similar). A
naïve predictor that predicts everyone as having the feature has an accuracy of 80%. Both tests
have a higher accuracy, indicating they have some predictive power over and above this naïve
classifier.
2.7 Train-validation-test
The most important test of a machine learning method is an evaluation of its out-of-sample
performance: how does it perform on unseen observations?
Train/test split
It is common to withhold a fraction of a dataset to evaluate out-of-sample performance.
This is called a training/test split. The split should be taken at random, or a suitably
stratified sample taken where there are distinct subpopulations. The training sample is
used to fit the model. The input variables of the test observations are then passed to the
trained model to determine predicted outputs for the test observations. We can then
compute the average prediction error on the test observations.
Train/validation/test split
Some machine learning methods have tuneable parameters, which we might call
hyperparameters, that control aspects of the model-fitting process, or the penalty for
overfitting. One example is the regularisation parameter in penalised regression, discussed
in Section 3.1.
We can say that the parameters of a model are variables internal to the model whose values are
estimated from the data and are used to calculate predictions using the model.
Hyperparameters are variables external to the model whose values are set in advance by the user.
They are chosen based on the user’s knowledge and experience in order to produce a model that
works well. The chosen hyperparameters are then evaluated and tuned using the validation data
set.
Question
Solution
Graduation
This form of the Gompertz-Makeham family of curves is the one used most widely. However, it
does not match the form given on page 32 of the Tables.
Time series
If we are fitting a linear time series using an ARIMA model, we need to decide on the values of d ,
p and q , which determine the number of levels of differencing to apply and the number of
moving average and autoregressive terms to include.
K-fold cross-validation
An alternative to the train/validation/test split considered above is to split the data into K
subsets of roughly equal size. K different instances of the model are then trained, with
instance i using all data except the i th subset. Subset i is then used to evaluate the
out-of-sample error of model instance i . The average of the K different out-of-sample
errors can then be used as a proxy for the true out-of-sample error.
An advantage of this method is that all observations are used for both training and validation,
and each observation is used for validation exactly once.
Collecting data
Data may come from a variety of sources, including sample surveys, population censuses,
company administration systems and databases constructed for specific purposes (such as
the Human Mortality Database, www.mortality.org).
During the last 20–30 years, the size of datasets available for analysis by actuaries and
other researchers has increased enormously. Datasets, such as those on purchasing
behaviour collected by supermarkets, relate to millions of transactions.
Types of data
There are many different types of data we might need to deal with. The table below illustrates
the ‘traditional’ types of data that have been used by actuaries and statisticians.
DATA TYPES
NUMERICAL CATEGORICAL
(ie numbers) (ie not numbers)
DISCRETE CONTINUOUS ATTRIBUTE NOMINAL ORDINAL
(DICHOTOMOUS)
↓ ↓ ↓ ↓ ↓
Age last birthday Exact age Alive / Dead Customer name Month (Jan, Feb, Mar, …)
Number of children Salary Male / Female Type of claim Exam grade (A, B, C, …)
Number of claims Claim amount Claim / No claim Occupation Size (S, M, L, XL)
Pass / Fail Marital status Agree/Don’t
Country know/Disagree
Colour of car
Attribute (or dichotomous) data refers to variables whose values consist of just two categories.
Ordinal variables take values that can be ordered in a natural way, whereas the values for nominal
variables cannot.
Some variables can be classed in several ways, eg the number of claims could be treated as a
continuous variable, rather than discrete, if the values were large. We’ve seen this idea before
when we used a normal approximation to a binomial or Poisson distribution in Subject CS1.
Similarly, colour of car could be recorded as red, orange, yellow, etc (ie nominal data) or (if it
came from a video image) it could be measured on the RGB scale (ie a vector of three discrete
numerical values). It’s also quite common to record attribute data as 1’s and 0’s (ie with discrete
numerical values) to make it easier to count subsets and to calculate proportions and averages.
Nowadays, however, there are many other types of data that don’t directly correspond to these
familiar types. For example, a motor insurer might be provided with a memory card containing
footage of an accident recorded on a vehicle’s dash cam (dashboard camera). This will typically
be a very large file containing a mixture of video and audio information, as well as structural
information (eg markers for the start and end of each frame of the video), header information (eg
the date it was captured, the serial number of the camera and the software version) and other
embedded information such as time markers for the footage and satellite coordinates.
A data set must first be prepared in such a way that a computer is able to access it and
apply a range of algorithms. If the data are already in a regular format, this may be a simple
matter of importing the data into whatever computer package is being used to develop the
algorithms. If the data are stored in complex file formats, it will be useful to convert the data
to a format suitable for analysis. See Tidy Data (Wickham) for a more systematic approach
to data preparation beyond the CS2 syllabus.
It may be possible to estimate or derive missing values from other available information.
Look for obvious errors in variables, and in combinations of variables.
Exploratory data analysis (EDA) should be used to check for errors and to understand
high-level properties of the data. When performing extensive exploratory analysis, it is
important to be careful of ‘data dredging’, which could result in mistaking chance
relationships for signals. It is a good idea to state the relationships that would be expected
between variables before carrying out exploratory data analysis.
These first few steps are also covered in the data analysis chapter of Subject CS1.
Feature scaling
Some machine learning techniques will only work effectively if the variables are of similar
scale. We can see this by recalling that, in a linear regression model (Subject CS1, Chapter
12) the parameter j , associated with variable x j , measures the impact on y of a one-unit
change in x j . If x j is measured in, say, metres, the value of j will be 100 times larger
than it would be with the same data if x j were measured in centimetres.
In Section 3.1, we consider expressions for regression penalties, such as ridge regression.
d
The ridge penalty for model complexity has the form 2j , which is only a meaningful
j 1
expression if the parameters j are on the same scale. One common approach is to centre
input variables, by subtracting the sample mean, and then to rescale each variable to have
standard deviation 1.
Although this section is focused on supervised learning, the scaling of the variables is particularly
important with the unsupervised K -means algorithm, which we will discuss later.
A typical split as described in Section 2.7 would use 60% of the data for training, 20% for
validation and 20% for testing. A guide might be to select enough data for the validation
data set and the testing data set so that the validation and testing processes can function,
and to allocate the rest of the data to the training data set. In practice, this often leads to
around a 60%/20%/20% split.
Evaluation
Once the model has been trained on a set of data, its performance should be evaluated.
How this is done may depend on the purpose of the analysis. If the aim is prediction, then
one obvious approach is to test the model on a set of data different from the one used for
development. If the aim is to identify hidden patterns in the data, other measures of
performance may be needed.
The data used should be fully described and available to other researchers.
The selection of the algorithm and the development of the model should be
described, again with computer code being made available. This should include the
parameters of the model and how and why they were chosen.
To ensure reproducibility in stochastic models in R, use the same numerical seed in the
function set.seed().
This is the problem of overfitting, the estimate too closely resembles the specifics of the training
data so large variation is observed between training samples.
One approach is to seek a model that uses fewer parameters by imposing a penalty on
model complexity. Commonly used criteria include the Akaike Information Criterion (AIC)
and the Bayesian Information Criterion (BIC):
AIC deviance 2d
Here d represents the number of parameters in the model and n is the number of observations.
The deviance here is a measure of the average error of the model’s outputs and measures the
goodness of fit. The extra terms reflect the number of parameters in the model. If we aim to
minimise the AIC or the BIC, this will allow us to find a good trade-off between the two objectives
of obtaining a good fit to the data and minimising the number of parameters in the model.
These criteria attempt to allow for the fact that more complex models will necessarily fit the
training data better. Instead of seeking models with the lowest deviance, models with the
lowest AIC or BIC are preferred. Automatic model selection methods such as stepwise
selection can systematically explore the space of all possible models to find reasonable
candidate models according to these criteria. See Chapter 13 of Subject CS1. However,
this approach will not work well when the number of input variables is large: it is
computationally expensive to explore a large model space.
The more input variables there are, the more possible models there are and so the larger the
model space. The more models we want to fit and evaluate, the more computing power is
required.
Regularization extends the idea of penalizing more complex models. Instead of maximising
the log-likelihood function, we maximise a penalized log-likelihood:
0 , 1 ,..., d | x , y g 0 , 1,..., d
where g represents a penalty function and the regularization parameter 0 controls how
strongly the penalty should be applied. As 0 we recover the unpenalized model.
By maximising this expression, we are aiming for a trade-off where we try to maximise the
log-likelihood but, at the same time, try to minimise the penalty applied (since this is subtracted).
d
Ridge regression: g 0 , 1,..., d 2j
j 1
d
Lasso regression: g 0 , 1,..., d j
j 1
The names are based on the geometrical interpretations of these measures, which you are not
expected to know.
In either case, for large values of , small absolute values of the parameters are
encouraged – hence, these are commonly known as shrinkage methods. A good value of
is typically chosen by cross-validation.
Question
A random sample x1 , , xn of n values has been taken from a Poisson distribution with unknown
mean . The value of is to be estimated using the penalty function ( 5)2 .
(i) Explain why this particular penalty function might have been chosen.
n(ˆ x ) 2 ˆ (ˆ 5) 0
(iv) Calculate the value of ̂ based on the sample of values 5.7, 5.4, 4.6, 5.0 and 4.9 when
0.2 .
(v) Use the equation in part (iii) to show algebraically what happens to the value of ̂ :
Solution
In this question we have just one parameter (ie d 1 ), which is called (corresponding to 1 ).
(i) We might choose this penalty function if we believe the true value of is close to 5, as
the penalty when 5 would be zero.
n
e xi
L e n xi constant e n nx constant
i 1 xi !
(iii) To maximise this, we equate the derivative with respect to the parameter to zero:
(log L)* nx
n 2 ( 5) 0
n( x ) 2 ( 5) 0
1 25.6
n 5 and x (5.7 5.4 4.6 5.0 4.9) 5.12
5 5
3 32 4(0.4)(25.6) 3 49.96
ˆ 12.585 or 5.085
2(0.4) 0.8
n(ˆ x ) 0 ˆ x
n(ˆ x )
2 ˆ (ˆ 5) 0
As , this becomes:
2ˆ (ˆ 5) 0 ˆ 0 or ˆ 5
Since the value of must be strictly positive, the required estimate would be ˆ 5 .
(vi) If we apply no penalty, as in part (v)(a), the method reduces to maximum likelihood
estimation and we get the usual estimate for , ie the sample mean of 5.12.
If we apply a high penalty to values that are not close to the target value of 5, as in part
(v)(b), the method will produce a value close to 5, irrespective of the actual values in the
sample.
As expected, the estimated value of 5.085 lies between these two values.
Recall from Subject CS1 Chapter 14 that if B1, B2 ,..., BR constitute a partition of a sample
space S and P (Bi ) 0 for i 1,2,..., R then for any event A in S such that P ( A) 0 :
P ( A | Br )P (Br )
P (Br | A)
P ( A)
where
R
P ( A) P ( A | Bi )P (Bi )
i 1
for r 1,2,..., R .
This is Bayes’ Theorem, which allows us to ‘invert’ conditional probabilities, ie to work out the
values of the probabilities P (Br | A) when we know the probabilities P ( A | Br ) .
Question
Solution
P ( X ,Y )
The proof uses the definition of conditional probabilities, P( X |Y ) , and the equivalent
P(Y )
identity P( X ,Y ) P( X |Y )P(Y ) .
Using the definition of conditional probabilities, the probability we want to find is:
P(Br , A)
P(Br | A)
P( A)
Using the identity above, the numerator on the right-hand side can be written as:
P (Br , A) P ( A | Br )P (Br )
If we condition the denominator on the different possible values of B r , we can write it as:
R
P(A) P(A|B1 )P(B1 ) P(A|B2 )P(B2 ) P(A|BR )P(BR ) P(A|Br )P(Br )
r 1
P ( A | Br )P (Br ) P ( A | Br )P (Br )
P (Br | A)
P ( A) R
P( A | Br )P(Br )
r 1
Naïve Bayes classification uses this formula to classify cases into mutually exclusive
categories on some outcome y , on the basis of a set of covariates x1,..., xd . The events
A are equivalent to the covariates taking some set of values, and the partition B1, B2 ,..., BR
is the set of values that the outcome can take.
Suppose the outcome is whether a person will survive for ten years. Let y i 1 denote the
outcome that person i survives, and y i 0 denote the outcome that person i dies. Then if
we have d covariates, we can write the posterior probability that y i 1 , given the covariate
information as:
P ( x i 1,..., x id | y i 1)P ( y i 1)
P ( y i 1| x i 1,..., x id )
P ( x i 1,..., x id )
It is difficult to estimate the joint distribution of the variables x1,..., xd , particularly in the
high-dimensional setting.
In practice many of the combinations of values will not be present in the sample, so the
corresponding probabilities, P ( x i 1 ,..., x id ) , cannot be estimated. For example, for a motor insurer
that uses several rating factors (eg age of policyholder, occupation, make of car, age of car,
postcode area), many of the subsets will be empty.
The naïve Bayes algorithm assumes that the values of the covariates xij , j 1,..., d , are
conditionally independent given the value of y i .
P ( x i 1 , x i 2 , , x id | y i 1) P ( x i 1 | y i 1) P ( x i 2 | y i 1) P ( x id | y i 1)
Question
Illustrate why this assumption might not be accurate for a motor insurer using the rating factors
age of policyholder, occupation, make of car, age of car, and postcode area to build a model to
predict whether or not a claim will be made.
Solution
As an example, say that out of the policyholders in the training data without a claim, 25% are
under 25, 20% drive high performance cars and 40% drive cars less than 3 years old.
However, it is unlikely that 2% (ie 25% 20% 40% ) of policyholders without a claim would be
under 25 with high performance cars under 3 years old, as young drivers are unlikely to be able to
afford such vehicles – or to be able to pay to insure them.
So that:
d
P (y i 1| xi 1,..., xid ) P (y i 1) P(xij | yi 1)
j 1
Assuming that the costs of misclassification are equal, we assign an observation to the
class with the highest posterior probability.
For a given set of values of x i 1 ,..., x id , the denominator P ( x i 1 ,..., x id ) doesn’t change. So, we can
treat this as a constant in the calculations and just look at the relative values – hence the
proportional sign. To find the actual values of the individual probabilities, we can just divide by
the total, to produce a set of probabilities that add up to 1.
Question
A motor insurer has analysed a sample of 1,000 claims for three different geographical regions
split by the size of claim (Small / Medium / Large). It has then classified them according to
whether they proved to be fraudulent or genuine claims. The results are shown in the tables
below:
(i) Give a formula that could be used to estimate the probability that a new claim from
Region 3 for a Medium amount will prove to be fraudulent.
(ii) Estimate the probability that each of the following types of new claim will be fraudulent:
The insurer decides to classify all claims with a probability of being fraudulent greater than 5% as
fraudulent (pending further investigation).
(iii) Determine which, if any, of the claim types in part (ii) would be investigated.
Solution
(i) Using Bayes’ Theorem (and obvious abbreviations for the events), we have:
20 180
P(R 3, M |F ) 0.4 , P(R3, M |G) ,
50 950
50 950
P(F ) 0.05 , P(G) 0.95
50 950 50 950
0.4 0.05
So: P(F |R3, M) 0.1
0.4 0.05 180 950 0.95
So the estimated probability that a claim from Region 3 for a Medium amount is
fraudulent is 10%.
In fact, we can do this calculation directly from the table. For Region 3 and Medium
amounts there were 20 fraudulent claims and 180 genuine claims. So the estimated
probability that a claim from Region 3 for a Medium amount is fraudulent is
20
0.1 , ie 10%.
20 180
(ii)(b) Similarly, the estimated probability that a claim from Region 1 for a Large amount is
3
fraudulent is 0.168 , ie 1.68%.
3 176
(ii)(c) Since there were no fraudulent claims from Region 2 for a Small amount in the sample,
the estimated probability that a claim from Region 2 for a Small amount is fraudulent is 0.
(iii) The insurer would investigate claims for a Medium amount from Region 3 (from (ii)(a))
but neither of the other two claim types.
The Core Reading stated earlier that, for equal misclassification costs, we allocate points to the
category with the highest posterior probability for classification. This is unlikely to be the case
here as misclassifying a fraudulent claim as genuine is likely more costly than misclassifying a
genuine claim as fraudulent.
In addition, if the approach is taken to only classify claims as fraudulent if the posterior probability
is greater than 0.5, then no claims would ever be classified as fraudulent based on this model. This
is due to the low overall prevalence of fraud in the three regions. However, a more complex model
that takes into account more attributes of the claims, as opposed to just region, may lead to
higher possible posterior probabilities of fraud.
YES NO
Male
YES NO
Male Female
Figure 21.4
This figure shows a classification tree for predicting gender based on height and weight.
Given the height and weight of a new individual, it is easy to predict the individual’s gender.
To do so, start at the top of the tree and, at each level of the tree, choose the appropriate
path. If the individual’s height is greater than 180 cm, predict male. Otherwise, consider
their weight: if it is greater than 80kg, predict male; if not, predict female.
The rationale here may be that tall people (Height > 180cm) tend to be male, so we can separate
them out at the first stage. Of the remaining people, males tend to be heavier than females, so
we can split these at a level that is likely to a reasonable boundary between males and females
(80kg, say).
We now consider how this decision tree may have been constructed (learned) using data.
Weight >80kg
Height >180cm
Figure 21.5
At each stage, the best partition is chosen, to minimise a loss function appropriate to the
problem.
In order to build a decision tree, we need to choose partitions at each stage, ie how to divide up
the input space. To do this we need to choose the input variable to use, here height or weight,
and the value at which to split the data.
In this algorithm, at each stage we choose the split that appears to be the most effective at
separating the remaining elements, without thinking ahead to the consequences that this might
have on the later splits. So, we just ‘bite off as much as we can’ at each stage. The effectiveness
of splits is measured by a loss function, which is discussed later.
The partition of the sample space corresponds to the tree in Figure 21.4, with each of the
three sub-rectangles corresponding to a terminal node in the tree.
Figure 21.5 shows how the example decision tree in Figure 21.4 partitions the input space. First
the space is divided into two rectangles at the point where height is 180cm. The rectangle
corresponding to height of no more than 180cm is then further split into two rectangles based on
weight.
An unseen observation would then be placed in the most common class within its
sub-rectangle.
For a classification tree, we assign a prediction category to each of the terminal nodes
(sub-rectangles in the input space). One way to do this is to predict the category associated with
the most individuals from the training data in that node. This is then the predicted category for
new, unseen, data points if they lie within that sub-rectangle of the input space.
In this example, incorrectly classifying a sick individual as healthy has a much higher cost than
incorrectly classifying a healthy individual as sick. These asymmetric misclassification costs mean
that it may be better to predict ‘sick’ for terminal nodes with a certain threshold of sick
individuals, rather than requiring the majority of individuals in that node to be sick.
Loss functions
For a classification problem with K classes (here, for binary classification K 2 ), one
common loss function is the Gini index. This is a measure of how ‘pure’ the leaf nodes are
(ie how mixed the training data assigned to each node is). For a tree with external nodes
j 1,..., J this is:
J K
n j p jk (1 p jk )
j 1 k 1
where p jk is the proportion of training observations under node j of type k and n j is the
number of observations under node j .
n j p jk (1 p jk ) n j p jk p2jk
J K J K
j 1 k 1 j 1 k 1
J K K
n j p jk p2jk
j 1 k 1 k 1
J K
n j 1 p2jk
j 1 k 1
The measure of purity for the j th external (terminal) node (or resulting sub-rectangle) is
K K
p jk (1 p jk ) or 1 p2jk . The overall Gini index is then a weighted sum of these values over
k 1 k1
all external nodes. The weights are the number of data points in each node, the nj .
If all the training observations under node j are of the same type (perfect class purity),
then:
K
p jk (1 p jk ) 0
k 1
This formula gives the probability that two items are selected at random (with replacement) are
of different types (ie if the first item is of type k , the second item will not be of type k ). So, if all
the items are of the same type, the probability will be 0. If all observations are of the same type,
say k , then p jk 1 and p jk 0 for all other types, k .
For a node that has an equal split of classes for a binary classification problem (worst
purity), this quantity takes its maximum value of 0.5.
For a classification problem where the data points are divided into m distinct categories, the Gini
1
index must take a value between 0 and 1 . As m , the upper limit of the Gini index tends
m
to 1.
For a regression problem, the predicted output value of a new observation under node j ,
yˆ j , is the mean output value of all training observations under node j . The corresponding
loss function is the squared error loss. If the j th node contains n j observations, the
squared error is:
J nj
( y i yˆ j )2
j 1 i 1
For each node, the predicted value that minimises the sum of squared errors for that node is the
mean output value for the training observations in that node. To see this, we have that the
squared error for node j , SE j , given some predicted value for this node, yˆ j , is:
nj
SE j (yi yˆ j )2
i 1
nj
dSE j
2 (yi yˆ j )
dyˆ j i 1
nj
2 (yi yˆ j ) 0
i 1
nj
n j yˆ j yi
i 1
nj
yi
yˆ j i 1 yj
nj
d 2 SE j
2n j 0
dyˆ j2
At the start of the procedure for a classification problem, when all observations belong to the
same region, the value of the Gini impurity score is:
K
n pk (1 pk )
k 1
Here pk is the proportion of training observations of type k and n is the total number of data
points.
When deciding on the first partition, we need to choose the input variable and the threshold. Say
we are evaluating the partition using some variable xk and some threshold s . This partition
creates two ‘child’ nodes from the parent node that contains all the data. After this split, these
two child nodes are now the external (terminal) nodes of the tree, which looks something like the
following:
The Gini impurity score for the tree after this split is:
2 K
n j p jk (1 p jk )
j 1 k 1
where p jk and n j are as defined previously. The reduction in the loss function is then given by:
K 2 K
n pk (1 pk ) n j p jk (1 p jk )
k 1 j 1 k 1
We want to identify the partition that maximises the above reduction in the loss function.
However, the first part of the expression does not change for different partitions. So, this is
equivalent to identifying the partition that minimises the score for the tree after the split, ie that
2 K
minimises n j pk (1 pk ) .
j 1 k 1
Say that using variable xk and threshold s maximises the reduction in the loss function for the
first partition. Consider, for example, further splitting the data for which xk s . When we split
this data using some variable xk and some threshold s the tree looks something like the
following:
The Gini impurity score for the tree after this split is now:
3 K
nj pjk (1 pjk )
j 1 k 1
where pjk and nj are as defined previously but for the revised tree. The reduction in the loss
function is then given by:
2 K 3 K
n j p jk (1 p jk ) nj pjk (1 pjk )
j 1 k 1 j 1 k 1
Once again, we want to identify the partition that maximises the above reduction in the loss
function. However, as before, the first part of the expression does not change for different
partitions on the data for which x k s . So, this is equivalent to obtaining the partition that
3 K
minimises the score for the tree after the split, ie that minimises nj pjk (1 pjk ) .
j 1 k 1
Also, the impurity score for the first external node does not change when considering these
different partitions, ie n1 n1 and p1 k p1k for all k . So, this is equivalent to minimising the
following (where the sum starts from 2 instead of 1 as we are ignoring the first external node):
3 K
nj pjk (1 pjk )
j 2 k 1
2 K
nc pck (1 pck )
c 1 k 1
where c 1,2 indexes the two child nodes of the split, pck is the proportion of training
observations under node c of type k and nc is the number of data points in the c th node.
2 K
It is generally easier to calculate nc pck (1 pck ) instead of the full reduction in the loss
c 1 k 1
function.
2 nc
(yi yˆc )2
c 1 i 1
where c 1,2 indexes the two child nodes of the split, yi is the observed value of the i th data
point in the c th child node, yˆc is the average value of the data points in the c th child node and
nc is the total number of data points in the c th child node.
Question
The graph below shows the heights and weights of the 16 sportsmen.
(i) Calculate the Gini index after each of the following proposed splits for the first split of a
decision tree for the sportsmen (these heights are shown with the dotted lines in the
graph above):
(ii) Explain which of the two proposed splits is preferred when using the greedy approach
with the Gini index.
Solution
Let the first child node of the proposed split contain those sportsmen that are taller than 195cm.
This node only contains basketball players (ie it is perfectly pure). The Gini impurity score for this
child node can be calculated as follows:
G1 1 p12B p12C p12F p12J p12R p12T
2
2 2 2 2 2
1 33 03 03 03 03 03
0
Here, p1B is the proportion of basketball players in the first child node, p1C the proportion of
cyclists etc. As expected, the score for this node is 0, as it only contains one type of athlete.
The second child node of the proposed split contains all those sportsmen that are shorter than
195cm. The Gini impurity score for this child node is:
G2 1 p22B p22C p22F p22J p22R p22T
0 2
3
2 2 2 2 2
1 13 13
2 13
3 13
2 13
3 13
134
0.793
169
The overall Gini index after this proposed split is a weighted sum of G1 and G2 . The weights are
the number of data points in each of the child nodes. So:
G 3G1 13G2
134 134
3 0 13 10.307
169 13
Let the first child node of the proposed split contain those sportsmen that are shorter than
160cm. This node only contains jockeys (ie it is perfectly pure). The Gini impurity score for this
child node can be calculated as follows:
G1 1 p12B p12C p12F p12J p12R p12T
2
2 2 2 2 2
1 20 20 20 22 20 20
0
Here, p1B is the proportion of basketball players in the first child node, p1C the proportion of
cyclists etc. As expected, the score for this node is 0, as it is pure.
The second child node of the proposed split contains all those sportsmen that are taller than
160cm. The Gini impurity score for this child node is:
G2 1 p22B p22C p22F p22J p22R p22T
3 2
3
2 2 2 2 2
1 14 14
2 14
3 14
0 14
3 14
39
0.796
49
The overall Gini index after this proposed split is a weighted sum of G1 and G2 , where the
weights are based on the number of data points in each of the child nodes:
G 2G1 14G2
39 78
2 0 14 11.143
49 7
When using the greedy approach, we want to maximise the reduction in the Gini impurity score.
This is the same as minimising the Gini impurity score after the split. The first split has a lower
score of 10.307 compared to 11.143 for the second split. So, the first split is preferred.
To calculate the actual reduction in the loss function we can first calculate the Gini impurity score
for the data before any splits:
G0 16 1 pB2 pC2 pF2 p2J pR2 pT2
3 2 2
2 2 2 2
16 1 16 16
2 16
3 16
2 16
3 16
3
53
13.25
4
The reduction in the loss function for the first split is:
So, the first split has a larger reduction in the loss function.
For each unseen observation we have B predictions, one from each of the trees. For regression
problems we average each of these for an overall prediction. For classification problems, we
predict the category with the most occurrences amongst the individual predictions.
The process of perturbing the training dataset is bootstrapping: sampling with replacement
from the original data.
For example, if the training dataset were 1,2,3,4,5 , three bootstrap samples might be:
Averaging the predictions from B decisions trees built on bootstrap samples typically
leads to much better predictions than would be obtained from a single decision tree in
isolation. In particular, bagging reduces the variance of the predictions.
One nice feature of this approach is that it naturally suggests a way to estimate the
performance of the method on unseen data. A typical bootstrap sample will contain around
1
two thirds 1 of the original observations. To see this, note that since we sample
e
with replacement, the number of times an individual observation occurs in any given
bootstrap sample is binomially distributed on n trials, with success probability n1 . This
distribution is roughly Poisson with mean 1. It follows that a particular observation is
excluded from a bootstrap sample (zero successes) with probability close to e1 . These
excluded observations are said to be out-of-bag, and the average out-of-bag error can be
used as an estimate of the error on unseen data.
For example, if the bootstrap sample from the set of observations 1,2,3,4,5 was 1,2,2,3,3 ,
then this sample doesn’t contain the original observations 4 and 5. These are the out-of-bag
observations for this sample. We can evaluate the performance of the tree fitted on the sample
1,2,2,3,3 by using observations 4 and 5. This can be repeated for each of the trees built using
the different bootstrapped samples to give an overall estimate of the performance on unseen
data.
Random forest
This is an improvement to the bagged decision tree method, which aims to reduce the
correlations between the predictions from different trees. As in the previous section, we
build B different trees, each on a bootstrapped sample. However, at each stage in the
building of a decision tree, we now consider only a random sample of the d different input
variables. One common choice for classification problems is to take d of the input
d
variables at each stage and for regression problems to choose input variables.
3
At each stage of building a decision tree, we need to consider how to partition the input space, ie
which variable to use and what the split point should be. In a random forest, we only consider a
subset of the available input variables at each point. This subset of ‘candidate’ variables is
selected at random for each split point.
While it seems counterintuitive to disregard information, this approach has good statistical
motivation. In a setting with a single strong relevant input variable, and a number of weaker
variables, in each bootstrapped sample, the strong predictor would always determine the
first split.
At each stage in a greedy algorithm, we choose the split that appears to be the most effective at
separating the remaining elements. If there was one very strong predictor, then this is likely to be
chosen as the first split in every bootstrapped sample.
This means that predictions from different trees would be highly correlated, leading to a
smaller variance reduction for a given number of bootstrap samples. Selecting the best
split from a randomly chosen subset of the predictor variables leads to trees that are
uncorrelated and maximises the variance reduction from bagging.
Equivalently, sampling input variables compensates for the greedy nature of the
tree-building algorithm, and so allows for better exploration of the predictor space.
By only considering a subset of variables at each split point, we should get a wider variety of
partitions across the trees that may have better overall performance than trees constructed using
a greedy approach on the entire input parameter set.
To illustrate bagged decision trees and Random Forest, we work with the Boston housing
dataset from the MASS package in R. The dataset aims to predict median house prices
using background information.
library(MASS)
library(tree)
library(randomForest)
Split the data into a subset used for training, and a withheld test subset.
set.seed(1)
This creates a set of row numbers indicating the data points to be used as the training data.
The first argument in the tree function is a formula object, similar to the input required in the
lm(), glm() or survfit() functions throughout Subject CS1 and Subject CS2. The aim is to
predict median house prices, which are in the column medv in the Boston data set. The code
medv~. indicates that we are predicting medv using all available input variables. The full stop is
shorthand for including all the other columns of the data set. The data argument is set to the
Boston data set filtered to only include those data points in the training data set.
Here we select all the median house prices for the data points in the test data set, ie we select
those not in the training rows by using negative indexing.
The predict function is used to predict the median house prices for all the data points in the test
data set, based on the values of the other columns. We can then plot a graph of predicted vs.
observed values in the test data set:
We can then calculate the mean squared error of the test set, which is representative of the
average squared distance between observed and predicted values for unseen observations.
mean((y_test-y_hat)^2)
[1] 35.28688
When plotting, note the common predicted value for all observations in the same region.
For a single decision tree, values that fall within the same terminal node (sub-rectangle of the
input space) have the same predicted value. So, this is why there are many points with the same
value of y_hat.
Now we make predictions using (by default) 500 bootstrapped trees. The number of trees
used can be adjusted using the ntree argument. To see the effect of bagging in isolation, we
use the randomForest command and specify mtry=13 to include all 13 input variables.
Recall that bagging is when we average predictions over a number of different decision trees.
Setting mtry = 13 means that all 13 input variables are considered at each split point. When we
introduced random forests in Section 0, we discussed considering subsets of the input variables at
each split point. Here we first use all the input variables as candidates at each split to investigate
the impact of bagging along.
We can then calculate predictions for each data point in the testing data:
We can then plot the predictions against the observations of the testing data set:
Comparing this graph to the predictions from a single tree, we can see that there is much better
correspondence between predicted values and observations when averaging over multiple
predictions.
mean((y_test - y_bag)^2)
[1] 23.33324
The mean square error on unseen observations is much lower than when using a single
decision tree.
d
Now by omitting mtry the algorithm’s default choice (for a regression problem) is , which
3
leads to a further improvement in MSE.
Now only 4 variables (chosen at random) are considered at each split point when constructing
each tree.
Once again we can calculate predictions for each data point in the testing data:
Again, we can then plot the predictions against the observations of the testing data set:
This graph looks similar to the previous plot of bagged predictions. However, it looks even less
spread out around the line y x , suggesting that there has been an even further improvement
when taking this approach.
mean((y_test-y_rf)^2)
[1] 18.32787
4 Unsupervised learning
We saw previously that supervised learning methods seek the patterns in a collection of
variables (the inputs) that best predict another variable (the output), eg we seek the
combinations of salary and debt that are most associated with higher risk of mortgage
default. In contrast, unsupervised learning methods determine the most dominant patterns
within a collection of variables, without reference to any external target.
With unsupervised learning, there is no output variable to aim at, we are instead trying to identify
patterns solely in the input variables, x1 ,..., xd .
Unsupervised learning methods are used to identify major components of variability, eg:
Trying to identify distinct sub-groups within a data set is known as clustering. An example
clustering method is the K -means algorithm discussed in the next section.
Are there strong relationships among the variables, so that the entire dataset can be
approximately described by a much smaller number of variables?
Market basket analysis can be used by online retailers as the basis for the ‘Other customers also
bought …’ recommendations or for promoting bundles of items that are frequently bought
together.
Another example of a clustering problem is a system that groups together similar text documents
based on, for example, the frequency of words within the documents.
We might ask whether we can identify groups (clusters) of policies which have similar
characteristics. We may not know in advance what these clusters are likely to be, or even
how many there are in our data.
There are a range of clustering algorithms available, but many are based on the K -means
algorithm. This is an iterative algorithm, which starts with an initial division of the data into
K clusters and adjusts that division in a series of steps designed to increase the
homogeneity within each cluster and to increase the heterogeneity between clusters.
The K -means algorithm proceeds as follows. Let us suppose we have data on J variables.
1. Choose a number of clusters, K , into which the data are to be divided. This could
be done on the basis of prior knowledge of the problem. Alternatively, the algorithm
could be run several times with different numbers of clusters to see which produces
the most satisfactory and interpretable result. There are various measures of within-
and between-group heterogeneity, often based on within-groups sums of squares.
Comparing within-groups sums of squares for different numbers of clusters might
identify a value of K beyond which no great increase in within-group homogeneity
is obtained.
3. Assign cases to the cluster centre that is nearest to them, using some measure of
distance. One common measure is Euclidean distance:
J
dist( x , k ) ( x j k j )2
j 1
Here x j is the standardised value of variable j for case x , and k j is the value of
variable j at the centre of cluster k ( k 1,, K ). Note that it is often necessary to
standardise the data before calculating any distance measure, as in Section 2.8.
This is because the measure of distance is based purely on the numerical values of the
variables. If some of the variables take large values because of the units of measurement
that have been adopted, these variables will dominate the calculations and the other
variables will effectively be ignored.
4. Calculate the centroid of each cluster, using the mean values of the data points
assigned to that cluster. This centroid becomes the new centre of each cluster.
The centroid is the centre of gravity of the data points when each has the same weight.
To calculate it, we find the average of each ‘coordinate’ in the data set.
5. Re-assign cases to the nearest cluster centre using the new cluster centres.
Example
The graph below shows the values of two variables, x1 and x2 , for 18 data points (after
appropriate scaling).
We can use the K -means algorithm with k 3 to try and identify any natural clusters in the data.
We start by allocating each of the data points to one of the 3 clusters at random.
We’ve used squares, circles and diamonds to distinguish the three groups. We then need to
calculate the position of the centroid of each cluster by averaging the x and y coordinates of the
data points within each cluster. The centroids are indicated by the larger hollow shapes.
The three centroids happen to be all quite close together at this stage.
We now reallocate each data point to the cluster whose centroid it is nearest to, and then
recalculate the positions of the centroids of the new groups.
Again, we reallocate each data point to the cluster whose centre it is nearest to, and then
recalculate the centres:
At this point, no more re-assignments take place. So, the centres do not move, and the algorithm
has converged. The final allocation of each point is indicated by the shapes and the final cluster
centres are indicated by the larger hollow shapes.
We could try repeating the algorithm with a different number of clusters, though it is not
necessarily straightforward to decide on the number to use. We can sometimes be informed by
prior knowledge of the data.
The table below shows the strengths and weaknesses of the K -means algorithm.
Strengths Weaknesses
Uses simple principles for identifying Less sophisticated than more recent
clusters which can be explained in clustering algorithms
non-statistical terms
Highly flexible and can be adapted to Not guaranteed to find the optimal set of
address nearly all its shortcomings with clusters because it incorporates a random
simple adjustments element
Requires a reasonable guess as to how
Fairly efficient and performs well
many clusters naturally exist in the data
Source: B. Lantz, Machine Learning with R (Birmingham, Packt Publishing, 2013), p. 271
The interpretation and evaluation of the results of K -means clustering can be somewhat
subjective. If the K -means exercise has been useful, the characteristics of the clusters will
be interpretable within the context of the problem being studied and will either confirm that
the pre-existing opinion about the existence of homogeneous groups has an evidential base
in the data, or provide new insights into the existence of groups that were not seen before.
One objective criterion that can be examined is the size of each of the clusters. If one
cluster contains the vast majority of the cases, or there are clusters with only a few cases,
this may indicate that the algorithm has failed to find K meaningful groups.
Another way to judge the effectiveness of the algorithm is to repeat the process several times
with different random allocations at the start. If similar groupings are obtained each time, it is
likely that there is a valid basis underlying the clusters.
One function that can perform K -means clustering in R is kmeans(). There are also
several machine learning packages that can carry out K -means clustering.
Principal components analysis (PCA) is a technique that can be used to reduce the dimensionality
of a data set. Given a data set containing d variables, PCA represents the data using a different
set of d variables (components) that are uncorrelated linear combinations of the original set. If
strong relationships exist between the original variables, then the new components are
constructed in such a way that the full data set can be reasonably accurately represented using
only a subset of these components.
In order to choose how many components are retained, we might decide to keep as many as
necessary to ‘explain’ a desired proportion of the total variance within the data set. For example,
we may wish to retain the smallest number of components, c d , that explain at least 90% of the
variation in the original data set.
PCA can be a useful tool in machine learning, where data sets can be extremely large and
therefore computationally expensive.
References
2
2
var 0 var f ( x0 ) f (x0 ) E f (x0 )
E y0 fˆ(x0 ) ˆ ˆ
To derive this, we first write y0 f ( x0 ) 0 in the LHS above and use f̂ and f to represent
fˆ( x ) and f ( x ) respectively:
0 0
2
E ( f 0 fˆ)2 var( f 0 fˆ) E ( f 0 fˆ)
f E ( fˆ)
So:
2
E (y0 fˆ)2 var( 0 ) var( fˆ) f E ( fˆ)
as required.
Recall that f̂ here (in random variable form) is an estimator of the true function form, f . So:
2 2
f E ( fˆ) E ( fˆ) f bias( fˆ)2
This gives:
The sum of the last two terms, var( fˆ) bias( fˆ)2 , is an alternative representative of the mean
squared error (MSE) of the estimator f̂ . MSE is covered in Chapter 8 of Subject CS1.
The chapter summary starts on the next page so that you can
keep all the chapter summaries together for revision purposes.
Chapter 21 Summary
What is machine learning?
Machine learning is a collection of methods for the automatic detection and exploitation of
patterns in data. Examples can be found in many areas of everyday life, eg in targeting
online advertising.
Machine learning is increasingly being used in the areas of finance and insurance, eg for
predicting which borrowers are most likely to default on a loan.
Broadly, machine learning problems can be divided into supervised learning problems and
unsupervised learning problems.
In a typical supervised learning problem, the aim is to determine the relationship between a
collection of d input variables, x1 ,..., xd , and an output variable, y , based on a sample of
training data xi , yi i 1,2,..., n .
With unsupervised learning, the algorithm aims to identify patterns in a data set, without
being given a specific target. So, there is no output variable, and we are instead trying to
identify patterns in the input variables, x1 ,..., xd .
y i f ( xi ) i
where f is the functional relationship and i is an error term. The aim of supervised
learning is to estimate f ( xi ) based on the available data.
Prediction vs inference
If prediction is the main aim, then we may not be that interested in the underlying structure
of the relationship between input and output, only how accurate a model is.
Problems where the output variable is qualitative (categorical), such as when predicting
credit default (yes/no), are said to be classification problems. The most common
classification tasks are binary classification tasks, where the outcome variable takes one of
two distinct values.
Machine learning methods that make assumptions about the function relationship
(eg assuming it is linear, as in linear regression) are called parametric methods. Those that
make no (or few) such assumptions are non-parametric methods.
The mean square error (MSE) compares a model’s predictions with the true output values. It
is defined as:
1 n
MSE (yi fˆ(xi ))2
n i 1
The lower the MSE, the closer the predictions are to the true values. More precisely, this
might be called the training MSE because it is evaluated on the data used to train the model.
Given that we want to use a model to predict the results for unseen data, this measure does
not quite match up with our requirements. Given enough parameters, we could easily make
a model that matches the training data with zero training error, but such a model would
likely perform badly when used to predict new observations.
Any model should have a sufficient number of parameters to be flexible enough to capture
underlying trends, but not so many that it reflects specific features of the training set that
we would not expect to be present in new data.
For a given (unseen) value x 0 , the expected mean square error for the corresponding
output value y 0 can be decomposed into three interpretable terms:
2
2
E y0 fˆ(x0 ) var 0 var fˆ( x0 ) f (x0 ) E fˆ(x0 )
var 0 is the irreducible error and reflects the uncertainty in y 0 even if the true functional
relationship is known.
var fˆ(x0 ) is the variance of the estimator of the functional relationship. It reflects the
variation in the estimates across different training samples.
Classification models that result in Yes or No outputs can be assessed using a confusion
matrix for the test set, which shows:
Precision
TP
Recall (sensitivity)
TP 2 Precision Recall
F1
TP FP TP FN Precision Recall
FP TP TN
false positive rate accuracy
TN FP TP TN FP FN
These all take values in the range 0% (worst) to 100% (best). One minus the false positive
rate is also known as the specificity.
A receiver operating characteristic (ROC) curve can be used to compare models. This plots
the true positive rate against the false positive rate for different thresholds (the rule for
assigning predicted categories).
Train-validation-test
An alternative is to split the data into K subsets of roughly equal size. K different instances
of the model are then trained, with instance i using all data except the i th subset. Subset i
is then used to evaluate the out-of-sample error of model instance i . The average of the K
different out-of-sample errors can then be used as a proxy for the true out-of-sample error.
Machine learning tasks can be broken down into the following stages:
collecting data
exploring and preparing the data
feature scaling (to ensure that the input variables have similar ranges of values)
splitting the data
training the model
evaluation.
Reproducibility of research
l 0 , 1 , , d | x1 , , xn g(0, 1 ,..., d )
d d
We could use i2 (ridge) or i (Lasso) for the penalty function g ( 0 , 1 ,..., d ) .
i 1 i 1
To try to select an appropriate number of parameters, d , (with sample size n), we can
minimise:
The naïve Bayes algorithm uses Bayes’ formula to classify items by determining the relative
likelihood of each of the possible options for the given values of the covariates x1i ,..., xdi .
The method assumes that the conditional probabilities of the covariates given an outcome
are independent. For example, for the outcome y i 1 , we have:
d
P(yi 1| x1i ,..., xdi ) P(yi 1) P( x ji | yi 1)
j 1
Decision trees, also known as classification and regression trees (CART), classify items by
asking a series of questions about the inputs that attempt to differentiate between the
categories. The simplest method of construction is to use greedy splitting where, at each
stage, the best partition of the data is chosen to maximise the reduction in a loss function.
For classification problems, a common loss function is the Gini index. For regression
problems, squared error loss is commonly used.
J K
The Gini index is defined as: n j p jk (1 p jk )
j 1 k 1
where j indexes over the external nodes, p jk is the proportion of sample items of class k
present at the j th external node and n j is the number of data points present at the j th
external node. When using greedy splitting, the aim at each stage is to find the partition that
maximises the reduction in the loss function. For any particular split point, this is equivalent
to minimising the following weighted sum of the impurity scores of the child notes of the
proposed partition:
2 K
nc pck (1 pck )
c 1 k 1
where c indexes over the two child nodes, pck is the proportion of sample items of class k
present at the c th child node and nc is the number of data points present at the c th child
node.
K K
1
pck (1 pck ) 1 pck2 . This gives a value between 0 (‘pure’) and 1 K (‘mixed’) where
k 1 k 1
K is the number of distinct classes. As K , the upper limit tends to 1.
J nj
yi yˆ j
2
The squared error is:
j 1 i 1
where y i are the observed values and yˆj is the predicted value for the j th external node.
When using greedy splitting, we want to maximise the reduction in the loss function. This is
equivalent to obtaining the partition that minimises the sum of squared errors for the child
nodes resulting from the proposed split, ie the partition that minimises:
2 nc
(yi yˆc )2
c 1 i 1
The K -means clustering algorithm involves modelling the data values as points in space.
Starting from an initial cluster allocation (commonly random), the method repeatedly finds
the centroid of the data points that have been allocated to each cluster and then reallocates
the points to the cluster whose centroid they are nearest to. When this process reaches a
stage where no further changes are made, the algorithm has converged.
Principal components analysis is a technique that can be used to reduce the dimensionality of
a data set. Given a data set containing d variables, PCA represents the data using a different
set of d variables (components) that are uncorrelated linear combinations of the original set.
If strong relationships exist between the original variables, then the new components are
constructed in such a way that the full data set can be reasonably accurately represented
using only a subset of these components.
The practice questions start on the next page so that you can
keep the chapter summaries together for revision purposes.
(ii) State whether each of the following applications involves supervised or unsupervised
learning, and suggest a suitable machine learning algorithm that could be used in each
case:
(a) predicting a university graduate’s salary at age 40 based on the subject they
studied, the grade they obtained in their degree and their sex
(b) grouping car insurance policyholders based on geographical area, premium paid
and claims experience
(a) hyperparameters
(b) CART
(d) clustering.
21.3 (a) Describe why the following model does not fit the observed data well:
(a) Draw up a confusion matrix for a test that can identify this feature perfectly.
(b) Calculate the precision, recall, F1 score, false positive rate, and accuracy, based on the
numbers in your matrix.
21.5 A test is available to detect a certain virus. The test has sensitivity (ie recall) 95% and specificity
99.5% (ie false positive rate 0.5%). The prevalence of the virus is 0.2%.
(a) Construct the confusion matrix for this test based on a population of 100,000 people.
21.6 A random sample x1 , , xn of values has been taken from a N( , 02 ) distribution, where the
value of 0 is known but the value of is unknown. However, it is believed that the value of
is close to 100. It has been suggested that could be estimated using a penalised log-likelihood
function.
(ii) Suggest why the penalty function ( 100)2 would be suitable to use in this case.
(iii) Show that the estimate of derived using this method with the penalty function in
part (ii) is:
n n
ˆ x 200 2 2
2
0 0
(iv) Comment on how the value of ̂ calculated using the formula in part (iii) is influenced by
the value chosen for the regularisation parameter.
(v) Explain why this method might be preferable in some circumstances to the basic method
of maximum likelihood.
21.7 A warehouse stores four types of single malt whisky in identical bottles in an underground
storeroom, before they are labelled and distributed. Recently the warehouse experienced a
major flood in which the stock records were destroyed and the handwritten descriptions on some
of the cases were washed off, so that they could no longer be identified.
The warehouse manager has asked you to supervise the process of identifying the type of whisky
each of these cases belongs to, so that they can be correctly labelled. A professional whisky taster
has provided a report based on a single bottle from each case, which he has classified based on
three criteria: Smoky, Fruity, Colour (each on a scale of 1 to 3).
The standard descriptions of the four types are shown in the table below. These can be
considered to have a probability of 80% of being correct, while any other description has a
probability of 10%. The warehouse manager has also indicated the proportions of each type she
believes were in stock at the time of the flood.
(i) Show that, under the assumptions of the naïve Bayes model:
P (y A | x1 , x2 , x 3 ) P (y A) P ( x1 | y A)P ( x2 | y A)P ( x 3 | y A)
The taster has described the sample bottle from one of the cases as:
(ii) Use the formula from part (i) to estimate how likely this case is to be of each of the four
types, and hence recommend how it should be labelled assuming equal misclassification
costs.
(iii) State two advantages and one disadvantage of the naïve Bayes classification method as a
machine learning technique.
21.8 A doctor is using K -means clustering with K 5 to classify her patients by height and weight.
The raw data shows the patients’ statistics in m and kg, but she has converted the heights to cm.
She used a method based on Euclidean distance, which converged after 3 iterations, giving the
following final centroids of the clusters:
Group 1 2 3 4 5
Height (cm) 165 160 175 150 185
Weight (kg) 55 65 80 90 100
(iv) Three of the patients in the data set have the following data values:
By using a graph, or otherwise, identify the clusters to which these patients belong based
on their heights and weights.
(v) State whether the results in part (iv) would have differed if the clusters had been
J
obtained using the absolute distance metric Dabs (x , k ) xj kj .
j 1
K K K
21.9 (i) (a) If p1 p2 pK 1 , prove the identity pk pi 1 pk2 .
k 1 i 1,i k k 1
(b) Explain how this identity can be used to calculate a measure of the effectiveness
of a proposed split point when constructing decision tree.
A researcher is considering two possible decision trees to classify items of four different types
labelled A, B, C and D. A sample of 15 items classified using each of the trees gave the results
shown below.
YES YES
TEST 1 AAAB TEST 4 BBBBBD
NO NO
YES YES
TEST 2 CCC TEST 5 AAA
NO NO
YES YES
TEST 3 DDDD TEST 6 CC
NO NO
BBBB DDDC
(ii) (a) Calculate the Gini index for the initial split point of each tree.
(b) Calculate the accuracy of each tree, assuming that the predicted category for each
final leaf node is the type making up the majority in that node.
21.10 An online retailer ran the K -means algorithm with K 5 on its customer base, using the following
variables:
total spend in the last year
number of items purchased in the last year.
The centres of each cluster are plotted as the larger hollow shapes.
The retailer’s current marking strategy is a monthly email to all customers outlining its ‘top picks’
across its entire product range.
(ii) Suggest how the retailer could market more specifically to customers in Cluster 3.
Chapter 21 Solutions
21.1 (i) Supervised and unsupervised learning
With supervised learning, the desired outcome for each data point is specified in advance and the
algorithm aims to reproduce these as closely as possible.
With unsupervised learning, the desired outcome for each data point is not specified in advance,
as the data are unlabelled. The algorithm aims to identify patterns within the data.
(ii) Applications
(a) Here we would aim to reproduce the salaries for a sample of graduates as closely as
possible based on the three variables. So, this would be supervised learning. We could
use a multiple linear regression model here.
(b) Here we would hope that the algorithm can find suitable homogeneous groups that we
don’t know in advance. So, this would be unsupervised learning. We could use the
K -means clustering algorithm here.
(iii) Terminology
(a) As well as the ‘internal’ parameters that a model estimates from the data and uses to
calculate predictions, machine learning methods often also require hyperparameters,
which are external to the model and whose values are set in advance based on the user’s
knowledge and experience, in order to produce a model that works well. An example is
the number of clusters to aim for with the K -means algorithm.
(b) CART is an abbreviation for classification and regression trees, which is another name for
decision trees. When used for classification problem, these categorise data points by
asking a series of questions that attempt to home in on a classification.
(c) Greedy splitting is a method of constructing a decision tree. At each stage, we choose the
split that leads to the largest reduction in the loss function. In other words, we choose
the split that appears to be the most effective at separating the remaining elements,
without thinking ahead to the consequences this might have for the later splits.
(d) Clustering refers to grouping data into a set of homogeneous groups or clusters, which
can be done using methods such as the K -means algorithm.
21.2 The expected mean square error for the output value for a given unseen input value can be
represented as:
2 2
var 0 var f ( x 0 ) f ( x 0 ) E f ( x 0 )
E y 0 fˆ( x 0 ) ˆ ˆ
var 0 , which is the irreducible error and reflects the uncertainty in y 0 even if the true
functional relationship is known.
var fˆ( x0 ) , which is the variance of the estimator of the functional relationship. It reflects the
variation in the estimates across different training samples.
f (x0) E fˆ(x0) , which is the square of the bias of the estimator of the functional relationship.
2
It represents the underlying difference between the structure of the fitted model and the true
functional relationship.
The fitted model is a straight line and is not flexible enough to capture what appears to be a
somewhat cyclical pattern observed in the data. This model has bias – there is a fundamental
difference between the structure of the fitted model and the true functional relationship.
(b) Improvements
We could try a model with more parameters to try to better capture the patterns in the data.
NO FP = 0 TN = 90 90
TOTAL 10 90 100
TP 10 TP 10
precision 100% , recall 100%
TP FP 10 0 TP FN 10 0
FP 0
false positive rate 0%
TN FP 90 0
TP TN 10 90
accuracy 100%
TP TN FP FN 10 90 0 0
(b) F1 score
We need the precision and recall in order to calculate the F1 score. The recall is given in the
question as 95%. The precision is:
TP 190 190
Precision 27.6%
TP FP 190 499 689
This method is based on the method of maximum likelihood where we choose the parameter
values to maximise the log-likelihood of the data available. This gives the parameter values that
best explain the values in the data.
However, we can also apply a penalty function, which is chosen to make the method more likely
to produce parameter estimates close to a set of target values that we are expecting. The penalty
is subtracted from the log-likelihood and we maximise this adjusted function instead.
We expect the value of to be close to 100. The function ( 100)2 would be suitable to use
here because it takes a large positive value if is a long way from 100 (in either direction).
(iii) Formula
1 x 2 1 n 2
n
1 n
L exp i 0 exp 2 xi constant
i 1 0 2 2 0 2 0 i 1
n
1
log L nlog 0 x
2 i
2
constant
2 0 i 1
n
1
xi
2
(log L)* n log 0 constant ( 100)2
2 02 i 1
To maximise this, we equate the derivative with respect to the parameter to zero:
(log L)* 2 n n
xi 2 ( 100) 2 (x ) 2 ( 100) 0
2 0 i 1
2
0
n n
2 2 2 x 200
0
0
n n
ˆ x 200 2 2
2
0 0
If were set equal to zero, there would be no penalty and the method would reduce to
maximum likelihood estimation. As expected, the formula in part (iii) would then give the usual
formula for the MLE of the mean of a normal distribution:
n n n n
ˆ x 0 2 0 2 x x
2 02
0 0 0
If were given a very high value, the penalty would dominate the calculations and, in the limit,
we would have:
n n n n 200
ˆ lim x 200 2 2 lim 2 x 200 2 2 100
02
0
0
0
2
The basic (unpenalised) method of maximum likelihood can sometimes lead to unreliable results.
The estimated values of the parameters can be very sensitive to the sample data and can vary
wildly.
This is most likely to happen when the sample size is small or the likelihood function is very flat so
that changes in the parameter values make very little difference to the log-likelihood.
Applying a penalty function encourages the method to produce parameter estimates that are
close to the values that would be expected from prior expectations.
P(y A, x1 , x2 , x3 )
P(y A| x1 , x2 , x3 ) (definition of conditional probability)
P(x1 , x2 , x3 )
P(x1 , x2 , x3 |y A)P(y A)
(definition of conditional probability in reverse)
P(x1 , x2 , x3 )
Let y denote the type and let x1 , x2 , x3 denote the three descriptions (Smoky, Fruity, Colour).
We can then apply the result from part (i) to calculate the probability that this case is a Mactavish
whisky (y M ) , given that it has been described as Smoky (S 2) , Fruity (F 2) and is medium
Colour (C 2) :
M: W : G: D 4 : 24 : 64 : 128 220
4 : 24 : 64 : 128 0.0182: 0.1091: 0.2909 : 0.5818
220 220 220
So the recommendation would be that this case is a Dogavulin whisky, as this has a far higher
probability (58%) than the other three types.
The main disadvantage is that it assumes that the conditional probabilities are independent
(which can be a poor approximation when the variables are correlated).
K is a hyperparameter specifying the number of clusters the algorithm should aim to produce –
in this case, 5.
The weights in the original data provided were given in units of kilograms. These cover a range of
values of about 50kg.
The heights in the original data provided were given in units of metres. These cover a range of
values of about 0.50m.
So with units of (kg, m) the range for the weights is about 100 times greater, which would mean
that the weights would totally dominate the calculations and the heights would effectively be
ignored.
However, when the doctor converts the heights to centimetres, the range of values is then about
50cm, which is numerically very similar to the range for the weights. This gives the two variables
a similar weighting in the calculations.
In practice both variables would be standardised by, for example, subtracting the mean and
dividing by the standard deviation.
(iii) Convergence
The algorithm involves repeatedly finding the centroid of the data points that have been allocated
to each cluster and then reallocating the points to the cluster whose centroid they are nearest to.
When this process reaches a stage where no further changes are made, the algorithm has
converged.
If we plot the patients and the centroids for the clusters on a graph, we can easily see which
centroid each patient is nearest to, and hence identify the cluster for these patients.
If we do the calculations, we find that the shortest Euclidean distances D i (for the centroids
i 1,2, ,5 ) are:
The absolute distance measures the distance between points assuming that we can only move
horizontally or vertically.
Here we have two dimensions (the two variables Weight and Height), so J 2 . With this metric
the distance between the two points ( x1 , x 2 ) and (k1 , k2 ) is:
x1 k1 x2 k2
With this metric, Mr Blobby’s distance from the centroid for cluster 4 is:
The diagram below shows the distance to each centroid for Mr Blobby.
We can see that if the clusters had been constructed using absolute distances, we would get the
same answers as if they had been constructed using Euclidean distance.
We can measure the effectiveness of a proposed split point by examining the ‘purity’ of the data
in each of the child nodes and then calculating an overall measure of the purity of the split.
This can be done by multiplying the proportion of items of type k at each child node by the
proportions for each other type i k and summing. These values are then weighted by the
number of items at that node to calculate an overall measure that we want to minimise. Using
the identity in part (i)(a) leads to the following equivalent expression that we want to minimise:
2 K
c pck2
n 1
c 1 k 1
First tree
In Tree 1, the top final node contains AAAB, ie 3 A’s and 1 B. This is the first child node of the
3 1
initial split. For this child node, the proportions are pA and pB . So the Gini impurity
4 4
score for this node is:
2 2
3 1 3
G1 1
4 4 8
The second child node of the initial split contains all the remaining data points. The proportions
4 3 4
are pB , pC and pD . So the Gini impurity score for this node is:
11 11 11
2 2 2
4 3 4 80
G2 1
11 11 11 121
The overall Gini index after this first split then the weighted sum of these scores:
3 80 193
G 4 11 8.772
8 121 22
Second tree
In Tree 2, the top final node contains BBBBBD, ie 5 B’s and 1 D. This is the first child node of the
5 1
initial split. For this child node, the proportions are pB and pD .
6 6
2 2
5 1 5
G1 1
6 6 18
The second child node of the first split contains all the remaining data points. The proportions are
3 1 1 1
pA , pC and pD . So the Gini impurity score for this node is:
9 3 3 3
2 2 2
1 1 1 2
G2 1
3 3 3 3
The overall Gini index after this initial split is then the weighted sum of these scores:
5 2 23
G 6 9 7.667
18 3 3
Assuming that the assigned labels for each node are given by the majority type within them, the
predicted values for the first tree, going in order down the tree, are A, C, D and B. There are 14
out of 15 correctly predicted types. So the accuracy is:
14
0.933
15
The predicted values for the second tree, going in order down the tree, are B, A, C and D. There
are 13 out of 15 correctly predicted types. So the accuracy is:
13
0.867
15
(ii)(c) Comment
The Gini index for the initial split is lower for the second tree than it is for the first. This means
that, when taking a greedy approach to tree construction, the initial split from the second tree
would be selected over the initial split from the first tree.
However, the first tree has a higher overall prediction accuracy than the second. So, this tree is
likely to be preferred overall.
This illustrates how the greedy approach doesn’t necessarily produce the best overall tree. Here,
the initial split in the second tree has a lower Gini index for the initial split because it separates
out 5 B’s and 1 D compared to the 3 A’s and 1 B in the first tree. However, after the initial split,
the subsequent subtree of the first tree outperforms that of the second, perfectly classifying the
remaining items.
Customers in Cluster 1 made more purchases than the average customer, but overall spent less
than average. Assuming the purchases were spread across the year, these customers appear to
be using the online retailer regularly to make mainly small to medium-sized purchases.
Customers in Cluster 2 made lots of purchases and spent a lot of money. They appear to also be
regularly purchasing the cheaper to mid-range items, but more frequently than customers in
Cluster 1.
Customers in Cluster 3 spent a lot of money on a few items. So, they don’t make purchases often
but when they do, they are buying the very expensive items.
Customers in Cluster 4 appear to have made very few mid-range to expensive purchases. They
may not be regular visitors to the retailer or could be newer customers.
Customers in Cluster 5 spent a similar amount to customers in Cluster 1 but over fewer purchases.
They appear to have made a few more expensive purchases over the year.
From part (i), customers in Cluster 3 are those that spend a lot of money on a few purchases,
suggesting that they are more interested in the expensive items. The retailer could email these
customers about their top-end products only, rather than the entire product range.
End of Part 5
What next?
1. Briefly review the key areas of Part 5 and/or re-read the summaries at the end of
Chapters 17 to 21.
2. Ensure you have attempted some of the Practice Questions at the end of each chapter in
Part 5. If you don’t have time to do them all, you could save the remainder for use as part
of your revision.
3. Attempt Assignment X5.
4. Read the Chapter 17 to 21 material (copulas, reinsurance, risk models and machine
learning) of the Paper B Online Resources (PBOR).
5. Attempt Assignment Y2.
Time to consider …
… ‘rehearsal’ products
Mock Exam and marking – You can attempt the Mock Exam and get it marked using Mock
Exam Marking or more flexible Marking Vouchers.
Additional Mock Pack (AMP) and Marking Vouchers – The AMP includes two additional
mock exam papers that you can attempt and get marked using Marking Vouchers.
Results of surveys suggest that attempting the mock exams and having them marked
improves your chances of passing the exam. Students have said:
‘Overall the marking was extremely useful and gave detailed comments
on where I was losing marks and how to improve on my answers and
exam technique. This is exactly what I was looking for – thank you!’