100% found this document useful (1 vote)
195 views

SRM Formula Sheet-2

Uploaded by

colore admistre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
195 views

SRM Formula Sheet-2

Uploaded by

colore admistre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Exam SRM

updated 08/04/22

STATISTICAL LEARNING STATISTICAL LEARNING Contrasting Statistical Learning Elements



Statistical Learning Problems
Modeling Problems
Types of Variables
Response A variable of primary interest
Supervised Unsupervised
Explanatory A variable used to study the response variable
Has response variable No response variable
Count A quantitative variable usually valid on
non-negative integers
Continuous A real-valued quantitative variable
Regression Classification
Nominal A categorical/qualitative variable having categories
Quantitative Categorical
without a meaningful or logical order
response variable response variable
Ordinal A categorical/qualitative variable having categories
with a meaningful or logical order
Parametric Non-Parametric
Notation Functional form Functional form
𝑦𝑦, 𝑌𝑌 Response variable of f specified of f not specified
𝑥𝑥, 𝑋𝑋 Explanatory variable
Subscript 𝑖𝑖 Index for observations Inference
Method Prediction
𝑛𝑛 No. of observations Properties Comprehension
Output of fˆ
Subscript 𝑗𝑗 Index for variables except response of f
𝑝𝑝 No. of variables except response
Flexibility Interpretability
𝐀𝐀! Transpose of matrix 𝐀𝐀 , ,
fˆ s ability to fˆ s ability to
"# Inverse of matrix 𝐀𝐀
𝐀𝐀 follow the data be understood
𝜀𝜀 Error term
𝑦𝑦+, 𝑌𝑌,, 𝑓𝑓.(𝑥𝑥) Estimate/Estimator of 𝑓𝑓(𝑥𝑥)
Data

Training Test
Observations used Observations not used
to train/obtain fˆ to train/obtain fˆ

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 1


Regression Problems Descriptive Data Analysis
𝑌𝑌 = 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 + 𝜀𝜀 where E[𝜀𝜀] = 0, so E[𝑌𝑌] = 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 Numerical Summaries
% ∑'&(# 𝑥𝑥& ∑' (𝑥𝑥& − 𝑥𝑥̅ )%
Test MSE = E B2𝑌𝑌 − 𝑌𝑌,5 D , 𝑥𝑥̅ = , 𝑠𝑠4% = &(#
𝑛𝑛 𝑛𝑛 − 1
∑'&(#(𝑦𝑦& − 𝑦𝑦+& )% '
∑ (𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦g)
which can be estimated using 𝑐𝑐𝑐𝑐𝑐𝑐4,6 = &(#
𝑛𝑛 𝑛𝑛 − 1
For fixed inputs 𝑥𝑥# , … , 𝑥𝑥$ , the test MSE is 𝑐𝑐𝑐𝑐𝑐𝑐4,6 '
∑&(#(𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦g)
% 𝑟𝑟4,6 = = , −1 ≤ 𝑟𝑟4,6 ≤ 1
VarS𝑓𝑓.2𝑥𝑥# , … , 𝑥𝑥$ 5T + 2BiasS𝑓𝑓.2𝑥𝑥# , … , 𝑥𝑥$ 5T5 +
VWWWWWWWWWWWWXWWWWWWWWWWWWY Var[𝜀𝜀]
VXY 𝑠𝑠4 ⋅ 𝑠𝑠6 j∑'&(#(𝑥𝑥& − 𝑥𝑥̅ )% ⋅ ∑'&(#(𝑦𝑦& − 𝑦𝑦g)%
.))*+,-./0* *))2)
)*+,-./0* *))2)

Classification Problems Scatterplots


Test Error Rate = ES𝐼𝐼2𝑌𝑌 ≠ 𝑌𝑌,5T, Plots values of two variables to investigate their relationship.
∑' 𝐼𝐼(𝑦𝑦& ≠ 𝑦𝑦+& )
which can be estimated using &(# Box Plots
𝑛𝑛
Captures a variable's distribution using its median, 1st and 3rd
Bayes Classifier: quartiles, and distribution tails.
𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 = arg max Pr2𝑌𝑌 = 𝑐𝑐a𝑋𝑋# = 𝑥𝑥# , … , 𝑋𝑋$ = 𝑥𝑥$ 5
3 interquartile range

Key Ideas
25% 25% 25% 25%
• The disadvantage to parametric methods is the danger of
choosing a form for 𝑓𝑓 that is not close to the truth.
• The disadvantage to non-parametric methods is the need for an
abundance of observations.
• Flexibility and interpretability are typically at odds.
1st 3rd outliers
• As flexibility increases, the training MSE (or error rate) decreases, quartile quartile
but the test MSE (or error rate) follows a u-shaped pattern.
smallest median largest
• Low flexibility leads to a method with low variance and high bias; non-outlier non-outlier
high flexibility leads to a method with high variance and low bias.



qq Plots
Plots sample quantiles against theoretical quantiles to determine
whether the sample and theoretical distributions have
similar shapes.

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 2


LINEAR MODELS Estimation – Ordinary Least Squares (OLS) MLR Inferences
LINEAR MODELS
𝑦𝑦+ = 𝑏𝑏7 + 𝑏𝑏# 𝑥𝑥# + ⋯ + 𝑏𝑏$ 𝑥𝑥$ Notation
Simple Linear Regression (SLR) 𝑏𝑏7 𝛽𝛽.; Estimator for 𝛽𝛽;
Special case of MLR where 𝑝𝑝 = 1 u ⋮ w = 𝐛𝐛 = (𝐗𝐗 ! 𝐗𝐗)"# 𝐗𝐗 ! 𝐲𝐲 𝑌𝑌, Estimator for E[𝑌𝑌]
𝑏𝑏$
𝑠𝑠𝑠𝑠 Estimated standard error
Estimation MSE = SSE⁄(𝑛𝑛 − 𝑝𝑝 − 1)
∑' (𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦g) 𝐻𝐻7 Null hypothesis
𝑏𝑏# = &(# ' residual standard error = √MSE 𝐻𝐻# Alternative hypothesis
∑&(#(𝑥𝑥& − 𝑥𝑥̅ )%
df Degrees of freedom
𝑏𝑏7 = 𝑦𝑦g − 𝑏𝑏# 𝑥𝑥̅ Other Numerical Results
𝑡𝑡#"?,+@ 𝑞𝑞 quantile of
𝐇𝐇 = 𝐗𝐗(𝐗𝐗 ! 𝐗𝐗)"# 𝐗𝐗 !
a 𝑡𝑡-distribution
SLR Inferences 𝐲𝐲+ = 𝐇𝐇𝐇𝐇
𝛼𝛼 Significance level
Standard Errors 𝑒𝑒 = 𝑦𝑦 − 𝑦𝑦+
𝑘𝑘 Confidence level
1 𝑥𝑥̅ % SST = ∑'&(#(𝑦𝑦& − 𝑦𝑦g)% = total variability
𝑠𝑠𝑠𝑠8! = nMSE o + ' p ndf Numerator degrees
𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% SSR = ∑'&(#(𝑦𝑦+& − 𝑦𝑦g)% = explained
of freedom
SSE = ∑'&(#(𝑦𝑦& − 𝑦𝑦+& )% = unexplained
MSE ddf Denominator degrees
𝑠𝑠𝑠𝑠8" = n ' SST = SSR + SSE
∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% of freedom
𝑅𝑅% = SSR⁄SST
𝐹𝐹#"?,A+@,++@ 𝑞𝑞 quantile of
% MSE 𝑛𝑛 − 1
1 (𝑥𝑥 − 𝑥𝑥̅ )% 𝑅𝑅<+=. = 1 − % = 1 − (1 − 𝑅𝑅% )  Ä an 𝐹𝐹-distribution
𝑠𝑠𝑠𝑠69 = nMSE o + ' p 𝑠𝑠6 𝑛𝑛 − 𝑝𝑝 − 1
𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% 𝑌𝑌':# Response of
Key Ideas new observation
1 (𝑥𝑥':# − 𝑥𝑥̅ )%
𝑠𝑠𝑠𝑠69#$" = nMSE o1 + + ' p • 𝑅𝑅% is a poor measure for model Subscript 𝑟𝑟 Reduced model
𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )%
comparison because it will increase Subscript 𝑓𝑓 Full model
simply by adding more predictors

Multiple Linear Regression (MLR) Standard Errors


to a model.
𝑌𝑌 = 𝛽𝛽7 + 𝛽𝛽# 𝑥𝑥# + ⋯ + 𝛽𝛽$ 𝑥𝑥$ + 𝜀𝜀 • Polynomials do not change consistently ä S𝛽𝛽.; T
𝑠𝑠𝑠𝑠8% = âVar

by unit increases of its variable, i.e. no
Notation
constant slope. Variance-Covariance Matrix
𝛽𝛽; The 𝑗𝑗th regression coefficient
• Only 𝑤𝑤 − 1 dummy variables are Var å T = MSE(𝐗𝐗 ! 𝐗𝐗)"# =
ä S𝜷𝜷
𝑏𝑏; Estimate of 𝛽𝛽;
needed to represent 𝑤𝑤 classes of a ä S𝛽𝛽.7 T
Var ä S𝛽𝛽.7 , 𝛽𝛽.# T ⋯ Cov
Cov ä S𝛽𝛽.7 , 𝛽𝛽.$ T
%
𝜎𝜎 Variance of response / ⎡ ⎤
categorical predictor; one of the classes ä . .
⎢CovS𝛽𝛽7 , 𝛽𝛽# T ä .
VarS𝛽𝛽# T ä S𝛽𝛽.# , 𝛽𝛽.$ T⎥
⋯ Cov
Irreducible error ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
acts as a baseline. ⎢ ⎥
MSE Estimate of 𝜎𝜎 % ä S𝛽𝛽.7 , 𝛽𝛽.$ T Cov
ä S𝛽𝛽.# , 𝛽𝛽.$ T ⋯ ä S𝛽𝛽.$ T ⎦
• In effect, dummy variables define a ⎣Cov Var
X Design matrix
distinct intercept for each class. Without
𝐇𝐇 Hat matrix 𝑡𝑡 Tests
the interaction between a dummy estimate − hypothesized value
𝑒𝑒 Residual
variable and a predictor, the dummy 𝑡𝑡 statistic =
SST Total sum of squares standard error
variable cannot additionally affect that 𝐻𝐻7 : 𝛽𝛽; = hypothesized value
SSR Regression sum of squares
predictor's regression coefficient.
SSE Error sum of squares Test Type Rejection Region

Assumptions Two-tailed |𝑡𝑡 statistic| ≥ 𝑡𝑡B⁄%,'"$"#


1. 𝑌𝑌& = 𝛽𝛽7 + 𝛽𝛽# 𝑥𝑥&,# + ⋯ + 𝛽𝛽$ 𝑥𝑥&,$ + 𝜀𝜀& Left-tailed 𝑡𝑡 statistic ≤ −𝑡𝑡B,'"$"#
2. 𝑥𝑥&,; ’s are non-random Right-tailed 𝑡𝑡 statistic ≥ 𝑡𝑡B,'"$"#

3. E[𝜀𝜀& ] = 0
𝐹𝐹 Tests
4. Var[𝜀𝜀& ] = 𝜎𝜎 % MSR SSR⁄𝑝𝑝
5. 𝜀𝜀& ’s are independent 𝐹𝐹 statistic = =
MSE SSE⁄(𝑛𝑛 − 𝑝𝑝 − 1)
6. 𝜀𝜀& ’s are normally distributed 𝐻𝐻7 : 𝛽𝛽# = 𝛽𝛽% = ⋯ = 𝛽𝛽$ = 0
7. The predictor 𝑥𝑥; is not a linear
Reject 𝐻𝐻7 if 𝐹𝐹 statistic ≥ 𝐹𝐹B,A+@,++@
combination of the other 𝑝𝑝 predictors,
• ndf = 𝑝𝑝
for 𝑗𝑗 = 0, 1, … , 𝑝𝑝
• ddf = 𝑛𝑛 − 𝑝𝑝 − 1

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 3


Partial 𝐹𝐹 Tests Variance Inflation Factor Selection Criteria
2SSED − SSEE 5ú2𝑝𝑝E − 𝑝𝑝D 5 1 𝑠𝑠4%% (𝑛𝑛 − 1) • Mallows’ 𝐶𝐶$
𝐹𝐹 statistic = VIF; = = 𝑠𝑠𝑒𝑒8%%
SSEE ⁄2𝑛𝑛 − 𝑝𝑝E − 15 1 − 𝑅𝑅;% MSE SSE + 2𝑝𝑝 ⋅ MSEL
𝐶𝐶$ =
𝐻𝐻7 : Some 𝛽𝛽; ′s = 0 Tolerance is the reciprocal of VIF. 𝑛𝑛
SSE
Reject 𝐻𝐻7 if 𝐹𝐹 statistic ≥ 𝐹𝐹B,A+@,++@ • Frees rule of thumb: any VIF; ≥ 10 𝐶𝐶$ = − 𝑛𝑛 + 2(𝑝𝑝 + 1)
MSEL
• ndf = 𝑝𝑝E − 𝑝𝑝D

Key Ideas • Akaike information criterion


• ddf = 𝑛𝑛 − 𝑝𝑝E − 1 SSE + 2𝑝𝑝 ⋅ MSEL
• As realizations of a 𝑡𝑡-distribution, AIC =

studentized residuals can help 𝑛𝑛 ⋅ MSEL
For all hypothesis tests, reject 𝐻𝐻7 if
𝑝𝑝-value ≤ 𝛼𝛼. identify outliers. • Bayesian information criterion
• When residuals have a larger spread for SSE + ln 𝑛𝑛 ⋅ 𝑝𝑝 ⋅ MSEL
BIC =
Confidence and Prediction Intervals larger predictions, one solution is to 𝑛𝑛 ⋅ MSEL
estimate ± (𝑡𝑡 quantile)(standard error) transform the response variable with a • Adjusted 𝑅𝑅%
concave function. • Cross-validation error
Quantity Interval Expression • There is no universal approach to

Validation Set
𝛽𝛽; 𝑏𝑏; ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠8% handling multicollinearity; it is even
• Randomly splits all available
E[𝑌𝑌] 𝑦𝑦+ ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠69 possible to accept it, such as when there
observations into two groups: the
is a suppressor variable. On the other
𝑌𝑌':# 𝑦𝑦+':# ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠69#$" training set and the validation set.
hand, it can be eliminated by using a set
• Only the observations in the training set
of orthogonal predictors.
Linear Model Assumptions are used to attain the fitted model, and

Leverage those in validation set are used to
Model Selection
𝑠𝑠𝑠𝑠69& % estimate the test MSE.
ℎ& = 𝐱𝐱&! (𝐗𝐗 ! 𝐗𝐗)"# 𝐱𝐱& = Notation
MSE

1 (𝑥𝑥& − 𝑥𝑥̅ )% 𝑔𝑔 Total no. of predictors 𝑘𝑘-fold Cross-Validation


ℎ& = + ' for SLR in consideration 1. Randomly divide all available
𝑛𝑛 ∑I(#(𝑥𝑥I − 𝑥𝑥̅ )%
𝑝𝑝 No. of predictors for a observations into 𝑘𝑘 folds.
• 1⁄𝑛𝑛 ≤ ℎ& ≤ 1
specific model 2. For 𝑣𝑣 = 1, … , 𝑘𝑘, obtain the 𝑣𝑣th fit by
• ∑'&(# ℎ& = 𝑝𝑝 + 1
$:# MSEL MSE of the model that uses training with all observations except
• Frees rule of thumb: ℎ& > 3 ¶ ß
' all 𝑔𝑔 predictors those in the 𝑣𝑣th fold.

Μ$ The "best" model with 𝑝𝑝 predictors 3. For 𝑣𝑣 = 1, … , 𝑘𝑘, use 𝑦𝑦+ from the 𝑣𝑣th fit to
Studentized and Standardized Residuals

𝑒𝑒& calculate a test MSE estimate with
𝑒𝑒JK,,& = Best Subset Selection
observations in the 𝑣𝑣th fold.
âMSE(&) (1 − ℎ& ) 1. For 𝑝𝑝 = 0, 1, … , 𝑔𝑔, fit all ¶L$ß models with 𝑝𝑝 4. To calculate CV error, average the 𝑘𝑘 test
𝑒𝑒& predictors. The model with the largest 𝑅𝑅% MSE estimates in the previous step.
𝑒𝑒JK<,& =
jMSE(1 − ℎ& ) is Μ$ .

• Frees rule of thumb: a𝑒𝑒JK<,& a > 2 Leave-one-out Cross-Validation (LOOCV)


2. Choose the best model among Μ7 , … , ΜL
• Calculate LOOCV error as a special case of
using a selection criterion of choice.
Cook’s Distance
𝑘𝑘-fold cross-validation where 𝑘𝑘 = 𝑛𝑛.
%
∑'I(#2𝑦𝑦+I − 𝑦𝑦+(&)I 5 Forward Stepwise Selection • For MLR:
𝐷𝐷& = '
MSE(𝑝𝑝 + 1) 1. Fit all 𝑔𝑔 simple linear regression models. 1 𝑦𝑦& − 𝑦𝑦+& %
The model with the largest 𝑅𝑅% is Μ# . LOOCV Error = ± Ä
𝑒𝑒&% ℎ& 𝑛𝑛 1 − ℎ&
= &(#
MSE(𝑝𝑝 + 1)(1 − ℎ& )% 2. For 𝑝𝑝 = 2, … , 𝑔𝑔, fit the models that add

one of the remaining predictors to Μ$"# . Key Ideas on Cross-Validation
Plots of Residuals
The model with the largest 𝑅𝑅% is Μ$ . • The validation set approach has unstable
• 𝑒𝑒 versus 𝑦𝑦+
3. Choose the best model among Μ7 , … , ΜL results and will tend to overestimate the
Residuals are well-behaved if
using a selection criterion of choice. test MSE. The two other approaches
o Points appear to be randomly scattered mitigate these issues.
o Residuals seem to average to 0 Backward Stepwise Selection • With respect to bias, LOOCV < 𝑘𝑘-fold CV <
o Spread of residuals does not change 1. Fit the model with all 𝑔𝑔 predictors, ΜL . Validation Set.
• 𝑒𝑒 versus 𝑖𝑖 2. For 𝑝𝑝 = 𝑔𝑔 − 1, … , 1, fit the models that • With respect to variance, LOOCV > 𝑘𝑘-fold
Detects dependence of error terms drop one of the predictors from Μ$:# . CV > Validation Set.
• 𝑞𝑞𝑞𝑞 plot of 𝑒𝑒 The model with the largest 𝑅𝑅% is Μ$ .

3. Choose the best model among Μ7 , … , ΜL
using a selection criterion of choice.

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 4


Other Regression Approaches Key Ideas on Ridge and Lasso Weighted Least Squares
Standardizing Variables • 𝑥𝑥# , … , 𝑥𝑥$ are scaled predictors. • Var[𝜀𝜀& ] = 𝜎𝜎 % ⁄𝑤𝑤&
• A centered variable is the result of • 𝜆𝜆 is inversely related to flexibility. • Equivalent to running OLS with √𝑤𝑤𝑦𝑦 as
subtracting the sample mean from • With a finite 𝜆𝜆, none of the ridge the response and √𝑤𝑤𝐱𝐱 as the predictors,
a variable. estimates will equal 0, but the lasso hence minimizing ∑'&(# 𝑤𝑤& (𝑦𝑦& − 𝑦𝑦+& )% .
• A scaled variable is the result of estimates could equal 0. 𝐛𝐛 = (𝐗𝐗 ! 𝐖𝐖𝐖𝐖)"# 𝐗𝐗 ! 𝐖𝐖𝐖𝐖 where 𝐖𝐖 is the
dividing a variable by its sample
diagonal matrix of the weights.
standard deviation. Partial Least Squares

• A standardized variable is the result of • The first partial least squares direction, 𝑘𝑘-Nearest Neighbors (KNN)
first centering a variable, then scaling it. 𝑧𝑧# , is a linear combination of standardized 1. Identify the "center of the neighborhood",
predictors 𝑥𝑥# , … , 𝑥𝑥$ , with coefficients i.e. the location of an observation with
Ridge Regression based on the relation between 𝑥𝑥; and 𝑦𝑦. inputs 𝑥𝑥# , … , 𝑥𝑥$ .
Coefficients are estimated by minimizing • Every subsequent partial least squares 2. Starting from the "center of the
the SSE while constrained by ∑$;(# 𝑏𝑏;% ≤ 𝑎𝑎 direction is calculated iteratively as a neighborhood", identify the 𝑘𝑘 nearest
or equivalently, by minimizing the linear combination of "updated training observations.
expression SSE + 𝜆𝜆 ∑$;(# 𝑏𝑏;% . predictors" which are the residuals of fits 3. For classification, 𝑦𝑦+ is the most frequent

with the "previous predictors" explained category among the 𝑘𝑘 observations; for
Lasso Regression regression, 𝑦𝑦+ is the average of the
by the previous direction.
Coefficients are estimated by minimizing response among the 𝑘𝑘 observations.
• The directions 𝑧𝑧# , … , 𝑧𝑧L are used as
the SSE while constrained by ∑$;(#a𝑏𝑏; a ≤ 𝑎𝑎 𝑘𝑘 is inversely related to flexibility.
predictors in a multiple linear regression.
or equivalently, by minimizing the The number of directions, 𝑔𝑔, is a measure
expression SSE + 𝜆𝜆 ∑$;(#a𝑏𝑏; a.
of flexibility.




Key Results for Distributions in the Exponential Family

Distribution Probability Function 𝜃𝜃 𝜙𝜙 𝑏𝑏(𝜃𝜃) Canonical Link, 𝑏𝑏M "# (𝜇𝜇)

1 (𝑦𝑦 − 𝜇𝜇)% 𝜃𝜃 %
Normal exp ∫− ª 𝜇𝜇 𝜎𝜎 % 𝜇𝜇
𝜎𝜎√2𝜋𝜋 2𝜎𝜎 % 2
Binomial 𝑛𝑛 𝜋𝜋 𝜇𝜇
 Ä 𝜋𝜋 6 (1 − 𝜋𝜋)'"6 ln ¶ ß 1 𝑛𝑛 ln21 + 𝑒𝑒 N 5 ln  Ä
(fixed 𝑛𝑛) 𝑦𝑦 1 − 𝜋𝜋 𝑛𝑛 − 𝜇𝜇
𝜆𝜆6
Poisson exp(−𝜆𝜆) ln 𝜆𝜆 1 𝑒𝑒 N ln 𝜇𝜇
𝑦𝑦!
Negative Binomial Γ(𝑦𝑦 + 𝑟𝑟) D 𝜇𝜇
𝑝𝑝 (1 − 𝑝𝑝)6 ln(1 − 𝑝𝑝) 1 −𝑟𝑟 ln21 − 𝑒𝑒 N 5 ln  Ä
(fixed 𝑟𝑟) 𝑦𝑦! Γ(𝑟𝑟) 𝑟𝑟 + 𝜇𝜇
𝛾𝛾 B B"# 𝛾𝛾 1 1
Gamma 𝑦𝑦 exp(−𝑦𝑦𝑦𝑦) − − ln(−𝜃𝜃) −
Γ(𝛼𝛼) 𝛼𝛼 𝛼𝛼 𝜇𝜇

𝜆𝜆 𝜆𝜆(𝑦𝑦 − 𝜇𝜇)% 1 1 1
Inverse Gaussian n exp ∫− ª − −√−2𝜃𝜃 −
2𝜋𝜋𝑦𝑦 O 2𝜇𝜇% 𝑦𝑦 2𝜇𝜇% 𝜆𝜆 2𝜇𝜇%

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 5


NON-LINEAR MODELS Numerical Results Inference
NON-LINEAR MODELS
𝐷𝐷∗ = 2[𝑙𝑙J<K − 𝑙𝑙(𝐛𝐛)] • Maximum likelihood estimators 𝜷𝜷 å
Generalized Linear Models 𝐷𝐷 = 𝜙𝜙𝐷𝐷∗ asymptotically have a multivariate
Notation For MLR, 𝐷𝐷 = SSE normal distribution with mean 𝜷𝜷
𝜃𝜃, 𝜙𝜙 Linear exponential family
and asymptotic variance-covariance
%
1 − exp{2[𝑙𝑙7 − 𝑙𝑙(𝐛𝐛)]⁄𝑛𝑛}
parameters 𝑅𝑅QJ = matrix 𝐈𝐈"# .
1 − exp{2𝑙𝑙7 ⁄𝑛𝑛}
E[𝑌𝑌], 𝜇𝜇 Mean response • To address overdispersion, change the
%
𝑙𝑙(𝐛𝐛) − 𝑙𝑙7
𝑏𝑏M (𝜃𝜃) Mean function 𝑅𝑅RJ*. = variance to Var[𝑌𝑌& ] = 𝛿𝛿 ⋅ 𝜙𝜙& ⋅ 𝑏𝑏MM (𝜃𝜃& ) and
𝑙𝑙J<K − 𝑙𝑙7
𝑣𝑣(𝜇𝜇) Variance function estimate 𝛿𝛿 as the Pearson chi-square
ℎ(𝜇𝜇) Link function AIC = −2 ⋅ 𝑙𝑙(𝐛𝐛) + 2 ⋅ (𝑝𝑝 + 1)* statistic divided by 𝑛𝑛 − 𝑝𝑝 − 1.
𝐛𝐛 Maximum likelihood estimate BIC = −2 ⋅ 𝑙𝑙(𝐛𝐛) + ln 𝑛𝑛 ⋅ (𝑝𝑝 + 1)*

*Assumes only 𝜷𝜷 needs to be estimated. If Likelihood Ratio Tests


of 𝜷𝜷
𝑙𝑙(𝐛𝐛) Maximized log-likelihood estimating 𝜙𝜙 is required, replace 𝑝𝑝 + 1 with 𝜒𝜒 % statistic = 2S𝑙𝑙2𝐛𝐛E 5 − 𝑙𝑙(𝐛𝐛D )T
𝑙𝑙7 Maximized log-likelihood for 𝑝𝑝 + 2. 𝐻𝐻7 : Some 𝛽𝛽; ′s = 0
null model

Reject 𝐻𝐻7 if 𝜒𝜒 % statistic ≥ 𝜒𝜒B,$
%
' "$(

Residuals
𝑙𝑙J<K Maximized log-likelihood for
Raw Residual Goodness-of-Fit Tests
saturated model
𝑒𝑒& = 𝑦𝑦& − 𝜇𝜇̂ & 𝑌𝑌 follows a distribution of choice with 𝑔𝑔 free
𝑒𝑒 Residual
parameters, whose domain is split into 𝑤𝑤
𝐈𝐈 Information matrix Pearson Residual
% mutually exclusive intervals.
𝜒𝜒#"?,+@ 𝑞𝑞 quantile of a chi-square 𝑦𝑦& − 𝜇𝜇̂ &
𝑒𝑒& = S
(𝑛𝑛3 − 𝑛𝑛𝑞𝑞3 )%
distribution j𝜙𝜙 ⋅ 𝑣𝑣(𝜇𝜇̂ & ) %
𝜒𝜒 statistic = ±
𝐷𝐷∗ Scaled deviance 𝑛𝑛𝑞𝑞3
The Pearson chi-square statistic is ∑'&(# 𝑒𝑒&% . 3(#
𝐷𝐷 Deviance statistic 𝑛𝑛3
𝐻𝐻7 : 𝑞𝑞3 = for all 𝑐𝑐 = 1, … , 𝑤𝑤
Deviance Residual 𝑛𝑛
Linear Exponential Family 𝑒𝑒& = ±j𝐷𝐷&∗ whose sign follows the
%
Reject 𝐻𝐻7 if 𝜒𝜒 % statistic ≥ 𝜒𝜒B,S"L"#
𝑦𝑦𝑦𝑦 − 𝑏𝑏(𝜃𝜃)
Prob. fn. of 𝑌𝑌 = exp ∫ + 𝑎𝑎(𝑦𝑦, 𝜙𝜙)ª 𝑖𝑖th raw residual Tweedie Distribution
𝜙𝜙

E[𝑌𝑌] = 𝑏𝑏M (𝜃𝜃) Anscombe Residual E[𝑌𝑌] = 𝜇𝜇, Var[𝑌𝑌] = 𝜙𝜙 ⋅ 𝜇𝜇T


Var[𝑌𝑌] = 𝜙𝜙 ⋅ 𝑏𝑏MM (𝜃𝜃) = 𝜙𝜙 ⋅ 𝑣𝑣(𝜇𝜇) å[𝑡𝑡(𝑌𝑌& )]
𝑡𝑡(𝑦𝑦& ) − E
𝑒𝑒& =

jVarä [𝑡𝑡(𝑌𝑌& )] Distribution 𝑑𝑑
Model Framework
• ℎ(𝜇𝜇) = 𝐱𝐱 ! 𝜷𝜷 Normal 0
• Canonical link is the link function where Poisson 1
ℎ(𝜇𝜇) = 𝑏𝑏M "# (𝜇𝜇).

Tweedie (1, 2)
Parameter Estimation Gamma 2
'
𝑦𝑦& 𝜃𝜃& − 𝑏𝑏(𝜃𝜃& )
𝑙𝑙(𝜷𝜷) = ± ∫ + 𝑎𝑎(𝑦𝑦& , 𝜙𝜙) ª Inverse Gaussian 3
𝜙𝜙
&(#

where 𝜃𝜃& = 𝑏𝑏M "# Sℎ"# 2𝐱𝐱&! 𝜷𝜷5T

The score equations are the partial


derivatives of 𝑙𝑙(𝜷𝜷) with respect to each 𝛽𝛽;
all set equal to 0. The solution to the score
equations is 𝐛𝐛. Then, 𝜇𝜇̂ = ℎ"#(𝐱𝐱 ! 𝐛𝐛).

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 6


Logistic and Probit Regression Nominal Response – Generalized Logit Poisson Count Regression
• The odds of an event are the ratio of the Let 𝜋𝜋&,3 be the probability that the 𝑖𝑖th ln 𝜇𝜇 = 𝐱𝐱 ! 𝜷𝜷
'
probability that the event will occur to observation is classified as category 𝑐𝑐. The
the probability that the event will reference category is 𝑘𝑘. 𝑙𝑙(𝜷𝜷) = ±[𝑦𝑦& ln 𝜇𝜇& − 𝜇𝜇& − ln(𝑦𝑦& !) ]
&(#
not occur. 𝜋𝜋&,3 '
ln o p = 𝐱𝐱&! 𝜷𝜷3 𝜕𝜕
• The odds ratio is the ratio of the odds 𝜋𝜋&,G 𝑙𝑙(𝜷𝜷) = ± 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎
𝜕𝜕𝜷𝜷
of an event with the presence of a exp2𝐱𝐱&! 𝜷𝜷3 5 &(#

⎪1 + ∑UVG exp2𝐱𝐱 ! 𝜷𝜷U 5 , 𝑐𝑐 ≠ 𝑘𝑘 '
characteristic to the odds of the same
𝜋𝜋&,3 = &
𝐈𝐈 = ± 𝜇𝜇& 𝐱𝐱& 𝐱𝐱&!
event without the presence of ⎨ 1
⎪ !
, 𝑐𝑐 = 𝑘𝑘 &(#
that characteristic. ⎩1 + ∑UVG exp2𝐱𝐱& 𝜷𝜷U 5 '
𝑦𝑦&
' S 𝐷𝐷 = 2 ± ‘𝑦𝑦& Àln  Ä − 1Ã + 𝜇𝜇̂ & ’
Binary Response 𝜇𝜇̂ &
𝑙𝑙(𝜷𝜷) = ± ± 𝐼𝐼(𝑦𝑦& = 𝑐𝑐) ln 𝜋𝜋&,3 &(#
Function Name ℎ(𝜇𝜇) &(# 3(# 𝑦𝑦& − 𝜇𝜇̂ &
Pearson residual, 𝑒𝑒& =
j𝜇𝜇̂ &
𝜇𝜇 Ordinal Response – Proportional Odds
Logit ln  Ä '
1 − 𝜇𝜇 Cumulative (𝑦𝑦& − 𝜇𝜇̂ & )%
Pearson chi-square statistic = ±
ℎ(Π3 ) = 𝛼𝛼3 + 𝐱𝐱&! 𝜷𝜷 where 𝜇𝜇̂ &
&(#
Probit Φ"# (𝜇𝜇) • Π3 = 𝜋𝜋# + ⋯ + 𝜋𝜋3

Poisson Regression with Exposures Model


𝑥𝑥&,# 𝛽𝛽#
• 𝐱𝐱& = “ ⋮ ” , 𝜷𝜷 = u ⋮ w ln 𝜇𝜇 = ln 𝑤𝑤 + 𝐱𝐱 ! 𝜷𝜷
Complementary
ln(− ln(1 − 𝜇𝜇)) 𝑥𝑥&,$ 𝛽𝛽$
log-log Alternative Count Models
These models can incorporate a Poisson
'
distribution while letting the mean of
𝑙𝑙(𝜷𝜷) = ±[𝑦𝑦& ln 𝜇𝜇& + (1 − 𝑦𝑦& ) ln(1 − 𝜇𝜇& )]
&(#
the response differ from the variance of
𝜕𝜕
'
𝜇𝜇&M the response:
𝑙𝑙(𝜷𝜷) = ± 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎
𝜕𝜕𝜷𝜷 𝜇𝜇& (1 − 𝜇𝜇& )
&(#
' Mean < Mean >
𝑦𝑦& 1 − 𝑦𝑦& Models
𝐷𝐷 = 2 ± À𝑦𝑦& ln  Ä + (1 − 𝑦𝑦& ) ln  ÄÃ Variance Variance
𝜇𝜇̂ & 1 − 𝜇𝜇̂ &
&(#
Negative binomial Yes No
𝑦𝑦& − 𝜇𝜇̂ &
Pearson residual, 𝑒𝑒& =
j𝜇𝜇̂ & (1 − 𝜇𝜇̂ & ) Zero-inflated Yes No
'
(𝑦𝑦& − 𝜇𝜇̂ & )% Hurdle Yes Yes
Pearson chi-square statistic = ±
𝜇𝜇̂ & (1 − 𝜇𝜇̂ & )
&(# Heterogeneity Yes No



www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 7


TIME SERIES Autoregressive Models Smoothing and Predictions
TIME SERIES
Notation 𝑦𝑦+W = 𝑏𝑏7 + 𝑏𝑏# 𝑦𝑦W"# , 2 ≤ 𝑡𝑡 ≤ 𝑛𝑛
Trend Models 𝜌𝜌G Lag 𝑘𝑘 autocorrelation 𝑏𝑏7 + 𝑏𝑏# 𝑦𝑦':X"# , 𝑙𝑙 = 1
𝑦𝑦+':X = ‘
Notation 𝑟𝑟G Lag 𝑘𝑘 sample autocorrelation 𝑏𝑏7 + 𝑏𝑏# 𝑦𝑦+':X"# , 𝑙𝑙 > 1
Subscript 𝑡𝑡 Index for observations 𝜎𝜎 % Variance of white noise %(X"#)
𝑠𝑠𝑠𝑠69#$) = 𝑠𝑠â1 + 𝑏𝑏#% + 𝑏𝑏#Y + ⋯ + 𝑏𝑏#
𝑇𝑇W Trends in time 𝑠𝑠 % Estimate of 𝜎𝜎 %
100𝑘𝑘% prediction interval for 𝑦𝑦':X is
𝑆𝑆W Seasonal trends 𝑏𝑏7 Estimate of 𝛽𝛽7
𝑦𝑦+':X ± 𝑡𝑡(#"G)⁄%,'"O ⋅ 𝑠𝑠𝑠𝑠69#$)
𝜀𝜀W Random patterns 𝑏𝑏# Estimate of 𝛽𝛽#
𝑦𝑦+':X 𝑙𝑙-step ahead forecast 𝑦𝑦g" Sample mean of first
𝑠𝑠𝑠𝑠 Estimated standard error 𝑛𝑛 − 1 observations Other Time Series Models
𝑡𝑡#"?,+@ 𝑞𝑞 quantile of a 𝑡𝑡-distribution 𝑦𝑦g: Sample mean of last Notation
𝑛𝑛 − 1 observations 𝑘𝑘 Moving average length
𝑛𝑛# Training sample size
𝑤𝑤 Smoothing parameter
𝑛𝑛% Test sample size

Autocorrelation 𝑔𝑔 Seasonal base
Trends ∑'W(G:#(𝑦𝑦W"G − 𝑦𝑦g)(𝑦𝑦W − 𝑦𝑦g) 𝑑𝑑 No. of trigonometric functions
𝑟𝑟G =
Additive: 𝑌𝑌W = 𝑇𝑇W + 𝑆𝑆W + 𝜀𝜀W ∑'W(#(𝑦𝑦W − 𝑦𝑦g)%

Multiplicative: 𝑌𝑌W = 𝑇𝑇W × 𝑆𝑆W + 𝜀𝜀W


Smoothing with Moving Averages

Testing Autocorrelation 𝑌𝑌W = 𝛽𝛽7 + 𝜀𝜀W
Stationarity test statistic = 𝑟𝑟G ⁄𝑠𝑠𝑠𝑠D*

Smoothing
Stationarity describes how something does where 𝑠𝑠𝑠𝑠D* = 1⁄√𝑛𝑛
𝑦𝑦W + 𝑦𝑦W"# + ⋯ + 𝑦𝑦W"G:#
not vary with respect to time. Control charts 𝐻𝐻7 : 𝜌𝜌G = 0 against 𝐻𝐻# : 𝜌𝜌G ≠ 0 𝑠𝑠̂W =
𝑘𝑘
can be used to identify stationarity. Reject 𝐻𝐻7 if |test statistic| ≥ 𝑧𝑧#"B⁄% 𝑦𝑦W − 𝑦𝑦W"G
𝑠𝑠̂W = 𝑠𝑠̂W"# + , 𝑘𝑘 = 1, 2, …
𝑘𝑘
White Noise AR(1) Model

𝑦𝑦+':X = 𝑦𝑦g 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"# + 𝜀𝜀W Predictions


𝑠𝑠𝑠𝑠69#$) = 𝑠𝑠6 j1 + 1⁄𝑛𝑛 𝑏𝑏7 = 𝑠𝑠̂'
Assumptions 𝑦𝑦+':X = 𝑏𝑏7
100𝑘𝑘% prediction interval for 𝑦𝑦':X is
1. E[𝜀𝜀W ] = 0
𝑦𝑦+':X ± 𝑡𝑡(#"G)⁄%,'"# ⋅ 𝑠𝑠𝑠𝑠69#$) Double Smoothing with Moving Averages

2. Var[𝜀𝜀W ] = 𝜎𝜎 %
3. Cov[𝜀𝜀W:G , 𝑌𝑌W ] = 0 for 𝑘𝑘 > 0 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑡𝑡 + 𝜀𝜀W
Random Walk

𝑤𝑤W = 𝑦𝑦W − 𝑦𝑦W"# Smoothing
• If 𝛽𝛽# = 0, 𝑌𝑌W follows a white noise process.
𝑦𝑦+':X = 𝑦𝑦' + 𝑙𝑙𝑤𝑤
Ÿ (%) 𝑠𝑠̂W + 𝑠𝑠̂W"# + ⋯ + 𝑠𝑠̂W"G:#
• If 𝛽𝛽# = 1, 𝑌𝑌W follows a random 𝑠𝑠̂W =
𝑠𝑠𝑠𝑠69#$) = 𝑠𝑠S √𝑙𝑙 𝑘𝑘
walk process. 𝑠𝑠̂W − 𝑠𝑠̂W"G
Approximate 95% prediction interval for (%) (%)
• If −1 < 𝛽𝛽# < 1, 𝑌𝑌W is stationary. 𝑠𝑠̂W = 𝑠𝑠̂W"# + , 𝑘𝑘 = 1, 2, …
𝑘𝑘
𝑦𝑦':X is 𝑦𝑦+':X ± 2 ⋅ 𝑠𝑠𝑠𝑠69#$)
Properties of Stationary AR(1) Model Predictions
Model Comparison 𝛽𝛽7 𝑏𝑏7 = 𝑠𝑠̂'
' E[𝑌𝑌W ] =
1 1 − 𝛽𝛽# (%)
ME = ± 𝑒𝑒W 2 ¶𝑠𝑠̂' − 𝑠𝑠̂' ß
𝑛𝑛% 𝜎𝜎 % 𝑏𝑏# =
W('" :# Var[𝑌𝑌W ] = 𝑘𝑘 − 1
' 1 − 𝛽𝛽#%
1 𝑒𝑒W 𝑦𝑦+':X = 𝑏𝑏7 + 𝑏𝑏# ⋅ 𝑙𝑙
MPE = 100 ⋅ ± 𝜌𝜌G = 𝛽𝛽#G
𝑛𝑛% 𝑦𝑦W
W('" :#
' Estimation
1
MSE = ± 𝑒𝑒W% ∑'W(%(𝑦𝑦W"# − 𝑦𝑦g" )(𝑦𝑦W − 𝑦𝑦g: )
𝑛𝑛% 𝑏𝑏# = ≈ 𝑟𝑟#
W('" :# ∑'W(%(𝑦𝑦W"# − 𝑦𝑦g")%
'
1 𝑏𝑏7 = 𝑦𝑦g: − 𝑏𝑏# 𝑦𝑦g" ≈ 𝑦𝑦g(1 − 𝑟𝑟# )
MAE = ± |𝑒𝑒W |
𝑛𝑛% ∑'W(% 𝑒𝑒W%
W('" :# 𝑠𝑠 % =
' 𝑛𝑛 − 3
1 𝑒𝑒W 𝑠𝑠 %
MAPE = 100 ⋅ ± ⁄ ⁄ ä [𝑌𝑌W ] =
Var
𝑛𝑛% 𝑦𝑦W 1 − 𝑏𝑏#%
W('" :#

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 8


Exponential Smoothing Seasonal Time Series Models Unit Root Test
𝑌𝑌W = 𝛽𝛽7 + 𝜀𝜀W Fixed Seasonal Effects – Trigonometric • A unit root test is used to evaluate the fit

Functions of a random walk model.
Smoothing T
• A random walk model is a good fit if the
𝑠𝑠̂W = (1 − 𝑤𝑤)(𝑦𝑦W + 𝑤𝑤𝑦𝑦W"# + ⋯ + 𝑤𝑤 W 𝑦𝑦7 ) 𝑆𝑆W = ±S𝛽𝛽#,& sin(𝑓𝑓& 𝑡𝑡) + 𝛽𝛽%,& cos(𝑓𝑓& 𝑡𝑡)T time series possesses a unit root.
𝑠𝑠̂W = (1 − 𝑤𝑤)𝑦𝑦W + 𝑤𝑤𝑠𝑠̂W"# , 0 ≤ 𝑤𝑤 < 1 &(#
• The Dickey-Fuller test and augmented
The value of 𝑤𝑤 is determined by minimizing • 𝑓𝑓& = 2𝜋𝜋𝜋𝜋⁄𝑔𝑔
Dickey-Fuller test are two examples of
𝑆𝑆𝑆𝑆(𝑤𝑤) = ∑'W(#(𝑦𝑦W − 𝑠𝑠̂W"# )% . • 𝑑𝑑 ≤ 𝑔𝑔⁄2
unit root tests.
Predictions Seasonal Autoregressive Models, SAR(p)

Volatility Models
𝑏𝑏7 = 𝑠𝑠̂' 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"L + ⋯ + 𝛽𝛽$ 𝑌𝑌W"$L + 𝜀𝜀W
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴(𝑝𝑝) Model
𝑦𝑦+':X = 𝑏𝑏7

Holt-Winter Seasonal Additive Model 𝜎𝜎W% = 𝜃𝜃 + 𝛾𝛾# 𝜀𝜀W"#


% %
+ ⋯ + 𝛾𝛾$ 𝜀𝜀W"$
Double Exponential Smoothing 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑡𝑡 + 𝑆𝑆W + 𝜀𝜀W

𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺(𝑝𝑝, 𝑞𝑞) Model


𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑡𝑡 + 𝜀𝜀W • 𝑆𝑆W = 𝑆𝑆W"L

L
𝜎𝜎W% = 𝜃𝜃 + 𝛾𝛾# 𝜀𝜀W"#
% %
+ ⋯ + 𝛾𝛾$ 𝜀𝜀W"$ +
Smoothing • ∑W(# 𝑆𝑆W = 0

%
𝛿𝛿# 𝜎𝜎W"# %
+ ⋯ + 𝛿𝛿? 𝜎𝜎W"?
(%)
𝑠𝑠̂W = (1 − 𝑤𝑤)(𝑠𝑠̂W + 𝑤𝑤𝑠𝑠̂W"# + ⋯ + 𝑤𝑤 W 𝑠𝑠̂7 ) 𝜃𝜃
Var[𝜀𝜀W ] =
(%)
𝑠𝑠̂W
(%)
= (1 − 𝑤𝑤)𝑠𝑠̂W + 𝑤𝑤𝑠𝑠̂W"# , 0 ≤ 𝑤𝑤 < 1 1 − ∑$;(# 𝛾𝛾; − ∑?;(# 𝛿𝛿;

Predictions Assumptions
(%)
𝑏𝑏7 = 2𝑠𝑠̂' − 𝑠𝑠̂' • 𝜃𝜃 > 0
1 − 𝑤𝑤 (%) • 𝛾𝛾; ≥ 0
𝑏𝑏# = ¶𝑠𝑠̂' − 𝑠𝑠̂' ß
𝑤𝑤 • 𝛿𝛿; ≥ 0
𝑦𝑦+':X = 𝑏𝑏7 + 𝑏𝑏# ⋅ 𝑙𝑙
• ∑$;(# 𝛾𝛾; + ∑?;(# 𝛿𝛿; < 1


Key Ideas for Smoothing

• It is only appropriate for time series data
without a linear trend.
• It is related to weighted least squares.
• A double smoothing procedure can be
used to forecast time series data with a
linear trend.
• Holt-Winter double exponential
smoothing is a generalization of the
double exponential smoothing.

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 9


DECISION TREES Cost Complexity Pruning Random Forests
DECISION TREES
Regression: 1. Create 𝑏𝑏 bootstrap samples from the
Regression and Classification Trees |!| original training dataset.
%
Notation Minimize ± ± 2𝑦𝑦& − 𝑦𝑦gZ+ 5 + 𝜆𝜆|𝑇𝑇| 2. Construct a decision tree for each
𝑅𝑅 Region of predictor space U(# &:𝐱𝐱 & ∈Z+ bootstrap sample using recursive binary
𝑛𝑛U No. of observations in node 𝑚𝑚 Classification: splitting. At each split, a random subset of
|!|
𝑛𝑛U,3 No. of category 𝑐𝑐 observations in 1 𝑘𝑘 variables are considered.
node 𝑚𝑚 Minimize ± 𝑛𝑛U ⋅ 𝐼𝐼U + 𝜆𝜆|𝑇𝑇|
𝑛𝑛 3. Predict the response of a new observation
U(#
𝐼𝐼 Impurity by averaging the predictions (regression
𝐸𝐸 Classification error rate Key Ideas trees) or by using the most frequent
𝐺𝐺 Gini index • Terminal nodes or leaves represent the category (classification trees) across
𝐷𝐷 Cross entropy partitions of the predictor space. all 𝑏𝑏 trees.
𝑇𝑇 Subtree • Internal nodes are points along the tree

|𝑇𝑇| No. of terminal nodes in 𝑇𝑇 Properties


where splits occur.
𝜆𝜆 Tuning parameter • Bagging is a special case of
• Terminal nodes do not have child nodes,
random forests.
Algorithm but internal nodes do.
• Increasing 𝑏𝑏 does not cause overfitting.
1. Construct a large tree with 𝑔𝑔 terminal • Branches are lines that connect any
• Decreasing 𝑘𝑘 reduces the correlation
nodes using recursive binary splitting. two nodes.
between predictions.
2. Obtain a sequence of best subtrees, • A decision tree with only one internal

as a function of 𝜆𝜆, using cost node is called a stump. Boosting



complexity pruning. Let 𝑧𝑧# be the actual response variable, 𝑦𝑦.
Advantages of Trees
3. Choose 𝜆𝜆 by applying 𝑘𝑘-fold cross 1. For 𝑘𝑘 = 1, 2, … , 𝑏𝑏:
• Easy to interpret and explain
validation. Select the 𝜆𝜆 that results in the • Use recursive binary splitting to fit a
• Can be presented visually
lowest cross-validation error. tree with 𝑑𝑑 splits to the data with 𝑧𝑧G as
• Manage categorical variables without the
4. The best subtree is the subtree created in the response.
need of dummy variables
step 2 with the selected 𝜆𝜆 value. • Update 𝑧𝑧G by subtracting 𝜆𝜆 ⋅ 𝑓𝑓.G (𝐱𝐱), i.e.
• Mimic human decision-making
let 𝑧𝑧G:# = 𝑧𝑧G − 𝜆𝜆 ⋅ 𝑓𝑓.G (𝐱𝐱).


Recursive Binary Splitting
Disadvantages of Trees 2. Calculate the boosted model prediction as
Regression:
L • Not robust 𝑓𝑓.(𝐱𝐱) = ∑8G(# 𝜆𝜆 ⋅ 𝑓𝑓.G (𝐱𝐱).
Minimize ± ± 2𝑦𝑦& − 𝑦𝑦gZ+ 5
% • Do not have the same degree of predictive

Properties
U(# &:𝐱𝐱 & ∈Z+ accuracy as other statistical methods
• Increasing 𝑏𝑏 can cause overfitting.
Classification:
L Multiple Trees • Boosting reduces bias.
1 • 𝑑𝑑 controls complexity of the
Minimize ± 𝑛𝑛U ⋅ 𝐼𝐼U Bagging
𝑛𝑛 boosted model.
U(# 1. Create 𝑏𝑏 bootstrap samples from the

original training dataset. • 𝜆𝜆 controls the rate at which
More Under Classification:
2. Construct a decision tree for each boosting learns.
𝑝𝑝̂U,3 = 𝑛𝑛U,3 ⁄𝑛𝑛U
bootstrap sample using recursive
𝐸𝐸U = 1 − max 𝑝𝑝̂U,3
3
binary splitting.
𝐺𝐺U = ∑S
3(# 𝑝𝑝̂U,3 21 − 𝑝𝑝̂ U,3 5
3. Predict the response of a new observation
𝐷𝐷U = − ∑S
3(# 𝑝𝑝̂U,3 ln 𝑝𝑝̂ U,3 by averaging the predictions (regression
deviance = −2 ∑LU(# ∑S
3(# 𝑛𝑛U,3 ln 𝑝𝑝̂ U,3 trees) or by using the most frequent
deviance category (classification trees) across
residual mean deviance =
𝑛𝑛 − 𝑔𝑔 all 𝑏𝑏 trees.

Properties
• Increasing 𝑏𝑏 does not cause overfitting.
• Bagging reduces variance.
• Out-of-bag error is a valid estimate of
test error.

www.coachingactuaries.com Copyright © 2022 Coaching Actuaries. All Rights Reserved. 10


UNSUPERVISED LEARNING Cluster Analysis Hierarchical Clustering
UNSUPERVISED LEARNING
Notation 1. Select the dissimilarity measure and
Principal Components Analysis 𝐶𝐶 Cluster containing indices linkage to be used. Treat each
Notation 𝑊𝑊(𝐶𝐶) Within-cluster variation observation as its own cluster.
𝑧𝑧, 𝑍𝑍 Principal component of cluster 2. For 𝑘𝑘 = 𝑛𝑛, 𝑛𝑛 − 1, … , 2:
(score) |𝐶𝐶| No. of observations in cluster • Compute the inter-cluster dissimilarity
Subscript 𝑚𝑚 Index for principal
between all 𝑘𝑘 clusters.
$ %
components Euclidean Distance = â∑;(#2𝑥𝑥&,; − 𝑥𝑥U,; 5 • Examine all 2G%5 pairwise
𝜙𝜙 Principal component dissimilarities. The two clusters with
loading 𝑘𝑘-Means Clustering the lowest inter-cluster dissimilarity
𝑥𝑥, 𝑋𝑋 Centered explanatory 1. Randomly assign a cluster to each are fused. The dissimilarity indicates
variable observation. This serves as the initial the height in the dendrogram at which

cluster assignments. these two clusters join.
Principal Components
$ $
2. Calculate the centroid of each cluster.

𝑧𝑧U = ± 𝜙𝜙;,U 𝑥𝑥; , 𝑧𝑧&,U = ± 𝜙𝜙;,U 𝑥𝑥&,;


3. For each observation, identify the closest Linkage Inter-cluster dissimilarity
centroid and reassign to that cluster.
;(# ;(# Complete The largest dissimilarity
• ∑$;(# 𝜙𝜙;,U
%
= 1 4. Repeat steps 2 and 3 until the cluster
$ assignments stop changing. Single The smallest dissimilarity
• ∑;(# 𝜙𝜙;,U ⋅ 𝜙𝜙;,I = 0, 𝑚𝑚 ≠ 𝑢𝑢
$ Average The arithmetic mean
Proportion of Variance Explained (PVE) 1 %
𝑊𝑊(𝐶𝐶I ) = ± ±2𝑥𝑥&,; − 𝑥𝑥U,; 5 The dissimilarity between
$ $ ' |𝐶𝐶I | Centroid
1 &,U∈`, ;(#
% the cluster centroids
± 𝑠𝑠4%% = ± ± 𝑥𝑥&,; $
𝑛𝑛 − 1 %
;(# ;(# &(# = 2 ± ±2𝑥𝑥&,; − 𝑥𝑥̅I,; 5 Key Ideas
'
1 %
&∈`, ;(#
• For 𝑘𝑘-means clustering, the algorithm
𝑠𝑠_%+ = ± 𝑧𝑧&,U
𝑛𝑛 − 1 needs to be repeated for each 𝑘𝑘.
&(#
𝑠𝑠_%+ • For hierarchical clustering, the algorithm
PVE = $
∑;(# 𝑠𝑠4%% only needs to be performed once for any
number of clusters.
Key Ideas • The result of clustering depends on many
• The variance explained by each parameters, such as:
subsequent principal component is o Choice of 𝑘𝑘 in 𝑘𝑘-means clustering
always less than the variance explained o Choice of number of clusters, linkage,
by the previous principal component. and dissimilarity measure in
• All principal components are hierarchical clustering
uncorrelated with one another. o Choice to standardize variables
• A dataset has min(𝑛𝑛 − 1, 𝑝𝑝) distinct
principal components.
• The first 𝑘𝑘 principal component scores
and loadings approximate the original
dataset, 𝑥𝑥&,; ≈ ∑GU(# 𝑧𝑧&,U 𝜙𝜙;,U .

Principal Components Regression


𝑌𝑌 = 𝜃𝜃7 + 𝜃𝜃# 𝑧𝑧# + ⋯ + 𝜃𝜃G 𝑧𝑧G + 𝜀𝜀
• If 𝑘𝑘 = 𝑝𝑝, then 𝛽𝛽; = ∑GU(# 𝜃𝜃U 𝜙𝜙;,U .

Copyright © 2022 Coaching Actuaries. All Rights Reserved. 11


www.coachingactuaries.com Personal copies permitted. Resale or distribution is prohibited.

You might also like