SRM Formula Sheet-2
SRM Formula Sheet-2
updated 08/04/22
Training Test
Observations used Observations not used
to train/obtain fˆ to train/obtain fˆ
Key Ideas
25% 25% 25% 25%
• The disadvantage to parametric methods is the danger of
choosing a form for 𝑓𝑓 that is not close to the truth.
• The disadvantage to non-parametric methods is the need for an
abundance of observations.
• Flexibility and interpretability are typically at odds.
1st 3rd outliers
• As flexibility increases, the training MSE (or error rate) decreases, quartile quartile
but the test MSE (or error rate) follows a u-shaped pattern.
smallest median largest
• Low flexibility leads to a method with low variance and high bias; non-outlier non-outlier
high flexibility leads to a method with high variance and low bias.
qq Plots
Plots sample quantiles against theoretical quantiles to determine
whether the sample and theoretical distributions have
similar shapes.
Validation Set
𝛽𝛽; 𝑏𝑏; ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠8% handling multicollinearity; it is even
• Randomly splits all available
E[𝑌𝑌] 𝑦𝑦+ ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠69 possible to accept it, such as when there
observations into two groups: the
is a suppressor variable. On the other
𝑌𝑌':# 𝑦𝑦+':# ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠69#$" training set and the validation set.
hand, it can be eliminated by using a set
• Only the observations in the training set
of orthogonal predictors.
Linear Model Assumptions are used to attain the fitted model, and
Leverage those in validation set are used to
Model Selection
𝑠𝑠𝑠𝑠69& % estimate the test MSE.
ℎ& = 𝐱𝐱&! (𝐗𝐗 ! 𝐗𝐗)"# 𝐱𝐱& = Notation
MSE
• A standardized variable is the result of • The first partial least squares direction, 𝑘𝑘-Nearest Neighbors (KNN)
first centering a variable, then scaling it. 𝑧𝑧# , is a linear combination of standardized 1. Identify the "center of the neighborhood",
predictors 𝑥𝑥# , … , 𝑥𝑥$ , with coefficients i.e. the location of an observation with
Ridge Regression based on the relation between 𝑥𝑥; and 𝑦𝑦. inputs 𝑥𝑥# , … , 𝑥𝑥$ .
Coefficients are estimated by minimizing • Every subsequent partial least squares 2. Starting from the "center of the
the SSE while constrained by ∑$;(# 𝑏𝑏;% ≤ 𝑎𝑎 direction is calculated iteratively as a neighborhood", identify the 𝑘𝑘 nearest
or equivalently, by minimizing the linear combination of "updated training observations.
expression SSE + 𝜆𝜆 ∑$;(# 𝑏𝑏;% . predictors" which are the residuals of fits 3. For classification, 𝑦𝑦+ is the most frequent
with the "previous predictors" explained category among the 𝑘𝑘 observations; for
Lasso Regression regression, 𝑦𝑦+ is the average of the
by the previous direction.
Coefficients are estimated by minimizing response among the 𝑘𝑘 observations.
• The directions 𝑧𝑧# , … , 𝑧𝑧L are used as
the SSE while constrained by ∑$;(#a𝑏𝑏; a ≤ 𝑎𝑎 𝑘𝑘 is inversely related to flexibility.
predictors in a multiple linear regression.
or equivalently, by minimizing the The number of directions, 𝑔𝑔, is a measure
expression SSE + 𝜆𝜆 ∑$;(#a𝑏𝑏; a.
of flexibility.
Key Results for Distributions in the Exponential Family
1 (𝑦𝑦 − 𝜇𝜇)% 𝜃𝜃 %
Normal exp ∫− ª 𝜇𝜇 𝜎𝜎 % 𝜇𝜇
𝜎𝜎√2𝜋𝜋 2𝜎𝜎 % 2
Binomial 𝑛𝑛 𝜋𝜋 𝜇𝜇
Ä 𝜋𝜋 6 (1 − 𝜋𝜋)'"6 ln ¶ ß 1 𝑛𝑛 ln21 + 𝑒𝑒 N 5 ln Ä
(fixed 𝑛𝑛) 𝑦𝑦 1 − 𝜋𝜋 𝑛𝑛 − 𝜇𝜇
𝜆𝜆6
Poisson exp(−𝜆𝜆) ln 𝜆𝜆 1 𝑒𝑒 N ln 𝜇𝜇
𝑦𝑦!
Negative Binomial Γ(𝑦𝑦 + 𝑟𝑟) D 𝜇𝜇
𝑝𝑝 (1 − 𝑝𝑝)6 ln(1 − 𝑝𝑝) 1 −𝑟𝑟 ln21 − 𝑒𝑒 N 5 ln Ä
(fixed 𝑟𝑟) 𝑦𝑦! Γ(𝑟𝑟) 𝑟𝑟 + 𝜇𝜇
𝛾𝛾 B B"# 𝛾𝛾 1 1
Gamma 𝑦𝑦 exp(−𝑦𝑦𝑦𝑦) − − ln(−𝜃𝜃) −
Γ(𝛼𝛼) 𝛼𝛼 𝛼𝛼 𝜇𝜇
𝜆𝜆 𝜆𝜆(𝑦𝑦 − 𝜇𝜇)% 1 1 1
Inverse Gaussian n exp ∫− ª − −√−2𝜃𝜃 −
2𝜋𝜋𝑦𝑦 O 2𝜇𝜇% 𝑦𝑦 2𝜇𝜇% 𝜆𝜆 2𝜇𝜇%
Smoothing
Stationarity describes how something does where 𝑠𝑠𝑠𝑠D* = 1⁄√𝑛𝑛
𝑦𝑦W + 𝑦𝑦W"# + ⋯ + 𝑦𝑦W"G:#
not vary with respect to time. Control charts 𝐻𝐻7 : 𝜌𝜌G = 0 against 𝐻𝐻# : 𝜌𝜌G ≠ 0 𝑠𝑠̂W =
𝑘𝑘
can be used to identify stationarity. Reject 𝐻𝐻7 if |test statistic| ≥ 𝑧𝑧#"B⁄% 𝑦𝑦W − 𝑦𝑦W"G
𝑠𝑠̂W = 𝑠𝑠̂W"# + , 𝑘𝑘 = 1, 2, …
𝑘𝑘
White Noise AR(1) Model
Volatility Models
𝑏𝑏7 = 𝑠𝑠̂' 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"L + ⋯ + 𝛽𝛽$ 𝑌𝑌W"$L + 𝜀𝜀W
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴(𝑝𝑝) Model
𝑦𝑦+':X = 𝑏𝑏7
Predictions Assumptions
(%)
𝑏𝑏7 = 2𝑠𝑠̂' − 𝑠𝑠̂' • 𝜃𝜃 > 0
1 − 𝑤𝑤 (%) • 𝛾𝛾; ≥ 0
𝑏𝑏# = ¶𝑠𝑠̂' − 𝑠𝑠̂' ß
𝑤𝑤 • 𝛿𝛿; ≥ 0
𝑦𝑦+':X = 𝑏𝑏7 + 𝑏𝑏# ⋅ 𝑙𝑙
• ∑$;(# 𝛾𝛾; + ∑?;(# 𝛿𝛿; < 1
Key Ideas for Smoothing
• It is only appropriate for time series data
without a linear trend.
• It is related to weighted least squares.
• A double smoothing procedure can be
used to forecast time series data with a
linear trend.
• Holt-Winter double exponential
smoothing is a generalization of the
double exponential smoothing.
Properties
U(# &:𝐱𝐱 & ∈Z+ accuracy as other statistical methods
• Increasing 𝑏𝑏 can cause overfitting.
Classification:
L Multiple Trees • Boosting reduces bias.
1 • 𝑑𝑑 controls complexity of the
Minimize ± 𝑛𝑛U ⋅ 𝐼𝐼U Bagging
𝑛𝑛 boosted model.
U(# 1. Create 𝑏𝑏 bootstrap samples from the
original training dataset. • 𝜆𝜆 controls the rate at which
More Under Classification:
2. Construct a decision tree for each boosting learns.
𝑝𝑝̂U,3 = 𝑛𝑛U,3 ⁄𝑛𝑛U
bootstrap sample using recursive
𝐸𝐸U = 1 − max 𝑝𝑝̂U,3
3
binary splitting.
𝐺𝐺U = ∑S
3(# 𝑝𝑝̂U,3 21 − 𝑝𝑝̂ U,3 5
3. Predict the response of a new observation
𝐷𝐷U = − ∑S
3(# 𝑝𝑝̂U,3 ln 𝑝𝑝̂ U,3 by averaging the predictions (regression
deviance = −2 ∑LU(# ∑S
3(# 𝑛𝑛U,3 ln 𝑝𝑝̂ U,3 trees) or by using the most frequent
deviance category (classification trees) across
residual mean deviance =
𝑛𝑛 − 𝑔𝑔 all 𝑏𝑏 trees.
Properties
• Increasing 𝑏𝑏 does not cause overfitting.
• Bagging reduces variance.
• Out-of-bag error is a valid estimate of
test error.