CH 2 Quantitative Analysis TKYYQAL48F
CH 2 Quantitative Analysis TKYYQAL48F
By AnalystPrep
1
©2024 AnalystPrep “This document is protected by International copyright laws. Reproduction and/or distribution of this document is
12 - Fundamentals of Probability 3
13 - Random Variables 17
14 - Common Univariate Random Variables 41
15 - Multivariate Random Variables 69
16 - Sample Moments 95
17 - Hypothesis Testing 117
18 - Linear Regression 139
19 - Regression with Multiple Explanatory Variables 157
20 - Regression Diagnostics 178
21 - Stationary Time Series 194
22 - Nonstationary Time Series 219
23 - Measuring Return, Volatility, and Correlation 245
24 - Simulation and Bootstrapping 264
25 - Machine-Learning Methods 281
26 - Machine Learning and Prediction 308
2
© 2014-2024 AnalystPrep.
Reading 12: Fundamentals of Probability
events.
quantifies the likelihood that some event will occur. For instance, we could be interested in the
outcomes are dependent on the problem being studied. For example, when modeling returns
from a portfolio, the sample space is a set of real numbers. As another example, assume we want
to model defaults in loan payment; we know that there can only be two outcomes: either the firm
defaults or it doesn’t. As such, the sample space is Ω = {Default, No Default}. To give yet
another example, the sample space when a fair six-sided die is tossed is made of six different
outcomes:
Ω = {1, 2, 3, 4, 5, 6}
3
© 2014-2024 AnalystPrep.
Events (ω)
An event is a set of outcomes (which may contain more than one element). For example, suppose
we tossed a die. A “6” would constitute an event. If we toss two dice simultaneously, a {6, 2}
would constitute an event. An event that contains only one outcome is termed an elementary
event.
The event space refers to the set of all possible outcomes and combinations of outcomes. For
example, consider a scenario where we toss two fair coins simultaneously. The following would
Note: If the coins are fair, the probability of a head, P(H), equals the probability of a tail, P(T).
Probability
The probability of an event refers to the likelihood of that particular event occurring. For
example, the probability of a Head when we toss a coin is 0.5, and so is the probability of a Tail.
According to frequentist interpretation, the term probability stands for the number of times an
event occurs if a set of independent experiments is performed. But this is what we call the
frequentist interpretation because it defines an event’s probability as the limit of its relative
frequency in many trials. It is just a conceptual explanation; in finance, we deal with actual, non-
Two events, A and B, are said to be mutually exclusive if the occurrence of A rules out the
4
© 2014-2024 AnalystPrep.
occurrence of B, and vice versa. For example, a car cannot turn left and turn right at the same
time.
Mutually exclusive events are such that one event precludes the occurrence of all the
other events. Thus, if you roll a dice and a 4 comes up, that particular event precludes
all the other events, i.e., 1,2,3,5 and 6. In other words, rolling a 1 and a 5 are mutually
Furthermore, there is no way a single investment can have more than one arithmetic
mean return. Thus, arithmetic returns of, say, 20% and 17% constitute mutually
exclusive events.
Independent Events
Two events, A and B, are independent if the fact that A occurs does not affect the probability of B
occurring. When two events are independent, this simply means that both events can happen at
the same time. In other words, the probability of one event happening does not depend on
whether the other event occurs or not. For example, we can define A as the likelihood that it
rains on March 15 in New York and B as the probability that it rains in Frankfurt on March 15. In
5
© 2014-2024 AnalystPrep.
this instance, both events can happen simultaneously or not.
Another example would be defining event A as getting tails on the first coin toss and B on the
second coin toss. The fact of landing on tails on the first toss will not affect the probability of
Intersection
The intersection of events say A and B is the set of outcomes occurring both in A and B. It is
then:
6
© 2014-2024 AnalystPrep.
P (A 1 ∩ A 2 ∩ … ∩ An ) = P (A1 ) × P (A2 ) × … × P (An )
P (A ∩ B) = P(A and B) = 0
This is because A's occurrence rules out B's occurrence. Remember that a car cannot turn left
Union
The union of events, say, A and B, is the set of outcomes occurring in at least one of the two sets
To determine the likelihood of any two mutually exclusive events occurring, we sum up their
7
© 2014-2024 AnalystPrep.
Given two events A and B, that are not mutually exclusive (independent events), the probability
Another important concept under probability is the complement of a set denoted by Ac (where
A can be any other event) which is the set of outcomes that are not in A. For example, consider
Conditional Probability
8
© 2014-2024 AnalystPrep.
Until now, we've only looked at unconditional probabilities. An unconditional probability (also
known as a marginal probability) is simply the probability that an event occurs without
considering any other preceding events. In other words, unconditional probabilities are not
conditioned on the occurrence of any other events; they are 'stand-alone' events.
Conditional probability is the probability of one event occurring with some relationship to one
or more other events. Our interest lies in the probability of an event 'A' given that another event
'B 'has already occurred. Here’s what you should ask yourself:
"What is the probability of one event occurring if another event has already taken place?" We
P (A ∩ B)
P(A│B) =
P (B)
Bayes' Theorem
Bayes' theorem describes the probability of an event based on prior knowledge of conditions that
might be related to the event. Assuming that we have two random variables, A and B, then
P (B|A) × P (A)
P (A|B) =
P (B)
Supposing that we are issued with two bonds, A and B. Each bond has a default probability of
10% over the following year. We are also told that there is a 6% chance that both the bonds will
default, an 86% chance that none of them will default, and a 14% chance that either of the bonds
Often, there is a high correlation between bond defaults. This can be attributed to the sensitivity
9
© 2014-2024 AnalystPrep.
displayed by bond issuers when dealing with broad economic issues. The 6% probability of both
the bonds defaulting is higher than the 1% probability of default had the default events been
The features of the probability matrix can also be expressed in terms of conditional probabilities.
For example, the likelihood that bond A will default given that B has defaulted is computed as:
P [A ∩ B] 6%
P (A|B) = = = 60%
P [B] 10%
This means that in 60% of the scenarios in which bond B will default, bond A will also default.
P [A ∩ B] = P (A|B) × P [B] I
Also:
P [A ∩ B] = P (B|A) × P [A] II
Both the right-hand sides of equations I and I I are combined and rearranged to give the Bayes'
theorem:
P (B|A) × P [A]
⇒ P (A|B) =
P [B]
When presented with new data, Bayes' theorem can be applied to update beliefs. To understand
how the theorem can provide a framework for how exactly the new beliefs should be, consider
Based on an examination of historical data, it's been determined that all fund managers at a
certain Fund fall into one of two groups: Stars and Non-Stars. Stars are the best managers. The
10
© 2014-2024 AnalystPrep.
probability that a Star will beat the market in any given year is 75%. Other managers are just as
likely to beat the market as they are to underperform it [i.e., Non-Stars have 50/50 odds of
beating the market. For both types of managers, the probability of beating the market is
independent from one year to the next. Stars are rare. Of a given pool of managers, only 16%
A new manager was added to the portfolio of funds three years ago. Since then, the new
manager has beaten the market every year. What was the probability that the manager was a
star when the manager was first added to the portfolio? What is the probability that this
manager is a star now? What's the probability that the manager will beat the market next year,
Solution
We first summarize the data by introducing some notations as follows: The chances that a
manager will beat the market on the condition that he is a star is:
3
P (B|S) = 0.75 =
4
1
P (B|S̄) = 0.5 =
2
The chances of the new manager being a star during the particular time he was added to the
analyst's portfolio are exactly the chances that any manager will be made a star, which is
unconditional:
4
P [S] = 0.16 =
25
To evaluate the likelihood of him being a star at present, we compute the likelihood of him being
a star given that he has beaten the market for three consecutive years, P (S|3B), using the Bayes’
theorem:
P (3B|S) × P [S]
11
© 2014-2024 AnalystPrep.
P (3B|S) × P [S]
P (S |3B) =
P [3B]
3 3 27
P (3B|S ) = ( ) =
4 64
The unconditional chances that the manager will beat the market for three years is the
denominator.
3 3 4 1 3 21 69
P [3B] = ( ) × +( ) =
4 25 2 25 400
Therefore:
( 27 4
) ( 25 ) 9
64
P (S|3B) = = = 39%
69 23
( 400)
Therefore, there is a 39% chance that the manager will be a star after beating the market for
three consecutive years, which happens to be our new belief and is a significant improvement
Finally, we compute the manager's chances of beating the market the next year. This happens to
be the summation of the chances of a star beating the market and the chances of a non-star
3 9 1 14 3
P [B] = × + × = 60% =
4 23 2 23 5
P (3B|S) × P [S]
P (S |3B) =
P [3B]
12
© 2014-2024 AnalystPrep.
The L.H.S of the formula is posterior. The first item on the numerator is the likelihood, and the
13
© 2014-2024 AnalystPrep.
Question 1
The probability that the Eurozone economy will grow this year is 18%, and the
probability that the European Central Bank (ECB) will loosen its monetary policy is
52%.
Assume that the joint probability that the Eurozone economy will grow and the ECB
will loosen its monetary policy is 45%. What is the probability that either the
Eurozone economy will grow or the ECB will loosen its monetary policy?
A. 42.12%
B. 25%
C. 11%
D. 17%
P(E) = 0.18 (the probability that the Eurozone economy will grow is 18%)
p(M) = 0.52 (the probability that the ECB will loosen the monetary policy is 52%)
p(EM) = 0.45 (the joint probability that Eurozone economy will grow and the ECB will
The probability that either the Eurozone economy will grow or the central bank will
Question 2
14
© 2014-2024 AnalystPrep.
p(O|T) = 0.62 Conditional probability of reaching
the office if the train arrives on time
p(O|T c) = 0.47 Conditional probability of reaching the office
if the train does not arrive on time
p(T) = 0.65 Unconditional probability of
the train arriving on time
p(O) = ? Unconditional probability
of reaching the office
A. 0.4325
B. 0.5675
C. 0.3856
D. 0.5244
If p(T) = 0.65 (Unconditional probability of train arriving on time is 0.65), then the
unconditional probability of the train not arriving on time p(T c) = 1 - p(T) = 1 - 0.65
= 0.35.
Note: p(O) is the unconditional probability of reaching the office. It is simply the
addition of:
1. reaching the office if the train arrives on time, multiplied by the train arriving
on time, and
2. reaching the office if the train does not arrive on time, multiplied by the train
not arriving on time (or given the information, one minus the train arriving on
15
© 2014-2024 AnalystPrep.
time)
Question 3
Suppose you are an equity analyst for the XYZ investment bank. You use historical
outperform the market 70% of the time and average managers outperform the
market only 40% of the time. Furthermore, 20% of all fund managers are excellent
managers and 80% are simply average. The probability of a manager outperforming
the market in any given year is independent of their performance in any other year.
A new fund manager started three years ago and outperformed the market all three
A. 29.53%
B. 12.56%
C. 57.26%
D. 30.21%
The best way to visualize this problem is to start off with a probability matrix:
Let E be the event of an excellent manager, and A represent the event of an average
manager.
We know that:
16
© 2014-2024 AnalystPrep.
P(O|E) = 0.7 and P(O|A) = 0.4
We want P(E|O):
P (O|E) × P (E)
P (E|O) =
P (O|E) × P (E) + P (O|A) × P (A)
(0.73 ) × 0.2
=
(0.73 ) × 0.2 + (0.43 ) × 0.8
= 57.26%
17
© 2014-2024 AnalystPrep.
Reading 13: Random Variables
Explain the differences between a probability mass function and a probability density
function.
Random Variables
A random variable is a variable whose possible values are outcomes of a random phenomenon. It
is a function that maps outcomes of a random process to real values. It can also be termed as the
Conventionally, random variables are given in upper case (such as X, Y, and Z) while the realized
For example, let X be the random variable as a result of rolling a die. Therefore, x is the outcome
of one roll, and it could take any of the values 1, 2, 3, 4, 5, or 6. The probability that the resulting
P (X = x) where x = 3
18
© 2014-2024 AnalystPrep.
Types of Random Variables
A discrete random variable is one that produces a set of distinct values. A discrete random
variable manifests:
If the range of all possible values is a finite set, e.g., {1,2,3,4,5,6} as in the case of a
If the range of all possible values is a countably infinite set: e.g. {1,2,3, ... }
The number of candidates registered for the FRM level 1 exam at any given time.
Since the possible values of a random variable are mostly numerical, they can be explained using
19
© 2014-2024 AnalystPrep.
mathematical functions. A function f X(x) = P(X = x) for each x in the range of X is the probability
function (PF) of X and explains how the total chance (which is 1) is distributed amongst the
possible values of X.
There are two functions used when explaining the features of the distribution of discrete random
variables: probability mass function (PMF) and cumulative distribution function (CDF).
This function gives the probability that a random variable takes a particular value. Since PMF
2. ∑x f X (x) = 1 (sum across all value in support of a random variable should be equal to 1)
f X (x) = px (1 − p)1−x , X = 0, 1
fX (0) = p0 (1 − p)1−0 = 1 − p
And
f X(1) = p1 (1 − p)1−1 = p
Looking at the above results, the first property f X(x) ≥ 0) of probability distributions is met. For
∑ fX (x) = ∑ f X(x) = 1 − p + p = 1
x x=0 ,1
20
© 2014-2024 AnalystPrep.
Moreover, the probability that we observe random variable 0 is 1-p, and the probability of
FX (x) = { 1 − p, x = 0
p, x= 1
The graph of the Bernoulli PMF is shown below, assuming the p=0.7. Note that PMF is only
CDF measures the probability of realizing a value less than or equal to the input x, P r(X ≤ x) . It
FX (x) = P r(X ≤ x)
function (in contrast with PMF) because it supports any value between 0 and 1 (in the case of
21
© 2014-2024 AnalystPrep.
Bernoulli random variables) inclusively.
⎧ 0, x< 0
FX (x) = ⎨ 1 − p, 0 ≤ x < 1
⎩
1, x≥ 1
FX (x) is defined for all real values of x. The graph of FX (x) against x begins at 0 then rises by
jumps as values of x are realized for which p(X = x) is positive. The graph reaches its maximum
value at 1. For the Bernoulli distribution with p=0.7, the graph is shown below:
Since CDF is defined for all values of x, the CDF for a Bernoulli distribution with a parameter
p=0.7 is:
⎧ 0, x<0
FX (x) = ⎨ 0.3, 0≤x<1
⎩
1, x≥1
22
© 2014-2024 AnalystPrep.
Relationship Between the CDF and PMF with Discrete Random Variables
The CDF can be represented as the sum of the PMF for all the values that are less than or equal
to x. Simply put:
FX (x) = ∑ fX (t)
tϵR( x), t≤x
On the other hand, PMF is equivalent to the difference between the consecutive values of X. That
is:
fX (x) = FX (x) − FX (x − 1)
There are 8 hens with different weights in a cage. Hens1 to 3 weigh 1 kg, hens 4 and 5 weigh
2kg, and the rest weigh 3kg. We need to develop the PMF and the CDF.
Solution
The random variables (X = 1kg, 2kg, or 3kg) here are the weights of the chicken,
3
fX (1) = P r(X = 1) =
8
2 1
fX (2) = P r(X = 2) = =
8 4
3
fX (3) = P r(X = 3) =
8
3
, x=1
⎧
⎪
⎪ 81
⎨ 4, x=2
⎪
⎩
⎪ 3, x=3
8
23
© 2014-2024 AnalystPrep.
For the CDF, it includes all the realized values of the random variable. So,
FX(0) = P r(X ≤ 0) = 0
3
FX(1) = P r(X ≤ 1) =
8
3 2 5⎡ ⎤
FX(2) = P r(X ≤ 2) = + = Using FX(x) = ∑ f X ( t)
8 8 8⎣ tϵR (x) ,t≤x ⎦
5 3
FX(3) = P r(X ≤ 3) = + =1
8 8
0, x< 1
⎧
⎪
⎪
⎪
⎪ 3, 1≤x<2
FX (x) = ⎨ 85
⎪ , 2≤x<3
⎪
⎪
⎩8
⎪
1, 3≤x
Note that
fX (x) = FX (x) − FX (x − 1)
5 3
f X (3) = FX(3) − FX (2) = 1 − =
8 8
A continuous random variable can assume any value along a given interval of a number line.
For instance, x > 0, (−∞ < x < ∞) and 0 < x < 1 . Examples of continuous random variables
include the price of stock or bond, or the value at risk of a portfolio at a particular point in time.
24
© 2014-2024 AnalystPrep.
This implies that p is the likelihood that the random variable X falls between r1 and r2 .
Given a PDF f(x), we can determine the probability that x falls between a and b:
b
P r(a < x ≤ b) = ∫ f (x) dx
a
The probability that X lies between two values is the area under the density function graph
Probability distribution function is another term used to refer to the probability density function.
The properties of the PDF are the same as those of PMF. That is:
1. f X(x) ≥ 0, −∞ ≤ x ≤ ∞ (nonnegativity)
2. ∫rrmmax f(x)dx = 1(The sum of all probabilities must be equal to 1, just like in discrete
in
random variables)
The upper and lower bounds of f(x) are defined by rmi n and rmax
It is also called the cumulative density function and is closely related to the concept of a PDF.
CDFA CDF defines the likelihood of a random variable falling below a specific value. To
25
© 2014-2024 AnalystPrep.
determine the CDF, the PDF is integrated from its lower bound.
The corresponding density function’s capital letter has traditionally been used to denote the CDF.
The following computation depicts a CDF, F(x), of a random variable X whose PDF is f(x):
a
F (a) = ∫ f(x)d(x) = P[X ≤ a]
−∞
The region under the PDF is a depiction of the CDF. The CDF is usually non-decreasing and
varies from zero to one. We must have a zero CDF at the minimum value of the PDF. The variable
cannot be less than the minimum. The likelihood of the random variable is less than or equal to
To obtain the PDF from the CDF, we have to compute the first derivative of the CDF. Therefore:
dF (x)
f(x) =
dx
Next, we look at how to determine the probability that a random variable X will fall between
b
P[a < X ≤ b] = ∫ f(x)dx = F (b) − F (a)
a
26
© 2014-2024 AnalystPrep.
P [X > a] = 1 − F (a)
The continuous random variable X has a pdf of f(x) = 12x 2 (1 − x) for 0 < x < 1. We need to find
Solution
We know that:
x
F (x) = ∫ f(t)d(t)
−∞
x
F (x) = ∫ 12t2 (1 − t )d(t) = [4 t3 − 3t4 ]x0 = x3 (4 − 3x)
0
So,
F (x) = x3 (4 − 3x)
Expected Values
The expected values are the numerical summaries of features of the distribution of random
variables. Denoted by E[X] or μ, it gives the value of X that is the measure of average or center of
E[X] = ∑ xf(X)
x
It is simply the sum of the product of the value of the random variable and the probability
There are 8 hens with different weights in a cage. Hens 1 to 3 weigh 1 kg, hens 4 and 5 weigh
27
© 2014-2024 AnalystPrep.
2kg, and the rest weigh 3kg. We need to calculate the mean weight of the hens.
Solution
3
⎧ , x=1
⎪
⎪ 81
f(x) = ⎨ 4 , x = 2
⎪
⎩ 3, x = 3
⎪
8
Now,
3 1 3
E[X] = ∑ xf(X) = 1 × + 2× + 3× = 2
x 8 4 8
∞
E[X] = ∫ xf(x)dx
−∞
Basically, it is all about integrating the product of the value of the random variable and the
The continuous random variable X has a pdf of f(x) = 12x 2 (1 − x) for 0 < x < 1.
Solution
We know that:
∞
E[X] = ∫ xf(x)dx
−∞
28
© 2014-2024 AnalystPrep.
So,
1
12 5 1
E(X) = ∫ x12x 2 (1 − x)d(x) = [3x 4 − x ] = 0.6
0 5 0
For random variables that are functions, we apply the same method as that of a “single” random
variable. That is, summing or integrating the product of the value of the random variable
function and the probability assumed by the corresponding random variable function.
E[g(x)] = ∑ g(x)f(x)
x
∞
E[g(x)] = ∫ g(x)f(x)dx
−∞
1 2
fX (x) = x , for 0 < x < 3
5
Calculate E(2X + 1)
Solution
∞
E[g(x)] = ∫ g(x)f (x)dx
−∞
∞ 1 1 x4 x3 3
=∫ (2x + 1)x2 dx = [ + ] = 9.9
−∞ 5 5 2 3 0
29
© 2014-2024 AnalystPrep.
Properties of Expectation
constant. That is, E(c)=c. Moreover, the expected value of a random variable is a constant and
The variance of random variable measures the spread (dispersion or variability) of the
Intuitively, the standard deviation is the square root of the variance. Now, denoting E(X) = μ,
then:
V ar(X) = E(X 2) − μ 2
The continuous random variable X has a pdf of f(x) = 12x 2 (1 − x) for 0 < x < 1.
Solution
We know that:
We have to calculate:
30
© 2014-2024 AnalystPrep.
E(X 2 )
1
12 5 1
E(X) = ∫ x. [12x 2 (1 − x)]dx = [3x4 − x ] = 0.6
0 5 0
1 1
12 5
E(X 2 ) = ∫ 12x 4 − 12x 5 dx = [ x − 2x6 ] = 0.4
0 5 0
So,
Moments
Moments are defined as the expected values that briefly describe the features of a distribution.
μ1 = E(X)
Therefore, the first moment provides the information about the average value. The second and
higher moments are broadly divided into Central and Non-central moments
Central Moments
μ k = E([X − E(X)]k ), k = 2, 3 …
Where k denotes the order of the moment. Central moments are moments about the mean.
Non-Central Moments
Non-central moments describe those moments about 0. The general formula is given by:
μ k = E(X k )
31
© 2014-2024 AnalystPrep.
Note that the central moments are constructed from the non-central moments and the first
Population Moments
The four common population moments are: mean, variance, skewness, and kurtosis.
The Mean
μ = E(X)
The Variance
The variance measures the spread of the random variable from its mean. The standard deviation
(σ) is the square root of the variance. The standard deviation is more commonly quoted in the
world of finance because it is easily comparable to the mean since they share the measurement
units.
The Skewness
3
E([X − E(X)])3 ⎡ X− μ ⎤
skew(X) = =E ( )
σ3 ⎣ σ ⎦
X −μ
Note that is a standardized X with a mean of 0 and a variance of 1.
σ
32
© 2014-2024 AnalystPrep.
Skewness can be positive or negative.
Positive skew
In most cases (but not always), the mean is greater than the median, or equivalently,
the mean is greater than the mode; in which case the skewness is greater than zero.
Negative skew
In most cases (but not always), the mean is lower than the median, or equivalently,
the mean is lower than the mode, in which case the skewness is lower than zero.
Kurtosis
33
© 2014-2024 AnalystPrep.
The Kurtosis is defined as the fourth standardized moment given by:
4
E([X − E(X)]4 ⎡ X −μ ⎤
Kurt(X) = =E ( )
σ4 ⎣ σ ⎦
The description of kurtosis is analogous to that of the Skewness only that the fourth power of the
Kurtosis implies that it measures the absolute deviation of random variables. The reference value
34
© 2014-2024 AnalystPrep.
multiplying the variable by a constant,
is referred to as the shift constant, and β is the scale constant. The transformation shifts X by
α and scales it by β. The process results in the formation of a new random variable, usually
denoted by Y.
Y = α + βx
Linear transformation of random variables is informed by the fact that many variables used in
Suppose your salary is α dollars per year, and you are entitled to a bonus of β dollars for every
dollar of sales you successfully bring in. Let X be what you sell in a certain year. How much in
Solution
We can linearly transform the sales variable X into a new variable Y that represents the total
amount made.
Y = α + βx
35
© 2014-2024 AnalystPrep.
If Y = α + βx, where α and β are constants. The mean of Y is given by:
The shift parameter α does not affect the variance. Why? Because variance is a measure of
spread from the mean; adding α does not change the spread but merely shifts the distribution to
√β 2σ 2 = |β| σ
It can also be shown that if β is positive (so that Y = α + βx is an increasing transformation), then
the skewness and kurtosis of Y are identical to the skewness and kurtosis of X. This is because
both moments are defined on standardized quantities, which removes the effect of the shift
We know that:
3
⎡ X −μ ⎤
skew(X ) = E ( )
⎣ σ ⎦
Now,
36
© 2014-2024 AnalystPrep.
3
E([Y − E (Y )])3 ⎡ Y − E(Y ) ⎤
skew(Y ) = =E ( )
σ3 ⎣ σ ⎦
3
⎡ α + βX − (α + βμ) ⎤
=E ( )
⎣ βσ ⎦
3 3
⎡ β(X − μ) ⎤ ⎡ X−μ ⎤
=E ( ) =E ( ) = Skew(X )
⎣ βσ ⎦ ⎣ σ ⎦
However, if β < 0, the magnitude of skewness of Y is the same as that of X but with the opposite
sign because of the odd power (i.e., 3). On the other hand, the kurtosis is unaffected because it
Just like any data, quantities such as the quantiles and the modes are used to describe the
distribution.
The Quantiles
For a continuous random variable X, the α-quartile of X is the smallest number m such that:
P r(X < m) = α
Where αϵ[0, 1]
For instance, if X is a continuous random variable, the median is defined to be the solution of:
m
P (X < m) = ∫ fX (x)dx = 0.5
−∞
Similarly, the lower and upper quartile is such that P (X < Q 1 ) = 0.25 and P (X < Q3 ) = 0.75
IQR = Q3 − Q1
37
© 2014-2024 AnalystPrep.
Example: Calculating the Quartiles of a PDF
Solution
m
P(X < m) = ∫ 3e−2xdx = 0.5
0
So,
3 m
= [− e−2x ] = 0.5
2 0
3 −2m 3
=− e + = 0.5
2 2
1 2
⇒ m = − × ln = 0.2027
2 3
Mode
The mode measures the common tendency, that is, the location of the most observed value of a
random variable. In a continuous random variable, the mode is represented by the highest point
in the PDF.
Random variables can be unimodal if there’s just one mode, bimodal if there are two modes, or
The graph below shows the difference between unimodal and bimodal distributions.
38
© 2014-2024 AnalystPrep.
39
© 2014-2024 AnalystPrep.
Question 1
4x)
A. 29
B. 30
C. 64
D. 35
Solution
Recall that:
So,
But we are given that the standard deviation is 2, implying that the variance is 4.
Therefore,
Var(3 − 4X) = 16 × 4 = 64
Question 2
A continuous random variable has a pdf given by f X(x) = ce−3x for all x > 0. Calculate
Pr(X<6.5)
40
© 2014-2024 AnalystPrep.
A. 0.4532
B. 0.4521
C. 0.3321
D. 0.9999
Solution
∞
∫ f(x)dx = 1
−∞
So,
∞ ∞
1 1
∫ ce−3xdx = 1 = c[− e−3x ] = c [0 − − ] = 1
0 3 0 3
⇒c=3
Therefore, the PDF is f X (x) = 3e−3x so that P r(X < 6.5) is given by:
6.5 6.5
1 1 1
∫ 3e−3 xdx = 3[− e−3x ] = c [− e−3×6.5 − − ]
0 3 0
3 3
= 0.9999
41
© 2014-2024 AnalystPrep.
Reading 14: Common Univariate Random Variables
Distinguish the key properties among the following distributions: uniform distribution,
Describe a mixture distribution and explain the creation and characteristics of mixture
distributions.
Parametric Distributions
There are two types of distributions, namely parametric and non-parametric distributions.
Functions mathematically describe parametric distributions. On the other hand, one cannot use a
Bernoulli Distribution
Bernoulli distribution is a discrete random variable that takes on values of 0 and 1. This
distribution is suitable for scenarios with binary outcomes, such as corporate defaults. Most of
42
© 2014-2024 AnalystPrep.
The Bernoulli distribution has a parameter p which is the probability of success, i.e., the
P [X = 1] = p and P [X = 0] = 1 − p
The probability mass function of the Bernoulli distribution stated as X ∼ Bernoulli (p) is given by:
Therefore, the mean and variance of the distribution are computed as:
P [X = 1] = p and P [X = 0] = 1 − p
⎧ 0, y < 0
FX (x) = ⎨ 1 − p, 0 ≤ y < 1
⎩
1, y ≥ 1
Therefore, the mean and variance of the distribution are computed as:
43
© 2014-2024 AnalystPrep.
E (X) = p × 1 + (1 − p) × 0 = p
Solution
E(X) = p
and
V (X) = p(1 − p)
So,
E(X) p 1
= = =4
V (X) p(1 − p) 0.25
Binomial Distribution
quantifies the total number of successes from an independent Bernoulli random variable, with
the probability of success being p and, of course, the failure being 1-p. Consider the following
example:
Suppose we are given two independent bonds with a default likelihood of 10%. Then we have the
following possibilities:
44
© 2014-2024 AnalystPrep.
Both of them default, or
P [X = 0] = (1 − 10%)2 = 81%
P [X = 2] = 10%2 = 1%
P [X = 0] = (1 − 10%)3 = 72.9%
P [X = 3] = 10%3 = 0.1%
Suppose now that we have n bonds. The following combination represents the number of ways in
n n!
( )= … … … … equation I
x x! (n − x)!
If p is the likelihood that one bond will default, then the chances that any particular k bonds will
px (1 − p)n−x … … … … … equation I I
Combining equation I and II , we can determine the likelihood of k bonds defaulting as follows:
n x n
P [X = x] = ( ) p (1 − p)n −x = ( ) px(1 − p)n−x f orx = 0, 1 , 2 , … n
x x
45
© 2014-2024 AnalystPrep.
This is the PDF for the binomial distribution.
Therefore, binomial distribution has two parameters: n and p and usually stated as X B(n, p) .
| x|
n i
∑ ( ) p (1 − p)n −i
i =1 i
The mean and variance of the binomial distribution can be evaluated using moments. The mean
E(X) = np
And
V (X) = np(1 − p)
The binomial can be approximated using a normal distribution (as will be seen later) if np ≥ 10
and n(1 − p) ≥ 10
Solution
n x
P [X = x] = ( ) p (1 − p)n−x
x
4 3 4
⇒ P (X ≥ 3) = P (X = 3) + P (X = 4) = ( ) p (1 − p)4−3 + ( ) p4 (1 − p)4−4
3 4
46
© 2014-2024 AnalystPrep.
4 4
=( ) 0.63 (1 − 0.6)4−3 + ( ) 0.64 (1 − 0.6)4−4
3 4
Poisson Distribution
Events are said to follow a Poisson process if they happen at a constant rate over time, and the
likelihood that one event will take place is independent of all the other events,for instance,the
Suppose that X is a Poisson random variable, stated as X~Poisson(λ) then the PMF is given by:
λxe−λ
P [X = x] =
x!
|x|
λi
∑
i=1 i!
The Poisson parameter λ (lambda), termed as the hazard rate, represents the mean number of
events in an interval. Therefore, the mean and variance of the Poisson distribution are given by:
E(X) = λ
And
V (X) = λ
A fixed income portfolio is made of a huge number of independent bonds. The average number of
bonds defaulting every month is 10. What is the probability that there are exactly 5 defaults in
one month?
Solution
47
© 2014-2024 AnalystPrep.
For Poisson distribution:
λxe−λ
P (X = x) =
x!
10 5 e−10
P (X = 5) = = 0.03783
5!
The notable feature of a Poisson distribution is that it is infinitely divisible. That is, if
Y ∼ Poisson(λ1 + λ2 )
Therefore, Poisson distribution is suitable for time series data since summing the number of
Uniform Distribution
A uniform distribution is a continuous distribution, which takes any value within the range [a,b],
1
fX (x) =
b −a
48
© 2014-2024 AnalystPrep.
Note that the PDF of a uniform random variable does not depend on x since all values are equally
likely.
⎧
⎪ 0, x < a
x− a
FX (x) = ⎨ b−a
,a ≤ x ≤ b
⎩
⎪ 1, x ≥ b
When a=0 and b=1, the distribution is called the standard uniform distribution. From this
distribution, we can construct any uniform distribution, U 2 and U1 using the formula:
U2 = a + (b − a) U1
The uniform distribution is denoted by X ∼ U(a, b), and the mean and variance are given by:
a +b
49
© 2014-2024 AnalystPrep.
a +b
E(X) =
2
(b − a)2
V (X) =
12
For instance, the mean and variance of the standard uniform distribution U 1 ∼ N(0 , 1) are given
by:
(0 + 1) 1
E(X) = =
2 2
And
(1 − 0)2 1
V (X) = =
12 12
Assume that we want to calculate the probability that X falls in the interval l < X < u where l is
the lower limit and u is the upper limit. That is, we need P (l < X < u) given that X ∼ U(a, b). To
min(u, b) − max(l, a)
P (l < X < u) =
b −a
u−l
b −a
Given the uniform distribution X U(−5 , 10) , calculate the mean, variance, and P (−3 < X < 6).
Solution
a + b −5 + 10
E(X) = = = 2.5
2 2
And
(10 − −5)2
50
© 2014-2024 AnalystPrep.
(10 − −5)2 225
V (X) = = = 18.75
12 12
min(u, b) − max(l, a)
P (l < X < u) =
b −a
min(6,10) − max(−3,−5) 6 − −3 9
P (−3 < X < 6) = = = = 0.60
10 − −5 10 − −5 15
Alternatively, you can think of the probability as the area under the curve. Note that the height of
1
the uniform distribution is and the length u − l.
b−a
That is:
1 1 9
× (u − l) = × (6 − −3) = = 0.60
b−a 10 − −5 15
Normal Distribution
Also called the Gaussian distribution, the normal distribution has a symmetrical PDF, and the
mean and median coincide with the highest point of the PDF. Furthermore, the normal
51
© 2014-2024 AnalystPrep.
The following is the formula of a PDF that is normally distributed, for a given random variable X :
1 (x − μ )2
−
1
f (x) = e 2 σ , −∞ < x < ∞
σ√2π
X ∼ N (μ , σ 2)
We read this as X is normally distributed, with a mean, μ ,and variance of σ2 . Any linear
combination of independent normal variables is also normal. To illustrate this, assume X and Y
are two variables that are normally distributed. We also have constants a and b . Then Z will be
52
© 2014-2024 AnalystPrep.
For instance for a = b = 1, then Z = X + Y and thus Z ∼ N (μX + μY , σ2X + σ2Y )
A standard normal distribution is a normal distribution whose mean is 0 and standard deviation
1 1 2
∅= e− 2 x
√2π
To determine a normal variable whose standard deviation is σ and mean is μ, we compute the
product of the standard normal variable with σ and then add the mean:
X = μ + σ∅ ⇒ X ∼ N (μ , σ2 )
Three standard normal variables X1 , X2 , and X3 are combined in the following way to construct
XA = √ρX1 + √1 − ρX 2
XB = √ρX1 + √1 − ρX3
The z-value measures how many standard deviations the corresponding x value is above or below
X −μ
Φ(z) = ∼ N (0, 1)
σ
And
X ∼ N (μ, σ 2)
Converting X normal random variables is termed as standardization. The values of z are usually
tabulated.
For example, consider the normal distribution X~N(1,2). We wish to calculate P(X>2).
53
© 2014-2024 AnalystPrep.
Solution
For
2− 1
P (X > 2) = 1 − P (X ≤ 2) = 1 − = 0.2929 ≈ 0.29
√2
ϕ (0.29) ≈ 61.41%
x-value z-value
μ 0
μ + 1σ 1
μ + 2σ 2
μ + nσ n
Recall that for a binomial random variable, if np ≥ 10 and n(1 − p) ≥ 10, then the binomial
54
© 2014-2024 AnalystPrep.
Also, Poisson distribution is normally approximated as λ≥1000 so that:
X ∼ N (λ , λ)
We then calculate the probabilities while maintaining the normal distribution principles. The
The normal distribution is widely used in Central Limit Theorem (CLT), which is utilized
in hypothesis testing.
The normal distribution is closely related to other important distributions, such as the
The notable property of the normal random variables is that they are infinitely divisible,
which makes the normal distribution suitable for modeling asset prices.
The normal distributions are closed under linear operations. In other words, the
Lognormal Distribution
that:
Y = lnX
X = eY
Where
Y ∼ N (μ, σ 2 )
55
© 2014-2024 AnalystPrep.
Since Y ∼ N (μ, σ 2) ,then the PDF of a log-normal random variable is:
2
1 ⎛ ln(x) − μ ⎞
−
1 σ
e 2⎝ ⎠
f (x) = ,x ≥ 0
xσ√2π
A variable is said to have a lognormal distribution if its natural logarithm has a normal
distribution. The lognormal distribution is undefined for negative values, unlike the normal
distribution that has a range of values between negative infinity and positive infinity.
If the above equation of the density function of the lognormal distribution is rearranged, we
obtain an equation that has a similar form to the normal distribution. That is:
2 2
1 ⎛ lnx−( μ−σ ) ⎞
−
1 2
−μ 1 2⎝ σ ⎠
f (x) = e 2σ e
σ√ 2π
From the above, we notice that the lognormal distribution happens to be asymmetrical. It's not
symmetrical around the mean as is the case under the normal distribution. The lognormal
56
© 2014-2024 AnalystPrep.
distribution peaks at exp (μ − σ 2).
1 2
E [X] = eμ+2 σ
This yields to an expression that closely resembles the Taylor expansion of the natural logarithm
1 2
r≈R− R
2
The following is the formula for the variance of the lognormal distribution:
2 2
V (X) = E [(X − E[X]2 )] = (eσ − 1) e2μ+σ
Consider a lognormal distribution given by X ∼ LogN (0.08 , 0.2) . Calculate the expected value.
Solution
1 2 1
E[X] = eμ+ 2 σ = e0.08+2 ×0.2 = 1.19721
Chi-Squared Distribution, χ2
Assume we’ve got k independent standard normal variables ranging from Z1 to Zk. The sum of
k
S = ∑ Zi2
1=1
57
© 2014-2024 AnalystPrep.
S ∼ Xk2
k is called the degree of freedom. It is important to note that two chi-squared variables that are
independent, with degrees of freedom as k1 and k2 , respectively, have a sum that is chi-square
The chi-squared variable is usually asymmetrical and takes on non-negative values only. The
E (S) = k
and
V (S) = 2k
The chi-squared distribution takes the following PDF, for positive values of x:
1 k x
58
© 2014-2024 AnalystPrep.
1 k x
f (x) = x 2 −1 e− 2
k
2 2 Γ ( k2 )
∞
Γ (n) = ∫ x n−1 e−xdx
0
Γ (n) = (n − 1)!
For instance:
Γ (3) = (3 − 1)! = 2 × 1 = 2
This distribution is widely applicable in statistics and risk management when testing hypotheses.
The chi-distribution is approximated using normal distribution when n is large. This implies that:
This is true because as the number of degrees of freedom increases, the skewness reduces.
Degrees of freedom measure the amount of data required to test model parameters. If we have a
sample size n, the degrees of freedom are given by n – p, where p is the number of parameters
estimated..
Student’s t Distribution
This distribution is often called the t distribution. Let Z be the standard normal variable, and U a
chi-square variable with k degrees of freedom. Also, assume that U is independent of Z. Then, a
Z
X=
U
√k
59
© 2014-2024 AnalystPrep.
The following formula represents its PDF:
Γ (k + 12 ) +1
−k
f (x) = (1 + x2 /k) 2
√ kπΓ ( k )
2
The mean of the t distribution is usually zero, and the distribution is symmetrical around it.
That is:
E (X) = 0
k
V (X) =
k− 2
k− 2
Kurt(X) = 3
k− 4
It is easy to see that the mean is valid for k > 1 and the variances finite for v > 2. The kurtosis is
The distribution converges to a standard normal distribution as k tends towards infinity (k → ∞).
When k > 2, the variance of the distribution becomes: (k k−2) , and it converges to one as k
increases.
We can also separate the degrees of freedom from variance to get what we called the
V (aX) = a2 V (X))
v−2
V [√ Y]= 1
v
60
© 2014-2024 AnalystPrep.
Where
X ∼ tk
The generalized student’s t is called standardized student’s t because it has a mean of 0 and a
variance of 1. Note that we still rescale it to have any variance for k>2.
A generalized student’s t is stated by the mean, variance, and the number of degrees of freedom.
This distribution is widely applicable in hypotheses testing, and modeling the returns of financial
The kurtosis of some returns on a bond portfolio with three parameters to be estimated is 6.
What are the degrees of freedom if the parameters were generated using student’s tk?
Solution
k− 2
61
© 2014-2024 AnalystPrep.
k− 2
Kurt(X) = 3
k− 4
k −2 5
∴6=3 ⇒ (k − 4)
k −4 3
So that
k=6
F–Distribution
The F-distribution is often used in the analysis of variance (ANOVA). The F distribution is an
asymmetric distribution that has a minimum value of 0, but no maximum value. Notably, the
U /k
62
© 2014-2024 AnalystPrep.
U 1/k1
X= ∼ F (k1 , k2 )
U 2/k2
Provided that U1 and U 2 are chi-squared distributions that are independent having k1 and k2 as
(k 1 X) k1 kk
2
2
√
(k1 X+k2 ) k 1 +k2
f (x) =
k1 k2
xB ( , )
2 2
1
B (x , y) = ∫ z x−1 (1 − z)y−1 dz
0
k2
E (X) = f or k2 > 2
k2 − 2
2k22 (k1 + k2 − 2)
σ2 = for k2 > 4
k 1(k2 − 2)2 (k2 − 4)
Suppose that X is a random variable with a t-distribution, and it has k degrees of freedom, then
χ2 ∼ F (1, k)
The beta distribution applies to continuous random variables in the range of 0 and 1. This
distribution is similar to the triangle distribution in the sense that they are both applicable in the
modelling of default rates and recovery rates. Assuming that a and b are two positive constants,
1
63
© 2014-2024 AnalystPrep.
1
f (x) = xa−1 (1 − x)b−1 , 0≤x≤1
B (a, b)
Γ( a)Γ(b)
Where B (a, b) = Γ( a+b)
The following two equations represent the mean and variance of the beta distribution:
a
μ=
a+b
ab
σ2 = 2
(a + b) (a + b + 1)
Exponential Distribution
The exponential distribution is a continuous distribution with a parameter , whose PDF is:
1 −x
f X(x) = e β ,x ≥ 0
β
64
© 2014-2024 AnalystPrep.
The CDF is also given by:
−x
FX (x) = 1 − e β
The parameter of the exponential distribution determines the mean and variance of the
E(X) = β
65
© 2014-2024 AnalystPrep.
And
V (X) = β 2
Notably, exponential distribution is a close ‘cousins’ of a Poisson. The time intervals between one
and subsequent Poisson random variables are exponentially distributed. Another feature of the
exponential distribution is that it is memoryless. That is, its distributions are independent of their
histories.
Assume that the time to default for a specific segment of mortgage consumers is exponentially
distributed with a β of ten years. What is the probability that a borrower will not default before
year 11?
Solution
To find the probability that the borrower will not default before year eleven, we start by
calculating the cumulative distribution until year eleven and then subtract this from 100%::
11
−
= 1 −e 10 = 1 − 0.3329 = 0.6671 = 66.7%
Mixture distributions are complex, and new distributions built using two or more distributions. In
n n
f (x) = ∑ wi f i (x) such that : ∑ w i = 1
j=1 i=1
66
© 2014-2024 AnalystPrep.
f i (x)'s are the component distributions, with w ′i s as the weights or the mixing proportions. The
component weights must all sum up to one, for the resulting mixture to be legitimately
distributed. In other words, a two-distribution combination must draw value from Bernoulli
random variables and depending on the benefits (0 or 1), it then picks the component
distributions. By doing this, it is possible to compute the CDF of the mixture when the
component distributions are normal random variables. These distributions are very flexible as
For example, consider X1 ∼ Fx1 and X2 ∼ Fx2 and Wi ∼ Bernoulli(p) . So that the mixture
Y = pX1 + (1 − p)X2
Both of the PDF and the CDF of the mixture distribution are weighted average of the constituent
And
Intuitively, the computation of the central moment is done in a similar way. That is:
And
Where
Using the same logic, we can calculate the other higher central moments such as the kurtosis
67
© 2014-2024 AnalystPrep.
and skewness. However, note that the mixture distribution might have both the skewness and the
kurtosis, while the components do not have (for example, normal random variables).
Moreover, mixing components with different means and variances leads to distribution that is
Consider two normal random variables X1 ∼ N(0.15 , 0.60) and X1 ∼ N(−0.8, 3). . What is the
Solution
We know that:
68
© 2014-2024 AnalystPrep.
Question
The number of new clients that a wealth management company receives in a month is
distributed as a Poisson random variable with mean 2. Calculate the probability that
A. 5.48%
B. 0.10%
C. 3.54%
D. 10.2%
λ n −λ
P [X = n] = e
n!
24 28 −24
P [X = 28] = e = 5.48
28!
69
© 2014-2024 AnalystPrep.
Reading 15: Multivariate Random Variables
Explain how a probability matrix can be used to express a probability mass function
(PMF).
variable.
Explain how the expectation of a function is computed for a bivariate discrete random
variable.
Explain the relationship between the covariance and correlation of two random
variables and how these are related to the independence of the two variables.
Explain the effects of applying linear transformations on the covariance and correlation
Explain how the iid property is helpful in computing the mean and variance of a sum of
Multivariate random variables accommodate the dependence between two or more random
variables. The concepts under multivariate random variables (such as expectations and
70
© 2014-2024 AnalystPrep.
Multivariate Discrete Random Variables
sample space. In other words, multivariate random variables are vectors of random variables.
For instance, a bivariate random variable X can be a vector with two components X1 and X2 with
The PMF or PDF for a bivariate random variable gives the probability that the two random
variables each take a certain value. If we wish to plot these functions, we would need three
The PMF of a bivariate random variable is a function that gives the probability that the
The PMF explains the probability of realization as a function of x 1 and x 2. The PMF has the
following properties:
1. f X1,X2 (x1 , x 2 ) ≥ 0
The trinomial distribution is the distribution of n independent trials where each trial results in
one of the three outcomes (a generalization of the binomial distribution). The first, second and
the third components are X1 , X2 and n − X1 − X2 respectively. However, the third component is
71
© 2014-2024 AnalystPrep.
1. n, representing the total number of the trials
1 − p1 − p2
n!
fX1, X2(x 1 , x 2 ) = px1 px2 (1 − p1 − p 2 )n−x1−x2
x 1 !x2 !(n − x 1 − x2 )!1 2
The CDF of a bivariate discrete random variable returns the total probability that each
In this equation, t1 contains the values that X1 may take as long as t1 ≤ x 1. Similarly, t2 contains
72
© 2014-2024 AnalystPrep.
Probability Matrices
In financial markets, market sentiments play a role in determining the return earned on a
security. Suppose the return earned on a bond is in part determined by the rating given to the
73
© 2014-2024 AnalystPrep.
Bond Return (X 1 )
−10% 0% 10%
Analyst Positive +1 5% 5% 30%
(X2 ) Neutral 0 10% 10% 15%
Negative −1 20% 5% 0%
Each cell represents the probability of a joint outcome. For example, there’s a 5% probability of a
negative return (-10%) if analysts have positive views about the bond and its issuer. In other
words, there’s a 5% probability that the bond will decline in price with a positive rating.
Similarly, there’s a 10% chance that the bond’s price will not change (and hence a zero return)
The marginal distribution gives the distribution of a single variable in a joint distribution. In the
probabilities for X1 across all the values in the support of X 2. The resulting PMF of X1 is denoted
f X1(x 1 ) = ∑ f X1,X2 (x 1 ,x 2 )
x2 ϵR(X2 )
f X2(x 2 ) = ∑ f X1,X2 (x 1 ,x 2 )
x1 ϵR(X1 )
Using the probability matrix, we created above, we can come up with marginal distributions for
For X1 ,
74
© 2014-2024 AnalystPrep.
P(X1 = −10%) = 5% + 20% + 10% = 35%
P(X1 = 0%) = 5% + 10% + 5% = 20%
P(X1 = +10%) = 30% + 15% + 0% = 45%
For X2 ,
As you may have noticed, the marginal distribution satisfies the property of the ideal probability
∑ f X1 (x 1 ) = 1
∀X1
And
f X1(x 1 ) ≥ 0
We can, in addition, use the marginal PMF to compute the marginal CDF. The marginal CDF is
75
© 2014-2024 AnalystPrep.
FX1 (x1 ) = ∑ fX1 (t1 )
t1 ϵR(X1 )
t1 ≤x1
P(A ∩ B) = P(A)P(B)
This principle applies to bivariate random variables as well. If the distributions of the
Now let’s use our earlier example on the return earned on a bond. If we assume that the two
variables – return and ratings – are independent, we can calculate the joint distribution by the
multiplying their marginal distributions. But are they really independent? Let’s find out! We have
already established the joint and the marginal distributions, as reproduced in the following table.
So assuming that our two variables are independent, our joint distribution would be as follows:
Bond Return (X 1 )
−10% 0% 10%
Analyst Positive +1 14% 8% 18%
(X2 ) Neutral 0 12.25% 7% 15.75%
Negative −1 8.75% 5% 11.25%
76
© 2014-2024 AnalystPrep.
We obtain the table above by multiplying the marginal PMF of the bond return by the marginal
PMF of ratings. For example, the marginal probability that the bond return is 10% is 45% -- the
sum of the third column. The marginal probability of a positive rating is 40% -- the sum of the
first row. These two values when multiplied give us the joint probability on the upper left end of
It is clear that the two variables are not independent because multiplying their marginal PMFs
P(A ∩ B)
P(A│B) =
P(B)
This result can be applied in bivariate distributions. That is, the conditional distribution of X1
f X1,X2 (x 1, x 2 )
f X1│X2(x 1 │X2 = x2 ) =
f X2(x 2 )
From the result above, the conditional distribution is joint distribution divided by the marginal
77
© 2014-2024 AnalystPrep.
Example: Calculating the Conditional Distribution
Suppose we want to find the distribution of bond returns conditional on a positive analyst rating.
78
© 2014-2024 AnalystPrep.
Returns(X1 ) −10% 0% 10%
5% 5% 30%
f (X1│X2)(x 1 │X2 = x2 ) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X1 = x 1 |X2 = 1)
What we have done is to take the joint probabilities where there’s a positive analyst rating and
then divided these values by the marginal probability of a positive rating (40%) to produce the
conditional distribution.
Note that the conditional PMF obeys the laws of probability, i.e.,
Conditional distributions can be computed for one variable, while conditioning on more than one
variable.
For example, assume that we need to compute the conditional distribution of the bond returns
given that analyst ratings are non-negative. Therefore, our conditioning set is {+1,0}:
X2 ∈ {+1, 0}
The conditional PMF must sum across all outcomes in the set that is conditioned on S {+1,0}:
The marginal probability that X2 ∈ {+1 , 0} is the sum of the marginal probabilities of these two
outcomes:
79
© 2014-2024 AnalystPrep.
Bond Return (X1 )
−10% 0% 10% f X2(x 2 )
Analyst Positive +1 5% 5% 30% 40%
(X2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
fX1 (x1 ) 35% 20% 45%
5%+10%
⎧
⎪ = 20%
⎪
⎪ 75%
⎪ 5%+10%
f (X1│X2 )(x1 │x 2 ϵ {+1, 0}) = ⎨ 75% = 20%
⎪
⎪
⎪
⎩ 30%+15% = 60%
⎪
75%
f X1,X2 (x1 , x 2 )
f(X1│X2)(x 1 │X2 = x2 ) =
fX2 (x 2 )
Or
Also, if the distributions of the components of the bivariate distributions are independent, then:
f (X1,X2) (x 1 , x2 ) = f X1 (x 1 )f X2(x 2 )
80
© 2014-2024 AnalystPrep.
Applying again to
we get:
Expectations
The expectation of a function of a bivariate random variable is defined in the same way as that of
the univariate random variable. Consider the function g(X1 , X2 ). The expectation is defined as:
g(x 1 , x2 depends on both x1 and x2 ) and it may be a function of one component only. Just like the
X1
1 2
X2 3 10% 15%
4 70% 5%
Solution
81
© 2014-2024 AnalystPrep.
Using the formula:
Moments
Just like the univariate random variables, we shall use the expectations to define the moments.
The second moment involves the covariance between the components of the bivariate
Note that Cov(X1 , X1 ) = Var(X1 ) and that if X1 and X2 are independent then
Most of the correlation between X1 and X 2 is reported. Now let Var(X1 ) = σ12, Var(X2 ) = σ22 and
82
© 2014-2024 AnalystPrep.
Cov(X1 , X2 ) = σ12 then the correlation is defined as:
Cov(X 1, X2 ) σ12
Corr(X1 , X2 ) = ρX1 X2 = =
σ1 σ2
√σ12√ σ22
σ12 = ρX1X2 σ1 σ2
Correlation gives the measure of the strength of the linear relationship between the two random
βVar(X1 ) β
Corr(X1 , X2 ) = ρX1 X2 = =
√Var(X1 )√β 2Var(X1 ) |β|
it is now evident that if β > 0, then ρX1X2 = 1 and when β ≤ 0 then ρX1 X2 = 0
Then,
This implies that the scale factor in each random variable multiactivity affects the covariance.
Using the above results, the corresponding correlation coefficient of aX1 and bX2 is given by :
abCov(X 1, X2 ) ab Cov(X1 , X2 )
Corr(aX1 , bX2 ) = =
|a||b| √Var(X1 )√Var(X2 )
√a2 Var(X1 )√b 2 Var(X 2 )
ab
= ρ X1 X2
|a||b|
83
© 2014-2024 AnalystPrep.
Application of Correlation: Portfolio Variance and Hedging
The variance of the underlying securities and their respective correlations are the necessary
two securities whose random returns are XA and XB and their means are μA and μ B with standard
σA+B
2 = σA2 + σB2 + 2ρAB σAσB
σA2 +B = 2σ 2 (1 + ρAB),
Where:
σA2 = σB2 = σ 2
if both securities have an equal variance. If the correlation between the two securities is zero,
then the equation can be simplified further. We have the following relation for the standard
deviation:
n
Y = ∑ Xi = 1nXi
i=1
n n
σY2 = ∑ ∑ ρij σi σj
i=1 j=1
In case all the Xi ’s are uncorrelated and all variances are equal to σ, then we have:
84
© 2014-2024 AnalystPrep.
This is what is called the square root rule for the addition of uncorrelated variables.
Y = aXA + bXB
The major challenge during hedging is a correlation. Suppose we are provided with $1 of a
introduced to our hedged portfolio. h is, therefore, the hedge ratio. The variance of the hedged
P = X A + hXB
σP2 = σA2 + h2 σB2 + 2hρAB σA σB
The minimum variance of a hedge ratio can be determined by determining the derivative with
dσ 2P
= 2hσB2 + 2ρABσA σB = 0
dh
σA
⇒ h ∗ = −ρAB
σB
The covariance matrix is a 2x2 matrix that displays the covariance between the components of X.
85
© 2014-2024 AnalystPrep.
σ12 σ12
Cov(X) = [ ]
σ12 σ22
Conditional Expectation
A conditional expectation is simply the mean calculated after a set of prior conditions has
happened. It is the value that a random variable takes “on average” over an arbitrarily large
expectation uses the same expression as any other expectation and is a weighted average where
In the bond return/rating example, we may wish to calculate the expected return on the bond
86
© 2014-2024 AnalystPrep.
Returns(X1 ) −10% 0% 10%
5% 5% 30%
f (X1│X2)(x 1 │X2 = 1) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X1 = x 1 |X2 = 1)
Conditional Variance
We can calculate the conditional variance by substituting the expectation in the variance formula
We know that:
Returning to our example above, the conditional variance Var(X1 |X2 = 1) is given by:
Now,
We need to calculate:
So that
2 2
Var(X1 │X2 = 1) = σ(X = 0.00875 − [0.0625] = 0.004844 = 0.484%
1 │X2 = 1)
87
© 2014-2024 AnalystPrep.
If we wish to find the standard deviation of the returns, we just find the square root of the
variance:
Before we continue, it is essential to note that continuous random variables make use of the
same concepts and methodologies as discrete random variables. The main distinguishing factor
The joint (bivariate) distribution function gives the probability that the pair (X1 , X2 ) takes values
88
© 2014-2024 AnalystPrep.
b d
P(a < X1 < b,c < X2 < d) = ∫ ∫ f X1,X2 (x 1 , x2 )dx1 dx2
a c
The joint pdf is always nonnegative, and the double integration yield a value of 1. That is:
fX1, X2(x 1 , x2 ) ≥ 0
And
b d
∫ ∫ f X1,X2 (x1 , x 2 )dx1 dx2 = 1
a c
Assume that the random variables (X1 ) and (X2 ) are jointly distributed as:
89
© 2014-2024 AnalystPrep.
f X1,X2 (x1 , x 2 ) = k(x 1 + 3x 2 ) 0 < x 1 < 2 , 0 < x2 < 2
Solution
b d
∫ ∫ f X1,X2 (x1 , x 2 )dx1 dx2 = 1
a c
We have
2 2 2 2
1
∫ ∫ k(x 1 + 3x 2 )dx1dx2 = ∫ k[( x 21 + 3x1 x 2 )] dx2 = 1
0 0 0 2 0
2
2
=∫ k(2 + 6x 2 )dx2 = k[2x2 + 3x22 ]0 = 1
0
1
16k = 1 ⇒ k =
16
So,
1
fX1, X2(x 1 , x2 ) = (x 1 + 3x 2)
16
Therefore,
1 2
1
P(X1 < 1, X2 > 1) = ∫ ∫ (x1 + 3x2 )dx1 dx2 = 0.3125
0 1 16
x1 x2
F(X1 < x1 , X2 < x 2) = ∫ ∫ f X1,X2 (t1 , t2 )dt 1 dt2
−∞ −∞
Note that the lower bound of the integral can be adjusted so that it is the lower value of the
90
© 2014-2024 AnalystPrep.
interval.
Using the example above, we can calculate F(X1 < 1, X2 < 1) in a similar way as above.
∞
f X1(x 1 ) = ∫ fX1, X2(x 1 , x2 )dx2
−∞
Similarly,
∞
f X2(x 2 ) = ∫ fX1, X2(x 1 , x2 )dx1
−∞
Note that if we want to find the marginal distribution of X1 we integrate X2 out and vice versa.
1
f X1,X2 (x1 , x 2) = (x1 + 3x 2 ) 0 < x 1 < 2, 0 < x 2 < 2
16
We wish to find the marginal distribution of X1 . This implies that we need to integrate out X2. So,
2 2
1 1 3
f X1 (x 1 ) = ∫ (x 1 + 3x2 )dx2 = [x 1 x 2 + x22 ]
0 16 16 2 0
1 1
= [2x 1 + 6] = (x 1 + 3)
16 8
1 1
⇒ f X1 (x 1 ) = [2x 1 + 6] = (x 1 + 3)
16 8
Conditional Distributions
91
© 2014-2024 AnalystPrep.
The conditional distribution is analogously defined as that of discrete random variables. That is:
f X1,X2 (x1 , x 2 )
f(X1│X2)(x 1 │X2 = x2 ) =
fX2 (x 2 )
The conditional distributions are applied in the field of finance, such as risk management. For
instance, we may wish to compute the conditional distribution of interests rates, X1 given that
A collection of random variables is independent and identically distributed (iid) if each random
variable has the same probability distribution as the others and all are mutually independent.
Example:
92
© 2014-2024 AnalystPrep.
The coin has no memory, so all the throws are "independent".
The probability of head vs. tail in every throw is 50:50; so the coin is equally likely
and stays fair; the distribution from which every throw is drawn is normal and stays
Consider the iid variables generated by a normal distribution. They are typically defined as:
x iii d ∼ N(μ , σ 2 )
n n n
E (∑ Xi) = ∑ E(Xi ) = ∑ μ = nμ
i i i
Where E(Xi) = μ
The result above is valid since the variables are independent and have similar moments.
Maintaining this line of thought, the variance of iid random variables is given by:
n n n n
Var (∑ Xi ) = ∑ Var (Xi) + 2 ∑ ∑ Cov(Xj, Xk )
i i j=1 k=j+1
n n n n
= ∑ σ2 + 2 ∑ ∑ 0 = ∑ σ 2 = nσ 2
i j=1 k=j+1 i
The independence property is important because there’s a difference between the variance of
the sum of multiple random variables and the variance of a multiple of a single random variable.
93
© 2014-2024 AnalystPrep.
In the case of a multiple of a single variable, X1, with variance σ 2,
Var(2X1 ) = 4Var(X1 ) = 4 × σ 2 = 4σ 2
94
© 2014-2024 AnalystPrep.
Practice Question
and let Y be the portion of the same application representing damage to the rest of
What is the probability that the portion of a claim representing damage to the rest of
A. 0.657
B.0.450
C. 0.415
D. 0.752
1−y
1−y x2
f Y (y) = ∫ 6 [1 − (x + y)] ∂x = [6(x − − xy]
0 2 0
(1 − y)2
6 [(1 − y) − − y(1 − y)]
2
At this we can factor out (1 − y) and solve what remains in the square bracket:
95
© 2014-2024 AnalystPrep.
(1 − y) 1−y
6(1 − y) [1 − − y] = 6(1 − y) [ ]
2 2
1 −y
6(1 − y) [ ] = 3(1 − y)[1 − y] = 3(1 − 2y + y 2) = 3 − 6y + 3y 2
2
So,
0.3
P(Y < 0.3) = ∫ (3 − 6y + 3y2 )dy = 0.9 − 0.27 + 0.027 = 0.657
0
96
© 2014-2024 AnalystPrep.
Reading 16: Sample Moments
Estimate the mean, variance, and standard deviation using sample data.
Describe the bias of an estimator and explain what the bias measures.
Explain what is meant by the statement that the mean estimator is BLUE.
Describe the consistency of an estimator and explain the usefulness of this concept.
Explain how the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) apply
Estimate the mean of two random variables and apply the CLT.
Explain how coskewness and cokurtosis are related to skewness and kurtosis.
Sample Moments
Recall that moments are defined as the expected values that briefly describe the features of a
distribution. Sample moments are those that are utilized to approximate the unknown population
97
© 2014-2024 AnalystPrep.
Such moments include mean, variance, skewness, and kurtosis. We shall discuss each moment in
detail.
The population mean, denoted by μ is estimated from the sample mean (X̄). The estimated mean
is denoted by μ
^ and defined by:
1 n
^ = X̄ =
μ ∑ Xi
n i=1
98
© 2014-2024 AnalystPrep.
E(Xi ) = μ and n is the number of observations.
Note that the mean estimator is a function of random variables, and thus it is a random variable.
Consequently, we can examine its properties as a random variable (its mean and variance)
as follows:
1 n 1 n 1 n 1
E(μ
^) = E(X̄) = E [ ∑ Xi ] = ∑ E(X i) = ∑ μ = × nμ = μ
n i=1 n i=1 n i=1 n
The above result is true since we have assumed that Xi 's are iid. The mean estimator is an
Bias(θ^) = E(θ^) − θ
Where θ^ is the true estimator of the population value θ. So, in the case of the population mean:
Bias(^
μ ) = E(μ
^) − μ = μ − μ = 0
Since the value of the mean estimator is 0, it is an unbiased estimator of the population mean.
Using conventional features of a random variable, the variance of the mean estimator is
calculated as:
1 n 1 n
Var(μ
^) = Var ( ∑ X i) = 2 [∑ Var(X i) + Covariances]
n i=1 n i=1
But we are assuming that Xi 's are iid, and thus they are uncorrelated, implying that their
covariance is equal to 0. Consequently, taking Var(Xi ) = σ2 , the above formula changes to:
1 n 1 n 1 n 12 σ2
Var(μ
^) = Var ( ∑ Xi ) = 2 [∑ Var(Xi )] = 2 [∑ σ 2] = × nσ 2 =
n i=1 n i=1 n i=1 n n
99
© 2014-2024 AnalystPrep.
Thus
σ2
Var(μ
^) =
n
Looking at the last formula, the variance of the mean estimator depends on the data variance
(σ 2) and the sample mean n. Consequently, the variance of the mean estimator decreases as the
number of the observations (sample size) is increased. This implies that the larger the sample
An experiment was done to find out the number of hours that candidates spend preparing for the
FRM part 1 exam. It was discovered that for a sample of 10 students, the following times were
spent:
318, 304, 317, 305, 309 , 307, 316, 309, 315, 327
Solution
We know that:
1 n
X̄ = μ
^= ∑ Xi
n i=1
318 + 304 + 317 + 305 + 309 + 307 + 316 + 309 + 315 + 327
⇒ X̄ =
10
= 312.7
As the sample size (the number of the observation) increases, the sample mean tends to
100
© 2014-2024 AnalystPrep.
Estimation of Variance and Standard Deviation
1 n
^2 =
σ ^)2
∑ (X − μ
n i=1 i
Note that we are still assuming that Xi’s are iid. As compared to the mean estimator, the sample
n−1 2 σ2
^2 ) = E(σ
Bias(σ ^2 ) − σ 2 = σ − σ2 =
n n
This implies that the bias decreases as the number of observations are increased. Intuitively, the
2
source of the bias is the variance of the mean estimator ( σn ). Since the bias is known, we
1 n n
n n 1
s2 = ^2 =
σ ^)2 =
× ∑ (Xi − μ ^)2
∑ (Xi − μ
n −1 n − 1 n i=1 n − 1 i =1
It can be shown that E(s2 ) = σ 2 and thus s2 is an unbiased variance estimator. Maintaining this
involves large data sets, and thus either of these values can be used. However, when the number
^2 is preferred conventionally.
of observations is more than 30 (n ≥ 30) , σ
The sample standard deviation is the square root of the sample variance. That is:
^2
^ = √σ
σ
or
s = √ s2
Note that the square root is a nonlinear function, and thus, the standard deviation estimators are
101
© 2014-2024 AnalystPrep.
Example: Calculating the Sample Variance Estimator (Unbiased)
Using the example as in calculating the sample mean, what is the sample variance?
Solution
n
1
s2 = ^)2
∑ (Xi − μ
n − 1 i=1
Xi μ )2
(Xi − ^
318 (318 − 312.7)2 = 28.09
304 75.69
317 18.49
305 59.29
309 13.69
307 32.49
316 10.89
309 13.69
315 5.29
327 204.49
Total 462.1
n
1 462.1
s2 = ^ )2 =
∑ (Xi − μ = 51.34
n − 1 i=1 10 − 1
I. The mean and the variance are almost adequate to describe data.
II. They give a clue on the range of the values that can be observed.
III. The units of the mean and the standard deviation are the same as those of the data, and
102
© 2014-2024 AnalystPrep.
Skewness
As we saw in chapter two, the skewness is a cubed standardized central moment given by:
E([X − E(X)]3 X− μ 3
skew(X) = = E [( ) ]
σ3 σ
X −μ
Note that σ
is a standardized X with a mean of 0 and variance of 1.
E([X − E(X)]3 μ3
skew(X) = 3 =
2 2 σ3
E[(X − E(X)) ]
The skewness measures the asymmetry of the distribution (since the third power depends on the
sign of the difference). When the value of the asymmetry is negative, there is a high probability
of observing the large magnitude of negative value than positive values (tail is on the left side of
the distribution). Conversely, if the skewness is positive, there is a high probability of observing
the large magnitude of positive values than negative values (tail is on the right side of the
distribution).
103
© 2014-2024 AnalystPrep.
The estimators of the skewness utilize the principle of expectation and is denoted by:
^3
μ
^3
σ
^ 3 as:
We can estimate μ
1 n
^3 =
μ ^)3
∑ (x i − μ
n i =1
The following are the data on the financial analysis of a sales company’s income over the last 100
months:
n = 100, ∑ni=1 (x i − μ
^ )2 = 674, 759.90 . and ∑ni=1 (x i − μ
^)3 = −12 , 456.784
Solution
∑ni=1 (x i − μ
1 1
^3
μ ^ )3 (−12 , 456.784)
n 100
= = = −0.000225
^3 3 3
σ
[ 1n ∑ni=1 (x i 1
^ )2 ] 2
−μ [ 100 × 674, 759.90] 2
Kurtosis
E([X − E(X)])4 X− μ 4
Kurt(X) = = E [( ) ]
σ4 σ
E([X − E(X)]4 μ
104
© 2014-2024 AnalystPrep.
E([X − E(X)]4 μ4
Kurt(X) = =
E[(X − E(X))2 ]2 σ4
The description of kurtosis is analogous to that of the Skewness, only that the fourth power of
the Kurtosis implies that it measures the absolute deviation of random variables. The reference
value of a normally distributed random variable is 3. A random variable with Kurtosis exceeding
The estimators of the skewness utilize the principle of expectation and is denoted by:
^4
μ
^4
σ
1 n
^4 =
μ ^)4
∑ (x i − μ
n i =1
We say that the mean estimator is the Best Linear Unbiased Estimator (BLUE) of the population
I. The variance of the mean has the lowest variance of any Linear Unbiased Estimator
(LUE).
II. It is the unbiased estimator of the population mean (as shown earlier)
The linear estimators are a function of the mean and can be defined as:
n
μ
^ = ∑ ωi Xi
i =1
1
Where ω i is independent of X i . In the case of the sample mean estimator, ωi = n
. Recall that we
105
© 2014-2024 AnalystPrep.
BLUE puts an estimator as the best by having the smallest variance among all linear and
unbiased estimators. However, there are other superior estimators, such as Maximum Likelihood
Estimators (MLE).
Recall that the mean estimator is unbiased, and its variance takes a simple form. Moreover, if the
data used are iid and normally distributed, then the estimator is also normally distributed.
However, it poses a great difficulty in defining the exact distribution of the mean in a finite
number of observations.
To overcome this, we use the behavior of the mean in large sample sizes (that is as n → ∞) to
approximate the distribution of the mean infinite sample sizes. We shall explain the behavior of
the mean estimator using the Law of Large Numbers (LLN) and the Central Limit Theorem
(CLT).
The law of large numbers (Kolmogorov Strong Law of Large Numbers) for iid states that if Xi ’s is
1 n −a.s
^n =
μ ∑ X→μ
n i=1 i
An estimator is said to be consistent if LLN applies to it. Consistency requires that an estimator
is:
II. The variance decreases as the number of observations n increases. That is:Var(μ
^ n) → 0.
a.s
^2−→ σ 2
Moreover, under LLN, the sample variance is consistent. That is, LLN implies that σ
106
© 2014-2024 AnalystPrep.
However, consistency is not easy to study because it tends to 0 as n → ∞.
The Central Limit Theorem (CLT) states that if X1 , X2 , … , Xn is a sequence of iid random variables
^−μ
μ
with a finite mean μ and a finite non-zero variance σ 2, then the distribution of σ tends to a
√n
Put simply,
^− μ
μ
σ
→ N(0, 1)
√n
Note that μ
^ = X̄ = Sample Mean
Note that CLT extends LLN and provides a way of approximating the distribution of the sample
mean estimator. CLT seems to be appropriate since it does not require the distribution of random
variables used.
Since CLT is asymptotic, we can also use the unstandardized forms so that:
σ2
μ
^ ∼ N (μ, )
n
^ −μ
μ
Z= σ
√n
The value of n solely depends on the shape of the population (distribution of Xi ’s), i.e., the
107
© 2014-2024 AnalystPrep.
Example: Applying CLT
A sales expert believes that the number of sales per day for a particular company has a mean of
40 and a standard deviation of 12. He surveyed for over 50 working days. What is the probability
that the sample mean of sales for this company is less than 35?
Solution
σ2
^ ∼ N (μ,
μ )
n
We need
108
© 2014-2024 AnalystPrep.
⎡ 35 − 40⎤
^ < 35) = P ⎢Z <
P(μ ⎥ = P(μ
^ < −2.946)
12
⎣ ⎦
√50
= P(μ
^ < −2.946) = 1 − P(μ
^ < 2.946) = 0.00161
Median
Median is a central tendency measure of distribution, also called the 50% quantile, which divides
the distribution in half ( 50% of observations lie on either side of the median value).
When the sample size is odd, the value in position (n + 1)/2 of the sorted list is used to estimate
the median:
Med(x) = x n +1
2
If the number of the observations is even, the median is estimated as the average of the two
1
Med(x) = [x n + x n +1 ]
2 2 2
Solution
109
© 2014-2024 AnalystPrep.
The sample size is 6 (even), so the median is given by:
1 1 1
Med(Age) = [x 6 + x 6 +1 ] = (x3 + x 4 ) = (43 + 50) = 46.5
2 2 2 2 2
It is used when the exact midpoint of the score distribution is desired, or when there
Other Quartiles
For other quantiles such as 25% and 75% quantiles, we estimate analogously as the median. For
instance, a θ-quantile is determined using the nθ, which is a value in the sorted list. If nθ is not
an integer, we will have to take the average below or above the value nθ.
So, in our example above, the 25% quantile (θ=0.25) is 6×0.25=1.5. This implies that we need to
1
^
q 25 = (25 + 34) = 29.5
2
The interquartile range (IQR) is defined as the difference between the 75% and 25% quartiles.
That is:
ˆ
(IQR) =^
q 75 − ^
q 25
IQR is a measure of dispersion and thus can be used as an alternative to the standard deviation
If we use the example above, the 75% quantile is 6×0.75=4.5. So, we need to average the 4th
110
© 2014-2024 AnalystPrep.
and 5th values:
1
^
q 75 = (50 + 51) = 50.5
2
ˆ
(IQR) = 50.5 − 29.5 = 21
I. The units of the quantiles are the same as those of the data used hence they are easy to
interpret.
II. They are robust to outliers of the data. The median and the IQR are unaffected by the
outliers.
We can extend the definition of moments from the univariate to multivariate random variables.
The mean is unaffected by this because it is just the combination of the means of the two
However, if we extend the variance, we would need to estimate the covariance between each pair
plus the variance of each data set used. Moreover, we can also define Kurtosis and Skewness
Covariance
In covariance, we focus on the relationship between the deviations of some two variables rather
111
© 2014-2024 AnalystPrep.
Cov(X, Y) = E[(X − E[X])]E[(Y − E[Y])]
= E[XY] − E[X]E[Y]
The sample covariance estimator is analogous to this result. The sample covariance estimator is
given by:
1 n
^XY =
σ ∑ (X i − μ
^X )(Yi − μ
^Y )
n i=1
Where
The sample covariance estimator is biased towards zero, but we can remove the estimator by
Correlation
Correlation measures the strength of the linear relationship between the two random variables,
covariance by the product of the sample standard deviation estimator of each random variable. It
is defined as:
^XY
σ ^XY
σ
ρXY = =
^2X√ σ
^2Y ^Xσ
σ ^Y
√σ
We estimate the mean of two random variables the same way we estimate that of a single
1
112
© 2014-2024 AnalystPrep.
1 n
^x =
μ ∑ (x i )
n i =1
And
1 n
^y =
μ ∑ (y )
n i =1 i
Assuming both of the random variables are iid, we can apply CLT in each estimator. However, if
we consider the joint behavior (as a bivariate statistic), CLT stacks the two mean estimators into
a 2x1 matrix:
^x
μ
^=[
μ ]
^y
μ
Which is normally distributed as long the random variable Z=[X, Y] is iid. The CLT on this vector
σX2 σXY
[ ]
σXY σY2
Note that in a covariance matrix, one diagonal displays the variance of random variable series,
and the other is covariances between the pair of the random variables. So, the CLT for bivariate
^x − μx
μ 0 σ2 σXY
√n [ ] → N ([ ] , [ X ])
^y − μy
μ 0 σXY σY2
If we scale the difference between the vector of means, then the vector of means is normally
σ2 σ XY
^
μ ⎛ μ ⎡ X ⎤⎞
[ x ] → N ⎜[ x ] , ⎢ n n ⎥⎟
^y σ2
μ ⎝ μy ⎣ σ XY Y ⎦⎠
n n
113
© 2014-2024 AnalystPrep.
The annualized estimates of the means, variances, covariance, and correlation for monthly return
of stock trade (T) and the government's bonds (G) for 350 months are as shown below:
Moment ^T
μ σT2 ^G
μ σG2 σTG ρTG
11.9 335.6 6.80 26.7 14.0 0.1434
We need to compare the volatility, interpret the correlation coefficient, and apply bivariate CLT.
Solution
Looking at the output, it is evident that the return from the stock trade is more volatile than the
government bond return since it has a higher variance. The correlation between the two forms of
^x − μx
μ 0 335.6 14.0
√n [ ^ − μ ] → N ([ ] , [ ])
μy y 0 14.0 26.7
But the mean estimators have a limiting distribution (which is assumed to be normally
distributed). So,
^x
μ μ 0.9589 0.04
[ ] → N ([ x ] , [ ])
^
μy μ y 0.04 0.07629
Note the new covariance matrix is equivalent to the previous covariance divided by the sample
size n=350.
In bivariate CLT, the correlation in the data is the correlation between the sample means and
114
© 2014-2024 AnalystPrep.
Coskewness and Cokurtosis are an extension of the univariate skewness and kurtosis.
Coskewness
These measures both capture the likelihood of the data taking a large directional value whenever
the other variable is large in magnitude. When there is no sensitivity to the direction of one
variable to the magnitude of the other, the two coskewnesses are 0. For example, the coskewness
in a bivariate normal is always 0, even when the correlation is different from 0. Note that the
The coskewness is estimated by using the estimation analogy. That is, replacing the expectation
∑ni=1 (xi − μ
^X)2 (y i − μ
^Y )
Skew(X, X, Y) =
^2X σ
σ ^Y
∑ni=1 (x i − μ ^Y )2
^X)(y i − μ
Skew(X, Y , Y) =
^2Y
^Xσ
σ
Cokurtosis
The reference value of a normally distributed random variable is 3. A random variable with
to that of the normal is not easy since the cokurtosis of the bivariate normal depends on the
correlation.
When the value of the cokurtosis is 1, then the random variables are uncorrelated and increases
116
© 2014-2024 AnalystPrep.
Practice Question
∑100 x
i=1 i =
3, 353 and ∑100 x 2 844, 536
i=1 i =
What is the sample mean and standard deviation of the monthly profits?
Solution
1 n
^ = X̄ =
μ ∑ Xi
n i=1
1
⇒ X̄ = × 3353 = 33.53
100
1 n
s2 = ^)2
∑ (Xi − μ
n − 1 i=1
Note that,
So that
117
© 2014-2024 AnalystPrep.
n n n n n
^)2 = ∑ X 2i − 2Xi μ
∑ (Xi − μ ^2 = ∑ X2i − 2^
^+ μ ^2
μ ∑ Xi + ∑ μ
i =1 i=1 i=1 i=1 i=1
1 n n
^=
μ ∑ Xi ⇒ ∑ Xi = n^
μ
n i=1 i =1
So,
n n n n
^2 = ∑ X2i − 2^
∑ X2i − 2μ̂ ∑ Xi + ∑ μ μ . nμ
^ + n^
μ
i =1 i =1 i=1 i =1
n
μ2
= ∑ X2i − n^
i =1
Thus:
1 n 1 n
s2 = ^)2 =
∑ (Xi − μ ^2 }
{ ∑ X2i − n μ
n −1 i=1 n − 1 i=1
1 n
2
1
s2 = {∑ X2i − n^
μ }= (844, 536 − 100 × 33.532 ) = 7395.0496
n −1 i =1 99
s = √7395.0496 = 85.99
118
© 2014-2024 AnalystPrep.
Reading 17: Hypothesis Testing
Construct and apply confidence intervals for one-sided and two-sided hypothesis tests,
and interpret the results of hypothesis tests with a specific level of confidence.
Differentiate between a one-sided and a two-sided test and identify when to use each
test.
Explain the difference between Type I and Type II errors and how these relate to the
Identify the steps to test a hypothesis about the difference between two population
means.
Explain the problem of multiple testing and how it can bias results.
the sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is
true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The
null hypothesis is an assumption of the population parameter. On the other hand, the alternative
hypothesis states the parameter values (critical values) at which the null hypothesis is rejected.
The critical values are determined by the distribution of the test statistic (when the null
hypothesis is true) and the size of the test (which gives the size at which we reject the null
hypothesis).
119
© 2014-2024 AnalystPrep.
Components of the Hypothesis Testing
As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The
null hypothesis is the statement concerning the population parameter values. It brings out the
The null hypothesis, denoted as H0, represents the current state of knowledge about the
population parameter that’s the subject of the test. In other words, it represents the “status
quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil
manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol
A test would then be carried out to confirm or reject the null hypothesis.
H0 : μ = μ0
H0 : μ ≤ μ0
Where:
120
© 2014-2024 AnalystPrep.
μ = true population mean and,
The alternative hypothesis, denoted H1, is a contradiction of the null hypothesis. The null
hypothesis determines the values of the population parameter at which the null hypothesis is
rejected. Thus, rejecting the H0 makes H1 valid. We accept the alternative hypothesis when the
Using our FDA example above, the alternative hypothesis would be:
H1 : μ ≠ μ0
H1 : μ > μ0
Where:
Note that we have stated the alternative hypothesis, which contradicted the above statement of
A test statistic is a standardized value computed from sample information when testing
hypotheses. It compares the given data with what we would expect under the null hypothesis.
Thus, it is a major determinant when deciding whether to reject H0, the null hypothesis.
121
© 2014-2024 AnalystPrep.
We use the test statistic to gauge the degree of agreement between sample data and the null
hypothesis. Analysts use the following formula when calculating the test statistic.
The test statistic is a random variable that changes from one sample to another. Test statistics
assume a variety of distributions. We shall focus on normally distributed test statistics because it
is used hypotheses concerning the means, regression coefficients, and other econometric
models.
We shall consider the hypothesis test on the mean. Consider a null hypothesis H 0 : μ = μ 0 .
Assume that the data used is iid, and asymptotic normally distributed as:
^ − μ) ∼ N (0 , σ2 )
√n (μ
Where σ 2 is the variance of the sequence of the iid random variable used. The asymptotic
^ − μ0
μ
T = ∼ N(0, 1)
2
√^
σ
n
Note this is consistent with our initial definition of the test statistic.
The following table gives a brief outline of the various test statistics used regularly, based on the
We can subdivide the set of values that can be taken by the test statistic into two regions: One is
called the non-rejection region, which is consistent with H0 and the rejection region (critical
122
© 2014-2024 AnalystPrep.
region), which is inconsistent with H0. If the test statistic has a value found within the critical
Just like with any other statistic, the distribution of the test statistic must be specified entirely
The Size of the Hypothesis Test and the Type I and Type II
Errors
While using sample statistics to draw conclusions about the parameters of the population as a
whole, there is always the possibility that the sample collected does not accurately represent the
population. Consequently, statistical tests carried out using such sample data may yield incorrect
results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two
types of errors:
Type I Error
Type I error occurs when we reject a true null hypothesis. For example, a type I error would
Type II Error
Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test
provides insufficient evidence to reject the null hypothesis when it’s false.
The level of significance denoted by α represents the probability of making a type I error, i.e.,
rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken
to be the probability of making a type II error within the bounds of statistical testing. The ideal
but practically impossible statistical test would be one that simultaneously minimizes α and β.
We use α to determine critical values that subdivide the distribution into the rejection and the
non-rejection regions.
123
© 2014-2024 AnalystPrep.
The Critical Value and the Decision Rule
The decision to reject or not to reject the null hypothesis is based on the distribution assumed by
the test statistic. This means if the variable involved follows a normal distribution, we use the
level of significance (α) of the test to come up with critical values that lie along with the standard
normal distribution.
The decision rule is a result of combining the critical value (denoted by Cα ), the alternative
hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis
For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-
side alternative, the decision is to reject the null hypothesis if |T| > Cα . That is, reject the null
hypothesis if the absolute value of the test statistic is greater than the critical value. When
testing on the one-sided, decision rule, reject the null hypothesis if T < Cα when using a one-
sided lower alternative and if T > Cα when using a one-sided upper alternative. When a null
significance level.
Note that prior to decision-making, one must decide whether the test should be one-tailed or
two-tailed. The following is a brief summary of the decision rules under different scenarios:
Decision rule: Reject H0 if the test statistic is less than the critical value. Otherwise, do not
reject H0.
124
© 2014-2024 AnalystPrep.
Right One-tailed Test
Decision rule: Reject H0 if the test statistic is greater than the critical value. Otherwise, do not
reject H0.
125
© 2014-2024 AnalystPrep.
Two-tailed Test
Decision rule: Reject H0 if the test statistic is greater than the upper critical value or less than
Consider, α=5%. Consider a one-sided test. The rejection regions are shown below:
126
© 2014-2024 AnalystPrep.
The first graph represents the rejection region when the alternative is one-sided lower. For
The second graph represents the rejection region when the alternative is a one-sided upper. The
Consider the returns from a portfolio X = (x1 , x 2 ,… , x n ) from 1980 through 2020. The
approximated mean of the returns is 7.50%, with a standard deviation of 17%. We wish to
determine whether the expected value of the return is different from 0 at a 5% significance level.
Solution
n=40
μ
^=0.075
μ0 =0
^2 =0.172
σ
So,
0.075 − 0
T= ≈ 2.79
0.17 2
√ 40
At the significance level, α = 5%,the critical value is ±1.96. Since this is a two-sided test, the
rejection regions are ( −∞,−1.96 ) and (1.96, ∞ ) as shown in the diagram below:
Since the test statistic (2.79) is higher than the critical value, then we reject the null hypothesis
128
© 2014-2024 AnalystPrep.
The example above is an example of a Z-test (which is mostly emphasized in this chapter and
immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-
distribution if the random variables are iid and normally distributed and that the sample size is
small (n<30).
^ − μ0
μ
s2 =
2
√ sn
^ − μ0
μ
T = ∼ tn −1
2
√ sn
The power of a test is the direct opposite of the level of significance. While the level of relevance
gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a
test gives the probability of correctly discrediting and rejecting the null hypothesis when it is
false. In other words, it gives the likelihood of rejecting H0 when, indeed, it’s false. Denoting the
Power of a Test = 1– β
The power test measures the likelihood that the false null hypothesis is rejected. It is influenced
by the sample size, the length between the hypothesized parameter and the true value, and the
Confidence Intervals
A confidence interval can be defined as the range of parameters at which the true parameter can
be found at a confidence level. For instance, a 95% confidence interval constitutes the set of
129
© 2014-2024 AnalystPrep.
parameter values where the null hypothesis cannot be rejected when using a 5% test size.
Therefore, a 1-α confidence interval contains values that cannot be disregarded at a test size of
α.
It is important to note that the confidence interval depends on the alternative hypothesis
statement in the test. Let us start with the two-sided test alternatives.
H0 : μ = 0
H1 : μ ≠ 0
^
σ ^
σ
[μ
^ − Cα × ^ + Cα ×
,μ ]
√n √n
Consider the returns from a portfolio X = (x1 , x 2 ,… , x n ) from 1980 through 2020. The
approximated mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95%
σ
^ σ
^
[^
μ − Cα × ,μ
^ + Cα × ]
√n √n
0.17 0.17
= [0.0750 − 1.96 × , 0.0750 + 1.96 × ]
√40 √ 40
= [0.02232,0.1277]
Thus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be
One-Sided Alternative
130
© 2014-2024 AnalystPrep.
For the one-sided alternative, the confidence interval is given by either:
^
σ
(−∞ , μ
^ + Cα × )
√n
or,
^
σ
(μ
^ + Cα × , ∞)
√n
H0 : μ ≤ 0
H1 : μ > 0
σ
^
= (−∞, μ
^ + Cα × )
√n
0.17
= (−∞, 0.0750 + 1.645 × )
√40
= (−∞, 0.1192)
H0 : μ > 0
H1 : μ ≤ 0
131
© 2014-2024 AnalystPrep.
^
σ
= (−∞, μ
^ + Cα × )
√n
0.17
= (−∞ , 0.0750 + 1.645 × ) = (0.1192, ∞)
√ 40
Note that the critical value decreased from 1.96 to 1.645 due to a change in the direction of the
change.
The p-Value
When carrying out a statistical test with a fixed value of the significance level (α), we merely
compare the observed test statistic with some critical value. For example, we might “reject
H0 using a 5% test” or “reject H0 at 1% significance level”. The problem with this ‘classical’
approach is that it does not give us details about the strength of the evidence against the null
hypothesis.
testing. The p-value is the lowest level at which we can reject H0. This means that the strength of
the evidence against H0 increases as the p-value becomes smaller. The test statistic depends on
the alternative.
For one-tailed tests, the p-value is given by the probability that lies below the calculated test
statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-
Denoting the test statistic by T, the p-value for H1 : μ > 0 is given by:
132
© 2014-2024 AnalystPrep.
P (Z ≤ |T |) = Φ(|T |)
Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the
If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We
start by determining the probability lying below the negative value of the test statistic. Then, we
add this to the probability lying above the positive value of the test statistic. That is the p-value
2 [1 − Φ[|T |]
Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the
coin 200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of
significance.
H0: θ = 0.5
Solution
Our p-value will be given by P(X < 85) where X `binomial(200,0.5) with mean 100(np=200*0.5),
assuming H0 is true.
85.5 − 100
P [z < ] = P (Z < −2.05)
√ 50
= 1– 0.97982 = 0.02018
133
© 2014-2024 AnalystPrep.
Recall that for a binomial distribution, the variance is given by:
(We have applied the Central Limit Theorem by taking the binomial distribution as approx.
normal)
Since the probability is less than 0.05, H0 is extremely unlikely, and we actually have strong
evidence against H0 that favors H1. Thus, clearly expressing this result, we could say:
“There is very strong evidence against the hypothesis that the coin is fair. We, therefore,
Remember, failure to reject H0 does not mean it’s true. It means there’s insufficient evidence to
A CFA candidate conducts a statistical test about the mean value of a random variable X.
She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the p-
value
Solution
134
© 2014-2024 AnalystPrep.
Interpretation
The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient
evidence to reject H0. In fact, the evidence is so strong that we would also reject H0 at
significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not
It’s common for analysts to be interested in establishing whether there exists a significant
difference between the means of two different populations. For instance, they might want to
know whether the average returns for two subsidiaries of a given company
Wi = [X i, Y i]
135
© 2014-2024 AnalystPrep.
Assume that the components Xi and Y iare both iid and are correlated. That is:
Corr(Xi , Yi ) ≠ 0
H 0 : μX = μY
H 1 : μX ≠ μY
In other words, we want to test whether the constituent random variables have equal means.
H0 : μ X − μ Y = 0
H1 : μ X − μ Y ≠ 0
Zi = Xi − Y i
Therefore, considering the above random variable, if the null hypothesis is correct then,
^z
μ
T = ∼ N (0, 1)
2
√^
σz
n
Note that the test statistic formula accounts for the correction between Xi and Yi . It is easy to see
that:
136
© 2014-2024 AnalystPrep.
Which can be denoted as:
^2z = σ
σ ^2X + σ
^2Y − 2σXY
^ z = μX − μY
μ
μX − μY
T=
^2 +^ 2
σ Y −2 σX Y
√ σX
n
This formula indicates that correlation plays a crucial role in determining the magnitude of the
test statistic.
Another special case of the test statistic is when Xi , and Y i are iid and independent. The test
μX − μY
T=
2 2
σ
^ σY
^
√nX +
X nY
An investment analyst wants to test whether there is a significant difference between the means
of the two portfolios at a 95% level. The first portfolio X consists of 30 government-issued bonds
and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30
private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the
two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is
rejected or otherwise.
Solution
137
© 2014-2024 AnalystPrep.
H0: μX - μY=0 vs. H1: μX - μY ≠ 0.
Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value
C α = ±1.96.
Recall that:
μX − μY μX − μY
T = =
^2 +^ 2
σ Y −2σ X Y ^ 2+^ 2
σ Y −2ρX Y σ X σY
√ σX √σ X
n n
0.10 − 0.14
= = −10.215
0.022 +0.03 2 −2×0.7×0.02 ×0.03
√ 30
The test statistic is far much less than -1.96. Therefore the null hypothesis is rejected at a 95%
level.
Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data
set. The reuse of data results in spurious results and unreliable conclusions that do not hold up
to scrutiny. The fundamental problem with multiple testing is that the test size (i.e., the
probability that a true null is rejected) is only applicable for a single test. However, repeated
testing creates test sizes that are much larger than the assumed size of alpha and therefore
Some control methods have been developed to combat multiple testing. These include Bonferroni
correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).
138
© 2014-2024 AnalystPrep.
Practice Question
An experiment was done to find out the number of hours that candidates spend
preparing for the FRM part 1 exam. It was discovered that for a sample of 10
B. [307.6, 317.8]
C. [307.9, 317.5]
D. [307.3, 318.2]
To find the value of t1− α , we use the t-table with (10 - 1 =) 9 degrees of freedom and
2
s 7.2
X̄ ± t1− α × = 312.7 ± 2.262 ×
2 √n √ 10
= [307.5, 317.9]
139
© 2014-2024 AnalystPrep.
Reading 18: Linear Regression
Describe the models that can be estimated using linear regression and differentiate
Construct, apply, and interpret hypothesis tests and confidence intervals for a single
Describe the relationship between a t-statistic, its p-value, and a confidence interval.
Linear regression is a statistical tool for modeling the relationship between two random
variables. This chapter will concentrate on the linear regression model (regression model with
As stated earlier, linear regression determines the relationship between the dependent variable Y
and the independent (explanatory) variable X. The linear regression with a single explanatory
Y = β0 + βX + ϵ
Where:
140
© 2014-2024 AnalystPrep.
ϵ=error(sometimes referred to as shock). It represents the portion of Y that cannot be explained
by X.
The assumption is that the expectation of the error is 0. That is, E(ϵ) = 0 and thus,
⇒ E[Y ] = β0 + βE[X]
Note that β0 is the value of Y when X = 0 . However, there are cases when the explanatory
variable is not equal to 0. In this case, β0 is interpreted as the value that ensures that the Y¯ in
the regression line Y¯ = β^0 + β^ X̄ where Y¯ and X̄ are the mean of y i and xi random variables.
141
© 2014-2024 AnalystPrep.
The independent variable can be continuous, discrete or even functions. Above the diversity of
1. The relationship between the dependent variable Y and the explanatory variables
2. The error term must be additive except where the variance of the error term depends on
3. The independent (explanatory variables) must be observables. This ensures that a linear
Y = β0 + βX k + ϵ
This model cannot be estimated using linear regression due to the presence of the unknown
parameter k, which violates the first restriction (it is non-linear regression function). This kind of
Transformations
When a linear regression model does not satisfy the linearity conditions stated above, we can
reverse the violation of the restrictions by transforming the model. Consider the model:
Y = β0X βϵ
Where ϵ is the positive error term (shock). Clearly, this model violates the condition of the
restriction since X is raised to an unknown parameter β, and the error term is not additive.
However, we can make this model linear by taking natural logarithm on both sides of the
equation so that:
ln(Y ) = (β0 X β ϵ)
142
© 2014-2024 AnalystPrep.
The last equation can be written as:
k
Y = β^ 0 + βX^ + ^
ϵ
Clearly, this equation satisfies the three linearity conditions. It is worth noting that when we are
interpreting the parameters of the transformed model, we measure the change of the
For instance, ln(Y ) = lnβ0 + βlnX + lnϵ implies that β represents the change in lnY corresponding
There are cases where the explanatory variables are binary numbers (0 and 1) representing the
occurrences of an event. These binary numbers are called dummies. For instance,
Yi = β0 + βDi + ϵi , ∀i = 0, … , n
β is the coefficient on Di .
The equation will change to the one written below under the condition that Di = 0:
Y i = β0 + ϵ i
When Di = 1:
Y i = β0 + β + ϵi
143
© 2014-2024 AnalystPrep.
This implies that when Di = 1, E (Yi |Di = 1) = β0 + β1 . The test scores will have a population mean
value of β0 + β1 when the ratio of students to teachers is low. The conditional expectations of Yi
when Di = 1 and when Di = 0 will have a difference of β1 between them written as:
(β0 + β) − β0 = β
The Ordinary Least Squares (OLS) is a method of estimating the linear regression parameters by
minimizing the sum of squared deviations. The regression coefficients chosen by the OLS
estimators are such that the observed data and the regression line are as close as possible.
144
© 2014-2024 AnalystPrep.
Consider a regression equation:
Y = β0 + βX + ϵ
Assume that each of xi and yi are linearly related, then the parameters can be estimated using
the OLS. The estimators minimize the residual sum of squares such that:
n 2 n
∑ (y i − β^ ^ ^2i
0 − β x i) = ∑ ϵ
i=1 i=1
Where the β^ 0 and β^ are parameter estimators (intercept and the slope respectively) which
minimizes the squared deviations between the line β^0 + β^ x i and yi so that:
145
© 2014-2024 AnalystPrep.
β^ 0 = Y¯ − β^ X̄
and
After the estimation of the parameters, the estimated regression line is given by:
y^i = β^ 0 + β^x i
ϵ i = y i − y^i = y i − β^ 0 − β^ x i
^
1 n
2
s2 = ∑ ^ϵ
n − 2 i=1 i
n
s2 = ^2 (1 − ρ^ 2XY )
σ
n −2 Y
Note that n-2 implies that two parameters are estimated and that s2 is an unbiased estimator of
σ2 . Moreover, it is assumed that the mean of the residuals is zero and uncorrelated with the
explanatory variables Xi .
1
If we multiply both the numerator and the denominator by , we have:
n
146
© 2014-2024 AnalystPrep.
1
∑ ni=1 (x i − X̄ ) (yi − Y¯)
β^ =
n
1 2
∑ ni=1 (x i − X̄ )
n
Note that the numerator is the covariance between X and Y, and the denominator is the variance
1
∑ni=1 (x i − X̄ ) (y i − Y¯) σ
^XY
β^ =
n
=
1 2 σX2
∑ni=1 (x i − X̄ )
n
Cov(X , Y )
Corr(X, Y ) = ρXY =
σX σY
⇒ σXY = ρXY σX σY
So,
ρXY σXσY
β^ =
σ^2X
ρ^XY σ
^Y
∴ β^ =
^X
σ
An investment analyst wants to explain the return from the portfolio (Y) using the prevailing
interest rates (X) over the past 30 years. The mean interest rate is 7%, and the return from the
^2Y
σ ^XY
σ 1600 500
[ ]=[ ]
^XY
σ ^
σX2 500 338
Assume that the analyst wants to estimate the linear regression equation:
Y^ i = β^ 0 + β^ Xi
147
© 2014-2024 AnalystPrep.
Estimate the parameters and, thus, the model equation.
Solution
Now,
σ
^XY 500
β^ = = = 1.4793
^2X
σ 338
and
Assumptions of OLS
1. The conditional distribution of the error term given the independent variables Xi is 0.
More precisely E(ϵ i |Xi ) = 0. This also implies that the independent variables and the
2. Both the dependent and independent variables are i.i.d. This assumption concerns the
drawing of the sample. According to this assumption, (Xi , Yi ), i = 1, … ,n are i.i.d in case a
simple random sampling is applied when drawing observations from a single large
population. Despite the i.i.d assumption being a reasonable assumption for many data
collection schemes, all sampling schemes do not produce i.i.d observations on (Xi , Yi ).
3. Large outliers are unlikely. In this assumption, observations whose values of Xi and/or Yi
fall far outside the usual range of the data, are unlikely. These observations are known as
significant outliers. Results of OLS regression can be misleading due to large outliers.
4. The variance of the independent variable is strictly nonnegative. That is, σ2X > 0. This is
148
© 2014-2024 AnalystPrep.
5. The variance of the error term is independent of the explanatory variables and that
V (ϵ i│X) = σ2 < ∞ and that the variance of all the error terms (shocks) is equal. This
The OLS estimators imply that the parameter estimators are unbiased estimators. That is,
α ) = α and E(β^) = β . This is actually true for large sample sizes or rather as the sample sizes
E(^
increases.
Lastly, the assumptions ensure that that the estimated parameters are normally distributed. The
σ2
√n (β^ − β) ∼ N (0, )
σX2
Where σ2 is the variance of the error term and σ2X is the variance of X. It is easy to see that the
σ2 (μ 2X − σX2 )
√n (β^ 0 − β0 ) ∼ N (0 , )
σX2
According to the central limit theorem (CLT), β^ can be treated as the standard random variable
σ2
with the mean as the true value β and the variance . That is:
nσ 2
X
σ2
β^ ∼ N (β, )
nσX2
However, we cannot use this value in hypothesis testing. We need to use the variance estimators
such that:
σ2 = s2
149
© 2014-2024 AnalystPrep.
1 n 2
^X =
σ ∑ (x − X̄ )
n i =1 i
n
2
⇒ nσ
^X = ∑ (xi − X̄ )
i=1
^2
σ s2
^2β =
σ =
2
∑ni=1 (x i − X̄ ) ^2X
nσ
The standard error estimate of the β denoted as SEE β is equivalent to the square root of its
variance, so:
s2 s
SEE β = √ =
σ 2X
n^ √n^
σX
^2X + σ
s2 (μ ^2X)
^2β0 =
σ
σ 2X
n^
When the OLS assumptions are met, the parameters are assumed to be normally distributed
when large samples are used. Therefore, we can run a hypothesis tests on the parameters just
parameters. For instance, we may want to test the significance of a single regression coefficient
Whenever a statistical test is being performed, the following procedure is generally considered
ideal:
150
© 2014-2024 AnalystPrep.
1. Statement of both the null and the alternative hypothesis;
2. Select the appropriate test statistic, i.e., what’s being tested, e.g., the population means,
4. Clearly, state the decision rule to guide you in choosing whether to reject or not to reject
β^ − βH0
T =
SEE β
This statistic possesses asymptotic normal distribution, which is then compared to a critical
|T| > Ct
For instance, if we assume a 5% significance level in this case, then the critical value is 1.96.
We can also evaluate the p-values. For one-tailed tests, the p-value is given by the probability
that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies
Denoting the test statistic by T, the p-value for H1 : β^ > 0 is given by:
151
© 2014-2024 AnalystPrep.
P (Z ≤ |T |) = Φ(|T |)
Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the
If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We
start by determining the probability lying below the negative value of the test statistic. Then, we
add this to the probability lying above the positive value of the test statistic. That is the p-value
We can also construct confidence intervals (discussed in detail in the previous chapter). Recall
that a confidence interval can be defined as the range of parameters at which the true parameter
can be found at a confidence level. For instance, a 95% confidence interval constitutes that the
set of parameter values where the null hypothesis cannot be rejected when using a 5% test size.
For instance, if we are performing the two-tailed hypothesis tests, then the confidence interval is
given by:
An investment analyst wants to explain the return from the portfolio (Y) using the prevailing
interest rates (X) over the past 30 years. The mean interest rate is 7%, and the return from the
^2Y
σ ^XY
σ 1600 500
[ ]=[ ]
^XY
σ σ 2
^X 500 338
Assume that the analyst wants to estimate the linear regression equation:
Y^ i = β^ 0 + β^ Xi
152
© 2014-2024 AnalystPrep.
Test whether the slope coefficient is equal to zero and construct a 95% confidence interval for
Solution
β^ − βH0
T =
SEE β
^XY
σ 500
β^ = = = 1.4793
σ
^X2 338
s
SEE β^ =
√n^
σX
But
n
s2 = ^2 (1 − ρ^ XY )
σ
n −2 Y
30 500
s2 = × 1600 (1 − ) = 548.7251
30 − 2 √338√1600
^
σ XY
(Note that for ρ^XY we have used the relationship ρ^XY = .)
^
σ X^
σY
Therefore,
So,
s 23.4249
153
© 2014-2024 AnalystPrep.
s 23.4249
SEEβ^ = = = 0.23263
√n σ
^X √ 30√ 338
β^ − βH0 1.4793
T = = = 6.3590
SEE β 0.23263
For the two-tailed test, the critical value is 1.96, and since the t-statistic here is greater than the
β^ − Ct × SEEβ , β^ + C t × SEEβ
= [1.0233, 1.9353]
154
© 2014-2024 AnalystPrep.
Practice Question 1
Assume that you have carried out a regression analysis (to determine whether the
slope is different from 0) and found out that the slope β^ = 1.156. Moreover, you have
constructed a 95% confidence interval of [0.550, 1.762]. What is the likely value of
A. 4.356
B. 3.7387
C. 0.7845
D. 0.6545
Solution
This is a two-tailed test since we’re asked to determine if the slope is different from
1.156 − 0.550
1.156 − 1.96 × SEE β = 0.550 ⇒ SEEβ = = 0.3092
1.96
β^ − βH0 1.156 − 0
T = = = 3.7387
SEEβ 0.3092
155
© 2014-2024 AnalystPrep.
Practice Question 2
A trader develops a simple linear regression model to predict the price of a stock. The
estimated slope coefficient for the regression is 0.60, the standard error is equal
to 0.25, and the sample has 30 observations. Determine if the estimated slope
Solution
H0:β1=0
H1:β1≠0
β1 − βH0 0.60 − 0
= = 2.4
Sβ1 0.25
156
© 2014-2024 AnalystPrep.
Step 4: State the decision rule
Reject H0; The slope coefficient is statistically significant since 2.048 < 2.4.
157
© 2014-2024 AnalystPrep.
Reading 19: Regression with Multiple Explanatory Variables
Interpret goodness of fit measures for single and multiple regressions, including R2 and
adjusted R2.
Construct, apply, and interpret joint hypothesis tests and confidence intervals for
the effect of more than one independent variable on a given dependent variable.
Y i = β0 + β1 X1 i + β2 X2 i + … + βk Xk i + εi ∀i = 1, 2, … n
Intuitively, the multiple regression model has k slope coefficients and k+1 regression
coefficients. Normally, statistical software (such as Excel and R) are used to estimate the
The slope coefficients βk computes the level of variation of the dependent variable Y when the
independent variable Xj changes by one unit while holding other independent variables constant.
The interpretation of the multiple regression coefficients is quite different compared to linear
regression with one independent variable. The effect of one variable is explored while keeping
158
© 2014-2024 AnalystPrep.
For instance, a linear regression model with one independent variable could be estimated as
Y^ = 0.6 + 0.85X1 . In this case, the slope coefficient is 0.85, which implies that a 1 unit increase
Now, assume that we had the second independent variable to the regression so that the
regression equation is Y^ = 0.6 + 0.85X1 + 0.65X2 . A unit increase in X1 will not result in a 0.85
unit increase in Y unless X1 and X2 are uncorrelated. Therefore, we will interpret 0.85 as one unit
of X1 leads to 0.85 units increase in the dependent variable Y, while keeping X2 constant.
Although the multiples regression parameters can be estimated, it is challenging since it involves
a huge amount of algebra and the use of matrices. However, we build a foundation of
understanding using the multiple regression model with two explanatory variables.
Yi = β0 + β1 X1 i + β2 X2 i + εi
The first step is to regress X1 and X2 and to get the residual of X 1i given by:
ϵ X1i = X1i − α
^0 − α
^ 1 X2i
Where α
^ 0 and α
^1 are the OLS estimators of X2i .
The next step is to regress Y on X2 to get the residuals of Yi, which is intuitively given by:
Where ^
γ 0 and ^
γ 1 are the OLS estimators of X2i . The final step is to regress the residual of X1
159
© 2014-2024 AnalystPrep.
ϵ Yi = β^ 1 ϵX1i + ϵ i
Note that we do not have a constant, the expected values of ϵ Yi and ϵ Xi are both 0. Moreover,
the main purpose of the first and the second regression is to exclude the effect of X2 from both Y
and X1 by dividing the variable into the fittest value which is correlated with X2, and the residual
error which is uncorrelated with X2 and thus the two-residual obtained is uncorrelated with X2
by intuition. The last step of the regression gives the regression between the components of Y
The OLS estimator for β2 can be approximated analogously as that of β1 by exchanging X2 for X1
in the process above. By repeating this process, we can estimate a k-parameter model such as:
Most of the time, this is done using a statistical package such as Excel and R.
Suppose that we have n observations of the dependent variable (Y) and the independent
For us to make a valid inference from the above equation, we need to make classical normal
1. The relationship between the dependent variable, Y, and the independent variables, X1,
2. The independent variables (X1, X2, . . . , Xk) are iid. Moreover, there is no definite linear
relationship that exists between two or more of the independent variables, X1, X2, . . . ,
X k.
3. The expectation of value of the error term, conditioned on the independent variables, is
160
© 2014-2024 AnalystPrep.
0: E(ϵ| X1, X2, . . . , Xk) = 0
4. The variance of the error term is equal for all observations. That is,
E(ϵ i ϵj ) = 0 ∀i ≠ j
6. The error term ϵ is normally distributed. This allows us to test the hypothesis about
regression analysis.
7. There are no outliers so that E(X ji4) < ∞ for all j=1,2….k
The assumptions are almost the same as those of linear regression with one independent
variable, only that the second assumption is tailored to ensure no linear relationships between
The goodness of fit of a regression is a measure using the Coefficient of determination (R 2) and
Recall that the standard error estimate gives a percentage at which we are certain of a forecast
made by a regression model. However, it does not tell us how suitable is the independent
variable in determining the dependent variable. The coefficient of variation corrects this
shortcoming.
The coefficient of variation measures a proportion of the total change in the dependent variable
explained by the independent variable. We can calculate the coefficient of variation in two ways:
161
© 2014-2024 AnalystPrep.
The coefficient of variation can be computed by squaring the correlation coefficient (r) between
R2 = r2
Recall that:
C ov(X, Y )
r=
σX σY
Where
σX -standard deviation of X
σY -standard deviation of Y
However, this method only accommodates regression with one independent variable.
The correlation coefficient between the money supply growth rate (dependent, Y) and inflation
rates (independent, X) is 0.7565. The standard deviation of the dependent (explained) variable is
0.050, and that of the independent variable is 0.02. Regression analysis for the ten years was
Solution
We know that:
Cov(X, Y ) 0.0007565
r= = = 0.7565
σX σY 0.05 × 0.02
162
© 2014-2024 AnalystPrep.
So, in regression, the money supply growth rate explains roughly 57.23% of the variation in the
If the regression analysis is known, then our best estimate for any observation for the dependent
variable would be the mean. Alternatively, instead of using the mean as an estimate of Yi, we can
predict an estimate using the regression equation. The resulting solution will be denoted as:
Y i = β0 + β1 X1 i + β2 X2 i + … + βk Xk i + εi = Y^i + ^
ϵi
So that:
Y i = Y^i + ^
ϵi
Now if we subtract the mean of the dependent variable in the above equation and square and
n n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ + ^
ϵ i)
i=1 i =1
n 2 n n
= ∑ (Y^i − Y¯) + 2 ∑ ^ ^2i
ϵ i (Y^i − Y¯) + ∑ ϵ
i =1 i=1 i=1
Note that:
n
ϵ i (Y^i − Y¯ ) = 0
2∑ ^
i=1
n n 2 n
2
∑ (Y i − Y¯) = ∑ (Y^i − Y¯) + ∑ ^
ϵ 2i
i=1 i=1 i=1
But
163
© 2014-2024 AnalystPrep.
2
2
ϵ i = (Yi − Y^)
^
So, that
n n 2
^ 2i = ∑ (Y i − Y^)
∑ ϵ
i=2 i=1
Therefore,
n n 2 n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯) + ∑ (Y i − Y^ )
i=1 i =1 i=1
If the regression analysis is useful for predicting Yi using the regression equation, then the error
Now let:
2
Explained Sum of Squares (ESS)=∑ni=1 (Y^i − Y¯ )
2
Residual Sum of Squares (RSS) =∑ni=1 (Yi − Y^)
2
Total Sum of Squares (TSS)=∑ni=1 (Yi − Y¯)
Then:
T SS = ESS + RSS
ESS RS S
1= +
T SS T SS
ESS RSS
⇒ = 1−
T SS T SS
Now, recall than the coefficient of determination is the fraction of the overall change that is
Explained Variation ES S RS S
164
© 2014-2024 AnalystPrep.
Explained Variation ES S RS S
R2 = = = 1−
Total Variation T SS T SS
If a model does not explain any of the observed data, then it has an R 2 of 0 . On the other hand,
if the model perfectly describes the data, then it has an R 2 of 1. Other values are in the range of
0 and 1 and are always positive. For instance, in the above example, the R 2 is approximately 1
and thus, the money supply growth rate perfectly explains the level of inflation rates in the
countries.
Limitations of R2
1. As the number of explanatory variables increases, the value of R2 always increases even
if the new variable is almost completely irrelevant to the dependent variable. For
instance, if a regression model with one explanatory variable is modified to have two
165
© 2014-2024 AnalystPrep.
explanatory variables, the new R 2 is greater or equal to that of a single explanatory
model. In the case where β = 0, adding a variable will not increase R2 . In that case, the
3. There is no standard value of R 2 that is considered good because its values depend on
The Adjusted R2
2
Denoted by R̄ , the adjusted-R2 measures the goodness of fit, which does not automatically
increase if an independent variable is added to the model; that is, it is adjusted for the degrees of
2
freedom. Note that R̄ is produced by statistical software. The relationship between the R2 and
2
R̄ is given by:
( nRSS
−k−1
)
2
R̄ = 1 −
( T−1
SS
)
n
n−1
= 1 −( ) (1 − R2 )
n −k −1
Where
n=number of observations
The adjusted R-squared can increase, but that happens only if the new variable improves the
model more than would be expected by chance. If the added variable improves the model by less
166
© 2014-2024 AnalystPrep.
2
When k≥ 1, then R 2 > R̄ since adding an extra new independent variable results in a decrease
2 2
in R̄ if that addition causes a small increase in R 2 . This explains the fact that R̄ can be a
2
A point to note is that when we decide to use R̄ to compare the regression models, the
dependent variable is defined the same way and that the sample size is the same as that of R 2 .
2
The following are the factors to watch out for when guarding against applying the R 2 or the R̄ :
It is not always true that the regressors are a true cause of the dependent variable, just
2
because there is a high R2 or R̄ .
It is not necessary that there is no omitted variable bias just because we have a high
2
R2 or R̄ .
It is not necessarily true that we have the most appropriate set of regressors just
2
because we have a high R2 or R̄
It is not necessarily true that we have an inappropriate set of regressors just because
2
we have a low R2 or R̄ .
2
R̄ does not automatically indicate that regression is well specified due to its inclusion of a right
2
set of variables since a high R̄ could reflect other uncertainties in the data in the analysis.
2
Moreover, R̄ can be negative if the regression model produces an extremely poor fit.
Previously, we had conducted hypothesis tests on individual regression coefficients using the t-
test. We need to perform a joint hypothesis test on the multiple regression coefficients using the
In multiple regression, we cannot test the null hypothesis that all the slope coefficients are equal
167
© 2014-2024 AnalystPrep.
to 0 using the t-test. This is because an individual test on the coefficient does not accommodate
F-test (test of regression’s generalized significance) determines whether the slope coefficients in
multiple linear regression are all equal to 0. That is, the null hypothesis is stated as
not equal to 0.
To accurately compute the test statistic for the null hypothesis that the slope is equal to 0, we
n 2
∑ (Y i − Y^ i)
i=1
n 2
∑ (Y^ i − Y¯ i)
i=1
III. The number of parameters to be estimated. For example, in a regression analysis with one
independent variable, there are two parameters: the slope and the intercept coefficients.
Using the above four requirements, we can determine the F-statistic. The F-statistic measures
how effective the regression equation explains the changes in the dependent variable. The F-
statistic is denoted by F(Number of slope parameters, n-(number of parameters)). For instance, the F-
statistic for multiple regression with two slope coefficients (and one intercept coefficient) is
denoted as F2, n-3. The value n-3 represents the degrees of freedom for the F-statistic.
The F-statistic is the ratio of the average regression sum of squares to the average amount of
squared errors. The average regression sum of squares is the regression sum of squares divided
168
© 2014-2024 AnalystPrep.
by the number of slope parameters (k) estimated. The average sum of squared errors is the sum
of squared errors divided by the number of observations (n) less a total number of parameters
In this case, we are dealing with a multiple linear regression model with k independent variable
( ESS )
k
F =
SS R
( )
n−(k+1)
In regression analysis output (ANOVA part), MSR and MSE are displayed as the first and the
second quantities under the MSS (mean sum of the squares) column, respectively. If the overall
If the independent variables do not explain any of the variations in the dependent variable, each
predicted independent variable Y^ i) possess the mean value of the dependent variable (Y ).
So, how do we decide F-test? We reject the null hypothesis at α significance level if the computed
F-statistic is greater than the upper α critical value of the F-distribution with the provided
An analyst runs a regression of monthly value-stock returns on four independent variables over
48 months.
The total sum of squares for the regression is 360, and the sum of squared errors is 120.
169
© 2014-2024 AnalystPrep.
Test the null hypothesis at a 5% significance level (95% confidence) that all the four independent
Solution
H 0 : β1 = 0 , β2 = 0, … , β4 = 0
Versus
( ESS 240
)
k 4
F= = = 21.5
SS R 120
( ) 43
n −(k+1)
Conclusion: at least one of the 4 independent variables is significantly different than zero.
An investment analyst wants to determine whether the natural log of the ratio of bid-offer spread
to the price of a stock can be explained by the natural log of the number of market participants
and the amount of market capitalization. He assumes a 5% significance level. The following is
170
© 2014-2024 AnalystPrep.
ANOVA df SS MSS F Significance F
Regression 2 3, 730.1534 1, 865.0767 2 , 217.95 0.00
Residual 2 , 797 2, 351.9973 0.8409
Total 2 , 799 5, 801.2051
We are concerned with the ANOVA (Analysis of variance) results. We need to conduct F-test to
Solution
H0 : β^ 1 = β^ 2 = 0
vs
H1 : At least 1β^ j ≠ 0, ∀j = 1, 2
There are two slope coefficients, k=2 (coefficients on the natural log of the number of market
participants and the amount of market capitalization), which is degrees of freedom for the
numerator of the F-statistic formula. For the denominator, the degrees of freedom are n- (k + 1)
=2800-3= 2,797.
The sum of the squared errors is 2,351.9973, while the regression sum of squares is 3,730.1534.
( E SS 3730.1534
) 2
k
F2,2797 = = = 2217.9530
SS R 2351.9973
( ) 2797
n−(k+1)
Since we are working at a 5% (0.05) significance level, we look at the F-distribution table on the
second column which displays the F-distributions with degrees of freedom in the numerator of
171
© 2014-2024 AnalystPrep.
F Distribution: Critical Values of F (5% significance level)
1 2 3 4 5 6 7 8 9 10
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84
As seen from the table, the critical value of the F-test for the null hypothesis to be rejected is
between 3.00 and 3.07. The actual F-statistic is 2217.95, which is far higher than the F-test
critical value, and thus we reject the null hypothesis that all the slope coefficients are equal to 0.
172
© 2014-2024 AnalystPrep.
Confidence interval (CI) is a closed interval in which the actual parameter is believed to lie with
some degree of confidence. Confidence intervals are used to perform hypothesis tests. For
instance, we may want to ascertain stock valuation using the capital asset pricing model (CAPM).
In this case, we may wish to hypothesize that the beta possesses the market’s systematic risk or
averaged beta.
The same analogy used in the regression analysis with one explanatory variable is also used in a
An economist tests the hypothesis that interest rates and inflation can explain GDP growth in a
country. Using some 73 observations, the analyst formulates the following regression equation:
GDP growth = ^
b0 + ^
b 1 (Interest) + ^
b 2 (Inflation)
What is the 95% confidence interval for the coefficient on the inflation rate?
A. 0.12024 to 0.27976
B. 0.13024 to 0.37976
C. 0.12324 to 0.23976
D. 0.11324 to 0.13976
Solution
173
© 2014-2024 AnalystPrep.
From the regression analysis, β^ =0.20 and the estimated standard error, s β^ =0.04. The number
of degrees of freedom is 73-3=70. So, the t-critical value at the 0.05 significance level is =
t 0.05, 73−2−1 = t0.025,70 = 1.994. Therefore, the 95% confidence level for the stock return is:
2
174
© 2014-2024 AnalystPrep.
Practice Questions
Question 1
variables over 48 months. The total sum of squares for the regression is 360 and the
A. 42.1%
B. 50%
C. 33.3%
D. 66.7%
Question 2
A. 27.1%
B. 63.6%
C. 72.9%
D. 36.4%
n−1
175
© 2014-2024 AnalystPrep.
2 n−1
R̄ = 1 − × (1 − R 2)
n − k−1
48 − 1
= 1− × (1 − 0.667)
48 − 4 − 1
= 63.6%
Question 3
Refer to the previous problem. The analyst now adds four more independent variables
to the regression and the new R2 increases to 69%. What is the new adjusted R2 and
A. The analyst would prefer the model with four variables because its adjusted R2 is
higher.
B. The analyst would prefer the model with four variables because its adjusted R2 is
lower.
C. The analyst would prefer the model with eight variables because its adjusted R2 is
higher.
D. The analyst would prefer the model with eight variables because its adjusted R2 is
lower.
2
New R = 69%
2 48 − 1
New adjusted R = 1− × (1 − 0.69) = 62.6%
48 − 8 − 1
The analyst would prefer the first model because it has a higher adjusted R2 and the
Question 4
176
© 2014-2024 AnalystPrep.
An economist tests the hypothesis that GDP growth in a certain country can be
equation:
A. Since the test statistic < t-critical, we accept H0; the interest rate coefficient
B. Since the test statistic > t-critical, we reject H0; the interest rate coefficient
C. Since the test statistic > t-critical, we reject H0; the interest rate coefficient is
D. Since the test statistic < t-critical, we accept H1; the interest rate coefficient
Hypothesis:
H 0 : β^ 1 = 0 vs H 1 : β^ 1 ≠ 0
177
© 2014-2024 AnalystPrep.
The test statistic is:
0.20 − 0
t=( )=4
0.05
The critical value is t(α/2, n-k-1) = t0.025,27 = 2.052 (which can be found on the t-table).
178
© 2014-2024 AnalystPrep.
Reading 20: Regression Diagnostics
Explain two model selection procedures and how these relate to the bias-variance
tradeoff.
Describe the various methods of visualizing residuals and their relative strengths.
Determine the conditions under which OLS is the best linear unbiased estimator.
That is, an ideal regression model should consist of all the variables that explain the dependent
Model specification includes the residual diagnostics and the statistical tests on the assumptions
of OLS estimators. Basically, the choice of variables to be included in a model depends on the
bias-variance tradeoff. For instance, large models that include the relevant number of variables
are likely to have unbiased coefficients. On the other side, smaller models lead to accurate
179
© 2014-2024 AnalystPrep.
The conventional specification makes sure that the functional form of the model is adequate, the
An omitted variable is one with a non-zero coefficient, but they are excluded in the regression
model.
I. The remaining variables sustain the impact of the excluded variables in terms of the
common variation. Thus, they do not consistently approximate the change in the
independent variable on the dependent variable while keeping all other things constant.
II. The magnitude of the estimated residuals is larger than the true value. This is true since
the estimated residuals have the true value and the effect of the omitted value that
Yi = α + βi X1i + β2 X2i + ϵ i
If we omit X2 from the estimated model, then the model is given by:
Yi = α + βi X1i + ϵ i
β1 + β2 δ
Where:
Cov(X , X )
180
© 2014-2024 AnalystPrep.
Cov(X1 , X2 )
δ=
Var(X1 )
It is clear that the bias – due to the omitted variable – depends on the population coefficient of
the excluded variable β2 and the relational strength of the X2 and X1 , represented by δ.
When the correlation between X1 and X2 is high, X1 can explain a significant proportion of
variation in X2 and hence the bias is high. On the other hand, if the independent variables are
Conclusively, the omitted variable leads to biasness of the coefficient on the variables that are
An extraneous variable is one that is unnecessarily included in the model, whose actual
variables is costly.
2 RSS
R̄ = 1 − ξ
TSS
Where:
(n − 1)
ξ=
(n − k − 1)
Looking at the formula above, adding more variables increase the value of k which in turn
2
increases the value of ξ and hence reducing the value of R̄ . However, if the model is large, then
181
© 2014-2024 AnalystPrep.
2
RSS is smaller which reduces the effect of ξ and produces larger R̄ .
Contrastingly, this is not always the case when the true coefficient is equal to 0 because, in this
2
case, RSS remains constant as ξ increases leading to a smaller R̄ and a large standard error.
Lastly, if the correlation between X 1 and X2 increases, the standard error value rises.
The bias-variance tradeoff amounts to choosing between the including irrelevant variables and
excluding relevant variables. Bigger models tend to have low bias level because it includes more
relevant variables. However, they are less accurate in approximating the regression parameters
Moreover, regression models with fewer independent variables are characterized by low
In the general-to-specific method, we start with a large general model that incorporates
all the relevant variables. Then, the reduction of the general model starts. We use
hypothesis tests to establish if there are any statistically insignificant coefficients in the
estimated model. When such coefficients are found, the variable with the coefficient
with the smallest t-statistic is removed. The model is then re-estimated using the
remaining set of independent variables. Once more, hypothesis tests are carried out to
establish if statistically insignificant coefficients are present. These two steps (remove
and re-estimate) are repeated until all coefficients that are statistically insignificant have
been removed.
2. m-fold Cross-Validation
182
© 2014-2024 AnalystPrep.
The m-fold cross-validation model-selection method aims at choosing the model that’s
As a first step, the number of models has to be decided, and this is determined in part by
the number of explanatory variables. When this number is small, the researcher can
consider all the possible combinations. With 10 variables, for example, 1,024 (=) distinct
3. Estimate parameters using m-1 of the groups; these groups make up what we call
the training block. The excluded group is referred to as the validation block.
4. Use the estimated parameters and the data in the excluded block (validation
sample residuals since they are arrived at using data not included in the sample
group has to serve as the validation block and used to compute residuals.
6. Compute the sum of squared errors using the residuals estimated from the out-
of-sample data.
7. Select the model with the smallest out-of-sample sum of squared residuals.
Heteroskedasticity
Recall that homoskedasticity is one of the critical assumptions in the determination of the
distribution of the OLS estimator. That is, the variance of ϵ i is constant and that it does not vary
with any of the independent variables; formally stated as Var(ϵ i│X1i ,X 2i , … , Xki ) = δ 2.
183
© 2014-2024 AnalystPrep.
Test for Heteroskedasticity
Halbert White proposed a simple test, with the following two-step procedures:
1. A constant
3. The cross product of all the independent variables, including the product of each
Yi = α + βi X1i + β2 X2i + ϵ i
The first step is to calculate the residuals by utilizing the OLS parameter estimators:
^ − β^ 1 X1i − β^ 2 X2i
ϵ i = Yi − α
^
184
© 2014-2024 AnalystPrep.
ϵ 2i = Υ 0 + Υ 1 X1i + Υ 2 X2i + Υ3 X21i + Υ 4 X22i + Υ 5 X21i X22i
^
hypothesis is: H 0 : Υ1 = ⋯ = Υ 5 = 0
The test statistic is calculated as nR2 where R2 is calculated in the second regression and that
the test statistic has a χ2k( k+3) (chi-distribution), where k is the number of explanatory variables in
2
For instance, if the number of the explanatory variables is two, k=2, then the test statistic has a
distribution of χ5 .
The three common methods of handling data with heteroskedastic shocks include:
However simple, this method leads to less accurate model parameter estimates
2. Transformation of data.
For instance, positive data can be log-transformed to try and remove heteroskedasticity
and give a better view of data. Another transformation can be in the form of dividing the
This is a complicated method that applies weights to the data before approximating the
parameters. That is if we know that Var(ϵi ) = w2i σ 2 where wi is known then we can
transform the data by dividing by wi to remove the heteroskedasticity from the errors. In
Yi Xi
other words, the WLS regresses wi
on wi
such as:
Y 1 X ϵ
185
© 2014-2024 AnalystPrep.
Yi 1 Xi ϵi
=α +β +
wi wi wi wi
Ȳi = α C̄i + βX̄i + ϵ̄ i
Note that the parameters of the model above are estimated using OLS on the transformed data.
1
That is, the weighted version of Yi which is Ȳi on two weighted explanatory variables C̄ i = and
wi
Xi
X̄i = wi
. Note that the WLS model does not clearly include the intercept α , but the interpretation
Multicollinearity
Multicollinearity occurs when others can significantly explain one or more independent
variables. For instance, in the case of two independent variables, there is evidence of
186
© 2014-2024 AnalystPrep.
In contrast with multicollinearity, perfect correlation is where one of the variables is perfectly
correlated to others such that the R 2 of regression of X j on the remaining independent variable is
precisely 1.
Conventionally, when R 2 is above 90% leads to problems in medium sample sizes such as that of
100. Multicollinearity does not pose an issue in parameter approximation, but rather, it brings
When multicollinearity is present, some of the coefficients in a regression model are jointly
statistically significant (F-statistic is substantial), but the individual t-statistic is very small (less
than 1.96) since the regression analysis assumes the collective effect of the variables rather than
Addressing Multicollinearity
187
© 2014-2024 AnalystPrep.
There are two ways of dealing with multicollinearity:
II. Identification of the multicollinear variables and excluding them from the model.
models: one that incorporates only Xj and one that omits k independent variables:
The variance inflation factor (VIF) for the variable Xj is given by:
1
VIFj =
1 − R 2j
Where R 2j originates from regressing Xj on the other variable in the model. When the
value of the VIF is above 10, then it is considered too much and the variable should be
Residual Plots
Residual plots are utilized to identify the deficiencies in a model specification. When the residual
plots are not systematically related to any of the included independent (explanatory variables)
and relatively small (within ± 4s, where s, is the standard shock deviation of the model) in
Outliers
Outliers are values that, if removed from the sample, produce large changes in the estimated
188
© 2014-2024 AnalystPrep.
coefficients. They can also be viewed as data points that deviate significantly from the normal
Cook’s distance helps us measure the impact of dropping a single observation j on a regression
(−j) 2
∑ni=1 (Ȳ i ^i)
−Y
Dj =
ks2
Where:
(−j)
Ȳ i =fitted value of Ȳ i when the observed value j is excluded, and the model is approximated
189
© 2014-2024 AnalystPrep.
k=number of coefficients in the regression model
When a variable is an inline (does not affect the coefficient estimates when excluded), the value
of its Cook’s distance (Dj ) is small. On the other hand, Dj is higher than 1 if it is an outlier.
Observation Y X
1 3.67 1.85
2 1.88 0.65
3 1.35 −0.63
4 0.34 1.24
5 −0.89 −2.45
6 1.95 0.76
7 2.98 0.85
8 1.65 0.28
9 1.47 0.75
10 1.58 −0.43
11 0.66 1.14
12 0.05 −1.79
13 1.67 1.49
14 −0.14 −0.64
15 9.05 1.87
If you look at the data sets above, it is easy to see that observation 15 is quite more significant
than the rest of the observations, and there is a possibility to be an outlier. However, we need to
ascertain this.
We begin by fitting the whole dataset ( Ȳi ) and then the 14 observations which remain after
Ȳ i = 1.4465 + 1.1281Xi
190
© 2014-2024 AnalystPrep.
(−j)
Ȳ i = 1.1516 + 0.6828Xi
(−j) (−j) 2
Observation Y X Ȳ i Ȳ i (Ȳ i − Ȳ i )
1 3.67 1.85 3.533 2.4148 1.2504
2 1.88 0.65 2.179 1.5954 0.3406
3 1.35 0.63 0.7358 0.7214 0.0002
4 0.34 1.24 2.8453 1.9983 0.7174
5 0.89 2.45 −1.3174 −0.5213 0.6338
6 1.95 0.76 2.3039 1.6705 0.4012
7 2.98 0.85 2.4053 1.732 0.4533
8 1.65 0.28 1.7624 1.3428 0.1761
9 1.47 0.75 2.2926 1.6637 0.3955
10 1.58 0.43 0.9614 0.858 0.0107
11 0.66 1.14 2.7325 1.921 0.6585
12 0.05 1.79 −0.5728 −0.07061 0.2522
13 1.67 1.49 3.1274 2.169 0.9185
14 0.14 0.64 0.7245 0.7146 0.0001
15 9.05 1.87 3.556 2.4284 1.2715
Sum 7.4800
(−j ) 2
∑ni=1 (Ȳ i ^i )
−Y 7.4800
Dj = = = 1.0523
2
ks 2 × 3.554
OLS is the Best Linear Unbiased Estimator (BLUE) when some key assumptions are met, which
implies that it can assume the smallest possible variance among any given estimator that is
Linearity: the parameters being estimated using the OLS method must be themselves
linear.
191
© 2014-2024 AnalystPrep.
Random: the data must have been randomly sampled from the population.
I. A big proportion of the estimators are not linear such as maximum likelihood estimators
(but biased).
II. BLUE property is heavily dependent on residuals being homoskedastic. In the case that
the variances of residuals vary the independent variables, then it is possible to construct
linear unbiased estimators (LUE) of the coefficients α and β using WLS but with extra
assumptions.
When the residuals and iid and normally distributed with a mean of 0 and variance of σ 2,
formally stated as ϵ i ∼i id N(0 , σ 2 ) makes the upgrades BLUE to BUE (Best Unbiased Estimator) by
virtue having the smallest variance among all linear and non-linear estimators. However, errors
being normally distributed is not a requirement for accurate estimates of the model coefficients
192
© 2014-2024 AnalystPrep.
Practice Question 1
I. Homoskedasticity means that the variance of the error terms is constant for all
independent variables.
II. Heteroskedasticity means that the variance of error terms varies over the sample.
A. Only I
B. II and III
Solution
If the variance of the residuals is constant across all observations in the sample, the
said to exhibit heteroskedasticity, i.e., the variance of the residuals is not the same
poses a significant problem: it introduces a bias into the estimators of the standard
Practice Question 2
193
© 2014-2024 AnalystPrep.
correlated with the remaining variables.
C. Presence of heteroskedasticity.
Solution
included in the model, whose true coefficient and consistently approximated value is
the errors varies systematically with the independent variables of the model.
194
© 2014-2024 AnalystPrep.
Reading 21: Stationary Time Series
Define white noise; describe independent white noise and normal (Gaussian) white
noise.
processes.
example, monthly sales of a company for the past ten years. Time series are used to forecast the
future of the time series. The time series are classified into the trend, seasonal, and cyclical
components. A trend time-series changes its level over time, while a seasonal time series has
predictable changes over a given time. Lastly, a cyclical time series, as its name suggests,
195
© 2014-2024 AnalystPrep.
reflects the cycles in a given data. We will concentrate on the cyclical data (especially linear
stochastic processes).
A stochastic process is a set of variables. The stochastic process is mostly denoted by Y t and by
the subscript, the random variable is ordered time so that Y s occurs first before Yt if s < t.
coefficient.
it starts from the infinite past and proceeds to the infinite future. However, only a finite subset of
A series is said to be covariance stationary if both its mean and covariance structure is stable
over time.
I. The mean does not change and thus constant over time. That is:
E(Yt) = μ∀t
II. The variance does change over time, and it is constant. That is:
V (Y t) = γ0 < ∞ ∀t
III. The autocovariance of the time series is finite and does not change over time, and it depends
196
© 2014-2024 AnalystPrep.
C ov (Yt , Y t−h ) = γh ∀ t
The covariance stationarity is crucial so that the time series has a constant relationship across
time and that the parameters are easily interpreted since the parameters will be asymptotically
normally distributed.
It can be quite challenging to quantify the stability of a covariance structure. We will, therefore,
use the autocovariance function. The autocovariance is the covariance between the stochastic
process at a different point in time (analogous to the covariance between two random variables).
It is given by:
γh = γ|h|
This is asserting the fact that the autocovariance depends on the length h and not the time t. So
that:
Cov(Y , Y ) γ γ
197
© 2014-2024 AnalystPrep.
Cov(Y t, Y th ) γh γh
ρ ( t) = = =
√ γ0 γ0 γ0
√V (Yt )√V (Y t−h )
Similarly, for h = 0 .
γ0
ρ ( t) = =1
γ0
The autocorrelation ranges from -1 and 1 inclusively. The partial autocorrelation function is
denoted as, p(h) , and in a linear population regression of Y t on Yt−1 , … , Y t−h , it is the coefficient
of y t−h . This regression is referred to as the autoregression. This is because the regression is on
198
© 2014-2024 AnalystPrep.
White Noise
Assume that:
yt = ϵt
ϵ t ∼ (0, σ 2 ) , ∀ σ2 < ∞
where ϵt is the shock and is uncorrelated over time. Therefore, ϵ t and y t are said to be serially
uncorrelated.
199
© 2014-2024 AnalystPrep.
This auto-correlation that has a zero mean and unchanging variance is referred to as the zero-
mean white noise (or just white noise) and is written as:
ϵt ∼ W N (0, σ 2)
And:
yt ∼ WN (0, σ 2)
200
© 2014-2024 AnalystPrep.
ϵ t and y t serially uncorrelated but not necessarily serially independent. If y possesses this
property, (serially uncorrelated but not necessarily serially independent) then it is said to be an
Therefore, we write:
yt iid
∼
(0, σ 2 )
This is read as “ y is independently and identically distributed with a mean 0 and constant
distribution. In this case, y is called the normal white noise or the Gaussian white noise.
Written as:
yt iid
∼
N (0, σ 2)
To characterize the dynamic stochastic structure of y t ∼ WN (0, σ 2), it follows that the
E (y t) = 0
And:
var (y t) = σ 2
These two are constant since only displacement affects the autocovariances rather than time. All
the autocovariances and autocorrelations are zero beyond displacement zero since white noise is
σ2 , h=0
γ (h) = {
0, h≥0
201
© 2014-2024 AnalystPrep.
ρ (h) = { 1, h=0
0, h≥1
Beyond displacement zero, all partial autocorrelations for a white noise process are zero. Thus,
by construction white noise is serially uncorrelated. The following is the function of the partial
p (h) = { 1, h=0
0, h≥1
Simple transformations of white noise are considered in the construction of processes with much
richer dynamics. Then the white noise should be the 1-step-ahead forecast errors from good
models.
The mean and variance of a process, conditional on its past, is another crucial characterization of
To compare the conditional and unconditional means and variances, consider the independence
Or:
The conditional mean and variance do not necessarily have to be constant. The conditional mean
E (y t|Ωt−1 ) = 0
202
© 2014-2024 AnalystPrep.
Independent white noise series have identical conditional and unconditional means and
variances.
Wold’s Theorem
∞
Yt = ϵt + β1 ϵ t−1 + β2 ϵt−2 + ⋯ = ∑ βi ϵ t−i
i=0
Where:
ϵt ∼ W N (0, σ 2)
The accurate model for any stationary covariance series is the Wold’s representation. Since ϵ t
Time-Series Models
AR models are time series models mostly used in finance and economics which links the
stochastic process Y t to the previous value Y t−1 . The first order AR model denoted by AR(1) is
given by:
Y t = α + βYt−1 + ϵt
Where:
α = intercept
203
© 2014-2024 AnalystPrep.
β = AR parameter
SinceY t is assumed to be covariance stationary, the mean,variance, and autocovariances are all
Therefore,
⇒ μ = α + βμ + 0
α
∴μ=
1 −β
γ0 = β2 γ0 + σ2 + 0
σ2
∴
1 − β2
Note that C ov(Y t−1 , ϵ t)=0 since Y t−1 is uncorrelated with the shocks ϵ t−1 , ϵ t−2 , …
The Autocovariances for AR(1) process is calculated recursively. The first autocovariance for the
204
© 2014-2024 AnalystPrep.
Cov(Y t, Y t−h )) = Cov(α + βYt−1 + ϵ t, Yt−h )
= βCov(Y t−1 , Y t−h ) + Cov(Y t−h , ϵ t)
= βγh−1
γh = βh γ0
γh = β|h| γ0
β h γ0 |h|
ρ (ρ) = =β
γ0
The ACF tends to 0 when h increases and that -1<β<0. The Partial autocorrelation of an AR(1)
β|h| , h ∈ {0 , ±}
∂ (h) = {
0, h ≥ 2
The lag operator denoted by L is important for manipulating complex time-series models. As its
name suggests, the lag operator moves the index of a particular observation one step back. That
is:
LY t = Y t−1
(I). The lag operator moves the index of a time series one step back. That is:
LY t = Y t−1
205
© 2014-2024 AnalystPrep.
(II). Consider the following mth-order lag operator polynomial Lm then:
L m Y t = y t−m
For example Lα = α
a(L) = 1 + a1 L + a2 L 2 + … + ap Lp
so that:
(V). The lag operator has a multiplicative property. Consider two lag operators a(L) and b(L).
Then:
a(L)b(L) = b(L)a(L)
IV. Under some restrictive conditions, the lag operator polynomial can be inverted so that:
a(L)a(L)−1=1. When a(L) is a first-order lag operator polynomial given by 1 − a1 (L), is invertible if
∞
(1 − a1 (L))−1 = ∑ ai L i
i=1
206
© 2014-2024 AnalystPrep.
Y t = α + βYt−1 + ϵt
Y t = α + β(LY )t + ϵt
⇒ (1 − βL)Y t = α + ϵ t
∞ ∞ ∞
α
⇒ Yt = α ∑ β i + ∑ β jL jϵ t = + ∑ β iL iϵ t−i
i=1 j=1 1 −β i=1
The AR(p) model is a generalization of the AR(1) model to include the p lags of Yt−1 . Thus, the
α
E (Yt ) =
1 − β1 − β2 − … βp
σ2
V (Y t) = γ0 =
1 − β1 ρ1 − β2 ρ2 − … βp ρp
From the formulas of the mean and variance of the AR(p) model, the covariance stationarity
207
© 2014-2024 AnalystPrep.
β1 + β2 + ⋯ + βp < 1
The autocorrelations function of the AR(p) model bears the same structural model as AR(1)
model; the ACF tends to 1 as the length between the two-time series increases and may oscillate.
Y t = μ + θϵ t−1 + ϵ t
Where ϵ t ∼ W N(0, σ2 ).
Evidently, the process Y t depends on the current shock ϵt and the previous shock ϵ t−1 where the
coefficient θ measures the magnitude at which the previous shock affects the process. Note μ is
For θ > 0 , MA(1) is persistent because the consecutive values are positively correlated. On the
other hand, if θ < 0, the process mean reverts because the effect of the previous shock is
The MA(1) model is always a covariance stationary process. The mean is as shown above, while
The variance uses the intuition that the shock is white noise processes that are uncorrelated.
208
© 2014-2024 AnalystPrep.
⎧
⎪ 1, h = 0
θ
ρ (h) = ⎨ ,h = 1
1+ 2 θ
⎩
⎪
0, h ≥ 2
The partial autocorrelations (PACF) of the MA(1) model is a complex and non-zero at all lags.
From the MA(1), we can generalize the qth order MA process. Denoted by MA(q), it is given by:
Y t = μ + ϵ t + θ1 ϵt −1 + … + θq ϵ t−q
The mean of the MA(q) process is still μ since all the shocks are white noise process (their
expectations are 0). The autocovariance function of the MA(q) process is given by:
σ 2 ∑q−h θi θi+h , 0 ≤ h ≤ q
γ (h) = { i=0
0, h > q
And θ0 =1
The value of θ can be determined by substituting the value taken by the autocorrelation function
and solving the resulting quadratic equation. The partial autocorrelation of an MA(q) model is
Given an MA(2), Yt = 3.0 + 5ϵt−1 + 5.75ϵ t−2 + ϵ t where ϵ t ∼ WN (0, σ2 ). What is the mean of the
process?
Solution
Y t = μ + θ1 ϵ t−1 + θ2 ϵ t−2 + ϵ t
Where μ is the mean. So, the mean of the above process is 3.0
209
© 2014-2024 AnalystPrep.
The ARMA model is a combination of AR and MA processes. Consider a first-order ARMA model
Y t = α + βYt−1 + θϵt−1 + ϵt
α
μ=
1− β
σ 2 (1 + 2βθ)
γ0 =
1 − β2
1+2βθ+θ2
⎧
⎪ σ2 ,h = 0
⎪
⎪
⎪ 1−β2
γ (h) = ⎨ σ 2 β(1+βθ)+θ(1+βθ)
,h = 1
⎪
⎪ 1−β2
⎪
⎩
⎪ βγh−1 , h ≥ 2
The ACF form of the ARMA(1,1) decays as the length h increases and oscillate if β < 0, which is
The PACF tends to 0 as the length h increase, which is consistent with the MA process. The
decay of ARMA’s ACF and PACF is slow, which distinguishes it from the pure AR and MA models.
From the variance formula of ARMA(1,1), it is easy to see that the process is covariance
ARMA(p,q) Model
As the name suggests, ARMA(p,q) is a combination of the AR(p) and MA(q) process. Its form is
given by:
210
© 2014-2024 AnalystPrep.
When expressed using lag polynomial, this expression reduces to:
β(L)Yt = α + θ(L)ϵt
stationary. The autocovariance and ACFs of the ARMA process are complex that decay at a slow
Sample Autocorrelation
The sample autocorrelation is utilized in validating the ARMA models. The autocovariance
T
1
γ^h = ∑ (Y i − Y¯ ) (Y i−h − Y¯)
T − h i=h+i
Test for autocorrelation is done using the graphical examination by plotting ACF and PACF of the
residuals and check for any deficiencies such as inadequacy of the model to capture the
211
© 2014-2024 AnalystPrep.
Box-Pierce and Ljung-Box test both tests the null hypothesis that:
H 0 : ρ1 = ρ2 = … = ρh = 0
Both the test are chi-distributed (χ2h ) random variables. If the test statistic is larger than the
Box-Pierce Test
h
QBP = T ∑ ρ^2i
i =1
That is, the test statistic is the sum of squared autocorrelation scaled by the sample size T, which
Ljung-Box Test
Ljung-Box test is a revised version of Box-Pierce that is appropriate with small sample sizes. The
h
1
Q LP = T (T + 2) ∑ ( ) ρ^ 2i
i=1 T−i
Model Selection
The first step in model selection is the inspection of the sample autocorrelations and the PACFs.
212
© 2014-2024 AnalystPrep.
This provides the initial signs of the correlation of the data and thus can be used to select the
The next step is to measure the fit of the selected model. The most commonly used method of
measuring the model’s fit is Mean Squared Error (MSE) which is defined as:
1 T
^2 =
σ ∑ ^ϵ2
T t=1 t
When the MSE is small, the model selected explains more of the time series. However, choosing
a model with a small MSE implies that we need to increase the coefficient of variation R2, which
can lead to overfitting. To attend to this problem, other methods have been developed to
measure the fit of the model. These methods involve adding an adjustment factor to MSE each
time a parameter is added. These measures are termed as the Information Criteria (IC).
There are two such ICs: Akaike Information Criteria (AIC) and the Bayesian Information Criteria
(BIC).
σ 2 + 2k
AI C = T ln^
Where T is the sample size, and k is the number of the parameter. The AIC model adds the
^2 + klnT
BIC = T ln σ
Where the variables are defined as in AIC; however, note that the adjustment factor in BIC
increases with an increase in the sample size T. Hence, it is a consistent model selection
213
© 2014-2024 AnalystPrep.
criterion. Moreover, the BIC criterion does not select the model that is larger than that selected
by AIC.
The Box-Jenkin methodology provides a criterion of selecting between models that are equivalent
but with different parameter values. The equivalency of the models implies that their mean, ACF
The Box-Jenkin methodology postulates two principles of selecting the models. One of the
principles is termed as Parsimony. Under this principle, given two equivalent models, choose a
The last principle is invertibility, which states that when selecting an MA or ARMA, select the
Model Forecasting
Forecasting is the process of using current information to forecast the future. In time series
The one-step forecast time series forecasts the conditions expectation E(YT +1 |ΩT ) . ΩT is termed
as the information set at time T which includes the entire history of Y (YT, YT-1...) and the shock
ET (YT +1│Ω T ) = ET (Y T +1 )
Principles of Forecasting.
I. The expectation of a variable is the realization of that variable. That is: ET (YT ) = Y T . This
214
© 2014-2024 AnalystPrep.
applies to the residuals: ET (ϵT −1 ) = ϵ T−1
II. The value of the expectation of future shocks is always 0. That is,
E T (ϵT +h ) = 0
III. The forecasts are done recursively, beginning with ET (Y T+1 ) and that the forecast of a given
Note that we are using the current values YT to predict Y T+1 and shock used is that of the future
ϵ T+1 .
So that:
E T (Y T+2 ) = α + βE T (α + βY T ) = α + β (α + βY T )
⇒ E T (Y T+2 ) = α + αβ + β2 YT
215
© 2014-2024 AnalystPrep.
When h is large, βh must be very small by the intuition of covariance stationary of Yt . Therefore,
h
α
lim ∑ αβ iβ hY T =
h→∞ 1 −β
i =0
The limit is actually the mean of the AR(1) model. The mean-reverting level implies Y T does not
lim ET (Y T+h ) = E (Y t)
h→∞
The forecast error is the difference between the true future value and the forecasted value, that
is,
For longer time-horizon, the forecast is mostly functions of the model parameters.
The ARMA(1,1) for modeling the default in premiums for an insurance company is given by
Dt = 0.055 + 0.934Dt−1 + ϵ t
Given that DT = 1.50, what is the first step forecast of the default?
Solution
We need:
ET (YT +1 ) = α + βYT
⇒ E T (DT +1 ) = 0.055 + 0.934 × 1.5 = 1.4560
216
© 2014-2024 AnalystPrep.
Some time-series data are seasonal. For instance, the sales at the time of summer that may differ
from that of winter. The time series with deterministic seasonality is termed as non-stationary,
while those with stochastic seasonality are called stationary time series and hence modeled with
AR or ARMA process.
A pure seasonal lag utilizes the lags at a seasonal frequency. For instance, assume that we are
using the semi-annual data, then the pure seasonal AR(1) model of quarterly time seasonal time
series is:
(1 − βL 4 )Y t = α + ϵ t
So that:
Y t = α + βYt−4 + ϵt
A more efficient seasonality includes the short-term and seasonal lag components. The short-
Seasonality can also be introduced to AR, MA, or both models by multiplying the short run lag
polynomial and by the seasonal lag polynomial. For instance, the seasonal ARMA is specified as:
ARMA(p, q) × (ps , q s )f
Where p and q are the orders of the short run-lag polynomials, and ps and qs are the seasonal lag
polynomials. Practically, seasonal lag polynomials are restricted to one seasonal lag because the
accuracy of the parameter approximations depends on the number of full seasonal cycles in the
sample data.
217
© 2014-2024 AnalystPrep.
Question 1
The following sample autocorrelation estimates are obtained using 300 data points:
Lag 1 2 3
Coefficient 0.25 −0.1 −0.05
A. 22.5
B. 22.74
C. 30
D. 30.1
m
QB P = T ∑ ρ^2 (h)
h=1
2 2
= 300(0.25 2 + (−0.1) + (−0.05) )
= 22.5
Question 2
The following sample autocorrelation estimates are obtained using 300 data points:
Lag 1 2 3
Coefficient 0.25 −0.1 −0.05
A. 30.1
B. 30
C. 22.5
218
© 2014-2024 AnalystPrep.
D. 22.74
m ˆ1
QLB = T (T + 2) ∑ ( )ρ2 (h)
h =1 T − h
0.25 2 −0.12 −0.05 2
= 300(302)( + + )
299 298 297
= 22.74
Note: Provided the sample size is large, the Box-Pierce and the Ljung-Box tests
Question 3
Yesterday's realization, y(t) was 0.015, and the lagged shock was -0.160. Today's
shock is 0.170.
If the weight parameter theta, θ, is equal to 0.70 and the mean of the process is 0.5,
A. -4.205
B. 4.545
C. 0.558
D. 0.282
realization = yt−1 .
219
© 2014-2024 AnalystPrep.
yt = μ + θϵt−1 + ϵt
= 0.5 + 0.170 + 0.7(−0.160) = 0.558
= 0.558
220
© 2014-2024 AnalystPrep.
Reading 22: Nonstationary Time Series
Explain how to construct an h-step-ahead point forecast for a time series with
seasonality.
Calculate the estimated trend value and form an interval forecast for a time series.
Recall that the stationary time series have means, variance, and autocovariance that are
independent of time. Therefore any time series that violates this rule is termed as the non-
The nonstationary time series include time trends, random walks( also called unit-roots) and
seasonalities. Time trends reflect the feature of the time series to grow over time.
Seasonalities occur due to change in the time series over different seasons such as each quarter.
Seasonalities can be shifts of the mean (for example depending on the period of the year) and the
mean cycle of the time series (this occurs when the shock of the current value depends on the
shock of the same future period). Seasonalities can be modeled using the dummy variables or
modeling it period after period changes (such as year after year) in an attempt to remove the
In a random walk, time series depends on each other and their respective shocks. We discuss
Time Trends.
221
© 2014-2024 AnalystPrep.
The time trend deterministically shifts the mean of the time series. The time trend can be linear
Linear trend models are those that the dependent variable changes at a constant rate with time.
If the time series y t has a linear trend, we can model the series by the following equation:
Yt = β0 + β1 t + ϵt , t = 1 , 2, … , T
Where
ϵ t= a random error term (Shock) and is white noise (ϵt ∼ WN(0, σ 2))
From the equation above, the β0 + β1t predicts y t at any time t. The slope β1 is described as the
trend coefficient since it is the slope coefficient. We estimate both factors β0 and β1 using the
E(Yt) = β0 + β1 t
222
© 2014-2024 AnalystPrep.
Estimation of the Trend Value Under Linear Trend Models
Using the estimated coefficients, we can predict the value of the dependent variable at any time
^ 2 = β^ 0 + β^ 1 (2). We can also forecast
(t=1, 2…, T). For instance, the trend value at time 2 is Y
the value of the time series outside the sample’s period, that is, T+1. Therefore, the predicted
^ T+1 = β^ 0 + β^ 1 (T + 1) .
value of Y t at time T+1 is Y
A linear trend is defined to be Y t = 17.5 + 0.65t. What is the trend projection for time 10?
Solution
T = 17.5 + 0.65 × 10 = 24
223
© 2014-2024 AnalystPrep.
In linear time series, the growth is a constant which might pose problems in economic and
1. When the trend is positive, then the growth rate is expected to decrease over time.
2. If the slope coefficient is less than 0, the Yt will tend toward negative values, a situation
that would not be plausible in most financial time series, e.g., asset prices and quantiles.
Considering these limitations, we discuss the log-linear time series, with a constant growth rate
Sometimes the linear trend models result in uncorrelated errors. For instance, the time series
with exponential growth rates. The appropriate model for the time series with exponential
Log-linear trends are those in which the variable changes at an increasing or decreasing rate
224
© 2014-2024 AnalystPrep.
Assume that the time series is defined as:
Y t = eβ0+β1t, t = 1,2, … , T
Which also can be written as (by taking the natural logarithms on both sides):
ln Y t = β0 + β1 t, t = 1, 2,… , T
By Exponential rate, we mean growth at a constant rate with continuous compounding. This can
be seen as follows: Using the time series formula above, the value of the time series at time 1
y2
and 2 are y 1 = eβ0+β1(1) and y2 = eβ0+β1(2) . The ratio y1
is given by:
Y2 eβ0+β1(2)
= = eβ(1)
Y1 eβ0+β1(1)
225
© 2014-2024 AnalystPrep.
Similarly, the value of the time-series at time t is Y t = eβ0+β1t , and at t+1, we have
Y t+1 eβ0+β1(t+1)
= = eβ1
Yt eβ0+β1(t)
If we take the natural logarithm on both sides of the above equation we have:
Yt+1
ln ( ) = lnYt+1 − lnY t = β1
Yt
From the above results, proportional growth in time series over the two consecutive periods is
yt+1 − yt y t+1
= − 1 = eβ1 − 1
yt yt
An investment analyst wants to fit the weekly sales (in millions) of his company by using the
sales data from Jan 2016 to Feb 2018. The regression equation is defined as:
What is the trend estimated value of the sales in the 80th week?
Solution
From the regression equation, β^ 0 = 5.1062 and β^ 1 = 0.0443. We know that, under log-linear
226
© 2014-2024 AnalystPrep.
^ ^
Y t = eβ 0 + β 1 t
Y t = β0 + β1 t + β2 t2 + ⋯ + βm tm ϵ t, t = 1, 2,… , T
Practically speaking, the polynomial-time trends are only limited to the linear (discussed above)
and the quadratic (second degree) time trend. In a quadratic time trend, the parameter can be
estimated using the OLS. The approximated parameter are asymptotically normally distributed
and hence statistical inference using the t-statistics and the standard error happen only if the
As the name suggests, this time trend is a mixture of the log-linear and quadratic time series. It
is given by:
ln Y t = β0 + β1 t + β2 t2
It can be shown that the growth rate of the log-quadratic time trend is β1 + 2β2 t. This can be
seen as follows:
2
The value of the time-series at time t is Yt = eβ0 +β1 t+β2t , and at t+1, we have
2
Y t+1 = eβ0+β1(t+1)+β2(t+1) . This implies that the ratio:
2
Yt+1 eβ0+β1 (t+1)+β2(t+1)
= = eβ1+2β2t
Yt eβ 0+β1t+β2 t2
227
© 2014-2024 AnalystPrep.
Example: Calculating the Growth Rate of Log-Quadratic Time Trend
The monthly real GDP of a country over 20 years can be modeled by the time series equation
given by:
What is the growth rate of the real GDP of this country at the end of 20 years?
Solution
β1 + 2β2 t
From the regression time-series equation given, we have β^ 1 = 0.015 and β^ 2 = 0.0000564 so that
Note that, since the data is modeled monthly, at the end of 20 years implies 240th month!
The coefficient of variation (R2 ) for the time trend series is always high and will tend to 100% as
the sample size increases. Therefore, the coefficient of variation is not an appropriate measure in
Seasonality
Seasonality is a feature of a time series in which the data undergoes regular and predictable
changes that recur every calendar year. For instance, gas consumption in the US rises during the
Seasonal effects are observed within a calendar year, e.g., spikes in sales over Christmas, while
cyclical effects span time periods shorter or longer than one calendar year, e.g., spikes in sales
228
© 2014-2024 AnalystPrep.
due to low unemployment rates.
there are s seasons in a year. Then the pure annual dummy model is:
1 t mod s = j
Djt = { ,
0 , t mod s ≠ j
E[Y1 ] = β0 + γ1
E[Y2 ] = β0 + γ2
Since period s, all dummy variables are zero, then the mean of the seasonality at time s is:
E[Y s ] = β0
The parameters of seasonality are estimated using the OLS estimators by regressing Y t on
229
© 2014-2024 AnalystPrep.
Time trends and seasonalities can be insufficient in explaining economic time series and since
their residuals might not be white noise. In the case that the non-stationary time series appears
to be stationary, but the residuals are not white noise, we can add stationary time series
components (such as AR and MA) to reflect the components of the non-stationary time series.
Y t = β0 + β1 t + ϵ t
If the residuals are not white noise but the time series appears to be stationary, we can include
Y t = β0 + β1 t + δ1 Y t−1 + ϵ t
s−1
Yt = β0 + β1 t + ∑ γj Djt + δ1 Y t−1 + ϵ t
j=1
Note that the AR component reflects the cyclicity of the time series, γj measures the shifts of the
mean from the trend growth, i.e β1 t. However, combinations of the time series do not always lead
to a model with the required dynamics. For instance, the Ljung-Box statistics may suggest
A random walk is a time series in which the value of the series in one period is equivalent to the
value of the series in the previous period plus the unforeseeable random error. A random walk
Let
Y t = Y t−1 + ϵ t
230
© 2014-2024 AnalystPrep.
Intuitively,
Y t = (Yt−2 + ϵ t−1 ) + ϵt
t
Yt = Y0 + ∑ ϵi
i=1
The random walk equation is a particular case of an AR(1) model with β0 = 0 and β1 = 1 . Thus,
we cannot utilize the regression techniques to estimate such AR(1). This is because a random
walk does not have a finite mean-reverting level or finite variance. Recall that if Yt has a mean-
β0
reverting level, then Y t = β0 + β1 Y t and thus . However, in a random walk, β0 = 0 and β1 = 1
1−β1
0
so, 1−1 = 0.
V(Yt) = tσ 2
The implication of the infinite variance of a random walk is that we are unable to use standard
Unit Roots
We have been discussing the random walks without a drift; that the current value is the best
231
© 2014-2024 AnalystPrep.
Y t = β0 + β1 Y t−1 + ϵ t
β0 ≠ 1, β1 = 1
Or
Y t = β0 + Yt−1 + ϵt
Where ϵ t ∼ WN(0 , σ 2 )
Recall that β1 = 1 implies undefined mean-reversion level and hence non-stationarity. Therefore,
we are unable to use the AR model to analyze a time series unless we transform the time series
ΔY t = Y t − Y (t−1) , y t = β0 + ϵt , ∀β0 ≠ 0
The unit root test involves the application of the random walk concepts to determine whether a
time series is nonstationary by focusing on the slope coefficient in a random walk time series
with a drift case of AR(1) model. This test is popularly known as the Dickey-Fuller Test
Consider an AR(1) model. If the time-series originates from an AR(1) model, then the time-series
is covariance stationary if the absolute value of the lag coefficient β1 is less than 1. That is,
|β1 | < 1 . Therefore, we could not depend on the statistical results if the lag coefficient is greater
When the lag coefficient is precisely equal to 1, then the time series is said to have a unit root. In
other words, the time-series is a random walk and hence not covariance stationary.
The unit root problem can also be expressed using the lag polynomial. Let
ψ(L) be the full lag polynomial, which can be factorized into the unit root lag denoted by (1-L)
and the remainder lag polynomial ϕ(L) which is the characteristic lag for stationary time series.
232
© 2014-2024 AnalystPrep.
Moreover, let θ(L)ϵ t be an MA. Thus, the unit root process can be described as:
ψ(L)Yt = θ(L)ϵt
(1 − L)ϕ(L) = θ(L)ϵ t
An AR(2) model is given by Y t = 1.7Y t−1 − 0.7Y t−2 + ϵ t . Does the process contain a unit root?
Solution
Using the definition of a lag polynomial, we can write the above equation as:
(1 − 1.7L + 0.7L 2 )Y t = ϵ t
(1 − L)(1 − 0.7L)Y t = ϵ t
Therefore, the process has a unit root due to the presence of a unit root lag operator (1-L).
1. A unit root process does not have a mean-reverting level. Recall that the stationary time
series does mean revert, that is, the long-run mean can be estimated.
2. In a time series with a unit root, spotting spurious relationships is a problem. A spurious
correlation is where there is no important link between the time series but regression
233
© 2014-2024 AnalystPrep.
3. The parameter estimators in ARMA time series with a unit root possess Dickey-Fuller
(DF) distribution, which is asymmetric, dependent on the sample size, and that its critical
value depends on whether time trends have been incorporated. This characteristic makes
it difficult to come up with sound statistical inference and model selection when fitting
the models.
If the time series seem to have unit roots, the best method is to model it using the first-
differencing series as an autoregressive time series, which can be effectively analyzed using
regression analysis.
Recall that the time series with a drift is a form of AR(1) model given by:
y t = β0 + Yt−1 + ϵt ,
Where ϵ t ∼ WN(0 , σ 2 )
Clearly β1 = 1 implies that the time series has an undefined mean-reversion level and hence non-
stationary. Therefore, we are unable to use the AR model to analyze time series unless we
Y t = Y t − Y(t−1) ⇒ y t = β0 + ϵt , ∀β0 ≠ 0
Using the lag polynomials, let ΔY t = Y t − Y (t−1) where Yt has a unit root (implying that Y t − Y (t−1)
(1 − L)ϕ(L)Y t = ϵt
ϕ(L)[(1 − L)Y t] = ϵt
ϕ(L)[(Y t − LY t)] = ϵt
ϕ(L)ΔY t = ϵt
Since the lag polynomial ϕ(L) is stationary series lag polynomial, the time series defined by ΔYt
must be stationary.
234
© 2014-2024 AnalystPrep.
Unit Root Test
The unit root test is done using the Augmented Dickey-Fuller (ADF) test. The test involves OLS
estimation of the parameters where the difference of the time series is regressed on the lagged
Where:
δ0 + δ1 t=deterministic terms
The test statistic for the ADF test is that of γ^(estimate of γ).
To get the gist of this, assume that we are conducting an ADF test on a time series with lagged
level only:
ΔY t = γ Yt−1
Y t = Y t−1 + ϵ t
Therefore, it implies that the time series is a random walk if γ=0. This leads us to the hypothesis
235
© 2014-2024 AnalystPrep.
H 1 : γ < 0 (the time series is a covariance stationary )
You should note this is a one-sided test, and thus, the null hypothesis is not rejected if γ>0. The
positivity of γ corresponds to an AR time series stationary. For example, recall that the AR(1)
Y t = β0 + β1 Y t−1 + ϵ t
ΔYt = β0 + γY t−1 + ϵ t
Clearly, if β1 = 1, then let γ = 0. Therefore, γ = 0 is the test for β1 = 1 . In other words, if there is
a unit root in an AR(1) model (with the dependent variable being the difference between the time
series and independent variable of the first lag) then, γ = 0, implying that the series has a unit
Implementing an ADF test on a time series requires making two choices: which deterministic
terms to include and the number of lags of the differenced data to use. The number of lags to
include is simple to determine—it should be large enough to absorb any short-run dynamics in
the difference ΔY t
The appropriate method of selecting the lagged differences is the AIC (which selects a relatively
larger model as compared to BIC). The length of the lag should be set depending on the length of
The Dickey-Fuller distributions are dependent on the choice of deterministic terms included. The
deterministic terms can be excluded, and instead, use constant terms or trend deterministic
terms. While keeping all other things equal, the addition of more deterministic terms reduces the
chance of rejecting the null hypothesis when the time series does not have a unit root, and hence
the power of the ADF test is reduced. Therefore, relevant deterministic terms should be
236
© 2014-2024 AnalystPrep.
included.
deterministic terms that are significant at 10% level. In case the deterministic trend term is not
significant at 10%, it is then dropped and the constant deterministic term is used instead. If the
trend is also insignificant, then it can be dropped and the test is rerun without the deterministic
term. It is important to note that the majority of macroeconomic time series require the use of
the constant.
In the case that the null of the ADF test cannot be rejected, the series should be differenced and
the test is rerun to make sure that the time series is stationary. If this is repeated (double
differenced) and the time series is still non-stationary, then other transformations to the data
such as taking the natural log(if the time series is always positive) might be required.
A financial analyst wishes to conduct an ADF test on the log of 20-year real GDP from 1999 to
The output of the ADF reports the results at the different number deterministic terms (first
column), and the last three columns indicate the number of lags according to AIC and the 5%
and 1% critical values that are appropriate to the underlying sample size and the deterministic
terms. The quantities in the parenthesis (below the parameters) are the test statistics.
Solution
237
© 2014-2024 AnalystPrep.
The hypothesis statement of the ADF test is:
We begin with choosing the appropriate model. At 10%, the trend model has an absolute value of
the statistic greater than the CV at 1% and 5% significance level; thus, we choose a model with
Therefore, for this model, the null hypothesis is rejected at a 99% confidence level since
|-4.376|>|-3.984|. Note that the null hypothesis is also rejected at a 95% confidence level.
Moreover, if the model was constant or no-deterministic, the null hypothesis will fail to be
Seasonal differencing is an alternative method of modeling the seasonal time series with a unit
root. Seasonal differencing is done by subtracting the value in the same period in the previous
year to remove the deterministic seasonalities, the unit root, and the time trends.
Consider the following quarterly time series with deterministic seasonalities and non-zero
growth rate:
Where ϵ t ∼ WN(0,σ 2 ).
But
238
© 2014-2024 AnalystPrep.
γj (D1j − D1j−4 ) = 0
⇒ Δ 4 Yt = β1 (t − (t − 4)) + ϵt − ϵ t−4
Therefore,
Δ4 Yt = 4β1 + ϵ t − ϵ t−4
Intuitively, this an MA(1) model, which is covariance stationary. The seasonal differenced time
series is described as the year to year change in Y t or year to year growth in case of logged time
series.
Spurious Regression
series analysis, but this can be avoided by making sure each of the time series in question is
stationary by using methods such as first differencing and log transformation (in case the time
series is positive)
Practically, many financial and economic time series are plausibly persistent but stationary.
Therefore, differencing is only required when there is clear evidence of unit root in the time
series. Moreover, when it is difficult to distinguish whether time series is stationary or not, it is a
good statistical practice to generate models at both levels and the differences.
For example, we wish to model the interest rate on government bonds using an AR(3) model. The
AR(3) is estimated on the levels and the differences (if we assume the existence of unit root) are
modeled by AR(2) since the AR is reduced by one due to differencing. By considering the models
at all levels allows us to choose the best model when the time series are highly persistent.
239
© 2014-2024 AnalystPrep.
Forecasting
Forecasting in non-stationary time series is analogous to that of stationary time series. That is,
YT = β0 + β1 T + ϵ t
Intuitively,
Y T+h = β0 + β1 (T + h) + ϵt+h
This is true because of both β0 and β1 (T + h) are constants while ϵ t+h ∼ WN(0, σ2 ).
Recall that the seasonal time series can be modeled using the dummy variables. Consequently,
we need to track the period of the forecast we desire. The annual time series is given by:
s−1
YT = β0 + ∑ γj Djt + ϵ t
j=1
ET (Y T+1 ) = β0 + γj
Where:
j = (T + 1)mod s is the forecasted period and that the forecast and the coefficient on the omitted
periods is 0.
240
© 2014-2024 AnalystPrep.
For instance, for quarterly seasonal time series that excludes the dummy variable for the fourth
quarter (Q4 ) , then the forecast for period 116 is given by:
ET (Y T+1 ) = β0 + γj
ET (Y T+1 ) = β0 + γ(116+1)(mod 4) = β0 + γ1
Therefore, the h-step ahead forecast are by tracking the period of T+h so that:
ET (Y T+h ) = β0 + γj
Where:
j = (T + h)mod s
ϵ ii∼d N (0, σ 2)
X ∼ N(0, σ 2), then define W = eX ∼ Log(μ , σ 2). Also recall that the mean of a log-normal
σ2
E(W) = eμ+ 2
241
© 2014-2024 AnalystPrep.
E T (ln YT+h ) = β0 + β1 (YT+h )
Thus,
σ2
ET (Y T+h ) = eβ0+β1 (Y T+h)+ 2
Confidence intervals are constructed to reflect the uncertainty of the forecasted value. The
confidence interval is dependent on the variance of the forecasted error, which is defined as:
i.e., it is the difference between the actual value and the forecasted value.
Y T+h = β0 + β1 (T + h) + ϵT+h
Clearly,
ET (Y T+h ) = β0 + β1 (T + h)
If we wish to construct a 95% confidence interval, given that the forecast error is Gaussian white
ET (Y T+h ) ± 1.96σ
σ is not known and thus can be estimated by the variance of the forecast error.
Intuitively, the confidence intervals for any model can be computed depending on the individual
242
© 2014-2024 AnalystPrep.
forecast error ϵT+h = YT+h − ET (Y T+h ).
A linear time trend model is estimated on annual government bond interest rates from the year
Rt = 0.25 + 0.000154t + ^
ϵt
The standard deviation of the forecasting error is estimated to be σ ̂ =0.0245. What is the 95%
confidence interval for the second year if the forecasting residual errors (residuals) is a Gaussian
white noise?
(Note that for the first time period t=2000 and the last time period is t=2020)
Solution
ET (Y T+h ) ± 1.96σ
= 0.28083 ± 1.96 × 0.0245
= [0.2328108, 0.3288508]
So the 95% confidence interval for the interest rate is between 1.029% and 10.68%.
243
© 2014-2024 AnalystPrep.
Question 1
The seasonal dummy model is generated on the quarterly growth rates of mortgages.
s−1
Yt = β0 + ∑ γjDjt + ϵ t
j=1
The estimated parameters are γ^1 = 6.25, γ^2 = 50.52, γ^3 = 10.25 and β^ 0 = −10.42
using the data up to the end of 2019. What is the forecasted value of the growth rate
A. 40.10
B. 34.56
C. 43.56
D. 36.90
1, for Q2
Djt = {
0, for Q1 , Q3 and Q4
So,
3
^Q 2 ) = β0 + ∑ γjDjt = −10.42 + 0 × 6.25 + 1 × 50.52 + 0 × 10.25 = 40.1
E(Y
j=1
Question 2
244
© 2014-2024 AnalystPrep.
within California in the US. The time series model contains both a trend and a
The trend component is reflected in variable time(t), where (t) month and seasons are
defined as follows:
The model started in April 2019; for example, y(T+1) refers to May 2019.
reflected by the intercept (15.5) plus the three seasonal dummy variables ( D2, D3 , and
D4 ).
245
© 2014-2024 AnalystPrep.
yT+11 = 0.20 × 11 + 15.5 + 4.0 × 1 = 21.7
246
© 2014-2024 AnalystPrep.
Reading 23: Measuring Return, Volatility, and Correlation
returns.
Define and distinguish between volatility, variance rate, and implied volatility.
Describe how the first two moments may be insufficient to describe non-normal
distributions.
Explain how the Jarque-Bera test is used to determine whether returns are normally
distributed.
Describe the power law and its use for non-normal distributions.
dependence.
a one-factor model.
Measurement of Returns
A return is a profit from an investment. Two common methods used to measure returns include:
P t − Pt−1
Rt =
P t−1
247
© 2014-2024 AnalystPrep.
Where
The time scale is arbitrary or shorter period such monthly or quarterly. Under the simple returns
method, the returns over multiple periods is the product of the simple returns in each period.
T
1 + RT = ∏ (1 + R t)
t=i
T
⇒ RT = (∏ (1 + R t)) − 1
t=i
Time Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54
Calculate the simple return based on the data for all periods.
Solution
We need to calculate the simple return over multiple periods which is given by:
T
1 + R T = ∏ (1 + R t)
t=i
248
© 2014-2024 AnalystPrep.
Time Price Rt 1 + Rt
0 100 − −
1 98.65 −0.0135 0.9865
2 98.50 −0.00152 0.998479
3 97.50 −0.01015 0.989848
4 95.67 −0.01877 0.981231
5 96.54 0.009094 1.009094
Product 0.9654
Note that
P t − Pt−1
Rt =
P t−1
So that that
P1 − P 0 98.65 − 100
R1 = = = −0.0135
P0 100
And
P2 − P 1 98.50 − 98.65
R2 = = = −0.00152
P1 98.65
And so on.
5
∏ (1 + R t) = 0.9865 × 0.998479 × … × 1.009094 = 0.9654
t=1
So,
Denoted by rt . Compounded returns is the difference between the natural logarithm of the price
249
© 2014-2024 AnalystPrep.
of assets at time t and t-1. It is given by:
rt = ln P t − lnP t−1
Computing the compounded returns over multiple periods is easy because it is just the sum of
T
rT = ∑ rt
t=1
Time Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54
What is the continuously compounded return based on the data over all periods?
Solution
T
rT = ∑ rt
t=1
Where
rt = ln P t − ln P t−1
250
© 2014-2024 AnalystPrep.
Time Price rt = ln P t − ln P t−1
0 100 −
1 98.65 −0.01359
2 98.50 −0.00152
3 97.50 −0.0102
4 95.67 −0.01895
5 96.54 0.009053
Sum −0.03521
Note that
And so on.
Also,
5
rT = ∑ rt = −0.01359 + −0.00152 + ⋯ + 0.009053 = −0.03521 = −3.521%
t=1
Intuitively, the compounded returns is an approximation of the simple return. The approximation,
however, is prone to significant error over longer time horizons, and thus compounded returns
The relationship between the compounded returns and the simple returns is given by the
formula:
1 + Rt = ert
What is the equivalent simple return for a 30% continuously compounded return?
Solution.
251
© 2014-2024 AnalystPrep.
1 + R t = ert
⇒ R t = ert − 1 = e0 .3 − 1 = 0.3499 = 34.99%
It is worth noting that compound returns are always less than the simple return. Moreover,
simple returns are never less than -100%, unlike compound returns, which can be less than
-100%. For instance, the equivalent compound return for -65% simple return is:
rt = ln (1 − 0.65) = −104.98%
The volatility of a variable denoted as σ is the standard deviation of returns. The standard
deviation of returns measures the volatility of the return over the time period at which it is
captured.
Consider the linear scaling of the mean and variance over the period at which the returns are
rt = μ + σet
Where E(rt ) = μ is the mean of the return, V(rt) = σ 2 is the variance of the return. et is the
shocks, which is assumed to be iid distributed with the mean 0 and variance of 1. Moreover, the
return is assumed to be also iid and normally distributed with the mean μ 2 i.e. rt ∼iid N(μ, σ 2 ).
Assume that we wish to calculate the returns under this model for 10 working days (two weeks).
10 10 10
∑ rt+i = ∑ (μ + σet+i ) = 10μ + σ ∑ et+i
i=1 i=1 i=1
So that the mean of the return over the 10 days is 10μ and the variance also is 10σ 2 since et is
√10 σ
252
© 2014-2024 AnalystPrep.
Therefore, the variance and the mean of return are scaled to the holding period while the
volatility is scaled to the square root of the holding period. This feature allows us to convert
For instance, given daily volatility, we would to have yearly (annualized) volatility by scaling it by
2
σannual = √ 252 × σdaily
Note that 252 is the conventional number of trading days in a year in most markets.
The monthly volatility of the price of gold is 4% in a given year. What is the annualized volatility
Solution
Using the scaling analogy, the corresponding annualized volatility is given by:
Variance Rate
The variance rate, also termed as variance, is the square of volatility. Similar to mean, variance
rate is linear to holding period and hence can be converted between periods. For instance, an
2
σannual 2
= 12 × σmonthly
1 T
^2 =
σ ^ )2
∑ (rt − μ
T t−1
253
© 2014-2024 AnalystPrep.
Where μ
^ is the sample mean of return, and T is the sample size.
The investment returns of a certain entity for five consecutive days is 6%, 5%, 8%,10% and 11%.
Solution
1
^=
μ (0.06 + 0.05 + 0.08 + 0.10 + 0.11) = 0.08
5
1 T
^2 =
σ ^ )2
∑ (rt − μ
T t−1
1
= [(0.06 − 008)2 + (0.05 − 0.08)2 + (0.08 − 0.08)2 + (0.10 − 0.08)2 + (0.11 − 0.08)2 ] = 0.00052 =
5
valuation. The options (both put and call) have payouts that are nonlinear functions of the price
of the underlying asset. For instance, the payout from the put option is given by:
max(K − PT )
where PT is the price of the underlying asset, K being the strike price, and T is the maturity
period. Therefore, the price payout from an option is sensitive to the variance of the return on
the asset.
254
© 2014-2024 AnalystPrep.
The Black-Scholes-Merton model is commonly used for option pricing valuation. The model
relates the price of an option to the risk-free rate of interest, the current price of the underlying
asset, the strike price, time to maturity, and the variance of return.
For instance, the price of the call option can be denoted by:
C t = f(rf , T , Pt , σ2 )
Where:
T=Time to maturity
The implied volatility σ relates the price of an option with the other three parameters. The
implied volatility is an annualized value and does not need to be converted further.
The volatility index (VIX) measures the volatility in the S&P 500 over the coming 30 calendar
days. VIX is constructed from a variety of options with different strike prices. VIX applies to a
large variety of assets such as gold, but it is only applicable to highly liquid derivative markets
The financial returns are assumed to follow a normal distribution. Typically, a normal distribution
is thinned-tailed, does not have skewness and excess kurtosis. The assumption of the normal
distribution is sometimes not valid because a lot of return series are both skewed and mostly
heavy-tailed.
To determine whether it is appropriate to assume that the asset returns are normally distributed,
255
© 2014-2024 AnalystPrep.
The Jarque-Bera Test
Jarque-Bera test tests whether the skewness and kurtosis of returns are compatible with that of
normal distribution.
Denoting the skewness by S and kurtosis by k, the hypothesis statement of the Jarque-Bera test
is stated as:
vs
2
⎛^
S (^
k − 3)2⎞
J B = (T − 1) +
⎝6 24 ⎠
The basis of the test is that, under normal distribution, the skewness is asymptotically normally
2
^
S
distributed with the variance of 6 so that the variable is chi-squared distributed with one
6
degree of freedom (χ21 ) and kurtosis is also asymptotically normally distributed with the mean of
(^
k − 3)2
3 and variance of 24 so that is also (χ21 ) variable. Coagulating these arguments given that
24
these variables are independent, then:
JB ∼ χ22
When the test statistic is greater than the critical value, then the null hypothesis is rejected.
Otherwise, the alternative hypothesis is true. We use the χ22 table with the appropriate degrees of
freedom:
256
© 2014-2024 AnalystPrep.
d.f. .995 .99 .975 .95 .9 .1 .05 .025 .01
1 0.00 0.00 0.00 0.00 0.02 2.71 3.84 5.02 6.63
2 0.01 0.02 0.05 0.10 0.21 4.61 5.99 7.38 9.21
3 0.07 0.11 0.22 0.35 0.58 6.25 7.81 9.35 11.34
4 0.21 0.30 0.48 0.71 1.06 7.78 9.49 11.14 13.28
5 0.41 0.55 0.83 1.15 1.61 9.24 11.07 12.83 15.09
6 0.68 0.87 1.24 1.64 2.20 10.64 12.59 14.45 16.81
7 0.99 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 1.73 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
11 2.60 3.05 3.82 4.57 5.58 17.28 19.68 21.92 24.72
12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 23.34 26.22
For example, the critical value of a χ22 at a 5% confidence level is 5.991, and thus, if the
computed test statistic is greater than 5.991, the null hypothesis is rejected.
Investment return is such that it has a skewness of 0.75 and a kurtosis of 3.15. If the sample size
is 125, what is the JB test statistic? Does the data qualify to be normally distributed at a 95%
confidence level?
Solution
2
⎛^
S (^
k − 3)2⎞ 0.752 (3.15 − 3)2
JB = (T − 1) + = (125 − 1) ( + ) = 11.74
⎝6 24 ⎠ 6 24
Since the test statistic is greater than the 5% critical value (5.991), then the null hypothesis that
The power law is an alternative method of determining whether the returns are normal or not by
257
© 2014-2024 AnalystPrep.
studying the tails. For a normal distribution, the tail is thinned, such that the probability of any
return greater than kσ decreases sharply as k increases. Other distributions are such that their
The power law tails are such that, the probability of observing a value greater than a given value
x is defined as:
P(X > x) = kx −α
The tail behavior of distributions is effectively compared by considering the natural log
To test whether the above equation holds, a graph of ln prob(X > x) plotted against lnx.
For a normal distribution, the plot is quadratic in x, and hence it decays quickly, meaning that
they have thinned tails. For other distributions such as Student’s t distribution, the plots are
linear to x, and thus, the tails decay at a slow rate, and hence they have fatter tails (produce
258
© 2014-2024 AnalystPrep.
Dependence and Correlation of Random Variables.
The two random variables X and Y are said to be independent if their joint density function is
Otherwise, the random variables are said to be dependent. The dependence of random variables
259
© 2014-2024 AnalystPrep.
The linear relationship of the random variables is measured using the correlation estimator
Yi = α + βi Xi + ϵi
The slope β is related to the correlation coefficient ρ. That is, if β = 0, then the random variables
Xi and Y i are uncorrelated. Otherwise, β ≠ 0. Infact, if the variances of the random variables are
engineered such that they are both equal to unity (σX2 = σY2 = 1), the slope of the regression
equation is equal to the correlation coefficient (β = ρ). Thus, the regression equation reflects how
Nonlinear dependence is complex and thus cannot be summarized using a single statistic.
Measures of Correlation
The correlation is mostly measured using the rank correlation (Spearman’s rank correlation) and
Kendal’s τ correlation coefficient. The values of the correlation coefficient are between -1 and 1.
When the value of the correlation coefficient is 0, then the random variables are independent;
Rank Correlation
The rank correlation uses the ranks of observations of random variables X and Y. That is, rank
correlation depends on the linear relationship between the ranks rather than the random
variables themselves.
The ranks are such that 1 is assigned to the smallest value, 2 to the next value, and so on until
When a rank repeats itself, an average is computed depending on the number of repeated
variables, and each is assigned the averaged rank. Consider the ranks 1,2,3,3,3,4,5,6,7,7. Rank 3
260
© 2014-2024 AnalystPrep.
is repeated three times, and rank 7 is repeated two times. For the repeated 3’s, the averaged
(3+4+5) (9+10)
rank is 3
= 4. For the repeated 7’s the averaged rank is 2
= 8.5 . Note that we are
averaging the ranks, which the repeated ranks could have to assume if they were not repeated.
Now, denote the rank of X by RX and that of Y by R Y then the rank correlation estimator is given
by:
Cov(RˆX , RY )
ρ^ s =
^ (RX )√V
√V ^ (R Y )
Alternatively, when all the ranks are distinct (no repeated ranks), the rank correlation estimator
is estimated as:
2
6 ∑ni=1 (R Xi − RY i) .
ρ^ s = 1 −
n(n2 − 1)
The intuition of the last formula is that when a highly ranked value of X is paired with
corresponding ranked values of Y, then the value of RXi − RY i is very small and thus, correlation
tends to 1. On the other, if the smaller rank values of X are marched with larger rank values of Y,
When the variables X and Y have a linear relationship, linear and rank, correlations have equal
value. However, rank correlation is inefficient compared to linear correlation and only used for
confirmational checks. On the other hand, rank correlation is insensitive to outliers because it
only deals with the ranks and not the values of X and Y.
261
© 2014-2024 AnalystPrep.
i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31
Solution
Consider the following table where the ranks of each variable have been filled and the square of
i X Y RX RY (RX − RY )2
1 0.35 2.50 3 4 1
2 1.73 6.65 4 6 4
3 −0.45 −2.43 2 2 0
4 −0.56 −5.04 1 1 0
5 4.03 3.20 6 5 1
6 3.21 2.31 5 3 4
Sum 10
Since there are no repeated ranks, then the rank correlation is given by:
2
6 ∑ni=1 (RXi − RY i ) .
ρ^s = 1 −
n(n 2 − 1)
6 × 10
= 1− = 1 − 0.2857 = 0.7143
6(62 − 1)
Kendal’s Tau is a non-parametric measure of the relationship between two random variables, say,
262
© 2014-2024 AnalystPrep.
Consider the set of random variables Xi and Yi . These pairs are said to be concordant for all i≠j if
the ranks of the components agree. That is, Xi > Xj when Y i > Yj or Xi < Xj when Y i < Y j . That is,
they are concordant if they agree on the same directional position (consistent). When the pairs
disagree, they are termed as discordant. Note that ties are neither concordant nor discordant.
Intuitively, random variables with a high number of concordant pairs have a strong positive
correlation, while those with a high number of discordant pairs are negatively correlated.
nc − nd nc nd
τ^ = = −
n(n − 1) n c + nd + nt n c + nd + n t
2
Where
nt =number of ties
It is easy to se that Kendal’s Tau is equivalent to the difference between the probabilities of
concordance and discordance. Moreover, when all the pairs are concordant, τ^ = 1 and when all
i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31
263
© 2014-2024 AnalystPrep.
What is Kendall’s τ correlation coefficient?
Solution
i X Y RX RY
1 0.35 2.50 3 4
2 1.73 6.65 4 6
3 −0.45 −2.43 2 2
4 −0.56 −5.04 1 1
5 4.03 3.20 6 5
6 3.21 2.31 5 3
Next is to arrange ranks in order of rank X, then the concordant (C) pairs are the number of
ranks greater than the given rank of Y, and discordant pairs are the number of ranks less than
RX RY C D
1 1 5 0
2 2 4 0
3 4 2 1
4 6 1 1
5 3 1 0
6 5 − −
Total 13 2
Note that, C=4, are the number of ranks greater than 2 (4,3,5 and 6) below it. Also, D=0 is the
number of ranks less than 2 below it. This is continued up to the second last row since there are
So, n c = 13 and nd = 2
nc − nd 13 − 2 11
⇒ τ^ = = = = 0.7333
n(n − 1) 6(6 − 1 15
2 2
264
© 2014-2024 AnalystPrep.
Practice Question
Suppose that we know from experience that α = 3 for a particular financial variable,
A. 125%
B. 0.5%
C. 4%
D. 0.1%
From the given probability, we can get the value of constant k as follows:
Thus,
265
© 2014-2024 AnalystPrep.
Reading 24: Simulation and Bootstrapping
Explain the use of antithetic and control variates in reducing Monte Carlo sampling
error.
Describe the bootstrapping method and its advantage over the Monte Carlo simulation.
simulated results, researchers gain insight into real problems. Examples of the application of
the simulation are the calculation of option payoff and determining the accuracy of an estimator.
Some of the simulation methods are the Monte Carlo Simulation (Monte Carlo) and the
Bootstrapping.
Monte Carlo Simulation approximates the expected value of a random variable using the
numerical methods. The Monte Carlo generates the random variables from an assumed data
generating process (DGP), and then it applies a function(s) to create realizations from the
unknown distribution of the transformed random variables. This process is repeated (to improve
the accuracy), and the statistic of interest is then approximated using the simulated values.
Bootstrapping is a type of simulation where it uses the observed variables to simulate from the
unknown distribution that generates the observed variables. In other words, bootstrapping
involves the combination of the observed data and the simulated values to create a new sample
The notable similarity between Monte Carlo and bootstrapping is that both aim at calculating the
266
© 2014-2024 AnalystPrep.
expected value of the function by using simulated data (often by use of a computer).
Also, the contrasting feature in these methods is that in Monte Carlo simulation, a data
generating process (DGP) is entirely used to simulate the data. However, in bootstrapping,
observed data is used to generate the simulated data without specifying an underlying DGP.
The simulation requires the generation of random variables from an assumed distribution, mostly
using a computer. However, computer-generated numbers are not necessarily random and thus
deterministic functions (pseudo number generators, PNGs), which seem to be random. The initial
values of pseudo numbers are termed as a seed value, which is usually unique but generates
The ability of the simulated variables from PRNGs to replicate makes it possible to use pseudo
numbers across multiple experiments because the same sequence of random variables can be
generated using the same seed value. Therefore, we can use this feature to choose the best
model or reproduce the same results in the future in case of regulatory requirements. Moreover,
Simulating random variables from a specific distribution is initiated by first generating a random
number from a uniform distribution (0,1). After that, the cumulative distribution of the
distribution we are trying to simulate is used to get the random values from that distribution.
That is, we first generate a random number U from U(0,1) distribution, then, we use the
generated random number to simulate a random variable X with the pdf f(x) by using the CDF,
F(x).
Let U be the probability that X takes a value less than or equal to x, that is,
U = P(≤ x) = F(x)
267
© 2014-2024 AnalystPrep.
Then we can derive the random variable x as:
x = F−1 (u)
To put this in a more straightforward perspective, the algorithm for simulating random variable
Note that the random variable X has a CDF F(x) as shown below:
268
© 2014-2024 AnalystPrep.
Example: Generating Random Variables from Exponential Distribution
Assume that we want to simulate three random variables from an exponential distribution with a
parameter λ = 0.2 using the value 0.112, 0.508, and 0.005 from U(0,1).
Solution
This question assumes that the uniform random variable has been generated. The inverse of the
1
F−1 (x) = − ln (1 − x)
λ
1
F−1 (x) = − ln (1 − x)
0.2
1
x=− ln (1 − u)
0.2
1
x1 = − ln (1 − u 1 ) = −5 ln (1 − 0.112) = 2.37567
0.2
1
x2 = − ln (1 − u 2 ) = −5 ln (1 − 0.508) = 14.1855
0.2
1
x3 = − ln (1 − u 3 ) = −5 ln (1 − 0.005) = 0.10025
0.2
Monte Carlo simulation is used to estimate the population moments or functions. The Monte
Carlo is as follows:
Assume that X is a random variable that can be simulated and let g(X) be a function that can be
269
© 2014-2024 AnalystPrep.
evaluated at the realizations of X. Then, the simulation generates multiple copies of g(X) by
This process is then repeated b times so that a set of iid variables is generated from the unknown
distribution g(X), which can then be used to estimate the desired statistic.
For instance, if we wish to estimate the mean of the generated random variables, then the mean
is given by:
1 b
^ (g(X)) =
E ∑ g(Xi )
b i=1
This is true because the generated variables are iid, and then the process is repeated b times.
^ (g(X)) = E(g(x))
lim E
b→∞
Also, the Central Limit Theorem applies to the estimated mean so that:
σg2
^(g(X))] =
Var[E
b
The second moment, which is the variance (standard variance estimator) is estimated as:
1 b 2
^2g =
σ ∑ (g(Xi) − E[ĝ(X)])
b i=1
From CLT, the standard error of the simulated expectation is given by:
σg2 σ2
⎷ b =
√b
The standard error of the simulated expectation measures the level of accuracy of the
270
© 2014-2024 AnalystPrep.
Another quantity that can be calculated from the simulation is the α-quantile by arranging the b
draws in ascending order then selecting the value bα of the sorted set.
Moreover, using the simulation, we determine the finite sample properties of the estimated
parameters. Assume that the sample size n is large enough so that approximation by CLT is
adequate. Now, consider a finite-sample distribution of a parameter θ^. Using the assumed DGP, n
X = [x1 , x 2, … , xn ]
We would need to simulate new data set and estimate the parameter b times: (θ^1 , θ^2 ,… , θ^b )
from the finite-sample distribution of the estimator of θ. From these values, we can rule out the
properties of the estimator θ^. For instance, the bias defined as:
Bias(θ) = E(θ^) − θ
1 b
ˆ
(Bias) (θ) = ∑ (θ^i − θ)
b i=1
Having the basics of the Monte Carlo simulation, its basic logarithm is as follows:
v. Determine the accuracy of the estimated quantity by calculating the standard error. If
the standard error is huge, increase the number of b-replications to obtain the smallest
error possible.
Example: Using the Monte Carlo Simulation to Estimate the Price of a Call Option
271
© 2014-2024 AnalystPrep.
Recall that the price of a call option is given by:
max(0, ST − K)
ST is the price of the underlying stock at the time of maturity T, and K is the strike price. The
price of the call option is a non-linear function of the underlying stock price at the expiration
date, and thus, we can model the price of the call option.
Assuming that the log of the stock price is normally distributed, then the price of the stock can
be modeled as the sum of the initial stock price, a mean and normally distributed error.
σ2
sT = s 0 + T (rf − ) √ Tx i
2
Where
272
© 2014-2024 AnalystPrep.
T= time to maturity in years
From the formula above, to simulate the price of the underlying stock requires the estimation of
Using the simulated price of the stock, the price of the option can be calculated as:
c = e(−rf T) max(ST − K , 0)
And thus the mean of the price of the call option can be estimated as:
1 b
^(c) = c̄ =
E ∑ ci
b i=1
Where c i is the simulated payoffs of the call option. Note that, using the equation,
σ2
sT = s0 + T (rf − ) √ Tx i , the simulated stock prices can be expressed as:
2
⎛ σ2 ⎞
s 0 +T rf − +√T xi
STi = e
⎝ 2⎠
And thus
σ2
g(xi ) = ci = e(−rT)max(es 0+T(r f− )+√ Txi
2 − K, 0)
^2g
σ ^g
σ
^ (c)) =
s.e(E =
⎷b √b
^2g
Where σ
1
273
© 2014-2024 AnalystPrep.
1
^2g =
σ c )2
∑ (c i − ^
b ∀i
Given that we calculate the standard error, we can calculate the confidence intervals for the
estimated mean of the call option price. For instance, the 95% confidence interval; is given by:
2. Control Variates.
To set the mood, recall that the estimation of expected values in simulation depends on the Law
of Large Numbers (LLN) and that the standard error of the estimated expected value is
proportional to 1/√b. Therefore, the accuracy of the simulation depends on the variance of the
simulated quantities.
Antithetic Variables
Moreover, if the covariance between the variables is negative (or negatively correlated), then:
274
© 2014-2024 AnalystPrep.
The antithetic variables use the last result. The antithetic variables reduce the sampling error by
incorporating the second set of variables that are generated in such a way that they are
negatively correlated with the initial iid simulated variables. That is, each simulated variable is
paired with an antithetic variable so that they occur in pairs and are negatively correlated.
F−1 (U1 ) ∼ Fx
U2 = 1 − U1
F−1 (U2 ) ∼ Fx
Then by definition of antithetic variables, the correlation between U1 and U2 is negative as well
Using the antithetic random variables is analogous to typical Monte Carlo simulation only that
Note that the number of simulations is b/2 since the simulation values are in pairs. The antithetic
variables reduce the sampling error only if the function g(X) is monotonic in x so that
Notably, the antithetic random variables reduce the sampling error through the correlation
coefficient. Note that usually sampling error using b iid simulated values, is
σg
√b
But by introducing the antithetic random variables, then the standard error is given by:
σ √1 + ρ
275
© 2014-2024 AnalystPrep.
σg √1 + ρ
√b
Clearly, the standard error decreases when the correlation coefficient, ρ < 0.
Control Variates
Control variates reduce the sampling error by incorporating values that have a mean of zero and
correlated to simulation. The control variates have a mean of zero so that it does not bias the
approximation. Given that the control variate and the desire function are correlated, an effective
combination (optimal weights) of the control variate and the initial simulation value to reduce
1 b
^[g(X)] =
E ∑ g(xi )
b i=1
^[g(X)] = E[g(X)] + ηi
E
Denote the control variate by h(Xi ) so that by definition, E[h(Xi )] = 0 and that it is correlated with
ηi .
An ideal control variate should be less costly to construct and that it should be highly correlated
with g(X) so that the optimal combination parameter β0 that minimizes the estimation errors can
g(xi ) = β0 + β1 h(Xi)
Disadvantages of Simulation
Monte Carlo Simulation can result in unreliable approximates of moments if the DGPs
276
© 2014-2024 AnalystPrep.
used do not adequately describe the observed data. This mostly occurs due to
Simulation can be costly, especially when you are running multiple simulation
Bootstrapping
As stated earlier, bootstrapping is a type of simulation where it uses the observed variables to
simulate from the unknown distribution that generates the observed variable. However, note that
bootstrapping does not directly model the observed data or suggest any assumption about the
distribution, but rather, the unknown distribution in which the sample is drawn is the origin of
277
© 2014-2024 AnalystPrep.
There are two types of bootstraps:
i. iid Bootstraps
iid Bootstrap
iid bootstraps select the samples that are constructed with replacement from the observed data.
Assume that a simulation sample of size m is created from the observed data with n
observations. iid bootstraps construct observation indices by randomly sampling with replacing
from the values 1,2,..., n. These random indices are then used to draw the observed data to be
278
© 2014-2024 AnalystPrep.
included in the simulated data (bootstrap sample).
For instance, assume we want to draw 10 observations from a sample of 50 data points:
{x 1 , x2 , x 3 , … , x 50}. The first simulation could use {x 1 , x 12, x 23x 11, x 32, x 43x 1 , x 22, x 2 , x22 }
observations and second simulation could use {x 50, x 21, x 23x 19, x 32, x49 x41 , x22 , x12 ,, x 39} and so on
In other words, iid bootstrap is analogous to Monte Carlo Simulation, where bootstrap samples
are used instead of simulated samples. Under iid bootstrap, the expected values are estimated
as:
1 b
^[g(X)] =
E ∑ g (x BS x BS xBS )
1, j, 2, j, ,… , m ,j,
b i=1
Where
x BS
i,j,
= observation i from observation j
The iid bootstrap is suitable when observations used are independent over time, and thus using it
In short, the logarithm of generating a sample using the iid bootstrap include:
The circular block bootstrap differs from the iid bootstrap in that instead of sampling each data
point with replacement, it samples the blocks of size q with replacement. For instance, assume
that we have 50 observations which are sampled into five blocks (q=5), each with 10
observations.
The blocks are sampled with replacement until the desired sample size is produced. In the case
279
© 2014-2024 AnalystPrep.
that the number of observations in sampled blocks is larger than the required sample size, some
The size of the number of blocks should be large enough to reflect the dependence of
observations but not too large to exclude some crucial blocks. Conventionally, the size of the
i. Decide on the size of block q-more preferably, the block size should be equal to the
ii. Select the first block index i from (1,2,…,n) and transfer {x i , xi +1 , … , xi +q } to the
iii. Incase the bootstrap sample has less than m elements, repeat step (ii) above.
iv. In case the bootstrap sample has more than m elements, omit the values from the end of
Application of Bootstrapping
One of the applications of bootstrapping is the estimation of the p-value at risk in financial
argmin
Var
Pr (L > VaR) = 1 − p
Where:
If the loss is measured in percentages of a particular portfolio, then p-VaR can be seen as a
quantile of the return distribution. For instance, if we wish to calculate a one-year VaR of a
portfolio, then we will simulate a one-year data (252 days) and then find the quantile of the
The VaR is then calculated by sorting the bootstrapped annual returns from lowest to highest
280
© 2014-2024 AnalystPrep.
and then determining (1-p)b, which is basically the empirical 1-p quantile of the annual returns.
The following are the two situations where bootstraps will not be sufficiently effective:
In cases where there are outliers in the data, hence there is a likelihood that the
Non-independent data – When a bootstrap is applied, the assumption the data are
Disadvantages of Bootstrapping
Bootstrapping uses the whole data to generate a simulated sample and thus may make
the simulated sample unreliable when the past and the present data are different. For
example, the present state of a financial market might be different from the past.
Bootstrapping of historical data can be unreliable due to changes in the market so that
the present is different from the past. For instance, if we are bootstrapping market
interest rates, there might be huge discrepancies due to past and present market
Monte Carlo simulation uses an entire statistical model that incorporates the assumption on the
distribution of the shocks, and therefore, the results are inaccurate if the model used is poor
On the other hand, bootstrapping does not specify the model but instead assumes the past
resembles the present of the data. In other words, the bootstrapping incorporates the aspect of
281
© 2014-2024 AnalystPrep.
Both Monte Carlo Simulation and bootstrapping are affected by the “Black Swan” problem,
where the resulting simulations in both methods closely resemble historical data. In other words,
simulations tend to focus on historical data, and thus, the simulations are not so different from
282
© 2014-2024 AnalystPrep.
Practice Question
A. They are variables that are generated to have a negative correlation with the
B. They are mean zero values that are correlated to the desired statistic that is
C. They are the mean zero variables that are negatively correlated with the
Solution
Antithetic variables are used to reduce the sampling error in the Monte Carlo
simulation. They are constructed to have a negative correlation with the initial
283
© 2014-2024 AnalystPrep.
Reading 25: Machine-Learning Methods
Explain the differences among the training, validation, and test data sub-samples, and
Machine learning (ML) is the art of programming computers to learn from data. Its basic idea is
that systems can learn from data and recognize patterns without active human intervention. ML
is best suited for certain applications, such as pattern recognition and complex problems that
require large amounts of data and are not well solved with traditional approaches.
On the other hand, classical econometrics has traditionally been used in finance to identify
patterns in data. It has a solid foundation in mathematical statistics, probability, and economic
theory. In this case, the analyst researches the best model to use along with the variables to be
used. The computer’s algorithm tests the significance of variables, and based on the results, the
284
© 2014-2024 AnalystPrep.
Machine learning and traditional linear econometric approaches are both employed in
prediction. The former has several advantages: machine learning does not rely on much financial
theory when selecting the most relevant features to include in a model. It can also be used by a
researcher who is unsure or has not specified whether the relationship between variables is
linear or non-linear. The ML algorithm automatically selects the most relevant features and
Secondly, ML algorithms are flexible and can handle complex relationships between variables.
y = β0 + β1 X1 + β2 X2 + ε
Suppose that the effect of X1 on y depends on the level of X2 . Analysts would miss this
interaction effect unless a multiplicative term was explicitly included in the model. In the case of
many explanatory variables, a linear model may be difficult to construct for all combinations of
interaction terms. The use of machine learning algorithms can mitigate this problem by
Additionally, the traditional statistical approaches for evaluating models, such as analyses of
statistical significance and goodness of fit tests, are not typically applied in the same way to
supervised machine learning models. This is because the goal of supervised machine learning is
often to make accurate predictions rather than to understand the underlying relationships
There are different terminologies and notations used in ML. This is because engineers, rather
than statisticians, developed most machine learning techniques. There has been a lot of
features/inputs are simply independent variables. Targets/outputs are dependent variables, and
285
© 2014-2024 AnalystPrep.
The following gives a summary of some of the differences between ML techniques and classical
econometrics.
286
© 2014-2024 AnalystPrep.
Machine Learning Classical Econometrics
Techniques
Builds models that can learn
from data and continuously Identifies and estimates the
improve their performance relationships between variables.
Goals
with time, and do not need It also tests the hypothesis
to specify the relationships about these relationships.
between variables in advance.
Require well-structured
ML models can deal with large
and clearly defined
Data amounts of complex and
requirements dependent and independent
unstructured data.
variables.
They are not built on Based on various assumptions, e.g.,
assumptions and can handle errors are normally distributed,
Assumptions
non-linear relationships linear relationships
between variables. between variables.
Maybe complex to interpret,
as they may involve complex Statistical models can
be interpreted in terms
Interpretability patterns and relationships
of the relationships
that are difficult to understand
between variables.
or explain.
There are many types of Machine learning systems. Some of the types include unsupervised
Unsupervised Learning
As the name suggests, the system attempts to learn without a teacher. It recognizes data
patterns without an explicit target. More specifically, it uses inputs (X’s) for analysis with no
corresponding target (Y). Data is clustered to detect groups or factors that explain the data. It is,
For example, unsupervised learning can be used by an entrepreneur who sells books to detect
groups of similar customers. The entrepreneur will at no point tell the algorithm which group a
customer belongs to. It instead finds the connections without the entrepreneur’s help. The
algorithm may notice, for instance, that 30% of the store’s customers are males who love science
fiction books and frequent the store mostly during weekends, while 25% are females who enjoy
drama books. A hierarchical clustering algorithm can be used to further subdivide groups into
smaller ones.
287
© 2014-2024 AnalystPrep.
Supervised Learning
By using well-labeled training data, this system is trained to work as a supervisor to teach the
machine to predict the correct output. You can think of it as how a student learns under the
supervision of a teacher. In supervised learning, a mapping function is determined that can map
inputs (X’s) with output (Y). The output is also known as the target, while X’s are also known as
the features.
Typically, there are two types of tasks in supervised learning. One is classification. For example,
a loan borrower may be classified as “likely to repay” or “likely to default.” The second one is the
prediction of a target numerical value. For example, predicting a vehicle’s price based on a set of
features such as mileage, year of manufacture, etc. For the latter, labels will indicate the selling
prices. As for the former, the features would be the borrower’s credit score, income, etc., while
Reinforcement Learning
288
© 2014-2024 AnalystPrep.
Reinforcement learning differs from other forms of learning. A learning system called an agent
perceives and interprets its environment, performs actions, and is rewarded for desired behavior
and penalized for undesired behavior. This is done through a trial-and-error approach. Over time,
the agent learns by itself what is the best strategy (policy) that will generate the best reward
while avoiding undesirable behaviors. Reinforcement learning can be used to optimize portfolio
allocation and create trading bots that can learn from stock market data through trial and error,
289
© 2014-2024 AnalystPrep.
Training ML models can be slowed by the millions of features that might be present in each
training instance. The many features can also make it difficult to find a good solution. This
Dimensions and features are often used interchangeably. Dimension reduction involves reducing
complex datasets, scales down the computational burden of dealing with large datasets, and
PCA is the most popular dimension reduction approach. It involves projecting the training
dataset onto a lower-dimensional hyperplane. This is done by finding the directions in the
dataset that capture the most variance and projecting the dataset onto those directions. PCA
In PCA, the variance measures the amount of information. Hence, principal components capture
the most variance and retain the most information. Accordingly, the first principal component
will account for the largest possible variance; the second component will intuitively account for
the second largest variance (provided that it is uncorrelated with the first principal component),
and so on. A scree plot shows how much variance is explained by the principal components of the
data. The principal components that explain a significant proportion of the variance are retained
Researchers are concerned about which principal components will adequately explain returns in
a hypothetical Very Small Cap (VSC) 30 and Diversified Small Cap (DSC) 500 equity index over a
15-year period. DSC 500 is a diversified index that contains stocks across all sectors, whereas
VSC 30 is a concentrated index that contains technology stocks. In addition to index prices, the
dataset contains more than 1000 technical and fundamental features. The fact that the dataset
has so many features causes them to overlap due to multicollinearity. This is where PCA comes in
handy, as it works by creating new variables that can explain most of the variance while
290
© 2014-2024 AnalystPrep.
Below is a screen plot for each index. Based on the 20 principal components generated, the first
three components explain 88% and 91% of the variance in the VSC 30 and DSC 500 index values,
respectively. Screen plots for both indexes illustrate that the incremental contribution in
explaining variance structure is very small after PC5 or so. From PC5 onwards, it is possible to
Clustering is a type of unsupervised machine-learning technique that organizes data points into
291
© 2014-2024 AnalystPrep.
Clusters contain observations from data that are similar in nature. K-means is an iterative
algorithm that is used to solve clustering problems. K is the number of fixed clusters determined
by the analyst at the outset. It is based on the idea of minimizing the sum of squared distances
between data points and the centroid of the cluster to which they belong. The following outlines
1. Randomly allocate initial K centroids within the data (centers of the clusters).
3. Calculate the new K centroids for each cluster by taking the average value of all data
4. Reassign each data point to the closest centroid based on the newly calculated centroids.
5. Repeat the process of recalculating the new K centroids until the centroids converge or a
Iterations continue until no data point is left to reassign to the closest centroid (there is no need
to recalculate new centroids). The distance between each data point and the centroids can be
measured in two ways. The first is the Euclidean distance, while the second is the Manhattan
distance.
292
© 2014-2024 AnalystPrep.
Consider two features x and y, which both have two data points A and B, with coordinates (x A, y A)
and (xB , y B), respectively. The Euclidean distance, also known as L2 - norm, is calculated as the
square root of the sum of the squares of the differences between the coordinates of the two
points. Imagine the Pythagoras Theorem, where Euclidean distance is the unknown side of a
right-angled triangle.
In the case that there are more than two dimensions, for example, n features for two data points
A and B , Euclidean distance will be constructed in a similar fashion. Euclidean distance is also
known as the "straight-line distance " because it is the shortest distance between two points,
indicated by the solid line in the figure below. Manhattan distance, also known as L 1 - norm, is
calculated as the sum of the absolute differences between two coordinates. For a two-
293
© 2014-2024 AnalystPrep.
Manhattan distance (dM ) = |xB − xA | + |x B − xA |
Manhattan distance is named after the layout of streets in Manhattan, where streets are laid out
in a grid pattern, and the only way to travel between two points is by going along the grid lines.
Suppose you have the following financial data for three companies:
Company P:
Company Q:
Company R
Calculate the Euclidean and Manhattan distances between companies P and Q in feature space
Euclidean Distance
To calculate the Euclidean distance between companies P and Q in feature space for the raw
data, we first need to find the difference between each feature value for the two companies and
294
© 2014-2024 AnalystPrep.
then square the differences. The Euclidean distance is then calculated by taking the square root
Manhattan Distance
To calculate the Manhattan distance between companies P and Q in feature space for the raw
data, we simply find the absolute difference between each feature value for the two companies
and sum these differences. The Manhattan distance is then calculated by taking the sum of these
differences. The Manhattan distance between companies P and Q in feature space is:
Formulas described above indicate the distance between two points A and B . It should be noted
that K-means aims to minimize the distance between each data point and its centroid rather than
to minimize the distance between data points. The data points will be closer to the centroids
Inertia, also known as the Within-Cluster Sum of Squared errors (WCSS), is a measure of the
sum of the squared distances between the data points within a cluster and the cluster's centroid.
n
WCSS = ∑ di2
i=1
K-means algorithm aims to minimize the inertia by iteratively reassigning data points to different
clusters and updating the cluster centroids until convergence. The final inertia value can be used
295
© 2014-2024 AnalystPrep.
Choosing an Appropriate Value for K
Choosing an appropriate value for K can affect the performance of the K-means model. For
example, if K is set too low, the clusters may be too general and may not be a true representative
of the underlying structure of the data. Similarly, if K is set too high, the clusters may be too
specific and may not represent the data’s overall structure. These clusters may not be useful for
the intended purpose of the analysis in either case. It is, therefore, important to choose K
optimally in practice.
The optimal value of K can be calculated using different methods, such as the elbow method and
the silhouette analysis. The elbow method fits the K-means model for different values of K and
plots the inertia/ WCSS for each value of K. Similar to PCA, this is called a scree-plot. It is then
examined for the obvious point on the plot where the inertia decreases more slowly as K
increases (elbow), which is chosen as the optimal value of K. In other words, it is the value that
296
© 2014-2024 AnalystPrep.
The second approach involves fitting the K-means model for a range of values of K and
determining the silhouette coefficient for each value of K. The silhouette coefficient compares
the distance of each data point from other points in its own cluster with its distance from the
data points in the other closest cluster. In other words, it measures the similarity of a data point
to its own cluster compared to the other closest clusters. The optimal value of K is the one that
K-means clustering is simple and easy to implement, making it a popular choice for clustering
tasks. There are some disadvantages to K-Means, such as the need to specify clusters, which can
be difficult if the dataset is not well separated. Additionally, it assumes that the clusters are
297
© 2014-2024 AnalystPrep.
spherical and equal in size, which is not always the case in practice.
K-means algorithm is very common in investment practice. It can be used for data exploration in
Understand the differences between and consequences of underfitting and overfitting, and
Overfitting
Imagine that you have traveled to a new country, and the shop assistant rips you off. It is a
natural instinct to assume that all shop assistants in that country are thieves. If we are not
careful, machines can also fall into the same trap of overgeneralizing. This is known as
overfitting in ML.
Overfitting occurs when the model has been trained too well on the training data and performs
poorly on new, unseen data. An overfitted model can have too many model parameters, thus
learning the detail and noise in the training data rather than the underlying patterns. This is a
problem because it means that the model cannot make reliable predictions about new data,
which can lead to poor performance in real-world applications. The evaluation of the ML
algorithm thus focuses on its prediction error on new data rather than on its goodness of fit on
the trained data. If an algorithm is overfitted to the training data, it will have a low prediction
error on the training data but a high prediction error on new data.
The dataset to which an ML model is applied is normally split into training and validation
samples. The training data set is used to train the ML model by fitting the model parameters. On
the other hand, the validation data set is used to evaluate the trained model and estimate how
Overfitting is a severe problem in ML, which can easily have thousands of parameters, unlike
classical econometric models that can only have a few parameters. Potential remedies for
298
© 2014-2024 AnalystPrep.
overfitting include decreasing the complexity of the model, reducing features, or using
Underfitting
Underfitting is the opposite of overfitting. It occurs when a model is too simple and thus not able
to capture the underlying patterns in the training data. This results in poor performance on both
the training data and new data. For example, we would expect a linear model of life satisfaction
to be prone to underfit as the real world is more complicated than the model. In this scenario,
Underfitting is more likely in conventional models because they tend to be less flexible than ML
models. The former follows a predetermined set of rules or assumptions, while ML approaches
do not follow assumptions about the structure of the model. It should be noted, however, that ML
models can still experience underfitting. This can happen when there is insufficient data to train
the model, when the data is of poor quality, and if there is excessively stringent regularization.
model as the complexity of the model increases. If the regularization is set too high, it can cause
the model to underfit the data. Potential remedies for addressing underfitting include increasing
the complexity of the model, adding more features, or increasing the amount of training data.
Bias-Variance Tradeoff
The complexity of the ML model, which determines whether the data is over, under, or well-
fitted, involves a phenomenon called bias-variance tradeoff. Complexity refers to the number of
features in a model and whether a model is linear or non-linear (with non-linear being too
complex). Bias occurs when a complex model is approximated with a simpler model, i.e., by
omitting relevant factors and interactions. A model with highly biased predictions is likely to be
oversimplified and thus results to underfitting. Variance refers to how sensitive the model is to
small fluctuations in the training data. A model with high variance in predictions is likely to be
The figure below illustrates how bias and variance are affected by model complexity.
299
© 2014-2024 AnalystPrep.
Sample Splitting and Preparation
Data Preparation
There is a tendency for ML algorithms to perform poorly when the variables have very different
scales. For example, there is a vast difference in the range between income and age. A person’s
income ranges in the thousands while their age ranges in the tens. Since ML algorithms only see
numbers, they will assume that higher-ranging numbers (income in this case) are superior, which
is false. It is, therefore, crucial to have values in the same range. Standardization and
Standardization involves centering and scaling variables. Centering is where the variable’s mean
value is subtracted from all observations on that variable (so standardized values have a mean of
300
© 2014-2024 AnalystPrep.
0). Scaling is where the centered values are divided by the standard deviation so that the
xi − μ
xi (standardized) =
σ
Normalization, also known as min-max scaling, entails rescaling values from 0 to 1. This is done
by subtracting the minimum value (x min ) from each observation and dividing by the difference
between the maximum (x max) and minimum values (xmin ) of X. This is expressed as follows:
x i − x min
xi (normalized ) =
xmax − x min
Standardization is used when the data includes outliers. This is because normalization
Data Cleaning
This is a crucial component of ML and may be the difference between an ML's success and
Missing data: Analysts encounter this issue very often. Missing data can be dealt with
in the following ways. First, observations with only a small number of missing values
can be removed. Secondly, they can replace them with the mean or median of the non-
missing observations. Lastly, it may be possible to estimate the missing values based on
301
© 2014-2024 AnalystPrep.
Unwanted observations: Observations that are not relevant to the specific task
distractions.
Problematic features: A feature with many standard deviations from the mean should
We briefly discussed the training and validation data sets, which are in-sample datasets.
Additionally, there is an out-of-sample dataset, which is the test data. The training dataset
teaches an ML model to make predictions, i.e., it learns the relationships between the input data
and the desired output. A validation dataset is used to evaluate the performance of an ML model
during the training process. It compares the performance of different models so as to determine
which one generalizes (fits) best to new data. A test dataset is used to evaluate an ML model’s
final performance and identify any remaining issues or biases in the model. The performance of a
good ML model on the test dataset should be relatively similar to the performance on the
training dataset. However, the training and test datasets may perform differently, and perfect
It is up to the researchers to decide how to subdivide the available data into the three samples. A
common rule of thumb is to use two-thirds of the sample for training and the remaining third to
be equally split between validation and testing. The subdivision of the data will be less crucial
when the overall data points are large. Using a small training dataset can introduce biases into
the parameter estimation because the model will not have enough data to learn the underlying
patterns in the data accurately. Using a small validation dataset can lead to inaccurate model
evaluation because the model may not have enough data to assess its performance accurately;
thus, it will be hard to identify the best specification. When subdividing the data into training,
validation, and test datasets, it is crucial to consider the type of data you are working with.
For cross-sectional data, it is best to divide the dataset randomly, as the data has no natural
ordering (i.e., the variables are not related to each other in any specific order). For time series
302
© 2014-2024 AnalystPrep.
data, it is best to divide the data into chronological order, starting with training data, then
Cross-validation Searches
Cross-validation can be used when the overall dataset is insufficient to be divided into training,
validation, and testing datasets. In cross-validation, training and validation datasets are
combined into one sample, and the testing dataset is excluded. The combined data is then
equally split into sub-samples, with a different sub-sample left out each time as the test dataset.
This technique is known as k-fold cross-validation. It splits the training and validation data into k
sub-samples, and the model is trained and evaluated k times while leaving out the test data from
the combined sample. The values k = 5 and k =10 are commonly chosen for k-fold cross-
validation.
punishment depending on its actions. It then uses the feedback to learn the actions that are
likely to generate the highest reward. The algorithm learns through trial and error by playing
303
© 2014-2024 AnalystPrep.
How Reinforcement Learning Operates
The environment consists of the state space, action space, and the reward function. The state
space is the set of all possible states in which the agent can be. On the other hand, the action
space consists of a set of actions that the agent can take. Lastly, the reward function defines the
feedback that the agent receives for taking a particular action in a given state space.
304
© 2014-2024 AnalystPrep.
Involves specifying the learning algorithm and any relevant parameters. The agent is then put in
Take an Action
The agent chooses an action depending on its current state and the learning algorithm. This
action is then taken in the environment, which may lead to a change of state and a reward. At
any given state, the algorithm can choose between taking the best course of action (exploitation)
and trying a new action (exploration). Exploitation is assigned the probability p and exploration
given the probability 1 − p. p increases as more trials are concluded, and the algorithm has
Based on the agent’s reward and the environment’s new state, it updates its internal state. This
The agent continues to take actions and update its internal state until it reaches a predefined
The Monte Carlo method estimates the value of a state or action based on the final reward
received at the end of an episode. On the other hand, the temporal difference method updates
the value of a state or action by looking at only one decision ahead when updating strategies.
An estimate of the expected value of taking action A in state S , after several trials, is denoted as
Q(S, A) . The estimated value of being in state S at any time is expressed as:
305
© 2014-2024 AnalystPrep.
Where α is a parameter, say 0.05, which is the learning rate that determines how much the agent
updates its Q value based on the difference between the expected and actual reward.
Suppose that we have three states (S1 , S 2, S3) and two actions (A1, A2) , with the following
Q(S, A) values:
S1 S2 S3
A1 0.3 0.4 0.5
A2 0.7 0.6 0.5
Monte-Carlo Method
Suppose that on the next trial, Action 2 is taken in State 3, and the total subsequent reward is
1.0. If α = 0.075, the Monte Carlo method would lead to Q(3, 2) being updated from 0.5 to:
If the next decision that has to be made on the trial under consideration happens when we are in
The value of being in State 2, Action 2 is 0.6. The temporal difference method would lead to
1. Trading: Reinforcement learning algorithms can learn from past data and market
dynamics to make informed decisions on when to buy and sell, possibly optimizing the
306
© 2014-2024 AnalystPrep.
2. Detecting fraud: RL can be used to detect fraudulent activity in financial transactions.
This algorithm learns from past data and hence adapts to new fraud patterns. This means
that the algorithm becomes better at detecting and preventing fraud with time.
loan. The algorithm can be trained on historical data about borrowers and their credit
4. Risk management: RL can be trained using past data to identify and mitigate financial
risks.
5. Portfolio optimization: RL can be trained to take actions that modify the allocation of
assets in the portfolio with time, with the sim of maximizing portfolio returns and
minimizing risks.
Natural language processing (NLP) focuses on helping machines process and understand human
language.
1. Data collection: Involves acquiring data from various sources, including financial
2. Data preprocessing: The raw textual data is cleaned, formatted, and transformed into a
form suitable for computer usage. Tasks such as tokenization, stemming, and stop word
3. Feature extraction: This involves extracting relevant features from the preprocessed
data. It may involve extracting financial metrics, sentiments, and other relevant
information.
4. Model training: This involves training the machine learning model using the extracted
features.
307
© 2014-2024 AnalystPrep.
5. Model evaluation: This involves evaluating the performance of the trained model to
validation can be employed here. Model evaluation is carried out on the test dataset.
6. Model deployment: The evaluated model is then deployed for use in real-world
investment scenarios.
Data Preprocessing
Textual data (unstructured data) is more suitable for human consumption rather than for
computer processing. Unstructured data thus needs to be converted to structured data through
cleaning and preprocessing, a process called text processing. Text cleansing involves involving
removing HTML tags, punctuations, numbers, and white spaces (e.g., tabs and indents).
The next step is text wrangling (preprocessing) which involves the following:
1. Tokenization: Involves separating a piece of text into smaller units called tokens. It
allows the NLP model to analyze the textual data more easily by breaking it down into
3. Removing stop words: These are words with no informational value, e.g., as, the, is,
used as sentence connectors. They are eliminated to reduce the number of tokens in the
training data.
4. Stemming: Reduces all the variations of a word into a common value (base form/stem):
For example, “earned,” “earnings,” and “earning.” are all assigned a common value of
words. Unlike stemming, lemmatization incorporates the full structure of the word and
to stemming.
6. Consider “n-grams:” These are words that need to be placed together to give a specific
308
© 2014-2024 AnalystPrep.
Finance professionals can leverage on NLP to derive insights from large chunks of data to make
Trading: NLP can be employed to analyze real-time financial data, e.g., stock prices, to
derive trends and patterns that could be used to inform investment decisions.
Risk management: NLP can be used to identify possible risks in financial contracts
and regulatory filings. For example, identifying language that implies a high level of
News analysis: NLP can be used to derive information from news articles and other
sources of financial information, e.g., earnings reports. The resulting information can
opportunities.
Sentiment analysis: NLP can be used to measure the public opinion of a company,
industry, or market trend by analyzing sentiments on social media posts and news
articles. Investors can use this information to make more informed investment
decisions. Investors can classify the text as positive, negative, or neutral based on the
Detect accounting fraud: For example, to detect accounting fraud, the Securities and
statements based on the news they represent, e.g., education, financial, environmental,
etc.
309
© 2014-2024 AnalystPrep.
Practice Question
Which of the following is least likely a task that can be performed using natural
language processing?
A. Sentiment analysis.
B. Text translation.
C. Image recognition.
D. Text classification.
Solution
Image recognition is not a task that can be performed using NLP. This is because NLP
A is incorrect: NLP can be used for sentiment analysis. For example, NLP can be
newswire statements based on the news they represent, e.g., education, financial,
environmental, etc.
310
© 2014-2024 AnalystPrep.
Reading 26: Machine Learning and Prediction
Discuss why regularization is useful and distinguish between the ridge regression and
LASSO approaches.
Outline the intuition behind the K nearest neighbors and support vector machine
Understand how neural networks are constructed and how their weights are
determined.
Evaluate the predictive performance of logistic regression models and neural network
Linear regression models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation to the observed data. It works by finding the
line of the best fit through the data points. This line is called a regression line, and it is straight.
The equation of the best fit can then be used to make predictions about the dependent variable
311
© 2014-2024 AnalystPrep.
The regression line can be expressed as follows:
y = α + β1 x 1 + β2 x 2 +. . . +βn x n
Where:
y = Dependent variable.
α = Intercept.
312
© 2014-2024 AnalystPrep.
x 1 , x 2 , … x n = Independent variables.
The coefficients show the effect of each independent variable on the dependent variable and are
Training any machine learning model aims to minimize the cost (loss) function. A cost function
measures the inaccuracy of the model predictions. It is the sum of squared residuals (RSS) for a
linear regression model. This is the sum of the squared difference between the actual and
2
n n
RSS = ∑ (y i − α − ∑ βjx ij )
i=1 i=1
To measure how well the data fits the line, take the difference between each actual data point (y)
and penalize larger differences. The squared differences are then added up, and an average is
taken.
The advantage of linear regression is that it is easy to understand and interpret. However, it has
It assumes that residuals (the difference between observed and predicted values) are
It is prone to overfitting.
313
© 2014-2024 AnalystPrep.
Example: Prediction using Linear Regression
Aditya Khun, an investment analyst, wants to predict the return on a stock based on its P/E ratio
and the market capitalization of the company using linear regression in machine learning. Khun
has access to the P/E ratio and market capitalization dataset for several stocks, along with their
corresponding returns. Khun can employ linear regression to model the relationship between the
return on a stock and its P/E ratio and market capitalization. The following equation represents
the model:
Where:
β0 = Intercept.
The first step of fitting a linear regression model is estimating the values of the coefficients β0 , β1
, and β2 using the training data. Coefficients that minimize the sum of the squared residuals are
determined.
314
© 2014-2024 AnalystPrep.
Intercept = 3.432.
Given a P/E ratio of 14 and a market capitalization of $150M, the return of the stock can be
determined as follows:
Logistic Regression
When using a linear regression model for binary classification, where the dependent variable Y
can only be 0 or 1, the model can predict probabilities outside the range of 0 to 1. This occurs
because the model attempts to fit a straight line to the data, and the predicted values may not be
restricted to the valid range of probabilities. As a result, the model may produce predictions that
are less than zero or greater than one. To avoid this issue, it may be necessary to use a different
type of model, such as logistic regression, which is specifically designed for binary classification
tasks and ensures that the predicted probabilities are within the valid range. This is achieved by
applying a sigmoid function. The sigmoid function graph is shown in the figure below.
315
© 2014-2024 AnalystPrep.
Logistic regression is used to forecast a binary outcome. In other words, it predicts the likelihood
e yj
F (y j ) =
1 + e yj
Where:
α = Intercept term.
e yj
316
© 2014-2024 AnalystPrep.
e yj
pj =
1 + e yj
Probability that y j = 0 is (1 − pi )
This measures how often we predicted zero when the true answer was one and vice versa. The
logistic regression coefficients are trained using techniques such as maximum likelihood
estimation (MLE) to predict values close to 0 and 1. MLE works by selecting the values of the
model parameters (∝ and the β s) that maximize the likelihood of the training data occurring. The
likelihood function is a mathematical function that describes the probability of the observed data
given the model parameters. By maximizing the likelihood function, we can find the values of the
parameters most likely to have produced the observed data. This can be expressed as:
n 1−yj
y
∏ F (yj ) j (1 − F (yj ))
j=1
It is often easier to maximize the log-likelihood function, log(L), than the likelihood function
itself. The log-likelihood function is obtained by taking the natural logarithm of the likelihood
function:
n
Log (L) = ∑ [y j log (F (yj )) + (1 − yj ) log (1 − F (y j ))]
j=1
Once the model parameters (∝ and the β s) that maximize the log-likelihood function have been
estimated using MLE, predictions can be made using the logistic regression model. To make
predictions, a threshold value Z is chosen. If the predicted probability p j is greater than or equal
to the threshold Z , the model predicts the positive outcome (yj = 1) ; if pj is less than the
threshold Z , the model predicts a negative outcome (yi = 0). This is expressed as:
1 if p j ≥ Z
yj = {
0 if p j < Z
317
© 2014-2024 AnalystPrep.
A credit analyst wants to predict whether a customer will default on a loan based on their credit
score and debt-to-income ratio. He gathers a dataset of 500 customers, with their corresponding
credit scores, debt-to-income ratio, and whether they defaulted on the loan. He then splits the
data into training and test sets and uses the training data to train a logistic regression model.
The model learns the following relationship between the independent variables (input features)
e(−10+(0.012×Credit score)+(0.4×Debt-to-income))
Probability of default =
1 + e(−10+(0.012×Credit score)+(0.4×Debt-to-income))
The above expression calculates the probability that the customer will default on the loan, given
So, if the credit score is 650 and the debt-to-income ratio is 0.6, the probability of default will be
calculated as:
e(−10+(0.012×650)+(0.4×0.6))
Probability of default = ≈ 12%
1 + e(−10+(0.012×650)+(0.4×0.6))
So there is a 12% probability that the customer will default on the loan. One can then use a
threshold (such as 50%) to convert this probability into a binary prediction (either “default” or
“no default”). Since 12% < 50%, we can classify this as “no default.”
Logistic regression is applied for prediction and classification tasks in machine learning. For
example, you could use logistic regression to classify stock returns as either “positive” or
“negative” based on a set of input features that you choose. It is simple to implement and
interpret. However, it assumes a linear relationship between the dependent and independent
variables and requires a large sample size to achieve stable estimates of the coefficients.
318
© 2014-2024 AnalystPrep.
Categorical data refers to information presented in groups and can take on values that are
names, attributes, or labels. It is not in a numerical format. For example, a given set of stocks
can be categorized as either growth or value stocks depending on the investment style. Many ML
It isn't easy to transform categorical variables, especially non-ordinal categorical data, where the
classes are not in any order. Mapping or encoding involves transforming non-numerical
information into numbers. One-hot encoding is the most common solution for dealing with non-
ordinal categorical data. It involves creating a new dummy variable for each group of the
categorical feature and encoding the categories as binary. Each observation is marked as either
For ordered categorical variables, for example, where a candidate's grades are specified as
either poor, good, or excellent, a dummy variable that equals 0 for poor, 1 for good, and 2 for
If an intercept term and correlated dummy variables are included in a model, the dummy
variable trap may be encountered. This means that the model will have multiple possible
solutions, and we cannot find a unique best-fit solution. To address this issue, techniques such as
regularization can be used. These approaches penalize the magnitude of the coefficients of the
model, which can help to reduce the impact of correlated variables and prevent the dummy
Regularization
319
© 2014-2024 AnalystPrep.
Regularization is a technique that events overfitting in machine learning models by penalizing
large coefficients. It adds a penalty term to the model's objective function, encouraging the
coefficients to take on smaller values. This reduces the impact of correlated variables, as it
forces the model to rely more on the overall pattern of the data and less on the influence of any
single variable. It improves the generalization of the model to new, unseen data.
scaling the data to have a minimum value of 0 and a maximum value of 1. On the other hand,
standardization involves scaling the data so that it has a mean of zero and a standard deviation
of one. Ridge regression and the least absolute shrinkage and selection operator (LASSO)
Ridge Regression
used to analyze data and make predictions. It is similar to ordinary least squares regression but
includes a penalty term that constrains the size of the model's coefficients. Consider a dataset
with n observations on each of k features in addition to a single output variable y and, for
simplicity, assume that we are estimating a standard linear regression model with hats above
parameters denoting their estimated values. The relevant objective function (referred to as a loss
k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x2j − … − β̂k xk j) + λ ∑ (β^ i )
∑ (ŷj − ∝
n j=1 i=1
or
k
2
L = R̂SS + λ ∑ (β^ i )
i=1
The first term in the expression is the residual sum of squares, which measures how well the
model fits the data. The second term is the shrinkage term, which introduces a penalty for large
slope parameter values. This is known as regularization, and it helps to prevent overfitting,
which is when a model fits the training data too well and performs poorly on new, unseen data.
320
© 2014-2024 AnalystPrep.
The parameter λ is a hyperparameter, which means that it is not part of the model itself but is
used to determine the model. In this case, it controls the relative weight given to the shrinkage
term versus the model fit term. It is essential to tune the value of λ, or perform hyperparameter
while λ is a hyperparameter.
introduces a penalty term to the objective function to prevent overfitting. However, the penalty
term in LASSO regression takes the form of the absolute value of the coefficients rather than the
k
1 n 2
L= ∑ (ŷ j − ∝
^ − β̂1 x1j − β̂2 x 2j − … − β̂kx kj ) + λ ∑ (| β̂i |)
n j =1 i=1
k
L = R̂SS + λ ∑ (| β̂i|)
i=1
solutions. This means that the values of the coefficients can be calculated directly, without the
need for iterative optimization. On the other hand, LASSO does not have closed-form solutions
for the coefficients, so a numerical optimization procedure must be used to determine the values
of the parameters.
Ridge regression and LASSO have a crucial difference. Ridge regression adds a penalty term that
reduces the magnitude of the β parameters and makes them more stable. The effect of this is to
“shrink” the β parameters towards zero, but not all the way to zero. This can be especially useful
when there is multicollinearity among the variables, as it can help to prevent one variable from
However, LASSO sets some of the less important β parameters to exactly zero. The effect of this
321
© 2014-2024 AnalystPrep.
is to perform feature selection, as the β parameters corresponding to the least important
features will be set to zero. In contrast, the β parameters corresponding to the more important
features will be retained. This can be useful in cases where the number of variables is very large,
and some variables are irrelevant or redundant. The choice between LASSO and ridge regression
depends on the specific needs of the model and the data at hand.
Elastic Net
Elastic net regularization is a method that combines the L1 and L2 regularization techniques in a
k k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x2j − … − β̂k xk j) + λ1 ∑ (β^ i ) + λ 2 ∑ (| β̂i |)
∑ (ŷj − ∝
n j=1 i=1 i =1
k k
2
L = R̂S S + λ1 ∑ (β^ i ) + λ2 ∑ (| β̂i|)
i=1 i=1
both L1 and L2 regularization. These advantages include decreasing the magnitude of some
parameters and eliminating some unimportant ones. This can help to improve the model's
Example: Regularization
OLS regression determines the coefficients of the model by minimizing the sum of the squared
residuals (RSS). Note that it does not incorporate any regularization and can therefore lead to
322
© 2014-2024 AnalystPrep.
significant coefficients and overfitting. On the other hand, ridge regularization adds a penalty
term to RSS. The penalty term is determined as the sum of the squared coefficient values,
strength of the penalty and can be adjusted to find an optimal balance between the model's
fitness and the model's simplicity. Notice that as λ increases, the penalty term becomes more
As discussed earlier, LASSO uses the sum of the absolute values of the coefficients as the penalty
term. This leads to some coefficients being reduced to zero, which eliminates unnecessary
features from the model. Notice the same from the table above. Similar to ridge regression, the
Choosing the value of the hyperparameter in a regularized regression model is an important step
in the modeling process, as it can significantly impact the model's performance. One common
approach to selecting the value of the hyperparameter is to use cross-validation, which involves
splitting the data into a training set, a validation set, and a test set. This was discussed in detail
in Chapter 14. The training set is used to fit the model and determine the coefficients for
different values of λ . The validation set determines how well the model generalizes to new data.
The test set is used to evaluate the final performance of the model and provide an unbiased
Decision Trees
A decision tree is a supervised machine-learning technique that can be used to predict either a
categorical target variable, produce a classification tree, or produce a regression tree. It creates
a tree-like decision model based on the input features. At each internal node of the tree, there is
a question, and the algorithm makes a decision based on the value of one of the features. It then
branches an observation to another node or a leaf. A leaf is a terminal node that leads to no
further nodes. In other words, the decision tree includes the initial root node, decision nodes,
Classification and Regression Tree (CART) is a decision tree algorithm commonly used for
323
© 2014-2024 AnalystPrep.
supervised learning tasks, such as classification and regression. One of the main benefits of
CART is that it is highly interpretable, meaning it is easy to understand how the model makes
predictions. This is because CART models are built using a series of simple decision rules that
are easy to understand and follow. For this reason, CART models are often referred to as “white-
box models,” in contrast to other techniques like neural networks, which are often referred to as
“black-box models.” Neural networks are more challenging to interpret because they are based
on complex mathematical equations that are not as easy to understand and follow.
The following is a visual representation of a simple model for predicting whether a company will
When building a decision tree, the goal is to create a model that can accurately predict the value
of a target variable based on the importance of other features in the dataset. To do this, the
324
© 2014-2024 AnalystPrep.
decision tree must decide which features to split on at each node of the tree. The tree is
constructed by starting at the root node and recursively partitioning the data into smaller and
smaller groups based on the values of the chosen features. We use a measure called information
additional information about the feature. In other words, it measures how much the feature helps
There are two commonly used measures of information gain: entropy and the Gini coefficient.
Both of these measures are used to evaluate the purity of a node in the decision tree. The goal is
to choose the feature that results in the most significant reduction in entropy or the Gini
coefficient, as this will be the most helpful feature in predicting the target variable.
Entropy ranges from 0 to 1, with 0 representing a completely ordered or predictable system and
K
Entropy = − ∑ pi log2 (p i)
i=1
Where K is the total number of possible outcomes and pi the probability of that outcome. The
logarithm used in the formula is typically the base-2 logarithm, also known as the binary
logarithm.
K
Gini = 1 − ∑ p2i
i =1
A credit card company is building a decision-tree model to classify credit card holders as high-
risk or low-risk for defaulting on their payments. They have the following data on whether a
credit card holder has defaulted (“Defaulted”) and two features (for the label and the features, in
325
© 2014-2024 AnalystPrep.
each case, “yes” = 1 and “no” = 0): whether the credit card holder has a high income and
The base entropy measures the randomness (uncertainty) of the output series before any data is
K
Entropy = − ∑ pi log2 (p i)
i =1
Where:
The logarithm used in the formula is typically the base-2 logarithm, also known as the binary
logarithm.
In this case, three credit card holders defaulted, and five didn't.
3 3 5 5
Entropy = − ( log2 ( ) + log2 ( )) = 0.954
8 8 8 8
Both features are binary, so there are no issues with determining a threshold as there would be
for a continuous series. The first stage is to calculate the entropy if the split was made for each
326
© 2014-2024 AnalystPrep.
of the two features. Examining the High_income feature first, among high-income credit card
owners (feature = 1), two defaulted while two did not, leading to entropy for this sub-set of:
2 2 2 2
Entropy = − ( log 2 ( ) + log2 ( )) = 1
4 4 4 4
Among non-high income credit card owners (feature = 0), one defaulted while three did not,
1 1 3 3
Entropy = − ( log2 ( ) + log2 ( )) = 0.811
4 4 4 4
The weighted entropy for splitting by income level is therefore given by:
4 4
Entropy = × 1 + × 0.811 = 0.906
8 8
We repeat this process by calculating the entropy that would occur if the split was made via the
Three of the four credit card owners who made late payments (feature = 1) defaulted, while one
did not.
3 3 1 1
Entropy = − ( log2 ( ) + log2 ( )) = 0.811
4 4 4 4
Among the four credit card owners who did not make late payments (feature = 0), none
defaulted. The weighted entropy for late payments feature is, therefore:
4
Entropy = × 0.811 = 0.4055
8
Notice that the entropy is maximized when the sample is first split by the late payments feature.
This becomes the root node of the decision tree. For credit card owners who do not make late
327
© 2014-2024 AnalystPrep.
payments (i.e., the feature =0), there is already a pure split as none of them defaulted. This is to
say that credit card holders who make timely payments do not default. This means that no
further splits are required along this branch. The (incomplete) tree structure is, therefore:
Ensemble Techniques
is used to make predictions rather than relying on the output of a single model. The idea behind
ensemble learning is that the individual models in the ensemble may have different error rates
and make noisy predictions. Still, by taking the average result of many predictions from various
models, the noise can be reduced, and the overall forecast can be more accurate.
328
© 2014-2024 AnalystPrep.
There are two objectives of using an ensemble approach in machine learning. First, ensembles
can often achieve better performance than individual models (think of the law of large numbers
where, as the number of models in the ensemble increases, the overall prediction accuracy tends
to improve). Second, ensembles can be more robust and less prone to overfitting, as they are
able to average out the errors made by individual models. Some ensemble techniques are
Bootstrap Aggregation
multiple decision trees by sampling from the original training data. The decision trees are then
combined to make a final prediction. A basic bagging algorithm for a decision tree would involve
1. Sample the training data with the replacement to obtain multiple subsets of the training
data
2. Contruct a decision tree on each subset of the training data using the usual techniques.
3. Combine the predictions made by each of the decision tree models, e.g., average, to
make a forecast.
Sampling with replacement is a statistical method that involves randomly selecting a sample
from a dataset and returning the selected element back into the dataset before choosing the next
element. This means that an element can be selected multiple times, or it can be left out entirely.
Sampling with replacement allows for the use of out-of-bag (OOB) data for model evaluation.
OOB data are observations that were not selected in a particular sample, and therefore were not
used for model training. These observations can be used to evaluate the model's performance, as
they can provide an estimate of how the model will perform on unseen data.
Random Forests
A random forest is an ensemble of decision trees. The number of features chosen for each tree is
usually approximately equal to the square root of the total number of features. The individual
329
© 2014-2024 AnalystPrep.
decision trees in a random forest are trained on different subsets of the data and different
subsets of the features, which means that each tree may give a slightly different prediction.
However, by combining the predictions of all the trees, the random forest can produce a more
accurate final prediction. The performance improvements of ensembles are often greatest when
the individual model outputs have low correlations with one another because this helps to
Boosting
Boosting is an ensemble learning technique that involves training a series of weak models, where
each successive model is trained on the errors or residuals of its predecessor. The goal of
330
© 2014-2024 AnalystPrep.
boosting is to improve the model's overall performance by combining the weaker models'
predictions to reduce bias and variance. Gradient boosting and AdaBoost (Adaptive Boosting) are
AdaBoost
AdaBoost is a boosting algorithm that trains a series of weak models, where each successive
model focuses more on the examples that were difficult for its predecessor to predict correctly.
This results in new predictors that concentrate more and more on the hard cases. Specifically,
AdaBoost adjusts the weights of the training examples at each iteration based on the previous
model's performance, focusing the training on the examples that are most difficult to predict.
1. The AdaBoost algorithm first trains a base classifier (such as a decision tree) on the
training data.
2. The algorithm then uses the trained classifier to make predictions on the training set and
calculates the errors or residuals between the predicted labels and the true labels.
3. The algorithm then adjusts the weights of the training examples based on the previous
classifier's performance, focusing the training on the examples that were most difficult to
predict correctly. Specifically, the weights of the misclassified examples are increased,
4. A second classifier is then trained on the updated weights. The whole process is repeated
until a predetermined number of classifiers have been trained, or until the model's
The final prediction of the AdaBoost model is calculated by combining the predictions of all of
the individual classifiers using a weighted sum, where the accuracy of each classifieraccuracy of
Gradient Boosting
In gradient boosting, a new model is trained on the residuals or errors of the previous model,
which are used as the target labels for the current model. This process is repeated until a
331
© 2014-2024 AnalystPrep.
predetermined number of models have been trained, or until the model's performance meets a
desired threshold. In contrast to AdaBoost, which adjusts the weights of the training examples at
each iteration based on the performance of the previous classifier, gradient boosting tries to fit
the new predictor to the residual errors made by the previous predictor.
K-Nearest Neighbors
K-nearest neighbors (KNN) is a supervised machine learning technique commonly used for
classification and regression tasks. The idea is to find similarities or “nearness” between a new
observation and its k-nearest neighbors in the existing dataset. To do this, the model uses one of
the distance metrics described in the previous chapter (Euclidean distance or Manhattan
distance) to calculate the distance between the new observation and each observation in the
training set. The k observations with the smallest distances are considered the k-nearest
neighbors of the new observation. The class label or value of the new observation is determined
KNN is sometimes called a “lazy learner” as it does not learn the relationships between the
features and the target like other approaches do. Instead, it simply stores the training data and
makes predictions based on the similarity between the new observation and its K-nearest
Here are the basic steps involved in implementing the KNN model:
332
© 2014-2024 AnalystPrep.
Choosing an appropriate value for K is important, as it can impact the model's ability to
generalize to new data and avoid overfitting or underfitting. If K is too large so that many
neighbors are selected, it will give a high bias but low variance, and vice versa for small K. If the
value of K is set too small, it may result in a model that is more sensitive to individual
observations and more complex. This may allow the model to fit the training data better.
However, it may also make the model more prone to overfitting and not generalize well to new
data.
A typical heuristic for selecting K is to set it approximately equal to the square root of the size of
the training sample. For example, if the training sample contains 10,000 points, then K could be
333
© 2014-2024 AnalystPrep.
Support vector machines (SVMs) are supervised machine learning models commonly used for
classification tasks, particularly when there are many features. SVM works by finding the path's
hyperplane or center that maximizes the distance between the two classes, called the margin.
This hyperplane (the solid line blue line in the figure below) is constructed by finding the two
parallel lines that are furthest apart and that best separate the observations into the two classes.
The data points on the edge of this path, or the points closest to the hyperplane, are called
support vectors.
Emma White is a portfolio manager at Delta Investments, a firm that manages a diverse range of
investment portfolios for its clients. Delta has a portfolio of “investment-grade” stocks, which are
relatively low-risk and have a high likelihood of producing steady returns. The portfolio also
334
© 2014-2024 AnalystPrep.
includes a selection of “non-investment grade” stocks, which are higher-risk and have the
potential for higher returns but also come with a greater risk of loss.
White is considering adding a new stock, ABC Inc., to the portfolio. ABC is a medium-sized
company in the retail sector but has not yet been rated by any of the major credit rating
agencies. To determine whether ABC is suitable for the portfolio, White decides to use machine
learning methods to predict the stock's risk level. How can Emma use the SVM algorithm to
Solution
White would first gather data on the features and target of bonds from companies rated as either
investment grade or non-investment grade. She would then use this data to train the SVM
algorithm to identify the optimal hyperplane that separates the two classes. Once the SVM model
is trained, White can use it to predict the rating of ABC Inc's bonds by inputting the features of
the bonds into the model and noting on which side of the margin the data point lies. If the data
point lies on the side of the margin associated with the investment grade class, then the SVM
model would predict that ABC Inc's bonds are likely to be investment grade. If the data point lies
on the side of the margin associated with the non-investment grade class, then the SVM model
would predict that ABC Inc's bonds are likely to be non-investment grade.
Neural Networks
Neural networks (NNs), also known as artificial neural networks (ANNs), are machine learning
algorithms capable of learning and adapting to complex nonlinear relationships between input
and output data. They can be used for both classification and regression tasks in supervised
learning, as well as for reinforcement learning tasks that do not require human-labeled training
data. A feed-forward neural network with backpropagation is a type of artificial neural network
that updates its weights and biases through an iteration process called backpropagation.
335
© 2014-2024 AnalystPrep.
In this neural network, there are three input variables, a single hidden layer comprising three
nodes and a single output variable. The output variable is determined based on the values of the
hidden nodes, which are calculated from the input variables. The equations that are used to
∅ is known as an activation function, which is a nonlinear function that is applied to the linear
combination of the input feature values to introduce nonlinearity into the model.
336
© 2014-2024 AnalystPrep.
y = ∅(W211 H1 + W221 H 2 + W231 H 3 + W 4 )
The other W parameters (coefficients in the linear functions) are weights. As previously stated, if
the activation functions were not included, the model would only be able to output linear
combinations of the inputs and hidden layer values, limiting its ability to identify complex
nonlinear relationships. This is not desirable, as the main purpose of using a neural network is to
The parameters of a neural network are chosen based on the training data, similar to how the
parameters are chosen in linear or logistic regression. To predict the value of a continuous
variable, we can select the parameters that minimize the mean squared errors. We can use a
There are no exact formulas for finding the optimal values for the parameters in a neural
network. Instead, a gradient descent algorithm is used to find values that minimize the error for
the training set. This involves starting with initial values for the parameters and iteratively
adjusting them in the direction that reduces the error of the objective function. This process is
similar to stepping down a valley, with each step following the steepest descent.
The learning rate is a hyperparameter that determines the size of the step taken during the
gradient descent algorithm. If the learning rate is too small, it will take longer to reach the
optimal parameters, but if it is too large, the algorithm may oscillate from one side of the valley
to another instead of accurately finding the optimal values. A hyperparameter is a value set
before the model training process begins and is used to control the model's behavior. It is not a
parameter of the model itself but rather a value used to determine how the model will be trained
and function.
In the example given earlier, the neural network had 16 parameters (i.e., a total of the weights
and the biases). The presence of many hidden layers and nodes in a neural network can lead to
too many parameters and the risk of overfitting. To prevent overfitting, calculations are
performed on a validation data set while training the model on the training data set. As the
gradient descent algorithm progresses through the multi-dimensional valley, the objective
337
© 2014-2024 AnalystPrep.
function will improve for both data sets.
However, at a certain point, further steps down the valley will begin to degrade the model's
performance on the validation data set while continuing to improve it on the training data set.
This indicates that the model is starting to overfit, so the algorithm should be stopped to prevent
A confusion matrix is a tool used to evaluate the performance of a binary classification model,
where the output variable is a binary categorical variable with two possible values (such as
“default” or “not default”). It is a 2×2 table that shows the possible outcomes, and whether the
i. True positive (TP) refers to the number of times the model correctly predicted that a
ii. False negative (FN) refers to the number of times the model incorrectly predicted that a
iii. False positive (FP) refers to the number of times the model incorrectly predicted that a
iv. True negative (TN) refers to the number of times the model correctly predicted that a
338
© 2014-2024 AnalystPrep.
The most common performance metrics based on a confusion matrix are:
i. Accuracy: This is the model's overall accuracy, calculated as the number of correct
(T P + T N )
(T P + T N + F P + F N )
ii. Precision: This is the proportion of correct positive predictions, calculated as:
TP
(T P + F P)
iii. Recall: This is the proportion of actual positive cases that were correctly predicted,
339
© 2014-2024 AnalystPrep.
calculated as:
TP
(T P + F N)
iv. The error rate is the proportion of incorrect predictions made by the model, calculated as
follows:
Suppose we have a dataset of 1600 borrowers, 400 of whom defaulted on their loans and 1200 of
whom did not. We can use logistic regression or a neural network to create a prediction model
that predicts the likelihood that a borrower will default on their loan. We can set a threshold
Assume that a neural network with one hidden layer and backpropagation is used to model the
data. The hidden layer has 5 units, and the activation function used is the logistic function. The
loss function used in the optimization process is based on an entropy measure. Note that a loss
function is used to evaluate how well a model performs on a given task. The optimization process
aims to find the set of model parameters that minimize the loss function. Suppose that the
optimization process takes 150 iterations to converge, which means it takes 150 steps to find the
In the context of machine learning, the effectiveness of a model specification is evaluated based
on its performance in classifying a validation sample. For simplicity, a threshold of 0.5 is used to
determine the predicted class label based on the model's output probability. If the probability of
a default predicted by the model is greater than or equal to 0.5, the predicted class label is
“default.” If the probability is less than 0.5, the predicted class label is “no default.”
Adjusting the threshold can affect the true positive and false positive rates in different ways. For
example, if the threshold is set too low, the model may have a high true positive rate and a high
false positive rate because the model is classifying more observations as positive. On the other
340
© 2014-2024 AnalystPrep.
hand, if the threshold is set too high, the model may have a low true positive rate and a low false
positive rate because the model is classifying fewer observations as positive. This trade-off
between true positive and false positive rates is similar to the trade-off between type I and type
II errors in hypothesis testing. In hypothesis testing, a type I error occurs when the null
hypothesis is rejected when it is actually true. In contrast, a type II error occurs when the null
Hypothetical confusion matrices for the logistic and neural network models are presented for
The values in the confusion matrix can be used to calculate various evaluation metrics:
341
© 2014-2024 AnalystPrep.
Training sample Validation sample
Performance Logistic Neural Logistic Neural
metrics regression network regression network
Accuracy 0.781 0.743 0.654 0.651
Precision 0.667 0.470 0.641 0.646
Recall 0.250 0.235 0.364 0.338
The model appears to perform slightly better on the training data than on the validation data,
indicating that the model is overfitting. To improve the model's performance, it may be beneficial
to remove some of the features with limited empirical relevance or apply regularization to the
model. These steps may help reduce overfitting and improve the model's ability to generalize to
new data.
There is not much difference in the performance of the logistic regression and neural network
approaches. The logistic regression model has a higher true positive rate but a lower true
negative rate for the training data compared to the neural network model. On the other hand,
the neural network model appears to have a higher true positive rate but a lower true negative
rate for the validation data compared to the logistic regression model.
The receiver operating characteristic (ROC) curve is a graphical representation of the trade-off
between the true positive rate and the false positive rate, which is illustrated in the figure below.
positive or negative, and plotting the true positive rate and the false positive rate at each
threshold.
342
© 2014-2024 AnalystPrep.
A higher area under the receiver operating curve (or area under curve/AUC) value indicates
better performance, with a perfect model having an AUC of 1. An AUC value of 0.5 corresponds
to the dashed line in the figure above and indicates that the model is no better than random
guessing. In contrast, an AUC value less than 0.5 indicates that the model has a negative
predictive value.
343
© 2014-2024 AnalystPrep.
Practice Question
Model A
Predicted: Predicted:
No Default Default
Actual: No Default T N = 100 F P = 50
Actual: default F N = 50 T P = 900
Model B
Predicted: Predicted:
No Default Default
Actual: No Default T N = 120 F P = 80
Actual: default F N = 30 T P = 870
The model that is most likely to have a higher accuracy and higher precision,
respectively, is:
Solution
(T P + T N )
Model accuracy is calculated as
(T P + T N + F P + F N )
900 + 100
Model A accuracy = = 0.909
900 + 100 + 50 + 50
870 + 120
Model B accuracy = = 0.900
870 + 120 + 80 + 30
344
© 2014-2024 AnalystPrep.
Model A has a slightly higher accuracy than model B.
TP
(T P + F P)
900
Model precision of A = = 0.9474
900 + 50
870
Model precision for B = = 0.9158
870 + 80
345
© 2014-2024 AnalystPrep.