0% found this document useful (0 votes)
28 views345 pages

CH 2 Quantitative Analysis TKYYQAL48F

The document provides study notes for the FRM Part I Exam, focusing on Quantitative Analysis, with a last update in February 2024. It covers various topics including fundamentals of probability, independent and mutually exclusive events, conditional probability, and Bayes' theorem, along with examples and applications in finance. The content is structured with a table of contents and detailed explanations of key concepts in probability and statistics.

Uploaded by

manav.vakharia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views345 pages

CH 2 Quantitative Analysis TKYYQAL48F

The document provides study notes for the FRM Part I Exam, focusing on Quantitative Analysis, with a last update in February 2024. It covers various topics including fundamentals of probability, independent and mutually exclusive events, conditional probability, and Bayes' theorem, along with examples and applications in finance. The content is structured with a table of contents and detailed explanations of key concepts in probability and statistics.

Uploaded by

manav.vakharia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 345

FRM Part I Exam

By AnalystPrep

Study Notes - Quantitative Analysis

Last Updated: Feb 12, 2024

1
©2024 AnalystPrep “This document is protected by International copyright laws. Reproduction and/or distribution of this document is

prohibited. Infringers will be prosecuted in their local jurisdictions.”


Table of Contents

12 - Fundamentals of Probability 3
13 - Random Variables 17
14 - Common Univariate Random Variables 41
15 - Multivariate Random Variables 69
16 - Sample Moments 95
17 - Hypothesis Testing 117
18 - Linear Regression 139
19 - Regression with Multiple Explanatory Variables 157
20 - Regression Diagnostics 178
21 - Stationary Time Series 194
22 - Nonstationary Time Series 219
23 - Measuring Return, Volatility, and Correlation 245
24 - Simulation and Bootstrapping 264
25 - Machine-Learning Methods 281
26 - Machine Learning and Prediction 308

2
© 2014-2024 AnalystPrep.
Reading 12: Fundamentals of Probability

After completing this reading, you should be able to:

Describe an event and an event space.

Describe independent events and mutually exclusive events.

Explain the difference between independent events and conditionally independent

events.

Calculate the probability of an event for a discrete probability function.

Define and calculate a conditional probability.

Distinguish between conditional and unconditional probabilities.

Explain and apply Bayes' rule.

Probability is the foundation of statistics, risk management, and econometrics. Probability

quantifies the likelihood that some event will occur. For instance, we could be interested in the

probability that there will be a defaulter in a prime mortgage facility.

Sample Space, Event Space, and Events

Sample Space (Ω)

A sample space is defined as a collection of all possible occurrences of an experiment. The

outcomes are dependent on the problem being studied. For example, when modeling returns

from a portfolio, the sample space is a set of real numbers. As another example, assume we want

to model defaults in loan payment; we know that there can only be two outcomes: either the firm

defaults or it doesn’t. As such, the sample space is Ω = {Default, No Default}. To give yet

another example, the sample space when a fair six-sided die is tossed is made of six different

outcomes:

Ω = {1, 2, 3, 4, 5, 6}

3
© 2014-2024 AnalystPrep.
Events (ω)

An event is a set of outcomes (which may contain more than one element). For example, suppose

we tossed a die. A “6” would constitute an event. If we toss two dice simultaneously, a {6, 2}

would constitute an event. An event that contains only one outcome is termed an elementary

event.

Event Space (F)

The event space refers to the set of all possible outcomes and combinations of outcomes. For

example, consider a scenario where we toss two fair coins simultaneously. The following would

constitute our event space:

{HH, HT, TH, TT}

Note: If the coins are fair, the probability of a head, P(H), equals the probability of a tail, P(T).

Probability

The probability of an event refers to the likelihood of that particular event occurring. For

example, the probability of a Head when we toss a coin is 0.5, and so is the probability of a Tail.

According to frequentist interpretation, the term probability stands for the number of times an

event occurs if a set of independent experiments is performed. But this is what we call the

frequentist interpretation because it defines an event’s probability as the limit of its relative

frequency in many trials. It is just a conceptual explanation; in finance, we deal with actual, non-

experimental events such as the return earned on a stock.

Independent and Mutually Exclusive Events

Mutually Exclusive Events

Two events, A and B, are said to be mutually exclusive if the occurrence of A rules out the

4
© 2014-2024 AnalystPrep.
occurrence of B, and vice versa. For example, a car cannot turn left and turn right at the same

time.

Mutually exclusive events are such that one event precludes the occurrence of all the

other events. Thus, if you roll a dice and a 4 comes up, that particular event precludes

all the other events, i.e., 1,2,3,5 and 6. In other words, rolling a 1 and a 5 are mutually

exclusive events: they cannot occur simultaneously.

Furthermore, there is no way a single investment can have more than one arithmetic

mean return. Thus, arithmetic returns of, say, 20% and 17% constitute mutually

exclusive events.

Independent Events

Two events, A and B, are independent if the fact that A occurs does not affect the probability of B

occurring. When two events are independent, this simply means that both events can happen at

the same time. In other words, the probability of one event happening does not depend on

whether the other event occurs or not. For example, we can define A as the likelihood that it

rains on March 15 in New York and B as the probability that it rains in Frankfurt on March 15. In

5
© 2014-2024 AnalystPrep.
this instance, both events can happen simultaneously or not.

Another example would be defining event A as getting tails on the first coin toss and B on the

second coin toss. The fact of landing on tails on the first toss will not affect the probability of

getting tails on the second toss.

Intersection

The intersection of events say A and B is the set of outcomes occurring both in A and B. It is

denoted as P(A∩B). Using the Venn diagram, this is represented as:

For independent events,

P(A ∩ B) = P(A and B) = P(A) × P (B)

Independence can be extended to n independent events: Let A1 , A2 , … ,A n be independent events

then:

6
© 2014-2024 AnalystPrep.
P (A 1 ∩ A 2 ∩ … ∩ An ) = P (A1 ) × P (A2 ) × … × P (An )

For mutually exclusive events,

P (A ∩ B) = P(A and B) = 0

This is because A's occurrence rules out B's occurrence. Remember that a car cannot turn left

and turn right at the same time!

Union

The union of events, say, A and B, is the set of outcomes occurring in at least one of the two sets

– A or B. It is denoted as P(A∪B). Using the Venn diagram, this is represented as:

To determine the likelihood of any two mutually exclusive events occurring, we sum up their

individual probabilities. The following is the statistical notation:

P (A ∪ B) = P(A or B) = P (A) + P (B)

7
© 2014-2024 AnalystPrep.
Given two events A and B, that are not mutually exclusive (independent events), the probability

that at least one of the events will occur is given by:

P (A ∪ B) = P(A or B) = P (A) + P (B) − P (A ∩ B)

The Complement of a Set

Another important concept under probability is the complement of a set denoted by Ac (where

A can be any other event) which is the set of outcomes that are not in A. For example, consider

the following Venn diagram:

This is the first axiom of probability, and it implies that:

P (A ∪ Ac) = P (A) + P(Ac ) = 1

Conditional Probability

8
© 2014-2024 AnalystPrep.
Until now, we've only looked at unconditional probabilities. An unconditional probability (also

known as a marginal probability) is simply the probability that an event occurs without

considering any other preceding events. In other words, unconditional probabilities are not

conditioned on the occurrence of any other events; they are 'stand-alone' events.

Conditional probability is the probability of one event occurring with some relationship to one

or more other events. Our interest lies in the probability of an event 'A' given that another event

'B 'has already occurred. Here’s what you should ask yourself:

"What is the probability of one event occurring if another event has already taken place?" We

pronounce P(A | B) as "the probability of A given B.," and it is given by:

P (A ∩ B)
P(A│B) =
P (B)

The bar sandwiched between A and B simply indicates "given."

Bayes' Theorem

Bayes' theorem describes the probability of an event based on prior knowledge of conditions that

might be related to the event. Assuming that we have two random variables, A and B, then

according to Bayes' theorem:

P (B|A) × P (A)
P (A|B) =
P (B)

Applying Bayes' Theorem

Supposing that we are issued with two bonds, A and B. Each bond has a default probability of

10% over the following year. We are also told that there is a 6% chance that both the bonds will

default, an 86% chance that none of them will default, and a 14% chance that either of the bonds

will default. All of this information can be summarized in a probability matrix.

Often, there is a high correlation between bond defaults. This can be attributed to the sensitivity

9
© 2014-2024 AnalystPrep.
displayed by bond issuers when dealing with broad economic issues. The 6% probability of both

the bonds defaulting is higher than the 1% probability of default had the default events been

independent, (I., P(A) × P(B))

The features of the probability matrix can also be expressed in terms of conditional probabilities.

For example, the likelihood that bond A will default given that B has defaulted is computed as:

P [A ∩ B] 6%
P (A|B) = = = 60%
P [B] 10%

This means that in 60% of the scenarios in which bond B will default, bond A will also default.

The above equation is often written as:

P [A ∩ B] = P (A|B) × P [B] I

Also:

P [A ∩ B] = P (B|A) × P [A] II

Both the right-hand sides of equations I and I I are combined and rearranged to give the Bayes'

theorem:

P (B│A) × P [A] = P (A│B) × P (B)

P (B|A) × P [A]
⇒ P (A|B) =
P [B]

When presented with new data, Bayes' theorem can be applied to update beliefs. To understand

how the theorem can provide a framework for how exactly the new beliefs should be, consider

the following scenario:

Example: Applying Baye's Theorem

Based on an examination of historical data, it's been determined that all fund managers at a

certain Fund fall into one of two groups: Stars and Non-Stars. Stars are the best managers. The

10
© 2014-2024 AnalystPrep.
probability that a Star will beat the market in any given year is 75%. Other managers are just as

likely to beat the market as they are to underperform it [i.e., Non-Stars have 50/50 odds of

beating the market. For both types of managers, the probability of beating the market is

independent from one year to the next. Stars are rare. Of a given pool of managers, only 16%

turn out to be Stars.

A new manager was added to the portfolio of funds three years ago. Since then, the new

manager has beaten the market every year. What was the probability that the manager was a

star when the manager was first added to the portfolio? What is the probability that this

manager is a star now? What's the probability that the manager will beat the market next year,

given that he has beaten it in the past three years?

Solution

We first summarize the data by introducing some notations as follows: The chances that a

manager will beat the market on the condition that he is a star is:

3
P (B|S) = 0.75 =
4

The chances of a non-star manager beating the market are:

1
P (B|S̄) = 0.5 =
2

The chances of the new manager being a star during the particular time he was added to the

analyst's portfolio are exactly the chances that any manager will be made a star, which is

unconditional:

4
P [S] = 0.16 =
25

To evaluate the likelihood of him being a star at present, we compute the likelihood of him being

a star given that he has beaten the market for three consecutive years, P (S|3B), using the Bayes’

theorem:

P (3B|S) × P [S]
11
© 2014-2024 AnalystPrep.
P (3B|S) × P [S]
P (S |3B) =
P [3B]

3 3 27
P (3B|S ) = ( ) =
4 64

The unconditional chances that the manager will beat the market for three years is the

denominator.

P [3B] = P (3B|S) × P [S] + P (3B|S̄) × P [S̄]

3 3 4 1 3 21 69
P [3B] = ( ) × +( ) =
4 25 2 25 400

Therefore:

( 27 4
) ( 25 ) 9
64
P (S|3B) = = = 39%
69 23
( 400)

Therefore, there is a 39% chance that the manager will be a star after beating the market for

three consecutive years, which happens to be our new belief and is a significant improvement

from our old belief, which was 16%.

Finally, we compute the manager's chances of beating the market the next year. This happens to

be the summation of the chances of a star beating the market and the chances of a non-star

beating the market, weighted by the new belief:

P [B] = P (B|S ) × P [S] + P (B| S̄) × P [S̄]

3 9 1 14 3
P [B] = × + × = 60% =
4 23 2 23 5

We also have that:

P (3B|S) × P [S]
P (S |3B) =
P [3B]

12
© 2014-2024 AnalystPrep.
The L.H.S of the formula is posterior. The first item on the numerator is the likelihood, and the

second part is prior.

13
© 2014-2024 AnalystPrep.
Question 1

The probability that the Eurozone economy will grow this year is 18%, and the

probability that the European Central Bank (ECB) will loosen its monetary policy is

52%.

Assume that the joint probability that the Eurozone economy will grow and the ECB

will loosen its monetary policy is 45%. What is the probability that either the

Eurozone economy will grow or the ECB will loosen its monetary policy?

A. 42.12%

B. 25%

C. 11%

D. 17%

The correct answer is B.

The addition rule of probability is used to solve this question:

P(E) = 0.18 (the probability that the Eurozone economy will grow is 18%)

p(M) = 0.52 (the probability that the ECB will loosen the monetary policy is 52%)

p(EM) = 0.45 (the joint probability that Eurozone economy will grow and the ECB will

loosen its monetary policy is 45%)

The probability that either the Eurozone economy will grow or the central bank will

loosen its the monetary policy:

p(E or M) = p(E) + p(M) - p(EM) = 0.18 + 0.52 - 0.45 = 0.25

Question 2

A mathematician has given you the following conditional probabilities:

14
© 2014-2024 AnalystPrep.
p(O|T) = 0.62 Conditional probability of reaching
the office if the train arrives on time
p(O|T c) = 0.47 Conditional probability of reaching the office
if the train does not arrive on time
p(T) = 0.65 Unconditional probability of
the train arriving on time
p(O) = ? Unconditional probability
of reaching the office

What is the unconditional probability of reaching the office, p(O)?

A. 0.4325

B. 0.5675

C. 0.3856

D. 0.5244

The correct answer is B.

This question can be solved using the total probability rule.

If p(T) = 0.65 (Unconditional probability of train arriving on time is 0.65), then the

unconditional probability of the train not arriving on time p(T c) = 1 - p(T) = 1 - 0.65

= 0.35.

Now, we can solve for

p(O) = p(O|T ) ∗ p(T ) + p(O|T c) ∗ p(T c)


= 0.62 ∗ 0.65 + 0.47 ∗ 0.35
= 0.5675

Note: p(O) is the unconditional probability of reaching the office. It is simply the

addition of:

1. reaching the office if the train arrives on time, multiplied by the train arriving

on time, and

2. reaching the office if the train does not arrive on time, multiplied by the train

not arriving on time (or given the information, one minus the train arriving on

15
© 2014-2024 AnalystPrep.
time)

Question 3

Suppose you are an equity analyst for the XYZ investment bank. You use historical

data to categorize the managers as excellent or average. Excellent managers

outperform the market 70% of the time and average managers outperform the

market only 40% of the time. Furthermore, 20% of all fund managers are excellent

managers and 80% are simply average. The probability of a manager outperforming

the market in any given year is independent of their performance in any other year.

A new fund manager started three years ago and outperformed the market all three

years. What’s the probability that the manager is excellent?

A. 29.53%

B. 12.56%

C. 57.26%

D. 30.21%

The correct answer is C.

The best way to visualize this problem is to start off with a probability matrix:

Kind of manager Probability Probability of beating market


Excellent 0.2 0.7
Average 0.8 0.4

Let E be the event of an excellent manager, and A represent the event of an average

manager.

P(E) = 0.2 and P(A) = 0.8

Further, let O be the event of outperforming the market.

We know that:

16
© 2014-2024 AnalystPrep.
P(O|E) = 0.7 and P(O|A) = 0.4

We want P(E|O):

P (O|E) × P (E)
P (E|O) =
P (O|E) × P (E) + P (O|A) × P (A)
(0.73 ) × 0.2
=
(0.73 ) × 0.2 + (0.43 ) × 0.8
= 57.26%

Note: The power of three is used to indicate three consecutive years.

17
© 2014-2024 AnalystPrep.
Reading 13: Random Variables

After completing this reading, you should be able to:

Describe and distinguish a probability mass function from a cumulative distribution

function and explain the relationship between these two.

Understand and apply the concept of a mathematical expectation of a random variable.

Describe the four common population moments.

Explain the differences between a probability mass function and a probability density

function.

Characterize the quantile function and quantile-based estimators.

Explain the effect of a linear transformation of a random variable on the mean,

variance, standard deviation, skewness, kurtosis, median, and interquartile range.

Random Variables

A random variable is a variable whose possible values are outcomes of a random phenomenon. It

is a function that maps outcomes of a random process to real values. It can also be termed as the

realization of a random process.

Precisely, if ω is an element of a sample space Ω and x is the realization, then X(ω) = x .

Conventionally, random variables are given in upper case (such as X, Y, and Z) while the realized

random values are represented in lower case (such as x, y, and z)

For example, let X be the random variable as a result of rolling a die. Therefore, x is the outcome

of one roll, and it could take any of the values 1, 2, 3, 4, 5, or 6. The probability that the resulting

random variable is equal to 3 can be expressed as:

P (X = x) where x = 3

18
© 2014-2024 AnalystPrep.
Types of Random Variables

Discrete Random Variables

A discrete random variable is one that produces a set of distinct values. A discrete random

variable manifests:

If the range of all possible values is a finite set, e.g., {1,2,3,4,5,6} as in the case of a

six-sided die or,

If the range of all possible values is a countably infinite set: e.g. {1,2,3, ... }

Examples of discrete random variables include:

Picking a random stock from the S&P 500.

The number of candidates registered for the FRM level 1 exam at any given time.

The number of study topics in a program.

Probability Functions under Discrete Random Variables

Since the possible values of a random variable are mostly numerical, they can be explained using

19
© 2014-2024 AnalystPrep.
mathematical functions. A function f X(x) = P(X = x) for each x in the range of X is the probability

function (PF) of X and explains how the total chance (which is 1) is distributed amongst the

possible values of X.

There are two functions used when explaining the features of the distribution of discrete random

variables: probability mass function (PMF) and cumulative distribution function (CDF).

Probability Mass Function (PMF)

This function gives the probability that a random variable takes a particular value. Since PMF

outputs the probabilities, it should possess the following properties:

1. f X(x) ≥ 0 ∀ range of X (value returned must be a nonnegative)

2. ∑x f X (x) = 1 (sum across all value in support of a random variable should be equal to 1)

Example: Bernoulli Distribution

Assume that X is a Bernoulli random variable, the PMF of X is given by:

f X (x) = px (1 − p)1−x , X = 0, 1

The Random variables in a Bernoulli distribution are 0 and 1. Therefore,

fX (0) = p0 (1 − p)1−0 = 1 − p

And

f X(1) = p1 (1 − p)1−1 = p

Looking at the above results, the first property f X(x) ≥ 0) of probability distributions is met. For

the second property:

∑ fX (x) = ∑ f X(x) = 1 − p + p = 1
x x=0 ,1

20
© 2014-2024 AnalystPrep.
Moreover, the probability that we observe random variable 0 is 1-p, and the probability of

observing random variable 1 is p. More precisely,

FX (x) = { 1 − p, x = 0
p, x= 1

The graph of the Bernoulli PMF is shown below, assuming the p=0.7. Note that PMF is only

defined for X=0,1.

Cumulative Distribution Function (CDF)

CDF measures the probability of realizing a value less than or equal to the input x, P r(X ≤ x) . It

is denoted by FX(x) and so,

FX (x) = P r(X ≤ x)

CDF is monotonic and increasing in x since it measures total probability. It is a continuous

function (in contrast with PMF) because it supports any value between 0 and 1 (in the case of

21
© 2014-2024 AnalystPrep.
Bernoulli random variables) inclusively.

For instance, the CDF of the Bernoulli random variable is:

⎧ 0, x< 0
FX (x) = ⎨ 1 − p, 0 ≤ x < 1

1, x≥ 1

FX (x) is defined for all real values of x. The graph of FX (x) against x begins at 0 then rises by

jumps as values of x are realized for which p(X = x) is positive. The graph reaches its maximum

value at 1. For the Bernoulli distribution with p=0.7, the graph is shown below:

Since CDF is defined for all values of x, the CDF for a Bernoulli distribution with a parameter

p=0.7 is:

⎧ 0, x<0
FX (x) = ⎨ 0.3, 0≤x<1

1, x≥1

The corresponding graph is as shown above

22
© 2014-2024 AnalystPrep.
Relationship Between the CDF and PMF with Discrete Random Variables

The CDF can be represented as the sum of the PMF for all the values that are less than or equal

to x. Simply put:

FX (x) = ∑ fX (t)
tϵR( x), t≤x

Where R(x) is the range of realized values of X (X=x).

On the other hand, PMF is equivalent to the difference between the consecutive values of X. That

is:

fX (x) = FX (x) − FX (x − 1)

Example: PMF and CDF under Discrete Random Variables

There are 8 hens with different weights in a cage. Hens1 to 3 weigh 1 kg, hens 4 and 5 weigh

2kg, and the rest weigh 3kg. We need to develop the PMF and the CDF.

Solution

The random variables (X = 1kg, 2kg, or 3kg) here are the weights of the chicken,

3
fX (1) = P r(X = 1) =
8
2 1
fX (2) = P r(X = 2) = =
8 4
3
fX (3) = P r(X = 3) =
8

So, the PMF is:

3
, x=1


⎪ 81
⎨ 4, x=2


⎪ 3, x=3
8

23
© 2014-2024 AnalystPrep.
For the CDF, it includes all the realized values of the random variable. So,

FX(0) = P r(X ≤ 0) = 0
3
FX(1) = P r(X ≤ 1) =
8
3 2 5⎡ ⎤
FX(2) = P r(X ≤ 2) = + = Using FX(x) = ∑ f X ( t)
8 8 8⎣ tϵR (x) ,t≤x ⎦
5 3
FX(3) = P r(X ≤ 3) = + =1
8 8

So that the CDF is

0, x< 1




⎪ 3, 1≤x<2
FX (x) = ⎨ 85
⎪ , 2≤x<3


⎩8

1, 3≤x

Note that

fX (x) = FX (x) − FX (x − 1)

Which implies that:

5 3
f X (3) = FX(3) − FX (2) = 1 − =
8 8

Which gives the same result as before.

Continuous Random Variables

A continuous random variable can assume any value along a given interval of a number line.

For instance, x > 0, (−∞ < x < ∞) and 0 < x < 1 . Examples of continuous random variables

include the price of stock or bond, or the value at risk of a portfolio at a particular point in time.

The following relationship holds for a continuous random variable X:

P [r1 < X < r2 ] = p

24
© 2014-2024 AnalystPrep.
This implies that p is the likelihood that the random variable X falls between r1 and r2 .

The Probability Density Function (PDF) under Continuous Random Variables

A probability density function (PDF) allows us to calculate the probability of an event.

Given a PDF f(x), we can determine the probability that x falls between a and b:

b
P r(a < x ≤ b) = ∫ f (x) dx
a

The probability that X lies between two values is the area under the density function graph

between the two values:

Probability distribution function is another term used to refer to the probability density function.

The properties of the PDF are the same as those of PMF. That is:

1. f X(x) ≥ 0, −∞ ≤ x ≤ ∞ (nonnegativity)

2. ∫rrmmax f(x)dx = 1(The sum of all probabilities must be equal to 1, just like in discrete
in

random variables)

The upper and lower bounds of f(x) are defined by rmi n and rmax

Cumulative Distribution Functions (CDF) under Continuous Random


Variables

It is also called the cumulative density function and is closely related to the concept of a PDF.

CDFA CDF defines the likelihood of a random variable falling below a specific value. To

25
© 2014-2024 AnalystPrep.
determine the CDF, the PDF is integrated from its lower bound.

The corresponding density function’s capital letter has traditionally been used to denote the CDF.

The following computation depicts a CDF, F(x), of a random variable X whose PDF is f(x):

a
F (a) = ∫ f(x)d(x) = P[X ≤ a]
−∞

The region under the PDF is a depiction of the CDF. The CDF is usually non-decreasing and

varies from zero to one. We must have a zero CDF at the minimum value of the PDF. The variable

cannot be less than the minimum. The likelihood of the random variable is less than or equal to

the maximum is 100%.

To obtain the PDF from the CDF, we have to compute the first derivative of the CDF. Therefore:

dF (x)
f(x) =
dx

Next, we look at how to determine the probability that a random variable X will fall between

some two values, a and b.

b
P[a < X ≤ b] = ∫ f(x)dx = F (b) − F (a)
a

Where a is less than b.

The following relationship is also true:

26
© 2014-2024 AnalystPrep.
P [X > a] = 1 − F (a)

Example:Formulating the CDF of a Continuous Random Variable

The continuous random variable X has a pdf of f(x) = 12x 2 (1 − x) for 0 < x < 1. We need to find

the expression for F(x).

Solution

We know that:

x
F (x) = ∫ f(t)d(t)
−∞
x
F (x) = ∫ 12t2 (1 − t )d(t) = [4 t3 − 3t4 ]x0 = x3 (4 − 3x)
0

So,

F (x) = x3 (4 − 3x)

Expected Values

The expected values are the numerical summaries of features of the distribution of random

variables. Denoted by E[X] or μ, it gives the value of X that is the measure of average or center of

the distribution of X. The expected value is the mean of the distribution of X.

For discrete random variables, the expected value is given by:

E[X] = ∑ xf(X)
x

It is simply the sum of the product of the value of the random variable and the probability

assumed by the corresponding random variable.

Example: Calculating the Expected Value in Discrete Random Variable

There are 8 hens with different weights in a cage. Hens 1 to 3 weigh 1 kg, hens 4 and 5 weigh

27
© 2014-2024 AnalystPrep.
2kg, and the rest weigh 3kg. We need to calculate the mean weight of the hens.

Solution

We had calculated the PDF as:

3
⎧ , x=1

⎪ 81
f(x) = ⎨ 4 , x = 2

⎩ 3, x = 3

8

Now,

3 1 3
E[X] = ∑ xf(X) = 1 × + 2× + 3× = 2
x 8 4 8

So, the mean weight of the hens in the cage is 2kg.

For the continuous random variable, the mean is given by:


E[X] = ∫ xf(x)dx
−∞

Basically, it is all about integrating the product of the value of the random variable and the

probability assumed by the corresponding random variable.

Example: Calculating the Expected Value of a Continuous Random


Variable

The continuous random variable X has a pdf of f(x) = 12x 2 (1 − x) for 0 < x < 1.

We need to calculate E[X].

Solution

We know that:


E[X] = ∫ xf(x)dx
−∞

28
© 2014-2024 AnalystPrep.
So,

1
12 5 1
E(X) = ∫ x12x 2 (1 − x)d(x) = [3x 4 − x ] = 0.6
0 5 0

For random variables that are functions, we apply the same method as that of a “single” random

variable. That is, summing or integrating the product of the value of the random variable

function and the probability assumed by the corresponding random variable function.

Assume that the random variable function is g(x). Then:

E[g(x)] = ∑ g(x)f(x)
x

for the discrete case and


E[g(x)] = ∫ g(x)f(x)dx
−∞

for the continuous case.

Example: Calculating the Expected Values Involving Functions as


Random Variable.

A random variable X has PDF of:

1 2
fX (x) = x , for 0 < x < 3
5

Calculate E(2X + 1)

Solution


E[g(x)] = ∫ g(x)f (x)dx
−∞
∞ 1 1 x4 x3 3
=∫ (2x + 1)x2 dx = [ + ] = 9.9
−∞ 5 5 2 3 0

29
© 2014-2024 AnalystPrep.
Properties of Expectation

The expectation operator is a linear operator. Consequently, the expectation of a constant is a

constant. That is, E(c)=c. Moreover, the expected value of a random variable is a constant and

not a random variable.

For non-linear function g(x),E(g(x))≠ g(E(x)). For instance, E ( X1 ) ≠ 1


E(X)

The Variance of a Random Variable

The variance of random variable measures the spread (dispersion or variability) of the

distribution about its mean. Mathematically,

V ar(X) = E(X 2 ) − E(X)2 = E[X − E(X)]2

Intuitively, the standard deviation is the square root of the variance. Now, denoting E(X) = μ,

then:

V ar(X) = E(X 2) − μ 2

Example: Calculating the Variance of Random Variable

The continuous random variable X has a pdf of f(x) = 12x 2 (1 − x) for 0 < x < 1.

We need to calculate Var[X].

Solution

We know that:

V ar(X) = E(X 2) − E(X)2

We had calculated E(X)=0.6

We have to calculate:

30
© 2014-2024 AnalystPrep.
E(X 2 )

1
12 5 1
E(X) = ∫ x. [12x 2 (1 − x)]dx = [3x4 − x ] = 0.6
0 5 0
1 1
12 5
E(X 2 ) = ∫ 12x 4 − 12x 5 dx = [ x − 2x6 ] = 0.4
0 5 0

So,

V ar(X) = 0.4 − 0.62 = 0.04

Moments

Moments are defined as the expected values that briefly describe the features of a distribution.

The first moment is defined to be the expected value of X:

μ1 = E(X)

Therefore, the first moment provides the information about the average value. The second and

higher moments are broadly divided into Central and Non-central moments

Central Moments

The general formula for the central moments is:

μ k = E([X − E(X)]k ), k = 2, 3 …

Where k denotes the order of the moment. Central moments are moments about the mean.

Non-Central Moments

Non-central moments describe those moments about 0. The general formula is given by:

μ k = E(X k )

31
© 2014-2024 AnalystPrep.
Note that the central moments are constructed from the non-central moments and the first

central and non-central moments are equal (μ1 = E(X)).

Population Moments

The four common population moments are: mean, variance, skewness, and kurtosis.

The Mean

The mean is the first moment and is given by:

μ = E(X)

It is the average (also called the location of the distribution) value of X.

The Variance

This is the second moment. It is presented as:

σ 2 = E([X − E(X)]2 ) = E[(X − μ)2 ]

The variance measures the spread of the random variable from its mean. The standard deviation

(σ) is the square root of the variance. The standard deviation is more commonly quoted in the

world of finance because it is easily comparable to the mean since they share the measurement

units.

The Skewness

Skewness is a cubed standardized central moment given by:

3
E([X − E(X)])3 ⎡ X− μ ⎤
skew(X) = =E ( )
σ3 ⎣ σ ⎦

X −μ
Note that is a standardized X with a mean of 0 and a variance of 1.
σ

32
© 2014-2024 AnalystPrep.
Skewness can be positive or negative.

Positive skew

The right tail is longer

The mass of the distribution is concentrated on the left

There are a few relatively high values.

In most cases (but not always), the mean is greater than the median, or equivalently,

the mean is greater than the mode; in which case the skewness is greater than zero.

Negative skew

The left tail is longer

The mass of the distribution is concentrated on the right

The distribution has a few relatively low values.

In most cases (but not always), the mean is lower than the median, or equivalently,

the mean is lower than the mode, in which case the skewness is lower than zero.

Kurtosis

33
© 2014-2024 AnalystPrep.
The Kurtosis is defined as the fourth standardized moment given by:

4
E([X − E(X)]4 ⎡ X −μ ⎤
Kurt(X) = =E ( )
σ4 ⎣ σ ⎦

The description of kurtosis is analogous to that of the Skewness only that the fourth power of the

Kurtosis implies that it measures the absolute deviation of random variables. The reference value

of a normally distributed random variable is 3. A random variable with Kurtosis exceeding 3 is

termed to be heavily or fat-tailed.

Effect of Linear Transformation on Moments

In very basic terms, a linear transformation is a change to a variable characterized by one or

more of the major math operations:

adding a constant to the variable,

subtracting a constant from the variable,

34
© 2014-2024 AnalystPrep.
multiplying the variable by a constant,

and/or dividing the variable by a constant.

Transformation results in the formation of a new random variable.

If X is a random variable and α and β are constants, then α + βx is a linear transformation of X. α

is referred to as the shift constant, and β is the scale constant. The transformation shifts X by

α and scales it by β. The process results in the formation of a new random variable, usually

denoted by Y.

Y = α + βx

Linear transformation of random variables is informed by the fact that many variables used in

finance and risk management do not have a natural scale.

Example: Linear Transformation of Random Variables

Suppose your salary is α dollars per year, and you are entitled to a bonus of β dollars for every

dollar of sales you successfully bring in. Let X be what you sell in a certain year. How much in

total do you make?

Solution

We can linearly transform the sales variable X into a new variable Y that represents the total

amount made.

Y = α + βx

Where α serves as the shift constant and β as the scale constant.

Effect on Mean and Variance

35
© 2014-2024 AnalystPrep.
If Y = α + βx, where α and β are constants. The mean of Y is given by:

E(Y ) = E(α + βx) = α + βE(X)

The variance is given by:

Var(Y ) = Var(α + βx) = β 2Var(X) = β 2 σ2

The shift parameter α does not affect the variance. Why? Because variance is a measure of

spread from the mean; adding α does not change the spread but merely shifts the distribution to

the left or right.

The standard deviation of Y is given by:

√β 2σ 2 = |β| σ

It also follows that α does not affect the standard deviation.

Effect on Skewness and Kurtosis

It can also be shown that if β is positive (so that Y = α + βx is an increasing transformation), then

the skewness and kurtosis of Y are identical to the skewness and kurtosis of X. This is because

both moments are defined on standardized quantities, which removes the effect of the shift

constant α and the scaling factor β . This can be seen as follows:

We know that:

3
⎡ X −μ ⎤
skew(X ) = E ( )
⎣ σ ⎦

Now,

36
© 2014-2024 AnalystPrep.
3
E([Y − E (Y )])3 ⎡ Y − E(Y ) ⎤
skew(Y ) = =E ( )
σ3 ⎣ σ ⎦
3
⎡ α + βX − (α + βμ) ⎤
=E ( )
⎣ βσ ⎦
3 3
⎡ β(X − μ) ⎤ ⎡ X−μ ⎤
=E ( ) =E ( ) = Skew(X )
⎣ βσ ⎦ ⎣ σ ⎦

However, if β < 0, the magnitude of skewness of Y is the same as that of X but with the opposite

sign because of the odd power (i.e., 3). On the other hand, the kurtosis is unaffected because it

uses an even power (i.e., 4).

Quantiles and Modes

Just like any data, quantities such as the quantiles and the modes are used to describe the

distribution.

The Quantiles

For a continuous random variable X, the α-quartile of X is the smallest number m such that:

P r(X < m) = α

Where αϵ[0, 1]

For instance, if X is a continuous random variable, the median is defined to be the solution of:

m
P (X < m) = ∫ fX (x)dx = 0.5
−∞

Similarly, the lower and upper quartile is such that P (X < Q 1 ) = 0.25 and P (X < Q3 ) = 0.75

The interquartile range (IQR), is an alternative measure of spread. It is given by:

IQR = Q3 − Q1

37
© 2014-2024 AnalystPrep.
Example: Calculating the Quartiles of a PDF

The random variable X has a pdf given by:

fX (x) = 3e−2x, x > 0

. Calculate the median of the distribution.

Solution

Denote the median by m. Then m is such that:

m
P(X < m) = ∫ 3e−2xdx = 0.5
0

So,

3 m
= [− e−2x ] = 0.5
2 0
3 −2m 3
=− e + = 0.5
2 2
1 2
⇒ m = − × ln = 0.2027
2 3

Mode

The mode measures the common tendency, that is, the location of the most observed value of a

random variable. In a continuous random variable, the mode is represented by the highest point

in the PDF.

Random variables can be unimodal if there’s just one mode, bimodal if there are two modes, or

multimodal if there are more than two modes.

The graph below shows the difference between unimodal and bimodal distributions.

38
© 2014-2024 AnalystPrep.
39
© 2014-2024 AnalystPrep.
Question 1

If a random variable X has a mean of 4 and a standard deviation of 2, calculate Var(3 -

4x)

A. 29

B. 30

C. 64

D. 35

Solution

The correct answer is C.

Recall that:

Var(α + βx) = β2 Var(Y )

So,

Var(3 − 4X) = (−4)2 Var(X) = 16 Var(X)

But we are given that the standard deviation is 2, implying that the variance is 4.

Therefore,

Var(3 − 4X) = 16 × 4 = 64

Question 2

A continuous random variable has a pdf given by f X(x) = ce−3x for all x > 0. Calculate

Pr(X<6.5)

40
© 2014-2024 AnalystPrep.
A. 0.4532

B. 0.4521

C. 0.3321

D. 0.9999

Solution

The correct answer is D.

We need to find the constant c first. We know that:


∫ f(x)dx = 1
−∞

So,

∞ ∞
1 1
∫ ce−3xdx = 1 = c[− e−3x ] = c [0 − − ] = 1
0 3 0 3
⇒c=3

Therefore, the PDF is f X (x) = 3e−3x so that P r(X < 6.5) is given by:

6.5 6.5
1 1 1
∫ 3e−3 xdx = 3[− e−3x ] = c [− e−3×6.5 − − ]
0 3 0
3 3
= 0.9999

41
© 2014-2024 AnalystPrep.
Reading 14: Common Univariate Random Variables

After completing this reading, you should be able to:

Distinguish the key properties among the following distributions: uniform distribution,

Bernoulli distribution, Binomial distribution, Poisson distribution, normal distribution,

lognormal distribution, Chi-squared distribution, student’s t, and F-distributions, and

identify common occurrences of each distribution.

Describe a mixture distribution and explain the creation and characteristics of mixture

distributions.

Parametric Distributions

There are two types of distributions, namely parametric and non-parametric distributions.

Functions mathematically describe parametric distributions. On the other hand, one cannot use a

mathematical function to describe a non-parametric distribution. Examples of parametric

distributions are uniform and normal distributions.

Discrete Random Variables

Bernoulli Distribution

Bernoulli distribution is a discrete random variable that takes on values of 0 and 1. This

distribution is suitable for scenarios with binary outcomes, such as corporate defaults. Most of

the time, 1 is always labeled “success” and 0 a “failure.”

42
© 2014-2024 AnalystPrep.
The Bernoulli distribution has a parameter p which is the probability of success, i.e., the

probability that X=1, then:

P [X = 1] = p and P [X = 0] = 1 − p

The probability mass function of the Bernoulli distribution stated as X ∼ Bernoulli (p) is given by:

f X (x) = px(1 − p)1−x

Therefore, the mean and variance of the distribution are computed as:

The PMF confirms that:

P [X = 1] = p and P [X = 0] = 1 − p

The CDF of a Bernoulli distribution is a step function given by:

⎧ 0, y < 0
FX (x) = ⎨ 1 − p, 0 ≤ y < 1

1, y ≥ 1

Therefore, the mean and variance of the distribution are computed as:

43
© 2014-2024 AnalystPrep.
E (X) = p × 1 + (1 − p) × 0 = p

V (X) = E(X 2 ) − [E(X)]2 = [p × 12 + (1 − p) × 0 2 ] − p 2 = p(1 − p)

Example: Bernoulli Distribution

What is the ratio of the mean to variance for X~Bernoulli(0.75)?

Solution

We know that for Bernoulli Distribution,

E(X) = p

and

V (X) = p(1 − p)

So,

E(X) p 1
= = =4
V (X) p(1 − p) 0.25

Thus, E(X): V(X)=4:1

Binomial Distribution

A binomial distribution is a collection of Bernoulli random variables. A binomial random variable

quantifies the total number of successes from an independent Bernoulli random variable, with

the probability of success being p and, of course, the failure being 1-p. Consider the following

example:

Suppose we are given two independent bonds with a default likelihood of 10%. Then we have the

following possibilities:

Both do not default,

44
© 2014-2024 AnalystPrep.
Both of them default, or

Only one of them defaults.

Let X represent the number of defaults:

P [X = 0] = (1 − 10%)2 = 81%

P [X = 1] = 2 × 10% × (1 − 10%) = 18%

P [X = 2] = 10%2 = 1%

If we possess three independent bonds having a 10% default probability then:

P [X = 0] = (1 − 10%)3 = 72.9%

P [X = 1] = 3 × 10% × (1 − 10%)2 = 24.3%

P [X = 2] = 3 × 10% 2 × (1 − 10%) = 2.7%

P [X = 3] = 10%3 = 0.1%

Suppose now that we have n bonds. The following combination represents the number of ways in

which k of the n bonds can default:

n n!
( )= … … … … equation I
x x! (n − x)!

If p is the likelihood that one bond will default, then the chances that any particular k bonds will

default is given by:

px (1 − p)n−x … … … … … equation I I

Combining equation I and II , we can determine the likelihood of k bonds defaulting as follows:

n x n
P [X = x] = ( ) p (1 − p)n −x = ( ) px(1 − p)n−x f orx = 0, 1 , 2 , … n
x x

45
© 2014-2024 AnalystPrep.
This is the PDF for the binomial distribution.

Therefore, binomial distribution has two parameters: n and p and usually stated as X B(n, p) .

The CDF of a binomial distribution is given by:

| x|
n i
∑ ( ) p (1 − p)n −i
i =1 i

Where |x| implies a random variable less than or equal to x.

The mean and variance of the binomial distribution can be evaluated using moments. The mean

and variance are given by:

E(X) = np

And

V (X) = np(1 − p)

The binomial can be approximated using a normal distribution (as will be seen later) if np ≥ 10

and n(1 − p) ≥ 10

Example: Binomial Distribution

Consider a Binomial distribution X~B(4,0.6). Calculate P(X≥ 3).

Solution

We know that for binomial distribution:

n x
P [X = x] = ( ) p (1 − p)n−x
x

In this case, n = 4 and p = 0.6

4 3 4
⇒ P (X ≥ 3) = P (X = 3) + P (X = 4) = ( ) p (1 − p)4−3 + ( ) p4 (1 − p)4−4
3 4

46
© 2014-2024 AnalystPrep.
4 4
=( ) 0.63 (1 − 0.6)4−3 + ( ) 0.64 (1 − 0.6)4−4
3 4

= 0.3456 + 0.1296 = 0.4752

Poisson Distribution

Events are said to follow a Poisson process if they happen at a constant rate over time, and the

likelihood that one event will take place is independent of all the other events,for instance,the

number of defaults that occur in each month.

Suppose that X is a Poisson random variable, stated as X~Poisson(λ) then the PMF is given by:

λxe−λ
P [X = x] =
x!

The CDF of a Poisson distribution is given by:

|x|
λi

i=1 i!

The Poisson parameter λ (lambda), termed as the hazard rate, represents the mean number of

events in an interval. Therefore, the mean and variance of the Poisson distribution are given by:

E(X) = λ

And

V (X) = λ

Example: Poisson Distribution

A fixed income portfolio is made of a huge number of independent bonds. The average number of

bonds defaulting every month is 10. What is the probability that there are exactly 5 defaults in

one month?

Solution

47
© 2014-2024 AnalystPrep.
For Poisson distribution:

λxe−λ
P (X = x) =
x!

For this question, we have that: λ = 10 and we need:

10 5 e−10
P (X = 5) = = 0.03783
5!

The notable feature of a Poisson distribution is that it is infinitely divisible. That is, if

X1 ∼ Poisson(λ1 ) and X 2 ∼ Poisson(λ 2 ) and that Y = X1 + X2 then,

Y ∼ Poisson(λ1 + λ2 )

Therefore, Poisson distribution is suitable for time series data since summing the number of

events in the sampling interval does not distort the distribution.

Continuous Random Variables

Uniform Distribution

A uniform distribution is a continuous distribution, which takes any value within the range [a,b],

which is equally likely to occur.

The PDF of a uniform distribution is given by:

1
fX (x) =
b −a

48
© 2014-2024 AnalystPrep.
Note that the PDF of a uniform random variable does not depend on x since all values are equally

likely.

The CDF of the uniform distribution is:


⎪ 0, x < a
x− a
FX (x) = ⎨ b−a
,a ≤ x ≤ b

⎪ 1, x ≥ b

When a=0 and b=1, the distribution is called the standard uniform distribution. From this

distribution, we can construct any uniform distribution, U 2 and U1 using the formula:

U2 = a + (b − a) U1

Where a and b are limits of U 2

The uniform distribution is denoted by X ∼ U(a, b), and the mean and variance are given by:
a +b
49
© 2014-2024 AnalystPrep.
a +b
E(X) =
2

(b − a)2
V (X) =
12

For instance, the mean and variance of the standard uniform distribution U 1 ∼ N(0 , 1) are given

by:

(0 + 1) 1
E(X) = =
2 2

And

(1 − 0)2 1
V (X) = =
12 12

Assume that we want to calculate the probability that X falls in the interval l < X < u where l is

the lower limit and u is the upper limit. That is, we need P (l < X < u) given that X ∼ U(a, b). To

compute this, we use the formula:

min(u, b) − max(l, a)
P (l < X < u) =
b −a

Intuitively, if l ≥ a and u ≤ b, the formula above simplifies into:

u−l
b −a

Given the uniform distribution X U(−5 , 10) , calculate the mean, variance, and P (−3 < X < 6).

Solution

For uniform distribution,

a + b −5 + 10
E(X) = = = 2.5
2 2

And

(10 − −5)2
50
© 2014-2024 AnalystPrep.
(10 − −5)2 225
V (X) = = = 18.75
12 12

For P (−3 < X < 6) , using the formula:

min(u, b) − max(l, a)
P (l < X < u) =
b −a

min(6,10) − max(−3,−5) 6 − −3 9
P (−3 < X < 6) = = = = 0.60
10 − −5 10 − −5 15

Alternatively, you can think of the probability as the area under the curve. Note that the height of
1
the uniform distribution is and the length u − l.
b−a

That is:

1 1 9
× (u − l) = × (6 − −3) = = 0.60
b−a 10 − −5 15

Normal Distribution

Also called the Gaussian distribution, the normal distribution has a symmetrical PDF, and the

mean and median coincide with the highest point of the PDF. Furthermore, the normal

distribution always has a skewness of 0 and a kurtosis of 3.

51
© 2014-2024 AnalystPrep.
The following is the formula of a PDF that is normally distributed, for a given random variable X :

1 (x − μ )2

1
f (x) = e 2 σ , −∞ < x < ∞
σ√2π

When a variable is normally distributed, it is often written as follows, for convenience:

X ∼ N (μ , σ 2)

Where E(X) = μ and V (X) = σ2

We read this as X is normally distributed, with a mean, μ ,and variance of σ2 . Any linear

combination of independent normal variables is also normal. To illustrate this, assume X and Y

are two variables that are normally distributed. We also have constants a and b . Then Z will be

normally distributed such that:

Z = aX + bY , such that Z ∼ N (aμX + bμ Y , a2 σX2 + b2 σY2 )

52
© 2014-2024 AnalystPrep.
For instance for a = b = 1, then Z = X + Y and thus Z ∼ N (μX + μY , σ2X + σ2Y )

A standard normal distribution is a normal distribution whose mean is 0 and standard deviation

is 1.It is denoted by N(0,1) and its PDF is as shown below:

1 1 2
∅= e− 2 x
√2π

To determine a normal variable whose standard deviation is σ and mean is μ, we compute the

product of the standard normal variable with σ and then add the mean:

X = μ + σ∅ ⇒ X ∼ N (μ , σ2 )

Three standard normal variables X1 , X2 , and X3 are combined in the following way to construct

two normal variables that are correlated:

XA = √ρX1 + √1 − ρX 2

XB = √ρX1 + √1 − ρX3

Where X A and XB have a correlation of ρ , and are standard normal variables.

The z-value measures how many standard deviations the corresponding x value is above or below

the mean. It is given by:

X −μ
Φ(z) = ∼ N (0, 1)
σ

And

X ∼ N (μ, σ 2)

Converting X normal random variables is termed as standardization. The values of z are usually

tabulated.

For example, consider the normal distribution X~N(1,2). We wish to calculate P(X>2).

53
© 2014-2024 AnalystPrep.
Solution

For

2− 1
P (X > 2) = 1 − P (X ≤ 2) = 1 − = 0.2929 ≈ 0.29
√2

We look up this value from the z-table.

ϕ (0.29) ≈ 61.41%

x-value z-value
μ 0
μ + 1σ 1
μ + 2σ 2
μ + nσ n

Recall that for a binomial random variable, if np ≥ 10 and n(1 − p) ≥ 10, then the binomial

distribution is normally distributed as:

X ∼ N (np, np(1 − p))

54
© 2014-2024 AnalystPrep.
Also, Poisson distribution is normally approximated as λ≥1000 so that:

X ∼ N (λ , λ)

We then calculate the probabilities while maintaining the normal distribution principles. The

normal distribution is very popular as compared to other distributions because:

Many discrete and continuous random variables distributions can be approximated

using the normal distribution.

The normal distribution is widely used in Central Limit Theorem (CLT), which is utilized

in hypothesis testing.

The normal distribution is closely related to other important distributions, such as the

chi-squared and the F distributions.

The notable property of the normal random variables is that they are infinitely divisible,

which makes the normal distribution suitable for modeling asset prices.

The normal distributions are closed under linear operations. In other words, the

weighted sum of the normal random variables is also normally distributed.

Lognormal Distribution

A variable X is said to be lognormally distributed if the variable Y is normally distributed such

that:

Y = lnX

This also can be treated as:

X = eY

Where

Y ∼ N (μ, σ 2 )

55
© 2014-2024 AnalystPrep.
Since Y ∼ N (μ, σ 2) ,then the PDF of a log-normal random variable is:

2
1 ⎛ ln(x) − μ ⎞

1 σ
e 2⎝ ⎠
f (x) = ,x ≥ 0
xσ√2π

A variable is said to have a lognormal distribution if its natural logarithm has a normal

distribution. The lognormal distribution is undefined for negative values, unlike the normal

distribution that has a range of values between negative infinity and positive infinity.

If the above equation of the density function of the lognormal distribution is rearranged, we

obtain an equation that has a similar form to the normal distribution. That is:

2 2
1 ⎛ lnx−( μ−σ ) ⎞

1 2
−μ 1 2⎝ σ ⎠
f (x) = e 2σ e
σ√ 2π

From the above, we notice that the lognormal distribution happens to be asymmetrical. It's not

symmetrical around the mean as is the case under the normal distribution. The lognormal

56
© 2014-2024 AnalystPrep.
distribution peaks at exp (μ − σ 2).

The following is the formula for the mean:

1 2
E [X] = eμ+2 σ

This yields to an expression that closely resembles the Taylor expansion of the natural logarithm

around 1. Recall that:

1 2
r≈R− R
2

where R is a standard return and r is the corresponding log return.

The following is the formula for the variance of the lognormal distribution:

2 2
V (X) = E [(X − E[X]2 )] = (eσ − 1) e2μ+σ

Example: Lognormal Distribution

Consider a lognormal distribution given by X ∼ LogN (0.08 , 0.2) . Calculate the expected value.

Solution

For the lognormal distribution, the expected value is given by:

1 2 1
E[X] = eμ+ 2 σ = e0.08+2 ×0.2 = 1.19721

Chi-Squared Distribution, χ2

Assume we’ve got k independent standard normal variables ranging from Z1 to Zk. The sum of

their squares will then have a Chi-Square distribution, written as follows:

k
S = ∑ Zi2
1=1

So, we can denote chi-distribution as:

57
© 2014-2024 AnalystPrep.
S ∼ Xk2

k is called the degree of freedom. It is important to note that two chi-squared variables that are

independent, with degrees of freedom as k1 and k2 , respectively, have a sum that is chi-square

distributed with (k1 + k2 ) degrees of freedom.

The chi-squared variable is usually asymmetrical and takes on non-negative values only. The

distribution has a mean of k and a standard deviation of 2k.

The distribution has a mean and variance given by:

E (S) = k

and

V (S) = 2k

The chi-squared distribution takes the following PDF, for positive values of x:

1 k x

58
© 2014-2024 AnalystPrep.
1 k x
f (x) = x 2 −1 e− 2
k
2 2 Γ ( k2 )

The gamma function, Γ, is such that:


Γ (n) = ∫ x n−1 e−xdx
0

Note also that the gamma function,Γis such that:

Γ (n) = (n − 1)!

For instance:

Γ (3) = (3 − 1)! = 2 × 1 = 2

This distribution is widely applicable in statistics and risk management when testing hypotheses.

The chi-distribution is approximated using normal distribution when n is large. This implies that:

χ2k ∼ N (k, 2k)

This is true because as the number of degrees of freedom increases, the skewness reduces.

Degrees of freedom measure the amount of data required to test model parameters. If we have a

sample size n, the degrees of freedom are given by n – p, where p is the number of parameters

estimated..

Student’s t Distribution

This distribution is often called the t distribution. Let Z be the standard normal variable, and U a

chi-square variable with k degrees of freedom. Also, assume that U is independent of Z. Then, a

random variable X that follows a t distribution is such that:

Z
X=
U
√k

59
© 2014-2024 AnalystPrep.
The following formula represents its PDF:

Γ (k + 12 ) +1
−k
f (x) = (1 + x2 /k) 2

√ kπΓ ( k )
2

The mean of the t distribution is usually zero, and the distribution is symmetrical around it.

That is:

E (X) = 0

The variance is given by:

k
V (X) =
k− 2

The kurtosis is also given by:

k− 2
Kurt(X) = 3
k− 4

It is easy to see that the mean is valid for k > 1 and the variances finite for v > 2. The kurtosis is

only definite if k > 4 and should always be higher than 3.

The distribution converges to a standard normal distribution as k tends towards infinity (k → ∞).

When k > 2, the variance of the distribution becomes: (k k−2) , and it converges to one as k

increases.

We can also separate the degrees of freedom from variance to get what we called the

standardized student’s t. Using the formula:

V (aX) = a2 V (X))

Using this result, it is easy to see that :

v−2
V [√ Y]= 1
v

60
© 2014-2024 AnalystPrep.
Where

X ∼ tk

The generalized student’s t is called standardized student’s t because it has a mean of 0 and a

variance of 1. Note that we still rescale it to have any variance for k>2.

A generalized student’s t is stated by the mean, variance, and the number of degrees of freedom.

It is stated as Gen . tk(μ , σ2 )

This distribution is widely applicable in hypotheses testing, and modeling the returns of financial

assets due to the excess kurtosis it displays.

Example: Standardized Student’s t

The kurtosis of some returns on a bond portfolio with three parameters to be estimated is 6.

What are the degrees of freedom if the parameters were generated using student’s tk?

Solution

We know that for t-distribution:

k− 2
61
© 2014-2024 AnalystPrep.
k− 2
Kurt(X) = 3
k− 4

k −2 5
∴6=3 ⇒ (k − 4)
k −4 3

So that

k=6

F–Distribution

The F-distribution is often used in the analysis of variance (ANOVA). The F distribution is an

asymmetric distribution that has a minimum value of 0, but no maximum value. Notably, the

curve approaches but never quite touches the horizontal axis.

X is said to follow an F -distribution with parameters k1 and k2 if:

U /k
62
© 2014-2024 AnalystPrep.
U 1/k1
X= ∼ F (k1 , k2 )
U 2/k2

Provided that U1 and U 2 are chi-squared distributions that are independent having k1 and k2 as

their degrees of freedom.

The F -distribution has the following PDF:

(k 1 X) k1 kk
2
2


(k1 X+k2 ) k 1 +k2
f (x) =
k1 k2
xB ( , )
2 2

B(x,y) is a beta function such that:

1
B (x , y) = ∫ z x−1 (1 − z)y−1 dz
0

The distribution has the following mean and variance respectively:

k2
E (X) = f or k2 > 2
k2 − 2

2k22 (k1 + k2 − 2)
σ2 = for k2 > 4
k 1(k2 − 2)2 (k2 − 4)

Suppose that X is a random variable with a t-distribution, and it has k degrees of freedom, then

X 2 is said to have an F -distribution with 1 and k degrees of freedom, i.e.,

χ2 ∼ F (1, k)

The Beta Distribution

The beta distribution applies to continuous random variables in the range of 0 and 1. This

distribution is similar to the triangle distribution in the sense that they are both applicable in the

modelling of default rates and recovery rates. Assuming that a and b are two positive constants,

then the PDF of the beta distribution is written as:

1
63
© 2014-2024 AnalystPrep.
1
f (x) = xa−1 (1 − x)b−1 , 0≤x≤1
B (a, b)

Γ( a)Γ(b)
Where B (a, b) = Γ( a+b)

The following two equations represent the mean and variance of the beta distribution:

a
μ=
a+b

ab
σ2 = 2
(a + b) (a + b + 1)

Exponential Distribution

The exponential distribution is a continuous distribution with a parameter , whose PDF is:

1 −x
f X(x) = e β ,x ≥ 0
β

64
© 2014-2024 AnalystPrep.
The CDF is also given by:

−x
FX (x) = 1 − e β

The parameter of the exponential distribution determines the mean and variance of the

distribution. That is:

E(X) = β

65
© 2014-2024 AnalystPrep.
And

V (X) = β 2

Notably, exponential distribution is a close ‘cousins’ of a Poisson. The time intervals between one

and subsequent Poisson random variables are exponentially distributed. Another feature of the

exponential distribution is that it is memoryless. That is, its distributions are independent of their

histories.

Example: Exponential Distribution

Assume that the time to default for a specific segment of mortgage consumers is exponentially

distributed with a β of ten years. What is the probability that a borrower will not default before

year 11?

Solution

To find the probability that the borrower will not default before year eleven, we start by

calculating the cumulative distribution until year eleven and then subtract this from 100%::

P (X > 11) = 1 − P(X ≤ 11) = 1 − FX (x = 11)

11

= 1 −e 10 = 1 − 0.3329 = 0.6671 = 66.7%

The Mixture Distribution

Mixture distributions are complex, and new distributions built using two or more distributions. In

this summary, we shall concentrate on the two distributions.

Generally, a mixture distribution comes from a weighted average distribution of density

functions, and can be written as follows:

n n
f (x) = ∑ wi f i (x) such that : ∑ w i = 1
j=1 i=1

66
© 2014-2024 AnalystPrep.
f i (x)'s are the component distributions, with w ′i s as the weights or the mixing proportions. The

component weights must all sum up to one, for the resulting mixture to be legitimately

distributed. In other words, a two-distribution combination must draw value from Bernoulli

random variables and depending on the benefits (0 or 1), it then picks the component

distributions. By doing this, it is possible to compute the CDF of the mixture when the

component distributions are normal random variables. These distributions are very flexible as

they fall between parametric and non-parametric distributions.

For example, consider X1 ∼ Fx1 and X2 ∼ Fx2 and Wi ∼ Bernoulli(p) . So that the mixture

distribution of X1 and X 2 is given by:

Y = pX1 + (1 − p)X2

Both of the PDF and the CDF of the mixture distribution are weighted average of the constituent

CDFs and PDFs. That is:

FY (y) = pFX1 (x 1 ) + (1 − p)FX2 (x 2 )

And

f Y (y) = pf X1(x 1 ) + (1 − p)f X2(x 2 )

Intuitively, the computation of the central moment is done in a similar way. That is:

E(Y ) = pE(X1 ) + (1 − p)E(X 2 )

And

V (Y ) = E(Y 2 ) − (E(Y ))2

Where

E(Y 2 ) = pE(X12 ) + (1 − p)E(X 22)

Using the same logic, we can calculate the other higher central moments such as the kurtosis

67
© 2014-2024 AnalystPrep.
and skewness. However, note that the mixture distribution might have both the skewness and the

kurtosis, while the components do not have (for example, normal random variables).

Moreover, mixing components with different means and variances leads to distribution that is

both skewed and heavy-tailed.

Example: Mixture Distributions

Consider two normal random variables X1 ∼ N(0.15 , 0.60) and X1 ∼ N(−0.8, 3). . What is the

mean of the resulting mixture distribution (Y) if the weight of X1 is 0.6?

Solution

We know that:

E(Y ) = pE(X1 ) + (1 − p)E(X2 )


= 0.6 × 0.15 + (1 − 0.6)(−0.8)
= −0.23

68
© 2014-2024 AnalystPrep.
Question

The number of new clients that a wealth management company receives in a month is

distributed as a Poisson random variable with mean 2. Calculate the probability that

the company receives exactly 28 clients in a year.

A. 5.48%

B. 0.10%

C. 3.54%

D. 10.2%

The correct answer is A.

The number of clients in a year (2 × 12) has a Poi(24) distribution.

λ n −λ
P [X = n] = e
n!

24 28 −24
P [X = 28] = e = 5.48
28!

69
© 2014-2024 AnalystPrep.
Reading 15: Multivariate Random Variables

After completing this reading, you should be able to:

Explain how a probability matrix can be used to express a probability mass function

(PMF).

Compute the marginal and conditional distributions of a discrete bivariate random

variable.

Explain how the expectation of a function is computed for a bivariate discrete random

variable.

Define covariance and explain what it measures.

Explain the relationship between the covariance and correlation of two random

variables and how these are related to the independence of the two variables.

Explain the effects of applying linear transformations on the covariance and correlation

between two random variables.

Compute the variance of a weighted sum of two random variables.

Compute the conditional expectation of a component of a bivariate random variable.

Describe the features of an iid sequence of random variables.

Explain how the iid property is helpful in computing the mean and variance of a sum of

iid random variables.

Multivariate Random Variables

Multivariate random variables accommodate the dependence between two or more random

variables. The concepts under multivariate random variables (such as expectations and

moments) are analogous to those under univariate random variables.

70
© 2014-2024 AnalystPrep.
Multivariate Discrete Random Variables

Multivariate random variables involve defining several random variables simultaneously on a

sample space. In other words, multivariate random variables are vectors of random variables.

For instance, a bivariate random variable X can be a vector with two components X1 and X2 with

the corresponding realizations being x1 and x 2 , respectively.

The PMF or PDF for a bivariate random variable gives the probability that the two random

variables each take a certain value. If we wish to plot these functions, we would need three

factors: X1 , X2 , and the PMF/PDF. This is also applicable to the CDF.

The Probability Mass Function (PMF)

The PMF of a bivariate random variable is a function that gives the probability that the

components of X=x takes the values X1 = x 1 and X2 = x2 . That is:

fX1, X2(x 1 , x2 ) = P(X1 = x 1 , X2 = x2 )

The PMF explains the probability of realization as a function of x 1 and x 2. The PMF has the

following properties:

1. f X1,X2 (x1 , x 2 ) ≥ 0

2. ∑x1 ∑ x2 fX1 ,X2(x 1 , x2 ) = 1

Example: Trinomial Distribution

The trinomial distribution is the distribution of n independent trials where each trial results in

one of the three outcomes (a generalization of the binomial distribution). The first, second and

the third components are X1 , X2 and n − X1 − X2 respectively. However, the third component is

redundant provided that we know X1 and X 2.

The trinomial distribution has three parameters:

71
© 2014-2024 AnalystPrep.
1. n, representing the total number of the trials

2. p1 , representing the probability of realizing X1

3. p2 , representing the probability of realizing X2

Intuitively, the probability of observing n − X1 − X 2 is:

1 − p1 − p2

The PMF of the trinomial distribution, therefore, is given by:

n!
fX1, X2(x 1 , x 2 ) = px1 px2 (1 − p1 − p 2 )n−x1−x2
x 1 !x2 !(n − x 1 − x2 )!1 2

The Cumulative Distribution Function (CDF)

The CDF of a bivariate discrete random variable returns the total probability that each

component is less than or equal to a given value. It is given by:

FX1,X2 (x1 , x 2 ) = P(X1 < x 1 , X2 < x2 ) = ∑ ∑ f (X1,X2)(t1 , t2 )


t1 ϵR(X1 ) t2 ϵR (X2 )
t1 ≤x1 t2 ≤x2

In this equation, t1 contains the values that X1 may take as long as t1 ≤ x 1. Similarly, t2 contains

the values that X2 may take as long as t2 ≤ x2

72
© 2014-2024 AnalystPrep.
Probability Matrices

The probability matrix is a tabular representation of the PMF.

Example: Probability Matrix

In financial markets, market sentiments play a role in determining the return earned on a

security. Suppose the return earned on a bond is in part determined by the rating given to the

bond by analysts. For simplicity, we are going to assume the following:

There are only three possible returns :10%, 0%, or -10%

Analyst ratings (sentiments) can be positive, neutral, or negative

We can represent this in a probability matrix as follows:

73
© 2014-2024 AnalystPrep.
Bond Return (X 1 )
−10% 0% 10%
Analyst Positive +1 5% 5% 30%
(X2 ) Neutral 0 10% 10% 15%
Negative −1 20% 5% 0%

Each cell represents the probability of a joint outcome. For example, there’s a 5% probability of a

negative return (-10%) if analysts have positive views about the bond and its issuer. In other

words, there’s a 5% probability that the bond will decline in price with a positive rating.

Similarly, there’s a 10% chance that the bond’s price will not change (and hence a zero return)

given a neutral rating.

The Marginal Distribution

The marginal distribution gives the distribution of a single variable in a joint distribution. In the

case of bivariate distribution, the marginal PMF of X1 is computed by summing up the

probabilities for X1 across all the values in the support of X 2. The resulting PMF of X1 is denoted

by f X1 (x 1 ), i.e., the marginal distribution of X1 .

f X1(x 1 ) = ∑ f X1,X2 (x 1 ,x 2 )
x2 ϵR(X2 )

Intuitively, the PMF of X2 is given by:

f X2(x 2 ) = ∑ f X1,X2 (x 1 ,x 2 )
x1 ϵR(X1 )

Example: Computing the Marginal Distribution

Using the probability matrix, we created above, we can come up with marginal distributions for

both X1 (return) and X2 (analyst ratings) as follows:

For X1 ,

74
© 2014-2024 AnalystPrep.
P(X1 = −10%) = 5% + 20% + 10% = 35%
P(X1 = 0%) = 5% + 10% + 5% = 20%
P(X1 = +10%) = 30% + 15% + 0% = 45%

For X2 ,

P(X2 = +1) = 5% + 5% + 30% = 40%


P(X2 = 0) = 10% + 10% + 15% = 35%
P(X2 = −1) = 20% + 5% + 0% = 25%

We wish to compute the marginal distribution of the returns. Now,

In summary, for example, the marginal distribution of X1 is given below:

Return(X1 ) −10% 0% 10%


P(X1 = x 1 ) 35% 20% 45%

Bond Return (X1 )


−10% 0% 10% f X2(x 2 )
Analyst Positive +1 5% 5% 30% 40%
(X2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
fX1 (x1 ) 35% 20% 45%

As you may have noticed, the marginal distribution satisfies the property of the ideal probability

distribution. That is:

∑ f X1 (x 1 ) = 1
∀X1

And

f X1(x 1 ) ≥ 0

This is true because the marginal PMF is a univariate distribution.

We can, in addition, use the marginal PMF to compute the marginal CDF. The marginal CDF is

such that, P(X1 < x 1 ). That is:

75
© 2014-2024 AnalystPrep.
FX1 (x1 ) = ∑ fX1 (t1 )
t1 ϵR(X1 )
t1 ≤x1

Independence of Random Variables

Recall that if the two events A and B are independent then:

P(A ∩ B) = P(A)P(B)

This principle applies to bivariate random variables as well. If the distributions of the

components of the bivariate distribution are independent, then:

f X1,X2 (x1 , x 2 ) = f X1 (x1 )f X2(x 2 )

Example: Independence of Random Variables

Now let’s use our earlier example on the return earned on a bond. If we assume that the two

variables – return and ratings – are independent, we can calculate the joint distribution by the

multiplying their marginal distributions. But are they really independent? Let’s find out! We have

already established the joint and the marginal distributions, as reproduced in the following table.

Bond Return (X1 )


−10% 0% 10% f X2(x 2 )
Analyst Positive +1 5% 5% 30% 40%
(X2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
fX1 (x1 ) 35% 20% 45%

So assuming that our two variables are independent, our joint distribution would be as follows:

Bond Return (X 1 )
−10% 0% 10%
Analyst Positive +1 14% 8% 18%
(X2 ) Neutral 0 12.25% 7% 15.75%
Negative −1 8.75% 5% 11.25%

76
© 2014-2024 AnalystPrep.
We obtain the table above by multiplying the marginal PMF of the bond return by the marginal

PMF of ratings. For example, the marginal probability that the bond return is 10% is 45% -- the

sum of the third column. The marginal probability of a positive rating is 40% -- the sum of the

first row. These two values when multiplied give us the joint probability on the upper left end of

the table (18%).

45% ∗ 40% = 18%

It is clear that the two variables are not independent because multiplying their marginal PMFs

does not lead us back to the joint PMF.

The Conditional Distributions

The conditional distributions describe the probability of an outcome of a random variable

conditioned on the other random variable taking a particular value.

Recall that, given any two events A and B, then:

P(A ∩ B)
P(A│B) =
P(B)

This result can be applied in bivariate distributions. That is, the conditional distribution of X1

given X2 is defined as:

f X1,X2 (x 1, x 2 )
f X1│X2(x 1 │X2 = x2 ) =
f X2(x 2 )

From the result above, the conditional distribution is joint distribution divided by the marginal

distribution of the conditioning variable.

77
© 2014-2024 AnalystPrep.
Example: Calculating the Conditional Distribution

Bond Return (X1 )


−10% 0% 10% f X2(x 2 )
Analyst Positive +1 5% 5% 30% 40%
(X2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
fX1 (x1 ) 35% 20% 45%

Suppose we want to find the distribution of bond returns conditional on a positive analyst rating.

The conditional distribution is:

f X1,X2 (x1 , X2 = 1) fX1, X2(x 1 , X2 = 1)


f(X1│X2)(x 1 │X2 = 1) = =
f X2(x 2 = 1) 40%

With this, we can proceed to determine specific conditional probabilities:

78
© 2014-2024 AnalystPrep.
Returns(X1 ) −10% 0% 10%
5% 5% 30%
f (X1│X2)(x 1 │X2 = x2 ) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X1 = x 1 |X2 = 1)

What we have done is to take the joint probabilities where there’s a positive analyst rating and

then divided these values by the marginal probability of a positive rating (40%) to produce the

conditional distribution.

Note that the conditional PMF obeys the laws of probability, i.e.,

1. f (X1 │X2)(x 1 │X2 = x2 ) ≥ 0(nonnegativity)

2. ∑∀(X1│X2) f(X1│X2)(x 1 │X2 = x2 ) = 1

Conditional Distribution for a Set of Outcomes

Conditional distributions can be computed for one variable, while conditioning on more than one

variable.

For example, assume that we need to compute the conditional distribution of the bond returns

given that analyst ratings are non-negative. Therefore, our conditioning set is {+1,0}:

X2 ∈ {+1, 0}

The conditional PMF must sum across all outcomes in the set that is conditioned on S {+1,0}:

∑X2ϵC f(X1,X2 )(x1 , x 2 )


f (X1│X2) (x 1 │x 2 ϵS) =
∑X2ϵC f(X2) (x 2 )

The marginal probability that X2 ∈ {+1 , 0} is the sum of the marginal probabilities of these two

outcomes:

f x2(+1) + fx2 (0) = 75%

79
© 2014-2024 AnalystPrep.
Bond Return (X1 )
−10% 0% 10% f X2(x 2 )
Analyst Positive +1 5% 5% 30% 40%
(X2 ) Neutral 0 10% 10% 15% 35%
Negative −1 20% 5% 0% 25%
fX1 (x1 ) 35% 20% 45%

Thus, the conditional distribution is given by:

5%+10%

⎪ = 20%

⎪ 75%
⎪ 5%+10%
f (X1│X2 )(x1 │x 2 ϵ {+1, 0}) = ⎨ 75% = 20%



⎩ 30%+15% = 60%

75%

Independence and Conditional Distribution of Random Variables

Recall that the conditional distribution is given by:

f X1,X2 (x1 , x 2 )
f(X1│X2)(x 1 │X2 = x2 ) =
fX2 (x 2 )

This can be rewritten into:

f(X1,X2) (x1 , x 2 ) = f (X1│X2 )(x1 │X2 = x 2 )fX2 (x2 )

Or

f (X1,X2)(x 1 , x2 ) = f X2│X1 (x 2 │X1 = x 1 )f X1(x 1 )

Also, if the distributions of the components of the bivariate distributions are independent, then:

f (X1,X2) (x 1 , x2 ) = f X1 (x 1 )f X2(x 2 )

If we substitute this in the above results we get:

f X1(x 1 )fX2 (x 2) = f (X1│X2) (x1 │X2 = x 2 )fX2 (x2 )


⇒ f X1(x 1 ) = f (X1│X2)(x1 │X2 = x 2 )

80
© 2014-2024 AnalystPrep.
Applying again to

f X1,X2 (x1 , x 2 ) = f(X2│X1)(x 2 │X1 = x 1 )f X1(x 1 )

we get:

fX2 (x2 ) = f (X2 │X1) (x 2 │X1 = x 1 )

Expectations

The expectation of a function of a bivariate random variable is defined in the same way as that of

the univariate random variable. Consider the function g(X1 , X2 ). The expectation is defined as:

E(g(X 1, X2 )) = ∑ ∑ g(x1 , x 2 )fX1, X2(x 1 ,x 2 )


x1 ϵR(X1 ) x2 ϵR(X2 )

g(x 1 , x2 depends on both x1 and x2 ) and it may be a function of one component only. Just like the

univariate random variable,

E(g(X1 , X2 )) ≠ g(E(X1 ), E(X2 ))

for a nonlinear function g(x1 , x 2 ).

Example: Calculating the Expectation

Consider the following probability mass function:

X1
1 2
X2 3 10% 15%
4 70% 5%

Given that g(x 1 , x2 ) = x x12 , calculate E(g(x 1 , x2 ))

Solution

81
© 2014-2024 AnalystPrep.
Using the formula:

E(g(X 1, X2 )) = ∑ ∑ g(x1 , x 2 )fX1, X2(x 1 , x 2 )


x1 ϵR(X1 ) x2 ϵR(X2 )

In this case we need:

E(g(X1 , X2 )) = ∑ ∑ g(x 1 , x 2 )f X1,X2 (x 1 , x2 )


x1 ϵ{1, 2} x2 ϵ{3,4}

= 1 3 (0.10) + 14 (0.7) + 2 3 (0.15) + 24 (0.05)


= 2.80

Moments

Just like the univariate random variables, we shall use the expectations to define the moments.

The first moment is defined as:

E(X) = [E(X1 ), E(X2 )] = [μ1 , μ2 ]

The second moment involves the covariance between the components of the bivariate

distribution X1 and X2. The second moment is given by:

Var(X1 + X 2 ) = Var(X1 ) + Var(X2 ) + 2Cov(X1 X2 )

The Covariance between X1 and X2 is defined as:

Cov(X1 , X2 ) = E[(X 1 − E[X1 ])]E[(X2 − E[X 2])]


= E[X1 X2 ] − E[X1 ]E[X2 ]

Note that Cov(X1 , X1 ) = Var(X1 ) and that if X1 and X2 are independent then

E[X1 X2 ] − E[X1 ]E[X2 ] = 0 and thus:

Cov(X1 , X2 ) = E[X1 X2 ] − E[X1 ]E[X2 ]


= E[X1 ]E[X2 ] − E[X1 ]E[X2 ] = 0

Most of the correlation between X1 and X 2 is reported. Now let Var(X1 ) = σ12, Var(X2 ) = σ22 and

82
© 2014-2024 AnalystPrep.
Cov(X1 , X2 ) = σ12 then the correlation is defined as:

Cov(X 1, X2 ) σ12
Corr(X1 , X2 ) = ρX1 X2 = =
σ1 σ2
√σ12√ σ22

Therefore, we can write this in terms of covariance. That is:

σ12 = ρX1X2 σ1 σ2

Correlation gives the measure of the strength of the linear relationship between the two random

variables, and it is always between -1 and 1. That is −1 < Corr(X1 , X1 ) < 1

For instance, if X 2 = α + βX1 then:

Cov(X1 , X2 ) = Cov(X1 , α + βX1 ) = βVar(X1 )

But we know that Var(α + βX1 ) = β2 Var(X1 ). So,

βVar(X1 ) β
Corr(X1 , X2 ) = ρX1 X2 = =
√Var(X1 )√β 2Var(X1 ) |β|

it is now evident that if β > 0, then ρX1X2 = 1 and when β ≤ 0 then ρX1 X2 = 0

Similarly, if we consider two scaled random variables a + bX 1 and c + dX 2

Then,

Cov(a + bX1 , c + dX2 ) = bdCov(X 1 , X2 )

This implies that the scale factor in each random variable multiactivity affects the covariance.

Using the above results, the corresponding correlation coefficient of aX1 and bX2 is given by :

abCov(X 1, X2 ) ab Cov(X1 , X2 )
Corr(aX1 , bX2 ) = =
|a||b| √Var(X1 )√Var(X2 )
√a2 Var(X1 )√b 2 Var(X 2 )
ab
= ρ X1 X2
|a||b|

83
© 2014-2024 AnalystPrep.
Application of Correlation: Portfolio Variance and Hedging

The variance of the underlying securities and their respective correlations are the necessary

ingredients if the variance of a portfolio of securities is to be determined. Assuming that we have

two securities whose random returns are XA and XB and their means are μA and μ B with standard

deviations of σA and σB . Then, the variance of XA plus XB can be computed as follows:

σA+B
2 = σA2 + σB2 + 2ρAB σAσB

If XA and XB have a correlation of ρAB between them,

The equation changes to:

σA2 +B = 2σ 2 (1 + ρAB),

Where:

σA2 = σB2 = σ 2

if both securities have an equal variance. If the correlation between the two securities is zero,

then the equation can be simplified further. We have the following relation for the standard

deviation:

ρAB = 0 ⇒ σA+B = √2σ

For any number of variables, we have that:

n
Y = ∑ Xi = 1nXi
i=1
n n
σY2 = ∑ ∑ ρij σi σj
i=1 j=1

In case all the Xi ’s are uncorrelated and all variances are equal to σ, then we have:

σY = √nσ if ρij = 0 ∀ i≠j

84
© 2014-2024 AnalystPrep.
This is what is called the square root rule for the addition of uncorrelated variables.

Suppose that Y, XA , and XB are such that:

Y = aXA + bXB

Therefore, with our standard notation, we have that:

σY2 = a2 σA2 + b 2σB2 + 2abρAB σAσB … … … … … Eq 1

The major challenge during hedging is a correlation. Suppose we are provided with $1 of a

security A. We are to hedge it with $h of another security B. A random variable p will be

introduced to our hedged portfolio. h is, therefore, the hedge ratio. The variance of the hedged

portfolio can easily be computed by applying Eq1:

P = X A + hXB
σP2 = σA2 + h2 σB2 + 2hρAB σA σB

The minimum variance of a hedge ratio can be determined by determining the derivative with

respect to h of the portfolio variance equation and then equate it to zero:

dσ 2P
= 2hσB2 + 2ρABσA σB = 0
dh
σA
⇒ h ∗ = −ρAB
σB

To determine the minimum variance achievable, we substitute h* to our original equation:

min[σP2 ] = σA2 (1 − ρ2AB )

The Covariance Matrix

The covariance matrix is a 2x2 matrix that displays the covariance between the components of X.

For instance, the covariance matrix of X is given by:

85
© 2014-2024 AnalystPrep.
σ12 σ12
Cov(X) = [ ]
σ12 σ22

The Variance of Sums of Random Variables

The variance of the sum of two random variables is given by:

Var(X1 + X 2 ) = Var(X1 ) + Var(X2 ) + 2Cov(X1 X2 )

If the random variables are independent, then Cov(X1 X2 ) = 0 and thus:

Var(X1 + X2 ) = Var(X 1) + Var(X2 )

In case of weighted random variables, the variance is given by:

Var(aX1 + bX2 ) = a2 Var(X1 ) + b2 Var(X2 ) + 2abCov(X1 X2 )

Conditional Expectation

A conditional expectation is simply the mean calculated after a set of prior conditions has

happened. It is the value that a random variable takes “on average” over an arbitrarily large

number of occurrences – given the occurrence of a certain set of "conditions." A conditional

expectation uses the same expression as any other expectation and is a weighted average where

the probabilities are determined by a conditional PMF.

For a discrete random variable, the conditional expectation is given by:

E(X1 │X 2 = x 2 ) = ∑ x 1if(X1 |X2 = x 2 )


i

Example: Calculating the Conditional Expectation

In the bond return/rating example, we may wish to calculate the expected return on the bond

given a positive analyst rating, i.e., E(X1 │X2 = 1)

If you recall, the conditional distribution is as follows:

86
© 2014-2024 AnalystPrep.
Returns(X1 ) −10% 0% 10%
5% 5% 30%
f (X1│X2)(x 1 │X2 = 1) = 12.5% = 12.5% = 75%
40% 40% 40%
= P(X1 = x 1 |X2 = 1)

The conditional expectation of the return is determined as follows:

E(X1 │X2 = 1) = −0.10 × 0.125 + 0 × 0.125 + 0.10 × 0.75 = 0.0625 = 6.25%

Conditional Variance

We can calculate the conditional variance by substituting the expectation in the variance formula

with the conditional expectation.

We know that:

Var(X1 ) = E[(X1 − E(X1 ))2 ] = E(X 1 )2 − [E(X)]2

Now the conditional variance of X1 conditional on X2 is given by:

Var(X1 │X2 = x 2) = E(X 21 |X2 = x2 ) − [E(X1 |X2 = x2 )]2

Returning to our example above, the conditional variance Var(X1 |X2 = 1) is given by:

Var(X1 │X2 = 1) = E(X21 |X2 = 1) − [E(X1 |X2 = 1)]2

Now,

E(X1 |X2 = 1) = 0.0625

We need to calculate:

E(X21 │X2 = 1) = (−0.10)2 × 0.125 + 0 2 × 0.125 + 0.102 × 0.75 = 0.00875

So that

2 2
Var(X1 │X2 = 1) = σ(X = 0.00875 − [0.0625] = 0.004844 = 0.484%
1 │X2 = 1)

87
© 2014-2024 AnalystPrep.
If we wish to find the standard deviation of the returns, we just find the square root of the

variance:

σ(X1│X2=1) = √0.004844 = 0.06960 = 6.96%

Continuous Random Variables

Before we continue, it is essential to note that continuous random variables make use of the

same concepts and methodologies as discrete random variables. The main distinguishing factor

is that instead of PMFs, continuous random variables use PDFs.

The Joint PDF

The joint (bivariate) distribution function gives the probability that the pair (X1 , X2 ) takes values

in a stated region A. It is given by:

88
© 2014-2024 AnalystPrep.
b d
P(a < X1 < b,c < X2 < d) = ∫ ∫ f X1,X2 (x 1 , x2 )dx1 dx2
a c

The joint pdf is always nonnegative, and the double integration yield a value of 1. That is:

fX1, X2(x 1 , x2 ) ≥ 0

And

b d
∫ ∫ f X1,X2 (x1 , x 2 )dx1 dx2 = 1
a c

Example: Calculating the Joint Probability

Assume that the random variables (X1 ) and (X2 ) are jointly distributed as:

89
© 2014-2024 AnalystPrep.
f X1,X2 (x1 , x 2 ) = k(x 1 + 3x 2 ) 0 < x 1 < 2 , 0 < x2 < 2

Calculate the probability P(X 1 < 1 , X2 > 1).

Solution

We need to first calculate the value of k.

Using the principle:

b d
∫ ∫ f X1,X2 (x1 , x 2 )dx1 dx2 = 1
a c

We have

2 2 2 2
1
∫ ∫ k(x 1 + 3x 2 )dx1dx2 = ∫ k[( x 21 + 3x1 x 2 )] dx2 = 1
0 0 0 2 0
2
2
=∫ k(2 + 6x 2 )dx2 = k[2x2 + 3x22 ]0 = 1
0
1
16k = 1 ⇒ k =
16

So,

1
fX1, X2(x 1 , x2 ) = (x 1 + 3x 2)
16

Therefore,

1 2
1
P(X1 < 1, X2 > 1) = ∫ ∫ (x1 + 3x2 )dx1 dx2 = 0.3125
0 1 16

Joint Cumulative Distribution Function (CDF)

The joint cumulative distribution is given by:

x1 x2
F(X1 < x1 , X2 < x 2) = ∫ ∫ f X1,X2 (t1 , t2 )dt 1 dt2
−∞ −∞

Note that the lower bound of the integral can be adjusted so that it is the lower value of the

90
© 2014-2024 AnalystPrep.
interval.

Using the example above, we can calculate F(X1 < 1, X2 < 1) in a similar way as above.

The Marginal Distributions

For the continuous random the marginal distribution is given by:


f X1(x 1 ) = ∫ fX1, X2(x 1 , x2 )dx2
−∞

Similarly,


f X2(x 2 ) = ∫ fX1, X2(x 1 , x2 )dx1
−∞

Note that if we want to find the marginal distribution of X1 we integrate X2 out and vice versa.

Example: Computing the Marginal Distribution

Consider the example above. We have that

1
f X1,X2 (x1 , x 2) = (x1 + 3x 2 ) 0 < x 1 < 2, 0 < x 2 < 2
16

We wish to find the marginal distribution of X1 . This implies that we need to integrate out X2. So,

2 2
1 1 3
f X1 (x 1 ) = ∫ (x 1 + 3x2 )dx2 = [x 1 x 2 + x22 ]
0 16 16 2 0
1 1
= [2x 1 + 6] = (x 1 + 3)
16 8
1 1
⇒ f X1 (x 1 ) = [2x 1 + 6] = (x 1 + 3)
16 8

Note that we can calculate f X2 (x 2 ) in a similar manner.

Conditional Distributions

91
© 2014-2024 AnalystPrep.
The conditional distribution is analogously defined as that of discrete random variables. That is:

f X1,X2 (x1 , x 2 )
f(X1│X2)(x 1 │X2 = x2 ) =
fX2 (x 2 )

The conditional distributions are applied in the field of finance, such as risk management. For

instance, we may wish to compute the conditional distribution of interests rates, X1 given that

the investors X2 experience a huge loss.

Independent, Identically Distributed (IID) Random Variables

A collection of random variables is independent and identically distributed (iid) if each random

variable has the same probability distribution as the others and all are mutually independent.

Example:

Consider successive throws of a fair coin:

92
© 2014-2024 AnalystPrep.
The coin has no memory, so all the throws are "independent".

The probability of head vs. tail in every throw is 50:50; so the coin is equally likely

and stays fair; the distribution from which every throw is drawn is normal and stays

the same, and thus each outcome is "identically distributed"

iid variables are mostly applied in time series analysis.

Mean and Variance of iid Variables

Consider the iid variables generated by a normal distribution. They are typically defined as:

x iii d ∼ N(μ , σ 2 )

The expected mean of these particular iid is given by:

n n n
E (∑ Xi) = ∑ E(Xi ) = ∑ μ = nμ
i i i

Where E(Xi) = μ

The result above is valid since the variables are independent and have similar moments.

Maintaining this line of thought, the variance of iid random variables is given by:

n n n n
Var (∑ Xi ) = ∑ Var (Xi) + 2 ∑ ∑ Cov(Xj, Xk )
i i j=1 k=j+1
n n n n
= ∑ σ2 + 2 ∑ ∑ 0 = ∑ σ 2 = nσ 2
i j=1 k=j+1 i

The independence property is important because there’s a difference between the variance of

the sum of multiple random variables and the variance of a multiple of a single random variable.

If X1 and X2 are iid with variance σ 2, then,

Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) = σ 2 + σ 2 = 2σ 2


Var(X1 + X2 ) ≠ Var(2X1 )

93
© 2014-2024 AnalystPrep.
In the case of a multiple of a single variable, X1, with variance σ 2,

Var(2X1 ) = 4Var(X1 ) = 4 × σ 2 = 4σ 2

94
© 2014-2024 AnalystPrep.
Practice Question

A company is reviewing fire damage claims under a comprehensive business

insurance policy. Let X be the portion of a claim representing damage to inventory

and let Y be the portion of the same application representing damage to the rest of

the property. The joint density function of X and Y is:

6[1 − (x + y)], x > 0, y > 0 x + y < 1


f(x, y) = {
0, elsewhere

What is the probability that the portion of a claim representing damage to the rest of

the property is less than 0.3?

A. 0.657

B.0.450

C. 0.415

D. 0.752

The correct answer is A.

First, we should find the marginal PMF of Y:

1−y
1−y x2
f Y (y) = ∫ 6 [1 − (x + y)] ∂x = [6(x − − xy]
0 2 0

Substitute the limits as usual to get:

(1 − y)2
6 [(1 − y) − − y(1 − y)]
2

At this we can factor out (1 − y) and solve what remains in the square bracket:

95
© 2014-2024 AnalystPrep.
(1 − y) 1−y
6(1 − y) [1 − − y] = 6(1 − y) [ ]
2 2

Of course you can cancel 2 with 6 at this point:

1 −y
6(1 − y) [ ] = 3(1 − y)[1 − y] = 3(1 − 2y + y 2) = 3 − 6y + 3y 2
2

So,

f Y (y) = 3 − 6y + 3y 2 , 0 < y < 1

We need P(Y < 0.3) , So,

0.3
P(Y < 0.3) = ∫ (3 − 6y + 3y2 )dy = 0.9 − 0.27 + 0.027 = 0.657
0

96
© 2014-2024 AnalystPrep.
Reading 16: Sample Moments

After completing this reading, you should be able to:

Estimate the mean, variance, and standard deviation using sample data.

Explain the difference between a population moment and a sample moment.

Distinguish between an estimator and an estimate.

Describe the bias of an estimator and explain what the bias measures.

Explain what is meant by the statement that the mean estimator is BLUE.

Describe the consistency of an estimator and explain the usefulness of this concept.

Explain how the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) apply

to the sample mean.

Estimate and interpret the skewness and kurtosis of a random variable.

Use sample data to estimate quantiles, including the median.

Estimate the mean of two random variables and apply the CLT.

Estimate the covariance and correlation between two random variables.

Explain how coskewness and cokurtosis are related to skewness and kurtosis.

Sample Moments

Recall that moments are defined as the expected values that briefly describe the features of a

distribution. Sample moments are those that are utilized to approximate the unknown population

moments. Sample moments are calculated from the sample data.

97
© 2014-2024 AnalystPrep.
Such moments include mean, variance, skewness, and kurtosis. We shall discuss each moment in

detail.

Estimation of the Mean

The population mean, denoted by μ is estimated from the sample mean (X̄). The estimated mean

is denoted by μ
^ and defined by:

1 n
^ = X̄ =
μ ∑ Xi
n i=1

Where X i is a random variable assumed to be independent and identically distributed so

98
© 2014-2024 AnalystPrep.
E(Xi ) = μ and n is the number of observations.

Note that the mean estimator is a function of random variables, and thus it is a random variable.

Consequently, we can examine its properties as a random variable (its mean and variance)

For instance, the expectation of the mean estimator μ


^ is the population mean μ. This can be seen

as follows:

1 n 1 n 1 n 1
E(μ
^) = E(X̄) = E [ ∑ Xi ] = ∑ E(X i) = ∑ μ = × nμ = μ
n i=1 n i=1 n i=1 n

The above result is true since we have assumed that Xi 's are iid. The mean estimator is an

unbiased estimator of the population mean.

The bias of an estimate is defined as:

Bias(θ^) = E(θ^) − θ

Where θ^ is the true estimator of the population value θ. So, in the case of the population mean:

Bias(^
μ ) = E(μ
^) − μ = μ − μ = 0

Since the value of the mean estimator is 0, it is an unbiased estimator of the population mean.

Using conventional features of a random variable, the variance of the mean estimator is

calculated as:

1 n 1 n
Var(μ
^) = Var ( ∑ X i) = 2 [∑ Var(X i) + Covariances]
n i=1 n i=1

But we are assuming that Xi 's are iid, and thus they are uncorrelated, implying that their

covariance is equal to 0. Consequently, taking Var(Xi ) = σ2 , the above formula changes to:

1 n 1 n 1 n 12 σ2
Var(μ
^) = Var ( ∑ Xi ) = 2 [∑ Var(Xi )] = 2 [∑ σ 2] = × nσ 2 =
n i=1 n i=1 n i=1 n n

99
© 2014-2024 AnalystPrep.
Thus

σ2
Var(μ
^) =
n

Looking at the last formula, the variance of the mean estimator depends on the data variance

(σ 2) and the sample mean n. Consequently, the variance of the mean estimator decreases as the

number of the observations (sample size) is increased. This implies that the larger the sample

size, the closer the estimated mean to the population mean.

Example: Calculating the Sample Mean

An experiment was done to find out the number of hours that candidates spend preparing for the

FRM part 1 exam. It was discovered that for a sample of 10 students, the following times were

spent:

318, 304, 317, 305, 309 , 307, 316, 309, 315, 327

What is the sample mean?

Solution

We know that:

1 n
X̄ = μ
^= ∑ Xi
n i=1
318 + 304 + 317 + 305 + 309 + 307 + 316 + 309 + 315 + 327
⇒ X̄ =
10
= 312.7

Desirable Properties of the Sample Mean Estimator

The mean estimator is averagely equal to the population mean.

As the sample size (the number of the observation) increases, the sample mean tends to

the population mean.

The sample mean can be assumed to be distributed normally (normal distribution)

100
© 2014-2024 AnalystPrep.
Estimation of Variance and Standard Deviation

The sample estimator of variance is defined as:

1 n
^2 =
σ ^)2
∑ (X − μ
n i=1 i

Note that we are still assuming that Xi’s are iid. As compared to the mean estimator, the sample

estimator of variance is biased. It can be proved that:

n−1 2 σ2
^2 ) = E(σ
Bias(σ ^2 ) − σ 2 = σ − σ2 =
n n

This implies that the bias decreases as the number of observations are increased. Intuitively, the
2
source of the bias is the variance of the mean estimator ( σn ). Since the bias is known, we

construct an unbiased estimator of variance as:

1 n n
n n 1
s2 = ^2 =
σ ^)2 =
× ∑ (Xi − μ ^)2
∑ (Xi − μ
n −1 n − 1 n i=1 n − 1 i =1

It can be shown that E(s2 ) = σ 2 and thus s2 is an unbiased variance estimator. Maintaining this

line of thought, it might seem s 2 ^2 but this is not


is a better estimator of variance than σ

^2 is less than that of


necessarily true since the variance of σ s2. However, financial analysis

involves large data sets, and thus either of these values can be used. However, when the number

^2 is preferred conventionally.
of observations is more than 30 (n ≥ 30) , σ

The sample standard deviation is the square root of the sample variance. That is:

^2
^ = √σ
σ

or

s = √ s2

Note that the square root is a nonlinear function, and thus, the standard deviation estimators are

biased but diminish as the sample size increases.

101
© 2014-2024 AnalystPrep.
Example: Calculating the Sample Variance Estimator (Unbiased)

Using the example as in calculating the sample mean, what is the sample variance?

Solution

The sample estimator of variance is given by:

n
1
s2 = ^)2
∑ (Xi − μ
n − 1 i=1

To make it easier, we will calculate we make the following table:

Xi μ )2
(Xi − ^
318 (318 − 312.7)2 = 28.09
304 75.69
317 18.49
305 59.29
309 13.69
307 32.49
316 10.89
309 13.69
315 5.29
327 204.49
Total 462.1

So, the variance is given to be:

n
1 462.1
s2 = ^ )2 =
∑ (Xi − μ = 51.34
n − 1 i=1 10 − 1

Reasons, why Mean and Standard Deviations are Used

I. The mean and the variance are almost adequate to describe data.

II. They give a clue on the range of the values that can be observed.

III. The units of the mean and the standard deviation are the same as those of the data, and

thus they can be easily compared.

102
© 2014-2024 AnalystPrep.
Skewness

As we saw in chapter two, the skewness is a cubed standardized central moment given by:

E([X − E(X)]3 X− μ 3
skew(X) = = E [( ) ]
σ3 σ

X −μ
Note that σ
is a standardized X with a mean of 0 and variance of 1.

This can also be written as:

E([X − E(X)]3 μ3
skew(X) = 3 =
2 2 σ3
E[(X − E(X)) ]

Where μ 3 is the third central moment, and σ is the standard deviation

The skewness measures the asymmetry of the distribution (since the third power depends on the

sign of the difference). When the value of the asymmetry is negative, there is a high probability

of observing the large magnitude of negative value than positive values (tail is on the left side of

the distribution). Conversely, if the skewness is positive, there is a high probability of observing

the large magnitude of positive values than negative values (tail is on the right side of the

distribution).

103
© 2014-2024 AnalystPrep.
The estimators of the skewness utilize the principle of expectation and is denoted by:

^3
μ
^3
σ

^ 3 as:
We can estimate μ

1 n
^3 =
μ ^)3
∑ (x i − μ
n i =1

Example: Calculating the Skewness

The following are the data on the financial analysis of a sales company’s income over the last 100

months:

n = 100, ∑ni=1 (x i − μ
^ )2 = 674, 759.90 . and ∑ni=1 (x i − μ
^)3 = −12 , 456.784

Calculate that Skewness.

Solution

The skewness is given by:

∑ni=1 (x i − μ
1 1
^3
μ ^ )3 (−12 , 456.784)
n 100
= = = −0.000225
^3 3 3
σ
[ 1n ∑ni=1 (x i 1
^ )2 ] 2
−μ [ 100 × 674, 759.90] 2

Kurtosis

The Kurtosis is defined as the fourth standardized moment given by:

E([X − E(X)])4 X− μ 4
Kurt(X) = = E [( ) ]
σ4 σ

The above can be written as:

E([X − E(X)]4 μ
104
© 2014-2024 AnalystPrep.
E([X − E(X)]4 μ4
Kurt(X) = =
E[(X − E(X))2 ]2 σ4

The description of kurtosis is analogous to that of the Skewness, only that the fourth power of

the Kurtosis implies that it measures the absolute deviation of random variables. The reference

value of a normally distributed random variable is 3. A random variable with Kurtosis exceeding

3 is termed to be heavily or fat-tailed.

The estimators of the skewness utilize the principle of expectation and is denoted by:

^4
μ
^4
σ

^ 4 (fourth central moment) as:


We can estimate μ

1 n
^4 =
μ ^)4
∑ (x i − μ
n i =1

The BLUE Mean Estimator

We say that the mean estimator is the Best Linear Unbiased Estimator (BLUE) of the population

mean when the data used are iid. That is,

I. The variance of the mean has the lowest variance of any Linear Unbiased Estimator

(LUE).

II. It is the unbiased estimator of the population mean (as shown earlier)

III. It is a linear function of the data used.

The linear estimators are a function of the mean and can be defined as:

n
μ
^ = ∑ ωi Xi
i =1

1
Where ω i is independent of X i . In the case of the sample mean estimator, ωi = n
. Recall that we

had shown the unbiases of the sample mean estimator.

105
© 2014-2024 AnalystPrep.
BLUE puts an estimator as the best by having the smallest variance among all linear and

unbiased estimators. However, there are other superior estimators, such as Maximum Likelihood

Estimators (MLE).

The Behavior of Mean in Large Sample Sizes

Recall that the mean estimator is unbiased, and its variance takes a simple form. Moreover, if the

data used are iid and normally distributed, then the estimator is also normally distributed.

However, it poses a great difficulty in defining the exact distribution of the mean in a finite

number of observations.

To overcome this, we use the behavior of the mean in large sample sizes (that is as n → ∞) to

approximate the distribution of the mean infinite sample sizes. We shall explain the behavior of

the mean estimator using the Law of Large Numbers (LLN) and the Central Limit Theorem

(CLT).

The Law of Large Numbers (LLN)

The law of large numbers (Kolmogorov Strong Law of Large Numbers) for iid states that if Xi ’s is

a sequence of random variables, with E(Xi ) ≡ μ then:

1 n −a.s
^n =
μ ∑ X→μ
n i=1 i

^n converges almost surely (−a.s


Put in words, the sample mean estimator μ → ) to population mean (μ) .

An estimator is said to be consistent if LLN applies to it. Consistency requires that an estimator

is:

I. Unbiased and that the bias should decrease as n increases.

II. The variance decreases as the number of observations n increases. That is:Var(μ
^ n) → 0.

a.s
^2−→ σ 2
Moreover, under LLN, the sample variance is consistent. That is, LLN implies that σ

106
© 2014-2024 AnalystPrep.
However, consistency is not easy to study because it tends to 0 as n → ∞.

The Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) states that if X1 , X2 , … , Xn is a sequence of iid random variables
^−μ
μ
with a finite mean μ and a finite non-zero variance σ 2, then the distribution of σ tends to a
√n

standard normal distribution as n → ∞.

Put simply,

^− μ
μ
σ
→ N(0, 1)
√n

Note that μ
^ = X̄ = Sample Mean

Note that CLT extends LLN and provides a way of approximating the distribution of the sample

mean estimator. CLT seems to be appropriate since it does not require the distribution of random

variables used.

Since CLT is asymptotic, we can also use the unstandardized forms so that:

σ2
μ
^ ∼ N (μ, )
n

Note that we can go back to standard normal variable Z as:

^ −μ
μ
Z= σ
√n

Which is actually the result we have initially.

The main question is, how large is n?

The value of n solely depends on the shape of the population (distribution of Xi ’s), i.e., the

skewness. However, CLT is appropriate when n ≥ 30

107
© 2014-2024 AnalystPrep.
Example: Applying CLT

A sales expert believes that the number of sales per day for a particular company has a mean of

40 and a standard deviation of 12. He surveyed for over 50 working days. What is the probability

that the sample mean of sales for this company is less than 35?

Solution

Using the information given in the question,

μ = 40, σ = 12 and n=50

By central limit theorem,

σ2
^ ∼ N (μ,
μ )
n

We need

108
© 2014-2024 AnalystPrep.
⎡ 35 − 40⎤
^ < 35) = P ⎢Z <
P(μ ⎥ = P(μ
^ < −2.946)
12
⎣ ⎦
√50
= P(μ
^ < −2.946) = 1 − P(μ
^ < 2.946) = 0.00161

Estimation of Median and Other Quantiles

Median

Median is a central tendency measure of distribution, also called the 50% quantile, which divides

the distribution in half ( 50% of observations lie on either side of the median value).

When the sample size is odd, the value in position (n + 1)/2 of the sorted list is used to estimate

the median:

Med(x) = x n +1
2

If the number of the observations is even, the median is estimated as the average of the two

central points of the sorted list. That is:

1
Med(x) = [x n + x n +1 ]
2 2 2

Example: Calculating the Median

The ages of experienced financial analysts in a country are:

56, 51, 43, 34,25, 50.

What is the median age of the analysts?

Solution

We need to arrange the data in ascending order:

25, 34, 43,50, 51,56

109
© 2014-2024 AnalystPrep.
The sample size is 6 (even), so the median is given by:

1 1 1
Med(Age) = [x 6 + x 6 +1 ] = (x3 + x 4 ) = (43 + 50) = 46.5
2 2 2 2 2

Properties of the Median

It may not be an actual observation in the data set.

It is not affected by extreme values because the median is a positional measure.

It is used when the exact midpoint of the score distribution is desired, or when there

are many outliers (extreme observations).

Other Quartiles

For other quantiles such as 25% and 75% quantiles, we estimate analogously as the median. For

instance, a θ-quantile is determined using the nθ, which is a value in the sorted list. If nθ is not

an integer, we will have to take the average below or above the value nθ.

So, in our example above, the 25% quantile (θ=0.25) is 6×0.25=1.5. This implies that we need to

find the average value of the 1st and 2nd values:

1
^
q 25 = (25 + 34) = 29.5
2

The Interquartile Range

The interquartile range (IQR) is defined as the difference between the 75% and 25% quartiles.

That is:

ˆ
(IQR) =^
q 75 − ^
q 25

IQR is a measure of dispersion and thus can be used as an alternative to the standard deviation

If we use the example above, the 75% quantile is 6×0.75=4.5. So, we need to average the 4th

110
© 2014-2024 AnalystPrep.
and 5th values:

1
^
q 75 = (50 + 51) = 50.5
2

So that the IQR is:

ˆ
(IQR) = 50.5 − 29.5 = 21

Desirable Properties of Quantiles

I. The units of the quantiles are the same as those of the data used hence they are easy to

interpret.

II. They are robust to outliers of the data. The median and the IQR are unaffected by the

outliers.

The Multivariate Moments

We can extend the definition of moments from the univariate to multivariate random variables.

The mean is unaffected by this because it is just the combination of the means of the two

univariate sample means.

However, if we extend the variance, we would need to estimate the covariance between each pair

plus the variance of each data set used. Moreover, we can also define Kurtosis and Skewness

analogously to univariate random variables.

Covariance

In covariance, we focus on the relationship between the deviations of some two variables rather

than the difference from the mean of one variable.

Recall that the covariance of two variables X and Y is given by:

111
© 2014-2024 AnalystPrep.
Cov(X, Y) = E[(X − E[X])]E[(Y − E[Y])]
= E[XY] − E[X]E[Y]

The sample covariance estimator is analogous to this result. The sample covariance estimator is

given by:

1 n
^XY =
σ ∑ (X i − μ
^X )(Yi − μ
^Y )
n i=1

Where

^X -the sample mean of X


μ

^Y - the sample mean of Y


μ

The sample covariance estimator is biased towards zero, but we can remove the estimator by

using n-1 instead of just n.

Correlation

Correlation measures the strength of the linear relationship between the two random variables,

and it is always between -1 and 1. That is −1 < Corr(X1 , X1 ) < 1 .

Correlation is a standardized form of the covariance. It is approximated by dividing the sample

covariance by the product of the sample standard deviation estimator of each random variable. It

is defined as:

^XY
σ ^XY
σ
ρXY = =
^2X√ σ
^2Y ^Xσ
σ ^Y
√σ

Sample Mean of Two Variables

We estimate the mean of two random variables the same way we estimate that of a single

variable. That is:

1
112
© 2014-2024 AnalystPrep.
1 n
^x =
μ ∑ (x i )
n i =1

And

1 n
^y =
μ ∑ (y )
n i =1 i

Assuming both of the random variables are iid, we can apply CLT in each estimator. However, if

we consider the joint behavior (as a bivariate statistic), CLT stacks the two mean estimators into

a 2x1 matrix:

^x
μ
^=[
μ ]
^y
μ

Which is normally distributed as long the random variable Z=[X, Y] is iid. The CLT on this vector

depends on the covariance matrix:

σX2 σXY
[ ]
σXY σY2

Note that in a covariance matrix, one diagonal displays the variance of random variable series,

and the other is covariances between the pair of the random variables. So, the CLT for bivariate

iid data is given by:

^x − μx
μ 0 σ2 σXY
√n [ ] → N ([ ] , [ X ])
^y − μy
μ 0 σXY σY2

If we scale the difference between the vector of means, then the vector of means is normally

distributed. That is:

σ2 σ XY
^
μ ⎛ μ ⎡ X ⎤⎞
[ x ] → N ⎜[ x ] , ⎢ n n ⎥⎟
^y σ2
μ ⎝ μy ⎣ σ XY Y ⎦⎠
n n

Example: Applying Bivariate CLT

113
© 2014-2024 AnalystPrep.
The annualized estimates of the means, variances, covariance, and correlation for monthly return

of stock trade (T) and the government's bonds (G) for 350 months are as shown below:

Moment ^T
μ σT2 ^G
μ σG2 σTG ρTG
11.9 335.6 6.80 26.7 14.0 0.1434

We need to compare the volatility, interpret the correlation coefficient, and apply bivariate CLT.

Solution

Looking at the output, it is evident that the return from the stock trade is more volatile than the

government bond return since it has a higher variance. The correlation between the two forms of

return is positive but very small.

If we apply bivariate CLT, then:

^x − μx
μ 0 335.6 14.0
√n [ ^ − μ ] → N ([ ] , [ ])
μy y 0 14.0 26.7

But the mean estimators have a limiting distribution (which is assumed to be normally

distributed). So,

^x
μ μ 0.9589 0.04
[ ] → N ([ x ] , [ ])
^
μy μ y 0.04 0.07629

Note the new covariance matrix is equivalent to the previous covariance divided by the sample

size n=350.

In bivariate CLT, the correlation in the data is the correlation between the sample means and

should be equal to the correlation between the data series.

Coskewness and Cokurtosis

114
© 2014-2024 AnalystPrep.
Coskewness and Cokurtosis are an extension of the univariate skewness and kurtosis.

Coskewness

The two coskewness measures are defined as:

E[(X − E[X])2 (Y − E[Y])]


Skew(X, X, Y) =
σ2X σY

E[(X − E[X])(Y − E[Y])2 ]


Skew(X, Y , Y) =
σX σ2Y

These measures both capture the likelihood of the data taking a large directional value whenever

the other variable is large in magnitude. When there is no sensitivity to the direction of one

variable to the magnitude of the other, the two coskewnesses are 0. For example, the coskewness

in a bivariate normal is always 0, even when the correlation is different from 0. Note that the

univariate skewness estimators are s(X,X,X) and s{Y,Y,Y).

So how do we estimate coskewness?

The coskewness is estimated by using the estimation analogy. That is, replacing the expectation

operator by summation. For instance, the two coskewness is given by:

∑ni=1 (xi − μ
^X)2 (y i − μ
^Y )
Skew(X, X, Y) =
^2X σ
σ ^Y

∑ni=1 (x i − μ ^Y )2
^X)(y i − μ
Skew(X, Y , Y) =
^2Y
^Xσ
σ

Cokurtosis

There intuitively three configurations of the cokurtosis. They are:

E[(X − E[X])2 (Y − E[Y])2 ]


Kurt(X, X, Y, Y) =
σX2 σY2

E[(X − E[X])3 (Y − E[Y])]


115
© 2014-2024 AnalystPrep.
E[(X − E[X])3 (Y − E[Y])]
Kurt(X, X, X, Y) =
σX3σY

E[(X − E[X])(Y − E[Y])3 ]


Kurt(X, Y, Y , Y) =
σX σY3

The reference value of a normally distributed random variable is 3. A random variable with

Kurtosis exceeding 3 is termed to be heavily or fat-tailed. However, comparing the cokurtosis

to that of the normal is not easy since the cokurtosis of the bivariate normal depends on the

correlation.

When the value of the cokurtosis is 1, then the random variables are uncorrelated and increases

as the correlation devices from 0.

116
© 2014-2024 AnalystPrep.
Practice Question

A sample of 100 monthly profits gave out the following data:

∑100 x
i=1 i =
3, 353 and ∑100 x 2 844, 536
i=1 i =

What is the sample mean and standard deviation of the monthly profits?

A. Sample Mean=33.53, Standard deviation=85.99

B. Sample Mean=53.53, Standard deviation=85.55

C. Sample Mean=43.53, Standard deviation=89.99

D. Sample Mean=33.63, Standard deviation=65.99

Solution

The correct answer is A.

Recall that the sample mean is given by:

1 n
^ = X̄ =
μ ∑ Xi
n i=1
1
⇒ X̄ = × 3353 = 33.53
100

The variance is given by:

1 n
s2 = ^)2
∑ (Xi − μ
n − 1 i=1

Note that,

^)2 = X2i − 2Xi μ


(Xi − μ ^2
^+ μ

So that

117
© 2014-2024 AnalystPrep.
n n n n n
^)2 = ∑ X 2i − 2Xi μ
∑ (Xi − μ ^2 = ∑ X2i − 2^
^+ μ ^2
μ ∑ Xi + ∑ μ
i =1 i=1 i=1 i=1 i=1

Note again that

1 n n
^=
μ ∑ Xi ⇒ ∑ Xi = n^
μ
n i=1 i =1

So,

n n n n
^2 = ∑ X2i − 2^
∑ X2i − 2μ̂ ∑ Xi + ∑ μ μ . nμ
^ + n^
μ
i =1 i =1 i=1 i =1
n
μ2
= ∑ X2i − n^
i =1

Thus:

1 n 1 n
s2 = ^)2 =
∑ (Xi − μ ^2 }
{ ∑ X2i − n μ
n −1 i=1 n − 1 i=1

So, in our case:

1 n
2
1
s2 = {∑ X2i − n^
μ }= (844, 536 − 100 × 33.532 ) = 7395.0496
n −1 i =1 99

So that the standard deviation is given to be:

s = √7395.0496 = 85.99

118
© 2014-2024 AnalystPrep.
Reading 17: Hypothesis Testing

After completing this reading, you should be able to:

Construct an appropriate null hypothesis and alternative hypothesis and distinguish

between the two.

Construct and apply confidence intervals for one-sided and two-sided hypothesis tests,

and interpret the results of hypothesis tests with a specific level of confidence.

Differentiate between a one-sided and a two-sided test and identify when to use each

test.

Explain the difference between Type I and Type II errors and how these relate to the

size and power of a test.

Understand how a hypothesis test and a confidence interval are related.

Explain what the p-value of a hypothesis test measures.

Interpret the results of hypothesis tests with a specific level of confidence.

Identify the steps to test a hypothesis about the difference between two population

means.

Explain the problem of multiple testing and how it can bias results.

Hypothesis testing is defined as a process of determining whether a hypothesis is in line with

the sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is

true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The

null hypothesis is an assumption of the population parameter. On the other hand, the alternative

hypothesis states the parameter values (critical values) at which the null hypothesis is rejected.

The critical values are determined by the distribution of the test statistic (when the null

hypothesis is true) and the size of the test (which gives the size at which we reject the null

hypothesis).

119
© 2014-2024 AnalystPrep.
Components of the Hypothesis Testing

The elements of the test hypothesis include:

I. The null hypothesis.

I. The alternative hypothesis.

II. The test statistic.

III. The size of the hypothesis test and errors

IV. The critical value.

V. The decision rule.

The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The

null hypothesis is the statement concerning the population parameter values. It brings out the

notion that “there is nothing about the data.”

The null hypothesis, denoted as H0, represents the current state of knowledge about the

population parameter that’s the subject of the test. In other words, it represents the “status

quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil

manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol

and not more. The inspectors will formulate a hypothesis like:

H0: Each 1 kg package has 0.15% cholesterol.

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H0 include:

H0 : μ = μ0

H0 : μ ≤ μ0

Where:

120
© 2014-2024 AnalystPrep.
μ = true population mean and,

μ0 = the hypothesized population mean.

The Alternative Hypothesis

The alternative hypothesis, denoted H1, is a contradiction of the null hypothesis. The null

hypothesis determines the values of the population parameter at which the null hypothesis is

rejected. Thus, rejecting the H0 makes H1 valid. We accept the alternative hypothesis when the

“status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H1: Each 1 kg package does not have 0.15% cholesterol.

The typical statements of H1 include:

H1 : μ ≠ μ0

H1 : μ > μ0

Where:

μ = true population mean and,

μ0 = the hypothesized population mean.

Note that we have stated the alternative hypothesis, which contradicted the above statement of

the null hypothesis.

The Test Statistic

A test statistic is a standardized value computed from sample information when testing

hypotheses. It compares the given data with what we would expect under the null hypothesis.

Thus, it is a major determinant when deciding whether to reject H0, the null hypothesis.

121
© 2014-2024 AnalystPrep.
We use the test statistic to gauge the degree of agreement between sample data and the null

hypothesis. Analysts use the following formula when calculating the test statistic.

(Sample Statistic–Hypothesized Value)


Test Statistic =
(Standard Error of the Sample Statistic)

The test statistic is a random variable that changes from one sample to another. Test statistics

assume a variety of distributions. We shall focus on normally distributed test statistics because it

is used hypotheses concerning the means, regression coefficients, and other econometric

models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis H 0 : μ = μ 0 .

Assume that the data used is iid, and asymptotic normally distributed as:

^ − μ) ∼ N (0 , σ2 )
√n (μ

Where σ 2 is the variance of the sequence of the iid random variable used. The asymptotic

distribution leads to the test statistic:

^ − μ0
μ
T = ∼ N(0, 1)
2
√^
σ
n

Note this is consistent with our initial definition of the test statistic.

The following table gives a brief outline of the various test statistics used regularly, based on the

distribution that the data is assumed to follow:

Hypothesis Test Test Statistic


Z-test z-statistic
Chi-Square Test Chi-Square statistic
t-test t-statistic
ANOVA F-statistic

We can subdivide the set of values that can be taken by the test statistic into two regions: One is

called the non-rejection region, which is consistent with H0 and the rejection region (critical

122
© 2014-2024 AnalystPrep.
region), which is inconsistent with H0. If the test statistic has a value found within the critical

region, we reject H0.

Just like with any other statistic, the distribution of the test statistic must be specified entirely

under H0 when H0 is true.

The Size of the Hypothesis Test and the Type I and Type II
Errors

While using sample statistics to draw conclusions about the parameters of the population as a

whole, there is always the possibility that the sample collected does not accurately represent the

population. Consequently, statistical tests carried out using such sample data may yield incorrect

results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two

types of errors:

Type I Error

Type I error occurs when we reject a true null hypothesis. For example, a type I error would

manifest in the form of rejecting H0 = 0 when it is actually zero.

Type II Error

Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test

provides insufficient evidence to reject the null hypothesis when it’s false.

The level of significance denoted by α represents the probability of making a type I error, i.e.,

rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken

to be the probability of making a type II error within the bounds of statistical testing. The ideal

but practically impossible statistical test would be one that simultaneously minimizes α and β.

We use α to determine critical values that subdivide the distribution into the rejection and the

non-rejection regions.

123
© 2014-2024 AnalystPrep.
The Critical Value and the Decision Rule

The decision to reject or not to reject the null hypothesis is based on the distribution assumed by

the test statistic. This means if the variable involved follows a normal distribution, we use the

level of significance (α) of the test to come up with critical values that lie along with the standard

normal distribution.

The decision rule is a result of combining the critical value (denoted by Cα ), the alternative

hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis

in favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-

side alternative, the decision is to reject the null hypothesis if |T| > Cα . That is, reject the null

hypothesis if the absolute value of the test statistic is greater than the critical value. When

testing on the one-sided, decision rule, reject the null hypothesis if T < Cα when using a one-

sided lower alternative and if T > Cα when using a one-sided upper alternative. When a null

hypothesis is rejected at an α significance level, we say that the result is significant at α

significance level.

Note that prior to decision-making, one must decide whether the test should be one-tailed or

two-tailed. The following is a brief summary of the decision rules under different scenarios:

Left One-tailed Test

H1: parameter < X

Decision rule: Reject H0 if the test statistic is less than the critical value. Otherwise, do not

reject H0.

124
© 2014-2024 AnalystPrep.
Right One-tailed Test

H1: parameter > X

Decision rule: Reject H0 if the test statistic is greater than the critical value. Otherwise, do not

reject H0.

125
© 2014-2024 AnalystPrep.
Two-tailed Test

H1: parameter ≠ X (not equal to X)

Decision rule: Reject H0 if the test statistic is greater than the upper critical value or less than

the lower critical value.

Consider, α=5%. Consider a one-sided test. The rejection regions are shown below:

126
© 2014-2024 AnalystPrep.
The first graph represents the rejection region when the alternative is one-sided lower. For

instance, the hypothesis is stated as:

H0: μ < μ0 vs. H1: μ > μ0.

The second graph represents the rejection region when the alternative is a one-sided upper. The

null hypothesis, in this case, is stated as:

H0: μ > μ0 vs. H1: μ < μ0.

Example: Hypothesis Test on the Mean

Consider the returns from a portfolio X = (x1 , x 2 ,… , x n ) from 1980 through 2020. The

approximated mean of the returns is 7.50%, with a standard deviation of 17%. We wish to

determine whether the expected value of the return is different from 0 at a 5% significance level.

Solution

We start by stating the two-sided hypothesis test:

H0: μ =0 vs. H1: μ ≠ 0

The test statistic is:


μ
^ −μ
127
© 2014-2024 AnalystPrep.
μ
^ − μ0
T = ∼ N(0, 1)
2
√^
σ
n

In this case, we have,

n=40

μ
^=0.075

μ0 =0

^2 =0.172
σ

So,

0.075 − 0
T= ≈ 2.79
0.17 2
√ 40

At the significance level, α = 5%,the critical value is ±1.96. Since this is a two-sided test, the

rejection regions are ( −∞,−1.96 ) and (1.96, ∞ ) as shown in the diagram below:

Since the test statistic (2.79) is higher than the critical value, then we reject the null hypothesis

in favor of the alternative hypothesis.

128
© 2014-2024 AnalystPrep.
The example above is an example of a Z-test (which is mostly emphasized in this chapter and

immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-

distribution if the random variables are iid and normally distributed and that the sample size is

small (n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. That is:

^ − μ0
μ
s2 =
2
√ sn

Therefore the test statistic for H 0 = μ0 is given by:

^ − μ0
μ
T = ∼ tn −1
2
√ sn

The Type II Error and the Test Power

The power of a test is the direct opposite of the level of significance. While the level of relevance

gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a

test gives the probability of correctly discrediting and rejecting the null hypothesis when it is

false. In other words, it gives the likelihood of rejecting H0 when, indeed, it’s false. Denoting the

probability of type II error by β, the power test is given by:

Power of a Test = 1– β

The power test measures the likelihood that the false null hypothesis is rejected. It is influenced

by the sample size, the length between the hypothesized parameter and the true value, and the

size of the test.

Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can

be found at a confidence level. For instance, a 95% confidence interval constitutes the set of

129
© 2014-2024 AnalystPrep.
parameter values where the null hypothesis cannot be rejected when using a 5% test size.

Therefore, a 1-α confidence interval contains values that cannot be disregarded at a test size of

α.

It is important to note that the confidence interval depends on the alternative hypothesis

statement in the test. Let us start with the two-sided test alternatives.

H0 : μ = 0

H1 : μ ≠ 0

Then the 1 − α confidence interval is given by:

^
σ ^
σ

^ − Cα × ^ + Cα ×
,μ ]
√n √n

C α is the critical value at α test size.

Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio X = (x1 , x 2 ,… , x n ) from 1980 through 2020. The

approximated mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95%

confidence interval for the portfolio return.

The 1 − α confidence interval is given by:

σ
^ σ
^
[^
μ − Cα × ,μ
^ + Cα × ]
√n √n
0.17 0.17
= [0.0750 − 1.96 × , 0.0750 + 1.96 × ]
√40 √ 40
= [0.02232,0.1277]

Thus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be

rejected against the alternative.

One-Sided Alternative

130
© 2014-2024 AnalystPrep.
For the one-sided alternative, the confidence interval is given by either:

^
σ
(−∞ , μ
^ + Cα × )
√n

for the lower alternative

or,

^
σ

^ + Cα × , ∞)
√n

for the upper alternative.

Example: Calculating the One-Sided Alternative Confidence Interval

Assume that we were conducting the following one-sided test:

H0 : μ ≤ 0

H1 : μ > 0

The 95% confidence interval for the portfolio return is:

σ
^
= (−∞, μ
^ + Cα × )
√n
0.17
= (−∞, 0.0750 + 1.645 × )
√40
= (−∞, 0.1192)

On the other hand, if the hypothesis test was:

H0 : μ > 0

H1 : μ ≤ 0

The 95% confidence interval would be:

131
© 2014-2024 AnalystPrep.
^
σ
= (−∞, μ
^ + Cα × )
√n

0.17
= (−∞ , 0.0750 + 1.645 × ) = (0.1192, ∞)
√ 40

Note that the critical value decreased from 1.96 to 1.645 due to a change in the direction of the

change.

The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely

compare the observed test statistic with some critical value. For example, we might “reject

H0 using a 5% test” or “reject H0 at 1% significance level”. The problem with this ‘classical’

approach is that it does not give us details about the strength of the evidence against the null

hypothesis.

Determination of the p-value gives statisticians a more informative approach to hypothesis

testing. The p-value is the lowest level at which we can reject H0. This means that the strength of

the evidence against H0 increases as the p-value becomes smaller. The test statistic depends on

the alternative.

The p-Value for One-Tailed Test Alternative

For one-tailed tests, the p-value is given by the probability that lies below the calculated test

statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-

tailed tests gives the p-value.

Denoting the test statistic by T, the p-value for H1 : μ > 0 is given by:

P (Z > |T |) = 1 − P(Z ≤ |T |) = 1 − Φ(|T|)

Conversely, for H 1 : μ ≤ 0 the p-value is given by:

132
© 2014-2024 AnalystPrep.
P (Z ≤ |T |) = Φ(|T |)

Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the

right tail is measured whether T is negative or positive.

The p-Value for Two-Tailed Test Alternative

If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We

start by determining the probability lying below the negative value of the test statistic. Then, we

add this to the probability lying above the positive value of the test statistic. That is the p-value

for the two-tailed hypothesis test is given by:

2 [1 − Φ[|T |]

Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the

coin 200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of

significance.

H0: θ = 0.5

H1: θ < 0.5

Solution

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X `binomial(200,0.5) with mean 100(np=200*0.5),

assuming H0 is true.

85.5 − 100
P [z < ] = P (Z < −2.05)
√ 50
= 1– 0.97982 = 0.02018

133
© 2014-2024 AnalystPrep.
Recall that for a binomial distribution, the variance is given by:

np(1 − p) = 200(0.5)(1 − 0.5) = 50

(We have applied the Central Limit Theorem by taking the binomial distribution as approx.

normal)

Since the probability is less than 0.05, H0 is extremely unlikely, and we actually have strong

evidence against H0 that favors H1. Thus, clearly expressing this result, we could say:

“There is very strong evidence against the hypothesis that the coin is fair. We, therefore,

conclude that the coin is biased against heads.”

Remember, failure to reject H0 does not mean it’s true. It means there’s insufficient evidence to

justify rejecting H0, given a certain level of significance.

Example 2: p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H0: μ = μ0 vs. H1: μ ≠ μ0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the p-

value

Solution

P-value = 2P (Z > 2.2) = 2[1– P (Z ≤ 2.2)] = 1.39% × 2 = 2.78%

(We have multiplied by two since this is a two-tailed test)

134
© 2014-2024 AnalystPrep.
Interpretation

The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient

evidence to reject H0. In fact, the evidence is so strong that we would also reject H0 at

significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not

reject H0 since the p-value surpasses these values.

Hypothesis about the Difference between Two Population


Means.

It’s common for analysts to be interested in establishing whether there exists a significant

difference between the means of two different populations. For instance, they might want to

know whether the average returns for two subsidiaries of a given company

exhibit significant differences.

Now, consider a bivariate random variable:

Wi = [X i, Y i]

135
© 2014-2024 AnalystPrep.
Assume that the components Xi and Y iare both iid and are correlated. That is:

Corr(Xi , Yi ) ≠ 0

Now, suppose that we want to test the hypothesis that:

H 0 : μX = μY

H 1 : μX ≠ μY

In other words, we want to test whether the constituent random variables have equal means.

Note that the hypothesis statement above can be written as:

H0 : μ X − μ Y = 0

H1 : μ X − μ Y ≠ 0

To execute this test, consider the variable:

Zi = Xi − Y i

Therefore, considering the above random variable, if the null hypothesis is correct then,

E(Zi) = E(Xi) − E(Y i ) = μX − μY = 0

Intuitively, this can be considered as a standard hypothesis test of

H0: μZ =0 vs. H1: μZ ≠ 0.

The tests statistic is given by:

^z
μ
T = ∼ N (0, 1)
2
√^
σz
n

Note that the test statistic formula accounts for the correction between Xi and Yi . It is easy to see

that:

V (Zi) = V (Xi ) + V (Yi ) − 2COV (Xi , Y i)

136
© 2014-2024 AnalystPrep.
Which can be denoted as:

^2z = σ
σ ^2X + σ
^2Y − 2σXY

^ z = μX − μY
μ

And thus the test statistic formula can be written as:

μX − μY
T=
^2 +^ 2
σ Y −2 σX Y
√ σX
n

This formula indicates that correlation plays a crucial role in determining the magnitude of the

test statistic.

Another special case of the test statistic is when Xi , and Y i are iid and independent. The test

statistic is given by:

μX − μY
T=
2 2
σ
^ σY
^
√nX +
X nY

Where n X and nY are the sample sizes of Xi , and Y i respectively.

Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means

of the two portfolios at a 95% level. The first portfolio X consists of 30 government-issued bonds

and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30

private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the

two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is

rejected or otherwise.

Solution

The hypothesis statement is given by:

137
© 2014-2024 AnalystPrep.
H0: μX - μY=0 vs. H1: μX - μY ≠ 0.

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value

C α = ±1.96.

Recall that:

Cov(X, Y ) = σXY = ρX Y σXσY

Where ρ_XY is the correlation coefficient between X and Y.

Now the test statistic is given by:

μX − μY μX − μY
T = =
^2 +^ 2
σ Y −2σ X Y ^ 2+^ 2
σ Y −2ρX Y σ X σY
√ σX √σ X
n n

0.10 − 0.14
= = −10.215
0.022 +0.03 2 −2×0.7×0.02 ×0.03
√ 30

The test statistic is far much less than -1.96. Therefore the null hypothesis is rejected at a 95%

level.

The Problem of Multiple Testing

Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data

set. The reuse of data results in spurious results and unreliable conclusions that do not hold up

to scrutiny. The fundamental problem with multiple testing is that the test size (i.e., the

probability that a true null is rejected) is only applicable for a single test. However, repeated

testing creates test sizes that are much larger than the assumed size of alpha and therefore

increases the probability of a Type I error.

Some control methods have been developed to combat multiple testing. These include Bonferroni

correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

138
© 2014-2024 AnalystPrep.
Practice Question

An experiment was done to find out the number of hours that candidates spend

preparing for the FRM part 1 exam. It was discovered that for a sample of 10

students, the following times were spent:

q 0.95 0.975 0.99 0.995 0.999 0.9995


n=1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869

B. [307.6, 317.8]

C. [307.9, 317.5]

D. [307.3, 318.2]

The correct answer is A.

Population variance is unknown; we must use the t-score.

To find the value of t1− α , we use the t-table with (10 - 1 =) 9 degrees of freedom and
2

the (1 - 0.025 =) 0.975 which gives us 2.262.

So the confidence interval is given by:

s 7.2
X̄ ± t1− α × = 312.7 ± 2.262 ×
2 √n √ 10
= [307.5, 317.9]

139
© 2014-2024 AnalystPrep.
Reading 18: Linear Regression

After completing this reading, you should be able to:

Describe the models that can be estimated using linear regression and differentiate

them from those which cannot.

Interpret the results of an OLS regression with a single explanatory variable.

Describe the key assumptions of OLS parameter estimation.

Characterize the properties of OLS estimators and their sampling distributions.

Construct, apply, and interpret hypothesis tests and confidence intervals for a single

regression coefficient in a regression.

Explain the steps needed to perform a hypothesis test in linear regression.

Describe the relationship between a t-statistic, its p-value, and a confidence interval.

Linear regression is a statistical tool for modeling the relationship between two random

variables. This chapter will concentrate on the linear regression model (regression model with

one explanatory variable).

The Linear Regression Model

As stated earlier, linear regression determines the relationship between the dependent variable Y

and the independent (explanatory) variable X. The linear regression with a single explanatory

variable is given by:

Y = β0 + βX + ϵ

Where:

β0 =constant intercept (the value of Y when X=0)

β=the Slope which measures the sensitivity of Y to variation in X.

140
© 2014-2024 AnalystPrep.
ϵ=error(sometimes referred to as shock). It represents the portion of Y that cannot be explained

by X.

The assumption is that the expectation of the error is 0. That is, E(ϵ) = 0 and thus,

E[Y ] = E[β0 ] + βE[X] + E[ϵ]

⇒ E[Y ] = β0 + βE[X]

Note that β0 is the value of Y when X = 0 . However, there are cases when the explanatory

variable is not equal to 0. In this case, β0 is interpreted as the value that ensures that the Y¯ in

the regression line Y¯ = β^0 + β^ X̄ where Y¯ and X̄ are the mean of y i and xi random variables.

The Linearity of a Regression

141
© 2014-2024 AnalystPrep.
The independent variable can be continuous, discrete or even functions. Above the diversity of

the explanatory variables, they must satisfy the following conditions:

1. The relationship between the dependent variable Y and the explanatory variables

(X1 , X2 , … , Xn ) must be linear.

2. The error term must be additive except where the variance of the error term depends on

the explanatory variables.

3. The independent (explanatory variables) must be observables. This ensures that a linear

regression with missing data is not developed.

A good example of a violation of the linearity principle is:

Y = β0 + βX k + ϵ

This model cannot be estimated using linear regression due to the presence of the unknown

parameter k, which violates the first restriction (it is non-linear regression function). This kind of

nonlinearity can be corrected through transformation.

Transformations

When a linear regression model does not satisfy the linearity conditions stated above, we can

reverse the violation of the restrictions by transforming the model. Consider the model:

Y = β0X βϵ

Where ϵ is the positive error term (shock). Clearly, this model violates the condition of the

restriction since X is raised to an unknown parameter β, and the error term is not additive.

However, we can make this model linear by taking natural logarithm on both sides of the

equation so that:

ln(Y ) = (β0 X β ϵ)

ln(Y ) = lnβ0 + βlnX + lnϵ

142
© 2014-2024 AnalystPrep.
The last equation can be written as:

k
Y = β^ 0 + βX^ + ^
ϵ

Clearly, this equation satisfies the three linearity conditions. It is worth noting that when we are

interpreting the parameters of the transformed model, we measure the change of the

transformed independent variable X on the transformed variable Y.

For instance, ln(Y ) = lnβ0 + βlnX + lnϵ implies that β represents the change in lnY corresponding

to a unit change in lnX.

The Use of the Dummy Variables

There are cases where the explanatory variables are binary numbers (0 and 1) representing the

occurrences of an event. These binary numbers are called dummies. For instance,

Assuming Di is a variable such that:

1 The student-teacher ratio in ith school < 20


Di = {
0 The student-teacher ratio in ith school ≥ 20

The following is the population regression model whose regressor Di :

Yi = β0 + βDi + ϵi , ∀i = 0, … , n

β is the coefficient on Di .

The equation will change to the one written below under the condition that Di = 0:

Y i = β0 + ϵ i

When Di = 1:

Y i = β0 + β + ϵi

143
© 2014-2024 AnalystPrep.
This implies that when Di = 1, E (Yi |Di = 1) = β0 + β1 . The test scores will have a population mean

value of β0 + β1 when the ratio of students to teachers is low. The conditional expectations of Yi

when Di = 1 and when Di = 0 will have a difference of β1 between them written as:

(β0 + β) − β0 = β

This makes β to be the difference between population means.

The Ordinary Least Squares

The Ordinary Least Squares (OLS) is a method of estimating the linear regression parameters by

minimizing the sum of squared deviations. The regression coefficients chosen by the OLS

estimators are such that the observed data and the regression line are as close as possible.

144
© 2014-2024 AnalystPrep.
Consider a regression equation:

Y = β0 + βX + ϵ

Where each of X and Y consists of n observations each (X = x1 , x 2 ,… , n) and (Y = y1 , y 2 , … , y n ).

Assume that each of xi and y​i are linearly related, then the parameters can be estimated using

the OLS. The estimators minimize the residual sum of squares such that:

n 2 n
∑ (y i − β^ ^ ^2i
0 − β x i) = ∑ ϵ
i=1 i=1

Where the β^ 0 and β^ are parameter estimators (intercept and the slope respectively) which

minimizes the squared deviations between the line β^0 + β^ x i and yi so that:

145
© 2014-2024 AnalystPrep.
β^ 0 = Y¯ − β^ X̄

and

∑ ni=1 (x i − X̄ ) (yi − Y¯)


β^ = 2
∑ ni=1 (x i − X̄ )

Where X̄ and Y¯ are the means of X and Y respectively.

After the estimation of the parameters, the estimated regression line is given by:

y^i = β^ 0 + β^x i

And the linear regression residual error term is given by:

ϵ i = y i − y^i = y i − β^ 0 − β^ x i
^

The variance of the error term is approximated as:

1 n
2
s2 = ∑ ^ϵ
n − 2 i=1 i

It can also be shown that:

n
s2 = ^2 (1 − ρ^ 2XY )
σ
n −2 Y

Note that n-2 implies that two parameters are estimated and that s2 is an unbiased estimator of

σ2 . Moreover, it is assumed that the mean of the residuals is zero and uncorrelated with the

explanatory variables Xi .

Now, consider the formula:

∑ ni=1 (x i − X̄ ) (yi − Y¯)


β^ = 2
∑ ni=1 (x i − X̄ )

1
If we multiply both the numerator and the denominator by , we have:
n

146
© 2014-2024 AnalystPrep.
1
∑ ni=1 (x i − X̄ ) (yi − Y¯)
β^ =
n
1 2
∑ ni=1 (x i − X̄ )
n

Note that the numerator is the covariance between X and Y, and the denominator is the variance

of X. So that we can write:

1
∑ni=1 (x i − X̄ ) (y i − Y¯) σ
^XY
β^ =
n
=
1 2 σX2
∑ni=1 (x i − X̄ )
n

Also recall that:

Cov(X , Y )
Corr(X, Y ) = ρXY =
σX σY

⇒ σXY = ρXY σX σY

So,

ρXY σXσY
β^ =
σ^2X

ρ^XY σ
^Y
∴ β^ =
^X
σ

Example: Estimating the Linear Regression Parameters

An investment analyst wants to explain the return from the portfolio (Y) using the prevailing

interest rates (X) over the past 30 years. The mean interest rate is 7%, and the return from the

portfolio is 14%. The covariance matrix is given by:

^2Y
σ ^XY
σ 1600 500
[ ]=[ ]
^XY
σ ^
σX2 500 338

Assume that the analyst wants to estimate the linear regression equation:

Y^ i = β^ 0 + β^ Xi

147
© 2014-2024 AnalystPrep.
Estimate the parameters and, thus, the model equation.

Solution

Now,

σ
^XY 500
β^ = = = 1.4793
^2X
σ 338

and

β^ 0 = Y¯ − β^ X̄ = 0.14 − 1.4793 × 0.07 = 0.0364

So, the estimated equation is given by:

Y^i = 0.0364 + 1.4793Xi

Assumptions of OLS

The OLS estimators assume the following:

1. The conditional distribution of the error term given the independent variables Xi is 0.

More precisely E(ϵ i |Xi ) = 0. This also implies that the independent variables and the

error term are uncorrelated and that E(ϵ i) = 0.

2. Both the dependent and independent variables are i.i.d. This assumption concerns the

drawing of the sample. According to this assumption, (Xi , Yi ), i = 1, … ,n are i.i.d in case a

simple random sampling is applied when drawing observations from a single large

population. Despite the i.i.d assumption being a reasonable assumption for many data

collection schemes, all sampling schemes do not produce i.i.d observations on (Xi , Yi ).

3. Large outliers are unlikely. In this assumption, observations whose values of Xi and/or Yi

fall far outside the usual range of the data, are unlikely. These observations are known as

significant outliers. Results of OLS regression can be misleading due to large outliers.

4. The variance of the independent variable is strictly nonnegative. That is, σ2X > 0. This is

essential in estimating the regression parameters.

148
© 2014-2024 AnalystPrep.
5. The variance of the error term is independent of the explanatory variables and that

V (ϵ i│X) = σ2 < ∞ and that the variance of all the error terms (shocks) is equal. This

assumption is termed as the homoskedasticity assumption.

The OLS estimators imply that the parameter estimators are unbiased estimators. That is,

α ) = α and E(β^) = β . This is actually true for large sample sizes or rather as the sample sizes
E(^

increases.

Lastly, the assumptions ensure that that the estimated parameters are normally distributed. The

asymptotic distribution of the slope is given by:

σ2
√n (β^ − β) ∼ N (0, )
σX2

Where σ2 is the variance of the error term and σ2X is the variance of X. It is easy to see that the

variance of β^ increases as σ2 increases.

For the intercept, the asymptotic distribution is defined as:

σ2 (μ 2X − σX2 )
√n (β^ 0 − β0 ) ∼ N (0 , )
σX2

According to the central limit theorem (CLT), β^ can be treated as the standard random variable
σ2
with the mean as the true value β and the variance . That is:
nσ 2
X

σ2
β^ ∼ N (β, )
nσX2

However, we cannot use this value in hypothesis testing. We need to use the variance estimators

such that:

σ2 = s2

So, recall that for a large sample size:

149
© 2014-2024 AnalystPrep.
1 n 2
^X =
σ ∑ (x − X̄ )
n i =1 i

n
2
⇒ nσ
^X = ∑ (xi − X̄ )
i=1

Therefore, the variance of the parameter β can be written as:

^2
σ s2
^2β =
σ =
2
∑ni=1 (x i − X̄ ) ^2X

The standard error estimate of the β denoted as SEE β is equivalent to the square root of its

variance, so:

s2 s
SEE β = √ =
σ 2X
n^ √n^
σX

Analogously, the variance of the intercept:

^2X + σ
s2 (μ ^2X)
^2β0 =
σ
σ 2X
n^

Hypothesis Testing on the Linear Regression Parameters

When the OLS assumptions are met, the parameters are assumed to be normally distributed

when large samples are used. Therefore, we can run a hypothesis tests on the parameters just

like the random variable.

A hypothesis is a statistical procedure where an analyst tests an assumption on the population

parameters. For instance, we may want to test the significance of a single regression coefficient

in a simple linear regression. Most of the hypothesis tests are t-tests.

Whenever a statistical test is being performed, the following procedure is generally considered

ideal:

150
© 2014-2024 AnalystPrep.
1. Statement of both the null and the alternative hypothesis;

2. Select the appropriate test statistic, i.e., what’s being tested, e.g., the population means,

the difference between sample means, or variance;

3. Specify the level of significance;

4. Clearly, state the decision rule to guide you in choosing whether to reject or not to reject

the null hypothesis;

5. Calculate the sample statistic, and finally

6. Make a decision based on the sample results.

For instance, assume we are testing the null hypothesis that:

H 0 : β = βH0 vs. H1 : β ≠ βH0

Where βH0 is the hypothesized slope parameter.

Then the test statistic will be:

β^ − βH0
T =
SEE β

This statistic possesses asymptotic normal distribution, which is then compared to a critical

value C t . The null hypothesis is rejected if:

|T| > Ct

For instance, if we assume a 5% significance level in this case, then the critical value is 1.96.

We can also evaluate the p-values. For one-tailed tests, the p-value is given by the probability

that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies

above the test statistic in right-tailed tests gives the p-value.

Denoting the test statistic by T, the p-value for H1 : β^ > 0 is given by:

P (Z > |T |) = 1 − P(Z ≤ |T |) = 1 − Φ(|T|)

Conversely, for H 1 : β^ ≤ 0 the p-value is given by:

151
© 2014-2024 AnalystPrep.
P (Z ≤ |T |) = Φ(|T |)

Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the

right tail is measured whether T is negative or positive.

If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We

start by determining the probability lying below the negative value of the test statistic. Then, we

add this to the probability lying above the positive value of the test statistic. That is the p-value

for the two-tailed hypothesis test is given by:

2[1 − Φ(|T |)]

We can also construct confidence intervals (discussed in detail in the previous chapter). Recall

that a confidence interval can be defined as the range of parameters at which the true parameter

can be found at a confidence level. For instance, a 95% confidence interval constitutes that the

set of parameter values where the null hypothesis cannot be rejected when using a 5% test size.

For instance, if we are performing the two-tailed hypothesis tests, then the confidence interval is

given by:

[β^ − Ct × SEEβ , β^ + C t × SEEβ ]

Example: Hypothesis Test on the Linear Regression Parameters

An investment analyst wants to explain the return from the portfolio (Y) using the prevailing

interest rates (X) over the past 30 years. The mean interest rate is 7%, and the return from the

portfolio is 14%. The covariance matrix is given by:

^2Y
σ ^XY
σ 1600 500
[ ]=[ ]
^XY
σ σ 2
^X 500 338

Assume that the analyst wants to estimate the linear regression equation:

Y^ i = β^ 0 + β^ Xi

152
© 2014-2024 AnalystPrep.
Test whether the slope coefficient is equal to zero and construct a 95% confidence interval for

the slope of the coefficient.

Solution

We start by stating the hypothesis:

β^ − βH0
T =
SEE β

We had calculated the slope from the matrix as:

^XY
σ 500
β^ = = = 1.4793
σ
^X2 338

Now, recall that:

s
SEE β^ =
√n^
σX

But

n
s2 = ^2 (1 − ρ^ XY )
σ
n −2 Y

So, in this case:

30 500
s2 = × 1600 (1 − ) = 548.7251
30 − 2 √338√1600

^
σ XY
(Note that for ρ^XY we have used the relationship ρ^XY = .)
^
σ X^
σY

Therefore,

s = √s2 = √548.7251 = 23.4249

So,

s 23.4249
153
© 2014-2024 AnalystPrep.
s 23.4249
SEEβ^ = = = 0.23263
√n σ
^X √ 30√ 338

Therefore the t-statistic is given by:

β^ − βH0 1.4793
T = = = 6.3590
SEE β 0.23263

For the two-tailed test, the critical value is 1.96, and since the t-statistic here is greater than the

significant value, then we reject the null hypothesis.

For the 95% CI, we know it is given by:

β^ − Ct × SEEβ , β^ + C t × SEEβ

= [1.4793 − 1.96 × 0.23263, 1.4793 + 1.96 × 0.23263]

= [1.0233, 1.9353]

154
© 2014-2024 AnalystPrep.
Practice Question 1

Assume that you have carried out a regression analysis (to determine whether the

slope is different from 0) and found out that the slope β^ = 1.156. Moreover, you have

constructed a 95% confidence interval of [0.550, 1.762]. What is the likely value of

your test statistic?

A. 4.356

B. 3.7387

C. 0.7845

D. 0.6545

Solution

The Correct answer is B

This is a two-tailed test since we’re asked to determine if the slope is different from

zero. We know that:

[β^ − Ct × SEEβ , β^ + C t × SEEβ ]

Which in this case is [0.550, 1.762].

We need to find the value of SEE β. That is:

1.156 − 0.550
1.156 − 1.96 × SEE β = 0.550 ⇒ SEEβ = = 0.3092
1.96

And we know that:

β^ − βH0 1.156 − 0
T = = = 3.7387
SEEβ 0.3092

155
© 2014-2024 AnalystPrep.
Practice Question 2

A trader develops a simple linear regression model to predict the price of a stock. The

estimated slope coefficient for the regression is 0.60, the standard error is equal

to 0.25, and the sample has 30 observations. Determine if the estimated slope

coefficient is significantly different than zero at a 5% level of significance by correctly

stating the decision rule.

A. Accept H1; The slope coefficient is statistically significant.

B. Reject H0; The slope coefficient is statistically significant.

C. Reject H0; The slope coefficient is not statistically significant.

D. Accept H1; The slope coefficient is not statistically significant.

Solution

The correct answer is B.

Step 1: State the hypothesis

H0:β1=0

H1:β1≠0

Step 2: Compute the test statistic

β1 − βH0 0.60 − 0
= = 2.4
Sβ1 0.25

Step 3: Find the critical value, tc

From the t table, we can find t0.025,28 is 2.048

156
© 2014-2024 AnalystPrep.
Step 4: State the decision rule

Reject H0; The slope coefficient is statistically significant since 2.048 < 2.4.

157
© 2014-2024 AnalystPrep.
Reading 19: Regression with Multiple Explanatory Variables

After completing this reading, you should be able to:

Distinguish between the relative assumptions of single and multiple regression.

Interpret regression coefficients in multiple regression.

Interpret goodness of fit measures for single and multiple regressions, including R2 and

adjusted R2.

Construct, apply, and interpret joint hypothesis tests and confidence intervals for

multiple coefficients in regression.

Unlike linear regression, multiple regression simultaneously considers the influence of

multiple explanatory variables on a response variable Y. In other words, it permits us to evaluate

the effect of more than one independent variable on a given dependent variable.

The form of the multiple regression model (equation) is given by:

Y i = β0 + β1 X1 i + β2 X2 i + … + βk Xk i + εi ∀i = 1, 2, … n

Intuitively, the multiple regression model has k slope coefficients and k+1 regression

coefficients. Normally, statistical software (such as Excel and R) are used to estimate the

multiple regression model.

Interpreting the Multiple Regression Coefficients

The slope coefficients βk computes the level of variation of the dependent variable Y when the

independent variable Xj changes by one unit while holding other independent variables constant.

The interpretation of the multiple regression coefficients is quite different compared to linear

regression with one independent variable. The effect of one variable is explored while keeping

other independent variables constant.

158
© 2014-2024 AnalystPrep.
For instance, a linear regression model with one independent variable could be estimated as

Y^ = 0.6 + 0.85X1 . In this case, the slope coefficient is 0.85, which implies that a 1 unit increase

in X1 results in 0.85 units increases independent variable Y.

Now, assume that we had the second independent variable to the regression so that the

regression equation is Y^ = 0.6 + 0.85X1 + 0.65X2 . A unit increase in X1 will not result in a 0.85

unit increase in Y unless X1 and X2 are uncorrelated. Therefore, we will interpret 0.85 as one unit

of X1 leads to 0.85 units increase in the dependent variable Y, while keeping X​2 constant.

OLS Estimators for the Multiple Regression Parameters

Although the multiples regression parameters can be estimated, it is challenging since it involves

a huge amount of algebra and the use of matrices. However, we build a foundation of

understanding using the multiple regression model with two explanatory variables.

Consider the following multiples regression equation.

Yi = β0 + β1 X1 i + β2 X2 i + εi

The OLS estimator of β1 is estimated as follows:

The first step is to regress X1 and X2 and to get the residual of X 1i given by:

ϵ X1i = X1i − α
^0 − α
^ 1 X2i

Where α
^ 0 and α
^1 are the OLS estimators of X2i .

The next step is to regress Y on X2 to get the residuals of Yi, which is intuitively given by:

ϵY i = Yi − y^0 − y^1 X2i

Where ^
γ 0 and ^
γ 1 are the OLS estimators of X2i . The final step is to regress the residual of X1

and Y (ϵX1i and ϵY i ) to get:

159
© 2014-2024 AnalystPrep.
ϵ Yi = β^ 1 ϵX1i + ϵ i

Note that we do not have a constant, the expected values of ϵ Yi and ϵ Xi are both 0. Moreover,

the main purpose of the first and the second regression is to exclude the effect of X2 from both Y

and X1 by dividing the variable into the fittest value which is correlated with X2, and the residual

error which is uncorrelated with X2 and thus the two-residual obtained is uncorrelated with X2

by intuition. The last step of the regression gives the regression between the components of Y

and X1, which is uncorrelated with X2.

The OLS estimator for β2 can be approximated analogously as that of β1 by exchanging X2 for X1

in the process above. By repeating this process, we can estimate a k-parameter model such as:

Yi = β0 + β1 X1i + β2 X2i + … + βk Xki + ε i∀i = 1 , 2, … n

Most of the time, this is done using a statistical package such as Excel and R.

Assumptions of the Multiple Regression Model

Suppose that we have n observations of the dependent variable (Y) and the independent

variables (X1, X2, . . . , Xk), we need to estimate the equation:

Yi = β0 + β1 X1i + β2 X2i + … + βk Xki + ε i ∀i = 1 , 2, … n

For us to make a valid inference from the above equation, we need to make classical normal

multiple linear regression model assumptions as follows:

1. The relationship between the dependent variable, Y, and the independent variables, X1,

X2, . . . , Xk, is linear.

2. The independent variables (X1, X2, . . . , Xk) are iid. Moreover, there is no definite linear

relationship that exists between two or more of the independent variables, X1, X2, . . . ,

X k.

3. The expectation of value of the error term, conditioned on the independent variables, is

160
© 2014-2024 AnalystPrep.
0: E(ϵ| X1, X2, . . . , Xk) = 0

4. The variance of the error term is equal for all observations. That is,

E(ϵ 2i ) = σ2ϵ , i = 1 , 2, … , n (homoskedasticity assumption). The assumption enables us to

estimate the distribution of the regression coefficients.

5. The error term ϵ is uncorrelated in all observations. Mathematically put,

E(ϵ i ϵj ) = 0 ∀i ≠ j

6. The error term ϵ is normally distributed. This allows us to test the hypothesis about

regression analysis.

7. There are no outliers so that E(X ji4) < ∞ for all j=1,2….k

The assumptions are almost the same as those of linear regression with one independent

variable, only that the second assumption is tailored to ensure no linear relationships between

the independent variables (multicollinearity).

Measures of Goodness of Fit

The goodness of fit of a regression is a measure using the Coefficient of determination (R 2) and

the adjusted coefficient of determination.

The Coefficient of Determination ( R2)

Recall that the standard error estimate gives a percentage at which we are certain of a forecast

made by a regression model. However, it does not tell us how suitable is the independent

variable in determining the dependent variable. The coefficient of variation corrects this

shortcoming.

The coefficient of variation measures a proportion of the total change in the dependent variable

explained by the independent variable. We can calculate the coefficient of variation in two ways:

1. Squaring the Correlation Coefficient between the Dependent and


Independent Variables.

161
© 2014-2024 AnalystPrep.
The coefficient of variation can be computed by squaring the correlation coefficient (r) between

the dependent and independent variables. That is:

R2 = r2

Recall that:

C ov(X, Y )
r=
σX σY

Where

Cov(X, Y )-covariance between two variables, X and Y

σX -standard deviation of X

σY -standard deviation of Y

However, this method only accommodates regression with one independent variable.

Example: Calculating the Coefficient of Determination using Correlation


Coefficient

The correlation coefficient between the money supply growth rate (dependent, Y) and inflation

rates (independent, X) is 0.7565. The standard deviation of the dependent (explained) variable is

0.050, and that of the independent variable is 0.02. Regression analysis for the ten years was

conducted on this variable. We need to calculate the coefficient of determination.

Solution

We know that:

Cov(X, Y ) 0.0007565
r= = = 0.7565
σX σY 0.05 × 0.02

So, the coefficient of determination is given by:

r2 = 0.75652 = 0.5723 = 57.23

162
© 2014-2024 AnalystPrep.
So, in regression, the money supply growth rate explains roughly 57.23% of the variation in the

inflation rate over the ten years.

2. Method for Regression Model with One or More Independent


Variables

If the regression analysis is known, then our best estimate for any observation for the dependent

variable would be the mean. Alternatively, instead of using the mean as an estimate of Yi, we can

predict an estimate using the regression equation. The resulting solution will be denoted as:

Y i = β0 + β1 X1 i + β2 X2 i + … + βk Xk i + εi = Y^i + ^
ϵi

So that:

Y i = Y^i + ^
ϵi

Now if we subtract the mean of the dependent variable in the above equation and square and

sum on both sides so that:

n n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯ + ^
ϵ i)
i=1 i =1
n 2 n n
= ∑ (Y^i − Y¯) + 2 ∑ ^ ^2i
ϵ i (Y^i − Y¯) + ∑ ϵ
i =1 i=1 i=1

Note that:

n
ϵ i (Y^i − Y¯ ) = 0
2∑ ^
i=1

Since the sample correlation between Y^i and ^


ϵ iis 0. The expression, therefore, reduces to,

n n 2 n
2
∑ (Y i − Y¯) = ∑ (Y^i − Y¯) + ∑ ^
ϵ 2i
i=1 i=1 i=1

But

163
© 2014-2024 AnalystPrep.
2
2
ϵ i = (Yi − Y^)
^

So, that

n n 2
^ 2i = ∑ (Y i − Y^)
∑ ϵ
i=2 i=1

Therefore,

n n 2 n 2
2
∑ (Y i − Y¯ ) = ∑ (Y^i − Y¯) + ∑ (Y i − Y^ )
i=1 i =1 i=1

If the regression analysis is useful for predicting Yi using the regression equation, then the error

should be smaller than predicting Yi using the mean.

Now let:

2
Explained Sum of Squares (ESS)=∑ni=1 (Y^i − Y¯ )

2
Residual Sum of Squares (RSS) =∑ni=1 (Yi − Y^)

2
Total Sum of Squares (TSS)=∑ni=1 (Yi − Y¯)

Then:

T SS = ESS + RSS

If we divide both sides by TSS, we get:

ESS RS S
1= +
T SS T SS
ESS RSS
⇒ = 1−
T SS T SS

Now, recall than the coefficient of determination is the fraction of the overall change that is

reflected in the regression. Denoted by R2, coefficient of determination is given by:

Explained Variation ES S RS S
164
© 2014-2024 AnalystPrep.
Explained Variation ES S RS S
R2 = = = 1−
Total Variation T SS T SS

If a model does not explain any of the observed data, then it has an R 2 of 0 . On the other hand,

if the model perfectly describes the data, then it has an R 2 of 1. Other values are in the range of

0 and 1 and are always positive. For instance, in the above example, the R 2 is approximately 1

and thus, the money supply growth rate perfectly explains the level of inflation rates in the

countries.

Limitations of R2

1. As the number of explanatory variables increases, the value of R2 always increases even

if the new variable is almost completely irrelevant to the dependent variable. For

instance, if a regression model with one explanatory variable is modified to have two

165
© 2014-2024 AnalystPrep.
explanatory variables, the new R 2 is greater or equal to that of a single explanatory

model. In the case where β = 0, adding a variable will not increase R2 . In that case, the

RSS will remain the same and so does R 2 .

2. The Coefficient of Determination R2 cannot be compared in different dependent

variables. For instance, we cannot compare the R2 for Yi and lnYi .

3. There is no standard value of R 2 that is considered good because its values depend on

the nature of the data involved.

Considering the first limitation, we now discuss the adjusted R 2.

The Adjusted R2

2
Denoted by R̄ , the adjusted-R2 measures the goodness of fit, which does not automatically

increase if an independent variable is added to the model; that is, it is adjusted for the degrees of
2
freedom. Note that R̄ is produced by statistical software. The relationship between the R2 and
2
R̄ is given by:

( nRSS
−k−1
)
2
R̄ = 1 −
( T−1
SS
)
n

n−1
= 1 −( ) (1 − R2 )
n −k −1

Where

n=number of observations

k=number of the independent variables (Slope coefficients)

The adjusted R-squared can increase, but that happens only if the new variable improves the

model more than would be expected by chance. If the added variable improves the model by less

than expected by chance, then the adjusted R-squared decreases.

166
© 2014-2024 AnalystPrep.
2
When k≥ 1, then R 2 > R̄ since adding an extra new independent variable results in a decrease
2 2
in R̄ if that addition causes a small increase in R 2 . This explains the fact that R̄ can be a

negative though R2 is always nonnegative.

2
A point to note is that when we decide to use R̄ to compare the regression models, the

dependent variable is defined the same way and that the sample size is the same as that of R 2 .

2
The following are the factors to watch out for when guarding against applying the R 2 or the R̄ :

An added variable doesn’t have to be statistically significant just because the R2 or


2
the R̄ has increased.

It is not always true that the regressors are a true cause of the dependent variable, just
2
because there is a high R2 or R̄ .

It is not necessary that there is no omitted variable bias just because we have a high
2
R2 or R̄ .

It is not necessarily true that we have the most appropriate set of regressors just
2
because we have a high R2 or R̄

It is not necessarily true that we have an inappropriate set of regressors just because
2
we have a low R2 or R̄ .

2
R̄ does not automatically indicate that regression is well specified due to its inclusion of a right
2
set of variables since a high R̄ could reflect other uncertainties in the data in the analysis.
2
Moreover, R̄ can be negative if the regression model produces an extremely poor fit.

Joint Hypothesis Test on Multiple Regression Parameters

Previously, we had conducted hypothesis tests on individual regression coefficients using the t-

test. We need to perform a joint hypothesis test on the multiple regression coefficients using the

F-test based on the F-statistic.

In multiple regression, we cannot test the null hypothesis that all the slope coefficients are equal

167
© 2014-2024 AnalystPrep.
to 0 using the t-test. This is because an individual test on the coefficient does not accommodate

the effect of interactions among the independent variables (multicollinearity).

F-test (test of regression’s generalized significance) determines whether the slope coefficients in

multiple linear regression are all equal to 0. That is, the null hypothesis is stated as

H 0 : β1 = β2 =. . . = βK = 0 against the alternative hypothesis that at least one slope coefficient is

not equal to 0.

To accurately compute the test statistic for the null hypothesis that the slope is equal to 0, we

need to identify the following:

I. The Sum of Squared Residuals (SSR) given by:

n 2
∑ (Y i − Y^ i)
i=1

This is also called the residual sum of squares.

II.Explained Sum of Squares (ESS) given by:

n 2
∑ (Y^ i − Y¯ i)
i=1

III. The total number of observations (n).

III. The number of parameters to be estimated. For example, in a regression analysis with one

independent variable, there are two parameters: the slope and the intercept coefficients.

Using the above four requirements, we can determine the F-statistic. The F-statistic measures

how effective the regression equation explains the changes in the dependent variable. The F-

statistic is denoted by F(Number of slope parameters, n-(number of parameters)). For instance, the F-

statistic for multiple regression with two slope coefficients (and one intercept coefficient) is

denoted as F2, n-3. The value n-3 represents the degrees of freedom for the F-statistic.

The F-statistic is the ratio of the average regression sum of squares to the average amount of

squared errors. The average regression sum of squares is the regression sum of squares divided

168
© 2014-2024 AnalystPrep.
by the number of slope parameters (k) estimated. The average sum of squared errors is the sum

of squared errors divided by the number of observations (n) less a total number of parameters

estimated ((n - (k + 1)). Mathematically:

Average regression sum of squares


F =
The average sum of squared errors

Explained sum of squares


ESS Slope parameters estimated
= Sum of squared residuals (SSR)
n −number of parameters estimated

In this case, we are dealing with a multiple linear regression model with k independent variable

whose F-statistic is given by:

( ESS )
k
F =
SS R
( )
n−(k+1)

In regression analysis output (ANOVA part), MSR and MSE are displayed as the first and the

second quantities under the MSS (mean sum of the squares) column, respectively. If the overall

regression’s significance is high, then the ratio will be large.

If the independent variables do not explain any of the variations in the dependent variable, each

predicted independent variable Y^ i) possess the mean value of the dependent variable (Y ).

Consequently, the regression sum of squares is 0 implying the F-statistic is 0.

So, how do we decide F-test? We reject the null hypothesis at α significance level if the computed

F-statistic is greater than the upper α critical value of the F-distribution with the provided

numerator and denominator degrees of freedom (F-test is always a one-tailed test).

Example: Conducting F-test

An analyst runs a regression of monthly value-stock returns on four independent variables over

48 months.

The total sum of squares for the regression is 360, and the sum of squared errors is 120.

169
© 2014-2024 AnalystPrep.
Test the null hypothesis at a 5% significance level (95% confidence) that all the four independent

variables are equal to zero.

Solution

H 0 : β1 = 0 , β2 = 0, … , β4 = 0

Versus

H1 : βj ≠ 0(at least one j is not equal to zero, j=1,2… k )

ESS = TSS – SSR = 360 – 120 = 240

The calculated test statistic:

( ESS 240
)
k 4
F= = = 21.5
SS R 120
( ) 43
n −(k+1)

F43, 43 is approximately 2.59 at a 5% significance level.

Decision: Reject H0.

Conclusion: at least one of the 4 independent variables is significantly different than zero.

Example: Calculating F-statistic and Conducting the F-test

An investment analyst wants to determine whether the natural log of the ratio of bid-offer spread

to the price of a stock can be explained by the natural log of the number of market participants

and the amount of market capitalization. He assumes a 5% significance level. The following is

the result of the regression analysis.

Coefficient Standard Error t-Statistic


Intercept 1.6959 0.2375 7.0206
Number of market participants −1.6168 0.0708 −22.8361
Amount of Capitalization −0.4709 0.0205 −22.9707

170
© 2014-2024 AnalystPrep.
ANOVA df SS MSS F Significance F
Regression 2 3, 730.1534 1, 865.0767 2 , 217.95 0.00
Residual 2 , 797 2, 351.9973 0.8409
Total 2 , 799 5, 801.2051

Residual standard error 0.9180


Multiple R-squared 0.6418
Observations 2, 800

We are concerned with the ANOVA (Analysis of variance) results. We need to conduct F-test to

determine the significance of regression analysis.

Solution

So, the hypothesis is stated as:

H0 : β^ 1 = β^ 2 = 0

vs

H1 : At least 1β^ j ≠ 0, ∀j = 1, 2

There are two slope coefficients, k=2 (coefficients on the natural log of the number of market

participants and the amount of market capitalization), which is degrees of freedom for the

numerator of the F-statistic formula. For the denominator, the degrees of freedom are n- (k + 1)

=2800-3= 2,797.

The sum of the squared errors is 2,351.9973, while the regression sum of squares is 3,730.1534.

Therefore, the F-statistic is:

( E SS 3730.1534
) 2
k
F2,2797 = = = 2217.9530
SS R 2351.9973
( ) 2797
n−(k+1)

Since we are working at a 5% (0.05) significance level, we look at the F-distribution table on the

second column which displays the F-distributions with degrees of freedom in the numerator of

the F-statistic formula as seen below:

171
© 2014-2024 AnalystPrep.
F Distribution: Critical Values of F (5% significance level)

1 2 3 4 5 6 7 8 9 10
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84

As seen from the table, the critical value of the F-test for the null hypothesis to be rejected is

between 3.00 and 3.07. The actual F-statistic is 2217.95, which is far higher than the F-test

critical value, and thus we reject the null hypothesis that all the slope coefficients are equal to 0.

Calculating the Confidence Interval for the Regression


Coefficient

172
© 2014-2024 AnalystPrep.
Confidence interval (CI) is a closed interval in which the actual parameter is believed to lie with

some degree of confidence. Confidence intervals are used to perform hypothesis tests. For

instance, we may want to ascertain stock valuation using the capital asset pricing model (CAPM).

In this case, we may wish to hypothesize that the beta possesses the market’s systematic risk or

averaged beta.

The same analogy used in the regression analysis with one explanatory variable is also used in a

multiple regression model using the t-test.

Example: Calculating the Confidence Interval (CI)

An economist tests the hypothesis that interest rates and inflation can explain GDP growth in a

country. Using some 73 observations, the analyst formulates the following regression equation:

GDP growth = ^
b0 + ^
b 1 (Interest) + ^
b 2 (Inflation)

The regression estimates are as follows:

Coefficient Standard Error


Intercept 0.04 0.6%
Interest rates 0.25 6%
Inflation 0.20 4%

What is the 95% confidence interval for the coefficient on the inflation rate?

A. 0.12024 to 0.27976

B. 0.13024 to 0.37976

C. 0.12324 to 0.23976

D. 0.11324 to 0.13976

Solution

The correct answer is A

173
© 2014-2024 AnalystPrep.
From the regression analysis, β^ =0.20 and the estimated standard error, s β^ =0.04. The number

of degrees of freedom is 73-3=70. So, the t-critical value at the 0.05 significance level is =

t 0.05, 73−2−1 = t0.025,70 = 1.994. Therefore, the 95% confidence level for the stock return is:
2

β^ ± tcs β^ = 0.2 ± 1.994 × 0.04 = [0.12024, 0.27976]

174
© 2014-2024 AnalystPrep.
Practice Questions

Question 1

An analyst runs a regression of monthly value-stock returns on four independent

variables over 48 months. The total sum of squares for the regression is 360 and the

sum of squared errors is 120. Calculate the R2.

A. 42.1%

B. 50%

C. 33.3%

D. 66.7%

The correct answer is D.

ESS 360 − 120


R2 = = = 66.7
TS S 360

Question 2

Refer to the previous problem and calculate the adjusted R2.

A. 27.1%

B. 63.6%

C. 72.9%

D. 36.4%

The correct answer is B.

n−1
175
© 2014-2024 AnalystPrep.
2 n−1
R̄ = 1 − × (1 − R 2)
n − k−1
48 − 1
= 1− × (1 − 0.667)
48 − 4 − 1
= 63.6%

Question 3

Refer to the previous problem. The analyst now adds four more independent variables

to the regression and the new R2 increases to 69%. What is the new adjusted R2 and

which model would the analyst prefer?

A. The analyst would prefer the model with four variables because its adjusted R2 is

higher.

B. The analyst would prefer the model with four variables because its adjusted R2 is

lower.

C. The analyst would prefer the model with eight variables because its adjusted R2 is

higher.

D. The analyst would prefer the model with eight variables because its adjusted R2 is

lower.

The correct answer is A.

2
New R = 69%

2 48 − 1
New adjusted R = 1− × (1 − 0.69) = 62.6%
48 − 8 − 1

The analyst would prefer the first model because it has a higher adjusted R2 and the

model has four independent variables as opposed to eight.

Question 4

176
© 2014-2024 AnalystPrep.
An economist tests the hypothesis that GDP growth in a certain country can be

explained by interest rates and inflation.

Using some 30 observations, the analyst formulates the following regression

equation:

GDP growth = β^ 0 + β^ 1 Interest + β^ 2 Inflation

Regression estimates are as follows:

Coefficient Standard Error


Intercept 0.10 0.5%
Interest Rates 0.20 0.05
Inflation 0.15 0.03

Is the coefficient for interest rates significant at 5%?

A. Since the test statistic < t-critical, we accept H0; the interest rate coefficient

is not significant at the 5% level.

B. Since the test statistic > t-critical, we reject H0; the interest rate coefficient

is not significant at the 5% level.

C. Since the test statistic > t-critical, we reject H0; the interest rate coefficient is

significant at the 5% level.

D. Since the test statistic < t-critical, we accept H1; the interest rate coefficient

is significant at the 5% level.

The correct answer is C.

We have GDP growth = 0.10 + 0.20(Int) + 0.15(Inf)

Hypothesis:

H 0 : β^ 1 = 0 vs H 1 : β^ 1 ≠ 0

177
© 2014-2024 AnalystPrep.
The test statistic is:

0.20 − 0
t=( )=4
0.05

The critical value is t(α/2, n-k-1) = t0.025,27 = 2.052 (which can be found on the t-table).

df/p 0.40 0.25 0.10 0.05 0.025 0.01


25 0.256060 0.684430 1.316345 1.708141 2.05954 2.48511
26 0.255955 0.684043 1.314972 1.705618 2.05553 2.47863
27 0.255858 0.683685 1.313703 1.703288 2.05183 2.47266
28 0.255768 0.683353 1.312527 1.701131 2.04841 2.46714
29 0.255684 0.683044 1.311434 1.699127 2.04523 2.46202

Decision: Since test statistic > t-critical, we reject H0.

Conclusion: The interest rate coefficient is significant at the 5% level.

178
© 2014-2024 AnalystPrep.
Reading 20: Regression Diagnostics

After completing this reading, you should be able to:

Explain how to test whether regression is affected by heteroskedasticity.

Describe approaches to using heteroskedastic data.

Characterize multicollinearity and its consequences; distinguish between

multicollinearity and perfect collinearity.

Describe the consequences of excluding a relevant explanatory variable from a model

and contrast those with the consequences of including an irrelevant regressor

Explain two model selection procedures and how these relate to the bias-variance

tradeoff.

Describe the various methods of visualizing residuals and their relative strengths.

Describe methods for identifying outliers and their impact.

Determine the conditions under which OLS is the best linear unbiased estimator.

Regression Model Specifications

Model specification is a process of determining which independent variables should be included

in or excluded from a regression model.

That is, an ideal regression model should consist of all the variables that explain the dependent

variables and remove those that do not.

Model specification includes the residual diagnostics and the statistical tests on the assumptions

of OLS estimators. Basically, the choice of variables to be included in a model depends on the

bias-variance tradeoff. For instance, large models that include the relevant number of variables

are likely to have unbiased coefficients. On the other side, smaller models lead to accurate

estimates of the impact of removing some variables.

179
© 2014-2024 AnalystPrep.
The conventional specification makes sure that the functional form of the model is adequate, the

parameters are constant, and the homoscedasticity assumption is met.

The Omitted Variables

An omitted variable is one with a non-zero coefficient, but they are excluded in the regression

model.

Effects of Omitting Variables

I. The remaining variables sustain the impact of the excluded variables in terms of the

common variation. Thus, they do not consistently approximate the change in the

independent variable on the dependent variable while keeping all other things constant.

II. The magnitude of the estimated residuals is larger than the true value. This is true since

the estimated residuals have the true value and the effect of the omitted value that

cannot be reflected in the included variables.

Illustration of the Omitted Variables

Suppose that the regression model is stated as:

Yi = α + βi X1i + β2 X2i + ϵ i

If we omit X2 from the estimated model, then the model is given by:

Yi = α + βi X1i + ϵ i

Now, in large samples sizes, the OLS estimator β^ 1 converges to:

β1 + β2 δ

Where:

Cov(X , X )
180
© 2014-2024 AnalystPrep.
Cov(X1 , X2 )
δ=
Var(X1 )

δ is the population slope coefficient in a regression of X2 on X1 .

It is clear that the bias – due to the omitted variable – depends on the population coefficient of

the excluded variable β2 and the relational strength of the X2 and X1 , represented by δ.

When the correlation between X1 and X2 is high, X1 can explain a significant proportion of

variation in X2 and hence the bias is high. On the other hand, if the independent variables are

uncorrelated, that is δ = 0 then β^ 1 is a consistent estimator of β1 .

Conclusively, the omitted variable leads to biasness of the coefficient on the variables that are

correlated with the omitted variables.

Inclusion of Extraneous Variables

An extraneous variable is one that is unnecessarily included in the model, whose actual

coefficient is 0 and is consistently estimated to be 0 in large samples. If we include these

variables is costly.

Illustration of Effect of Inclusion of Extraneous Random Variables

Recall that the adjusted R 2 is given by:

2 RSS
R̄ = 1 − ξ
TSS

Where:

(n − 1)
ξ=
(n − k − 1)

Looking at the formula above, adding more variables increase the value of k which in turn
2
increases the value of ξ and hence reducing the value of R̄ . However, if the model is large, then

181
© 2014-2024 AnalystPrep.
2
RSS is smaller which reduces the effect of ξ and produces larger R̄ .

Contrastingly, this is not always the case when the true coefficient is equal to 0 because, in this
2
case, RSS remains constant as ξ increases leading to a smaller R̄ and a large standard error.

Lastly, if the correlation between X 1 and X2 increases, the standard error value rises.

The Bias-Variance Tradeoff

The bias-variance tradeoff amounts to choosing between the including irrelevant variables and

excluding relevant variables. Bigger models tend to have low bias level because it includes more

relevant variables. However, they are less accurate in approximating the regression parameters

due to the possibility of involving extraneous variables.

Moreover, regression models with fewer independent variables are characterized by low

estimation error but more prone to biased parameter estimates.

Methods of Choosing a Model from a Set of Independent


Variables

1. General-to-Specific Model Selection

In the general-to-specific method, we start with a large general model that incorporates

all the relevant variables. Then, the reduction of the general model starts. We use

hypothesis tests to establish if there are any statistically insignificant coefficients in the

estimated model. When such coefficients are found, the variable with the coefficient

with the smallest t-statistic is removed. The model is then re-estimated using the

remaining set of independent variables. Once more, hypothesis tests are carried out to

establish if statistically insignificant coefficients are present. These two steps (remove

and re-estimate) are repeated until all coefficients that are statistically insignificant have

been removed.

2. m-fold Cross-Validation

182
© 2014-2024 AnalystPrep.
The m-fold cross-validation model-selection method aims at choosing the model that’s

best at fitting observations not used to estimate parameters.

How is this method executed?

As a first step, the number of models has to be decided, and this is determined in part by

the number of explanatory variables. When this number is small, the researcher can

consider all the possible combinations. With 10 variables, for example, 1,024 (=) distinct

models can be constructed.

The cross-validation process proceeds as follows:

1. Shuffle the dataset randomly.

2. Split the dataset into m groups.

3. Estimate parameters using m-1 of the groups; these groups make up what we call

the training block. The excluded group is referred to as the validation block.

4. Use the estimated parameters and the data in the excluded block (validation

block) to compute residual values. These residuals are referred to as out-of-

sample residuals since they are arrived at using data not included in the sample

used to come up with the parameter estimates.

5. Repeat parameter estimation and residual computation a total of m times; each

group has to serve as the validation block and used to compute residuals.

6. Compute the sum of squared errors using the residuals estimated from the out-

of-sample data.

7. Select the model with the smallest out-of-sample sum of squared residuals.

Heteroskedasticity

Recall that homoskedasticity is one of the critical assumptions in the determination of the

distribution of the OLS estimator. That is, the variance of ϵ i is constant and that it does not vary

with any of the independent variables; formally stated as Var(ϵ i│X1i ,X 2i , … , Xki ) = δ 2.

Heteroskedasticity is a systematic pattern in the residuals where the variances of the

residuals are not constant.

183
© 2014-2024 AnalystPrep.
Test for Heteroskedasticity

Halbert White proposed a simple test, with the following two-step procedures:

I. Approximate the model and calculate the residuals, ϵi

II. Regress the squared residuals on:

1. A constant

2. All explanatory variables

3. The cross product of all the independent variables, including the product of each

variable with itself.

Consider an original model with two independent variables:

Yi = α + βi X1i + β2 X2i + ϵ i

The first step is to calculate the residuals by utilizing the OLS parameter estimators:

^ − β^ 1 X1i − β^ 2 X2i
ϵ i = Yi − α
^

Now, we need to regress the squared residuals on a constant X1 , X2 ,X 21 ,X22 and X1 X2

184
© 2014-2024 AnalystPrep.
ϵ 2i = Υ 0 + Υ 1 X1i + Υ 2 X2i + Υ3 X21i + Υ 4 X22i + Υ 5 X21i X22i
^

ϵ 2i must not be explained by any of the variables and the null


If the data is homoscedastic, then ^

hypothesis is: H 0 : Υ1 = ⋯ = Υ 5 = 0

The test statistic is calculated as nR2 where R2 is calculated in the second regression and that

the test statistic has a χ2k( k+3) (chi-distribution), where k is the number of explanatory variables in
2

the first-step model.

For instance, if the number of the explanatory variables is two, k=2, then the test statistic has a

distribution of χ5 .

Modeling Heteroskedastic Data

The three common methods of handling data with heteroskedastic shocks include:

1. Ignoring the heteroskedasticity when approximating the parameters and then

utilizing the White covariance estimator in hypothesis tests.

However simple, this method leads to less accurate model parameter estimates

compared to other methods that address the heteroskedasticity.

2. Transformation of data.

For instance, positive data can be log-transformed to try and remove heteroskedasticity

and give a better view of data. Another transformation can be in the form of dividing the

dependent variable by another positive variable.

3. Use of weighted least squares (WLS).

This is a complicated method that applies weights to the data before approximating the

parameters. That is if we know that Var(ϵi ) = w2i σ 2 where wi is known then we can

transform the data by dividing by wi to remove the heteroskedasticity from the errors. In
Yi Xi
other words, the WLS regresses wi
on wi
such as:

Y 1 X ϵ
185
© 2014-2024 AnalystPrep.
Yi 1 Xi ϵi
=α +β +
wi wi wi wi
Ȳi = α C̄i + βX̄i + ϵ̄ i

Note that the parameters of the model above are estimated using OLS on the transformed data.
1
That is, the weighted version of Yi which is Ȳi on two weighted explanatory variables C̄ i = and
wi
Xi
X̄i = wi
. Note that the WLS model does not clearly include the intercept α , but the interpretation

is still the same, that is, the intercept.

Multicollinearity

Multicollinearity occurs when others can significantly explain one or more independent

variables. For instance, in the case of two independent variables, there is evidence of

multicollinearity if the R 2 is very high if one variable is regressed on the other.

186
© 2014-2024 AnalystPrep.
In contrast with multicollinearity, perfect correlation is where one of the variables is perfectly

correlated to others such that the R 2 of regression of X j on the remaining independent variable is

precisely 1.

Conventionally, when R 2 is above 90% leads to problems in medium sample sizes such as that of

100. Multicollinearity does not pose an issue in parameter approximation, but rather, it brings

some difficulties in modeling the data.

When multicollinearity is present, some of the coefficients in a regression model are jointly

statistically significant (F-statistic is substantial), but the individual t-statistic is very small (less

than 1.96) since the regression analysis assumes the collective effect of the variables rather than

the individual effect of the variables.

Addressing Multicollinearity

187
© 2014-2024 AnalystPrep.
There are two ways of dealing with multicollinearity:

I. Ignoring multicollinearity altogether since it technically not a problem.

II. Identification of the multicollinear variables and excluding them from the model.

Identification of multicollinear variables using the variance inflation factor which

compares the variance of the regression coefficients on independent variable Xj in two

models: one that incorporates only Xj and one that omits k independent variables:

X ji = Υ 0 + Υ1 X1i + ⋯ + Υ j−1 Xj−1i + Υj +1 X j+1i + ⋯ + Υ k Xki + ηi

The variance inflation factor (VIF) for the variable Xj is given by:

1
VIFj =
1 − R 2j

Where R 2j originates from regressing Xj on the other variable in the model. When the

value of the VIF is above 10, then it is considered too much and the variable should be

excluded from the model.

Residual Plots

Residual plots are utilized to identify the deficiencies in a model specification. When the residual

plots are not systematically related to any of the included independent (explanatory variables)

and relatively small (within ± 4s, where s, is the standard shock deviation of the model) in

magnitude, then the model is ideally good.

Residual plot is a graph of ^


ϵ i (vertical axis) against the independent variables xi . Alternatively,
^
ϵi
we could use the standardized residuals s which makes sure that the deviation is apparent.

Outliers

Outliers are values that, if removed from the sample, produce large changes in the estimated

188
© 2014-2024 AnalystPrep.
coefficients. They can also be viewed as data points that deviate significantly from the normal

objects as if they were generated by a different mechanism.

Cook’s distance helps us measure the impact of dropping a single observation j on a regression

(and the line of best fit).

The Cook’s distance is given by:

(−j) 2
∑ni=1 (Ȳ i ^i)
−Y
Dj =
ks2

Where:

(−j)
Ȳ i =fitted value of Ȳ i when the observed value j is excluded, and the model is approximated

using n-1 observations.

189
© 2014-2024 AnalystPrep.
k=number of coefficients in the regression model

s2 =estimated error variance from the model using all observations

When a variable is an inline (does not affect the coefficient estimates when excluded), the value

of its Cook’s distance (Dj ) is small. On the other hand, Dj is higher than 1 if it is an outlier.

Example: Calculating Cook’s Distance

Consider the following data sets:

Observation Y X
1 3.67 1.85
2 1.88 0.65
3 1.35 −0.63
4 0.34 1.24
5 −0.89 −2.45
6 1.95 0.76
7 2.98 0.85
8 1.65 0.28
9 1.47 0.75
10 1.58 −0.43
11 0.66 1.14
12 0.05 −1.79
13 1.67 1.49
14 −0.14 −0.64
15 9.05 1.87

If you look at the data sets above, it is easy to see that observation 15 is quite more significant

than the rest of the observations, and there is a possibility to be an outlier. However, we need to

ascertain this.

We begin by fitting the whole dataset ( Ȳi ) and then the 14 observations which remain after

excluding the dataset that we believe is an outlier.

If we fit the whole dataset, we get the following regression equation:

Ȳ i = 1.4465 + 1.1281Xi

And if we exclude the observation that we believe it is an outlier we get:

190
© 2014-2024 AnalystPrep.
(−j)
Ȳ i = 1.1516 + 0.6828Xi

Now the fitted values are as shown below:

(−j) (−j) 2
Observation Y X Ȳ i Ȳ i (Ȳ i − Ȳ i )
1 3.67 1.85 3.533 2.4148 1.2504
2 1.88 0.65 2.179 1.5954 0.3406
3 1.35 0.63 0.7358 0.7214 0.0002
4 0.34 1.24 2.8453 1.9983 0.7174
5 0.89 2.45 −1.3174 −0.5213 0.6338
6 1.95 0.76 2.3039 1.6705 0.4012
7 2.98 0.85 2.4053 1.732 0.4533
8 1.65 0.28 1.7624 1.3428 0.1761
9 1.47 0.75 2.2926 1.6637 0.3955
10 1.58 0.43 0.9614 0.858 0.0107
11 0.66 1.14 2.7325 1.921 0.6585
12 0.05 1.79 −0.5728 −0.07061 0.2522
13 1.67 1.49 3.1274 2.169 0.9185
14 0.14 0.64 0.7245 0.7146 0.0001
15 9.05 1.87 3.556 2.4284 1.2715
Sum 7.4800

If the s2 = 3.554 the Cook’s distance is given by:

(−j ) 2
∑ni=1 (Ȳ i ^i )
−Y 7.4800
Dj = = = 1.0523
2
ks 2 × 3.554

Since Dj > 1, then observation 15 can be considered as an outlier.

Strengths of Ordinary Least Squares (OLS )

OLS is the Best Linear Unbiased Estimator (BLUE) when some key assumptions are met, which

implies that it can assume the smallest possible variance among any given estimator that is

linear and unbiased:

Linearity: the parameters being estimated using the OLS method must be themselves

linear.

191
© 2014-2024 AnalystPrep.
Random: the data must have been randomly sampled from the population.

Non-Collinearity: the regressors being calculated should not be perfectly correlated

with each other.

Exogeneity: the regressors aren’t correlated with the error term.

Homoscedasticity: the variance of the error term is constant

However, being a BLUE estimator comes with the following limitations:

I. A big proportion of the estimators are not linear such as maximum likelihood estimators

(but biased).

II. BLUE property is heavily dependent on residuals being homoskedastic. In the case that

the variances of residuals vary the independent variables, then it is possible to construct

linear unbiased estimators (LUE) of the coefficients α and β using WLS but with extra

assumptions.

When the residuals and iid and normally distributed with a mean of 0 and variance of σ 2,

formally stated as ϵ i ∼i id N(0 , σ 2 ) makes the upgrades BLUE to BUE (Best Unbiased Estimator) by

virtue having the smallest variance among all linear and non-linear estimators. However, errors

being normally distributed is not a requirement for accurate estimates of the model coefficients

or a necessity for desirable properties of estimators.

192
© 2014-2024 AnalystPrep.
Practice Question 1

Which of the following statements is/are correct?

I. Homoskedasticity means that the variance of the error terms is constant for all

independent variables.

II. Heteroskedasticity means that the variance of error terms varies over the sample.

III. The presence of conditional heteroskedasticity reduces the standard error.

A. Only I

B. II and III

C. All statements are correct

D. None of the statements are correct

Solution

The correct answer is C.

All statements are correct

If the variance of the residuals is constant across all observations in the sample, the

regression is said to be homoskedastic. When the opposite is true, the regression is

said to exhibit heteroskedasticity, i.e., the variance of the residuals is not the same

across all observations in the sample. The presence of conditional heteroskedasticity

poses a significant problem: it introduces a bias into the estimators of the standard

error of the regression coefficients. As such, it understates the standard error.

Practice Question 2

A financial analyst fails to include a variable which inherently has a non-zero

coefficient in his regression analysis. Moreover, the ignored variable is highly

193
© 2014-2024 AnalystPrep.
correlated with the remaining variables.

What is the most likely deficiency of the analyst’s model?

A. Omitted variable bias.

B. Bias due to inclusion of extraneous variables.

C. Presence of heteroskedasticity.

D. None of the above.

Solution

The correct answer is A.

Ommitted variable bias occurs under two conditions:

I. A variable with a non-zero coefficient is omitted

II. A variable that is omitted is correlated with remaining (included) variables.

These conditions are met in the description of the analyst’s model.

Option B is incorrect since an extraneous variable is one that is unnecessarily

included in the model, whose true coefficient and consistently approximated value is

0 in large sample sizes.

Option C is incorrect because heteroskedasticity is a condition where the variance of

the errors varies systematically with the independent variables of the model.

194
© 2014-2024 AnalystPrep.
Reading 21: Stationary Time Series

After completing this reading, you should be able to:

Describe the requirements for a series to be covariance stationary.

Define the autocovariance function and the autocorrelation function.

Define white noise; describe independent white noise and normal (Gaussian) white

noise.

Define and describe the properties of autoregressive (AR) processes.

Define and describe the properties of moving average (MA) processes.

Explain how a lag operator works.

Explain mean reversion and calculate a mean-reverting level.

Define and describe the properties of autoregressive moving average (ARMA)

processes.

Describe the application of AR, MA, and ARMA processes.

Describe sample autocorrelation and partial autocorrelation.

Describe the Box-Pierce Q-statistic and the Ljung-Box Q statistic.

Explain how forecasts are generated from ARMA models.

Describe the role of mean reversion in long-horizon forecasts.

Explain how seasonality is modeled in a covariance-stationary ARMA.

Time series is a collection of observations on a variable’s outcome in distinct periods — for

example, monthly sales of a company for the past ten years. Time series are used to forecast the

future of the time series. The time series are classified into the trend, seasonal, and cyclical

components. A trend time-series changes its level over time, while a seasonal time series has

predictable changes over a given time. Lastly, a cyclical time series, as its name suggests,

195
© 2014-2024 AnalystPrep.
reflects the cycles in a given data. We will concentrate on the cyclical data (especially linear

stochastic processes).

A stochastic process is a set of variables. The stochastic process is mostly denoted by Y t and by

the subscript, the random variable is ordered time so that Y s occurs first before Yt if s < t.

A linear process has a general form of:

Yt = α t + β0 ϵ t + β1ϵ t−1 + β2 ϵ t−2 + …



= α t + ∑ βiϵ t−i
i =0

The linear process is linear on the shock, ϵt . α t is a deterministic while βi is a constant

coefficient.

Covariance Stationary Time Series

The ordered set: {… , y −2 , y −1 ,y 0 , y 1, y 2 , …} is called the realization of a time series. Theoretically,

it starts from the infinite past and proceeds to the infinite future. However, only a finite subset of

realization can be used in practice, and is called a sample path.

A series is said to be covariance stationary if both its mean and covariance structure is stable

over time.

More specifically, a time series is said to be covariance stationary if:

I. The mean does not change and thus constant over time. That is:

E(Yt) = μ∀t

II. The variance does change over time, and it is constant. That is:

V (Y t) = γ0 < ∞ ∀t

III. The autocovariance of the time series is finite and does not change over time, and it depends

on the distance between two observations. That is:

196
© 2014-2024 AnalystPrep.
C ov (Yt , Y t−h ) = γh ∀ t

The covariance stationarity is crucial so that the time series has a constant relationship across

time and that the parameters are easily interpreted since the parameters will be asymptotically

normally distributed.

Autocovariance and Autocorrelation Functions

The Autocovariance Function

It can be quite challenging to quantify the stability of a covariance structure. We will, therefore,

use the autocovariance function. The autocovariance is the covariance between the stochastic

process at a different point in time (analogous to the covariance between two random variables).

It is given by:

γt,h = E [(Yt − E (Yt )) (Yt−h − E (Y t−h ))]

And if the length h = 0 then:

γt,h = E [(Y t − E(Y t))2 ]

Which is the variance of Yt .

The autocovariance is a function of h so that:

γh = γ|h|

This is asserting the fact that the autocovariance depends on the length h and not the time t. So

that:

Cov (Y t, Yt−h ) = Cov (Y t−h , Y t)

The Autcorrelation is defined as:

Cov(Y , Y ) γ γ
197
© 2014-2024 AnalystPrep.
Cov(Y t, Y th ) γh γh
ρ ( t) = = =
√ γ0 γ0 γ0
√V (Yt )√V (Y t−h )

Similarly, for h = 0 .

γ0
ρ ( t) = =1
γ0

The autocorrelation ranges from -1 and 1 inclusively. The partial autocorrelation function is

denoted as, p(h) , and in a linear population regression of Y t on Yt−1 , … , Y t−h , it is the coefficient

of y t−h . This regression is referred to as the autoregression. This is because the regression is on

the lagged values of the variable.

198
© 2014-2024 AnalystPrep.
White Noise

Assume that:

yt = ϵt

ϵ t ∼ (0, σ 2 ) , ∀ σ2 < ∞

where ϵt is the shock and is uncorrelated over time. Therefore, ϵ t and y t are said to be serially

uncorrelated.

199
© 2014-2024 AnalystPrep.
This auto-correlation that has a zero mean and unchanging variance is referred to as the zero-

mean white noise (or just white noise) and is written as:

ϵt ∼ W N (0, σ 2)

And:

yt ∼ WN (0, σ 2)

200
© 2014-2024 AnalystPrep.
ϵ t and y t serially uncorrelated but not necessarily serially independent. If y possesses this

property, (serially uncorrelated but not necessarily serially independent) then it is said to be an

independent white noise.

Therefore, we write:

yt iid

(0, σ 2 )

This is read as “ y is independently and identically distributed with a mean 0 and constant

variance. y is said to be serially independent if it is serially uncorrelated and it has a normal

distribution. In this case, y is called the normal white noise or the Gaussian white noise.

Written as:

yt iid

N (0, σ 2)

To characterize the dynamic stochastic structure of y t ∼ WN (0, σ 2), it follows that the

unconditional mean and variance of y are:

E (y t) = 0

And:

var (y t) = σ 2

These two are constant since only displacement affects the autocovariances rather than time. All

the autocovariances and autocorrelations are zero beyond displacement zero since white noise is

uncorrelated over time.

The following is the autocovariance function for a white noise process:

σ2 , h=0
γ (h) = {
0, h≥0

The following is the autocorrelation function for a white noise process:

201
© 2014-2024 AnalystPrep.
ρ (h) = { 1, h=0
0, h≥1

Beyond displacement zero, all partial autocorrelations for a white noise process are zero. Thus,

by construction white noise is serially uncorrelated. The following is the function of the partial

autocorrelation for a white noise process:

p (h) = { 1, h=0
0, h≥1

Simple transformations of white noise are considered in the construction of processes with much

richer dynamics. Then the white noise should be the 1-step-ahead forecast errors from good

models.

The mean and variance of a process, conditional on its past, is another crucial characterization of

dynamics with crucial implications for forecasting.

To compare the conditional and unconditional means and variances, consider the independence

white noise: y tiid



(0, σ 2) . y has an unconditional mean and variance of zero and σ 2 respectively.

Now, consider the transformational set:

Ωt−1 = {yt−1 , y t−2 , …}

Or:

Ωt−1 = {ϵt−1 , ϵt−2 , …}

The conditional mean and variance do not necessarily have to be constant. The conditional mean

for the independent white noise process is:

E (y t|Ωt−1 ) = 0

The conditional variance is:

var (yt |Ωt−1 ) = E ((yt − E (yt |Ωt−1 ))2 |Ωt−1 ) = σ 2

202
© 2014-2024 AnalystPrep.
Independent white noise series have identical conditional and unconditional means and

variances.

Wold’s Theorem

Assuming that {yt} is any zero-mean covariance-stationary process. Then:


Yt = ϵt + β1 ϵ t−1 + β2 ϵt−2 + ⋯ = ∑ βi ϵ t−i
i=0

Where:

ϵt ∼ W N (0, σ 2)

Note that β0 = 1 and ∑∞ β 2 < ∞.


i=0 i

The accurate model for any stationary covariance series is the Wold’s representation. Since ϵ t

corresponds to the 1-step-ahead forecast errors to be incurred should a particularly good

forecast be applied, the ϵ t’s are the innovations.

Time-Series Models

The Autoregressive (AR) Models

AR models are time series models mostly used in finance and economics which links the

stochastic process Y t to the previous value Y t−1 . The first order AR model denoted by AR(1) is

given by:

Y t = α + βYt−1 + ϵt

Where:

α = intercept

203
© 2014-2024 AnalystPrep.
β = AR parameter

ϵ t = the shock which is white noise (ϵ t ∼ W N (0, σ2 )

SinceY t is assumed to be covariance stationary, the mean,variance, and autocovariances are all

constant. By the principle of covariance stationarity,

E(Yt ) = E(Y t−1 ) = μ

Therefore,

E(Y t) = E(α + βY t−1 + ϵ t = α + βE(Yt−1 ) + E(ϵ t)

⇒ μ = α + βμ + 0

α
∴μ=
1 −β

And for the variance,

V (Y t) = V (α + βYt−1 + ϵt ) = β2 V (Y t−1 ) + V (ϵ t) + Cov(Y t−1 , ϵt)

γ0 = β2 γ0 + σ2 + 0

σ2

1 − β2

Note that C ov(Y t−1 , ϵ t)=0 since Y t−1 is uncorrelated with the shocks ϵ t−1 , ϵ t−2 , …

The Autocovariances for AR(1) process is calculated recursively. The first autocovariance for the

AR(1) model is given by:

Cov(Y t, Yt−1 ) = Cov(α + βYt−1 + ϵ t, Yt−1 )


= βCov(Y t, Y t−1 ) + Cov(Yt−1 , ϵt )
= βγ0

The remaining autocovariance is recursively calculated as:

204
© 2014-2024 AnalystPrep.
Cov(Y t, Y t−h )) = Cov(α + βYt−1 + ϵ t, Yt−h )
= βCov(Y t−1 , Y t−h ) + Cov(Y t−h , ϵ t)
= βγh−1

It should be easy to see that Cov(Yt−h , ϵt ) = 0 . Applying this recursion analogy:

γh = βh γ0

Therefore we can generalize the autocovariance as:

γh = β|h| γ0

Intuitively the autocorrelation function is given by:

β h γ0 |h|
ρ (ρ) = =β
γ0

The ACF tends to 0 when h increases and that -1<β<0. The Partial autocorrelation of an AR(1)

model is given by:

β|h| , h ∈ {0 , ±}
∂ (h) = {
0, h ≥ 2

The Lag Operator

The lag operator denoted by L is important for manipulating complex time-series models. As its

name suggests, the lag operator moves the index of a particular observation one step back. That

is:

LY t = Y t−1

Properties of the Lag Operator

(I). The lag operator moves the index of a time series one step back. That is:

LY t = Y t−1

205
© 2014-2024 AnalystPrep.
(II). Consider the following mth-order lag operator polynomial Lm then:

L m Y t = y t−m

For instance L 2 Yt = L(LYt ) = L(Y t−1 ) = Y t−2

(III). The lag operator of a constant is just a constant.

For example Lα = α

(IV). The pth order lag operator is given by:

a(L) = 1 + a1 L + a2 L 2 + … + ap Lp

so that:

a(L)Y t = Yt + a1 Y t−1 + a2 Y t−2 + … + ap Y t−p

(V). The lag operator has a multiplicative property. Consider two lag operators a(L) and b(L).

Then:

a(L)b(L)Y t = (1 + a1 (L)) (1 + b1 (L)) Y t


= (1 + a1 (L)) (Yt + b 1 Yt−1 )
= Y t + b1 Y t−1 + a1 Y t−1 + a1 b1 Y t−2

Moreover, the lag operator has a commutative property so that:

a(L)b(L) = b(L)a(L)

IV. Under some restrictive conditions, the lag operator polynomial can be inverted so that:

a(L)a(L)−1=1. When a(L) is a first-order lag operator polynomial given by 1 − a1 (L), is invertible if

|a1 | < 1 so that its inverse is given by:


(1 − a1 (L))−1 = ∑ ai L i
i=1

For an AR(1) model,

206
© 2014-2024 AnalystPrep.
Y t = α + βYt−1 + ϵt

This can be expressed with the lag operator so that:

Y t = α + β(LY )t + ϵt

⇒ (1 − βL)Y t = α + ϵ t

If |β|<1, then the lag polynomial above is invertible so that:

(1 − βL)−1 (1 − βL)Y t = (1 − βL)−1 α + (1 − βL)−1 ϵ t

∞ ∞ ∞
α
⇒ Yt = α ∑ β i + ∑ β jL jϵ t = + ∑ β iL iϵ t−i
i=1 j=1 1 −β i=1

The pth Order Autoregressive Model (AR(p))

The AR(p) model is a generalization of the AR(1) model to include the p lags of Yt−1 . Thus, the

AR(p) is given by:

Yt = α + β1 Yt−1 + β2 Yt−1 + … + βp Y t−p + ϵ t

If Yt is covariance stationary, then the long-run mean is given by:

α
E (Yt ) =
1 − β1 − β2 − … βp

And the long-run variance is given by:

σ2
V (Y t) = γ0 =
1 − β1 ρ1 − β2 ρ2 − … βp ρp

From the formulas of the mean and variance of the AR(p) model, the covariance stationarity

property is satisfied if:

207
© 2014-2024 AnalystPrep.
β1 + β2 + ⋯ + βp < 1

Otherwise, the covariance stationarity will be violated.

The autocorrelations function of the AR(p) model bears the same structural model as AR(1)

model; the ACF tends to 1 as the length between the two-time series increases and may oscillate.

However, higher-order ARs may bear complex structures in their ACFs.

The Moving Average Models (MA)

The first-order moving average model denoted by MA(1) is given by:

Y t = μ + θϵ t−1 + ϵ t

Where ϵ t ∼ W N(0, σ2 ).

Evidently, the process Y t depends on the current shock ϵt and the previous shock ϵ t−1 where the

coefficient θ measures the magnitude at which the previous shock affects the process. Note μ is

the mean of the process since:

E(Y t) = E(μ + θϵt−1 + ϵt ) = E(μ) + θE(ϵ t−1 ) + E(ϵt )


= μ +0 +0 = μ

For θ > 0 , MA(1) is persistent because the consecutive values are positively correlated. On the

other hand, if θ < 0, the process mean reverts because the effect of the previous shock is

reversed in the current period.

The MA(1) model is always a covariance stationary process. The mean is as shown above, while

the variance of the MA(1) model is given by:

V (Yt ) = V (μ + θϵt−1 + ϵt ) = V (μ) + θ2 V (ϵt−1 ) + V (ϵ t)


= 0 + θ2 V (ϵt−1 ) + V (ϵ t) = θ2 σ2 + σ2
⇒ V (Yt ) = σ2 (1 + θ2 )

The variance uses the intuition that the shock is white noise processes that are uncorrelated.

The MA(1) model has a non-zero autocorrelation function given by:

208
© 2014-2024 AnalystPrep.

⎪ 1, h = 0
θ
ρ (h) = ⎨ ,h = 1
1+ 2 θ


0, h ≥ 2

The partial autocorrelations (PACF) of the MA(1) model is a complex and non-zero at all lags.

From the MA(1), we can generalize the qth order MA process. Denoted by MA(q), it is given by:

Y t = μ + ϵ t + θ1 ϵt −1 + … + θq ϵ t−q

The mean of the MA(q) process is still μ since all the shocks are white noise process (their

expectations are 0). The autocovariance function of the MA(q) process is given by:

σ 2 ∑q−h θi θi+h , 0 ≤ h ≤ q
γ (h) = { i=0
0, h > q

And θ0 =1

The value of θ can be determined by substituting the value taken by the autocorrelation function

and solving the resulting quadratic equation. The partial autocorrelation of an MA(q) model is

complex and non-zero at all lags.

Example: Moving Average Process.

Given an MA(2), Yt = 3.0 + 5ϵt−1 + 5.75ϵ t−2 + ϵ t where ϵ t ∼ WN (0, σ2 ). What is the mean of the

process?

Solution

The MA(2) is given by:

Y t = μ + θ1 ϵ t−1 + θ2 ϵ t−2 + ϵ t

Where μ is the mean. So, the mean of the above process is 3.0

The Autoregressive Moving Average (ARMA) Models

209
© 2014-2024 AnalystPrep.
The ARMA model is a combination of AR and MA processes. Consider a first-order ARMA model

(ARMA(1,1)). It is given by:

Y t = α + βYt−1 + θϵt−1 + ϵt

The mean of the ARMA(1,1) model is given by:

α
μ=
1− β

And variance is given by

σ 2 (1 + 2βθ)
γ0 =
1 − β2

The autocovariance function is given by:

1+2βθ+θ2

⎪ σ2 ,h = 0


⎪ 1−β2
γ (h) = ⎨ σ 2 β(1+βθ)+θ(1+βθ)
,h = 1

⎪ 1−β2


⎪ βγh−1 , h ≥ 2

The ACF form of the ARMA(1,1) decays as the length h increases and oscillate if β < 0, which is

consistent with the AR model.

The PACF tends to 0 as the length h increase, which is consistent with the MA process. The

decay of ARMA’s ACF and PACF is slow, which distinguishes it from the pure AR and MA models.

From the variance formula of ARMA(1,1), it is easy to see that the process is covariance

stationery if |β| < 1

ARMA(p,q) Model

As the name suggests, ARMA(p,q) is a combination of the AR(p) and MA(q) process. Its form is

given by:

Yt = α + β1 Yt−1 + … + βp Y t−p + θϵ t−1 + … + θq ϵ t−q + ϵ t

210
© 2014-2024 AnalystPrep.
When expressed using lag polynomial, this expression reduces to:

β(L)Yt = α + θ(L)ϵt

Analogous to ARMA(1,1), ARMA(p,q) is covariance -stationary if the AR portion is covariance

stationary. The autocovariance and ACFs of the ARMA process are complex that decay at a slow

pace to 0 as the lag h increases and possibly oscillate.

Sample Autocorrelation

The sample autocorrelation is utilized in validating the ARMA models. The autocovariance

estimator is given by:

T
1
γ^h = ∑ (Y i − Y¯ ) (Y i−h − Y¯)
T − h i=h+i

Where Y¯ is the full sample mean.

The autocorrelation estimator is given by:

∑ Ti=h+i (Y i − Y¯ ) (Y i−h − Y¯) γ^h


ρ^h = =
∑Ti=1 (Y i − Y¯)
2 γ^0

The autocorrelation is such that −1 ≤ ρ^ ≤ 1

Test for Autocorrelation

Test for autocorrelation is done using the graphical examination by plotting ACF and PACF of the

residuals and check for any deficiencies such as inadequacy of the model to capture the

dynamics of the data. However, graphical methods are unreliable.

The common tests used are Box-Pierce and Ljung-Box tests.

Box-Pierce and Ljung-Box Tests.

211
© 2014-2024 AnalystPrep.
Box-Pierce and Ljung-Box test both tests the null hypothesis that:

H 0 : ρ1 = ρ2 = … = ρh = 0

Against the alternative that:

H 1 : ρj ≠ 0(At least one is non-zero)

Both the test are chi-distributed (χ2h ) random variables. If the test statistic is larger than the

critical value, the null hypothesis is rejected.

Box-Pierce Test

The test statistic under the Box-Pierce is given by:

h
QBP = T ∑ ρ^2i
i =1

That is, the test statistic is the sum of squared autocorrelation scaled by the sample size T, which

is (χ2h ) random variable if the null hypothesis is true.

Ljung-Box Test

Ljung-Box test is a revised version of Box-Pierce that is appropriate with small sample sizes. The

test statistic is given by:

h
1
Q LP = T (T + 2) ∑ ( ) ρ^ 2i
i=1 T−i

The Ljung-Box test statistic is also χ2h random variable.

Model Selection

The first step in model selection is the inspection of the sample autocorrelations and the PACFs.

212
© 2014-2024 AnalystPrep.
This provides the initial signs of the correlation of the data and thus can be used to select the

type of models to be used.

The next step is to measure the fit of the selected model. The most commonly used method of

measuring the model’s fit is Mean Squared Error (MSE) which is defined as:

1 T
^2 =
σ ∑ ^ϵ2
T t=1 t

When the MSE is small, the model selected explains more of the time series. However, choosing

a model with a small MSE implies that we need to increase the coefficient of variation R2, which

can lead to overfitting. To attend to this problem, other methods have been developed to

measure the fit of the model. These methods involve adding an adjustment factor to MSE each

time a parameter is added. These measures are termed as the Information Criteria (IC).

There are two such ICs: Akaike Information Criteria (AIC) and the Bayesian Information Criteria

(BIC).

Akaike Information Criteria (AIC)

Akaike Information Criteria (AIC) is defined as:

σ 2 + 2k
AI C = T ln^

Where T is the sample size, and k is the number of the parameter. The AIC model adds the

adjustment of adding two more parameters.

Bayesian Information Criteria (BIC).

Bayesian Information Criteria (BIC) is defined as:

^2 + klnT
BIC = T ln σ

Where the variables are defined as in AIC; however, note that the adjustment factor in BIC

increases with an increase in the sample size T. Hence, it is a consistent model selection

213
© 2014-2024 AnalystPrep.
criterion. Moreover, the BIC criterion does not select the model that is larger than that selected

by AIC.

The Box-Jenkin Methodology

The Box-Jenkin methodology provides a criterion of selecting between models that are equivalent

but with different parameter values. The equivalency of the models implies that their mean, ACF

and PACF are equal.

The Box-Jenkin methodology postulates two principles of selecting the models. One of the

principles is termed as Parsimony. Under this principle, given two equivalent models, choose a

model with fewer parameters.

The last principle is invertibility, which states that when selecting an MA or ARMA, select the

model such that the coefficient in MA is invertible.

Model Forecasting

Forecasting is the process of using current information to forecast the future. In time series

forecasting, we can make a one-step forecast or any time horizon h.

The one-step forecast time series forecasts the conditions expectation E(YT +1 |ΩT ) . ΩT is termed

as the information set at time T which includes the entire history of Y (YT, YT-1...) and the shock

history (ϵ T , ϵT −1 … ). In practice, this forecast is shortened to ET (YT +1 ) so that:

ET (YT +1│Ω T ) = ET (Y T +1 )

Principles of Forecasting.

There are three rules of forecasting:

I. The expectation of a variable is the realization of that variable. That is: ET (YT ) = Y T . This

214
© 2014-2024 AnalystPrep.
applies to the residuals: ET (ϵT −1 ) = ϵ T−1

II. The value of the expectation of future shocks is always 0. That is,

E T (ϵT +h ) = 0

III. The forecasts are done recursively, beginning with ET (Y T+1 ) and that the forecast of a given

time horizon might depend on the forecast of the previous horizon.

Let us consider some examples.

For the AR(1) model, the one-step forecast is given by:

E T (Y T+1 ) = ET (α + βY T + ϵT +1 ) = α + βET (YT ) + 0


= α + βY T

Note that we are using the current values YT to predict Y T+1 and shock used is that of the future

ϵ T+1 .

The two-step forecast is given by:

ET (YT +2 ) = E T (α + βYT +1 + ϵ T+2 )


= α + βET (YT +1 ) + E T (ϵ T+2 )

But ET (ϵT +2 ) = 0 and ET (YT +1 ) = α + βY T

So that:

E T (Y T+2 ) = α + βE T (α + βY T ) = α + β (α + βY T )
⇒ E T (Y T+2 ) = α + αβ + β2 YT

Analogously, the forecast for time horizon h we have:

ET (YT +h ) = α + αβ + αβ2 + … + αβh−1 + βh YT


h
= ∑ αβ i + β hY T
i =1

The Mean Reverting Level

215
© 2014-2024 AnalystPrep.
When h is large, βh must be very small by the intuition of covariance stationary of Yt . Therefore,

it can be shown that:

h
α
lim ∑ αβ iβ hY T =
h→∞ 1 −β
i =0

The limit is actually the mean of the AR(1) model. The mean-reverting level implies Y T does not

affect the future value of Y. That is,

lim ET (Y T+h ) = E (Y t)
h→∞

The same procedure is applied to MA and ARMA models.

The forecast error is the difference between the true future value and the forecasted value, that

is,

ϵ T+1 = Y T+1 − ET (YT +1 )

For longer time-horizon, the forecast is mostly functions of the model parameters.

Example: Model Forecasting

The ARMA(1,1) for modeling the default in premiums for an insurance company is given by

Dt = 0.055 + 0.934Dt−1 + ϵ t

Given that DT = 1.50, what is the first step forecast of the default?

Solution

We need:

ET (YT +1 ) = α + βYT
⇒ E T (DT +1 ) = 0.055 + 0.934 × 1.5 = 1.4560

Seasonality of Time Series

216
© 2014-2024 AnalystPrep.
Some time-series data are seasonal. For instance, the sales at the time of summer that may differ

from that of winter. The time series with deterministic seasonality is termed as non-stationary,

while those with stochastic seasonality are called stationary time series and hence modeled with

AR or ARMA process.

A pure seasonal lag utilizes the lags at a seasonal frequency. For instance, assume that we are

using the semi-annual data, then the pure seasonal AR(1) model of quarterly time seasonal time

series is:

(1 − βL 4 )Y t = α + ϵ t

So that:

Y t = α + βYt−4 + ϵt

A more efficient seasonality includes the short-term and seasonal lag components. The short-

term components utilize the lags at the observation frequency.

Seasonality can also be introduced to AR, MA, or both models by multiplying the short run lag

polynomial and by the seasonal lag polynomial. For instance, the seasonal ARMA is specified as:

ARMA(p, q) × (ps , q s )f

Where p and q are the orders of the short run-lag polynomials, and ps and qs are the seasonal lag

polynomials. Practically, seasonal lag polynomials are restricted to one seasonal lag because the

accuracy of the parameter approximations depends on the number of full seasonal cycles in the

sample data.

217
© 2014-2024 AnalystPrep.
Question 1

The following sample autocorrelation estimates are obtained using 300 data points:

Lag 1 2 3
Coefficient 0.25 −0.1 −0.05

Compute the value of the Box-Pierce Q-statistic.

A. 22.5

B. 22.74

C. 30

D. 30.1

The correct answer is A.

m
QB P = T ∑ ρ^2 (h)
h=1
2 2
= 300(0.25 2 + (−0.1) + (−0.05) )
= 22.5

Question 2

The following sample autocorrelation estimates are obtained using 300 data points:

Lag 1 2 3
Coefficient 0.25 −0.1 −0.05

Compute the value of the Ljung-Box Q-statistic.

A. 30.1

B. 30

C. 22.5

218
© 2014-2024 AnalystPrep.
D. 22.74

The correct answer is D.

m ˆ1
QLB = T (T + 2) ∑ ( )ρ2 (h)
h =1 T − h
0.25 2 −0.12 −0.05 2
= 300(302)( + + )
299 298 297
= 22.74

Note: Provided the sample size is large, the Box-Pierce and the Ljung-Box tests

typically arrive at the same result.

Question 3

Assume the shock in a time series is approximated by Gaussian white noise.

Yesterday's realization, y(t) was 0.015, and the lagged shock was -0.160. Today's

shock is 0.170.

If the weight parameter theta, θ, is equal to 0.70 and the mean of the process is 0.5,

determine today's realization under a first-order moving average, MA(1), process.

A. -4.205

B. 4.545

C. 0.558

D. 0.282

The correct answer is C.

Today’s shock = ϵ t ; yesterday’s shock = ϵ t−1; today’s realization = y t ; yesterday’s

realization = yt−1 .

The MA(1) is given by:

219
© 2014-2024 AnalystPrep.
yt = μ + θϵt−1 + ϵt
= 0.5 + 0.170 + 0.7(−0.160) = 0.558
= 0.558

220
© 2014-2024 AnalystPrep.
Reading 22: Nonstationary Time Series

After completing this reading, you should be able to:

Describe linear and nonlinear time trends.

Explain how to use regression analysis to model seasonality.

Describe a random walk and a unit root.

Explain the challenges of modeling time series containing unit-roots.

Describe how to test if a time series contains a unit root.

Explain how to construct an h-step-ahead point forecast for a time series with

seasonality.

Calculate the estimated trend value and form an interval forecast for a time series.

Recall that the stationary time series have means, variance, and autocovariance that are

independent of time. Therefore any time series that violates this rule is termed as the non-

stationary time series.

The nonstationary time series include time trends, random walks( also called unit-roots) and

seasonalities. Time trends reflect the feature of the time series to grow over time.

Seasonalities occur due to change in the time series over different seasons such as each quarter.

Seasonalities can be shifts of the mean (for example depending on the period of the year) and the

mean cycle of the time series (this occurs when the shock of the current value depends on the

shock of the same future period). Seasonalities can be modeled using the dummy variables or

modeling it period after period changes (such as year after year) in an attempt to remove the

seasonal change of the mean.

In a random walk, time series depends on each other and their respective shocks. We discuss

each of the non-stationarities.

Time Trends.

221
© 2014-2024 AnalystPrep.
The time trend deterministically shifts the mean of the time series. The time trend can be linear

and non-linear (which includes log and quadratic time series).

Linear Time Trends

Linear trend models are those that the dependent variable changes at a constant rate with time.

If the time series y t has a linear trend, we can model the series by the following equation:

Yt = β0 + β1 t + ϵt , t = 1 , 2, … , T

Where

Y t =the value of the time series at time t (trend value at time t)

β0 =the y-intercept term

β1 =the slope coefficient

t=time, the independent (explanatory) variable

ϵ t= a random error term (Shock) and is white noise (ϵt ∼ WN(0, σ 2))

From the equation above, the β0 + β1t predicts y t at any time t. The slope β1 is described as the

trend coefficient since it is the slope coefficient. We estimate both factors β0 and β1 using the

ordinary least squares and denoted as: β^ 0 and β^ 1 respectively.

The mean of the linear time series is:

E(Yt) = β0 + β1 t

On a graph, a linear trend appears as a straight line angled diagonally up or down.

222
© 2014-2024 AnalystPrep.
Estimation of the Trend Value Under Linear Trend Models

Using the estimated coefficients, we can predict the value of the dependent variable at any time
^ 2 = β^ 0 + β^ 1 (2). We can also forecast
(t=1, 2…, T). For instance, the trend value at time 2 is Y

the value of the time series outside the sample’s period, that is, T+1. Therefore, the predicted
^ T+1 = β^ 0 + β^ 1 (T + 1) .
value of Y t at time T+1 is Y

Example: Calculating the Trend Value

A linear trend is defined to be Y t = 17.5 + 0.65t. What is the trend projection for time 10?

Solution

We substitute t=10, which is:

T = 17.5 + 0.65 × 10 = 24

Disadvantages of Linear Time Series

223
© 2014-2024 AnalystPrep.
In linear time series, the growth is a constant which might pose problems in economic and

financial time series.

1. When the trend is positive, then the growth rate is expected to decrease over time.

2. If the slope coefficient is less than 0, the Yt will tend toward negative values, a situation

that would not be plausible in most financial time series, e.g., asset prices and quantiles.

Considering these limitations, we discuss the log-linear time series, with a constant growth rate

rather than just a constant rate.

Log-Linear Trend Models

Sometimes the linear trend models result in uncorrelated errors. For instance, the time series

with exponential growth rates. The appropriate model for the time series with exponential

growth is the Log-linear trend model.

Log-linear trends are those in which the variable changes at an increasing or decreasing rate

rather than at a constant rate like in linear trends.

224
© 2014-2024 AnalystPrep.
Assume that the time series is defined as:

Y t = eβ0+β1t, t = 1,2, … , T

Which also can be written as (by taking the natural logarithms on both sides):

ln Y t = β0 + β1 t, t = 1, 2,… , T

By Exponential rate, we mean growth at a constant rate with continuous compounding. This can

be seen as follows: Using the time series formula above, the value of the time series at time 1
y2
and 2 are y 1 = eβ0+β1(1) and y2 = eβ0+β1(2) . The ratio y1
is given by:

Y2 eβ0+β1(2)
= = eβ(1)
Y1 eβ0+β1(1)

225
© 2014-2024 AnalystPrep.
Similarly, the value of the time-series at time t is Y t = eβ0+β1t , and at t+1, we have

Y t+1 = eβ0+β1(t+1) . This implies that the ratio:

Y t+1 eβ0+β1(t+1)
= = eβ1
Yt eβ0+β1(t)

If we take the natural logarithm on both sides of the above equation we have:

Yt+1
ln ( ) = lnYt+1 − lnY t = β1
Yt

The log-linear model implies that:

E(ln Y t+1 − lnY t) = β1

From the above results, proportional growth in time series over the two consecutive periods is

equal. That is:

yt+1 − yt y t+1
= − 1 = eβ1 − 1
yt yt

Example: Calculating the Trend Value of a Log-Linear Trend Time Series

An investment analyst wants to fit the weekly sales (in millions) of his company by using the

sales data from Jan 2016 to Feb 2018. The regression equation is defined as:

lnY t = 5.1062 + 0.0443t, t = 1 , 2, … , 100

What is the trend estimated value of the sales in the 80th week?

Solution

From the regression equation, β^ 0 = 5.1062 and β^ 1 = 0.0443. We know that, under log-linear

trend models, the predicted trend value is given by:

226
© 2014-2024 AnalystPrep.
^ ^
Y t = eβ 0 + β 1 t

⇒ Y80 = e5.1062+0.0443×80 = 5711.29 Million

Quadratic Time Trend

A polynomial-time trend can be defined as:

Y t = β0 + β1 t + β2 t2 + ⋯ + βm tm ϵ t, t = 1, 2,… , T

Practically speaking, the polynomial-time trends are only limited to the linear (discussed above)

and the quadratic (second degree) time trend. In a quadratic time trend, the parameter can be

estimated using the OLS. The approximated parameter are asymptotically normally distributed

and hence statistical inference using the t-statistics and the standard error happen only if the

residuals ϵ t are white noise.

The Log-Quadratic Time Trend

As the name suggests, this time trend is a mixture of the log-linear and quadratic time series. It

is given by:

ln Y t = β0 + β1 t + β2 t2

It can be shown that the growth rate of the log-quadratic time trend is β1 + 2β2 t. This can be

seen as follows:

2
The value of the time-series at time t is Yt = eβ0 +β1 t+β2t , and at t+1, we have

2
Y t+1 = eβ0+β1(t+1)+β2(t+1) . This implies that the ratio:

2
Yt+1 eβ0+β1 (t+1)+β2(t+1)
= = eβ1+2β2t
Yt eβ 0+β1t+β2 t2

If we take a natural log on the results, we get the desired result.

227
© 2014-2024 AnalystPrep.
Example: Calculating the Growth Rate of Log-Quadratic Time Trend

The monthly real GDP of a country over 20 years can be modeled by the time series equation

given by:

RGT = 6.75 + 0.015t + 0.0000564t2

What is the growth rate of the real GDP of this country at the end of 20 years?

Solution

This is the log-quadratic time trend whose growth rate is given by

β1 + 2β2 t

From the regression time-series equation given, we have β^ 1 = 0.015 and β^ 2 = 0.0000564 so that

the growth rate is given by:

β1 + 2β2 t = 0.015 + 2 × 0.0000564 × 240 = 0.0421

Note that, since the data is modeled monthly, at the end of 20 years implies 240th month!

The coefficient of variation (R2 ) for the time trend series is always high and will tend to 100% as

the sample size increases. Therefore, the coefficient of variation is not an appropriate measure in

trend series. Other alternatives such as residual diagnostics, can be useful.

Seasonality

Seasonality is a feature of a time series in which the data undergoes regular and predictable

changes that recur every calendar year. For instance, gas consumption in the US rises during the

winter and falls during the summer.

Seasonal effects are observed within a calendar year, e.g., spikes in sales over Christmas, while

cyclical effects span time periods shorter or longer than one calendar year, e.g., spikes in sales

228
© 2014-2024 AnalystPrep.
due to low unemployment rates.

Modeling Seasonal Time Series

Regression on seasonal dummies is an essential method of modeling seasonality. Assuming that

there are s seasons in a year. Then the pure annual dummy model is:

Yt = β0 + γ1 D1t γ2 D2t + ⋯ + γs−1 Ds−1t + ϵ t


s−1
= β0 + ∑ γjDjt + ϵt
j=1

Djt is defined as:

1 t mod s = j
Djt = { ,
0 , t mod s ≠ j

γj measures the amount of difference of the mean at period j and s.

Note X mod Y is the remainder of the X/Y.For instance, 9 mod 4=1.

The mean of the first period of the seasonality is:

E[Y1 ] = β0 + γ1

And the mean of period 2 is:

E[Y2 ] = β0 + γ2

Since period s, all dummy variables are zero, then the mean of the seasonality at time s is:

E[Y s ] = β0

The parameters of seasonality are estimated using the OLS estimators by regressing Y t on

constant and s-1 dummy variables.

Combination of Stationary and Non-Stationary Time Series

229
© 2014-2024 AnalystPrep.
Time trends and seasonalities can be insufficient in explaining economic time series and since

their residuals might not be white noise. In the case that the non-stationary time series appears

to be stationary, but the residuals are not white noise, we can add stationary time series

components (such as AR and MA) to reflect the components of the non-stationary time series.

Consider the following linear time trend.

Y t = β0 + β1 t + ϵ t

If the residuals are not white noise but the time series appears to be stationary, we can include

an AR term to make the model’s residuals white noise:

Y t = β0 + β1 t + δ1 Y t−1 + ϵ t

We can also add the seasonal component (if it exists):

s−1
Yt = β0 + β1 t + ∑ γj Djt + δ1 Y t−1 + ϵ t
j=1

Note that the AR component reflects the cyclicity of the time series, γj measures the shifts of the

mean from the trend growth, i.e β1 t. However, combinations of the time series do not always lead

to a model with the required dynamics. For instance, the Ljung-Box statistics may suggest

rejection of the null hypothesis.

Unit Roots and Random Walks

A random walk is a time series in which the value of the series in one period is equivalent to the

value of the series in the previous period plus the unforeseeable random error. A random walk

can be defined as follows:

Let

Y t = Y t−1 + ϵ t

230
© 2014-2024 AnalystPrep.
Intuitively,

Yt−1 = Y t−2 + ϵ t−1

If we substitute Y t−1 in the first equation, we get,

Y t = (Yt−2 + ϵ t−1 ) + ϵt

Continuing this process, it implies that a random walk is given by:

t
Yt = Y0 + ∑ ϵi
i=1

The random walk equation is a particular case of an AR(1) model with β0 = 0 and β1 = 1 . Thus,

we cannot utilize the regression techniques to estimate such AR(1). This is because a random

walk does not have a finite mean-reverting level or finite variance. Recall that if Yt has a mean-
β0
reverting level, then Y t = β0 + β1 Y t and thus . However, in a random walk, β0 = 0 and β1 = 1
1−β1
0
so, 1−1 = 0.

The variance of a random walk is given by:

V(Yt) = tσ 2

The implication of the infinite variance of a random walk is that we are unable to use standard

regression analysis on a time series that appears to be a random walk.

Unit Roots

We have been discussing the random walks without a drift; that the current value is the best

predictor of the time series in the next period.

A random walk with a drift is defined as a time-series where it increases or decreases by a

constant amount in each period. It is mathematically described as:

231
© 2014-2024 AnalystPrep.
Y t = β0 + β1 Y t−1 + ϵ t

β0 ≠ 1, β1 = 1

Or

Y t = β0 + Yt−1 + ϵt

Where ϵ t ∼ WN(0 , σ 2 )

Recall that β1 = 1 implies undefined mean-reversion level and hence non-stationarity. Therefore,

we are unable to use the AR model to analyze a time series unless we transform the time series

by taking the first difference we get:

ΔY t = Y t − Y (t−1) , y t = β0 + ϵt , ∀β0 ≠ 0

Which is covariance stationary.

The unit root test involves the application of the random walk concepts to determine whether a

time series is nonstationary by focusing on the slope coefficient in a random walk time series

with a drift case of AR(1) model. This test is popularly known as the Dickey-Fuller Test

The Unit Root Problem

Consider an AR(1) model. If the time-series originates from an AR(1) model, then the time-series

is covariance stationary if the absolute value of the lag coefficient β1 is less than 1. That is,

|β1 | < 1 . Therefore, we could not depend on the statistical results if the lag coefficient is greater

or equal to 1 ( |β1 | ≥ 1).

When the lag coefficient is precisely equal to 1, then the time series is said to have a unit root. In

other words, the time-series is a random walk and hence not covariance stationary.

The unit root problem can also be expressed using the lag polynomial. Let

ψ(L) be the full lag polynomial, which can be factorized into the unit root lag denoted by (1-L)

and the remainder lag polynomial ϕ(L) which is the characteristic lag for stationary time series.

232
© 2014-2024 AnalystPrep.
Moreover, let θ(L)ϵ t be an MA. Thus, the unit root process can be described as:

ψ(L)Yt = θ(L)ϵt

This can be factorized into:

(1 − L)ϕ(L) = θ(L)ϵ t

Example: Checking for Unit Roots using the Lag Polynomials

An AR(2) model is given by Y t = 1.7Y t−1 − 0.7Y t−2 + ϵ t . Does the process contain a unit root?

Solution

If we rearrange the equation:

Yt − 1.7Y t−1 + 0.7Y t−2 = ϵ t

Using the definition of a lag polynomial, we can write the above equation as:

(1 − 1.7L + 0.7L 2 )Y t = ϵ t

The right-hand side is a quadratic equation which can be factorized. So,

(1 − L)(1 − 0.7L)Y t = ϵ t

Therefore, the process has a unit root due to the presence of a unit root lag operator (1-L).

Challenges of Modeling Time Series Containing Unit Roots

1. A unit root process does not have a mean-reverting level. Recall that the stationary time

series does mean revert, that is, the long-run mean can be estimated.

2. In a time series with a unit root, spotting spurious relationships is a problem. A spurious

correlation is where there is no important link between the time series but regression

analysis produces significant parameter estimates.

233
© 2014-2024 AnalystPrep.
3. The parameter estimators in ARMA time series with a unit root possess Dickey-Fuller

(DF) distribution, which is asymmetric, dependent on the sample size, and that its critical

value depends on whether time trends have been incorporated. This characteristic makes

it difficult to come up with sound statistical inference and model selection when fitting

the models.

Transformation of Time Series with Unit Roots

If the time series seem to have unit roots, the best method is to model it using the first-

differencing series as an autoregressive time series, which can be effectively analyzed using

regression analysis.

Recall that the time series with a drift is a form of AR(1) model given by:

y t = β0 + Yt−1 + ϵt ,

Where ϵ t ∼ WN(0 , σ 2 )

Clearly β1 = 1 implies that the time series has an undefined mean-reversion level and hence non-

stationary. Therefore, we are unable to use the AR model to analyze time series unless we

transform the time series by taking the first difference to get:

Y t = Y t − Y(t−1) ⇒ y t = β0 + ϵt , ∀β0 ≠ 0

Where the ϵ t ∼ WN(0, σ 2) and thus covariance stationary.

Using the lag polynomials, let ΔY t = Y t − Y (t−1) where Yt has a unit root (implying that Y t − Y (t−1)

does not have a unit root.), then:

(1 − L)ϕ(L)Y t = ϵt
ϕ(L)[(1 − L)Y t] = ϵt
ϕ(L)[(Y t − LY t)] = ϵt
ϕ(L)ΔY t = ϵt

Since the lag polynomial ϕ(L) is stationary series lag polynomial, the time series defined by ΔYt

must be stationary.

234
© 2014-2024 AnalystPrep.
Unit Root Test

The unit root test is done using the Augmented Dickey-Fuller (ADF) test. The test involves OLS

estimation of the parameters where the difference of the time series is regressed on the lagged

level, appropriate deterministic terms, and the lagged difference.

The ADF regression is given by:

ΔYt = γYt−1 + (δ0 + δ1 t) + (λΔYt−1 + λ2 ΔY t−2 + ⋯ + λp ΔY(t−p))

Where:

γY t−1 =Lagged level

δ0 + δ1 t=deterministic terms

λΔYt−1 + λ2 ΔY t−2 + ⋯ + λp ΔY (t−p) =Lagged differences.

The test statistic for the ADF test is that of γ^(estimate of γ).

To get the gist of this, assume that we are conducting an ADF test on a time series with lagged

level only:

ΔY t = γ Yt−1

Intuitively, if the time series is a random walk, then:

Y t = Y t−1 + ϵ t

If we subtract Yt−1 on both sides we get:

Y t − Yt−1 = Y t−1 − Y t−1 + ϵ t


⇒ ΔY t = 0 × Y t−1 + ϵ t

Therefore, it implies that the time series is a random walk if γ=0. This leads us to the hypothesis

statement of the ADF test:

H 0 : γ = 0 (The time series is a random walk)

235
© 2014-2024 AnalystPrep.
H 1 : γ < 0 (the time series is a covariance stationary )

You should note this is a one-sided test, and thus, the null hypothesis is not rejected if γ>0. The

positivity of γ corresponds to an AR time series stationary. For example, recall that the AR(1)

model is given by:

Y t = β0 + β1 Y t−1 + ϵ t

If we subtract Yt−1 from both sides of the AR(1) above we have:

Y t − Yt−1 = β0 + (β1 − 1)Y t−1 + ϵ t

Now let γ = (β1 − 1). Therefore,

ΔYt = β0 + γY t−1 + ϵ t

Clearly, if β1 = 1, then let γ = 0. Therefore, γ = 0 is the test for β1 = 1 . In other words, if there is

a unit root in an AR(1) model (with the dependent variable being the difference between the time

series and independent variable of the first lag) then, γ = 0, implying that the series has a unit

root and is nonstationary.

Implementing an ADF test on a time series requires making two choices: which deterministic

terms to include and the number of lags of the differenced data to use. The number of lags to

include is simple to determine—it should be large enough to absorb any short-run dynamics in

the difference ΔY t

The appropriate method of selecting the lagged differences is the AIC (which selects a relatively

larger model as compared to BIC). The length of the lag should be set depending on the length of

the time series and the frequency of the sampling.

The Dickey-Fuller distributions are dependent on the choice of deterministic terms included. The

deterministic terms can be excluded, and instead, use constant terms or trend deterministic

terms. While keeping all other things equal, the addition of more deterministic terms reduces the

chance of rejecting the null hypothesis when the time series does not have a unit root, and hence

the power of the ADF test is reduced. Therefore, relevant deterministic terms should be

236
© 2014-2024 AnalystPrep.
included.

The recommended method of choosing appropriate deterministic terms is by including the

deterministic terms that are significant at 10% level. In case the deterministic trend term is not

significant at 10%, it is then dropped and the constant deterministic term is used instead. If the

trend is also insignificant, then it can be dropped and the test is rerun without the deterministic

term. It is important to note that the majority of macroeconomic time series require the use of

the constant.

In the case that the null of the ADF test cannot be rejected, the series should be differenced and

the test is rerun to make sure that the time series is stationary. If this is repeated (double

differenced) and the time series is still non-stationary, then other transformations to the data

such as taking the natural log(if the time series is always positive) might be required.

Example: Conducting the ADF Test

A financial analyst wishes to conduct an ADF test on the log of 20-year real GDP from 1999 to

2019. The result of the tests is shown below:

Deterministic γ δ0 δ1 Lags 5%CV 1%CV


None −0.004 8 −1.940 −2.570
(−1.665)
Constant −0.008 0.010 4 −2.860 −3.445
(−1.422) (1.025)
Trend −0.084 0.188 3 −3.420 −3.984
(−4.376) (−4.110)

The output of the ADF reports the results at the different number deterministic terms (first

column), and the last three columns indicate the number of lags according to AIC and the 5%

and 1% critical values that are appropriate to the underlying sample size and the deterministic

terms. The quantities in the parenthesis (below the parameters) are the test statistics.

Determine whether the time series contains a unit root.

Solution

237
© 2014-2024 AnalystPrep.
The hypothesis statement of the ADF test is:

H 0 : γ = 0 (The time series is a random walk)

H 1 : γ < 0 (the time series is a covariance stationary )

We begin with choosing the appropriate model. At 10%, the trend model has an absolute value of

the statistic greater than the CV at 1% and 5% significance level; thus, we choose a model with

the trend deterministic term.

Therefore, for this model, the null hypothesis is rejected at a 99% confidence level since

|-4.376|>|-3.984|. Note that the null hypothesis is also rejected at a 95% confidence level.

Moreover, if the model was constant or no-deterministic, the null hypothesis will fail to be

rejected. This reiterates the importance of choosing an appropriate model.

The Seasonal Differencing

Seasonal differencing is an alternative method of modeling the seasonal time series with a unit

root. Seasonal differencing is done by subtracting the value in the same period in the previous

year to remove the deterministic seasonalities, the unit root, and the time trends.

Consider the following quarterly time series with deterministic seasonalities and non-zero

growth rate:

Yt = β0 + β1 t + γ1 D1t + γ2 D2t + γ3 D3t + ϵ t

Where ϵ t ∼ WN(0,σ 2 ).

Denote a seasonal Δ 4 Yt = Y t − Y t−4

⇒ Δ4 Y t =(β0 + β1 t + γ1 D1t + γ2 D2t + γ3 D3t + ϵt)


− (β0 + β1 (t − 4) + γ1 D1t−4 + γ2 D2t−4 + γ3 D3t−4 + ϵ t−4)
=β1 (t − (t − 4)) − [γ1 (D1t − D1t−4 ) + γ1 (D12 − D2t−4 ) + γ1 (D3t − D3t−4 )] + ϵ t
− ϵt−4

But

238
© 2014-2024 AnalystPrep.
γj (D1j − D1j−4 ) = 0

Because D1j = D1j−4 by the definition of the seasonal differencing. So that:

⇒ Δ 4 Yt = β1 (t − (t − 4)) + ϵt − ϵ t−4

Therefore,

Δ4 Yt = 4β1 + ϵ t − ϵ t−4

Intuitively, this an MA(1) model, which is covariance stationary. The seasonal differenced time

series is described as the year to year change in Y t or year to year growth in case of logged time

series.

Spurious Regression

Spurious regression is a type of regression that gives misleading statistical evidence of a

linear relationship between independent non-stationary variables. This is a problem in time

series analysis, but this can be avoided by making sure each of the time series in question is

stationary by using methods such as first differencing and log transformation (in case the time

series is positive)

Condition for Differencing in Time Series

Practically, many financial and economic time series are plausibly persistent but stationary.

Therefore, differencing is only required when there is clear evidence of unit root in the time

series. Moreover, when it is difficult to distinguish whether time series is stationary or not, it is a

good statistical practice to generate models at both levels and the differences.

For example, we wish to model the interest rate on government bonds using an AR(3) model. The

AR(3) is estimated on the levels and the differences (if we assume the existence of unit root) are

modeled by AR(2) since the AR is reduced by one due to differencing. By considering the models

at all levels allows us to choose the best model when the time series are highly persistent.

239
© 2014-2024 AnalystPrep.
Forecasting

Forecasting in non-stationary time series is analogous to that of stationary time series. That is,

the forecasted value at time T is the expected value of Y T+h .

Consider a linear time trend:

YT = β0 + β1 T + ϵ t

Intuitively,

Y T+h = β0 + β1 (T + h) + ϵt+h

Taking the expectation, we get:

ET (YT+h ) = E T (β0) + ET (β1 (T + h) + ET (ϵ t+h )


⇒ ET (Y T+h ) = β0 + β1 (T + h)

This is true because of both β0 and β1 (T + h) are constants while ϵ t+h ∼ WN(0, σ2 ).

Forecasting in Seasonal Time Series

Recall that the seasonal time series can be modeled using the dummy variables. Consequently,

we need to track the period of the forecast we desire. The annual time series is given by:

s−1
YT = β0 + ∑ γj Djt + ϵ t
j=1

The first-step forecast is:

ET (Y T+1 ) = β0 + γj

Where:

j = (T + 1)mod s is the forecasted period and that the forecast and the coefficient on the omitted

periods is 0.

240
© 2014-2024 AnalystPrep.
For instance, for quarterly seasonal time series that excludes the dummy variable for the fourth

quarter (Q4 ) , then the forecast for period 116 is given by:

ET (Y T+1 ) = β0 + γj
ET (Y T+1 ) = β0 + γ(116+1)(mod 4) = β0 + γ1

Therefore, the h-step ahead forecast are by tracking the period of T+h so that:

ET (Y T+h ) = β0 + γj

Where:

j = (T + h)mod s

Forecasting in Log Models

Under the log model, you should note that:

E(YT+h ) ≠ E(ln YT+h )

If the residuals are Gaussian white noise, that is:

ϵ ii∼d N (0, σ 2)

Then the properties of the log-normal can be used for forecasting. If

X ∼ N(0, σ 2), then define W = eX ∼ Log(μ , σ 2). Also recall that the mean of a log-normal

distribution is given by:

σ2
E(W) = eμ+ 2

Using this analogy, for a log-linear time trend model:

ln Y T+h = β0 + β1 (Y T+h ) + ϵ T+h

The forecast at time T+h,

241
© 2014-2024 AnalystPrep.
E T (ln YT+h ) = β0 + β1 (YT+h )

The variance of the shock is σ 2 so that:

ln YT+h ∼ (β0 + β1 (YT+h ), σ 2)

Thus,

σ2
ET (Y T+h ) = eβ0+β1 (Y T+h)+ 2

Forecasting Confidence Intervals

Confidence intervals are constructed to reflect the uncertainty of the forecasted value. The

confidence interval is dependent on the variance of the forecasted error, which is defined as:

ϵT+h = YT+h − ET (Y T+h )

i.e., it is the difference between the actual value and the forecasted value.

Consider the linear time trend model:

Y T+h = β0 + β1 (T + h) + ϵT+h

Clearly,

ET (Y T+h ) = β0 + β1 (T + h)

And the forecast error is ϵT+h

If we wish to construct a 95% confidence interval, given that the forecast error is Gaussian white

noise, then the confidence interval is given by:

ET (Y T+h ) ± 1.96σ

σ is not known and thus can be estimated by the variance of the forecast error.

Intuitively, the confidence intervals for any model can be computed depending on the individual

242
© 2014-2024 AnalystPrep.
forecast error ϵT+h = YT+h − ET (Y T+h ).

Example: Forecasting and Forecasting Confidence Intervals

A linear time trend model is estimated on annual government bond interest rates from the year

2000 to 2020. The model’s equation is given by:

Rt = 0.25 + 0.000154t + ^
ϵt

The standard deviation of the forecasting error is estimated to be σ ̂ =0.0245. What is the 95%

confidence interval for the second year if the forecasting residual errors (residuals) is a Gaussian

white noise?

(Note that for the first time period t=2000 and the last time period is t=2020)

Solution

The second year starting from 2000 is 2002. So,

E T (R2002 ) = 0.25 + 0.000154 × 2002 = 0.2808308

The 95% confidence interval is given by:

ET (Y T+h ) ± 1.96σ
= 0.28083 ± 1.96 × 0.0245
= [0.2328108, 0.3288508]

So the 95% confidence interval for the interest rate is between 1.029% and 10.68%.

243
© 2014-2024 AnalystPrep.
Question 1

The seasonal dummy model is generated on the quarterly growth rates of mortgages.

The model is given by:

s−1
Yt = β0 + ∑ γjDjt + ϵ t
j=1

The estimated parameters are γ^1 = 6.25, γ^2 = 50.52, γ^3 = 10.25 and β^ 0 = −10.42

using the data up to the end of 2019. What is the forecasted value of the growth rate

of the mortgages in the second quarter of 2020?

A. 40.10

B. 34.56

C. 43.56

D. 36.90

The correct answer is A.

We need to define the set of dummy variables:

1, for Q2
Djt = {
0, for Q1 , Q3 and Q4

So,

3
^Q 2 ) = β0 + ∑ γjDjt = −10.42 + 0 × 6.25 + 1 × 50.52 + 0 × 10.25 = 40.1
E(Y
j=1

Question 2

A mortgage analyst produced a model to predict housing starts (given in thousands)

244
© 2014-2024 AnalystPrep.
within California in the US. The time series model contains both a trend and a

seasonal component and is given by the following:

Yt = 0.2t + 15.5 + 4.0 × D2t + 6.4 × D3t + 0.5 × D4t

The trend component is reflected in variable time(t), where (t) month and seasons are

defined as follows:

Season Months Dummy


Winter December, January, and February
Spring March, April, and May D2t
Summer June, July, and August D3t
Fall September, October, and November D4t

The model started in April 2019; for example, y(T+1) refers to May 2019.

What does the model predict for March 2020?

A. 21,700 housing starts

B. 22,500 housing starts

C. 24,300 housing starts

D. 20,225 housing starts

The correct answer is A.

The model is given as:

Yt = 0.2t + 15.5 + 4.0 × D2t + 6.4 × D3t + 0.5 × D4t

Important: Since we have three dummies and an intercept, quarterly seasonality is

reflected by the intercept (15.5) plus the three seasonal dummy variables ( D2, D3 , and

D4 ).

If YT+1 = May 2019, then March 2020 = YT + 11

Finally, note that March falls under D2t

245
© 2014-2024 AnalystPrep.
yT+11 = 0.20 × 11 + 15.5 + 4.0 × 1 = 21.7

Thus, the model predicts 21,700 housing starts in March 2020.

246
© 2014-2024 AnalystPrep.
Reading 23: Measuring Return, Volatility, and Correlation

After completing this reading, you should be able to:

Calculate, distinguish, and convert between simple and continuously compounded

returns.

Define and distinguish between volatility, variance rate, and implied volatility.

Describe how the first two moments may be insufficient to describe non-normal

distributions.

Explain how the Jarque-Bera test is used to determine whether returns are normally

distributed.

Describe the power law and its use for non-normal distributions.

Define correlation and covariance and differentiate between correlation and

dependence.

Describe properties of correlations between normally distributed variables when using

a one-factor model.

Measurement of Returns

A return is a profit from an investment. Two common methods used to measure returns include:

1. Simple Returns Method

2. Continuously Compounded Returns Method.

The Simple Returns Method

Denoted Rt the simple return is given by:

P t − Pt−1
Rt =
P t−1

247
© 2014-2024 AnalystPrep.
Where

P t=Price of an asset at time t (current time)

P t−1 =Price of an asset at time t-1 (past time)

The time scale is arbitrary or shorter period such monthly or quarterly. Under the simple returns

method, the returns over multiple periods is the product of the simple returns in each period.

Mathematically given by:

T
1 + RT = ∏ (1 + R t)
t=i

T
⇒ RT = (∏ (1 + R t)) − 1
t=i

Example: Calculating the Simple Returns

Consider the following data.

Time Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54

Calculate the simple return based on the data for all periods.

Solution

We need to calculate the simple return over multiple periods which is given by:

T
1 + R T = ∏ (1 + R t)
t=i

Consider the following table:

248
© 2014-2024 AnalystPrep.
Time Price Rt 1 + Rt
0 100 − −
1 98.65 −0.0135 0.9865
2 98.50 −0.00152 0.998479
3 97.50 −0.01015 0.989848
4 95.67 −0.01877 0.981231
5 96.54 0.009094 1.009094
Product 0.9654

Note that

P t − Pt−1
Rt =
P t−1

So that that

P1 − P 0 98.65 − 100
R1 = = = −0.0135
P0 100

And

P2 − P 1 98.50 − 98.65
R2 = = = −0.00152
P1 98.65

And so on.

Also note that:

5
∏ (1 + R t) = 0.9865 × 0.998479 × … × 1.009094 = 0.9654
t=1

So,

1 + RT = 0.9654 ⇒ R T = −0.0346 = −3.46%

Continuously Compounded Returns Method

Denoted by rt . Compounded returns is the difference between the natural logarithm of the price

249
© 2014-2024 AnalystPrep.
of assets at time t and t-1. It is given by:

rt = ln P t − lnP t−1

Computing the compounded returns over multiple periods is easy because it is just the sum of

returns of each period. That is:

T
rT = ∑ rt
t=1

Example: Calculating Continuously Compounded Returns

Consider the following data.

Time Price
0 100
1 98.65
2 98.50
3 97.50
4 95.67
5 96.54

What is the continuously compounded return based on the data over all periods?

Solution

The continuously compounded return over the multiple periods is given by

T
rT = ∑ rt
t=1

Where

rt = ln P t − ln P t−1

Consider the following table:

250
© 2014-2024 AnalystPrep.
Time Price rt = ln P t − ln P t−1
0 100 −
1 98.65 −0.01359
2 98.50 −0.00152
3 97.50 −0.0102
4 95.67 −0.01895
5 96.54 0.009053
Sum −0.03521

Note that

r1 = ln P 1 − ln P 0 = ln 98.65 − ln 100 = −0.01359


r2 = ln P 2 − ln P 1 = ln 98.50 − ln 98.65 = −0.00152

And so on.

Also,

5
rT = ∑ rt = −0.01359 + −0.00152 + ⋯ + 0.009053 = −0.03521 = −3.521%
t=1

Relationship between the Compounded and Simple Returns

Intuitively, the compounded returns is an approximation of the simple return. The approximation,

however, is prone to significant error over longer time horizons, and thus compounded returns

are suitable for short time horizons.

The relationship between the compounded returns and the simple returns is given by the

formula:

1 + Rt = ert

Example: Conversion Between the Simple and Compound Returns

What is the equivalent simple return for a 30% continuously compounded return?

Solution.

Using the formula:

251
© 2014-2024 AnalystPrep.
1 + R t = ert
⇒ R t = ert − 1 = e0 .3 − 1 = 0.3499 = 34.99%

It is worth noting that compound returns are always less than the simple return. Moreover,

simple returns are never less than -100%, unlike compound returns, which can be less than

-100%. For instance, the equivalent compound return for -65% simple return is:

rt = ln (1 − 0.65) = −104.98%

Measurement of Volatility and Risk

The volatility of a variable denoted as σ is the standard deviation of returns. The standard

deviation of returns measures the volatility of the return over the time period at which it is

captured.

Consider the linear scaling of the mean and variance over the period at which the returns are

measured. The model is given by:

rt = μ + σet

Where E(rt ) = μ is the mean of the return, V(rt) = σ 2 is the variance of the return. et is the

shocks, which is assumed to be iid distributed with the mean 0 and variance of 1. Moreover, the

return is assumed to be also iid and normally distributed with the mean μ 2 i.e. rt ∼iid N(μ, σ 2 ).

Note the shock can also be expressed as ϵ t = σet where: ϵt ∼ N(0, σ 2) .

Assume that we wish to calculate the returns under this model for 10 working days (two weeks).

Since the model deals with the compound returns, we have:

10 10 10
∑ rt+i = ∑ (μ + σet+i ) = 10μ + σ ∑ et+i
i=1 i=1 i=1

So that the mean of the return over the 10 days is 10μ and the variance also is 10σ 2 since et is

iid. The volatility of the return is, therefore:

√10 σ

252
© 2014-2024 AnalystPrep.
Therefore, the variance and the mean of return are scaled to the holding period while the

volatility is scaled to the square root of the holding period. This feature allows us to convert

volatility between different periods.

For instance, given daily volatility, we would to have yearly (annualized) volatility by scaling it by

√252 . That is:

2
σannual = √ 252 × σdaily

Note that 252 is the conventional number of trading days in a year in most markets.

Example: Calculating the Annualized Volatility

The monthly volatility of the price of gold is 4% in a given year. What is the annualized volatility

of the gold price?

Solution

Using the scaling analogy, the corresponding annualized volatility is given by:

σannual = √12 × 0.042 = 13.86%

Variance Rate

The variance rate, also termed as variance, is the square of volatility. Similar to mean, variance

rate is linear to holding period and hence can be converted between periods. For instance, an

annual variance rate from a monthly variance rate is given by

2
σannual 2
= 12 × σmonthly

The variance of returns can be approximated as:

1 T
^2 =
σ ^ )2
∑ (rt − μ
T t−1

253
© 2014-2024 AnalystPrep.
Where μ
^ is the sample mean of return, and T is the sample size.

Example: Calculating the Variance of Return

The investment returns of a certain entity for five consecutive days is 6%, 5%, 8%,10% and 11%.

What is the variance estimator of returns?

Solution

We start by calculating the sample mean:

1
^=
μ (0.06 + 0.05 + 0.08 + 0.10 + 0.11) = 0.08
5

So that the variance estimator is:

1 T
^2 =
σ ^ )2
∑ (rt − μ
T t−1

1
= [(0.06 − 008)2 + (0.05 − 0.08)2 + (0.08 − 0.08)2 + (0.10 − 0.08)2 + (0.11 − 0.08)2 ] = 0.00052 =
5

The Implied Volatility

Implied volatility is an alternative measure of volatility that is constructed using options

valuation. The options (both put and call) have payouts that are nonlinear functions of the price

of the underlying asset. For instance, the payout from the put option is given by:

max(K − PT )

where PT is the price of the underlying asset, K being the strike price, and T is the maturity

period. Therefore, the price payout from an option is sensitive to the variance of the return on

the asset.

254
© 2014-2024 AnalystPrep.
The Black-Scholes-Merton model is commonly used for option pricing valuation. The model

relates the price of an option to the risk-free rate of interest, the current price of the underlying

asset, the strike price, time to maturity, and the variance of return.

For instance, the price of the call option can be denoted by:

C t = f(rf , T , Pt , σ2 )

Where:

rf= Risk-free rate of interest

T=Time to maturity

P t=Current price of the underlying asset

σ 2=Variance of the return

The implied volatility σ relates the price of an option with the other three parameters. The

implied volatility is an annualized value and does not need to be converted further.

The volatility index (VIX) measures the volatility in the S&P 500 over the coming 30 calendar

days. VIX is constructed from a variety of options with different strike prices. VIX applies to a

large variety of assets such as gold, but it is only applicable to highly liquid derivative markets

and thus not applicable to most financial assets.

The Financial Returns Distribution

The financial returns are assumed to follow a normal distribution. Typically, a normal distribution

is thinned-tailed, does not have skewness and excess kurtosis. The assumption of the normal

distribution is sometimes not valid because a lot of return series are both skewed and mostly

heavy-tailed.

To determine whether it is appropriate to assume that the asset returns are normally distributed,

we use the Jarque-Bera test.

255
© 2014-2024 AnalystPrep.
The Jarque-Bera Test

Jarque-Bera test tests whether the skewness and kurtosis of returns are compatible with that of

normal distribution.

Denoting the skewness by S and kurtosis by k, the hypothesis statement of the Jarque-Bera test

is stated as:

H 0 : S = 0 and k=3 (the returns are normally distributed)

vs

H 1 : S ≠ 0 and k ≠ 3 (the returns are not normally distributed)

The test statistic (JB) is given by:

2
⎛^
S (^
k − 3)2⎞
J B = (T − 1) +
⎝6 24 ⎠

Where T is the sample size.

The basis of the test is that, under normal distribution, the skewness is asymptotically normally
2
^
S
distributed with the variance of 6 so that the variable is chi-squared distributed with one
6

degree of freedom (χ21 ) and kurtosis is also asymptotically normally distributed with the mean of
(^
k − 3)2
3 and variance of 24 so that is also (χ21 ) variable. Coagulating these arguments given that
24
these variables are independent, then:

JB ∼ χ22

The Decision Rule of the JB Test

When the test statistic is greater than the critical value, then the null hypothesis is rejected.

Otherwise, the alternative hypothesis is true. We use the χ22 table with the appropriate degrees of

freedom:

Chi-square Distribution Table

256
© 2014-2024 AnalystPrep.
d.f. .995 .99 .975 .95 .9 .1 .05 .025 .01
1 0.00 0.00 0.00 0.00 0.02 2.71 3.84 5.02 6.63
2 0.01 0.02 0.05 0.10 0.21 4.61 5.99 7.38 9.21
3 0.07 0.11 0.22 0.35 0.58 6.25 7.81 9.35 11.34
4 0.21 0.30 0.48 0.71 1.06 7.78 9.49 11.14 13.28
5 0.41 0.55 0.83 1.15 1.61 9.24 11.07 12.83 15.09
6 0.68 0.87 1.24 1.64 2.20 10.64 12.59 14.45 16.81
7 0.99 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 1.73 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
11 2.60 3.05 3.82 4.57 5.58 17.28 19.68 21.92 24.72
12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 23.34 26.22

For example, the critical value of a χ22 at a 5% confidence level is 5.991, and thus, if the

computed test statistic is greater than 5.991, the null hypothesis is rejected.

Example: Conducting a JB Test

Investment return is such that it has a skewness of 0.75 and a kurtosis of 3.15. If the sample size

is 125, what is the JB test statistic? Does the data qualify to be normally distributed at a 95%

confidence level?

Solution

The test statistic is given by:

2
⎛^
S (^
k − 3)2⎞ 0.752 (3.15 − 3)2
JB = (T − 1) + = (125 − 1) ( + ) = 11.74
⎝6 24 ⎠ 6 24

Since the test statistic is greater than the 5% critical value (5.991), then the null hypothesis that

the data is normally distributed is rejected.

The Power Law

The power law is an alternative method of determining whether the returns are normal or not by

257
© 2014-2024 AnalystPrep.
studying the tails. For a normal distribution, the tail is thinned, such that the probability of any

return greater than kσ decreases sharply as k increases. Other distributions are such that their

tails decrease relatively slowly, given a large deviation.

The power law tails are such that, the probability of observing a value greater than a given value

x is defined as:

P(X > x) = kx −α

Where k and α are constants.

The tail behavior of distributions is effectively compared by considering the natural log

(ln(P(X>x))) of the tail probability. From the above equation:

ln prob(X > x) = ln k − αln x

To test whether the above equation holds, a graph of ln prob(X > x) plotted against ln⁡x.

For a normal distribution, the plot is quadratic in x, and hence it decays quickly, meaning that

they have thinned tails. For other distributions such as Student’s t distribution, the plots are

linear to x, and thus, the tails decay at a slow rate, and hence they have fatter tails (produce

values that are far from the mean).

258
© 2014-2024 AnalystPrep.
Dependence and Correlation of Random Variables.

The two random variables X and Y are said to be independent if their joint density function is

equal to the product of their marginal distributions. Formally stated:

f X,Y = f X(x). fY (y)

Otherwise, the random variables are said to be dependent. The dependence of random variables

can be linear or nonlinear.

259
© 2014-2024 AnalystPrep.
The linear relationship of the random variables is measured using the correlation estimator

called Pearson’s correlation.

Recall that given the linear equation:

Yi = α + βi Xi + ϵi

The slope β is related to the correlation coefficient ρ. That is, if β = 0, then the random variables

Xi and Y i are uncorrelated. Otherwise, β ≠ 0. Infact, if the variances of the random variables are

engineered such that they are both equal to unity (σX2 = σY2 = 1), the slope of the regression

equation is equal to the correlation coefficient (β = ρ). Thus, the regression equation reflects how

the correlation measures the linear dependence.

Nonlinear dependence is complex and thus cannot be summarized using a single statistic.

Measures of Correlation

The correlation is mostly measured using the rank correlation (Spearman’s rank correlation) and

Kendal’s τ correlation coefficient. The values of the correlation coefficient are between -1 and 1.

When the value of the correlation coefficient is 0, then the random variables are independent;

otherwise, a positive (negative) correlation indicates an increasing (a decreasing) relationship

between the random variables.

Rank Correlation

The rank correlation uses the ranks of observations of random variables X and Y. That is, rank

correlation depends on the linear relationship between the ranks rather than the random

variables themselves.

The ranks are such that 1 is assigned to the smallest value, 2 to the next value, and so on until

the largest value is assigned n.

When a rank repeats itself, an average is computed depending on the number of repeated

variables, and each is assigned the averaged rank. Consider the ranks 1,2,3,3,3,4,5,6,7,7. Rank 3

260
© 2014-2024 AnalystPrep.
is repeated three times, and rank 7 is repeated two times. For the repeated 3’s, the averaged
(3+4+5) (9+10)
rank is 3
= 4. For the repeated 7’s the averaged rank is 2
= 8.5 . Note that we are

averaging the ranks, which the repeated ranks could have to assume if they were not repeated.

So the new ranks are:1,2,4,4,4,4,5,6,8.5,8.5.

Now, denote the rank of X by RX and that of Y by R Y then the rank correlation estimator is given

by:

Cov(RˆX , RY )
ρ^ s =
^ (RX )√V
√V ^ (R Y )

Alternatively, when all the ranks are distinct (no repeated ranks), the rank correlation estimator

is estimated as:

2
6 ∑ni=1 (R Xi − RY i) .
ρ^ s = 1 −
n(n2 − 1)

The intuition of the last formula is that when a highly ranked value of X is paired with

corresponding ranked values of Y, then the value of RXi − RY i is very small and thus, correlation

tends to 1. On the other, if the smaller rank values of X are marched with larger rank values of Y,

then RXi − R Yi is relatively larger and thus, correlation tends to -1.

When the variables X and Y have a linear relationship, linear and rank, correlations have equal

value. However, rank correlation is inefficient compared to linear correlation and only used for

confirmational checks. On the other hand, rank correlation is insensitive to outliers because it

only deals with the ranks and not the values of X and Y.

Example: Calculating the Rank Correlation

Consider the following data.

261
© 2014-2024 AnalystPrep.
i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31

What is the value of rank correlation?

Solution

Consider the following table where the ranks of each variable have been filled and the square of

their difference in ranks.

i X Y RX RY (RX − RY )2
1 0.35 2.50 3 4 1
2 1.73 6.65 4 6 4
3 −0.45 −2.43 2 2 0
4 −0.56 −5.04 1 1 0
5 4.03 3.20 6 5 1
6 3.21 2.31 5 3 4
Sum 10

Since there are no repeated ranks, then the rank correlation is given by:

2
6 ∑ni=1 (RXi − RY i ) .
ρ^s = 1 −
n(n 2 − 1)
6 × 10
= 1− = 1 − 0.2857 = 0.7143
6(62 − 1)

The Kendal’s Tau (τ )

Kendal’s Tau is a non-parametric measure of the relationship between two random variables, say,

X and Y. Kendal’s τ compares the frequency of concordant and discordant pairs.

262
© 2014-2024 AnalystPrep.
Consider the set of random variables Xi and Yi . These pairs are said to be concordant for all i≠j if

the ranks of the components agree. That is, Xi > Xj when Y i > Yj or Xi < Xj when Y i < Y j . That is,

they are concordant if they agree on the same directional position (consistent). When the pairs

disagree, they are termed as discordant. Note that ties are neither concordant nor discordant.

Intuitively, random variables with a high number of concordant pairs have a strong positive

correlation, while those with a high number of discordant pairs are negatively correlated.

The Kendal’s Tau is defined as:

nc − nd nc nd
τ^ = = −
n(n − 1) n c + nd + nt n c + nd + n t
2

Where

nc =number of concordant pairs

nd =number of discordant pairs

nt =number of ties

It is easy to se that Kendal’s Tau is equivalent to the difference between the probabilities of

concordance and discordance. Moreover, when all the pairs are concordant, τ^ = 1 and when all

pairs are discordant, τ^ = −1 .

Example: Calculating the Kendall’s Tau

Consider the following data (same as the example above).

i X Y
1 0.35 2.50
2 1.73 6.65
3 −0.45 −2.43
4 −0.56 −5.04
5 4.03 3.20
6 3.21 2.31

263
© 2014-2024 AnalystPrep.
What is Kendall’s τ correlation coefficient?

Solution

The first step is to rank each data:

i X Y RX RY
1 0.35 2.50 3 4
2 1.73 6.65 4 6
3 −0.45 −2.43 2 2
4 −0.56 −5.04 1 1
5 4.03 3.20 6 5
6 3.21 2.31 5 3

Next is to arrange ranks in order of rank X, then the concordant (C) pairs are the number of

ranks greater than the given rank of Y, and discordant pairs are the number of ranks less than

the given rank of Y.

RX RY C D
1 1 5 0
2 2 4 0
3 4 2 1
4 6 1 1
5 3 1 0
6 5 − −
Total 13 2

Note that, C=4, are the number of ranks greater than 2 (4,3,5 and 6) below it. Also, D=0 is the

number of ranks less than 2 below it. This is continued up to the second last row since there are

no more ranks to look up.

So, n c = 13 and nd = 2

nc − nd 13 − 2 11
⇒ τ^ = = = = 0.7333
n(n − 1) 6(6 − 1 15
2 2

264
© 2014-2024 AnalystPrep.
Practice Question

Suppose that we know from experience that α = 3 for a particular financial variable,

and we observe that the probability that X > 10 is 0.04.

Determine the probability that X is greater than 20.

A. 125%

B. 0.5%

C. 4%

D. 0.1%

The correct answer is B.

From the given probability, we can get the value of constant k as follows:

prob(X > x) = kx (−α )


0.04 = k(10)(−3)
k = 40

Thus,

P(X > 20) = 40(20)(−3) = 0.005 or 0.5%

Note: The power law provides an alternative to assuming normal distributions.

265
© 2014-2024 AnalystPrep.
Reading 24: Simulation and Bootstrapping

After completing this reading, you should be able to:

Describe the basic steps to conduct a Monte Carlo simulation.

Describe ways to reduce the Monte Carlo sampling error.

Explain the use of antithetic and control variates in reducing Monte Carlo sampling

error.

Describe the bootstrapping method and its advantage over the Monte Carlo simulation.

Describe pseudo-random number generation.

Describe situations where the bootstrapping method is ineffective.

Describe the disadvantages of the simulation approach to financial problem-solving.

Simulation is a way of modeling random events to match real-world outcomes. By observing

simulated results, researchers gain insight into real problems. Examples of the application of

the simulation are the calculation of option payoff and determining the accuracy of an estimator.

Some of the simulation methods are the Monte Carlo Simulation (Monte Carlo) and the

Bootstrapping.

Monte Carlo Simulation approximates the expected value of a random variable using the

numerical methods. The Monte Carlo generates the random variables from an assumed data

generating process (DGP), and then it applies a function(s) to create realizations from the

unknown distribution of the transformed random variables. This process is repeated (to improve

the accuracy), and the statistic of interest is then approximated using the simulated values.

Bootstrapping is a type of simulation where it uses the observed variables to simulate from the

unknown distribution that generates the observed variables. In other words, bootstrapping

involves the combination of the observed data and the simulated values to create a new sample

that is related but different from the observed data.

The notable similarity between Monte Carlo and bootstrapping is that both aim at calculating the

266
© 2014-2024 AnalystPrep.
expected value of the function by using simulated data (often by use of a computer).

Also, the contrasting feature in these methods is that in Monte Carlo simulation, a data

generating process (DGP) is entirely used to simulate the data. However, in bootstrapping,

observed data is used to generate the simulated data without specifying an underlying DGP.

Simulation of Random Variables

The simulation requires the generation of random variables from an assumed distribution, mostly

using a computer. However, computer-generated numbers are not necessarily random and thus

termed as pseudo-random numbers. Pseudo numbers are produced by the complex

deterministic functions (pseudo number generators, PNGs), which seem to be random. The initial

values of pseudo numbers are termed as a seed value, which is usually unique but generates

similar random variables when PNRG runs.

The ability of the simulated variables from PRNGs to replicate makes it possible to use pseudo

numbers across multiple experiments because the same sequence of random variables can be

generated using the same seed value. Therefore, we can use this feature to choose the best

model or reproduce the same results in the future in case of regulatory requirements. Moreover,

the corresponding random variables can be generated using different computers.

Simulating Random Variables from a Specific Distribution

Simulating random variables from a specific distribution is initiated by first generating a random

number from a uniform distribution (0,1). After that, the cumulative distribution of the

distribution we are trying to simulate is used to get the random values from that distribution.

That is, we first generate a random number U from U(0,1) distribution, then, we use the

generated random number to simulate a random variable X with the pdf f(x) by using the CDF,

F(x).

Let U be the probability that X takes a value less than or equal to x, that is,

U = P(≤ x) = F(x)

267
© 2014-2024 AnalystPrep.
Then we can derive the random variable x as:

x = F−1 (u)

To put this in a more straightforward perspective, the algorithm for simulating random variable

from a specific distribution involves:

1. Generating a random variable u from the uniform distribution U(0,1)

2. Compute x = F−1 (u)

Note that the random variable X has a CDF F(x) as shown below:

P(X ≤ x) = P(F−1 (U) ≤ x) = P(U ≤ F(x)) = F(x)

268
© 2014-2024 AnalystPrep.
Example: Generating Random Variables from Exponential Distribution

Assume that we want to simulate three random variables from an exponential distribution with a

parameter λ = 0.2 using the value 0.112, 0.508, and 0.005 from U(0,1).

Solution

This question assumes that the uniform random variable has been generated. The inverse of the

CDF of exponential distribution is given by:

1
F−1 (x) = − ln (1 − x)
λ

So, in this case:

1
F−1 (x) = − ln (1 − x)
0.2
1
x=− ln (1 − u)
0.2

So the random variables are:

1
x1 = − ln (1 − u 1 ) = −5 ln (1 − 0.112) = 2.37567
0.2
1
x2 = − ln (1 − u 2 ) = −5 ln (1 − 0.508) = 14.1855
0.2
1
x3 = − ln (1 − u 3 ) = −5 ln (1 − 0.005) = 0.10025
0.2

The random variables are 2.37567, 14.1855 and 0.10025

Monte Carlo Simulation

Monte Carlo simulation is used to estimate the population moments or functions. The Monte

Carlo is as follows:

Assume that X is a random variable that can be simulated and let g(X) be a function that can be

269
© 2014-2024 AnalystPrep.
evaluated at the realizations of X. Then, the simulation generates multiple copies of g(X) by

simulating draws from X = xj and calculate gi = g(x i ).

This process is then repeated b times so that a set of iid variables is generated from the unknown

distribution g(X), which can then be used to estimate the desired statistic.

For instance, if we wish to estimate the mean of the generated random variables, then the mean

is given by:

1 b
^ (g(X)) =
E ∑ g(Xi )
b i=1

This is true because the generated variables are iid, and then the process is repeated b times.

Consequently, by the law of large number (LLN),

^ (g(X)) = E(g(x))
lim E
b→∞

Also, the Central Limit Theorem applies to the estimated mean so that:

σg2
^(g(X))] =
Var[E
b

Where σg2 = Var(g(X))

The second moment, which is the variance (standard variance estimator) is estimated as:

1 b 2
^2g =
σ ∑ (g(Xi) − E[ĝ(X)])
b i=1

From CLT, the standard error of the simulated expectation is given by:

σg2 σ2
⎷ b =
√b

The standard error of the simulated expectation measures the level of accuracy of the

estimation; thus, the choice of b determines the accuracy of the simulation.,

270
© 2014-2024 AnalystPrep.
Another quantity that can be calculated from the simulation is the α-quantile by arranging the b

draws in ascending order then selecting the value bα of the sorted set.

Moreover, using the simulation, we determine the finite sample properties of the estimated

parameters. Assume that the sample size n is large enough so that approximation by CLT is

adequate. Now, consider a finite-sample distribution of a parameter θ^. Using the assumed DGP, n

random samples are generated so that:

X = [x1 , x 2, … , xn ]

We need to estimate a parameter θ^.

We would need to simulate new data set and estimate the parameter b times: (θ^1 , θ^2 ,… , θ^b )

from the finite-sample distribution of the estimator of θ. From these values, we can rule out the

properties of the estimator θ^. For instance, the bias defined as:

Bias(θ) = E(θ^) − θ

That can be approximated as:

1 b
ˆ
(Bias) (θ) = ∑ (θ^i − θ)
b i=1

Having the basics of the Monte Carlo simulation, its basic logarithm is as follows:

i. Generate the data: x i = [x1i, x 2i , … , x ni] by using the assumed DGP.

ii. Compute the desired function or statistic gi = g(x i) .

iii. Iterate steps 1 and 2 b times.

iv. From the replications {g 1 , g 2 , … , gb } , calculate the statistic of interest.

v. Determine the accuracy of the estimated quantity by calculating the standard error. If

the standard error is huge, increase the number of b-replications to obtain the smallest

error possible.

Example: Using the Monte Carlo Simulation to Estimate the Price of a Call Option

271
© 2014-2024 AnalystPrep.
Recall that the price of a call option is given by:

max(0, ST − K)

ST is the price of the underlying stock at the time of maturity T, and K is the strike price. The

price of the call option is a non-linear function of the underlying stock price at the expiration

date, and thus, we can model the price of the call option.

Assuming that the log of the stock price is normally distributed, then the price of the stock can

be modeled as the sum of the initial stock price, a mean and normally distributed error.

Mathematically stated as:

σ2
sT = s 0 + T (rf − ) √ Tx i
2

Where

s0 = the initial stock price

272
© 2014-2024 AnalystPrep.
T= time to maturity in years

rf= annualized time to maturity

σ 2= variance of the stock return

x i= simulated values from N(0, σ 2)

From the formula above, to simulate the price of the underlying stock requires the estimation of

the stock volatility.

Using the simulated price of the stock, the price of the option can be calculated as:

c = e(−rf T) max(ST − K , 0)

And thus the mean of the price of the call option can be estimated as:

1 b
^(c) = c̄ =
E ∑ ci
b i=1

Where c i is the simulated payoffs of the call option. Note that, using the equation,
σ2
sT = s0 + T (rf − ) √ Tx i , the simulated stock prices can be expressed as:
2

⎛ σ2 ⎞
s 0 +T rf − +√T xi
STi = e
⎝ 2⎠

And thus

σ2
g(xi ) = ci = e(−rT)max(es 0+T(r f− )+√ Txi
2 − K, 0)

The standard error of the call option price is given by:

^2g
σ ^g
σ
^ (c)) =
s.e(E =
⎷b √b

^2g
Where σ

1
273
© 2014-2024 AnalystPrep.
1
^2g =
σ c )2
∑ (c i − ^
b ∀i

Given that we calculate the standard error, we can calculate the confidence intervals for the

estimated mean of the call option price. For instance, the 95% confidence interval; is given by:

^(c) ± 1.96 s.e (E


E ^(c))

Reducing Monte Carlo Sampling Error

Sampling error in Monte Carlo simulation is reduced by two complementary methods:

1. Antithetic Variables, and

2. Control Variates.

These methods can be used simultaneously.

To set the mood, recall that the estimation of expected values in simulation depends on the Law

of Large Numbers (LLN) and that the standard error of the estimated expected value is

proportional to 1/√b. Therefore, the accuracy of the simulation depends on the variance of the

simulated quantities.

Antithetic Variables

Recall variance between two random variables X and Y is given by:

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)

Otherwise, if the variables are independent, then:

Var(X + Y) = Var(X) + Var(Y)

Moreover, if the covariance between the variables is negative (or negatively correlated), then:

Var(X + Y) = Var(X) + Var(Y) − 2Cov(X, Y)

274
© 2014-2024 AnalystPrep.
The antithetic variables use the last result. The antithetic variables reduce the sampling error by

incorporating the second set of variables that are generated in such a way that they are

negatively correlated with the initial iid simulated variables. That is, each simulated variable is

paired with an antithetic variable so that they occur in pairs and are negatively correlated.

If U1 is a uniform random variable, then:

F−1 (U1 ) ∼ Fx

Denote an antithetic variable U 2 which is generated using:

U2 = 1 − U1

Note that U 2 is also a uniform random variable so that:

F−1 (U2 ) ∼ Fx

Then by definition of antithetic variables, the correlation between U1 and U2 is negative as well

as their mappings onto the CDF Fx .

Using the antithetic random variables is analogous to typical Monte Carlo simulation only that

values are constructed in pairs [{U 1 , 1 − U1 } , {U 2 , 1 − U2 } , … , {U b , 1 − U b }] which are then


2 2

transformed to have the desired distribution using the inverse CDF.

Note that the number of simulations is b/2 since the simulation values are in pairs. The antithetic

variables reduce the sampling error only if the function g(X) is monotonic in x so that

Corr(x i, −xi ) = Corr(g(x i ), g(−x i)).

Notably, the antithetic random variables reduce the sampling error through the correlation

coefficient. Note that usually sampling error using b iid simulated values, is

σg
√b

But by introducing the antithetic random variables, then the standard error is given by:

σ √1 + ρ
275
© 2014-2024 AnalystPrep.
σg √1 + ρ
√b

Clearly, the standard error decreases when the correlation coefficient, ρ < 0.

Control Variates

Control variates reduce the sampling error by incorporating values that have a mean of zero and

correlated to simulation. The control variates have a mean of zero so that it does not bias the

approximation. Given that the control variate and the desire function are correlated, an effective

combination (optimal weights) of the control variate and the initial simulation value to reduce

the variance of the approximation.

Recall that expected value is approximated as:

1 b
^[g(X)] =
E ∑ g(xi )
b i=1

Since this estimate is consistent, we can break down to:

^[g(X)] = E[g(X)] + ηi
E

Where ηi is a mean zero error. That is: E(ηi) = 0

Denote the control variate by h(Xi ) so that by definition, E[h(Xi )] = 0 and that it is correlated with

ηi .

An ideal control variate should be less costly to construct and that it should be highly correlated

with g(X) so that the optimal combination parameter β0 that minimizes the estimation errors can

be approximated by the regression equation:

g(xi ) = β0 + β1 h(Xi)

Disadvantages of Simulation

Monte Carlo Simulation can result in unreliable approximates of moments if the DGPs

276
© 2014-2024 AnalystPrep.
used do not adequately describe the observed data. This mostly occurs due to

misspecifications of the DGP.

Simulation can be costly, especially when you are running multiple simulation

experiments because it can be time-consuming.

Bootstrapping

As stated earlier, bootstrapping is a type of simulation where it uses the observed variables to

simulate from the unknown distribution that generates the observed variable. However, note that

bootstrapping does not directly model the observed data or suggest any assumption about the

distribution, but rather, the unknown distribution in which the sample is drawn is the origin of

the observed data.

277
© 2014-2024 AnalystPrep.
There are two types of bootstraps:

i. iid Bootstraps

ii. Circular Blocks Bootstraps (CBB)

iid Bootstrap

iid bootstraps select the samples that are constructed with replacement from the observed data.

Assume that a simulation sample of size m is created from the observed data with n

observations. iid bootstraps construct observation indices by randomly sampling with replacing

from the values 1,2,..., n. These random indices are then used to draw the observed data to be

278
© 2014-2024 AnalystPrep.
included in the simulated data (bootstrap sample).

For instance, assume we want to draw 10 observations from a sample of 50 data points:

{x 1 , x2 , x 3 , … , x 50}. The first simulation could use {x 1 , x 12, x 23x 11, x 32, x 43x 1 , x 22, x 2 , x22 }

observations and second simulation could use {x 50, x 21, x 23x 19, x 32, x49 x41 , x22 , x12 ,, x 39} and so on

until the desired number of simulations is reached.

In other words, iid bootstrap is analogous to Monte Carlo Simulation, where bootstrap samples

are used instead of simulated samples. Under iid bootstrap, the expected values are estimated

as:

1 b
^[g(X)] =
E ∑ g (x BS x BS xBS )
1, j, 2, j, ,… , m ,j,
b i=1

Where

x BS
i,j,
= observation i from observation j

b = total number of bootstraps samples

The iid bootstrap is suitable when observations used are independent over time, and thus using it

in financial analysis is unsuitable because most of the financial data is dependent.

In short, the logarithm of generating a sample using the iid bootstrap include:

i. Create a random set of m integers (i1 , i2 , … , im ) from (1,2,…,n) with replacement.

ii. Construct the bootstrap sample as x i1 , x i2 , … , x im

Circular Block Bootstrap (CBB)

The circular block bootstrap differs from the iid bootstrap in that instead of sampling each data

point with replacement, it samples the blocks of size q with replacement. For instance, assume

that we have 50 observations which are sampled into five blocks (q=5), each with 10

observations.

The blocks are sampled with replacement until the desired sample size is produced. In the case

279
© 2014-2024 AnalystPrep.
that the number of observations in sampled blocks is larger than the required sample size, some

of the observations are omitted in the last block.

The size of the number of blocks should be large enough to reflect the dependence of

observations but not too large to exclude some crucial blocks. Conventionally, the size of the

blocks is the square root of the sample size (√ n).

The general steps of generating sample using the CBB are:

i. Decide on the size of block q-more preferably, the block size should be equal to the

square root of the sample size, i.e √n.

ii. Select the first block index i from (1,2,…,n) and transfer {x i , xi +1 , … , xi +q } to the

bootstrap sample where the indices larger (i>n) wrap around.

iii. Incase the bootstrap sample has less than m elements, repeat step (ii) above.

iv. In case the bootstrap sample has more than m elements, omit the values from the end of

the bootstrap sample until the sample size is m.

Application of Bootstrapping

One of the applications of bootstrapping is the estimation of the p-value at risk in financial

markets. Recall the p-value at risk (p-VaR) is defined as:

argmin
Var
Pr (L > VaR) = 1 − p

Where:

L = loss of the portfolio over a given period, and

1-p = the probability that the loss occurs.

If the loss is measured in percentages of a particular portfolio, then p-VaR can be seen as a

quantile of the return distribution. For instance, if we wish to calculate a one-year VaR of a

portfolio, then we will simulate a one-year data (252 days) and then find the quantile of the

simulated annual returns.

The VaR is then calculated by sorting the bootstrapped annual returns from lowest to highest

280
© 2014-2024 AnalystPrep.
and then determining (1-p)b, which is basically the empirical 1-p quantile of the annual returns.

Situations Where Bootstrap Will be Ineffective

The following are the two situations where bootstraps will not be sufficiently effective:

In cases where there are outliers in the data, hence there is a likelihood that the

bootstrap’s conclusion will be affected.

Non-independent data – When a bootstrap is applied, the assumption the data are

independent of one another.

Disadvantages of Bootstrapping

Bootstrapping uses the whole data to generate a simulated sample and thus may make

the simulated sample unreliable when the past and the present data are different. For

example, the present state of a financial market might be different from the past.

Bootstrapping of historical data can be unreliable due to changes in the market so that

the present is different from the past. For instance, if we are bootstrapping market

interest rates, there might be huge discrepancies due to past and present market

forces, which cause the interest rate to fluctuate significantly.

Comparison between Monte Carlo Simulation and


Bootstrapping

Monte Carlo simulation uses an entire statistical model that incorporates the assumption on the

distribution of the shocks, and therefore, the results are inaccurate if the model used is poor

even when the replications are significantly large.

On the other hand, bootstrapping does not specify the model but instead assumes the past

resembles the present of the data. In other words, the bootstrapping incorporates the aspect of

the dependence of the observed data to reflect the sampling variation.

281
© 2014-2024 AnalystPrep.
Both Monte Carlo Simulation and bootstrapping are affected by the “Black Swan” problem,

where the resulting simulations in both methods closely resemble historical data. In other words,

simulations tend to focus on historical data, and thus, the simulations are not so different from

what it has been observed.

282
© 2014-2024 AnalystPrep.
Practice Question

Which of the following statements correctly describes an antithetic variable?

A. They are variables that are generated to have a negative correlation with the

initial simulated sample.

B. They are mean zero values that are correlated to the desired statistic that is

to be computed from through simulation.

C. They are the mean zero variables that are negatively correlated with the

initial simulated sample.

D. None of the above

Solution

The correct answer is A.

Antithetic variables are used to reduce the sampling error in the Monte Carlo

simulation. They are constructed to have a negative correlation with the initial

simulated sample so that the overall standard error of approximation is reduced.

283
© 2014-2024 AnalystPrep.
Reading 25: Machine-Learning Methods

After completing this reading, you should be able to:

Discuss the philosophical and practical differences between machine-learning

techniques and classical econometrics.

Differentiate among unsupervised, supervised, and reinforcement learning models.

Use principal components analysis to reduce the dimensionality of a set of features.

Describe how the K-means algorithm separates a sample into clusters.

Understand the differences between and consequences of underfitting and overfitting

and propose potential remedies for each.

Explain the differences among the training, validation, and test data sub-samples, and

how each is used.

Explain how reinforcement learning operates and how it is used in decision-making.

Be aware of natural language processing and how it is used.

Machine-Learning Techniques vs. Classical Econometrics

Machine learning (ML) is the art of programming computers to learn from data. Its basic idea is

that systems can learn from data and recognize patterns without active human intervention. ML

is best suited for certain applications, such as pattern recognition and complex problems that

require large amounts of data and are not well solved with traditional approaches.

On the other hand, classical econometrics has traditionally been used in finance to identify

patterns in data. It has a solid foundation in mathematical statistics, probability, and economic

theory. In this case, the analyst researches the best model to use along with the variables to be

used. The computer’s algorithm tests the significance of variables, and based on the results, the

analyst decides whether the data supports the theory.

284
© 2014-2024 AnalystPrep.
Machine learning and traditional linear econometric approaches are both employed in

prediction. The former has several advantages: machine learning does not rely on much financial

theory when selecting the most relevant features to include in a model. It can also be used by a

researcher who is unsure or has not specified whether the relationship between variables is

linear or non-linear. The ML algorithm automatically selects the most relevant features and

determines the most appropriate relationships between the variables.

Secondly, ML algorithms are flexible and can handle complex relationships between variables.

Consider the following linear regression model:

y = β0 + β1 X1 + β2 X2 + ε

Suppose that the effect of X1 on y depends on the level of X2 . Analysts would miss this

interaction effect unless a multiplicative term was explicitly included in the model. In the case of

many explanatory variables, a linear model may be difficult to construct for all combinations of

interaction terms. The use of machine learning algorithms can mitigate this problem by

automatically capturing interactions.

Additionally, the traditional statistical approaches for evaluating models, such as analyses of

statistical significance and goodness of fit tests, are not typically applied in the same way to

supervised machine learning models. This is because the goal of supervised machine learning is

often to make accurate predictions rather than to understand the underlying relationships

between variables or to test hypotheses.

There are different terminologies and notations used in ML. This is because engineers, rather

than statisticians, developed most machine learning techniques. There has been a lot of

discussion of features/inputs and targets/outputs. According to classical econometrics,

features/inputs are simply independent variables. Targets/outputs are dependent variables, and

the values of the outputs are referred to as labels.

285
© 2014-2024 AnalystPrep.
The following gives a summary of some of the differences between ML techniques and classical

econometrics.

286
© 2014-2024 AnalystPrep.
Machine Learning Classical Econometrics
Techniques
Builds models that can learn
from data and continuously Identifies and estimates the
improve their performance relationships between variables.
Goals
with time, and do not need It also tests the hypothesis
to specify the relationships about these relationships.
between variables in advance.
Require well-structured
ML models can deal with large
and clearly defined
Data amounts of complex and
requirements dependent and independent
unstructured data.
variables.
They are not built on Based on various assumptions, e.g.,
assumptions and can handle errors are normally distributed,
Assumptions
non-linear relationships linear relationships
between variables. between variables.
Maybe complex to interpret,
as they may involve complex Statistical models can
be interpreted in terms
Interpretability patterns and relationships
of the relationships
that are difficult to understand
between variables.
or explain.

Types of Machine Learning

There are many types of Machine learning systems. Some of the types include unsupervised

learning, supervised learning, and reinforcement learning.

Unsupervised Learning

As the name suggests, the system attempts to learn without a teacher. It recognizes data

patterns without an explicit target. More specifically, it uses inputs (X’s) for analysis with no

corresponding target (Y). Data is clustered to detect groups or factors that explain the data. It is,

therefore, not used for predictions.

For example, unsupervised learning can be used by an entrepreneur who sells books to detect

groups of similar customers. The entrepreneur will at no point tell the algorithm which group a

customer belongs to. It instead finds the connections without the entrepreneur’s help. The

algorithm may notice, for instance, that 30% of the store’s customers are males who love science

fiction books and frequent the store mostly during weekends, while 25% are females who enjoy

drama books. A hierarchical clustering algorithm can be used to further subdivide groups into

smaller ones.

287
© 2014-2024 AnalystPrep.
Supervised Learning

By using well-labeled training data, this system is trained to work as a supervisor to teach the

machine to predict the correct output. You can think of it as how a student learns under the

supervision of a teacher. In supervised learning, a mapping function is determined that can map

inputs (X’s) with output (Y). The output is also known as the target, while X’s are also known as

the features.

Typically, there are two types of tasks in supervised learning. One is classification. For example,

a loan borrower may be classified as “likely to repay” or “likely to default.” The second one is the

prediction of a target numerical value. For example, predicting a vehicle’s price based on a set of

features such as mileage, year of manufacture, etc. For the latter, labels will indicate the selling

prices. As for the former, the features would be the borrower’s credit score, income, etc., while

the labels would be whether they defaulted.

Reinforcement Learning

288
© 2014-2024 AnalystPrep.
Reinforcement learning differs from other forms of learning. A learning system called an agent

perceives and interprets its environment, performs actions, and is rewarded for desired behavior

and penalized for undesired behavior. This is done through a trial-and-error approach. Over time,

the agent learns by itself what is the best strategy (policy) that will generate the best reward

while avoiding undesirable behaviors. Reinforcement learning can be used to optimize portfolio

allocation and create trading bots that can learn from stock market data through trial and error,

among many other uses.

Principal Components Analysis (PCA)

289
© 2014-2024 AnalystPrep.
Training ML models can be slowed by the millions of features that might be present in each

training instance. The many features can also make it difficult to find a good solution. This

problem is referred to as the curse of dimensionality.

Dimensions and features are often used interchangeably. Dimension reduction involves reducing

the features of a dataset without losing important information. It is useful in ML as it simplifies

complex datasets, scales down the computational burden of dealing with large datasets, and

improves the interpretability of models.

PCA is the most popular dimension reduction approach. It involves projecting the training

dataset onto a lower-dimensional hyperplane. This is done by finding the directions in the

dataset that capture the most variance and projecting the dataset onto those directions. PCA

reduces the dimensionality of a dataset while preserving as much information as possible.

In PCA, the variance measures the amount of information. Hence, principal components capture

the most variance and retain the most information. Accordingly, the first principal component

will account for the largest possible variance; the second component will intuitively account for

the second largest variance (provided that it is uncorrelated with the first principal component),

and so on. A scree plot shows how much variance is explained by the principal components of the

data. The principal components that explain a significant proportion of the variance are retained

(usually 85% to 95%).

Example: Principal Components Analysis (PCA)

Researchers are concerned about which principal components will adequately explain returns in

a hypothetical Very Small Cap (VSC) 30 and Diversified Small Cap (DSC) 500 equity index over a

15-year period. DSC 500 is a diversified index that contains stocks across all sectors, whereas

VSC 30 is a concentrated index that contains technology stocks. In addition to index prices, the

dataset contains more than 1000 technical and fundamental features. The fact that the dataset

has so many features causes them to overlap due to multicollinearity. This is where PCA comes in

handy, as it works by creating new variables that can explain most of the variance while

preserving information in the data.

290
© 2014-2024 AnalystPrep.
Below is a screen plot for each index. Based on the 20 principal components generated, the first

three components explain 88% and 91% of the variance in the VSC 30 and DSC 500 index values,

respectively. Screen plots for both indexes illustrate that the incremental contribution in

explaining variance structure is very small after PC5 or so. From PC5 onwards, it is possible to

ignore the principal components without losing important information.

The K-Means Clustering Algorithm

Clustering is a type of unsupervised machine-learning technique that organizes data points into

similar groups. These groups are called clusters.

291
© 2014-2024 AnalystPrep.
Clusters contain observations from data that are similar in nature. K-means is an iterative

algorithm that is used to solve clustering problems. K is the number of fixed clusters determined

by the analyst at the outset. It is based on the idea of minimizing the sum of squared distances

between data points and the centroid of the cluster to which they belong. The following outlines

the process for implementing K-means clustering:

1. Randomly allocate initial K centroids within the data (centers of the clusters).

2. Assign each data point to the closest centroid, creating K clusters.

3. Calculate the new K centroids for each cluster by taking the average value of all data

points assigned to that cluster.

4. Reassign each data point to the closest centroid based on the newly calculated centroids.

5. Repeat the process of recalculating the new K centroids until the centroids converge or a

predetermined number of iterations has been reached.

Iterations continue until no data point is left to reassign to the closest centroid (there is no need

to recalculate new centroids). The distance between each data point and the centroids can be

measured in two ways. The first is the Euclidean distance, while the second is the Manhattan

distance.

292
© 2014-2024 AnalystPrep.
Consider two features x and y, which both have two data points A and B, with coordinates (x A, y A)

and (xB , y B), respectively. The Euclidean distance, also known as L2 - norm, is calculated as the

square root of the sum of the squares of the differences between the coordinates of the two

points. Imagine the Pythagoras Theorem, where Euclidean distance is the unknown side of a

right-angled triangle.

For a two-dimensional space, this is represented as:

Euclidean Distance (d E) = √(x B − xA )2 + (y B − y A )2

In the case that there are more than two dimensions, for example, n features for two data points

A and B , Euclidean distance will be constructed in a similar fashion. Euclidean distance is also

known as the "straight-line distance " because it is the shortest distance between two points,

indicated by the solid line in the figure below. Manhattan distance, also known as L 1 - norm, is

calculated as the sum of the absolute differences between two coordinates. For a two-

dimensional space, this is represented as:

293
© 2014-2024 AnalystPrep.
Manhattan distance (dM ) = |xB − xA | + |x B − xA |

Manhattan distance is named after the layout of streets in Manhattan, where streets are laid out

in a grid pattern, and the only way to travel between two points is by going along the grid lines.

Example: Calculating Euclidean and Manhattan distances

Suppose you have the following financial data for three companies:

Company P:

Feature 1: Market Capitalization = $0.5 billion

Feature 2: P/E Ratio = 9

Feature 3: Debt-to-Equity Ratio = 0.6

Company Q:

Feature 1: Market Capitalization = $2.5 billion

Feature 2: P/E Ratio = 15

Feature 3: Debt-to-Equity Ratio = 8

Company R

Feature 1: Market Capitalization = $85 billion

Feature 2: P/E Ratio = 32

Feature 3: Debt-to-Equity Ratio = 45

Calculate the Euclidean and Manhattan distances between companies P and Q in feature space

for the raw data.

Euclidean Distance

To calculate the Euclidean distance between companies P and Q in feature space for the raw

data, we first need to find the difference between each feature value for the two companies and

294
© 2014-2024 AnalystPrep.
then square the differences. The Euclidean distance is then calculated by taking the square root

of the sum of these squared differences.

Euclidean Distance (d E) = √(0.5 − 2.5)2 + (9 − 15)2 + (0.6 − 8)2 = √94.76 = 9.73

Manhattan Distance

To calculate the Manhattan distance between companies P and Q in feature space for the raw

data, we simply find the absolute difference between each feature value for the two companies

and sum these differences. The Manhattan distance is then calculated by taking the sum of these

differences. The Manhattan distance between companies P and Q in feature space is:

Manhattan Distance (d M ) = |0.5 − 2.5| + |9 − 15| + |0.6 − 8| = | − 2| + | − 6| + | − 7.4|


= 2 + 6 + 7.4 = $15.4

Performance Measurement for K-means

Formulas described above indicate the distance between two points A and B . It should be noted

that K-means aims to minimize the distance between each data point and its centroid rather than

to minimize the distance between data points. The data points will be closer to the centroids

when the model fits better.

Inertia, also known as the Within-Cluster Sum of Squared errors (WCSS), is a measure of the

sum of the squared distances between the data points within a cluster and the cluster's centroid.

Denoting the distance measure as di , WCSS is expressed as:

n
WCSS = ∑ di2
i=1

K-means algorithm aims to minimize the inertia by iteratively reassigning data points to different

clusters and updating the cluster centroids until convergence. The final inertia value can be used

to measure the quality of the clusters produced by the K-means algorithm.

295
© 2014-2024 AnalystPrep.
Choosing an Appropriate Value for K

Choosing an appropriate value for K can affect the performance of the K-means model. For

example, if K is set too low, the clusters may be too general and may not be a true representative

of the underlying structure of the data. Similarly, if K is set too high, the clusters may be too

specific and may not represent the data’s overall structure. These clusters may not be useful for

the intended purpose of the analysis in either case. It is, therefore, important to choose K

optimally in practice.

The optimal value of K can be calculated using different methods, such as the elbow method and

the silhouette analysis. The elbow method fits the K-means model for different values of K and

plots the inertia/ WCSS for each value of K. Similar to PCA, this is called a scree-plot. It is then

examined for the obvious point on the plot where the inertia decreases more slowly as K

increases (elbow), which is chosen as the optimal value of K. In other words, it is the value that

corresponds to the “elbow” point in the scree plot.

296
© 2014-2024 AnalystPrep.
The second approach involves fitting the K-means model for a range of values of K and

determining the silhouette coefficient for each value of K. The silhouette coefficient compares

the distance of each data point from other points in its own cluster with its distance from the

data points in the other closest cluster. In other words, it measures the similarity of a data point

to its own cluster compared to the other closest clusters. The optimal value of K is the one that

corresponds to the highest silhouette coefficient across all data points.

Advantages and Disadvantages of the K-Means Algorithm

K-means clustering is simple and easy to implement, making it a popular choice for clustering

tasks. There are some disadvantages to K-Means, such as the need to specify clusters, which can

be difficult if the dataset is not well separated. Additionally, it assumes that the clusters are

297
© 2014-2024 AnalystPrep.
spherical and equal in size, which is not always the case in practice.

K-means algorithm is very common in investment practice. It can be used for data exploration in

high-dimensional data to discover patterns and group similar observations together.

Overfitting and Underfitting

Understand the differences between and consequences of underfitting and overfitting, and

propose potential remedies for each.

Overfitting

Imagine that you have traveled to a new country, and the shop assistant rips you off. It is a

natural instinct to assume that all shop assistants in that country are thieves. If we are not

careful, machines can also fall into the same trap of overgeneralizing. This is known as

overfitting in ML.

Overfitting occurs when the model has been trained too well on the training data and performs

poorly on new, unseen data. An overfitted model can have too many model parameters, thus

learning the detail and noise in the training data rather than the underlying patterns. This is a

problem because it means that the model cannot make reliable predictions about new data,

which can lead to poor performance in real-world applications. The evaluation of the ML

algorithm thus focuses on its prediction error on new data rather than on its goodness of fit on

the trained data. If an algorithm is overfitted to the training data, it will have a low prediction

error on the training data but a high prediction error on new data.

The dataset to which an ML model is applied is normally split into training and validation

samples. The training data set is used to train the ML model by fitting the model parameters. On

the other hand, the validation data set is used to evaluate the trained model and estimate how

well the model will generalize into new data.

Overfitting is a severe problem in ML, which can easily have thousands of parameters, unlike

classical econometric models that can only have a few parameters. Potential remedies for

298
© 2014-2024 AnalystPrep.
overfitting include decreasing the complexity of the model, reducing features, or using

techniques such as regularization or early stopping.

Underfitting

Underfitting is the opposite of overfitting. It occurs when a model is too simple and thus not able

to capture the underlying patterns in the training data. This results in poor performance on both

the training data and new data. For example, we would expect a linear model of life satisfaction

to be prone to underfit as the real world is more complicated than the model. In this scenario,

the ML predictions are likely to be inaccurate, even on the training data.

Underfitting is more likely in conventional models because they tend to be less flexible than ML

models. The former follows a predetermined set of rules or assumptions, while ML approaches

do not follow assumptions about the structure of the model. It should be noted, however, that ML

models can still experience underfitting. This can happen when there is insufficient data to train

the model, when the data is of poor quality, and if there is excessively stringent regularization.

Regularization is an approach commonly used to prevent overfitting. It adds a penalty to the

model as the complexity of the model increases. If the regularization is set too high, it can cause

the model to underfit the data. Potential remedies for addressing underfitting include increasing

the complexity of the model, adding more features, or increasing the amount of training data.

Bias-Variance Tradeoff

The complexity of the ML model, which determines whether the data is over, under, or well-

fitted, involves a phenomenon called bias-variance tradeoff. Complexity refers to the number of

features in a model and whether a model is linear or non-linear (with non-linear being too

complex). Bias occurs when a complex model is approximated with a simpler model, i.e., by

omitting relevant factors and interactions. A model with highly biased predictions is likely to be

oversimplified and thus results to underfitting. Variance refers to how sensitive the model is to

small fluctuations in the training data. A model with high variance in predictions is likely to be

complex and thus results to overfitting.

The figure below illustrates how bias and variance are affected by model complexity.

299
© 2014-2024 AnalystPrep.
Sample Splitting and Preparation

Data Preparation

There is a tendency for ML algorithms to perform poorly when the variables have very different

scales. For example, there is a vast difference in the range between income and age. A person’s

income ranges in the thousands while their age ranges in the tens. Since ML algorithms only see

numbers, they will assume that higher-ranging numbers (income in this case) are superior, which

is false. It is, therefore, crucial to have values in the same range. Standardization and

normalization are two methods for rescaling variables.

Standardization involves centering and scaling variables. Centering is where the variable’s mean

value is subtracted from all observations on that variable (so standardized values have a mean of

300
© 2014-2024 AnalystPrep.
0). Scaling is where the centered values are divided by the standard deviation so that the

distribution has a unit variance. This is expressed as follows:

xi − μ
xi (standardized) =
σ

Normalization, also known as min-max scaling, entails rescaling values from 0 to 1. This is done

by subtracting the minimum value (x min ) from each observation and dividing by the difference

between the maximum (x max) and minimum values (xmin ) of X. This is expressed as follows:

x i − x min
xi (normalized ) =
xmax − x min

The preferable rescaling method depends on the data characteristics:

Standardization is used when the data includes outliers. This is because normalization

would compress data points into a narrow range of 0 − 1 , which would be

uncharacteristic of the original data.

Data must be normally distributed for standardization to be used, whereas

normalization can be used when the data distribution is unknown.

Data Cleaning

This is a crucial component of ML and may be the difference between an ML's success and

failure. Data cleaning is necessary for the following reasons:

Missing data: Analysts encounter this issue very often. Missing data can be dealt with

in the following ways. First, observations with only a small number of missing values

can be removed. Secondly, they can replace them with the mean or median of the non-

missing observations. Lastly, it may be possible to estimate the missing values based on

observations of other features.

Inconsistent recording: It is important to record data consistently so that it can be

read correctly and easily used.

301
© 2014-2024 AnalystPrep.
Unwanted observations: Observations that are not relevant to the specific task

should be removed. The result is a more efficient analysis and a reduction in

distractions.

Duplicate observations: Duplicate data points should be removed to avoid biases.

Problematic features: A feature with many standard deviations from the mean should

be carefully monitored as they can be problematic.

Training, Validation, and Test Datasets

We briefly discussed the training and validation data sets, which are in-sample datasets.

Additionally, there is an out-of-sample dataset, which is the test data. The training dataset

teaches an ML model to make predictions, i.e., it learns the relationships between the input data

and the desired output. A validation dataset is used to evaluate the performance of an ML model

during the training process. It compares the performance of different models so as to determine

which one generalizes (fits) best to new data. A test dataset is used to evaluate an ML model’s

final performance and identify any remaining issues or biases in the model. The performance of a

good ML model on the test dataset should be relatively similar to the performance on the

training dataset. However, the training and test datasets may perform differently, and perfect

generalization may not always be possible.

It is up to the researchers to decide how to subdivide the available data into the three samples. A

common rule of thumb is to use two-thirds of the sample for training and the remaining third to

be equally split between validation and testing. The subdivision of the data will be less crucial

when the overall data points are large. Using a small training dataset can introduce biases into

the parameter estimation because the model will not have enough data to learn the underlying

patterns in the data accurately. Using a small validation dataset can lead to inaccurate model

evaluation because the model may not have enough data to assess its performance accurately;

thus, it will be hard to identify the best specification. When subdividing the data into training,

validation, and test datasets, it is crucial to consider the type of data you are working with.

For cross-sectional data, it is best to divide the dataset randomly, as the data has no natural

ordering (i.e., the variables are not related to each other in any specific order). For time series

302
© 2014-2024 AnalystPrep.
data, it is best to divide the data into chronological order, starting with training data, then

validation data, and testing data.

Cross-validation Searches

Cross-validation can be used when the overall dataset is insufficient to be divided into training,

validation, and testing datasets. In cross-validation, training and validation datasets are

combined into one sample, and the testing dataset is excluded. The combined data is then

equally split into sub-samples, with a different sub-sample left out each time as the test dataset.

This technique is known as k-fold cross-validation. It splits the training and validation data into k

sub-samples, and the model is trained and evaluated k times while leaving out the test data from

the combined sample. The values k = 5 and k =10 are commonly chosen for k-fold cross-

validation.

Reinforcement Learning (RL)

Reinforcement learning involves training an agent to make a series of decisions in an

environment to maximize a reward. The agent is given feedback as either a reward or

punishment depending on its actions. It then uses the feedback to learn the actions that are

likely to generate the highest reward. The algorithm learns through trial and error by playing

many times against itself.

303
© 2014-2024 AnalystPrep.
How Reinforcement Learning Operates

Define the Environment

The environment consists of the state space, action space, and the reward function. The state

space is the set of all possible states in which the agent can be. On the other hand, the action

space consists of a set of actions that the agent can take. Lastly, the reward function defines the

feedback that the agent receives for taking a particular action in a given state space.

Initialize the Agent

304
© 2014-2024 AnalystPrep.
Involves specifying the learning algorithm and any relevant parameters. The agent is then put in

the original state of the environment.

Take an Action

The agent chooses an action depending on its current state and the learning algorithm. This

action is then taken in the environment, which may lead to a change of state and a reward. At

any given state, the algorithm can choose between taking the best course of action (exploitation)

and trying a new action (exploration). Exploitation is assigned the probability p and exploration

given the probability 1 − p. p increases as more trials are concluded, and the algorithm has

learned more about the best strategy.

Update the Agent

Based on the agent’s reward and the environment’s new state, it updates its internal state. This

update is carried out using some form of the optimization algorithm.

Repeat the Process

The agent continues to take actions and update its internal state until it reaches a predefined

number of iterations or a terminal state is reached.

Monte Carlo vs. Temporal Difference Methods for Reinforcement


Learning

The Monte Carlo method estimates the value of a state or action based on the final reward

received at the end of an episode. On the other hand, the temporal difference method updates

the value of a state or action by looking at only one decision ahead when updating strategies.

An estimate of the expected value of taking action A in state S , after several trials, is denoted as

Q(S, A) . The estimated value of being in state S at any time is expressed as:

Qnew (S, A) = Qold (S, A) + α [R − Qold (S, A)]

305
© 2014-2024 AnalystPrep.
Where α is a parameter, say 0.05, which is the learning rate that determines how much the agent

updates its Q value based on the difference between the expected and actual reward.

Example: Reinforcement Learning

Suppose that we have three states (S1 , S 2, S3) and two actions (A1, A2) , with the following

Q(S, A) values:

S1 S2 S3
A1 0.3 0.4 0.5
A2 0.7 0.6 0.5

Monte-Carlo Method

Suppose that on the next trial, Action 2 is taken in State 3, and the total subsequent reward is

1.0. If α = 0.075, the Monte Carlo method would lead to Q(3, 2) being updated from 0.5 to:

Q(S3, A2) = Q(S 3, A2) + 0.075 × (1.0 − Q(S3 , A2))


= 0.5 + 0.075(1.0 − 0.5) = 0.5375

Temporal Difference Method

If the next decision that has to be made on the trial under consideration happens when we are in

State 2. Additionally, a reward of 0.3 is earned between the two decisions.

The value of being in State 2, Action 2 is 0.6. The temporal difference method would lead to

Q(3, 2) being updated from 0.5 to:

0.5 + 0.075(0.3 + 0.6 − 0.5) = 0.53

Potential Applications of Reinforcement Learning in Finance

1. Trading: Reinforcement learning algorithms can learn from past data and market

dynamics to make informed decisions on when to buy and sell, possibly optimizing the

trading of financial instruments, including stocks, bonds, and derivatives.

306
© 2014-2024 AnalystPrep.
2. Detecting fraud: RL can be used to detect fraudulent activity in financial transactions.

This algorithm learns from past data and hence adapts to new fraud patterns. This means

that the algorithm becomes better at detecting and preventing fraud with time.

3. Credit scoring: RL can be used to predict the probability of a borrower defaulting on a

loan. The algorithm can be trained on historical data about borrowers and their credit

histories to achieve this.

4. Risk management: RL can be trained using past data to identify and mitigate financial

risks.

5. Portfolio optimization: RL can be trained to take actions that modify the allocation of

assets in the portfolio with time, with the sim of maximizing portfolio returns and

minimizing risks.

Natural Language Processing

Natural language processing (NLP) focuses on helping machines process and understand human

language.

Steps Involved in NLP Process

The main steps in the NLP process are outlined below:

1. Data collection: Involves acquiring data from various sources, including financial

statements, news articles, social media posts, etc.

2. Data preprocessing: The raw textual data is cleaned, formatted, and transformed into a

form suitable for computer usage. Tasks such as tokenization, stemming, and stop word

removal can be carried out at this stage.

3. Feature extraction: This involves extracting relevant features from the preprocessed

data. It may involve extracting financial metrics, sentiments, and other relevant

information.

4. Model training: This involves training the machine learning model using the extracted

features.

307
© 2014-2024 AnalystPrep.
5. Model evaluation: This involves evaluating the performance of the trained model to

ensure it generates accurate and reliable predictions. Techniques such as cross-

validation can be employed here. Model evaluation is carried out on the test dataset.

6. Model deployment: The evaluated model is then deployed for use in real-world

investment scenarios.

Data Preprocessing

Textual data (unstructured data) is more suitable for human consumption rather than for

computer processing. Unstructured data thus needs to be converted to structured data through

cleaning and preprocessing, a process called text processing. Text cleansing involves involving

removing HTML tags, punctuations, numbers, and white spaces (e.g., tabs and indents).

The next step is text wrangling (preprocessing) which involves the following:

1. Tokenization: Involves separating a piece of text into smaller units called tokens. It

allows the NLP model to analyze the textual data more easily by breaking it down into

individual units that can be more easily processed.

2. Lowercasing: To avoid discriminating between “stock” and “Stock.”

3. Removing stop words: These are words with no informational value, e.g., as, the, is,

used as sentence connectors. They are eliminated to reduce the number of tokens in the

training data.

4. Stemming: Reduces all the variations of a word into a common value (base form/stem):

For example, “earned,” “earnings,” and “earning.” are all assigned a common value of

earn. It only removes the suffixes of words.

5. Lemmatization: Involves reducing words to their base form/lemma to identify related

words. Unlike stemming, lemmatization incorporates the full structure of the word and

uses a dictionary or morphological analysis to identify the lemma. It generates more

accurate base forms of words. However, it is more computationally expensive compared

to stemming.

6. Consider “n-grams:” These are words that need to be placed together to give a specific

meaning. For example, “strong earnings,” “negative outlook,” or “market uncertainty.”

308
© 2014-2024 AnalystPrep.
Finance professionals can leverage on NLP to derive insights from large chunks of data to make

more informed decisions. The following are some applications of NLP.

Trading: NLP can be employed to analyze real-time financial data, e.g., stock prices, to

derive trends and patterns that could be used to inform investment decisions.

Risk management: NLP can be used to identify possible risks in financial contracts

and regulatory filings. For example, identifying language that implies a high level of

risk, or wordings/clauses that could be interpreted differently by different parties.

News analysis: NLP can be used to derive information from news articles and other

sources of financial information, e.g., earnings reports. The resulting information can

then be used to monitor companies’ performance and identify potential investment

opportunities.

Sentiment analysis: NLP can be used to measure the public opinion of a company,

industry, or market trend by analyzing sentiments on social media posts and news

articles. Investors can use this information to make more informed investment

decisions. Investors can classify the text as positive, negative, or neutral based on the

sentiment expressed in the text.

Customer service: NLP can be employed in chatbots to aid companies in responding

to customer queries faster and more efficiently.

Detect accounting fraud: For example, to detect accounting fraud, the Securities and

Exchange Commission (SEC) analyzed large amounts of publicly available corporate

disclosure documents to identify patterns in language that indicated fraud.

Text classification: This is the process of assigning text data to prespecified

categories. For example, text classification could involve assigning newswire

statements based on the news they represent, e.g., education, financial, environmental,

etc.

309
© 2014-2024 AnalystPrep.
Practice Question

Which of the following is least likely a task that can be performed using natural

language processing?

A. Sentiment analysis.

B. Text translation.

C. Image recognition.

D. Text classification.

Solution

The correct answer is C.

Image recognition is not a task that can be performed using NLP. This is because NLP

is focused on understanding and processing text, not images.

A is incorrect: NLP can be used for sentiment analysis. For example, NLP can be

used to measure the public opinion of a company, industry, or market trend by

analyzing sentiments on social media posts.

B is incorrect: Financial documents may need to be translated into different

languages to reach a global audience.

D is incorrect: Text classification is the process of assigning text data to

prespecified categories. For example, text classification could involve assigning

newswire statements based on the news they represent, e.g., education, financial,

environmental, etc.

310
© 2014-2024 AnalystPrep.
Reading 26: Machine Learning and Prediction

After completing this reading, you should be able to:

Explain the role of linear regression and logistic regression in prediction.

Understand how to encode categorical variables.

Discuss why regularization is useful and distinguish between the ridge regression and

LASSO approaches.

Show how a decision tree is constructed and interpreted.

Describe how ensembles of learners are built.

Outline the intuition behind the K nearest neighbors and support vector machine

methods for classification.

Understand how neural networks are constructed and how their weights are

determined.

Evaluate the predictive performance of logistic regression models and neural network

models using a confusion matrix.

Role of Linear and Logistic Regression in Prediction

Linear Regression (Ordinary Least Squares)

Linear regression models the relationship between a dependent variable and one or more

independent variables by fitting a linear equation to the observed data. It works by finding the

line of the best fit through the data points. This line is called a regression line, and it is straight.

The equation of the best fit can then be used to make predictions about the dependent variable

based on new values of the independent variables.

311
© 2014-2024 AnalystPrep.
The regression line can be expressed as follows:

y = α + β1 x 1 + β2 x 2 +. . . +βn x n

Where:

y = Dependent variable.

α = Intercept.

312
© 2014-2024 AnalystPrep.
x 1 , x 2 , … x n = Independent variables.

β1 , β2 , … , βn = Multiple regression coefficients.

The coefficients show the effect of each independent variable on the dependent variable and are

calculated based on the data.

The Cost Function for Linear Regression

Training any machine learning model aims to minimize the cost (loss) function. A cost function

measures the inaccuracy of the model predictions. It is the sum of squared residuals (RSS) for a

linear regression model. This is the sum of the squared difference between the actual and

predicted values of the response (dependent variable).

2
n n
RSS = ∑ (y i − α − ∑ βjx ij )
i=1 i=1

Where x ij is the ith observation and jth variable.

To measure how well the data fits the line, take the difference between each actual data point (y)

and the model's prediction (^


y ). The differences are then squared to eliminate negative numbers

and penalize larger differences. The squared differences are then added up, and an average is

taken.

The advantage of linear regression is that it is easy to understand and interpret. However, it has

the following limitations:

It assumes a linear relationship between the dependent and independent variables.

It assumes that residuals (the difference between observed and predicted values) are

normally distributed and have a constant variance.

It is prone to overfitting.

It assumes that there is no multicollinearity.

313
© 2014-2024 AnalystPrep.
Example: Prediction using Linear Regression

Aditya Khun, an investment analyst, wants to predict the return on a stock based on its P/E ratio

and the market capitalization of the company using linear regression in machine learning. Khun

has access to the P/E ratio and market capitalization dataset for several stocks, along with their

corresponding returns. Khun can employ linear regression to model the relationship between the

return on a stock and its P/E ratio and market capitalization. The following equation represents

the model:

Return = β0 + β1 P/E ratio + β2 Market capitalization

Where:

Return = Dependent variable.

P/E ratio and market capitalization = Independent variables.

β0 = Intercept.

β1 and β2 are the coefficients of the model.

The first step of fitting a linear regression model is estimating the values of the coefficients β0 , β1

, and β2 using the training data. Coefficients that minimize the sum of the squared residuals are

determined.

Suppose we have the following data for 6 stocks:

Stock P/E Ratio Market cap Return


($millions)
1 9 200 8%
2 11 300 15%
3 14 400 18%
4 16 500 19%
5 18 600 23%
6 20 700 27%

Given the following parameters and coefficients:

314
© 2014-2024 AnalystPrep.
Intercept = 3.432.

P/E Ratio coefficient = −0.114.

Market cap coefficient = 0.0368.

The prediction equation is expressed as follows:

Return = 3.432 + −0.114 × P\E ratio + 0.0368 × Market capitalization

Given a P/E ratio of 14 and a market capitalization of $150M, the return of the stock can be

determined as follows:

Return = 3.432 + −0.114 × 14 + 0.0368 × 150 = 7.356%

Logistic Regression

When using a linear regression model for binary classification, where the dependent variable Y

can only be 0 or 1, the model can predict probabilities outside the range of 0 to 1. This occurs

because the model attempts to fit a straight line to the data, and the predicted values may not be

restricted to the valid range of probabilities. As a result, the model may produce predictions that

are less than zero or greater than one. To avoid this issue, it may be necessary to use a different

type of model, such as logistic regression, which is specifically designed for binary classification

tasks and ensures that the predicted probabilities are within the valid range. This is achieved by

applying a sigmoid function. The sigmoid function graph is shown in the figure below.

315
© 2014-2024 AnalystPrep.
Logistic regression is used to forecast a binary outcome. In other words, it predicts the likelihood

of an event occurring based on independent variables, which can be categorical or continuous.

The logistic regression model is expressed as:

e yj
F (y j ) =
1 + e yj

Where:

yj = α + β1 x 1j + β2jx 2j+ .. . +βn jx m j

α = Intercept term.

βij = Coefficients that must be learned from the training data.

The probability that y j = 1 is expressed as:

e yj
316
© 2014-2024 AnalystPrep.
e yj
pj =
1 + e yj

Probability that y j = 0 is (1 − pi )

The Cost Function for Logistic Regression

This measures how often we predicted zero when the true answer was one and vice versa. The

logistic regression coefficients are trained using techniques such as maximum likelihood

estimation (MLE) to predict values close to 0 and 1. MLE works by selecting the values of the

model parameters (∝ and the β s) that maximize the likelihood of the training data occurring. The

likelihood function is a mathematical function that describes the probability of the observed data

given the model parameters. By maximizing the likelihood function, we can find the values of the

parameters most likely to have produced the observed data. This can be expressed as:

n 1−yj
y
∏ F (yj ) j (1 − F (yj ))
j=1

It is often easier to maximize the log-likelihood function, log(L), than the likelihood function

itself. The log-likelihood function is obtained by taking the natural logarithm of the likelihood

function:

n
Log (L) = ∑ [y j log (F (yj )) + (1 − yj ) log (1 − F (y j ))]
j=1

Once the model parameters (∝ and the β s) that maximize the log-likelihood function have been

estimated using MLE, predictions can be made using the logistic regression model. To make

predictions, a threshold value Z is chosen. If the predicted probability p j is greater than or equal

to the threshold Z , the model predicts the positive outcome (yj = 1) ; if pj is less than the

threshold Z , the model predicts a negative outcome (yi = 0). This is expressed as:

1 if p j ≥ Z
yj = {
0 if p j < Z

Example: Using Logistic Regression to Predict Loan Default

317
© 2014-2024 AnalystPrep.
A credit analyst wants to predict whether a customer will default on a loan based on their credit

score and debt-to-income ratio. He gathers a dataset of 500 customers, with their corresponding

credit scores, debt-to-income ratio, and whether they defaulted on the loan. He then splits the

data into training and test sets and uses the training data to train a logistic regression model.

The model learns the following relationship between the independent variables (input features)

and dependent variables (loan default):

e(−10+(0.012×Credit score)+(0.4×Debt-to-income))
Probability of default =
1 + e(−10+(0.012×Credit score)+(0.4×Debt-to-income))

The above expression calculates the probability that the customer will default on the loan, given

their credit score and debt-to-income ratio.

So, if the credit score is 650 and the debt-to-income ratio is 0.6, the probability of default will be

calculated as:

e(−10+(0.012×650)+(0.4×0.6))
Probability of default = ≈ 12%
1 + e(−10+(0.012×650)+(0.4×0.6))

So there is a 12% probability that the customer will default on the loan. One can then use a

threshold (such as 50%) to convert this probability into a binary prediction (either “default” or

“no default”). Since 12% < 50%, we can classify this as “no default.”

Applications of Logistic Regression

Logistic regression is applied for prediction and classification tasks in machine learning. For

example, you could use logistic regression to classify stock returns as either “positive” or

“negative” based on a set of input features that you choose. It is simple to implement and

interpret. However, it assumes a linear relationship between the dependent and independent

variables and requires a large sample size to achieve stable estimates of the coefficients.

Encoding Categorical Variables

318
© 2014-2024 AnalystPrep.
Categorical data refers to information presented in groups and can take on values that are

names, attributes, or labels. It is not in a numerical format. For example, a given set of stocks

can be categorized as either growth or value stocks depending on the investment style. Many ML

algorithms struggle to deal with such data.

It isn't easy to transform categorical variables, especially non-ordinal categorical data, where the

classes are not in any order. Mapping or encoding involves transforming non-numerical

information into numbers. One-hot encoding is the most common solution for dealing with non-

ordinal categorical data. It involves creating a new dummy variable for each group of the

categorical feature and encoding the categories as binary. Each observation is marked as either

belonging (Value=1) or not belonging (Value=0) to that group.

Example: One-hot Encoding for Sector, Industry.

Utilities Technology Transportation Internet Airlines Electric


Meta 0 1 0 1 0 0
Energy 1 0 0 0 0 1
Alibaba 0 1 0 1 0 0
Virgin 0 0 1 0 1 0
Atlantic

For ordered categorical variables, for example, where a candidate's grades are specified as

either poor, good, or excellent, a dummy variable that equals 0 for poor, 1 for good, and 2 for

excellent can be used.

If an intercept term and correlated dummy variables are included in a model, the dummy

variable trap may be encountered. This means that the model will have multiple possible

solutions, and we cannot find a unique best-fit solution. To address this issue, techniques such as

regularization can be used. These approaches penalize the magnitude of the coefficients of the

model, which can help to reduce the impact of correlated variables and prevent the dummy

variable trap from occurring.

Regularization

319
© 2014-2024 AnalystPrep.
Regularization is a technique that events overfitting in machine learning models by penalizing

large coefficients. It adds a penalty term to the model's objective function, encouraging the

coefficients to take on smaller values. This reduces the impact of correlated variables, as it

forces the model to rely more on the overall pattern of the data and less on the influence of any

single variable. It improves the generalization of the model to new, unseen data.

Regularization requires the data to be normalized or standardized. Normalization is a method of

scaling the data to have a minimum value of 0 and a maximum value of 1. On the other hand,

standardization involves scaling the data so that it has a mean of zero and a standard deviation

of one. Ridge regression and the least absolute shrinkage and selection operator (LASSO)

regression are the two commonly used regularization techniques.

Ridge Regression

Ridge regression, sometimes known as L2 regularization, is a type of linear regression that is

used to analyze data and make predictions. It is similar to ordinary least squares regression but

includes a penalty term that constrains the size of the model's coefficients. Consider a dataset

with n observations on each of k features in addition to a single output variable y and, for

simplicity, assume that we are estimating a standard linear regression model with hats above

parameters denoting their estimated values. The relevant objective function (referred to as a loss

function) in ridge regression is:

k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x2j − … − β̂k xk j) + λ ∑ (β^ i )
∑ (ŷj − ∝
n j=1 i=1

or

k
2
L = R̂SS + λ ∑ (β^ i )
i=1

The first term in the expression is the residual sum of squares, which measures how well the

model fits the data. The second term is the shrinkage term, which introduces a penalty for large

slope parameter values. This is known as regularization, and it helps to prevent overfitting,

which is when a model fits the training data too well and performs poorly on new, unseen data.

320
© 2014-2024 AnalystPrep.
The parameter λ is a hyperparameter, which means that it is not part of the model itself but is

used to determine the model. In this case, it controls the relative weight given to the shrinkage

term versus the model fit term. It is essential to tune the value of λ, or perform hyperparameter

optimization, to find the best value for the given situation. ∝


^ and β̂i are the model parameters,

while λ is a hyperparameter.

Least Absolute Shrinkage and Selection Operator (LASSO)

LASSO regression, sometimes known as L1 regularization, is similar to ridge regression in that it

introduces a penalty term to the objective function to prevent overfitting. However, the penalty

term in LASSO regression takes the form of the absolute value of the coefficients rather than the

square of the coefficients as in ridge regression.

k
1 n 2
L= ∑ (ŷ j − ∝
^ − β̂1 x1j − β̂2 x 2j − … − β̂kx kj ) + λ ∑ (| β̂i |)
n j =1 i=1

Also expressed as:

k
L = R̂SS + λ ∑ (| β̂i|)
i=1

In ridge regression, the values of ∝


^ and β̂i can be determined analytically using closed-form

solutions. This means that the values of the coefficients can be calculated directly, without the

need for iterative optimization. On the other hand, LASSO does not have closed-form solutions

for the coefficients, so a numerical optimization procedure must be used to determine the values

of the parameters.

Ridge regression and LASSO have a crucial difference. Ridge regression adds a penalty term that

reduces the magnitude of the β parameters and makes them more stable. The effect of this is to

“shrink” the β parameters towards zero, but not all the way to zero. This can be especially useful

when there is multicollinearity among the variables, as it can help to prevent one variable from

dominating the others.

However, LASSO sets some of the less important β parameters to exactly zero. The effect of this

321
© 2014-2024 AnalystPrep.
is to perform feature selection, as the β parameters corresponding to the least important

features will be set to zero. In contrast, the β parameters corresponding to the more important

features will be retained. This can be useful in cases where the number of variables is very large,

and some variables are irrelevant or redundant. The choice between LASSO and ridge regression

depends on the specific needs of the model and the data at hand.

Elastic Net

Elastic net regularization is a method that combines the L1 and L2 regularization techniques in a

single loss function:

k k
1 n 2 2
L= ^ − β̂1 x 1j − β̂2 x2j − … − β̂k xk j) + λ1 ∑ (β^ i ) + λ 2 ∑ (| β̂i |)
∑ (ŷj − ∝
n j=1 i=1 i =1

k k
2
L = R̂S S + λ1 ∑ (β^ i ) + λ2 ∑ (| β̂i|)
i=1 i=1

By adjusting λ 1 and λ2 , which are hyperparameters, it is possible to obtain the advantages of

both L1 and L2 regularization. These advantages include decreasing the magnitude of some

parameters and eliminating some unimportant ones. This can help to improve the model's

performance and the accuracy of its predictions.

Example: Regularization

Table 1: OLS, Ridge and LASSO Regression Estimates


Feature OLS Ridge Ridge LASSO LASSO
(λ = 0.1) (λ = 0.5) (λ = 0.01) (λ = 0.1)
Intercept 6.27 2.45 2.33 2.40 2.29
1 −20.02 −6.23 −1.90 −1.20 0
2 51.53 9.99 2.32 1.19 0.50
3 −32.45 −2.41 −0.43 0 0
4 10.01 0.89 0.51 0 0
5 −5.92 −1.64 −1.22 −1.01 0

OLS regression determines the coefficients of the model by minimizing the sum of the squared

residuals (RSS). Note that it does not incorporate any regularization and can therefore lead to

322
© 2014-2024 AnalystPrep.
significant coefficients and overfitting. On the other hand, ridge regularization adds a penalty

term to RSS. The penalty term is determined as the sum of the squared coefficient values,

multiplied by λ, which is regarded as a hyperparameter. The hyperparameter controls the

strength of the penalty and can be adjusted to find an optimal balance between the model's

fitness and the model's simplicity. Notice that as λ increases, the penalty term becomes more

influential, and the coefficient values become smaller.

As discussed earlier, LASSO uses the sum of the absolute values of the coefficients as the penalty

term. This leads to some coefficients being reduced to zero, which eliminates unnecessary

features from the model. Notice the same from the table above. Similar to ridge regression, the

strength of the penalty can be modified by adjusting the value of λ.

Choosing the value of the hyperparameter in a regularized regression model is an important step

in the modeling process, as it can significantly impact the model's performance. One common

approach to selecting the value of the hyperparameter is to use cross-validation, which involves

splitting the data into a training set, a validation set, and a test set. This was discussed in detail

in Chapter 14. The training set is used to fit the model and determine the coefficients for

different values of λ . The validation set determines how well the model generalizes to new data.

The test set is used to evaluate the final performance of the model and provide an unbiased

estimate of the model's accuracy.

Decision Trees

A decision tree is a supervised machine-learning technique that can be used to predict either a

categorical target variable, produce a classification tree, or produce a regression tree. It creates

a tree-like decision model based on the input features. At each internal node of the tree, there is

a question, and the algorithm makes a decision based on the value of one of the features. It then

branches an observation to another node or a leaf. A leaf is a terminal node that leads to no

further nodes. In other words, the decision tree includes the initial root node, decision nodes,

and terminal nodes.

Classification and Regression Tree (CART) is a decision tree algorithm commonly used for

323
© 2014-2024 AnalystPrep.
supervised learning tasks, such as classification and regression. One of the main benefits of

CART is that it is highly interpretable, meaning it is easy to understand how the model makes

predictions. This is because CART models are built using a series of simple decision rules that

are easy to understand and follow. For this reason, CART models are often referred to as “white-

box models,” in contrast to other techniques like neural networks, which are often referred to as

“black-box models.” Neural networks are more challenging to interpret because they are based

on complex mathematical equations that are not as easy to understand and follow.

The following is a visual representation of a simple model for predicting whether a company will

issue dividends to shareholders based on the company's profits:

When building a decision tree, the goal is to create a model that can accurately predict the value

of a target variable based on the importance of other features in the dataset. To do this, the

324
© 2014-2024 AnalystPrep.
decision tree must decide which features to split on at each node of the tree. The tree is

constructed by starting at the root node and recursively partitioning the data into smaller and

smaller groups based on the values of the chosen features. We use a measure called information

gain to determine which feature to split at each node.

Information gain measures how much uncertainty or randomness is reduced by obtaining

additional information about the feature. In other words, it measures how much the feature helps

us predict the target variable.

There are two commonly used measures of information gain: entropy and the Gini coefficient.

Both of these measures are used to evaluate the purity of a node in the decision tree. The goal is

to choose the feature that results in the most significant reduction in entropy or the Gini

coefficient, as this will be the most helpful feature in predicting the target variable.

Entropy ranges from 0 to 1, with 0 representing a completely ordered or predictable system and

1 representing a completely random or unpredictable system. It is expressed as:

K
Entropy = − ∑ pi log2 (p i)
i=1

Where K is the total number of possible outcomes and pi the probability of that outcome. The

logarithm used in the formula is typically the base-2 logarithm, also known as the binary

logarithm.

The Gini measure is expressed as:

K
Gini = 1 − ∑ p2i
i =1

Example: Building a Decision-Tree Model to Classify Credit Card Holders


as High Risk or Low Risk

A credit card company is building a decision-tree model to classify credit card holders as high-

risk or low-risk for defaulting on their payments. They have the following data on whether a

credit card holder has defaulted (“Defaulted”) and two features (for the label and the features, in

325
© 2014-2024 AnalystPrep.
each case, “yes” = 1 and “no” = 0): whether the credit card holder has a high income and

whether they have a history of late payments:

Defaulted High_income Late_payments


1 1 1
0 0 0
0 0 0
1 1 1
1 0 1
0 0 1
0 1 0
0 1 0

1. Calculate the “base entropy” of the defaulted series.

The base entropy measures the randomness (uncertainty) of the output series before any data is

split into separate groups or categories.

K
Entropy = − ∑ pi log2 (p i)
i =1

Where:

K = Total number of possible outcomes.

pi = Probability of that outcome.

The logarithm used in the formula is typically the base-2 logarithm, also known as the binary

logarithm.

In this case, three credit card holders defaulted, and five didn't.

3 3 5 5
Entropy = − ( log2 ( ) + log2 ( )) = 0.954
8 8 8 8

2. Build a decision tree for this problem

Both features are binary, so there are no issues with determining a threshold as there would be

for a continuous series. The first stage is to calculate the entropy if the split was made for each

326
© 2014-2024 AnalystPrep.
of the two features. Examining the High_income feature first, among high-income credit card

owners (feature = 1), two defaulted while two did not, leading to entropy for this sub-set of:

2 2 2 2
Entropy = − ( log 2 ( ) + log2 ( )) = 1
4 4 4 4

Among non-high income credit card owners (feature = 0), one defaulted while three did not,

leading to an entropy of:

1 1 3 3
Entropy = − ( log2 ( ) + log2 ( )) = 0.811
4 4 4 4

The weighted entropy for splitting by income level is therefore given by:

4 4
Entropy = × 1 + × 0.811 = 0.906
8 8

Information gain = 0.954 − 0.906 = 0.048

We repeat this process by calculating the entropy that would occur if the split was made via the

late payment feature.

Three of the four credit card owners who made late payments (feature = 1) defaulted, while one

did not.

3 3 1 1
Entropy = − ( log2 ( ) + log2 ( )) = 0.811
4 4 4 4

Among the four credit card owners who did not make late payments (feature = 0), none

defaulted. The weighted entropy for late payments feature is, therefore:

4
Entropy = × 0.811 = 0.4055
8

Information gain = 0.954 − 0.4055 = 0.5485

Notice that the entropy is maximized when the sample is first split by the late payments feature.

This becomes the root node of the decision tree. For credit card owners who do not make late

327
© 2014-2024 AnalystPrep.
payments (i.e., the feature =0), there is already a pure split as none of them defaulted. This is to

say that credit card holders who make timely payments do not default. This means that no

further splits are required along this branch. The (incomplete) tree structure is, therefore:

Ensemble Techniques

Ensemble learning is a machine learning technique in which a group of models, or an ensemble,

is used to make predictions rather than relying on the output of a single model. The idea behind

ensemble learning is that the individual models in the ensemble may have different error rates

and make noisy predictions. Still, by taking the average result of many predictions from various

models, the noise can be reduced, and the overall forecast can be more accurate.

328
© 2014-2024 AnalystPrep.
There are two objectives of using an ensemble approach in machine learning. First, ensembles

can often achieve better performance than individual models (think of the law of large numbers

where, as the number of models in the ensemble increases, the overall prediction accuracy tends

to improve). Second, ensembles can be more robust and less prone to overfitting, as they are

able to average out the errors made by individual models. Some ensemble techniques are

discussed below, i.e., bootstrap aggregation, random forests, and boosting.

Bootstrap Aggregation

Bootstrap aggregation, or bagging, is a machine-learning technique that involves creating

multiple decision trees by sampling from the original training data. The decision trees are then

combined to make a final prediction. A basic bagging algorithm for a decision tree would involve

the following steps:

1. Sample the training data with the replacement to obtain multiple subsets of the training

data

2. Contruct a decision tree on each subset of the training data using the usual techniques.

3. Combine the predictions made by each of the decision tree models, e.g., average, to

make a forecast.

Sampling with replacement is a statistical method that involves randomly selecting a sample

from a dataset and returning the selected element back into the dataset before choosing the next

element. This means that an element can be selected multiple times, or it can be left out entirely.

Sampling with replacement allows for the use of out-of-bag (OOB) data for model evaluation.

OOB data are observations that were not selected in a particular sample, and therefore were not

used for model training. These observations can be used to evaluate the model's performance, as

they can provide an estimate of how the model will perform on unseen data.

Random Forests

A random forest is an ensemble of decision trees. The number of features chosen for each tree is

usually approximately equal to the square root of the total number of features. The individual

329
© 2014-2024 AnalystPrep.
decision trees in a random forest are trained on different subsets of the data and different

subsets of the features, which means that each tree may give a slightly different prediction.

However, by combining the predictions of all the trees, the random forest can produce a more

accurate final prediction. The performance improvements of ensembles are often greatest when

the individual model outputs have low correlations with one another because this helps to

improve the generalization of the model.

Boosting

Boosting is an ensemble learning technique that involves training a series of weak models, where

each successive model is trained on the errors or residuals of its predecessor. The goal of

330
© 2014-2024 AnalystPrep.
boosting is to improve the model's overall performance by combining the weaker models'

predictions to reduce bias and variance. Gradient boosting and AdaBoost (Adaptive Boosting) are

the most popular methods.

AdaBoost

AdaBoost is a boosting algorithm that trains a series of weak models, where each successive

model focuses more on the examples that were difficult for its predecessor to predict correctly.

This results in new predictors that concentrate more and more on the hard cases. Specifically,

AdaBoost adjusts the weights of the training examples at each iteration based on the previous

model's performance, focusing the training on the examples that are most difficult to predict.

Here is a more detailed description of the process:

1. The AdaBoost algorithm first trains a base classifier (such as a decision tree) on the

training data.

2. The algorithm then uses the trained classifier to make predictions on the training set and

calculates the errors or residuals between the predicted labels and the true labels.

3. The algorithm then adjusts the weights of the training examples based on the previous

classifier's performance, focusing the training on the examples that were most difficult to

predict correctly. Specifically, the weights of the misclassified examples are increased,

while the weights of the correctly classified examples are decreased.

4. A second classifier is then trained on the updated weights. The whole process is repeated

until a predetermined number of classifiers have been trained, or until the model's

performance meets a desired threshold.

The final prediction of the AdaBoost model is calculated by combining the predictions of all of

the individual classifiers using a weighted sum, where the accuracy of each classifieraccuracy of

each classifier determines the weights.

Gradient Boosting

In gradient boosting, a new model is trained on the residuals or errors of the previous model,

which are used as the target labels for the current model. This process is repeated until a

331
© 2014-2024 AnalystPrep.
predetermined number of models have been trained, or until the model's performance meets a

desired threshold. In contrast to AdaBoost, which adjusts the weights of the training examples at

each iteration based on the performance of the previous classifier, gradient boosting tries to fit

the new predictor to the residual errors made by the previous predictor.

K-Nearest Neighbors and Support Vector Machine Methods

K-Nearest Neighbors

K-nearest neighbors (KNN) is a supervised machine learning technique commonly used for

classification and regression tasks. The idea is to find similarities or “nearness” between a new

observation and its k-nearest neighbors in the existing dataset. To do this, the model uses one of

the distance metrics described in the previous chapter (Euclidean distance or Manhattan

distance) to calculate the distance between the new observation and each observation in the

training set. The k observations with the smallest distances are considered the k-nearest

neighbors of the new observation. The class label or value of the new observation is determined

based on these neighbors' class labels or values.

KNN is sometimes called a “lazy learner” as it does not learn the relationships between the

features and the target like other approaches do. Instead, it simply stores the training data and

makes predictions based on the similarity between the new observation and its K-nearest

neighbors in the training set.

Here are the basic steps involved in implementing the KNN model:

332
© 2014-2024 AnalystPrep.
Choosing an appropriate value for K is important, as it can impact the model's ability to

generalize to new data and avoid overfitting or underfitting. If K is too large so that many

neighbors are selected, it will give a high bias but low variance, and vice versa for small K. If the

value of K is set too small, it may result in a model that is more sensitive to individual

observations and more complex. This may allow the model to fit the training data better.

However, it may also make the model more prone to overfitting and not generalize well to new

data.

A typical heuristic for selecting K is to set it approximately equal to the square root of the size of

the training sample. For example, if the training sample contains 10,000 points, then K could be

set to 100 (the square root of 10,000).

Support Vector Machines

333
© 2014-2024 AnalystPrep.
Support vector machines (SVMs) are supervised machine learning models commonly used for

classification tasks, particularly when there are many features. SVM works by finding the path's

hyperplane or center that maximizes the distance between the two classes, called the margin.

This hyperplane (the solid line blue line in the figure below) is constructed by finding the two

parallel lines that are furthest apart and that best separate the observations into the two classes.

The data points on the edge of this path, or the points closest to the hyperplane, are called

support vectors.

Example: Support Vector Machine

Emma White is a portfolio manager at Delta Investments, a firm that manages a diverse range of

investment portfolios for its clients. Delta has a portfolio of “investment-grade” stocks, which are

relatively low-risk and have a high likelihood of producing steady returns. The portfolio also

334
© 2014-2024 AnalystPrep.
includes a selection of “non-investment grade” stocks, which are higher-risk and have the

potential for higher returns but also come with a greater risk of loss.

White is considering adding a new stock, ABC Inc., to the portfolio. ABC is a medium-sized

company in the retail sector but has not yet been rated by any of the major credit rating

agencies. To determine whether ABC is suitable for the portfolio, White decides to use machine

learning methods to predict the stock's risk level. How can Emma use the SVM algorithm to

explore the implied credit rating of ABC?

Solution

White would first gather data on the features and target of bonds from companies rated as either

investment grade or non-investment grade. She would then use this data to train the SVM

algorithm to identify the optimal hyperplane that separates the two classes. Once the SVM model

is trained, White can use it to predict the rating of ABC Inc's bonds by inputting the features of

the bonds into the model and noting on which side of the margin the data point lies. If the data

point lies on the side of the margin associated with the investment grade class, then the SVM

model would predict that ABC Inc's bonds are likely to be investment grade. If the data point lies

on the side of the margin associated with the non-investment grade class, then the SVM model

would predict that ABC Inc's bonds are likely to be non-investment grade.

Neural Networks

Neural networks (NNs), also known as artificial neural networks (ANNs), are machine learning

algorithms capable of learning and adapting to complex nonlinear relationships between input

and output data. They can be used for both classification and regression tasks in supervised

learning, as well as for reinforcement learning tasks that do not require human-labeled training

data. A feed-forward neural network with backpropagation is a type of artificial neural network

that updates its weights and biases through an iteration process called backpropagation.

335
© 2014-2024 AnalystPrep.
In this neural network, there are three input variables, a single hidden layer comprising three

nodes and a single output variable. The output variable is determined based on the values of the

hidden nodes, which are calculated from the input variables. The equations that are used to

determine the values at the hidden nodes are:

H 1 = ∅(W111 X1 + W112 X2 + W113 X3 + W1 )


H 2 = ∅(W121 X1 + W122 X2 + W123 X3 + W2 )
H 3 = ∅(W131 X1 + W132 X2 + W133 X3 + W3 )

∅ is known as an activation function, which is a nonlinear function that is applied to the linear

combination of the input feature values to introduce nonlinearity into the model.

The value of y is determined by applying an activation function to a linear combination of the

values in the hidden layer.

336
© 2014-2024 AnalystPrep.
y = ∅(W211 H1 + W221 H 2 + W231 H 3 + W 4 )

Where W 1, W2 , W3 , W4 are biases.

The other W parameters (coefficients in the linear functions) are weights. As previously stated, if

the activation functions were not included, the model would only be able to output linear

combinations of the inputs and hidden layer values, limiting its ability to identify complex

nonlinear relationships. This is not desirable, as the main purpose of using a neural network is to

identify and model these kinds of relationships.

The parameters of a neural network are chosen based on the training data, similar to how the

parameters are chosen in linear or logistic regression. To predict the value of a continuous

variable, we can select the parameters that minimize the mean squared errors. We can use a

maximum likelihood criterion to choose the parameters for classification tasks.

There are no exact formulas for finding the optimal values for the parameters in a neural

network. Instead, a gradient descent algorithm is used to find values that minimize the error for

the training set. This involves starting with initial values for the parameters and iteratively

adjusting them in the direction that reduces the error of the objective function. This process is

similar to stepping down a valley, with each step following the steepest descent.

The learning rate is a hyperparameter that determines the size of the step taken during the

gradient descent algorithm. If the learning rate is too small, it will take longer to reach the

optimal parameters, but if it is too large, the algorithm may oscillate from one side of the valley

to another instead of accurately finding the optimal values. A hyperparameter is a value set

before the model training process begins and is used to control the model's behavior. It is not a

parameter of the model itself but rather a value used to determine how the model will be trained

and function.

In the example given earlier, the neural network had 16 parameters (i.e., a total of the weights

and the biases). The presence of many hidden layers and nodes in a neural network can lead to

too many parameters and the risk of overfitting. To prevent overfitting, calculations are

performed on a validation data set while training the model on the training data set. As the

gradient descent algorithm progresses through the multi-dimensional valley, the objective

337
© 2014-2024 AnalystPrep.
function will improve for both data sets.

However, at a certain point, further steps down the valley will begin to degrade the model's

performance on the validation data set while continuing to improve it on the training data set.

This indicates that the model is starting to overfit, so the algorithm should be stopped to prevent

this from happening.

Predictive Performance of Logistic Regression Models vs.


Neural Network Models using a Confusion Matrix

A confusion matrix is a tool used to evaluate the performance of a binary classification model,

where the output variable is a binary categorical variable with two possible values (such as

“default” or “not default”). It is a 2×2 table that shows the possible outcomes, and whether the

predicted outcome was correct. A confusion matrix is organized as follows:

Predicted positive Predicted negative


Actual positive TP FN
Actual negative FP TN

The four elements of the table are:

i. True positive (TP) refers to the number of times the model correctly predicted that a

borrower would default on their loan.

ii. False negative (FN) refers to the number of times the model incorrectly predicted that a

borrower would not default, when in fact, they did.

iii. False positive (FP) refers to the number of times the model incorrectly predicted that a

borrower would default, when in fact, they did not.

iv. True negative (TN) refers to the number of times the model correctly predicted that a

borrower would not default on their loan.

338
© 2014-2024 AnalystPrep.
The most common performance metrics based on a confusion matrix are:

i. Accuracy: This is the model's overall accuracy, calculated as the number of correct

predictions divided by the total number of predictions. In the case of a binary

classification problem, the accuracy is calculated as follows:

(T P + T N )
(T P + T N + F P + F N )

ii. Precision: This is the proportion of correct positive predictions, calculated as:

TP
(T P + F P)

iii. Recall: This is the proportion of actual positive cases that were correctly predicted,

339
© 2014-2024 AnalystPrep.
calculated as:

TP
(T P + F N)

iv. The error rate is the proportion of incorrect predictions made by the model, calculated as

follows:

Error rate = (1 − Accuracy)

Example: Confusion Matrix

Suppose we have a dataset of 1600 borrowers, 400 of whom defaulted on their loans and 1200 of

whom did not. We can use logistic regression or a neural network to create a prediction model

that predicts the likelihood that a borrower will default on their loan. We can set a threshold

value to convert the predicted probabilities into binary values of 0 or 1.

Assume that a neural network with one hidden layer and backpropagation is used to model the

data. The hidden layer has 5 units, and the activation function used is the logistic function. The

loss function used in the optimization process is based on an entropy measure. Note that a loss

function is used to evaluate how well a model performs on a given task. The optimization process

aims to find the set of model parameters that minimize the loss function. Suppose that the

optimization process takes 150 iterations to converge, which means it takes 150 steps to find the

set of model parameters that minimize the loss function.

In the context of machine learning, the effectiveness of a model specification is evaluated based

on its performance in classifying a validation sample. For simplicity, a threshold of 0.5 is used to

determine the predicted class label based on the model's output probability. If the probability of

a default predicted by the model is greater than or equal to 0.5, the predicted class label is

“default.” If the probability is less than 0.5, the predicted class label is “no default.”

Adjusting the threshold can affect the true positive and false positive rates in different ways. For

example, if the threshold is set too low, the model may have a high true positive rate and a high

false positive rate because the model is classifying more observations as positive. On the other

340
© 2014-2024 AnalystPrep.
hand, if the threshold is set too high, the model may have a low true positive rate and a low false

positive rate because the model is classifying fewer observations as positive. This trade-off

between true positive and false positive rates is similar to the trade-off between type I and type

II errors in hypothesis testing. In hypothesis testing, a type I error occurs when the null

hypothesis is rejected when it is actually true. In contrast, a type II error occurs when the null

hypothesis is not rejected when it is actually false.

Hypothetical confusion matrices for the logistic and neural network models are presented for

both the training and validation samples.

Logistic Regression Training Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 100 F N = 300
Actual: No default F P = 50 T N = 1150

Logistic Regression Validation Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 100 F N = 175
Actual: No default F P = 56 T N = 337

Neural Network Training Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 94 F N = 306
Actual: No default F P = 106 T N = 1094

Neural Network Validation Sample

Predicted: Default Predicted: No Default


Actual: Default T P = 93 F N = 182
Actual: No default F P = 51 T N = 342

The values in the confusion matrix can be used to calculate various evaluation metrics:

341
© 2014-2024 AnalystPrep.
Training sample Validation sample
Performance Logistic Neural Logistic Neural
metrics regression network regression network
Accuracy 0.781 0.743 0.654 0.651
Precision 0.667 0.470 0.641 0.646
Recall 0.250 0.235 0.364 0.338

The model appears to perform slightly better on the training data than on the validation data,

indicating that the model is overfitting. To improve the model's performance, it may be beneficial

to remove some of the features with limited empirical relevance or apply regularization to the

model. These steps may help reduce overfitting and improve the model's ability to generalize to

new data.

There is not much difference in the performance of the logistic regression and neural network

approaches. The logistic regression model has a higher true positive rate but a lower true

negative rate for the training data compared to the neural network model. On the other hand,

the neural network model appears to have a higher true positive rate but a lower true negative

rate for the validation data compared to the logistic regression model.

The Receiver Operating Characteristic Curve

The receiver operating characteristic (ROC) curve is a graphical representation of the trade-off

between the true positive rate and the false positive rate, which is illustrated in the figure below.

It is calculated by varying the threshold value or decision boundary, classifying predictions as

positive or negative, and plotting the true positive rate and the false positive rate at each

threshold.

342
© 2014-2024 AnalystPrep.
A higher area under the receiver operating curve (or area under curve/AUC) value indicates

better performance, with a perfect model having an AUC of 1. An AUC value of 0.5 corresponds

to the dashed line in the figure above and indicates that the model is no better than random

guessing. In contrast, an AUC value less than 0.5 indicates that the model has a negative

predictive value.

343
© 2014-2024 AnalystPrep.
Practice Question

Consider the following confusion matrices.

Model A

Predicted: Predicted:
No Default Default
Actual: No Default T N = 100 F P = 50
Actual: default F N = 50 T P = 900

Model B

Predicted: Predicted:
No Default Default
Actual: No Default T N = 120 F P = 80
Actual: default F N = 30 T P = 870

The model that is most likely to have a higher accuracy and higher precision,

respectively, is:

A. Higher accuracy: Model A, Higher precision: Model B.

B. Higher accuracy: Model B, Higher precision: Model A.

C. Higher accuracy: Model A, Higher precision: Model A.

D. Higher accuracy: Model B, Higher precision: Model B.

Solution

The correct answer is C.

(T P + T N )
Model accuracy is calculated as
(T P + T N + F P + F N )

900 + 100
Model A accuracy = = 0.909
900 + 100 + 50 + 50
870 + 120
Model B accuracy = = 0.900
870 + 120 + 80 + 30

344
© 2014-2024 AnalystPrep.
Model A has a slightly higher accuracy than model B.

Model precision is calculated as follows:

TP
(T P + F P)

900
Model precision of A = = 0.9474
900 + 50
870
Model precision for B = = 0.9158
870 + 80

Model A has a higher precision relative to B.

345
© 2014-2024 AnalystPrep.

You might also like