LM10 Simple Linear Regression IFT Notes
LM10 Simple Linear Regression IFT Notes
1 & 2. Introduction and Estimation of the Simple Linear Regression Model ....................................2
Estimating the Parameters of a Simple Linear Regression ................................................................3
3. Assumptions of the Simple Linear Regression Model ...........................................................................5
4.1 Analysis of Variance .........................................................................................................................................9
Breaking down the Sum of Squares Total into Its Components .......................................................9
Measures of Goodness of Fit ........................................................................................................................ 10
ANOVA and Standard Error of Estimate in Simple Linear Regression ....................................... 11
4.2 Hypothesis Testing of Linear Regression Coefficients .................................................................... 13
Hypothesis Tests of the Slope Coefficient .............................................................................................. 13
Hypothesis Tests of the Intercept ............................................................................................................. 16
Hypothesis Tests of Slope When Independent Variable Is an Indicator Variable .................. 17
Test of Hypotheses: Level of Significance and p-Values ................................................................... 18
5. Prediction Using Simple Linear Regression and Prediction Intervals ......................................... 19
6. Functional Forms for Simple Linear Regression .................................................................................. 20
The Log-Lin Model........................................................................................................................................... 21
The Lin-Log Model........................................................................................................................................... 21
The Log-Log Model.......................................................................................................................................... 22
Selecting the Correct Functional Form.................................................................................................... 24
Summary................................................................................................................................................................... 25
This document should be read in conjunction with the corresponding reading in the 2023 Level I CFA®
Program curriculum. Some of the graphs, charts, tables, examples, and figures are copyright
2022, CFA Institute. Reproduced and republished with permission from CFA Institute. All rights
reserved.
Required disclaimer: CFA Institute does not endorse, promote, or warrant the accuracy or quality of
the products or services offered by IFT. CFA Institute, CFA®, and Chartered Financial Analyst® are
trademarks owned by CFA Institute.
Ver 1.0
Variation of Y = ∑ (Yi − ̅
Y) 2
i=1
The variation of Y is also called sum of squares total (SST) or the total sum of squares. Our
aim is to understand what explains the variation of Y.
The analyst now wants to check if another variable – CAPEX can be used to explain the
variation of ROA. The analyst defines CAPEX as: capital expenditures in the previous period,
scaled by the prior period’s beginning property, plant, and equipment. He gathers the
following data for the six companies:
Company ROA (%) CAPEX(%)
A 6 0.7
B 4 0.4
C 15 5.0
D 20 10.0
E 10 8.0
F 20 12.5
Arithmetic mean 12.50 6.10
The relation between the two variables can be visualized using a scatter plot.
The variable whose variation we want to explain, also called the dependent variable is
presented on the vertical axis. It is typically denoted by Y. In our example ROA is the
dependent variable.
The explanatory variable, also called the independent variable is presented on the
horizontal axis. It is typically denoted by X. In our example CAPEX is the independent
variable.
A linear regression model computes the best fit line through the scatter plot, which is the
line with the smallest distance between itself and each point on the scatter plot. The
regression line may pass through some points, but not through all of them.
Regression analysis with only one independent variable is called simple linear regression
(SLR). Regression analysis with more than one independent variable is called multiple
regression. In this reading we focus on single independent variable, i.e., simple linear
regression.
Estimating the Parameters of a Simple Linear Regression
The Basics of Simple Linear Regression
Linear regression assumes a linear relationship between the dependent variable (Y) and
independent variable (X). The regression equation is expressed as follows:
Yi = b0 + b1 Xi + εi
where:
i = 1, …, n
Y = dependent variable
b0 = intercept
b1 = slope coefficient
X = independent variable
ε = error term
b0 and b1 are called the regression coefficients.
The equation shows how much Y changes when X changes by one unit.
Estimating the Regression Line
Linear regression chooses the estimated values for intercept ̂𝑏0 and slope ̂
𝑏1 such that the
sum of the squared errors (SSE), i.e., the vertical distances between the observations and
the regression line is minimized.
This is represented by the following equation. The error terms are squared so that they don’t
cancel out each other. The objective of the model is that the sum of the squared error terms
should be minimized.
N N
2 2
Yi ) = ∑ [Yi − (b̂0 + b̂1 Xi )]
SSE = ∑ (Yi − ̂
i=1 i=1
b̂0 = ̅
Y − b̂1 ̅
X
Note: On the exam you will most likely be given the values of b1 and b0. It is unlikely that you
will be asked to calculate these values. Nevertheless, the following table shows how to
calculate the slope and intercept.
Company ROA(Yi) CAPEX (Xi) ̅ )𝟐
(𝒀𝒊 − 𝒀 ̅ )𝟐
(𝑿𝒊 − 𝑿 ̅ ) (𝑿𝒊 − 𝑿
(𝒀𝒊 − 𝒀 ̅)
A 6.0 0.7 42.25 29.16 35.10
B 4.0 0.4 72.25 32.49 48.45
C 15.0 5.0 6.25 1.21 -2.75
D 20.0 10.0 56.25 15.21 29.25
E 10.0 8.0 6.25 3.61 -4.75
F 20.0 12.5 56.25 40.96 48.00
Sum 75.0 36.6 239.50 122.64 153.30
Mean 12.5 6.100
153.30
Slope coefficient: b̂1 = 122.64 = 1.25
1. Linearity: The relationship between the dependent variable, Y, and the independent
variable, X, is linear.
2. Homoskedasticity: The variance of the regression residuals is the same for all
observations.
3. Independence: The observations, pairs of Ys and Xs, are independent of one another.
This implies the regression residuals are uncorrelated across observations.
4. Normality: The regression residuals are normally distributed.
Assumption 1: Linearity
Since we are fitting a straight line through a scatter plot, we are implicitly assuming that the
true underlying relationship between the two variables is linear.
If the relationship between the variables is non-linear, then using a simple linear regression
model will produce invalid results.
Exhibit 10 from the curriculum shows two variables that have an exponential relationship. A
linear regression line does not fit this relationship well. At lower and higher values of X, the
model underestimates Y. Whereas, at middle values of X, the model overestimates Y.
Another point related to this assumption is that the independent variable X should not be
random. If X is random, there would be no linear relationship between the two variables.
Also, the residuals of the model should be random. They should not exhibit a pattern when
plotted against the independent variable. As shown in Exhibit 11, the residuals from the
linear regression model in Exhibit 10 do not appear random.
Hence, a linear regression model should not be used for these two variables.
Assumption 2: Homoskedasticity
Assumption#2 is that the variance of the residuals is constant for all observations. This
condition is called homoskedasticity. If the variance of the error term is not constant, then it
is called heteroskedasticity.
Exhibit 12 shows a scatter plot of short-term interest rates versus inflation rate for a
country. The data represents a total span of 16 years. We will refer to the first eight years of
normal rates as Regime 1 and the second eight years of artificially low rates as Regime 2.
At first glance, the model seems to fit the data well. However, when we plot the residuals of
the model against the years, we can see that the residuals for the two regimes appear
different. This is shown in Exhibit 13 below. The variation in residuals for Regime 1 is much
smaller than the variation in residuals for Regime 2. This is a clear violation of the
homoskedasticity assumption and the two regimes should not be clubbed together in a
single model.
Assumption 3: Independence
Assumption# 3 states that the residuals should be uncorrelated across observations. If the
residuals exhibit a pattern, then this assumption will be violated.
For example, Exhibit 15 from the curriculum plots the quarterly revenues of a company over
40 quarters. The data displays a clear seasonal pattern. Quarter 4 revenues are considerably
higher than the first 3 quarters.
A plot of the residuals from this model in Exhibit 16 also helps us see this pattern. The
residuals are correlated – they are high in Quarter 4 and then fall back in the other quarters.
Both exhibits show that the assumption of residual independence is violated and the model
should not be used for this data.
Assumption 4: Normality
Assumption#4 states that the residuals should be normally distributed.
Instructor’s Note: This assumption does not mean that X and Y should be normally
distributed, it only means that the residuals from the model should be normally distributed.
4.1 Analysis of Variance
Breaking down the Sum of Squares Total into Its Components
To evaluate how well a linear regression model explains the variation of Y we can break
down the total variation in Y (SST) into two components: Sum of square errors (SSE) and
regression sum of squares (RSS).
Total sum of squares (SST), ∑N ̅ 2
i=1 (Yi − Y) : This measures the total variation in the
dependent variable. SST is equal to the sum of squared distances between the actual values
of Y and the average value of Y.
The regression sum of squares (RSS), ∑N ̂ ̅ 2
i = 1 (Yi − Y) . This is the amount of total variation
in Y that is explained in the regression equation. RSS is the sum of squared distances
between the predicted values of Y and the average value of Y.
The sum of squared errors or residuals (SSE), ∑N ̂ 2
i = 1 (Yi − Yi ) .This is also known as
residual sum of squares. It measures the unexplained variation in the dependent variable.
SSE is the sum of the vertical distances between the actual values of Y and the predicted
values of Y on the regression line.
SST = RSS + SSE
F-test: For a meaningful regression model, the slope coefficients should be non-zero. This is
determined through the F-test which is based on the F-statistic. The F-statistic tests whether
all the slope coefficients in a linear regression are equal to 0. In a regression with one
independent variable, this is a test of the null hypothesis H0: b1 = 0 against the alternative
hypothesis Ha: b1 ≠ 0.
The F-statistic also measures how well the regression equation explains the variation in the
dependent variable. The four values required to construct the F-statistic for null hypothesis
testing are:
• The total number of observations (n)
• The total number of independent variables (k)
• The regression sum of squares (RSS)
• The sum of squared errors or residuals (SSE)
The F-statistic is calculated as:
RSS
k Mean square regression MSR
F= SSE = =
Mean square error MSE
n−(k+1)
Error
SSE
(unexplained n-2 SSE MSE =
n−k−1
variation)
n represents the number of observations and k represents the number of independent variables.
With one independent variable, k = 1. Hence, MSR = RSS and MSE = SSE / (n - 2).
Information from the ANOVA table can be used to compute the standard error of estimate
(SEE). The standard error of estimate (SEE) measures how well a given linear regression
model captures the relationship between the dependent and independent variables. It is the
standard deviation of the prediction errors. A low SEE implies an accurate forecast.
The formula for the standard error of estimate is given below:
Standard error of estimate (SEE) = √MSE
A low SEE implies that the error (or residual) terms are small and hence the linear
regression model does a good job of capturing the relationship between dependent and
independent variables.
The following example demonstrates how to interpret an ANOVA table.
Example: Using ANOVA Table Results to Evaluate a Simple Linear Regression
You are provided the following ANOVA table:
Source Sum of Squares Degrees of Freedom Mean Square
Regression 576.1485 1 576.1485
Error 1,873.5615 98 19.1180
Total 2,449.7100
1. What is the coefficient of determination for this regression model?
2. What is the standard error of the estimate for this regression model?
3. At a 5% level of significance, do we reject the null hypothesis of the slope coefficient
equal to zero if the critical F-value is 3.938?
4. Based on your answers to the preceding questions, evaluate this simple linear regression
model.
Solution to 1: Coefficient of determination (R2) = RSS / SST = 576.148/2,449.71 = 23.52%.
Solution to 2: Standard error of estimate (SEE) = √MSE = √19.1180 = 4.3724
Solution to 3:
MSR 576.1485
F= = = 30.1364
MSE 19.1180
Since the calculated F-stat is higher than the critical value of 3.938, we can conclude that the
slope coefficient is statistically different from 0.
Solution to 4: The coefficient of determination indicates that the model explains 23.52% of
the variation in Y. Also, the F-stat confirms that the model’s slope coefficient is statistically
different from 0. Overall, the model seems to fit the data reasonably well.
4.2 Hypothesis Testing of Linear Regression Coefficients
Hypothesis Tests of the Slope Coefficient
In order to test whether an estimated slope coefficient is statistically significant, we use
hypothesis testing.
Continuing with our previous example of a simple linear regression with ROA as the
dependent variable and CAPEX as the independent variable. Suppose we want to test
whether the slope coefficient of CAPEX is different from zero.
The steps are:
Step 1: State the hypothesis
H0: b1 = 0; Ha: b1 ≠ 0
Step 2: Identify the appropriate test statistic
To test the significance of individual slope coefficients we use the t-statistic. It is calculated
by subtracting the hypothesized population slope (B1) from the estimated slope coefficient
̂1 ) and then dividing this difference by the standard error of the slope coefficient, sb̂ :
(b 1
Another feature of simple linear regression is that the F-stat is simply the square of the t-stat
for the slope/correlation. For our example, the F-stat is 4.001312 = 16.0104
Hypothesis Tests of the Intercept
For the ROA regression example, the intercept is 4.875%. Say you want to test if the
intercept is statistically greater than 3%. This will be a one-tailed hypothesis test and the
steps are:
Step 1: State the hypothesis
H0: b0 ≤ 3% versus Ha: b0 > 3%
Step 2: Identify the appropriate test statistic
To test whether the population intercept is a specific value we can use the following t-stat:
̂0 − B0
b
t intercept =
sb̂0
with 6 – 2 = 4 degrees of freedom
The standard error of the intercept is calculated as:
1 ̅
X2
sb̂0 = √ + n
̅) 2
n ∑i=1(X i − X
Instructor’s Note: On the exam, the value of the standard error will most likely be given to
you. You are unlikely to be asked to calculate this value.
Step 3: Specify the level of significance
α = 5%.
Step 4: State the decision rule
Critical t-value = 2.132.
Reject the null if the calculated t-statistic is greater than 2.132.
Step 5: Calculate the test statistic
1 ̅2
X 1 6.12
sb̂0 = √ + = √ + = 0.68562
n ∑ni=1(X i − ̅
X)2 6 122.64
̂0 − B0 4.875 − 3.0
b
t intercept = = = 2.73475
sb̂0 0.68562
Step 6: Make a decision
Since the calculated t-stat is greater than the critical value, we can reject the null hypothesis
and conclude that the intercept is greater than 3%.
Hypothesis Tests of Slope When Independent Variable Is an Indicator Variable
An indicator variable or a dummy variable can only take values of 0 or 1. An independent
variable is set up as an indicator variable in specific cases. Say we want to evaluate if a
company’s quarterly earnings announcement influences its monthly stock returns. Here the
monthly returns RET would be regressed on the indicator variable, EARN, that takes on a
value of 0 if there is no earnings announcement that month and 1 if there is an earnings
announcement.
The simple linear regression model can be expressed as:
RETi = b0 + b1EARNi + ϵi
Say we run the regression analysis over a 30-month period. The observations and regression
results are shown in Exhibit 28.
Clearly the returns for announcement months are different from the returns for months
without announcement.
The results of the regression are given in Exhibit 29.
Estimated Standard Error of Calculated Test
Coefficients Coefficients Statistic
Intercept 0.5629 0.0560 10.0596
EARN 1.2098 0.1158 10.4435
We can draw the following inferences this table:
• The t-stats for both the intercept and the slope are high hence we can conclude that
both the intercept and the slope are statistically significant.
• The intercept is the predicted value of Y when X = 0. Therefore, the intercept (0.5629)
is the mean of the returns for non-earnings-announcement months.
• The slope coefficient (1.2098) is the difference in means of returns between earnings-
announcement and non-announcement months.
Test of Hypotheses: Level of Significance and p-Values
p-value: At times financial analysts report the p-value or probability value for a particular
hypothesis. The p-value is the smallest level of significance at which the null hypothesis can
be rejected. It allows the reader to interpret the results rather than be told that a certain
hypothesis has been rejected or accepted. In most regression software packages, the p-
values printed for regression coefficients apply to a test of the null hypothesis that the true
parameter is equal to 0 against the alternative that the parameter is not equal to 0, given the
estimated coefficient and the standard error for that coefficient.
Here are a few important points connecting t-statistic and p-value:
• Higher the t-statistic, smaller the p-value.
• The smaller the p-value, the stronger the evidence to reject the null hypothesis.
• Given a p-value, if p-value ≤ α, then reject the null hypothesis H0. α is the significance
level. For example, if we are given a p value of 0.03 or 3%, we can reject the null
hypothesis at the 5% significance level, but not at the 1% significance level.
5. Prediction Using Simple Linear Regression and Prediction Intervals
We use regression equations to make predictions about a dependent variable. Let us
̂0 + b
consider the regression equation: Y = b0 + b1X. The predicted value of ̂Y = b ̂1 X.
Once the variance of the prediction error is known, it is easy to determine the confidence
interval around the prediction. The steps are:
1. Make the prediction.
2. Compute the variance of the prediction error.
3. Determine tc at the chosen significance level α.
4. Compute the (1-α) prediction interval using the formula below:
̂
Y ± t c ∗ sf
In our ROA regression model, if a company’s CAPEX is 6%, its forecasted ROA is:
̂0 + b
̂Y = b ̂1 X = 4.875 + 1.25 × 6 = 12.375%
Assuming a 5% significance level (α), two sided, with n − 2 degrees of freedom (so, df = 4),
the critical values for the prediction interval are ±2.776.
The standard error of the forecast is:
1 (6 − 6.1)2
sf = 3.49588 ∗ √[1 + + ] = 3.736912
6 122.640
To make the simple linear regression model fit well, we will have to modify either the
dependent or the independent variable. The modification process is called ‘transformation’
and the different types of transformations are:
• Using the log of the dependent variable
• Using the log of the independent variable
• Using the square of the independent variable
• Differencing the independent variable
In the subsequent sections, we will discuss three commonly used functional forms based on
log transformations:
• Log-lin model: The dependent variable is logarithmic but the independent variable is
linear.
• Lin-log model: The dependent variable is linear but the independent variable is
logarithmic.
• Log-log model: Both the dependent and independent variables are in logarithmic
form.
We can see that the log-lin model is a better fitting model than the simple linear model for
data with exponential growth.
Example: Making forecasts with a log-lin model
Suppose the regression model is: ln Y = −7 + 2X. If X is 2.5% what is the forecasted value of
Y?
Solution:
Ln Y = -7 + 2*2.5 = -2
Y = e-2 = 0.135335
The Lin-Log Model
In the lin-log model, the dependent variable is linear but the independent variable is
logarithmic. The regression equation is expressed as:
Yi = b0 + b1 ln Xi
The slope coefficient in this model is the absolute change in the dependent variable for a
relative change in the independent variable.
Consider a pot of operating profit margin as the dependent variable (Y) and unit sales as the
independent variable (X). The scatter plot and regression line for a sample of 30 companies
is shown in Exhibit 35.
Instead of using the unit sales directly, if we transform the variable and use the natural log of
unit sales as the independent variable, we get a much better fit. This is shown in Exhibit 36.
The R2 of the model jumps to 97.17% from 55.10%. For this data, the lin-log model has a
significantly higher explanatory power as compared to simple linear model.
The Log-Log Model
In the log-log model, both the dependent and independent variables are in logarithmic form.
It is also called the ‘double-log’ model. The regression equation is expressed as:
ln Yi = b0 + b1 ln Xi
The slope coefficient in this model is the relative change in the dependent variable for a
relative change in the independent variable. The model is often used to calculate elasticities.
Consider a plot of company revenues as the dependent variable (Y) and the advertising
spend as a percentage of SG&A, ADVERT, as the independent variable (X). The scatter plot
and regression line for a sample company is shown in Exhibit 37 below:
The R2 of this model is only 20.89%. If we use the natural logs of both the revenues and
ADVERT variables, we get a much better fit. This is shown in Exhibit 38.
The R2 of the model increases by more than 4 times to 84.91% and the F-stat jumps from
7.39 to 157.52. For this data, the log-log model results in a much better fit as compared to
the simple linear model.
Summary
LO: Describe a simple linear regression model, how the least squares criterion is used
to estimate regression coefficients, and the interpretation of these coefficients.
A linear regression model computes the best fit line through the scatter plot, which is the
line with the smallest distance between itself and each point on the scatter plot.
The variable whose variation we want to explain is called the dependent variable.
The explanatory variable is called the independent variable.
Linear regression chooses the estimated values for intercept ̂ 𝑏0 and slope ̂
𝑏1 such that the
sum of the squared errors (SSE), i.e., the vertical distances between the observations and the
regression line is minimized.
The intercept is the value of the dependent variable when the independent variable is zero.
The slope measures the change in the dependent variable for a one-unit change in the
independent variable. If the slope is positive, the two variables move in the same direction. If
the slope is negative, the two variables move in opposite directions.
LO: Explain the assumptions underlying the simple linear regression model, and
describe how residuals and residual plots indicate if these assumptions may have
been violated.
The simple linear regression model is based on the following four assumptions:
1. Linearity: The relationship between the dependent variable, Y, and the independent
variable, X, is linear.
2. Homoskedasticity: The variance of the regression residuals is the same for all
observations.
3. Independence: The observations, pairs of Ys and Xs, are independent of one another.
This implies the regression residuals are uncorrelated across observations.
4. Normality: The regression residuals are normally distributed.
LO: Calculate and interpret measures of fit and formulate and evaluate tests of fit and
of regression coefficients in a simple linear regression.
Measures of goodness of fit:
The coefficient of determination, denoted by R2, measures the fraction of the total variation
in the dependent variable that is explained by the independent variable.
Explained variation Regression sum of squares (RSS)
R2 = =
Total variation Sum of squares total (SST)
The F-statistic tests whether all the slope coefficients in a linear regression are equal to 0. In
a regression with one independent variable, this is a test of the null hypothesis H0: b1 = 0
against the alternative hypothesis Ha: b1 ≠ 0.
The F-statistic is calculated as:
RSS
k Mean regression sum of squares MSR
F= SSE = =
Mean squared error MSE
n−(k+1)
The standard error of estimate (SEE) measures how well a given linear regression model
captures the relationship between the dependent and independent variables. It is the
standard deviation of the prediction errors. A low SEE implies an accurate forecast.
Standard error of estimate (SEE) = √MSE
LO: Calculate and interpret the predicted value for the dependent variable, and a
prediction interval for it, given an estimated linear regression model and a value for
the independent variable.
The estimated variance of the prediction error is given by:
1 (X − ̅
X)2
sf2 2
= s ∗ [1 + + ]
n (n − 1)sx2
Once the variance of the prediction error is known, it is easy to determine the confidence
interval around the prediction. The steps are:
1. Make the prediction.
2. Compute the variance of the prediction error.
3. Determine tc at the chosen significance level α.
4. Compute the (1-α) prediction interval using the formula below:
̂
Y ± t c ∗ sf
LO: Describe different functional forms of simple linear regressions.
The three commonly used functional forms based on log transformations:
• Log-lin model: The dependent variable is logarithmic but the independent variable is
linear.
• Lin-log model: The dependent variable is linear but the independent variable is
logarithmic.
• Log-log model: Both the dependent and independent variables are in logarithmic
form.