Linear Regression Models
Linear Regression Models
Com Finance
Lectured by
1
Session outline
At the end of this session, students must be able to:
• Understand and differentiate various linear regression models; single and multiple regressions.
• Apply linear regression models using selected computer software packages and interpret the
results.
•Linear regression is concerned with the describing and evaluating the relationship
between a given variable and one or more other variables.
•In this technique, the dependent variable is continuous, independent variable(s) can
be continuous or discrete, and nature of regression line is linear.
•Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as
regression line).
Classical Linear Regression Model
In econometrics, there is need to examine the relationship
between two or more financial variables.
The relationship between variables can be explored by
a. Constructing scatter diagrams,
b. Building a linear regression model
c. Calculating a correlation coefficient
Scatter plots
It is a graph that shows the relationship between
the observations for two data series in two
dimensions.
The pattern of data is indicative of the type of
relationship between your two variables
Positive relationship
Negative relationship
No relationship
Positive relationship
Negative relationship
Reliability
Age of Car
No relation
Types of linear regression models
•There are two types of linear regression models.
• E.g. Analysing how asset returns vary with changes in the level of market risk. Here asset
returns are expected to be affected or influenced to change by one factor only i.e. market risk.
• E.g. Evaluating how Share price of a company is influenced by company size, market risk,
sector sensitive risk and inward foreign direct investment to the sector. Here share
price of a listed company is expected to be affected or influenced to change by a
number of factors. This is more realistic in real practice, so multiple regression has
more value in evaluating such a finance models.
THE SIMPLE REGRESSION MODEL
Regression analysis is concerned with describing and
evaluating the relationship between a given
variable (explained or dependent variable) and one
or more other variables (explanatory or independent
variables).
In statistical modelling, regression analysis is a
statistical process for estimating the relationships
among variables.
Explained variable is denoted by y and explanatory
variable by x.
Regression is an attempt to explain the variation in a
dependent variable using the variation in
independent variables.
Regression is thus an explanation of causation.
If the independent variable(s) sufficiently explain the
variation in the dependent variable, the model can
be used for prediction.
Sources of Errors in Regression
ˆ
xi x yi y and forˆ y ˆ x
xi x
2
Linear regression model generates a straight line which summarises the values of all data
points for corresponding values of y and x.
Yi X i i
Linear regression model based OLS method
•The method used to fit the data into the straight line is called the Ordinary Least
Squares (OLS) method.
•OLS takes each vertical distance from each data point to the line, squares it and
minimise the total sum of the areas of the squares – hence the ordinary least squares.
•This is similar to minimising the sum of all the areas of the squares (distance) from each
of the data point to the line from either sides of it.
i
• denotes the residual, the difference
var(i ) 2 The variance of the mean is constant and finite over all values of xi - linearlity
cov(i , j ) 0 The errors are linearly independent of one another – no multicollinearity.
•NB: one rule of thumb in linear regression is that sample size must be at least 15 cases
per one independent variable.
Testing & interpreting linear regression assumptions from SPSS output.
1. Linearity – The P-P plots show the distribution of data points close to the perfect line that
diagonal cuts through the square of the plots. Any points away from it suggest problems with
linearity.
2. Normality – the histogram must show the bell shaped curve of the normal distribution of the
residuals of the dependent variable.
3. Multicollinearity - Tests whether individual independent variables are highly related to one
another thereby representing the same effect on the dependent and thus inflate the results of
the model. It is tested via either the tolerance level or the Vector Inflation Factor (VIF) levels.
Tolerance value must be between 0 and 1, anything below 0.2 is unacceptable, above 0.2 but
less than 0.5 is moderate, up to 1 is very good. VIF is the inverse of the tolerance score. VIF
below 3 is great, above 3 but less than 5 is moderate, up to 10 is not a good sign while
anything above 10 is unacceptable.
4. Outliers – cases with outliers must be removed from the data before your final analysis and
such changes must be explained in your presentation. Tested using either the mahalanodis or
cooks distances. Cooks distance must be between 0 and 1, anything above 1 indicates
presence of outliers and must be removed or treated as missing.
5. Independence of errors – Durbin-Watson static tests autocorrelation in residuals. It ranges
between 0 and 4. Values between 1.5 and 2.5 mean non-autocorrelation, a value towards 0
indicates positive autocorrelation and a value towards 4 indicate negative autocorrelation.
Addressing normality problem - Transformation of quantitative data
When data do not meet all the normality assumption, it means linear regression cannot be
used and data has to be 1) transformed first to normalise it or bring it near normality or 2)
use nonparametric models.
Data can be transformed in three ways; logarithm, square root or the reciprocal of the
Addressing outlier, multicollinearity and linearity problems – By elimination
1. Outliers – Tested using either the mahalanodis or cooks distances. Cooks distance
must be between 0 and 1, anything above 1 indicates presence of outliers. Cases
with values more than 1 must be removed or treated as missing data.
2. Linearity – The P-P plots must show the distribution of data points close to the
perfect line that diagonal cuts through the square of the plots. Any points away
from it suggest problems with linearity and must be excluded.
3. Multicollinearity - It is tested via either the tolerance level or the Vector Inflation
Factor (VIF) levels. Tolerance value must be between 0 and 1, anything below 0.2 is
unacceptable, above 0.2 but less than 0.5 is moderate, up close to 1 is very good.
VIF is the inverse of the tolerance score. VIF below 3 is great, above 3 but less than
5 is moderate, up to 10 is not a good sign while anything above 10 is unacceptable.
When two independent variables are highly correlated, one must be
removed from the analysis.
They want to model or produce an unbiased estimate of the current price for any property in future
based on its size, land size and its age, thus Price is the DV while three IVs are size, land size
and age of property. Since all these variables are continuous, multiple linear regression is the most
appropriate method. Based on data is in the SPSS file, the aim of this exercise is to:
5. Test assumptions of multiple linear regression
6. Apply the model.
7. Present and interpret the results.
Presenting and interpreting results
• Results section of your project/dissertation/thesis/research paper must cover the following:
1. Descriptive statistics
• Before presenting your main results, it is important first to understand your data. So, present simple
descriptive statistics of the data as indicator of the expected results as well as the nature and scope of your
data.
2. Assumptions diagnosis – discuss all the relevant assumptions tests and provide statistics
showing how they are met.
3. Results – present results only, do not interpret.
4. Discussion of results – interpret the results and discuss variance or agreement with
literature or theory.
Academic writing structure…
Chapter Four: DATA ANALYSIS AND INTERPRETATION OF RESULTS
Introduction: introduce and state the main methods used and the main classes of the results presented
Main Body: must cover.
• Descriptive statistics – state the results and explain what they mean in general and in relation to literature.
• Diagnostic tests (explain how good and significant the tests are, if they applicable to your research).
• Main results - Present main results objective by objective (if quantitative, state statistical significance of
each result and if qualitative support by the examples give verbatim by respondents or frequency of the
responses)
• Results discussion - interpret the results and show how they link to literature and theory.
• Chapter Summary: summarise main results and introduce the next chapter
NB: All headings and subheadings must clearly reflect what is presented in the results or what the results are
all about. This is important for the reader to easily link the results to the particular problem or objective of the
study addressed by those findings. Headings must be informative enough to the reader in all chapters.
Example 1.
Poor heading : Descriptive statistics
Good heading: Summary Statistics of factors affecting the price of properties.
Example 2:
Poor heading: linear regression results.
Good heading: The effects of property characteristics on price of properties.
Factors affecting prices of houses
4.0 Introduction
This section presents results on factors that influence prices of properties. The data
consists of 65 properties in selected suburbs in Bulawayo and capture the price of
property, its age, size of the land where located and the size of the house itself. The
results were analysed using multiple linear regression model. Three sets of results are
presented; first are the descriptive statistics describing the nature and scope of the data,
followed by tests on assumptions of multiple linear regression model and the main
results. The last section discusses the results.
The regression model as whole is correctly specified and fits the data well as the F-
statistics is significant (Table 2). In addition, the adjusted R square indicates that about
82% of the variations in house prices is explained by the changes in the three selected
independent variables (Table 3) which gives the model a high explanatory power. The
model does not seem to have any multicollinearity problems as evidenced by the low
values of the VIF. Residual do not suffer from autocorrelation given the moderate value
of the Durbin Watson statistic. Finally, the data reflects the presence of normality of the
residuals of the dependent variable with a linear characteristic (See Appendices 1 and
2) and with no case with cooks distance greater than 1, means there are no outliers
observed in the data.
Table 2: ANOVA Table 3: Model Summary
Model Sum of df Mean F Sig. Mode R R Adjusted Std. Error Durbin-
Squares Square l Squar R Square of the Watson
Regressio 89695.40 3 29898.4 93.13 .000b e Estimate
n 7
1 Residual 19584.38 61 321.06 .906a .821 .812 17.918 1.682
1
109279.79 64
Total
The results show that only the size of the house and land size where the house is
located are important factors affecting price of a house. However, the age of the house
seems not to affect prices of houses. Based on this results, it means that prices of
houses can only be modelled using land size and house size only (Table 4).
As an example, if a house is located in a 300 square meters area and its built-up area is
90 square meters in size, its price would approximately be $50 760.00
(300x0.162)+(90x0.024).
Notes on interpreting and presentation of linear regression results from SPSS
output.
1. Model fit – results from regression must be used only if the data fit in the model well. The
significance of the model fit must have p-value of less than 0.05. You find this from the ANOVA
table. If this condition is satisfied, we conclude that the model fits the data well and we can go
ahead and interpret coefficients of independent variables. Another way to test the goodness of the
model is to look at the R squared. This tells us how much of the changes in the dependent variable
are a result of/explained by the selected independent variables. The higher is this value the better
is the model. This is useful especially where stepwise method is used. R Squared is used in simple
linear regression while adjusted R Squared is preferred in multiple linear regression.
2. Block method
a) In this case you enter independent variables in their box in sets of selected independent
variables, each as a separate block in ascending or descending order. The aim is to compare the
R squared and significance levels for the different blocks to identify which block gives a better