Regression Analysis
Regression Analysis
Sessions 6 & 7:
Regression Analysis
www.cranfield.ac.uk/som
Content :
Sessions 1 & 2: Probability and Probability Distributions
Session 3: Sampling and Estimations
Session 4: Hypothesis Testing
Session 5: Problem Solving
SESSIONS 6 & 7: REGRESSION ANALYSIS
Session 8: Regression Models with Dummy Variables
Sessions 9 &10: Problem Solving and Exam Revision
2
Statistical Analysis in Finance
Reading:
Statistical Techniques in Business and Economics
(17/E) by Douglas A. Lind, William G. Marchal and
Samuel A. Wathen 2017. McGraw-Hill. Chapters
Chapters 13 and 14.
Examples
§ Does the amount Boots spends per month on training its sales
force affect its monthly sales?
§ Does the number of hours students study for SAF exam influence
the exam score?
5
Scatter Plot
Barclays' stock
45
40
35
30
25
20
15
10
5
0
0 5 10 15 20 25
FTSE-ALL share return
6
Correlation
Correlation Coefficient
Cov( x, y) n
( xi - x)( yi - y )
r= Cov ( x, y ) = å
sx s y i =1 (n - 1)
n
( xi - x ) 2
sx = s 2
s 2
x =å
x i =1 ( n - 1)
10
Correlation Coefficient, Example 1
11
r (1 - r 2 )
t df = , where sr =
sr n-2
14
r n-2
t = ,
1- r 2
Reject H0 if:
t > ta,n-2 or t < -ta,n-2
15
Testing the Significance of r
16
18
The basic premise of simple linear
regression analysis
y = b 0 + b1 x
where y is the dependent variable, x - independent variable.
• The aim of regression analysis is to estimate the unknown
population parameters (β0 and β1).
• We usually use sample data to estimate the population
parameters of interest. Let b0 and b1 represent the estimates
of β0 and β1, respectively.
y = b0 + b1 x
19
For example, to test whether larger firms (x) usually pay higher Dividends (y)
• The effect of firm size on dividends can be written using a simple equation:
y = b 0 + b1 x
y = b 0 + b1 x + e population equation
y = b0 + b1 x + e sample equation 20
Determining sample regression equation
yˆ = b0 + b1 x
• y-hat is the estimated value of y for a selected value of x
22
Determining sample regression equation
(cont’d)
23
b1 =
å ( x - x )( y - y )
i i
å (x - x) i
2
b0 = y - b1 x
24
Deriving OLS
n n
So let L = å e = å ( yi - b0 - b1 xi )
2 2
i
i =1 i =1
¶L n
= -2å xi ( yi - b0 - b1 xi )
b1 i =1
b1 =
å ( x - x )( y - y )
i i
å (x - x)i
2
b0 = y - b1 x
26
Relation between b1(slope)
and r (correlation)
b1 =
å ( x - x )( y - y )
i i
å (x - x) i
2
if we devide by b1n - 1
b1 =
å ( x - x )( y - y ) ( n - 1)
i i
å ( x - x ) (n - 1)
i
2
Cov ( x, y )
r=
sx s y
sy
SCOPE OF THE REGRESSION LINE b1 = r
sx
27
Example 3
An article in Business Week listed the “Best Small Companies.” We are
interested in the current results of the companies’ sales and earnings. A random
sample of 12 companies was selected and the sales and earnings, in millions,
are reported below. Let sales be the independent variable and earnings be the
dependent variable. Determine the regression equation.
Company Earnings (m) Sales (m)
Papa International 4.9 89.2
Applied Innovation 4.4 18.6
Integracare 1.3 18.2
Wall Data 8 71.7
Davidson Associates 6.6 58.6
Chico’s Fas 4.1 46.8
Checkmate Elec 2.6 17.5
Royal Grip 1.7 11.9
M-Wave 3.5 19.6
Serving-N-Slide 8.2 51.2
Daig 6 28.6
Cobra Golf 12.8 69.2 28
Example 4
yˆ = -1.74 + 1.641 x
Question: If an analyst tells you that she expects the FTSE-ALL
share to yield a 20% return next year, what would you expect the
return on BP to be?
30
• After the dialog box opens, select the data for your dependent
variable (Earnings) in the Input Y Range and the data for your
independent variable (Sales) in the Input X Range.
b ± ta , n - 2 sb
32
33
Standard errors of estimates
34
Y
Scatter points
are close to
the line
Scatter points
Y further from
line
X
35
Standard errors of estimates (cont’d)
36
37
Coefficient of determination (R2)
y
32
26 y
23
!
"
SRSS Error variation (ESS)
R2 =
STSS Total variation (TSS)
20 25 30 35 40 45
x 38
14
Earnings (£m), y
12
10
2
Earnings (£m), y
Predicted Earnings (£m), y
0
0 10 20 30 40 50 60 70 80 90 100
Sales (£m), x
39
Coefficient of Determination
(Goodness of Fit)
42
ESS = å ( y - yˆ )2 =å (e)2
43
Determining the sample MLR equation
(cont’d)
x1
yˆ = b + b x
0 1
X x2 yˆ = b + b x + b x
0 1 1 2 2
é y1 ù é1 x11 x12 ù é e1 ù
ê y ú ê1 x21 x22 úúéê 0 ùú êe2 ú
b
ê 2ú = ê b1 + ê ú
ê ! ú ê! ! úê ú ê ! ú
ê ú ê úêëb2 úû ê ú
ë yn û ë1 xn1 x n 2 û ëen û
éb0 ù
êb ú
ê 1 ú = ( X ¢X )-1 X ¢y
ê ú
ê ú
ëbk û
47
Interpreting Coefficients in MLR
y = b0 + b1 x1 + b2 x2 + ... + bk xk + e
48
^
S( y - y ) 2 S (e) 2 S( ESS ) 2
s= = = =
n - k -1 n - k -1 n - k -1
49
Adjusted R2
n -1
Radj
2
(
= 1- 1- R 2
)
n - k -1
50
b ± ta , n - k -1sb
51
Testing the multiple regression model
(joint hypothesis test)
The joint (global) hypothesis test is used to investigate whether any of the
independent variables have significant coefficients. The hypotheses are:
H 0 : b1 = b 2 = ... = b k = 0
H1 : Not all b s equal 0 or at least one b i ¹ 0
å ( ˆ
y - y )2
k RSS / k
Fa ,k,n -k -1 = =
å ( y - yˆ ) n - k - 1 ESS / n - k - 1
2
Decision Rule:
R2 / k
Fa ,k,n -k -1 =
1 - R2 / n - k -1
RSS / k RSS n - k - 1
Fa , k, n - k -1 = =
ESS / n - k - 1 ESS k
RSS n - k -1
Fa , k, n - k -1 =
Total SS - RSS k
RSS / Total SS n - k - 1
Fa , k, n - k -1 =
1 - RSS/Total SS k
2
R /k
Fa , k, n - k -1 =
1 - R2 / n - k -1
54
Example 6
Suppose we want to estimate the relationship between Wage and Age and
Education. Say we gathered data on 80 workers at Cranfield School of
Management with information on their hourly wage, education and age.
• Estimate whether Wages are determined by Education and Age (call this
model 1).
Example 6: Solution
(0.127) (0.193)
P-values in parentheses 56
Example 6: Solution
• Interpretation
• Holding Age coefficient constant, the coefficient of Education is 1.44. This means if
education increase by one year (unit), the wage would increase by 1.44 .
• Holding Education coefficient constant, the coefficient of Age is 0.04. This means as
a person gets older by one year (unit), the wage would increase by 0.04 .
• The education coefficient is significant (p-value < 5% ) so we reject the null
hypothesis that Education =0
• The age coefficient is not significant (p-value > 5% ) so we do not reject the null
hypothesis that Age =0
• Y = 2.63 + 1.44x(16) + 0.04x(41) = 27.31
• The Adj-R2 is 0.60; So, the model is able to explain 60% of the variation in the
dependent variable.
• F-test is 62.46 is significant (p-value < 5%). Reject null, there is strong evidence
that at least one of the coefficients is different to zero. The model as a whole is
therefore significant.
57
4. Conditional on x1, x2 , …. xk the variance of the error term is the same for all
observations. The error term is homoskedastic (equally scattered), Var (ε) =s2
5. Conditional on x1, x2 , …. xk, the error term is uncorrelated across observations, Cov (ε i,
εj)=0; i≠j. There is no autocorrelation.
6. The error term is not correlated with any independent variable,
Cov (εi,xi)=0. There is no endogeneity.
7. The error term is normally distributed. This assumption allows us to construct confidence
intervals and conduct the tests of significance. If this assumption does hold the regression
hypothesis tests are not valid.
58
GAUSS-MARKOV THEOREM
59
• Effects of multicollinearity :
1
VIF =
1 - R 2j
The term R2j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
61
62
Common Violation 2: The error term is not
homoskedastic
• The variance of the errors is the same for all observation of the
independent variables (homoscedasticity); Var (ε) = s2
• The variance of the error term changes for different values of at least one
independent variable.
• Greater discretion (cross-section). E.g., saving is more variable in
households with higher average income; dividends are more variable in
firms with higher profits.
63
64
Common Violation 3: The error terms are
autocorrelated
• If X3,i like many economic series (e.g, GDP, stock returns) exhibit a
trend over time, then X3,i depends on X3,i-1, X3,i -2 and so on.
66
Common Violation 4: Endogeneity
67
68
Quadratic regression models
69
y= b0 + b1x + b2x2 + e
70
Quadratic regression models (cont’d)
y= b0 + b1x + b2x2 + e
y y
b2 < 0
b2 > 0
71
Example 7
72
Example 7 (cont’d)
Wage
Scatter plot of wages versus age
50
40
30
20
10
20 40 60 80
Age
Linear
Quadratic
73
Example 7 (cont’d)
LR Model QR Model
(1) (2)
Education 1.441** 1.254**
(0.000) (0.000)
Age2 -0.0133**
(0.000)
P-value in parentheses
** p < 0.05
74
Example 7 (cont’d): Multicolinearity?
Centering means that you subtract off the mean before squaring.
75
Example 7 (cont’d):
QR Model Transformed QR
(1) Model (2)
Education 1.254** Education 1.254**
(0.000) (0.000)
Age 1.350** C_Age 0.031
(0.000) (0.128)
Age2 -0.0133** C_Age2 -0.0133** P-value in parentheses;
** p < 0.05
(0.000) (0.000)
Constant -22.72** Constant 11.461
(0.000) (0.000)
Observations 80 Observations 80
Adjusted R2 0.826 Adjusted R2 0.826
Corr(Age, Age2) 0.987 Corr(Age, Age2) -0.086
VIF (Age) 43.08 VIF (Age) 1.01
VIF (Age2) 42.99 VIF (Age2) 1.05
VIF(Education) 1.05 VIF(Education) 1.05
76
Evaluating effects for a quadratic
regression model (QRM)
78
Example 8
2. The optimal age at which the wage is maximized is about 51 years, with
a wage of about £31.60 (for 16 years of education).
79
Practical Example 9
80
Practical Example 9 (cont.)
H 0: β 1 = β 2 = β 3 = β 4 = β 5 = 0
H1: Not all the β’s are 0
82
Practical Example 9 (cont.)
• Next we evaluate the individual regression coefficients. The null
hypothesis and the alternate hypothesis are (α = .05):
H 0: β i = 0
H 1: β i ≠ 0
• p-values for the regression coefficients for home value, years of
education, and gender are all less than .05. We conclude that these
regression coefficients are not equal to zero and are significant
predictors of family income.
• The p-value for age and mortgage amount are greater than the
significance level of .05, so we do not reject the null hypotheses for
these variables. The regression coefficients are not different from zero
and are not related to family income.
• Based on the results of testing each of the regression coefficients, we
conclude that the variables age and mortgage amount are not effective
predictors of family income. Thus, they should be removed from the
multiple regression equation. Remember that we must remove one
independent variable at a time and redo the analysis to evaluate the
overall effect of removing the variable. 83
84
Practical Example 9 (cont.)
85
86
Practical Example 9 (cont.)
87
88
Practical Example (cont.)
89