100% found this document useful (1 vote)
263 views

Regression Analysis

The document discusses regression analysis, which is used to describe and evaluate the relationship between a dependent variable and one or more independent variables. It covers simple linear regression, which develops an equation to express this relationship. The aim is to estimate the population parameters, and sample data is used to estimate the sample regression equation and coefficients through the method of least squares. Tests of significance can determine if the estimated relationships are statistically meaningful.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
263 views

Regression Analysis

The document discusses regression analysis, which is used to describe and evaluate the relationship between a dependent variable and one or more independent variables. It covers simple linear regression, which develops an equation to express this relationship. The aim is to estimate the population parameters, and sample data is used to estimate the sample regression equation and coefficients through the method of least squares. Tests of significance can determine if the estimated relationships are statistically meaningful.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Statistical Analysis in Finance

Sessions 6 & 7:
Regression Analysis

Dr. Nemanja Radić

www.cranfield.ac.uk/som

Statistical Analysis in Finance

Content :
Sessions 1 & 2: Probability and Probability Distributions
Session 3: Sampling and Estimations
Session 4: Hypothesis Testing
Session 5: Problem Solving
SESSIONS 6 & 7: REGRESSION ANALYSIS
Session 8: Regression Models with Dummy Variables
Sessions 9 &10: Problem Solving and Exam Revision
2
Statistical Analysis in Finance

Reading:
Statistical Techniques in Business and Economics
(17/E) by Douglas A. Lind, William G. Marchal and
Samuel A. Wathen 2017. McGraw-Hill. Chapters
Chapters 13 and 14.

Intended Learning Outcomes

• Discuss the limitations of correlation analysis.


• Estimate the simple linear regression model and interpret the
coefficients.
• Estimate the multiple linear regression model and interpret
the coefficients.
• Conduct tests of individual significance.
• Conduct a test of joint significance.
• Describe common violations of the assumptions of ordinary
least squares (OLS) method.
• Use and evaluate quadratic regression models.
4
What is Correlation Analysis?

§ Used to report the relationship between two variables.

CORRELATION ANALYSIS: A group of techniques to measure


the relationship between two variables.

§ In addition to graphing techniques, we’ll develop numerical


measures to describe the relationships.

Examples
§ Does the amount Boots spends per month on training its sales
force affect its monthly sales?
§ Does the number of hours students study for SAF exam influence
the exam score?
5

Scatter Plot

• A scatter plot is a graph that shows the relationship between the


observations for two data series in two dimensions.

Barclays' stock
45
40
35
30
25
20
15
10
5
0
0 5 10 15 20 25
FTSE-ALL share return
6
Correlation

• A scatter plot uses a graph, correlation analysis expresses this


same relationship using a single number.
• The correlation coefficient measures the linear association
between two variables.
• It is not implied that changes in one variable causes changes
in the other variable.
• Rather, it is simply states that there is evidence for a linear
relationship between the two variables, and that the
movements in the two are on average related to an extent
given by the correlation coefficient.

Correlation Coefficient

CORRELATION COEFFICIENT A measure of the strength of


the linear relationship between two variables.

Characteristics of the correlation coefficient are:


• The sample correlation coefficient is identified as r
• It shows the direction and strength of the linear
relationship between two interval- or ratio-scale variables
• It ranges from -1.00 to 1.00
• If it’s 0, there is no association
• A value near 1.00 indicates a direct or positive correlation
• A value near -1.00 indicates a negative correlation
8
Correlation Coefficient

The following graphs summarize the strength and direction of the


correlation coefficient:

Calculating the correlation coefficient

Cov( x, y) n
( xi - x)( yi - y )
r= Cov ( x, y ) = å
sx s y i =1 (n - 1)

n
( xi - x ) 2
sx = s 2
s 2
x =å
x i =1 ( n - 1)

10
Correlation Coefficient, Example 1

How is the correlation coefficient determined? We’ll use the North


American Copier Sales as an example. We begin with a scatter diagram,
but this time we’ll draw a vertical line at the mean of the x-values (96
sales calls) and a horizontal line at the mean of the y-values (45 copiers).

11

Limitations of correlation analysis

• The correlation coefficient captures only a


linear relationship.
• The correlation coefficient may not be a
reliable measure in the presence of outliers.

• Even if two variables are highly correlated,


one does not necessarily cause the other. 13
Testing the significance of the correlation
coefficient

• We need to be able to determine whether the relationship


implied by the sample correlation coefficient is real or due
to chance.
• In other words, we would like to test whether the population
correlation coefficient is different from zero:

H0: r = 0 (the correlation in the population is 0)


H1: r ≠ 0 (the correlation in the population is not 0)

The test is assumed to follow a t-distribution with df n-2.

r (1 - r 2 )
t df = , where sr =
sr n-2
14

Testing the significance of the correlation


coefficient

r n-2
t = ,
1- r 2

with n - 2 degrees of freedom

Reject H0 if:
t > ta,n-2 or t < -ta,n-2

15
Testing the Significance of r

• Recall that the sales manager from North American Copier


Sales found an r of 0.865.
• Could the result be due to sampling error? Remember only
15 salespeople were sampled.
• We ask the question, could there be zero correlation in the
population from which the sample was selected?
• We’ll let ρ represent the correlation in the population and
conduct a hypothesis test to find out.

16

Regression Analysis (RA)

• It is concerned with describing and evaluating the


relationship between a given variable (usually called the
dependent variable, y) and one or more other variables
(usually known as the independent variable, x).

• Specifically, regression is an attempt to explain


movements in a dependent variable by reference to
movements in one or more other independent variables.

• It develops an equation that express the relationship


between the dependent variable and the independent
variable.

18
The basic premise of simple linear
regression analysis

y = b 0 + b1 x
where y is the dependent variable, x - independent variable.
• The aim of regression analysis is to estimate the unknown
population parameters (β0 and β1).
• We usually use sample data to estimate the population
parameters of interest. Let b0 and b1 represent the estimates
of β0 and β1, respectively.

• We form the sample regression equation as

y = b0 + b1 x
19

The basic premise of simple linear


regression analysis

For example, to test whether larger firms (x) usually pay higher Dividends (y)

• The effect of firm size on dividends can be written using a simple equation:

y = b 0 + b1 x

• However, this equation (y = β0 + β1 x) is completely deterministic.

• Is this realistic? No. So what we do is to add a random error term, ε into


the equation.

y = b 0 + b1 x + e population equation
y = b0 + b1 x + e sample equation 20
Determining sample regression equation

• The least-squares principal minimizes the sum of squared errors from


the estimated regression line.

• In other words, we choose b0 and b1 so that the (vertical) distances from


the data points to the fitted lines are minimised (so that the line fits the
data as closely as possible).
21

Determining sample regression equation


(cont’d)

yˆ = b0 + b1 x
• y-hat is the estimated value of y for a selected value of x

• b0 is the intercept (constant)

• b1 is the slope of the line;

b1 is also known as the marginal effect: the average change in


y-hat for each change of one unit in the independent variable x.

22
Determining sample regression equation
(cont’d)

• The common method used to estimate the coefficients


(b0 and b1) is known as the least squares (LS).

• The objective of LS is to make the difference between


the predicted values (y-hat) and actual values (y) as
small as possible.

• What we actually do is take each distance and square it


(i.e. take the area of each of the squares in the
diagram) and minimise the total sum of the squares
(hence least squares).

23

Computing the slope (b1) and the


intercept (b0)

• Using calculus we can show that

b1 =
å ( x - x )( y - y )
i i

å (x - x) i
2

b0 = y - b1 x

24
Deriving OLS

• For a sample of n observations , where n =1,2..., n


• OLS minimises the sum square of residuals

e12 + e22 + e32 + .... + en2


We know that ei2 = ( yi - yˆ i ) = ( yi - b0 - b1 xi )
2 2

n n
So let L = å e = å ( yi - b0 - b1 xi )
2 2
i
i =1 i =1

Want to minimise L with respect to (w.r.t.) b0 and b1 ,


so differentiate L w.r.t. b0 and b1.
25

Deriving OLS (cont’d)


¶L n
= -2å ( yi - b0 - b1 xi )
b0 i =1

¶L n
= -2å xi ( yi - b0 - b1 xi )
b1 i =1

After substutitions and rearangment

b1 =
å ( x - x )( y - y )
i i

å (x - x)i
2

b0 = y - b1 x

26
Relation between b1(slope)
and r (correlation)

b1 =
å ( x - x )( y - y )
i i

å (x - x) i
2

if we devide by b1n - 1

b1 =
å ( x - x )( y - y ) ( n - 1)
i i

å ( x - x ) (n - 1)
i
2

Cov ( x, y )
r=
sx s y
sy
SCOPE OF THE REGRESSION LINE b1 = r
sx
27

Example 3
An article in Business Week listed the “Best Small Companies.” We are
interested in the current results of the companies’ sales and earnings. A random
sample of 12 companies was selected and the sales and earnings, in millions,
are reported below. Let sales be the independent variable and earnings be the
dependent variable. Determine the regression equation.
Company Earnings (m) Sales (m)
Papa International 4.9 89.2
Applied Innovation 4.4 18.6
Integracare 1.3 18.2
Wall Data 8 71.7
Davidson Associates 6.6 58.6
Chico’s Fas 4.1 46.8
Checkmate Elec 2.6 17.5
Royal Grip 1.7 11.9
M-Wave 3.5 19.6
Serving-N-Slide 8.2 51.2
Daig 6 28.6
Cobra Golf 12.8 69.2 28
Example 4

Let assume that we regressed the monthly BP stock returns on the


monthly FTSE-ALL share index returns for a period of 36 months.
The estimated line equation is given as

yˆ = -1.74 + 1.641 x
Question: If an analyst tells you that she expects the FTSE-ALL
share to yield a 20% return next year, what would you expect the
return on BP to be?

30

Excel and Regression, Example 5

• You can use the standard functions to compute the intercept


and slope in the regression equation.
=INTERCEPT(y-range, x-range)
=SLOPE (y-range, x-range)

• Open the data in an Excel spreadsheet and from the menu,


choose Data > Data Analysis > Regression .

• After the dialog box opens, select the data for your dependent
variable (Earnings) in the Input Y Range and the data for your
independent variable (Sales) in the Input X Range.

• We can display the output on a new page, in the current


worksheet, or even in a new workbook. 31
Testing the significance of the coefficients
(b0 and b1)

H0: β = 0 (the coefficient of the linear model is 0)


H1: β ≠ 0 (the coefficient of the linear model is not 0)
Reject H0 if:
t > ta,n-2 or t < -ta,n-2 b
t=
sb
Sb is the standard error of the slope (b).

We can also construct a confidence interval for b.

b ± ta , n - 2 sb
32

Evaluating a regression equation’s ability


to predict

• The standard error of estimates

• The coefficient of determination (R2)

33
Standard errors of estimates

• The estimates are specific to the sample used.

• It would be desirable to have an idea of how “good” these


estimates are in the sense of having some measure of
the reliability or precision of the estimates.

• Whether they are likely to vary from one sample to


another sample within the given population.

• An idea of the precision of the estimates can be


calculated using standard error.

34

Regression Standard Error

Y
Scatter points
are close to
the line

Scatter points
Y further from
line

The Regression Standard Error (or standard error of the


estimate) is a measure of the dispersion of the observed
values around the line of regression.

X
35
Standard errors of estimates (cont’d)

• The standard error of estimate


measures the scatter, or dispersion, of
the observed values around the line of
regression
• Formulas used to compute the standard
error:
^
S( y - y ) 2
s y. x =
n-2
yˆ is the predicted value of y

36

Standard error of the estimate - Excel

37
Coefficient of determination (R2)

y
32

Explained variation (RSS)


29

26 y
23
!
"
SRSS Error variation (ESS)
R2 =
STSS Total variation (TSS)

20 25 30 35 40 45
x 38

Goodness of fit (R2)

14
Earnings (£m), y

12

10

2
Earnings (£m), y
Predicted Earnings (£m), y

0
0 10 20 30 40 50 60 70 80 90 100
Sales (£m), x

39
Coefficient of Determination
(Goodness of Fit)

• The coefficient of determination measures the fraction of the total


variation in the dependent variable that is explained by the independent
variable. It ranges from 0 to 1. R2. One way to define R2 is to say that it is
y!
the square of the correlation coefficient between y and .
-

Regression Sum Squares (Explained variation) RSS


å ( yˆ - y ) 2
R2 = = =
Total Sum of Squares (Total variation) Total SS
å(y - y)2

Error (residual) Sum Squares (Unexplained variation) ESS


= 1- = 1-
Total Sum of Squares (Total variation) Total SS
^
å(y - y )2
= 1-
å(y - y)2
40

Multiple Linear Regression

• What if our dependent (y) variable depends on more than one


independent variable?
• For example the firm dividend payments might plausibly depend on
1. Firm profit
2. Firm level of debt
3. Available investments
4. Ownership structure of the firm
• Similarly, stock returns might depend on several factors.
• Having just one independent variable is no good in this case - we
want to have more than one x variable. It is very easy to generalise
the simple model to one with k independent variables.
41
Multiple Linear Regression (MLR) Model

• The population regression model of a dependent variable,


Y, on a set of k independent variables, x1, x2,. . . , xk is given
by:

Y= b0 + b1x1 + b2x2 + . . . + bkxk +e

where b0 is the Y-intercept of the regression surface and each


bi , i = 1,2,...,k is the slope of the regression surface – with
respect to Xi. We call them partial regression coefficients.
e is the stochastic error term.

42

Determining the sample MLR equation

• Since we work with a sample we now write


y = b0 + b1 x1 + b2 x2 + ... + bk xk + e

• Where y is the dependent variable, x1, x2, …,xk are the


independent variables, and e is the residual (error).

• As in the case of the simple linear regression model, we


apply the least square principal (estimator) to minimise
the sum of square errors.

ESS = å ( y - yˆ )2 =å (e)2
43
Determining the sample MLR equation
(cont’d)

Simple and multiple least-squares egression


Y y

x1
yˆ = b + b x
0 1
X x2 yˆ = b + b x + b x
0 1 1 2 2

In a simple regression In a multiple regression


model, the least-squares model, the least-squares
estimator minimizes the sum estimator minimizes the sum
of squared errors from the of squared errors from the
estimated regression line. estimated regression plane.
44

Determining the sample MLR equation


(cont’d)

• We rely on statistical packages to estimate sample MLR


coefficients
• Le consider the following sample MLR equation
• yi = b0 +b1xi1 + b2xi2 + ei
• We have 2 independent variables:

é y1 ù é1 x11 x12 ù é e1 ù
ê y ú ê1 x21 x22 úúéê 0 ùú êe2 ú
b
ê 2ú = ê b1 + ê ú
ê ! ú ê! ! úê ú ê ! ú
ê ú ê úêëb2 úû ê ú
ë yn û ë1 xn1 x n 2 û ëen û

n ´1 n´3 3´1 n´1


45
Computing the slopes and the
intercept MLR

• In the simple linear regression, we took the error (residual)


sum of squares, and minimised it w.r.t. b0 and b1.
• In the matrix notation, we have
é e1 ù
êe ú
e = ê 2ú
ê ú
ê ú
ëen û

• The residual sum of squares would be given by


é e1 ù
êe ú
e' e = [e1 e2 ! en ]ê 2 ú = e12 + e22 + ... + en2 = å e 2
ê ú
ê ú
ë en û
46

Computing the slopes and the


intercept MLR (cont’d)

• In order to obtain the parameter estimates, b0, b1,..., bk,


we would minimise the residual sum squares with
respect to all the bs.

• It can be shown that

éb0 ù
êb ú
ê 1 ú = ( X ¢X )-1 X ¢y
ê ú
ê ú
ëbk û
47
Interpreting Coefficients in MLR

y = b0 + b1 x1 + b2 x2 + ... + bk xk + e

• In MLR, there is a slight modification in the interpretation of


the estimated coefficients as they show partial influences.

For example, if there are 3 independent variables, the value


of b1 estimates how a change in the independent variable
will influence y assuming that the other two independent
variables are held constant.

48

Evaluating a regression equation’s ability


to predict in MLR

• The multiple standard error of estimates

^
S( y - y ) 2 S (e) 2 S( ESS ) 2
s= = = =
n - k -1 n - k -1 n - k -1

• The coefficient of determination (R-square)


and Adjusted R-square

49
Adjusted R2

• More independent variables always result in a higher R2.


• But some of these variables may be unimportant and
should not be in the model.
• The Adjusted R2 tries to balance the raw explanatory
power against the desire to include only important
independent variable predictors.

n -1
Radj
2
(
= 1- 1- R 2
)
n - k -1
50

Testing the significance of individual


coefficients

H0: βi = 0 (the slope of the linear model is 0)


H1: βi ≠ 0 (the slope of the linear model is not 0)
Reject H0 if:
t > ta,n-k-1 or t < -ta,n-k-1 bi
t =
sbi

Sb is the standard error of the slope (b).

We can also construct a confidence interval for b.

b ± ta , n - k -1sb
51
Testing the multiple regression model
(joint hypothesis test)

The joint (global) hypothesis test is used to investigate whether any of the
independent variables have significant coefficients. The hypotheses are:

H 0 : b1 = b 2 = ... = b k = 0
H1 : Not all b s equal 0 or at least one b i ¹ 0
å ( ˆ
y - y )2
k RSS / k
Fa ,k,n -k -1 = =
å ( y - yˆ ) n - k - 1 ESS / n - k - 1
2

Decision Rule:

Reject H0 if F > Fa,k,n-k-1 52

Relation between R2 and F-test

R2 / k
Fa ,k,n -k -1 =
1 - R2 / n - k -1

RSS / k RSS n - k - 1
Fa , k, n - k -1 = =
ESS / n - k - 1 ESS k
RSS n - k -1
Fa , k, n - k -1 =
Total SS - RSS k
RSS / Total SS n - k - 1
Fa , k, n - k -1 =
1 - RSS/Total SS k
2
R /k
Fa , k, n - k -1 =
1 - R2 / n - k -1
54
Example 6

Suppose we want to estimate the relationship between Wage and Age and
Education. Say we gathered data on 80 workers at Cranfield School of
Management with information on their hourly wage, education and age.
• Estimate whether Wages are determined by Education and Age (call this
model 1).

• What is the sample regression equation?

• Interpret the regression coefficients.

• Predict the wage if education is 16 and age is 41.

• Interpret the F-test and the adjusted R-square.

• Estimate whether Wages are determined by Age (model 2); Estimate


whether Wages are determined by Education (model 3).

• Which model is a better fit (i.e. compare the three models)?


55

Example 6: Solution

Model Model Model


1 2 3
Education 1.441 1.45
(0.000) (0.000)

Age 0.0472 0.063

(0.127) (0.193)

Constant 2.638 21.77 4.83


(0.2684) (0.000) (0.0131)
Observations 80 80 80
2
Adjusted R 0.608 0.009 0.601
Standard Error 4.678 7.446 4.719

P-values in parentheses 56
Example 6: Solution

• Y(wage) = 2.63 + 1.44xEducation + 0.04xAge

• Interpretation
• Holding Age coefficient constant, the coefficient of Education is 1.44. This means if
education increase by one year (unit), the wage would increase by 1.44 .
• Holding Education coefficient constant, the coefficient of Age is 0.04. This means as
a person gets older by one year (unit), the wage would increase by 0.04 .
• The education coefficient is significant (p-value < 5% ) so we reject the null
hypothesis that Education =0
• The age coefficient is not significant (p-value > 5% ) so we do not reject the null
hypothesis that Age =0
• Y = 2.63 + 1.44x(16) + 0.04x(41) = 27.31
• The Adj-R2 is 0.60; So, the model is able to explain 60% of the variation in the
dependent variable.
• F-test is 62.46 is significant (p-value < 5%). Reject null, there is strong evidence
that at least one of the coefficients is different to zero. The model as a whole is
therefore significant.
57

The assumptions underlying the Classical


Linear Regression Model (CLRM)
• The statistical properties of LS estimator, as well as the validity of the testing procedures,
depend on the assumptions of CLRM.

• We usually make the following set of assumptions:


1. The regression model given Y= b0 + b1x1 + b2x2 + . . . + bkxk + e is linear in the parameters
(it means that the parameters are not multiplied together, divided, squared or cubed etc).

2. Conditional on x1, x2 , …. xk the error term has a zero mean; E(ε) = 0.


3. No multicollinearity, or no perfect linear relationships among the x variables.

4. Conditional on x1, x2 , …. xk the variance of the error term is the same for all
observations. The error term is homoskedastic (equally scattered), Var (ε) =s2

5. Conditional on x1, x2 , …. xk, the error term is uncorrelated across observations, Cov (ε i,
εj)=0; i≠j. There is no autocorrelation.
6. The error term is not correlated with any independent variable,
Cov (εi,xi)=0. There is no endogeneity.

7. The error term is normally distributed. This assumption allows us to construct confidence
intervals and conduct the tests of significance. If this assumption does hold the regression
hypothesis tests are not valid.
58
GAUSS-MARKOV THEOREM

• On the basis of CLR assumptions the OLS method gives


best linear unbiased estimators (BLUE):

• (1) The estimators are unbiased; in repeated


applications of the method, the estimators approach
their true values.

• (2) In the class of linear estimators, OLS estimators


have minimum variance; i.e., they are efficient.

59

Common Violation 1: Multicollinearity

Multicollinearity exists when independent variables (x’s) are


correlated.

• Effects of multicollinearity :

1. An independent variable known to be an important


predictor ends up having a regression coefficient that is not
significant.
2. A regression coefficient that should have a positive sign
turns out to be negative, or vice versa.
3. When an independent variable is added or removed, there
is a drastic change in the values of the remaining
regression coefficients.
60
Common Violation 1: Multicollinearity
(cont’d)

• A general rule is if the correlation between two independent variables is


between -0.70 and 0.70 there is not a problem using both of the
independent variables.
• A more precise test is to use the variance inflation factor (VIF).
• A VIF > 10 is unsatisfactory. Remove that independent variable from the
analysis.
• The value of VIF is found as follows:

1
VIF =
1 - R 2j

The term R2j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
61

Common Violation 1: Multicollinearity


(cont’d)

• A remedy may be to simply drop one of the collinear variables


if we can justify it as redundant.

• Alternatively, we could try to increase our sample size.

• Another option would be to try to transform our variables so


that they are no longer collinear.

• Last, especially if we are interested only in maintaining a high


predictive power, it may make sense to do nothing.

62
Common Violation 2: The error term is not
homoskedastic
• The variance of the errors is the same for all observation of the
independent variables (homoscedasticity); Var (ε) = s2
• The variance of the error term changes for different values of at least one
independent variable.
• Greater discretion (cross-section). E.g., saving is more variable in
households with higher average income; dividends are more variable in
firms with higher profits.

63

Common Violation 2: The error term is not


homoskedastic (cont’d)

• Heteroskedasticity results in inefficient estimators


and the hypothesis tests for significance are no
longer valid.

• To get around this, some researchers use LS


estimates along with corrected standard errors,
called White’s standard errors. Many statistical
packages have this option available.

64
Common Violation 3: The error terms are
autocorrelated

• We assume that the error term is uncorrelated across observations


when obtaining OLS estimates, Cov (ε i, εj)=0; i≠j

• One factor that can cause autocorrelation is omitted variables.

• Suppose Yi is related to X2,i and X3,i, but we wrongfully do not


include X3,i in our model.

• The effect of X3,i will be captured by the disturbances ε i.

• If X3,i like many economic series (e.g, GDP, stock returns) exhibit a
trend over time, then X3,i depends on X3,i-1, X3,i -2 and so on.

• Similarly then ε i depends on ε i-1, ε i-2 and so on.


65

Common Violation 3: The error terms are


autocorrelated

• A plot of the residuals against time shows:

66
Common Violation 4: Endogeneity

• Endogeneity in the regression model refers to the error term


being correlated with the independent variables, Cov (εi, xi)= 0.

• This commonly occurs due to an omitted independent variable.

• For example, a person’s salary may be highly correlated with


that person’s innate ability. But since we cannot include it,
ability gets incorporated in the error term. If we try to predict
salary by years of education, which may also be correlated with
innate ability, then we have an endogeneity problem.

67

Common Violation 4: Endogneity


(cont’d)

• Endogeneity will result in biased estimators, and so is quite


a serious problem.

• Unfortunately, endogeneity is difficult to fix. Most commonly,


we would like to find an instrumental variable, one that is
correlated with the endogenous independent variable but
uncorrelated with the error term. But it may be difficult to
find such a variable.

• Further discussion of the instrumental variable approach will


be discussed in FEC course.

68
Quadratic regression models

• Sometimes the relationship cannot be represented


by a straight line, and rather, must be captured by
an appropriate curve.

• The linearity assumption places restriction on the


linearity of the β parameters, not on x values, we
can capture many interesting non-linear
relationships with this framework.

69

Quadratic regression models (cont’d)

• For example, the relation between wage and age tends to


be an inverted “U” shape. The same relationship could be
observed between managerial ownership and firm
performance.

• Workers can expect wages to rise with age only up to


certain point, beyond which wages begin to fall.

• Such a relationship can be estimated by a quadratic


regression model:

y= b0 + b1x + b2x2 + e
70
Quadratic regression models (cont’d)

• For a quadratic regression, we estimate :

y= b0 + b1x + b2x2 + e

• The sign of b2 determines the shape:

y y

b2 < 0
b2 > 0

71

Example 7

Using the Wage example, we can estimate the following


models :

Wage = b0 + b1education + b2age + e (1)

Wage = b0 + b1education + b2age + b3age2 + e (2)

1. Determine which models fits the data best.

72
Example 7 (cont’d)

Wage
Scatter plot of wages versus age
50
40
30
20
10

20 40 60 80
Age
Linear
Quadratic

73

Example 7 (cont’d)

LR Model QR Model
(1) (2)
Education 1.441** 1.254**
(0.000) (0.000)

Age 0.0472 1.350**


(0.127) (0.000)

Age2 -0.0133**
(0.000)

Constant 2.638 -22.72**


(0.268) (0.000)
Observations 80 80
Adjusted R2 0.608 0.826

P-value in parentheses
** p < 0.05

74
Example 7 (cont’d): Multicolinearity?

QR Model Corr(Age, Age2) = 0.987


(2)
Education 1.254**
(0.000)
Age 1.350** Variable VIF
(0.000) Age 43.08
Age2 -0.0133** Age2 42.99
(0.000) Education 1.05
Constant -22.72**
(0.000)
Observations 80
Adjusted R2 0.826

A correlation between a variable and its power variable can be


reduced by “centering”

Centering means that you subtract off the mean before squaring.
75

Example 7 (cont’d):

QR Model Transformed QR
(1) Model (2)
Education 1.254** Education 1.254**
(0.000) (0.000)
Age 1.350** C_Age 0.031
(0.000) (0.128)
Age2 -0.0133** C_Age2 -0.0133** P-value in parentheses;
** p < 0.05
(0.000) (0.000)
Constant -22.72** Constant 11.461
(0.000) (0.000)
Observations 80 Observations 80
Adjusted R2 0.826 Adjusted R2 0.826
Corr(Age, Age2) 0.987 Corr(Age, Age2) -0.086
VIF (Age) 43.08 VIF (Age) 1.01
VIF (Age2) 42.99 VIF (Age2) 1.05
VIF(Education) 1.05 VIF(Education) 1.05

76
Evaluating effects for a quadratic
regression model (QRM)

• Marginal effect is equivalent to evaluate the change in the dependent


variable (y) due to one unit change in the independent variable (x).

• In a linear regression model (y = b0 + b1x + e), the marginal effect is


constant, estimated by b1

• In a QRM (y = b0 + b1x + b2x2 + e), the marginal effect depends on the


value of x at which it is evaluated.

• We can show through calculus that the marginal effect of x on y is b1 +


2b2x

• It is common to use the sample mean when interpreting the marginal


effect.
77

Evaluating effects for a quadratic


regression model (QRM)

• We can also use the predicted equation to estimate where


the dependent variable (y) will be maximized or minimized.

• y reaches an optimal value (maximum or a minimum) when


the marginal effect equals 0.

• The value of x at which this happens is obtained from


solving the equation b1 + 2b2x = 0 as x = -b1/2b2.

78
Example 8

1. Use the quadratic equation to determine the optimal age at which


wage is maximized.

2. Using the answer in question 1, what is the maximum wage for an


individual with 16 years of education?

1. Recall our estimated equation :

Wage = -22.72 + 1.254 Education + 1.350Age - 0.0133Age2

Age = -(1.35)/2(-0.0133) = 50.75.

2. The optimal age at which the wage is maximized is about 51 years, with
a wage of about £31.60 (for 16 years of education).

79

Practical Example 9

• The Bank of New England is a large financial institution


serving the New England states as well as New York and
New Jersey. The mortgage department of the Bank of New
England is studying data from recent loans.
• Of particular interest is how such factors as the value of the
home being purchased ($000), education level of the head
of the household (number of years, beginning with first
grade), age of the head of the household, current monthly
mortgage payment (in dollars), and gender of the head of
the household (male = 1, female = 0) relate to the family
income.
• The mortgage department would like to know whether these
variables are effective predictors of family income.

80
Practical Example 9 (cont.)

y = b0 + b1X1 + b2X2 + b3X3 +b4X4 + b5X5 + e


1) We begin by calculating the correlation matrix shown below. It shows
the relationship between each of the independent variables and the
dependent variable.
2) Be careful in result interpretation:
• An increase of $1,000 in the value of the home suggests an increase of
$72 in family income. An increase of 1 year of education increases
income by $1,624, another year older reduces income by $122, and an
increase of $1,000 in the mortgage reduces income by $1.
• If a male is head of the household, the value of family income will
increase by $1,807. Remember that “female” was coded 0 and “male”
was coded 1, so a male head of household is positively related to home
value.
• The age of the head of household and monthly mortgage payment are
inversely related to family income. This is true because the sign of the
regression coefficient is negative.
81

Practical Example 9 (cont.)

• Next we conduct the Global hypothesis test. Here we check to


see if any of the regression coefficients are different from 0 (α =
.05):

H 0: β 1 = β 2 = β 3 = β 4 = β 5 = 0
H1: Not all the β’s are 0

• The p-value from the table is 0.000. Because the p-value is


less than the significance level, we reject the null hypothesis
and conclude that at least one of the regression coefficients
is not equal to zero.

82
Practical Example 9 (cont.)
• Next we evaluate the individual regression coefficients. The null
hypothesis and the alternate hypothesis are (α = .05):
H 0: β i = 0
H 1: β i ≠ 0
• p-values for the regression coefficients for home value, years of
education, and gender are all less than .05. We conclude that these
regression coefficients are not equal to zero and are significant
predictors of family income.
• The p-value for age and mortgage amount are greater than the
significance level of .05, so we do not reject the null hypotheses for
these variables. The regression coefficients are not different from zero
and are not related to family income.
• Based on the results of testing each of the regression coefficients, we
conclude that the variables age and mortgage amount are not effective
predictors of family income. Thus, they should be removed from the
multiple regression equation. Remember that we must remove one
independent variable at a time and redo the analysis to evaluate the
overall effect of removing the variable. 83

Practical Example 9 (cont.)

• Now, let us remove the variables


• Observe the R2 and adjusted R2 change without the mortgage
variable.
• Also observe that the p-value associated with age is greater than
the .05 significance level. So next we remove the age variable
and redo the analysis.
• Our final step is to examine the regression assumptions with our
regression model.
• The first assumption is that there is a linear relationship between
each independent variable and the dependent variable. It is not
necessary to review the dummy variable Gender because there
are only two possible outcomes. (scatter plots of family income
versus home value and family income versus years of education).

84
Practical Example 9 (cont.)

• If the linearity assumption is valid, then the distribution of residuals


should follow the normal probability distribution with a mean of zero. To
evaluate this assumption, we will use a histogram and a normal
probability plot.

85

Practical Example 9 (cont.)

• The final assumption refers to multicollinearity. This means that the


independent variables should not be highly correlated. We suggested a
rule of thumb that multicollinearity would be a concern if the correlations
among independent variables were close to 0.7 or −0.7.
• To calculate the VIFs, we need to do a regression analysis for each
independent variable as a function of the other independent variables.
From each of these regression analyses, we need the R2 to compute
the VIF using formula (14–7).
• If the VIFs are less than 10, then multicollinearity is not a concern.

86
Practical Example 9 (cont.)

• To summarize, the multiple regression equation is


ŷ = 74.527 + 0.063(Value) + 1.016(Education) +
1.770(Gender)

• This equation explains 71.6% of the variation in family


income. There are no major departures from the multiple
regression assumptions of linearity, normally distributed
residuals, and multicollinearity.

87

Practical Example (cont.)

88
Practical Example (cont.)

89

You might also like