0% found this document useful (0 votes)
11 views

Lecture 12

Uploaded by

sbernardo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 12

Uploaded by

sbernardo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

QMT 109: Business Research

and Statistical Methods


Wilson Gan
Lecture Twelve

Reference for Definitions and Formulas: Bowerman et al. (2009) and Field (2013)
Multiple Linear Regression
Analysis (Part I)
Now let’s extend simple linear regression analysis by adding
more independent variables…
Recall: Fuel Consumption Case
 Cities in the USA use natural gas to heat their homes when
temperatures are low. As is obvious for most of you, natural gas
consumption is highest during the winter. Assume that we are
analysts in a management consulting firm hired by a natural gas
company serving a small city to predict weekly natural gas demand
based on average temperature so they can prepare their supply chain
accordingly. They have provided us with 8 week’s worth of data on
the city’s average temperature and corresponding weekly fuel
consumption.
 The natural gas firm will tolerate an error rate of at most 10 percent,
meaning that the difference between our prediction and actual usage
should not be more than 10 percent. This is because the natural gas
firm will have to pay gas transmission companies in case it over or
under-predicts demand.

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Suppose we want to use another variable to
predict gas consumption…
 The chill index for a given average hourly temperature
expresses the combined effects of all other major weather-
related factors that influence natural gas consumption, such as
wind velocity, sunlight, cloud cover, and the passage of
weather fronts. The chill index is expressed as a whole number
between 0 and 30. A weekly chill index near 0 indicates that,
given the average hourly temperature during the week, all other
major weather-related factors will only slightly increase natural
gas consumption. A weekly chill index near 30 indicates that,
given the average hourly temperature during the week, other
weather-related factors will greatly increase natural gas
consumption.

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Note the general form of the multiple
regression equation

 µy = β0 + β1x1 + β2x2 +…+ βkxk is the mean value of the dependent


variable y when the values of the independent variables are x 1, x2,…,
xk
 β0 is the y-intercept; the mean of y when all independent variables
are = 0
 ε is an error term that describes the effect on y of all factors other
than the independent variables
 Remember that we do not know the real values of regression
parameters β0 , β1 , β2 , … , and βk. We estimate through:
 ŷ = b0 + b1x01 + b2x02 + … + bkx0k
 Note: Formula for least squares point estimates of the parameters
in multiple regression uses matrix algebra
Source: Bowerman et al. (2009)
© Wilson Gan 2010
Sample Multiple Regression Output in Excel

 Suppose the weather forecasting service predicts that the average


hourly temperature next week will be 40 °F with a chill index of 10,
estimate fuel consumption next week.

Source: Bowerman et al. (2009)


© Wilson Gan 2010
The Good News? The statistical significance
formulas are very similar to SRM
 Sum of Squared Residuals (difference between actual and predicted)
 The goal of the least squares method is to produce an equation of the
line that minimizes SSE
 The Mean Standard Error is the point estimate of the residual
variance σ2

SSE   ei2   ( yi  yˆ i ) 2
 The Standard Error is the point estimate of the residual standard deviation
σ

© Wilson Gan 2010


F Test checks whether at least one of the
independent variables is significant
 H0: None of the independent
variables are significantly related
to y
 Ha: At least one of the
independent variables is
significantly related to y
 F is based on k numerator and
[n- (k+1)] denominator degrees
of freedom
 Reject H0 if F(model) > F or p-
value < 

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Test each independent variable for
statistical significance
 A regression model is not likely to be useful unless there is a
significant relationship between y and at least one x
 To test significance, we use the null hypothesis: H0: βj = 0
 Versus the alternative hypothesis: Ha: βj ≠ 0
 Note: The test will only be valid if SRM assumptions hold.
Alternative Hypo Reject H0 if P-value
|t| > t /2 Twice the area under the t
or distribution to the right of |t|
Ha: βj ≠ 0 t > t/2
or
t < -t/2
 Remember: In this case, degrees
of freedom = n – (k+1)

© Wilson Gan 2010


And if you want, test the significance of the
y-intercept β0
 To test significance, we use the null hypothesis: H0: β0 = 0
 Versus the alternative hypothesis: Ha: β0 ≠ 0
 Note: The test will only be valid if SRM assumptions hold.
 If the intercept is not statistically significant, we can drop it from the
model (unless common sense dictates otherwise)
Alternative Hypo Reject H0 if P-value
|t| > t /2 Twice the area under the t
or distribution to the right of |t|
Ha: β0 ≠ 0 t > t/2
or
t < -t/2
 Remember: In this case, degrees
of freedom = n - (k+1)

© Wilson Gan 2010


And we can apply the concept of the
confidence interval for βj and β0 as well…
 The confidence intervals here apply to each βj separately
 100(1-α)% Confidence Interval for βj
[bj ± t α/2 Sbj]
 100(1-α)% Confidence Interval for β0
[b0 ± tα/2 Sb0]
 df = [n – (k+1)]

 Note that there is a different formula to determine the confidence


interval for the estimate of the mean value or the point prediction of
an individual value of the dependent variable (which is discussed in
the next slide)

© Wilson Gan 2010


Same confidence and prediction interval
formulas as with SRM

CONFIDENCE
INTERVAL
[ŷ  t /2 s Distance value ]
PREDICTION
INTERVAL [ŷ  t /2 s 1  Distance value ]

 Distance value = leverage value


 Except that the distance value is calculated using matrix
algebra, hence we rely on the value from statistics software
 Note degrees of freedom = [n – (k+1)]

© Wilson Gan 2010


The R2 and R formulas are also similar
 The multiple coefficient of determination, R2, is the proportion of
the total variation in the n observed values of the dependent variable
that is explained by the regression model

 Note: Total Variation = Explained Variation + Unexplained Variation


 Unexplained Variation = SSE

 The multiple correlation coefficient measures the strength of association


between y and the independent variables, and is denoted by R

© Wilson Gan 2010


But there’s also an adjusted R2
 Problem (in simple terms): The more independent variables
one throws into the model, the higher the R2 regardless of
whether these variables are meaningful (related to the
dependent variable)
 Solution (in simple terms): The adjusted R2 which applies a
penalty of sorts for each incremental independent variable in
the model. Formula:

© Wilson Gan 2010


Suppose one wants to include qualitative /
categorical independent variables…
 It is possible through dummy variables
 Scenario: Suppose that Electronics World, a chain of stores that sells
audio and video equipment, has gathered data concerning stores
sales, the number of households in the store’s area and the location
of each store for a regression
 Store sales is the dependent variable
 Number of households in the store’s area and location of each
store are the independent variables
 Location of each store is a categorical variable with three possible
values:
 Suburban street location
 Downtown street location
 Mall location

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Multiple Regression Model with Dummy
Variables
 A dummy variable can only have a value of 0 or 1
 Define the following dummy variables for Electronics World:
 D (downtown location)
D

 DM (mall location)
Value of DD Value of DM
Store is in a suburban street location 0 0
Store is in a downtown location 1 0
Store is in a mall location 0 1

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Imagine that each dummy variable as an
additional y-intercept

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Sample Excel Output of MRM with Dummy
Variables

 We are 95% confident that:


 For any given number of households in a store’s area, the mean
monthly sales volume in a mall
downtown
location
location
is between
is between
18.5543.636
and
less thangreater
38.193 and 17.363
than the
greater
meanthan
monthly
the mean
salesmonthly
volume sales
in a street
volume
in a street location
location
Source: Bowerman et al. (2009)
© Wilson Gan 2010
Multiple Linear Regression
Analysis (Part II)
Model Building
Regression Assumptions and Outliers
Multicollinearity
Steps to Building a Regression Model

Source: Field (2013)


© Wilson Gan 2010
Data Transformation
 You can transform the regression variables (both dependent and
independent) to improve model fit or address violations of regression
assumptions. Some examples:
Data Transformation Can Correct for
Log transformation log xi Positive skew, positive
kurtosis, unequal variances,
lack of linearity
Reciprocal transformation 1 / xi Positive skew, positive
kurtosis, unequal variances

Source: Field (2013) and Bowerman (2009)


© Wilson Gan 2010
But aren’t we doing a linear regression
here?
 A regression equation is linear when it is linear in the parameters.
While the equation must be linear in the parameters, you can
transform the predictor variables to improve fit.
 The easiest way to determine whether an equation is nonlinear is to
focus on the term “nonlinear” itself.

Linear
Linear
Non-Linear in beta1
Linear
Non-Linear in beta 2 and 3

Source: http://blog.minitab.com/blog/adventures-in-statistics/what-is-the-difference-between-linear-and-
© Wilson
nonlinear-equations-in-regression-analysis and Gan 2010
http://www.stat.colostate.edu/regression_book/chapter9.pdf
But note that data transformation changes
interpretation of the regression function

 For example:
 In the standard model, a one-unit change in x results to a
<parameter value of x> unit change in y
 In the power (log-log or elasticity) model, a one percent change in
x results to a <parameter value of x> percent change in y
Source: http://stattrek.com/regression/linear-transformation.aspx?Tutorial=AP
© Wilson Gan 2010
The assumptions that govern MRM are
similar to that of SRM
1. Mean of Zero
The mean of the error terms is equal to zero
2. Constant Variance Assumption
Homoscedasticity. The variance of the error terms σ 2 is, the
same for every combination values of x1, x2,…, xk
3. Normality Assumption
The error terms follow a normal distribution for every
combination values of x1, x2,…, xk
4. Independence Assumption
The values of the error terms are statistically independent of
each other

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Approach to checking MRM assumptions is
similar to that of SRM
 Checks of regression assumptions are performed by analyzing
the regression residuals
 Residuals versus each independent variable
 Residuals versus predicted y’s
 Residuals in time order (if the response is a time series)

 With any real data, assumptions will not hold exactly. Mild
departures do not affect our ability to make statistical
inferences.
 In checking assumptions, we are looking for pronounced
departures from the assumptions.

© Wilson Gan 2010


Check for OUTLIERS: Why are we
concerned with outliers in a regression?

 Outliers significantly (“disproportionately”) affect the


regression equation, therefore one must identify these and
check whether these are valid or erroneous data
Source: Bowerman et al. (2009)
© Wilson Gan 2010
How do we detect outliers? [Short answer]

 Run outlier diagnostic using statistics software


 In MEGASTAT, this is the Diagnostics and Influential Residuals
option
 In SPSS, this is the Save | Residuals option

© Wilson Gan 2010


How do we detect outliers? [Slightly longer
answer]
Leverage values Recall leverage value = distance value
If the leverage value for an observation is
large, the observation would have substantial
leverage in determining the least squares
prediction equation
Studentized residuals A specific observation’s residual divided by
the estimate of its standard deviation
Studentized deleted Will deleting the residual significantly affect
residuals the regression model? Calculated through:
Subtracting from yi the point prediction y-hat
computed through a regression of all n
observations except observation i
Cook’s D A statistical measure to detect outliers

© Wilson Gan 2010


How do we address identified outliers?
 If you believe they are erroneous data, delete them
 If you believe they are illustrative of a special scenario,
separate them into another group perhaps through employing a
dummy variable
 Example: In a regression of average daily balance
(independent), number of bank accounts (independent) and
average monthly bank visits (dependent), one sees a set of
customers with a very large number of bank accounts.
 If these are outliers, and there’s indeed a legitimate reason
for having a large number of bank accounts, then the
researcher may separate them into a group

© Wilson Gan 2010


Check for MULTICOLLINEARITY: Why is it
important?
 MULTICOLLINEARITY exists when independent variables in
a regression are related to or dependent upon each other
 It is BAD in a multiple regression. Why?
 Assume perfect collinearity between the two independent
variables: It becomes impossible to obtain unique estimates of
regression coefficients because there are an infinite number of
combinations of coefficients that would work equally well [stated
another way, b1 and b2 are interchangeable]
 Inflates standard deviation of the population of all possible least
square estimates for b1 and b2
 Though perfect collinearity is rare in real life, even mild
collinearity may pose problems to a regression

Source: Bowerman et al. (2009)


© Wilson Gan 2010
Check for MULTICOLLINEARITY: Three
problems that arise
 Unstrustworthy bs: Given the large variance in possible b
values, b coefficient based on the sample is less likely to
represent the population
 Importance or Significance of bs: bs are the coefficient of
predictors, the significance of which is tested via a t-test
 Given inflation in standard deviation of b, standard error is
likewise inflated, hence t-statistic becomes smaller and p-values
larger
 Limits size of R: Given high levels of mulicollinearity across
independent variables, the incremental explained variance from
each additional independent variable is minimal [in other
words, R and R2 does not increase much as one adds more
collinear variables]
Source: Field (2013)
© Wilson Gan 2010
How to check for multicollinearity?

 If the largest VIF is greater than 10, severe multicollinearity


 If the largest VIF is between 5 to 10, moderate multicollinearity
 If the mean of VIF is substantially greater than 1, regression may
be biased due to multicollinearity
Source: Bowerman et al. (2009)
© Wilson Gan 2010
Back up: What is a VIF?
 Variance Inflation Factor (VIF), in a nutshell, indicates
whether an independent variable has a strong relationship with
other independent variables. It has the following formula:

 One may also want to produce a correlation matrix to check for


the correlation between each of the independent variables

© Wilson Gan 2010


Back up: Produce a matrix scatterplot if you
are a visual person, or…

© Wilson Gan 2010


Back up: a correlation matrix, if you need to
see the “numbers”

 Rule of thumb: r > 0.8 may be worth noting


© Wilson Gan 2010
FINAL POINT: Does one just throw variables
into a multiple regression model?
 Three major approaches:
Hierarchical Known predictors, based on past research or
experience, are first inserted into the model. After
this, new predictors are entered at the same time or in
a stepwise procedure
Forced Entry All predictors are entered simultaneously
Stepwise Variables are iteratively inserted into and deleted
from the model based on mathematical criteria
•Begin with the predictor with the highest correlation with
the dependent variable
•Add the predictor that results to the greatest improvement
in the explained variation of the dependent variable
•As model is built, a removal test is made of the least
useful predictor (based on multicollinearity)

Source: Field (2013)


© Wilson Gan 2010

You might also like