Chapter 9:
Regression Analysis
Business Analytics: Methods, Models,
and Decisions, 1st edition
James R. Evans
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-1
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-2
Chapter 9 Topics
Regression Analysis
Simple Linear Regression
Residual Analysis and Regression Assumptions
Multiple Linear Regression
Building Good Regression Models
Regression with Categorical Independent
Variables
Regression Models with Nonlinear Terms
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-3
Regression Analysis
Regression analysis is a tool for building statistical
models that characterize relationships among a
dependent variable and one or more independent
variables, all of which are numerical.
Simple linear regression involves a single
independent variable.
Multiple regression involves two or more
independent variables.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-4
Simple Linear Regression
Finds a linear relationship between:
- one independent variable X and
- one dependent variable Y
First prepare a scatter plot to verify the data has a
linear trend.
Use alternative approaches if the data is not linear.
Figure 9.1
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-5
Simple Linear Regression
Example 9.1
Home Market Value Data
Size of a house is
typically related to its
market value. Figure 9.2
X = square footage
Y = market value ($)
The scatter plot of the full
data set (42 homes)
indicates a linear trend.
Figure 9.3
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-6
Simple Linear Regression
Finding the Best-Fitting Regression Line
Two possible lines are shown below.
Line A is clearly a better fit to the data.
We want to determine the best regression line.
^
Y = b0 + b1X
where
b0 is the intercept
b1 is the slope
Figure 9.4
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-7
Simple Linear Regression
Example 9.2
Using Excel to Find the Best Regression Line
Market value = 32673 + 35.036(square feet)
The regression model
explains variation in
market value due to
size of the home.
It provides better
estimates of market
value than simply
using the average.
Figure 9.5
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-8
Simple Linear Regression
Least-Squares Regression
Regression analysis finds
the equation of the best-
fitting line that minimizes
the sum of the squares of the
Figure 9.6
observed errors (residuals).
Using calculus we can solve for the slope and intercept
of the least-squares regression line.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-9
Simple Linear Regression
Least-Squares Regression Equations
Slope
b1 =SLOPE(known ys, known xs)
Intercept
b0 =INTERCEPT(known ys, known xs)
^
Predict
^
Y for specified X values: Y = b0 + b1X
Y =TREND(known ys, known xs, new xs)
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-10
Simple Linear Regression
Example 9.3 Using Excel Functions to Find Least-
Squares Coefficients
Slope = b1 = 35.036
=SLOPE(C4:C45, B4:B45)
Intercept = b0 = 32,673 Figure 9.2
=INTERCEPT(C4:C45, B4:B45)
Estimate Y when X = 1800 square feet
^
Y = 32,673 + 35.036(1800) = $95,737.80
=TREND(C4:C45, B4:B45, 1800)
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-11
Simple Linear Regression
Excel Regression tool
Data
Data Analysis
Regression
Input Y Range
Input X Range
Labels
Excel outputs a table
with many useful Figure 9.7
regression statistics.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-12
Simple Linear Regression
Regression Statistics in Excel’s Output
Multiple R
| r | where r is the sample correlation coefficient
r varies from -1 to +1 (r is negative if slope is negative)
R Square
coefficient of determination, R2
varies from 0 (no fit) to 1 (perfect fit)
Adjusted R Square
adjusts R2 for sample size and number of X variables
Standard Error
variability between observed & predicted Y variables
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-13
Simple Linear Regression
Example 9.4 Interpreting Regression Statistics for
Simple Linear Regression (Home Market Value)
53% of the variation in home market values
can be explained by home size.
The standard error of $7287 is less than
standard deviation (not shown) of $10,553.
Figure 9.8
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-14
Simple Linear Regression
Regression Analysis of Variance
ANOVA conducts an F-test to determine whether
variation in Y is due to varying levels of X.
ANOVA is used to test for significance of regression:
H0: population slope coefficient = 0
H1: population slope coefficient ≠ 0
Excel reports the p-value (Significance F).
Rejecting H0 indicates that X explains variation in Y.
From Figure 9.8
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-15
Simple Linear Regression
Example 9.5 Interpreting Significance of Regression
Home size is not a significant variable
Home size is a significant variable
p-value = 3.798 x 10-8
Reject H0.
The slope is not equal to zero.
Using a linear relationship, home size is a significant
variable in explaining variation in market value.
From Figure 9.8
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-16
Simple Linear Regression
Testing Hypotheses for Regression Coefficients
An alternate method for testing
is to use a t-test:
Excel provides the p-values for tests on the slope
and intercept.
From Figure 9.8
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-17
Simple Linear Regression
Example 9.6 Interpreting Hypothesis Tests for
Regression Coefficients (Home Market Value)
p-value for test on the intercept = 0.000649
p-value for test on the slope = 3.798 x 10-8
Both tests reject their null hypotheses.
Both the intercept and slope coefficients are
significantly different from zero.
From Figure 9.8
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-18
Simple Linear Regression
From Figure 9.8
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-19
Residual Analysis and Regression Assumptions
Residual Analysis
Residuals are observed errors.
Residual = Actual Y value − Predicted Y value
Standard residual = residual / standard deviation
Rule of thumb: Standard residuals outside of ±2
or ±3 are potential outliers.
Excel provides a table and a plot of residuals.
Figure 9.9
Figure 9.10
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-20
Residual Analysis and Regression Assumptions
Example 9.8 Interpreting Residual Output
None of the residuals in the table of 5 homes
shown below appear to be outliers.
In the full data set of 42 homes, there is a
standardized residual larger than 4.
This small home may have a pool or unusually
large piece of land.
Figure 9.9
Figure 9.3
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-21
Residual Analysis and Regression Assumptions
Checking Assumptions
Linearity
- examine scatter diagram (should appear linear)
- examine residual plot (should appear random)
Normality of Errors
- view a histogram of standard residuals
- regression is robust to departures from normality
Homoscedasticity
- variation about the regression line is constant
Independence of Errors
- successive observations should not be related
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-22
Residual Analysis and Regression Assumptions
Example 9.9 Checking Regression Assumptions for
the Home Market Value Data
Linearity - linear trend in scatterplot
- no pattern in residual plot
Figure 9.3 Figure 9.10
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-23
Residual Analysis and Regression Assumptions
Example 9.9 (continued) Checking Regression
Assumptions for the Home Market Value Data
Normality of Errors – residual histogram appears
slightly skewed but is not a serious departure
Figure 9.11
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-24
Residual Analysis and Regression Assumptions
Example 9.9 (continued) Checking Regression
Assumptions for the Home Market Value Data
Homoscedasticity – residual plot shows no serious
difference in the spread of the data for different X
values.
Figure 9.10
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-25
Residual Analysis and Regression Assumptions
Example 9.9 (continued) Checking Regression
Assumptions for the Home Market Value Data
Independence of Errors – Because the data is
cross-sectional, we can assume this assumption
holds.
All 4 regression assumptions are reasonable for
the Home Market Value data.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-26
Multiple Linear Regression
Multiple Regression has more than one independent
variable.
The multiple linear regression equation is:
The ANOVA test for significance of the entire
model is:
One can also test for significance of individual
regression coefficients.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-27
Multiple Linear Regression
Example 9.10 Interpreting Regression Results for
the Colleges and Universities Data
Colleges try to predict student graduation rates
using a variety of characteristics, such as:
1. Median SAT 3. Acceptance rate
2. Expenditures/student 4. Top 10% of HS class
Figure 9.12
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-28
Multiple Linear Regression
Example 9.10 (continued) Interpreting Regression
Results for the Colleges and Universities Data
Figure 9.13
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-29
Multiple Linear Regression
Example 9.10 (continued) Interpreting Regression
Results for the Colleges and Universities Data
All of the slope
coefficient p-values
are < 0.05.
From Figure 9.13
The residual plots (only one shown
here) show random patterns about 0.
Normal probability plots (not shown)
also validate assumptions.
Figure 9.14
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-30
Multiple Linear Regression
Analytics in Practice:
Using Linear Regression and
Interactive Risk Simulators to
Predict Performance at ARAMARK
ARAMARK, located in Philadelphia, is an award-
winning provider of professional services
They developed an on-line tool called “interactive
risk simulators” (shown on next slide) that allows
users to change various business metrics and
immediately see the results.
The simulators use linear regression models.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-31
Multiple Linear Regression
Analytics in Practice: (ARAMARK continued)
Risk metrics are adjusted using sliders.
Allows users (managers and directors) to see the
impact of these risks on the business.
Figure 9.15
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-32
Building Good Regression Models
All of the independent variables in a linear
regression model are not always significant.
We will learn how to build good regression models
that include the “best” set of variables.
Banking Data includes demographic information
on customers in the bank’s current market.
Figure 9.16
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-33
Building Good Regression Models
Predicting Average Bank Balance using Regression
Home Value and Education
are not significant.
Figure 9.17
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-34
Building Good Regression Models
Systematic Approach to Building Good Multiple
Regression Models
1. Construct a model with all available independent
variables and check for significance of each.
2. Identify the largest p-value that is greater than α.
3. Remove that variable and evaluate adjusted R2.
4. Continue until all variables are significant.
Find the model with the highest adjusted R2.
(Do not use unadjusted R2 since it always
increases when variables are added.)
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-35
Building Good Regression Models
Example 9.11
Identifying the Best Regression Model
Bank regression after removing Home Value
Adjusted R2 improves slightly.
All X variables are significant.
Figure 9.18
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-36
Building Good Regression Models
Multicollinearity
- occurs when there are strong correlations among
the independent variables
- makes it difficult to isolate the effects of
independent variables
- signs of slope coefficients may be opposite of the
true value and p-values can be inflated
Correlations exceeding ±0.7 are an indication that
multicollinearity might exist.
Variance Inflation Factors are a better indicator.
Parsimony is an age-old principle that applies here.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-37
Building Good Regression Models
Example 9.12
Identifying Potential Multicollinearity
Colleges and Universities (full model)
Full model
Adjusted R2 = 0.4921
Figure 9.13
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-38
Building Good Regression Models
Example 9.12 (continued)
Identifying Potential Multicollinearity
Correlation Matrix (Colleges and Universities data)
From Figure 9.19
All of the correlations are within ±0.7
Signs of the coefficients are questionable for
Expenditures and Top 10%.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-39
Building Good Regression Models
Example 9.12 (continued)
Identifying Potential Multicollinearity
Colleges and Universities (reduced model)
Dropping Top 10%
Adjusted R2 drops to 0.4559
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-40
Building Good Regression Models
Example 9.12 (continued)
Identifying Potential Multicollinearity
Colleges and Universities (reduced model)
Dropping Expenditures
Adjusted R2 drops to 0.4556
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-41
Building Good Regression Models
Example 9.12 (continued)
Identifying Potential Multicollinearity
Colleges and Universities (reduced model)
Dropping Expenditures and Top 10%
Adjusted R2 drops to 0.3613
Which of the 4 models would you choose?
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-42
Building Good Regression Models
Example 9.12 (continued)
Banking Data (full model)
Full Model
Adjusted R2 = 0.9441
Education and Home Value
are not significant.
Figure 9.17
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-43
Building Good Regression Models
Example 9.12 (continued)
Identifying Potential Multicollinearity
Correlation matrix for the Banking data
From Figure 9.20
From Figure 9.17
Some of the correlations exceed 0.7 for Home
Value and Wealth.
Signs of the coefficients for predicting bank
balance are as expected (positive).
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-44
Building Good Regression Models
Example 9.12 (continued)
Banking Data (reduced model)
Dropping Wealth and Home Value
Adjusted R2 drops to 0.9201
Education is not significant.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-45
Building Good Regression Models
Example 9.12 (continued)
Identifying Potential Multicollinearity
Re-ordered Correlation matrix for Banking data
From Figure 9.20
By re-ordering the variables, we can see the
correlations for Age, Education, and Wealth are all
within ± 0.7.
Let’s try a reduced model with the Age, Education,
and Wealth variables.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-46
Building Good Regression Models
Example 9.12 (continued)
Banking Data (reduced model) ** best model
Dropping Income and Home Value.
Adjusted R2 = 0.9345.
All variables are significant.
Multicollinearity is not a problem.
Figure 9.21
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-47
Regression with Categorical Variables
Dealing with Categorical Variables
Must be coded numeric using dummy variables.
For variables with 2 categories, code as 0 and 1.
For variables with k ≥ 3 categories, create k−1
binary (0,1) variables.
Interaction Terms
A dependence between two variables is called
interaction.
Test for interaction by adding a new term to the
model, such as X3 = X1X2.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-48
Regression with Categorical Variables
Example 9.13 A Model with Categorical Variables
Employee Salaries provides data for 35 employees
Predict Salary using Age and MBA (yes=1, no=0)
Figure 9.22
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-49
Regression with Categorical Variables
Example 9.13 (continued)
Salary = 893.59 + 1044(Age) for those without MBA
Salary =15,660.82 + 1044(Age) for those with MBA
Adjusted R2 = 0.949858
Figure 9.23
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-50
Regression with Categorical Variables
Example 9.14 Incorporating Interaction Terms in a
Regression Model
Define an interaction between Age and MBA and
include in the regression model.
Interaction = (Age)(MBA)
Figure 9.24
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-51
Regression with Categorical Variables
Example 9.14 (continued) Incorporating Interaction
Terms in a Regression Model
MBA is now insignificant so we
will drop it from the model.
Adjusted R2 = 0.976701
Figure 9.25
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-52
Regression with Categorical Variables
Example 9.14 (continued)
Salary = 3,323 + 984(Age) for those without MBA
Salary = 3,323 + 1410(Age) for those with MBA
Adjusted R2 = 0.976727
(a slight improvement)
Figure 9.26
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-53
Regression with Categorical Variables
Example 9.15 A Regression Model with Multiple
Levels of Categorical Variables
Surface Finish data provides measurements for
35 parts produced on a lathe.
Figure 9.27
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-54
Regression with Categorical Variables
Example 9.15 (continued)
A Regression Model with
Multiple Levels of
Categorical Variables
Tool Type (A,B,C,D) is now
coded as 3 dummy variables
Figure 9.28
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-55
Regression with Categorical Variables
Example 9.15 (continued) A Regression Model with
Multiple Levels of Categorical Variables
Tool A: Surf. Finish = 24.5 + 0.098 RPM
Tool B: Surf. Finish = 11.2 + 0.098 RPM
Tool C: Surf. Finish = 4.0 + 0.098 RPM
Tool D: Surf. Finish = -1.6 + 0.098 RPM
Figure 9.29
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-56
Regression Models with Nonlinear Terms
Curvilinear Regression
Curvilinear models may be appropriate when
scatter charts or residual plots show nonlinear
relationships.
A second order polynomial might be used
Here β1 represents the linear effect of X on Y and
β2 represents the curvilinear effect.
This model is linear in the β parameters so we can
use linear regression methods.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-57
Regression Models with Nonlinear Terms
Example 9.16 Modeling Beverage Sales Using
Curvilinear Regression
Sales of cold beverages increase when it is hotter
outside.
Figure 9.30
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-58
Regression Models with Nonlinear Terms
Example 9.16 (continued) Modeling Beverage Sales
Using Curvilinear Regression
U-shape residual plot
Figure 9.31
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-59
Regression Models with Nonlinear Terms
Example 9.16 (continued) Modeling Beverage
Sales Using Curvilinear Regression
Residual
pattern is
more random
Sales = 142,850
−3643(temperature)
+ 23.3(temperature)2
Figure 9.32
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-60
Regression Models with Nonlinear Terms
Example 9.16 (continued) Modeling Beverage
Sales Using Curvilinear Regression
Second Order Polynomial Trendline
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-61
Chapter 9 - Key Terms
Autocorrelation
Coefficient of determination
Coefficient of multiple determination
Curvilinear regression model
Dummy variables
Homoscedasticity
Interaction
Least-squares regression
Mulitcollinearity
Multiple correlation coefficient
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-62
Chapter 9 - Key Terms (continued)
Multiple linear regression
Parsimony
Partial regression coefficient
Regression analysis
Residuals
Significance of regression
Simple linear regression
Standard error of the estimate
Standard residuals
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-63
Case Study
Performance Lawn Equipment (9)
Recall that PLE produces lawnmowers and a
medium size diesel power lawn tractor.
Predict what might have happened if PLE never
implemented the 2009 defect reduction initiative.
Determine the effect of education, GPA, and age
when hired on employee retention.
Investigate the rate of learning following the
implementation of the new production technology.
Write a formal report summarizing your results.
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-64
Copyright © 2013 Pearson Education, Inc.
publishing as Prentice Hall 9-65