Module 6A Estimating Relationships
Module 6A Estimating Relationships
ESTIMATING RELATIONSHIPS
Predictive Analytics
Learning Objective:
At the end of the lesson, the student should be able to:
• Recall how to estimate and interpret a linear regression model.
• Interpret goodness-of-fit measures.
• Conduct tests of significance.
• Estimate and interpret multiple linear regression models.
Regression Analysis
▪ One of the most widely used techniques in predictive analytics.
▪ Used to capture the relationship between two or more variables and
to predict the outcome of a target variable based on several input
variables.
▪ The hypothesized relationship may be linear, quadratic, or some other
form.
▪ It also allows us to make assessments and robust predictions by
determining which of the relationships matter most or can be
ignored.
Examples of nonlinear
relationships
THE LINEAR REGRESSION MODEL
We formulate a linear model that relates the outcome of a target
variable (also called a response, criterion, or dependent variable) to
one or more other input variables (called stimulus, predictor or
independent variable).
Consequently, we use the information on the predictor variables
to describe and/or predict changes in the response variable.
Source: Statistics for Manager Using Microsoft Excel, 5e @ 2008 Prentice-Hall, Inc
Regression Analysis: Types of relationships
Source: Statistics for Manager Using Microsoft Excel, 5e @ 2008 Prentice-Hall, Inc
Regression Analysis: Types of relationships
Source: Statistics for Manager Using Microsoft Excel, 5e @ 2008 Prentice-Hall, Inc
Simple Linear Regression Equation
A common approach to obtaining estimates for the coefficients is to use
the ordinary least squares (OLS) method. OLS estimators have many desirable
properties if certain assumptions hold.
where Earnings is annual post-college earnings (in $), Cost is the average annual
cost (in $), Grad if the graduation rate (in %), Debt is the percentage of students
paying down debt (in %), and City assumes a value of 1 if the college is located in a
city, 0 otherwise.
a. What is the sample regression equation?
b. Interpret the slope coefficients.
c. Predict annual post-college earnings if a college’s average annual cost is
$25,000, its graduation rate is 60%, its percentage of students paying down
debt is 80%, and it is located in a city.
Table 1
Table 1
Table 1
Table 1
Table 1
Table 1
Example 1 Answers
a. The sample regression equation is:
1 1 53 6 11 84
2 5 74 7 14 96
3 7 59 8 15 69
4 8 43 9 15 84
5 10 56 10 19 83
Example 1 Excel output
For the scatterplot:
1. Highlight X array and
Y array.
2. Choose Insert.
3. Choose Scatter among
the chart types
available.
4. Edit the axis labels.
Example 1 Excel output
For regression:
1. Go to Data, choose
Data Analysis.
2. Choose Regression
among the Data
Analysis Tools.
3. Fill up necessary
fields.
4. Click OK.
Example 1 Excel output
Intercept = 49.477
Example 1 Interpretation of coefficients
𝑆𝑐𝑜𝑟𝑒 = 49.477 + 1.9641 ∗ 𝑋 (ℎ𝑜𝑢𝑟𝑠)
0 ≤ 𝑟2 ≤ 1
Example 1 Coefficient of determination
𝑆𝑆𝑅 1020.341
𝑟2 = = = 0.39412
𝑆𝑆𝑇 2588.90
39.41% of the variation in scores
is explained by the variation in
study hours.
Standard Error of Estimate
Example 1 Standard error of estimate
𝑆𝑌𝑋 = 14.002
Comparing Standard Errors
𝑆𝑌𝑋 is a measure of the variation of observed Y values from the
regression line.
The magnitude of 𝑆𝑌𝑋 should always be judged relative to the
size of the Y values in the sample data.
Adjusted R squared
Adjusted R squared
Tests of Significance
We test for joint and individual significance to determine
whether there is evidence of a linear relationship between the
response variable and the predictor variables.
For the test to be valid, the OLS estimators b1, b2, …, bk
must be normally distributed. This condition is satisfied if the
random error term is normally distributed. If we cannot
assume the normality of the errors, then the test are valid
only for large sample sizes.
Test of Joint Significance or Overall Fit
Test of Joint Significance or Overall Fit
Table 1
Test of Joint Significance or Overall fit
Test of Individual Significance:
▪ The t-test for a population slope is used to determine if there
is a linear relationship between X and Y.
▪ Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
Test of Individual Significance:
Example 1 Inferences about the slope
The estimated regression equation is:
1.9641 − 0
𝑡= = 1.9641
0.8610
𝑑𝑓 = 𝑛 − 2 = 10 − 2 = 8
𝑏1
𝑇. 𝐷𝐼𝑆𝑇 2.281221,8,2 = .052
Example 1 Inference about slope
𝑇. 𝐷𝐼𝑆𝑇 2.281221,8,2
= .052 = p-value
• H0 : β1 = 0
• H1 : β 1 ≠ 0
𝑏1 = −24.975: 𝑏2 = 74.131:
Sales will decrease, on Sales will increase, on
average, by 24.975 pies per average, by 74.131 pies per
week for each $1 increase in week for each $100 increase
selling price, net of the in advertising cost, net of
effects of changes due to the effects of changes due to
advertising. price.
Predict sales for a week if the selling price is 6.50 and
the advertising cost is $420:
𝑀𝑆𝑅 14730.013
𝐹= = = 6.539
𝑀𝑆𝐸 2252.776
The p-value is 0.012. Reject the
null hypothesis at α=0.05.
0 ≤ 𝑟2 ≤ 1
ASSESSING OVERALL FIT: Coeff. of Multiple Determination
𝑆𝑆𝑅 24960.027
𝑅2 = = = 0.521
𝑆𝑆𝑇 56493.333
52.1% of the variation in pie sales
is explained by the variation in
selling price and advertising cost.
ADJUSTED R2
▪ R-squared increases when a new predictor variable X is
added to the model.
▪ This can be a disadvantage when comparing models.
▪ What is the net effect of adding a new variables?
▪ We lose a degree of freedom when a new variable is
added.
▪ Did the new X variable add enough independent power to
offset the loss of one degree of freedom?
ADJUSTED R2
▪ The adjusted R2 shows the proportion of variation in Y explained by
all X variables adjusted for the number of X variables used.
Adjusted 𝑅2 = 0.442
44.2% of the variation in pie sales is explained by
the variation in selling price and advertising cost,
taking into account the sample size and number
of predictor variables.
How many predictors?
▪ One way to prevent overfitting the model is to limit the
number of predictors based on the sample size.
Aside from visually examining the scatter plots of the IV and DV to assess linearity, the
scatter plot of the IV vs the residuals may also be examined. The plots at the left show curve
patterns which indicates that the data relationship is not linear. Another model should be
used.
Checking the assumptions by examining the residuals
Residual
Analysis for
Equal Variance:
Plot X against
residuals
Checking the assumptions by examining the residuals
Residual
Analysis for
Equal variance:
Plot predicted
values against
residuals
Checking the assumptions by examining the residuals
Residual Analysis for Normality:
1. Examine the Stem-and-Leaf Display of the Residuals
2. Examine the Box-and-Whisker Plot of the Residuals
3. Examine the Histogram of the Residuals
4. Construct a normal probability plot.
5. Construct a Q-Q plot.
Checking the assumptions by examining the residuals
D= i =2
n
▪ The possible range is 0 ≤ D ≤ 4 e
i =1
2
i
The assumption of
normality of residuals is
satisfied since the points
follow a straight line.
Example 1 in
JASP:
Correlation
Analysis
Example 1 in
JASP:
Regression
analysis with
assumption
checks
Example 1 in
Gretl:
Regression
analysis
Example 1 in
Gretl:
Output
Example 1 in
Gretl:
Assumption
Checks
Strategies when performing regression analysis
▪ Start with a scatter plot of X on Y to observe possible
relationship.
▪ Perform residual analysis to check the assumptions.
▪ Plot the residuals vs X to check for violations of
assumptions such as equal variance.
▪ Use a histogram, stem and leaf display, box and whisker
plot or normal probability plot of the residuals to uncover
possible non-normality.
Strategies when performing regression analysis
▪ If there is any violation of any assumption, use alternative
methods or models.
▪ If there is no evidence of assumption violation, then test for
the significance of the regression coefficients.
▪ Avoid making predictions or forecasts outside the relevant
range.
Exercises:
Given the data regarding the selling price of a
home (in thou $), home size in sq ft, lot size
(in thou sq ft) and baths (number of
bathrooms).
a. Obtain the regression equation using
Price as response variable and the rest as
predictors.
b. Describe the results in terms of model fit.
c. Describe the results in terms of the
coefficient of determination.
d. Interpret the different coefficients.
Some helpful sources:
▪ The Four Assumptions of Linear Regression – Statology