0% found this document useful (0 votes)
44 views

Applied Linear Statistical Models: MS 5218 Dr. Lilun DU Multiple Regression

Uploaded by

yanwang2693
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Applied Linear Statistical Models: MS 5218 Dr. Lilun DU Multiple Regression

Uploaded by

yanwang2693
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Applied Linear Statistical

Models
MS 5218
Dr. Lilun DU
Multiple Regression
COURSE INFORMATION

2
Announcement
◼ Instructor: Dr. Lilun DU
◼ Office: 7-252 AC3
◼ Email: [email protected]
◼ Office hours: by appointment
◼ Course Website: Canvas
◼ TA: Shuhua XIAO/Mingshuo LIU
◼ TA Emails: [email protected]/mingshliu3-
[email protected]
◼ TA Office hours: 3pm-4pm, Tue/LIU; 10am-
11am/Mon
3
Schedule

4
Course Materials
◼ No required textbooks
◼ All lecture notes, assignments, data/code,
and other materials are available on
Canvas
◼ Optional references
➢ 1. Kleinbaum, Kupper, Nizam, and
Rosenberg Applied Regression Analysis and
Other Multivariable Methods, Thomson
➢ 2. Stine and Foster Statistics for Business
Decision Making and Analysis, Pearson
5
R and Rstudio
◼ R is a programming language
◼ R is convenient for statistical computing
and graphical presentation to analyse and
visualize data
◼ Rstudio is an integrated development
environment for R
◼ There is a great chance you would see
some R codes and outputs in the exam

6
Assessment
◼ Continue Assessment: 40%
➢ One written assignment 20%
➢ One group project (at most 6 people, around
40 groups) 20%

◼ Final Exam (open-notes): 60%

7
Written Assignment
◼ To be assigned after the 5th or 6th lecture
◼ Completed assignment to be uploaded to
Canvas on a specific day and time

◼ After the deadline no homework will be


accepted
➢ homework is accepted within 24 hours but the
grade of this homework is discounted by 80%
➢ exception: certified religious or medical
excuses 8
Group Project - Content
◼ You may form a group of 1-6.
◼ It is expected to perform statistical
analysis with the methods covered in this
course on a dataset.
◼ It is expected to provide a proper
interpretation of the results achieved.

9
Group project - presentation
◼ Present the methods, results, and insights of
your group project in the final week (last two
lectures)
◼ Due to the class size (about 90 students), the
presentation will be 12 mins for each group
◼ The grade is based on
➢ whether the method is used correctly, whether results
are reasonable, whether the results are presented
clearly
➢ the presentation itself such as fluency, effort to follow
➢ Individual grade is further determined by your
contribution to the project, given by your teammates 10
Final Exam
◼ Exam scope: all the materials from the
course
◼ Exam will take place offline
◼ Open notes: you can bring notes.
◼ In preparation for the final exam, one set
of questions for practice will be provided.

11
Intended Lectures
◼ 1. Multiple Regression
◼ 2. Building Regression Models
◼ 3. ACOVA
◼ 4. ANOVA
◼ 5. CURVE (Transformation)
◼ 6. Model Selection
◼ 7. Time Series
◼ 8. Logistic Regression
◼ 9. Poisson Regression
◼ 10. Survival Analysis
◼ 11. Bayesian Linear Regression
12
REVIEW OF SRM

13
Simple Linear Regression
◼ In linear regression, we consider the conditional
distribution of one variable (y) at each of several levels
of a second variable (x).
◼ y is known as the dependent variable (response)
◼ x is known as the independent variable (explanatory
variable)

14
Example: Tesla
◼ Tesla CEO, Elon Musk, had become the
richest person in the world once again since
the start of 2021.

◼ In year 2020 alone, Tesla's share price has


rocketed upward more than 700%.

◼ We can use Capital Asset Pricing Model


(CAPM) to analyze

➢ (1) whether Tesla beats the market


➢ or (2) whether it is aggressive or defensive
significantly.

15
Example: Tesla (n=253)
◼ The scatterplot summarizes the relationship between the
daily (simple) return of Tesla and the daily (simple)
return of SP500 in 2020.

16
Determine Regression Equation
◼ One goal of regression is to draw the ‘best’ line through
the data points (least-squares regression line)
◼ The fitted value 𝑦ො = 𝑏0 + 𝑏1 𝑥

◼ 𝑒 = 𝑦 − 𝑦ො is the residual, which describes the error of


prediction/estimation

◼ Least-Squares: Choose 𝑏0 and 𝑏1 to minimize the sum of


squared residuals: σ𝑛𝑖=1 𝑒𝑖2 .

𝑠𝑦
◼ Formula: 𝑏1 = 𝑟 and 𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
𝑠𝑥
17
RMSE
◼ The standard deviation of residuals measures
how much the residuals vary around the fitted
value, called root mean squared error (RMSE)

𝑒12 + 𝑒22 + ⋯ + 𝑒𝑛2


se =
𝑛−2

18
R-squared
◼ A regression line splits the response into two
parts, a fitted value and a residual, 𝑦 = 𝑦ො + 𝑒

◼ As a summary of the fitted line, it is common to


report how much of the variation of 𝑦 is
explained by the regression line, the r-squared.

𝑛 2
σ 𝑒
𝑖=1 𝑖
𝑟2 = 1 − 𝑛 2
σ𝑖=1 𝑦𝑖 − 𝑦ത

19
Regression output (R)

20
Simple Regression Model (SRM)
◼ A statistical model describes the variation in data as the
combination of a pattern plus a background of remaining,
unexplained variation.

◼ A pattern in data is a systematic, predictive feature. If


customers who receive coupons buy more cereal than
customers without coupons, there is a pattern.

◼ Used to describe future data

21
Model and Assumptions
◼ The model is 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀

◼ The random errors (𝜀)

➢ 1. are independent of each another,


➢ 2. have equal variance 𝜎𝜀2 , and
➢ 3. are normally distributed with mean zero.

22
Regression Diagnosis (R)
◼ Residual plot ◼ QQ plots

23
Inference
◼ Three parameters identify the population described by
the simple regression model. The least-squares
regression provides the estimates: 𝑏0 estimates 𝛽0 , 𝑏1
estimates 𝛽1 , and 𝑠𝑒 estimates 𝜎𝜀 .

◼ The sampling distribution of 𝑏0 and 𝑏1 (normal


distribution)

◼ The confidence interval and hypothesis test of 𝛽0 and 𝛽1

24
Regression table
◼ 95% CI for 𝛽1 : 𝑏1 ± 2𝑠𝑒(𝑏1 )

◼ Hypothesis test: 𝐻0 : 𝛽1 = 0 versus 𝐻1 : 𝛽1 ≠


0
Term Estimate SE t-Stat p-Value LCI UCI
Intercept 0.9021 0.312 2.891 0.00418 0.3 1.5

Slope 1.2326 0.144 8.557 <.0001 0.949 1.516

25
Prediction Interval
◼ The 95% prediction interval for the response
𝑦𝑛𝑒𝑤 under the Simple Regression Model equals

𝑦ො𝑛𝑒𝑤 ± 𝑡0.025,𝑛−2 𝑠𝑒 𝑦ො𝑛𝑒𝑤 ,

where 𝑦ො𝑛𝑒𝑤 = 𝑏0 + 𝑏1 𝑥𝑛𝑒𝑤 and

1 𝑥𝑛𝑒𝑤 − 𝑥ҧ 2
se 𝑦ො𝑛𝑒𝑤 = RMSE 1 + +
𝑛 𝑛 − 1 𝑠𝑥2
26
PI illustration

27
THE MULTIPLE REGRESSION
MODEL
28
The Multiple Regression Model

◼ A chain is considering where to locate a new


restaurant. Is it better to locate it far from
the competition or in a more affluent area?

▪ Use multiple regression to describe the relationship


between several explanatory variables and the response.

▪ Multiple regression separates the effects of each


explanatory variable on the response and reveals which
really matter.
29
The Multiple Regression Model
▪ Multiple regression model (MRM): model for the
association in the population between multiple
explanatory variables and a response.

▪ k: the number of explanatory variables in the


multiple regression (k = 1 in simple regression).

30
The Multiple Regression Model
◼ The response Y is linearly related to k
explanatory variables X1, X2, and Xk by the
equation

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀,
𝜀 ∼ 𝑁(0, 𝜎𝜀2 )

◼ The unobserved errors in the model


1. are independent of one another,
2. have equal variance, and
3. are normally distributed around the regression equation.
31
The Multiple Regression Model
▪ (SRM~MRM) While the SRM bundles all but one
explanatory variable into the error term, multiple
regression allows for the inclusion of several variables in
the model.

▪ In the MRM, residuals departing from normality may


suggest that an important explanatory variable has been
omitted.

32
Interpreting Multiple Regression
◼ Example: Women’s Apparel Stores

▪ Response variable: sales at stores in a


chain of women’s apparel (annually in
dollars per square foot of retail space).

▪ Two explanatory variables: median


household income in the area
(thousands of dollars) and number of
competing apparel stores in the same
mall.

33
Interpreting Multiple Regression
◼ Example: Women’s Apparel Stores

▪ Begin with a scatterplot matrix, a table of


scatterplots arranged as in a correlation matrix.

▪ Using a scatterplot matrix to understand data


can save considerable time later when
interpreting the multiple regression results.

34
Interpreting Multiple Regression
◼ Scatterplot Matrix: Women’s Apparel Stores

35
Interpreting Multiple Regression
◼ Example: Women’s Apparel Stores

◼ The scatterplot matrix for this example

▪ Confirms a positive linear association between sales


and median household income.

▪ Shows a weak association between sales and number


of competitors.

36
Interpreting Multiple Regression
◼ Correlation Matrix: Women’s Apparel Stores

37
Interpreting Multiple Regression
◼ R-squared and se
▪ The equation of the fitted model for estimating
sales in the women’s apparel stores example is

𝑦ො = 60.359 + 7.966 Income -24.165 Competitors

38
Interpreting Multiple Regression
◼ R-squared

▪ R2 indicates that the fitted equation explains 59.47% of


the store-to-store variation in sales.

▪ For this example, R2 is larger than the r2 values for


separate SRMs fitted for each explanatory variable; it is
also larger than their sum.

▪ 𝑅2 never decreases when an explanatory variable is


added to a regression.

39
Interpreting Multiple Regression
◼ Adjusted R-squared

▪ 𝑅ത 2 is known as the adjusted R-squared. It


adjusts for both sample size n and model size k.
It is always smaller than R2.

𝑛−1
𝑅ത 2 = 1 − 1 − 𝑅 2 ×
𝑛−𝑘−1

40
Interpreting Multiple Regression
◼ Calibration Plot (for 𝑅 2 )

▪ Calibration plot: scatterplot of the 𝑦 response on


the fitted values 𝑦.

▪ 𝑅 2 is the correlation between 𝑦 and 𝑦; ො the


tighter data cluster along the diagonal line in the
calibration plot, the larger the 𝑅 2 value.

41
Interpreting Multiple Regression
◼ Calibration Plot: Women’s Apparel Stores

42
Interpreting Multiple Regression
◼ RMSE 𝑠𝑒 :
𝑒12 + 𝑒22 + ⋯ + 𝑒𝑛2
se =
𝑛−𝑘−1

◼ The residual degrees of freedom (n-k-1) is


the divisor of se.

◼ For this example, se = $68.03


43
Interpreting Multiple Regression
◼ Marginal and Partial Slopes

▪ Partial slope: slope of an explanatory variable in a


multiple regression that statistically excludes the effects
of other explanatory variables.

▪ Marginal slope: slope of an explanatory variable in a


simple regression.

44
Interpreting Multiple Regression
◼ Partial Slopes: Women’s Apparel Stores

45
Interpreting Multiple Regression
◼ Partial Slopes: Women’s Apparel Stores

▪ The slope b1 = 7.966 for Income implies that, a store in


a location with a higher median household of $10,000
sells, on average, $79.66 more per square foot than a
store in a less affluent location with the same number of
competitors.

▪ The slope b2 = -24.165 implies that, among stores in


equally affluent locations, each additional competitor
lowers average sales by $24.165 per square foot.

46
Interpreting Multiple Regression
◼ Marginal and Partial Slopes

▪ Partial and marginal slopes only agree when the


explanatory variables are uncorrelated.

▪ In this example they do not agree.


▪ For instance, the marginal slope for Competitors is 4.6352.
▪ It is positive because more affluent locations tend to draw more
competitors.

▪ The MRM separates these effects but the SRM does not.

47
Interpreting Multiple Regression
◼ Path Diagram

▪ Path diagram: schematic drawing of the


relationships among the explanatory variables
and the response.

▪ Collinearity: very high correlations among the


explanatory variables that make the estimates in
multiple regression uninterpretable.

48
Interpreting Multiple Regression
◼ Path Diagram: Women’s Apparel Stores
7.966

-24.165

◼ Income has a direct positive effect on sales and an


indirect negative effect on sales via the number of
competitors. 49
Checking Conditions
◼ Conditions for Inference

◼ Use the residuals from the fitted MRM to check


that the errors in the model
▪ are independent;
▪ have equal variance; and
▪ follow a normal distribution.

50
Checking Conditions
◼ Residual Plots
◼ Linearity: Plot of residuals versus each explanatory
variable is used to verify that the relationships are
linear.
◼ Equal Variance: Plot of residuals versus 𝑦ො (outliers)
◼ Independence: If the data form a time series, use a
timeplot of the residuals and with Durbin-Watson
Statistics. O.W., understand the context.

▪ Note: dependence can arise from omitting an important variable


from the model (plot residuals versus lurking variable).

51
Checking Conditions
◼ Residual Plot: Women’s Apparel Stores

◼ Visually estimate that se is less than $75 per sq ft (all


residuals lie within $150 of horizontal line).
◼ No evident pattern.

52
Checking Conditions
◼ Residual Plot: Women’s Apparel Stores

◼ This plot of residuals versus Income does


not indicate a problem.
53
Checking Conditions
◼ Residual Plot: Women’s Apparel Stores

◼ This plot of residuals versus Competitors


does not indicate a problem.

54
Checking Conditions
◼ Check Normality: Women’s Apparel Stores

◼ The quantile plot indicates nearly normal


condition is satisfied (although a slight skew
is evident in histogram).
55
Inference in Multiple Regression
◼ Inference for the Model: F-test

▪ F-test: test of the explanatory power of


the MRM as a whole.

▪ F-statistic: ratio of the sample variance of


the fitted values to the variance of the
residuals.

56
Inference in Multiple Regression
◼ Inference for the Model: F-test

◼ The F-Statistic

𝑅2 𝑛−𝑘−1
𝐹= 2
×
1−𝑅 𝑘

◼ is used to test the null hypothesis that all slopes


are equal to zero, e.g., H0: 𝛽1 = 𝛽2 = 0.

57
Inference in Multiple Regression
◼ F-test Results in Analysis of Variance
Table

◼ The F-statistic has a p-value of <0.0001; reject


H 0.

◼ Income and Competitors together explain


statistically significant variation in sales. 58
Inference in Multiple Regression
◼ Inference for One Coefficient

▪ The t-statistic is used to test each slope


using the null hypothesis H0: βj = 0.

▪ The t-statistic is calculated as


𝑏𝑗 − 0
𝑡𝑗 =
𝑠𝑒(𝑏𝑗 )

59
Inference in Multiple Regression
◼ t-test Results for Women’s Apparel Stores

◼ The t-statistics and associated p-values


indicate that both slopes are significantly
different from zero.

60
Inference in Multiple Regression
◼ Confidence Interval Results

◼ Consistent with t-statistics results.

61
Inference in Multiple Regression
◼ Prediction Intervals

▪ An approximate 95% prediction interval is


given by 𝑦ො ± 2𝑠𝑒 .

▪ For example, the 95% prediction interval for


sales per square foot at a location with
median income of $70,000 and 3 competitors
is approximately $545.48 ± $137.29 per
square foot.
62
Steps in Fitting a Multiple Regression

1. What is the problem to be solved? Do these


data help in solving it?

2. Check the scatterplots of the response versus


each explanatory variable (scatterplot matrix).

3. If the scatterplots appear straight enough, fit


the multiple regression model. Otherwise find a
transformation.

63
Steps in Fitting a Multiple Regression
4. Obtain the residuals and fitted values from the
regression.

5. Make scatterplots that show the overall model.


Use plot of e vs. 𝑦ො to check for similar variances.

6. Check residuals for dependence.

7. Scatterplot residuals versus individual


explanatory variables. Look for patterns.

64
Steps in Fitting a Multiple Regression

8. Check whether residuals are nearly normal.

9. Use the F-statistic to test the null hypothesis


that the collection of explanatory variables has
no effect on the response.

10. If the F-statistic is statistically significant, test


and interpret individual partial slopes.

65
Pitfalls/take-home message
▪ Don’t confuse a multiple regression with
several simple regressions.

▪ Don’t believe that you have all of the


important variables.

▪ Do not think that you have found causal


effects.

66

You might also like