Applied Linear Statistical Models: MS 5218 Dr. Lilun DU Multiple Regression
Applied Linear Statistical Models: MS 5218 Dr. Lilun DU Multiple Regression
Models
MS 5218
Dr. Lilun DU
Multiple Regression
COURSE INFORMATION
2
Announcement
◼ Instructor: Dr. Lilun DU
◼ Office: 7-252 AC3
◼ Email: [email protected]
◼ Office hours: by appointment
◼ Course Website: Canvas
◼ TA: Shuhua XIAO/Mingshuo LIU
◼ TA Emails: [email protected]/mingshliu3-
[email protected]
◼ TA Office hours: 3pm-4pm, Tue/LIU; 10am-
11am/Mon
3
Schedule
4
Course Materials
◼ No required textbooks
◼ All lecture notes, assignments, data/code,
and other materials are available on
Canvas
◼ Optional references
➢ 1. Kleinbaum, Kupper, Nizam, and
Rosenberg Applied Regression Analysis and
Other Multivariable Methods, Thomson
➢ 2. Stine and Foster Statistics for Business
Decision Making and Analysis, Pearson
5
R and Rstudio
◼ R is a programming language
◼ R is convenient for statistical computing
and graphical presentation to analyse and
visualize data
◼ Rstudio is an integrated development
environment for R
◼ There is a great chance you would see
some R codes and outputs in the exam
6
Assessment
◼ Continue Assessment: 40%
➢ One written assignment 20%
➢ One group project (at most 6 people, around
40 groups) 20%
7
Written Assignment
◼ To be assigned after the 5th or 6th lecture
◼ Completed assignment to be uploaded to
Canvas on a specific day and time
9
Group project - presentation
◼ Present the methods, results, and insights of
your group project in the final week (last two
lectures)
◼ Due to the class size (about 90 students), the
presentation will be 12 mins for each group
◼ The grade is based on
➢ whether the method is used correctly, whether results
are reasonable, whether the results are presented
clearly
➢ the presentation itself such as fluency, effort to follow
➢ Individual grade is further determined by your
contribution to the project, given by your teammates 10
Final Exam
◼ Exam scope: all the materials from the
course
◼ Exam will take place offline
◼ Open notes: you can bring notes.
◼ In preparation for the final exam, one set
of questions for practice will be provided.
11
Intended Lectures
◼ 1. Multiple Regression
◼ 2. Building Regression Models
◼ 3. ACOVA
◼ 4. ANOVA
◼ 5. CURVE (Transformation)
◼ 6. Model Selection
◼ 7. Time Series
◼ 8. Logistic Regression
◼ 9. Poisson Regression
◼ 10. Survival Analysis
◼ 11. Bayesian Linear Regression
12
REVIEW OF SRM
13
Simple Linear Regression
◼ In linear regression, we consider the conditional
distribution of one variable (y) at each of several levels
of a second variable (x).
◼ y is known as the dependent variable (response)
◼ x is known as the independent variable (explanatory
variable)
14
Example: Tesla
◼ Tesla CEO, Elon Musk, had become the
richest person in the world once again since
the start of 2021.
15
Example: Tesla (n=253)
◼ The scatterplot summarizes the relationship between the
daily (simple) return of Tesla and the daily (simple)
return of SP500 in 2020.
16
Determine Regression Equation
◼ One goal of regression is to draw the ‘best’ line through
the data points (least-squares regression line)
◼ The fitted value 𝑦ො = 𝑏0 + 𝑏1 𝑥
𝑠𝑦
◼ Formula: 𝑏1 = 𝑟 and 𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
𝑠𝑥
17
RMSE
◼ The standard deviation of residuals measures
how much the residuals vary around the fitted
value, called root mean squared error (RMSE)
18
R-squared
◼ A regression line splits the response into two
parts, a fitted value and a residual, 𝑦 = 𝑦ො + 𝑒
𝑛 2
σ 𝑒
𝑖=1 𝑖
𝑟2 = 1 − 𝑛 2
σ𝑖=1 𝑦𝑖 − 𝑦ത
19
Regression output (R)
20
Simple Regression Model (SRM)
◼ A statistical model describes the variation in data as the
combination of a pattern plus a background of remaining,
unexplained variation.
21
Model and Assumptions
◼ The model is 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
22
Regression Diagnosis (R)
◼ Residual plot ◼ QQ plots
23
Inference
◼ Three parameters identify the population described by
the simple regression model. The least-squares
regression provides the estimates: 𝑏0 estimates 𝛽0 , 𝑏1
estimates 𝛽1 , and 𝑠𝑒 estimates 𝜎𝜀 .
24
Regression table
◼ 95% CI for 𝛽1 : 𝑏1 ± 2𝑠𝑒(𝑏1 )
25
Prediction Interval
◼ The 95% prediction interval for the response
𝑦𝑛𝑒𝑤 under the Simple Regression Model equals
1 𝑥𝑛𝑒𝑤 − 𝑥ҧ 2
se 𝑦ො𝑛𝑒𝑤 = RMSE 1 + +
𝑛 𝑛 − 1 𝑠𝑥2
26
PI illustration
27
THE MULTIPLE REGRESSION
MODEL
28
The Multiple Regression Model
30
The Multiple Regression Model
◼ The response Y is linearly related to k
explanatory variables X1, X2, and Xk by the
equation
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀,
𝜀 ∼ 𝑁(0, 𝜎𝜀2 )
32
Interpreting Multiple Regression
◼ Example: Women’s Apparel Stores
33
Interpreting Multiple Regression
◼ Example: Women’s Apparel Stores
34
Interpreting Multiple Regression
◼ Scatterplot Matrix: Women’s Apparel Stores
35
Interpreting Multiple Regression
◼ Example: Women’s Apparel Stores
36
Interpreting Multiple Regression
◼ Correlation Matrix: Women’s Apparel Stores
37
Interpreting Multiple Regression
◼ R-squared and se
▪ The equation of the fitted model for estimating
sales in the women’s apparel stores example is
38
Interpreting Multiple Regression
◼ R-squared
39
Interpreting Multiple Regression
◼ Adjusted R-squared
𝑛−1
𝑅ത 2 = 1 − 1 − 𝑅 2 ×
𝑛−𝑘−1
40
Interpreting Multiple Regression
◼ Calibration Plot (for 𝑅 2 )
41
Interpreting Multiple Regression
◼ Calibration Plot: Women’s Apparel Stores
42
Interpreting Multiple Regression
◼ RMSE 𝑠𝑒 :
𝑒12 + 𝑒22 + ⋯ + 𝑒𝑛2
se =
𝑛−𝑘−1
44
Interpreting Multiple Regression
◼ Partial Slopes: Women’s Apparel Stores
45
Interpreting Multiple Regression
◼ Partial Slopes: Women’s Apparel Stores
46
Interpreting Multiple Regression
◼ Marginal and Partial Slopes
▪ The MRM separates these effects but the SRM does not.
47
Interpreting Multiple Regression
◼ Path Diagram
48
Interpreting Multiple Regression
◼ Path Diagram: Women’s Apparel Stores
7.966
-24.165
50
Checking Conditions
◼ Residual Plots
◼ Linearity: Plot of residuals versus each explanatory
variable is used to verify that the relationships are
linear.
◼ Equal Variance: Plot of residuals versus 𝑦ො (outliers)
◼ Independence: If the data form a time series, use a
timeplot of the residuals and with Durbin-Watson
Statistics. O.W., understand the context.
51
Checking Conditions
◼ Residual Plot: Women’s Apparel Stores
52
Checking Conditions
◼ Residual Plot: Women’s Apparel Stores
54
Checking Conditions
◼ Check Normality: Women’s Apparel Stores
56
Inference in Multiple Regression
◼ Inference for the Model: F-test
◼ The F-Statistic
𝑅2 𝑛−𝑘−1
𝐹= 2
×
1−𝑅 𝑘
57
Inference in Multiple Regression
◼ F-test Results in Analysis of Variance
Table
59
Inference in Multiple Regression
◼ t-test Results for Women’s Apparel Stores
60
Inference in Multiple Regression
◼ Confidence Interval Results
61
Inference in Multiple Regression
◼ Prediction Intervals
63
Steps in Fitting a Multiple Regression
4. Obtain the residuals and fitted values from the
regression.
64
Steps in Fitting a Multiple Regression
65
Pitfalls/take-home message
▪ Don’t confuse a multiple regression with
several simple regressions.
66