Linear Regression
Linear Regression
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Linear Regression
Outline:
accuracy
Evaluation Phase Modeling Phase
– Selecting a subset of
predictors
3
Linear Regression
• Linear regression examines relationship between set of predictors and a
response variable
• Predictors typically continuous
• Categorical predictors can be included, through use of dummy variables
Outcome coefficients
𝑦ො = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑝 𝑥𝑝 + 𝜀
4
Explanatory Vs. Predictive?
• You should know the goal of your regression analysis before beginning
the modeling process.
Goal
Good predictive model: predicts new
records accurately.
Predictive
ෝ
Focus is on the predictions 𝒚
Split dataset into training and testing
5
An Example of Linear Regression
• Cereals data set contains nutritional information for 77 cereals
• Dataset includes both categorical and numeric predictors. Categorical variables
should be converted to dummy variables
Name Name of the cereal
Manuf Categorical
Type Categorical
Calories Numeric
Protein Numeric
Fat Numeric
Sodium Numeric
Fiber Numeric
Carbo Numeric
Sugars Numeric
Potass Numeric
Vitamins Numeric
Shelf Categorical
Weight Numeric
Cups Numeric
Cold Categorical
Nabisco Categorical
Quaker Categorical
Kelloggs Categorical
GeneralMills Categorical
Ralston Categorical
AHFP Categorical
Rating Numeric
6
Explanatory Vs. Predictive?
Goal
Research Question:
Predictive How can we predict cereal rating using other
variables?
7
An Example of Linear Regression (cont’d)
# preprocessing
which(is.na(cereal$Sugars)) # Record 58 is missing
which(is.na(cereal$Potass)) # Record 5 and 21 are missing
8
An Example of Linear Regression (cont’d)
• Suppose we estimate rating of a cereal, given its sugar and fiber contents
# Display summaries
summary(reg_model)
9
An Example of Linear Regression (cont’d)
10
Coefficient of Determination (𝑅2 )
• Recall Coefficient of Determination:
summary(reg_model)$r.squared
11
Adjusting R2: Penalizing Models for Including Non-Useful Predictors
– n: number of records n −1
– M: number of predictors
2
R = 1 − (1 − R )
2
n − m −1
adj
summary(reg_model)$adj.r.squared
12
Inference in Multiple Regression
• Three inferential methods:
1. T-test for relationship between response Y and specific predictor xi,
in presence of other predictors X1, X2, …, Xi-1, Xi+1, …, Xm
2. F-test for significance of entire regression equation
3. Confidence interval for 𝛽𝑖 , slope of 𝑖𝑡ℎ predictor
13
T-test for Relationship Between rating and sugars
14
F-test for Significance of Overall Regression Model
• F-test evaluates linear relationship between response and set of
predictors for entire model
• Hypotheses for F-Test:
– 𝐻0 : β1 = β2 = … = βm = 0
– 𝐻1 : At least one βi ≠ 0
15
Confidence Interval for Slope of Regression Line
# CI for coefficients
confint(reg_model, level = 0.95)
• We are 95% confident true coefficient of sugar lies between –2.69 and –1.85
• Note that the point β1 = 0 not contained in interval (–2.69 , –1.85)
• Therefore, we are 95% confident of significance in linear relationship between
nutritional rating and sugar content
16
Regression with Categorical Predictors Using Dummy Variables
17
Regression with Categorical Predictors Using Dummy Variables (cont’d)
summary(reg_model2)
18
Regression with Categorical Predictors Using Dummy Variables (cont’d)
What
about
Shelf1?
19
Model Evaluation Techniques For the Estimation and Prediction Tasks
20
Model Evaluation Techniques For the Estimation and Prediction Tasks (cont’d)
21
Model Evaluation Techniques For the Estimation and Prediction Tasks (cont’d)
“test set” means the data that lm analyzed, not the “testing dataset”
22
Variable Selection Methods
23
Variable Selection Methods
Goal: Find parsimonious model (the simplest model
that performs sufficiently well)
– More robust
– Higher predictive accuracy
24
Variable Selection Methods
• Four variable selection methods:
1. Forward selection. Initially, there are no predictor variables in the
regression equation. Predictor variables are entered one at a time,
only if they meet certain criteria specified in terms of F ratio. The
order in which the variables are included is based on the
contribution to the explained variance.
2. Backward elimination. Initially, all the predictor variables are
included in the regression equation. Predictors are then removed
one at a time based on the F ratio for removal.
3. Stepwise selection. A variable that has been entered into the
model early in the forward selection process may turn out to be
nonsignificant, once other variables have been entered into the
model
4. Best Subsets (Exhaustive search ): provides the best selections
through examination of all possible combination. Warning: this
may become interactable and take very long time
25
Forward Selection Procedure (cont’d)
• Procedure begins with no variables in model
• Step 1:
– Predictor x1 most highly correlated with response selected
– If model not significant, stop and report no predictors important
– Otherwise, proceed to Step 2
• Step 2:
– For remaining predictors, compute sequential F-statistic given predictors
already in model
– For example, first pass sequential F-Statistics computed for F(x2|x1), F(x3|x1),
F(x4|x1)
– Select variable with largest sequential F-statistic
• Step 3:
– Test significance of sequential F-statistic, for variable selected in Step 2
– If resulting model not significant, then stop reporting model without variable
selected in Step 2
– Otherwise, add variable from Step 2, and return to Step 2
26
Forward Selection Procedure (cont’d)
# Define base intercept only model
model.null = lm(Rating ~ 1 , data=cereal.train)
summary(reduced.forward.model)
27
Backwards Elimination Procedure (cont’d)
• Procedure begins with all variables in model
• Step 1:
– Perform regression on full model with all variables
– For example, assume model has x1,…,x4
• Step 2:
– For each variable in model perform partial F-statistic
– First pass includes F(x1|x2, x3 , x4), F(x2|x1, x3 , x4), F(x3|x1, x2 , x4), F(x4|x1, x2 , x3)
– Select variable with smallest partial F-statistic, denoted Fmin
• Step 3:
– If Fmin not significant, remove associated variable from model and return to
Step 2
– Otherwise, if Fmin significant, stop algorithm and report current model
– If first pass, then current model is full model
– If not first pass, then full set of predictors reduced by one or more variables
28
Backwards Elimination Procedure (cont’d)
summary(reduced.backward.model)
29
Stepwise Procedure
• Stepwise Procedure represents modification to Forward Selection
Procedure
• A variable entered into model during forward selection process
may turn out to be non-significant, as additional variables enter
model
• Stepwise Procedure tests this possibility
• If variable in model no longer significant, variable with smallest
partial F-statistic removed from model
• Procedure terminates when no additional variables can enter or
be removed
30
Stepwise Procedure (cont’d)
summary(reduced.step.model)
31
Best Subsets Procedure
library(fastDummies)
cereal.train.dummy = dummy_cols(cereal.train, select_columns =
c("Manuf","Type", "Shelf"))
#Remove the original variables "Manuf","Type", and "Shelf", and drop one
dummy variable from each of them
cereal.train.dummy = cereal.train.dummy[,c(-1,-2,-12,-23,-30,-32)]
Summay = summary(search)
# show models
Summay$which show models
# show metrics
Summay$rsq show metrics
Summay$adjr2
32
Best Subsets Procedure
33
Evaluate the selected model in the testing dataset
• Lets assume the stepwise model as the best one and now
apply it on the testing dataset
cereal.test$Shelf = as.factor(cereal.test$Shelf)
accuracy(pred.reduced.step, cereal.test$Rating)
34
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.