0% found this document useful (0 votes)
6 views

Linear Regression

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Linear Regression

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

The slides are derived from the following publisher instructor

material. This work is protected by United States copyright laws


and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Linear Regression
Outline:

This chapter shows:


– Differentiating between
explanatory and predictive
modeling with regression
Business / Research Data Understanding
Understanding Phase Phase
– Fitting a predictive model
Deployment Phase Data Preparation

– Assessing predictive Phase

accuracy
Evaluation Phase Modeling Phase

– Selecting a subset of
predictors

CRISP-DM standard process

3
Linear Regression
• Linear regression examines relationship between set of predictors and a
response variable
• Predictors typically continuous
• Categorical predictors can be included, through use of dummy variables

Outcome coefficients

𝑦ො = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑝 𝑥𝑝 + 𝜀

Constant error (noise)


(Intercept)
predictors

4
Explanatory Vs. Predictive?

• You should know the goal of your regression analysis before beginning
the modeling process.

Good explanatory model: fits the data


Explanatory and closely.
descriptive Focus is on the coefficients 𝜷
Use the entire dataset

Goal
Good predictive model: predicts new
records accurately.
Predictive

Focus is on the predictions 𝒚
Split dataset into training and testing

5
An Example of Linear Regression
• Cereals data set contains nutritional information for 77 cereals
• Dataset includes both categorical and numeric predictors. Categorical variables
should be converted to dummy variables
Name Name of the cereal
Manuf Categorical
Type Categorical
Calories Numeric
Protein Numeric
Fat Numeric
Sodium Numeric
Fiber Numeric
Carbo Numeric
Sugars Numeric
Potass Numeric
Vitamins Numeric
Shelf Categorical
Weight Numeric
Cups Numeric
Cold Categorical
Nabisco Categorical
Quaker Categorical
Kelloggs Categorical
GeneralMills Categorical
Ralston Categorical
AHFP Categorical
Rating Numeric

6
Explanatory Vs. Predictive?

• Research questions examples

Use the entire dataset


Research Question:
Explanatory and Is there an association between rating and sugar,
descriptive after adjusting for confounders?
What variables contributes to high cereal rating?

Goal

Research Question:
Predictive How can we predict cereal rating using other
variables?

Split dataset into training and testing

7
An Example of Linear Regression (cont’d)

# Read in the cereals dataset


cereal = read.csv("cereals.csv",stringsAsFactors = TRUE)

# preprocessing
which(is.na(cereal$Sugars)) # Record 58 is missing
which(is.na(cereal$Potass)) # Record 5 and 21 are missing

cereal=cereal[c(-5,-21,-58),] # Delete missing records


cereal=cereal[,c(-1)] # Drop the first column

# partition into training and testing datasets


set.seed(123)
cereal.randomized=cereal[sample(nrow(cereal),replace =
FALSE),]
cereal.train=cereal.randomized[1:round(.7*74),]
cereal.test=cereal.randomized[(round(.7*74)+1):74,]

8
An Example of Linear Regression (cont’d)
• Suppose we estimate rating of a cereal, given its sugar and fiber contents

• Use sugars and fibers (predictors) to estimate rating (response)

# Run regression analysis


reg_model=lm(Rating~Sugars+Fiber, data=cereal.train)

# Display summaries
summary(reg_model)

9
An Example of Linear Regression (cont’d)

• Estimated regression equation:

𝑦ො = 52.4201 − 2.2704𝑠𝑢𝑔𝑎𝑟𝑠 + 3.1113𝑓𝑖𝑏𝑒𝑟

• Interpreted as: “Estimated nutritional rating equals 52.4201 minus 2.2704


times grams of sugar plus 3.1113 times grams of fiber”

• b1 = –2.2704, indicates negative relationship between sugars and rating

– “estimated decrease in nutritional rating per unit increase in sugar


content is 2.2704, when fiber content held constant”

• b2 = 3.1118, indicates positive relationship between fiber and rating

10
Coefficient of Determination (𝑅2 )
• Recall Coefficient of Determination:
summary(reg_model)$r.squared

• 𝑅2 Represents proportion of variability in response accounted for by linear


relationship with predictor set

• Question: Would we expect higher R2 value when using two


predictors, rather than one?
‒ Yes, R2 always increases by including additional predictor
‒ Where new predictor useful, R2 increases substantially
‒ Otherwise, R2 may increase small or negligible amount

11
Adjusting R2: Penalizing Models for Including Non-Useful Predictors

• Recall R2 always increases when additional predictor added to


model, whether it’s useful or not
• Largest R2 may occur for models with most predictors, rather than
best predictors
• Adjusted R2 measure “adjusts” R2, by penalizing models that include
non-useful predictors
• Formula for Adjusted R2:

– n: number of records n −1
– M: number of predictors
2
R = 1 − (1 − R )
2

n − m −1
adj

summary(reg_model)$adj.r.squared

12
Inference in Multiple Regression
• Three inferential methods:
1. T-test for relationship between response Y and specific predictor xi,
in presence of other predictors X1, X2, …, Xi-1, Xi+1, …, Xm
2. F-test for significance of entire regression equation
3. Confidence interval for 𝛽𝑖 , slope of 𝑖𝑡ℎ predictor

13
T-test for Relationship Between rating and sugars

• Null hypothesis rejected with evidence of linear relationship


between rating and fiber, in presence of sugars

–𝐻0 : β2 = 0 (there is no association between Fiber and Rating)


–𝐻1 : β2 ≠ 0 (there is association between Fiber and Rating)

14
F-test for Significance of Overall Regression Model
• F-test evaluates linear relationship between response and set of
predictors for entire model
• Hypotheses for F-Test:
– 𝐻0 : β1 = β2 = … = βm = 0
– 𝐻1 : At least one βi ≠ 0

15
Confidence Interval for Slope of Regression Line

# CI for coefficients
confint(reg_model, level = 0.95)

• We are 95% confident true coefficient of sugar lies between –2.69 and –1.85
• Note that the point β1 = 0 not contained in interval (–2.69 , –1.85)
• Therefore, we are 95% confident of significance in linear relationship between
nutritional rating and sugar content

16
Regression with Categorical Predictors Using Dummy Variables

• Thus far, examples used continuous predictor variables


• However, categorical variables can be included in model through
use of dummy variables
• Assume, shelf included as predictor, along with continuous
predictors sugars and fiber
• For regression, categorical variable with 𝑘 categories transformed
to 𝑘 – 1 dummy variables
R’s lm function does this
automatically.
• Dummy variable is binary, equals 1 when observation belongs to
category, otherwise equals 0

17
Regression with Categorical Predictors Using Dummy Variables (cont’d)

• Shelf variable transformed to two indicator variables:


1 𝑖𝑓𝑐𝑒𝑟𝑒𝑎𝑙 𝑙𝑜𝑐𝑎𝑡𝑒𝑑 𝑜𝑛 𝑠ℎ𝑒𝑙𝑓 2 1 𝑖𝑓𝑐𝑒𝑟𝑒𝑎𝑙 𝑙𝑜𝑐𝑎𝑡𝑒𝑑 𝑜𝑛 𝑠ℎ𝑒𝑙𝑓 3
𝑆ℎ𝑒𝑙𝑓2 = ቊ 𝑆ℎ𝑒𝑙𝑓3 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• Note, location of cereals on Shelf1 implied when Shelf2 =


0 and Shelf3 = 0

# Run Multiple regression analysis with Shelf


categorical variable
cereal.train$Shelf = as.factor(cereal.train$Shelf)
reg_model2=lm(Rating ~ Sugars + Fiber + Shelf, data =
cereal.train)

summary(reg_model2)

18
Regression with Categorical Predictors Using Dummy Variables (cont’d)

• Including indicator variables into model:

𝑦ො = 𝑏0 + 𝑏1 (𝑠𝑢𝑔𝑎𝑟𝑠) + 𝑏2 (𝑓𝑖𝑏𝑒𝑟) + 𝑏3 (𝑆ℎ𝑒𝑙𝑓2) + 𝑏4 (𝑆ℎ𝑒𝑙𝑓3)

What
about
Shelf1?

• How to interpret the coefficients for Shelf2 and Shelf3?


• Look at the confidence intervals for Shelf2 and Shelf3? What do you observe?

19
Model Evaluation Techniques For the Estimation and Prediction Tasks

• It is useful to know how well a model is preforming against the


training data. However, it is more useful to know how well the
model is preforming against the testing (or holdout) data.
Some textbooks called it
Validation dataset

• Performance measures will always be optimistic on the training


data because the model is using the training data to build the
model

• We can get a better sense of how well the model is doing if we


look at how well it preforms against data it has never “seen” or
used to create the model

20
Model Evaluation Techniques For the Estimation and Prediction Tasks (cont’d)

# Make predictions using the regression model


predictions = predict(reg_model, newdata=cereal.test)

#compute common accuracy measures for the model


library(forecast)
accuracy(predictions, cereal.test$Rating)

# Calculate the baseline predictions (average prediction)dataset


mean_rating_train = mean(cereal.train$Rating)
baseline_predictions = rep(mean_rating_train,length(predictions))

#compute common accuracy measures for the baseline


accuracy(baseline_predictions, cereal.test$Rating)

21
Model Evaluation Techniques For the Estimation and Prediction Tasks (cont’d)

“test set” means the data that lm analyzed, not the “testing dataset”

22
Variable Selection Methods

23
Variable Selection Methods
Goal: Find parsimonious model (the simplest model
that performs sufficiently well)

– More robust
– Higher predictive accuracy

• We will assess predictive accuracy on testing dataset

• Assist analyst in determining which variables to


include in model

24
Variable Selection Methods
• Four variable selection methods:
1. Forward selection. Initially, there are no predictor variables in the
regression equation. Predictor variables are entered one at a time,
only if they meet certain criteria specified in terms of F ratio. The
order in which the variables are included is based on the
contribution to the explained variance.
2. Backward elimination. Initially, all the predictor variables are
included in the regression equation. Predictors are then removed
one at a time based on the F ratio for removal.
3. Stepwise selection. A variable that has been entered into the
model early in the forward selection process may turn out to be
nonsignificant, once other variables have been entered into the
model
4. Best Subsets (Exhaustive search ): provides the best selections
through examination of all possible combination. Warning: this
may become interactable and take very long time

25
Forward Selection Procedure (cont’d)
• Procedure begins with no variables in model
• Step 1:
– Predictor x1 most highly correlated with response selected
– If model not significant, stop and report no predictors important
– Otherwise, proceed to Step 2
• Step 2:
– For remaining predictors, compute sequential F-statistic given predictors
already in model
– For example, first pass sequential F-Statistics computed for F(x2|x1), F(x3|x1),
F(x4|x1)
– Select variable with largest sequential F-statistic
• Step 3:
– Test significance of sequential F-statistic, for variable selected in Step 2
– If resulting model not significant, then stop reporting model without variable
selected in Step 2
– Otherwise, add variable from Step 2, and return to Step 2

26
Forward Selection Procedure (cont’d)
# Define base intercept only model
model.null = lm(Rating ~ 1 , data=cereal.train)

# Full model with all predictors


model.full = lm(Rating ~ . , data= cereal.train)

# Perform forward algorithm


reduced.forward.model = step(model.null, scope =
list(lower = model.null, upper = model.full), direction =
"forward", trace = FALSE)

summary(reduced.forward.model)

27
Backwards Elimination Procedure (cont’d)
• Procedure begins with all variables in model
• Step 1:
– Perform regression on full model with all variables
– For example, assume model has x1,…,x4
• Step 2:
– For each variable in model perform partial F-statistic
– First pass includes F(x1|x2, x3 , x4), F(x2|x1, x3 , x4), F(x3|x1, x2 , x4), F(x4|x1, x2 , x3)
– Select variable with smallest partial F-statistic, denoted Fmin
• Step 3:
– If Fmin not significant, remove associated variable from model and return to
Step 2
– Otherwise, if Fmin significant, stop algorithm and report current model
– If first pass, then current model is full model
– If not first pass, then full set of predictors reduced by one or more variables

28
Backwards Elimination Procedure (cont’d)

# Perform backward algorithm


reduced.backward.model = step(model.full, scope = list(lower =
model.null, upper = model.full), direction = "backward", trace
= FALSE)

summary(reduced.backward.model)

29
Stepwise Procedure
• Stepwise Procedure represents modification to Forward Selection
Procedure
• A variable entered into model during forward selection process
may turn out to be non-significant, as additional variables enter
model
• Stepwise Procedure tests this possibility
• If variable in model no longer significant, variable with smallest
partial F-statistic removed from model
• Procedure terminates when no additional variables can enter or
be removed

30
Stepwise Procedure (cont’d)

#Perform Step-wise algorithm


reduced.step.model = step(model.full, scope = list(lower =
model.null, upper = model.full), direction = "both", trace =
FALSE)

summary(reduced.step.model)

31
Best Subsets Procedure
library(fastDummies)
cereal.train.dummy = dummy_cols(cereal.train, select_columns =
c("Manuf","Type", "Shelf"))

#Remove the original variables "Manuf","Type", and "Shelf", and drop one
dummy variable from each of them
cereal.train.dummy = cereal.train.dummy[,c(-1,-2,-12,-23,-30,-32)]

# Perform exhaustive search


library(leaps)

search = regsubsets(Rating ~ ., data = cereal.train.dummy, nbest = 1,


nvmax = dim(cereal.train.dummy)[2], method = "exhaustive")

Summay = summary(search)

# show models
Summay$which show models

# show metrics
Summay$rsq show metrics
Summay$adjr2

32
Best Subsets Procedure

Adjusted R2 rises until you hit 9 predictors, then stabilizes, so choose


model with 9 predictors, according to the adj R2 criterion

• Best Subsets Procedure is attractive selection method when


predictor set not large (less than ~30)
• Procedure intractably slow for predictor sets > 30
• p predictors leads to 2p – 1 models

33
Evaluate the selected model in the testing dataset

• Lets assume the stepwise model as the best one and now
apply it on the testing dataset

#Evaluating the model with the testing data set

cereal.test$Shelf = as.factor(cereal.test$Shelf)

pred.reduced.step = predict(reduced.step.model, newdata


= cereal.test)

accuracy(pred.reduced.step, cereal.test$Rating)

34
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.

You might also like