3.3 Logistic Regression - v3
3.3 Logistic Regression - v3
Logistic regression
Venkat Reddy
Contents
• Need of logistic regression?
• The logistic regression model
• Meaning of beta
• Goodness of fit
Venkat Reddy
Data Analytics Course
• Prediction
2
Main steps in statistical data analysis
Venkat Reddy
Data Analysis Course
3
Ford case study
• Driving while not alert can be deadly. The objective is to design a classifier
that will detect whether the driver is alert or not alert, employing data that
are acquired while driving.
• The objective is to design a detector/classifier that will detect whether the
driver is alert or not alert, employing any combination of vehicular,
environmental and driver physiological data that are acquired while driving.
Venkat Reddy
Data Analytics Course
• The third column has a value X for each row where
• X=1 if the driver is alert
• X = 0 if the driver is not alert
• The next 8 columns with headers P1, P2 , …….., P8 represent physiological
data;
• The next 11 columns with headers E1, E2, …….., E11 represent
environmental data;
• The next 11 columns with headers V1, V2, …….., V11 represent
4
vehicular data;
Recap –Multiple Regression
• Import the ford training data
• Print contents
• Identify independent & dependent variables
• Is there any interdependency between variables?
Venkat Reddy
Data Analytics Course
• Run the basic multicollinearity test and delete the redundant
variables
• What is the final list of independent variables?
5
What is the need of logistic regression?
• Consider Age vs. Ice cream sale data. The dataset has two
columns.
• Age – continuous variable between 6-80
• Buy(0- Yes ; 1-No)
Venkat Reddy
Data Analytics Course
6
Demo- Need of logistic regression
• Download the ice cream sales data & fit a linear regression
line.
• What is R squared
• If age increases, what happens to ice cream sales?
Venkat Reddy
Data Analytics Course
• 25 years old person, does he buy an ice cream?
• Can we fit a liner regression line to this data?
7
Real-life examples
• Gaming - Win vs. Loss
• Sales - Buying vs. Not buying
• Marketing – Response vs. No Response
• Credit card & Loans – Default vs. Non Default
Venkat Reddy
Data Analytics Course
• Operations – Attrition vs. Retention
• Websites – Click vs. No click
• Fraud identification –Fraud vs. Non Fraud
• Healthcare –Cure vs. No Cure
8
Bought ice cream
No
Yes
0
10
20
Why not linear ?
30
Age
40
50
60
70
80
Venkat Reddy
Some Nonlinear functions
Gaussian
Quadratic
polynomial
Venkat Reddy
Data Analytics Course
Sine
Exponential
Logistic
Double
exponential
10
Better fit?
1.2
No 1
0.8
Venkat Reddy
Data Analytics Course
0.6
0.4
0.2
Yes 0
0 10 20 30 40 50 60 70 80
Age 11
The Logistic function
• We want a model that predicts probabilities between 0 and 1, that is,
S-shaped.
• There are lots of s-shaped curves. We use the logistic model:
• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+
1X
• The function on left, loge[P/(1-P)], is called the logistic function.
Venkat Reddy
Data Analytics Course
1.0
e + x
P( y x ) =
0.8 1 + e + x
0.6
0.4
0.2 12
0.0
x
Logistic regression function
• Logistic regression models the logit of the outcome, instead of the
outcome i.e. instead of winning or losing, we build a model for log odds
of winning or losing
• Natural logarithm of the odds of the outcome
• ln(Probability of the outcome (p)/Probability of not having the outcome
Venkat Reddy
Data Analytics Course
(1-p))
P e + x
ln = α + β 1x 1 + β 2 x 2 + ... β i x i P( y x ) =
1- P 1 + e + x
13
Curve fitting using MLE
• Remember OLS for linear models?
• Imagine a logistic line thorough the data, find the error at each
point, all these errors must follow normal distribution, now using
calculus we try to find the maximum likelihood values of betas such
that, they will make the current distribution to look like a near
perfect normal distribution
Venkat Reddy
Data Analytics Course
• Maximum Likelihood Estimator:
• Starts with arbitrary values of the regression coefficients and
constructs an initial model for predicting the observed data.
• Then evaluates errors in such prediction and changes the regression
coefficients so as make the likelihood of the observed data greater
under the new model.
• Repeats until the model converges, meaning the differences between
the newest model and the previous model are trivial.
• The idea is that you “find and report as statistics” the parameters 14
that are most likely to have produced your data.
Logistic Regression in SAS
proc logistic data=sales;
model buy=Age ;
run;
e + x
P( y x ) =
Venkat Reddy
Data Analytics Course
1 + e + x
P(buy=0) =
exp(3.8982-0.1353*Age) /
(1+exp(3.8982-0.1353*Age))
Probability modeled is Buy=0
15
Lab: Logistic Regression
• What is the response/dependent variable in ford data?
• What are the independent or predictor variables?
• Build a logistic regression line on the given data
• Write the line equation
Venkat Reddy
Data Analytics Course
16
Goodness of fit for a logistic regression
• Chi-Square
• The Chi-Square statistic and associated p-value (Sig.) tests whether the model
coefficients as a group equal zero.
• Concordance and discordance: Correct predictions, false positives & false negatives
• Imagine a pair(1,0)
• Concordance: If the model gives high probability to 1 & low probability to 0
Venkat Reddy
Data Analytics Course
• Discordance: If the model gives low probability to 1 & high probability to 0
• Tie: Model gives same probability to both
• The "Percent Correct Predictions" statistic assumes that if the estimated p is
greater than or equal to .5 then the event is expected to occur and not occur
otherwise.
• By assigning these probabilities 0s and 1s and comparing these to the actual 0s and
1s, the % correct Yes, % correct No, and overall % correct scores are calculated.
Model
0 1
17
0
Actual
1
Goodness of fit for overall model
Testing Global Null Hypothesis: BETA=0
Venkat Reddy
Data Analytics Course
Association of Predicted Probabilities and
Observed Responses
Percent 92.0 Somers' D 0.851
Concordant
Percent 6.9 Gamma 0.861
Discordant
Percent Tied 1.1 Tau-a 0.365
Pairs 525 c 0.926
18
Goodness of fit
• Note: subgroups for the % correctly predicted is also important,
especially if most of the data are 0s or 1s
• Hosmer and Lemeshow Goodness-of-Fit Test
• Chisquare test for Observed and expected bad by diving the variable
into groups
• The test assesses whether or not the observed event rates match
Venkat Reddy
Data Analytics Course
expected event rates in subgroups of the model population.
• The Hosmer–Lemeshow test specifically identifies subgroups as the
deciles of fitted risk values. Models for which expected and observed
event rates in subgroups are similar are called well calibrated.
proc logistic data=Ice_cream_sales;
model buy=Age /lackfit;
run;
• Other methods
ROC curves?
Somers' D, Gamma, Tau-a, C 19
More than a dozen “R2”-type summaries
Hosmer and Lemeshow Goodness-of-Fit Test
Venkat Reddy
Data Analytics Course
3 5 1 0.07 4 4.93
4 5 0 0.21 5 4.79
5 5 0 0.61 5 4.39
6 5 0 1.21 5 3.79
7 5 1 2.00 4 3.00
8 5 5 3.10 0 1.90
9 5 4 3.98 1 1.02
10 4 4 3.78 0 0.22
Venkat Reddy
Data Analytics Course
• Remove first 15 variables and rebuild the model
• Is there any change in the goodness of fit?
21
Meaning of beta
• In linear regression denotes the corresponding change in y for unit
change in x, here denotes the corresponding change in log odds of
y for unit change in x
• For example
• In linear regression: if y=2+3x ; if x increases by 1 unit then y increases by
Venkat Reddy
Data Analytics Course
3 units
• In logistic regression log(P/1-P) =2+3x: if x increases by 1 unit then the
log odds of P(y=1) increases by 3 units
• = log odds ratio associated with predictors
• e = odds ratio
P e + x
ln = α + β 1x 1 + β 2 x 2 + ... β i x i P( y x ) =
1- P 1 + e + x 22
Individual impact- Wald chi-square
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
Intercept 1 3.8982 1.3446 8.4044 0.0037
Age 1 -0.1353 0.0410 10.8989 0.0010
Weight 1 -0.037 0.004 2.665 0.0002
Venkat Reddy
Data Analytics Course
• Higher the Wald chi square, higher the impact/importance
23
If age increases by 1 unit then
Age a+ b*age exp(a+ b*age) 1+exp(a+ b*age) exp(a+ b*age)/(1+exp(a+ b*age) Change in Age Change in Prob
26 0.38 1.46 2.46 59.40%
27 0.25 1.28 2.28 56.10% 1 3.30%
25 0.52 1.67 2.67 62.61% (1) -3.22%
36 -0.97 0.38 1.38 27.44% 10 31.96%
16 1.73 5.66 6.66 84.98% (10) -25.59%
46 -2.33 0.10 1.10 8.90% 20 50.49%
Venkat Reddy
Data Analytics Course
6 3.09 21.90 22.90 95.63% (20) -36.24%
Venkat Reddy
Data Analytics Course
23 0.79 2.20 3.20 69%
91 -8.41 0.00 1.00 0%
27 0.25 1.28 2.28 56%
27 0.25 1.28 2.28 56%
43 -1.92 0.15 1.15 13%
49 -2.73 0.07 1.07 6%
43 -1.92 0.15 1.15 13%
79 -6.79 0.00 1.00 0%
46 -2.33 0.10 1.10 9%
Venkat Reddy
Data Analytics Course
success/failure
26
Venkat Reddy Konasani
Manager at Trendwise Analytics
[email protected]
[email protected]
Venkat Reddy
Data Analysis Course
www.TrendwiseAnalytics.com/venkat
+91 9886 768879
This presentation is just class notes. The course notes for Data Analysis Training is by written by me (Venkata Reddy Konasani) as an aid for myself.
The best way to treat this is as a high-level summary; the actual session went more in depth (explained the examples, for instance) and contained
other information. Most of this material was written as informal notes, not intended for publication
27