0% found this document useful (0 votes)
9 views

3.3 Logistic Regression - v3

Uploaded by

ss t
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

3.3 Logistic Regression - v3

Uploaded by

ss t
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Analytics Course

Logistic regression
Venkat Reddy
Contents
• Need of logistic regression?
• The logistic regression model
• Meaning of beta
• Goodness of fit

Venkat Reddy
Data Analytics Course
• Prediction

2
Main steps in statistical data analysis

Venkat Reddy
Data Analysis Course
3
Ford case study
• Driving while not alert can be deadly. The objective is to design a classifier
that will detect whether the driver is alert or not alert, employing data that
are acquired while driving.
• The objective is to design a detector/classifier that will detect whether the
driver is alert or not alert, employing any combination of vehicular,
environmental and driver physiological data that are acquired while driving.

Venkat Reddy
Data Analytics Course
• The third column has a value X for each row where
• X=1 if the driver is alert
• X = 0 if the driver is not alert
• The next 8 columns with headers P1, P2 , …….., P8 represent physiological
data;
• The next 11 columns with headers E1, E2, …….., E11 represent
environmental data;
• The next 11 columns with headers V1, V2, …….., V11 represent
4
vehicular data;
Recap –Multiple Regression
• Import the ford training data
• Print contents
• Identify independent & dependent variables
• Is there any interdependency between variables?

Venkat Reddy
Data Analytics Course
• Run the basic multicollinearity test and delete the redundant
variables
• What is the final list of independent variables?

5
What is the need of logistic regression?
• Consider Age vs. Ice cream sale data. The dataset has two
columns.
• Age – continuous variable between 6-80
• Buy(0- Yes ; 1-No)

Venkat Reddy
Data Analytics Course
6
Demo- Need of logistic regression
• Download the ice cream sales data & fit a linear regression
line.
• What is R squared
• If age increases, what happens to ice cream sales?

Venkat Reddy
Data Analytics Course
• 25 years old person, does he buy an ice cream?
• Can we fit a liner regression line to this data?

7
Real-life examples
• Gaming - Win vs. Loss
• Sales - Buying vs. Not buying
• Marketing – Response vs. No Response
• Credit card & Loans – Default vs. Non Default

Venkat Reddy
Data Analytics Course
• Operations – Attrition vs. Retention
• Websites – Click vs. No click
• Fraud identification –Fraud vs. Non Fraud
• Healthcare –Cure vs. No Cure

8
Bought ice cream

No

Yes
0
10
20
Why not linear ?

30

Age
40
50
60
70
80

Data Analytics Course


9

Venkat Reddy
Some Nonlinear functions
Gaussian

Quadratic
polynomial

Venkat Reddy
Data Analytics Course
Sine
Exponential

Logistic
Double
exponential
10
Better fit?
1.2

No 1

0.8

Venkat Reddy
Data Analytics Course
0.6

0.4

0.2

Yes 0
0 10 20 30 40 50 60 70 80

Age 11
The Logistic function
• We want a model that predicts probabilities between 0 and 1, that is,
S-shaped.
• There are lots of s-shaped curves. We use the logistic model:
• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+
1X
• The function on left, loge[P/(1-P)], is called the logistic function.

Venkat Reddy
Data Analytics Course
1.0
e + x
P( y x ) =
0.8 1 + e + x

0.6

0.4

0.2 12

0.0
x
Logistic regression function
• Logistic regression models the logit of the outcome, instead of the
outcome i.e. instead of winning or losing, we build a model for log odds
of winning or losing
• Natural logarithm of the odds of the outcome
• ln(Probability of the outcome (p)/Probability of not having the outcome

Venkat Reddy
Data Analytics Course
(1-p))

 P  e + x
ln   = α + β 1x 1 + β 2 x 2 + ... β i x i P( y x ) =
 1- P  1 + e + x

13
Curve fitting using MLE
• Remember OLS for linear models?
• Imagine a logistic line thorough the data, find the error at each
point, all these errors must follow normal distribution, now using
calculus we try to find the maximum likelihood values of betas such
that, they will make the current distribution to look like a near
perfect normal distribution

Venkat Reddy
Data Analytics Course
• Maximum Likelihood Estimator:
• Starts with arbitrary values of the regression coefficients and
constructs an initial model for predicting the observed data.
• Then evaluates errors in such prediction and changes the regression
coefficients so as make the likelihood of the observed data greater
under the new model.
• Repeats until the model converges, meaning the differences between
the newest model and the previous model are trivial.
• The idea is that you “find and report as statistics” the parameters 14
that are most likely to have produced your data.
Logistic Regression in SAS
proc logistic data=sales;
model buy=Age ;
run;

e + x
P( y x ) =

Venkat Reddy
Data Analytics Course
1 + e + x

P(buy=0) =
exp(3.8982-0.1353*Age) /
(1+exp(3.8982-0.1353*Age))
Probability modeled is Buy=0

15
Lab: Logistic Regression
• What is the response/dependent variable in ford data?
• What are the independent or predictor variables?
• Build a logistic regression line on the given data
• Write the line equation

Venkat Reddy
Data Analytics Course
16
Goodness of fit for a logistic regression
• Chi-Square
• The Chi-Square statistic and associated p-value (Sig.) tests whether the model
coefficients as a group equal zero.
• Concordance and discordance: Correct predictions, false positives & false negatives
• Imagine a pair(1,0)
• Concordance: If the model gives high probability to 1 & low probability to 0

Venkat Reddy
Data Analytics Course
• Discordance: If the model gives low probability to 1 & high probability to 0
• Tie: Model gives same probability to both
• The "Percent Correct Predictions" statistic assumes that if the estimated p is
greater than or equal to .5 then the event is expected to occur and not occur
otherwise.
• By assigning these probabilities 0s and 1s and comparing these to the actual 0s and
1s, the % correct Yes, % correct No, and overall % correct scores are calculated.
Model
0 1
17
0
Actual
1
Goodness of fit for overall model
Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 28.9720 1 <.0001

Score 22.3501 1 <.0001

Wald 10.8989 1 0.0010

Venkat Reddy
Data Analytics Course
Association of Predicted Probabilities and
Observed Responses
Percent 92.0 Somers' D 0.851
Concordant
Percent 6.9 Gamma 0.861
Discordant
Percent Tied 1.1 Tau-a 0.365
Pairs 525 c 0.926

18
Goodness of fit
• Note: subgroups for the % correctly predicted is also important,
especially if most of the data are 0s or 1s
• Hosmer and Lemeshow Goodness-of-Fit Test
• Chisquare test for Observed and expected bad by diving the variable
into groups
• The test assesses whether or not the observed event rates match

Venkat Reddy
Data Analytics Course
expected event rates in subgroups of the model population.
• The Hosmer–Lemeshow test specifically identifies subgroups as the
deciles of fitted risk values. Models for which expected and observed
event rates in subgroups are similar are called well calibrated.
proc logistic data=Ice_cream_sales;
model buy=Age /lackfit;
run;
• Other methods
ROC curves?
Somers' D, Gamma, Tau-a, C 19
More than a dozen “R2”-type summaries
Hosmer and Lemeshow Goodness-of-Fit Test

Partition for the Hosmer and Lemeshow Test


Group Total Buy = 0 Buy = 1
Observed Expected Observed Expected
1 5 0 0.01 5 4.99
2 6 0 0.03 6 5.97

Venkat Reddy
Data Analytics Course
3 5 1 0.07 4 4.93
4 5 0 0.21 5 4.79
5 5 0 0.61 5 4.39
6 5 0 1.21 5 3.79
7 5 1 2.00 4 3.00
8 5 5 3.10 0 1.90
9 5 4 3.98 1 1.02
10 4 4 3.78 0 0.22

Hosmer and Lemeshow Goodness-of-Fit


Test 20
Chi-Square DF Pr > ChiSq
19.0740 8 0.0145
Lab: Logistic regression
• How good is the fit?
• What is chi-square
• What is concordance
• What is discordance

Venkat Reddy
Data Analytics Course
• Remove first 15 variables and rebuild the model
• Is there any change in the goodness of fit?

21
Meaning of beta
• In linear regression  denotes the corresponding change in y for unit
change in x, here  denotes the corresponding change in log odds of
y for unit change in x
• For example
• In linear regression: if y=2+3x ; if x increases by 1 unit then y increases by

Venkat Reddy
Data Analytics Course
3 units
• In logistic regression log(P/1-P) =2+3x: if x increases by 1 unit then the
log odds of P(y=1) increases by 3 units
•  = log odds ratio associated with predictors
• e  = odds ratio

 P  e + x
ln   = α + β 1x 1 + β 2 x 2 + ... β i x i P( y x ) =
 1- P  1 + e + x 22
Individual impact- Wald chi-square
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
Intercept 1 3.8982 1.3446 8.4044 0.0037
Age 1 -0.1353 0.0410 10.8989 0.0010
Weight 1 -0.037 0.004 2.665 0.0002

Venkat Reddy
Data Analytics Course
• Higher the Wald chi square, higher the impact/importance

23
If age increases by 1 unit then
Age a+ b*age exp(a+ b*age) 1+exp(a+ b*age) exp(a+ b*age)/(1+exp(a+ b*age) Change in Age Change in Prob
26 0.38 1.46 2.46 59.40%
27 0.25 1.28 2.28 56.10% 1 3.30%
25 0.52 1.67 2.67 62.61% (1) -3.22%
36 -0.97 0.38 1.38 27.44% 10 31.96%
16 1.73 5.66 6.66 84.98% (10) -25.59%
46 -2.33 0.10 1.10 8.90% 20 50.49%

Venkat Reddy
Data Analytics Course
6 3.09 21.90 22.90 95.63% (20) -36.24%

• If age increases 10 units


• the probability of buying decreases by______
• If age decreases by 10 units
• the probability of buying increases by _______
• Non Linear?
24
Prediction
Probability modeled is Buy=0
Age a+ b*age exp(a+ b*age) 1+exp(a+ b*age) exp(a+ b*age)/(1+exp(a+ b*age)
8 2.82 16.71 17.71 94%
17 1.60 4.94 5.94 83%
87 -7.87 0.00 1.00 0%
62 -4.49 0.01 1.01 1%
85 -7.60 0.00 1.00 0%
16 1.73 5.66 6.66 85%

Venkat Reddy
Data Analytics Course
23 0.79 2.20 3.20 69%
91 -8.41 0.00 1.00 0%
27 0.25 1.28 2.28 56%
27 0.25 1.28 2.28 56%
43 -1.92 0.15 1.15 13%
49 -2.73 0.07 1.07 6%
43 -1.92 0.15 1.15 13%
79 -6.79 0.00 1.00 0%
46 -2.33 0.10 1.10 9%

proc logistic data=Ice_cream_sales;


model buy=Age ; 25
output out=pred_ice_logistic p=phat;
run;
Lab: Prediction & Impact of predictor
variables
• What are the top 5 impacting variables
• Keep only top 15 variables and re-build the model
• What is the dip in goodness of fit?
• Homework: Use the test data and estimate the probability of

Venkat Reddy
Data Analytics Course
success/failure

26
Venkat Reddy Konasani
Manager at Trendwise Analytics
[email protected]
[email protected]

Venkat Reddy
Data Analysis Course
www.TrendwiseAnalytics.com/venkat
+91 9886 768879

This presentation is just class notes. The course notes for Data Analysis Training is by written by me (Venkata Reddy Konasani) as an aid for myself.
The best way to treat this is as a high-level summary; the actual session went more in depth (explained the examples, for instance) and contained
other information. Most of this material was written as informal notes, not intended for publication

Please send questions/comments/corrections to [email protected]

27

You might also like