0% found this document useful (0 votes)

59 views48 pages

Logit

This document provides an introduction to logistic regression. It begins with an overview of why logistic regression is used when the dependent variable is binary. Maximum likelihood estimation is employed to estimate the coefficients. The coefficients can be interpreted as odds ratios that represent the effect of the independent variables on the log odds of the dependent variable. Various tests and statistics are discussed to evaluate model fit and performance, identify issues like omitted variables, and check assumptions.

Uploaded by

avatarman92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views48 pages

Logit

Uploaded by

avatarman92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 48

An Introduction to Logistic Regression

JohnWhitehead Department of Economics Appalachian State University

Outline
Introduction and Description Some Potential Problems and Solutions Writing Up the Results

Introduction and Description

Why use logistic regression? Estimation by maximum likelihood Interpreting coefficients Hypothesis testing Evaluating the performance of the model

Why use logistic regression?

There are many important research topics for which the dependent variable is "limited." For example: voting, morbidity or mortality, and participation data is not continuous or distributed normally. Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not vote) or 1(did vote)

The Linear Probability Model

In the OLS regression: Y = + X + e ; where Y = (0, 1) The error terms are heteroskedastic e is not normally distributed because Y takes on only two values The predicted probabilities can be greater than 1 or less than 0

An Example: Hurricane Evacuations

Q: EVAC Did you evacuate your home to go someplace safer before Hurricane Dennis (Floyd) hit? 1 YES 2 NO 3 DON'T KNOW 4 REFUSED

The Data
EVAC 0 0 0 1 1 0 0 0 0 0 0 0 1 PETS 1 1 1 1 0 0 0 1 1 0 0 1 1 MOBLHOME 0 0 1 1 0 0 0 0 0 0 0 0 1 TENURE 16 26 11 1 5 34 3 3 10 2 2 25 20 EDUC 16 12 13 10 12 12 14 16 12 18 12 16 12

OLS Results
Dependent Variable: Variable (Constant) PETS MOBLHOME TENURE EDUC FLOYD R2 F-stat EVAC B 0.190 -0.137 0.337 -0.003 0.003 0.198 0.145 36.010 t-value 2.121 -5.296 8.963 -2.973 0.424 8.147

Problems:
Predicted Values outside the 0,1 range
Descriptive Statistics N Unst andardiz ed Predicted Value Valid N (listwise) 1070 1070 Minimum -.08498 Max imum .76027 Mean .2429907

Std. Deviat

.1632

Heteroskedasticity

Park Test
Dependent Variable: LNESQ B t-stat (Constant) -2.34 -15.99 LNTNSQ -0.20 -6.19

The Logistic Regression Model

The "logit" model solves these problems: ln[p/(1-p)] = + X + e

p is the probability that the event Y occurs, p(Y=1) p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or "logit"

More: The logistic distribution constrains the estimated probabilities to lie between 0 and 1. The estimated probability is:
p = 1/[1 + exp(- - X)] if you let + X =0, then p = .50 as + X gets really big, p approaches 1 as + X gets really small, p approaches 0

Comparing LP and Logit Models

LP Model
1 Logit Model

Maximum Likelihood Estimation (MLE)

MLE is a statistical method for estimating the coefficients of a model. The likelihood function (L) measures the probability of observing the particular set of dependent variable values (p1, p2, ..., pn) that occur in the sample: L = Prob (p1* p2* * * pn) The higher the L, the higher the probability of observing the ps in the sample.

MLE involves finding the coefficients (, ) that makes the log of the likelihood function (LL < 0) as large as possible Or, finds the coefficients that make -2 times the log of the likelihood function (-2LL) as small as possible The maximum likelihood estimates solve the following condition:

{Y - p(Y=1)}Xi = 0 summed over all observations, i = 1,,n

Interpreting Coefficients

Since:

ln[p/(1-p)] = + X + e
The slope coefficient () is interpreted as the rate of change in the "log odds" as X changes not very useful. Since:

p = 1/[1 + exp(- - X)]

The marginal effect of a change in X on the probability is: p/X = f( X)

An interpretation of the logit coefficient which is usually more intuitive is the "odds ratio"
Since:

[p/(1-p)] = exp( + X)

exp() is the effect of the independent variable on the "odds ratio"

From SPSS Output:

Variable PETS MOBLHOME TENURE EDUC Constant B -0.6593 1.5583 -0.0198 0.0501 -0.916 Exp(B) 0.5172 4.7508 0.9804 1.0514 1/Exp(B) 1.933 1.020

Households with pets are 1.933 times more likely to evacuate than those without pets.

Hypothesis Testing

The Wald statistic for the coefficient is:

Wald = [ /s.e.B]2 which is distributed chi-square with 1 degree of freedom. The "Partial R" (in SPSS output) is R = {[(Wald-2)/(-2LL()]}1/2

An Example:
Variable B S.E. Wald R Sig t-value -3.28 5.42 -2.48 1.07 -1.33

PETS -0.6593 0.2012 MOBLHOME 1.5583 0.2874 TENURE -0.0198 0.008 EDUC 0.0501 0.0468 Constant -0.916 0.69

10.732 -0.1127 0.0011 29.39 0.1996 0 6.1238 -0.0775 0.0133 1.1483 0.0000 0.2839 1.7624 1 0.1843

Evaluating the Performance of the Model

There are several statistics which can be used for comparing alternative models or evaluating the performance of a single model:
Model Chi-Square Percent Correct Predictions Pseudo-R2

Model Chi-Square
The model likelihood ratio (LR), statistic is LR[i] = -2[LL() - LL(, ) ]
{Or, as you are reading SPSS printout: LR[i] = [-2LL (of beginning model)] - [-2LL (of ending model)]}

The LR statistic is distributed chi-square with i degrees of freedom, where i is the number of independent variables Use the Model Chi-Square statistic to determine if the overall model is statistically significant.

An Example:
Beginning Block Number 1. Method: Enter -2 Log Likelihood 687.35714 Variable(s) Entered on Step Number 1.. PETS PETS MOBLHOME MOBLHOME TENURE TENURE EDUC EDUC Estimation terminated at iteration number 3 because Log Likelihood decreased by less than .01 percent. -2 Log Likelihood 641.842 Chi-Square Model 45.515 df 4 Sign. 0.0000

Percent Correct Predictions

The "Percent Correct Predictions" statistic assumes that if the estimated p is greater than or equal to .5 then the event is expected to occur and not occur otherwise. By assigning these probabilities 0s and 1s and comparing these to the actual 0s and 1s, the % correct Yes, % correct No, and overall % correct scores are calculated.

An Example:
Observed 0 1 Predicted 0 1 328 24 139 44 Overall % Correct 93.18% 24.04% 69.53%

Pseudo-R2

One psuedo-R2 statistic is the McFadden's-R2 statistic: McFadden's-R2 = 1 - [LL(,)/LL()] {= 1 - [-2LL(, )/-2LL()] (from SPSS printout)}

where the R2 is a scalar measure which varies between 0 and (somewhat close to) 1 much like the R2 in a LP model.

An Example:
Beginning -2 LL Ending -2 LL Ending/Beginning McF. R = 1 - E./B.
2

687.36 641.84 0.9338 0.0662

Some potential problems and solutions

Omitted Variable Bias Irrelevant Variable Bias Functional Form Multicollinearity Structural Breaks

Omitted Variable Bias

Omitted variable(s) can result in bias in the coefficient estimates. To test for omitted variables you can conduct a likelihood ratio test: LR[q] = {[-2LL(constrained model, i=k-q)] - [-2LL(unconstrained model, i=k)]}

where LR is distributed chi-square with q degrees of freedom, with q = 1 or more omitted variables {This test is conducted automatically by SPSS if you specify "blocks" of independent variables}

An Example:
Variable PETS MOBLHOME TENURE EDUC CHILD WHITE FEMALE Constant Beginning -2 LL Ending -2 LL B -0.699 1.570 -0.020 0.049 0.009 0.186 0.018 -1.049 Wald 10.968 29.412 5.993 1.079 0.011 0.422 0.008 2.073 687.36 641.41 Sig 0.001 0.000 0.014 0.299 0.917 0.516 0.928 0.150

Constructing the LR Test

Ending -2 LL Partial Model Ending -2 LL Full Model Block Chi-Square DF Critical Value 641.84 641.41 0.43 3 11.345

Since the chi-squared value is less than the critical value the set of coefficients is not statistically significant. The full model is not an improvement over the partial model.

Irrelevant Variable Bias

The inclusion of irrelevant variable(s) can result in poor model fit. You can consult your Wald statistics or conduct a likelihood ratio test.

Functional Form
Errors in functional form can result in biased coefficient estimates and poor model fit. You should try different functional forms by logging the independent variables, adding squared terms, etc. Then consult the Wald statistics and model chi-square statistics to determine which model performs best.

Multicollinearity
The presence of multicollinearity will not lead to biased coefficients. But the standard errors of the coefficients will be inflated. If a variable which you think should be statistically significant is not, consult the correlation coefficients. If two variables are correlated at a rate greater than .6, .7, .8, etc. then try dropping the least theoretically important of the two.

Structural Breaks

You may have structural breaks in your data. Pooling the data imposes the restriction that an independent variable has the same effect on the dependent variable for different groups of data when the opposite may be true. You can conduct a likelihood ratio test: LR[i+1] = -2LL(pooled model) [-2LL(sample 1) + -2LL(sample 2)]

where samples 1 and 2 are pooled, and i is the number of independent variables.

An Example

Is the evacuation behavior from Hurricanes Dennis and Floyd statistically equivalent?
Floyd B -0.66 1.56 -0.02 0.05 -0.92 687.36 641.84 45.52 Dennis B -1.20 2.00 -0.02 -0.04 -0.78 440.87 382.84 58.02 Pooled B -0.79 1.62 -0.02 0.02 -0.97 1186.64 1095.26 91.37

Variable PETS MOBLHOME TENURE EDUC Constant Beginning -2 LL Ending -2 LL Model Chi-Square

Constructing the LR Test

Floyd 641.84 70.58 5 13.277 Dennis 382.84 Pooled 1095.26

Ending -2 LL Chi-Square DF Critical Value

[Pooled - (Floyd + Dennis)]

p = .01

Since the chi-squared value is greater than the critical value the set of coefficients are statistically different. The pooled model is inappropriate.

What should you do?

Try adding a dummy variable: FLOYD = 1 if Floyd, 0 if Dennis

Variable PETS MOBLHOME TENURE EDUC FLOYD Constant

B -0.85 1.75 -0.02 0.02 1.26 -1.68

Wald 27.20 65.67 8.34 0.27 59.08 8.71

Sig 0.000 0.000 0.004 0.606 0.000 0.003

Writing Up Results
Present descriptive statistics in a table Make it clear that the dependent variable is discrete (0, 1) and not continuous and that you will use logistic regression. Logistic regression is a standard statistical procedure so you don't (necessarily) need to write out the formula for it. You also (usually) don't need to justify that you are using Logit instead of the LP model or Probit (similar to logit but based on the normal distribution [the tails are less fat]).

An Example:
"The dependent variable which measures the willingness to evacuate is EVAC. EVAC is equal to 1 if the respondent evacuated their home during Hurricanes Floyd and Dennis and 0 otherwise. The logistic regression model is used to estimate the factors which influence evacuation behavior."

Organize your regression results in a table:

In the heading state that your dependent variable (dependent variable = EVAC) and that these are "logistic regression results. Present coefficient estimates, t-statistics (or Wald, whichever you prefer), and (at least the) model chi-square statistic for overall model fit If you are comparing several model specifications you should also present the % correct predictions and/or Pseudo-R2 statistics to evaluate model performance If you are comparing models with hypotheses about different blocks of coefficients or testing for structural breaks in the data, you could present the ending loglikelihood values.

An Example:
Table 2. Logistic Regression Results Dependent Variable = EVAC Variable B B/S.E. PETS MOBLHOME TENURE EDUC Constant Model Chi-Squared -0.6593 1.5583 -0.0198 0.0501 -0.916 45.515 -3.28 5.42 -2.48 1.07 -1.33

When describing the statistics in the tables, point out the highlights for the reader. What are the statistically significant variables?

"The results from Model 1 indicate that coastal residents behave according to risk theory. The coefficient on the MOBLHOME variable is negative and statistically significant at the p < .01 level (tvalue = 5.42). Mobile home residents are 4.75 times more likely to evacuate.

Is the overall model statistically significant? The overall model is significant at the .01 level according to the Model chi-square statistic. The model predicts 69.5% of the responses correctly. The McFadden's R2 is .066."

Which model is preferred?

"Model 2 includes three additional independent variables. According to the likelihood ratio test statistic, the partial model is superior to the full model of overall model fit. The block chi-square statistic is not statistically significant at the .01 level (critical value = 11.35 [df=3]). The coefficient on the children, gender, and race variables are not statistically significant at standard levels."

Also
You usually don't need to discuss the magnitude of the coefficients--just the sign (+ or -) and statistical significance. If your audience is unfamiliar with the extensions (beyond SPSS or SAS printouts) to logistic regression, discuss the calculation of the statistics in an appendix or footnote or provide a citation. Always state the degrees of freedom for your likelihood-ratio (chi-square) test.

References

http://personal.ecu.edu/whiteheadj/data/logit/
http://personal.ecu.edu/whiteheadj/data/logit/logitpap.htm E-mail: [email protected]

Regresi Logistik
No ratings yet
Regresi Logistik
34 pages
An Introduction To Logistic Regression
No ratings yet
An Introduction To Logistic Regression
48 pages
An Introduction To Logistic Regression: Johnwhitehead Department of Economics East Carolina University
No ratings yet
An Introduction To Logistic Regression: Johnwhitehead Department of Economics East Carolina University
48 pages
CUHK STAT5102 Ch7
No ratings yet
CUHK STAT5102 Ch7
33 pages
Garson 2008 Logistic Regression
No ratings yet
Garson 2008 Logistic Regression
33 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Qualitative Response Models
No ratings yet
Qualitative Response Models
35 pages
Binary Logistic Regression - 6.2
No ratings yet
Binary Logistic Regression - 6.2
34 pages
L9 Logistical Regression Models Updated
No ratings yet
L9 Logistical Regression Models Updated
10 pages
Bio2 Module 5 - Logistic Regression
No ratings yet
Bio2 Module 5 - Logistic Regression
19 pages
5.1) Binary logistic regression
No ratings yet
5.1) Binary logistic regression
32 pages
Probit Logit Interpretation
No ratings yet
Probit Logit Interpretation
26 pages
Logistic and Nonlinear Regression: Department of Political Science AND International Relations Posc/Uapp 816
No ratings yet
Logistic and Nonlinear Regression: Department of Political Science AND International Relations Posc/Uapp 816
15 pages
Session 3 - Logistic Regression
50% (2)
Session 3 - Logistic Regression
28 pages
Psy 512 Logistic Regression
No ratings yet
Psy 512 Logistic Regression
12 pages
Ourse Notes Ogistic Egression: Course Notes: Descriptive Statistics Course Notes: Descriptive Statistics
No ratings yet
Ourse Notes Ogistic Egression: Course Notes: Descriptive Statistics Course Notes: Descriptive Statistics
6 pages
SPSS Binary Logistic Regression Demo 1 Terminate
No ratings yet
SPSS Binary Logistic Regression Demo 1 Terminate
22 pages
Logistic Regression & Practice
100% (1)
Logistic Regression & Practice
51 pages
Logistic Regression: Continued Psy 524 Ainsworth
0% (1)
Logistic Regression: Continued Psy 524 Ainsworth
29 pages
Logistic Regression: Multivariate Analysis
No ratings yet
Logistic Regression: Multivariate Analysis
29 pages
Logistic Regression
No ratings yet
Logistic Regression
54 pages
5.3) Ordinal logistic regression 2
No ratings yet
5.3) Ordinal logistic regression 2
40 pages
Logistic Regression
100% (1)
Logistic Regression
37 pages
A Simple But Effective Logistic Regression Derivation
No ratings yet
A Simple But Effective Logistic Regression Derivation
6 pages
Section 9 Limited Dependent Variables
No ratings yet
Section 9 Limited Dependent Variables
17 pages
Notes 13
No ratings yet
Notes 13
18 pages
Logistic Nota
No ratings yet
Logistic Nota
87 pages
Binary Logistic Regression
No ratings yet
Binary Logistic Regression
8 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
Dummy Dependent Variable
100% (1)
Dummy Dependent Variable
58 pages
Business Analytics & Machine Learning: Logistic and Poisson Regressions
No ratings yet
Business Analytics & Machine Learning: Logistic and Poisson Regressions
62 pages
Day 13 Logistic Regression
No ratings yet
Day 13 Logistic Regression
28 pages
Logit and Spss
No ratings yet
Logit and Spss
37 pages
Cap1_Slides
No ratings yet
Cap1_Slides
30 pages
Section and Solution
No ratings yet
Section and Solution
4 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Lecture 8
No ratings yet
Lecture 8
39 pages
Introduction To Logistic Regression
No ratings yet
Introduction To Logistic Regression
20 pages
Lecture 7 - Binary
No ratings yet
Lecture 7 - Binary
45 pages
3-Classification
No ratings yet
3-Classification
26 pages
Regression With A Binary Dependent Variable
No ratings yet
Regression With A Binary Dependent Variable
63 pages
Regresion Logistica
No ratings yet
Regresion Logistica
71 pages
Chapter 15 Qualitative Response Regression Models Part 2
No ratings yet
Chapter 15 Qualitative Response Regression Models Part 2
31 pages
Binary Logistic Regression Using Stata 17 Drop-Down Menus
No ratings yet
Binary Logistic Regression Using Stata 17 Drop-Down Menus
53 pages
Chapter 5-LDVM-2024
No ratings yet
Chapter 5-LDVM-2024
27 pages
Logistic_Regression_ADA__xid-2911285_1_0SwZFA4qav
No ratings yet
Logistic_Regression_ADA__xid-2911285_1_0SwZFA4qav
98 pages
13. Review of Logistic and Poisson Regression Models
No ratings yet
13. Review of Logistic and Poisson Regression Models
15 pages
Regression Logistic 4
No ratings yet
Regression Logistic 4
51 pages
Logistic Regression-Advanced Biostat-PDF(1)
No ratings yet
Logistic Regression-Advanced Biostat-PDF(1)
86 pages
An Overview of Logistic Regression: Jill Mccracken May 28, 2004
No ratings yet
An Overview of Logistic Regression: Jill Mccracken May 28, 2004
10 pages
ES714glm Generalized Linear Models
No ratings yet
ES714glm Generalized Linear Models
26 pages
Logistic Regression Tutorial
No ratings yet
Logistic Regression Tutorial
25 pages
Logistic Regression
No ratings yet
Logistic Regression
98 pages
Lecture 7 Probit
No ratings yet
Lecture 7 Probit
24 pages
Logistic Reg
No ratings yet
Logistic Reg
54 pages
Ilovepdf Merged (24)
No ratings yet
Ilovepdf Merged (24)
208 pages
Lec-03_LogisticRegression
No ratings yet
Lec-03_LogisticRegression
32 pages
Lecture 22. Glm
No ratings yet
Lecture 22. Glm
41 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Solution of Certain Problems in Quantum Mechanics
From Everand
Solution of Certain Problems in Quantum Mechanics
A. Bolotin
No ratings yet

Logit

Uploaded by

Logit

Uploaded by

An Introduction to Logistic Regression

JohnWhitehead Department of Economics Appalachian State University

Introduction and Description

Why use logistic regression?

The Linear Probability Model

An Example: Hurricane Evacuations

The Logistic Regression Model

Comparing LP and Logit Models

Maximum Likelihood Estimation (MLE)

{Y - p(Y=1)}Xi = 0 summed over all observations, i = 1,,n

p = 1/[1 + exp(- - X)]

exp() is the effect of the independent variable on the "odds ratio"

From SPSS Output:

The Wald statistic for the coefficient is:

Evaluating the Performance of the Model

Percent Correct Predictions

687.36 641.84 0.9338 0.0662

Some potential problems and solutions

Omitted Variable Bias

Constructing the LR Test

Irrelevant Variable Bias

Constructing the LR Test

Ending -2 LL Chi-Square DF Critical Value

[Pooled - (Floyd + Dennis)]

What should you do?

Try adding a dummy variable: FLOYD = 1 if Floyd, 0 if Dennis

Variable PETS MOBLHOME TENURE EDUC FLOYD Constant

B -0.85 1.75 -0.02 0.02 1.26 -1.68

Wald 27.20 65.67 8.34 0.27 59.08 8.71

Sig 0.000 0.000 0.004 0.606 0.000 0.003

Organize your regression results in a table:

Which model is preferred?

You might also like