2 Dealing With Logistic Regression
2 Dealing With Logistic Regression
1. Overall Understanding
Logistic Regression
Definition Logistic regression is a statistical model that in its basic form uses
a logistic function to model a binary dependent variable.
In regression analysis, logistic regression (or logit regression) is
estimating the parameters of a logistic model (a form of
binary regression).
Binary Logistic Regression
When to use Logistic regression is the statistical technique used to predict the
relationship between predictors (our independent variables) and a
predicted variable (the dependent variable) where the dependent
variable is binary (e.g., sex [male vs. female], response [yes vs. no],
score [high vs. low], etc…).
Necessary Logistic Pre-estimation For Specification Error test [perform
Regression Diagnosis Specification error linktest]
Goodness of fit For Goodness of fit test [ perform lfit]
Other Diagnostics For Other Diagnostics Tests [perform
Post-estimation fitstat]
Multicollinearity For Multicollinearity test [perform
Heteroscedasticity collin]
For Heteroscedasticity test [perform
hettest]
1
is ordered. It is used when your dependent variable has: A
meaningful order, and. More than two categories (or levels).
2
case, logit or logistic), linktest uses the linear predicted value (_hat) and linear predicted
value squared (_hatsq) as the predictors to rebuild the model. The variable _hat should be a
statistically significant predictor, since it is the predicted value from the model.
Step 1: run logistic regression (logit DV IVs) [Result appears]
Step 2: linktest [result of hat and hat test appears]
Step 3: Compare your specification error
Decision:
1. Check _hat value, it should be statistically significant
2. Check _hatsq value, it should not be statistically significant
3. If 1 and 2 appears as per our desire, the result confirms, on one
hand, that we have chosen meaningful predictors.
2. Goodness of fit
We have seen from our previous lessons that Stata’s output of logistic regression contains the
log likelihood chi-square and pseudo R-square for the model.
The log likelihood chi-square is an omnibus test to see if the model as a whole is statistically
significant.
Another commonly used test of model fit is the Hosmer and Lemeshow’s goodness-of-fit test.
The idea behind the Hosmer and Lemeshow’s goodness-of-fit test is that the predicted
frequency and observed frequency should match closely, and that the more closely they
match, the better the fit. The Hosmer-Lemeshow goodness-of-fit statistic is computed as the
Pearson chi-square from the contingency table of observed frequencies and expected
frequencies. Similar to a test of association of a two-way table, a good fit as measured by
Hosmer and Lemeshow’s test will yield a large p-value. When there are continuous predictors
in the model, there will be many cells defined by the predictor variables, making a very large
contingency table, which would yield significant result more than often. So a common
practice is to combine the patterns formed by the predictor variables into 10 groups and form
a contingency table of 2 by 10.
Step 1: run logistic regression (logit DV IVs) [Result appears]
Step 2: lfit [Result appears]
Decision:
If p-value is greater than 0.05 or 5%, we can say that Hosmer and Lemeshow’s
goodness-of-fit test indicates that our model fits the data well.
3. Other Diagnostics
There are many other measures of model fit, such AIC (Akaike Information Criterion) and
BIC (Bayesian Information Criterion).
3
Note: Many times, fitstat is used to compare models. Note that fitstat should only be
used to compare nested models.
Decision:
1.The value of fitstat provides many diagnostic results of the dataset that we use.
2.It provides readers meaningful insight about the fitness of dataset.
4. Multicollinearity
Multicollinearity (or collinearity for short) occurs when two or more independent variables in
the model are approximately determined by a linear combination of other independent
variables in the model. Notice that Stata issues a note, informing us that the
variable yr_rnd has been dropped from the model due to collinearity. We cannot assume that
the variable that Stata drops from the model is the "correct" variable to omit from the model;
rather, we need to rely on theory to determine which variable should be omitted.
Decision:
As a rule of thumb, a tolerance of 0.1 or less (equivalently VIF of 10 or
greater) is a cause for concern.
Best Wishes