1
• When and Why do we Use Logistic
Regression?
• Binary
• Multinomial
• Theory Behind Logistic Regression
• Assessing the Model
• Assessing predictors
• Things that can go Wrong
• Interpreting Logistic Regression
When And Why
• To predict an outcome variable that is
categorical from one or more categorical or
continuous predictor variables.
• Used because having a categorical outcome
variable violates the assumption of linearity in
normal regression.
• Does not assume a linear relationship
between DV and IV
No assumptions about the distributions of the predictor
variables.
Predictors do not have to be normally distributed
Logistic regression does not make any assumptions of normality, linearity,
and homogeneity of variance for the independent variables.
Because it does not impose these
requirements, it is preferred to discriminant
4
analysis when the data does not satisfy these assumptions.
• Logistic regression is used to analyze relationships between a
dichotomous dependent variable and continue or dichotomous
independent variables.
• Logistic regression combines the independent variables to
estimate the probability that a particular event will occur, i.e. a
subject will be a member of one of the groups defined by the
dichotomous dependent variable
1
P(Y) X )
1e (
b 0 b1 1 i
• Outcome
• We predict the probability of the outcome
occurring
• b0 and b0
• Can be thought of in much the same way as
multiple regression
• Note the normal regression equation forms part
of the logistic regression equation
1
P(Y)
1e ( b0 b1X 1 b2X2 ... b n X n i)
• Outcome
• We still predict the probability of the
outcome occurring
• Differences
• Note the multiple regression equation forms
part of the logistic regression equation
• This part of the equation expands to
accommodate additional predictors
Logit p = α + β1X1 +β2X2 + .. + βpXp
α represents the overall disease risk
β1 represents the fraction by which the disease risk is altered by a unit
change in X1
β2 is the fraction by which the disease risk is altered by a unit change in
X2
……. and so on.
What changes is the log odds. The odds themselves are changed by eβ
If β = 1.6 the odds are e1.6 = 4.95
Measuring the Probability of Outcome
The probability of the outcome is measured by the odds of occurrence
of an event.
If P is the probability of an event, then (1-P) is the probability of it not
occurring.
Odds of success = P / 1-P
P
1 P
• Forced Entry: All variables entered simultaneously.
• Hierarchical: Variables entered in blocks.
• Blocks should be based on past research, or theory being tested. Good
Method.
• Stepwise: Variables entered on the basis of statistical criteria
(i.e. relative contribution to predicting outcome).
• Should be used only for exploratory analysis.
Stage 1:
Objectives Of logistic regression
Identify the independent variable that impact on
the dependent variable
Establishing classification system based on the
logistic model for determining the group
membership
• BINARY LOGISTIC REGRESSION
It is used when the dependent variable is dichotomous.
MULTINOMIAL LOGISTIC REGRESSION
It is used when the dependent or outcomes variable has more than
two categories.
Independent Dependent
Variable Variable
13
Independent Dependent
Variable Variable
14
Binary logistic regression expression
BINARY Y = Dependent Variables
ß˚ = Constant
ß1 = Coefficient of variable X1
X1 = Independent Variables
E = Error Term
• Very small samples have so much sampling
errors.
• Very large sample size decreases the chances of
errors.
• Logistic requires larger sample size than multiple
regression.
• Researchers recommended sample size greater
than 400.
Sample Size Per Category Of The Independent Variable
The recommended sample size for each group is at
least 10 observations per estimated parameters.
Logistic relationship describe earlier in both estimating the logistic
model and establishing the relationship between the dependent
and independent variables.
Result is a unique transformation of dependent variables which
impacts not only the estimation process but also the resulting
coefficients of independent variables.
Maximum Likelihood Estimation (MLE)
MLE is a statistical method for estimating the
coefficients of a model.
The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the probability of
observing the ps in the sample.
S-shaped
Range (0-1)
LP Model
1
Logit Model
0
The data used to conduct logistic regression is from a survey of 30
homeowners conducted by an electricity company about an offer of roof
solar panels with a 50% subsidy from the state government as part of the
state’s environmental policy.
The variables are:
IVs: household income measured in units of a thousand dollars , age of
householder, monthly mortgage, size of family household
DV: whether the householder would take or decline the offer.
Take the offer was coded as 1 and decline the offer was coded as 0.
To determine whether household income and monthly
mortgage will predict taking or declining the solar
panel offer
Independent Variables: household income and monthly
mortgage
Dependent Variables: Take the offer or decline the offer
Two hypotheses to be tested
• There are two hypotheses to test in relation to the overall fit of the
model:
H0: The model is a good fitting model
H1: The model is not a good fitting model (i.e. the predictors have a
significant effect)