0% found this document useful (0 votes)
11 views

Logistic Regression Analysis

The document discusses binary logistic regression, a statistical method used for predicting binary outcomes based on independent variables. It explains the logistic function, the estimation of probabilities, and the importance of using logistic regression over ordinary linear regression for binary dependent variables. Additionally, it covers model building, significance testing, and the use of stepwise regression in exploratory analysis.

Uploaded by

Nadew Begashawu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Logistic Regression Analysis

The document discusses binary logistic regression, a statistical method used for predicting binary outcomes based on independent variables. It explains the logistic function, the estimation of probabilities, and the importance of using logistic regression over ordinary linear regression for binary dependent variables. Additionally, it covers model building, significance testing, and the use of stepwise regression in exploratory analysis.

Uploaded by

Nadew Begashawu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Modeling Binary outcome

using Binary Logistic


Regression

1
Logistic Regression
In many studies the outcome variable of
interest is the presence or absence of some
condition, such as:
- survival status (alive or dead)
- responding or not to a treatment
- having a MI or not
- birth weight status (normal or low)

In such types of outcome variables we


cannot use ordinary multiple (linear)
regression but instead we use the logistic
regression method.
2
Cont…
Binary logistic regression is a form of
regression which is used when the
dependent variable is dichotomy and the
independents are of any type.

Multinomial logistic regression exists to


handle the case of dependents with more
classes than two. When multiple classes of
the dependent variable can be ordered, then
ordinal logistic regression is preferred to
multinomial logistic regression.
Cont…
Logistic regression can be used to predict a
dependent variable on the basis of
continuous and/or categorical independents
and to:
 determine the percent of variance in the
dependent variable explained by the
independents;
 rank the relative importance of
independents
 assess interaction effects; and
 understand the impact of covariate
control variables
4
Cont…
Logistic regression applies maximum
likelihood estimation after transforming the
dependent into a logit variable (the natural
log of the odds of the dependent occurring
or not).

Logistic regression estimates the probability


of a certain event occurring.

Note that logistic regression calculates


changes in the log odds of the dependent,
not changes in the dependent itself as OLS
regression does.
Cont…
Example: Consider the Infant survival data to
predict factors associated with low birth
weight.

The outcome of interest is a dichotomous


variable (≥ 2500=0, < 2500=1)

By considering age of the mother as predictor


variable, construct a scatter plot of the
outcome and predictor variables.
Scatter plot of Birth weight status by
age of the mother
1
Birth weight status

0.8

0.6

0.4

0.2

10 20 30 40 50
mother's age
Cont…
We can see that this plot is less informative
about the relationship between the outcome
and the explanatory variables than in the
case when the outcome variable is
continuous.

As an alternative to plotting the individual


values of the outcome variable, we can group
the data into, say, five year age groups and
calculate the mean value of LBW for each age
group.
Cont…
The mean of the dichotomous random
variable Y=LBW, designated by p, is
the proportion of times that it takes
the value 1 (success), i.e.,
BWt < 2500 (LBW).

p=P(Y=1) = P(“success”)
Sample data
Low birth
weight
Age of
mother No Yes Total % LWBt
< 20 506 68 574 11.8
20-24 1207 123 1330 9.2
25-29 1163 101 1264 8.0
30-34 838 78 916 8.5
35-39 613 55 668 8.2
40-44 113 15 128 11.7
> 44 28 4 32 12.5
Total 4468 444 4912 9.0
Cumulative low birth weight rate by
age of the mother
9
8
7
6
Cumulative Percent

5
4
3
2
1
0
< 15 15-19 20-24 25-29 30-34 35-39 40-44 > 44
Age group
Cont…

For the data, 666 of 7873 newborns


had birth weight < 2500 gm

This gives an estimated low birth


weight probability of 666/7873 =
0.085

Overall, 8.5% of the newborns had


BWt < 2500 gm
Cont…

The statistical model that is generally preferred for


the analysis of binary response is the binary logistic
regression model, stated in terms of the probability
that Y=1 given X, the value of the predictors:

Prob{Y=1|X}=[1+exp(-Xβ)]-1 where Xβ as stands for


β0 + β1X1 + β2X2 + … + βkXk
The regression parameters β are estimated by the
method of maximum likelihood.
The function P=[1+exp(-x)]-1 is called the logistic
function. The function has an unlimited range for x
while P is restricted to range from 0 to 1
A logistic function or logistic
curve is a common "S" shape (
sigmoid curve)
1
0.9
ex 0.8
P x
for  0 and  1
1 e 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
X
The Logistic Function

Cont…

Cont…
Thus, modeling the probability p with a logistic
function is equivalent to fitting a linear
regression in which the continuous response y
has been replaced by the logarithm of the
odds of success for dichotomous random
variable.
In stead of assuming that the relationship
between p and x is linear, we assume that the
relationship between ln[p/(1-p)] and x is
linear. The technique of fitting a model to this
form is known as logistic regression.
Logistic functions are used in
logistic regression to model how the
The fitted equation

Example
Fit a logistic regression model between low
birth weight and age of the mother:
Exp(B) 95.0% C.I. for
B S.E. Wald df Sig. (OR) EXP(B)

Lower Upper
Age -0.036 0.014 6.47 1 0.011 0.965 0.938 0.992
Constant -1.302 0.366 12.64 1 0.000 0.272

 pˆ 
ln    1.302  0.036 X
 1  pˆ 
From the model, the coefficient of age implies
that for one year increase in the age of the
mother, the log odds that the newborn will be
low birth weight decreases by 0.036. When the
log odds decreases, the probability p decreases
Plot of estimated probability of LBW by
mothers age

0.16

0.14
Estimated probability of LBW

0.12

0.1

0.08

0.06

0.04

0.02

0
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Age of mother
Inference on coefficients

Example
Consider the systolic blood pressure data and
let < 140 and ≥ 140 mmHg are categories to
define normal and elevated systolic blood
pressure. Let the group ≥ 140 be elevated
systolic BP and we want to identify factors
that contribute for elevated systolic blood
pressure.

Let’s start by taking age as the independent


variable to predict elevated systolic blood
pressure:
The logistic regression
model

B S.E. Wald df Sig. Exp(B)


age
0.062 0.011 32.01 1 0.000 1.064
Constant
-3.813 0.496 59.23 1 0.000 0.022
a. Variable(s) entered on step 1: age.

23
Estimating Probabilities
In order to estimate the probability that a
person with a particular age will have
elevated systolic blood pressure, we simply
substitute the appropriate value of x into the
preceding equation.

Example: to estimate the probability that a


personln age pˆ  48 has elevated systolic blood
 1  pˆ   3.81  0.062(48)  0.837
pressure: 

 e  0.837 0.433
1  pˆ
0.433
 pˆ  0.302
1  0.433
Multiple Logistic
Regression
We have seen that age of the person
influence the probability that the systolic
blood pressure will be ≥ 140. What will be
the fitted equation if weight is included in the
model?

To model the probability p as a function of


the two  pexplanatory
 variables, we fit a model
ln  form:
of the  α  β 1 x 1  β 2 x 2
 1 p

where x designates age and x weight of the


Both age and weight are in the
model
B S.E. Wald df Sig. Exp(B) 95.0% C.I. for
EXP(B)

0.059 0.011 26.2 1 0.000 1.06 1.037 1.084


age

weight 0.042 0.008 30.8 1 0.000 1.04 1.028 1.059

-6.976 0.810 74.1 1 0.000 0.00


Constant
a. Variable(s) entered on step 1: age, weight.
Like the linear regression model, the logistic
regression model can be generalized to
include discrete or nominal variables in
addition to continuous ones.

Suppose that in the above model we include


smoking p̂and  sex.
ln   a  b 1 x 1  b 2 x 2  b 3 x 3  b 4 x 4  b 5 x 5
 1  p̂ 

Where x1, x2, x3, x4, and x5 represent


respectively age, weight, smoking and sex
Age, weight, sex and smoking are
all in the model
Adding unordered categorical
independent variables
 Choose one level of variable to be a baseline/ reference
(better if it is the largest group of subjects)

 Create DUMMY (INDICATOR) variables coded 0 or 1


for each level of variable except baseline/ reference

 Include all dummy variables in the model

 Or use the categorical option in SPSS automatically


creates the dummy variables for you
Adding ordered categorical
variables
First examine as a categorical variable

If there is evidence of dose-response relation


(increasing/decreasing OR)
◦ Repeat analyses in SPSS and this time do not tell
SPSS it is a categorical variable
◦ Examine the results for evidence of a dose-response
relation (trend)
Main points in Logistic Regression
Unlike OLS regression, however, logistic
regression does not:
 assume linearity of relationship between
the independent variables and the
dependent,
 require normally distributed variables,
 assume homoscedasticity, and in general
has less stringent requirements.

It does, however, require that observations


are independent and that the logit of the
dependent variable event of interest is
Why we use logistic regression rather than ordinary
linear regression for a binary dependent variable?

We use logistic regression with a binary dependent


variable for the following reasons:

1. If you use linear regression, the predicted values will


become greater than one and less than zero if you
move far enough on the X-axis. Such values are
theoretically inadmissible.

2. One of the assumptions of linear regression is that the


variance of Y is constant across values of X. This
cannot be the case with a binary variable, because the
variance is P(1-P).

32
Cont…
When 50 percent of the people are 1s, then the variance is
0.25, its maximum value. As we move to more extreme
values, the variance decreases. When P=.10, the variance
is 0.10.9 = .09, so as P approaches 1 or zero, the
variance approaches zero.

3. The significance testing of the b weights rest upon the


assumption that errors of prediction are normally
distributed. Because Y only takes the values 0 and 1,
this assumption is pretty hard to justify, even
approximately.
Cont…
There are two main uses of logistic regression. The first is
the prediction of group membership. Since logistic
regression calculates the probability of success over the
probability of failure, the results of the analysis are in the
form of an odds ratio.

Logistic regression is often used in epidemiological


studies where the result of the analysis is the probability
of developing a disease after controlling for other
associated risks or factors.
Logistic Regression Model
building
The goal of logistic regression is to correctly predict the
category of outcome for individual cases using the most
parsimonious model.

To accomplish this goal, a model is created that includes


all predictor variables that are useful in predicting the
response variable.

Several different options are available during model


creation. Variables can be entered into the model in the
order specified by the researcher or logistic regression
can test the fit of the model after each coefficient is added
or deleted, called stepwise regression.
Cont…
Stepwise regression is used in the exploratory phase of
research but it is not recommended for theory testing
(Menard 1995).

Theory testing is the testing of a-priori theories or


hypotheses of the relationships between variables.

Exploratory testing makes no a-priori assumptions


regarding the relationships between the variables, thus
the goal is to discover relationships.
Cont…
Backward stepwise regression appears to be
the preferred method of exploratory
analyses, where the analysis begins with a
full or saturated model and variables are
eliminated from the model in an iterative
process.

The fit of the model is tested after the


elimination of each variable to ensure that
the model still adequately fits the data. When
no more variables can be eliminated from the
model, the analysis has been completed.
Cont…
Logistic regression also provides knowledge
of the relationships and strengths among the
variables (e.g., smoking 10 packs a day puts
a person at a higher risk for developing
cancer than working in an asbestos mine).

The process by which coefficients are tested


for significance for inclusion or elimination
from the model involves several different
techniques.

38
Cont…
The success of the logistic regression can be
assessed by looking at the classification table,
showing correct and incorrect classifications of
the dichotomous, ordinal, or polychotomous
dependent.

Also, goodness-of-fit tests such as model chi-


square are available as indicators of model
appropriateness as is the Wald statistic to test
the significance of individual independent
variables
Wald Test
A Wald test is used to test the statistical
significance of each coefficient () in the
model. A Wald test calculates a Z statistic,
which is: b
Z
se(b)

This z value is then squared, yielding a Wald


statistic with a chi-square
2  b  distribution.
2

Wald χ  
 se(b) 
Logistic regression
Model: log odds = β0 + β1x

Null hypothesis: β1 = 0 (or exp(β1 ) = OR =


1)

LRT statistic = -2 x (LL if β1=0 – LL if β1 =


β 1)

= -2
Likelihood forxmodel
(LexcludingLikelihood
x – Lincluding
forx)model
fitted without x fitted with x
Likelihood-Ratio Test
The likelihood-ratio test uses the ratio of the
maximized value of the likelihood function for
the full model (LL1) over the maximized value
of the likelihood function for the simpler
model (LL0).

The likelihood-ratio test statistic equals:

-2(LLo – LL1)

This log transformation of the likelihood


functions yields a chi-squared statistic.
Likelihood ratio test
Thisis the recommended test statistic to use
when building a model through backward
stepwise elimination.

-2 x (difference in the log of the likelihood at


the hypothesised value of β1 to the log of the
likelihood at the MLE b1)

Distributed as a chi-squared distribution with


1 degree of freedom.
Example: From logistic regression on
SBP
• Effect of age on risk of elevated SBP

B S.E. Wald df Sig. Exp(B)


age .062 .011 32.007 1 .000 1.064
Constant -3.813 .496 59.229 1 .000 .022

Wald test for effect of Age


Example: From logistic regression

Effect of age on risk of elevated


SBP

Likelihood ratio test for adding in age


to the model
The likelihood for the model with age is
shown in the box below.

SPSS doesn’t give us the likelihood for the


model with just the intercept (527.108) but if
it did, this likelihood – 491.405 would give us
35.703 as shown in the Omnibus box
46
LRT in logistic regression
Can use the likelihood ratio test for looking at the
effect of adding more than one parameter (eg. when x
is a factor with more than 2 levels or when we want to
see the effect of adding several variables at once)

Just fit the model with and without these parameters


(e.g. with and without x) and compare twice the
difference in log likelihood of the two models to a chi-
squared distribution
LRT in logistic regression
 Degrees of freedom equals number of parameters
added i.e. if x has 3 levels then df = 3-1=2.

 Can use the likelihood ratio test for looking at the


effect of adding more than one parameter (e.g. when x
is a factor with more than 2 levels or when we want to
see the effect of adding several variables at once)

You might also like