100% found this document useful (1 vote)
75 views37 pages

Regression Logistic Regression

Logistic regression allows modeling of the relationship between a binary outcome variable and one or more predictor variables. It transforms the probability of the outcome into odds via the logistic function to obtain a linear relationship. Estimation is done via maximum likelihood. The coefficients can be interpreted as changes in the log-odds of the outcome for a one-unit change in the predictor, adjusted for other predictors. Multiple logistic regression extends this to multiple predictors and allows testing their effects and interactions. Coding of predictor variables depends on their type to satisfy model assumptions.

Uploaded by

CristinaBoboc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
75 views37 pages

Regression Logistic Regression

Logistic regression allows modeling of the relationship between a binary outcome variable and one or more predictor variables. It transforms the probability of the outcome into odds via the logistic function to obtain a linear relationship. Estimation is done via maximum likelihood. The coefficients can be interpreted as changes in the log-odds of the outcome for a one-unit change in the predictor, adjusted for other predictors. Multiple logistic regression extends this to multiple predictors and allows testing their effects and interactions. Coding of predictor variables depends on their type to satisfy model assumptions.

Uploaded by

CristinaBoboc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to

Logistic Regression

Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein


Content

• Simple and multiple linear regression


• Simple logistic regression
– The logistic function
– Estimation of parameters
– Interpretation of coefficients
• Multiple logistic regression
– Interpretation of coefficients
– Coding of variables
• Examples in Egret

• Modelling tomorrow
Simple linear regression
Table 1 Age and systolic blood pressure (SBP) among 33 adult women

Age SBP Age SBP Age SBP


22 131 41 139 52 128
23 128 41 171 54 105
24 116 46 137 56 145
27 106 47 111 57 141
28 114 48 115 58 153
29 123 49 133 59 157
30 117 49 128 63 155
32 122 50 183 67 176
33 99 51 130 71 172
35 121 51 133 77 178
40 147 51 144 81 217
SBP (mm Hg)

220

200

180

160

140

120

100

80
20 30 40 50 60 70 80 90

Age (years)

adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974


Simple linear regression
• Relation between 2 continuous variables (SBP and age)

y
Slope y  α  β1x 1

• Regression coefficient 1
– Measures association between y and x
– Amount by which y changes on average when x changes by one unit
– Least squares method
Multiple linear regression

• Relation between a continuous variable and a set of


i continuous variables

y  α  β1x 1  β 2 x 2  ...  βi x i

• Partial regression coefficients i


– Amount by which y changes on average when xi changes by one
unit and all the other xis remain constant
– Measures association between xi and y adjusted for all other xi

• Example
– SBP versus age, weight, height, etc
Multiple linear regression

y  α  β1x 1  β 2 x 2  ...  βi x i

Dependent Independent variables


Predicted Predictor variables
Response variable Explanatory variables
Outcome variable Covariables
Multivariate analysis

Model Outcome

Linear regression continous


Poisson regression counts
Cox model survival
Logistic regression binomial
......

• Choice of the tool according to study, objectives, and the


variables
– Control of confounding
– Model building, prediction
Logistic regression

• Models the relationship between a set of variables xi


– dichotomous (eat : yes/no)
– categorical (social class, ... )
– continuous (age, ...)

and

– dichotomous variable Y

• Dichotomous (binary) outcome most common


situation in biology and epidemiology
Logistic regression (1)

Table 2 Age and signs of coronary heart disease (CD)


How can we analyse these data?

• Comparison of the mean age of diseased and


non-diseased women

– Non-diseased: 38.6 years


– Diseased: 58.7 years (p<0.0001)

• Linear regression?
Dot-plot: Data from Table 2
Logistic regression (2)

Table 3 Prevalence (%) of signs of CD according to age group


Dot-plot: Data from Table 3

Diseased % 100

80

60

40

20

0
0 1 2 3 4 5 6 7

Age (years)
The logistic function (1)
Probability of
disease 1.0

0.8

0.6

0.4

0.2

0.0

x
The logistic function (2)

{logit of P(y|x)
The logistic function (3)

• Advantages of the logit


– Simple transformation of P(y|x)
– Linear relationship with x
– Can be continuous (Logit between -  to + )
– Known binomial distribution (P between 0 and 1)
– Directly related to the notion of odds of disease

 P  P
ln    α  βx  e αβx
 1- P  1- P
Interpretation of  (1)

P αβx
e
1- P
Interpretation of  (2)

•  = increase in log-odds for a one unit increase in x


• Test of the hypothesis that =0 (Wald test)

β2
2  (1 df)
Variance( β)

• Interval testing
Example

• Age (<55 and 55+ years) and risk of developing


coronary heart disease (CD)
• Results of fitting Logistic Regression Model

 P 
ln    α  β1  Age  - 0.841  2.094  Age
 1- P 
Fitting equation to the data

• Linear regression: Least squares


• Logistic regression: Maximum likelihood
• Likelihood function
– Estimates parameters  and with property that
likelihood (probability) of observed data is higher than
for any other values
– Practically easier to work with log-likelihood
n
L()  lnl ()   yi ln ( xi ) (1  yi ) ln1   ( xi )
i 1
Maximum likelihood

• Iterative computing
– Choice of an arbitrary value for the coefficients (usually 0)
– Computing of log-likelihood
– Variation of coefficients’ values
– Reiteration until maximisation (plateau)

• Results
– Maximum Likelihood Estimates (MLE) for  and 
– Estimates of P(y) for a given value of x
Multiple logistic regression

• More than one independent variable


– Dichotomous, ordinal, nominal, continuous …

 P 
ln    α  β1x 1  β2 x 2  ... βi x i
 1- P 
• Interpretation of i
– Increase in log-odds for a one unit increase in x i with all the
other xis constant
– Measures association between xi and log-odds adjusted for
all other xi
Multiple logistic regression

• Effect modification
– Can be modelled by including interaction terms

 P 
ln    α  β1x1  β2 x 2  β3 x1  x1
 1- P 
Statistical testing

• Question
– Does model including a given independent variable
provide more information about dependent variable than
model without this variable?
• Three tests
– Likelihood ratio statistic (LRS)
– Wald test
– Score test
Likelihood ratio statistic

• Compares two nested models


Log(odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1)
Log(odds) =  + 1x1 + 2x2 (model 2)

• LR statistic
-2 log (likelihood model 2 / likelihood model 1) =
-2 log (likelihood model 2) minus -2log (likelihood model 1)

LR statistic is a 2 with DF = number of extra parameters


in model
Example
P Probability for cardiac arrest
Exc 1= lack of exercise, 0 = exercise
Smk 1= smokers, 0= non-smokers

 P 
ln    α  β1 Exc  β2 Smk
 1- P 
 0.7102  1.0047 Exc  0.7005 Smk
(SE 0.2614) (SE 0.2664)

adapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998


• Interactive effect between smoking and exercise?

 P 
ln    α  β1 Exc  β2 Smk  β3 Smk  Exc
 1- P 

• Product term 3 = -0.4604 (SE 0.5332)

Wald test = 0.75 (1df)

-2log(L) = 342.092 with interaction term


= 342.836 without interaction term

 LR statistic = 0.74 (1df), p = 0.39


 No evidence of any interaction
Coding of variables (1)

• Dichotomous variables: yes = 1, no = 0


• Continuous variables
– Increase in OR for a one unit change in exposure
variable
– Logistic model is multiplicative 
OR increases exponentially with x
» If OR = 2 for a one unit change in exposure and x increases
from 2 to 5: OR = 2 x 2 x 2 = 23 = 8

– Verify that OR increases exponentially with x.


When in doubt, treat as qualitative variable
Continuous variable?
• Relationship between SBP>160 mmHg and body weight

• Introduce BW as continuous variable?


– Code weight as single variable, eg. 3 equal classes:
40-60 kg = 0, 60-80 kg = 1, 80-100 kg = 2

– Compatible with assumption of multiplicative model


– If not compatible, use indicator variables
Coding of variables (2)

• Nominal variables or ordinal with unequal


classes:
– Tobacco smoked: no=0, grey=1, brown=2, blond=3
– Model assumes that OR for blond tobacco
= OR for no tobacco3
– Use indicator variables (dummy variables)
Indicator variables: Type of tobacco

• Neutralises artificial hierarchy between classes in the


variable "type of tobacco"
• No assumptions made
• 3 variables (3 df) in model using same reference
• OR for each type of tobacco adjusted for the others in
reference to non-smoking
Examples using Egret
Example 1: Low Birth Weight Study

• 198 observations
• Low Birth Weigth [LBW]
– 1= Birth weight < 2500g
– 0= Birth weight >= 2500g
• Age of mother in years
• Weight of mother in pounds [LWT]
• Race (1,2,3)
• Number of doctor’s visit in last trimester [FTV]
Example 2: Risk of death from bacterial
meningitis according to treatment
• 161 observations
• Death (0,1)
• Treatment
– 1=Chloramphenicol, 2=Ampicillin)
• Delay before treatment (onset, in days)
• Convulsions (1,0)
• Level of consciousness (1-3)
• Severity of dehydration (1-3)
• Age in years
• Pathogen
– 1 Others, 2 HiB, 3 Streptococcus pneumoniae
Reference

• Hosmer DW, Lemeshow S. Applied logistic


regression.Wiley & Sons, New York, 1989

You might also like