0% found this document useful (0 votes)
2 views

Logistic regression_2021 ch-8

The document outlines the analysis of categorical data, focusing on methods to explore relationships between categorical variables, estimate associations, and calculate probabilities using statistical techniques such as Chi-squared tests and logistic regression. It includes definitions, hypotheses testing, examples, and interpretations of results, particularly in health-related datasets. Additionally, it discusses the assumptions and limitations of these statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Logistic regression_2021 ch-8

The document outlines the analysis of categorical data, focusing on methods to explore relationships between categorical variables, estimate associations, and calculate probabilities using statistical techniques such as Chi-squared tests and logistic regression. It includes definitions, hypotheses testing, examples, and interpretations of results, particularly in health-related datasets. Additionally, it discusses the assumptions and limitations of these statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Analysis of Categorical Data

Haileab Fekadu (MPH, Ass’t Professor)

Department of Epidemiology and Biostatistics


Institute of Public Health
University of Gondar

haileabfekadu(@gmail.com

July 2021 1
Objectives:
 Explore relationship between two categorical variables

 Estimate strengths of association

 identify explanatory variables that influence value’s of response


variable and estimate the probability

 Calculate, interpret and write report on a given health and


health related dataset

2
Statistical Methods for Categorical Variables
Chi-squared test

Measures of association

Logistic Regression

3
Recap: Important terms

 Null hypothesis Vs Alternative hypothesis

 Level of significance Vs Confidence interval

 Critical value Vs test Statistic

 P-value Vs level of significance

 Statistical significance

4
Introduction
Categorical Response Variables:
Variables measured in Nominal scale and ordinal scale

Whether or not a person smokes Non  smoker


Y 
Binary Smoker
Survives
Success of a Response Y 
medical treatment Dies

None
Pain Ordinal Response Moderate
sever

5
Introduction cont’d…

• often for the sake of simplicity continuous data is


“dichotomized”, “trichotomized”

E.g., birth weight (low, normal),


Annual income (<15,000, 15,000-25,000, 25,000-
40,000, >40,000

6
Testing and measuring Association
Testing Hypothesis about association between variables

Hypothesis
1. The hypothesis to be tested can be stated as:
H0: There is no association b/n variables
H1: There is association
2. Compute the calculated and tabulated value of the test
statistic
3. Compare the two (tabulated and calculated values)
4. Decision

7
testing Association: Chi-square
• The test statistic is used to test the association between
two categorical variables

 If the contingency table is 2X 2, • If the contingency table is


then rX c, then

• The Chi-squared test measures the disparity between


observed frequencies (data from the sample and
expected frequencies (probability distribution) 8
Example
• A cross-sectional survey was conducted to assess the
association between head injury and wearing helmet.
• A total of 793 individuals were included in the study and
the data is given below.

• Test the presence of association between head injury and


wearing helmet at 0.05 level of significant

9
Solution
• Hypothesis:
– H0:: There is no association between helmet and head injury
– HA: There is association between helmet and head injury

• Test statistics: Χ2 =28.26


• The critical value at 1 df with 5% level of significance: 3.84
• Decision: since 28.26>3.84, reject H0

Conclusion: there is association between head injury and wearing


helmet

10
Chi-squared cont’d…

• The Chi-squared test is valid


– If no observed cell have value 0
– Not more than 20% of expected cell has value less than
5
– The cells are mutually exclusive

• Limitations
– Discussion

11
Measures of Association
• How do we determine whether a certain disease is associated
with a certain exposure?

• If an association between disease and exposure exists,


 How strong is it?
 The direction of association?
 How do we know whether the association is just by chance
or due to other factors?

• Commonly used measure of association is the odds ratio and


rate ratio

12
Measures of Association
Odds Ratio:
• Ratio of two odds
• The odds in favor of an event happening (such as getting a
disease) is the probability of the event happening relative to
the probability of the event not happening.

• In case control studies, we know the proportion of cases who


were exposed and the proportion of controls who were
exposed

• For example suppose the probability (p) of a patient to survive


is 60%,
• Then the probability of death(1-p) is 40%, so the odds of the
patient surviving is= p/1-p = 0.6/0.4 = 1.5 13
Measures of Association cont’d…
Odds Ratio …
• Consider the 2x2 contingency table

𝑎𝑑
OR=
𝑏𝑐

• If OR = 1 No association between disease and exposure


• If OR > 1 Positive association between disease and exposure
• If OR < 1 Negative association between disease and exposure
(Exposure may be protective of the disease
14
Measures of Association cont’d…
Example
• A case control study was conducted on 200 cases of heart
disease and 400 controls to determine whether heart disease
is associated with smoking status.
• The results are shown in the table below

112∗224
OR= =1.62
176∗88

• Interpret ???
15
Interpretation

• The odds of smoking was 1.62 times higher CHD cases


than the controls

• We are 95% certain that the true odds smoking is between


1.13 and 2.31.

• The 95% CI does not include 1; we therefore conclude


smoking was significantly higher among cases than
controls

16
Examples using STATA
Consider the Jimma Infant data

• Tabulate the bwt by sex

• Determine the whether is association between sex and bwt

• STATA Code:
– tabulate sex bwt
– tabulate sex bwt,chi2

17
STATA output

18
Logistic regression
In linear regression,
 Dependent variable  continuous
 Independent variable/s  categorical or numeric

 What can we do if the dependent variable is categorical


(dichotomous, multinomial or ordinal)?

19
Logistic regression for binary outcome
• Suppose Y= 1, Yes/success == π = probability of success= P(Y=1)
0, No/failure == 1- π =probability of failure=P(Y=0)

 In ordinary regression the model predicts the mean Y for any


combination of predictors.
 What’s the “mean” of a 0/1 indicator variable?

 yi # of 1' s
y   Proportionof " success"
n # of trials
 Goal of logistic regression: Predict the “true” proportion of success,
π, at any value of the predictor.
Modeling binary data cont’d…

• The objective is to study the effect of explanatory variables


 An outcome variable with two possible categorical
outcomes (1=success; 0=failure)
 The events are independent from subject to subject
 Explanatory variables can be categorical and/or continuous
 A way to estimate the probability𝜋of the success event of
the outcome variable

21
Logit Function
no data Function Plot
1.0

0.8

0.6
y

0.4

0.2

-10 -8 -6 -4 -2 0 2 4 6 8 10 12
x
exp ( bo + b1• x )
y=
1 + exp ( bo + b1• x )
Logistic regression cont’d…
 Again such S-shaped (sigmoid) curve is difficult to describe with a linear
equation for two reasons.
 First, even though it seems linear at the center of the curve, the
extremes do not follow a linear trend;
 Second, the errors are neither normally distributed nor constant across
the entire range of data.

Question! So what do we do with this S-Shaped curve?

 Answer:
 First: Find a function that best fits (be linked) with this S- shaped
graph
 Second: Find another function that transforms the S-shaped graph
into linear function

23
(I) Finding a function that best fits with the S-
shaped graph of probability

1.0

0.8

0.6
P = P(y/x) = P(success given x
0.4
occurred) = P(a person is +ve CHD
0.2 given his age is x)
0.0
20 40 60 80 100

It always has an S- shaped curve within the range of 0 and 1 for any x

That is why we linked it with p (probability) which has the same S-shape in the
same range of 0 to 1

24
b0 + b1 X
e
p= b0 + b1 X
1+ e

  
log    0  1 X
 1   

The logit transformation linearizes the


sigmoid curve.
Interpretation
• The parameter β determines the rate of increases or decreases of the

curve

 when β>0, then π(x) increases as x increases

 When β<0, then π(x) decreases as x increases

 When β=0, the curve flattens as a horizontal line

7/10/2021 Biostat for Postgraduate studies 26


Example-1: The Aspirin and Myocardial Infarction Data
• Relationship between aspirin use and heart attacks
• 5-year randomized study
• Does regular aspirin intake reduces cardiovascular disease?

27
Data exploration:
Explore relationship between variables
• We want to model the probability to develop a MI given the
aspirin.

28
Modeling binary data cont’d…
• Determine he probability of successes
• The probability of success: P(Z=1).
• This is the probability to have cardiovascular disease.

𝛽 +𝛽 𝑋
𝑒 0 1 𝑗
𝑃𝑗 = 𝛽0 +𝛽1 𝑋𝑗 , The parameter βj is the
1+𝑒
treatment effect
• We want to see if Aspirin intake has an effect on the
probability to have Myocardial infarction.

29
Model formulation: Binary logistic Regression
• Model the probability of o have Myocardial infarction given
the aspirin intake
logit(𝜋(x)) = 𝛽0 + 𝛽1 𝑋𝑗 Remark: 𝜋(x)=𝜋=P

𝑒 𝛽0 +𝛽1 𝑋𝑗
𝜋𝑗 =
1 + 𝑒 𝛽0 +𝛽1 𝑋𝑗
• The logit model is the logit transformation of the
probability
 logit (p) = log(odds)

• Where odds = P(success)


1-P(success) 30
Example-1: The Aspirin and Myocardial Infarction Data

• 𝑙𝑜𝑔𝑖𝑡 𝑝 = −4.05 − 0.61𝑋

• The parameter estimate for the effect of the placebo group


is -4.04971.
• The parameter estimate for the effect of the Aspirin intake
is -0.60544.
• How do we interpret the parameters?
• Compare the probability of event among different groups
 Odds Ratio: 𝑒 𝛽

31
Example-1: cont’d…

• 𝑙𝑜𝑔𝑖𝑡 𝑝 = −4.05 − 0.61𝑋


• The odds ratio, θ, is equal to 0.545.
• If θ < 1 than the odds for a Myocardial infarction in the
Aspirin intake group is smaller than the odds for Myocardial
infarction in the placebo group.
• This means that the aspirin reduces the risk of myocardial
infarction.

32
Multiple Logistic Regression Model
• Simple logistic regression

• Determine the effect of single predictor


• If we consider more than one predictor, the model Multiple
logistic regression
multivariable logistic regression model

33
Multivariable logistic regression:

• The joint effects of all explanatory variables, say 𝑋1 , 𝑋2 , …,


𝑋𝑝 put together on the odds is:

• 𝛼: represents the overall disease risk


• βj : represents the fraction by which the disease risk is
altered by a unit change in Xj
• What changes is the log odds?
• The odds themselves are changed by exp(𝛃) 34
Overall test: Likelihood ratio test
• The likelihood ratio test, which makes use of the deviance , is
analogous to the F-test from linear regression

• In its most basic form, it can test the hypothesis that all the
coefficients in a model are all equal to 0:
H0: ß1 = ß2 = . . . = ßk = 0
H1: ßk

In rejecting the null hypothesis, we have to identify the variable


which is significant.

7/10/2021 Biostat for Postgraduate studies 35


Test of significance cont’d…
I) Z-test
• The significance of each variable can be assessed by treating
b
Z=
se(b)

• The corresponding P-values are easily computed (found from the


table of Z-distribution).

7/10/2021 Biostat for Postgraduate studies 36


Deviance
• The deviance of a model is -2 times the log likelihood (-2LL)
associated with each model.

• As a model’s ability to predict outcomes improves, the deviance


falls.
• Poorly-fitting models have higher deviance.

• If a model perfectly predicts outcomes, the deviance will be zero.

• This is analogous to the situation in linear regression, where the


residual sum of squares falls to ‘0’ if the model predicts the values
of the dependent variable perfectly.

7/10/2021 Biostat for Postgraduate studies 37


Deviance
• Based on the deviance, it is possible to construct an analogous
to r² for logistic regression, commonly referred to as the
Pseudo r².

• If G1² is the deviance of a model with variables, and G0² is the


deviance of a null model, the pseudo r² of the model is
G12
r² = 1 - G 02 = 1 – (ln L1 / ln L0)

• One might think of it as the proportion of deviance explained.

7/10/2021 Biostat for Postgraduate studies 38


Assumptions

1. Meaningful coding. Logistic coefficients will be difficult to


interpret if not coded meaningfully.
• The convention for binomial logistic regression is to code the
dependent class of greatest interest as 1 and the other class as 0.
2. Inclusion of all relevant variables in the regression model

3. Exclusion of irrelevant variables from the model


4. No multicollinearity: To the extent that one independent is a linear
function of another independent, the problem of multicollinearity
will occur in logistic regression.
As the independents increase in correlation with each other, the
standard errors of the logit (effect) coefficients will become inflated.

7/10/2021 Biostat for Postgraduate studies 39


Example : Jimma infant Data

Simple Logistic Regression


• Outcome: birth weight status(bwt, 1 = low )
• Factor: sex (1 = male)

40
Example : Jimma infant Data cont’d…

41
Example : Jimma infant Data cont’d…

Logistic regression reporting odds ratio

STATA CODE:
logit bwt i.sex, or
42
Example : Jimma infant Data cont’d…

• Take males as a reference group


• The estimate now stands for females

• Interpretation and conclusion?


43
Example : Jimma infant Data cont’d…

Multiple Logistic Regression

• Considering the variables sex, place of residence

• birth weight status(bwt, 1 = low).


– Factor1: sex2 (1 = female).
– Factor2: place of residence (1 = urban, 2=semi-urban,
3=rural)

44
Example : Jimma infant Data cont’d…

Multiple Logistic Regression …


Logistic regression Number of obs = 7,873
LR chi2(3) = 258.69
Prob > chi2 = 0.0000
Log likelihood = -2152.613 Pseudo R2 = 0.0567

bwt Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

sex
male .6468934 .0539523 -5.22 0.000 .5493393 .7617715

place
semi-urban 1.935937 .3516486 3.64 0.000 1.356053 2.763794
rural 5.436301 .835046 11.02 0.000 4.023039 7.346031

_cons .0326106 .0049134 -22.72 0.000 .0242721 .0438136

STATA CODE:
logit bwt i.sex i.place, or 45
Example : Jimma infant Data cont’d…

Multiple Logistic Regression …

SU and R refer to semi-urban and rural, respectively

 Interpret and conclude all the important information

46
Example : Jimma infant Data cont’d…
Multiple logistic Regression …
Add a third variable: mother’s age (momage) measured in years
Logistic regression Number of obs = 7,873
LR chi2(4) = 290.71
Prob > chi2 = 0.0000
Log likelihood = -2136.6035 Pseudo R2 = 0.0637

bwt Coef. Std. Err. z P>|z| [95% Conf. Interval]

sex
male -.4384726 .0835816 -5.25 0.000 -.6022895 -.2746556

site
2 .4978627 .1891165 2.63 0.008 .1272011 .8685243
3 1.733488 .1537344 11.28 0.000 1.432174 2.034802

momage -.0243593 .0066555 -3.66 0.000 -.0374037 -.0113148


_cons -2.80704 .2240589 -12.53 0.000 -3.246187 -2.367893

STATA CODE:
logit bwt i.sex i.place age 47
Example : Jimma infant Data cont’d…

Multiple Logistic Regression …

𝑙𝑜𝑔𝑖𝑡 𝑃 = −3.27 + 044𝑚𝑎𝑙𝑒 + 0.66𝑆𝑈 + 1.73𝑅 − 0.023𝑎𝑔𝑒

 Interpret and conclude all the important information

48
Example : Jimma infant Data cont’d…
Multiple Logistic Regression …
Odds Ratio estimates
Logistic regression Number of obs = 7,873
LR chi2(4) = 290.71
Prob > chi2 = 0.0000
Log likelihood = -2136.6035 Pseudo R2 = 0.0637

bwt Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

sex
male .6450209 .0539119 -5.25 0.000 .5475566 .7598337

site
2 1.645201 .3111347 2.63 0.008 1.135645 2.383391
3 5.660361 .8701922 11.28 0.000 4.187793 7.650734

momage .975935 .0064953 -3.66 0.000 .9632872 .988749


_cons .0603835 .0135295 -12.53 0.000 .0389223 .0936779

STATA CODE:
logit bwt i.sex i.place age, or 49
Model Comparison
• Compare the model fit statistics, say -2Log-likelihood of the
three models considered so far, i.e,
• Simple logistic regression: only `sex' as a factor,
• Multiple logistic regression: `sex' and `place' as two factor
variables, and
• Multiple logistic regression: `sex', `place' and `maternal age'.

• Which one fits the data better? Why?

50
Hosmer and Lemeshow Test
• The Hosmer -Lemeshow goodness- of - fit statistic is used to
assess whether the necessary assumptions for the
application of multiple logistic regression are fulfilled.

• The Hosmer and Lemeshow's goodness-of-fit statistic is


computed as the Pearson chi-square from the contingency
table of observed frequencies and expected frequencies.

• A good fit as measured by Hosmer and Lemeshow's test will


yield a large p-value

7/10/2021 Biostat for Postgraduate studies 51


52

You might also like