Week 8 - Logistic Regression
Week 8 - Logistic Regression
Logistic Regression
Dr Carolina Feher da Silva
What is the point of logistic regression?
• Activity – picking the right test for research scenarios
• Break
Today’s
class Running a logistic regression in jamovi
Part 2:
• Be able to run a logistic regression analysis in jamovi
Part 3:
• Interpret statistical output of a logistic regression analysis – inc.
model results & checking assumptions
• Be able to report results clearly in APA style – text and graphs
What is the point of logistic regression?
When should I use it?
Results might look like this …
The logistic function
What is binary logistic regression?
• Like linear regression, logistic regression is all
about prediction
• Prediction of categorical outcomes instead of
continuous
• More specifically:
• Binary logistic regression
• Outcomes that are binary (e.g., yes vs no), that
is, they have 2 options only
Effect size for predictors B and beta Odds ratio (must report
95%CI)
What is an odds ratio (OR)
• If you want to know in detail:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/
• What are the odds of an event?
• The probability that the event will occur divided by the probability that the event will not
occur.
• If the probability is 0.8, then the odds are 0.8/0.2 = 4.
Odds ratio for a predictor X
Odds after a unit change in the predictor
Original odds (before change)
Greater than 1 = greater probability of experiencing event (if 95% CI greater than 1
significant)
Less that 1 = lessened probability of experiencing event (if 95% CI less than 1 significant)
Example 1: Stroke and depression
Depression
No stroke 1
Stroke 3.5 (1.4-8.3) ***
*** p<0.001
People who have had a stroke have a 3.5 times increased odds of being diagnosed with
depression than people who have not had a stroke.
Whyte et al (2004). Depression after stroke: A prospective epidemiological study. Journal of the American Geriatrics
Society: 52(5); 774-778.
Example 2: Oral contraceptive pill use and
depression
Depression
No oral contraceptive 1
use
Current oral 0.81 (0.47-1.40)
contraceptive use
While the odds ratio suggests that women taking oral contraceptives may have lower odds of
experiencing depression, the fact the 95% confidence interval overlaps with 1 means this is not
statistically significant.
Cheslock-Potsova et al (2015). Oral Contraceptive Use and Psychiatric Disorders in a Nationally Representative Sample
of Women. Archives of womens mental health: 18(1); 103-111.
Odds ratios with continuous predictors
Continuous predictor…..
• Different way to think about odds ratio
The effects of anxiety, lying and depression on being a bully are statistically significant.
For example: For every 1-point increase in anxiety, the odds of being a bully are multiplied by
0.36, which decreases it.
Salmon (1998). Bullying in schools: self-reported anxiety, depression and self-esteem in secondary school children. BMJ
317: 924-925.
Summary: Logistic regression
• Logistic regression allows us to determine the probability of
experiencing an event/outcome.
• Logistic regression can use both continuous and categorical predictors
(for categorical predictors we compare to a ‘reference’ group).
• The outcome for logistic regression is always categorical (binary).
• The probability of experiencing an event is indicated by the odds ratio
and 95% confidence interval (if the 95% confidence interval doesn't
overlap with 1, the effect is statistically significant).
• The contribution of each predictor is assessed with the Wald statistic.
Activity: Choose the test
Instructions:
• In groups, work through research scenarios/questions
• Choose which statistical test is the most appropriate to use
• Things to keep in mind to help you make your decision:
groups /
Research Question 1:
Mental health problems can be measured on a continuum (e.g. number of
symptoms) or discretely (e.g. diagnosed with an anxiety disorder vs not). If we had
data on both of these outcomes and a potential predictor (i.e., adverse childhood
experiences), how would you analyse the data to answer the following hypotheses?
H1: higher levels of adverse childhood experiences are associated with higher anxiety symptoms
H2: higher levels of adverse childhood experiences are associated with anxiety disorders.
groups /
Research Question 2:
Bevan has data on one million participants’ big five personality traits
collected in 2011. He links this data to the national death register to see
if there are any links between big five traits and dying in the past 10
years. First, what analysis would you use? Second, are there any
confounding variables or moderators you would recommend including
in the analysis?
groups /
Break Time
Stretch your legs
Get a cuppa
Running a logistic regression
in jamovi
Logistic regression example
• Do personality traits and gender predict getting a first (vs not) on the PSY2017 exam?
• Outcome variable is now if person gets a first (70% or higher) or if lower than a
first (≤69%) =‘psy2017_first’ ; a first = 1 and below a first=0 (coded like a dummy
variable)
• Gender is coded and Female = 0 and Male = 1 (this is a dummy variable)
Statistical Analysis process
Descriptive statistics
Assumption testing
• Scatterplot
Histogram
should look
like this: two
columns, same
width
2. Reporting
Model fit resultsOverall Model
• How much variance (%) in ‘Exam first’ does the model predict?
• Pseudo R2 = % of variance explained - used as no ‘proper’ ones for logistic regression
• Pseudo adjusted R2 – only increases if new predictors improve model (so good to compare)
• Overall model test = Is model significantly better than using the mean to guess everyone’s grade?
(similar to F test in multiple regression, but a chi-square instead
• Model comparisons tell us if model 2 (block 1 and 2) is better than model 1 alone (if doing blocks)
2. Overall Logistic Regression model
• Pseudo R2 statistics
• may be best for comparing results of different models that use the same data set as they focus on how much it changes as
predictors change.
• May be less good as an indicator of variance explained that can used to compare across models (i.e., not quite the same or good
as a normal R square)
• People may vary in terms of which pseudo R2 to report
• See more info here: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
• Report:
• “The overall model is significant, χ2 (6) = 13.2, p = .040.
• Pseudo R2 statistics indicate that approximately 9% (Cox and Snell R2) to 13% (Nagelkerke R2 ) of variance in the
outcome is explained by the model.”
The overall model’s test: chi-square,
assesses whether the overall logistic
regression model is statistically
significant or not (similar to the F-
ratio in multiple regression)
Model fit continued…
• The classification table is also useful for evaluating how good your model is
3. Reporting model Predictors
Examine if individual predictors were significant, and if so how important
were they in the model?
1 predictor
was
statistically
significant
2a. Reporting predictors:
• Look at p values to determine significance of each predictor.
• The odds ratio (OR) tells us the direction and magnitude of effect of each
predictor on the outcome. e.g. Odds ratio = Effect size
• Report:
• “On average, for every unit increase in conscientiousness the odds of getting a first
become 2.68 times larger, OR = 2.68, 95%CI [1.41, 5.08].”
• Odds ratios below 1 and significant would simply be a decrease, e.g. for an OR of
0.50 …
• Predictor will be not statistically significant (p > .05) if 95% Confidence Intervals (CIs)
pass over 1 (e.g. 0.5 to 2.3), as relationship could cause an increase or decrease.
Thus, we can’t be sure the odds differ for people with different levels of these traits.
• Always report CIs for Odds ratios.
ASSUMPTION B Outliers
B) Outliers
Maximum Cook’s distance < 1 = no outliers
• Look in descriptives of new variable cook’s distance:
• Are any cases/individuals having an excessive effect on the
model?
• No influential case when the maximum Cook’s distance is less than 1
• Don’t have to report specifically, but could write…
• No single cases exerted excessive effects on the regression model (Cook’s Distance,
range = 0.001 to 0.050).
• If higher than 1, then run model again without those cases and report….
• X cases were exerting large effects on the regression model (Cook’s Distance = X for
case X, X for case X). Re-running the regression model without these cases resulted in
….
Multicollinearit
ASSUMPTION F y
F) Multicollinearity – correlation matrix
Example:
red ones are sig relationship
between predictors (but low
correlation value, so OK).
Green ones are sig relationship
between predictors & outcome
Blue ones are not sig
relationship between predictor
& outcome
F) Multicollinearity
• Don’t want predictors to be too similar to each other
• Problems if Variation Inflation Factor (VIF) > 10
• (tolerance =1/VIF) These are fine
• Problems if predictors correlated >.8 or >.9 in correlation matrix
• If high VIFs, examine correlation matrix & drop or combine problem predictor(s)
• e.g. combine highly correlated predictors, or drop one of them
• Predictors = first model uses different types of stress, then the number of
stressors
• Outcome = 6-year incidence of depression
Different kinds of stress & depression
Variable Category OR (95% CI)
Job strain Low 1
High 1.58 (1.25-2.00)
Negative Life events No 1
Yes 2.11 (1.68-2.65)
Chronic stress No 1
Yes 1.99 (1.46-2.71)
Childhood trauma No 1
Yes 1.68 (1.32-2.15)
All models were adjusted by gender, age, marital status, education and time-varying covariates (employment status, self-rated health and having
one or more long-term medical conditions)
All models were adjusted by gender, age, marital status, education and time-varying covariates (employment status, self-rated health and having
one or more long-term medical conditions)