Statistics 244 - Binary Response Regression, and Related Issues
Statistics 244 - Binary Response Regression, and Related Issues
For observation i, have set of p predictors xi1 , . . . , xip , and a binary response variable
0 if observation i is a “failure”
Yi =
1 if observation i is a “success”
Assume
Yi ∼ Bern(pi )
As in least-squares regression, want to model Yi as a function of the predictor variables.
φ = 1
b(θi ) = log(1 + exp(θi ))
c(yi , φ) = 0
exp(θi )
This implies pi = µi = b0 (θi ) = 1+exp(θi ) .
Data were collected on patients with malignant melanoma who had their tumor removed by
surgery. The surgeries, which took place between 1962 and 1977, removed the tumors as well
as the surrounding skin. Patients were followed until the end of 1977.
We want to model the probability of death from melanoma by 1977, coded as status, as a func-
tion of
1
sex = Male or Female
Data summaries
> summary(melanoma)
status sex age year thickness
Alive:134 Female:119 Min. : 4.00 Min. :1962 Min. : 0.100
Dead : 57 Male : 72 1st Qu.:41.00 1st Qu.:1968 1st Qu.: 0.970
Median :54.00 Median :1971 Median : 1.940
Mean :51.52 Mean :1970 Mean : 2.861
3rd Qu.:63.50 3rd Qu.:1972 3rd Qu.: 3.540
Max. :95.00 Max. :1977 Max. :17.420
ulcer
Absent :108
Present: 83
2
Is it reasonable to let ηi = g(µi ) = µi ?
• If we assume pi = xi β, then it’s possible that large values of xij can result in pi > 1.
• Similarly, this parameterization does not prevent pi < 0.
• This argues to choose link function g() to ensure 0 < pi < 1.
Want to choose link function so that g maps from (0, 1) to (−∞, ∞).
Inverse link functions (the function that gives the mean as a function of the linear predictor)
3
exp(xi β ) 1
• Logit link: pi = µi = =
1+exp(xi β ) 1+exp(−xi β )
• When −1 ≤ η ≤ 1, all three (inverse) link functions are nearly linearly related. This means
that they are equally good/bad to use when probabilities can be expected to be away from
0 or 1.
• Main difference between logit and probit (inverse) link functions is that logit tends to assign
conservative probabilities (closer to 0.5) for extreme values of the linear predictor. Logit is
more robust than probit.
• Notice that the complementary log-log inverse link function predicts extreme low probabil-
ities, but conservative high probabilities.
• Preference for one of these link functions over another can only be established with very
large samples of data. However, if working with very large sets of data, probably should
reconsider working with a linear function of the predictor variables.
4
5
Latent variable formulation
6
which is logistic regression.
In the equation
zi = xi β + εi ,
if εi ∼ N(0, 1), then this leads to probit regression (i.e., Bernoulli model with a probit link function).
If εi has cdf Fεi (x) = 1 − exp(−ex ), the “extreme value” distribution, then the complimentary
log-log link results.
In general, Fεi (x) being a cdf of a continuous distribution over R is a good candidate for an inverse
link function for a binary response model. The inverse cdf therefore can be used as a link function.
• In some situations, a binary outcome could be viewed as choice between two alternatives.
• It might be sensible to assume that, underlying the potential choice, there is a continuum of
values that expresses the merit or utility.
• The latent variable view of the model assumes that if a person’s (unobserved) utility is above
a certain threshold, one decision is made; if it is below the threshold, the other decision is
made.
• Note that the latent variable framework does not change the probability model; it just lends
an interpretation of the underpinnings of the model.
• Hard to distinguish the fit of different link functions without a lot of data.
• Logit link has some nice mathematical advantages over other link functions (as we shall see).
In many situations, the predictor variables are sampled (or determined by design) first, and then
the response is treated as a random variable. These are called prospective designs.
Randomly select a sample of smokers and non-smokers, and wait to see whether study partici-
pants develop lung cancer by age 75.
7
Here, let xi be 0 or 1 if participant i is a non-smoker or a smoker, respectively, and let yi be 0 or 1
if the participant avoids or develops lung cancer. Possible logistic regression model:
logit pi = β1 + β2 xi .
A prospective design for studying the effects of smoking on lung cancer can be inefficient and
time-consuming.
Identify a random sample of elderly people with lung cancer, and a random sample without lung
cancer. Then identify whether they were smokers versus non-smokers
In this situation, the yi were sampled first, and the xi are observed subsequently.
Retrospective sampling could produce a very different type of sample than prospective sampling:
For example, despite lung cancer being relatively rare, we could ensure through retrospective
sampling just as many lung cancer cases as non-cases.
If sampling were carried out retrospectively (instead of prospectively), and we fit a logistic regres-
sion model of the form
logit pi = β1 + β2 xi ,
the coefficient β2 to xi will be the same as if we collected the data prospectively!
Bottom line If we want to estimate the effect of a predictor variable in a logistic regression from
data collected in a prospective study, we can carry out a retrospective study instead and get ap-
proximately the same estimate.
Comments:
• No need to restrict xi to being a binary predictor, or even a single predictor. The result holds
for a vector of arbitrary predictor variables, xi .
• If we chose another link function besides logit, this would not work.
Model can be fit using maximum likelihood estimation, implemented numerically using Fisher
scoring.
8
Can also show the estimated variance (lowercase “var”) of β̂ is
var(β̂) = (X T W X)−1
1
wii = = µ̂i (1 − µ̂i )
(∂ηi /∂µi )2µi =µ̂i φV (µ̂i )
exp(xi β̂ )
with µ̂i = g −1 (xi β̂) = , and φ = 1 for the Bernoulli model.
1+exp(xi β̂ )
> melanoma.year.glm =
glm(status ˜ year, family=binomial, data=melanoma)
> summary(melanoma.year.glm)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 351.33724 123.35457 2.848 0.00440 **
year -0.17880 0.06263 -2.855 0.00431 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The estimated coefficient to lwt, β̂1 = −0.17880, means that for every year later an operation was
performed, there is a 0.17880 drop in the logit of the probability that the patient dies by 1977.
∂µ exp(xβ)
= βj = βj µ(1 − µ).
∂xj [1 + exp(xβ)]2
9
This means that for probabilities near (say) µ = 0.5, a unit increase in xj corresponds to an increase
of βj (0.5)(1 − 0.5) = βj /4 in probability. For probabilities closer to 0 or 1, the increase is less.
Of course, this interpretation depends on the approximate probability so should be used with
caution.
For example, in the range of probabilities of death of around 0.35, every additional year later an
operation is performed results in an drop of
in probability of death.
q
β̂j −βj
• Test statistic is z = , where s.e.(β̂j ) = var(β̂)jj . Under an assumed βj , z ∼ N(0, 1).
s.e.(β̂j )
(t-distribution is a poor approximation for the distribution of β̂j in a logistic regression;
numerator and denominator are not independent)
• Conventional to use
∗
β̂j ± z1−α/2 s.e.(β̂j )
∗
where z1−α/2 is the (1 − α/2) quantile of N(0, 1).
• This is actually a crude calculation, and relies on the sample size being large enough to
assume β̂j is approximately normally distributed. Generally better to use profile likelihood
confidence intervals.
Logistic regression coefficients for binary predictors and odds ratios For event A, define
Pr(A)
odds(A) =
1 − Pr(A)
Basically the odds, a monotonically increasing function of probability, maps probabilities to a scale
that ranges from 0 to ∞.
Some odds
Pr(A) = 0 ←→ odds(A) = 0
Pr(A) = 1/2 ←→ odds(A) = 1
Pr(A) = 1 ←→ odds(A) = ∞
Notice that
Pr(Y = 1) Pr(Y = 1)
logit Pr(Y = 1) = log = log = log(odds(Y = 1)).
1 − Pr(Y = 1) Pr(Y = 0)
The “logit” function is sometimes called the “log-odds” function.
Odds ratios
10
For a chosen value a, consider the quantity
odds(Y = 1 | xj = a + 1)
ORj (a) =
odds(Y = 1 | xj = a)
For binary predictors (where xj = 0, 1), it is very common to summarize the effects as odds ratios.
Not as compelling to use odds ratios when xj is a quantitative predictor.
Simplest solution is to
2. Recognizing that ORj = exp(βj ), exponentiate the endpoints of the confidence interval from
the previous step.
Example: Melanoma
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 460.62370 150.46918 3.061 0.002204 **
sexMale 0.50311 0.36880 1.364 0.172505
age 0.02214 0.01191 1.859 0.063020 .
year -0.23556 0.07652 -3.078 0.002083 **
thickness 0.11393 0.06669 1.708 0.087554 .
ulcerPresent 1.44607 0.39367 3.673 0.000239 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Confidence interval for the odds ratio of ulceration of the tumor on probability of death.
β̂ulcer = 1.446
s.e.(β̂ulcer ) = 0.394
11
The endpoints of an approximate 95% confidence interval for β̂ulcer are
Could apply the same procedure with the endpoints based on the profile likelihood confidence
interval.
Inference for µ
• obtaining the standard error of probability estimates using the Delta method estimates of
the variance, or
• obtaining the confidence interval of xβ, and finding the inverse logit of the endpoints.
The following produces the predicted probabilities and Delta method standard errors based on
the fitted model:
Example:
12
thickness=2.5,ulcer="Absent")
> melanoma1.newpred =
predict(melanoma1.glm,newdata=melanoma.newdat,
type="response",se.fit=T)
> melanoma1.newpred
$fit
1
0.0938881
$se.fit
1
0.03119798
1. Compute the logit of the estimated probability and its associated standard error.
3. Invert the endpoints of the confidence interval through the inverse logit function.
Example
> melanoma1a.newpred =
predict(melanoma1.glm,newdata=melanoma.newdat,type="link",
se.fit=T)
> print(melanoma1a.newpred$fit)
1
-2.267059
> print(melanoma1a.newpred$se.fit)
[1] 0.3667195
Example (cont’d)
Confidence interval:
> print(plogis(c(
melanoma1a.newpred$fit - 1.96*melanoma1a.newpred$se.fit,
melanoma1a.newpred$fit + 1.96*melanoma1a.newpred$se.fit))
)
1 1
0.04807017 0.17533355
Note that the symmetric interval based on the Delta method can extend beyond (0, 1).
13
Likelihood ratio tests for nested models
Want to test model M0 with p0 coefficients nested within model M1 with p1 coefficients.
Let D(µ̂0 | y) be the deviance for model M0 and let D(µ̂1 | y) be the deviance for the model M1 .
Then
D(µ̂0 | y) − D(µ̂1 | y)
has a χ2p1 −p0 distribution under model M0 . Determine p-value based on this statistic.
Analysis of Deviance
Can also use the deviance function on the fitted models to obtain the deviances:
> deviance(melanoma1.glm)
[1] 186.1809
> deviance(melanoma2.glm)
[1] 211.8841
14
Extra linear parameters coming from thickness (1) and ulcer (1) for a total of 2.
Thus we compare (211.88 − 186.18) = 25.7 to a χ22 distribution to obtain a p-value. This can be
obtained from the anova output as before.
When only categorical predictors are observed, Bernoulli responses can be grouped within cate-
gories to form Binomial responses.
A study in which 338 HIV-infected veterans whose immune systems were beginning to deteriorate
were either assigned to take AZT immediately or to wait until their cells showed severe immune
15
weakness. The response was whether a veteran developed AIDS symptoms during the 3-year
study.
Data
Letting Yi be 1 if veteran i experienced AIDS symptoms, and 0 if not, we would like to model the
Bernoulli probability pi = Pr(Yi = 1) as a function of race and AZT use.
To implement the analysis, could list out all 338 cases. But this would be inefficient.
Instead, let c index a unique category formed from the factors, with c = 1, 2, . . . , C. In the HIV
example, C = 4.
Assume logit pc = xc β.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.07357 0.26294 -4.083 4.45e-05 ***
RaceWhite 0.05548 0.28861 0.192 0.84755
AZTYes -0.71946 0.27898 -2.579 0.00991 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
16
Comparing ungrouped and grouped analysis
Maximum likelihood produces the same overall inferences because the likelihoods are the same
up to a multiplicative constant.
i=1
When C categories of predictors are observed, the deviance for the logistic regression model is
C
X yc nc − yc
D(µ | y) = 2 yc log + (nc − yc ) log
µc nc − µ c
c=1
Note that this is not equal to the deviance viewing the data as ungrouped.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.07357 0.26294 -4.083 4.45e-05 ***
RaceWhite 0.05548 0.28861 0.192 0.84755
AZTYes -0.71946 0.27898 -2.579 0.00991 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Parameter estimates are identical to grouped analysis, but deviances are different.
Deviance for the saturated model for the grouped data representation is also different for the
ungrouped data representation (because for the grouped data, use µi = yc , and for the ungrouped
data use µi = yi ).
• Consequence – cannot evaluate fit of model by comparing scaled deviance to χ2n−p distribu-
tion.
17
• Resolution – much more justified to compare two models subtracting scaled deviances rather
than evaluating single models with a scaled deviance.
Is Race a significant predictor of having AIDS symptoms, accounting for AZT in the model?
Grouped analysis:
Ungrouped analysis
Goodness of fit
where µ̂(0) is the vector of mean estimates for the model with only an intercept term (no predic-
tors). This model is often generically called the “null” model.
Problem The null deviance is not well-defined. Could depend on whether the data were grouped.
18
Deviance residuals
with
yc n c − yc
dc = −2 yc log + (nc − yc ) log .
µ̂c nc − µ̂c
Letting
yi − µ̂i yc − µ̂c
ri∗ = p or rc∗ = p
(1 − hii )µ̂i (1 − µ̂i ) (1 − hcc )µ̂c (nc − µ̂c )/nc
be the standardized residuals where hii or hcc is the corresponding diagonal element of H w . To
first approximation,
var(ri∗ ) ≈ 1 var(rc∗ ) ≈ 1
Basic diagnostics
• Plot deviance or jackknifed residuals against fitted values to detect non-linearities, outliers.
• Plot deviance or jackknifed residuals against values of a single predictor also to detect non-
linearities.
Problem With binary responses, residual plots almost always look striped.
19
An approach recommended by Gelman
• Group observations according to the sorted fitted values into bins of size G (e.g., G at least
6-7 or more)
• Plot the average fitted value within each bin against the average residual within each bin.
Instead of fitted values, can use the same strategy for predictor variables when one wants to in-
vestigate non-linearities of the residuals by individual predictors.
20
21
Predictive performance of binary response regression
• Already analyzed a data set consisting of people with and without a disease, along with
predictors of having the disease. The analysis results in an estimated logistic regression.
• For each person, can plug in the predictor values into the logistic regression equation to
obtain an estimated probability of having the disease.
Do higher probabilities of having the disease correspond to actually having the disease?
Ingredients
Simple prediction rule Let c be a chosen cutoff value. Declare that the person has the disease if
p̂ ≥ c, and not if p̂ < c.
Let
0 if p̂i < c
ỹi (c) =
1 if p̂i ≥ c
22
be a binary prediction as a function of the cutoff value c.
Let
F (c) = Pr(ỹ(c) = 0 | y = 1)
G(c) = Pr(ỹ(c) = 0 | y = 0)
Now define
Sensitivity(c) = Pr(ỹ(c) = 1 | y = 1) ≈ 1
Specificity(c) = Pr(ỹ(c) = 0 | y = 0) ≈ 1
Predictive summaries
Sensitivity(0.5) = Pr(ỹ(0.5) = 1 | y = 1)
Specificity(0.5) = Pr(ỹ(0.5) = 0 | y = 0)
In other words, compute the fraction of observations with p̂i ≥ 0.5 among those with yi = 1 (the
sensitivity), and compute the fraction of observations with p̂i < 0.5 among those with yi = 0 (the
specificity).
Answer p̂i will always be greater than c, so that ỹi (c) is always 1. This means
Sensitivity(c) = Pr(ỹ(c) = 1 | y = 1) ≈ 1
Specificity(c) = Pr(ỹ(c) = 0 | y = 0) ≈ 0
23
Answer p̂i will always be less than c, so that ỹi (c) is always 0. This means
Sensitivity(c) = Pr(ỹ(c) = 1 | y = 1) ≈ 0
Specificity(c) = Pr(ỹ(c) = 0 | y = 0) ≈ 1
Moral of the story When constructing a prediction rule, there is a tradeoff between sensitivity and
specificity. Just selecting a cutoff for c may not be sufficient.
Summary for describing overall discrimination: ROC (“receiver operating characteristic”) curve
2. For each c, compute the empirical sensitivity Pr(ỹ(c) = 1 | y = 1), and “1 − specificity,”
Pr(ỹ(c) = 1 | y = 0).
3. Plot the “sensitivity” (y-axis) versus “1 − specificity” (x-axis) for this collection of points.
This approach only makes sense if the model produces many different predicted probabilities (a
model with only one or two factors would not be appropriate for an ROC analysis).
Interpretation
• Best possible situation: Plot jumps to a sensitivity of 1 and stays flat. This corresponds to
perfect accuracy.
• Worst possible situation: Diagonal line (y = x) from the origin to (1, 1). This corresponds to
random guessing.
Actual worst possible situation: curve is under the line y = x; predictions have poor sensi-
tivity and specificity.
The more area under the ROC curve, the greater the accuracy.
For 0 ≤ t ≤ 1, define
ROC(t) = 1 − F (G−1 (1 − t)),
where
G−1 (1 − t) = inf{x ∈ R : G(x) ≥ 1 − t}.
24
25
Summarizing an ROC curve: c-statistic (area under the ROC curve, “AUC”)
Z 1
c-statistic = AUC = ROC(u) du
0
Pick at random an observation with yi = 1 and an observation with yi = 0. Then the concordance
index is the probability that the fitted probability for yi = 1 is larger than the fitted probability for
yi = 0.
This statistic is an accepted discrimination measure for binary response regressions. In fact, it is
equivalent to the Mann-Whitney U -statistic (also known as the Wilcoxon rank-sum statistic) for
testing whether one sample is stochastically larger than another. Will discuss this in section.
26
ROC analysis for melanoma models
27
One criticism of ROC approach: Overfitting
• In situations where the regression model is overfitting (e.g., if too many predictor variables
are incorporated), then the ROC analysis will be misleading.
• Possible solution: For each observation i, determine the predicted probability p̂i based on
fitting a logistic regression without observation i. Use these jackknifed probability estimates
in place of the ordinary logistic regression probability estimates.
If binary responses can be separated (or nearly so) into all 0s and all 1s as a linear function of X,
then the MLE of β may not exist (or at least may not be finite).
• Complete separation
• Quasi-complete separation
28
Logistic regression for the first data set
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 165.32 407521.43 0 1
x -47.23 115264.41 0 1
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 137.32 54599.64 0.003 0.998
x -39.23 15599.90 -0.003 0.998
29
(Dispersion parameter for binomial family taken to be 1)
Comments
• Instances of predictions p̂i = 0 or p̂i = 1 (to many decimal places) is an indicator of separa-
tion.
• For complete separation, the maximized log-likelihood is 0 (to many decimal places); for
quasi-complete separation, it is strictly negative.
• The more covariates one has in a binary response model, the greater the potential for sepa-
ration.
Addressing separation
• Can use other approaches besides likelihood-based ones. For example, Bayesian approaches
with proper priors, regularization (penalized likelihood), and Firth’s “bias-reduced” ap-
proach all produce finite estimates.
30