Categorical-Notes-Ch4
Categorical-Notes-Ch4
The techniques presented in Chapter 2 and Chapter 3 are limited in two aspects.
First, they consider only categorical predictors X without allowing for numeric
predictors. Second, we only considered relatively small numbers of predictors.
To overcome these limitations, we will need to develop an extension of the
linear model which is capable of dealing with categorical and count responses;
this leads to the so-called generalized linear model or GLM.
for known functions b(·), c(·, ·) and known constant ω > 0. The parameter
θ is referred to as the canonical parameter and φ is known as the dispersion
parameter.
Remark 4.1. This definition is given in Section 4.4 of Agresti; we will be using
this one rather than using the less general exponential family definition from
Section 4.1.1 of Agresti.
51
52 CHAPTER 4. GENERALIZED LINEAR MODELS
λy e−λ
f (y; λ) = = exp {y log λ − λ − log y!} .
y!
Exercise 4.1. Show that the following families are exponential dispersion
families for suitable choices of b(·), c(·, ·), θ, φ, ω.
Definition 4.2. A generalized linear model for Y and X models the re-
sponse Yi with the density/mass function
� �
yi θ(xi ) − b(θ(xi ))
f (yi ; θ(xi ), ωi , φ) = exp + c(yi , φ) ,
φ/ωi
where
term ηi = x�
i β is referred to as the linear predictor.
g(µi ) = x�
i β,
for some parameter β in a GLM. Now, we have just shown that µi = b� (θi );
hence,
g(b� (θi )) = x�
i β
54 CHAPTER 4. GENERALIZED LINEAR MODELS
so that
� �
θi = (b� )−1 g −1 (x�
i β) .
Now, imagine we set g(µ) = (b� )−1 (µ). This leads to the model θi = x� i β
so that the linear predictor ηi = x�i β and the canonical parameter θ i coincide.
This choice of the link function is referred to as the canonical link.
In some sense, this choice of link function is the most “natural” choice of
link function, and (as we will see) various aspects of GLMs become simplified
when the canonical link is chosen.
Example 4.3. Consider the binomial dispersion family from Example 4.2 in
which Y = Z/n where Z ∼ Binomial(n, π). For this model, we have φ = 1, ω =
n, b(θ) = log(1 + eθ ) where θ = log π/(1 − π). We have
eθ
b� (θ) = .
1 + eθ
ex
Now, noting that expit(x) = 1+ex is the inverse function of logit(x) = log(x/(1−
x)) we have
E(Y ) = b� (θ) = π.
Exercise 4.2. Show that the Poisson exponential dispersion family de-
scribed in Example 4.1 has canonical link g(λ) = log λ. Additionally, using
the properties of the exponential dispersion family to verify that E(Y ) = λ
and Var(Y ) = λ for the Poisson distribution.
4.1. INTRODUCTION TO GLMS 55
Var(Yi ) = eθi .
�
Assuming a canonical link we have Var(Yi | Xi = x) = E(Yi | Xi = x) = ex β
.
where recall
For a fixed φ, the model which fits the data as closely as possible is the model
which simply takes µi = Yi .
This model, which fits the data as closely as possible, is referred to as the
saturated model, and is a model which has a separate mean parameter for every
observation.
model, i.e. it is
n
� ωi [Yi (θ�i − θ�i ) − (b(θ�i ) − b(θ�i ))]
D� = 2 ,
i=1
φ
Remark 4.2. For Poisson and Binomial GLMs, φ ≡ 1 so that the scaled de-
viance D� and the (raw) deviance D are equal.
A common use of the scaled deviance D� is as a test statistic for assessing
goodness of fit. A sensible way to check goodness of fit is to test your model
against a the model which fits the data as well as possible; if you cannot reject
your model in favor of this larger model, this gives some assurance that the
model is not out-of-line with the data.
Λ = D0� − D1� ∼
•
χ2d under H0
4.2. BINOMIAL GLMS 57
where D0� is the deviance of the model under H0 and D1� is the deviance of the
model under H1 , and d = dim γ. An application of analysis of deviance is given
in Section 4.3.2. This χ2 approximation is valid even when D� is not itself χ2 ;
all we need is that d = dim γ is fixed in the asymptotics.
Yi = Xi� β + �i
has a problem in that the �i ’s will not have the same variance, i.e., there is
heteroskedasticity. To resolve this problem, one approach is to transform Yi
so that it is close to homoskedastic. For Poisson data, the so-called “variance
stabilizing transformation” gives the linear model
�
Yi = Xi� β + �i ,
exp(x�
i β)
πi = .
1 + exp(x�
i β)
58 CHAPTER 4. GENERALIZED LINEAR MODELS
A plot of the expit function is given below. As we can see, the function has an
“S” shape, and tends to 0, 1 as x → ±∞. A nice feature of the expit function
is that it respects the fact that πi ∈ [0, 1].
1.0
0.8
0.6
expit
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
Example 4.4. On January 28, 1986, the space shuttle Challenger broke apart
just after launch, taking the lives of all seven of the crew. This example is taken
from an article by Dalal et al. (1989), which examined whether the incident
should have been predicted, and hence prevented, on the basis of data from
previous flights.
The cause of the failure was ultimately attributed to the failure of a crucial
shuttle component known as the O-rings; these components had been tested
prior to the launch to see if they could hold up under a variety of temperatures.
For our analysis, we will let Yi = 1 or 0 if the O-ring failed on a given test
shuttle flight and let Xi = (1, Temperaturei ). The temperature on the day of
the Challenger launch was 31◦ Fahrenheit, or roughly 0◦ Celsius.
We let Yi ∼ Bernoulli(πi ) and use a logistic regression model logit(πi ) =
Xi� β. The data is available in the vcd package:
library(vcd)
data(SpaceShuttle)
head(SpaceShuttle)
##
## Call:
## glm(formula = I(Fail == "yes") ~ Temperature, family = binomial,
## data = SpaceShuttle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## Temperature -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
1.0
0.8
Estimated probability of failure
0.6
0.4
0.2
0.0
40 50 60 70 80 90
Temperature
expit(beta[1] + 31 * beta[2])
## (Intercept)
## 0.9996088
It seems unlikely that the astronauts would get on the shuttle if they were
aware of this.
Odds(xj + Δ)
= eΔβj .
Odds(xj )
Example 4.5. For the shuttle data, the estimated regression coefficients are
## (Intercept) Temperature
## 15.0429016 -0.2321627
exp(-10 * beta[2])
## Temperature
## 10.19225
That is, the odds of failure increase by a factor of roughly 10 (estimated) for
every decrease of 10 degrees.
These are essentially Wald-based intervals and they are generally not preferred.
A better confidence interval can be obtained by inverting a likelihood ratio test.
Consider the null hypothesis H0 : βj = b versus the alternative Ha : βj �= b.
This hypothesis can be tested by the following procedure:
Inverting this test, we can form an asymptotic 100(1 − α)% confidence interval
for βj as {b : Λ(b) ≤ χ21,α }. This is referred to as a profile-likelihood confidence
interval.
Example 4.6. The profile confidence interval is easy to obtain in R; we compare
this with the Wald interval.
## [,1] [,2]
## [1,] 0.5810523 29.50475096
## [2,] -0.4443022 -0.02002324
The intervals largely overlap but are somewhat different. Again, we generally
prefer the profile confidence intervals. From this, we see that the multiplicative
effect on the odds of a change of −10 degrees in temperature is, with 95%
confidence, in the interval
exp(-10 * confint(fit_space))[2,]
link function
πi = x�
i β.
This is adequate in many situations; however, one runs into problems when
x� �
i β > 1 or xi β < 0. Whenever there is a continuous predictor with an
unrestricted range, there will always exist values of x which cause this to happen.
As we know that probabilities must lie in [0, 1], this is problematic. Hence, one
should think very carefully before using the a linear link function for binomial
logistic regression.
A benefit of the linear link function is that it is extremely easy to interpret.
This gives the GLM F −1 (πi ) = Xi� β, a binomial GLM with link function
g(π) = F −1 (π). Thus, given any distribution function F (·), we can define a
binomial GLM.
et
f (t) = , (−∞ < t < ∞).
(1 + et )2
Aside from logistic regression, the latent tolerance model has many inter-
esting special cases. The probit model considers Ti ∼ Normal(0, 1), while the
complementary log-log model sets Ti to have an extreme value distribution.
In practice, the main difference between these models lies in how they handle
outliers, which is determined by how fast F (x) → 0, 1 as x → ±∞. When F (x)
corresponds to a light-tailed distribution, such as a normal distribution, we
expect to see very few outliers in Y ; here, an outlier is a point (x, y) where x
is extreme and y takes on the “wrong” value (i.e., y is strongly predicted to be
1, but is 0 instead). Conversely, a heavy-tailed distribution (such as a Cauchy
distribution) does not have problems with outliers.
The logistic regression model has exponential tails, with F (x) ≈ ex as x →
−∞ and F (x) ≈ 1−e−x as x → ∞. This is heavier than the normal distribution,
64 CHAPTER 4. GENERALIZED LINEAR MODELS
Probit
1.0
Logit
C-Log-Log
0.8
0.6
Link
0.4
0.2
0.0
-4 -2 0 2 4
which has super-exponential tails, but lighter than the t-distribution, which has
polynomial tails.
We can also choose F (t) to correspond to an asymmetric distribution. This
is often done in toxicology studies, where a sufficiently high toxicity basically
guarantees death, but very small doses can also be fatal. In this case, the
complementary log-log model F −1 (π) = log{− log(1 − π)} is a good choice. A
comparison of the complementary log-log, probit, and logistic links is given in
the Figure 4.1.
Agresti considers the following data. This data is from an epidemiological survey
to investigate snoring as a risk factor for heart disease. The sample consists of
2484 subjects and is given in the following table.
4.2. BINOMIAL GLMS 65
Heart Disease
Snoring Yes No
Never 24 1355
Occasionally 35 603
Nearly every night 21 192
Every night 30 224
The variable snoring is ordinal; to account for this, we will assign the nu-
merical scores 0, 2, 4, and 5 to the categories. We construct the data in R as:
## Yes No snoring_scores
## [1,] 24 1355 0
## [2,] 35 603 2
## [3,] 21 192 4
## [4,] 30 224 5
Note how the data is input: the response is taken to be a matrix with the
first column consisting of number of successes and the second column consisting
of the number of failures. The term family = binomial specifies that we are
doing logistic regression. We obtain a summary of the model as
summary(snoring_fit)
##
## Call:
## glm(formula = snoring ~ snoring_scores, family = binomial)
##
## Deviance Residuals:
## 1 2 3 4
## -0.8346 1.2521 0.2758 -0.6845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
66 CHAPTER 4. GENERALIZED LINEAR MODELS
The snoring scores are highly significant. We also note that the deviance
is D = 2.8089. The asymptotic χ2 approximation is likely to be good here,
as the cell counts are large. Comparing this to a χ24−2 distribution, a P -value
for the test against the saturated model is 0.2455, indicating that there is little
evidence of lack of fit.
Exercise 4.5. Interpret the estimated coefficient for snoring scores and
construct a Wald interval for the coefficient.
## [,1] [,2]
## 1 "Never" "0.021"
## 2 "Occasionally" "0.044"
## 3 "Nearly Every Night" "0.093"
## 4 "Every Night" "0.132"
This model was introduced in Exercise 4.2, where we saw that this corresponds
to a Poisson GLM using the canonical link g(λ) = log λ; that is, the model is
log{E(Yi | Xi = x)} = x� β.
Thus holding all other predictors fixed, shifting xj by Δ has the effect of multi-
plying the mean by eΔβj :
• An intercept term.
library(MASS)
data(ships)
head(ships)
For this data, the rows actually correspond to the number of incidents for
all ships of the same type/year/period, with the offset term service giving the
total number of months those ships were at service. The Poisson loglinear model
can be fit as:
##
## Call:
4.3. POISSON LOGLINEAR MODELS 69
We can also test for the individual factors using likelihood ratio tests; the anova
function in R performs the results of analysis of deviance in which the various
terms are added sequentially to the model.
It is also possible to test for the effect of each factor assuming all other terms
are being included in the model. This can be done using the drop1 function.
From the likelihood ratio tests, we see that there is a large amount of evidence
for all three predictors. The deviance for this example is
D <- deviance(fit_ships_poisson)
p_value <- pchisq(D, df.residual(fit_ships_poisson),
lower.tail = FALSE)
## D df p_value
## 38.69505154 25.00000000 0.03951433
There is some evidence of lack of fit for the model. One possibility for
correcting this is to consider interaction terms. For example:
##
## Response: incidents
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 33 146.328
## type 4 55.439 29 90.889 2.629e-11 ***
## yearbuilt 3 41.534 26 49.355 5.038e-09 ***
## period 1 10.660 25 38.695 0.001095 **
## type:yearbuilt 12 24.108 13 14.587 0.019663 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The deviance of 14.587 is now in line with the reference χ213 distribution.
k(1 − π) k(1 − π)
E(Y ) = , and Var(Y ) = .
π π2
Writing π in terms of E(Y ) = µ, show this gives
k
π=
µ+k
µ2
Var(Y ) = µ +
k
Hint: we can write Y as the sum of k independent Geometric random
variables.
The key point of the negative binomial GLM is that it implies overdispersion.
Instead of having the variance equal to µ, it is equal to µ+µ2 /k. When k is very
large relative to µ, we will not have much overdispersion; on the other hand, if
k is small relative to µ then we will have a lot of overdispersion.
When k = 1, the negative binomial corresponds to a geometric random
variable. It turns out that the negative binomial distribution also includes the
Poisson distribution as a special case.
Y ∼ Poisson(λ) given λ,
λ ∼ Gamma(k, k/µ).
(a) Show that Y has a negative binomial distribution with k trials and
mean µ.
(b) As k → ∞, show that Y converges in distribution to a Poisson(µ).
Note: the gamma is parameterized so that E(λ) = µ and Var(λ) = µ2 /k.
Remark 4.4. Exercise 4.7 is important for the following reasons. First, in
indicates that we do not need to restrict k to be an integer. Second, it makes
it clear why Y is overdispersed relative to a Poisson; as we have mentioned,
introducing a latent variable automatically will cause overdispersion. Third,
4.4. DEALING WITH OVERDISPERSION 73
it gives some intuitive justification for how a negative binomial might arise in
practice, even if there is no obvious “coin flipping” going on. Fourth, it makes it
clear that the Poisson GLM is a special case of a negative binomial GLM, so if
a Poisson GLM is correct we will not lose anything (aside from some estimation
efficiency) by using a negative binomial GLM.
Var(Yi | Xi = x) = φµi .
The Poisson GLM is a type of quasi-Poisson model when φ ≡ 1, but the
quasi-Poisson model allows overdispersion when φ > 1.
## resp race
## 1 0 black
## 2 0 black
## 3 0 black
## 4 0 black
## 5 0 black
## 6 0 black
tail(victim)
## resp race
## 1303 2 white
## 1304 3 white
## 1305 3 white
## 1306 3 white
## 1307 3 white
## 1308 6 white
4.4. DEALING WITH OVERDISPERSION 75
library(tidyverse)
victim %>% group_by(race) %>%
dplyr::summarise(mean = mean(resp), var = var(resp))
## # A tibble: 2 x 3
## race mean var
## <fct> <dbl> <dbl>
## 1 white 0.0923 0.155
## 2 black 0.522 1.15
Exercise 4.8. Explain why the table above suggests that the Poisson GLM
is not appropriate for modeling the homicide data.
## k phi
## 0.2023119 1.7456940
This gives the estimates k = 0.2 for the negative binomial model and φ =
1.75 for the quasi-Poisson. We now see how the variance estimates for the
models line up with the empirical variance.
mutate(poisson_var = mean,
negbin_var = mean + mean^2 / k_negbin,
quasi_var = phi_quasi * mean)
print(victim_summary)
## # A tibble: 2 x 6
## race mean empirical_var poisson_var negbin_var quasi_var
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 white 0.0923 0.155 0.0923 0.134 0.161
## 2 black 0.522 1.15 0.522 1.87 0.911
This gives the following table:
Model Var, White Var, Black
Observed 0.16 1.15
Poisson 0.09 0.52
Negative Binomial 0.13 1.87
Quasi-Poisson 0.16 0.91
The quasi-Poisson model does quite well at recovering the observed/empirical
variance for each race, while the Poisson does not do very well. The negative
binomial is somewhat in-between, give a slightly lower variance for white re-
spondents and a moderately larger variance for black respondents.
n
∂L � ωi Xi (Yi − µi )
= .
∂β i=1
φV (µi )g � (µi )
� = 0.
where V (µi ) = b�� {(b� )−1 (µi )} and the MLE of β satisfies u(β)
Remark 4.6. Even though there are no β’s on the right-hand-side of u(β), note
that µi = g −1 (Xi� β) so that µi depends implicitly on β.
Next, we derive the Fisher information. We have
∂2L 1� ∂ Yi − µi
− = ωi Xij
∂βj ∂βk φ i ∂βk V (µi )g � (µi )
1� ∂ Yi − µi ∂µi ∂ηi
=− ωi Xij
φ i ∂µi V (µi )g � (µi ) ∂ηi ∂βk
� � ��
1� ωi ωi (Yi − µi ) ∂ 1
= Xij Xik − .
φ i V (µi )g � (µi )2 g � (µi ) ∂µi V (µi )g � (µi )
� X/φ
J = X �W
ωi ωi (Yi − µi ) ∂ 1
�
− .
V (µi )g (µi ) 2 g (µi ) ∂µi V (µi )g � (µi )
�
I = Eβ (J ) = X � W X/φ
where W is diagonal with entries ωi /[V (µi )g � (µi )2 ], where we have simply used
the fact that Eβ (Yi − µi ) = 0. We summarize these results as:
78 CHAPTER 4. GENERALIZED LINEAR MODELS
� X/φ
J = X �W
ωi ωi (Yi − µi ) ∂ 1
�ii =
w − .
V (µi )g � (µi )2 g � (µi ) ∂µi V (µi )g � (µi )
I = Eβ (J ) = X � W X/φ
(d) The matrix W can be approximated with W � , which has diagonal entries
ωi /[V (� µi )2 ] where µ
µi )g(� �i is the MLE of µi .
Remark 4.7. Note that similarity between the asymptotic covariance of β� for
the linear model Y = Xβ + � with iid mean-0 finite-variance errors �i and
the GLM. In the former case, covariance of β� is σ 2 (X � X)−1 . The formula
φ(X � W X)−1 generalizes this to the case of generalized linear models.
Remark 4.8. Typically, we favor the use of the observed Fisher information J
� to approximate the variance of β;
(evaluated at β) � we do not favor the use of
the expected Fisher information I.
Exercise 4.9. Recall that we defined V (µi ) = b�� {(b� )−1 (µi )} and that we
define the canonical link by g(µi ) = (b� )−1 (µi ).
(a) Show that J = I when the canonical link is used. Hint: Show that
V (µi )g � (µi ) = 1.
(b) Suppose that the canonical link is used and that φ is known and that
the design matrix X is regarded as fixed and known. Show that X � Y�
is sufficient for β, where Y� = (ω1 Y1 , . . . , ωn Yn ).
to check.
The Pearson residual is relatively intuitive: how far away is Yi from its estimated
mean, scaled by an estimate of its variance. Additionally, the Pearson residuals
are associated to the generalized X 2 statistic
n
� n
�
�i )2
ωi (Yi − µ
X2 = = e2i .
i=1
V (�
µi ) i=1
Note: The Pearson residuals do not depend on φ; similarly, the deviance resid-
uals (to be defined below) do not depend on φ. Hence, they can have variance
much larger than 1; keep this in mind when trying to interpret them!
where θ�i and θ�i are as in Definition 4.4. Then we define the deviance
residual for observation i as
�
di × sign(Yi − µ
�i ).
The deviance D is the the sum of the squared deviance residuals, giving an
analogy with the relationship between Pearson’s residuals and the generalized
X 2 statistic. The quantiles of the deviance residuals are reported in the output
of the GLM function in R. For the Ships dataset, we have
summary(fit_ships_poisson)
##
## Call:
## glm(formula = incidents ~ type + yearbuilt + period, family = poisson,
## data = ships, subset = (service != 0), offset = log(service))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
80 CHAPTER 4. GENERALIZED LINEAR MODELS
The deviance residuals for this example vary between −1.7 and 2.8.
Generally, we prefer to have residuals which are standardized to have (ap-
proximate) mean 0 and variance 1. The deviance residuals and Pearson residuals
do not have this property. In the case of the Pearson residual, the reason that
we do not have approximate variance 1 is that Yi and µ �i are correlated; the
same issue occurs in linear regression models, wherein the standardized residu-
als (Yi − Y�i )/�
σ do not have variance 1. When φ = 1, Agresti shows (see Sections
4.5.6 and 4.5.7) that
Var(ei ) ≈ (1 − �
hii ) where � � X)−1 X � W 1/2 �ii , (4.2)
hii = �W 1/2 X(X � W
Example 4.8. This example concerns the 1973 admissions data for department
at the University of California at Berkeley. The key inferential issue lies in
assessing whether there is evidence of sex bias in their admissions practices. We
consider two predictors: the sex of the student and the department the student
applied to.
The data is built into R and be loaded as follows:
data(UCBAdmissions)
berk_0 <- data.frame(UCBAdmissions)
print(berk_0)
We place this data into a form suitable for the glm function (with admit-
ted/rejected counts on the same row) using the tidyverse package:
library(tidyverse)
berk <- berk_0 %>% group_by(Gender, Dept) %>%
dplyr::summarise(Admitted = sum(Freq * (Admit == "Admitted")),
82 CHAPTER 4. GENERALIZED LINEAR MODELS
##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender, family = binomial,
## data = berk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16.7915 -4.7613 -0.4365 5.1025 11.2022
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.22013 0.03879 -5.675 1.38e-08 ***
## GenderFemale -0.61035 0.06389 -9.553 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 877.06 on 11 degrees of freedom
4.6. ANALYSIS OF RESIDUALS 83
##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender + Dept, family = binomial,
## data = berk)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## -1.2487 -0.0560 1.2533 0.0826 1.2205 -0.2076 3.7189 0.2706
## 9 10 11 12
## -0.9243 -0.0858 -0.8509 0.2052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.58205 0.06899 8.436 <2e-16 ***
## GenderFemale 0.09987 0.08085 1.235 0.217
## DeptB -0.04340 0.10984 -0.395 0.693
## DeptC -1.26260 0.10663 -11.841 <2e-16 ***
## DeptD -1.29461 0.10582 -12.234 <2e-16 ***
## DeptE -1.73931 0.12611 -13.792 <2e-16 ***
## DeptF -3.30648 0.16998 -19.452 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 877.056 on 11 degrees of freedom
## Residual deviance: 20.204 on 5 degrees of freedom
## AIC: 103.14
##
## Number of Fisher Scoring iterations: 4
84 CHAPTER 4. GENERALIZED LINEAR MODELS
Interestingly, our results conflict with those the previous analysis. Females
have an odds of admittance which is e0.1 = 1.11 times that of males, and the
results are not statistically significant. On the other hand, we see that the
department an individual applies to is highly important!
This is an example of Simpson’s paradox — the direction of the effect of sex
reverses after we take department into account. Essentially what is happening
is that females apply to departments which are more selective than males. In
particular, examining the raw data, we see that females tend not to apply to
departments A and B, which happen to also have very high acceptance rates.
Hence, the effect of gender is primarily accounted for by the tendency of females
to apply to selective departments.
Next, we observe that the model logit(πij ) = α + βi + γj actually does not
fit the data particularly well. We have a residual deviance of 20.24 on 5 degrees
of freedom, which gives a P -value of 0.001 for the test of the model against a
saturated model. To understand why the model does not appear to fit well, we
look at the standardized Pearson residuals:
## 1 2 3 4 5 6
## -4.0272880 -0.2797222 1.8808316 0.1412619 1.6334924 -0.3026439
## 7 8 9 10 11 12
## 4.0272880 0.2797222 -1.8808316 -0.1412619 -1.6334924 0.3026439
We see very large residuals relative to a standard normal distribution for ob-
servation 1 (department A, male) and observation 7 (department A, female).
Examining these observations, we see that department A seems to have an ex-
tremely high acceptance rate for females; 512/825 males are accepted, while
89/108 females are accepted. If we fit the model with department A removed,
we get
##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Dept + Gender, family = binomial,
## data = subset(berk, Dept != "A"))
##
## Deviance Residuals:
## 2 3 4 5 6 8 9 10
## -0.1191 0.5239 -0.5164 0.6868 -0.5024 0.5680 -0.3914 0.5440
## 11 12
## -0.4892 0.5158
4.6. ANALYSIS OF RESIDUALS 85
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.54418 0.08584 6.340 2.3e-10 ***
## DeptC -1.14008 0.12188 -9.354 < 2e-16 ***
## DeptD -1.19456 0.11984 -9.968 < 2e-16 ***
## DeptE -1.61308 0.13928 -11.581 < 2e-16 ***
## DeptF -3.20527 0.17880 -17.927 < 2e-16 ***
## GenderFemale -0.03069 0.08676 -0.354 0.724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 539.4581 on 9 degrees of freedom
## Residual deviance: 2.5564 on 4 degrees of freedom
## AIC: 71.791
##
## Number of Fisher Scoring iterations: 3
This model seems to fit the data extremely well, with deviance 2.56 on 4 degrees
of freedom. Considering department A in isolation, we have:
##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender, family = binomial,
## data = subset(berk, Dept == "A"))
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.49212 0.07175 6.859 6.94e-12 ***
## GenderFemale 1.05208 0.26271 4.005 6.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.9054e+01 on 1 degrees of freedom
## Residual deviance: 5.5511e-15 on 0 degrees of freedom
86 CHAPTER 4. GENERALIZED LINEAR MODELS
## AIC: 15.706
##
## Number of Fisher Scoring iterations: 3
This equation can be solved numerically using Newton’s method. Recall that,
to solve the equation g(x) = 0 Newton’s method updates
x ← x − (∇g)−1 (x)g(x)
where ∇g denotes the Jacobian matrix of g. Recalling that the Jacobian matrix
of −u(β) is the observed Fisher information, we get the following algorithm.
Exercise 4.11. Show that the MLE β� does not depend on the value of the
dispersion parameter φ.
I = X � W X/φ.
88 CHAPTER 4. GENERALIZED LINEAR MODELS
�N
ωi (Yi − µi )
u(β) = .
i=1
φV (µi )g � (µi )
Quasi-likelihood methods arise from the observation that the solution β� to the
equation u(β) = 0 is valid under the weaker set of assumptions:
φ
• The variance of the response is given by Var(Yi | Xi = x) = ωi V (µi ) for
some φ, known weight ωi , and function V (µ).
Further, the function u(β) behaves very much like a score function, even when
it does not correspond to any genuine GLM score function. For example, when
the assumptions above are true, the asymptotic variance of β� is given by the
inverse of a “psuedo” Fisher information
I −1 = φ(X � W X)−1 .
(M2) For each of the 2484 individuals surveyed, let Yi ∼ Binomial(1, πi ) where
πi is the probability that an individual at snoring level Xi develops heart
disease.
These two approaches are describing the same generative model, but differ
in that M1 considers a total of N = 4 observations by grouping together all
individuals at the same level of snoring, whereas M2 considers a total of N =
2484 individuals. Fitting these two models will result in the same β, � the same
�
variance estimator for β, and the same inferences in general as the likelihood
function for β is the same.
These approaches differ in one key aspect! They do not result in
the same saturated model! Note that the saturated model for M1 has four
parameters: one for each snoring level. On the other hand, the saturated model
for M2 has 2484 parameters: one for each subject! Not only do the saturated
models differ, but the deviance statistic for M1 can be well-approximated by a
χ24−p distribution while the deviance statistic for M2 does not have an asymptotic
χ2 distribution because the number of parameters of the saturated model is grows
with n.
To see this, first, we examine the deviance statistic with the grouped data
(i.e., using M1)
deviance(snoring_fit)
## [1] 2.808912
df.residual(snoring_fit)
## [1] 2
coef(snoring_fit)
## (Intercept) snoring_scores
## -3.8662481 0.3973366
The model fits quite well. Now, let’s fit the model to ungrouped data (i.e.,
using M2).
90 CHAPTER 4. GENERALIZED LINEAR MODELS
counts <- c(24, 1355, 35, 603, 21, 192, 30, 224)
snore_scores <- c(0, 0, 2, 2, 4, 4, 5, 5)
binary_data <- c(rep(1, counts[1]), rep(0, counts[2]),
rep(1, counts[3]), rep(0, counts[4]),
rep(1, counts[5]), rep(0, counts[6]),
rep(1, counts[7]), rep(0, counts[8]))
binary_scores <- c(rep(0, counts[1]), rep(0, counts[2]),
rep(2, counts[3]), rep(2, counts[4]),
rep(4, counts[5]), rep(4, counts[6]),
rep(5, counts[7]), rep(5, counts[8]))
big_snore <- data.frame(disease = binary_data, snore = binary_scores)
tail(big_snore)
## disease snore
## 2479 0 5
## 2480 0 5
## 2481 0 5
## 2482 0 5
## 2483 0 5
## 2484 0 5
The object big snore has a separate row for each individual, giving the
disease status and snoring score. Fitting this model, we see that the estimated
MLE is the same but the deviance is different.
## [1] 837.7316
df.residual(big_snore_fit)
## [1] 2482
coef(big_snore_fit)
## (Intercept) snore
## -3.8662481 0.3973366
For the purpose of assessing goodness of fit, we should use the deviance for
M1 rather than M2.