0% found this document useful (0 votes)
8 views

Categorical-Notes-Ch4

Uploaded by

Aaid Algahtani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Categorical-Notes-Ch4

Uploaded by

Aaid Algahtani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Chapter 4

Generalized Linear Models

The techniques presented in Chapter 2 and Chapter 3 are limited in two aspects.
First, they consider only categorical predictors X without allowing for numeric
predictors. Second, we only considered relatively small numbers of predictors.
To overcome these limitations, we will need to develop an extension of the
linear model which is capable of dealing with categorical and count responses;
this leads to the so-called generalized linear model or GLM.

4.1 Introduction to GLMs


Consider data consisting of a response vector Y = (Y1 , . . . , Yn ) and a design
matrix
 �
X1
 
X =  ...  .
Xn�
We will model the distribution of [Yi | Xi = xi ] using a distribution from the
exponential dispersion family.

Definition 4.1. A family of densities/mass functions {f (·; θ, φ) : θ ∈


Θ, φ ∈ Φ} is an exponential dispersion family if we can write
� �
yθ − b(θ)
f (y; θ, φ) = exp + c(y, φ)
φ/ω

for known functions b(·), c(·, ·) and known constant ω > 0. The parameter
θ is referred to as the canonical parameter and φ is known as the dispersion
parameter.
Remark 4.1. This definition is given in Section 4.4 of Agresti; we will be using
this one rather than using the less general exponential family definition from
Section 4.1.1 of Agresti.

51
52 CHAPTER 4. GENERALIZED LINEAR MODELS

Example 4.1. Consider Y ∼ Poisson(λ). The mass function of Y is

λy e−λ
f (y; λ) = = exp {y log λ − λ − log y!} .
y!

Taking θ = log λ, b(θ) = eθ , φ = ω = 1, and c(y, φ) = − log y! we have


� �
yθ − b(θ)
f (y; λ) = exp + c(y, φ) .
φ/ω
Hence the Poisson(λ) family is an exponential dispersion family.
Example 4.2. Let Y = Z/n where Z ∼ Binomial(n, π). Then Y has mass
function
� �
n
π ny (1 − π)n(1−y)
ny
� � ��
n
= exp ny log π + n log(1 − π) − ny log(1 − π) + log .
ny
θ
Now, � θ = log(π/(1 − π)), φ = 1, ω = n, b(θ) = log(1 + e ), and c(y, φ) =
� nset
log ny . Then
� �
yθ − b(θ)
f (y; n, π) = exp + c(y, φ).
φ/ω
Hence the possible distributions of Y form an exponential dispersion family.

Exercise 4.1. Show that the following families are exponential dispersion
families for suitable choices of b(·), c(·, ·), θ, φ, ω.

(a) The Normal(µ, σ 2 ) family.


(b) The Gamma(α, β) family.

Definition 4.2. A generalized linear model for Y and X models the re-
sponse Yi with the density/mass function
� �
yi θ(xi ) − b(θ(xi ))
f (yi ; θ(xi ), ωi , φ) = exp + c(yi , φ) ,
φ/ωi

where

1. f (·; θ, ω, φ) is an exponential dispersion family; this is referred to as


the stochastic component of the model.
2. g(µi ) = x�
i β for a known link function g(·) and where µi = E {Yi | Xi = xi }.
This is referred to as the systematic component of the model. The
4.1. INTRODUCTION TO GLMS 53

term ηi = x�
i β is referred to as the linear predictor.

4.1.1 Moments of the Exponential Dispersion Family


Given an exponential dispersion family, the log-likelihood is given by
yθ − b(θ)
log f (y; θ, φ, ω) = + c(y, φ).
φ/ω
Suppose (φ, ω) are known; then the score is given by
y − b� (θ)
u(θ) = .
φ/ω
Because the score has mean 0, this gives
Eθ (Y ) − b� (θ)
Eθ u(θ) = = 0.
φ/ω
Hence Eθ (Y ) = b� (θ). Next, the Fisher information is given by
b�� (θ)
Iθ = .
φ/ω
Note that, because the second derivative does not depend on Y , this is both the
observed and expected Fisher information for θ. Because the Fisher information
gives the variance of the score, we have
ω2 ωb�� (θ)
Var θ (Y ) = .
φ2 φ
φ ��
Thus, Varθ (Y ) = ω b (θ). These facts are summarized below.

Fact 4.1. Suppose that Y ∼ f (y; θ, φ, ω) where f (·; θ, φ, ω) is an exponential


dispersion family. Then
1. Eθ (Y ) = b� (θ); and
φ ��
2. Varθ (Y ) = ω b (θ).

4.1.2 The canonical parameter and the canonical link


Fact 4.1 gives an implicit relationship between θ and µ in a generalized linear
model. Recall that we assume

g(µi ) = x�
i β,

for some parameter β in a GLM. Now, we have just shown that µi = b� (θi );
hence,

g(b� (θi )) = x�
i β
54 CHAPTER 4. GENERALIZED LINEAR MODELS

so that
� �
θi = (b� )−1 g −1 (x�
i β) .

Now, imagine we set g(µ) = (b� )−1 (µ). This leads to the model θi = x� i β
so that the linear predictor ηi = x�i β and the canonical parameter θ i coincide.
This choice of the link function is referred to as the canonical link.

Definition 4.3. In a generalized linear model, the canonical link function


is given by g(µ) = (b� )−1 (µ), and the resulting mass function for Yi is
� �
yi x� �
i β − b(xi β)
exp + c(yi , φ) .
φ/ωi

In some sense, this choice of link function is the most “natural” choice of
link function, and (as we will see) various aspects of GLMs become simplified
when the canonical link is chosen.
Example 4.3. Consider the binomial dispersion family from Example 4.2 in
which Y = Z/n where Z ∼ Binomial(n, π). For this model, we have φ = 1, ω =
n, b(θ) = log(1 + eθ ) where θ = log π/(1 − π). We have


b� (θ) = .
1 + eθ
ex
Now, noting that expit(x) = 1+ex is the inverse function of logit(x) = log(x/(1−
x)) we have

E(Y ) = b� (θ) = π.

Taking a second derivative, we have


� �
�� eθ eθ eθ
b (θ) = = · 1− = π(1 − π).
(1 + eθ )2 1 + eθ 1 + eθ
The variance of Y is then
φ �� π(1 − π)
Var(Y ) = b (θ) = .
ω n
This illustrates that the formulae we have given for the moments works as
intended. The canonical link is given by the inverse of b� (θ) = expit(θ), i.e.,
π
g(π) = log 1−π . The generalized linear model for binomial proportion data with
π
the link function g(π) = log 1−π is referred to as a logistic regression model.

Exercise 4.2. Show that the Poisson exponential dispersion family de-
scribed in Example 4.1 has canonical link g(λ) = log λ. Additionally, using
the properties of the exponential dispersion family to verify that E(Y ) = λ
and Var(Y ) = λ for the Poisson distribution.
4.1. INTRODUCTION TO GLMS 55

4.1.3 GLMs model the variance


For most types of categorical data, there is a relationship between the mean
and variance. For example, Poisson models have E(Y ) = Var(Y ) and binomial
proportion models (Y = Z/n, Z ∼ Binomial(n, π)) have Var(Y ) = E(Y )[1 −
E(Y )]/n.
One reason for preferring generalized linear models is that they respect these
relationships. That is, generalized linear models automatically incorporate het-
eroskedasticity. For example, the Poisson model has

Var(Yi ) = eθi .

Assuming a canonical link we have Var(Yi | Xi = x) = E(Yi | Xi = x) = ex β
.

4.1.4 The Deviance of a GLM


Given data D = {(Xi , Yi )} which are modeled by some GLM, the log-likelihood
is given by
n
� ωi (Yi θ(µi ) − b(θ(µi )))
+ c(Yi , φ), (4.1)
i=1
φ

where recall

θ(µi ) = (b� )−1 (µi ),


g(µi ) = Xi� β.

For a fixed φ, the model which fits the data as closely as possible is the model
which simply takes µi = Yi .

Exercise 4.3. Let µ = (µ1 , . . . , µn ). Show that (4.1) is maximized as a


function of µ when µi = Yi for i = 1, . . . , n.

The log-likelihood of this model is given by


n
� ωi (Yi θ(Yi )) − b(θ(Yi ))
+ c(Yi , φ).
i=1
φ

This model, which fits the data as closely as possible, is referred to as the
saturated model, and is a model which has a separate mean parameter for every
observation.

Definition 4.4. The scaled deviance of a GLM given data D = {(Xi , Yi )}


is likelihood ratio test statistic for testing the model against the saturated
56 CHAPTER 4. GENERALIZED LINEAR MODELS

model, i.e. it is
n
� ωi [Yi (θ�i − θ�i ) − (b(θ�i ) − b(θ�i ))]
D� = 2 ,
i=1
φ

where θ�i = θ(Yi ), θ�i = θ(� � where β� is the MLE of


�i = g −1 (Xi� β)
µi ), and µ
β.

The deviance of a GLM is


n

D = φD� = 2 ωi [Yi (θ�i − θ�i ) − (b(θ�i ) − b(θ�i ))]
i=1

Remark 4.2. For Poisson and Binomial GLMs, φ ≡ 1 so that the scaled de-
viance D� and the (raw) deviance D are equal.
A common use of the scaled deviance D� is as a test statistic for assessing
goodness of fit. A sensible way to check goodness of fit is to test your model
against a the model which fits the data as well as possible; if you cannot reject
your model in favor of this larger model, this gives some assurance that the
model is not out-of-line with the data.

Fact 4.2. Consider a GLM for D = {(Xi , Yi )} with a p-dimensional predictor


vector β and m observations Yi . Assume that the model is correct. Then:

1. If the GLM is a Poisson GLM, then D ∼ χ2m−p for fixed m as the true
expected counts µi tend to ∞.

2. If the GLM is a Binomial GLM, then D ∼ χ2m−p for fixed m as the true
expected counts ni πi tend to ∞.

Remark 4.3. It is not true in general that D or D� will have an asymptotic


χ2 distribution as the number of observations diverges m → ∞. This is because
the number of parameters in the saturated model increases as the number of
observations increases. This excludes a number of important special cases: for
example, binary regression will not typically have an asymptotic χ2 distribution
for D because the expected counts are all bounded by ni = 1 (and hence cannot
go to ∞).

4.1.5 Analysis of deviance


The scaled deviance D� can also be used to conduct hypothesis
� � tests of nested
models. Suppose we have a GLM with linear predictor x� βγ and we are inter-
ested in testing the hypothesis H0 : γ = 0 against the alternative H1 : γ �= 0.
This can be done using the a likelihood ratio test, which can be conveniently
expressed in the form

Λ = D0� − D1� ∼

χ2d under H0
4.2. BINOMIAL GLMS 57

where D0� is the deviance of the model under H0 and D1� is the deviance of the
model under H1 , and d = dim γ. An application of analysis of deviance is given
in Section 4.3.2. This χ2 approximation is valid even when D� is not itself χ2 ;
all we need is that d = dim γ is fixed in the asymptotics.

4.1.6 Why GLMs?


The GLM approach should be contrasted with a “transformation” based ap-
proach. For example, consider Yi ∼ Poisson(λi ) and suppose we want to model
Yi as a function of predictors Xi . Aside from the fact that λi must be positive,
the linear model

Yi = Xi� β + �i

has a problem in that the �i ’s will not have the same variance, i.e., there is
heteroskedasticity. To resolve this problem, one approach is to transform Yi
so that it is close to homoskedastic. For Poisson data, the so-called “variance
stabilizing transformation” gives the linear model

Yi = Xi� β + �i ,

where now the �i ’s have approximately constant variance.


The transformation approach considers modeling E[T (Yi )] in this case where

T (y) = y. A generalized linear model instead considers transforming the
mean; we consider g (E(Yi )). Even if T = g, these approaches are not the same
because E(g(Y )) �= g(E(Y )) in general.
A primary advantage of a GLM over using a transformation is that we have
a model for E(Yi ) directly. The transformation approach only gives a model for
E(T (Yi )) and makes inference about E(Yi ) difficult.

4.2 Binomial GLMs


4.2.1 Logistic Regression
We consider case of binomial proportions data, in which the response is Yi =
Zi /n where Zi ∼ Binomial(n, πi ). The logistic regression model sets
π
logit(πi ) = x�
i β where logit(πi ) = log
1−π
is referred to as the logit function. As mentioned in Example 4.3, the logit
function has inverse logit−1 (x) = expit(x) where expit(x) = ex /(1 + ex ) =
(1 + e−x )−1 . Hence, the model for πi is given by

exp(x�
i β)
πi = .
1 + exp(x�
i β)
58 CHAPTER 4. GENERALIZED LINEAR MODELS

A plot of the expit function is given below. As we can see, the function has an
“S” shape, and tends to 0, 1 as x → ±∞. A nice feature of the expit function
is that it respects the fact that πi ∈ [0, 1].

1.0
0.8
0.6
expit

0.4
0.2
0.0

-6 -4 -2 0 2 4 6

Example 4.4. On January 28, 1986, the space shuttle Challenger broke apart
just after launch, taking the lives of all seven of the crew. This example is taken
from an article by Dalal et al. (1989), which examined whether the incident
should have been predicted, and hence prevented, on the basis of data from
previous flights.
The cause of the failure was ultimately attributed to the failure of a crucial
shuttle component known as the O-rings; these components had been tested
prior to the launch to see if they could hold up under a variety of temperatures.
For our analysis, we will let Yi = 1 or 0 if the O-ring failed on a given test
shuttle flight and let Xi = (1, Temperaturei ). The temperature on the day of
the Challenger launch was 31◦ Fahrenheit, or roughly 0◦ Celsius.
We let Yi ∼ Bernoulli(πi ) and use a logistic regression model logit(πi ) =
Xi� β. The data is available in the vcd package:

library(vcd)
data(SpaceShuttle)
head(SpaceShuttle)

## FlightNumber Temperature Pressure Fail nFailures Damage


## 1 1 66 50 no 0 0
## 2 2 70 50 yes 1 4
## 3 3 69 50 no 0 0
## 4 4 80 50 <NA> NA NA
## 5 5 68 50 no 0 0
## 6 6 67 50 no 0 0
4.2. BINOMIAL GLMS 59

The model can be fit using the glm function in R as follows.

fit_space <- glm(I(Fail == 'yes') ~ Temperature, data =


SpaceShuttle, family = binomial)
summary(fit_space)

##
## Call:
## glm(formula = I(Fail == "yes") ~ Temperature, family = binomial,
## data = SpaceShuttle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## Temperature -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5

As we can see, the coefficient corresponding to the predictor Temperature is


significant; we will delve into the other components of the fit later, but for now
we will just plot the fit.

beta <- coef(fit_space)


plot(function(x) expit(beta[1] + x * beta[2]), xlim = c(40, 90),
xlab = "Temperature", ylab = "Estimated probability of failure")
with(SpaceShuttle, points(Temperature, ifelse(Fail == 'yes', 1, 0)))
60 CHAPTER 4. GENERALIZED LINEAR MODELS

1.0
0.8
Estimated probability of failure

0.6
0.4
0.2
0.0

40 50 60 70 80 90

Temperature

We make two observations at this point:


1. The dataset is relatively small, and certainly predicting the outcome at
31◦ based on these results requires a large amount of extrapolation.
2. If we take the model at face-value, a failure was inevitable; the estimated
probability of failure at 31◦ is virtually 1:

expit(beta[1] + 31 * beta[2])

## (Intercept)
## 0.9996088

It seems unlikely that the astronauts would get on the shuttle if they were
aware of this.

4.2.2 Interpreting the coefficients of a logistic regression


A nice feature of the logistic is that the coefficients have relatively simple in-
terpretations. Consider predictor j and a vector x = (x1 , . . . , xP ). Consider
4.2. BINOMIAL GLMS 61

shifting xj by Δ units while holding all other entries fixed. Then



logit π(xj + Δ) = xp βp + (xj + Δ)βj = logit π(xj ) + Δβj .
p�=j

The odds ratio corresponding to a change of Δ units in j is then

Odds(xj + Δ)
= eΔβj .
Odds(xj )

This leads to the following interpretation:

Holding all other predictors fixed, a change in xj by Δ has a multi-


plicative effect of eΔβj on the odds of success.

Example 4.5. For the shuttle data, the estimated regression coefficients are

beta <- coef(fit_space)


print(beta)

## (Intercept) Temperature
## 15.0429016 -0.2321627

A change in the temperature of −10 degrees results in a multiplicative in-


crease in the odds of failure by an estimated factor of

exp(-10 * beta[2])

## Temperature
## 10.19225

That is, the odds of failure increase by a factor of roughly 10 (estimated) for
every decrease of 10 degrees.

4.2.3 Confidence intervals for coefficients


The output of the glm function gives standard errors which we can use to com-
pute confidence intervals of the form

βj = β�j ± zα/2 se(β�j ).

These are essentially Wald-based intervals and they are generally not preferred.
A better confidence interval can be obtained by inverting a likelihood ratio test.
Consider the null hypothesis H0 : βj = b versus the alternative Ha : βj �= b.
This hypothesis can be tested by the following procedure:

1. Fit the model with βj unrestricted and compute the log-likelihood A.

2. Fit the model with βj fixed at b and compute the log-likelihood B.


62 CHAPTER 4. GENERALIZED LINEAR MODELS

3. Under H0 , the test statistic Λ(b) = −2(B − A) has an asymptotic χ21


distribution; reject if Λ(b) > χ21,α .

Inverting this test, we can form an asymptotic 100(1 − α)% confidence interval
for βj as {b : Λ(b) ≤ χ21,α }. This is referred to as a profile-likelihood confidence
interval.
Example 4.6. The profile confidence interval is easy to obtain in R; we compare
this with the Wald interval.

## Profile confidence intervals


confint(fit_space)

## Waiting for profiling to be done...


## 2.5 % 97.5 %
## (Intercept) 3.3305848 34.34215133
## Temperature -0.5154718 -0.06082076

## Computing Wald intervals


var_beta <- vcov(fit_space) ## Get the inverse Fisher information
ci_space <- function(j) beta[j] +
c(-1, 1) * sqrt(var_beta[j,j]) * qnorm(0.975)
rbind(ci_space(1), ci_space(2))

## [,1] [,2]
## [1,] 0.5810523 29.50475096
## [2,] -0.4443022 -0.02002324

The intervals largely overlap but are somewhat different. Again, we generally
prefer the profile confidence intervals. From this, we see that the multiplicative
effect on the odds of a change of −10 degrees in temperature is, with 95%
confidence, in the interval

exp(-10 * confint(fit_space))[2,]

## Waiting for profiling to be done...


## 2.5 % 97.5 %
## 173.246863 1.837136

4.2.4 Other Choices of Link Functions


The linear link function
The choice of link function determines how the success probability π varies with
the predictor xi . The simplest choice of link function would be a simple linear
4.2. BINOMIAL GLMS 63

link function

πi = x�
i β.

This is adequate in many situations; however, one runs into problems when
x� �
i β > 1 or xi β < 0. Whenever there is a continuous predictor with an
unrestricted range, there will always exist values of x which cause this to happen.
As we know that probabilities must lie in [0, 1], this is problematic. Hence, one
should think very carefully before using the a linear link function for binomial
logistic regression.
A benefit of the linear link function is that it is extremely easy to interpret.

Latent tolerance link functions


A general class of link functions can be obtained by viewing the response Yi
as being determined through a latent tolerance model. Consider a toxicology
example, in which individual i can tolerate a dose Ti of some treatment (Yi = 0),
but if the dose exceeds Ti then the individual dies (Yi = 1). We model the dosage
a subject receives as a function of covariates ηi = Xi� β. As we do not know
iid
the tolerance, we model Ti ∼ F for some known distribution function F . This
results in the following induced model for Yi :

Pr(Yi = 1) = Pr(Ti ≤ Xi� β) = F (Xi� β).

This gives the GLM F −1 (πi ) = Xi� β, a binomial GLM with link function
g(π) = F −1 (π). Thus, given any distribution function F (·), we can define a
binomial GLM.

Exercise 4.4. Show that logistic regression corresponds to a latent toler-


ance model in which Ti is a random variable with density

et
f (t) = , (−∞ < t < ∞).
(1 + et )2

The distribution of Ti in this case is referred to as the logistic distribution.

Aside from logistic regression, the latent tolerance model has many inter-
esting special cases. The probit model considers Ti ∼ Normal(0, 1), while the
complementary log-log model sets Ti to have an extreme value distribution.
In practice, the main difference between these models lies in how they handle
outliers, which is determined by how fast F (x) → 0, 1 as x → ±∞. When F (x)
corresponds to a light-tailed distribution, such as a normal distribution, we
expect to see very few outliers in Y ; here, an outlier is a point (x, y) where x
is extreme and y takes on the “wrong” value (i.e., y is strongly predicted to be
1, but is 0 instead). Conversely, a heavy-tailed distribution (such as a Cauchy
distribution) does not have problems with outliers.
The logistic regression model has exponential tails, with F (x) ≈ ex as x →
−∞ and F (x) ≈ 1−e−x as x → ∞. This is heavier than the normal distribution,
64 CHAPTER 4. GENERALIZED LINEAR MODELS

Probit

1.0
Logit
C-Log-Log
0.8
0.6
Link

0.4
0.2
0.0

-4 -2 0 2 4

Figure 4.1: Comparison of link functions, after scaling and centering.

which has super-exponential tails, but lighter than the t-distribution, which has
polynomial tails.
We can also choose F (t) to correspond to an asymmetric distribution. This
is often done in toxicology studies, where a sufficiently high toxicity basically
guarantees death, but very small doses can also be fatal. In this case, the
complementary log-log model F −1 (π) = log{− log(1 − π)} is a good choice. A
comparison of the complementary log-log, probit, and logistic links is given in
the Figure 4.1.

4.2.5 Fitting a Binomial GLM

Agresti considers the following data. This data is from an epidemiological survey
to investigate snoring as a risk factor for heart disease. The sample consists of
2484 subjects and is given in the following table.
4.2. BINOMIAL GLMS 65

Heart Disease
Snoring Yes No
Never 24 1355
Occasionally 35 603
Nearly every night 21 192
Every night 30 224

The variable snoring is ordinal; to account for this, we will assign the nu-
merical scores 0, 2, 4, and 5 to the categories. We construct the data in R as:

snoring <- cbind(Yes = c(24, 35, 21, 30),


No = c(1355, 603, 192, 224))
snoring_scores <- c(0, 2, 4, 5)
print(cbind(snoring, snoring_scores))

## Yes No snoring_scores
## [1,] 24 1355 0
## [2,] 35 603 2
## [3,] 21 192 4
## [4,] 30 224 5

We consider the data is consisting of four binomial random variables Zi ∼


Binomial(ni , πi ) where, for example, Z1 = 24 and n1 = 24 + 1355. To fit a
binomial logistic regression to Yi = Zi /n, we can use the commands

snoring_fit <- glm(snoring ~ snoring_scores, family = binomial)

Note how the data is input: the response is taken to be a matrix with the
first column consisting of number of successes and the second column consisting
of the number of failures. The term family = binomial specifies that we are
doing logistic regression. We obtain a summary of the model as

summary(snoring_fit)

##
## Call:
## glm(formula = snoring ~ snoring_scores, family = binomial)
##
## Deviance Residuals:
## 1 2 3 4
## -0.8346 1.2521 0.2758 -0.6845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
66 CHAPTER 4. GENERALIZED LINEAR MODELS

## (Intercept) -3.86625 0.16621 -23.261 < 2e-16 ***


## snoring_scores 0.39734 0.05001 7.945 1.94e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.9045 on 3 degrees of freedom
## Residual deviance: 2.8089 on 2 degrees of freedom
## AIC: 27.061
##
## Number of Fisher Scoring iterations: 4

The snoring scores are highly significant. We also note that the deviance
is D = 2.8089. The asymptotic χ2 approximation is likely to be good here,
as the cell counts are large. Comparing this to a χ24−2 distribution, a P -value
for the test against the saturated model is 0.2455, indicating that there is little
evidence of lack of fit.

Exercise 4.5. Interpret the estimated coefficient for snoring scores and
construct a Wald interval for the coefficient.

Next, we get the predictions from the model.

snoring_predictions <- predict(snoring_fit, type = 'response')


print(cbind(
c("Never", "Occasionally", "Nearly Every Night", "Every Night"),
round(snoring_predictions,3)
))

## [,1] [,2]
## 1 "Never" "0.021"
## 2 "Occasionally" "0.044"
## 3 "Nearly Every Night" "0.093"
## 4 "Every Night" "0.132"

Evidently the probability of developing heart disease increases dramatically as


the snoring score increases, from an estimated 2% to an estimated 13%.

4.3 Poisson Loglinear Models


The simplest GLM for count data is the Poisson loglinear model.

Definition 4.5. Given data D = {(Xi , Yi )}, the distribution of [Yi | Xi ]


4.3. POISSON LOGLINEAR MODELS 67

follows a Poisson loglinear model if


� �
Yi ∼ Poisson exp(x� β) given Xi = x.

This model was introduced in Exercise 4.2, where we saw that this corresponds
to a Poisson GLM using the canonical link g(λ) = log λ; that is, the model is

log{E(Yi | Xi = x)} = x� β.

4.3.1 Interpreting the coefficients


The coefficients of Poisson loglinear model, like the coefficients of a binomial
logistic regression model, have nice interpretations. Consider shifting xj by Δ
units while holding all other entries fixed. Then

log λ(xj + Δ) = xp βp + (xj + Δ)βj = log λ(xj ) + Δβj .
p�=j

Thus holding all other predictors fixed, shifting xj by Δ has the effect of multi-
plying the mean by eΔβj :

λ(xj + Δ) = eΔβj λ(xj ).

This leads to the following interpretation:

Holding all other predictors fixed, a change in xj by Δ units has a


multiplicative effect of eΔβj on the mean of Yj .

4.3.2 Poisson Loglinear Models with an Offset


Many Poisson loglinear models also include a term called an offset. The need
for offset terms is best understood through an example.
Example 4.7. This example, taken from Section 6.3.2 of McCullaugh and
Nelder (1989), concerns modeling the rate of reported damage incidents of cer-
tain types of cargo-carrying ships. We consider the following predictors which
make up Xi :

• An intercept term.

• The type of ships (A–E).

• The year of construction, as a categorical variable (60–64, 65–69, 70–74,


75–79).

• The period of operation (60–74, 75–79).

• The number of months of service.


68 CHAPTER 4. GENERALIZED LINEAR MODELS

Each of these predictors is categorical, with the exception of months of service.


Letting λ(x) be E{Yi | Xi = x}, we could posit a Poisson loglinear model as

log λ(x) = α + βType + γYear + δPeriod + ζ × Months.

Instead, we make the following observation:


Consider two ships, of the same type, constructed in the same year,
and serving in the period. If the first ship is at service for twice as
many months as the second, how many more incidents do we expect
it to have on average?
Assuming that incidents arrive according to something like a homogeneous Pois-
son process, the answer is clear: we would expect twice as many accidents. This
suggests the model

log λ(x) = α + log Months + βType + γYear + δPeriod .

The term log Months is called an offset. It essentially corresponds to a predictor


which has a known coefficient equal to 1.
We now fit this Poisson loglinear model with an offset term to the data.
First we print the data:

library(MASS)
data(ships)
head(ships)

## type year period service incidents


## 1 A 60 60 127 0
## 2 A 60 75 63 0
## 3 A 65 60 1095 3
## 4 A 65 75 1095 4
## 5 A 70 60 1512 6
## 6 A 70 75 3353 18

For this data, the rows actually correspond to the number of incidents for
all ships of the same type/year/period, with the offset term service giving the
total number of months those ships were at service. The Poisson loglinear model
can be fit as:

ships$yearbuilt <- as.factor(ships$year)


fit_ships_poisson <- glm(incidents ~ type + yearbuilt + period,
offset = log(service), family = poisson,
data = ships, subset = (service != 0))
summary(fit_ships_poisson)

##
## Call:
4.3. POISSON LOGLINEAR MODELS 69

## glm(formula = incidents ~ type + yearbuilt + period, family = poisson,


## data = ships, subset = (service != 0), offset = log(service))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6768 -0.8293 -0.4370 0.5058 2.7912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.943769 0.561747 -14.141 < 2e-16 ***
## typeB -0.543344 0.177590 -3.060 0.00222 **
## typeC -0.687402 0.329044 -2.089 0.03670 *
## typeD -0.075961 0.290579 -0.261 0.79377
## typeE 0.325579 0.235879 1.380 0.16750
## yearbuilt65 0.697140 0.149641 4.659 3.18e-06 ***
## yearbuilt70 0.818427 0.169774 4.821 1.43e-06 ***
## yearbuilt75 0.453427 0.233170 1.945 0.05182 .
## period 0.025631 0.007885 3.251 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 146.328 on 33 degrees of freedom
## Residual deviance: 38.695 on 25 degrees of freedom
## AIC: 154.56
##
## Number of Fisher Scoring iterations: 5

We can also test for the individual factors using likelihood ratio tests; the anova
function in R performs the results of analysis of deviance in which the various
terms are added sequentially to the model.

anova(fit_ships_poisson, test = "LRT")

## Analysis of Deviance Table


##
## Model: poisson, link: log
##
## Response: incidents
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 33 146.328
70 CHAPTER 4. GENERALIZED LINEAR MODELS

## type 4 55.439 29 90.889 2.629e-11 ***


## yearbuilt 3 41.534 26 49.355 5.038e-09 ***
## period 1 10.660 25 38.695 0.001095 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is also possible to test for the effect of each factor assuming all other terms
are being included in the model. This can be done using the drop1 function.

drop1(fit_ships_poisson, test = "LRT")

## Single term deletions


##
## Model:
## incidents ~ type + yearbuilt + period
## Df Deviance AIC LRT Pr(>Chi)
## <none> 38.695 154.56
## type 4 62.365 170.23 23.670 9.300e-05 ***
## yearbuilt 3 70.103 179.97 31.408 6.975e-07 ***
## period 1 49.355 163.22 10.660 0.001095 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the likelihood ratio tests, we see that there is a large amount of evidence
for all three predictors. The deviance for this example is

D <- deviance(fit_ships_poisson)
p_value <- pchisq(D, df.residual(fit_ships_poisson),
lower.tail = FALSE)

c(D = D, df = df.residual(fit_ships_poisson), p_value = p_value)

## D df p_value
## 38.69505154 25.00000000 0.03951433

There is some evidence of lack of fit for the model. One possibility for
correcting this is to consider interaction terms. For example:

fit_ships_interaction <- glm(incidents ~ type * yearbuilt + period,


offset = log(service), family = poisson,
data = ships, subset = (service != 0 ))
anova(fit_ships_interaction, test = "LRT")

## Analysis of Deviance Table


##
## Model: poisson, link: log
4.4. DEALING WITH OVERDISPERSION 71

##
## Response: incidents
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 33 146.328
## type 4 55.439 29 90.889 2.629e-11 ***
## yearbuilt 3 41.534 26 49.355 5.038e-09 ***
## period 1 10.660 25 38.695 0.001095 **
## type:yearbuilt 12 24.108 13 14.587 0.019663 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The deviance of 14.587 is now in line with the reference χ213 distribution.

4.4 Dealing with Overdispersion


For count data, overdispersion occurs when Var(Yi ) > E(Yi ). This situation is
very common! For example, it necessarily occurs when Yi ∼ Poisson(ηi ) and ηi
is random.

4.4.1 Negative Binomial GLMs


For loglinear GLMs, the root of the problem is that the Poisson distribution
only has one parameter. Alternatively, we might use a negative binomial model.
Recall that a negative binomial describes the following experiment:

Flip a coin with probability of Heads equal to π repeatedly. Then


Y = the number of Tails you flip until you have observed k heads.
The mass function of Y in this case is

f (y; π, k) = Z(y, k) · π k (1 − π)y , y = 0, 1, 2, . . . .

where Z(y, k) is the number of Bernoulli sequences of length y + k consisting of


exactly y failures and k successes, subject to the constraint that the last trial is
a success, i.e.,
� �
y+k−1 (y + k − 1)! Γ(y + k)
Z(y, k) = = = .
y y!(k − 1)! Γ(k)Γ(y + 1)

Exercise 4.6. Suppose that Y has a negative binomial distribution with


72 CHAPTER 4. GENERALIZED LINEAR MODELS

success probability π and k successes. Show that

k(1 − π) k(1 − π)
E(Y ) = , and Var(Y ) = .
π π2
Writing π in terms of E(Y ) = µ, show this gives

k
π=
µ+k
µ2
Var(Y ) = µ +
k
Hint: we can write Y as the sum of k independent Geometric random
variables.

Definition 4.6. A loglinear negative binomial GLM for [Yi | Xi = Xi ]


models the mass function of Yi as
� �k � �y i
Γ(yi + k) k k
f (yi | k, µi ) = · 1− ,
Γ(k)Γ(yi + 1) µ+k µ+k

where log µi = Xi� β.

The key point of the negative binomial GLM is that it implies overdispersion.
Instead of having the variance equal to µ, it is equal to µ+µ2 /k. When k is very
large relative to µ, we will not have much overdispersion; on the other hand, if
k is small relative to µ then we will have a lot of overdispersion.
When k = 1, the negative binomial corresponds to a geometric random
variable. It turns out that the negative binomial distribution also includes the
Poisson distribution as a special case.

Exercise 4.7. Consider the hierarchical model

Y ∼ Poisson(λ) given λ,
λ ∼ Gamma(k, k/µ).

(a) Show that Y has a negative binomial distribution with k trials and
mean µ.
(b) As k → ∞, show that Y converges in distribution to a Poisson(µ).
Note: the gamma is parameterized so that E(λ) = µ and Var(λ) = µ2 /k.

Remark 4.4. Exercise 4.7 is important for the following reasons. First, in
indicates that we do not need to restrict k to be an integer. Second, it makes
it clear why Y is overdispersed relative to a Poisson; as we have mentioned,
introducing a latent variable automatically will cause overdispersion. Third,
4.4. DEALING WITH OVERDISPERSION 73

it gives some intuitive justification for how a negative binomial might arise in
practice, even if there is no obvious “coin flipping” going on. Fourth, it makes it
clear that the Poisson GLM is a special case of a negative binomial GLM, so if
a Poisson GLM is correct we will not lose anything (aside from some estimation
efficiency) by using a negative binomial GLM.

4.4.2 Quasi-likelihood Methods


We briefly introduce the idea of quasi-likelihood here, since it is often used to
deal with overdispersion. Rather than specifying a parametric model for the
response Yi , we have a model with the following components:
(A1) A linear predictor ηi = Xi� β;
(A2) A link function g(·) such that µi = g −1 (ηi ); and
φ
(A3) A variance function V (·) such that Var(Yi | Xi = x) = ωi V (µi ).
Remark 4.5. These components should be compared with Fact 4.1. Every
GLM satisfies assumptions (A1–A3) V (µi ) = b�� {(b� )−1 (µi )}; for example, a
Poisson GLM has
V (µi ) = µi ,
while a binomial GLM has
V (µi ) = µi (1 − µi ).
The quasi-likelihood framework is more general, however, because there is no
GLM that corresponds to the choices φ > 1 and V (µ) = µ or V (µ) = µ(1 −
µ).
The reason for the name “quasi-likelihood” will become clearer later, when
we study how they are fit.

Definition 4.7. A quasi-Poisson model for Yi given Xi is a model for Yi


which satisfies assumptions A1 and A2 with variance function V (µ) = µ
and ωi = 1 so that

Var(Yi | Xi = x) = φµi .
The Poisson GLM is a type of quasi-Poisson model when φ ≡ 1, but the
quasi-Poisson model allows overdispersion when φ > 1.

4.4.3 Quasi-Poisson or Negative Binomial?


The negative binomial and quasi-Poisson models give two possible ways of deal-
ing with overdispersion. But which one is most appropriate?
The negative binomial model has a couple of features which make it some-
what undesirable. The form of the variance is µi + µ2i /k, i.e., the negative
binomial has variance which is quadratic in µi . Many (including Agresti) find it
unnatural to have the variance scale with µ2i , and hence prefer the quasi-Poisson
as a default.
74 CHAPTER 4. GENERALIZED LINEAR MODELS

4.4.4 Homicide Example


The following data is taken from Table 14.6 in Agresti. The data is from a survey
of 1308 people in which they were asked how many homicide victims they know.
The variables are resp, the number of victims the respondent knows, and race,
the race of the respondent (black or white). The question: to what extent does
race predict how many homicide victims a person knows? The data is given by:

Response Black White


0 119 1070
1 16 60
2 12 14
3 7 4
4 3 0
5 2 0
6 0 1

We load the data into R:

black <- c(119,16,12,7,3,2,0)


white <- c(1070,60,14,4,0,0,1)
resp <- c(rep(0:6,times=black), rep(0:6,times=white))
race <- factor(c(rep("black", sum(black)),
rep("white", sum(white))),
levels = c("white","black"))
victim <- data.frame(resp, race)
head(victim)

## resp race
## 1 0 black
## 2 0 black
## 3 0 black
## 4 0 black
## 5 0 black
## 6 0 black

tail(victim)

## resp race
## 1303 2 white
## 1304 3 white
## 1305 3 white
## 1306 3 white
## 1307 3 white
## 1308 6 white
4.4. DEALING WITH OVERDISPERSION 75

We consider GLM’s of the form

g(µi ) = β0 + β1 I(person i is black).

First, we compute summary statistics for the data.

library(tidyverse)
victim %>% group_by(race) %>%
dplyr::summarise(mean = mean(resp), var = var(resp))

## # A tibble: 2 x 3
## race mean var
## <fct> <dbl> <dbl>
## 1 white 0.0923 0.155
## 2 black 0.522 1.15

We display these results in the following table.

Respondent Race Mean Variance


White 0.09 0.16
Black 0.52 1.15

Exercise 4.8. Explain why the table above suggests that the Poisson GLM
is not appropriate for modeling the homicide data.

Next, we fit the Poisson, negative binomial, and quasi-Poisson models.

fit_poisson <- glm(resp ~ race, family = poisson, data = victim)


fit_negbin <- glm.nb(resp ~ race, data = victim)
fit_quasi <- glm(resp ~ race, data = victim, family = quasipoisson)

k_negbin <- fit_negbin$theta


phi_quasi <- summary(fit_quasi)$dispersion
print(c(k = k_negbin, phi = phi_quasi))

## k phi
## 0.2023119 1.7456940

This gives the estimates k = 0.2 for the negative binomial model and φ =
1.75 for the quasi-Poisson. We now see how the variance estimates for the
models line up with the empirical variance.

victim_summary <- victim %>% group_by(race) %>%


dplyr::summarise(mean = mean(resp), empirical_var = var(resp))
victim_summary <- victim_summary %>%
76 CHAPTER 4. GENERALIZED LINEAR MODELS

mutate(poisson_var = mean,
negbin_var = mean + mean^2 / k_negbin,
quasi_var = phi_quasi * mean)
print(victim_summary)

## # A tibble: 2 x 6
## race mean empirical_var poisson_var negbin_var quasi_var
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 white 0.0923 0.155 0.0923 0.134 0.161
## 2 black 0.522 1.15 0.522 1.87 0.911
This gives the following table:
Model Var, White Var, Black
Observed 0.16 1.15
Poisson 0.09 0.52
Negative Binomial 0.13 1.87
Quasi-Poisson 0.16 0.91
The quasi-Poisson model does quite well at recovering the observed/empirical
variance for each race, while the Poisson does not do very well. The negative
binomial is somewhat in-between, give a slightly lower variance for white re-
spondents and a moderately larger variance for black respondents.

4.5 Technical Details: Likelihood Theory


We now delve into the likelihood equations to gain insight into GLMs and un-
derstand better how they are fit. Recall that the log-likelihood of a GLM is
n
� ωi [Yi θi − b(θi )]
L= + c(Yi , φ).
i=1
φ
Differentiating with respect to βj and applying the chain rule gives
n
� ωi [Yi − b� (θi )] ∂θi ∂µi ∂ηi
∂L
= ,
∂βj i=1
φ ∂µi ∂ηi ∂βj

where recall that µi = b� (θi ), g(µi ) = ηi , and ηi = Xi� β. We now begin to


d −1
simplify. First, using the fact that dx f (x) = 1/f � (f −1 (x)), we derive the
following:
∂θi 1 def 1
= �� = ,
∂µi b {(b� )−1 (µi )} V (µi )
∂µi 1
= �
∂ηi g (µi )
∂ηi
= Xij .
∂βj
4.5. TECHNICAL DETAILS: LIKELIHOOD THEORY 77

Hence, the score vector is given by

n
∂L � ωi Xi (Yi − µi )
= .
∂β i=1
φV (µi )g � (µi )

Summarizing, we have the following

Fact 4.3. The score of β is


� ωi Xi (Yi − µi )
u(β) = ,
i
φV (µi )g � (µi )

� = 0.
where V (µi ) = b�� {(b� )−1 (µi )} and the MLE of β satisfies u(β)

Remark 4.6. Even though there are no β’s on the right-hand-side of u(β), note
that µi = g −1 (Xi� β) so that µi depends implicitly on β.
Next, we derive the Fisher information. We have

∂2L 1� ∂ Yi − µi
− = ωi Xij
∂βj ∂βk φ i ∂βk V (µi )g � (µi )
1� ∂ Yi − µi ∂µi ∂ηi
=− ωi Xij
φ i ∂µi V (µi )g � (µi ) ∂ηi ∂βk
� � ��
1� ωi ωi (Yi − µi ) ∂ 1
= Xij Xik − .
φ i V (µi )g � (µi )2 g � (µi ) ∂µi V (µi )g � (µi )

Hence, the observed Fisher information can be written as

� X/φ
J = X �W

� is diagonal with (i, i)th entry


where W

ωi ωi (Yi − µi ) ∂ 1

− .
V (µi )g (µi ) 2 g (µi ) ∂µi V (µi )g � (µi )

The expected Fisher information is then

I = Eβ (J ) = X � W X/φ

where W is diagonal with entries ωi /[V (µi )g � (µi )2 ], where we have simply used
the fact that Eβ (Yi − µi ) = 0. We summarize these results as:
78 CHAPTER 4. GENERALIZED LINEAR MODELS

Fact 4.4. (a) The observed Fisher information for a GLM of Yi | Xi = x is


given by

� X/φ
J = X �W

� is diagonal with (i, i)th entry


where W

ωi ωi (Yi − µi ) ∂ 1
�ii =
w − .
V (µi )g � (µi )2 g � (µi ) ∂µi V (µi )g � (µi )

(b) The expected Fisher information is

I = Eβ (J ) = X � W X/φ

where W is diagonal with entries


ωi
wii = .
V (µi )g � (µi )2

(c) The asymptotic covariance of the MLE estimator β� of β is φ(X � W X)−1 .

(d) The matrix W can be approximated with W � , which has diagonal entries
ωi /[V (� µi )2 ] where µ
µi )g(� �i is the MLE of µi .

Remark 4.7. Note that similarity between the asymptotic covariance of β� for
the linear model Y = Xβ + � with iid mean-0 finite-variance errors �i and
the GLM. In the former case, covariance of β� is σ 2 (X � X)−1 . The formula
φ(X � W X)−1 generalizes this to the case of generalized linear models.
Remark 4.8. Typically, we favor the use of the observed Fisher information J
� to approximate the variance of β;
(evaluated at β) � we do not favor the use of
the expected Fisher information I.

Exercise 4.9. Recall that we defined V (µi ) = b�� {(b� )−1 (µi )} and that we
define the canonical link by g(µi ) = (b� )−1 (µi ).
(a) Show that J = I when the canonical link is used. Hint: Show that
V (µi )g � (µi ) = 1.

(b) Suppose that the canonical link is used and that φ is known and that
the design matrix X is regarded as fixed and known. Show that X � Y�
is sufficient for β, where Y� = (ω1 Y1 , . . . , ωn Yn ).

4.6 Analysis of Residuals


As in linear models, examination of residuals can give insight into where a
poorly-fitting GLM might be failing. There are several possibilities for residuals
4.6. ANALYSIS OF RESIDUALS 79

to check.

Definition 4.8. The Pearson residual for observation i is


(Yi − µ�i )
ei = � .
V (�
µi )/ωi

The Pearson residual is relatively intuitive: how far away is Yi from its estimated
mean, scaled by an estimate of its variance. Additionally, the Pearson residuals
are associated to the generalized X 2 statistic
n
� n

�i )2
ωi (Yi − µ
X2 = = e2i .
i=1
V (�
µi ) i=1

When φ = 1, the generalized X 2 statistic is associated to the score test


comparing the fitted model to the saturated model. We use this observation
as motivation to define residuals using the likelihood ratio test comparing the
fitted model to the saturated model.

Note: The Pearson residuals do not depend on φ; similarly, the deviance resid-
uals (to be defined below) do not depend on φ. Hence, they can have variance
much larger than 1; keep this in mind when trying to interpret them!

Definition 4.9. Define

di = 2ωi [Yi (θ�i − θ�i ) − {b(θ�i ) − b(θ�i )}],

where θ�i and θ�i are as in Definition 4.4. Then we define the deviance
residual for observation i as

di × sign(Yi − µ
�i ).

The deviance D is the the sum of the squared deviance residuals, giving an
analogy with the relationship between Pearson’s residuals and the generalized
X 2 statistic. The quantiles of the deviance residuals are reported in the output
of the GLM function in R. For the Ships dataset, we have

summary(fit_ships_poisson)

##
## Call:
## glm(formula = incidents ~ type + yearbuilt + period, family = poisson,
## data = ships, subset = (service != 0), offset = log(service))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
80 CHAPTER 4. GENERALIZED LINEAR MODELS

## -1.6768 -0.8293 -0.4370 0.5058 2.7912


##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.943769 0.561747 -14.141 < 2e-16 ***
## typeB -0.543344 0.177590 -3.060 0.00222 **
## typeC -0.687402 0.329044 -2.089 0.03670 *
## typeD -0.075961 0.290579 -0.261 0.79377
## typeE 0.325579 0.235879 1.380 0.16750
## yearbuilt65 0.697140 0.149641 4.659 3.18e-06 ***
## yearbuilt70 0.818427 0.169774 4.821 1.43e-06 ***
## yearbuilt75 0.453427 0.233170 1.945 0.05182 .
## period 0.025631 0.007885 3.251 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 146.328 on 33 degrees of freedom
## Residual deviance: 38.695 on 25 degrees of freedom
## AIC: 154.56
##
## Number of Fisher Scoring iterations: 5

The deviance residuals for this example vary between −1.7 and 2.8.
Generally, we prefer to have residuals which are standardized to have (ap-
proximate) mean 0 and variance 1. The deviance residuals and Pearson residuals
do not have this property. In the case of the Pearson residual, the reason that
we do not have approximate variance 1 is that Yi and µ �i are correlated; the
same issue occurs in linear regression models, wherein the standardized residu-
als (Yi − Y�i )/�
σ do not have variance 1. When φ = 1, Agresti shows (see Sections
4.5.6 and 4.5.7) that

Var(ei ) ≈ (1 − �
hii ) where � � X)−1 X � W 1/2 �ii , (4.2)
hii = �W 1/2 X(X � W

i.e., � � X)−1 X � W 1/2 . When


hii is the ith diagonal element of W 1/2 X(X � W

φ �= 1, we instead have Var(ei ) ≈ φ(1 − hii )

Definition 4.10. The standardized Pearson residual of observation i is


ei
ri = � ,
� −�
φ(1 hii )
4.6. ANALYSIS OF RESIDUALS 81

where �hii is as defined in (4.2) and φ� is an appropriate estimate of φ (or,


φ� = φ if φ is known).

Example 4.8. This example concerns the 1973 admissions data for department
at the University of California at Berkeley. The key inferential issue lies in
assessing whether there is evidence of sex bias in their admissions practices. We
consider two predictors: the sex of the student and the department the student
applied to.
The data is built into R and be loaded as follows:

data(UCBAdmissions)
berk_0 <- data.frame(UCBAdmissions)
print(berk_0)

## Admit Gender Dept Freq


## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
## 7 Admitted Female B 17
## 8 Rejected Female B 8
## 9 Admitted Male C 120
## 10 Rejected Male C 205
## 11 Admitted Female C 202
## 12 Rejected Female C 391
## 13 Admitted Male D 138
## 14 Rejected Male D 279
## 15 Admitted Female D 131
## 16 Rejected Female D 244
## 17 Admitted Male E 53
## 18 Rejected Male E 138
## 19 Admitted Female E 94
## 20 Rejected Female E 299
## 21 Admitted Male F 22
## 22 Rejected Male F 351
## 23 Admitted Female F 24
## 24 Rejected Female F 317

We place this data into a form suitable for the glm function (with admit-
ted/rejected counts on the same row) using the tidyverse package:

library(tidyverse)
berk <- berk_0 %>% group_by(Gender, Dept) %>%
dplyr::summarise(Admitted = sum(Freq * (Admit == "Admitted")),
82 CHAPTER 4. GENERALIZED LINEAR MODELS

Rejected = sum(Freq * (Admit == "Rejected"))) %>%


as.data.frame()
print(berk)

## Gender Dept Admitted Rejected


## 1 Male A 512 313
## 2 Male B 353 207
## 3 Male C 120 205
## 4 Male D 138 279
## 5 Male E 53 138
## 6 Male F 22 351
## 7 Female A 89 19
## 8 Female B 17 8
## 9 Female C 202 391
## 10 Female D 131 244
## 11 Female E 94 299
## 12 Female F 24 317

We consider a logistic regression model in which the number of admitted in


for gender i in department j is binomial with success probability πij . We first
consider the model logit(πij ) = α + βi , i.e., the probability of being accepted
depends on gender.

berk_glm_1 <- glm(cbind(Admitted, Rejected) ~ Gender,


family = binomial, data = berk)
summary(berk_glm_1)

##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender, family = binomial,
## data = berk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16.7915 -4.7613 -0.4365 5.1025 11.2022
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.22013 0.03879 -5.675 1.38e-08 ***
## GenderFemale -0.61035 0.06389 -9.553 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 877.06 on 11 degrees of freedom
4.6. ANALYSIS OF RESIDUALS 83

## Residual deviance: 783.61 on 10 degrees of freedom


## AIC: 856.55
##
## Number of Fisher Scoring iterations: 4

Based on these results, we see that females have an odds of admittance


which is e−0.61 = 0.54 times that of males; moreover, these results are highly
statistically significant. One can imagine making a claim of serious gender bias
on the basis of this analysis.
Consider now the model logit(πij ) = α + βi + γj , so that the probability of
being accepted depends on both gender and department.

berk_glm <- glm(cbind(Admitted, Rejected) ~ Gender + Dept,


family = binomial, data = berk)
summary(berk_glm)

##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender + Dept, family = binomial,
## data = berk)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## -1.2487 -0.0560 1.2533 0.0826 1.2205 -0.2076 3.7189 0.2706
## 9 10 11 12
## -0.9243 -0.0858 -0.8509 0.2052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.58205 0.06899 8.436 <2e-16 ***
## GenderFemale 0.09987 0.08085 1.235 0.217
## DeptB -0.04340 0.10984 -0.395 0.693
## DeptC -1.26260 0.10663 -11.841 <2e-16 ***
## DeptD -1.29461 0.10582 -12.234 <2e-16 ***
## DeptE -1.73931 0.12611 -13.792 <2e-16 ***
## DeptF -3.30648 0.16998 -19.452 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 877.056 on 11 degrees of freedom
## Residual deviance: 20.204 on 5 degrees of freedom
## AIC: 103.14
##
## Number of Fisher Scoring iterations: 4
84 CHAPTER 4. GENERALIZED LINEAR MODELS

Interestingly, our results conflict with those the previous analysis. Females
have an odds of admittance which is e0.1 = 1.11 times that of males, and the
results are not statistically significant. On the other hand, we see that the
department an individual applies to is highly important!
This is an example of Simpson’s paradox — the direction of the effect of sex
reverses after we take department into account. Essentially what is happening
is that females apply to departments which are more selective than males. In
particular, examining the raw data, we see that females tend not to apply to
departments A and B, which happen to also have very high acceptance rates.
Hence, the effect of gender is primarily accounted for by the tendency of females
to apply to selective departments.
Next, we observe that the model logit(πij ) = α + βi + γj actually does not
fit the data particularly well. We have a residual deviance of 20.24 on 5 degrees
of freedom, which gives a P -value of 0.001 for the test of the model against a
saturated model. To understand why the model does not appear to fit well, we
look at the standardized Pearson residuals:

rstandard(berk_glm, type = "pearson")

## 1 2 3 4 5 6
## -4.0272880 -0.2797222 1.8808316 0.1412619 1.6334924 -0.3026439
## 7 8 9 10 11 12
## 4.0272880 0.2797222 -1.8808316 -0.1412619 -1.6334924 0.3026439

We see very large residuals relative to a standard normal distribution for ob-
servation 1 (department A, male) and observation 7 (department A, female).
Examining these observations, we see that department A seems to have an ex-
tremely high acceptance rate for females; 512/825 males are accepted, while
89/108 females are accepted. If we fit the model with department A removed,
we get

berk_no_A <- glm(cbind(Admitted, Rejected) ~ Dept + Gender,


family = binomial,
data = subset(berk, Dept != "A"))
summary(berk_no_A)

##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Dept + Gender, family = binomial,
## data = subset(berk, Dept != "A"))
##
## Deviance Residuals:
## 2 3 4 5 6 8 9 10
## -0.1191 0.5239 -0.5164 0.6868 -0.5024 0.5680 -0.3914 0.5440
## 11 12
## -0.4892 0.5158
4.6. ANALYSIS OF RESIDUALS 85

##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.54418 0.08584 6.340 2.3e-10 ***
## DeptC -1.14008 0.12188 -9.354 < 2e-16 ***
## DeptD -1.19456 0.11984 -9.968 < 2e-16 ***
## DeptE -1.61308 0.13928 -11.581 < 2e-16 ***
## DeptF -3.20527 0.17880 -17.927 < 2e-16 ***
## GenderFemale -0.03069 0.08676 -0.354 0.724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 539.4581 on 9 degrees of freedom
## Residual deviance: 2.5564 on 4 degrees of freedom
## AIC: 71.791
##
## Number of Fisher Scoring iterations: 3

This model seems to fit the data extremely well, with deviance 2.56 on 4 degrees
of freedom. Considering department A in isolation, we have:

summary(glm(cbind(Admitted, Rejected) ~ Gender,


family = binomial,
data = subset(berk, Dept == "A")))

##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender, family = binomial,
## data = subset(berk, Dept == "A"))
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.49212 0.07175 6.859 6.94e-12 ***
## GenderFemale 1.05208 0.26271 4.005 6.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.9054e+01 on 1 degrees of freedom
## Residual deviance: 5.5511e-15 on 0 degrees of freedom
86 CHAPTER 4. GENERALIZED LINEAR MODELS

## AIC: 15.706
##
## Number of Fisher Scoring iterations: 3

We draw the following conclusions from this analysis:


1. Within department A, there appears to be substantial evidence that female
applicants have a higher acceptance rate than male applicants.
2. Ignoring department A, there is no evidence of an effect of sex on admission
rate, after controlling for department.
As before, we urge caution in interpreting these results in a causal fashion; we
are not saying that being female causes one to have a higher acceptance rate in
department A, only that there is strong evidence of a statistical association.

4.7 How to Fit a GLM


In general, there is no closed form for the MLE β� of a generalized linear model.
Recall that the MLE satisfies the score equations
n
� ωi Xi (Yi − µi ) set
u(β) = = 0.
i=1
φV (µi )g � (µi )

This equation can be solved numerically using Newton’s method. Recall that,
to solve the equation g(x) = 0 Newton’s method updates

x ← x − (∇g)−1 (x)g(x)

where ∇g denotes the Jacobian matrix of g. Recalling that the Jacobian matrix
of −u(β) is the observed Fisher information, we get the following algorithm.

Algorithm 1 Newton’s method for computing β�


1: Initialize β (0) and set t = 0.
2: Set β (t+1) = β (t) + J (β (t) )−1 u(β (t) ).
3: Set t = t + 1.
4: If converged, set β� = β (t) ; else, return to Step 2.

This algorithm works quite well, provided that J is positive definite. A


second idea is to replace J with I. This leads to the so-called Fisher scoring
algorithm.
Remark 4.9. In R, the Fisher scoring method is the default method for fitting
glms. Fisher scoring tends to be more numerically stable (note that I is the
variance of the score, so it is guaranteed to be positive definite), but can be
slow to converge near the optimum. Fisher scoring is also known as iteratively
reweighted least-squares; see Section 4.6.4 of Agresti for a discussion of this.
4.8. ESTIMATING THE DISPERSION PARAMETER 87

Algorithm 2 Fisher scoring method for computing β�


1: Initialize β (0) and set t = 0.
2: Set β (t+1) = β (t) + I(β (t) )−1 u(β (t) ).
3: Set t = t + 1.
4: If converged, set β � = β (t) ; else, return to Step 2.

Exercise 4.10. For what class of GLMs is Fisher scoring equivalent to


Newton’s method?

4.8 Estimating the Dispersion Parameter


The Poisson loglinear and binomial logistic regression models work with a fixed/known
dispersion parameter φ. For other GLMs, or for quasi-likelihood GLMs, we do
not know the dispersion parameter φ.

Exercise 4.11. Show that the MLE β� does not depend on the value of the
dispersion parameter φ.

In principle, φ could be estimated by maximum likelihood, and the previous


exercise suggest that we can proceed by first computing β� as usual and second
� This is not usually done in practice; instead,
optimizing over φ with β fixed at β.
we prefer to use the so-called moment estimator
n
1 � ωi (Yi − µ�i )2
φ� = .
n − p i=1 V (�
µi )

� can be shown to have an asymptotic χ2 distribution


The statistic (n − p)φ/φ n−p
under the same conditions that the deviance is χ2n−p , allowing us to construct
confidence intervals for φ if desired. This is also rarely done, the reason being
that the dispersion parameter φ is rarely itself of interest, and that the inter-
vals constructed may be highly sensitive to the parametric assumptions we are
making.

4.8.1 Why the Dispersion Parameter Matters


In view of Exercise 4.11, one might wonder why we care about the dispersion
parameter at all; we usually care about β, and φ does not impact the point
estimate of β. For example, the estimate of β� obtained from a Poisson loglinear
model and a quasi-Poisson loglinear model are the same.
� The
The reason we estimate φ is that it impacts the variance estimate of β.
Fisher information is given by

I = X � W X/φ.
88 CHAPTER 4. GENERALIZED LINEAR MODELS

Hence, the variance of β� is directly proportional to φ! If we were to use, say,


a Poisson loglinear model in the presence of overdispersion, the consequence is
that we will underestimate the variance of β� and end up with inference which
is highly inaccurate; in particular, we expect higher false positive rates and
lower coverage of confidence intervals when the model does not account for
overdispersion.

4.9 More on Quasi-Likelihood


We now discuss in more detail quasi-likelihood methods. The starting point for
these methods is the score function of a GLM, which is (see Fact 4.3)

�N
ωi (Yi − µi )
u(β) = .
i=1
φV (µi )g � (µi )

Quasi-likelihood methods arise from the observation that the solution β� to the
equation u(β) = 0 is valid under the weaker set of assumptions:

• The mean response is given by E{Yi | Xi = x} = µi = g −1 (x� β).

φ
• The variance of the response is given by Var(Yi | Xi = x) = ωi V (µi ) for
some φ, known weight ωi , and function V (µ).

Further, the function u(β) behaves very much like a score function, even when
it does not correspond to any genuine GLM score function. For example, when
the assumptions above are true, the asymptotic variance of β� is given by the
inverse of a “psuedo” Fisher information

I −1 = φ(X � W X)−1 .

Example 4.9. Suppose that Yi given Xi = x has mean given by logit µi =


x� β and variance nφi V (µi ) = nφi µi (1 − µi ). This model is referred to as a
quasi-binomial model. This model is useful when a binomial logistic model
Zi ∼ Binomial(ni , πi ) where Yi = Zi /ni might be appropriate, but the data
is overdispersed, i.e., there is more variability in the observations than can be
account for by a binomial distribution. This might occur, for example, if the Zi ’s
are actually Binomial(ni , πi ) where the individual πi ’s are themselves random
variables.
As with the quasi-Poisson model, the quasi-binomial model will have the
same point estimator β� but the variance estimate will differ be inflated by φ. �
The dispersion parameter in this case can be estimated in the same fashion as
usual GLM’s, i.e. with the moment estimator.
4.10. GROUPED VERSUS UNGROUPED DEVIANCE 89

4.10 Grouped versus ungrouped deviance


Recall the snoring dataset from Section 4.2.5. There are actually two equivalent
ways of writing this GLM.

(M1) For each level of snoring i, let Zi ∼ Binomial(ni , πi ) be the number of


individuals with heart disease where ni is the number of individuals at
snoring level i and πi is the probability of heart disease. Set Yi = Zi /ni .

(M2) For each of the 2484 individuals surveyed, let Yi ∼ Binomial(1, πi ) where
πi is the probability that an individual at snoring level Xi develops heart
disease.

These two approaches are describing the same generative model, but differ
in that M1 considers a total of N = 4 observations by grouping together all
individuals at the same level of snoring, whereas M2 considers a total of N =
2484 individuals. Fitting these two models will result in the same β, � the same

variance estimator for β, and the same inferences in general as the likelihood
function for β is the same.
These approaches differ in one key aspect! They do not result in
the same saturated model! Note that the saturated model for M1 has four
parameters: one for each snoring level. On the other hand, the saturated model
for M2 has 2484 parameters: one for each subject! Not only do the saturated
models differ, but the deviance statistic for M1 can be well-approximated by a
χ24−p distribution while the deviance statistic for M2 does not have an asymptotic
χ2 distribution because the number of parameters of the saturated model is grows
with n.
To see this, first, we examine the deviance statistic with the grouped data
(i.e., using M1)

deviance(snoring_fit)

## [1] 2.808912

df.residual(snoring_fit)

## [1] 2

coef(snoring_fit)

## (Intercept) snoring_scores
## -3.8662481 0.3973366

The model fits quite well. Now, let’s fit the model to ungrouped data (i.e.,
using M2).
90 CHAPTER 4. GENERALIZED LINEAR MODELS

counts <- c(24, 1355, 35, 603, 21, 192, 30, 224)
snore_scores <- c(0, 0, 2, 2, 4, 4, 5, 5)
binary_data <- c(rep(1, counts[1]), rep(0, counts[2]),
rep(1, counts[3]), rep(0, counts[4]),
rep(1, counts[5]), rep(0, counts[6]),
rep(1, counts[7]), rep(0, counts[8]))
binary_scores <- c(rep(0, counts[1]), rep(0, counts[2]),
rep(2, counts[3]), rep(2, counts[4]),
rep(4, counts[5]), rep(4, counts[6]),
rep(5, counts[7]), rep(5, counts[8]))
big_snore <- data.frame(disease = binary_data, snore = binary_scores)
tail(big_snore)

## disease snore
## 2479 0 5
## 2480 0 5
## 2481 0 5
## 2482 0 5
## 2483 0 5
## 2484 0 5

The object big snore has a separate row for each individual, giving the
disease status and snoring score. Fitting this model, we see that the estimated
MLE is the same but the deviance is different.

big_snore_fit <- glm(disease ~ snore, family = binomial,


data = big_snore)
deviance(big_snore_fit)

## [1] 837.7316

df.residual(big_snore_fit)

## [1] 2482

coef(big_snore_fit)

## (Intercept) snore
## -3.8662481 0.3973366

For the purpose of assessing goodness of fit, we should use the deviance for
M1 rather than M2.

You might also like