Count Data 2012
Count Data 2012
Count Data
Up to this point, the response variables have all been continuous measurements such as weights, heights,
lengths, temperatures and growth rates. A great deal of the data collected by scientists, medical statisticians
and economists, however, is in the form of counts (whole numbers or integers). The number of individuals
who died, the number of firms going bankrupt, the number of days of frost, the number of red blood cells
on a microscope slide, and the number of craters in a sector of lunar landscape are all potentially interesting
variables for study. With count data, the number 0 often appears as a value of the response variable (consider,
for example, what a 0 would mean in the context of the examples just listed). In this chapter we deal with
data on frequencies, where we count how many times something happened, but we have no way of knowing
how often it did not happen (e.g. lightning strikes, bankruptcies, deaths, births). This is in contrast to count
data on proportions, where we know the number doing a particular thing, but also the number not doing
that thing (e.g. the proportion dying, sex ratios at birth, proportions of different groups responding to a
questionnaire).
Straightforward linear regression methods (assuming constant variance, normal errors) are not appropriate
for count data for four main reasons:
In R, count data are handled very elegantly in a generalized linear model by specifying family=poisson
which sets errors = Poisson and link = log (see p. 558). The log link ensures that all the fitted
values are positive, while the Poisson errors take account of the fact that the data are integer and have variances
that are equal to their means.
The following example has a count (the number of reported cancer cases per year per clinic) as the response
variable, and a single continuous explanatory variable (the distance from a nuclear plant to the clinic in
kilometres). The question is whether or not proximity to the reactor affects the number of cancer cases.
clusters<-read.table("c:\\temp\\clusters.txt",header=T)
attach(clusters)
names(clusters)
[1] "Cancers" "Distance"
There seems to be a downward trend in cancer cases with distance (see the plot below). But is the trend
significant? We do a regression of cases against distance, using a GLM with Poisson errors:
model1<-glm(Cancers~Distance,poisson)
summary(model1)
Call:
glm(formula = Cancers ~ Distance, family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5504 -1.3491 -1.1553 0.3877 3.1304
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.186865 0.188728 0.990 0.3221
Distance -0.006138 0.003667 -1.674 0.0941 .
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 149.48 on 93 degrees of freedom
Residual deviance: 146.64 on 92 degrees of freedom
AIC: 262.41
The trend does not look to be significant, but look at the residual deviance. Under Poisson errors, it is assumed
that residual deviance is equal to the residual degrees of freedom (because the variance and the mean should
be the same). The fact that residual deviance is larger than residual degrees of freedom indicates that we
have overdispersion (extra, unexplained variation in the response). We compensate for the overdispersion by
refitting the model using quasi-Poisson rather than Poisson errors:
model2<-glm(Cancers~Distance,quasipoisson)
summary(model2)
Call:
glm(formula = Cancers ~ Distance, family = quasipoisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5504 -1.3491 -1.1553 0.3877 3.1304
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.186865 0.235364 0.794 0.429
Distance -0.006138 0.004573 -1.342 0.183
(Dispersion parameter for quasipoisson family taken to be 1.555271)
Null deviance: 149.48 on 93 degrees of freedom
Residual deviance: 146.64 on 92 degrees of freedom
COUNT DATA 581
Compensating for the overdispersion has increased the p value to 0.183, so there is no compelling evidence
to support the existence of a trend in cancer incidence with distance from the nuclear plant. To draw the fitted
model through the data, you need to understand that the GLM with Poisson errors uses the log link, so the
parameter estimates and the predictions from the model (the ‘linear predictor’) are in logs, and need to be
antilogged exp(yv) before the (non-significant) fitted line is drawn.
xv <- seq(0,100)
yv <- predict(model2,list(Distance=xv))
plot(Cancers~Distance,pch=21,col="red",bg="orange")
lines(xv,exp(yv),col="blue")
6
5
4
Cancers
3
2
1
0
0 20 40 60 80 100
Distance
Note how odd the scatterplot looks when we have count data as the response. The values of the response
variable are in rows (because they are all whole numbers), and there are data points all over the place (not
clustered around the regression line, as they were in previous regressions).
It is always a good idea with count data to get a feel for the overall frequency distribution of counts using
table:
table(cells)
cells
0 1 2 3 4 5 6 7
314 75 50 32 18 13 7 2
Most subjects (314 of them) showed no damaged cells, and the maximum of 7 was observed in just two
patients.
We begin data inspection by tabulating the main effect means:
tapply(cells,smoker,mean)
FALSE TRUE
0.5478723 1.9111111
tapply(cells,weight,mean)
normal obese over
0.5833333 1.2814371 0.9357143
tapply(cells,sex,mean)
female male
0.6584507 1.2202643
tapply(cells,age,mean)
mid old young
0.8676471 0.7835821 1.2710280
It looks as if smokers had a substantially higher mean count than non-smokers, overweight and obese subjects
had higher counts than those of normal weight, males had a higher count than females, and young subjects
had a higher mean count than middle-aged or older people. We need to test whether any of these differences
are significant and to assess whether there are interactions between the explanatory variables.
model1<-glm(cells~smoker*sex*age*weight,poisson)
summary(model1)
Null deviance: 1052.95 on 510 degrees of freedom
Residual deviance: 736.33 on 477 degrees of freedom
AIC: 1318
Number of Fisher Scoring iterations: 6
The residual deviance (736.33) is much greater than the residual degrees of freedom (477), indicating
overdispersion, so before interpreting any of the effects, we should refit the model using quasi-Poisson errors:
model2<-glm(cells~smoker*sex*age*weight,quasipoisson)
summary(model2)
The first thing to understand is what the NAs in the coefficients table mean. This is a sign of aliasing:
there is no information in the dataframe from which to estimate this particular interaction term. There is an
apparently significant three-way interaction between smoking, age and obesity (p = 0.0424). There were too
few subjects to assess the four-way interaction (see the NAs in the table), so we begin model simplification
by removing the highest-order interaction:
model3<-update(model2, ~. -smoker:sex:age:weight)
summary(model3)
584 THE R BOOK
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.897195 0.436988 -2.053 0.04060 *
smokerTRUE 0.030263 0.735386 0.041 0.96719
sexmale 0.297192 0.570009 0.521 0.60234
ageold -0.118726 0.528165 -0.225 0.82224
ageyoung 0.289259 0.639618 0.452 0.65130
weightobese 1.302660 0.898307 1.450 0.14768
weightover -0.005052 1.027198 -0.005 0.99608
smokerTRUE:sexmale 0.527345 0.867294 0.608 0.54345
smokerTRUE:ageold -0.566584 1.700590 -0.333 0.73915
smokerTRUE:ageyoung 0.757297 0.939746 0.806 0.42073
sexmale:ageold -0.379884 0.935365 -0.406 0.68483
sexmale:ageyoung -0.610703 0.920969 -0.663 0.50758
smokerTRUE:weightobese 3.924591 1.475476 2.660 0.00808 **
smokerTRUE:weightover 1.192159 1.259888 0.946 0.34450
sexmale:weightobese -1.273202 1.040701 -1.223 0.22178
sexmale:weightover 0.154097 1.098781 0.140 0.88853
ageold:weightobese -0.993355 0.970484 -1.024 0.30656
ageyoung:weightobese -1.346913 1.459454 -0.923 0.35653
ageold:weightover 0.454217 1.090260 0.417 0.67715
ageyoung:weightover -0.483955 1.300866 -0.372 0.71004
smokerTRUE:sexmale:ageold 0.771116 1.451512 0.531 0.59549
smokerTRUE:sexmale:ageyoung -0.210317 1.140384 -0.184 0.85376
smokerTRUE:sexmale:weightobese -2.500668 1.369941 -1.825 0.06857 .
smokerTRUE:sexmale:weightover -1.110222 1.217531 -0.912 0.36230
smokerTRUE:ageold:weightobese -0.882951 1.187871 -0.743 0.45766
smokerTRUE:ageyoung:weightobese -2.453315 1.047067 -2.343 0.01954 *
smokerTRUE:ageold:weightover 0.823018 1.528233 0.539 0.59045
smokerTRUE:ageyoung:weightover 0.040795 1.223664 0.033 0.97342
sexmale:ageold:weightobese 2.338617 1.324805 1.765 0.07816 .
sexmale:ageyoung:weightobese 2.822032 1.623849 1.738 0.08288 .
sexmale:ageold:weightover -0.442066 1.545451 -0.286 0.77497
sexmale:ageyoung:weightover 0.357807 1.291194 0.277 0.78181
The remaining model simplification is left to you as an exercise. Your minimal adequate model might look
something like this:
newWt<-weight
levels(newWt)[c(1,3)]<-"not"
summary(model15)
Call:
glm(formula = cells ~ smoker + newWt + smoker:newWt, family = quasipoisson)
COUNT DATA 585
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6511 -1.1742 -0.9709 0.5217 3.8157
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7522 0.1279 -5.883 7.31e-09 ***
smokerTRUE 1.0523 0.1740 6.048 2.84e-09 ***
newWtobese 0.3803 0.1924 1.977 0.0486 *
smokerTRUE:newWtobese 0.5764 0.2573 2.240 0.0255 *
This model shows a highly significant interaction between smoking and weight in determining the number
of damaged cells, but there are no convincing effects of age or sex. In a case like this, it is useful to produce
a summary table to highlight the effects:
tapply(cells,list(smoker,weight),mean)
The interaction arises because the response to smoking depends on body weight: smoking adds a mean of
about 0.5 damaged cells for individuals with normal body weight, but adds 2.8 damaged cells for obese
people.
It is straightforward to turn the summary table into a barplot:
barplot(tapply(cells,list(smoker,weight),mean),col=c("wheat2","wheat4"),
beside=T,ylab="damaged cells",xlab="body mass")
legend(1.2,3.4,c("non-smoker","smoker"),fill=c("wheat2","wheat4"))
3.5
non-smoker
smoker
3.0
2.5
damaged cells
2.0
1.5
1.0
0.5
0.0
species<-read.table("c:\\temp\\species.txt",header=T)
attach(species)
names(species)
[1] "pH" "Biomass" "Species"
plot(Biomass,Species,type="n")
spp<-split(Species,pH)
bio<-split(Biomass,pH)
points(bio[[1]],spp[[1]],pch=16,col="red")
points(bio[[2]],spp[[2]],pch=16,col="green")
points(bio[[3]],spp[[3]],pch=16,col="blue")
legend(locator(1),legend=c("high","low","medium"),
pch=c(16,16,16),col=c("red","green","blue"),title="pH")
pH
40
high
low
medium
30
Species
20
10
0 2 4 6 8 10
Biomass
Note the use of split to create separate lists of plotting coordinates for the three levels of pH. It is clear that
species declines with biomass, and that soil pH has a big effect on species, but does the slope of the relationship
between species and biomass depend on pH? The lines look reasonably parallel from the scatterplot. This is
a question about interaction effects, and in analysis of covariance, interaction effects are about differences
between slopes:
model1<-glm(Species~ Biomass*pH,poisson)
summary(model1)
COUNT DATA 587
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.76812 0.06153 61.240 < 2e-16 ***
Biomass -0.10713 0.01249 -8.577 < 2e-16 ***
pHlow -0.81557 0.10284 -7.931 2.18e-15 ***
pHmid -0.33146 0.09217 -3.596 0.000323 ***
Biomass:pHlow -0.15503 0.04003 -3.873 0.000108 ***
Biomass:pHmid -0.03189 0.02308 -1.382 0.166954
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 452.346 on 89 degrees of freedom
Residual deviance: 83.201 on 84 degrees of freedom
AIC: 514.39
There is no evidence of overdispersion (residual deviance = 83.2 on 89 d.f.). We can test for the need for
different slopes by comparing this maximal model (with six parameters) with a simpler model with different
intercepts but the same slope (four parameters):
model2<-glm(Species~Biomass+pH,poisson)
anova(model1,model2,test="Chi")
The slopes are very significantly different (p = 0.000 33), so we are justified in retaining the more complicated
model1.
Finally, we draw the fitted lines through the scatterplot, using predict We need to specify values for all
of the explanatory variables in the model, namely biomass (a continuous variable) and soil pH (a three-level
categorical variable). First, the continuous variable for the x axis:
xv<-seq(0,10,0.1)
Next we need to provide a vector of factor levels for soil pH and this vector must be exactly the same length
as the vector of x values (length(xv) = 101 in this case). It is simplest to use the factor levels for pH in
the order in which they appear:
levels(pH)
We shall draw the line for the high pH first, remembering to antilog the predictions:
pHs<-factor(rep("high",101))
xv<-seq(0,10,0.1)
yv<-predict(model1,list(Biomass=xv,pH=pHs))
588 THE R BOOK
lines(xv,exp(yv),col="red")
pHs<-factor(rep("low",101))
yv<-predict(model1,list(Biomass=xv,pH=pHs))
lines(xv,exp(yv),col="green")
pHs<-factor(rep("mid",101))
yv<-predict(model1,list(Biomass=xv,pH=pHs))
lines(xv,exp(yv),col="blue")
You could make the R code more elegant by writing a function to plot any number of lines, depending on the
number of levels of the factor (three levels of pH in this case).
40
pH
high
low
medium
30
Species
20
10
0 2 4 6 8 10
Biomass
Here are data on the numbers of bankruptcies in 80 districts. The question is whether there is any evidence
that some districts show greater than expected numbers of cases. What would we expect? Of course we should
expect some variation, but how much, exactly? Well that depends on our model of the process. Perhaps the
simplest model is that absolutely nothing is going on, and that every singly bankruptcy case is absolutely
independent of every other. That leads to the prediction that the numbers of cases per district will follow a
Poisson process, a distribution in which the variance is equal to the mean (see p. 314). Let us see what the
data show.
case.book<-read.table("c:\\temp\\cases.txt",header=T)
attach(case.book)
names(case.book)
[1] "cases"
COUNT DATA 589
First we need to count the numbers of districts with no cases, one case, two cases, and so on. The R
function that does this is called table:
frequencies<-table(cases)
frequencies
cases
0 1 2 3 4 5 6 7 8 9 10
34 14 10 7 4 5 2 1 1 1 1
There were no cases at all in 34 districts, but one district had 10 cases. A good way to proceed is to compare
our distribution (called frequencies) with the distribution that would be observed if the data really did
come from a Poisson distribution as postulated by our model. We can use the R function dpois to compute
the probability density of each of the 11 frequencies from 0 to 10 (we multiply the probability produced by
dpois by the total sample of 80 to obtain the predicted frequencies). We need to calculate the mean number
of cases per district – this is the Poisson distribution’s only parameter:
mean(cases)
[1] 1.775
The plan is to draw two distributions side by side, so we set up the plotting region:
windows(7,4)
par(mfrow=c(1,2))
Now we plot the observed frequencies in the left-hand panel and the predicted, Poisson frequencies in the
right-hand panel:
barplot(frequencies,ylab="Frequency",xlab="Cases",col="green4", main="Cases")
barplot(dpois(0:10,1.775)*80,names=as.character(0:10),
ylab="Frequency",xlab="Cases",col="green3",main="Poisson")
Cases Poisson
30
20
Frequency
Frequency
15
20
10
10
5
5
0
0 2 4 6 8 10 0 2 4 6 8 10
Cases Cases
The distributions are very different: the mode of the observed data is 0, but the mode of the Poisson distribution
with the same mean is 1; the observed data contained examples of 8, 9 and 10 cases, but these would be
highly unlikely under a Poisson process. We would say that the observed data are highly aggregated – they
590 THE R BOOK
have a variance–mean ratio much greater than 1 (the Poisson distribution, of course, has a variance–mean
ratio of 1):
var(cases)/mean(cases)
[1] 2.99483
So, if the data are not Poisson distributed, how are they distributed? A good candidate distribution where
the variance–mean ratio is this big (around 3.0) is the negative binomial distribution (see p. 315). This is a
two-parameter distribution: the first parameter is the mean number of cases (1.775), and the second is called
the clumping parameter, k (measuring the degree of aggregation in the data: small values of k (k < 1) show
high aggregation, while large values of k (k > 5) show randomness). We can get an approximate estimate of
the magnitude of k from
x̄ 2
k̂ = .
s 2 − x̄
mean(cases)ˆ2/(var(cases)-mean(cases))
[1] 0.8898003
so we shall work with k = 0.89. How do we compute the expected frequencies? The density function for
the negative binomial distribution is dnbinom and it has three arguments: the frequency for which we want
the probability (in our case 0 to 10), the number of successes (in our case 1), and the mean number of cases
(1.775); we multiply by the total number of cases (80) to obtain the expected frequencies
exp<-dnbinom(0:10,1,mu=1.775)*80
We will draw a single figure in which the observed and expected frequencies are drawn side by side. The
trick is to produce a new vector (called both) which is twice as long as the observed and expected frequency
vectors (2 × 11 = 22). Then, we put the observed frequencies in the odd-numbered elements (using modulo
2 to calculate the values of the subscripts), and the expected frequencies in the even-numbered elements:
both<-numeric(22)
both[1:22 %% 2 != 0]<-frequencies
both[1:22 %% 2 == 0]<-exp
labels<-character(22)
labels[1:22 %% 2 == 0]<-as.character(0:10)
Now we can produce the barplot, using dark red for the observed frequencies and dark blue for the negative
binomial frequencies (‘expected’):
windows(7,7)
barplot(both,col=rep(c("red4","blue4"),11),names=labels,ylab="Frequency",
xlab="Cases")
COUNT DATA 591
30
observed
expected
25
20
Frequency
15
10
5
0
0 1 2 3 4 5 6 7 8 9 10
Cases
Now we need to add a legend to show what the two colours of the bars mean. You can locate the legend by trial
and error, or by left-clicking the mouse when the cursor is in the correct position, using the locator(1)
function to fix the lop left corner of the legend box (see p. 194):
legend(locator(1),c("observed","expected") ,fill=c("red4","blue4"))
The fit to the negative binomial distribution is much better than it was with the Poisson distribution, especially
in the right-hand tail. But the observed data have too many 0s and too few 1s to be represented perfectly by
a negative binomial distribution. If you want to quantify the lack of fit between the observed and expected
frequency distributions, you can calculate Pearson’s chi-squared (O − E)2 /E based on the number of
comparisons that have expected frequency greater than 4:
exp
[1] 28.8288288 18.4400617 11.7949944 7.5445460 4.8257907 3.0867670
[7] 1.9744185 1.2629164 0.8078114 0.5167082 0.3305070
If we accumulate the rightmost six frequencies, then all the values of exp will be bigger than 4. The degrees
of freedom are then given by the number of legitimate comparisons (6) minus the number of parameters
estimated from the data (2 in our case) minus 1 (for contingency, because the total frequency must add up to
80), i.e. 3 d.f. We reduce the lengths of the observed and expected vectors, creating an upper interval called
5+ for ‘5 or more’:
cs<-factor(0:10)
levels(cs)[6:11]<-"5+"
levels(cs)
[1] "0" "1" "2" "3" "4" "5+"
Now make the two shorter vectors of and ef (for ‘observed’ and ‘expected frequencies’):
ef<-as.vector(tapply(exp,cs,sum))
of<-as.vector(tapply(frequencies,cs,sum))
592 THE R BOOK
Finally, we can compute the chi-squared value measuring the difference between the observed and expected
frequency distributions, and use 1-pchisq to work out the p value:
sum((of-ef)ˆ2/ef)
[1] 3.594145
1-pchisq(3.594145,3)
[1] 0.3087555
We conclude that a negative binomial description of these data is reasonable (the observed and expected
distributions are not significantly different, p = 0.31).
The data analysed in this section refer to children from Walgett, New South Wales, Australia, who were
classified by sex (with two levels: male (M) and female (F)), culture (also with two levels: Aboriginal (A)
and not (N)), age group (with four levels: F0 (primary), F1, F2 and F3) and learner status (with two levels:
average (AL) and slow (SL)). The response variable is a count of the number of days absent from school in a
particular school year (Days).
library(MASS)
data(quine)
attach(quine)
names(quine)
Call:
glm(formula = Days ~ Eth * Sex * Age * Lrn, family = quasipoisson)
COUNT DATA 593
Deviance Residuals:
Min 1Q Median 3Q Max
-7.3872 -2.5129 -0.4205 1.7424 6.6783
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0564 0.3346 9.135 2.22e-15 ***
EthN -0.1386 0.4904 -0.283 0.7780
SexM -0.4914 0.5082 -0.967 0.3356
AgeF1 -0.6227 0.5281 -1.179 0.2407
AgeF2 -2.3632 2.2066 -1.071 0.2864
AgeF3 -0.3784 0.4296 -0.881 0.3802
LrnSL -1.9577 1.8120 -1.080 0.2822
EthN:SexM -0.7524 0.8272 -0.910 0.3649
EthN:AgeF1 0.1029 0.7427 0.139 0.8901
EthN:AgeF2 -0.5546 3.8094 -0.146 0.8845
EthN:AgeF3 0.0633 0.6194 0.102 0.9188
SexM:AgeF1 0.4092 0.9372 0.437 0.6632
SexM:AgeF2 3.1098 2.2506 1.382 0.1696
SexM:AgeF3 1.1145 0.6173 1.806 0.0735 .
EthN:LrnSL 2.2588 1.9474 1.160 0.2484
SexM:LrnSL 1.5900 1.9448 0.818 0.4152
AgeF1:LrnSL 2.6421 1.8688 1.414 0.1601
AgeF2:LrnSL 4.8585 2.8413 1.710 0.0899 .
AgeF3:LrnSL NA NA NA NA
EthN:SexM:AgeF1 -0.3105 1.6756 -0.185 0.8533
EthN:SexM:AgeF2 0.3469 3.8928 0.089 0.9291
EthN:SexM:AgeF3 0.8329 0.9629 0.865 0.3888
EthN:SexM:LrnSL -0.1639 2.1666 -0.076 0.9398
EthN:AgeF1:LrnSL -3.5493 2.0712 -1.714 0.0892 .
EthN:AgeF2:LrnSL -3.3315 4.2739 -0.779 0.4373
EthN:AgeF3:LrnSL NA NA NA NA
SexM:AgeF1:LrnSL -2.4285 2.1901 -1.109 0.2697
SexM:AgeF2:LrnSL -4.1914 2.9472 -1.422 0.1576
SexM:AgeF3:LrnSL NA NA NA NA
EthN:SexM:AgeF1:LrnSL 2.1711 2.7527 0.789 0.4319
EthN:SexM:AgeF2:LrnSL 2.1029 4.4203 0.476 0.6351
EthN:SexM:AgeF3:LrnSL NA NA NA NA
(Dispersion parameter for quasipoisson family taken to be 9.514226)
Null deviance: 2073.5 on 145 degrees of freedom
Residual deviance: 1173.9 on 118 degrees of freedom
Notice that certain interactions are aliased (shown by rows with NA). These have not been estimated because
of missing factor-level combinations, as indicated by the zeros in the following table:
ftable(table(Eth,Sex,Age,Lrn))
594 THE R BOOK
Instead of using quasi-Poisson errors (as above) we could use a negative binomial model. This is in the MASS
library and involves the function glm.nb. The modelling proceeds in exactly the same way as with a typical
GLM:
model.nb1<-glm.nb(Days~Eth*Sex*Age*Lrn)
summary(model.nb1,cor=F)
Call:
glm.nb(formula = Days ~ Eth * Sex * Age * Lrn,
init.theta = 1.92836014510701, link = log)
(Dispersion parameter for Negative Binomial(1.9284) family taken to be 1)
Null deviance: 272.29 on 145 degrees of freedom
Residual deviance: 167.45 on 118 degrees of freedom
AIC: 1097.3
Theta: 1.928
Std. Err.: 0.269
2 x log-likelihood: -1039.324
The output is slightly different than for a conventional GLM: you see the estimated negative binomial
parameter (here called Theta, but known to us as k, and equal to 1.928) and its approximate standard error
(0.269) and 2 times the log-likelihood (contrast this with the residual deviance from our quasi-Poisson model,
which was 1301.1; see above). Note that the residual deviance in the negative binomial model (167.45) is not
2 times the log-likelihood.
596 THE R BOOK
An advantage of the negative binomial model over the quasi-Poisson is that we can automate the model
simplification with stepAIC:
model.nb2<-stepAIC(model.nb1)
summary(model.nb2,cor=F)
model.nb3<-update(model.nb2,~. - Sex:Age:Lrn)
anova(model.nb3,model.nb2)
COUNT DATA 597