Goodness
Goodness
Gordon K. Smyth
Abstract For any generalized linear model, the Pearson goodness of t statistic is the score test statistic for testing the current model against the saturated model. The relationship between the Pearson statistic and the residual deviance is therefore the relationship between the score test and the likelihood ratio test statistics, and this claries the role of the Pearson statistic in generalized linear models. The result is extended to cases in which there are multiple reponse observations for the same combination of explanatory variables. Keywords. Pearson statistic; score test; chisquare statistic; generalized linear model; exponential family nonlinear model; saturated model.
Introduction
Goodness of t tests go back at least to Pearsons (1900) article establishing the asymptotic chisquare distribution for a goodness of t statistic for the multinomial distribution. Pearsons chisquare statistic includes the test for independence in twoway contingency tables. It has been extended in generalized linear model theory to a test for the adequacy of the current tted model. Given a generalized linear model with responses yi , weights wi , tted means i , variance function v () and dispersion = 1, the Pearson goodness of t statistic is X2 = wi (yi i )2 v ( i )
[14]. If the tted model is correct and the observations yi are approximately normal, then X 2 is approximately distributed as 2 on the residual degrees of freedom for the model.
To appear in Science and Statistics: A Festschrift for Terry Speed, D. R. Goldstein (ed.), IMS Lecture Notes Monograph Series, Volume 40, Institute of Mathematical Statistics, Hayward, California, March 2003.
G. K. Smyth
The Pearson goodness of t statistic X 2 is one of two goodness of t tests in routine use in generalized linear models, the other being the residual deviance. The residual deviance is the log-likelihood ratio statistic for testing the tted model against the saturated model in which there is a regression coecient for every observation. The Pearson statistic is a quadratic form alternative to the residual deviance, and is often preferred over the residual deviance because of its moment estimator character. The expected value of the Pearson statistic depends only on the rst two moments of the distribution of the yi and in this sense the Pearson statistic is robust against mis-specication of the response distribution. The score test, like the likelihood ratio test, is a general asymptotic parametric test associated with the likelihood function [22]. Score tests are often simpler than likelihood ratio tests because the statistic requires parameter estimators to be obtained only under the null hypothesis. For this reason score tests have been proposed frequently in generalized linear model contexts to test for various sorts of model complications such as overdispersion [5] [3] [7] [24] [13] [19], zero ination [8], adequacy of the link function [20] [9], or extra terms in the tted model [21] [4] [1] [2] [26] [19]. While the residual deviance arises from a general inferential principle, namely the likelihood ratio test, the origin of the Pearson statistic has seemed more ad hoc. Several authors have noted that score tests for extra terms in the linear predictor give rise to chisquare statistics, but there has been no result for the residual Pearson statistic itself. Pregibon [21] shows, by using one-step estimators, that the score statistic for extra terms in the linear predictor can be expressed as a dierence between two chisquare statistics, just as the likelihood ratio test can be obtained as the dierence between two residual deviances. Cox and Hinkley [6, Examples 9.17 and 9.21] show that the simplest Pearson statistic, the goodness of t statistic for the multinomial distribution, can be derived as a score statistic. This article shows that Cox and Hinkleys result for the multinomial extends to all generalized linear models. The Pearson goodness of t statistic is itself a score test statistic, testing the current model against the saturated model. The relationship between the Pearson statistic and the residual deviance is therefore the relationship between the score test and the likelihood ratio test statistics, and this claries the role of the Pearson statistic in generalized linear models. The result of this article extends to several more general situations. The result extends to data sets with multiple counts in categories and to generalizations of exponential families models, such as overdispersion models, for which there are extra parameters in the variance function. It includes for example as special cases the results on tests for independence in two-way contingency tables of Thall [26] and Paul and Banerjee [19]. The general proofs given here are simpler and more transparent than the special case proofs for contingency tables. Finally, the results given here do not require link-linearity as in generalized linear models, but apply to any exponential family non-linear regression model. The theory of score tests is revised briey in Section 2 and the background material
Score tests
This section summarizes briey the theory of likelihood score tests. Further background on score tests and likelihood ratio tests can be found in Rao [23, pages 417418] and Cox and Hinkley [6, Section 9.3]. Let (y; 1 , 2 ) be a log-likelihood function depending on a response vector y and parameter vectors 1 and 2 . We wish to test the composite hypothesis H0 : 2 = 0 against the alternative that 2 is unrestricted. The components of 1 are so-called nuisance parameters because they are not of interest in the test but values must be estimated for them for a test statistic to be computed. The likelihood score vectors for 1 and 2 are the partial derivatives 1 = 1 and 2 = 2
The expected or Fisher information matrix is I = E ( ), which is partitioned conformally with as I11 I12 I= . I21 I22 The score test statistic is based on the fact that the score vector has mean zero and covariance matrix I . If the nuisance vector 1 is known, then the score test statistic of H0 is 1/2 Z = I22 2 , where I22 stands for any factor such that I22 I22 = I22 , or equivalently
1 S = Z T Z = T 2 I22 2 1/2 1 / 2 T /2
with 2 and I22 evaluated at 2 = 0. The score vector is a sum of terms corresponding to individual observations and so is asymptotically normal under standard regularity conditions. It follows that Z is asymptotically a standard normal p2 -vector
G. K. Smyth
under the null hypothesis H0 and that S is asymptotically chisquare distributed on p2 degrees of freedom, where p2 is the dimension of 2 . If the nuisance parameters are not known, then the score test substitutes for them 1 under the null hypothesis. Setting 1 = 1 is their maximum likelihood estimators equivalent to setting 1 = 0, so we need the asymptotic distribution of 2 conditional on 1 = 0, which is normal with mean zero and covariance matrix
1 I2.1 = I22 I21 I11 I12 .
1 and 2 = 0. If I12 = 0 then 1 and 2 are said with 2 and I2.1 evaluated at 1 = to be orthogonal. In that case, 1 and 2 are independent and I2.1 = I22 , meaning that the information matrix I22 does not need to be adjusted for estimation of 1 , Neyman [15] and Neyman and Scott [16] show that the asymptotic distribution and eciency of the score statistic S is unchanged if an estimator other than the maximum likelihood estimator is used for the nuisance parameters, provided that the estimator is consistent with convergence rate at least O(n1/2 ), where n is the 1 number of observations. They show that we can substitute into S any estimator of 1 for which n| 1 1 | is bounded in probability as n . In that case they rename the score statistic the C () test statistic.
Generalized linear models assume that observations are distributed according to a linear exponential family with an additional dispersion parameter. The density or probability mass function for each response is assumed to be of the form f (y ; , ) = a(y, ) exp[{y ()}/] (1)
where a and are suitable known functions. The mean is = () and the variance is (). The mean and the canonical parameter are one-to-one functions of one another. We call the dispersion parameter and v () = () the variance function. Following Jrgensen [10, 12], we call the distribution described by (1) an exponential dispersion model and denote it ED(, ). If y1 , . . . , yn are independently distributed as ED(, ) then the sample mean y is sucient for and y ED(, /n). More generally, if yi ED(, /wi ) where the wi are known weights, then the weighted sum y w is sucient for and y w =
n i=1 wi yi n i=1 wi
ED ,
n i=1 wi
where g is a known monotonic link function, xi is a vector of covariates and is an unknown vector of regression coecients. Without loss of generality we will assume that the n p matrix X with rows xi is of full column rank and that p < n, where p is the dimension of . More generally we will consider generalized nonlinear models in which the mean vector = (1 , . . . , n )T is a general n-dimensional function of the p-vector . To ensure that the parametrization is not degenerate, we will assume that the gradient matrix / is of full column rank, at least in a neighborhood containing the true . value of and the maximum likelihood estimate In this article we mainly consider models in which the dispersion is known, = 1 say. Most models with discrete responses have known dispersion.
Let be the locus of possible values for , = {( ) : IRp }. Let H0 be the null hypothesis that belongs to and let Ha be the alternative hypothesis that is unrestricted. The goodness of t test for the current model tests H0 against Ha . For a generalized linear model, H0 is the hypothesis that the i are described by the link-linear model (2). Theorem 1 The score statistic for the goodness of t test of a generalized nonlinear model with unit dispersion is the Pearson chisquare statistic
n
S=
i=1
wi (yi i )2 /v ( i )
. where i is the expected value i evaluated at the maximum likelihood estimator Proof. There exists an parameter vector 2 of dimension n p such that ( , 2 ) is a one-to-one transformation of in the neighborhood of interest and such that 2 = 0 if and only if . The goodness of t test consists of testing H0 : 2 = 0 against the alternative that 2 is unrestricted. The components of the original parameter vector are the nuisance parameters for this test. In the generalized linear model case, the implicit parameter vector 2 can be constructed by nding an n (n p) matrix X2 such that (X, X2 ) is of full rank. Then Ha is the saturated model that g (i ) = X + X2 2 for some and some 2 .
G. K. Smyth
Let 1 and 2 be the score vectors for and 2 respectively, and let I be the Fisher information matrix, partitioned into I11 , I12 and I22 as in Section 2. The Fisher information for 2 adjusted for estimation of is I2.1 and the score statistic for testing H0 is 1 S = T 2 I2.1 2 and = 0. where 2 and I2.1 are evaluated at =
2
Let V = diag{v (i )/wi } and write e = V 1/2 (y ) for the vector of Pearson residuals. Also write U1 = V 1/2 and U2 = V 1/2
. 2 It is straightforward to show that the score vectors are given by j = U T e j for j = 1, 2 and the information matrices are given by
T Ijk = Uj Uk
for j, k = 1, 2 [25] [27]. T U )1 U T of the orthogonal projection onto Write P1 for the matrix P1 = U1 (U1 1 1 the column space of U1 . Also write U2.1 = (I P1 )U2 and P2.1 for the matrix
T 1 T P2.1 = U2.1 (U2 .1 U2.1 ) U2.1
of the orthogonal projection onto the column space of U2.1 . Then P1 and P2.1 project onto orthogonal subspaces and P1 + P2.1 = I since the dimensions of the subspaces add to n. We can rewrite
T T T T T T I2.1 = U2 U2 U2 U1 (U1 U1 )1 U1 U2 = U2 (I P1 )U2 = U2 .1 U2.1 .
We can also rewrite 2 = (U T U T P1 )e = U T e 2 2 2 .1 ensures that U T e = 0 and hence P1 e = 0. This shows because evaluating at = 1 that the score statistic is
T 1 T T T T S = eT U2.1 (U2 .1 U2.1 ) U2.1 e = e P2.1 e = e (P1 + P2.1 )e = e e
S=
i=1
wi ( ywi i )2 /v ( i )
wi =
j =1
wij
wij yij .
j =1
Proof. The weighted means y wi are sucient for the i , and y i ED(i , 1/wi ). The y wi are distributed as for the yi but with weights wi , so the result follows immediately from Theorem 1. 2
Example . Suppose that the yij are binary responses and that wij = 1 for all i and j . Then
n
S=
i=1
ni (ri p i )2 /v ( pi )
G. K. Smyth
where ri is the empirical proportion for the ith covariate-combination group, p i is the ni estimated probability that yij = 1, and v ( pi ) = p i (1 p i ). If yi = j =1 yij is the number of successes for the ith group then the yi are binomial(ni , pi ) and
n
S=
i=1
(yi. i )2 /vi ( i )
with i = npi and vi (i ) = i (ni i )/ni . This is the Pearson goodness of t statistic for the data summarized in the usual generalized linear model way as binomial counts for each covariate-combination group. Example . Paul and Banerjee [19] derive the score test for interaction in a twoway contingency table with multiple counts in each cell. Corollary 1 includes Paul and Banerjees Theorem 1 as a special case.
Suppose now that there are extra parameters which aect the variance of the yi , but not its mean, and which are outside the exponential dispersion model framework. Let be the vector of extra parameters and let G be the parameter space for . Suppose that for each xed value of , the yi follow a generalized nonlinear model with variance function v (; ). The values of eectively index a class of generalized nonlinear models. This setup arises frequently when extra parameters are introduced to accommodate overdispersion in generalized linear models [1] [2] [7] [19]. It is straightforward to show that and are orthogonal parameters. This follows because 1 = V (y ) and does not depend on . Therefore the cross derivative 2 / will be linear in y and will have expectation zero. Orthogonality of and implies that estimation of does not aect the form of the score statistics for goodness of t. According to the theory of C () tests, may be replaced in the score test statistics by any estimator which is O(n1/2 ) consistent without changing the distributional properties of S to rst order. This gives the following theorem. Theorem 2 Suppose that for each G, y1 , . . . , yn ED(i , 1/wi ) are independent with variance function v (; ). The C () goodness of t statistic is
n
S=
i=1
) wi (yi i )2 /v ( i ;
S=
i=1
) wi ( ywi i )2 /v ( i ;
is any n-consistent estimator of , i is the maximum likelihood estimator where , the wi are sums of weights and the y of i given = wi are weighted means. The proofs of Theorem 2 and the corollary are similar to the proofs in Section 2. Example . Suppose that yij follows a negative binomial distribution with mean i and variance function V (; c) = + c2 , i = 1, . . . , n, j = 1, . . . , ni for each c 0. Suppose that the i are a function of a vector of regression parameters. For xed values of c, the means y i are sucient for the i and are negative binomial with the same variance function and weights ni . The C () goodness of t statistic therefore is n ni ( yi i )2 S= i + c 2 i i=1 where c is a n-consistent estimator of c and i is the maximum likelihood estimator of i with c = c . This includes Theorem 3 of Paul and Banerjee (1998). One possible estimator for is the maximum likelihood estimator. An alternative estimation method is to solve S = n p with respect to . This estimator is often preferred in overdispersion contexts because it is evidently a consistent estimator based only on the rst and second moments of the yi and therefore has a quasilikelihood avor (Breslow, 1990). Obviously the score statistic S is no longer useful as a goodness of t statistic if is estimated by either of the above methods. If there are repeat observations for covariate combinations, then an estimate of may be obtained from the pure error or within-covariate combination variability. In this approach, can be estimated by solving wij (yij y wi )2 = v ( ywi ; ) i=1 j =1
n ni n
(ni 1).
i=1
10
G. K. Smyth
All the above results have assumed that = 1. If is unknown, then both and I are divided by and the score statistic for goodness of t for a generalized nonlinear model becomes n wi (yi i )2 S= . v ( i ) i=1 The appearance of the unknown scale parameter in S means that the statistic is no longer useful for judging goodness of t. The statistic leads instead, by equating S to its expectation, to the so-called Pearson estimator of , = 1 np
n i=1
wi (yi i )2 v ( i )
which is the default estimator of in generalized linear model functions in the statistical programs S-Plus and R. Other estimators of are discussed by Jrgensen [11]. When there are repeat observations, the dierence between the full version of the score statistic in Theorem 1 and the reduced form in Corollary 1 can be used to dene a pure error estimate of the dispersion parameter , pure = 1 (ni 1) wij (yij y wi )2 . v ( ywi ) i=1 j =1
n ni
In the case of normal linear regression, this is the well known pure error estimator of the variance. With the use of this this estimator, the score statistic recovers its use as a goodness of t statistic, but now as a generalized F -statistic rather than chisquare. Substituting the pure error estimator into the score test for the reduced data gives ywi i )2 (ni 1) n wi ( F = . pure v ( n p i=1 i ) If the yij are approximately normal, then F follows approximately an F -distribution on n p and (ni 1) degrees of freedom under the null hypothesis. This is asymptotically true for example as the weights wij or the dispersion 0, because any exponential dispersion model ED(, ) tends to normality as 0 [11, 12]. The F statistic above is a generalization of the normal theory equivalents, described for example by Weisberg [28, Section 4.3].
Dedication
This article is in honor of Terry Speed, from whom I learned generalized linear models while an undergraduate student in Perth, Western Australia. Terrys enthusiasm for
11
References
[1] Breslow, N. E. Score tests in overdispersed generalized linear models. In A. Decarli, B. J. Francis, R. Gilchrist, and G. U. H. Seeber, editors, Proceedings of GLIM 89 and the 4th International Workshop on Statistical Modelling, pages 6474. Springer-Verlag: New York, 1989. [2] Breslow, N. E. Tests of hypotheses in overdispersed Poisson regression and other quasi-likelihood models. Journal of the American Statistical Association, 85:565 571, 1990. [3] Cameron, A., and Trivedi, P. Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics, 46:347364, 1990. [4] Chen, C.-F. Score tests for regression models. Journal of the American Statistical Association, 78:158161, 1983. [5] Cox, D. R. Some remarks on overdispersion. Biometrika, 70:269274, 1983. [6] Cox, D. R., and Hinkley, D. V. Theoretical Statistics. Chapman and Hall: London, 1974. [7] Dean, C. B. Testing for overdispersion in Poisson and binomial regression models. Journal of the American Statistical Association, 87:451457, 1992. [8] Deng, D., and Paul, S. R. Score tests for zero ination in generalized linear models. Canadian Journal of Statistics, 28:563570, 2000. [9] Genter, F. C. and Farewell, V. T. Goodness-of-link testing in ordinal regression models. Canadian Journal of Statistics, 13:3744, 1985. [10] Jrgensen, B. Exponential dispersion models (with discussion). Journal of the Royal Statistical Society Series B, 49:127162, 1987. [11] Jrgensen, B. The theory of exponential dispersion models and analysis of deviance. Monograas de Matem atika No. 51, Instituto de Mathem atika pura e Aplicada, Rio de Janeiro, 1992. [12] Jrgensen, B. Theory of Dispersion Models. Chapman and Hall: London, 1997. [13] Lu, W.-S. Score tests for overdispersion in Poisson regression models. Journal of Statistical Computation and Simulation, 56:213228, 1997. [14] McCullagh, P., and Nelder, J. A. Generalized Linear Models. Chapman and Hall: London, 1989.
12
G. K. Smyth
[15] Neyman, J. Optimal asymptotic tests of composite hypotheses. In V. Grenander, editor, Probability and Statistics: The Harold Cram er Volume, pages 213234. Wiley: New York, 1959. [16] Neyman, J., and Scott, E. On the use of C () optimal tests of composite hypotheses. Bulletin of the International Statistical Institute, Proceedings of the 35th Session, 41:477497, 1966. [17] Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50:157175, 1900. [18] Paul, S. R., and Deng, D. Goodness of t of generalized linear models to sparse data. Journal of the Royal Statistical Society Series B, 62:323333, 2000. [19] Paul, S. R., and Banerjee, T. Analysis of two-way layout of count data involving multiple counts in each cell. Journal of the American Statistical Association, 93:14191429, 1998. [20] Pregibon, D. Goodness of link tests for generalized linear models. Applied Statistics, 29:1524, 1980. [21] Pregibon, D. Score tests in GLIM with applications. In R. Gilchrist, editor, GLIM 82: Proceedings of the International Conference on Generalised Linear Models, pages 8797. Springer-Verlag: New York, 1982. [22] Rao, C. R. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44:5057, 1947. [23] Rao, C. R. Linear Statistical Inference and its Applications, Second Edition. Wiley: New York, 1973. [24] Smith, P. J., and Heitjan, D. F. Testing and adjusting for departures from nominal dispersion in generalized linear models. Applied Statistics, 42:31-41, 1993. [25] Smyth, G.K. Exponential dispersion models and the Gauss-Newton algorithm. Australian Journal of Statistics, 33:5764, 1991. [26] Thall, P. F. Score tests in two-way layouts of counts. Communications in Statistics Part ATheory and Methods, 21:30173036, 1992. [27] Wei, B.-C. Exponential Family Nonlinear Models. Springer-Verlag: Singapore, 1998. [28] Weisberg, S. Applied linear regression. Wiley: New York, 1985.