The VGAM Package For Categorical Data Analysis: Thomas W. Yee
The VGAM Package For Categorical Data Analysis: Thomas W. Yee
Thomas W. Yee
University of Auckland
Abstract
Classical categorical regression models such as the multinomial logit and proportional
odds models are shown to be readily handled by the vector generalized linear and additive
model (VGLM/VGAM) framework. Additionally, there are natural extensions, such as
reduced-rank VGLMs for dimension reduction, and allowing covariates that have values
specific to each linear/additive predictor, e.g., for consumer choice modeling. This article
describes some of the framework behind the VGAM R package, its usage and implemen-
tation details.
Keywords: categorical data analysis, Fisher scoring, iteratively reweighted least squares, multi-
nomial distribution, nominal and ordinal polytomous responses, smoothing, vector generalized
linear and additive models, VGAM R package.
1. Introduction
This is a VGAM vignette for categorical data analysis (CDA) based on Yee (2010a). Any
subsequent features (especially non-backward compatible ones) will appear here.
The subject of CDA is concerned with analyses where the response is categorical regardless of
whether the explanatory variables are continuous or categorical. It is a very frequent form of
data. Over the years several CDA regression models for polytomous responses have become
popular, e.g., those in Table 1. Not surprisingly, the models are interrelated: their foundation
is the multinomial distribution and consequently they share similar and overlapping properties
which modellers should know and exploit. Unfortunately, software has been slow to reflect
their commonality and this makes analyses unnecessarily difficult for the practitioner on
several fronts, e.g., using different functions/procedures to fit different models which does not
aid the understanding of their connections.
This historical misfortune can be seen by considering R functions for CDA. From the Com-
prehensive R Archive Network (CRAN, http://CRAN.R-project.org/) there is polr() (in
MASS; Venables and Ripley 2002) for a proportional odds model and multinom() (in nnet;
Venables and Ripley 2002) for the multinomial logit model. However, both of these can be
considered ‘one-off’ modeling functions rather than providing a unified offering for CDA. The
function lrm() (in rms; Harrell, Jr. 2016) has greater functionality: it can fit the proportional
odds model (and the forward continuation ratio model upon preprocessing). Neither polr()
or lrm() appear able to fit the nonproportional odds model. There are non-CRAN packages
too, such as the modeling function nordr() (in gnlm; Lindsey 2007), which can fit the pro-
portional odds, continuation ratio and adjacent categories models; however it calls nlm() and
2 The VGAM Package for Categorical Data Analysis
Table 1: Quantities defined in VGAM for a categorical response Y taking values 1, . . . , M +1.
Covariates x have been omitted for clarity. The LHS quantities are ηj or ηj−1 for j = 1, . . . , M
(not reversed) and j = 2, . . . , M + 1 (if reversed), respectively. All models are estimated by
minimizing the deviance. All except for multinomial() are suited to ordinal Y .
the user must supply starting values. In general these R (R Development Core Team 2009)
modeling functions are not modular and often require preprocessing and sometimes are not
self-starting. The implementations can be perceived as a smattering and piecemeal in nature.
Consequently if the practitioner wishes to fit the models of Table 1 then there is a need to
master several modeling functions from several packages each having different syntaxes etc.
This is a hindrance to efficient CDA.
SAS (SAS Institute Inc. 2003) does not fare much better than R. Indeed, it could be considered
as having an excess of options which bewilders the non-expert user; there is little coherent
overriding structure. Its proc logistic handles the multinomial logit and proportional odds
models, as well as exact logistic regression (see Stokes et al. 2000, which is for Version 8
of SAS). The fact that the proportional odds model may be fitted by proc logistic, proc
genmod and proc probit arguably leads to possible confusion rather than the making of
connections, e.g., genmod is primarily for GLMs and the proportional odds model is not
a GLM in the classical Nelder and Wedderburn (1972) sense. Also, proc phreg fits the
multinomial logit model, and proc catmod with its WLS implementation adds to further
potential confusion.
This article attempts to show how these deficiencies can be addressed by considering the
vector generalized linear and additive model (VGLM/VGAM) framework, as implemented by
the author’s VGAM package for R. The main purpose of this paper is to demonstrate how the
framework is very well suited to many ‘classical’ regression models for categorical responses,
and to describe the implementation and usage of VGAM for such. To this end an outline of
this article is as follows. Section 2 summarizes the basic VGLM/VGAM framework. Section
3 centers on functions for CDA in VGAM. Given an adequate framework, some natural
extensions of Section 2 are described in Section 4. Users of VGAM can benefit from Section
5 which shows how the software reflects their common theory. Some examples are given in
Section 6. Section 7 contains selected topics in statistial computing that are more relevant
to programmers interested in the underlying code. Section 8 discusses several utilities and
extensions needed for advanced CDA modeling, and the article concludes with a discussion.
This document was run using VGAM 0.7-10 (Yee 2010b) under R2.10.0.
Some general references for categorical data providing background to this article include
Thomas W. Yee 3
Agresti (2010), Agresti (2013), Agresti (2018), Fahrmeir and Tutz (2001), Fullerton and
Xu (2016), Harrell (2015), Hensher et al. (2015), Leonard (2000), Lloyd (1999), Long (1997),
McCullagh and Nelder (1989), Simonoff (2003), Smithson and Merkle (2013) and Tutz (2012).
An overview of models for ordinal responses is Liu and Agresti (2005), and a manual for fitting
common models found in Agresti (2002) to polytomous responses with various software is
Thompson (2009). A package for visualizing categorical data in R is vcd (Meyer et al. 2006,
2009).
2. VGLM/VGAM overview
This section summarizes the VGLM/VGAM framework with a particular emphasis toward
categorical models since the classes encapsulates many multivariate response models in, e.g.,
survival analysis, extreme value analysis, quantile and expectile regression, time series, bioas-
say data, nonlinear least-squares models, and scores of standard and nonstandard univariate
and continuous distributions. The framework is partially summarized by Table 2. More gen-
eral details about VGLMs and VGAMs can be found in Yee and Hastie (2003) and Yee and
Wild (1996) respectively. An informal and practical article connecting the general framework
with the software is Yee (2008).
2.1. VGLMs
Suppose the observed response y is a q-dimensional vector. VGLMs are defined as a model
for which the conditional distribution of Y given explanatory x is of the form
Here x = (x1 , . . . , xp )> with x1 = 1 if there is an intercept. Note that (2) means that all the
parameters may be potentially modelled as functions of x. It can be seen that VGLMs are like
GLMs but allow for multiple linear predictors, and they encompass models outside the small
confines of the exponential family. In (1) the quantity φ is an optional scaling parameter
which is included for backward compatibility with common adjustments to overdispersion,
e.g., with respect to GLMs.
In general there is no relationship between q and M : it depends specifically on the model or
distribution to be fitted. However, for the ‘classical’ categorical regression models of Table 1
we have M = q − 1 since q is the number of levels the multi-category response Y has.
The ηj of VGLMs may be applied directly to parameters of a distribution rather than just to
a mean for GLMs. A simple example is a univariate distribution with a location parameter
ξ and a scale parameter σ > 0, where we may take η1 = ξ and η2 = log σ. In general,
ηj = gj (θj ) for some parameter link function gj and parameter θj . For example, the adjacent
categories models in Table 1 are ratios of two probabilities, therefore a log link of ζjR or ζj is
4 The VGAM Package for Categorical Data Analysis
p1P
+p2
B>
1 x1 + Hk f ∗k (xk ) VGAM vgam() Yee and Wild (1996)
k=p1 +1
B>
1 x1 + A ν RR-VGLM rrvglm() Yee and Hastie (2003)
Table 2: Some of the package VGAM and its framework. The vector of latent variables
ν = C> x2 where x> = (x> >
1 , x2 ).
the default. In VGAM, there are currently over a dozen links to choose from, of which any
can be assigned to any parameter, ensuring maximum flexibility. Table 5 lists some of them.
VGLMs are estimated using iteratively reweighted least squares (IRLS) which is particularly
suitable for categorical models (Green 1984). All models in this article have a log-likelihood
n
X
`= wi `i (3)
i=1
where the wi are known positive prior weights. Let xi denote the explanatory vector for the
ith observation, for i = 1, . . . , n. Then one can write
β>
η1 (xi ) 1 xi
ηi = η(xi ) = .
.. >
..
= B xi =
.
ηM (xi ) β>
M x i
β(1)1 · · · β(1)p
..
=
.
xi = β
(1) · · · β (p) xi . (4)
β(M )1 · · · β(M )p
Fisher scoring usually has good numerical stability because the Wi are positive-definite over
a larger region of parameter space than Newton-Raphson. For the categorical models in this
article the expected information matrices are simpler than the observed information matrices,
and are easily derived, therefore all the families in Table 1 implement Fisher scoring.
a sum of smooth functions of the individual covariates, just as with ordinary GAMs (Hastie
and Tibshirani 1990). The f k = (f(1)k (xk ), . . . , f(M )k (xk ))> are centered for uniqueness, and
are estimated simultaneously using vector smoothers. VGAMs are thus a visual data-driven
method that is well suited to exploring data, and they retain the simplicity of interpretation
that GAMs possess.
An important concept, especially for CDA, is the idea of ‘constraints-on-the functions’. In
practice we often wish to constrain the effect of a covariate to be the same for some of the ηj
and to have no effect for others. We shall see below that this constraints idea is important
for several categorical models because of a popular parallelism assumption. As a specific
example, for VGAMs we may wish to take
so that f(1)2 ≡ f(2)2 and f(2)3 ≡ 0. For VGAMs, we can represent these models using
p p
f k (xk ) = H1 β ∗(1) + Hk f ∗k (xk )
X X
η(x) = β (1) + (6)
k=2 k=2
(ek is a vector of zeros except for a one in the kth position) so that XVLM is (nM ) × p∗
where p∗ = pk=1 ncol(Hk ) is the total number of columns of all the constraint matrices.
P
6 The VGAM Package for Categorical Data Analysis
fact, the latter is (for now?). Users need to set cumulative(parallel = TRUE) explicitly to
fit a proportional odds model—hopefully this will alert them to the fact that they are making
the proportional odds assumption and check its validity (Peterson (1990); e.g., through a
deviance or likelihood ratio test). However the default means numerical problems can occur
with far greater likelihood. Thus there is tension between the two options. As a compromise
there is now a VGAM family function called propodds(reverse = TRUE) which is equivalent
to cumulative(parallel = TRUE, reverse = reverse, link = "logit").
By the way, note that arguments such as parallel can handle a slightly more complex syntax.
A call such as parallel = TRUE ~ x2 + x5 - 1 means the parallelism assumption is only
applied to X2 and X5 . This might be equivalent to something like parallel = FALSE ~ x3
+ x4, i.e., to the remaining explanatory variables.
4. Other models
Given the VGLM/VGAM framework of Section 2 it is found that natural extensions are
readily proposed in several directions. This section describes some such extensions.
B2 = C A> (9)
and if the rank R is kept low then this can cut down the number of regression coefficients
dramatically. If R = 2 then the results may be biplotted (biplot() in VGAM). Here, C and
A are p2 × R and M × R respectively, and usually they are ‘thin’.
More generally, the class of reduced-rank VGLMs (RR-VGLMs) is simply a VGLM where B2
is expressed as a product of two thin estimated matrices (Table 2). Indeed, Yee and Hastie
(2003) show that RR-VGLMs are VGLMs with constraint matrices that are unknown and
estimated. Computationally, this is done using an alternating method: in (9) estimate A
given the current estimate of C, and then estimate C given the current estimate of A. This
alternating algorithm is repeated until convergence within each IRLS iteration.
Incidentally, special cases of RR-VGLMs have appeared in the literature. For example, a
RR-multinomial logit model, is known as the stereotype model (Anderson 1984). Another
8 The VGAM Package for Categorical Data Analysis
is Goodman (1981)’s RC model (see Section 4.2) which is reduced-rank multivariate Poisson
model. Note that the parallelism assumption of the proportional odds model (McCullagh and
Nelder 1989) can be thought of as a type of reduced-rank regression where the constraint
matrices are thin (1M , actually) and known.
The modeling function rrvglm() should work with any VGAM family function compatible
with vglm(). Of course, its applicability should be restricted to models where a reduced-rank
regression of B2 makes sense.
where µij = E(Yij ) is the mean of the i-j cell, and the rank R satisfies R < min(n, M ).
The modeling function grc() should work on any two-way table Y of counts generated by
(10) provided the number of 0’s is not too large. Its usage is quite simple, e.g., grc(Ymatrix,
Rank = 2) fits a rank-2 model to a matrix of counts. By default a Rank = 1 model is fitted.
where “Ti > Tj ” means i is preferred over j. Suppose that πi > 0. Let Yij be the number of
times that Ti is preferred over Tj in the nij comparisons of the pairs. Then Yij ∼ Bin(nij , pi/ij ).
This is a Bradley-Terry model (without ties), and the VGAM family function is brat().
Maximum likelihood estimation of the parameters π1 , . . . , πM +1 involves maximizing
M +1
! !yij !nij −yij
Y nij πi πj
.
i<j
yij πi + πj πi + πj
By default, πM +1 ≡ 1 is used for identifiability, however, this can be changed very easily.
Note that one can define linear predictors ηij of the form
! !
πi πi
logit = log = λi − λj . (11)
πi + πj πj
Thomas W. Yee 9
Table 3: Some genetic models currently implemented and their unique parameters.
The VGAM framework can handle the Bradley-Terry model only for intercept-only models;
it has
λj = ηj = log πj = β(1)j , j = 1, . . . , M. (12)
As well as having many applications in the field of preferences, the Bradley-Terry model has
many uses in modeling ‘contests’ between teams i and j, where only one of the teams can
win in each contest (ties are not allowed under the classical model). The packaging function
Brat() can be used to convert a square matrix into one that has more columns, to serve as
input to vglm(). For example, for journal citation data where a citation of article B by article
A is a win for article B and a loss for article A. On a specific data set,
by default, where there are M competitors and πM ≡ 1. Like brat(), one can choose a
different reference group and reference value.
Other R packages for the Bradley-Terry model include BradleyTerry2 by H. Turner and D.
Firth (with and without ties; Firth 2005, 2008) and prefmod (Hatzinger 2009).
Genotype AA AO BB BO AB OO
Probability p2 2pr q2 2qr 2pq r2
Blood group A A B B AB O
Table 4: Probability table for the ABO blood group system. Note that p and q are the
parameters and r = 1 − p − q.
For example the ABO blood group system has two independent parameters p and q, say.
Here, the blood groups A, B and O form six possible combinations (genotypes) consisting of
AA, AO, BB, BO, AB, OO (see Table 4). A and B are dominant over bloodtype O. Let p,
q and r be the probabilities for A, B and O respectively (so that p + q + r = 1) for a given
population. The log-likelihood function is
`(p, q) = nA log(p2 + 2pr) + nB log(q 2 + 2qr) + nAB log(2pq) + 2nO log(1 − p − q),
The function Coef(), which applies only to intercept-only models, applies to gj (θj ) = ηj the
inverse link function gj−1 to ηbj to give θbj .
6. Examples
This section illustrates CDA modeling on three data sets in order to give a flavour of what is
available in the package.
R> head(marital.nz, 4)
R> summary(marital.nz)
R> head(depvar(fit.ms), 4)
R> colSums(depvar(fit.ms))
to produce Figure 1. The scale argument is used here to ensure that the y-axes have a
common scale—this makes comparisons between the component functions less susceptible to
misinterpretation. The first three plots are the (centered) fb(s)2 (x2 ) for η1 , η2 , η3 , where
(s, t) = (1, 1), (2, 3), (3, 4), and x2 is age. The last plot are the smooths overlaid to aid
comparison.
It may be seen that the ±2 standard error bands about the Widowed group is particularly wide
at young ages because of a paucity of data, and likewise at old ages amongst the Singles. The
fb(s)2 (x2 ) appear as one would expect. The log relative risk of being single relative to being
married/partnered drops sharply from ages 16 to 40. The fitted function for the Widowed group
14 The VGAM Package for Categorical Data Analysis
8
4 6
s(age, df = 3):1
s(age, df = 3):2
2 4
0 2
−2 0
−4 −2
−6 −4
20 40 60 80 20 40 60 80
age age
6 6
4 4
s(age, df = 3):3
s(age, df = 3)
2 2
0 0
−2 −2
−4 −4
−6 −6
20 40 60 80 20 40 60 80
age age
Figure 1: Fitted (and centered) component functions fb(s)2 (x2 ) from the NZ marital status
data (see Equation 14). The bottom RHS plot are the smooths overlaid.
increases with age and looks reasonably linear. The fb(1)2 (x2 ) suggests a possible maximum
around 50 years old—this could indicate the greatest marital conflict occurs during the mid-
life crisis years!
The methods function for plot() can also plot the derivatives of the smooths. The call
0.15 0.25
0.00
0.10 0.20
−0.05
s'(age, df = 3):1
s'(age, df = 3):2
s'(age, df = 3):3
0.05 0.15
−0.10
0.00 0.10
−0.15
−0.05 0.05
−0.20
−0.10 0.00
−0.25
−0.15 −0.05
20 40 60 80 20 40 60 80 20 40 60 80
Figure 2: Estimated first derivatives of the component functions, fb0 (s)2 (x2 ), from the NZ
marital status data (see Equation 14).
Then
confirms that one term was used for each component function. The plots from
8
4
6
2
poly(age, 2)
foo(age)
0
2
−2
0
−4
−2
−6
−4
20 40 60 80 20 40 60 80
age age
4
partial for age
−2
−4
20 30 40 50 60 70 80 90
age
Figure 3: Parametric version of fit.ms: fit2.ms. The component functions are now
quadratic, piecewise quadratic/zero, or linear.
[1] 7.59
is small, so it seems the parametric model is quite reasonable against the original nonpara-
metric model. Specifically, the difference in the number of ‘parameters’ is approximately
[1] 3.151
[1] 0.0619
1.0
0.8
Fitted probabilities
0.6
Divorced/Separated
Married/Partnered
Single
0.4 Widowed
0.2
0.0
20 30 40 50 60 70 80 90
Age
Figure 4: Fitted probabilities for each class for the NZ male European marital status data
(from Equation 14).
which gives Figure 4. This shows that between 80–90% of NZ white males aged between
their early 30s to mid-70s were married/partnered. The proportion widowed started to rise
steeply from 70 years onwards but remained below 0.5 since males die younger than females
on average.
R> # Scale the variables? Yes; the Anderson (1984) paper did (see his Table 6).
R> head(backPain, 4)
x1 x2 x3 pain
1 1 1 1 same
18 The VGAM Package for Categorical Data Analysis
2 1 1 1 marked.improvement
3 1 1 1 complete.relief
4 1 2 1 same
R> summary(backPain)
x1 x2 x3 pain
Min. :1.00 Min. :1.00 Min. :1.00 worse : 5
1st Qu.:1.00 1st Qu.:2.00 1st Qu.:1.00 same :14
Median :2.00 Median :2.00 Median :1.00 slight.improvement :18
Mean :1.61 Mean :2.07 Mean :1.37 moderate.improvement:20
3rd Qu.:2.00 3rd Qu.:3.00 3rd Qu.:2.00 marked.improvement :28
Max. :2.00 Max. :3.00 Max. :2.00 complete.relief :16
R> backPain <- transform(backPain, sx1 = -scale(x1), sx2 = -scale(x2), sx3 = -scale(x3))
displays the six ordered categories. Now a rank-1 stereotype model can be fitted with
Then
R> Coef(bp.rrmlm1)
A matrix:
latvar
log(mu[,1]/mu[,6]) 1.0000
log(mu[,2]/mu[,6]) 0.3094
log(mu[,3]/mu[,6]) 0.3467
log(mu[,4]/mu[,6]) 0.5099
log(mu[,5]/mu[,6]) 0.1415
C matrix:
latvar
sx1 -2.628
sx2 -2.146
sx3 -1.314
B1 matrix:
log(mu[,1]/mu[,6]) log(mu[,2]/mu[,6]) log(mu[,3]/mu[,6])
(Intercept) -2.914 0.2945 0.5198
log(mu[,4]/mu[,6]) log(mu[,5]/mu[,6])
(Intercept) 0.3511 0.9026
are the fitted A, C and B1 (see Equation 9) and Table 2) which agrees with his Table 6.
Here, what is known as “corner constraints” is used ((1, 1) element of A ≡ 1), and only the
Thomas W. Yee 19
intercepts are not subject to any reduced-rank regression by default. The maximized log-
likelihood from logLik(bp.rrmlm1) is −151.55. The standard errors of each parameter can
be obtained by summary(bp.rrmlm1). The negative elements of C b imply the latent variable
νb decreases in value with increasing sx1, sx2 and sx3. The elements of A b tend to decrease
so it suggests patients get worse as ν increases, i.e., get better as sx1, sx2 and sx3 increase.
A rank-2 model fitted with a different normalization
>
produces uncorrelated ν bi = C
b x2i . In fact var(lv(bp.rrmlm2)) equals I2 so that the latent
variables are also scaled to have unit variance. The fit was biplotted (rows of C
b plotted as
arrow; rows of A plotted as labels) using
b
to give Figure 5. It is interpreted via inner products due to (9). The different normalization
means that the interpretation of ν1 and ν2 has changed, e.g., increasing sx1, sx2 and sx3
results in increasing νb1 and patients improve more. Many of the latent variable points ν
b i are
coincidental due to discrete nature of the xi . The rows of Ab are centered on the blue labels
(rather cluttered unfortunately) and do not seem to vary much as a function of ν2 . In fact
this is confirmed by Anderson (1984) who showed a rank-1 model is to be preferred.
This example demonstrates the ability to obtain a low dimensional view of higher dimen-
sional data. The package’s website has additional documentation including more detailed
Goodman’s RC and stereotype examples.
2 10
59
58
11
1 20
19
18
65
64 55
5
45
54
53
52
51
50
49
48
46
6
7
47
4
32
31
30
29
28
89
88
87
86
85
84
83
sx3 sx1
Latent Variable 2
0 13
60
14
12 43
3
41
44
42
2
1
100
101
40
39
38
99
98 75
25
23
80
79
78
76
73
70
69
68
67
66
log(mu[,5]/mu[,6])
77
24
22
21
74
72
71
log(mu[,1]/mu[,6]) log(mu[,4]/mu[,6])
log(mu[,3]/mu[,6])
log(mu[,2]/mu[,6])
sx2 57
56
9
8
−1
37
36
35
34
33
97
96
95
94
93
92
91
90 16
15
63
62
61
17
−2
26
82
81
27
−4 −3 −2 −1 0 1 2
Latent Variable 1
Figure 5: Biplot of a rank-2 reduced-rank multinomial logit (stereotype) model fitted to the
back pain data. A convex hull surrounds the latent variable scores ν b i (whose observation
numbers are obscured because of their discrete nature). The position of the jth row of Ab is
the center of the label “log(mu[,j])/mu[,6])”.
The working weight matrices Wi may become large for categorical regression models. In
general, we have to evaluate the Wi for i = 1, . . . , n, and naively, this could be held in an
array of dimension c(M, M, n). However, since the Wi are symmetric positive-definite it
suffices to only store the upper or lower half of the matrix.
The variable wz in vglm.fit() stores the working weight matrices Wi in a special format
called the matrix-band format. This format comprises a n × M ∗ matrix where
hbw
1
M∗ =
X
(M − i + 1) = hbw (2 M − hbw + 1)
i=1
2
is the number of columns. Here, hbw refers to the half-bandwidth of the matrix, which is an
integer between 1 and M inclusive. A diagonal matrix has unit half-bandwidth, a tridiagonal
matrix has half-bandwidth 2, etc.
Suppose M = 4. Then wz will have up to M ∗ = 10 columns enumerating the unique elements
of Wi as follows:
1 5 8 10
2 6 9
Wi = . (15)
3 7
4
That is, the order is firstly the diagonal, then the band above that, followed by the second
band above the diagonal etc. Why is such a format adopted? For this example, if Wi is
Thomas W. Yee 21
diagonal then only the first 4 columns of wz are needed. If Wi is tridiagonal then only the
first 7 columns of wz are needed. If Wi is banded then wz needs not have all 21 M (M + 1)
columns; only M ∗ columns suffice, and the rest of the elements of Wi are implicitly zero.
As well as reducing the size of wz itself in most cases, the matrix-band format often makes
the computation of wz very simple and efficient. Furthermore, a Cholesky decomposition of
a banded matrix will be banded. A final reason is that sometimes we want to input Wi into
VGAM: if wz is M × M × n then vglm(..., weights = wz) will result in an error whereas
it will work if wz is an n × M ∗ matrix.
To facilitate the use of the matrix-band format, a few auxiliary functions have been written.
In particular, there is iam() which gives the indices for an array-to-matrix. In the 4 × 4
example above,
$row.index
[1] 1 2 3 4 1 2 3 1 2 1
$col.index
[1] 1 2 3 4 2 3 4 3 4 4
returns the indices for the respective array coordinates for successive columns of matrix-
band format (see Equation 15). If diag = FALSE then the first 4 elements in each vector
are omitted. Note that the first two arguments of iam() are not used here and have been
assigned NAs for simplicity. For its use on the multinomial logit model, where (Wi )jj =
wi µij (1 − µij ), j = 1, . . . , M , and (Wi )jk = −wi µij µik , j 6= k, this can be programmed
succinctly like
(the actual code is slightly more complicated). In general, VGAM family functions can be
remarkably compact, e.g., acat(), cratio() and multinomial() are all less than 120 lines
of code each.
∂ pj (x)
= h0 (ηj ) β j − h0 (ηj−1 ) β j−1 . (17)
∂x
The function margeff() returns an array with these derivatives and should handle any value
of reverse and parallel.
P (Y = j) ∗ ∗ ∗ ∗
log = β(j)1 + β(1)2 (xi2j − xi24 ) + β(1)3 (xi3j − xi34 ) + β(1)4 xi4 , (18)
P (Y = M + 1)
where, for the ith person, xi2j is the cost for the jth transport means, and xi3j is the journey
time of the jth transport means. The distance to get to work is xi4 ; it has the same value
regardless of the transport means.
Equation 18 implies H1 = I3 and H2 = H3 = H4 = 13 . Note also that if the last response
category is used as the baseline or reference group (the default of multinomial()) then
xik,M +1 can be subtracted from xikj for j = 1, . . . , M —this is the natural way xik,M +1 enters
into the model.
Recall from (2) that we had
p
ηj (xi ) = β >
X
j xi = xik β(j)k . (19)
k=1
Often β ∗∗ ∗∗ ∗
j = β , say. In (21) the variables in xi are common to all ηj , and the variables in
x∗ij have different values for differing ηj . This allows for covariate values that are specific to
each ηj , a facility which is very important in many applications.
The use of the xij argument with the VGAM family function multinomial() has very
important applications in economics. In that field the term “multinomial logit model” includes
a variety of models such as the “generalized logit model” where (19) holds, the “conditional
logit model” where (20) holds, and the “mixed logit model,” which is a combination of the
two, where (21) holds. The generalized logit model focusses on the individual as the unit of
analysis, and uses individual characteristics as explanatory variables, e.g., age of the person
in the transport example. The conditional logit model assumes different values for each
alternative and the impact of a unit of xk is assumed to be constant across alternatives,
e.g., journey time in the choice of transport mode. Unfortunately, there is confusion in the
literature for the terminology of the models. Some authors call multinomial() with (19) the
“generalized logit model”. Others call the mixed logit model the “multinomial logit model” and
view the generalized logit and conditional logit models as special cases. In VGAM terminology
there is no need to give different names to all these slightly differing special cases. They are all
still called multinomial logit models, although it may be added that there are some covariate-
specific linear/additive predictors. The important thing is that the framework accommodates
xij , so one tries to avoid making life unnecessarily complicated. And xij can apply in theory
to any VGLM and not just to the multinomial logit model. Imai et al. (2008) present another
perspective on the xij problem with illustrations from Zelig (Imai et al. 2009).
p
diag(xik , . . . , xik ) Hk β ∗k .
X
η(xi ) = (23)
k=1
Each component of the list xij is a formula having M terms (ignoring the intercept) which
specifies the successive diagonal elements of the matrix X∗(ik) . Thus each row of the constraint
24 The VGAM Package for Categorical Data Analysis
matrix may be multiplied by a different vector of values. The constraint matrices themselves
are not affected by the xij argument.
How can one fit such models in VGAM? Let us fit (18). Suppose the journey cost and time
variables have had the cost and time of walking subtracted from them. Then, using “.trn”
to denote train,
should do the job. Here, the argument form2 is assigned a second S formula which is used
in some special circumstances or by certain types of VGAM family functions. The model
has H1 = I3 and H2 = H3 = H4 = 13 because the lack of parallelism only applies to the
intercept. However, unless Cost is the same as Cost.bus and Time is the same as Time.bus,
this model should not be plotted with plotvgam(); see the author’s homepage for further
documentation.
∗
By the way, suppose β(1)4 ∗ . Then the above code but with
in (18) is replaced by β(j)4
will not work because the basis functions for ns(Cost.bus), ns(Cost.trn) and ns(Cost.car)
are not identical since the knots differ. Consequently, they represent different functions despite
having common regression coefficients.
Fortunately, it is possible to force the ns() terms to have identical basis functions by using a
trick: combine the vectors temporarily. To do this, one can let
This computes a natural cubic B-spline evaluated at x but it uses the other arguments as well
to form an overall vector from which to obtain the (common) knots. Then the usage of NS()
can be something like
So NS(Cost.bus, Cost.trn, Cost.car) is the smooth term for Cost.bus, etc. Furthermore,
plotvgam() may be applied to fit4, in which case the fitted regression spline is plotted against
its first inner argument, viz. Cost.bus.
One of the reasons why it will predict correctly, too, is due to “smart prediction” (Yee 2008).
Implementation details
The xij argument operates after the ordinary XVLM matrix is created. Then selected columns
of XVLM are modified from the constraint matrices, xij and form2 arguments. That is, from
form2’s model matrix XF2 , and the Hk . This whole operation is possible because XVLM
remains structurally the same. The crucial equation is (24).
Other xij examples are given in the online help of fill() and vglm.control(), as well as
at the package’s webpage.
9. Discussion
26 The VGAM Package for Categorical Data Analysis
This article has sought to convey how VGLMs/VGAMs are well suited for fitting regression
models for categorical data. Its primary strength is its simple and unified framework, and
when reflected in software, makes practical CDA more understandable and efficient. Further-
more, there are natural extensions such as a reduced-rank variant and covariate-specific ηj .
The VGAM package potentially offers a wide selection of models and utilities.
There is much future work to do. Some useful additions to the package include:
1. Bias-reduction (Firth 1993) is a method for removing the O(n−1 ) bias from a maximum
likelihood estimate. For a substantial class of models including GLMs it can be for-
mulated in terms of a minor adjustment of the score vector within an IRLS algorithm
(Kosmidis and Firth 2009). One by-product, for logistic regression, is that while the
maximum likelihood estimate (MLE) can be infinite, the adjustment leads to estimates
that are always finite. At present the R package brglm (Kosmidis 2008) implements bias-
reduction for a number of models. Bias-reduction might be implemented by adding an
argument bred = FALSE, say, to some existing VGAM family functions.
4. For logistic regression SAS’s proc logistic gives a warning if the data is completely
separate or quasi-completely separate. Its effects are that some regression coefficients
tend to ±∞. With such data, all (to my knowledge) R implementations give warnings
that are vague, if any at all, and this is rather unacceptable (Allison 2004). The safeBi-
naryRegression package (Konis 2009) overloads glm() so that a check for the existence
of the MLE is made before fitting a binary response GLM.
In closing, the VGAM package is continually being developed, therefore some future changes
in the implementation details and usage may occur. These may include non-backward-
compatible changes (see the NEWS file.) Further documentation and updates are available
at the author’s homepage whose URL is given in the DESCRIPTION file.
Acknowledgments
The author thanks Micah Altman, David Firth and Bill Venables for helpful conversations,
and Ioannis Kosmidis for a reprint. Thanks also to The Institute for Quantitative Social
Science at Harvard University for their hospitality while this document was written during a
sabbatical visit.
References
Thomas W. Yee 27
Agresti A (2002). Categorical Data Analysis. 2nd edition. John Wiley & Sons, New York,
USA.
Agresti A (2010). Analysis of Ordinal Categorical Data. Second edition. Wiley, Hoboken, NJ,
USA.
Agresti A (2013). Categorical Data Analysis. Third edition. Wiley, Hoboken, NJ, USA.
Agresti A (2018). An Introduction to Categorical Data Analysis. Third edition. Wiley, New
York, USA.
Anderson JA (1984). “Regression and Ordered Categorical Variables.” Journal of the Royal
Statistical Society B, 46(1), 1–30.
Buja A, Hastie T, Tibshirani R (1989). “Linear Smoothers and Additive Models.” The Annals
of Statistics, 17(2), 453–510.
Chambers JM, Hastie TJ (eds.) (1993). Statistical Models in S. Chapman & Hall, New York,
USA.
Firth D (2005). “Bradley-Terry Models in R.” Journal of Statistical Software, 12(1), 1–12.
URL http://www.jstatsoft.org/v12/i01/.
Firth D (2008). BradleyTerry: Bradley-Terry Models. R package version 0.8-7, URL http:
//CRAN.R-project.org/package=BradleyTerry.
Fullerton AS, Xu J (2016). Ordered Regression Models: Parallel, Partial, and Non-Parallel
Alternatives. Chapman & Hall/CRC, Boca Raton, FL, USA.
Goodman LA (1981). “Association Models and Canonical Correlation in the Analysis of Cross-
classifications Having Ordered Categories.” Journal of the American Statistical Association,
76(374), 320–334.
Green PJ (1984). “Iteratively Reweighted Least Squares for Maximum Likelihood Estimation,
and Some Robust and Resistant Alternatives.” Journal of the Royal Statistical Society B,
46(2), 149–192.
Harrell, Jr FE (2016). rms: Regression Modeling Strategies. R package version 4.5-0, URL
http://CRAN.R-project.org/package=rms.
Hastie T (2008). gam: Generalized Additive Models. R package version 1.01, URL http:
//CRAN.R-project.org/package=gam.
Hastie T, Tibshirani R, Buja A (1994). “Flexible Discriminant Analysis by Optimal Scoring.”
Journal of the American Statistical Association, 89(428), 1255–1270.
Hastie TJ, Tibshirani RJ (1990). Generalized Additive Models. Chapman & Hall, London.
Hatzinger R (2009). prefmod: Utilities to Fit Paired Comparison Models for Preferences.
R package version 0.8-16, URL http://CRAN.R-project.org/package=prefmod.
Hensher DA, Rose JM, Greene WH (2015). Applied Choice Analysis. Second edition. Cam-
bridge University Press, Cambridge, U.K.
Imai K, King G, Lau O (2008). “Toward A Common Framework for Statistical Analysis and
Development.” Journal of Computational and Graphical Statistics, 17(4), 892–913.
Imai K, King G, Lau O (2009). Zelig: Everyone’s Statistical Software. R package version 3.4-
5, URL http://CRAN.R-project.org/package=Zelig.
Konis K (2009). safeBinaryRegression: Safe Binary Regression. R package version 0.1-2,
URL http://CRAN.R-project.org/package=safeBinaryRegression.
Kosmidis I (2008). brglm: Bias Reduction in Binary-Response GLMs. R package version 0.5-
4, URL http://CRAN.R-project.org/package=brglm.
Kosmidis I, Firth D (2009). “Bias Reduction in Exponential Family Nonlinear Models.”
Biometrika, 96(4), 793–804.
Lange K (2002). Mathematical and Statistical Methods for Genetic Analysis. 2nd edition.
Springer-Verlag, New York, USA.
Leonard T (2000). A Course in Categorical Data Analysis. Chapman & Hall/CRC, Boca
Raton, FL, USA.
Lindsey J (2007). gnlm: Generalized Nonlinear Regression Models. R package version 1.0,
URL http://popgen.unimaas.nl/~jlindsey/rcode.html.
Liu I, Agresti A (2005). “The Analysis of Ordered Categorical Data: An Overview and
a Survey of Recent Developments.” Sociedad Estadı́stica e Investigación Operativa Test,
14(1), 1–73.
Lloyd CJ (1999). Statistical Analysis of Categorical Data. John Wiley & Sons, New York,
USA.
Long JS (1997). Regression Models for Categorical and Limited Dependent Variables. Sage
Publications, Thousand Oaks, CA, USA.
MacMahon S, Norton R, Jackson R, Mackie MJ, Cheng A, Vander Hoorn S, Milne A, McCul-
loch A (1995). “Fletcher Challenge-University of Auckland Heart & Health Study: Design
and Baseline Findings.” New Zealand Medical Journal, 108, 499–502.
Thomas W. Yee 29
McCullagh P, Nelder JA (1989). Generalized Linear Models. 2nd edition. Chapman & Hall,
London.
Meyer D, Zeileis A, Hornik K (2009). vcd: Visualizing Categorical Data. R package version 1.2-
7, URL http://CRAN.R-project.org/package=vcd.
Nelder JA, Wedderburn RWM (1972). “Generalized Linear Models.” Journal of the Royal
Statistical Society A, 135(3), 370–384.
Peterson B (1990). “Letter to the Editor: Ordinal Regression Models for Epidemiologic Data.”
American Journal of Epidemiology, 131, 745–746.
Peterson B, Harrell FE (1990). “Partial Proportional Odds Models for Ordinal Response
Variables.” Applied Statistics, 39(2), 205–217.
R Development Core Team (2009). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org/.
SAS Institute Inc (2003). The SAS System, Version 9.1. Cary, NC. URL http://www.sas.
com/.
Smithson M, Merkle EC (2013). Generalized Linear Models for Categorical and Continuous
Limited Dependent Variables. Chapman & Hall/CRC, London.
Stokes W, Davis J, Koch W (2000). Categorical Data Analysis Using The SAS System. 2nd
edition. SAS Institute Inc., Cary, NC, USA.
Turner H, Firth D (2007). “gnm: A Package for Generalized Nonlinear Models.” R News,
7(2), 8–12. URL http://CRAN.R-project.org/doc/Rnews/.
Tutz G (2012). Regression for Categorical Data. Cambridge University Press, Cambridge.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. 4th edition. Springer-
Verlag, New York. URL http://www.stats.ox.ac.uk/pub/MASS4/.
Wand MP, Ormerod JT (2008). “On Semiparametric Regression with O’Sullivan Penalized
Splines.” The Australian and New Zealand Journal of Statistics, 50(2), 179–198.
30 The VGAM Package for Categorical Data Analysis
Weir BS (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data.
Sinauer Associates, Inc., Sunderland, MA, USA.
Wild CJ, Yee TW (1996). “Additive Extensions to Generalized Estimating Equation Meth-
ods.” Journal of the Royal Statistical Society B, 58(4), 711–725.
Yee TW (2008). “The VGAM Package.” R News, 8(2), 28–39. URL http://CRAN.R-project.
org/doc/Rnews/.
Yee TW (2010a). “The VGAM Package for Categorical Data Analysis.” Journal of Statistical
Software, 32(10), 1–34. URL http://www.jstatsoft.org/v32/i10/.
Yee TW (2010b). VGAM: Vector Generalized Linear and Additive Models. R package
version 0.7-10, URL http://CRAN.R-project.org/package=VGAM.
Yee TW, Hastie TJ (2003). “Reduced-rank Vector Generalized Linear Models.” Statistical
Modelling, 3(1), 15–41.
Yee TW, Stephenson AG (2007). “Vector Generalized Linear and Additive Extreme Value
Models.” Extremes, 10(1–2), 1–19.
Yee TW, Wild CJ (1996). “Vector Generalized Additive Models.” Journal of the Royal
Statistical Society B, 58(3), 481–493.
Affiliation:
Thomas W. Yee
Department of Statistics
University of Auckland, Private Bag 92019
Auckland Mail Centre
Auckland 1142, New Zealand
E-mail: [email protected]
URL: http://www.stat.auckland.ac.nz/~yee/