The Theory and Practice of Item Response Theory
The Theory and Practice of Item Response Theory
R. J. de Ayala
One could make a case that item response theory (IRT) is the most important statistical
method about which most of us know little or nothing. The seeds of IRT lie in the psycho-
metric tradition of classical test theory (CTT). In that theory an observed score is presumed
to be a function of true score plus a random error. Frederic Lord and others realized in the
1950s that CTT did not work in the testing of ability, because it presumes a linear model
in which there is a single factor. However, when items for standardized tests were factor
analyzed, “difficulty” factors emerged in which the easy and hard items formed separate
factors. Also, in CTT an item’s mean and error variance were independent, whereas with
standardized tests very easy and very difficult items have less variance than do moderately
difficult items. After some attempt to rescue CTT, Lord and others realized that a completely
new perspective on measurement was required; the method that emerged from these efforts
was IRT.
IRT was rapidly embraced by test manufacturers as a much more realistic model than
CTT. Additionally, worries began to emerge that standardized tests were unfair to certain
groups, and IRT provided a principled way to determine if items were “fair.”
However, IRT was not embraced by everyday researchers. There were several different
reasons for this. First, IRT requires sample sizes too large for most research projects. Sec-
ond, IRT is much more complicated than CTT, and researchers have been intimidated by
the mathematics. Third, readily accessible software has not enabled the estimating of the
parameters of the IRT model.
There have also been social reasons why most of us know so little about IRT. IRT
experts have seemed more interested in communicating with each other than with the gen-
eral research community. In this book, we have an expert in the field of IRT who is reaching
out to explain IRT to us. IRT requires radically rethinking measurement in a model that is
not linear. The familiar concepts of “error” and “reliability” do not appear in standard IRT
models. Therefore, the reader of this book needs to work hard to learn not only a new lan-
guage but also a new way of thinking about measurement. We need to realize, as Lord did
in the 1950s, that for many problems of measurement, IRT, not CTT, is the way to approach
the problem. This book can aid us with that rethinking.
David A. Kenny
vi
Preface
Item response theory (IRT) methodology forms the foundation of many psychometric appli-
cations. For instance, IRT is used for equating alternate examination forms, for instrument
design and development, for item banking, for computerized adaptive testing, for certifica-
tion testing, and in various other psychological assessment domains. Well-trained psycho-
metricians are increasingly expected to understand and be able to use IRT.
In this book we address the “how to” of applying IRT models while at the same time
providing enough technical substance to answer the “why” questions. We make extensive
use of endnotes and appendices to address some technical issues, and the reader is referred
to appropriate sources for even greater technical depth than is provided here. To facilitate
understanding the application of the models, we use common data sets across the chapters.
In addition, the exemplary model applications employ several common software packages.
Some of these are free (BIGSTEPS, NOHARM), while others are commercially available
(BILOG-MG, MULTILOG, PARSCALE). The terminology used in this book is more general
than is typically seen in IRT textbooks and is not specific to educational measurement. The
reader is assumed to be familiar with common psychometric concepts such as reliability,
validity, scaling, levels of measurement, factor analysis, as well as regression analysis.
Chapter 1 “begins at the beginning” by presenting the basic concept of why we are
interested in performing measurement. There is a short philosophical treatment of measure-
ment, pointing out the desirable characteristics that we would like measurement to have.
The traditional psychometric concepts of levels of measurement, reliability, validity, and
various approaches to measurement, such as IRT, latent class analysis, and classical test
theory, are woven into this introduction.
Chapter 2 introduces the simplest IRT model, the Rasch or one-parameter logistic
(1PL) model. This model establishes the principles that underlie the more complex models
discussed in subsequent chapters. The chapter also introduces the empirical data set that is
used in the next four chapters and discusses philosophical differences between Rasch and
non-Rasch models. Chapters 3 and 4 conceptually present two different parameter estima-
tion techniques, applied to the data set introduced in Chapter 2. Chapter 3’s application
begins by examining the unidimensionality assumption through nonlinear factor analysis
and proceeds to item parameter estimation and interpretation. In Chapter 4 we reanalyze our
vii
viii Preface
data using a different estimation technique and a different program. In addition, we present
alternative item- and model-level fit analyses by using statistical and graphical methods.
Chapter 5 presents the two-parameter logistic (2PL) model. Building on the 1PL model
presentation, the chapter is concerned only with the features that are unique to this model.
Additional fit analysis methods are introduced in this chapter in our reanalysis of the data
used with the 1PL model. Similarly, Chapter 6 introduces the characteristics specific to the
three-parameter logistic model and reanalyzes the data used in the 1PL and 2PL model
examples. By the end of Chapter 6 our fit analysis has evolved to include both item- and
model-level fit analyses, model comparison statistics, and person fit analysis. The chapter
also examines IRT’s assumptions regarding functional form and conditional independence.
As such, through these chapters we demonstrate the steps of model and data fit that would
normally be used in practice.
Because data are not always dichotomous, we next discuss models that are appropriate
for polytomous data (e.g., data from Likert response scales) in Chapters 7–9. The models
are divided between those for ordered and those for unordered polytomous data. Chapters
7 and 8 address ordered polytomous data from both Rasch and non-Rasch perspectives
(i.e., Rasch: partial credit and rating scale models; non-Rasch: generalized partial credit and
graded response models). As in other parts of the book, common data sets are used across
these chapters. Modeling unordered polytomous data as well as simultaneously using mul-
tiple IRT models are addressed in Chapter 9.
All of the models presented to this point assume that an individual’s responses are a
function of a single-person latent variable. In Chapter 10 we generalize the models pre-
sented in Chapters 5 and 6 to multiple latent variables.
As mentioned, IRT is used for equating multiple forms and in the creation of item
banks. Techniques for accomplishing both of these purposes are presented in Chapter 11.
Our final addition to our model–data fit toolbox is provided in Chapter 12, where we are
concerned with whether an item functions differently for different groups of people. Stated
another way, do factors (e.g., the respondents’ gender) that are tangential to the construct
an item is supposed to be measuring affect how people respond to the item? If so, then the
item may be biased against those individuals. Techniques for identifying items with such
discrepancies are presented in this, the last chapter.
I would like to acknowledge that this book has been influenced by my interactions with
various individuals over the past two decades. Barbara Dodd, Bill Koch, Earl Jennings, Chan
Dayton, Bill Hays, Frank Baker, Mark Reckase, Dave Weiss, and Seock-Ho Kim are some
of them. My apologies to anyone whom I have omitted. I would also like to acknowledge
the graduate students in psychometrics I have had the pleasure of teaching. Thanks also to
Bruno Zumbo, Department of Education, University of British Columbia, who reviewed the
book. I am appreciative of the support and patience of C. Deborah Laughton and David A.
Kenny, whose unenviable task was to edit earlier drafts of the manuscript. Finally, I would
like to thank Dorothy Hochreich for her sage advice many years ago. Additional output and
the data sets are available at the author’s website: http://cehs.unl.edu/EdPsych/RJSite/home
Contents
1 • Introduction to Measurement 1
Measurement 1
Some Measurement Issues 3
Item Response Theory 4
Classical Test Theory 5
Latent Class Analysis 7
Summary 9
ix
x Contents
References 419
Indices
F number of factors f = 1, ... , F
L number of items j = 1, ... , L
N number of persons i = 1, ... , N
m number of response categories k = 1, ... , m
m number of operations, thresholds, boundaries k = 1, ... , m or k = 0, ... , m
R number of quadrature points r = 1, ... , R
S number of cognitive operations s =1 , ... , S
x response vector
xj dichotomous response or category score on item j
xjk polytomous response on item j’s kth category
ν index for latent classes (nu) ν = 1, ... , G
πν latent class ν proportion
Symbols
w eigenvector λ eigenvalue (lambda)
Σ variance/covariance matrix p probability
h2 communality a item loading
υ population parameters vector; MMLE ϑ item parameter matrix (capital alternative
(upsilon) form of phi)
Γ DIF, group variable (capital gamma) Λ DIF, person location (e.g., θ or Xi) (capital
lambda)
xiii
xiv Symbols and Acronyms
Parameters
θ person location (theta)
α item discrimination (alpha) τ threshold (tau)
ηs elementary component s
σe(θ̂
) standard error of person location se(θ̂
) sample standard error
σe(θ̂
) standard error item location se(θ̂
) sample standard error
σmeas standard error of measurement
The estimate of a parameter is represented with a circumflex. For example, the estimate of the
parameter θ is symbolized as θ̂ , the estimate of the parameter α is represented as â , etc.
Transformation Equations
α(µ ) α
ξ* = ξ(ζ) + κ γ* = γ – α* =
σ σ
Acronyms
CTT classical test theory
df degrees of freedom
DIF differential item functioning
EAP expected a posteriori
GUI graphical user interface
IRF item response function
IRS item response surface
IRT item response theory (latent trait theory)
JMLE joint maximum likelihood estimate (also called unconditional maximum likelihood
estimation)
LCA latent class analysis
LSA latent structure analysis
MAP maximum a posteriori
MINF multidimensional information
MIRT multidimensional item response theory
MMLE marginal maximum likelihood estimation
SD standard deviation
SEE standard error of estimate
TCC test characteristic curve or total characteristic curve
TCF total characteristic function
Introduction to Measurement
I often say that when you can measure what you are speaking about and
express it in numbers you know something about it; but when you cannot
measure it, when you cannot express it in numbers, your knowledge is of
a meagre and unsatisfactory kind: it may be the beginning of knowledge,
but you have scarcely, in your thoughts, advanced to the state of science,
whatever the matter may be.
—Sir William Thomson (Lord Kelvin) (1891, p. 80)
This book is about a particular measurement perspective called item response theory (IRT),
latent trait theory, or item characteristic curve theory. To understand this measurement per-
spective, we need to address what we mean by the concept of measurement. Measurement
can be defined in many different ways. A classic definition is that measurement is “the
assignment of numerals to objects or events according to rules. The fact that numerals
can be assigned under different rules leads to different kinds of scales and different kinds
of measurement” (Stevens, 1946, p. 677). Although commonly used in introductory mea-
surement and statistics texts, this definition reflects a rather limited view. Measurement is
more than just the assignment of numbers according to rules (i.e., labeling); it is a process
by which an attempt is made to understand the nature of a variable (cf. Bridgman, 1928).
Moreover, whether the process results in numeric values with inherent properties or the
identification of different classes depends on whether we conceptualize the variable of inter-
est as continuous or categorical. IRT provides one particular mathematical technique for
performing measurement in which the variable is considered to be continuous in nature.
Measurement
1
2 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
anxiety involves feelings, it is not possible to directly observe anxiety. As such, anxiety is an
unobservable or latent variable or construct.
The measurement process involves deciding whether our latent variable, anxiety, should
be conceptualized as categorical, continuous, or both. In the categorical case we would clas-
sify individuals into qualitatively different latent groups so that, for example, one group may
be interpreted as representing individuals with incapacitating anxiety and another group
representing individuals without anxiety. In this conceptualization the persons differ from
one another in kind on the latent variable. Typically, these latent categories are referred to as
latent classes. Alternatively, anxiety could be conceptualized as continuous. From this per-
spective, individuals differ from one another in their quantity of the latent variable. Thus, we
might label the ends of the latent continuum as, say, “high anxiety” and “low anxiety.” When
the latent variable is conceptualized as having categorical and continuous facets, then we
have a combination of one or more latent classes and one or more latent continua. In this
case, the latent classes are subpopulations that are homogeneous with respect to the variable
of interest, but differ from one another in kind. Within each of these classes there is a latent
continuum on which the individuals within the class may be located. For example, assume
that our sample of respondents consists of two classes. One class could consist of indi-
viduals whose anxiety is so severe that they suffer from incapacitating attacks of terror. As
such, these individuals are so qualitatively different from other persons that they need to be
addressed separately from those whose anxiety is not so severe. Therefore, the second class
contains individuals who do not suffer from incapacitating attacks of terror. Within each of
these classes we have a latent continuum on which we locate the class’s respondents.
Although we cannot observe our latent variable, its existence may be inferred from
behavioral manifestations or manifest variables (e.g., restlessness, sleeping difficulties, head-
aches, trembling, muscle tension, item responses, self-reports). These manifestations allow
for several different approaches to measuring generalized anxiety. For example, one approach
may involve physiological assessment via an electromyogram of the degree of muscle ten-
sion. Other approaches might involve recording the number of hours spent sleeping or the
frequency and duration of headaches, using a galvanic skin response (GSR) feedback device
to assess sweat gland activity, or more psychological approaches, such as asking a series
of questions. These approaches, either individually or collectively, provide our operational
definition of generalized anxiety (Bridgman, 1928). That is, our operational definition speci-
fies how we go about collecting our observations (i.e., the latent variable’s manifestations).
Stated concisely, our interest is in our latent variable and its operational definition is a means
to that end.
The measurement process, so far, has involved our conceptualization of the latent vari-
able’s nature and its operational definition. We also need to decide on the correspondence
between our observations of the individuals’ anxiety levels and their locations on the con-
tinuum and/or in a class. In general, scaling is the process of establishing the correspon-
dence between the observation data and the persons’ locations on the latent variable. Once
we have our individuals located on the latent variable, we can then compare them to one
another. IRT is one approach to establishing this correspondence between the observation
data and the persons’ locations on the latent variable. Examples of other relevant scaling
processes are Guttman Scalogram analysis (Guttman, 1950), Coombs Unfolding (Coombs,
Introduction to Measurement 3
1950), and the various Thurstone approaches (Thurstone, 1925, 1928, 1938). Alternative
scaling approaches may be found in Dunn-Rankin, Knezek, Wallace, and Zhang (2004),
Gulliksen (1987), Maranell (1974), and Nunnally and Bernstein (1994).
Before proceeding to discuss various latent variable methods for scaling our observations,
we need to discuss four issues. The first issue involves the consistency of the measures. By way
of analogy, assume that we are measuring the length of a box. If our repeated measurements
of the length of the box were constant, then these measurements would be considered to
be highly consistent or to have high reliability. However, if these repeated measurements
varied wildly from one another, then they would be considered to have low consistency or
to have low reliability. In the former case, our measurements would have a small amount
of error, whereas in the latter they would have a comparatively larger amount of error. The
consistency (or lack thereof) would affect our confidence in the measurements. That is, in
the first scenario we would have greater confidence in our measurements than in the second
scenario.
The second issue concerns the validity of the measures. Although there are various
types of validity, we define validity as the degree to which our measures are actually mani-
festations of the latent variable. As a contrarian example, assume we use the “frequency
and duration of headaches” approach for measuring anxiety. Although some persons may
recognize that there might be a relationship between “frequency and duration of headaches”
and anxiety level, they may not consider this approach, in and of itself, to be an accurate
“representation” of anxiety. In short, simply because we make a measure does not mean that
the measure necessarily results in an accurate reflection of the variable of theoretical interest
(i.e., our measurements may or may not have validity). A necessary, but not sufficient condi-
tion for our measurements to have validity is that they possess a high degree of reliability.
Therefore, it is necessary to be concerned not only with the consistency of our measure-
ments, but also with their validity. Obtaining validity evidence is part of the measurement
process.
The third issue concerns a desirable property we would like our measurements to pos-
sess. Thurstone (1928) noted that a measuring instrument must not be seriously affected
in its measuring function by the object of measurement. In other words, we would like our
measurement instrument to be independent of what it is we are measuring. If this is true,
then the instrument possesses the property of invariance. For instance, if we measure the
size of a shoe box by using a meter stick, then the measurement instrument (i.e., the meter
stick) is not affected by and is independent of which box is measured. Contrast this with
the situation in which measuring a shoe box’s size is done not by using a meter stick, but
by stretching a string along the shortest dimension of the box and cutting the string so
that its length equals the shortest dimension. This string would serve as our measurement
instrument and we would use it to measure the other two dimensions of the box. In short,
the measurements would be multiples of the shortest dimension. Then suppose we use this
approach to measure a cereal box. That is, for the cereal box its shortest dimension is used
4 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
to define the measurement instrument. Obviously, the box we are measuring affects our
measurement instrument and our measurements would not possess the invariance property.
Without invariance our comparisons across different boxes would have limited utility.
The final issue we present brings us back to the classic definition of measurement men-
tioned above. Depending on which approach we use to measure anxiety (i.e., GSR, dura-
tion of headache, item responses, etc.), the measurements have certain inherent properties
that affect how we interpret their information. For instance, the “duration of headaches”
approach produces measurements that cannot be negative and that allow us to make com-
parative statements among people as well as to determine whether a person has a headache.
These properties are a reflection of the fact that the measurements have not only a constant
unit, but also a (absolute) zero point that reflects the absence of what is being measured.
Invoking Stevens’s (1946) levels of measurement taxonomy or Coombs’s (1974) taxonomy
these numbers would reflect a ratio scale.
In contrast, if we use a GSR device for measuring anxiety we would need to establish a
baseline or a zero point by canceling out an individual’s normal skin resistance static level
before we measure the person’s GSR. As a result, and unlike that of the ratio scale, this zero
point is not an absolute zero, but rather a relative one. However, all of our measurements
would still have a constant unit and would be considered to be on an interval scale. Another
approach to measuring anxiety is to ask an individual to rate his or her anxiety in terms of
severity. This ratings approach would produce numbers that are on an ordinal scale. These
approaches allow us to make comparative statements, such as “This person’s anxiety level is
greater than (or less than) that of another,” or in the case of the interval scale, “This person’s
anxiety level is half as severe as that person’s anxiety level.” Alternatively, if our question
simply requires the respondent to reply “yes,” he or she is experiencing a symptom, or “no,”
he or she is not, then the “yes/no” responses would reflect a nominal scale. These various
scenarios show that how we interpret and use our data needs to take into account the differ-
ent types of information that the observations carry.
In the following discussion we present three approaches for establishing a correspon-
dence between our observations and our latent variable. We begin by briefly introducing
IRT, followed by classical test theory (CTT). Both of these approaches assume that the latent
variable is continuous. The last approach discussed, latent class analysis (LCA), is appropri-
ate for categorical latent variables. Appendix E, “Mixture Models,” addresses the situation
when a latent variable is conceptualized as having categorical and continuous facets.
“Theory” is used here in the sense that it is a paradigm that attempts to explain all the facts
with which it can be confronted (Kuhn, 1970, p. 18). IRT is, in effect, a system of models
that defines one way of establishing the correspondence between latent variables and their
manifestations. It is not a theory in the traditional sense because it does not explain why a
person provides a particular response to an item or how the person decides what to answer
(cf. Falmagne, 1989). Instead, IRT is like the theory of statistical estimation. IRT uses latent
characterizations of individuals and items as predictors of observed responses. Although
Introduction to Measurement 5
some researchers (e.g., Embretson, 1984; Fischer & Formann, 1982) have attempted to
use item characteristics to explain why an item is located at a particular point, for the
most part, IRT like other scaling methods (e.g., Guttman Scalogram, Coombs Unfolding)
treats the individual as a black box. (See Appendix E, “Linear Logistic Test Model [LLTM],”
for a brief presentation of one of these explanatory approaches, as well as De Boeck &
Wilson [2004] for alternative approaches.) The cognitive processes used by an individual
to respond to an item are not modeled in the commonly used IRT models. In short, this
approach is analogous to measuring the speed of an automobile without understanding how
an automobile moves.1
In IRT persons and items are located on the same continuum. Most IRT models assume
that the latent variable is represented by a unidimensional continuum. In addition, for an
item to have any utility it must be able to differentiate among persons located at different
points along a continuum. An item’s capacity to differentiate among persons reduces our
uncertainty about their locations. This capacity to differentiate among people with different
locations may be held constant or allowed to vary across an instrument’s items. Therefore,
individuals are characterized in terms of their locations on the latent variable and, at a mini-
mum, items are characterized with respect to their locations and capacity to discriminate
among persons. The gist of IRT is the (logistic or multinomial) regression of observed item
responses on the persons’ locations on the latent variable and the item’s latent characteriza-
tions.
Like IRT, classical test theory (CTT) or true score theory also assumes that the latent variable
is continuous. CTT is the approach that most readers have been exposed to throughout their
education. In contrast to IRT in which the item is the unit of focus, in CTT the respondent’s
observed score on a whole instrument is the unit of focus. The individual’s observed score,
X, is (typically) the unweighted sum of the person’s responses to an instrument’s items. In
ability or achievement assessment this sum reflects the number of correct responses.
CTT is based on the true score model. This model relates the individual’s observed
score to his or her location on the latent variable. To understand this model, assume that an
individual is administered an instrument an infinite independent number of times. On each
of these administrations we calculate the individual’s observed score. The mean of the infi-
nite number of observed scores is the expectation of the observed scores (i.e., µi = E(Xi)).
On any given administration of the instrument the person’s observed score will not exactly
agree with the mean, µ, of the observed scores. This difference between the observed score
and the mean is considered to be error. Symbolically, we may write the relationship between
person i’s observed score, the expectation, and error as
Xi = µi + Εi (1.1)
where Εi is the error score or the error of measurement (i.e., Εi = Xi – µi); Ε is the capital Greek
letter epsilon. Equation 1.1 is known as the true score model. In words, this model states
6 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
that person i’s observed performance on an instrument is a function of his or her expected
performance on the instrument plus error. Given that the error scores are considered to be
random and that µi = E(Xi), then it follows that the mean error for an individual across the
infinite number of independent administrations of the instrument is zero.
By convention µi is typically represented by the Latin or Roman letter T. However, to be
consistent with our use of Greek letters to symbolize parameters, we use the capital Greek
letter tau, Τ. The symbol Τ represents the person’s true score (i.e., Τi = E(Xi)). The term true
score should not be interpreted as indicating truth in any way. As such, true score is a mis-
nomer. To avoid this possible misinterpretation, we refer to Τi (or µi) as individual i’s trait
score.2 A person’s trait score represents his or her location on the latent variable of interest
and is fixed for an individual and instrument. The common representation of the model in
Equation 1.1 is
Xi = Τi + Εi (1.2)
Although Equation 1.1 may be considered more informative than Equation 1.2, we follow
the convention of using Τ for the trait score.
There is a functional relationship between the IRT person latent trait (θ) and the CTT
person trait characterization. This relationship is based on the assumption of parallel forms
for an instrument. That is, each item has the same response function on all the forms. Fol-
lowing Lord and Novick (1968), assume that we administer an infinite number of indepen-
dent parallel forms of an instrument to an individual. Then the expected proportion of 1s or
expected trait score, EΤ, across these parallel forms is equal to the average probability of a
response of 1 on the instrument, given the person’s latent trait and an IRT model. As a con-
sequence, the IRT θ is the same as the expected proportion EΤ except for the difference in
their scales of measurement. That is, θ has a range of –∞ to ∞, whereas for EΤ the range is
0 to 1. The expected trait score EΤ is related to the IRT latent trait by a monotonic increasing
transformation. This transformation is discussed in Chapters 4 and 10.
In addition to the true score model, CTT is based on a set of assumptions. These
assumptions are that, in the population, (1) the errors are uncorrelated with the trait scores
for an instrument, (2) the errors on one instrument are uncorrelated with the trait scores on
a different instrument, and (3) the errors on one instrument are uncorrelated with the error
scores on a different instrument. These assumptions are considered to be “weak” assump-
tions because they are likely to be met by the data. In contrast, IRT is based on “strong”
assumptions.3 These IRT assumptions are discussed in the following chapter.
These CTT assumptions and the model given in Equation 1.1 (or Equation 1.2) form
the basis of the psychometric concept of reliability and the validity coefficient. For example,
the correlation of the observed scores on an instrument and the corresponding trait scores
is the index of reliability for the instrument. Moreover, using the variances of the trait scores
( σ 2T ) and observed scores ( σ 2X ) we can obtain the population reliability of an instrument’s
scores:
ρXX′ = σ2T
2
(1.3)
σX
Introduction to Measurement 7
Because trait score variance is unknown, we can only estimate ρXX′. Some of the traditional
approaches for estimating reliability are KR-20, KR-21, and coefficient alpha. An assessment
of the variance of the errors of measurement in any set of observed scores may be obtained
by substituting s 2T = s 2X – s 2E into Equation 1.3 to get
s 2E = s 2X (1 − ρXX ′ ) (1.4)
The square root of s 2E (i.e., sE) is referred to as the standard error of measurement. The
standard error of measurement is the standard deviation of the errors of measurement
associated with the observed scores for a particular group of respondents.
From the foregoing it should be clear that because an individual’s trait score is latent
and unknown, then the error associated with an observed score is also unknown. Therefore,
Equations 1.1 and 1.2 have two unknown quantities, an individual’s trait score and error
score. Lord (1980) points out that the model given in Equations 1.1 or 1.2 cannot be dis-
proved by any set of data. As a result, one difference between IRT and CTT is that with IRT
we can engage in model–data fit analysis, whereas in CTT we do not examine model–data
fit and simply assume the model to be true.
As may be obvious, the observed score X is influenced by the instrument’s character-
istics. For example, assume a proficiency testing situation. An easy test administered an
infinite number of independent times to an individual will yield a different value of Τi than
a difficult test administered an infinite number of independent times to the same individual.
This is analogous to the example of measuring the shoe and cereal boxes by using the short-
est dimension of each box. In short, as is the case with the box example, in CTT person
measurement is dependent on the instrument’s characteristics. Moreover, because the vari-
ance of the sample’s observed scores appears in both Equations 1.3 and 1.4, one may deduce
that the heterogeneity (or lack thereof) of the observed scores affects both reliability and
the standard error of measurement. Moreover, Equations 1.3 and 1.4 cannot be considered
to solely be properties of the instrument, but rather also reflect the sample’s characteristics.
In short, the instrument’s characteristics affect the person scores and sample characteristics
affect the quantitative indices of the instrument (e.g., item difficulty and discrimination,
reliability, etc.). Thus, Thurstone’s (1928) idea of invariance does not exist in CTT. In con-
trast, with IRT it is possible to have invariance of both person and item characterizations.
See Appendix E, “Dependency in Traditional Analysis Statistics and Observed Scores,” for
a demonstration of this lack of invariance with CTT. In addition, Gulliksen (1987) con-
tains detailed information on CTT and Engelhard (1994, 2008) presents a historical view
of invariance.
Unlike IRT’s premise of a continuous latent variable, in latent class analysis (LCA) the latent
variable is assumed to be categorical. That is, the latent variable consists of a set of mutually
exclusive and exhaustive latent classes.4 To be more specific, “there exists a set of latent
classes, such that the manifest relationship between any two or more items on a test can be
8 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
accounted for by the existence of these basic classes and by these alone” (Stouffer, 1950, p.
6). In LCA the comparison of individuals involves comparing their latent class memberships
rather than comparing their locations on a continuous latent variable.
For an understanding of the nature of a categorical latent variable, we turn to two empiri-
cal studies. The first is a study of the nosologic structure of psychotic illness by Kendler,
Karkowski, and Walsh (1998). In their study, these authors conceptualized this latent vari-
able as categorical. Their LCA showed that their participants belonged to one of six classes:
(1) classic schizophrenia, (2) major depression, (3) schizophreniform disorder, (4) bipolar-
schizomania, (5) schizodepression, and (6) hebephrenia. The second example involves cheat-
ing on academic examinations (Dayton & Scheers, 1997); the latent variable is cheating. The
LCA of the investigators’ data revealed a structure with two latent classes. One class consisted
of persons who were “persistent cheaters,” whereas the second class consisted of individuals
who would either exhibit “opportunistic” cheating or might not cheat at all.
In both of these examples, we can see that respondents differ from one another on the
latent variable in terms of their class membership rather than in terms of their locations on
a continuum. In Appendix E, “Mixture Models,” we discuss an approach in which we com-
bine LCA and IRT. That is, we can conceptualize academic performance as involving both
latent classes and continua. For example, we can have one class of persistent cheaters and
another class of noncheaters. Within each class there is a proficiency variable continuum.
Therefore, the cheater class has its own continuum on which we can compare individual
performances. Similarly, the noncheater class has a separate continuum that we use to com-
pare the noncheaters’ performances with one another. These latter comparisons are not
contaminated by the cheaters’ performances and the noncheaters are not disadvantaged by
the presence of the cheaters.
In general, LCA determines the number of latent classes that best explains the data
(i.e., determining the latent class structure). This process involves comparing models that
vary in their respective number of classes (e.g., a one-class model, a two-class model, and
so on). Determining the latent class structure involves not only statistical tests of fit, but
also the interpretability of the solution. With each class structure one has estimates of the
items’ characteristics. Based on these item characteristics and the individuals’ responses,
the respondents are assigned to one of the latent classes. Subsequent to this assignment, we
obtain estimates of the relative size of each class. These relative sizes are known as the latent
class proportions (πνs, where ν is the latent class index). The sum of the latent class propor-
tions across the latent classes is constrained to 1.0. For example, if the latent variable, say
algebra proficiency, has a two-class structure, then π1 might equal 0.65 and π2 = 1 – 0.65
= 0.35. Moreover, our latent classes’ interpretation may reveal that the larger class (i.e., π1
= 0.65) consists of persons who have mastered algebraic problems, whereas the other class
consists of individuals who have not mastered the problems. In short, the data’s latent struc-
ture consists of masters and nonmasters.
One may conceive of a situation in which if one had a large number of latent classes and
if they were ordered, then there would be little difference between conceptualizing the latent
variable as continuous or as categorical. In point of fact, a latent class model with a sufficient
number of latent classes is equivalent to an IRT model. For example, for a data set with four
items, then a latent class model with at least three latent classes would provide “equivalent”
Introduction to Measurement 9
item characterizations as an IRT model that uses only item location parameters. Appendix
E, “Mixture Models,” contains additional information about LCA.
Summary
transcend the particular sample used in their estimation. Moreover, unlike CTT, with IRT
we are able to make predictive statements about respondents’ performance as well as exam-
ine the tenability of the model vis-à-vis the data. In the next chapter the simplest of the
IRT models is presented. This model, the Rasch or one-parameter logistic model, contains
a single parameter that characterizes the item’s location on the latent variable continuum.
We show how this single-item parameter can be used to estimate a respondent’s location on
a latent variable.
Notes
1. Although understanding how the automobile moves is not necessary in order to measure the
speed with which it moves, nonetheless fully understanding how the automobile moves can lead to
an improved measurement process.
2. A trait is exemplified primarily in the things that a person can do (Thurstone, 1947) and is
any “distinguishable, relatively enduring way in which one individual varies from another” (Guilford,
1959, p. 6). However, we do not consider traits to be rigidly fixed or predetermined (see Anastasi,
1983).
3. There is an implicit unidimensionality assumption in CTT. That is, for observed scores to have
any meaning they need to represent the sum of responses to items that measure the same thing. For
instance, assume that an examination consists of five spelling questions and five single-digit addition
problems. Presumably our examination data would consist of two dimensions representing spelling
and addition proficiencies. If a person had an observed score of 5, it would not be possible to deter-
mine whether he or she is perfect in spelling, perfect in addition, or in some combination of spelling
and addition proficiencies. In this case, the observed score has no intrinsic meaning. In contrast, if
the examination consists of only spelling questions, then the score would indicate how well a person
could spell the questions on the test and would have intrinsic meaning.
4. Both IRT and LCA can be considered to be special instances of the general theoretical frame-
work for modeling categorical variables, known as latent structure analysis (LSA; Lazarsfeld, 1950).
Moreover, both linear and nonlinear factor analysis may be regarded as special cases of LSA (McDon-
ald, 1967).
2
There are many possible construct domains to which IRT may be applied. These involve
psychological constructs, such as neuroticism, motivation, social anxiety, cognitive develop-
ment, consumer preferences, proficiency, and so on. Although each of these constructs could
be conceptualized as being categorical (e.g., consisting of only latent classes), in the current
context we would conceptualize each of these latent variables as a continuum. Whatever
the construct of interest may be, we assume that it is manifested through an individual’s
responses to a series of items. The simplest IRT model that could be used in this situation
is one that characterizes each item in terms of a single parameter. This parameter is the
item’s location on the latent continuum that represents the construct. The concept that an
item has a location is not new and may be traced back to Thurstone (1925, 1928); also see
Andrich (1978a), Lumsden (1978), and Yen (1986). In this chapter a one-item parameter
IRT model is conceptually developed. In the context of this model the general principles
and assumptions underlying IRT, as well as a parameter estimation approach, are presented.
In subsequent chapters more sophisticated estimation approaches and more complicated
models are discussed.
11
12 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
FIGURE 2.1. Graphical representation of latent variable continuum with five items (circles).
continuum. Typically, we use standard scorelike values (i.e., z-scores) to mark off the
continuum and represent the metric in IRT.
For this continuum assume that the upper end of the continuum indicates greater
mathematics proficiency than does the lower end. This means that items located toward
the right side require an individual to have greater proficiency to correctly answer the items
than items located toward the left side. As can be seen, our instrument’s items are located
throughout the continuum with some above 0 and others below 0. For instance, the first
item is located at –2, the second item at –1, and so on. We use the Greek letter δ (delta) to
represent an item’s location, and δj represents the jth item location on this continuum. Using
this notation, the first item’s location is represented as δ1 = –2 and the fifth’s as δ5 = 2. More-
over, the Greek letter θ (theta) is used to represent the person location on this continuum. In
the context of this example, a person’s location reflects his or her mathematics proficiency.
According to the figure, person A is located at 0.0 (i.e., θA = 0.0). As should be clear from
Figure 2.1, both items and persons are located on the same continuum.
One implication of having both persons and items located on the same continuum
is that it is possible to make comparative statements about how a typical person might
respond to an item. For example, because the lower end of the continuum represents less
mathematics proficiency than the upper end, items that are located in the lower end require
less proficiency to be correctly answered than those in the upper end. As a result, chances
are that a person located at 0 will correctly respond to items located in the lower end of the
continuum (e.g., item 1 with a δ1 = –2). However, if we administer an item located closer
to 0, say item 2 with δ2 = –1, then chances are that the person will respond correctly, but
we recognize that there is an increased possibility that he or she may incorrectly respond.
This incorrect response may be due to a lapse in being able to recall relevant information,
the tip-of-tongue phenomenon, or another such cause. Similarly, administering an item,
such as item 4 (δ4 = 1), to a person located at 0 will likely result in an incorrect response,
but there is still a sizeable chance that he or she may correctly answer the item because of
the closeness in the proficiency required by the item and that which the person possesses.
In other words, the greater the distance between the person and item locations, the greater
the certainty we have in how the person is expected to respond to the item. However, as this
The One-Parameter Model 13
distance approaches zero, then the more likely we are to say that there is a 50:50 chance that
the person will correctly respond to the item. These expectations about how a person will
respond are expressed probabilistically (i.e., “chances of a correct response are . . .”).
Although the foregoing may make intuitive sense, one might ask, “Are there data
that support a pattern of an increasing probability of a correct response as person location
increases?” In Figure 2.2 we see that the answer is yes. This figure shows that the proportion
of individuals correctly responding to an item is an S-shape (sigmoidal) function of their
standard scores on a test. In this case, the participants were administered an examination to
assess their latent mathematics proficiency. The complete data from this administration are
presented in Table 2.1.
From the graph we see that as the z-scores increase, there is an increase in the propor-
tion of examinees correctly responding to the item; however, this increase is not constant
across the continuum. Moreover, we see that as one progresses beyond a z of 1 the points
begin to form a plateau. Conversely, as one progresses below a z of –1 the proportions also
start to level off. However, there is a range on the z-score metric around –0.5 where the
proportion of individuals with a correct response is around 0.50. That is, this is the point
at which we would say that there is a 50:50 chance that the person will correctly respond
to the item.
We can trace the nonlinear pattern of the empirical proportions in Figure 2.2 and
obtain an empirical trace line (Lazarsfeld, 1950). This trace line would clearly be a sigmoid
or S-shaped curve (i.e., an ogive). However, rather than being satisfied to simply describe
the pattern, we can develop a model that incorporates our ideas about how an observed
response is governed by a person’s location in order to be able to predict response behavior.
The nonlinearity shown in Figure 2.2 eliminates using the linear regression of proportions
on the person locations to predict response behavior. Because this ogival pattern is evident
FIGURE 2.2. Empirical proportions of individuals with a correct response on one item as a function
of standardized number correct scores.
14 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
10000 2280 1
01000 242 1
00100 235 1
00010 158 1
00001 184 1
11000 1685 2
10100 1053 2
01100 134 2
10010 462 2
01010 92 2
00110 65 2
10001 571 2
01001 79 2
00101 87 2
00011 41 2
11100 1682 3
11010 702 3
10110 370 3
01110 63 3
11001 626 3
10101 412 3
10011 166 3
01101 52 3
01011 28 3
00111 15 3
11110 2095 4
11101 1219 4
11011 500 4
10111 187 4
01111 40 4
11111 3385 5
Note. N = 19,601.
in cumulative distributions, such as the cumulative normal distribution or the logistic dis-
tribution, we might consider using one of these for our modeling. In point of fact, we use
the logistic function in the following because of its simplicity. (The use of the cumulative
normal function is discussed in Appendix C.)
In its simplest form the logistic model may be presented as
ez
p(x) = (2.1)
1 + ez
The One-Parameter Model 15
where p(x) is the probability of values of 1 when the predictor takes on the value of x,
e is a constant equal to 2.7183 . . . , and z is some linear combination of, for example,
predictor variable(s) and a constant. By appropriately specifying z, we can arrive at a model
for predicting response behavior on an item j.
To specify z, we return to the idea above that the distance between the person and the
item locations (i.e., (θ – δj)) is an important determinant of the probability of his or her
response (cf. Rasch, 1980; Wright & Stone, 1979). Therefore, letting z = (θ – δj) results in a
model that would allow one to predict the probability of a response of 1 as a function of both
the item and person locations.1 By substitution of this z into the logistic model we have
(θ−δ j)
e
p(xj = 1|θ, δj) = (2.2)
(θ−δ j)
1+ e
where p(xj = 1|θ, δj) is the probability of the response of 1 (i.e., xj = 1), θ is the person
location, and δj is item j’s location. This model is called the Rasch model (Rasch, 1980).
Expressed in words, Equation 2.2 says that the probability of a response of 1 on item j is a
function of the distance between a person located at θ and the item located at δ. (Technically,
we are talking about the probability of a randomly selected individual from those located at
θ.) The right side maps the (potentially infinite) distance between the person’s location and
the item’s location onto the [0, 1] probability scale. A response of 1 simply indicates that
an event occurred or we observed a success. (We use the phrase “response of 1” instead of
“correct response” because the instrument may not be an examination; given a proficiency
context, we may refer to the response as correct or incorrect.) For convenience, pj is used
for p(xj = 1|θ, δj) in the following.
The theoretical range of the item locations δs, as well as the person locations θs, is from
–∞ to ∞. However, typical item and person locations fall within –3 to 3. In proficiency test-
ing the item locations are referred to as item difficulties.2 In general, items located somewhat
below 0.0 are said to be “easy” items (e.g., below –2.0) and items somewhat above 0.0 are
“hard” items (e.g., above 2.0). In general, the items that are considered to be “easy” are the
ones that persons with low proficiencies have a tendency to answer correctly. Conversely,
the “harder” items are the items that persons with high proficiencies tend to get correct.
Items around 0.0 are considered to be of “average difficulty.”
As an example of using the Rasch model to predict response behavior, assume that we
administer a mathematics item located at 1 (i.e., δ = 1) to individuals located at 0 (i.e., θ =
0). According to the model in Equation 2.2 the probability of a correct response for a ran-
domly selected individual from this group would be
e(0 −1)
pj = = 0.2689
1 + e(0 −1)
That is, the probability that a randomly selected person located at 0 will correctly respond
to this item is only 0.2689. The magnitude of this probability should not be surprising,
16 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
given that this item is located above (or to the right of) the individual’s location and so
the item requires more mathematical proficiency to be correctly answered than the person
possesses. Actually, chances are that a person located at 0 will incorrectly respond to this
item rather than respond correctly because the probability of an incorrect response to this
item by someone located at 0 is 1 – 0.2689 = 0.7311.
Another way of interpreting our probabilities is to convert them to the odds of a cor-
rect response on the item. For example, converting these probabilities to odds we find that
the odds of a response of 1 are approximately 1 to 2.7, or that it’s almost three times more
likely that the person will incorrectly respond to the item than correctly respond. Appendix
E, “Odds, Odds Ratios, and Logits,” contains more information on odds.
For a given item location, the substitution of different values of θ into Equation 2.2
produces a series of probabilities that when graphed show a pattern similar to that shown
in Figure 2.2. The trace line produced by the model given in Equation 2.2 is referred to as
an item characteristic curve (Lord, 1952), an item curve (Tucker, 1946), or an item response
function (IRF). Example IRFs are shown in Figure 2.3 and are discussed below. For the
Rasch model the item’s location, δ, is defined as the point of inflexion or the “middle” point
of the IRF; an inflexion point is where the slope of a function changes direction. Because the
Rasch model IRF has a lower asymptote of 0 and an upper asymptote of 1, this midpoint has
a value of 0.50. Therefore, for the Rasch model the item’s location corresponds to the point
on the continuum where the probability of a response of 1 is 0.50.
In the IRT literature one sometimes sees reference to a one-parameter model. In this section
we present the one-parameter model, and in the subsequent section we discuss whether the
Rasch model and the one-parameter model should be considered distinct models.
FIGURE 2.3. Empirical proportions and IRFs corresponding to different discrimination values.
The One-Parameter Model 17
Figure 2.3 shows a series of IRFs overlaid on the item data shown in Figure 2.2. Each
of these IRFs uses an estimated item location of –0.37 (i.e., d̂ j = –0.37).3 The dashed IRF
(labeled Rasch) is based on the Rasch model and is created by substituting d̂ = –0.37 for δ in
Equation 2.2 and using values of θ from –2.0 to 2.0.
As we see, the predicted Rasch IRF is not as steep as the observed response function
(i.e., the empirical trace line suggested by the proportions). To better match the empirical
trace line we need to increase the slope of the IRF. To do this we revisit the exponent in
Equation 2.2. This exponent, (θ – δ), can be considered to have a multiplier whose value is
1. That is, if we symbolize this multiplier by α, then the exponent becomes α(θ – δ). Dis-
tributing α across (θ – δ) and letting
γ = –αδ (2.3)
α(θ – δ) = αθ – αδ = αθ + γ (2.4)
Equation 2.4 is the slope–intercept parameterization of the exponent. In this form we have
the equation for a straight line, where α represents the slope and γ symbolizes the intercept;
“αθ + γ” is sometimes referred to as being in linear form. Although the intercept is related to
the item location, it is not the item’s location. To obtain the item’s location we would use
δ=–γ (2.5)
α
To better understand the slope–intercept form, examine Figure 2.4. (We refer to the
line in the graph as a logit regression line.) This figure shows αθ + γ as a function of θ for an
item located at –0.37. As can be seen, the line’s intercept (γ) equals 0.37 and the slope (α) is
1. From these values we can obtain the item’s location on the θ continuum using Equation
2.5:
γ 0.37
δ=– =– = –0.37 (2.6)
α 1.0
Because α is directly related to the logit regression line’s slope, a change in α leads to a
change in the line’s slope. The effect of changing α “passes through” the reparameterization
of the slope–intercept form into the α(θ – δ) deviate form. That is, the slope of the IRF may
be modified by changing the value of α.4 For example, by increasing α we arrive at the other
two IRFs shown in Figure 2.3. The solid bold and nonbold IRFs are the IRFs when α = 1.5
or α = 2.0, respectively. As we see, using an α = 2.0 results in a predicted IRF that almost
perfectly matches the empirical trace line. Stated another way, in this case the net effect of
increasing the value of α is to improve the fit of the model to the data.
18 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
We can rewrite Equation 2.2 to explicitly incorporate α. Doing this produces the one-
parameter logistic (1PL) model:
α(θ−δ j )
e
p(xj = 1|θ, α, δj) = (2.7)
α(θ−δ j )
1+ e
1
p(xj = 1|θ, α, δj) = (2.8)
−α(θ−δ j )
1+ e
The lack of a subscript on α means that α does not vary across items. As such, the
corresponding IRFs do not cross one another.5 Sometimes “α(θ – δ)” (or “γ + αθ”) is
referred to as a logistic deviate. For simplicity of presentation pj is used in lieu of p(xj = 1|θ,
α, δj) in the following.
Because α is related to the IRF’s slope, it reflects how well an item discriminates among
individuals located at different points along the continuum. As a consequence, α is known
as the item discrimination parameter. To understand this, assume we have three items with
different αs, but located at 0.0 (i.e., δ1 = δ2 = δ3 = 0). Our three discrimination parameters
are 0, 1, and 2. In addition, we have a respondent A located at –1 (θA = –1) and another
respondent B located at 1 (i.e., θB = 1).
For the item with α = 0.0 our IRF, as well as logit regression line, is horizontal. As a
result, the predicted probabilities of a response of 1 for the two respondents is 0.5. In this
case, the item does not provide any information for differentiating between the two respon-
The One-Parameter Model 19
dents. This lack of discriminatory power is a direct function of α = 0.0. In contrast, with
the second item (α = 1) we have different predictions for our respondents; for respondent
A the p2 = 0.2689 and for person B the p2 = 0.7311. Therefore, this item’s α allows us to
distinguish between the two respondents.
Developing this idea further, the third item (α = 2.0) would have the steepest IRF (and
logit regression line) of the three items. This steepness is reflected in a greater difference in
the predicted probabilities for our respondents than seen with the previous two items. That
is, for this item respondent A has a p3 = 0.1192 and for person B we have p3 = 0.8808. In
short, the magnitude of the difference in these predicted probabilities is a direct function
of the item’s α. Therefore, items with larger αs (i.e., steeper logit regression lines and IRFs)
do a better job of discriminating among respondents located at different points on the con-
tinuum than do items with smaller αs.
To summarize, both the 1PL and Rasch models require that items have a constant value for
α, but allow the items to differ in their locations. For the Rasch model this constant is 1.0,
whereas for the 1PL model the constant α does not have to be equal to 1.0. Mathematically,
the 1PL and the Rasch models are equivalent. The values from one model can be transformed
into the other by appropriate rescaling. The use of the Rasch model sets α to 1.0, and
this constant value is absorbed into the metric used in defining the continuum; this is
demonstrated in Chapter 4.
However, to some the Rasch model represents a different philosophical perspective
than that embodied in the 1PL model. The 1PL model is focused on fitting the data as
well as possible, given the model’s constraints. In contrast, the Rasch model is a model
used to construct the variable of interest (cf. Andrich, 1988; Wilson, 2005; Wright, 1984;
Wright & Masters, 1982; Wright & Stone, 1979). In short, this perspective says that the
Rasch model is the standard by which one can create an instrument for measuring a vari-
able. This perspective is similar to that seen in Guttman Scaling and Coombs Unfolding
(Coombs, 1950), and is analogous to what is done in the physical sciences.6 For example,
consider the measurement of time. The measurement of time involves a repetitive pro-
cess that marks off equal increments (i.e., units) of the (latent) variable time. In order
to measure time we need to define our unit (e.g., a standard period of oscillation). With
the Rasch model the unit is defined as the logit. That is, the unit is the distance on our
continuum that leads to an increase in the odds of success by a factor equal to the tran-
scendental constant e. Therefore, analogous to time measurement, our measurements with
a one-parameter model are based on the (repetitive) use of a unit that remains constant
across our metric.
For simplicity in the following discussion we use the general term 1PL model to refer
to both α = 1.0 (i.e., the Rasch model) and to the situation where α is equal to some other
constant. However, when we use the term Rasch model, we are referring to the situation
when α = 1.0 and a measurement philosophy that states that the Rasch model is the basis
for constructing the variable of interest.
20 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
IRT models assume that the response data are a manifestation of one or more person-oriented
latent dimensions or factors. This is typically referred to as the dimensionality assumption and
is reflected both in the models and in their graphical representations. For instance, in the 1PL
model we use a single person location variable, θ, to reflect that one latent variable accounts
for a person’s response behavior. Moreover, in the 1PL model’s conceptual development, as
well as in its IRF, this assumption is reflected in a single continuum to represent the latent
variable (see Figures 2.1 and 2.3). All the models presented in Chapters 2–9 assume a single
latent person variable. Therefore, for these models the dimensionality assumption is referred
to as the unidimensionality assumption. Specifically, the unidimensionality assumption states
that the observations on the manifest variables (e.g., the items) are solely a function of a
single continuous latent person variable. If one has a unidimensional latent space, then the
persons may be located and compared on this latent variable. In terms of our example, this
assumption states that there is a single latent mathematics proficiency variable that underlies
the respondents’ performance on our instrument. In contrast, if we needed to also know the
respondents’ locations on an obsessiveness latent variable to account for their performance
on our instrument, then the response data would be best modeled using a two-dimensional,
not a unidimensional, model.
We view the unidimensionality assumption as representing an ideal situation analogous
to the homogeneity of variance assumption in analysis of variance (ANOVA). In practice,
there will most likely be some degree of violation of the unidimensionality assumption. This
degree of violation may or may not be problematic. That is, although the data may in truth
be a manifestation of, for example, two latent variables, a unidimensional model may pro-
vide a sufficiently accurate representation to still be useful. (This is similar to an ANOVA in
which the homogeneity of variance assumption is violated, but the F test is still useful under
certain conditions.) Of course, in some situations the degree of violation may be so large
that a unidimensional model is not useful. In these situations one might consider the use
of a multidimensional model (Chapter 10) or some other approach to modeling the data.
However, regardless of whether one uses a unidimensional model or not, it should be noted
that whether the estimated θs are meaningful and useful is a validity issue. In short, the esti-
mated θs, in and of themselves, do not guarantee that the latent variable that is intended to
be measured (e.g., mathematics proficiency) is, in fact, measured.
A second assumption is that the responses to an item are independent of the responses
to any other item conditional on the person’s location. This assumption is referred to as
conditional independence or local independence. We use the term conditional independence
in the following because we consider it to be more descriptive than the term local indepen-
dence.
In the unidimensional case, the conditional independence assumption says that how
a person responds to a question is determined solely by his or her location on the latent
continuum and not by how he or she responds to any other question on the examination.
If this were not true, then more than the person’s, say, mathematics ability, would be affect-
ing his or her responses and one would have a nonunidimensional situation. Given this
interpretation, it is not surprising that sometimes the unidimensionality assumption and the
The One-Parameter Model 21
conditional independence assumption are discussed as being one and the same. However,
they may not be one and the same in all cases; also see Goldstein (1980). For instance, cer-
tain instrument formats lead to a dependency among item responses that does not appear to
invoke additional latent variables to define the latent space. For example, we might have a
series of questions that all relate to the same passage or a set of hierarchically related items
in which answering later items is based, in part, on answer(s) to earlier item(s). With these
formats item responses will most likely violate the conditional independence assumption.
In contrast, there are cases of item interdependency that are due to additional latent
variables. For example, consider the case of speededness in which an individual has insuf-
ficient time to respond to all the items on an instrument. In this situation, unless we use an
additional latent variable, such as an individual’s rapidity, in defining the latent space, then
conditional independence is violated for the speeded items. That is, the unidimensionality
assumption is violated because we have two latent person variables, rapidity and the target
latent variable (e.g., mathematics proficiency). Furthermore, to the extent that the latent
variable rapidity is associated with gender or ethnic group differences, then one may also
observe that the speeded items exhibit differential item functioning (Chapter 12). Verhelst,
Verstralen, and Jansen (1997) and Roskam (1997) both present models for time-limited
measurement.
Strictly speaking, the conditional independence assumption states that for “any group
of examinees all characterized by the same values θ1, θ2, . . . , θk, the (conditional) distribu-
tions of the item scores are all independent of each other” (Lord & Novick, 1968, p. 361).7
That is, when all the latent variables that define the complete latent space are known and
taken into account, then the item responses are independent of one another. Therefore, the
conditional independence assumption applies not only to unidimensional, but also to the
multidimensional IRT models.
A third assumption is the functional form assumption. This assumption states that the
data follow the function specified by the model. For instance, for Equations 2.2 and 2.7 the
functional form is that of an “S”-shaped curve. This ogival form matches the empirical data
for the item shown in Figure 2.2. (Although these data were modeled using a logistic func-
tion, an alternative approach would be to use a probit strategy; see Appendix C.)
In the context of the 1PL model, the functional form assumption also embodies that
all items on an instrument have IRFs with a common lower asymptote of 0 and a common
slope. This common slope (i.e., constant α) across items is reflected in parallel IRFs. As is
the case with the unidimensionality assumption, this assumption is rarely exactly met in
practice. However, if the IRFs are parallel within sampling error, then this is interpreted as
indicating model–data fit. Several different ways of determining model–data fit are addressed
in the following chapters.
To demonstrate the principles underlying person and item parameter estimation, we use
response data from the administration of a five-item mathematics examination. Consistent
with the measurement issues discussed in Chapter 1, we assume that we have content
22 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
validity evidence for our instrument. Although there is some controversy concerning the
concept of content validity, in this book we assume that it is a useful concept. See Sireci
(1998) for a discussion of the concept of content validity. We conceptualize mathematics
proficiency as a continuous latent variable.
Although our example data come from proficiency assessment, we could have just as
easily used data from a personality, attitude, or interest inventory. Moreover, our IRT model
does not make an assumption about the item response format used on our instrument.
Whether the questions, for example, use a multiple-choice, open-ended, true-false, forced-
choice, or fill-in-the-blank response format is irrelevant. All that matters is that the data
analyzed are dichotomous and that the assumptions are tenable. The appropriateness of the
model to the data is a fit issue.
For our example, the dichotomous data were determined by classifying the examinees’
responses into one of two categories, correct or incorrect. If the examinee correctly per-
formed the mathematical operation(s) on an item, then his or her response was categorized
in the “correct” category and assigned a value of 1. Otherwise, his or her response was cat-
egorized as incorrect and received a value of 0. Table 2.1 contains the response patterns and
their frequency of occurrence.
With binary data and five items there are 2L = 25 = 32 possible unique response pat-
terns, where L is the number of items. As can be seen from Table 2.1, each of these possible
patterns is observed in the data set. There are 691 persons who did not correctly respond
to any of the items (i.e., the response pattern 00000 with an X = 0 or a “zero score”), and
there are 3385 persons who obtained a perfect score of 5 (i.e., X = 5, for the response pat-
tern 11111). For pedagogical purposes the items are presented in ascending order of item
location. Therefore, one would expect that if a person had item 2 correct, then that person
should have also had item 1 correct.8
In practice, we know neither the item parameters (e.g., the δs) nor the person parameters (θs).
Because, in general, our interest is in estimating person locations, we begin with estimating
respondent locations on the latent continuum. For estimating a respondent’s location
(θ̂ ), we assume that we know the items’ locations on the latent continuum. Although we
conceptually present the estimation process here, in Appendix A we furnish a mathematical
treatment and address obtaining the item location estimates in Appendix B.
Figure 2.5 shows the locations of our mathematics items. Item 1 has a location at –1.9
(i.e., δ1 = –1.9) , item 2 is located at δ2 = –0.6, and the remaining items are located at
δ3 = –0.25, δ4 = 0.30, and δ5 = 0.45; these values correspond to their locations in Figure 2.5.
Stated in words, item 1 is almost two units below the zero point and item 5 is almost a half
unit above the zero point. The units defining the metric are called logits; see Appendix E,
“Odds, Odds Ratios, and Logits,” for information on logits. In general, and given that we are
assessing mathematics proficiency, item 1 would tend to be considered an “easy” item and
is comparatively easier than the remaining items on the instrument. Items 2 through 5 are
generally considered to be of “average” difficulty.
The One-Parameter Model 23
FIGURE 2.5. Graphical representation of the item locations for the mathematics test.
For our second item, and using δ2 in lieu of δ1, the probability of a correct response to item
2 for someone located at –3.0 is
These probabilities reflect what one would expect—namely, that a randomly selected
person located at –3.0 has a higher probability of correctly answering the easiest item on the
instrument than of correctly answering a harder item.
The responses to items 3 through 5 are incorrect. Therefore, to determine the probabil-
ity of an incorrect response, we use the complement of Equation 2.2 for items 3 through 5.
Item 5 is used to demonstrate obtaining the probability of an incorrect response. For item 5
the probability of an incorrect response for someone located at –3.0 is
That is, a person located at –3.0 has a very high probability of incorrectly answering the
hardest item on the instrument. The probabilities of incorrect responses to items 3 and
4 would be obtained in a similar fashion. These probabilities are p(x3 = 0) = 0.9399 and
p(x4 = 0) = 0.9644.
So far we have the individual item probabilities for a θ = –3.0. To obtain the likelihood
of the observed response pattern 11000 requires multiplying the individual item probabili-
ties (step 2). For individuals located at θ = –3.0 the likelihood of observing the response
pattern 11000 is given by
As stated in words, the likelihood of individuals located at –3.0 providing the response
pattern 11000 is about 0.02.
These two steps, calculating the individual item probabilities and then the joint prob-
ability of the pattern, would be repeated for the θs in the range –3.0 to 3.0 (step 3). Concep-
tually, the resulting series of probabilities from –3.0 to 3.0 collectively form the likelihood
function (L). In step 4 the L is examined to determine the location of the maximum of the
likelihood function. This location is the estimate of the person location (θ̂ ) that would most
likely produce the response pattern 11000 on this mathematics examination, using our
model and item parameters.
We may symbolically represent the above steps for calculating a likelihood. Letting x
represent a response pattern (e.g., x = 11000), then the likelihood of person i’s response
vector, xi, is
Ý
L(xi|θ, α, δ) = « (1 – pj)(1 – xij) (2.9)
£
where pj is short for p(xij = 1|θi, α, δj), xij is person i’s response to item j, δ is a vector
containing item location parameters, L is the number of items on the instrument (i.e., its
length), and “ ∏ ” is the product symbol. From Equation 2.9 one sees that as the number of
items increases, the product of these probabilities will potentially become so small that it will
become difficult to represent on any calculational device. Therefore, rather than working
directly with the probability, the natural logarithmic transformation of the probability (i.e.,
loge(pj) or ln(pj)) is typically used. This transformation results in a summation rather than
a multiplication. The utilization of logs results in a likelihood that is called the log likelihood
function, lnL(xi), where
L
lnL(xi|θ, α, δ) = ∑ (xij ln(pj) + (1 – xij) ln(1 – pj)) (2.10)
j=1
The One-Parameter Model 25
A graphical representation of the log likelihood for the pattern 11000 is presented
in Figure 2.6. The vertical line in the body of the graph shows that the location of the
maximum of the log likelihood occurs at approximately at –0.85 (i.e., this is the value that
is most likely to result in the response pattern 11000 on this instrument). This value would
be the estimated person location for this response pattern (i.e., θ̂ = –0.85).
What would the lnLs look like for the other response patterns that have the same
observed score of 2 (e.g., 10100, 01100, etc.)? All of these lnLs exhibit the same form as
seen with the pattern 11000 and with their maxima located at the same θ, but each likeli-
hood is less in magnitude throughout the entire θ continuum than that shown for 11000.
Figure 2.7 contains these lnLs for all 10 patterns with an X = 2. This pattern of lnLs is
intuitively appealing because incorrectly answering the easiest three items and correctly
answering the hardest two items (i.e., 00011) is not as likely to occur as the reverse pat-
tern 11000. In short, none of the other patterns for X = 2 are as likely to occur as 11000.
Moreover, although different patterns of responses may produce the same X and have
varying degrees of likelihood, for the 1PL model a given X yields the same θ̂ regardless of
the pattern of responses that produced X. Stated another L way, Figure 2.7 shows that for
the 1PL model the person’s observed score (i.e., Xi = α ∑ xij ) contains all the information
necessary for estimating θi or, statistically speaking, Xi is a sufficient statistic for estimat-
ing θi (cf. Rasch, 1980).9 In effect, the data shown in Table 2.1 may be collapsed into six
observed scores of 0, 1, 2, 3, 4, and 5. As such, and in the context of the 1PL model, the
actual pattern of responses that make up each observed score is ignored in our estimate of
θ when we use the maximum likelihood approach.10 We can proceed to obtain the θ̂ s for
the remaining patterns in Table 2.1 by determining their lnLs in a fashion similar to that
used with 11000.
Although we located the maximum of the lnL by visual inspection, there are alternative,
more sophisticated approaches that can be used to find the maximum’s location. One of these
FIGURE 2.7. Log likelihood functions for the all patterns that result in an X = 2.
approaches is Newton’s method for maximum likelihood estimation (MLE); this approach is
discussed in Appendix A. Assume that we use Newton’s method to find the MLE θ̂ s for the
remaining observed scores. For the individuals who obtained only one correct answer (X
= 1), their θ̂ = –1.9876. That is, these 3099 individuals (i.e., 2280 + 242 + . . . + 184) are
located approximately two logits below the zero point on the mathematics proficiency con-
tinuum metric. For the observed scores of 2, 3, and 4, the obtained θ̂ s are –0.8378, 0.1008,
and 1.1796, respectively. Comparing the MLE θ̂ for X = 2 with our visual inspection estimate
(θ̂ = –0.85; X = 2) shows close agreement. We also see that as an individual answers more
questions correctly, his or her corresponding MLE θ̂ increases to indicate greater mathemat-
ics proficiency. However, unlike the observed scores, our θ̂ s are invariant of this particular
mathematics examination.
It may have been noticed that θ̂ s were not provided for people who obtained either a
zero score (X = 0) or a perfect score (X = 5) on the examination. This is because the
corresponding log likelihoods do not have a maximum. For instance, the log likelihood for
the perfect score (X = 5) is presented in Figure 2.8. It can be seen that this log likelihood
conforms to what one would expect. There are an infinite number of θs above 4 that could
produce the observed score of 5. Because there is no way of distinguishing which θ among
them is most likely, given our data, we do not have a finite θ̂ for a perfect score. In this
case, our log likelihood is asymptotic with 0.0 and the estimate of θ would be ∞. For a
The One-Parameter Model 27
zero score the log likelihood would be the mirror image of the one shown in Figure 2.8
and our θ̂ would be –∞.
When zero and perfect scores are encountered in practice, the various computer esti-
mation programs have different kludges for handling these scores. For example, in the next
chapter the perfect and zero scores are modified so that they are no longer perfect and zero,
respectively. Alternatively, an estimation approach that incorporates ancillary population
information can be used to provide θ̂ s for zero and perfect scores. This approach falls within
Bayesian estimation and is discussed in Chapter 4.
FIGURE 2.8. Log likelihood function for the perfect score of X = 5 (x = 11111).
28 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
'
where pj is given by the IRT model, p j is the model’s first derivative, and E is the symbol
for expectation (Lord, 1980). Given that the first derivative of the 1PL model is p'j =
α[pj (1 – pj)], then Equation 2.11 simplifies to
1
σ2 $
e (θ | θ )
= L (2.11)
∑ α2[p j(1 − p j )]
In practice, θ̂ is substituted for θ in the IRT model. For example, for the person location
estimate of –0.8378 our SEE is 0.9900 (i.e., roughly a full logit).
Table 2.2 contains the MLE θ̂ s and their corresponding SEEs. The magnitude of the
SEE is influenced not only by the quality of the items on the instrument, but also by the
instrument’s length. The addition of items similar to those on the instrument will lead to a
decrease in the standard errors of θ̂ . For example, if we lengthen our example mathematics
test to 20 items by quadrupling the five-item set (i.e., four items at –1.9, four items at –0.6,
etc.), then our SEE for θ̂ = –0.8378 decreases to 0.4950.
We can use our SEEs to create a maximum likelihood confidence limit estimator (Birn-
baum, 1968) by
Equation 2.13 tells us the range within which we would expect θ to lie (1 – α)% of the
time.13 For the example’s observed score of 2 with θ̂ = –0.8378, the 95% confidence band
would be [–0.8378 ± 1.96*0.9900] = [–2.7783, 1.1026]. That is, we would expect the θ that
produced an X = 2 on the instrument to lie within this interval 95% of the time. The width of
the confidence band is directly related to the degree of uncertainty about a person’s location.
A narrow interval indicates comparatively less uncertainty about a person’s location than a
wider interval.
Table 2.2. MLE q̂ s and Their Corresponding SEEs for the Different Xs
X q̂‑ se(q̂
) 95% CB Number of Iterations
0 –∞ ∞ ∞ ∞
1 1.9876 1.2002 –4.3401, 0.3648 4
2 0.8378 0.9900 –2.7783, 1.1026 3
3 0.1008 0.9717 –1.8037, 2.0053 3
4 1.1796 1.1562 –1.0864, 3.4457 3
5 ∞ ∞ ∞ ∞
Note. ∞ and –∞ symbolize infinity and negative infinity, respectively. CB, confidence band.
The One-Parameter Model 29
So far we have viewed the estimation of a person’s location from the perspective of how
uncertain we are about the person’s location. We can also take the opposing perspective.
That is, how certain are we about a person’s location or, similarly, how much information
do we have about a person’s location? From this perspective, the confidence band’s width
is indirectly related to the information we have for estimating a person’s location with an
instrument. A narrow interval indicates comparatively more information for estimating a
person’s location than does a wider interval. If we take the reciprocal of Equation 2.11,
we obtain an expression that directly specifies the “amount of information to be expected
in respect of any unknown parameters, from a given number of observations of indepen-
dent objects or events, the frequencies of which depend on that parameter” (Fisher, 1935,
p. 18).
Because the instrument’s items are our “observations,” and given the conditional
independence assumption, Fisher’s idea about information may be applied to quantify the
amount of information that the items as well as the instrument provide for estimating the
person location parameters. Following Fisher’s use of “I” to represent the concept informa-
tion, then an estimator’s information equals the reciprocal of Equation 2.11:
2 ∂�2
∂ 1
I(θ) = E lnL = –E lnL = (2.14)
∂θ �∂θ2 2
σe (θ)
By substitution of Equation 2.11 into Equation 2.14 we obtain the total information (I(θ))
provided by the instrument for estimating θ:
L [p' ]2
1 j
I(θ) = = ∑ (2.15)
σe (θ) j=1 j − p j )
2 p (1
where all terms have been defined above. Equation 2.15 is also referred to as test information
or total test information. Unlike the concept of reliability that depends on both instrument
and sample characteristics, an instrument’s total information is a property of the instrument
itself (cf. Samejima, 1990). In this book the term total information is used in lieu of test
information or total test information to reflect the fact that the instrument may not necessarily
be a test.
Equation 2.15 specifies how much information an instrument provides for separating
two distinct θs, θ1 and θ2, that are in proximity to one another. By analogy, in simple linear
regression the steeper the slope of the regression line, the greater the difference between
the predicted values for two different predictor values than when the slope is less steep.
(For example, imagine the slope is 0 in one case and 0.9 in another case. In the former
situation one would predict the same value for two different predictor values, whereas in
the latter one would predict two different values for two different predictor values.) The
numerator of Equation 2.15 is the (squared) slope, whereas the denominator is a reflection
of the variability at the point at which the slope is calculated. Therefore, less variability
(i.e., greater certainty) at the point at which one calculates the slope combined with a steep
slope provides more information for distinguishing between θ1 and θ2 than if one had more
30 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
variability and/or a less steep slope. Moreover, Equation 2.15 shows, all things being equal,
that lengthening an instrument leads to a concomitant increase in precision for estimating
person locations.
An instrument’s total information reflects that each of the items potentially contributes
some information to reduce the uncertainty about a person’s location independent of the
other items on the instrument. It is because of this independence that we can sum the indi-
vidual item contributions to obtain the total information (Equation 2.15). This individual
item contribution is known as item information, Ij(θ) (Birnbaum, 1968):
[p'j]2
Ij(θ) = (2.16)
p j(1 − p j )
(The subscript j on I signifies item information, whereas the lack of a subscript indicates
total information.) Therefore, total information equals the sum of the item information,
I(θ) = Σ Ij(θ).
For the 1PL model Equation 2.16 simplifies to
Because the product pj(1 – pj) reaches its maximum value when pj = (1 – pj), the maximum
item information for the 1PL model is α20.25. Figure 2.9 shows an example item information
function for an item located at 0.35 based on the Rasch model (i.e., α = 1.0). As can be seen,
the item information function is unimodal and symmetric about the item’s location with a
maximum value of 0.25.
With the 1PL model all the items exhibit the same pattern seen in Figure 2.9. Namely,
(1) an item provides its maximum information at its location, (2) the item information
function is unimodal and symmetric about δ, and (3) all items on an instrument provide
the same maximum amount of information of α20.25 at their respective locations. We now
apply these concepts to our instrument.
The likelihood principles outlined above for person estimation can also be applied to the
estimation of item locations. The MLE of item locations is presented in Appendix B. Using
MLE we estimated the five item locations. These estimates, d̂ s, are presented in Table 2.3
along with their corresponding standard errors. As we see, item 1 is located at (d̂ 1 =) –2.5118,
item 2 is located at d̂ 2 = –0.1592, and so on; α = 1. In general, item 1 would be considered
to be an easy item. The most difficult item on the mathematics instrument is item 5 with a
location of 1.6414.
Figure 2.10 contains the items’ corresponding IRFs. Because the items have a common
α and α is related to an IRF’s slope, it is not surprising that the IRFs are parallel to one
another. One also sees that the probability of a correct response is 0.5 at the items’ estimated
locations (i.e., an item’s location corresponds to the IRF’s point of inflexion).
To obtain an idea of how well an item and the entire instrument can estimate person
locations, we examine the item and total information. Figure 2.11 shows the information
provided by each item (Ij(θ)) and by the instrument (I(θ)) as a function of θ. As can be
seen, item 1 (nonbold solid line) provides its maximum information for estimating θ at its
location of –2.5118. As we move away from this point, in either direction, the item provides
progressively less information about θ. In fact, above 2.0 this item provides virtually no
information for estimating persons located at and above 2.0. Therefore, using this item and
others like it to estimate individuals with θ > 2.0 would not yield precise estimates and the
corresponding standard errors would be comparatively large.
The total information function (labeled “Total”) shows that the instrument provides
its maximum information for estimating θ in a neighborhood around 0.70. As we progress
away from this neighborhood the instrument provides less information for estimating θ so
that, for example, at θ = 3.0 the instrument provides one half less information for estimat-
ing θ than at 0.70.14 Recalling the inverse relationship between information and SEE, this
Table 2.3. MLE d̂ s and Their Corresponding SEEs for the Five-Item Instrument
Item d̂ se(d̂
)
1 –2.5118 0.0246
2 –0.1592 0.0183
3 0.3518 0.0183
4 1.3203 0.0198
5 1.6414 0.0209
32 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
FIGURE 2.10. IRFs for all five items on the mathematics instrument.
FIGURE 2.11. Item and total information functions for the mathematics instrument.
The One-Parameter Model 33
end of the continuum to increase the information about individuals located at the lower
end of the continuum. If it were necessary to restrict our instrument to five items, then
we might remove one or more items from the instrument that provide redundant informa-
tion. For example, we might consider removing item 4 (δ4 = 1.3203) because it and item 5
(δ5 = 1.6414) are providing somewhat redundant information about examinee location. In
contrast, if we desired to provide better estimation above 0.7, then items located around 0.7
and greater would be included in the instrument. In this fashion we can design an instru-
ment to measure along a wide range of the continuum or, conversely, very precisely in a
narrow range by adding items located within the range of interest.
To facilitate designing an instrument with specific estimation properties, we can specify
a target total information function for the instrument. This target total information function
would provide a blueprint for the area(s) on the continuum for which we wish to provide a
high degree of precision for estimating person locations. For instance, consider a certifica-
tion or licensure situation that involves a (decision) cutpoint above which a person would
be, for example, certified and below which the person would not be certified. In this case it
would be desirable to have an instrument with a total information function that is peaked
at the cutpoint. This instrument would have enhanced capacity for distinguishing between
individuals who were located near the cutpoint. To achieve this goal we would add items
whose information maxima were at or near the cutpoint. Moreover, this greater informa-
tion would reduce our SEEs at the cutpoint and thereby reduce the confidence band width
(Equation 2.13) used to decide whether someone is above or below the cutpoint. (That is,
we could use the interval estimate from Equation 2.13, not the point estimate θ̂ , for deciding
whether an individual is above or below the cutpoint.)
Alternatively, we may wish to have equiprecise estimation across a range of θ (i.e.,
“constant” SEE). In this case, the corresponding target total information function would
resemble a rectangular (i.e., uniform) distribution across the θ range. To achieve this objec-
tive we would add items whose information maxima are located at different points along the
θ range of interest.
In summary, information functions may be used to design an instrument with spe-
cific characteristics. This capability takes advantage of the fact that items and persons are
located on the same continuum, as well as our capacity to assess the amount of information
for estimating person locations based solely on the item parameter estimates. Success in
developing an instrument whose observed total information function is similar to the target
information function depends on having an adequate pool of items to work with and impos-
ing constraints on the item selection to ensure that the resulting instrument has validity
with respect to the construct of interest. Theunissen (1985), van der Linden and Boekkooi-
Timminga (1989), and van der Linden (1998) contain detailed information for automating
instrument development using a target instrument information function and taking into
consideration content representation; also see Baker, Cohen, and Baarmish (1988) as well
as Luecht (1998). In general, it is not possible to use CTT to design an instrument in this
fashion. Of course, after developing the instrument it would still be necessary to perform a
validation study for the instrument.
34 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
Summary
Although for some individuals the Rasch and 1PL models reflect different perspectives
for performing measurement, the models are mathematically equivalent. For some the
Rasch model is seen as the standard by which to create a measurement device. From this
perspective, for the data to be useful for measurement they must follow the model. Data
that do not exhibit fit with the model are seen as suspect and may need to be discarded.
In contrast, the 1PL model may be viewed as representing a statistical approach of trying
to model response data. In this case, if one has model–data misfit, it is the model that is
seen as suspect. For both models and for IRT in general, the graphical representation of
the predicted probabilities of a response of 1 on an item are referred to as an item response
function.
Because the 1PL model states that items have a common discrimination parameter (α),
the corresponding IRFs are parallel to one another. The item’s location (δ) on the latent
continuum is defined by the location of the IRF’s point of inflexion. In addition, we assume
that the construct of interest is unidimensional, that the data are consistent with the model’s
function form, and that the responses are conditionally independent. The unidimensional-
ity assumption states that the observations may be explained solely on the basis of a single
latent person trait, θ. The functional form assumption states that the response data for an
item follow an ogival pattern. In the context of the 1PL model the conditional independence
assumption states that the responses to an item are independent of the responses to another
item, conditional on person location. The tenability of these assumptions for a data set
needs to be examined.
In contrast to CTT, in IRT both persons and items are located on the same continuum.
Although items are allowed to vary in their locations, an item’s capacity to differentiate
among individuals is held constant across items. This capacity is captured by the item dis-
crimination parameter. For the Rasch model α = 1.0 and for the 1PL model α may be equal
to some constant other than 1.0.
With the 1PL model the sum of person i’s responses (observed score) is a sufficient
statistic for estimating his or her location, and the sum of the responses to an item j (item
score) is a sufficient statistic for estimating the item’s location. All individuals who receive
the same observed score will obtain the same estimated person location (θ̂ ), and all items
that have the same item score will receive the same estimated location (d̂ ). The accuracy of
the θ̂ and d̂ is indexed by their respective standard errors. Smaller standard errors of esti-
mate reflect greater accuracy than do larger standard errors. Moreover, one’s uncertainty of
each parameter is reflected in the concept of information. In general, each item provides
information for estimating a person’s location. The sum of the item information functions is
the instrument’s total information function. The concept of total information can be used to
design instruments with specific psychometric properties.
In this chapter we conceptually presented maximum likelihood estimation; Appendices
A and B contain a more formal treatment for estimating the person and item parameters,
respectively. In the next chapter the estimation of person and item parameters is further
developed, using the likelihood principle presented above. Further, we demonstrate a pro-
The One-Parameter Model 35
cess that can be used in practice for obtaining IRT parameter estimates. As part of this pro-
cess we assess some of IRT’s assumptions and examine model–data fit.
Notes
1. There are other ways of conceptualizing the relationship between the person and item loca-
tions that do not involve taking their difference. For instance, Ramsay (1989) examined the ratio of
the person and item locations (i.e., θ/δ). Although Ramsay’s model is slightly more complicated than
the standard approach of examining the difference between θ and δ, it does have the benefits of allow-
ing different estimates of θ for different response patterns that yield the same observed score, as well
as a way of accounting for examinee guessing.
2. Some treatments of the Rasch model (e.g., Rasch, 1961; Wright, 1968) use the concept of an
item’s “easiness” (Εj) to represent an item’s location. In these cases, easiness may be transformed to
item difficulty, δ, by δj = –ln(Εj) (i.e., Εj = e(–δj)).
3. To obtain this d̂ j, we use an approximation strategy based on the item’s traditional item dif-
ficulty, Pj; this approach is discussed in Chapter 5 and Appendix C. Specifically, δj is equal to the
z-score that delimits an area above it that equals Pj. The Pj for the item in Figure 2.2 is 0.644. From
a standard unit normal curve table we find that a z = –0.37 corresponds to the point above which lies
0.644 of the area under the normal curve. Therefore, the estimate of the item’s location is –0.37. A
more sophisticated approach for estimating an item’s location is presented in Appendix B.
4. The first derivative, p'j , of Equation 2.7 is
α(θ−δ j )
' e
p j = α[pj (1 – pj)] = α α(θ−δ j ) 2 (2.18)
(1 + e )
eα(0) e0
p'j = α =α = 0.25α (2.19)
α(0) )2 (1 + e0 )2
(1 + e
Therefore, strictly speaking, α in Equation 2.7 is proportional to the slope of the tangent line to the
IRF at δj. Note that, by convention, we define the IRF’s slope to be the one calculated at the item’s
location (i.e., an IRF has different slopes at different points along the function).
5. Strictly speaking, the IRFs are parallel when α is constant across items with different locations
and when the lower asymptote of the corresponding IRFs is constant (e.g., the IRF lower asymptotes
equal 0.0). Moreover, it is the tangent lines to the IRF at δj for the items that are parallel.
6. This philosophy is similar to the Guttman Scalogram (Guttman, 1950) technique for measur-
ing, for example, attitudes. In scalogram analysis if one can successfully apply the technique, then
the resulting scale is unidimensional and has certain known properties. However, there is no guar-
antee that one will be able to successfully apply scalogram analysis to a particular data set. In short,
simply because one would like to measure a particular construct does not necessarily mean one will
be successful. This standard is greater than simply asking a series of questions—the data must fit a
scalogram in order for the creation of an instrument to be considered successful. (Strictly speaking,
36 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
there is some latitude.) Similarly, using the Rasch model as the standard for being able to measure a
construct means that the data must fit the Rasch model. Whether the unidimensional scale produced
by the Rasch model is meaningful or useful is a validity question. The Rasch model differs from the
Guttman Scalogram model most notably in that the Rasch model is a probabilistic model, whereas the
scalogram model is deterministic.
7. McDonald (McDonald, 1979, 1981; McDonald & Mok, 1995) has asserted that there are two
principles of conditional independence. The first is the strong principle of conditional (local) indepen-
dence and the second is the weak principle of conditional (local) independence. The strong principle is
the one defined by Lord and Novick (1968); that is, after taking into account (conditioning on) all the
relevant latent variables, the item responses for all subsets of items are mutually statistically indepen-
dent. The weak principle states that the covariability between two item vectors is zero after condition-
ing on all the relevant latent variables (i.e., pairwise conditional independence). McDonald refers to
these two forms as principles rather than assumptions because “the (strong or weak) principle of local
independence is not an assumption, but serves to provide the mathematical definition of latent traits”
(McDonald & Mok, 1995, p. 25). The existence of weak conditional independence is a necessary, but
not sufficient, condition for the existence of strong conditional independence.
The weak principle of conditional independence is the basic assumption that underlies com-
mon factor analysis (McDonald, 1979). This connection with factor analysis and the capability of
factor analysis to distinguish between major and minor factors provides another way of looking at the
dimensionality assumption. Essential dimensionality (dE; Stout, 1990) is the minimal number of major
factors necessary for a weakly monotone latent model to achieve essential independence (EI). Stout
(1990) states that EI exists when the conditional covariances between items, on average, approach
zero as test length becomes infinite. When dE equals 1, then one has essentially unidimensionality.
Stout (1987) developed a statistical test, T, to detect departure from (essential) unidimensionality in
a data set. This approach is implemented in DIMTEST; DIMTEST is available from the Assessment
Systems Corporation (assess.com/xcart/home.php). Nandakumar (1991) and Hattie, Krokowski, Rog-
ers, and Swaminathan (1996) contain readable presentations of the DIMTEST approach to assessing
dimensionality.
8. This idea of logically consistent response patterns or ideal response patterns is seen in a per-
fect Guttman scale (Guttman, 1950) or in Walker’s (1931) unig test. A unig test consists of observed
scores, X, composed of the correct answers to the X easiest items. Moreover, the observed score X + 1
contains the correct answers of score X plus one more. In the context of the 1PL model, the condition
for an ideal response pattern for person i is xij ≥ xiv when pij ≥ piv, where j and v are two items. For
example, assuming three items are ordered from easy to hard, then the logically consistent patterns
are 000 (all items incorrect), 100 (easiest item correct, harder items incorrect), 110 (hardest item
incorrect, easier items correct), and 111 (all items correct). These ideal response patterns are also
known as Guttman patterns or conformal patterns. Although our instrument does not conform to a
unig test (or a perfect Guttman scale), the ideal response patterns (i.e., 00000, 10000, 11000, 11100,
11110, 11111) do have the largest frequency for a given observed score, X. For instance, for X = 1 (i.e.,
the patterns 10000, 01000, 00100, 00010, 00001), the 10000 pattern has the largest frequency. There
L!
are CLX = X!(L − X)! different pattern combinations for a given X score.
9. According to Fisher (1971b), “if θ [is] the parameter to be estimated, θ1 a statistic which con-
tains the whole of the information as to the value of θ, which the sample supplies, and θ2 any other
statistic, then . . . when θ1 is known, knowledge of the value of θ2 throws no further light upon the
value of θ” (pp. 316–317). By definition, a statistic or estimator (y) for a parameter is sufficient if it
contains all of the information regarding the parameter that can be obtained from the sample (i.e., no
further information about the parameter can be obtained from any other sample statistic (e.g., y′) that
The One-Parameter Model 37
is functionally independent of y. In short, a sufficient statistic represents a form of data reduction that
preserves the information contained in the data. To demonstrate the existence of sufficient statistics
for the Rasch model, we follow Wright and Stone (1979). Starting with an N × L data matrix X whose
entries xij represent the binary response on the jth item by the ith person, then the marginal totals
(i.e., row and column sums) for X are
L
person i’s observed score (i.e., row i in X): Xi = ∑ w jxij
j=1
N
and item j’s item score (i.e., column j in X): qj = ∑ vi xij
i =1
N L x (θ −δ )
e ij i j
pj = ∏ ∏ (θi −δ j)
i =1 j=1 1 + (e )
Converting these products to sums and then factoring the numerator, we have
exp ∑ X i θi exp − ∑ q jδ j
i j
pj =
N L
∏ ∏ (1 + exp[θi − δ j])
i =1 j=1
where we use “exp[z]” instead of “ez” to simplify presentation. This numerator has the form of suf-
ficient statistics: one for estimating θi and the other for estimating δj. (See Fisher (1971b) for the
condition thatN must be satisfied for the existence of a sufficient statistic.) Specifically,
N the observed
score Xi (= ∑ xij ) is a sufficient statistic for estimating θi and the item score qj (= ∑ xij ) is a sufficient
statistic for estimating δj (also see Rasch, 1980).
The sufficient statistic Xi is the sum of person i’s responses to the items and does not involve
item characteristics (e.g., item locations or discriminations). This is why the 10 patterns with an Xi =
2 (Figure 2.7) all have the same θ̂ . Similarly, the sufficient statistic qj is the sum of the responses on
item j and involves only item responses and not person characteristics.
10. This does not mean that an alternative estimation procedure that does not involve the cri-
terion of maximizing the likelihood function will also yield the same θ̂ for all response patterns
2
that produce a given observed score, X. For example, the minimum Χ estimation method seeks to
2
estimate the parameters that minimize a Χ criterion (i.e., maximize data–model fit). (See Linacre
(2004) or Baker and Kim (2004) for more information on this estimation method.) Utilizing this
procedure, “persons with different response patterns, but with the same raw <observe> score, obtain
different measures <θ̂ s> characterized by different standard errors, and similarly for items” (Linacre,
2004, p. 6).
38 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
11. Strictly speaking, one has only the sample standard error of estimate (se( θ̂)); that is, an esti-
mate of σe(θ̂).
12. The standard error of measurement is a global measure of error across the manifest variable
score metric for a given sample. It may (and most likely will) over- or underestimate the degree of
error at different points along the score metric (Feldt & Brennan, 1989; Livingston, 1982). In our
example, there are six possible Xis, but there is only one standard error of measurement. One could
calculate a conditional standard error of measurement at each Xi between the zero and perfect scores
that would provide a more accurate reflection of the error at that particular score than would the stan-
dard error of measurement. See Livingston (1982), Kolen, Hanson, and Brennan (1992), and Kolen,
Zeng, and Hanson (1996) for approaches to calculating a conditional standard error of measurement.
These conditional standard errors would be the same for everyone who obtained the same Xi score.
13. The use of z in this formula is based on the fact “that a maximum likelihood estimator has
approximately (asymptotically) the normal distribution with mean θ, the true ability value, and vari-
ance 1/I(θ), under conditions satisfied by most of the models” (Cramér, 1946, p. 500; cited in Birn-
baum, 1968, p. 457).
14. As Lord (1980) points out, “information is not a pure number; the units in terms of which
information is measured depend on the units used to measure ability” (p. 87). Therefore, the fore-
going description of the locations of item and total information maxima as well as the shape of the
information function is tied to the particular θ metric used and should not be considered absolute.
3
In Chapter 2 we introduced the Rasch model. As part of its presentation we discussed using
the likelihood of the observed responses as a mechanism for estimating the model’s param-
eters; also see Appendices A and B. There are, in fact, multiple approaches that use a likeli-
hood function for estimating item and person parameters. In this chapter one approach is
presented and in Chapter 4 a second procedure, marginal maximum likelihood estimation,
is discussed.
Chapter 2’s (as well as Appendix A’ s) presentation of estimating person locations
assumes that the item parameters are known. Similarly, in Appendix B the estimation of
item locations assumes that the person locations are known. However, in practice, neither
the person nor the item locations are known. In this chapter we conceptually introduce a
procedure for the simultaneous estimation of both item and person parameters. We then
present an example of applying this estimation procedure and some of the steps involved in
assessing model–data fit, such as dimensionality and invariance assessment.
Various strategies have been developed to solve the problem of estimating one set of
parameters (e.g., the person set) without knowledge of another set of parameters (e.g., the
item set). One of these approaches maximizes the joint likelihood function of both persons
and items in order to simultaneously estimate both person and item parameters. This
strategy is referred to as joint maximum likelihood estimation (JMLE). To arrive at the joint
likelihood function for persons and items, we begin with the likelihood function for persons.
Assuming conditional independence and (without loss of generalizability) dichotomous
responses, the probability of a person’s responses is simply the product of the probability
of the responses across an instrument’s items. Symbolically, this can be represented for an
L-item long instrument by
39
40 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
L
p(x|θ, α, δ) = ∏ pjxj(1 – pj)1–xj (3.1)
j=1
The term p(x|θ, α, δ) is the probability of the response vector, x, conditional on the person’s
location, θ, item discrimination α, and on a vector of item location parameters, δ (i.e., δ =
(δ1, . . . , δL)). The probability for item j, pj, is calculated according to a particular model
(e.g., the 1PL model).
To obtain the joint likelihood function, L, across both persons and items, one multiplies
Equation 3.1 across the N persons:
N L
L= ∏ ∏ pj(θi)xij(1 – pj(θi))1–xij (3.2)
i =1 j=1
Recall from Chapter 2 that to avoid numerical accuracy issues, the likelihood function is
typically transformed by using the natural log (ln). Therefore, by applying the natural log
transformation to Equation 3.2 we obtain the joint log likelihood function:
N L
lnL = ∑ ∑ [xijln(pj(θi)) + (1 – xij)ln(1 – pj(θi))] (3.3)
i =1 j=1
The values of the θs and δs that maximize Equation 3.3 are taken as the person and item
parameter estimates, respectively. These estimates are determined by setting the first
derivatives of lnL to zero in a fashion similar to what is presented in Appendices A and B.
This strategy of maximizing the joint likelihood function proceeds in a series of steps
and stages. (For simplicity we assume the Rasch model in describing these steps.) In step
1 the item locations are estimated using provisional estimates of person locations. These
provisional person location estimates are treated as “known” for purposes of estimating
the item locations. The estimation of item locations is done first because, typically, one has
substantially more persons than items and thereby there is more information for estimating
the item locations. Because the estimation of one item’s parameter does not depend on the
parameters of other items, the items are estimated one at a time. In step 2 these estimates
are treated as “known” and are used for estimating person locations. Each person’s location
is estimated independently of those of the other individuals.
In step 1 the estimation of the item locations used provisional estimates of person loca-
tions. However, after step 2 these provisional person estimates have been improved and
these improved estimates should lead to more accurate item location estimates. Therefore,
in our second stage, step 1 is repeated using the improved estimates of person locations from
step 2 to obtain better estimates of the item locations. With these improved item location
estimates, step 2 is repeated using the improved item location estimates to improve the per-
son location estimates. This “ping-pong” stage process continues until successive improve-
ments in the person and item location estimates are considered “indistinguishable” from one
Joint Maximum Likelihood Parameter Estimation 41
another (i.e., these improvements are less than the convergence criterion; see Appendix A).
See Baker and Kim (2004) and Lord (1980) for further details and the relevant equations.
On occasion one encounters the term unconditional maximum likelihood estimation
(UCON). UCON is a synonymous term for JMLE. The term UCON has been used by Ben
Wright (e.g., Wright & Stone, 1979) to distinguish this approach from another estima-
tion approach used with the Rasch model called conditional maximum likelihood estimation,
or CMLE (Andersen, 1972). (CMLE takes advantage of the separability of the person and
item parameters in the Rasch model to condition the likelihood function on the Rasch
model’s sufficient statistics. The result provides consistent maximum likelihood estimators
of δ [Andersen, 1972]. CMLE can only be used with the Rasch model and its various exten-
sions.)
JMLE (UCON) is used in estimation programs such as WINSTEPS (Linacre, 2001a),
BIGSTEPS (Linacre & Wright, 2001), FACETS (Linacre, 2001b), and Quest (Adams &
Khoo, 1996), as well as in LOGIST (Wingersky, Barton, & Lord, 1982). CMLE is one of
two estimation techniques available in OPLM (Verhelst, Glas, & Verstralen, 1995). More
technical information about JMLE and CMLE estimation may be found in Baker and Kim
(2004).
Although with CTT the trait score’s metric is specified by the expectation of the observed
scores, in IRT this is not the case. To understand why this is so, consider our development
of the Rasch model. In developing the model our concern is only with the distance between
the person and an item’s locations, (θ – δ). If θ = 2 and δ = 1, then this distance is 1.0 and
the probability of a response of 1 with the Rasch model would equal 0.7311. However, for
any θ and δ whose difference equals 1.0 the Rasch model would result in exactly the same
probability of a successful response (e.g., if θ' = 50 and δ' = 49, then pj = 0.7311). This
raises the question, “Should the person be located at 2 and the item located at 1, or should
the person be located at 50 and the item located at 49?” In other words, because multiple
values of θ and δ lead to the same probability, the continuum’s metric is not absolute, but
rather it is relative and nonunique. In IRT we have an indeterminacy in the origin and unit
of measurement of the metric. In short, our metric is determined or is unique only up to a
linear transformation. For our example the linear transformation for the θs is θ* = θ(1) + 48
and for the δs it is δ* = δ(1) + 48. This property is referred to as the indeterminacy of scale,
indeterminacy of metric, or the model identification problem.
An implication of this indeterminacy is the need to fix or anchor the metric to a par-
ticular origin to estimate the model’s parameters (i.e., the item or the person parameters);
with the Rasch model the unit of measurement is fixed at 1.0. Because people and items are
located on the same continuum, we only need to fix either the person or the item locations.
One method for fixing the metric, person centering, sets the mean of the θ̂ s to 0.0 after each
step of person location estimation. A second approach, item centering, sets the mean of the
d̂ s to 0.0 after each step of item estimation.1 Although these two approaches will most likely
42 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
result in different estimates, the relative nature of the metric means that model–data fit will
not be adversely affected by the centering approach used.
To see the effect of this indeterminacy with the JMLE algorithm, recall that initially the
item locations are estimated, given provisional estimates of the person locations. In the next
step the person locations are estimated relative to the item location estimates. Subsequently,
the item locations are reestimated relative to the new estimates of person locations, and so
on across the stages. Because with each step the estimation proceeds relative to the improved
estimates from the previous step, the metric’s mean begins to drift across the steps. There-
fore, for the JMLE algorithm the metric of either the person or item locations needs to be
fixed after each step. For instance, and to exemplify item centering, we would first estimate
the item locations, calculate the mean d̂ , and then subtract it from each d̂ . This would pro-
duce a metric centered at 0.0, and we would perform this centering after the relevant estima-
tion step. Centering is really an example of the application of a linear transformation.
Of the various commonly available IRT programs, WINSTEPS (Linacre, 2001a), BIG-
STEPS (Linacre & Wright, 2001), and FACETS (Linacre, 2001b) all use item centering.
An older program, LOGIST, uses person centering. Newer programs such as BILOG-MG
(Zimowski, Muraki, Mislevy, & Bock, 2003), BILOG 3 (Mislevy & Bock, 1997), PARSCALE
(Muraki & Bock, 2003), and MULTILOG (Thissen, Chen, & Bock, 2003) use a variant of
person centering based on the posterior latent variable (person) distribution (Baker, 1990);
this approach is discussed in Chapter 4.
The process of obtaining estimates of person and item parameters is known as calibration.
Wright (1977a) stated “that calibration sample sizes of 500 are more than adequate in
practice and that useful information can be obtained from samples as small as 100” (p. 224).
However, in this latter situation one has less calibration precision than with larger sample
sizes (Wright, 1977a) as well as a loss of power for detecting model–data misfit (Whitely,
1977).2 That is, the degree of model–data misfit that one is willing to tolerate should be
taken into consideration when discussing calibration sample sizes. We feel that another
consideration in determining calibration sample sizes should include the sample size
requirement(s) of ancillary technique(s), such as methods to be used for dimensionality
assessment (e.g., factor analysis). For example, there are various rules of thumb for factor
analysis sample sizes, such as the number of persons should be from 3 to 10 times the
number of items; these ratios interact with the magnitude of the factor loadings. The 10:1
ratio is a very common suggestion, whereas the smaller ratios are applicable only when the
communalities are large (cf. MacCallum, Widaman, Zhang, & Hong, 1999). As such, our
calibration sample size is affected by the requirements for performing the factor analysis.
It cannot be stressed enough that sample size guidelines should not be interpreted
as hard-and-fast rules. Specific situations may require more or fewer persons than other
situations, given the (mis)match between the instrument’s range of item locations and the
sample’s range of person locations, the desired degree of estimation accuracy of both items
and persons, and pragmatic issues concerning model–data fit, ancillary technique sample
Joint Maximum Likelihood Parameter Estimation 43
size requirements, estimation technique, the amount of missing data, and generalizability.
For instance, if one is interested in establishing norms for a particular population, then the
representativeness of the sample would be paramount. This may require obtaining a large
sample in order to formulate a convincing argument to support the utility of the norms. As
such, it may be tempting to simply adopt the philosophy that one should have as large a
sample size as one can obtain. However, the increased cost of collecting a large sample may
not always be justified. For example, if one is performing a survey, then approximately 1200
randomly sampled individuals may be sufficient, provided that the size of the population is
large (Dillman, 2000). Another consideration is that the smaller the sample size, the higher
the probability, all other things being equal, that everyone will provide the same response
(e.g., a response of 0) to one or more items. As a result, the item location is unestimatable.
The same problem may occur with “short” instruments. That is, with short instruments
there is an increased chance of a person’s providing a response of, for example, 1 for all
items. This individual’s location could not be estimated using MLE. Given the foregoing
caveats and considerations, a rough sample size guideline is that a calibration should have
a few hundred respondents. This should not be interpreted as a minimum, but rather as a
desirable target. Certain applications may require more respondents, whereas in others a
smaller sample may suffice.
To demonstrate the application of the Rasch model for measurement we use the mathematics
data introduced in Chapter 2 (Table 2.1). We use JMLE for estimating our item and person
parameters. In Chapter 4 we reanalyze these data using a different program, BILOG-MG, to
demonstrate a second estimation technique.
For this example, we use the BIGSTEPS software program. This calibration program
implements JMLE for the Rasch model and certain extensions of it. In this chapter we use
some of the program’s features and introduce additional features in Chapter 7 when it is
used with polytomous data. One distinguishing feature of BIGSTEPS is its lack of a GUI, or
graphical user interface (i.e., menus and dialogs). We recognize that many readers are more
comfortable using a program with a GUI. An alternative program, WINSTEPS, uses a GUI
and can accomplish everything that BIGSTEPS can and more. However, our decision to use
BIGSTEPS is based on the fact that it is free (WINSTEPS is not) and that its output is similar
enough to WINSTEPS’ that the reader can easily make the transition to WINSTEPS. A third
program, MINISTEPS, is a free “student version” of WINSTEPS that cannot be used with
our data, owing to the program’s limitations on the number of items and cases, but that can
suffice for classroom examples.
The application of IRT for measuring a variable requires three categories of activi-
ties. The first category consists of calibration-related activities. A second category involves
model–data fit assessment, and the third category requires obtaining validity evidence. To
varying degrees, all three categories inform one another. We use the term category rather
than, say, stages, to emphasize that the corresponding activities are not necessarily done
44 THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY
in sequence (i.e., category 1 activities precede category 2 activities, etc.). In the following
discussion the primary emphasis is on the first two categories. These two categories are dis-
cussed first and then followed by an example.
Category 1 involves the calibration-related activities. Obviously, the entire process
begins with the construction of an instrument. The instrument is pilot tested, refined (if
necessary), and then administered to the individuals of interest. After this administration
the data are inspected for anomalous responses (e.g., miscoded responses, multiple respons-
es to an item, etc.) and, if necessary, appropriately corrected. After this “cleaning” of the
data, we perform some of the category 2 activities, and these, in turn, lead to our calibration
(category 1).
Category 2 consists of model–data fit activities. Some of these activities are performed
prior to the calibration, whereas others occur after the calibration. Although there are some
model–data fit activities that transcend individual programs, some approaches for assessing
model–data fit are easier to perform with some calibration programs than with others. For
example, the assessment of the data’s dimensionality and the examination of invariance of
parameter estimates (discussed below) are examples of activities that transcend individual
calibration programs. In contrast, specific fit statistics/indices and graphical assessment of
fit tend to be particular to a specific program because individual programs provide different
fit information.
The following example begins with an assessment of the tenability of the IRT unidimen-
sionality assumption. In short, and in the context of the Rasch model, the question being
asked is, “Do our data conform to the unidimensional Rasch model?” After we perform our
dimensionality assessment we proceed to the data’s calibration. As part of this calibration
various fit statistics are produced at both the model and item/person levels. The model-level
fit statistics are examined first, followed by the item-level fit statistics. In some cases, model-
level misfit may be diagnosed by examining the item-level fit statistics. The final step in our
fit examination for this example involves assessing the invariance of the item parameter
estimates. This is followed by a brief discussion of obtaining validity evidence for the instru-
ment (i.e., category 3 activities).
Dimensionality Assessment
In assessing dimensionality our interest is in the number of content-oriented factors. The
traditional approach for assessing dimensionality involves the factor analysis of a correlation
matrix. Whenever one factor analyzes a correlation matrix derived from binary data, there
is a possibility of obtaining artifactual factor(s) that are related to the nonlinearity between
the items and the common factors. These “factors of curvilinearity” have sometimes been
referred to as “difficulty” factors and are not considered to be content-oriented factors
(Ferguson, 1941; McDonald, 1967; McDonald & Ahlawat, 1974; Thurstone, 1938). To
avoid extracting these difficulty factors, McDonald (1967) suggests the use of nonlinear
factor analysis. Because our data are dichotomous, we use this nonlinear approach for our
dimensionality analysis. This nonlinear strategy is implemented in the program NOHARM
(Fraser, 1988; Fraser & McDonald, 2003).
An alternative to using NOHARM is to use TESTFACT (Wood et al., 2003) and per-
Joint Maximum Likelihood Parameter Estimation 45
form full-information factor analysis (Muraki & Engelhard, 1985; Wood et al., 2003). We
selected NOHARM over TESTFACT on the basis of research showing that NOHARM per-
forms well in dimensionality recovery studies (e.g., De Champlain & Gessaroli, 1998; Finch
& Habing, 2005; Knol & Berger, 1991) and because it is available at no cost. For additional
information on approaches for assessing unidimensionality, see Hattie (1985) and Panter,
Swygert, Dahlstrom, and Tanaka (1997).
NOHARM (Normal Ogive Harmonic Analysis Robust Method) is a general program
that takes advantage of the relationship between nonlinear factor analysis and the normal
ogive model in order to fit unidimensional and multidimensional normal ogive models (see
Appendix C). In the current context, we fit one- and two-dimensional two-parameter (2P)
models to the data. To determine which dimensional solution is “best,” the differences in fit
among the models are examined. Because the models we are fitting do not address guessing,
we are assuming that the response data are not influenced by guessing. Moreover, we are also
assuming that the latent trait is normally distributed or is multivariate normal (McDonald,
1981). Rather than calculate the raw product–moment matrix on more than 19,600 indi-
viduals for each of the NOHARM analyses, we calculate the raw product–moment matrix
once and then use it as input to the NOHARM analyses; Chapter 10 shows how to perform
the analysis with individual case data.3
The raw product–moment matrix, P, is obtained by X'X(1/N), where N is the number
of cases and X is the data matrix. If we let X consist of binary responses, then P contains
the item means or traditional item difficulties, Pjs, along its main diagonal and the sums
of product terms divided by N as its off-diagonal elements. For example, for a three-item
instrument the raw product–moment matrix is
P1 ( ∑ X1X2 ) / N ( ∑ X1X 3 ) / N
( ∑ X X ) / N P2 ( ∑ X 2X 3 ) / N
P= 2 1 (3.4)
(∑ X 3X1 ) / N (∑ X 3X 2 ) / N P3
matrix to facilitate model–data fit analysis. The residual matrix is the discrepancy between
the observed covariances and those of the items after the model has been fitted to the data.
Therefore, the ideal situation is where the discrepancies are zero. In general, our unidimen-
sional solution’s residuals are comparatively small relative to the item covariances. More-
over, examination of the residual matrix does not reveal any large residuals. To summarize
the residual matrix, NOHARM provides its root mean square (RMS). The RMS is the square
root of the average squared difference between the observed and predicted covariances.
Therefore, small values of RMS indicate good fit. This overall measure of model–data misfit
may be evaluated by comparing it to four times the reciprocal of the square root of the sam-
ple size (i.e., the “typical” standard error of the residuals; McDonald, 1997). For these data
this criterion is 0.0286. A second measure is Tanaka’s (1993) goodness-of-fit index (GFI);
Chapter 10 contains a description of the GFI. McDonald (1999) suggests that a GFI of 0.90
indicates an acceptable level of fit and a value of 0.95 indicates “good” fit; GFI = 1 indicates
perfect fit. Therefore, according to these indices and in light of the residuals, there does not
appear to be sufficient evidence to reject a unidimensional solution.
The subsequent NOHARM analysis for obtaining the two-dimensional solution is per-
formed by modifying the second line in the input command file to be “5 2 19601 1 1
0 0 0”.4 As is the case with the one-dimensional solution, the solution’s residuals are
comparatively small relative to the item covariances and the matrix does not reveal any
large residuals. The solution’s RMS of 0.00102 is substantially less than the criterion of
0.0286. Not surprisingly, as the dimensionality of the models increased, the corresponding
residuals decreased and, therefore, so did the RMS. With respect to Tanaka’s index, the two-
dimensional solution’s value is 0.9999. Although the two-dimensional 2P model has the
lower RMS and larger Tanaka index, the application of Occam’s razor leads us to not reject
the unidimensional model of the data. Therefore, we conclude that our unidimensional
model is a sufficiently accurate representation of the data to proceed with the IRT calibra-
tion.5 Appendix E contains an approximate chi-square statistic that can be used to supple-
ment NOHARM’s fit information.