Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980)
Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980)
FREDERIC M. LORD
Educational Testing Service
ROUTLEDGE
Routledge
Taylor &. Francis Group
Routledge Routledge
Taylor and Francis Group Taylor and Francis Group
270 Madison Avenue 2 Park Square
New York, NY 10016 Milton Park, Abingdon
Oxon OX14 4RN
Publisher's Note
The publisher has gone to great lengths to ensure the quality of this reprint
but points out that some imperfections in the original may be apparent.
Contents
Preface xi
iii
IV CONTENTS
The topics, organization, and presentation are those used in a 4-week seminar
held each summer for the past several years. The material is organized primarily
to maintain the reader's interest and to facilitate understanding; thus all related
topics are not always packed into the same chapter. Some knowledge of classical
test theory, mathematical statistics, and calculus is helpful in reading this mate-
rial.
Chapter 1, a perspective on classical test theory, is perhaps not essential for
XI
Xii PREFACE
References
Cohen, A. S. Bibliography of papers on latent trait assessment. Evanston, Ill.: Region V Technical
Assistance Center, Educational Testing Service Midwestern Regional Office, 1979.
Warm, T. A. A primer of item response theory. Technical Report 941078. Oklahoma City, Okla.:
U.S. Coast Guard Institute, 1978.
FREDERIC M. LORD
I INTRODUCTION TO ITEM
RESPONSE THEORY
1 Classical Test Theory—
Summary and Perspective
1.1. INTRODUCTION
This chapter is not a substitute for a course in classical test theory. On the
contrary, some knowledge of classical theory is presumed. The purpose of this
chapter is to provide some perspective on basic ideas that are fundamental to all
subsequent work.
A psychological or educational test is a device for obtaining a sample of
behavior. Usually the behavior is quantified in some way to obtain a numerical
score. Such scores are tabulated and counted. Their relations to other variables of
interest are studied empirically.
If the necessary relationships can be established empirically, the scores may
then be used to predict some future behavior of the individuals tested. This is
actuarial science. It can all be done without any special theory. On this basis, it is
sometimes asserted from an operationalist viewpoint that there is no need for any
deeper theory of test scores.
Two or more "parallel" forms of a published test are commonly produced.
We usually find that a person obtains different scores on different test forms.
How shall these be viewed?
Differences between scores on parallel forms administered at about the same
time are usually not of much use for describing the individual tested. If we want a
single score to describe his test performance, it is natural to average his scores
across the test forms taken. For usual scoring methods, the result is effectively
the same as if all forms administered had been combined and treated as a single
test.
The individual's average score across test forms will usually be a better
3
4 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
measurement than his score on any single form, because the average score is
based on a larger sample of behavior. Already we see that there is something of
deeper significance than the individual's score on a particular test form.
Σ2XT Σ2T
Ρ2XT ≡
σ2Xσ2T σ2X
2
= 1 - σ2 E (1-6)
σ X
If ΡXT were nearly 1.00, we could safely substitute the available test score X for
the unknown measurement of interest T.
Equations (1-2) through (1-6) are tautologies that follow automatically from
the definition of T and E.
What has our deeper theory gained for us? The theory arises from the realiza
tions that T, not X, is the quantity of real interest. When a job applicant leaves
the room where he was tested, it is T, not X, that determines his capacity for
future performance.
We cannot observe T, but we can make useful inferences about it. How this is
done becomes apparent in subsequent sections (also, see Section 4.2).
An example will illustrate how true-score theory leads to different conclusions
than would be reached by a simple consideration of observed scores. An
achievement test is administered to a large group of children. The lowest scoring
children are selected for special training. A week later the specially trained
children are retested to determine the effect of the training.
True-score theory shows that a person may receive a very low test score either
because his true score is low or because his error score E is low (he was
unlucky), or both. The lowest scoring children in a large group most likely have
not only low T but also low E. If they are retested, the odds are against their being
so unlucky a second time. Thus, even if their true scores have not increased, their
observed scores will probably be higher on the second testing. Without true-score
theory, the probable observed-score increase would be credited to the special
training. This effect has caused many educational innovations to be mistakenly
labeled ''successful.''
It is true that repeated observations of test scores and retest scores could lead
the actuarial scientist to the observation that in practice, other things being equal,
initially low-scoring children tend to score higher on retesting. The important
point is that true-score theory predicts this conclusion before any tests are given
and also explains the reason for this odd occurrence. For further theoretical
discussion, see Linn and Slinde (1977) and Lord (1963). In practical applica
tions, we can determine the effects of special training for the low-scoring chil
dren by splitting them at random into two groups, comparing the experimental
group that received the training with the control group that did not.
Note that we do not define true score as the limit of some (operationally
impossible) process. The true score is a mathematical abstraction. A statistician
doing an analysis of variance components does not try to define the model
6 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
Equations (1-1) through (1-6) cannot be disproved by any set of data. These
equations do not enable us to estimate σ2T|, σ2E|, or ρXT, however. To estimate
these important quantities, we need to make some assumptions. Note that no
assumption about the real world has been made up to this point.
It is usual to assume that errors of measurement are uncorrelated with true
scores on different tests and with each other: For tests X and Y,
ρ(E X ,E Y ) = 0, ρ(E X ,T Y ) = 0 (X≠Y). (1-7)
Exceptions to these assumptions are considered in path analysis (Hauser &
Goldberger, 1971; Milliken, 1971; Werts, Linn, & Jöreskog, 1974; Werts,
Rock, Linn, & Jöreskog, 1977).
1.5. ENVOI
In item response theory (as discussed in the remaining chapters of this book) the
expected value of the observed score is still called the true score. The discrep
ancy between observed score and true score is still called the error of measure
ment. The errors of measurement are thus necessarily unbiased and uncorrelated
with true score. The assumptions of (1-7) will be satisfied also; thus all the
remaining equations in this chapter, including those in the Appendix, will hold.
Nothing in this book will contradict either the assumptions or the basic con
clusions of classical test theory. Additional assumptions will be made; these will
allow us to answer questions that classical theory cannot answer. Although we
will supplement rather than contradict classical theory, it is surprising how little
we will use classical theory explicitly.
Further basic ideas and formulas of classical test theory are summarized for
easy reference in an appendix to this chapter. The reader may skip to Chapter 2.
APPENDIX
Composite Tests
Up to this point, there has been no assumption that our test is composed of
subtests or of test items. If the test score X is a sum of subtest or item scores Yi,
8 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
so that
n
X =
Σ
i=1
Yi,
where V indexes the items in test X'. If all subtests are parallel,
nρYY'+ (1-17)
ρXX' = ,
1 + (n - l)ρYY'
n Σ σ2i
Ρ2XT = ρXX'≥ 1- ≡ α. (1-18)
n - 1 ( σ2X )
Alpha is not a reliability coefficient; it is a lower bound.
If items are scored either 0 or 1, α becomes the Kuder-Richardson formula-20
coefficient ρ20: from (1-18) and (1-23),
n Σπi (1 - πi)
Ρ2XT = ρXX'≥
n - 1
1 -
σ2
X } = ρ20, (1-19)
n μX(n - μX)
ρ20 ≥
n - 1 [1 -
nσ2X } = ρ21 , (1-20)
Item Theory
Denote the score on item i by Yi. Classical item analysis provides various
tautologies. The variance of the test scores is
where ρij and ρix are Pearson product moment correlation coefficients. If Yi is
always 0 or 1, then X is the number-right score, the interitem correlation ρij is a
APPENDIX 9
phi coefficient, and ρix is an item-test point biserial correlation. Classical item
analysis theory may deal also with the biserial correlation between item score and
test score and with the tetrachoric correlations between items (see Lord &
Novick, 1968, Chapter 15). In the case of dichotomously scored items (Yi = 0 or
1), we have
n
μx = πii, (1-22)
Σ
i=l
Σ σiρiC
i
ρxc = √ Σ Σ σ σ ρ . (1-25)
i j ij
i 3
These two formulas provide the two paradoxical classical rules for building a
test:
Overview
Classical test theory is based on the weak assumptions (1-7) plus the assumption
that we can build strictly parallel tests. Most of its equations are unlikely to be
contradicted by data. Equations (1-1) through (1-13) are unlikely to be falsified,
since they involve the unobservable variables T and E. Equations (1-15), (1-16),
and (l-20)-(l-25) cannot be falsified because they are tautologies.
The only remaining equations of those listed are (1-14) and (1-17)—(1-19).
These are the best known and most widely used practical outcomes of classical
test theory. Suppose when we substitute sample statistics for parameters in
(1-17), the equality is not satisfied. We are likely to conclude that the discrep
ancies are due to sampling fluctuations or else that the subtests are not really
strictly parallel.
The assumption (1-7) of uncorrelated errors is also open to question, however.
Equations (1-7) can sometimes be disproved by path analysis methods. Similar
comments apply to (1-14), (1-18), and (1-19).
10 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE
Note that classical test theory deals exclusively with first and second moments:
with means, variances, and covariances. An extension of classical test theory
to higher-order moments is given in Lord and Novick (1968, Chapter 10). With-
out such extension, classical test theory cannot investigate the linearity or non-
linearity of a regression, nor the normality or nonnormality of a frequency
distribution.
REFERENCES
Cronbach, L. J., Gleser, G. C , Nanda, H., & Rajaratnam, N. The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley, 1972.
Hauser, R. M., & Goldberger, A. S. The treatment of unobservable variables in path analysis. In H.
L. Costner (Ed.), Sociological methodology, 1971. San Francisco: Jossey-Bass, 1971.
Linn, R. L., & Slinde, J. A. The determination of the significance of change between pre- and
posttesting periods. Review of Educational Research, 1977, 47, 121-150.
Lord, F. M. Elementary models for measuring change. In C. W. Harris (Ed.), Problems in measur-
ing change. Madison: University of Wisconsin Press, 1963.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Milliken, G. A. New criteria for estimability for linear models. The Annals of Mathematical Statis-
tics, 1971, 42, 1588-1594
Werts, C. E., Linn, R. L., & Jöreskog, K. G. Intraclass reliability estimates: Testing structural
assumptions. Educational and Psychological Measurement, 1974, 34, 25-33.
Werts, C. E., Rock, D. A., Linn, R. L., & Jöreskog, K. G. Validating psychometric assumptions
within and between several populations. Educational and Psychological Measurement, 1977,
37, 863-872.
2 Item Response Theory—
Introduction and Preview
2.1. INTRODUCTION
Commonly, a test consists of separate items and the test score is a (possibly
weighted) sum of item scores. In this case, statistics describing the test scores of
a certain group of examinees can be expressed algebraically in terms of statistics
describing the individual item scores for the same group [see Eq. (1-21) to
(1-25)]. As already noted, classical item theory (which is only a part of classical
test theory) consists of such algebraic tautologies.
Such a theory makes no assumptions about matters that are beyond the control
of the psychometrician. It cannot predict how individuals will respond to items
unless the items have previously been administered to similar individuals. In
practical test development work, we need to be able to predict the statistical and
psychometric properties of any test that we may build when administered to any
target group of examinees. We need to describe the items by item parameters and
the examinees by examinee parameters in such a way that we can predict prob-
abilistically the response of any examinee to any item, even if similar examinees
have never taken similar items before. This involves making predictions about
things beyond the control of the psychometrician—predictions about how people
will behave in the real world.
As an especially clear illustration of the need for such a theory, consider the
basic problem of tailored testing: Given an individual's response to a few items
already administered, choose from an available pool one item to be administered
to him next. This choice must be made so that after repeated similar choices the
examinee's ability or skill can be estimated as accurately as possible from his
responses. To do this even approximately, we must be able to estimate the
11
12 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW
examinee's ability from any set of items that may be given to him. We must also
know how effective each item in the pool is for measuring at each ability level.
Neither of these things can be done by means of classical mental test theory.
In most testing work, our main task is to infer the examinee's ability level or
skill. In order to do this, we must know something about how his ability or skill
determines his response to an item. Thus item response theory starts with a
mathematical statement as to how response depends on level of ability or skill.
This relationship is given by the item response function (trace line, item charac
teristic curve).
This book deals chiefly with dichotomously scored items. Responses will be
referred to as right or wrong (but see Chapter 15 for dealing with omitted
responses). Early work in this area was done by Brogden (1946), Lawley (1943),
Lazarsfeld (see Lazarsfeld & Henry, 1968), Lord (1952), and Solomon (1961),
among others. Some polychotomous item response models are treated by Ander
sen (1973a, b), Bock (1972, 1975), and Samejima (1969, 1972). Related models
in bioassay are treated by Aitchison and Bennett (1970), Amemiya (1974a, b, c),
Cox (1970), Finney (1971), Gurland, Ilbok, and Dahm (1960), Mantel (1966),
van Strik (1960).
Let us denote by 6 the trait (ability, skill, etc.) to be measured. For a dichotomous
item, the item response function is simply the probability P or P(θ) of a correct
response to the item. Throughout this book, it is (very reasonably) assumed that
P(d) increases as 6 increases. A common assumption is that this probability can
be represented by the (three-parameter) logistic function
P = p(θ) = c + 1 - c
, (2-1)
1 + e-l.7a(θ-b)
where a, b, and c are parameters characterizing the item, and e is the mathemat
ical constant 2.71828. . . . Logistic item response functions for 50 four-choice
word-relations items are shown in Fig. 2.2.1 to illustrate the variety found in a
typical published test. This logistic model was originated and developed by Allan
Birnbaum.
Figure 2.2.2 illustrates the meaning of the item parameters. Parameter c is the
probability that a person completely lacking in ability (θ = —∞)will answer the
item correctly. It is called the guessing parameter or the pseudo-chance score
level. If an item cannot be answered correctly by guessing, then c = 0.
Parameter b is a location parameter: It determines the position of the curve
along the ability scale. It is called the item difficulty. The more difficult the item,
the further the curve is to the right. The logistic curve has its inflexion point at
θ= b. When there is no guessing, b is the ability level where the probability of a
2.2. ITEM RESPONSE FUNCTIONS 13
1.0
P(θ)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
O. I
0.0
-2.5 -2.0 - I .5 - I .0 -0,5 0.0 0.5 I .0 I .5 2.0 2.5
θ
FIG. 2.2.1. Item response functions for SCAT II Verbal Test, Form 2B.
correct answer is .5. When there is guessing, b is the ability level where the
probability of a correct answer is halfway between c and 1.0.
Parameter a is proportional to the slope of the curve at the inflexion point [this
slope actually is .425^(1 — c)]. Thus a represents the discriminating power of
the item, the degree to which item response varies with ability level.
An alternative form of item response function is also frequently used: the
(three-parameter) normal ogive,
P ≡ P(θ) = c + (1 - c)
i a(θ-b)
- 0 0
1
√2π
e-t2/2dt. (2-2)
Again, c is the height of the lower asymptote; b is the ability level at the point of
inflexion, where the probability of a correct answer is (1 + c)/2; a is propor-
14 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW
P(θ)
10
Inf l e x ion
0 9
b
FIG. 2.2.2. Meaning of item parameters (see text).
tional to the slope of the curve at the inflexion point [this slope actually is a(l —
C)/√2Π].
The difference between functions (2-1) and (2-2) is less than .01 for every set
of parameter values. On the other hand, for c = 0, the ratio of the logistic
function to the normal function is 1.0 at a(θ - b) = 0, .97 at - 1, 1.4 at - 2, 2.3
at - 2.5, 4.5 at - 3, and 34.8 at - 4. The two models (2-1) and (2-2) give very
similar results for most practical work.
The reader may ask for some a priori justification of (2-1) or (2-2). No
convincing a priori justification exists (however, see Chapter 3). The model must
be justified on the basis of the results obtained, not on a priori grounds.
No one has yet shown that either (2-1) or (2-2) fits mental test data signifi
cantly better than the other. The following references are relevant for any statisti
cal investigation along these lines: Chambers and Cox (1967), Cox (1961, 1962),
Dyer (1973, 1974), Meeter, Pirie, and Blot (1970), Pereira (1977a, b) Quesen-
berry and Starbuck (1976), Stone (1977).
In principle, examinees at high ability levels should virtually never answer an
easy item incorrectly. In practice, however, such an examinee will occasionally
make a careless mistake. Since the logistic function approaches its asymptotes
less rapidly than the normal ogive, such careless mistakes will do less violence to
the logistic than to the normal ogive model. This is probably a good reason for
preferring the logistic model in practical work.
Prentice (1976) has suggested a two-parameter family of functions that in
cludes both (2-1) and (2-2) when a = 1, b = 0, and c = 0 and also includes a
variety of skewed functions. The location, scale, and guessing parameters are
easily added to obtain a five-parameter family of item response curves, each item
being described by five parameters.
2.3. CHECKING THE MATHEMATICAL MODEL 15
Either (2-1) or (2-2) may provide a mathematical statement of the relation be-
tween the examinee's ability and his response to a test item. A more searching
consideration of the practical meaning of (2-1) and (2-2) is found in Section 15.7.
Such mathematical models can be used with confidence only after repeated
and extensive checking of their applicability. If ability could be measured accu-
rately, the models could be checked directly. Since ability cannot be measured
accurately, checking is much more difficult. An ideal check would be to infer
from the model the small-sample frequency distribution of some observable
quantity whose distribution does not depend on unknown parameters. This does
not seem to be possible in the present situation.
The usual procedure is to make various tangible predictions from the model
and then to check with observed data to see if these predictions are approximately
correct. One substitutes estimated parameters for true parameters and hopes to
obtain an approximate fit to observed data. Just how poor a fit to the data can be
tolerated cannot be stated exactly because exact sampling variances are not
known. Examples of this sort of check on the model are found throughout this
book. See especially Fig. 3.5.1. If time after time such checks are found to be
satisfactory, then one develops confidence in the practical value of the model for
predicting observable results.
Several researchers have produced simulated data and have checked the fit of
estimated parameters to the true parameters (which are known since they were
used to generate the data). Note that this convenient procedure is not a check on
the adequacy of the model for describing the real world. It is simply a check on
the adequacy of whatever procedures the researcher is using for parameter esti-
mation (see Chapter 12).
At this point, let us look at a somewhat different type of check on our item
response model (2-1). The solid curves in Fig. 2.3.1 are the logistic response
curves for five SAT verbal items estimated from the response data of 2862
students, using the methods of Chapter 12. The dashed curves were estimated,
almost without assumption as to their mathematical form, from data on a total
sample of 103,275 students, using the totally different methods of Section 16.13.
The surprising closeness of agreement between the logistic and the unconstrained
item response functions gives us confidence in the practical value of the logistic
model, at least for verbal items like these.
The following facts may be noted, to point up the significance of this result:
1. The solid and dashed curves were obtained from totally different assump-
tions. The solid curve assumes the logistic function, also that the test items all
measure just one psychological dimension. The dashed curve assumes only that
the conditional distribution of number-right observed score for given true score is
a certain approximation to a generalized binomial distribution.
2. The solid and dashed curves were obtained from different kinds of raw
1.0 1.0
.8 .8
II
P
.5 .5
P
47
.2 .2
0 0
-2 -I 0 1 2 -2 -I 0 1 2
θ θ
1.0 1.0
.8 .8
10
13
P
.5 5
.2 .2
0 0
-2 -I 0 1 2 -2 -I 0 1 2
θ θ
1.0
.8
30
P
.5
.2
0
-2 -1 0 1 2
θ
FIG. 2.3.1. Five item characteristic curves estimated by two different methods.
(From F. M. Lord, Item characteristic curves estimated without knowledge of
their mathematical form—a confrontation of Birnbaum's logistic model. Psycho-
metrika, 1970, 35, 43-50.)
16
2.3. CHECKING THE MATHEMATICAL MODEL 17
data. The solid curve comes from an analysis of all the responses of a sample of
students to all 90 SAT verbal items. The dashed curve is obtained just from
frequency distributions of number-right scores on the SAT verbal test and, in a
minor way, from the variance across items of the proportion of correct answers to
the item.
3. The solid curve is a logistic function. The dashed curve is the ratio of two
polynomials, each of degree 89.
4. The solid curve was estimated from a bimodal sample of 2862 examinees,
selected by stratified sampling to include many high-ability and many low-ability
students. The dashed curve was estimated from all 103,275 students tested in a
regular College Board test administration.
Further details of this study are given in Sections 16.12 and 16.13.
These five items are the only items to be analyzed to date by this method. The
five items were chosen solely for the variety of shapes represented. If a hundred
or so items were analyzed in this way, it is likely that some poorer fits would be
found.
It is too much to expect that (2-1) or (2-2) will hold exactly for every test item
and for every examinee. If some examinees become tired, sick, or uncooperative
partway through the testing, the mathematical model will not be strictly appro-
priate for them. If some test items are ambiguous, have no correct answer, or
have more than one correct answer, the model will not fit such items. If exam-
inees omit some items, skip back and forth through the test, and do not have time
to finish the test, perhaps marking all unfinished items at random, the model
again will not apply.
A test writer tries to provide attractive incorrect alternatives for each
multiple-choice item. We may imagine examinees so completely lacking in
ability that they do not even notice the attractiveness of such alternatives and so
respond to the items completely at random; their probability of success on such
items will be 1/A, where A is the number of alternatives per item. We may also
imagine other examinees with sufficient ability to see the attractiveness of the
incorrect alternatives although still lacking any knowledge of the correct answer;
their probability of success on such items is often less than 1/A. If this occurs, the
item response function is not an increasing function of ability and cannot be fitted
by any of the usual mathematical models.
We might next imagine examinees who have just enough ability to eliminate
one (or two, or three,.. .) of the incorrect alternatives from consideration, al-
though still lacking any knowledge of the correct answer. Such examinees might
be expected to have a chance of 1/(A - 1) (or 1/(A - 2), 1/(A - 3),. . .) of
answering the item correctly, perhaps producing an item response function look-
ing like a staircase.
Such anticipated difficulties deterred the writer for many years from research
on item response theory. Finally, a large-scale empirical study of 150 five-choice
18
100
100
80
80
60
60
RIGHT
RIGHT
%
%
40
40
20
20
0
0
0 30 60 90 6 30 60 90
SCORE SCORE
FIG. 2.3.2. Proportion of correct answers to an item as a function of number-right test score. The
two items shown are the two worst examples of nonmonotonicity among the 150 items studied.
2.4. UNIDIMENSIONAL TESTS 19
Note, however, that latent trait theory is more general than factor analysis.
Ability θ is probably not normally distributed for most groups of examinees.
Unidimensionality, however, is a property of the items; it does not cease to exist
just because we have changed the distribution of ability in the group tested.
Tetrachoric correlations are inappropriate for nonnormal distributions of ability;
they are also inappropriate when the item response function is not a normal
ogive. Tetrachoric correlations are always inappropriate whenever there is guess
ing. This poses a problem for factor analysts in defining what is meant by
common factor, but it does not disturb the unidimensionality of a pool of items.
It seems plausible that tests of spelling, vocabulary, reading comprehension,
arithmetic reasoning, word analogies, number series, and various types of spatial
tests should be approximately one-dimensional. We can easily imagine tests that
are not. An achievement test in chemistry might in part require mathematical
training or arithmetic skill and in part require knowledge of nonmathematical
facts.
Item response theory can be readily formulated to cover cases where the test
items measure more than one latent trait. Practical application of multidimen
sional item response theory is beyond the present state of the art, however,
10
8
SIZE OF ROOT
6
4
2
0
0 1 2 3 4 5 6 7 8 9 10 II 12
RANKING
FIG. 2.4.1. The 12 largest latent roots in order of size for the SCAT 2A Verbal
Test.
2.5. PREVIEW 21
except in special cases (Kolakowski & Bock, 1978; Mulaik, 1972; Samejima,
1974; Sympson, 1977).
There is great need for a statistical significance test for the unidimensionality
of a set of test items. An attempt in this direction has been made by Christof-
fersson (1975), Indow and Samejima (1962), and Muthén (1977).
A rough procedure is to compute the latent roots of the tetrachoric item
intercorrelation matrix with estimated communalities placed in the diagonal. If
(1) the first root is large compared to the second and (2) the second root is not
much larger than any of the others, then the items are approximately unidimen-
sional. This procedure is probably useful even though tetrachoric correlation
cannot usually be strictly justified. (Note that Jöreskog's maximum likelihood
factor analysis and accompanying significance tests are not strictly applicable to
tetrachoric correlation matrices.)
Figure 2.4.1 shows the first 12 latent roots obtained in this way for the SCAT
II Verbal Test, Form 2A. This test consists of 50 word-relations items. The data
were the responses of a sample of 3000 high school students. The plot suggests
that the items are reasonably one-dimensional.
2.5. PREVIEW
4.0
3.0
10
Information
2.0
13
1.0
30
II
47
0.0
Ability
FIG. 2.5.1. Item and test information functions. (From F. M. Lord, An analysis
of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic
model. Educational and Psychological Measurement, 1968, 28, 989-1020.)
the examinee's ability θ from his responses. It can be shown that the test informa
tion function I{θ} is simply the sum of the item information functions:
/{θ} = Σ I{θ,
i
ui}. (2-6)
The test information function for the five-item test is shown in Fig. 2.5.1.
We have in (2-6) the very important result that when item responses are
optimally weighted, the contribution of the item to the measurement effectiveness
of the total test does not depend on what other items are included in the test. This
is a different situation from that in classical test theory, where the contribution of
each item to test reliability or to test validity depends inextricably on what other
items are included in the test.
2.5. PREVIEW 23
4.0
10
3.0
13
Optimum Weight
2.0
II
30
1.0
47
47
Ability
FIG. 2.5.2. Optimal (logistic) scoring weight for five items as a function of
ability level. (From F. M. Lord, An analysis of the Verbal Scholastic Aptitude
Test using Birnbaum's three-parameter logistic model. Educational and Psy-
chological Measurement, 1968, 28, 989-1020.)
the five additional easy items. Curve 6 shows the effect of discarding (not
scoring) the easier half of the test. Curve 7 shows the effect of discarding the
harder half of the test; notice that the resulting half-length test is actually better
for measuring low-ability examinees than is the regular full-length SAT. Curve 8
shows a hypothetical SAT just like the regular full-length SAT except that all
items are at the same middle difficulty level.
Results such as these are useful for planning revision of an existing test,
perhaps increasing its measurement effectiveness at certain specified ability
levels and decreasing its effectiveness at other levels. These and other useful
applications of item response theory are treated in detail in subsequent chapters.
REFERENCES 25
REFERENCES
Aitchison, J., & Bennett, J. A. Polychotomous quantal response by maximum indicant. Biometrika,
1970, 57, 253-262.
Amemiya, T. Qualitative response models. Technical Report No. 135. Stanford, Calif.: Institute for
Mathematical Studies in the Social Sciences, Stanford University, 1974. (a)
Amemiya, T. The maximum likelihood estimator vs. the minimum chi-square estimator in the
general qualitative response model. Technical Report No. 136. Stanford, Calif.: Institute for
Mathematical Studies in the Social Sciences, Stanford University, 1974. (b)
Amemiya, T. The equivalence of the nonlinear weighted least squares method and the method of
scoring in the general qualitative response model. Technical Report No. 137. Stanford, Calif.:
Institute for Mathematical Studies in the Social Sciences, Stanford University, 1974. (c)
Andersen, E. B. Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk
Forlag, 1973. (a)
Andersen, E. B. Conditional inference for multiple-choice questionnaires. British Journal of
Mathematical and Statistical Psychology, 1973, 26, 31-44. (b)
Bock, R. D. Estimating item parameters and latent ability when responses are scored in two or more
nominal categories. Psychometrika, 1972, 37, 29-51.
Bock, R. D. Multivariate statistical methods in behavioral research. New York: McGraw-Hill,
1975.
Brogden, H. E. Variation in test validity with variation in the distribution of item difficulties, number
of items, and degree of their intercorrelation. Psychometrika, 1946, 11, 197-214.
Chambers, E. A., & Cox, D. R. Discrimination between alternative binary response models.
Biometrika, 1967, 54, 573-578.
Christoffersson, A. Factor analysis of dichotomized variables. Psychometrika, 1975, 40, 5-32.
Cox, D. R. Tests of separate families of hypotheses. In J. Neyman (Ed.), Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1). Berkeley: University of
California Press, 1961.
Cox, D. R. Further results on tests of separate families of hypotheses. Journal of the Royal Statistical
Society, 1962, 24, 406-424.
Cox, D. R. The analysis of binary data. London: Methuen, 1970.
Dyer, A. R. Discrimination procedures for separate families of hypotheses. Journal of the American
Statistical Association, 1973, 68, 970-974.
Dyer, A. R. Hypothesis testing procedures for separate families of hypotheses. Journal of the
American Statistical Association, 1974, 69, 140-145.
Finney, D. J. Probit analysis (3rd ed.). New York: Cambridge University Press, 1971.
Gurland, J., Ilbok, J., & Dahm, P. A. Polychotomous quantal response in biological assay. Biomet-
rics, 1960, 16, 382-398.
Indow, T., & Samejima, F. LIS measurement scale for non-verbal reasoning ability. Tokyo:
Nihon-Bunka Kagakusha, 1962. (In Japanese)
Kolakowski, D., & Bock, R. D. Multivariate generalizations of probit analysis. Unpublished manu-
script, 1978.
Lawley, D. N. On problems connected with item selection and test construction. Proceedings of the
Royal Society of Edinburgh, 1943, 61, 273-287.
Lazarsfeld, P. F., & Henry, N. W. Latent structure analysis. Boston: Houghton-Mifflin, 1968.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Mantel, N. Models for complex contingency tables and polychotomous dosage response curves.
Biometrics, 1966, 22, 83-95.
Meeter, D., Pirie, W., & Blot, W. A comparison of two model discrimination criteria. Technomet-
rics, 1970, 12, 457-470.
26 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW
27
28
20 40 60 0 30 60 90
0 SCORE SCORE
100
100
8.0
80
60
60
% RIGHT
% RIGHT
40
40
20
20
0
0
0 30 60 90 0 30 60 90
SCORE SCORE
FIG. 3.1.1. Selected item-test regressions for five-choice Scholastic Aptitude Test items (crosses
show regression when omitted responses are replaced by random responses).
29
30 3. RELATION TO CONVENTIONAL ITEM ANALYSIS
the expectation being taken over all individuals at score level x. Now, for any
individual, x is the sum (over items) of his item scores; that is
n
(3-1)
x = Σu
i=l
i .
Then by definition
n n
(3-2)
Σ
i=1
μi|x ≡
Σ E(u )
i=1
i|x = x.
We can understand this general result most easily by considering the special
case when all the items are statistically equivalent. In this case, μi|x is by
definition the same for all items, so (3-2) can be written
n
Σμ
i= 1
i|x = nμi|x = x,
from which it follows that μi|x - x/n for each item. Thus the iosr of each item is
a straight line through the origin with slope 1/n. Note that for statistically equiva
lent items μi|x = x/n even when the items are entirely uncorrected with each
other. The iosr has a slope of 1/n even when the test does not measure anything!
This is still true if each item is negatively correlated with every other item!
All this proves that we cannot as a general matter expect item-observed score
regressions to be even approximately normal ogives. We shall not make further
use of item-observed score regressions in this book. The regression of item score
on true score is considered in Section 16.12.
The writer prefers to consider the choice of item response function, such as Eq.
(2-1) or (2-2), as a basic assumption to be justified by methods discussed in
Section 2.3 rather than by any a priori argument. This is particularly wise when
there is guessing, since one assumption often used in this case to deduce Eq.
(2-1) or (2-2) from a priori considerations is that examinees either know the
correct answer to the item or else guess at random. This assumption is totally
unacceptable and would discredit the entire theory if the theory depended on it.
The alternate, acceptable point of view is simply that Eq. (2-1) and (2-2) are
useful as versatile formulas capable of adequately representing a wide variety of
3.2. RATIONALE FOR NORMAL OGIVE MODEL 31
Yi
Pi(θ)
u =1
i
Pi(θ)
Pi(θ)
γ
i
Q i (θ)
Q i (θ)
Ui=0
μi|θ
θ
FIG. 3 . 2 . 1 . Hypothetical conditional distribution of Yi' for three levels of ability
θ, showing the regression μi|'θ and the cutting point γi that separates right an
swers from wrong answers.
∫∞-L =∫L-∞ ,
This is the same as Eq. (2-2) for the normal ogive item response function when ci
= 0.
Note that we have not made any assumption about the distribution of ability θ
in the total group tested. In particular, contrary to some assertions in the litera
ture, we have not assumed that ability is normally distributed in the total group.
Furthermore, if (3-5) holds for some group of examinees, selection on θ will not
change the conditional distribution of Y'i for fixed θ and hence will not change
(3-5). Thus the shape of the distribution of ability θ in the total group tested is
irrelevant to our derivation of (3-5).
Equation (3-5) has the form of a cumulative frequency distribution, as do Eq.
(2-1) and (2-2) when c = 0. In general, however, there seems to be little reason
for thinking of an item response curve as a cumulative frequency distribution.
3.3. RELATION TO CONVENTIONAL ITEM STATISTICS 33
Conventional item analysis deals with πi, conventionally called the item diffi
culty, the proportion of examinees answering item i correctly. It also deals with
ρ i x , the product moment correlation between item score ui and number-right test
score x, often called the point-biserial item-test correlation, or else with ρ'ix, the
corresponding biserial item-test correlation. A general formula for the relation of
biserial correlation (ρ') to point-biserial correlation is
φ(γ)
ρ = ρ' , (3-6)
√Π(1 - π)
where φ(γ) is the normal curve ordinate at the point γ that cuts off area π of the
standardized normal curve.
If ability θ is normally distributed and ci = 0, then by definition the
product-moment correlation ρ'iθ or simply ρ'i) between Y'i and θ is also the
biserial correlation between ui and θ. Such a relationship is just what is meant by
biserial correlation.
There is also a product-moment or point-biserial correlation between ui and θ,
to be denoted by ρiθ. To the extent that number-right score x is a measure of
ability θ,ρ'ix is an approximation to ρ'i ≡ ρ'iθ and ρ'ix. is an approximation to ρ'iθ.
Combined with (3-3), this (crude) approximation yields a conceptually illuminat
ing crude relationship between the conventional item-test correlation and the ai
parameter of item response theory, valid only for the case where θ is normally
distributed and there is no guessing:
ρ'ix
ai ≡ (3-7)
√1- ρi'2x
and
ai
ρ'ix ≡ (3-8)
√1 + a2i ,
where ≡ denotes approximate equality. This shows that under the assumptions
made, the item discrimination parameter ai and the item-test biserial correlation
ρ'ix are approximately monotonic increasing functions of each other.
Approximations (3-7) and (3-8) hold only if the unit of measurement for θ has
been chosen so that the mean of θ is 0 and the standard deviation is 1 (see Section
3.5). Approximations (3-7) and (3-8) do not hold unless θ is normally distributed
in the group tested. They do not hold if there is guessing. In addition, the
approximations fall short of accuracy because (1) the test score x contains errors
of measurement whereas θ does not; and (2) x and θ have differently shaped
distributions (the relation between x and θ is nonlinear).
Approximations (3-7) and (3-8) are given here not for practical use but rather
34 3. RELATION TO CONVENTIONAL ITEM ANALYSIS
to give an idea of the nature of the item discrimination parameter ai. The relation
of ai to conventional item and test parameters is illustrated in Table 3.8.1.
Item i is answered correctly whenever the examinee's ability Yi' is greater
than γi. If ability θ is normally distributed, then Yi' will not only be conditionally
normally distributed for fixed θ but also unconditionally normally distributed in
the total population. Since the unconditional mean and variance of Yi' have been
chosen to be 0 and 1, respectively, a simple relation between γi and πi (propor-
tion of correct answers to item i in the total group) can be written down: When θ
is normally distributed,
∞
bi ≡ γi (3-10)
ρix .
If all items have equal discriminating power ai, then by (3-4) all ρi' are equal and
the difficulty parameter bi is proportional to γi, the normal curve deviate corre
sponding to the proportion of correct answers πi . Thus when all items are equally
discriminating, there is a monotonic relation between bi and πi: AS πi increases,
bi and γi both decrease. When all items are not equally discriminating, the
relation between bi and γi or πi depends on ai. In general, arranging items in
order on πi is not the same as arranging them on bi.
As pointed out earlier, an item response function can also be viewed as the
regression of item score on ability. In many statistical contexts, regression
functions remain unchanged when the frequency distribution of the predictor
variable is changed. In the present context this should be quite clear: The proba
bility of a correct answer to item i from examinees at a given ability level θ0
depends only on θ0, not on the number of people at θ0, nor on the number of
people at other ability levels θ1, θ2, . . . . Since the regression is invariant, its
lower asymptote, its point of inflexion, and the slope at this point all stay the
same regardless of the distribution of ability in the group tested. Thus ai, bi, and
ci are invariant item parameters. According to the model, they remain the same
regardless of the group tested.
Suppose, on the contrary, it is found that the item response curves of a set of
3.4. INVARIANT ITEM PARAMETERS 35
items differ from one group to another. This means that people in group 1 (say) at
ability level θ0 have a different probability of success on the set of items than do
people in group 2 at the same θ0. This now means that the test is able to
discriminate group 1 individuals from group 2 individuals of identical ability
level θ0. And this, finally, means that the test items are measuring some dimen
sion on which the groups differ, a dimension other than θ. But our basic assump
tion here is that the test items have only one dimension in common. The conclu
sion is either that this particular test is not one-dimensional as we require or else
that we should restrict our research to groups of individuals for whom the items
are effectively one-dimensional.
The invariance of item parameters across groups is one of the most important
characteristics of item response theory. We are so accustomed to thinking of item
difficulty as the proportion (π i ) of correct answers that it is hard to imagine how
item difficulty can be invariant across groups that differ in ability level. The
following illustration may help to clarify matters.
Figure 3.4.1 shows two rather different item characteristic curves. Inverted on
the baseline are the distributions of ability for two different groups of examinees.
First of all, note again: The ability required for a certain probability of success on
an item does not depend on the distribution of ability in some group; con
sequently, the item difficulty b should be the same regardless of the group from
which it is determined.
Now note carefully the following. In group A, item 1 is answered correctly
less often than item 2. In group B, the opposite occurs. If we use the proportion
of correct answers as a measure of item difficulty, we find that item 1 is easier
than item 2 for one group but harder than item 2 for the other group.
Proportion of correct answers in a group of examinees is not really a measure
of item difficulty. This proportion describes not only the test item but also the
group tested. This is a basic objection to conventional item analysis statistics.
Item-test correlations vary from group to group also. Like other correlations,
P
1
10
05
FIG. 3.4.1. Item response curves in
relation to two groups of examinees.
(From F. M. Lord, A study of item
bias, using item characteristic curve b1
theory. In Y. H. Poortinga (Ed.), 0.0 θ
Basic problems in cross-cultural A
psychology. Amsterdam: Swets and B
Zeitlinger, 1977, pp. 19-29.)
36 3. RELATION TO CONVENTIONAL ITEM ANALYSIS
item-test correlations tend to be high in groups that have a wide range of talent,
low in groups that are homogeneous.
3.5. INDETERMINACY
Item response functions Pi(θ) like Eq. (2-1) and (2-2) ordinarily are taken to be
functions of a i (θ — bi). If we add a constant to every θ and at the same time add
the same constant to every bi, the quantity a i (θ — bi) is unchanged and so is the
response function Pi(θ). This means that the choice of origin for the ability scale
is purely arbitrary; we can choose any origin we please for measuring ability as
long as we use the same origin for measuring item difficulty bt.
Similarly, if we multiply every θ by a constant, multiply every bi by the same
constant, and divide every ai by the same constant, the quantity a i (θ - bi)
remains unchanged and so does the response function Pi(θ). This means that the
choice of unit for measuring ability is also purely arbitrary.
One could decide to choose the origin and unit for measuring ability in such a
way that the first person tested is assigned θ1 = 0 and the second person tested is
assigned θ2 = 1 or — 1. Another possibility would be to choose so that for the
first item b1 = 0 and a1 = 1. Scales chosen in this way would be meaningless to
anyone unfamiliar with the first two persons tested or with the first item adminis
tered. A more common procedure is to choose the scale so that the mean and
standard deviation of θ are 0 and 1 for the group at hand.
The invariance of item parameters, emphasized in Section 3.4, clearly holds
only as long as the origin and unit of the ability scale is fixed. This means that if
we determine the bi for a set of items from one group of examinees and then
independently from another, we should not expect the two sets of bi to be
identical. Rather we should expect them to have a linear relation to each other
(like the relation between Fahrenheit and Celsius temperature scales).
Figure 3.5.1 compares estimated bi from a group of 2250 white students with
estimated bi from a group of 2250 black students for 85 verbal items from the
College Board SAT. Most of the scatter about the line is due to sampling
fluctuations in the estimates; some of the scatter is due to failure of the model to
hold exactly for groups as different as these (see Chapter 14).
If we determine the ai for a set of items independently from two different
groups, we expect the two sets of values to be identical except for an undeter
mined unit of measurement that will be different for the two groups. We expect
the ai to lie along a straight line passing through the origin (0, 0), with a slope
reciprocal to the slope of the line relating the two sets of bi. The slope represents
the ratio of scale units for the two sets of parameters. The two sets of ai are
related in the same way as two sets of measurements of the same physical
objects, one set expressed in inches and the other in feet.
The ci are not affected by changes in the origin and unit of the ability scale.
The ci should be identical from one group to another.
3.5. INDETERMINACY 37
5
4
3
2
b (BLACKS)
1
0
-I
-2
-3
-3 -2 -I 0 1 2 3 4 5
b (WHITES)
FIG. 3.5.1. Estimated difficulty parameters (b) for 85 items for blacks and for
whites. (From F. M. Lord, A study of item bias, using item characteristic curve
theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology.
Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.)
Ability parameters 6 are also invariant from one test to another except for
choice of origin and scale, assuming that the tests both measure the same ability,
skill, or trait. For 1830 sixth-grade pupils, Fig. 3.5.2 compares the 9 estimated
from a 50-item Metropolitan vocabulary test with the 0 estimated from a 42-item
SRA vocabulary test. Both tests consist of four-choice items.
The scatter about a straight line is more noticeable here than in Fig. 3.5.1
because there each bi was estimated from the responses of 2250 students, here
each 0 is estimated from the responses to only 42 or 50 items. Thus the estimates
of θ are more subject to sampling fluctuations than the estimates of bi. The broad
scatter at low ability levels is due to guessing, random or otherwise. A more
detailed evaluation of the implications of Fig. 3.5.2 is given in Section 12.6. It is
shown there that after an appropriate transformation is made, the transformed
38 3. RELATION TO CONVENTIONAL ITEM ANALYSIS
2.5
2.0
1 .5
1 .0
0.5
0.0
SRA
-0.5
-I .0
-I .5
-2.0
-2.5
FIG. 3.5.2. Ability estimates from a 50 item MAT vocabulary test are com
pared with ability estimates from a 42-item SRA vocabulary test for 1830 sixth-
grade pupils. (Ability estimates outside the range - 2 . 5 < θ < 2.5are printed on
the border of the table.)
estimates of θ from the two tests correlate higher than do number-right scores on
the two tests.
In conclusion, in item response theory the item parameters are invariant from
group to group as long as the ability scale is not changed; in classical item
analysis, the item parameters are not invariant from group to group, although
they are unaffected by choice of ability scale. Similarly, ability 6 is invariant
across tests of the same psychological dimension as long as the ability scale is not
changed; number-right test score is not invariant from test to test, although it is
unaffected by choice of scale for measuring θ.
3.7. ITEM INTERCORRELATIONS 39
Although we do not expect the restrictive model of the previous section to hold
for most actual data, some useful conclusions can be drawn from it that will help
us understand the relation of our latent item parameters to familiar quantities. It is
clear that when Y'i and Y'j are normally distributed, the product-moment correla
tion p'i j between them is, by definition, the same as the tetrachoric correlation
between item i and item j. Under the restrictive model, the p'i j will have just one
common factor, θ, so that
aiaj
p'ij = √1 + a2i √1+ a2j
Conversely, under the restrictive model the p ' i , and thus the ai, can be inferred
from a factor analysis of tetrachoric item intercorrelations:
P' i j p' i k
P' i 2 = (i ≠ j , i ≠ k, j ≠ k). (3-12)
P' j k
This is not recommended in the usual situations where there is guessing, how
ever.
TABLE 3.8.1
Relation of Item Discriminating Power at to Various
Conventional Item Parameters, and to Parameters of a 50-ltem Test,
when Ability Is Normally Distributed(µθ=0,σθ=1=1) and All
Free-Response Items Are of 50% Difficulty (πi = .50, bi = 0)
Eq.
ai = 0 .20 .44 .75 .98 1.33 2.06 no.
ρ'iθ 0 .2 .4 .6 .7 .8 .9 (3-3)
Free-Response
ρ'ij
ρij 0 .025 .10 .23 .33 .44 .60 (3-13)
ρ i(x-i) 0 .12 .29 .47 .56 .66 .77 (3-14)
ρxx' 0 .57 .85 .94 .96 .98 .99 (1-17)
0 .017 .07 .16 .22 .29 .40 (3-19)
= .6)
ρIJ
Multiple-Choice
7.2 (3-15)
πl
familiar probably have reliabilities close to .90. If so, we should focus our
attention on the column with ρ x x , — .90 at the bottom and ai = .75 at the top.
The top half of the table assumes that ci = 0 . This is referred to as the
free-response case (although free-response items do not necessarily have ci =
0). Note that by (3-4) and (3-9), under the assumptions made, free-response
items with bi = 0 will have exactly 50% correct answers (πi = .50) in the total
group of examinees. The parameters shown are the biserial item-ability correla
tion ρ'iθ; the point-biserial (product-moment) item-ability correlation ρiθ; the
tetrachoric item intercorrelation p' ij ; the product-moment item intercorrelation
Pij (phi coefficient); the item-test correlation pix-i), where x — i is number-
right score on the remaining 49 items; and the parallel-forms test reliability pxx,.
The equation used to calculate each parameter is referenced in the table.
The bottom half of the table deals with multiple-choice tests. The theoretical
relation between the multiple-choice and the free-response case is discussed in
the Appendix. For the rest of this chapter only, multiple-choice items are indexed
by I and J to distinguish them from free-response items (indexed by i and j); the
number-right score on a multiple-choice test will be denoted by X to distinguish
it from the score x obtained from free-response items. The multiple-choice item
intercorrelation pIJ (phi coefficient) is computed by (3-19) from the free-
response pij. All multiple-choice parameters in the table are computed fromp IJ .
All numbered equations except (3-19) apply equally to multiple-choice and to
free-response items.
Note in passing several things of general interest in the table:
1. A comparison of pxx' with pxx, indicates the loss in test reliability when
low-ability examinees are able to get one-fifth of the items right without knowing
any answers.
2. The standard deviation σx of number-right scores varies very sharply with
item discriminating power (with item intercorrelation).
3. The usual item-test correlation ρix or ρ'ix (also ρIX or ρ'IX) is spuriously
high because item i is included in x (or I in X). The amount of the spurious effect
can be seen by comparing ρIX and ρI(x_1).
4. For free-response items, the item-test correlation Pi(x-i) in the last two
columns of the table is higher than the item-ability correlation piθ. This may be
viewed as due to the fact (see Section 3.1) that the item observed-score regres
sion is more nearly linear than the item-ability regression (item response func
tion) .
APPENDIX
This appendix provides those formulas not given elsewhere that are necessary for
computing Table 3.8.1. In the top half of the table, the phi coefficient pij was
42 3. RELATION TO CONVENTIONAL ITEM ANALYSIS
obtained from the tetrachoric p'ij by a special formula (Lord & Novick, 1968,
Eq. 15.9.3) applicable only to items with 50% correct answers:
2
ρi j = arcsin p ' i j (3-13)
π
the arcsin being expressed in radians. The test reliability pxx ' was obtained from
ρij by the Spearman-Brown formula (1-17) for the correlation between two
parallel tests after lengthening each of them 50 times. The item-test correlation
ρ i ( x - i ) was obtained from a well-known closely related formula for the correla
tion between one test (/) and the lengthened form (y) of a parallel test (j):
mσi ρ i j
Piy , (3-14)
σy
where m is the number of times j is lengthened [for ρ i ( x - i ) in Table 3.8.1, m =
49], y is number-right score on the lengthened test, a2i = πi(l — πi) is the
variance (1-23) of the item score (ui = 0 or 1), and
σ2y = σ2i [m + m(m – l)pij] (3-15)
is the variance of the y scores [see Eq. (1-21)].
The usual point-biserial item-test correlation pix is computed from pi(x-i) by
a formula derived as follows:
Suppose now that we change the items to multiple choice with cI = CJ = c >
REFERENCES 43
0. According to Eq. (2-1) or (2-2), the effect will be that of the people who got
each free-response item wrong, a fraction c will now get the corresponding
multiple-choice item right. Thus π = Πi + c I (l —π i ). The new 2 x 2 table for
multiple-choice items will therefore be
½ (l – c) ½ (1 + c)
When the general formula for a phi coefficient is applied to the last 2 x 2 table,
we find that for the multiple-choice items under consideration
A 2(1 - c) 4 + cA{\ - c)2 - (1 - c) 2( ½ - A + cA)2
PIJ =
¼(1 –c2 )
1 -c
(4A – 1).
1 +c
Using (3-18) we find a simple relation between the free-response pij and the
multiple-choice pIJ for the special case where πi = π = .5:
PIJ =
1 -c (3-19)
P
1 + c ij
This formula is a special case of the more general formula in Eq. (7-3).
REFERENCES
Fan, C.-T. On the applications of the method of absolute scaling. Psychometrika, 1957, 22,175–183.
Gulliksen, H. Theory of mental tests. New York: Wiley, 1950.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
4 Test Scores
Ability
of Item
Estimates
and
Parameters
as Functions
44
4.2. TRUE SCORE 45
to item, as is ordinarily the case, the frequency distribution Ø(x θ) of the number-
right test score for a person with ability θ is a generalized binomial (Kendall &
Stuart, 1969, Section 5.10). This distribution can be generated by the generating
function
n
π (Qi + Pi)- (4-1)
i=1
For example, if n = 3, the scores x = 0, 1, 2, 3 occur with relative frequency
Q1Q2Q3, Q1Q2P3 + Q1P2Q3 + P1Q2Q3, Q 1 P 2 P 3 + P1Q2P3 + P1P2Q3, and
P1P2P3, respectively. The columns of Table 4.3.1 give the reader a good idea of
the kinds of Ø(x θ) encountered in practice.
Although Ø(x θ), the conditional distribution of number-right score, cannot be
written in a simple form, its mean µx\θ and varianceσ2 x\ θ for given θ are simply
n
µx\θ = Pi(θ), (4-2)
i=i
σ2x\θ = n
PiQi . (4-3)
i=i
The mean (4-2) can be derived from the fact that x≡ iui and the familiar fact
that the mean of ui is Pi. The variance (4-3) can be derived from the familiar
binomial variance σ2(ui) = PiQi by noting that
σ2x\θ= σ2( iui θ = iσ2(ui)) = iPi Qi
because of local independence. Note that Ø(x|θ), µx|θ, and σx|θ refer to the
distribution of x (1) for all people at ability level θ and also (2) for any given
individual whose ability level is 6.
If P is the average of the Pi(θ) taken over n items, then in practice Ø(x|θ) is
usually very much like the binomial distribution (xn) Px Qn–x where Q ≡ 1 —
P. The main difference is that the variance Pi Qi is always less than the
binomial variance nPQ unless Pi = P for all items. The difference between the
two variances is simply nσ2p|θ, where σ2p|θ is the variance of the Pi(θ) for fixed
θ taken over items:
n
ξ = Pi{θ). (4-5)
i=i
Since each Pi(θ) is an increasing function of 0, number-right true score is an
increasing function of ability.
This is the same true score denoted by T in Section 1.2. The classical notation
avoids Greek letters, the present notation emphasizes that the relation of ob
served score x to true score ξ is the relation of a sample observation to a
population parameter.
True score ξ and ability 6 are the same thing expressed on different scales of
measurement. The important difference is that the measurement scale for ξ
depends on the items in the test; the measurement scale for 6 is independent of
the items in the test (Section 3.4). This makes 6 more useful than £ when we wish
to compare different tests of the same ability. Such comparisons are an essential
part of any search for efficient test design (Chapter 6).
4.3. S T A N D A R D E R R O R O F M E A S U R E M E N T
1 N 2
s2e.ξ = σ elξ (4-6)
N
When ξ is fixed, so is θ and vice versa [see Eq. (4-5)]. Thus
60 7 28 52
59 2 20 37 35
58 6 27 23 11
57 12 23 9 2
56 1 18 14 3
55 2 20 7 1
54 5 17 3
53 8 12 1
52 12 7
51 1 15 4
50 1 15 2
49 3 14 1
48 5 11
47 7 8
46 10 5
45 1 12 3
44 1 14 1
43 3 13 1
42 5 11
41 7 9
40 1 9 6
39 1 12 4
38 3 13 2
37 4 13 1
36 1 6 11 1
35 1 9 9
34 2 11 7
33 3 12 5
32 1 5 12 3
31 1 7 12 2
47
(continued)
TABLE 4.3.1 (continued)
48
Selected Fixed Values of θ
Score
(x) -3.000 -2.625 -2.250 -1.875 -1.500 -1.125 -0.750 -0.375 0.000 +0.375 +0.750 +1.125 +1.500 +1.875 +2.250 +2.625 +3.000
30 2 9 10 1
29 3 11 7
28 1 5 12 5
27 1 7 12 3
26 2 9 11 2
25 1 4 11 9 1
24 1 6 12 7 1
23 1 3 8 12 5
22 1 4 10 11 3
21 1 2 6 11 9 2
20 1 2 4 8 12 7 1
19 1 1 3 6 10 12 5
18 1 1 2 4 8 12 10 3
17 2 2 4 6 10 12 8 2
16 3 4 6 9 12 12 6 1
15 5 6 8 11 13 10 4 1
14 7 9 11 12 12 8 2
13 10 11 12 13 11 6 1
12 12 13 13 12 8 4 1
11 14 13 13 10 6 2
10 14 13 11 7 4 1
9 12 10 8 5 2
8 9 8 5 3 1
7 6 5 3 1
6 4 3 1 1
5 2 1 1
4 1
3
2
0
4.4. TYPICAL DISTORTIONS IN MENTAL MEASUREMENT 49
using estimated item parameters. All items are five-choice items. The typical
ogive shape of the regression function µx\θ = nPi(θ) is apparent in this table
and also the typical decreasing standard error of measurement and increasing
skewness at high ability levels.
As already noted, formula (4-2) for the regression µx\θ of number-right score on
ability is the same as formula (4-5) for the relation of true score to ability. This
important function ξ = ξ(θ) ≡ nPi(θ) and also the function
1 n
ζ ≡ ζ(θ) ≡ Pi(θ) (4-9)
n
i=1
are called test characteristic functions. Either of these functions specifies the
distortion imposed on the ability scale when number-right score on a particular
set of test items is used as a measure of ability. A typical example of a test
characteristic function appears in Fig. 5.5.1.
Over ability ranges where the test characteristic curve is relatively steep,
score differences are exaggerated compared to ability differences. Over ranges
where the test characteristic curve is relatively flat, score differences are com
pressed compared to ability differences. Since number-right scores are integers,
compression of a wide range of ability into one or two discrete score values
necessarily results in inaccurate measurement.
If all items had the same response function, clearly the test characteristic
function (4-9) would be the same function also. More generally, test characteris
tic curves usually have ogive shapes similar to but not identical with item re
sponse functions. Differences in difficulty among items cause a flattening of the
test characteristic curve. If all items had the same response curves except that
their difficulty parameters bi were uniformly distributed, the test characteristic
curve would be virtually a straight line except at its extremes. For a long test, the
greater the range of the bi, the more nearly horizontal the test characteristic
curve.
If a test is composed of two sets of items, one set easy and the other set
difficult, the test characteristic curve may have three relatively flat regions: It
may be flat in the middle as well as at extreme ability levels. Such a test will
compress the ability scale and provide poor measurement at middle ability levels,
as well as at the extremes.
If the distribution of ability is assumed to have some specified shape (for
example, it is bell-shaped), the effect of the distortions introduced by various
types of test characteristic functions can be visualized. If a test is much too easy
50 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS
for the group tested, the point of inflection of the test characteristic curve may
fall in the lower tail of the distribution where there are no examinees. Only the
top part of the test characteristic curve may be relevant for the particular group
tested. In this case, most examinees in the group may be squeezed into a few
discrete score values at the top of the score range. The bottom part of the
available score range is unused. Measurement is poor except for the lower ability
levels of the group. An assumed bell-shaped distribution of ability is turned into a
negatively skewed, in the extreme a J-shaped, distribution of number-right
scores. Such a test may be very approproate if its main use is simply to weed out
a few of the lowest scoring examinees.
If the test is much too hard for the group tested, an opposite situation will
exist. The score distribution will be positively skewed but will not be J-shaped if
there is guessing, because zero scores are then unlikely. Such a test may be very
appropriate for a scholarship examination or for selecting a few individuals from
a large group of applicants.
If the test is not very discriminating, the test characteristic curve will be
relatively flat. If the relevant part of the characteristic curve (the part where the
examinees occur) is nearly straight, the shape of the frequency distribution of
ability will not be distorted by the test. However, a wide range of ability will be
squeezed into a few middling number-right scores, with correspondingly poor
measurement.
If the test is very discriminating, its characteristic curve will be corre
spondingly steep in the middle. The curve cannot be steep throughout the ability
range because it is asymptotic to ζ = 1 at the right and to ξ = c at the left, where
1 n
c = Ci. (4-10)
n i=1
Thus there will be good measurement in the middle but poor measurement at
the extremes. If the test difficulty is appropriate for the group tested, the middle
part of the bell-shaped distribution of ability will be spread out and the tails
squeezed together. The result in this case is a platykurtic distribution of number-
right scores. The more discriminating the items, the more platykurtic the
number-right score distribution, other things being equal. In the extreme, a
U-shaped distribution of number-right scores may be obtained.
If we wish to discriminate well among people near a particular ability level (or
levels), we should build a test that has a steep characteristic curve at the point(s)
where we want to discriminate. For example, if a test is to be used only to select a
single individual for a scholarship or prize, then the items should be so difficult
that only the top person in the group tested knows the answer to more than half of
the test items. The problem of optimal test design for such a test is discussed in
Chapters 5, 6, and 11.
An understanding of the role of the test characteristic curve is important in
designing a test for a specific purpose. Diagrams showing graphically just how
4.6. THE TOTAL-GROUP DISTRIBUTION OF NUMBER-RIGHT SCORE 51
various test characteristic curves distort the ability scale, and thus the frequency
distribution of ability, are given in Lord and Novick (1968, Section 16.14).
1 N
Ø(x) = Ø(x θa), (4-12)
N
a= 1
where Ø(x θa) is calculated from (4-1).
Any desired moments of the total-group distribution of test score can be
calculated from the Ø(x) obtained by (4-12). The expected score for the N
examinees also can be found from (4-12) and (4-2):
52 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS
1 N 1 N n
Ex = µx\θa = Pi(θ). (4-13)
N N
a=1 a=1 i=i
An estimate of the expected variance of the N scores can be found from an
ANOVA identity relating total-group statistics to conditional statistics:
1 N n 1 N n 1 N n
N PiaQia +
N Pia 2 N2 ( Pia 2 (4-14)
( ) )
a=l i=l a=1 i = 1 a=\i=l
where Pia ≡ Pi(θa).
Although we shall have little use for test reliability coefficients in this book, it is
reassuring to have a formula relating test reliability to item and ability parame
ters. A conventional definition of test reliability is given by Eq. (1-6) and (1-9):
Written in our current notation, this definition is
σ2xξ
ρ x x ' ≡ ρ2xξ ≡ 1 - σ2 (4-15)
x
For a sample of examinees, (4-15), (4-6), (4-8), and (4-14) suggest an appro
priate sample reliability coefficient:
N n N n
( Pia ( pia)2/N
Pxx' (4-16)
N n N n N n
Pia Qia + ( P ia ) 2
– ( Pia)2/N
From (4-7), we see that (4-15) is the complement of the ratio of (averaged
squared error about the regression of x on θ) to (variance of x). Reliability is
therefore, by definition, equal to the correlation ratio of score x on ability 6.
function. This step function cuts off 2½% or less of the frequency at every ability
level 6. Repeat this process for the upper tails of the score distributions.
The two resulting step functions are shown in Fig. 4.8.1. No matter what the
value of θ may be, in the long run at least 95% of all randomly chosen scores will
lie in the region between these step functions.
Now consider a random examinee. His number-right score on the test is x0,
say. We are going to assert that he is in the region between the step functions.
This assertion will be correct at least 95% of the time for randomly chosen
examinees. But given that this examinee's test score is x0, this assertion is in
logic completely equivalent to the assertion that he lies in a certain interval on 6.
In Fig. 4.8.1, the ends of an illustrative interval are denoted by θ and θ. We shall
therefore assert that his ability 6 lies in the interval (θ, θ). Such assertions will be
correct in the long run at least 95% of the time for randomly chosen examinees.
An interval with this property is called a 95% confidence interval. Such
confidence intervals are basic to the very important concept of test information,
introduced in Chapter 5. It is for this reason that we consider it in such detail
here.
A point estimate of 6 for given x would be provided by the regression of 6 on
x (see Section 12.8). Although the regression of x on 6 is given by (4-2), the
60
50
40
NUMBER RIGHT SCORE
Xo
30
20
10
0
-3 -2 -1 0 i 1 θ 2 3
THETA
FIG. 4.8.1. Confidence interval (θ, θ) for estimating ability (SAT Mathematics
Test, January 1971).
54 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS
Number right is not the only way, nor the best way, to score a test. For more
general results, and for other reasons, we need to know the conditional frequency
distribution of the pattern of item responses—the joint distribution of all item
responses ui (i = 1, 2 , . . . , n) for given θ.
For item i, the conditional distribution given θ of a single item response is
Pi(θ) if u i = 1,
L(ui θ) = Qi(θ) if ui = 0, (4-18)
{ 0 otherwise.
This may be written more compactly in various ways. For present purposes, we
shall write
The reader should satisfy himself that (4-18) and (4-19) are identical for the two
permissible values of ui.
Because of local independence, which is guaranteed by unidimensionality
(Section 2.4), success on one item is statistically independent of success on other
items. Therefore, the joint distribution of all item responses, given θ, is the
product of the distributions (4-19) for the separate items:
n
L(u|θ; a, b , c) = L(u1, u2,... ,un θ) = π PuiiQ1i–Ui, (4-20)
i=l
where u = {ui} is the column vector { u 1 , u 2 , . . . , un}' and a, b , c are vectors
of ai bi, and ci
Equation (4-20) may be viewed as the conditional distribution of the pattern u
of item responses for a given individual with ability θ and for known item
parameters a, b , and c. In this case the ui (i = 1, 2 , . . . , n) are random
variables and θ, a, b , and c are considered fixed.
If the ui for an individual have already been determined from his answer
sheet, they are no longer chance variables but known constants. In this case,
assuming the item parameters to be known from pretesting, it is useful to think of
(4-20) as a function of the mathematical variable θ, which represents the (un
known) ability level of the examinee. Considered in this way, (4-20) is the
4.10. JOINT DISTRIBUTION OF ALL ITEM SCORES 55
0
0
1
-20
-30
LOG LIKELIHOOD
-40
-50
-eo
-70
-80
-90
-5 -4 -3 -2 -I O 2 3 4 5
ABILITY θ
FIG. 4.9.1. Logarithm of likelihood functions for estimating the ability of six
selected examinees from the SCAT II 2B Mathematics test.
likelihood function for 6. The maximum likelihood estimate 6 (see Section 4.13)
of the examinee's ability is the value of θ that maximizes the likelihood (4-20) of
his actually observed responses ui (i = I, 2, . . . , n).
Figure 4.9.1 shows the logarithm of six logistic likelihood functions com
puted independently by (4-20) for six selected examinees taking a 100-item high
school mathematics aptitude test. The maxima of these six curves are found at
ability levels 6 = —5.6, —4.6, —1.0, . 1 , 1.0, and 3.7. These six values are the
maximum likelihood ability estimates 6 for the six examinees.
independent. Thus the joint distribution of the N different u for all examinees is
the product of the separate distributions. This joint distribution is then
N
ΠΠ
n
L(U|θ;a,b,c)≡L(ul,u2,... ,uN\θ) = Pia uia Qia1–u ia , (4-21)
a=1 n=1
The likelihood function (4-20) for θ for one examinee can also be written
n Pi ui n
L(u θ) = π . π Qi .
( Qi )
i=1 i=l
In general, this is not helpful; but in the case of the logistic function [Eq. (2-1)]
when Ci = 0,
l
Pi 1 + e–DLi
= eDLi , (4-22)
Qi 1
1
1 + e–DLi
where D is the constant 1.7 and
Substituting (4-22) in the preceding likelihood function gives the logistic likeli
hood function
4.12. SUFFICIENT STATISTICS 57
n n
L(u|θ) = exp uiLi Qt
( ) Π
i=l i=l
n n
= exp -D aibiui eDθs (4-24)
( ) Π Qi (θ)
i=i i=l
where
n
S = s(u) ≡ aiui. (4-25)
i=l
In Appendix 4, it is shown that if
the expectation of s is
n
E(s θ) = aiPi (θ). (4-27)
1=1
Note that (4-27) is a kind of true score, although different from the usual
number-right score ξ . A consistent estimator 0 of 6 is found by solving for 0
the equation.
iaiPiθ) = s. (4-28)
58 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS
It is shown in Section 4.14 that the θ obtained from Eq. (4-28) is also the
maximum likelihood estimator of 6 under the logistic model with all ci = 0.
If s is sufficient for 0, so is any monotonic function of s. It is generally agreed
that when a sufficient statistic s exists for θ, any statistical inference for 0 should
be based on some function of s and not on any other statistic.
The three conditions stated at the end of the preceding section are the most
general conditions for the existence of a sufficient statistic for 6. There is no
sufficient statistic when the item response function is a normal ogive, even
though the normal ogive and logistic functions are empirically almost indistin
guishable. There is no sufficient statistic when there is guessing, that is, when
Ci ≠ 0. This means that there is no sufficient statistic in cases, frequently reported
in the literature, where the Rasch model (see Wright, 1977) is (improperly) used
when the items can be answered correctly by guessing.
When no sufficient statistic exists, the statistician uses other estimation methods,
such as maximum likelihood. As already noted in Section 4.10, the maximum
likelihood estimates θa (a = 1, 2, . . . , N) and ai, bi, and ci (i = 1, 2 , . . . , n)
are by definition the parameter values that maximize (4-21) when the matrix of
observed item responses U ≡ ||uia|| is known. In practice, the maximum likeli
hood estimates are found by taking derivatives of the logarithm of the likelihood
function, setting the derivatives equal to zero, and then solving the resulting
likelihood equations.
The natural logarithm of (4-21), to be denoted by l, is
N n
l= ln L , ( U | θ ; a , b , c ) = [u ia ln Pia + (1 - uia) ln Q i a ] . (4-29)
a=l i=l
If X represents θa , aj, bj, or Cj, the derivative of the log likelihood with respect
to X is
l N n P'ia
uia — (1 – u ia ) Pia
X [ Pia Qia ]
a=l i=1
N n P ia
uia – P ia ) (4-30)
P ia Qia
a=l i=1
where P ia ≡ Pia / x. An explicit expression for P'ia can be written as soon as
'
the mathematical form of Pia is specified, as by Eq. (2-1) or (2-2). The result for
the three-parameter logistic model is given by Eq. (4-40). Some practical proce
dures for solving the likelihood equations (4-29) are discussed in Chapter 12.
When a, b , c are known from pretesting, the likelihood equation for estimat
ing the ability of each examinee is obtained by setting (4-30) equal to zero:
4.15. MAXIMUM LIKELIHOOD ESTIMATION FOR EQUIVALENT ITEMS 59
n P' ia
(uia – P ia ) = 0. (4-31)
P'ia Qia
i= 1
This is a nonlinear equation in just one unknown, θa. The maximum likelihood
estimate θa of the ability of examinee a is a root of this equation. The roots of
(4-31) can be found by iterative numerical procedures, once the mathematical
form for Pia is specified.
If the number of items is small, (4-31) may have more than one root
(Samejima, 1973). This may cause difficulty if the number of test items n is 2 or
3, as in Samejima's examples. Multiple roots have not been found to occur in
practical work with n ≥ 20.
If the number of items is large enough, the long test being formed by combin
ing parallel subtests, the uniqueness of the root 6 of the likelihood equation
(4-31) is guaranteed by a theorem of Foutz (1977). The unique root is a consis
tent estimator; that is, it converges to the true parameter value as the number of
parallel subtests becomes large.
Pia (4-32)
= DaiPia Qia .
θa
Substituting this for P'ia in and rearranging gives the likelihood equation
n n
ai Pia (θa) = ai uia . (4-33)
i= 1 i=1
This again in a nonlinear equation in a single unknown, θa.
Note that its root θa, the maximum likelihood estimator, is a function of the
sufficient statistic (4-25). Thus (4-33) is the same as (4-28). It is a general
property that a maximum likelihood estimator will be a function of sufficient
statistics whenever the relevant sufficient statistics exist.
Suppose all items have the same response function P(θ). We shall call this the
case of equivalent items. This is not likely to occur in practice, but it is a limiting
case that throws some light on practical situations.
60 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS
In the case of equivalent items, the likelihood equation (4-31) for estimating θ
becomes {P'lPQ) i(ui – P ) = 0 or
1 n
P(θ) = Ui ≡ Z, (4-34)
n
i=l
where z ≡ xln is the proportion of items answered correctly. The maximum
likelihood estimator θ is found by solving (4-34) for θ:
θ = P–1(z), (4-35)
–1
where P ( ) is the inverse function to P( ), whatever the item response
function may be.
Note that when all items are equivalent, a sufficient statistic for estimating
ability 6 is s = i au i = a iui = ax. Thus, in this special case, both the
number-right score x and the proportion-correct score z are sufficient statistics
for estimating ability.
E x e r c i s e 4.15.1
Suppose that P(6) is given by Eq. (2-1) and all items have ai = a, bi = b, and
ci = c, where a, b, and c are known. Show that the maximum likelihood
estimator 0 of ability is given by
1
θ In z — c + b. (4-36)
Da 1 -z
(Here and throughout this book, " I n " denotes a natural logarithm.) If c = 0,θ is
a linear function of the logarithm of the odds ratio (probability of success)/
(probability of failure).
A few useful formulas involving the three-parameter logistic function [Eq. (2-1)]
are recorded here for convenient reference. These formulas do not apply to the
three-parameter normal ogive [Eq. (2-2)].
1 — Ci ci + eDLi
Pi=Ci + (4-37)
1 + e–DLi 1 + eDLi '
Pi c, + eDLi
(4-39)
Qi 1 - ci
4.17. EXERCISES 61
P'i Dai
(4-41)
Qi 1 + e–DLi
4.17. EXERCISES
4-1 Compute P(6) under the normal ogive model [Eq. (2-2)] for a = 1/1.7,
b = 0, c = .2, and θ = —3, —2, — 1 , 0 , 1, 2, 3. Compare with the results
given for item 2 in Table 4.17.1 under the logistic model. Plot the item
response function P(θ) for each item in test 1, using the values given in
Table 4 . 1 7 . 1 .
TABLE 4.17.1
Item Response Function P(θ) and Related Functions for Test 1,
Composed of n = 3 Items with Parameters a1 = a2 = a3 = 1/1.7,
b1! = - 1 , b2 = 0, b3 = + 1 , C1 = c2 = c 3 = .2
(9
Item No.*
1 2 3 P(θ) P(θ)Q(θ) P'(θ) p'2/pQ P'IPQ
*For item i, enter the table with the θ values shown in column i (i = 1, 2, 3).
62 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS
4-2 Compute from Eq. (4-1) for examinees at 6 = 0 the frequency distribution
Φ(x\θ) of the number-right score on a test composed of n = 3 equivalent
items, given that P i (0) = .6 for each item. Compute the mean score from
the Φ(x\θ), also from (4-2). Compute the standard deviation (4-3) of
number-right scores. Compute the mean of the proportion-correct score
z = x\n.
4-3 Compute from (4-1) the frequency distribution of number-right score x on
test 1 when (9 = 0, given that P 1 (0) = .7848, P 2 (0) = .6, P 3 (0) = .4152.
C o m p u t e µx θ,µZ θ a n d σx θ. Compare with the results of Exercise 4-2.
4-4 Note that σx θ is the standard error of measurement, (4-8). Check the
value found in Exercise 4-3, using Eq. (4-4).
4-5 Compute from Table 4.3.1 the standard deviation of the conditional dis
tribution Φ(x\θ) of number-right scores when θ = –3, 0, + 2 . 2 5 . (Be
cause of rounding errors, the columns do not add to exactly 100; compute
the standard deviation of the distribution as tabled.)
4-6 What is the range of number-right true scores ξ on test 1 (see Table
4.17.1)?
4-7 In Table 4 . 3 . 1 , find very approximately a (equal-tailed) 94% confidence
interval for θ when x = 26.
4-8 Given that P 1 (0) = .7848, P 2 (0) = .6, P 3 (0) = .4152, as in Exercise 4-3,
compute from (4-20) the likelihood when θ = 0 of every possible pattern
of responses to this three-item test.
4-9 Given that u1 = 0, u2 = 0, and u3 = 1, compute for θ = - 3 , - 2 , —1,0,
1, 2, 3 and plot the likelihood function (4-24) for a three-item test com
posed of equivalent items with a = 1/1.7, b = 0, and c = 0 for each
item. The necessary values of P(θ) are given in Table 4.17.2.
4-10 For Exercise 4-9, show that the right side of (4-33) exceeds the left side
when θa = — 1 but that the left side exceeds the right side when θa = 0;
consequently the maximum likelihood estimator θa satisfying (4-33) lies
between — 1 and 0.
4-11 Find from (4-36) the maximum likelihood estimate θ for the situation in
Exercises 4-9 and 4-10.
TABLE 4.17.2
Logistic Item Response Function P(θ) when a = 1/1.7, b = 0, c = 0
e -3 -2 -1 0 1 2 3
APPENDIX
Prob(u 0 |s 0 , 6) = Prob(u0 θ)
.
Prob(u|θ)
u sa
where the summation is over all vectors u for which iaiui = s0. By (4-24),
exp(-DXiaibiuoil)eD θ s
o Π i Q i (θ)
Prob(u 0 |s 0 , θ) = Dθs
exp (-D i ai biui)e o ΠiQi(θ)
u so
The point of this result is not the formula obtained but the fact that it does not
depend on 0. In view of the definition of a sufficient statistic (see Section 4.12),
we therefore have the following: If
REFERENCES
Birnbaum, A. Test scores, sufficient statistics, and the information structures of tests. In F. M. Lord
& M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley,
1968.
Foutz, R. V. On the unique consistent solution to the likelihood equations. Journal of the American
Statistical Association, 1977, 72, 147-148.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1, 3rd ed.). New York: Hafner,
1969.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Samejima, F. A comment on Birnbaum's three-parameter logistic model in the latent trait theory.
Psychometrika, 1973, 38, 221-233.
Wright, B. D. Solving measurement problems with the Rasch model. Journal of Educational Mea
surement, 1977, 14, 97-116.
5 I n f o r m a t i o n
O p t i m a l
F u n c t i o n s
S c o r i n g
a n d
W e i g h t s
The information function I{θ, y} for any score y is by definition inversely pro
portional to the square of the length of the asymptotic confidence interval for
estimating ability 0 from score y (Birnbaum, 1968, Section 17.7). In this chap
ter, an asymptotic result means a result that holds when the number n of items
(not the number N of people) becomes very large. In classical test theory, it is
usual to consider that a test is lengthened by adding items "like those in the
test,'' that is, by adding test forms that are strictly parallel (see Section 1.4) to the
original test. This guarantees that an examinee's proportion-correct true score
("zayta") ξ = ξ/n is not changed by lengthening the test. We shall use lengthen
ing in this sense here and throughout this book.
Denote by z ≡ xln the observed proportion-correct score (proportion of n
items answered correctly). The regression of z on ability θ is by Eq. (4-2) and
(4-9)
1 n
µz θ Pi(θ)= ζ. (5-1)
n
i =1
This regression is not changed by lengthening the test. The variance of z for fixed
0 is seen from Eq. (4-3) and (4-4) to be
1 n l
σ2z θ = n2 Pi Qi = (PQ -σ2p θ)- (5-2)
n
i=1
This variance approaches zero as n becomes large.
65
66 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
CB 2(1.96 σz|θ)
tan α = =
AB AB
or
5.1. THE INFORMATION FUNCTION FOR A TEST SCORE 67
»µ|θ
2.5% ile
Z0 A a a
B
l.96σz|θ
i.96<σz|θ
e e ©
FIG. 5.1.1. Construction of a 95% asymptotic confidence interval (0, 0) for ability 0.
= 3.92σzθ
AB
tan α .
Since tan a is the slope of the regression line µz\θ , the information function, as
defined at the beginning of this chapter, for score z is proportional to
d 2
1 = µz\θ
( dθ )
.
AB2 (3.92) 2 Var(z|θ )
Figure 5.1.1 was derived for estimating ability from the proportion-correct
score z. For unidimensional tests, the same line of reasoning applies quite gener
ally to almost all kinds of test scores in practical use. Thus Birnbaum (1968)
defines the information function for any score y to be
d 2
µy|θ )
( dθ
I{θ,y} ≡ (5-3)
Var (y\θ)
The information function for score y is by definition the square of the ratio of the
slope of the regression of y on θ to the standard error of measurement of y for
fixed d.
Now true score n, corresponding to observed score y, is fixed whenever θ is
fixed. (If this were not true, the test items would be systematically measuring
68 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
1. The smaller the standard error of measurement σy |η, the more informa
tion y provides about θ.
2. The steeper the slope of the regression µy|θ (the more sharply y varies
with θ), the more information y provides about θ.
If two tests have the same true-score scale, their effectiveness as measuring
instruments can properly be summarized by their standard errors of measurement
at various true-score levels. If, however, the tests measure the same trait but their
true-score scales are nonlinearly related, the situation is different. This will
ordinarily be the case whenever the tests are not parallel forms (see Chapter 13).
In this case, it is not enough to compare standard errors of measurement; we must
also take the relation of their true-score scales into account. This is the reason
why the score information function depends not only on the standard error of
measurement but also on the slope of the regression of score on ability.
E x a m p l e 5.1
The use of (5-3) can be illustrated by deriving the information function for the
proportion correct score z. From (5-1),
dµz|θ 1)
= i P' i (θ),
dθ (n
where P'i(θ) is the derivative of Pi(θ) with respect to θ. From (5-3) and (5-2) we
can now write the information function for z:
n 2
P'i(θ),
[ ]
I{θ, z} = i=1
n .
Pi(θ)Qi(θ)
i=1
This result is the same as the information function for number-right score, which
is derived from a more general result in the sequel and presented as Eq. (5-13).
A nonasymptotic derivation of (5-3) was given by Lord (1952, Eq. 57) before the
term score information function was coined and also in a different context, by
5.2. ALTERNATIVE DERIVATION OF THE SCORE INFORMATION FUNCTION 69
Mandel and Stiehler (1954). Suppose that we are using test score 3; in an effort to
discriminate individuals at θ' from individuals at θ". Figure 5.2.1 illustrates the
two frequency distributions Φ(y θ) at θ' and θ" and shows the mean µy θ of each
distribution.
A natural statistic to use to measure the effectiveness of y for this purpose is
the ratio
µy θ" – µy θ'
σy θ" ,
where the denominator is some sort of average of σy θ' and σy θ". The displayed
ratio is proportional to the difference between means divided by its standard
error, sometimes called a critical ratio.
If θ' and θ" are close together, µy θ will be an approximately linear function
of θ in the interval (θ' ,θ"). Thus the numerator of our ratio will be proportional to
the distance θ" - θ'. The coefficient of proportionality is the slope of the
regression, given by the derivative d µy θ/dθ. Over short distances, it will make
no difference whether this slope is taken at θ" or at θ'. Also, σy θ" will be close
to σy θ, so their average will differ little froma y \ e σ yθ',.Thus our ratio can be
written
σy θ' .
The information function (5-3) when θ = θ' is directly proportional to the
square of the ratio just derived. The coefficient of proportionality is (θ" — θ')2, a
quantity of no relevance for assessing the discriminating power of test score y at
ability level θ = θ'.
If asymptotic values of µy θ and Var (y\θ) are used in (5-3), we find an
asymptotic information function explicable in terms of the length of the asympto
tic confidence interval. If exact values of µy θ and Var (y\θ) are used, the
µ y θ"
µy θ'
The maximum likelihood estimator θ is a kind of test score. Thus, we can use
(5-3) to find the information function of the maximum likelihood estimator. To
do this, we need (asymptotic) formulas for the regression µθ|θ and for the var
iance σ2θ|θ.
There is a general theorem, under regularity conditions, satisfied here
whenever the item parameters are known from previous testing: A maximum
likelihood estimator θ of a parameter θ is asymptotically normally distributed
with mean θ0 (the unknown true parameter value) and variance
1
Var(θ|θ o ) = (5-4)
d ln L 2 ,
E
[ ( dθ ) θ0 }
where L is the likelihood function.
When the item parameters are known, we have from Eq. (5-4) and (4-30) that
1 [ n ]2
= E (ui – Pi)P'i/PiQi θo
Var (θ\θo) { }
i=1
{ [ n
= E (ui – Pi)P'i/PiQi
i=1
n ] }
(uj – Pj)P'j/PjQj |θo
J=1
n n
P'io P'jo
= E[(u i – P i )(u j – P j )|θ o ].
Pio Pjo Qio Qjo
i=l j =1
Since E(ui |θo) = Pio() , the expectation under the summation sign is a covariance.
Because of local independence, ui is distributed independently of uj for fixed θ.
Consequently the covariance is zero except when i = j , in which case it is a
variance. Thus
1 n p 'i2 n p 'io2
= 2
Var
2
(uio|θo) = pi02Qio2•
Var(θ|θo) p i Q 'i p io2 p io2
i=1 i=l
Dropping the subscript o, the formula for the asymptotic sampling variance of the
maximum likelihood estimator is thus
5.3. THE TEST INFORMATION FUNCTION 71
1
Var(θ|θ) = (5-5)
n P'i2 .
Pi Qi
t=1
Now, as already stated, θ is a consistent estimator; so asymptotically µθ|θ
θ. Thus asymptotically the numerator of the information function (5-3) for score
θ is (dµθ|θ/dθ)2 = 1. Thus the (asymptotic) information function (5-3) of the
maximum likelihood estimator of ability is the reciprocal of the asymptotic
variance (5-5):
n p' i 2
I{θ} ≡ I {θ,θ} = (5-6)
Pi Qi .
i= 1
Let us note an obvious theorem in passing:
Theorem 5.3.2. The test information function I{θ} given by (5-6) is an upper
bound to the information that can be obtained by any method of scoring the test.
where τ' (θ) is the derivative of τ'(θ). Since E(t\θ) ≡ τ(θ), we have from (5-3),
(5-4), and (5-7) asymptotically
[τ' (θ)]2 [ d ln L 1
I{θ, t}= ≤E 2 = = I{θ}. (5-8)
Var (t θ) ( dθ ) ] Var (θ θ)
This result holds under rather general regularity conditions on the item response
function P i ( θ ) .
72 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
A very important feature of (5-6) is that the test information consists entirely of
independent and additive contributions from the items. The contribution of an
item does not depend on what other items are included in the test. The contribu
tion of a single item is P'i2/Pi Qt. This contribution is called the item information
function:
p i '2
I{θ, ui} = (5-9)
PiQi .
Item information functions for five familiar items are shown in Figure 2.5.1
along with the I{6} for the five-item test.
In classical test theory, by contrast, the validity coefficient ρxC for number-
right test score (correlation between score and criterion C) is given by Eq. (1-25)
in terms of item intercorrelations ρij and item-criterion correlations ρ i c . There is
no way to identify the contribution of a single item to test validity; the contribu
tion of the item depends in an intricate way on the choice of items included in the
test. The same may be said of an item's contribution to coefficient alpha, as
shown by Eq. (1-24), and to other test reliability coefficients.
For emphasis and clarity, let us elaborate here Birnbaum's (1968) suggested
procedure for test construction, previewed in Chapter 2. The procedure operates
on a pool of items that have been calibrated by pretesting, so that we have the
item information curve for each item.1
1. Decide on the shape desired for the test information function. Remember
that this information function is inversely proportional to the squared length of
the asymptotic confidence interval for estimating ability from test score. What
accuracy of ability estimation is required of the test at each ability level? The
desired curve is the target information curve.
2. Select items with item information curves that will fill the hard-to-fill
areas under the target information curve.
3. Cumulatively add up the item information curves, obtaining at all times the
information curve for the part-test composed of items already selected.
4. Continue (back-tracking if necessary) until the area under target informa
tion curve is filled up to a satisfactory approximation.
The item information curve for the three-parameter logistic model in Eq. (2-1)
can be written down from (5-9) in many forms, such as
1
These rules are reproduced with special permission from F. M. Lord, Practical applications of
item characteristic curve theory. Journal of Educational Measurement, Summer 1977, 14, No. 2,
117-138. Copyright 1977, National Council on Measurement in Education, Inc., East Lansing,
Mich.
5.5. INFORMATION FUNCTION FOR A WEIGHTED SUM OF ITEM SCORES 73
Pi – ci 2
= D a i Qi
2 2
I{θ, ui} ,
Pi ( 1 – ci
D2a2i(1 – ci)
I{θ, ui] =
(c, + e D L i ) ( 1 +e-DLi)2 '
Suppose the test score is the weighted composite y ≡ iWiui,, where the wi are
any set of weights. Since each ui is a (locally) independent binomial variable, we
have
µ2 wu θ = i Wi Pi, (5-10)
σ2 wu θ = i W2i Pi (5-11)
By (5-3), the information function for the weighted composite is
( iWiPi )2 (5-12)
I{θ, iWiui} = 2
i W i Pi ,
.
If the weights are all 1, y is the usual number-right score x. Thus the informa
tion function for number-right score x is
( iP'i )2 (5-13)
I{θ, x} =
i Pi Qi ,
Note that (5-12) and (5-13) cannot be expressed as simple sums of independent
additive contributions from individual items, as in (5-6).
Figure 5.5.1 shows the estimated information function I{θ, x} for the
number-right score on a high school-level verbal test (SCAT II, Form 2A)
composed of 50 four-choice word-relations items. For comparison, the test in
formation function I{θ} is shown and also two other information functions to be
discussed later. The test characteristic curve is given also.
We have seen that the squared slope of the test characteristic curve is the
numerator of the information function for number-right score. The inflection
point of the test characteristic curve in the figure is to the left of the maximum
information, showing the effect of the denominator (squared standard error of
measurement).
The relation shown between the number-right curve and the upper bound I{θ}
is fairly typical of plots seen by the writer. This relation is of interest since it
limits the extent to which we can hope to improve the accuracy of measurement
by improving the method of scoring the test.
74 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
ABILITY
-3 -2 -I 0 1 2 3
I 8
50
TEST CHARACTERISTIC CURVE
EQUAL WEIGHTS
OPTIMAL WEIGHTS
15
40 SCORING WEIGHTS A
SCORING WEIGHTS (5-18)
12
SCORE
3.0
INFORMATION
RIGHT
9
NUMBER
20
6
10
3
0
0
-3 -2 0 1 2 3
-1 ABILITY
FIG. 5.5.1. Test characteristic curve (solid line) and various information curves
(dashed lines) for SCAT II 2A Verbal test. Item-scoring weights for the informa
tion curves are specified in the legend.
We have the surprising result that the information function for the weighted
composite i (p'i/Pi Qi)ui is the same as the test information function, which is
the maximum information attainable by any scoring method. Thus
p'i(θ) (5-15)
Wi, (θ) ≡
P i (θ)Q i (θ)
Dai Qi(Pi – ci )
P'i = (5-16)
1 – ci
From (5-15) and (5-16), the optimal item-scoring weights are
Dai (Pi – ci ) Dai
Wi(θ) = (5-17)
Pi(1 – ci) 1 + cie–DLi
where Li ≡ ai{θ — bi). Note that when ci = 0, the optimal weight is 1.7ai or
since we may divide all the weights by 1.7, simply ai.
At high ability levels P i a (θ) 1, consequently Wi(θ) Dai . Thus, we see
that at high ability levels optimal scoring weights under the logistic model are
proportional to item discriminating power ai.
The optimal weights Wi(θ) under the logistic model are shown in Fig. 2.5.2
for five familiar items. Note the following facts about optimal item weights for
the logistic model, visible from the curves in the figure.2
1. As ability increases, the curve representing optimal item weight as a func
tion of ability sooner or later becomes virtually horizontal. Thus, for sufficiently
high ability levels, the optimal item weights are virtually independent of ability
level. The optimum weight at this upper asymptote is proportional to the item
parameter ai. This occurs because there is no guessing at high ability levels.
2. As ability decreases from a very high level, the optimal weight curves for
the difficult items begin to decline. The reason is that at lower ability levels
random guessing destroys the value of these items.
3. As ability decreases further, the optimal weights for these difficult items
become virtually zero. Such items will not be wanted if the test is used only to
discriminate among examinees at low ability levels.
In summary, under the logistic model, the optimal weight to be assigned to an
item for discriminating at high ability levels depends on the general discriminat
ing power of the item. The optimal weight to be used for discriminating at lower
ability levels depends not only on the general discriminating power of the item
but also very much on the amount of random guessing occurring on the item at
these ability levels. Thus, all moderately discriminating items are of use for
discriminating at high ability levels, whereas only the easy items are of appreci
able use for discriminating at low ability levels.
Item-scoring weights that are optimal for a particular examinee can never be
determined exactly, since we do not know the examinee's ability θ exactly.3 A
2
The remainder of this paragraph is taken with permission from F. M. Lord, An analysis of the
Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and
Psychological Measurement, 1968, 28, 989-1020.
3
The remainder of this section is adapted and reprinted with special permission from F. M. Lord,
Practical applications of item characteristic curve theory. Journal of Educational Measurement,
Summer 1977, 14, No. 2, 117-138. Copyright 1977, National Council on Measurement in Educa
tion, Inc., East Lansing, Mich.
76 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
Is there an item response function P(θ) such that the optimal weights w(θ)
actually do not depend on θ? If so, then
w(θ) ≡ P'(θ)
= A,
P(θ)Q(θ)
where A is some constant. This leads to the differential equation
5.9. EXERCISES 77
dP
= A dθ.
P(l – P)
Integrating, we have uniquely
1 – P
–In = Aθ + B,
P
where B is a constant of integration. Solving for P, we find
1 1
P ≡ P(θ) =
1 + e–Aθ–B = 1 + e-Da(θ-b) ,
where A ≡ Da and B ≡ —Dab.
In summary, when the item response function is a two-parameter logistic
function, the optimal scoring weight Wi(θ) does not depend on θ. The optimal
weight is wi = ai , the item discrimination index. The optimally weighted compos
ite of item scores is s ≡ i a i u i , the sufficient statistic of Section 4.12. The
two-parameter logistic function, which does not permit guessing, is the most
general item response function for which the optimal item scoring weights do not
depend on θ.
Figure 5.5.1 shows the information curve obtained when the weights wi = ai
are used for the SCAT II-2A verbal test. The score iaiui is optimally efficient
at high ability levels but is less efficient than number-right score at low ability
levels. This is the result to be expected on a multiple-choice test, since wi = ai is
optimal only when there is no guessing.
From Eq. (4-31) and (5-15), if the item parameters are known from pretesting,
the maximum likelihood estimate of ability is obtained by solving, for θ, the
equation
i= 1 i=1
Thus we see that the maximum likelihood estimator θ is itself a function of the
optimally weighted composite of item scores iWi(θ)ui with θ substituted for θ.
This is true regardless of the form of the item response function Pi(θ).
5.9. EXERCISES
5-1 For test 1, compute from Table 4.17.1 the mean number-right score x at
θ = —3, — 2, —1,0, 1, 2, 3 using Eq. (4-2). Plot the regression of x on
θ. This is the test characteristic function.
78 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
5-2 As in Exercise 5-1, compute the standard deviation [Eq. (4-3)] of number-
right score for integer values of θ. Plot on the same graph as the regres
sion σx θ .
5-3 From Table 4.17.1, plot on a single graph the item information function
for each of the three items in test 1.
5-4 Compute from Table 4.17.1 the test information function (5-6) of test 1.
Plot on the same graph as µx θ andσx θ Also plot on the same graph as
the item information functions.
5-5 to 5-8 Using Table 4.17.2, repeat Exercises 5-1 to 5-4 for a three-item test
with a = 1/1.7, b = 0, c = 0 for all items. Compare with the results of
test 1.
5-9 Compute from (5-5) the variance of θ at integral values of θ for test 1.
5-10 From Table 4.17.1, compute the score information function (5-13) for the
number-right score x. Plot this and the test information function (5-6)
from Exercise 5-4 on the same graph.
5-11 For each item in test 1, plot the optimal scoring weights (5-15) as a func
tion of θ.
5-12 For test 1, compute the optimally weighted composite score iWi(θ)ui
for examinees at ability level θ=0 responding ux = 1, u2 = 0, u3 = 0.
Repeat for ui, = 0, u2 = 1, u3 = 0; also repeat for ux = 0, u2 = 0, u3
= 1. Can you explain why the scores for the three patterns should be in
the rank order you have found?
5-13 Compute the optimal item-scoring weight (5-15) at each θ level for the
items in Table 4.17.2. Explain.
APPENDIX
(1/n) times a constant term (a term that does not vary with n) and also that E(y —
η)3, the third sampling moment of y, is of order n–3/2 (is a constant divided by
n3/2). We assume this in all that follows.
Expanding Y(y) by Taylor's formula, we have
Y(y) – Y(η) = Y'(η)(y - η) + ½ y"(η)(y - η)2 +
δY'"(η)(y – η)3, (5-20)
where0 <δ < 1 and Y' (η), Y"(η), Y' "(η) are derivatives of Y(η) with respect to η.
Rearranging (5-20) and taking expectations, we find that the expectation of Y is
Squaring (5-20) and taking expectations, we find a formula for the sampling
variance of Y:
Var (Y θ) ≡ E{[Y(y) - Y(η})]2 θ} = [Y'(η)]2 Var (y\ θ) + terms
of order n–3/2. (5-22)
From (5-21),
d dη] 1
dθ
E(Y θ) = Y'(η)dθ + terms of order .
n
From this and (5-22) and (5-3), we obtain the information function for the
transformed score Y(y):
2
I{θ, Y(y)} = (dη/dθ) + neglected terms.
Var (y\θ)
Since Var (y\θ) is a constant times 1/n and the numerator is independent of n, the
fraction on the right is of order n (a constant times n). The largest neglected
terms are easily seen to be constant with respect to n. For large n, the largest
neglected terms are therefore small compared to term retained. The term retained
is seen to be the information function of the untransformed score y. Asymptoti
cally,
/{θ, Y(y)} = l{θ,y}. (5-23)
In summary, if (1) y is a score chosen so that the corresponding true score
does not vary with n, (2) Y(y) is a monotonic transformation of 3; not involving
n, (3) Var (y θ) is of order 1/n and E[(y —η))3θ]is of order n–3/2, then the score
transformation Y(y) does not change the asymptotic score information function.
The first restriction in this summary is readily removed for most sensible
methods of scoring. The number-right true score ξ, for example, varies with n,
whereas the proportion-correct true score ζ = ξ/n does not; yet bothξ andζ have
the same information: I { θ , ξ } = I{θ, ζ}.
The invariance (5-23) of I{θ, Y(y)} is important; however, it is not surprising
80 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS
REFERENCES
The relative efficiency of test score y with respect to test score x is the ratio of
their information functions:
Scores x and y may be scores on two different tests of the same ability θ, or x and
y may result from scoring the same test in two different ways. Relative efficiency
is defined only when the θ in I{θ, y} is the same θ as in I{θ, x}. Although the
notation does not make it explicit, it should be clear that the relative efficiency of
two test scores varies according to ability level.
The dashed curve in Fig. 6.7.1 shows estimated relative efficiency of a
"regular" test compared to a "peaked" test. Both are 45-item verbal tests
composed of five-choice items. The regular test (y) consists of the even-
numbered items in a 90-item College Board SAT. The peaked test (x) consists of
45 items from the same test with difficulty parameters nearest the average bi (the
average over all 90 items).
83
84 6. THE RELATIVE EFFICIENCY OF TWO TESTS
There is considerable overlap in items between the two 45-item tests, but this
does not impair the comparison. As the figure shows, from the third percentile up
through the thirtieth, the regular test with its wide spread of item difficulty is less
than half as efficient as the peaked test. In other words, the regular test would
have to be lengthened to more than 90 items in order to be as efficient as the
45-item peaked test within this range.
The ability scale θ is the scale on which all item response functions have the
particular mathematical form Pi(θ). This is a specified form chosen by the
psychometrician, such as Eq. (2-1) or (2-2). Except for the theoretical case where
all items are equivalent, there is no transformation of the ability scale that will
convert a set of normal ogive response functions to logistic, or vice versa.
Once we have found the scale θ on which all item response curves are (say)
logistic, it is often thought that this scale has unique virtues. This conclusion is
incorrect, however, as the following illustration shows.
Consider the transformations
Dai
θ* ≡ θ*(θ) ≡ Kekθ, b*i ≡ Kekbi, ai* ≡ ,
(6-2)
k
where K and k are any positive constants. Under the logistic model
1 — ci
Pi ≡ ci + 1 + e –Da
eDa i bi
iθ
1 — ci (6-3)
≡ ci +
1 +(b*i/θ*)ai*
Also
(b*i /θ*)ai*
Qi=(1 – ci) .
1 + (b*i /θ*)ai*
Thus
Pi – ci θ* ai*
= * (6-4)
Qi ( bi) .
This last equation relates probability of success on an item to the ratio of exam
inee ability θ * to item difficulty b*i. The relation is so simple and direct as to
suggest that the θ* scale may be better for measuring ability than is the θ scale.
By assumption, all items have logistic response curves on the θ scale; how
ever, it is equally true that all items have response curves given by (6-3) on the θ*
scale. Thus there is no obvious reason to prefer θ to θ*.
6.3. EFFECT OF ABILITY TRANSFORMATION 85
If there is no unique virtue in the θ scale for ability, we should consider how a
monotonic transformation of this scale affects our theoretical machinery. There
is nothing about our definition or derivation of the information function [Eq. (5-3)]
that requires us to use the θ scale rather than the θ* scale. If θ* is any monotonic
transformation of θ, the information function for making inferences about θ*
from y is defined by Eq. (5-3) to be
(dµy|θ* /dθ*)2
I{θ*, y} ≡ (6-5)
Var (y\θ*) .
Before proceeding, we need to clarify a notational paradox. Note that, for
every θ O ,
30
s
c
o
R
20 E
I
N
F
0
R
M
A
T
0
N
F
U
N
C
T
i
0
10 N
0
2 10 25 50 75 90 98 %ile
FIG. 6.3.1. Score information function for measuring ability θ, SAT Mathematics
test. Taken with permission from F. M. Lord, The 'ability' scale in item charac
teristic curve theory. Psychometrika, 1975, 40, 205-217.
function may be drastically altered by the transformation. Worse yet, the ability
level at which a test provides maximum information may be totally different
when ability is measured by θ* rather than by θ. Or /{θ, x} may have one
maximum, whereas I{θ*, x} has two separate maxima. Actually, any single-
valued continuous information function on θ may be transformed to any other
such information function by a suitably chosen monotonic transformation θ*(θ).
Figure 6.3.1 shows the information function I{θ, x} for number-right score
on a 60-item College Board mathematics aptitude test. The baseline, representing
ability, is marked off in terms of estimated percentile rank on ability for the
group tested rather than in terms of θ values. Figure 6.3.2 shows a rather mild
transformation θ*(θ) ≡ ω(θ). Figure 6.3.3 shows the resulting information func-
6.3. EFFECT OF ABILITY TRANSFORMATION 87
1.5
1.0
0.5 O
M
E
G
A
0.0 s
cA
L
E
O
-0.5 F
A
B
I
L
-1.0
ITY
-1.5
•2.0
-2.5
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 '.5
FIG. 6.3.2. Relation of the ω scale of ability to the usual 6 scale. Taken with
permission from F. M. Lord, The 'ability' scale in item characteristic curve
theory. Psychometrika, 1975, 40, 205-217.
tion I{ω, x} on the Ω scale for the same number-right score. The information
functions for the same score x on the two different ability scales bear little re
semblance to each other.
Clearly information is not a pure number; the units in terms of which informa
tion is measured depend on the units used to measure ability. This must be true,
since information is defined by the length of a confidence interval, and this
length is expressed in terms of the units used to measure ability. If we are
uncertain what units to use to quantify ability, then to the same extent we do not
know how to quantify information.
We cannot draw any useful conclusions from the shape of a single information
function unless we assert that the ability scale we are using is unique except for a
88 6. THE RELATIVE EFFICIENCY OF TWO TESTS
30
s
c
O
R
E
20 I
N
F
O
R
M
A
T
I
O
N
F
U
N
C
T
I
O
N
10
0
2 10 25 50 75 90 98 % ile
FIG. 6.3.3. Score information function for measuring ability ω, SAT Mathematics
test. Taken with permission from F. M. Lord, The 'ability' scale in item charac
teristic curve theory. Psychometrika, 1975, 40, 205-217.
linear transformation. Most important we cannot know at what ability level the
test or test score discriminates best, unless we have an ability scale that is not
subject to challenge.
Even though a single information curve may not be readily interpretable,
comparisons between two or more information curves are not impaired by doubt
about the ability scale. This important fact is easily proved in Section 6.4.
Suppose we transform the ability scale monotonically to θ*(θ) and then compute
the relative efficiency of two scores, x and y (which may be scores on one test or
6.5. INFORMATION FUNCTION OF OBSERVED SCORE ON TRUE SCORE 89
on two different tests), for measuring θ*. Replacing θ by 0* in (6-1) and using
(6-6), we find
2
RE {y, x} =
I{θ*, y} I{θ, y} (dθ*/dθ) I{θ, y}
2
I{θ*, x} I{θ, x} (dθ*/dθ) I{θ, x}
Comparing this with (6-1), we see that relative efficiency is invariant under any
monotonic transformation of the ability scale. It is for this reason that the symbol
6 does not appear in the notation RE {y, x}.
For the reasons outlined in Section 6.3, the practical application of item
response theory in this book are not based on inference from an isolated informa
tion function. We shall compare information curves, or equivalently we shall rely
on a study of relative efficiency. Such comparisons are not affected by the choice
of scale for measuring ability.
It was noted in Section 4.2 that number-right true score ξ is a monotonic increas
ing transformation of ability θ. What is the information function of number-right
score x for making inferences about true score ξ?
If we substitute ξ for θ* and x for y in (6-5), we find
(dµx|ξ/dξ)2
I{ξ, x} = (6-7)
σ2x|ξ
Now the true score ξ is defined as the expectation of x. It follows that µx|ξ = ξ.
If we substitute this into the numerator of (6-7), the desired information function
is found to be
1 (6-8)
I{ξ, x} ≡
σ2x|ξ
When using observed score x to make inferences about the corresponding
true score ξ, the appropriate information function I{ξ , x} is the reciprocal of
the squared standard error of measurement of score x at ξ. This result will hold
for any score x, not just for number-right score, as long as ξ ≡ µx|ξ is a mono
tonic function of θ.
Figure 6.5.1 shows I{ξ, x} for the same test represented in Fig. 6.3.1 and
6.3.3. The reader should compare these three information functions, noting once
again that information functions do not give a unique answer to the question: ' 'At
what ability level does the test measure best?"
The reader may have been startled to find from Fig. 6.5.1 that I{ξ, x} is
greatest at high and at low ability levels and least at moderate ability levels.
Actually, similar results would be found for most tests. Examinees at very high
90 6. THE RELATIVE EFFICIENCY OF TWO TESTS
0.5
0.4
FUNCTION
0.3
INFORMATION
0.2
SCORE
0. I
0.0
0 20 40 60
NUMBER-RIGHT TRUE SCORE
FIG. 6.5.1. Score information function for measuring the true score ξ on SAT
mathematics test.
ability levels are virtually certain to obtain a perfect score on the test. Thus for
them the standard error of measurement σx|ξ is nearly zero, their true score ξ is
very close to n, the length of the confidence interval for estimating ξ from x is
nearly zero, and consequently I{ξ, x} ≡ l/σ2x|ξ is very large. Clearly true score
ξ can be estimated very accurately for such examinees: It is close to n. Their
ability 0 cannot be estimated accurately; however: We know that their 6 is high
without knowing how high. This situation is mirrored by the fact that I{ξ, x} is
very large for such examinees, whereas I{θ, x} is near zero. The reader should
understand these conclusions if he is to make proper use of information functions
(or of standard errors of measurement).
Suppose now that we have another test measuring the same ability 8 as test x.
Denote the observed score on the new test by y and the corresponding true score
by . As in (6-8), the information function for y on will be
1 (6-9)
I{ , y} = σ2
y|
6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY 91
Similarly,
2
RE {x, y} = σ2y| dξ 2
σ y|ξ d ( )
. (6-12)
Equations (6-11) and (6-12) are valid regardless of the scale used to measure
ability (see Section 6.4). In particular, (6-11) and (6-12) do not assume that
ability is to be measured on the true-score scale ξ.
Denote by P(ξ) the frequency distribution (density) of true score ξ in some
population of examinees. The distribution Q( ) of = (ξ) in this same popula
tion is then found from
q( )d( ) ≡ p(ξ)dξ (6-13)
Rearranging, we have
d
= P(ξ) .
dξ q( )
Substituting this into (6-11), we find
RE {y, x} =
σ2x|ξ
P2(ξ) (6-14)
2
σ-y| ξ q [ (ξ)]
To our surprise, this formula shows that the relative efficiency of two tests can
be expressed directly in terms of true-score frequency distributions and standard
errors of measurement. The formulas agree with the vague intuitive notion that a
test is more discriminating at true-score levels where the scores are spread out
and less discriminating at true-score levels where the scores pile up.
92 6. THE RELATIVE EFFICIENCY OF TWO TESTS
In more familiar terms, this equation says that 0 ≡ (ξ0) has the same percen
tile rank in q( ) as ξ0 does in p(ξ). Thus for any value of ξ, = (ξ) is to be
obtained by standard equipercentile equating. The distributions q( ) and p(ξ)
must be for the same group or for statistically equivalent groups of examinees.
Given estimates of q( ) and p(ξ), the integration and equating are done by
numerical methods by the computer (see Section 17.3).
A computer program (Stocking, Wingersky, Lees, Lennon, & Lord, 1973), is
available to compute (6-14). The program uses estimates (£) and q( ) obtained
by the methods of Chapter 16. It then uses (6-17) to find equivalent values of ξ
and . Finally, using approximation (6-15), it computes relative efficiencies by
(6-14).
In Fig. 6.7.1, the solid curve is the approximate relative efficiency from
(6-14). The dotted curve is the ratio of information functions computed by (6-1)
6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY 93
10.0
9.0
8.0
7.0
6.0
5.0
4.0
R. E. of "Regular" Test vs. Peaked Test
3.0
2.0
1.0
.9
.8
.7
.6
.5
.4
.3
.2
0 .2 .4 .6 .8 1.0 ξ
.1
1 5 10 25 50 75 90 95 9 9 %
FIG. 6.7.1. Approximation (solid line) to relative efficiency [Eq. (6-14)] compared
with estimate (dashed line) from Eq. (6-1) and (5-13). (From F. M. Lord, The
relative efficiency of two tests as a function of ability level. Psychometrika, 1974,
39, 351-358.)
from estimated item response function parameters. The two tests under compari
son are the regular test (y) and the peaked test (x), described in more detail in
Section 6.1. Approximation (6-14) tends to oscillate about the estimated relative
efficiency (6-1), but the approximation is adequate for the practical purpose of
comparing the effectiveness of the tests over a range of ability levels. The
agreement found here and in later sections of this chapter between relative
efficiency calculated from item parameters and relative efficiency approximated
from totally different sources is a reassuring illustration of the adequacy of item
response theory and of the procedures used for estimation of item parameters.
As noted in Section 6.4, the relative efficiency of two tests remains the same
under any monotonic transformation of the ability scale. Thus, the RE curve can
be plotted against any convenient baseline. In Fig. 6.7.1, the baseline is scaled in
terms of true score ζ = ξ/n = iPi(θ)/n for the peaked test [see Eq. (4-5)].
Is it a good rule of test construction to spread the items over a wide range of
item difficulty, so as to have some items that are appropriate for each examinee?
Or will a peaked test with all items of equal difficulty be better for everyone? In
Fig. 6.7.1, the peaked test (really only partially peaked—it is hard to find 45
items that are identical in difficulty) is better than the regular (unpeaked) test for
all examinees from the first through the seventy-fifth percentile. If the peaked
94 6. THE RELATIVE EFFICIENCY OF TWO TESTS
test were more difficult, it might be better from perhaps the tenth percentile up
through the nintieth.
Although the approximation of Section 6.7 avoids the need to estimate item
response function parameters, the method (see Chapter 16) for estimating p(ξ)
and q( ) is far from simple. Section 6.7 is included here because it leads to the
suggestion that a simple approximation to relative efficiency can be obtained by
substituting observed-score relative frequencies, fx and fy, say, for the true-
score densities p(ξ) and q( ).
A simple approximation to σx|ξ and σy| is also available. If the nx items in
test x are considered as a random sample from an infinite pool of items, then the
sampling distribution of number-right score x for a particular examinee, over
successive random samples of items, is the familiar binomial distribution
(nxx)ζx(1 - ζ)nx-x,
where ζ is a parameter characterizing the individual. Since ξ(x|ζ = nxζ = ξ
for the binomial, ζ or ξ is the individual's true score. [Although it may not
seem so, the fact is that the binomial model just described holds just as well
when the items are of widely varying difficulty as when they are all of the
same difficulty. A simple discussion of this fact is given by Lord (1977).]
Under the binomial model just outlined, the sampling variance of an exam
inee's number-right score x over random samples of items is given by the
familiar formula
2
3
6
5 4
7
FIG. 6.8.1. Estimated true-score distribution for the sixth-grade data for STEP (1), MAT (2), CAT
(3), ITBS (4), Stanford (5), CTBS (6), and SRA (7).
95
96 6. THE RELATIVE EFFICIENCY OF TWO TESTS
Equation (6-20) will work best with a large sample of examinees, perhaps
several thousand. If the sample size is smaller, the equipercentile equating of x
and y will be irregular because of local irregularities in fx and fv. This can be
overcome by smoothing distributions fx and fy. Smaller samples can then be
used, but at some cost in labor.
In order to investigate the adequacy of (6-20), the relative efficiencies of the
vocabulary sections of seven nationally known reading tests were approximated
by formula (6-20) and also by the computer program (Stocking et al., 1973)
described in Section 6.7. For each test, a carefully selected representative na
tional sample of 10,000 or more sixth graders from the Anchor Test Study
(Loret, Seder, Bianchini, & Vale, 1974) supplies the frequency distribution of
number-right vocabulary score needed for the two methods. The number of items
per vocabulary section ranges from n = 30 through n = 50.
Figure 6.8.1 shows the true-score distributions for the seven vocabulary tests
as estimated by the method of Chapter 16. These are the p(ξ) and q( ) used in
(6-14) to obtain the smooth curves in Fig. 6.9.1-6.9.6. As already noted, a test
in general tends to be less efficient where the true scores pile up and more
efficient where the true scores are spread out.
Figures 6.9.1 to 6.9.6 show the efficiency curves for six of the tests relative to
the Metropolitan Reading Tests (1970), Intermediate Level, Form F, Word
Analysis subtest (MAT). The smooth curves are obtained from (6-14); the broken
lines are obtained from (6-20), after grouping together adjacent pairs of raw
scores in order to reduce zigzags due to sampling fluctuations.
Although (6-20) gives only approximate results, the approximation is seen to
be quite adequate for many purposes. Rough calculations using (6-20) can be
conveniently made under circumstances not permitting the use of an elaborate
computer program.
Figure 6.9.1 shows the relative efficiency of STEP (Sequential Tests of Edu
cational Progress) Series II (1969), Level 4, Form A, Reading subtest. STEP is
more efficient than MAT for the bottom fifth or sixth of the pupils and less
efficient for the rest of the students. Between the fortieth and eightieth percen
tiles, STEP would have to be tripled in length in order to be as effective as MAT.
STEP (n y = 30) is actually three-fifths as long as MAT (n x = 50), as shown by
1
This section is revised and printed with special permission from F. M. Lord, Quick estimates of
the relative efficiency of two tests as a function of ability level. Journal of Educational Measure
ment, Winter 1974, 11, No. 4, 247-254. Figures 6.9.1, 6.9.6, and Table 6.9.1 are taken from the
same source. Copyright 1974, National Council on Measurement in Education, Inc., East Lansing,
Mich.
6.3
4.0
2.5
I .6
EFFICIENCY
1.0
RELATIVE
.63
NY
NX
.40
.25
16
0 10 20 30 40 50 60 70 80 90 100
PERCENTILE
NY
NX
.63 .40
.25
16
0 10 20 30 40 50 60 70 80 90 100
PERCENTILE
97
6.3
4.0
2.5
R E L A T I V E EFFICIENCY
1.6
1 .0
NY
NX
.63
.40
.25
16
0 10 20 30 40 50 60 70 80 90 100
PERCENTILE
FIG. 6.9.3. Relative efficiency of Iowa Test of Basic Skills (1970), Level 12, Form
5, Vocabulary compared to MAT.
6.3
4.0
2.5
RELATIVE EFFICIENCY
1.0 1.6
NY
NX
.63 .40
.25
16
0 10 20 30 40 50 60 70 80 90 100
PERCENTILE
FIG. 6.9.4. Relative efficiency of Stanford Reading Tests (1964), Intermediate II,
Form W, Word Meaning compared to MAT.
98
6.3
4.0
2.5
EFFICIENCY
1.0 1.6
RELATIVE
NY
Nx
.63 . 40
.25
16
0 10 20 30 40 50 60 70 80 90 100
PERCENTILE
NY
NX
.63 .40
.25
. 16
0 10 20 30 40 50 60 70 80 90 100
PERCENTILE
FIG. 6.9.6. Relative efficiency of SRA Achievement Series (1971), Green Edition,
Form E, Vocabulary compared to MAT.
99
100 6. THE RELATIVE EFFICIENCY OF TWO TESTS
the dashed line representing the ratio ny/nx. The dashed line represents the
relative efficiency that would be expected if the two tests differed only in length.
The fact that STEP is more efficient for low-ability students and less efficient
at higher ability levels is to be expected in view of the fact that this STEP (Level
4) is extremely easy for most sixth-grade pupils in the representative national
sample. It has long been known that an easy test discriminates best at low ability
levels and is less effective at higher levels than a more difficult test would be.
Similar conclusions can be drawn from the other figures. It is valuable to have
such relative efficiency curves whenever a choice has to be made between tests
measuring the same ability or trait.
Numerical Example
For illustrative purposes, Table 6.9.1 shows a method for computing RE {y, x}
for one set of data, with N = 10,000, nx = 50, ny = 30. The method illustrated
is a little rough but seems adequate for the purpose.
The raw data for the table are the frequency distributions given by the un-
italicized figures in columns fx and fy . The score ranges covered by the table are
TABLE 6.9.1
Illustrative Computations for Relative Efficiency*
Test X Test Y
Percentile
Rank X Fx y fy Fy RE {y, x}
fx
17.5 1141
18 176 14 203
12.70 18.23 (175.08) 14.5 204.5 1270 1.13
13.17 18.5 174 1317 14.73 (205.19) 1.12
19 172 15 206
14.76 19.42 (166.96) 15.5 228 1476 .85
14.89 19.5 166 1489 75.55 (230.20) .83
20 160 16 250
16.49 20.5 171 1649 16.19 (258.74) .71
17.26 20.92 (180.24) 16.5 273 1726 .71
21 182
18.31 21.5 186.5 1831 16.85 (289.10) .69
22 191 17 296
20.22 22.5 196 2022** 17.50 (313) 2022** .67
23 201 18 330
*Italicized figures are obtained by linear interpolation. The remaining figures indicate exact
values obtained from the data.
**These two numbers are identical only by coincidence.
6.10. REDESIGNING A TEST 101
When a test is to be changed, we normally have response data for a typical group
of examinees. Such data are not available when a new test is designed. This
section deals chiefly with changing or redesigning an existing test. Procedures
for redesign are best explained by citing a concrete example.
Recently, it was decided to change slightly the characteristics of the College
Entrance Examination Board's Scholastic Aptitude Test, Verbal Section. It was
desired to make the test somewhat more appropriate at low ability levels without
impairing its effectiveness at high ability levels. The possibility of simultane-
ously shortening the test was also considered.
A first step was to estimate the item parameters for all items in a typical
current form of the Verbal test. A second step was to compute from the item
parameters the information curves for variously modified hypothetical forms of
the Verbal test. Each of these curves was compared to the information curve of
the actual Verbal test. The ratio of the two curves is the relative efficiency of the
modified test, which varies as a function of ability level.
Let us now consider some typical questions: How would the relative effi-
ciency of the existing test be changed by
102 6. THE RELATIVE EFFICIENCY OF TWO TESTS
These questions are taken up one by one in the correspondingly numbered para-
graphs that follow, illustrating the results of various design changes on the SAT
Verbal test. In Figure 6.10.1, the horizontal scale representing the ability mea-
sured by the test has for convenience been marked off to correspond to College
Board true scaled scores.
(The test-score information curves are computed for hypothetical examinees
who omit no items. The SAT is normally scored with a "correction for guess-
ing," but when there are no omitted items, the corrected score is perfectly cor-
related with number-right score. For this reason, the relative efficiencies dis-
2.5
2.0
RELATIVE EFFICIENCY
1.5
5
2
4
1.0
3
6
0.5
7
0.0
200 300 400 500 600 700
FIG. 6.10.1. Relative efficiency of various modified SAT Verbal tests. (From F. M.
Lord, Practical applications of item characteristic curve theory. Journal of Educa-
tional Measurement. Summer 1977, 14, No. 2, 117-138. Copyright 1977,
National Council on Measurement in Education, Inc., East Lansing, Mich.)
6.10. REDESIGNING A TEST 103
cussed below are equally appropriate for corrected scores and for number-right
scores. See Section 15.11 for a detailed discussion of work with formula scores.)
6.11. EXERCISES
6-1 Using Table 4.17.1, compute the test information function for a three-item
test with a = 1/1.7, b = 0, c = .2 for all items. Plot it.
6-2 Using Table 4.17.2, plot on the same graph the test information function
for a three-item test with a = 1/1.7, b = 0, c = 0 (see Exercise 5-8).
6-3 Compute and plot the efficiency of the test in Exercise 6-1 relative to test 1
(see Exercise 5-4).
6-4 Compute and plot the efficiency of the test in Exercise 6-2 relative to test 1
(see Exercise 5-4).
6-5 For each item in test 1, plot P(θ) from Table 4.17.1 against θ* = eθ. These
are the item response functions when θ* is used as the measure of ability.
Compare with Exercise 4-1.
6-6 Using Table 4.17.1, compute and plot against 0* the item information
function I{θ*, ui} for each item in test 1. Compare with Exercise 5-3.
6-7 Using Table 4.17.1, plot the information function (6-8) of number-right
observed score x on number-right true score ξ for test 1. Necessary values
were calculated in Exercise 5-2.
6-8 Using Table 4.17.1, compute for test 1 at 0 = - 3 , - 2 , - 1 , 0, 1, 2, 3 the
σ2x|ξ of (6-19), and compare with the σ2x|ξ of Eq. (4-3). The necessary
values of ξ = ξ(θ) are calculated in Exercise 5-1. Explain the discrepancy
between the two sets of results.
6-9 Suppose test 1 is modified by replacing item 2 by an item exactly like item
1. Compute the information function for the modified test and plot its
relative efficiency with respect to test 1 (see Exercise 5-4).
REFERENCES 105
REFERENCES
Lord, F. M. A strong true-score theory, with applications. Psychometrika, 1965, 30, 239-270.
Lord, F. M. Practical applications of item characteristic curve theory. Journal of Educational
Measurement, 1977, 14, 117-138.
Loret, P. G., Seder, A., Bianchini, J. C , & Vale, C. A. Anchor Test Study—Equivalence and norms
tables for selected reading achievement tests (grades 4, 5, 6). Washington, D.C.: U.S. Govern-
ment Printing Office, 1974.
Stocking, M., Wingersky, M. S., Lees, D. M., Lennon, V., & Lord, F. M. A program for
estimating the relative efficiency of tests at various ability levels, for equating true scores, and for
predicting bivariate distributions of observed scores. Research Memorandum 73-24. Princeton,
N.J.: Educational Testing Service, 1973.
7 Optimal Number of
Choices Per Item 1
7.1. INTRODUCTION
Typical multiple-choice tests have four or five alternative choices per item. What
is the optimal number?
If additional choices did not increase total testing time or add to the cost of the
test, it would seem from general considerations that the more choices, the better.
The same conclusion can be reached by examination of the formula (4-43) for the
logistic item information function: Information is maximized when c 0. An
empirical study by Vale and Weiss (1977) reaches the same conclusion.
In practice, increasing the number of choices will usually increase the testing
time. Each approach treated in this chapter makes the assumption that total
testing time for a set of n items is proportional to the number A of choices per
item. This means that nA, the total number of alternatives in the entire test, is
assumed fixed.
It seems likely that many or most item types do not satisfy this condition, but
doubtless some item types will be found for which the condition can be shown to
hold approximately. The relation of n to A for fixed testing time should be
determined experimentally for each given item type; the theoretical approaches
given here should then be modified in obvious ways to determine the optimal
1
This chapter is adapted by special permission from F. M. Lord, Optimal number of choices per
item—a comparison of four approaches. Journal of Educational Measurement, Spring 1977, 14, No.
1. Copyright 1977. National Council on Measurement in Education, Inc., East Lansing, Mich.
Research reported was supported by grant GB-41999 from the National Science Foundation.
106
7.3. A MATHEMATICAL APPROACH 107
value of A for each item type. A useful procedure for doing this is described in
Grier (1976).
In this chapter, some published empirical results, two published theoretical
approaches, and also an unpublished classical test theory approach are compared
with some new results obtained from item response theory. From some points of
view, the contrasts between the different approaches are as interesting and in-
structive as the actual answers given to the question asked.
Ruch and Charles (1928), Ruch, DeGraff, and Gordon (1926, pp. 54-88), Ruch
and Stoddard (1925, 1927), and Toops (1921), among others, reported data on
the relative time required to answer items with various numbers of alternatives.
Their empirical evidence regarding the optimal number of alternatives for
maximum test reliability is somewhat contradictory. Ruch and Stoddard (1927)
and Ruch and Charles (1928) concluded that because more of such items can be
administered in a given length of time, two- and three-choice items give as good
or better results than do four- and five-choice items.
More recently, Williams and Ebel (1957, p. 64) report that
For tests of equal working time . . . three-choice vocabulary test items gave a test of
equal reliability, and two-choice items a test of higher reliability, in comparison
with standard four-choice items. However, neither of the differences was signifi-
cant at the 10% level of confidence.
Tversky (1964) proposed that the optimal number of choices is the value of A
that maximizes the "discrimination function" An. He chose this function be-
cause An is the total number of possible distinct response patterns on n A -choice
items and also for other related reasons.
Tversky easily showed that when nA = K is fixed, An is maximized by A = e
= 2.718. For integer values of A, An is maximized by A =3. Tversky concludes
that when nA = K, three choices per item is optimal.
108 7. OPTIMAL NUMBER OF CHOICES PER ITEM
Grier (1975) investigated the same problem. He also found that three-choice
items are best when the total number of alternatives is fixed. Two-choice items
are next best.
Grier reached these conclusions by maximizing an approximation to the
Kuder-Richardson Formula-21 reliability coefficient. This approximation, given
as Eq. (7-2), is derived by Ebel (1969) on the assumption that the mean number-
right score x is halfway between the maximum possible score n and the expected
chance score n/A and also that the standard deviation of test scores sx is one-
sixth of the difference between the maximum possible score and the expected
chance score:
n
x =
2 ( n + —A ) . (7-1)
n
6 (n — — A ).
Sx =
n 9(A + 1)
r21 =
n - 1 [
1 -
n(A - 1) . ] (7-2)
This formula (as Ebel points out) is not useful for small n. When A = 3, the
value of r21 given by (7-2) is negative unless n > 18.
r' = r (7-3)
1 + 1/(A - 1)p ,
where r' is the product-moment intercorrelation between k-choice items when
k = A. Here p and r denote, respectively, the difficulty (proportion of correct
answers) and the product-moment intercorrelation of k-choice items when k =
7.5. A CLASSICAL TEST THEORY APPROACH 109
∞. This formula is a generalization of Eq. (3-19) and may be derived by the same
approach.
By the Spearman-Brown formula [Eq. (1-17)J, the reliability r'tt of number-
right scores on a test composed of n equivalent A -choice items is found from
(7-3) to be
nr' nr
r'tt (7-4)
1 + (n - 1)r' (n - l)r + 1+ 1/(A - 1)p
Since n = K/A, Eq. (7-4) becomes
Kr
r'tt = (7-5)
Kr + (1 - r)A + A/(A - 1)p
We wish to know what value of A, the number of choices, will maximize the
reliability r'tt. The optimal value of A is the value that minimizes the de
nominator of (7-5). The derivative of the denominator with respect to A is 1 —
r — 1/(A — l) 2 p. Setting this equal to zero and solving for A, the optimal value
is found to be
1
4 = 1+ (7-6)
(1 - r)p
It is easy to verify that this value of A provides a maximum rather than a
minimum for r'tt.
Some optimal values of A from (7-6) are shown in the following table:
For p = .5 these values agree rather well with those found by Grier (1975). Our
optimal values, however, unlike Grier's, are independent of test length, as in
deed they should be. For p ≠ .5, the results are different from Grier's.
Some typical values of test reliability are shown in the following table for the
case where p = .5:
A =2 A = 3 A = 4 A =5
I .4
I .2
.50
.333
I .0
.25
RELATIVE EFFICIENCY
0.8
•20
TSAI3
0.6
0 .4
0.2
0.0
SCALED SCORE
FIG. 7.6.1. Relative efficiency of five SAT Verbal tests that differ only in test
length and in the value of ci.
I .2
I .0
EFFICIENCY
0.8
0.6
RELATIVE
6
SOLID C= .2
DOTTED C = .25
DASHED C= .50
0.2
0.0
FIG. 7.6.2. Efficiency of three SAT Verbal tests relative to the ci = .333 test
after the ci = .5 test has been made easier and the ci = .2 and ci = .25 tests
have been made harder.
112 7. OPTIMAL NUMBER OF CHOICES PER ITEM
mately. The dashed curve in Fig. 7.6.2 shows the relative efficiency of the
ci = .50 test when all its items are made slightly easier (all the item difficulty
parameters bi decreased by 0.1). The dotted curve shows the relative efficiency
of the ci = .25 test when all its items are made slightly harder (all bi increased
by 0.1). The solid curve shows a harder ci = .20 test (all bi increased 0.2). The
efficiencies are shown relative to the test with ci = .333. The test with ci = .333
is clearly superior to the others.
Comparisons using item response theory assume that ci can be changed
without affecting the item discrimination power ai. This would be true if exam
inees either knew the answer or guessed at random. When an examinee has
partial information about an item, the value of ai is likely to change with the
number of alternatives. This effect could operate against reducing the number of
alternatives per item. The extent of this effect cannot be confidently predicted
here.
If a test is to be used only to accept or to reject examinees, all items in the test
ideally should be maximally informative at the cutting score. Now, the maximum
possible information Mi obtainable from a logistic item with parameters ai and ci
is given by Eq. (10-6). If all items have ai = a and ci — c and if the number of
items that can be administered in the available testing time is proportional to c,
what value of c will maximize the test information I{θ} = nMi α cMi at the
cutting score? Numerical investigation of cMi using Eq. (10-6) shows that the
optimal value of c is c = A374.
REFERENCES
Ebel, R. L. Expected reliability as a function of choices per item. Educational and Psychological
Measurement, 1969, 29, 565-570.
Grier, J. B. The number of alternatives for optimum test reliability. Journal of Educational Mea
surement, 1975, 12, 109-112.
Grier, J. B. The optimal number of alternatives at a choice point with travel time considered. Journal
of Mathematical Psychology, 1976, 14, 91-97.
Ruch, G. ML, & Charles, J. W. A comparison of five types of objective tests in elementary
psychology. Journal of Applied Psychology, 1928, 12, 398-404.
Ruch, G. M., DeGraff, M. H., & Gordon, W. E. Objective examination methods in the social
studies. New York: Scott, Foresman and Co., 1926.
Ruch, G. M., & Stoddard, G. D. Comparative reliabilities of five types of objective examinations.
Journal of Educational Psychology, 1925, 16, 89-103.
Ruch, G. M., & Stoddard, G. D. Tests and measurement in high school instruction. Chicago: World
Book, 1927.
REFERENCES 113
Toops, H. A. Trade tests in education. Teachers College Contributions to Education (No. 115). New
York: Columbia University, 1921.
Tversky, A. On the optimal number of alternatives of a choice point. Journal of Mathematical
Psychology, 1964, 1, 386-391.
Vale, C. D., & Weiss, D. J. A comparison of information functions of multiple-choice and free-
response vocabulary items. Research Report 77-2. Minneapolis: Psychometric Methods Pro-
gram, Department of Psychology, University of Minnesota, 1977.
Williams, B. J., & Ebel, R. L. The effect of varying the number of alternatives per item on
multiple-choice vocabulary test items. The Fourteenth Yearbook. National Council on Mea-
surements Used in Education, 1957.
8 Flexilevel Tests1
8.1. INTRODUCTION
It is well known (see also Theorem 8.7.1) that for accurate measurement the
difficulty level of a psychological test should be appropriate to the ability level of
the examinee. With conventional tests, this goal is achievable for all examinees
only if they are fairly homogeneous in ability. College entrance examinations,
for example, could provide more reliable measurement at particular ability levels
if they did not need to cover such a wide range of examinee talent (see Section
6.10). Furthermore, in many situations it is psychologically desirable that the test
difficulty be matched to the examinee's ability: A test that is excessively difficult
for a particular examinee may have a demoralizing or otherwise undesirable
effect.
There has recently been increasing interest in "branched," "computer-
assisted," "individualized," "programmed," "sequential," or tailored testing
(Chapter 10). When carefully designed, such testing comes close to matching the
difficulty of the items administered to the ability level of the examinee. The
practical complications involved in achieving this result are great, however.
Some simplification can be obtained by simple two-stage testing—by use of a
routing test followed by the administration of one of several alternative second-
stage tests (Chapter 9). This reduces the number of items needed and eliminates
Sections 8.1 through 8.4 and Fig. 8.2.1 are taken with special permission and with some
revisions from F. M. Lord, The self-scoring flexilevel test. Journal of Educational Measurement,
Fall 1971, 8, No. 3, 147-151. Copyright 1971, National Council on Measurement in Education,
Inc., East Lansing, Mich.
114
8.2. FLEXILEVEL TESTS 115
the need for a computer to administer them. To obtain comparable scores from
different second-stage tests, however, expensive equating procedures based on
special large-scale administrations are required.
To a degree, the same result, the matching of item difficulty with ability level,
can be achieved with fewer complications. This can be done by modifying the
directions, the test booklet, and the answer sheet of an ordinary conventional
test. The modified test is called a flexilevel test.
Consider a conventional multiple-choice test in which the items are arranged
in order of difficulty. The general idea of a flexilevel test is simply that the
examinee starts with the middle item in the test and proceeds, taking an easier
item each time he gets an item wrong and a harder item each time he gets an item
right. He stops when he has answered half the items in the test.
Let us consider a concrete example, starting with a conventional test of N =
75 items. (In this chapter, the symbol N is used with this special meaning; in
other chapters, N denotes the number of examinees.) For purposes of discussion,
we assume that the items are arranged in order of difficulty; however, it is seen
later that any rough approximation to this is adequate. The middle item of the
conventional test (formerly item 38) is the first item in the flexilevel test. It is
printed in the center at the top of the first page of the flexilevel test. The page
below this, and subsequent pages, are divided in half vertically (see Fig. 8.2.1).
2.* 2.•
3.* 3.•
. .
. .
. [easier items] . [harder items]
. .
. .
. .
. .
Items formerly numbered 39, 40, 4 1 , . . . , 75 appear in that order in the right-
hand columns, the hardest item (formerly item 75) at the bottom of the last page.
In place of the old numbers, these items are numbered in blue as items 1,2,
3 , . . . , 37, respectively. Items formerly number 37, 36, 35,. . . , 1 appear in
that order in the left-hand columns, the easiest item (formerly item 1) at the
bottom of the last page. In place of the old numbers, these items are numbered in
red as items 1, 2, 3 , . .. ,37, respectively (the easiest item is now at the end and
is numbered 37). The layout is indicated in Fig. 8.2.1.
The answer sheet used for a flexilevel test must inform the examinee whether
each answer is right or wrong. When the examinee chooses a wrong answer, a
red spot appears where he has marked or punched the answer sheet. When he
chooses a right answer, a blue spot appears. Answer sheets similar to this are
commercially available in a variety of designs.
In answering the test, the examinee must follow one rule. When his answer to
an item is correct, he should turn next to the lowest numbered "blue" item not
previously answered. When his answer is incorrect, he should work next on the
lowest numbered "red" item not previously answered.
Each examinee is to answer just ½(N + 1) = 38 items. One way to make it
apparent to him when he has finished the test would be to print the answer sheet
in two columns, using the same format as in Fig. 8.2.1 but with the second
column inverted. Thus, the examinee works down from the top in the first
column of the answer sheet and up from the bottom in the second column. The
examinee can be told to stop (he has completed the test) when he has responded
to one item in each row of the answer sheet.
It is now clear that the high-ability examinee who does well on the first items
he answers will automatically be administered a harder set of items than the
low-ability examinee who does poorly on the first items. Within limits, the
flexilevel test automatically adjusts the difficulty of the items to the examinee's
ability level.
8.3. SCORING
Let us first agree that when examinees answer the same items, we will be
satisfied to consider examinees with the same number-right score equal. A sur-
prising feature of the flexilevel test is that even though different examinees take
different sets of items, complicated and expensive scoring or equating procedures
to put all examinees on the same score scale are not needed. The obvious validity
of the scoring (by contrast with tailored testing) will keep examinees from feeling
that they are the victims of occult scoring methods. Finally, the test is self-
scoring—the examinee can determine his score without counting the number of
correct answers.
The score on a flexilevel test will be the number of questions answered
8.4. PROPERTIES OF FLEXILEVEL TESTS 117
correctly, except that examinees who miss the last question they attempt receive
a one-half point "bonus." Justification that this scoring provides comparable
scores, as well as procedures for arriving at an examinee's score without count-
ing the number of correct answers, is given in the following section.
A flexilevel test has the following properties, which the reader should verify for
himself. For convenience of exposition, we at first assume, as before, that the
items in the conventional test are arranged in order of difficulty. Later on we see
that any rough approximation will be adequate.
For simplicity, assume throughout that the examinee has completed the re-
quired ½(N + 1) = 38 items (the complications arising when examinees do not
have enough time are not dealt with here). Also, assume that the examinee has
been instructed to indicate on the answer sheet the item he would have to answer
next if the test were continued. (In an exceptional case, this might be a dummy
"item 3 8 , " which need not actually appear in the test booklet, since no one will
ever reach it.) An examinee who indicates that he would next try a blue item will
be called a blue examinee; one who indicates a red item will be called a red
examinee.
2. For a blue examinee, the number of right answers is equal to the serial
number of the item that would be answered next if the test were continued.
3. For a red examinee, the number of wrong answers is equal to the serial
number of the item that would be answered next if the test were continued. The
number of right answers is obtained by subtracting this serial number from ½(N
+ 1). (A different serial numbering of the red items could give the number of
right answers directly but might confuse the examinee while he is taking the test.)
4. All blue examinees who have a given number-right score have answered
the same block of items.
5. All red examinees who have a given number-right score have answered the
same block of items.
It can now be seen that all blue examinees can properly be compared with
each other in terms of their number-right scores, even though examinees with
different scores have not taken the same test. Consider two blue examinees, A
and B, whose number-right scores differ by 1. The items answered by the two
examinees are identical except that A had one item that was harder than any of
118 8. FLEXILEVEL TESTS
B's and B had one item that was easier than any of A's. The higher scoring
examinee, A, is clearly the better of the two because he took the harder test.
The same reasoning shows that all red examinees can properly be compared
with each other in terms of their number-right scores:
In the foregoing discussion, the item taken by A and not by B was far apart on
the difficulty scale from the item taken by B and not by A. Thus A still would be
considered better than B even if the difficulty levels of individual items had been
roughly estimated rather than accurately determined. It will be seen that still
simpler considerations make exact determination of difficulty levels unnecessary
for the remaining comparisons among examinees, discussed below. Thus:
It remains to be shown how blue examinees can be compared with red exam-
inees. Consider a red examinee with a number-right score of x. If his very last
response had been correct instead of wrong, he would have been a blue examinee
with a score of x + 1. Clearly, his actual performance was worse than this; so we
conclude that
Finally, we can compare a blue examinee and a red examinee, both having the
same number-right score. Suppose we hypothetically administer to each exam-
inee the item that he would normally take if the testing were continued. If both
examinees answer this item correctly, they both become blue examinees with
identical number-right scores. We have agreed that such examinees can be con-
sidered equal. In order hypothetically to reach this equality, however, the blue
examinee had to answer a hard item correctly, whereas the red examinee had
only to answer an easy item correctly. Clearly, without the hypothetical extra
item, the standing of the blue examinee is inferior to the standing of the red
examinee:
9. A red examinee has outperformed all blue examinees having the same
number-right score.
In view of this last conclusion, let us modify the scoring by adding one-half
score point to the number-right score of each red examinee. Thus, once we agree
8.5. THEORETICAL EVALUATION OF NOVEL TESTING PROCEDURES 119
to use number-right score for examinees answering the same block of items, we
can say that
Item response theory is essential both for good design and for evaluation of novel
testing procedures, such as flexilevel testing. If its basic assumptions hold, item
response theory allows us to state precisely the relation between the parameters
of the test design and the properties of the test scores produced.
Although the properties of test scores depend on the design parameters, the
dependence is in general not a simple one. Item response theory will be most
easily applicable if we make some simplifying assumptions. Even then, it is hard
to state unequivocal rules for optimal test design. In the present state of the art,
the following procedure is typical.
In this way, 100 or 200 different designs for some novel testing procedure can be
tried out on the computer in a short time using simulated examinees.
Nothing like this could be done if 100 actual tests had to be built and adminis-
tered to statistically adequate samples of real examinees. When we have learned as
much as we can from simulated examinees, then we can design an adequate test,
build it, administer it in a real testing situation, and evaluate the results. The real
test administration is indispensable. Limits on testing time, attitudes of exam-
inees, failure to follow directions, or other violations of the assumptions of the
model may in practice invalidate all theoretical predictions.
120 8. FLEXILEVEL TESTS
The preliminary theoretical work and computer simulation are also important.
Without them, the test actually built is likely to be a very inadequate one.
We can evaluate any given flexilevel test once we can determine ø(y|θ), the
conditional frequency distribution of test scores y for examinees at ability level
6. Given some mathematical form for the function Pi = Pi(θ) = P(θ; ai bi
ci), the value of ø(y|θ) can be determined numerically for any specified value of
6 by the recursive method outlined below. In the case of flexilevel tests, the
testing and scoring procedures are so fully specified that the item parameters are
the only parameters involved. It is assumed that the item parameters have already
been determined by pretesting.3
Assume the N test items to be arranged in order of difficulty, as measured by
the parameter bi. We choose N to be an odd number. For present purposes (not
for actual test administration), identify the items by the index i, taking on the
values — n + 1, — n + 2, . . . ,—1,0, 1 , . . . , n — 2, n — 1, respectively, when
the items are arranged in order of difficulty. Thus n = (N + l)/2 is the number of
items answered by each examinee, and b0 is the median item difficulty.
Consider, for example, the sequence of right (R) and wrong (W) answers R W
W R W R R R W R. Following the rules given for a flexilevel test, we see that the
corresponding sequence of items answered is
i = 0, + 1 , - 1 , - 2 , +2, - 3 , + 3 , +4, +5, - 4 , ( + 6).
Let Iv be the random variable denoting the vth item administered (v = 1,
2,. . . , n + 1); thus Iv takes the integer values i = — n + 1, — n + 2,. . . , n —
1. The general rule for flexilevel tests is that when Iv > 0, either
Iv+1 = Iv + 1 or Iv+1 = Iv - v,
and when Iv < 0, either
Iv+i = Iv - I or Iv+i = Iv + v.
For example, if the fourth item administered is indexed by I4 = -2, the next
item to be administered must be either I3 = - 2 - 1 = - 3 or I5 = - 2 + 4 =
+ 2, depending on whether item 4 is answered incorrectly or correctly.
Let Pv(i'\i, θ) denote the probability that item i' will be the next item ad-
2
Sections 8.6 through 8.9 are revised and taken with permission from F. M. Lord, The theoretical
study of the measurement effectiveness of flexilevel tests. Educational and Psychological Measure
ment, 1971, 31, 805-813.
3
The reader concerned only with practical conclusions may skip to Section 8.7.
8.6. CONDITIONAL FREQUENCY DISTRIBUTION 121
If i < 0, (8.1)
Pi(θ) if i' = i + v,
P v (i'|i, θ) =
{
Qi(θ) if i' = i - 1,
0 otherwise.
For examinees at ability level , let pv(i|θ) denote the probability that item i is
the vth item administered (v = 1, 2 , . . . , n + 1). For fixed v, the joint
distribution of i and i' is the product of the marginal distribution pv(i|θ) and the
conditional distribution pv (i'|i, θ). Summing this product over i, we obtain the
overall probability that item i' will be administered on the (v + l)th trial:
n- l
Pv+1 (i|θ) = pv(i|θ)Pv(i|'i,θ). (8-2)
i=-n + l
The rightmost probability, Pv, is known from (8-1). The other probability on the
right, pv, can be found by the procedure described below.
The first item administered (v = 1) is always item I1 = 0, so
1 if i = 0,
P1(i|θ) =
{0 otherwise.
Starting with this fact and with a knowledge of all the Pi(θ) (item response
functions) for a specified value of 6, the values of p2(i'|θ) for each i' can be
obtained from (8-2). Drop the prime from the final result. Repetition of the same
procedure now gives us p3(i'|θ), the overall probability that item i (i = —n + 1,
— n + 2,. . . , n — 1) will be the third item administered. Successive repetitions
of the same procedure now gives us p4(i'|θ), p 5 (i'|θ),. . . , pn + i(i'|θ).
Now we can make use of an already verified feature of flexilevel tests. Again
let i' represent the (v + l)th item to be administered. If ¥ > 0, then the
number-right score x on the v items already administered was x = i'; if i' < 0,
then x = v + i'. Thus the frequency distribution of the number-right score x for
examinees at ability level θ is given by pn+l(x|θ) for those examinees who
answered correctly the nth (last) item administered and by pn+1(x — n|θ) for
those who answered incorrectly. This frequency distribution can be computed
recursively from (8-1) and (8-2).
As already noted, the actual score assigned on a flexilevel test is y = x if the
last item is answered correctly and y = x + ½ if it is answered incorrectly.
Consequently the conditional distribution of test scores is
122 8. FLEXILEVEL TESTS
if y is an integer,
ø(y|θ) =
{ Pn+1(y|θ)
P n+1 (y - n -½|θ) if y is a half integer.
(8-3)
For any specified test design, this conditional frequency distribution ø(y|θ) can
be computed from (8-1) and (8-2) for y = ½, 1, l ½ , . . . , n for various values
of θ.
Such a distribution constitutes the totality of possible information relevant to
evaluating the effectiveness of y as a measure of ability θ. Having computed
ø(y|θ), we compute its mean µy|θ and its variance σ2y|θ. The necessary deriva
tive dµy|θ/dθ is readily approximated by numerical methods:
dµy|θ
= µy|θ+ - µy|θ
dθ
approximately, when A is a small increment in 6. From these we compute the
information function [Eq. (5-3)] for test score y.
The numerical results reported here are obtained on the assumption that Pi is a
three-parameter normal ogive [Eq. (2-2)]. The results would presumably be
about the same if Pi had been assumed logistic rather than normal ogive.
To keep matters simple, we consider tests in which all items have the same
discriminating power a and also the same guessing parameter c. Results are
presented here separately for c = 0 (no guessing) and c — .2. The results are
general for any value of a > 0, since a can be absorbed into the unit of
measurement chosen for the ability scale (see the baseline scale shown in the
figures). Each examinee answers exactly n = 60 items. For simplicity, we
consider tests in which the item difficulties form an arithmetic sequence, so that
b i+1 - bi = d, say, for i = — n + 1, — n + 2 , . . . , n — 1.
Figure 8.7.1 compares the effectiveness of four 60-item (n = 60, N = 119)
flexilevel tests and three bench mark tests by means of score information curves.
The "standard test" is a conventional 60-item test composed entirely of items of
difficulty b = 0, scored by counting the number of right answers. There is no
guessing, so c = 0. The values of a and c are the same for bench mark and
flexilevel tests. The average value of bi, averaged over items, is zero for all
seven tests.
The figure shows that the standard test is best for discriminating among
examinees at ability levels near 6 = 0. If good discrimination is important at 6 =
±2/2a or θ = ±3/2a, then a flexilevel test such as the one with d = .033/2^ or
d = .050/2a is better. The larger d is, the poorer the measurement at 0 = 0 but
the better the measurement at extreme values of θ.
8.7. ILLUSTRATIVE FLEXILEVEL TESTS, NO GUESSING 123
1{θ,y}
standard
d = .033/2a
d=.067/2a d=.050/2a
d=.IOO/2a
half at b = - 2 . 8 / 2 a , half at b = + 2 . 8 / 2 a
θ
-3 -2 -1 0 1 2 3
2a 2a 2a 2a 2a 2a
FIG. 8.7.1. Score information functions for four 60-item flexilevel tests with b{) =
0 (dotted curves) and three bench mark tests, c = 0. (From F. M. Lord, The
theoretical study of the measurement effectiveness of flexilevel tests. Educational
and Psychological Measurement, 1971, 31, 805-813.)
124 8. FLEXILEVEL TESTS
Figure 8.8.1 compares the effectiveness of three 60-item flexilevel tests with
each other and with five bench mark tests. All items have c = .2 and all have the
same discriminating power a. Numerical labels on the curves are for a = .75.
The standard test is a conventional 60-item test with all items at difficulty level
b = .5/2a, scored by counting the number of right answers.
If all the item difficulties in any test were changed by some constant amount
b, the effect would be simply to translate the corresponding curve by an amount
b along the θ-axis. The difficulty level of each bench mark test and the starting
8.8. ILLUSTRATIVE FLEXILEVEL TESTS, WITH GUESSING 125
1{θ,y}
standard ,b = -0.33
b0=- 6 d= .022
d=.033 2
3 at b = - 1 . 0
half at b=-1.33
half at b= 0.67
} d= .044 { at b= 1.0
θ
_3 2 1 0 1 2 3
2a - 2a - 2a 2a 2a 2a
FIG. 8.8.1. Information functions for three 60-item flexilevel tests (dotted curves)
and five bench mark tests, c = .2. (Numerical labels on curves are for a = .75.)
(From F. M. Lord, The theoretical study of the measurement effectiveness of
flexilevel tests. Educational and Psychological Measurement, 1971,31, 805-813.
item difficulty level b0 of each flexilevel test in Fig. 8.8.1 has been chosen so as
to give maximum information somewhere in the neighborhood of θ = 0.
The standard test is again found to be best for discriminating among exam
inees at ability levels near θ = 0. At θ = ±2 the flexilevel tests are better than
any of the other conventional (bench mark) tests, although the situation is less
clear than before because of the asymmetry of the curves.
When a = .75, the 60-item flexilevel test with b0 = - . 6 and d = .022
gives about as effective measurement as a
58-item standard test at θ = 0,
60-item standard test at 6 = ±.67,
126 8. FLEXILEVEL TESTS
8.9. CONCLUSION
Near the middle of the ability range for which the test is designed, a flexilevel
test is less effective than is a comparable peaked conventional test. In the outly
ing half of the ability range, the flexilevel test provides more accurate measure
ment in typical aptitude and achievement testing situations than a peaked conven
tional test composed of comparable items. The advantage of flexilevel tests over
conventional tests at low ability levels is significantly greater when there is
guessing than when there is not.
Since most examinees lie in the center of the distribution where the peaked
conventional test is superior, a flexilevel test may not have a higher reliability
coefficient for the total group than the peaked conventional test. The flexilevel
test is designed for situations where it is important to measure well at both high
and low ability levels. As shown by the unpeaked bench mark tests in the figures,
unpeaked conventional tests cannot do as well in any part of the range as a
suitably designed flexilevel test. The most likely application of flexilevel tests is
in situations where it would otherwise be necessary to unpeak a conventional test
in an attempt to obtain adequate measurement at the extremes of the ability
range. Such situations are found in nationwide college admissions testing and
elsewhere.
Empirical studies need to answer such questions as the following:
Several empirical studies of varied merit have already been carried out, with
various results. The reader is referred to Betz and Weiss (1975), where several of
these are discussed, to Harris and Pennell (1977), and to Seguin (1976).
8.10. EXERCISES
REFERENCES
Betz, N. E., & Weiss, D. J. Empirical and simulation studies of flexilevel ability testing. Research
Report 75-3. Minneapolis: Psychometric Methods Program, Department of Psychology, Univer
sity of Minnesota, 1975.
Harris, D. A., & Pennell, R. J. Simulated and empirical studies of flexilevel testing in Air Force
technical training courses. Report No. AFHRL-TR-77-51. Brooks Air Force Base, Texas:
Human Resources Laboratory, 1977.
Seguin, S. P. An exploratory study of the efficiency of the flexilevel testing procedure. Unpublished
doctoral dissertation, University of Toronto, 1976.
9 Two-Stage Procedures1 and
Multilevel Tests
9.1. INTRODUCTION
1
Sections9.1-9.8 are revised and printed with permission from F. M. Lord, A theoretical study of
two-stage testing. Psychometrika, 1971, 36, 227-242.
128
9.2. FIRST TWO-STAGE PROCEDURE—ASSUMPTIONS 129
the special sense that the second-stage test is administered only to borderline
examinees. The advantages of this procedure come from economy in testing
time.
In contrast, the present chapter is concerned with situations where the im
mediate purpose of the testing is measurement, not classification. Here, the total
number of test items administered to a single examinee is fixed. Any advantage
of two-stage testing appears as improved measurement.
This chapter attempts to find, under specified restrictions, some good designs
for two-stage testing. A "good" procedure provides reasonably accurate mea
surement for all examinees including those who would obtain near-perfect or
near-zero (or near-chance-level) scores on a conventional test.
The particulars at our disposal in designing a two-stage testing procedure
include the following:
For the sake of simplicity, we assume that the available items differ only in
difficulty, bi. They all have equal discrimination parameters a and equal guess
ing parameters c. Also, we consider here only the case where the routing test
and each of the second-stage tests are peaked; that is, each subtest is composed
of items all of equal difficulty. These assumptions mean that within a subtest all
items are statistically equivalent, with item response function P ≡ P(θ). (Sections
9.9-9.13 describe an approach that avoids these restrictive assumptions.)
9.3. SCORING
routing test and for the second-stage test yields two such estimates, θx and θ2, for
any given examinee. These are jointly sufficient statistics for 0. They must be
combined into a single estimate. In the situation at hand, it would be inefficient
to discard θ1 and use only θ2. Unfortunately, there is no uniquely best way to
combine the two jointly sufficient statistics.
For present purposes, θ1 and θ2 will be averaged after weighting them in
versely according to their (estimated) large-sample variances. It is well known
that this weighting produces a consistent estimator with approximately minimum
large-sample variance (see Graybill and Deal, 1959). Thus, an examinee's
score 0 on the two-stage test will be proportional to
θ1 θ2
V(θ1) +
,
V(02)
where V is an estimate of large-sample variance, to be denoted by Var. We
multiply this by V (θ1)V (θ2)/[V (θ1) + V (θ2)] to obtain the examinee's overall
score, defined as
θ1 V (θ1) + θ2 V(θ1)
0 = (9-4)
V(θ1) + V(θ1) .
The multiplying factor is chosen so that 0 is asymptotically unbiased:
θ Var θ2 + θ Var θ1
ξθ ≡ = θ.
Var θ1 + Var θ2
From Eq. (5-5), for equivalent items,
If there are n1 items in the routing test and n2 = n — n1 items in the second-stage
test, there are at most (n1 + l)(n 2 + 1) different possible numerical values for 0.
Let θxy denote the value of 0 when the number-right score on the routing test is x
132 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
and on the second-stage test is y. By Eq. (4-1), the frequency distribution of x for
fixed 0 is the binomial
n1
px Qn1 - x ,
( x )
where P is given by (9-1) with ai = a, ci = c, and bi equal to the difficulty
level (b, say) of the routing test. The distribution of y is the binomial
n2 py Q n 2-y
( y ) y y
,
where Py is similarly given by (9-1) with bi equal to the difficulty level of the
second-stage test, this being a numerical function of x, here denoted by b(x),
assigned in advance by the psychometrician.
These two binomials are independent when 0 is fixed. Given numerical values
for n1, n2, a, b, c and for b(x) (x = 0, 1 , . . . , n1), the exact frequency
distribution pxy of the examinee's score 0 for an examinee at any given ability
level 0 can be computed from the product of the two binomials:
n1 P x Q n t—x n2
Pxy = Prob(θ = θ x y | θ ) = PyyQn2-yy. (9-7)
(x ) ( y )
This frequency distribution contains all possible information relevant for choos
ing among the specified two-stage testing procedures.
In actual practice, it is necessary to summarize somehow the plethora of
numbers computed from (9-7). This is done by using the information function for
6 given by Eq. (5-3). For given 6, the denominator of the information function is
the variance of 0 given 0, computed in straightforward fashion from the known
conditional frequencies (9-7). We have similarly for the numerator
n1 n2
ξ(θ|θ) = Pxy θ xy .
x = 0y = 0
Since θxy is not a function of θ,
n1 n2
Pxy
ξ(θ|θ) = θxy .
—
d6 θ
y=0
x =0
A formula for p x y / θ is easily written from (9-7) and (9-1), from which the
numerical value of the numerator of Eq. (5-3) is calculated for given 6. In this
way, I{θ,θ} is evaluated numerically for all ability levels of interest.
+.25
11;i i . 25,± .75,
7;±I.I25,±.3I25
θ
1.5 1.0 b
b- b- b - .5
a a a
FIG. 9.5.1. Information functions for some two-stage testing designs when n = 60, c = 0.
133
134 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
b were investigated. For this reason, only the left portion of each curve is shown
in Fig. 9.5.1.
The two solid curves are benchmarks, with which the two-stage procedures
are to be compared. The "standard" curve shows the information function for
the number-right score on a 60-item peaked conventional test whose items all
have the same difficulty level, b, and the same discriminating power, a. The
"up-and-down" benchmark curve is the " b e s t " of those obtained by the up-
and-down method of tailored testing (see Chapter 10; the benchmark curve of
Fig. 9.5.1 here is taken with permission from Fig. 7.6 of Lord, 1970).
If an examiner wants accurate measurement for typical examinees in the group
tested and is less concerned about examinees at the extremes of the ability range,
he should use a peaked conventional test. If a two-stage procedure is to be really
valuable, it will usually be because it provides good measurement for extreme as
well as for typical examinees. For this reason, an attempt was made to find
two-stage procedures with information curves similar to (or better than) the
"up-and-down" curve shown in the figure. For Sections 9.5 and 9.6 nearly 200
different two-stage designs were simulated for this search. Obviously, empirical
investigations of 200 designs would have been out of the question.
Surprisingly, Fig. 9.5.1 shows that when there is no guessing, it is possible
for a 60-item two-stage procedure to approximate the measurement efficiency of
a good 60-item up-and-down tailored testing procedure throughout the ability
range from 8 = b — 1.5|a to θ = b + 1.5/a. The effectiveness of the two-stage
procedures shown falls off rather sharply outside this ability range, but this range
is adequate or more than adequate for most testing purposes.
The label " 1 1 ; ± 1, ± . 5 " indicates that the routing test contains n1 = 11
items (at difficulty b) and that there are four alternative 49-item second-stage
tests with difficulty levels b ± 1/a and b ± .5/a. The cutting points on this
routing test are equally spaced in terms of number-right scores, x1: If x1 — 0-2,
the examinee is routed to the easiest second-stage test; if x1 = 3 - 5 , to the next
easiest; and so on.
The label " 7 ; ± 1.125, ± . 3 1 2 5 " is similarly interpreted, the examinees
being routed according to the score groupings x1 = 0 - 1 , x1 = 2 - 3 , x1 = 4 - 5 ,
and x1 = 6-7. The label " 1 1 ; ± 1 . 2 5 , ± . 7 5 , ± . 2 5 " similarly indicates a proce
dure with six alternative second-stage procedures, assigned according to the
groupings x1, = 0 - 1 , x1 = 2 - 3 , . . . , x1 = 10-11.
A 60-item up-and-down procedure in principle requires 1830 items before
testing can start; in practice, 600 items might be adequate without seriously
impairing measurement. Two of the two-stage procedures shown in Fig. 9.5.1
require only slightly more than 200 items.
The two-stage procedures shown in Figure 9.5.1 are the " b e s t " out of approx
imately sixty 60-item procedures studied with c = 0. None of the two-stage
procedures that at first seemed promising according to armchair estimates turned
out well. From this experience, it seems that casually designed two-stage tests
9.6. DISCUSSION OF RESULTS FOR 60-ITEM TESTS WITH NO GUESSING 135
are likely to provide fully effective measurement only over a relatively narrow
range of ability, or possibly not at all.
Table 9.6.1 shows the information at four different ability levels obtainable from
some of the better procedures. The following generalizations are plausible and
should hold in most situations.
Length of Routing Test. If the routing test is too long, not enough items are
left for the second-stage test, so that measurement may be effective near θ = b
but not at other ability levels. The test is not adaptive. If the routing test is too
short, then examinees are poorly allocated to the second-stage tests. In this case,
if the second-stage tests all have difficulty levels near b, then effective measure
ment may be achieved near 6 — b but not at other ability levels; if the second-
stage tests differ considerably in difficulty level, then the misallocation of exam
inees may lead to relatively poor measurement at all ability levels. The results
shown in Fig. 9.5.1 and Table 9.6.1 suggest that n1 = 3 is too small and n1 = 11
TABLE 9.6.1
Information for Various 60-ltem Testing Procedures with c = 0
Information** at
*A11 cutting points are equally spaced, except for the starred procedure, which has score groups
x1 = 0, x1 = 1-3, x1 = 4-6, x1 = 7.
**A11 information values are to be multiplied by a2.
136 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
is too large for the range b ± 1.5/a in the situation considered, assuming that no
more than four second-stage tests are used.
Some 40-odd different procedures were tried out for the case where a total of n =
15 items with c = 0 are to be administered to each examinee. The "best" of
these—those with information curves near the up-and-down bench mark—are
shown in Fig. 9.7.1. The bench mark here is again one of the "best" up-and-
down procedures [see Stocking (1969), Fig. 2].
Table 9.7.1 shows results for various other two-stage procedures not quite so
"good" as those in Fig. 9.7.1. In general, these others either did not measure
well enough at extreme ability levels or else did not measure well enough at 6 =
b. The results for n = 15 seem to require no further comment, since the general
principles are the same as for n = 60.
1{θ,θ}
3; ±1, ±.25
Up-and-Down
3 ; ±1.25, ±.5
3; ±1.25, ±.25
θ
1.5 1.0 .5 b
b- b- b-
a a a
FIG. 9.7.1. Information functions for some two-stage testing designs when n = 15, c = 0.
137
138 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
TABLE 9.7.1
Information for Various 15-ltem Testing Procedures with c = 0
Information* * at
*A11 cutting points are equally spaced, except for the starred procedure, which has score groups
x1 = 0-1, x1 = 2, x1 = 3-4.
**A11 information values are to be multiplied by a2.
About 75 different 60-item two-stage procedures with c = .20 were tried out.
The "best" of these are shown in Fig. 9.8.1 along with an appropriate bench
mark procedure (see Lord, 1970, Fig. 7.8).
Apparently, when items can be answered correctly by guessing, two-stage
testing procedures are not so effective for measuring at extreme ability levels as
are the better up-and-down procedures. Unless some really "good" two-stage
procedures were missed in the present investigation, it appears that a two-stage
test might require 10 or more alternative second stages in order to measure well
throughout the range shown in Fig. 9.8.1. Such tests were not studied here
because the cost of producing so many second stages may be excessive. Possibly
a three-stage procedure would be preferable.
When there is guessing, maximum information is likely to be obtained at an
ability level higher than θ = b, as is apparent from Fig. 9.8.1. This means that
the examiner will probably wish to choose a value of b (the difficulty level of the
routing test) somewhat below the mean ability level of the group to be tested. If a
value of b were chosen nearΜΘ,the mean ability level of the group, as might well
be done if there were no guessing, then the two-stage procedures shown in Fig.
9.8.1 would provide good measurement for the top examinees (above θ = b +
9.8. ILLUSTRATIVE 60-ITEM TWO-STAGE TESTS WITH GUESSING 139
1{θ,θ}
Standard
25a2
20a2 68
69
65e
15a2
10 a2
5a2
0
1.5 1 .5 .5 I 1.5 2
θ
b- b- b- b b+ b+ b+ b+
a a a a a a a
FIG. 9.8.1. Information functions for some two-stage testing designs when n = 60, c = .2.
1/a) but quite poor measurement for the bottom examinees (below 0 = b — 1/a).
If an examiner wants good measurement over two or three standard deviations on
each side of the mean ability level of the group, he should choose the value of b
for the two-stage procedures in Fig. 9.8.1 so that μθ falls near b + .75/a. In this
way, the ability levels of his examinees might be covered by the range from 6 =
b — .75/a to θ = b + 2.25/a, for example.
The three two-stage tests shown in Fig. 9.8.1 are as follows. Test 68 has an
11-item routing test with six score groups x1 = 0-3, 4, 5-6, 7-8, 9-10, 11,
corresponding to six alternative second-stage tests at difficulty levels b2 where
a(b2 - b) = - 1 . 3 5 , - . 6 5 , - . 3 2 5 , +.25, +.75, and +1.5. Test 69 has a
17-item routing test with x1 = 0 - 5 , 6 - 7 , 8 - 1 0 , 11-13, 14-15, 16-17 and a(b2 -
b) = - 1 . 5 , - . 7 5 , - . 2 5 , +.35, +.9, +1.5. Test 65e has an 11-item routing test
with x1 = 0-2, 3-4, 5-6, 7-8, 9-10, l l and a(b 2 - b) = - 1 . 5 , - . 9 , - . 3 , +.2,
+ .6, +1.0.
A table of numerical values would be bulky and is not given here. Most of the
conclusions apparent from such a table have already been stated.
140 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
Item Number
1 10 20 30 40 50 60
Level 1
Level 2
Level 3
Level 4
Level 5
n2L - n
m =
( L - 1 ).
In practice, L, m, n2, and n are necessarily integers. There is no cause for
confusion, however, if for convenience this restriction is sometimes ignored in
theoretical work.
In what follows, we frequently refer to a second-stage test as a level. Figure
9.9.1 illustrates a second-stage design with L = 5, n2 = 32, n = 60, and m =
25.
If there are too many second-stage tests, the scaling and equating of these tests
becomes burdensome. In any case, it will be found that there is relatively little
gain from having more than a few second-stage tests.
RELATIVE EFFICIENCY 2 4
1
1
3
0
200 300 400 500 600 700
SCALED SCORE
FIG. 9.10.1. Efficiency of each of the four levels of a poorly designed two-
stage test, relative to Form SSA 45.
Figure 9.10.1 shows the relative efficiency curves obtained in this way for the
four (second-stage) levels of a hypothetical two-stage mathematics aptitude test.
This four-level test was the first trial design worked out by the writer after
studying the item parameters obtained for the SAT mathematics aptitude test,
Form SSA 45. It is presented here to emphasize that armchair designs of two-
stage tests, even when based on more data than are usually available, are likely to
be very inadequate.
The high relative efficiency of level 1 at scores below 300 was a desired part
of the design. The similar high efficiency of level 2 below 300 is completely
unnecessary, unplanned, and undesirable. Level 2 is too easy and too much like
level 1.
The intention was that each level should reach above 100% relative efficiency
for part of the score scale. Level 3 falls seriously short of this. As a result, the
four-level test design would be inferior to the regular SAT for the majority of
examinees. The shortcomings of level 3 could be remedied by restricting its
range of item difficulty. Level 4 may be unnecessarily effective at the top of the
score range and beyond. It should perhaps be easier.
The design inadequacies apparent in Fig. 9.10.1 can be rather easily covered up
by increasing the number of levels and restricting the difficulty range of each
level. After trying about a dozen different designs, the seven-level test shown in
Fig. 9.11.1 was devised.
9.11. DEPENDENCE OF THE TWO-STAGE TEST ON ITS LEVELS 143
The solid curves in the figure are the relative efficiency curves for the seven
levels (the lower portion of each curve is not shown). The dashed lines are
relative efficiency curves for the entire seven-level two-stage test (formulas for
obtaining the relative efficiency curve of the entire two-stage test from the curves
of the individual levels are derived in the Appendix at the end of this chapter).
The lower dashed curve assumes the routing test score has a standard error of
measurement of 90 scaled-score units. The upper curve assumes the very low
value of 30. To achieve a standard error of 30, the routing test would have to be
as long or longer than the present SAT—an impractical requirement included for
its theoretical interest.
As mentioned earlier, subsequent results to be given here assume the standard
error of measurement of the routing test to be 75 scaled-score points. This value
is bracketed by the two values shown. A standard error of about 75 would be
expected for a routing test consisting of 12 mathematics items.
The relationship between the efficiency curves for the individual levels and
the efficiency curve of the entire two-stage test is direct and visually obvious
from the figure. The effect of lowering the accuracy of the routing test is also
clear and simple to visualize. The effect is less than might be expected.
Each level in Fig. 9.11.1 has only two-thirds as many items as the regular
SAT mathematics aptitude test. Thus use of a two-stage test may enable us to
increase the accuracy of measurement while reducing the official testing time for
each examinee (ignoring the time required for the self-administered routing test,
answered in advance of the regular testing).
2
2
7 σ =30
RELATIVE EFFICIENCY
σ = 90
6
1 3 5
4
LEVEL:1 2 3 4 5 6 7
0
200 300 400 500 600 700
SCALED SCORE
FIG. 9.11.1. Relation between the relative efficiency of a two-stage test and the
relative efficiency of the individual levels.
144 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
TABLE 9.12.1
Determination of Cutting Points for
Assigning Levels of the Two-Stage
Test in Fig. 9.11.1
Cutting
Level Scores at which RE =1.00 Score
1 389
324
2 259* 465
400
3 335 519
469
4 419 577
531
5 485 671
616
6 561 759*
685
7 611
In order to use a routing test for assigning examinees to levels, the score scale
must be divided by cutting points that determine, for each scaled score on the
routing test, the level to be assigned. There is no simple and uniquely optimal
way to determine these cutting points.
A method that seems effective is illustrated in Table 9.12.1. for the multilevel
test of Fig. 9.11.1. The cutting score between two adjacent levels in the table is
taken to be the average of the two numbers connected by oblique lines. The
cutting scores so obtained are indicated along the baseline of Fig. 9.11.1.
A convincing justification for this procedure is not immediately apparent. The
procedure has been found to give good results as long as the levels are reasonably
spaced and exceed a relative efficiency of 1.00 in a suitable score interval. Small
changes in the cutting scores will have little effect on the RE curve of the
two-stage test.
Further experimentation with different designs shows that, with care, good re-
sults can be achieved with a two-stage test having only three or four (second-
9.13. RESULTS FOR VARIOUS TWO-STAGE DESIGNS 145
7
5
RELATIVE EFFICIENCY
4 3
1I
0
200 300 400 500 600 700
SCALED SCORE
stage) levels. Relative efficiency curves for four different two-stage test designs
are shown in Fig. 9.13.1. The curves were obtained in an effort to raise the
lowest point of the curve without changing its general overall shape. It is proba-
bly not possible to make any great improvement from this point of view on the
designs shown. This may account for the fact that the four curves shown differ
little from each other. It would, of course, be easy to design two-stage tests with
very differently shaped curves, if that were desired.
The identifying number on each curve in Figure 9.13.1 is its L value, the
number of levels. The four designs shown are partially described in Table
9.13.1. A full description of each two-stage test would require listing all item
parameters for each level and would not add much of value to the illustrative
examples given.
TABLE 9.13.1
Description of Two-Stage Designs Shown in Fig. 9.13.1
3 102 45 1.045
4 114 45 1.076
5 114 45 1.101
7 123 41 1.074
146 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS
Marco (1977) describes a "multilevel" test that resembles a two-stage test ex
cept that there is no routing test; instead, each examinee routes himself to levels
that seem of appropriate difficulty to him. Item response theory cannot predict
how a person will route himself. The present results may nevertheless be relevant
if his errors in self-routing are similar to the errors made by some kind of routing
test.
Empirical studies of two-stage testing are reported by Linn, Rock, and Cleary
(1969) and by Larkin and Weiss (1975). Other studies are cited in these refer
ences. A simulation study of two-stage testing is reported by Betz and Weiss
(1974).
Simulation studies in general confirm, or at least do not disagree with, conclu
sions reached here. Empirical studies frequently do not yield clear-cut results.
This last might well be expected whenever total group reliability or validity
coefficients are used to compare two-stage tests with conventional tests.
If the conventional test contains a wide spread of item difficulties, the two-
stage test may be better at all ability levels, in which case it will have higher
total-group reliability. If the conventional test is somewhat peaked at the appro
priate difficulty level, however, it will be better than the two-stage test at moder
ate ability levels where most of the examinees are found; the two-stage test will
be better at the extremes of the ability range. The two-stage test will in this case
probably show lower total-group reliability than the conventional test, because
most of the group is at the ability level where the conventional test is peaked.
Two-stage tests will be most valuable in situations where the group tested has
a wider range of ability than can be measured effectively by a peaked conven
tional test.
9.15. EXERCISES
APPENDIX
where pl|η is the probability that an examinee with true score η will be assigned
to level l. Since μ(y l |η) is constant, the last term is zero. Thus the denominator
of I{η, y} is
L
σ2y|η = Pl|η Var(y l |η).
l=1
REFERENCES
Betz, N. E., & Weiss, D. J. Simulation studies of two-stage ability testing. Research Report 74-4.
Minneapolis: Psychometric Methods Program, Department of Psychology, University of Min
nesota, 1974.
Cronbach, L. J., & Gleser, G. C. Psychological tests and personnel decisions (2nd ed.). Urbana,
111.: University of Illinois Press, 1965.
Graybill, F. A., & Deal, R. Combining unbiased estimators. Biometrics, 1959, 75, 543-550.
Larkin, K. C , & Weiss, D. J. An empirical comparison of two-stage and pyramidal adaptive ability
testing. Research Report 75-1. Minneapolis: Psychometric Methods Program, Department of
Psychology, University of Minnesota, 1975.
REFERENCES 149
Linn, R. L., Rock, D. A., & Cleary, T. A. The development and evaluation of several programmed
testing methods. Educational and Psychological Measurement, 1969, 29, 129-146.
Lord, F. M. Some test theory for tailored testing. In W. H. Holtzman (Ed.), Computer assisted
instruction, testing, and guidance. New York: Harper and Row, 1970.
Marco, G. L. Item characteristic curve solutions to three intractable testing problems. Journal of
Educational Measurement, 1977, 14, 139-160.
Stocking, M. Short tailored tests. Research Bulletin 69-63 and Office of Naval Research Technical
Report N00014-69-C-0017. Princeton, N.J.: Educational Testing Service, 1969.
10 Tailored Testing
10.1. INTRODUCTION 1
It seems likely that in the not too distant future many mental tests will be
administered and scored by computer. Computerized instruction will be com-
mon, and it will be convenient to use computers to administer achievement tests
also.
The computer can test many examinees simultaneously with the same or with
different tests. If desired, each examinee can be allowed to answer test questions
at his own rate of speed. This situation opens up new possibilities. The computer
can do more than simply administer a predetermined set of test items. Given a
pool of precalibrated items to choose from, the computer can design a different
test for each examinee.
An examinee is measured most effectively when the test items are neither too
difficult nor too easy for him. Thus for any given psychological trait the com-
puter's main task at each step of the test administration might be to estimate
tentatively the examinee's level on the trait, on the basis of his responses to
whatever items have already been administered. The computer could then choose
the next item to be administered on the basis of this tentative estimate.
Such testing has been called adaptive testing, branched testing, individualized
testing, programmed testing, sequential item testing, response-contingent test-
ing, and computerized testing. Clearly, the procedure could be implemented
1
This section is a revised version of the introductory section in F. M. Lord, Some test theory for
tailored testing. In W. H. Holtzman (Ed.), Computer assisted instruction, testing, and guidance.
New York: Harper and Row, 1970, pp. 139-183. Used by permission.
150
10.2. MAXIMIZING INFORMATION 151
without a computer. Here, emphasizing the key feature, we shall speak of tai
lored testing. This term was suggested by William W. Turnbull in 1951.
It should be clear that there are important differences between testing for
instructional purposes and testing for measurement purposes. The virtue of an
instructional test lies ultimately in its effectiveness in changing the examinee. At
the end we would like him to be able to answer every test item correctly. A
measuring instrument, on the other hand, should not alter the trait being mea
sured. Moreover (see Section 10.2), measurement is most effective when the
examinee only knows the answers to about half of the test items. The discussion
here is concerned exclusively with measurement problems and not at all with
instructional testing.
Suppose we have a pool of calibrated items. Which single item from the pool will
add the most to the test information function at a given ability level?
According to Eq. (5-6), each item contributes independently to the test infor
mation function I{θ}. This contribution is given by
Pi'2 (5-9)
I{θ, ui} = ,
PiQi
the item information function. To answer the question asked, compute Eq. (5-9)
for each item in the pool and then pick the item that gives the most information at
the required ability level θ. It is useful here to discuss the maximum of the item
information function in some detail, so as to provide background for tailored
testing applications.
Under the logistic model [Eq. (2-1)] when there is no guessing, P'i =
DaiPiQi. The item information function [Eq. (5-9)] is thus
I{θ, ui} = D 2 a 2 i Pi(θ)Qi(θ). (10-1)
Now, PiQi is a maximum when Pi = .5. It follows that when there is no
guessing, an item gives its maximum information for those examinees who have a
50% chance of answering correctly. When Pi(θ) = .5, we have θ = bi. Thus,
when there is no guessing, an item gives its maximum information for examinees
whose ability θ is equal to the item difficulty bi. All statements in this paragraph
may be shown to hold for the normal ogive model also.
The maximum information, to be denoted by Mi, for the logistic model with
no guessing is seen from (10-1) to be
D2a2i
Mi = = .722a2i. (10-2)
4
2
Mi = 2a i = .637a 2 i . (10-3)
π
Note that maximum information Mi is proportional to the square of the item
discriminating power a i . Thus an item at the proper difficulty level with ai =
1.0 is worth as much as four items with ai = .5.
On a certain item, suppose that examinees guess at random with probability p
of success whenever they do not know the correct answer. (Item response theory
does not use this assumption; it is used here only as a bench mark.) According to
this supposition, the actual proportion of correct answers to the item at ability
level θ will be Pi(θ) + pQi(θ). Accordingly, a common rule of thumb for test
design is that the average "item difficulty" ( = proportion of correct answers in
the group tested) should be .5 when there is no guessing and ½(1 + p) when
there is random guessing with chance of success p. Let us check this rule using
item information functions.
It is not difficult to show (Birnbaum, 1968, Eq. 20.4.21) for the three-
parameter logistic model that an item gives maximal information at ability level
θ = θi where
1 1 + √l + 8ci
θi = bi + In (10-4)
Dai 2
When ci = 0, the item gives maximal information when θ = bi. When ci ≠ 0,
θi > bi. The distance from the item difficulty level bi to the optimal θi is
inversely proportional to the item discriminating power ai.
It is readily found from (10-4) that when ability and item difficulty bi are
optimally matched, the proportion of correct answers is
Pi(θi) = ¼ (1 + √l + 8 c i ) . (10-5)
If we substitute ci for p in the old rule of thumb for test design and subtract the
results from (10-5), the difference vanishes for ci = 0 and for ci = 1; for all
other permissible values of c i , Pi(θi) exceeds the probability given by the rule of
thumb. Thus, under the logistic model, an item will be maximally informative
for examinees whose probability of success is somewhat greater than ½(1 + ci).
Some typical values for Pi(θi) are
D2a2i
Mi = [1 - 20c i - 8c2i + (1 + 8 c i ) 3 / 2 ] . (10-6)
8(1 - ci)2
Typical maximal values can be inferred from the following list:
10.3. ADMINISTERING THE TAILORED TEST 1 53
Consider now the tailored testing of a single examinee. If we know nothing about
him, we may administer first an item of middle difficulty from the available item
pool. If we have information about the examinee's educational level, or some
other relevant fact, we may be able to pick a first item that is better matched to
his ability level. Unless the test is very short, a poor choice of the first item will
have little effect on the final result.
If the examinee answers the first item incorrectly (correctly), we suppose that
it is hard (easy) for him, so we choose an easier (harder) item to administer next.
If he answers this incorrectly (correctly) also, we next administer a still easier
(harder) item, and so on.
There will be no finite maximum likelihood estimate of the examinee's ability
as long as his answers are all correct or all incorrect. Such a situation will not
continue very long, however: If successive decrements (increments) in item
difficulty are sizable, as they should be, we will soon be administering items at
an extreme level of difficulty or easiness.
Once the examinee has given at least one right answer and at least one wrong
answer, it is usually possible to solve the likelihood Eq. (5-19) for θ, obtaining a
finite maximum likelihood estimate, denoted by θ, of the examinee's ability.
Since Eq. (5-19) is an equation in just one unknown (θ), it may be readily solved
by numerical methods.
Samejima (1973) has pointed out that in certain cases the likelihood equation
may have no finite root or may have both a finite and an infinite root (see end of
Section 4.13). If this occurs, we can follow Samejima's suggestion (1977) and
administer next an extremely easy item if θ = - ∞ or an extremely hard item if
θ = + ∞. This procedure (repeated if necessary) should quickly give a usable
ability estimate without danger of further difficulties. Such difficulties are ex
tremely rare, once the number of items administered is more than 10 or 15.
As soon as we have a maximum likelihood estimate θ of the examinee's
ability, we can evaluate the information function of each item in the pool at θ =
θ. We administer next the item that gives the most information at θ. When the
examinee has responded to this new item, we can reestimate θ and repeat the
154 10. TAILORED TESTING
procedure. When enough items have been administered, the final θ is the exam
inee's score. All such scores are on the same scale for all examinees, even though
different examinees may have taken totally different sets of items.
The pool of items available to the computer must be very much larger than the
number n of items administered to any one examinee. If the pool contains 200 or
more items, it may be impractical to calibrate the items by administering them all
simultaneously to a single group of examinees. In certain cases, furthermore, the
range of item difficulty may be too great for administration to a single group:
Low-ability examinees, for example, who are needed to calibrate the easy items,
might find the very hard items intolerable.
When different items are calibrated on different groups of examinees, the
calibrations will in general not be comparable, because of the essential indeter
minacy of origin and scale from group to group (see Section 3.5). There are
many special ways to design test administrations so that the data can be pieced
together to place all the estimated parameters on the same scale. A simple
design might be as follows.
Divide the entire pool of items to be calibrated into K modules. If a very wide
range of item difficulty is to be covered, modules 1 , 2 , . . . , ½K should increase
in difficulty from module to module; ½K, ½K + 1,. . . , K should decrease in
difficulty. Form a subtest by combining modules 1 and 2; another by combining
modules 2 and 3; another by combining 3 and 4 , . . . ; another by combining K —
1 and K. Form a Kth subtest by combining modules K and 1. Administer the K
subtests to K nonoverlapping groups of examinees, giving a different subtest to
each group.
With this design, each item is taken by two groups of examinees. Each group
of examinees shares items with two other groups. This interlocking makes it
possible to estimate all item parameters and all ability parameters by maximum
likelihood simultaneously in a single computer run (but see Chapter 13 Appendix
for a procedure to accelerate iterative convergence). Thus all item parameters are
placed on the same scale without any inefficient piecing together of estimates
from different sources.
Two parallel forms of a tailored test of verbal ability have been built, using the
principles outlined in the preceding sections. A main feature is that this test is
appropriate at any level of verbal ability from fourth grade up through graduate
school.
10.5. A BROAD-RANGE TAILORED TEST 155
Many of the test items for grades 4 to 12 were obtained from the Cooperative
School and College Ability Tests and the Cooperative Sequential Tests of Educa
tional Progress. The remaining items were obtained from the College Board's
Preliminary SAT, their regular SAT, and the Graduate Record Examination.
A total of more than 1000 verbal items were available from these sources. All
items were calibrated and put on the same scale by piecing together scraps of data
available from various regular test administrations and from various equating
studies. The resulting item parameter estimates are not so accurate as could have
been obtained by the procedure outlined in the preceding section; this would have
required a large amount of special testing, however.
A very few items with very poor discriminating power ai were discarded. A
better test could have been constructed by keeping only a few hundred of the
most discriminating items (as done by Urry, 1974). Here, this was considered
undesirable in principle because of the economic cost of discarding hundreds of
test items.
Additional items were discarded because they were found to duplicate mate
rial covered by other items. The remaining 900 items were arranged in order of
difficulty (b i ) and then grouped on difficulty into 10 groups.
All items in the most extreme groups were retained because of a scarcity of
very difficult and very easy items. At intermediate difficulty levels, many more
items were available than were really needed for the final item pool. Although all
items could have been retained, 50 items were chosen at random from each
difficulty level for use in the final pool for the broad-range tailored test, thus
freeing the other items for other uses. Note again that this selection was made at
random and not on the basis of item discriminating power, for the reason outlined
in a preceding paragraph.
A total of 363 items were selected by this procedure. Five different item types
were represented. Within each of the 10 difficulty levels, items of a given item
type were grouped into pairs of approximately equal difficulty (bi). Two parallel
item pools were then formed by assigning one item from each pair at random to
each pool. The two pools thus provide two parallel forms for the broad-range
tailored test.
Each of the two item pools has exactly 25 items at each difficulty level, except
for extreme levels where there are insufficient items. This makes it possible to
administer 25 items to an examinee, all or most at a difficulty level appropriate
for him. In using the broad-range tailored test, exactly 25 items are administered
to each examinee.
If the items given to one examinee were selected solely on the basis of I{θ,
θ}, it could happen by chance, in an extreme case, that examinee A might
receive only items of type C and examinee B might receive only items of type D.
If this happened, it would cast considerable doubt on any comparison of the two
examinees' "verbal ability" scores. One good way to avoid this problem would
be to require that the first item administered to any examinee always be of type
156 10. TAILORED TESTING
C, the second item always be of type D, and so forth. This would assure that all
examinees take the same number of items of each type. A practical approxima
tion to this was implemented for the broad-range tailored test. The details need
not be spelled out here.
Once a maximum likelihood estimate of the examinee's ability is available, as
described in Section 10.3, the item to be administered next is thereafter always
the item of the required item type that gives the most information at the currently
estimated ability level θ. If one item in the pool has optimal difficulty level
(10.4) at θ but another item is more discriminating, the latter item may give more
information at θ and may thus be the one selected to be administered next. Note
that this procedure tends to administer the most discriminating items (highest ai)
first and the least discriminating items last or not at all.
For the flexilevel tests and for the two-stage tests of preceding chapters, it is
possible to write formulas for computing the (conditional) mean and variance of
the final test score for people at any specified ability level. The information
function can then be evaluated from these. Some early theoretical work in tai
lored testing was done in the same way. The up-and-down branching method
(Lord, 1970) and the Robbins-Monro branching method (Lord, 1971) both have
formulas for the small-sample conditional mean and variance of the final test
score.
The method used here for choosing items, while undoubtedly more efficient
than the up-and-down or the Robbins-Monro methods, does not appear to
permit calculation of the required mean and variance. Thus, any comparative
evaluation of procedures here must depend on Monte Carlo estimates of the
required mean and variance, obtained by computer simulation. Monte Carlo
methods are more expensive and less accurate than exact formulas. Monte Carlo
methods should be avoided whenever formulas can be obtained.
Simulated tailored testing can be carried out as follows. A set of equally
spaced ability levels are chosen for study. The following procedure is repeated
independently for each ability level θ.
Some way of selecting the first item to be administered is specified. The
known parameters of the first items administered are used to compute P1(θ), the
probability of a correct response to item 1 for examinees at the chosen θ level. A
hypothetical observation u1a — 0 or u1a = l is drawn at random with probability
of success P 1 (θ). This specifies the response of examinee a (at the specified
ability level θ) to the first item. The second item to be administered is chosen by
the rules of Section 10.3. Then P2(θ) is computed, and a value of u2a is drawn at
random with probability of success P2(θ). The entire process is repeated until
n — 25 items have been administered to examinee a. According to the rules of
10.7. RESULTS OF EVALUATION 157
Section 10.3, this will involve the computation of many successive maximum
likelihood estimates θa of the ability of examinee a. The successive θa are used
to select the items to be administered but not in the computation of Pi(θ). The
final θa, based on the examinee's responses to all 25 items, is his final test score.
In the simulations reported here, the foregoing procedure was repeated inde
pendently for 200 examinees at each of 13 different θ levels. At each θ level, the
mean m and variance s2 of the 200 final scores were computed. In principle, if
the θ levels are not too far apart, the information function at each chosen level of
θ can be approximated from these results, using the formula (compare Section
5.2)
[m(θ|θ+ l) - m(θ|θ - 1 )] 2
I{θ,θ} , (10-7)
(θ+1 -θ-1)2 s2 (θ|θ0)
where θ-1 θ0 and θ-1 denote successive levels of θ, not too far apart. This
formula uses a common numerical approximation to the derivative of m(θ|θ).
Both μ (θ|θ) and σ2(θ|θ) can be estimated by the Monte Carlo method from
200 final scores with fair accuracy. The difference in the numerator of (10-7)
is a quite unstable estimator, however, because of loss of significant figures
due to cancellation. This is a serious disadvantage of Monte Carlo evaluation of
test procedures and designs.
In the present case, it was found that μ(θ|θ) was close to θ, showing that θ is a
reasonably unbiased estimator of θ. Under such conditions [see Eq. (5-8)], the
information function is inversely proportional to the error variance σ 2 (θ|θ). Re
sults here are therefore presented in terms of estimated error variance (Fig.
10.7.1) or its reciprocal (Fig. 10.7.2).
2
Figures 10.7.1 and 10.7.2, also the accompanying explanations, are taken with permission from
F. M. Lord, A broad-range tailored test of verbal ability. In C. L. Clark (Ed.), Proceedings of the
First Conference on Computerized Adaptive Testing. Washington, D.C.: United States Civil Service
Commission, 1976, pp. 75-78; also Applied Psychological Measurement, 1977, 1, 95-100.
158 10. TAILORED TESTING
0 .3
0. 2
MEAS.
S. E.
0. 1
the horizontal scale—about fifth-grade level. The small dots represent the results
when the difficulty level of the first item was near 0—about ninth-grade level.
For the hexagons, it was near .75—near the average verbal ability level of col-
lege applicants taking the College Entrance Examination Board's Scholastic
Aptitude Test. For the points marked by an x, it was near 1.5. For any given
ability level, the standard error of measurement varies surprisingly little, consid-
ering the extreme variation in starting item difficulty. We see that the difficulty
level of the first item administered is not likely to be a serious problem for the
kind of tailored testing recommended here.
It is important to compare the broad-range tailored test with a conventional
test. Let us compare it with a 25-item version of the Preliminary Scholastic
Aptitude Test of the College Entrance Examination Board. Figure 10.7.2 shows
the information function for the Verbal score on each of three forms of the PS AT
adjusted to a test length of just 25 items and also the approximate information
function for the verbal score on the broad-range tailored test, which administers
just 25 items to each examinee. The PSAT information functions are computed
from estimated item parameters; the tailored test information function is the
reciprocal of the squared standard error of measurement from Monte Carlo re-
sults. The tailored test shown in Fig. 10.7.2 corresponds to the hexagons of Fig.
10.7.1.
This tailored test is at least twice as good as a 25-item conventional PSAT at
almost all ability levels. This is not surprising: At the same time that we are
tailoring the test to fit the individual, we are taking advantage of the large item
pool, using the best 25 items available within certain restrictions on item type.
Because we are selecting only the best items, the comparison may be called
unfair to the PSAT. It is not clear, however, how a "fair" evaluation of the
tailored test is to be made.
10.8. OTHER WORK ON TAILORED TESTS 159
80
6.0
INFORMATION
4.0
20
0
FIG. 10.7.2. Information function for the 25-item tailored test, also for three
forms of the Preliminary Scholastic Aptitude Test (dotted lines) adjusted to a
test length of 25 items.
Part of the advantage of the tailored test is due to matching the difficulty of the
items administered to the ability level of the examinee. Part is due to selecting the
most discriminating items. A study of a hypothetical broad-range tailored test
composed of items all having the same discriminating power would throw light
on this problem. It would show how much gain could be expected solely from
matching item difficulty to ability level.
The advantages of selecting the best items from a large item pool are made
clear by the following result. Suppose each examinee answers just 25 items, but
these are selected from the combined pool of 363 items rather than from the pool
of 183 items used for Fig. 10.7.2. Monte Carlo results show that the tailored test
with the doubled item pool will give at least twice as much information as the
25-item tailored test of Fig. 10.7.2. Selecting the best items from a 363-item pool
gives a better set of 25 items than selecting from a 183-item pool.
If it is somehow uneconomical to make heavy use of the most discriminating
items in a pool, one could require that item selection should be based only on
item difficulty and not on information or discriminating power. If this restriction
is not accepted, it is not clear how adjustment should be made for size of item
pool when comparing different tailored tests.
others. Selected reports from these sources and others are included in the list of
references. For good recent reviews of tailored testing, see McBride (1976) and
Killcross (1976).
REFERENCES
Technical Study 74-3. Washington, D.C.: Research Section, Personnel Research and Develop-
ment Center, 1974.
Urry, V. W. Tailored testing: A successful application of latent trait theory. Journal of Educational
Measurement, 1977, 14, 181-186.
Vale, C. D., & Weiss, D. J. A simulation study of stradaptive ability testing. Research Report 75-6.
Minneapolis: Psychometric Methods Program, Department of Psychology, University of Min-
nesota, 1975.
Weiss, D. J. Strategies of adaptive ability measurement. Research Report 74-5. Minneapolis:
Psychometric Methods of Program, Department of Psychology, University of Minnesota, 1974.
Weiss, D. J. (Ed.). Computerized adaptive trait measurement: Problems and prospects. Research
Report 75-5. Minneapolis: Psychometric Methods Program, Department of Psychology, Univer-
sity of Minnesota, 1975.
Weiss, D. J. (Ed.). Applications of computerized adaptive testing. Research Report 77-1. Min-
neapolis: Psychometric Methods Program, Department of Psychology, University of Minnesota,
1977.
Weiss, D. J., & Betz, N. E. Ability measurement: Conventional or adaptive? Research Report 73-1.
Minneapolis: Psychometric Methods Program, Department of Psychology, University of Min-
nesota, 1973.
11 Mastery Testing
11.1. INTRODUCTION
162
11.3. DECISION RULES 163
What is needed is a way of evaluating the testing procedure that does not
depend on the unknown distributions of ability in the groups to be tested. The
approach in this chapter makes no use of these unknown ability distributions.
We gain flexibility and generality when we do not require knowledge of the
ability distribution in the group tested. At the same time, we necessarily pay a
price: An evaluation based on incomplete information cannot have every virtue
of an evaluation based on complete information.
The practical use of any mastery test involves some rule for deciding which
examinees are to be classified as masters and which are not. In the case of a test
composed of n dichotomous items, each decision is necessarily based on the
examinee's n responses. For any one examinee, these responses are denoted by
the vector u = {u1, u2, . . . ,un}. Since the decision d depends on u, we may
write it as the function d ≡ d(u). The decision to classify an examinee as a
master will be denoted by d = 1 or "accept"; as a nonmaster, by d = 0 or
"reject."
The decision rule d may produce two kinds of errors:
d(u) = 1 (accept) when θ = θ1,
d(u) = 0 (reject) when θ = θ2.
164 11. MASTERY TESTING
Let α and β denote, respectively, the probabilities of these two kinds of errors, so
that
α ≡ Prob[d(u) = l|θ 1 ],
β ≡ Prob[d(u) = 0|θ 2 ]. (11-1)
1 - β = Prob(u|θ2).
d=1
This can be written
1 -β = λ(u)Prob(u|θ 1 ), (11-3)
d=1
where λ(u) is the likelihood ratio
Prob(u|θ2)
λ(u) = (11-4)
Prob(u|θ1) .
This ratio compares the likelihood of the observed response vector u when the
examinee is at θ2 with the likelihood when he is at θ1. This is the ratio commonly
used to test the hypothesis θ = θ1 versus the hypothesis θ = θ2. Note that the
likelihood ratio for any u is known from item response theory [see Eq. (4-20) and
(11-12)] as soon as the item parameters and the form of the item response
function are specified.
The expected value of the likelihood ratio for given θ1, given the decision to
accept, is
1
The problem of how to score the test can be solved by application of the Neyman-Pearson
Theorem (R. V. Hogg & A. T. Craig, Introduction to mathematical statistics (3rd ed.). New York:
Macmillan, 1970, Chapter 9.). An explicit proof is given here in preference to a simple citation of the
theorem.
11.4. SCORING THE TEST: THE LIKELIHOOD RATIO 165
λ(u)Prob(u|θ 1 )
E[λ(u)|θ =θl, d= 1] = d=1
Prob(u|θ1)
d=l
To apply this decision rule for fixed a in practical testing, proceed as follows:
1. Score each answer sheet to obtain λ(u), the likelihood ratio (11-4) for the
pattern of responses given by the examinee [see Eq. (11-12)].
2. Accept each examinee whose score λ(u) is above some cutting score λ0a
that depends on a, as specified in steps 2 and 3 above.
166 11. MASTERY TESTING
3. If λ(u) = λ0a for some examinees, choose among these at random as in step
4 above.
Note that the examinee's score actually is the likelihood ratio for the responses on
his answer sheet. The optimal scoring procedure does not require that we know
the distribution of ability in the group tested. Simplified scoring methods are
considered in Section 11.8.
11.5. LOSSES
θ1 θ2
d =1 N1a N2(1 - β)
d = 0 N1(l -a) N2β
Total N1 N2
The expected loss, to be denoted by C, is N1aA' + N2βB'. Van Ryzin and
Susarla (1977) give a practicable empirical Bayes procedure that, in effect,
estimates N1 and N2 while minimizing C for fixed A' and B'. See also Snijders
(1977). We shall not consider such procedures here.
Define A = N1A' and B = N2B'; then the expected loss is
C = Aa + Bβ. (11-6)
Suppose, first, that we are given A and B. We shall find a decision rule that
minimizes C.
If a is given, Section 11.4 provides the solution to the decision problem posed.
The present section deals with the case where a is not known but A and B are
specified instead.
For any given a, it is obvious from (11-6) that the expected loss C is
minimized by making β as small as possible. Thus here we again want to use the
likelihood ratio λ(u) as the examinee's score, accepting only the highest scoring
examinees. In Section 11.4, we determined the cutting score so as to satisfy the
first equation in (11-2) for given a. Here the cutting score that minimizes C, to
be denoted now simply by λ0, will be shown to have a simple relation to A and B.
Let r = 1, 2,. . . , R index all numerically different scores λ(u), arranging
11.6. CUTTING SCORE FOR THE LIKELIHOOD RATIO 167
them in order so that λR > λR-t > . . . > λ1. Let r* denote the lowest score to
be "accepted" under the decision rule. Consider first the case where it is unnec
essary to assign any examinees at random. The expected loss can then be written
C r * =A Prob(λr|θ1) +B Prob(λr|θ2). (11-7)
* r≤r *
r≥r
We wish to choose a to minimize C. This is the same as choosing r* to
minimize C. Since C ≥ 0, and since each summation in (11-7) lies between 0
and 1, C must have a minimum, to be denoted by C", on 0 ≤ a ≤ 1, or
(equivalently) on 1 ≤ r* ≤ R. Denote by r° the value of r* that minimizes C r * .
If C° is a minimum, we must have C r+1 - C° ≥ 0 and Cr-1 - C° ≥ 0. (If a
= 0 or a = 1, only one of these inequalities is required; for simplicity, we ignore
this trivial case.) Substituting into these inequalities from (11-7), we find
Prob(λr|θ2) A ≥ Prob(λr-1|θ2)
≥ . (11-8)
Prob(λr|θ1) B Prob(λr-1|θ1)
If there is only one pattern of scores that has the likelihood ratio λr ≡ λr(u),
we can denote this pattern by u r . In this case, Prob(λr|θ) = Prob(u r |θ) a n d m e
left side of (11-8) is the likelihood ratio λr. If there is only one pattern of scores
that has the likelihood ratio λr-1, the right side of (11-8), similarly, is the
likelihood ratio λr-1. Thus (11.8) becomes
A
λr ≥ ≥ λr-1. (11-9)
B
This same result is easily found also for the case where there may be several u
with the same λ(u). The conclusion is that expected loss is minimized by choos
ing the cutting score to be
A
λ° = (11-10)
B .
This conclusion was reached for the special case where all examinees with
score λr can be accepted (none have to be assigned at random). We now remove
this limitation by showing the following: If any examinees have scores exactly
equal to the cutting score λ° = A/B, there will be no difference in expected loss
however these examinees are assigned.
Consider examinees whose score pattern u° is such that
Prob(u°|θ2) A
λ(u0) = =
Prob(u0|θ1) B.
168 11. MASTERY TESTING
1. The score assigned to each examinee is the likelihood ratio for his response
pattern u.
2. Accept the examinee if his score exceeds the ratio of the two costs, A/B;
reject him if his score is less than A/B.
3. If his score equals A/B, either decision is optimal.
Theorem 11.6.1. The expected loss C is minimized by the decision rule d(u) if
and only if
Usually we do not know the losses A', B', or the weights A and B. In such cases,
we can fall back on the fact that any decision rule d(u) obtained from Theorem
11.6.1 for any A and any B is an admissible decision rule. In the present context,
this means that no other rule d*(u) can have smaller error probability a unless it
also has a larger β, nor can it have a smaller β unless it also has a larger a.
To prove that d(u) of Theorem 11.6.1 is admissible, suppose to the contrary
that a* = Prob(d* = l|θ1) < a and that at the same time β* - Prob(d* = 0|θ2)
≤ β. It would follow that C* = Aa* + Bβ* < Aa + Bβ = C. But this
contradicts the theorem, which states that d minimizes C; hence the supposition
must be false. Any decision rule defined by Theorem 11.6.1 is an admissible
rule; there is no other rule that is better both at θ1 and at θ2.
11.8. WEIGHTED SUM OF ITEM SCORES 169
It will make no difference if we use the logarithm of the likelihood ratio as the
examinee's score instead of the ratio itself. From Eq. (4-20), the likelihood ratio
for response pattern u is
n l-ui
ui Qi(θ2)
λ(u) =
i=1
[ Pi(θ2)
Pi(θ1) ] [ Qi(θ1) ] . (11-12)
where
Qi(θ2)
K = i In
[ Qi(θ1) ]. (11-14)
Thus, the examinee's score y (say) may be taken to be a weighted sum of item
scores:
n
y = y(u) = wi(θ1,θ2)ui (11-15)
i=1
where the scoring weights are given by
y0 = In A - In B - K. (11-17)
In general, the item-scoring weights w i (θ 1 , θ2) depend on the choice of θ1 and θ2.
If we divide all Wi(θ1, θ2) by θ2 — θ1, and make a corresponding change in the
cutting score y0, this does not change any decision. Let us relabel θ1 as θ0 and
define the locally best scoring weight as the limit of Wi(θ0, θ2) when θ2 —> θ0:
wi(θ0θ2)
W0i = Wi(θo)s lim
θ2 ->θo [ (θ2 - θ 0 ) ] (11-18)
The locally best weighted sum Y of item scores is seen from (11-19) to be
n n
Y = P'i(θ0)ui
Wi(θo)ui = (11-20)
i=1 i=l Pi(θo)Qi(θo) .
If the examinee's score is Y, what is the appropriate cutting score Y0?
11.11. EVALUATING A MASTERY TEST 171
The locally best score Y (11-20) is obtained from the optimally weighted sum
y (11-15) by the relation
y |
Y = lim (11-21)
θ2—>θ1 θ2 - θ1 θo
d n
=- ln Qi(θo)
dθo i=1
n
P'i(θo) (11-22)
=
i=1
Qi(θo) .
If the cost of accepting nonmasters who are just below θ0 is equal to the cost
of rejecting masters who are just above θo, then we can use Y0 in (11-22) as the
cutting score against which each person's score Y is compared.
If an examinee's score Y is exactly at the cutting score Y0, we have from
(11-20) and (11-22)
n n
P'i(θo)ui P'i(θo)
Y = = (11-23)
i=1
Pi(θo) Qi(θo) i=l
Qi(θo)
Comparing this with Eq. (5-19), we see that this is the same as the likelihood
equation for estimating the examinee's ability from his responses u. This means
that if a person's score Y is at the cutting point, then the maximum likelihood
estimate of his ability is exactly θO, the ability level that divides masters from
nonmasters. This result further clarifies the choice of Y0 as a cutting score.
If we are concerned with two specified ability levels θ1 and θ2, as in Sections
11.2-11.8, we obviously should evaluate the mastery test in terms of the ex
pected loss (11-6). For this, we need to know A, B, α, and β.
172 11. MASTERY TESTING
Given θ1, θ2, the item parameters, the form of the item response function, and
the cutting score y0, we can determine misclassification probabilities a and β
from the frequency distribution of the weighted sum of item scores
y= iwi(θ1, θ 2 )u i .
The required frequency distribution fy =fy(y) for any given θ is provided by the
generating function
n
fyty = [Qi(θ)+ Pi (θ)twi(θ1,θ2)] (11-24)
y i= l
[compare Eq. (4-1)]. In other words, the frequency fy of any score y appears as
the coefficient of ty on the right side of (11-24) after expansion. To obtain a (or
β), (11-24) must be evaluated at θ = θ1 (or θ = θ2) and the frequencies cumu
lated as required by (11-2) and by Theorem 11.6.1. For example, if n = 2,
w1(Θ1, θ2) = .9, w2(θl, θ2) = 1.2, then (11-24) becomes
Q1Q2 + P 1 Q 2 t 9 +Q 1 P 2 t l . 2 +P1P2t2.1.
Thus f y (0)= Q1Q2,fy(.9) = P1Q2, fy(1.2) = Q1P2,fy(2.1) = P1P2.
If A and B are known, the expected loss can be computed from (11-6). If A
and B are not known a priori, but some cutting score λ° has somehow been
chosen, the ratio of A to B can be found from the equation λ° = A/B. Together
with a and β, this ratio is all that is needed to determine the relative effectiveness
of different mastery tests.
Although the expected loss can be determined as just described, the procedure
is not simple and the formulas are not suitable for needed further mathematical
derivations. In view of this, we shall often evaluate the effectiveness of a mastery
test by the test information I{θ} at the ability level θ = θo that separates mastery
from nonmastery. The test information at ability θo is
n P't2 |
I, {θ} = (11-25)
i=1
PiQi e = θ0
For the three-parameter logistic case, if all items have the same ai = a and ci
— c and if the bi are optimal (11-26), then the required number n0 of items is
found by dividing the chosen value of I0{θ} by the maximum item information
M given by Eq. (10-6). This gives
8(1 - c)2I0 {θ}
n0 = 2 2 (11-28)
D a [1 - 20c - 8c2 + (1 + 8c)3/2 ] .
If c = 0, the required test length is
1.384I0 {θ}
n0 = (11-29)
a2 .
The number of items required is inversely proportional to the square of the
item discriminating power a. If a certain mastery test requires 100 items with
c = 0, we find from (11-28) that it will require 138 items with c = .167, 147
items with c = .20, 162 items with c = .25, 191 items with c = .333, or 277
items with c = .50.
According to the approach suggested here, the design of a mastery test for a
unidimensional skill could proceed somewhat as follows.
11.15. EXERCISES
11-1 Suppose a mastery test consists of n = 5 items exactly like item 2 in test 1
and also that θ1 = - 1, θ2 = + 1 . What is a if we accept only examinees
who score x = 5? (Use Table 4.17.1.) What is β? What if we accept x ≥
4? x ≥ 3? x ≥ 2? If A = B = 1, what is the expected loss C of accepting
x = 5? x ≥ 4? x ≥ 3? x ≥ 2?
11-2 What is the score (likelihood ratio) of an examinee who gets all five items
right (x = 5) in Exercise 11-1? What is his score if JC = 4? 3? 2? 1? 0?
Which examinees will be accepted if A = 5, B = 2?
11-3 Suppose test 1 (see Table 4.17.1) is used as mastery test with θ1 = — 1,θ2
= + 1 . What is the score (likelihood ratio) of an examinee with u = {1,0,
0}? {0, 1, 0}? {0, 0, 1}? Do these scores arrange these examinees in the
order that you would expect? Explain the reason for the ordering obtained.
11-4 What is the optimal scoring weight (11-16) for each of the three items in
the test in Exercise 11-3 (be sure to use natural logarithms)? Why does
item 3 get the least weight?
11-5 What is the optimally weighted sum of item scores (11-15) for an exam
inee in Exercise 11-3 with response pattern u = {1, 1, 0}? {1, 0, 1}? {0,
1, 1}? If A = B, what is the cutting score (11-17) for these optimally
weighted sums of item scores?
176 11. MASTERY TESTING
11-6 What are the locally best scoring weights (11-19) for each of the three
items in the test in Exercise 11-3 when θ0 = 0? Compare with the weights
found in Exercise 11-4. Are the differences important?
11-7 What is the locally best weighted sum of item scores (11-20) for an
examinee in Exercise 11-3 with response pattern u = {1, 1, 0}? {1,0,
1}? {0, 1, 1}? If A = B, what is the locally best cutting score (11-22) for
these scores? Compare with the results of Exercise 11-5 and comment on
the comparison.
REFERENCES
Birnbaum, A. Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord
and M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley,
1968.
Birnbaum, A. Statistical theory for logistic mental test models with a prior distribution of ability.
Journal of Mathematical Psychology, 1969, 6, 258-276.
Davis, C. E., Hickman, J., & Novick, M. R. A primer on decision analysis for individually
prescribed instruction. Technical Bulletin No. 17. Iowa City, Ia.: Research and Development
Division, The American College Testing Program, 1973.
Glass, G. V. Standards and criteria. Journal of Educational Measurement, 1978, 75, 237-261.
Huynh, H. On the reliability of decisions in domain-referenced testing. Journal of Educational
Measurement, 1976, 13, 253-264.
Snijders, T. Complete class theorems for the simplest empirical Bayes decision problems. The
Annals of Statistics, 1977,5, 164-171.
Subkoviak, M. J. Estimating reliability from a single administration of a criterion-referenced test.
Journal of Educational Measurement, 1976, 13, 265-276.
Swaminathan, H., Hambleton, R. K., & Algina, J. A Bayesian decision-theoretic procedure for use
with criterion-referenced tests. Journal of Educational Measurement, 1975, 12, 87-98.
van der Linden, W. J. Forgetting, guessing, and mastery: The Macready and Dayton models revisited
and compared with a latent trait approach. Journal of Educational Statistics, 1978, 3, 305-317.
van Ryzin, J., & Susarla, V. On the empirical Bayes approach to multiple decision problems. The
Annals of Statistics, 1977,5, 172-181.
Ill PRACTICAL PROBLEMS AND
FURTHER APPLICATIONS
12 Estimating Ability and
Item Parameters
In its simplest form, the parameter estimation problem is the following. We are
given a matrix U = | | u i a | | consisting of the responses (uia = 0 or 1) of each of N
examinees to each of n items. We assume that these responses arise from a
certain model such as Eq. (2-1) or (2-2). We need to infer the parameters of the
model: ai, bi, ci (i = 1, 2, . . . , n) and θa (a = 1 , 2 , . . . , N).
As noted in Section 4.10 and illustrated for one θ in Fig. 4 . 9 . 1 , the maximum
likelihood estimates are the parameter values that maximize the likelihood L ( U | θ ;
a, b , c) given the observations U. Maximum likelihood estimates are usually
found from the roots of the likelihood equations (4-30), which set the derivatives
of the log likelihood equal to zero. The likelihood equations (4-30) are
179
180 12. ESTIMATING ABILITY AND ITEM PARAMETERS
Pia DaiQia(Pia - Ci )
= ,
θa 1 - Ci
A similar set of likelihood equations can be obtained in the same way for the
three-parameter normal ogive model.
These formulas are given here to show their particular character. The reader
need not be concerned with the details. The important characteristic of (12-la) is
that when the item parameters are known, the ability estimate θa for examinee a
is found from just one equation out of the N equations (12-la). The estimate θa does
not depend on the other θ. When the examinee parameters are known, the three
parameters for item i are estimated by solving just three equations out of (12-lb).
The estimates for item / do not depend on the parameters of the other items.
This suggests an iterative procedure where we treat the trial values of θa (a =
1 , 2 , . . . , N) as known while solving (12-lb) for the estimates âi, bi, ĉi (i = 1,
2,. . . , n); then treat all item parameters (i = 1, 2 , . . . , n) as known while
solving (12-la) for new trial values θa (a = 1, 2,. . . , N). This is to be repeated
until the numerical values converge. Because of the independence within each set
of parameter estimates when the other set is fixed, this procedure is simpler and
quicker than solving for all parameters at once.
'2
n p ia
Irr = (r ≡ a = 1 , 2 , . . . ,N). (12-3)
Pia Qia
i=1
From Eq. (4-30),
n P'ia
Sr = (uia - Pia) P Q (r ≡ a = 1 , 2 , . . . ,N). (12-4)
ia ia
i=l
When the item parameters are fixed, the correction A°r to θ°r is simply A0r =
S0r/I0r. Thus θa (a = 1, 2 , . . . , N) can be readily found by the iterative meth
od of the preceding paragraph.
When the ability parameters are fixed (treated as known), the parameter
vector x is {a1, b1, c1; a2, b2, c2;... ; an, bn, cn}. Formulas (see Appendix 12)
for the Iqr are obtained by the same method used to find Eq. (5-5). The informa
tion matrix ||I qr || is a diagonal supermatrix whose diagonal elements are 3 x 3
matrices, one for each item. The 3 x 3 matrices are not diagonal. The corrections
Ar are obtained separately for each item, by solving three linear equations in
three unknown A's.
If the true item parameters were known, then the asymptotic sampling variance
of θa would be approximated by l/I a a evaluated at θa = θa . This is readily ob
tained from the modified Newton-Raphson procedure after convergence [see Eq.
(5-4)]. Ability estimates θa and θb are uncorrected with a ≠ b.
A large sampling variance occurs when the likelihood function has a relatively
flat maximum, as in the two curves on the left and the one on the right of Fig.
4.9.1. A small sampling variance occurs when the likelihood function has a
well-determined maximum, as in the middle three curves of Fig. 4.9.1.
If the true ability parameters were known, the asymptotic sampling variance-
covariance matrix of the ai, bi, and ci would be approximated by the inverse of
the 3 x 3 matrix of the Iqr for item i, evaluated at ai, bi, and ci. Parameter
estimates for item i are uncorrected with estimates for item j when i ≠ j .
In practice, estimated sampling variances and covariances of parameter esti
mates are obtained by substituting estimated parameter values for parameters
assumed known. When all parameters must be estimated, this substitution under
estimates the true sampling fluctuations.
Andersen (1973) argues that when item and person parameters are estimated
simultaneously, the estimates do not converge to their true values as the number
of examinees becomes large. The relevant requirement, however, is convergence
when the number of people and the number of items both become large together.
A proof of convergence for this case has been given for the Rasch model by
182 12. ESTIMATING ABILITY AND ITEM PARAMETERS
Haberman (1977). It seems likely that convergence will be similarly proved for
the three-parameter model also.
If an examinee does not respond to the last few items in a test because of lack of
time, his (lack of) behavior with respect to these items is not described by current
item response theory. Unidimensional item response theory deals with actual
responses; it does not predict whether or not a response will occur. Many low-
ability students answer all the items in a test well ahead of the allotted time limit.
Ability to answer test items rapidly is thus only moderately correlated, if at all,
with ability to answer correctly. Item response theory currently deals with the
latter ability and not at all with the former. [Models for speeded tests have been
developed by Meredith (1970), Rasch (1960, Chapter 3), van der Ven (1976),
and others.]
If we knew which items the examinee did not have time to consider, we would
ignore these items when estimating his ability. This is appropriate because of the
fundamental property that the examinee's ability 6 is the same for all items in a
unidimensional pool. Except for sampling fluctuations, our estimate of θ will be
the same no matter what items in the pool are used to obtain it. Note that our
ability estimate for the individual represents what he can do on items that he has
time to reach and consider. It does not tell us what he can do in a limited testing
time.
In practice, all consecutively omitted items at the end of his answer sheet are
ignored for estimating an examinee's ability. Such items are called not reached
items. If the examinee did not read and respond to the test items in serial order,
we may be mistaken in assuming that he did not read such a "not reached" item.
We may also be mistaken in assuming that he did read all (earlier) items to which
he responded. The assumption made here, however, seems to be the best practi
cal assumption currently available.
If an examinee answers all test items correctly, the maximum likelihood estimate
of his ability is θ = + ∞. In this case, Bayesian methods would surely give a
more plausible result. In maximum likelihood estimation, such examinees may
be omitted from the data if desired. Their inclusion, however, will not affect the
estimates obtained for the item parameters nor for other ability parameters.
If an examinee answers all items incorrectly, the maximum likelihood esti
mate of his ability is θ = —∞.Most examinees answer at least a few items
correctly, however, if only by guessing. By Eq. (2-1) or (2-2), Pi(θ) ≥ c{. Thus
by Eq. (4-5) the number-right true score ξ is always greater than or equal to
"iCi. An examinee's number-right observed score may be less than "i ci be-
12.6. ACCURACY OF ABILITY ESTIMATION 183
cause he is unlucky in his guessing; in this case we are likely to find an estimate
of θ = —∞forhim.
On the other hand, an examinee with a very low number-right score may still
have a finite θ provided he has answered the easiest items correctly. This occurs
because, as seen in Section 5.6, hard items receive very little scoring weight in
determining θ for low-ability examinees. Correspondingly, an examinee with a
number-right score above nici may still obtain θ = - ∞. This will happen if he
answers some very easy items incorrectly. A person who gets easy items wrong
cannot be a high-ability person.
Figure 3.5.2 compares ability estimates θ for 1830 sixth-grade pupils from a
50-item MAT vocabulary test with estimates obtained independently from a
42-item SRA vocabulary test. Values of θ outside the range - 2 . 5 to +2.5 are
plotted on the perimeter of the figure. Many of the points on the left and lower
boundaries of the figure represent pupils for whom θ = —∞on one or both tests.
It appears from the figure, as we would anticipate, that θ values between 0 and
1 show much smaller sampling errors than do extreme θ values. Although not
apparent from the figure, a pupil might have a θ of —10 on one test and a θ of
— 50 on the other, simply because of sampling fluctuations. For many practical
purposes, the difference between θ = —10 and θ = — 50 for a sixth-grade pupil is
unimportant, even if numerically large. In the usual frame of reference, it may
not be necessary to distinguish between these two ability levels. If we did wish to
distinguish them, we would have to administer a much easier test.
Since some of the θ are negatively infinite in Fig. 3.5.2, we cannot compute a
correlation coefficient for this scatterplot. If we simply omit all infinite esti
mates, the correlation coefficient might be dominated by the large scatter of a
few extreme θ. If we omit these θ also, the correlation obtained will depend on
just which θ we choose to omit and which we retain.
It is helpful to transform the θ scale to a more familiar scale that better
represents our interests. A convenient and meaningful scale to use is
the proportion-correct score scale. We can transform all θ on to this scale by
the familiar transformation [Eq. (4-9)J:
n Pi(θ)
ζ ≡ ζ(θ) ≡
1 n .
Figure 12.6.1 is the same as Fig. 3.5.2 except that the points are now plotted
on the proportion-correct score scale. The θ obtained from the SRA items have
been transformed into SRA proportion-correct estimated true scores; the θ from
MAT, to MAT estimated true scores. As noted often before, proportion-correct
true scores cannot fall below ni ci/n. The product-moment correlation between
SRA and MAT values of ζ is found to be .914. This is notably higher than the
184 12. ESTIMATING ABILITY AND ITEM PARAMETERS
1.0
0.9
0.8
0.7
0.6
SRA
0.5
0.4
0.3
0.2
0.1
0.0
0.0 O.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MAT
FIG. 12.6.1. Estimated true scores on a 50-item MAT vocabulary test and a 42-
item SRA vocabulary test for 1830 sixth-grade pupils.
Actually, all θ, ai, and bi (but not ci) are unidentifiable until we agree on some
arbitrary choice of origin and unit of measurement (see Section 3.5). Once this
choice is made, all θ and all item parameters will ordinarily be identifiable in a
suitable infinite population of examinees and infinite pool of test items.
Just as his θ cannot be estimated from the responses of an examinee who
answers all n test items correctly, similarly bt cannot be estimated from the
responses of a sample of N examinees all of whom answer item i correctly. This
does not mean that bt is unidentifiable; it only means that the data are inadequate
for our purpose. We need a larger sample of examinees, some examinees will
surely get the item wrong if the sample is large enough; or, better, we need a
sample of examinees at lower ability levels.
If only a few examinees in a large sample answer item i correctly, bi will
have a large sampling error. To make this clearer, consider a special case where
Ci = 0, ai — 1/1.7 = .588, and all examinees are at θ = 0. The standard error of
the proportion pi of correct answers in a sample of N such examinees is given by
the usual binomial formula:
SE(p i ) = √P NQ i i
.
Under the logistic model, for the special case considered,
Pi = (1 + e b i) -1
or
1
bi = In
( - 1 + Pi ).
The asymptotic standard error of bi = In (— 1 + 1/pi) is easily found by the delta
method (Kendall & Stuart, 1969, Chapter 10) to be
1
SE(b i ) = .
√NPIQI
The two standard errors are compared in the following listing for TV = 1000.
The problem is even more obvious for ci, which represents the performance
of low-ability examinees. If we have no such examinees in our sample, we
cannot estimate ci. This is the fault of the sample and not the fault of the
estimation method. In such a sample, any reasonable value of ci will be able to
fit the data about as well as any other. If we arbitrarily assign some plausible
value of ci and then estimate ai and bi accordingly, we shall obtain a good
description of our data.
We need not forego practical applications of item response theory just because
some of the parameters cannot be estimated accurately from our data, as long as
we restrict our conclusions to ranges for which our data are relevant. If we wish
to predict far outside these ranges, we must gather data relevant to our problem.
In work with published tests, it is usual to test similar groups of examinees year
after year with parallel forms of the same test. When this happens, we can form a
good picture of the frequency distribution of ability in the next group of exam
inees to be tested. Such "prior" information can be used to advantage to improve
parameter estimation, providing it can be conveniently quantified and conve
niently processed numerically.
Suppose each examinee tested is known to be randomly drawn from a popula
tion in which the distribution of ability is g(θ). The joint distribution of examinee
ability θ and item response vector u for a randomly chosen examinee is equal to
the conditional probability (4-20) multipled by g(θ):
n
L(u,θ|a,b,c) - L(θ|0;a, b, c)g(θ) = g(θ) [P i (θ)] ui [Q i (θ)] 1-ui .
i=1 (12-5)
The marginal distribution of u for a randomly chosen examinee is obtained from
the joint distribution by integrating out θ:
∞ n
L(u|a, b,c) = g(θ) [P i (θ)] ui [Q i (θ)] 1-ui dθ. (12-6)
∫
—∞ i=l
The conditional distribution of θ for given u is obtained by dividing (12-5) by
(12-6). This last is the posterior distribution of θ given the item response vector
u. Since (12-6) is not a function of θ, we can say that the posterior distribution of
θ is proportional to (12-5). This distribution contains all the information we have
for inferring the ability θ of an examinee whose item responses are u.
If we want a point estimate of θ for a particular examinee, we can use the
mean of the posterior distribution (see Birnbaum, 1969) or its mode. There is no
convenient mathematical expression for either mean or mode, but both can be
determined numerically.
12.9. FURTHER THEORETICAL COMPARISON OF ESTIMATORS 187
Use of the posterior mean θ to estimate θ from u is the same as using the
regression (conditional expectation) of θ on u. Suppose we have a population of
individuals, and for each individual we calculate the posterior mean θ. Let σθ
denote the standard deviation of these posterior means. By a standard definition,
the correlation ratio of θ on u is
σθ
ηθu ≡
,
σθ
where σ2θ is the prior or marginal variance of θ. Since θ correlates only imper
fectly with θ, ηθu < 1 and σθ; < σθ ; thus in any group the distribution of the
estimates θ will have a smaller variance than does the distribution of true ability
θ. The θ exhibit regression toward the mean.
If we define θ* = θ/ηθu, then σθ = σθ: The estimates have the same vari
ance as the true values. But θ* is not a type of estimate usually used by Bayesian
statisticians. The estimate θ minimizes the mean square error of estimation over
all examinees, but, as we have seen, it does not have the same variance as 0; the
estimate θ* has the same variance as θ, but it is a worse estimate in terms of
mean square error.
Is θ or θ* unbiased for a particular person? If θ is an unbiased estimator for
every individual in a population of individuals, then the error θ — θ is uncorre
ted with θ, so that in the population σ2θ = σ2θ + σ2θ_θ > σ2θ. Thus, the unbiased
estimates θ have a larger variance than the true ability parameters. They also
have a larger variance than θ or θ*. Thus neither θ nor θ* is unbiased.
The foregoing problems are simply a manifestation of the basic fact that the
properties of estimates are never exactly the same as the properties of the true
values.
The mode of the posterior distribution of θ is called the Bayesian modal
estimator. If the posterior distribution is unimodal and symmetric, then the
Bayesian modal estimator will be the same as the posterior mean θ, whose
properties as an estimator have already been discussed.
If g(θ) is uniform, then [see Eq. (12-5)] the posterior distribution of θ is propor
tional to the likelihood function (4-20). Thus when g(θ) is uniform, the
maximum likelihood estimator θ that maximizes (4-20) is the same as the Bayes
ian modal estimator that maximizes (12-5). Since any bell-shaped g(θ) is surely
nearer the truth than a uniform distribution of θ, it has been argued, the Bayesian
modal estimator (BME) computed from a suitable bell-shaped prior, g(θ), must
surely be better than the maximum likelihood estimator (MLE), which (it is
asserted) assumes a uniform prior.
188 12. ESTIMATING ABILITY AND ITEM PARAMETERS
The trouble with this argument is that it tacitly assumes the conclusion to be
proved. If the BME were a faultless estimation procedure, then this line of
reasoning would show that the MLE is inferior whenever g(θ) is not uniform. On
the other hand, if the BME is less than perfect as an estimator, then the MLE
cannot be criticized on the grounds that under an implausible assumption (uni
form distribution of θ) it happens to coincide with the BME.
The MLE is invariant under any continuous one-to-one transformation of the
parameter. The same likelihood equations will result whether we estimate θ or
θ* ≡ Kek θ, as in Eq. (6-2), or ξ ≡ n 1 P ί (θ). Thus if θ is the MLE of θ, the
MLE of θ* will be Kek θ and the MLE of ξ will be n1Pί(θ).
If the cited argument proved that the MLE assumes a uniform distribution of
θ, then the same argument would prove that the MLE assumes a uniform distribu
tion of θ* and also of ξ. This is self-contradictory, since if any one of these is
uniformly distributed, the others cannot be.
The absurd conclusion stems from the fact that BME is not invariant. It yields
a substantively different estimate depending on whether we estimate θ, θ*, or ξ.
The proof of this statement follows.
In simplified notation, the posterior distribution is proportional to L(u|Θ =
θ)g(θ). The BME is the mode of the posterior distribution. If θ* = θ*(θ) rather
than θ is the parameter of interest, then the prior distribution of θ* is
dθ
g*(θ*) = g(θ)
dθ* ,
The posterior distribution of 0* is thus proportional to
dθ
L(u|Θ* = θ*)g*(θ*) ≡ L(u|Θ = θ)g(θ)
dθ*
The posterior for θ* differs from the posterior for θ by the factor dθ/dθ*. Thus
the two posterior distributions will in general have different modes and will
therefore yield different BME's.
As stated at the beginning, the purpose of this section is not to fault the BME
but to point out a fallacy in a plausible line of reasoning, which superficially
appears to show that the MLE assumes a uniform distribution of the parameter
estimated. As pointed out at the end of Section 12.8, no point estimator, whether
Bayesian or non-Bayesian, has all possible desirable properties. If g(θ) is known
approximately, then any inference about θ may properly be based on the pos
terior distribution of θ given u (but not necessarily on the mode of this distribu
tion). The MLE, on the other hand, is of interest in situations where we cannot or
do not wish to restrict our attention to any particular g(θ).
It is often argued in other applications of Bayesian methods that the choice of
prior distribution, here g(θ), does not matter much when the number n of
observations is large. This fact is not helpful here, since n here is the number of
items. Mental test theory exists only because the observed score on n items
differs nonnegligibly from the true score that would be found if n were infinite.
12.12. THE RASCH MODEL 189
It is likely that new and better methods will be found for estimating both item
parameters and ability parameters. Illustrative data in this book have mostly been
obtained by certain modified maximum likelihood methods (Wood & Lord,
1976; Wood, Wingersky, & Lord, 1976). It is not the purpose of this book to
recommend any particular estimation method, however, since such a recom
mendation is likely to become quickly out of date. The practical applications
outlined in these chapters are useful regardless of whatever effective estimation
method is used.
Anderson (1978) and Maurelli (1978) report studies comparing maximum
likelihood estimates with Bayesian estimates. The interested reader is referred to
Urry (1977) for a description of an alternative estimation procedure.
Rasch's item response theory (Rasch 1966a, 1966b; Wright, 1977) assumes that
all items are equally discriminating and that items cannot be answered correctly
by guessing. The Rasch model is the special case of the three-parameter logistic
model arising when cί = 0 and aί = a for all items. If the Rasch assumptions are
190 12. ESTIMATING ABILITY AND ITEM PARAMETERS
satisfied for some set of data, then sufficient statistics (Section 4.12) are avail
able for estimating both item difficulty and examinee ability. If, as is usually the
case, however, the Rasch assumptions are not met, then use of the Rasch model
does not provide estimators with optimal properties. This last statement seems
obvious, but it is often forgotten.
In any comparison of results from use of the Rasch model with results from
the use of the three-parameter logistic model, it is important to remember the
following. If the Rasch model holds, we are comparing the results of two statisti
cal estimation procedures; we are not comparing two different models, since the
Rasch model is included in the three-parameter model. If the Rasch model does
not hold, then its use must be justified in some way. If sample size is small, for
example, Rasch estimates may be more accurate than three-parameter-model
estimates, even when the latter model holds and the Rasch model does not.
12.13. EXERCISES
APPENDIX
Listed here for convenient reference are formulas for Iq r (q, r = a, b, c) for the
three-parameter logistic function. These formulas are used in the modified
Newton-Raphson iterations (Section 12.2) and for computing sampling var
iances of maximum likelihood estimators (Section 12.3).
D2 N Q
Iaa = (θa - bi)2 (Pia - Ci)2 ia , (12-8)
(l - cί)2 Pia
2 2
a= 1
D ai N Qia
Ibb = Pia - Ci)2 (12-9)
(1 - ci)2 Pia ,
a=1
1 N Qia
Icc = (12-10)
(1 - cί)2 Pίa ,
a=1
D2 aί N
lab = (θa - bί)(Pίa - cί)2 Qia (12-11)
(1 - cί)2 Pia ,
a=1
D N
lac = 2 (6a - bί )(Pia - cί) (12-12)
(1 - ct) ,
a=\ 7
Dat A Q
Ibc = (Pia - Ct) ia , (12-13)
(1 - cί)2 Pia
a= 1
REFERENCES
Consider a situation where many people are being selected to do typing. All
applicants have been tested before applying for employment. Some come with
test record x showing typing speed in words per second; others come with test
record y showing typing speed in seconds per word. We assume for this section
that all typing tests are perfectly reliable ("infallible").
There is a one-to-one correspondence between the two measures x and y of
typing speed. The hiring official will undoubtedly wish to express all typing
speeds in the same terms for easy comparison of applicants. Perhaps he will
replace all y scores by their reciprocals. In general, we denote such a transforma
tion by xy ≡ x(y). In the illustration, x(y) = \ly. Clearly x and xy are compara
ble values.
[Note on notation: In practical test work, a new test is commonly equated to
an old test. It seems natural to call the old test x and the new test y. The function
xy ≡ x(y) may seem awkward, since we habitually think of y as a function of x.
An alternative notation would write y*(y) instead of x(y); this would fail to
emphasize the key fact that xy ≡ x(y) is on the same scale as x.]
In mental testing, we may believe that two tests, x and y, measure the same
trait without knowing the mathematical relation between the score scale for x and
the score scale for y. Suppose the hiring officer knew that both x and y were
perfectly reliable measures of typing speed but did not know how each was
expressed. Could he without further testing find the mathematical relation x(y)
between x and y, so as to use them for job selection? If many applicants have
both x and y, it is easy to find x(y) approximately. But suppose that the hiring
officer never obtains both x and y on the same individual.
193
194 13. EQUATING
The true situation, not known to the hiring officer, is illustrated schematically
in Fig. 13.1.1. The points falling along the curve represent corresponding values
of x and y for various individuals. The frequency distributions of x and y for the
combined population (of all applicants regardless of test taken) are indicated
along the two axes (the y distribution is shown upside down).
It should be clear from Fig. 13.1.1 that when 3; and x have a one-to-one
monotonic relation, any cutting score Y0 on y implies a cutting score X0 =x(Y0)
on x. Moreover, the people who lie to the right of Y0 are the same people as the
people who lie below X0. Thus the percentile rank of Y0, counted from right to
left in the distribution of y, is the same as the percentile rank of X0 counting
upward in the distribution of x. The cutting scores X0 and Y0 are said to have an
equip ere entile relationship.
If G(y) denotes the cumulative distribution of y cumulated from right to left
and F(x) denotes the cumulative distribution of x cumulated from low to high,
then F(X0) = G(Y0) for any pair of corresponding cutting points (Y0,X0). Since
F is monotonic, we can solve this equation for X0, obtaining X0 = F-1 [G(Y 0 )],
where F-1 is the inverse function of F. Thus, the transformation x(y) is given by
x or x(y)
Xo
(Yo,Xo)
F(X o )
Yo
y
G(yo)
F(xy) = P
G(y) = P. } (13-2)
where p is the percentile rank of both y and xy. This last pair of equations is a
direct statement of the equipercentile relationship of xy to y.
In the present application where y = 1/x, y decreases as x increases. In most
typical applications, x and y increase together, in which case we define G(y) in
the usual way, cumulated from left to right. If x and y increase together, (13-2)
still applies, using the appropriate definition of G(y).
Suppose, now, that the hiring officer in our illustration knows that applicants
with x are a random sample from the same population as applicants with y. This
information will allow him to estimate the mathematical relation of x to y. His
sample cumulative distribution of y values is an estimate of G(y); his sample
distribution of x values is an estimate of F(x). He can therefore estimate the
relationship x(y) from (13-1) or (13-2).
When this has been done, the transformed or ''equated" score xy is on the x
scale of measurement, and test y is said to have been equated to test x. To
summarize, if (1) x(y) is a one-to-one function of y and (2) the X group has the
same distribution of ability as the Y group, then (3) equipercentile equating will
transform y to the same scale of measurement as x; and (4) the transformed y,
denoted by xy ≡ x(y), will have the same frequency distribution as x. Two
perfectly reliable tests measuring the same trait can be equated by administering
them to equivalent populations of examinees and carrying out an equipercentile
equating by (13-2). The equating will be the same no matter what the distribution
of ability in the two equivalent populations.
13.2. EQUITY
From Eq. (4-1), the conditional frequency distribution fx\θ of number-right test
score x is given by the identity
n n
fx\θtx ≡ II (Qί + Pit), (13-6)
x=0 i=l
where the symbol t serves only to determine the proper grouping of terms on the
right. For n = 3, for example, (13-6) becomes
fo\θ + f'1\θt +.f2\θt2 + f3\θt3 ≡ Q1Q2Q3
+ (Q1Q2P3 + Q1P2Q3 + P1Q2Q3)t + (Q1P2P3 + P 1 Q 2 P 3
+ P1P2Q3)t2 +P1P2P3 t3,
which holds for all values of t.
13.3. CAN FALLIBLE TESTS BE EQUATED? 197
m y m
2 h x(y)\θ t II (Qί + pίt). (13-7)
y=0 3=1
Equity, however, requires that the distribution of the function x(y) be the
same as that of x. Substituting x(y) for x in (13-6), we have
n fx(y)\θtx{y) n (13-8)
II (Qί + Pίt).
x(y)=0 i=l
Since each fx(y)\θ > 0, the hx(y)\θ in (13-7) must be the same as the fx(y)\θ in
(13-8). Thus m = n. From (13-3), (13-7), and (13-8),
n n
II (Qί + Pit) II (Qj + Pjt) (13-9)
i=1 ί=1
for all 0 and for all t.
We now prove (under regularity conditions) that (13-9) will hold only if tests x
and y are strictly parallel. Since t in (13-9) is arbitrary, replace t by t + 1 to
obtain 11ί(1 + pίt) ≡ IIj(1 + Pjt). Taking logarithms, we have ί In (1 + Ptt)
≡ j In (1 + pj t). Expanding each logarithm in a power series, we have for t2 <
1 and for all 0 that
ί(Pίt - ½P2ί t2+ Pί3t3 - ...) ≡ ί(Pίt - ½P2ί t2+ Pί3t3 - ...).
(13-10)
After dividing by n, this may be rewritten
r
∞ µr ( - t ) ∞ v(-0) r
(13-11)
r r ,
r=l r=l
where (for any given 6) µr ≡ n-1 nPrί is the rth (conditional) moment about the
origin of the Pί for n items in test x, and vr is the rth (conditional) moment
about the origin of the Pί for the n items in test y. Because a convergent Taylor
series is unique, it follows that µr = vr. Since the distribution of a bounded
variable is determined by its moments [Kendall & Stuart, 1969, Section
198 13. EQUATING
4.22(c)], it follows under realistic regularity conditions1 that for each item in test
x there is an item in test y with the same item response function P(θ), and vice
versa.
Since it contradicts common thinking and practice, it is worth stating this
result as a theorem:
Theorem 13.3.1. Under realistic regularity conditions, scores x and y on two
tests cannot be equated unless either (1) both scores are perfectly reliable or (2)
the two tests are strictly parallel [in which case x(y) ≡ y].
Since test users are frequently faced with a real practical need for equating tests
from different publishers, what can be done in the light of Theorem 13.3.1,
which states that such tests cannot be equated? A first reaction is typically to try
to use some prediction approach based on regression equations.
If we were to try to predict x from y, we would clearly be doing the wrong
thing. From the point of view of the examinee, x and y are symmetrically
related. A basic requirement of equating is that the result should be the same no
matter which test is called x and which is called y. This requirement is not
satisfied when we predict one test from the other.
Suppose we have some criterion, denoted by ω, such as grade-point average
or success on the job. Denote by Rx(ω|x) the value of ω predicted from x by the
usual (linear or nonlinear) regression equation and by Ry(ω|y) the value pre
dicted from y. A sophisticated regression approach will determine x(y) so that
Rx [ω|x(y)] = Ry(ω|y). For example, if ω is academic grade-point average, a per
son scoring y on test y and a person scoring x on test x will be treated as equal
whenever their predicted grade-point averages are equal.
By Theorem 13.3.1, however, it is clear that such an x(y) will ordinarily not
satisfy the equity requirement. Other difficulties follow. We state here a general
conclusion reached in the appendix at the end of this chapter: Suppose x(y) is
defined by Rx [ω|x(y)] ≡ Ry(ω|y). The transformation x(y) found typically will
vary from group to group unless x and y are equally correlated with the criterion
ω. This is not satisfactory: An equating should hold for all subgroups of our total
group (for men, women, blacks, whites, math majors, etc.).
Suppose that x is an accurate predictor of ω and that y is not. Competent
1
The need for regularity conditions has been pointed out by Charles E. Davis. For example, let
test x consist of the two items illustrated in Fig. 3.4.1. Let θ+ denote the value of θ where these two
response functions cross. Let the first (second) item in test y have the same response function as item
1 (2) in test x up to θ+ and the same response function as item 2 (1) in test x above θ+. Test x can be
equated to this specially contrived test y. Since such situations are not realistic, the mathematical
regularity conditions required to eliminate them are not detailed here.
13.5. TRUE-SCORE EQUATING 199
examinees are severely penalized if they take test y: Their chance of selection
may be little better than under random selection. Regression methods may op
timize selection from the point of view of the selecting institution; they may not
yield a satisfactory solution to the equity problem from the point of view of the
applicants.
Equations (13-12) are parametric equations for the relation between η and ξ. A
single equation for the relationship is found (in principle) by eliminating θ from
the two parametric equations. In practice, this relationship can be estimated by
using estimated item parameters to approximate the Pi(θ) and Pj(θ) and then
substituting a series of arbitrary values of θ into (13-12) and computing ξ and η
for each θ. The resulting paired values define ξ as a function of η (or vice versa)
and constitute an equating of these true scores. Note that the relation between ξ
and η is mathematical and not statistical (not a scatterplot). Since ξ and η are
each monotonic increasing functions of θ, it follows that ξ is a monotonic
increasing function of η.
Figure 13.5.1 shows the estimated test characteristic curves of two calculus
tests: AP with 45 five-choice items and CLEP with 50 five-choice items. The
broken line in Fig. 13.5.1 graphically illustrates the meaning of true-score equat
ing. The broken line shows that a true score of 35 on CLEP is equivalent to a true
200 13. EQUATING
ABILITY
-3 -2 -I 0 I 2 3
50
45
45
40
CLEP
AP
40
35
35
30
30
25
CLEP
AP
25
20
20
15
15
10
10
5
5
0
O
-3 -2 -I 0 I 2 3
ABILITY
If test x and test y are both given to the same examinees, the test administered
second is not being given under typical conditions because of practice effects, for
example, fatigue. If, on the other hand, each of the two tests is given to a
different sample of examinees, the equating is impaired by differences between
the samples.
Differences between the two samples of examinees can be measured and
controlled by administering to each examinee an anchor test measuring the same
ability as x and y. When an anchor test is used, equating may be carried out even
when the x group and the y group are not at the same ability level. The anchor
test may be a part of both test x and test y; such an anchor test is called internal.
13.6. TRUE-SCORE EQUATING WITH AN ANCHOR TEST 201
45
40 TRUE-SCORE EQUATING (12)
EQUIPERCENTILE (13)
35
30
25
AP
20
15
10
5
0
6 5 10 15 20 25 30 35 40 45 50
CLEP
FIG. 13.5.2. Two estimates of the line of relationship between AP and CLEP.
If the anchor test is external to the two regular tests, the item parameters for
the anchor items are not used in the equating procedure of Section 13.5. They
have served their purpose by tying the data together so that all parameters are
expressed on the same scale. Without such anchor items, equating would be
impossible unless the x group and the y group had the same distribution of
ability.
Øx(x) =
∫ — ∞
Øx(x|θ) γ(θ) dθ. (13-14)
The function Øx(x|θa) is given by Eq. (4-1), using estimated item parameters for
test x items in place of their true values.
Similar equations apply for test y. Furthermore, since x and y are indepen
dently distributed when θ is fixed, the joint distribution of scores x and y for the
specified group is estimated by
13.8. ILLUSTRATIVE EXAMPLE 203
N
1
Ø(x, y) = Øx (x|θa)Øy(y|θa), (13-15)
N
a=1
or by
∞
Ø(x, y) =
∫ — ∞
Ø x (x|θ)Øy(y|θ)γ(θ)dθ. (13-16)
Note that, thanks to the anchor test, it is possible to estimate the joint distribution
of x and y even though no examinee has taken both tests.
The integrand of (13-16) is the trivariate distribution of θ, x, and y. Since θ
determines the true scores ξ and η, this distribution also represents the joint
distribution of the four variables ξ, η, x, and y. The joint distribution contains
all possible information about the relation of x to y. Yet, by Section 13 3, it
cannot provide an adequate equating of x and y unless the two tests are already
parallel.
A plausible procedure is to determine the equipercentile relationship between
x and y from (13-15) or (13-16) and to treat this as an approximate equating. Is
this better than applying the true-score equating of Section 13.5 to observed
scores x and y or to estimated true scores ξ = Pi(θ) and η)?
At present, we have no criterion for evaluating the degree of inadequacy of an
imperfect equating. Without such a criterion, the question cannot be answered.
At least the equipercentile equating of x and y covers the entire range of observed
scores, whereas the equating of ξ and η cannot provide any guide for scores
below the " c h a n c e " levels represented by ci.
The solid line of relationship in Fig. 13.5.2 is obtained by the methods of this
section. The result agrees very closely with the true-score equating of Section
13.6. Further comparisons of this kind need to be made before we can safely
generalize this conclusion.
Figure 13.5.2 is presented because it deals with a practical equating problem. For
just this reason, however, there is no satisfactory way to check on the accuracy of
the results obtained. The results obtained by conventional methods cannot be
justified as a criterion.
The following example was set up so as to have a proper criterion for the
equating results. Here, unknown to the computer procedure, test X and test Y are
actually the same test. Thus we know in advance what the line of relation should be.
2
This section is taken with special permission from F. M. Lord, Practical applications of item
characteristic curve theory. Journal of Educational Measurement, Summer 1977, 14, No. 2, 117–
138. Copyright 1977, National Council on Measurement in Education, Inc., East Lansing, Mich.
204 13. EQUATING
Test X is the 85-item verbal section of the College Board SAT, Form XSA,
administered to a group of 2802 college applicants in a regular SAT administra-
tion. Test Y is the same test administered to a second group of 2763 applicants.
Both groups also took a 39-item verbal test mostly, but not entirely, similar to the
regular 85-item test. The 39-item test is used here as an anchor test for the
equating.
The two groups differed in ability level (otherwise the outcome of the equat-
ing would be a foregone conclusion). The proportion of correct answers given to
typical items is lower by roughly 0.10 in the first group than in the second.
The equating was carried out exactly as described in Section 13.6. One
computer run was made simultaneously for all 85 + 85 + 39 = 209 items and for
all 5565 examinees. The resulting line of relationship between true-scores on test
80
70
09
TEST Y ( FORM XSA5038)
30 4020
10
0
0 1O 20 30 40 50 60 70 80
TEST X (FORM XSA5038)
FIG. 13.8.1. Estimated equating (crosses) between "Test X " and "Test Y,"
which are actual y identical.
13.9. PREEQUATING 205
X and test Y is shown by the crosses in Fig. 13.8.1. It agrees very well with the
45-degree line, also shown, that should be found when a test is equated to itself.
Results like this have been obtained for many sets of data. It is the repeated
finding of such results that encourages confidence in the item response function
model used.
13.9. PREEQUATING
Publishers who annually produce several forms of the same test have a continual
need for good equating methods. Conventional methods may properly require
1000 or more examinees for each test equated. If the test to be equated is a secure
test used for an important purpose, a special equating administration is likely to
impair its security. The reason is that coaching schools commonly secure detailed
advance information about the test questions from such equating administrations.
In preequating, by use of item response theory, each new test form is equated
to previous forms before it is administered. In preequating, a very large pool of
calibrated test items is maintained. New forms are built from this pool. The
method of Section 13.5 is used to place all true scores on the same score scale.
Scaled observed scores on the various test forms are then treated as if they were
interchangeable. A practical study of preequating is reported by Marco (1977).
Preequating eliminates special equating administrations for new test forms. It
requires instead special calibration administrations for new test items to be in-
cluded in the pool. Each final test form is drawn from so many calibration
administrations that its security is not seriously compromised.
Figure 13.9.1 shows a plan for a series of administrations for item calibration.
Rows in the table represent sets of items; columns represent groups of exam-
inees. An asterisk represents the administration of an item set to a particular
group. The column labeled n shows the number of items in each item set; the row
labeled n shows the number of items taken by each group. The row labeled N
shows the number of examinees in each group; the column labeled N shows the
number of examinees taking each item set.
A total of 399 items are to be precalibrated using a total of 15,000 examinees.
The total number of responses is approximately 938,000; thus an examinee takes
about 62 items on the average, and an item is administered to about 2350
examinees on the average. The table is organized to facilitate an understanding of
the adequacy of linkages tying the data together.
Item set F72 consists of previously calibrated items taken from the precali-
brated item pool. The remaining items are all new items to be calibrated. The item
parameters of the new items will be estimated from these data while the bi for the
20 precalibrated items are held fixed at their precalibrated values. This will place
all new item parameters on the same scale as the precalibrated item pool. All new
item and examinee parameters will be estimated simultaneously by maximum like-
lihood from the total data set consisting of all 958,000 responses (see Appendix).
GROUPS (TOTAL N =15.000)
ITEM SETS
XI X2 X3 X4 X5 X6 X7 X8 XI3 XI4 XI7 XI8 XI9 X23
N= 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 2,000
Set Forms Items n
XI 21-40
20 2,000
X3 1-20
XI 41-74
2 34 2,000 (938,000 RESPONSES)
X3 41-74
X2 21-40
3 X3 20 2,000
21-40
X2 41-73
4 33 2,000
X6 41-73
X4 21-40
5 X6 20 2,000
1-20
X5 21-40
6 X6 20 2,000
21-40
7 1-20 20 9,000
F72 41- .72
8 X7 41-72
32 2,000
XI3 41-70
9 X8 41-70
30 2,000
XI4 1-40
10 XI3
40 2,000
I-40
11
XI4 2I-40
20 2,000
XI7 1-20
XI9 41-80
12 40 2,000
XI7 41-80
XI9 2I-40
13 20 2,000
XI8 2I-40
14 XI9 21 -70 50 2,000
X23 399 n= 74 73 74 40 40 73 52 50 72 70 80 60 80 70
Precalibrated items.
FIG. 13.9.1. Item calibration plan.
13.11. EXERCISES 207
Since practical pressures often require that tests be ''equated" at least approxi-
mately, the procedures suggested in Sections 13.5-13.7 may be used. What is
really needed is a criterion for evaluating approximate procedures, so as to be
able to choose from among them. If you can't be fair (provide equity) to
everyone, what is the next best thing?
There is a parallel here to the problem of determining an unbiased selection
procedure (Hunter & Schmidt, 1976; Thorndike, 1971). Some procedures are
fair from the point of view of the selecting institutions. Usually, however, no
procedure can be simultaneously fair, even from a statistical point of view, both
to the selecting institutions and to various subgroups of examinees.
In the present problem, the equating needs of a particular selecting institution
could be satisfied by regression methods (Section 13.4). If regression methods of
"equating" are used, however, examinees could properly complain that they had
been disadvantaged (denied admission to college, for example) because they had
taken test y instead of test x or test x instead of test y. It seems important to avoid
this.
An equipercentile "equating" of raw scores has the convenient property that
when a cutting score is used, the proportion of selected examinees will be the
same for those taking test x and for those taking test y, except for sampling
fluctuations. This will be true regardless of where the institution sets its cutting
score. Thus equipercentile "equating" of raw scores gives an appearance of
being fair to everyone.
Most practical equatings are carried out between "parallel" test forms. In
such cases, forms x and y are so nearly alike that equipercentile equating, or
even conventional mean-and-sigma equating, should yield excellent results. This
chapter does not discourage such practical procedures. This chapter tries to
clarify the implications of equating as a concept. Such clarification is especially
important for any practical equating of tests from two different publishers or of
tests at two different educational levels.
The reader is referred to Angoff (1971) for a detailed exposition of conven-
tional equating methods. Woods and Wiley (1978) give a detailed account of
their application of item response theory to a complicated practical equating
problem involving the equating of 60 different reading tests, using available data
from 31 states and the District of Columbia.
13.11. EXERCISES
13-1 The test characteristic function for test 1 was computed in Exercise 5.9.1.
Compute for 6 = — 3, — 2, —1,0, 1, 2, 3 the test characteristic function of
a test composed of n = 3 items just like the items in Table 4.17.2. From
208 13. EQUATING
these two test characteristic functions, determine equated true scores for
these two tests. Plot seven points on the equating function x(y) and
connect by a smooth curve.
13-2 Suppose that test x is a perfectly reliable test with scores x ≡ T. Suppose
test y is a poorly reliable test with scores y ≡ T + E, where E is a random
error of measurement, as in Section 1.2. Make a diagram showing a
scatterplot for x and y and also the regressions of y on x and of x on y.
Discuss various functions x(y) that might be used to try to "equate"
scores on test y to scores on test x.
APPENDIX
'2 1
ρ xω = (13-18)
1
σ ω2 1 - ρxω2 ,
+σ '2
ω ρ
2
xω
where the prime denotes a statistic for the selected group. For any group
βxωβωx =P2xω. (13-19)
From (13-18) and (13-19) we have
1
βxωβωx = (13-20)
1 - (σ ω2 la ω'2 ) [l - (l lρ xω
2
)]
and similarly for y
1 (13-21)
=
)] .
β'yωβ'ωy 2 '2 2
1- (σ ω la to )[l - ( 1/ρ yω
APPENDIX 13 209
If the equating (13-17) is to hold for the selected group, we must have R'x[ω|x(y)]
≡ R'y(ω|y) and consequently β'ωx = β'ωy. Dividing (13-21) by (13-20) to eliminate
β'ωy = β'ωx, we have
t 2 ' 2 2
βyω = σ ω - σ ω [1 - (1/ρ xω )]
. (13-22)
2 2 2
β xω' σ ' -
ω (J ω
[ 1 - ( 1lρ yω ) ]
We assume, as is usual, that the (linear) regressions on ω are the same before and
after selection: β'xω = βxω and β'yω = βyω. Thus, finally
σ ω - σ ω [ 1 - ( 1/ρ X ω ) ]
' 2 2 2
βyω . (13-23)
= '2 2 2
βxω σ - σ ω 1 - ( 1/ρ yω'2 ) ]
ω
Consider what happens when ΣΩ varies from group to group. All unprimed
statistics in (13-23) refer to a fixed group and do not vary. The ratio on the left
stays the same, but the ratio on the right can stay the same only if ρyω = ρXω.
This is an illustration of a more general conclusion:
Suppose x(y) is defined by Rx[ω|x(y)] ≡ Ry(ω|y). The transformation x(y)
that is found will typically vary from group to group unless x and y are equally
correlated with the criterion ω.
so that
Xt - rXt-1 (13-26)
X∞ = .
1 - r
The rate r can thus be found from
r = Xt - Xt-1 (13-27)
.
Xt - 1 - Xt-2
In practice, I is computed from (13-27), using the results of three successive
iterations. Then (13-26) provides an extrapolated approximation to the maximum
likelihood estimate X∞.
In the situation illustrated by Fig. 13.9.1, the bi for all items in each subtest
may be averaged and this average substituted for X in (13-27) to find the rate r.
The same rate r may then be used in (13-26) separately for each item to approxi
mate the maximum likelihood estimator bi.
These bi for all items are then held fixed and the θa are estimated iteratively
for all individuals. The θa are then held fixed while reestimating all item parame
ters by ordinary estimation methods. Additional applications of (13-26) and
(13-27) may be carried out after further iterations that provide new values of
Xt-2, Xt-1, and Xt. One application of (13-26) and (13-27), however, will often
sufficiently accelerate convergence.
√ ic (1 - ci)
x(y) = (y - jcj) + ici. (13-28)
√ jcj (1 - cj)
We use (13-28) for test y scores below jcj; we use true-score equating
(13-12) above jcj. The equating relationship so defined is continuous: When y
= jcj, we find that x(y) = ici whether we use the true-score equating curve
of (13-12) or the raw-score "equating" line of (13-28). We cannot defend
(13-28) as uniquely correct, but it is a good practical solution to an awkward
problem.
REFERENCES
14.1. INTRODUCTION
212
14.2. A CONVENTIONAL APPROACH 213
other group. This situation is the simplest and most commonly considered case of
item bias.
If the item response functions for the two groups cross, as is frequently found
in practice, the bias is more complicated. Such an item is clearly biased for and
against certain subgroups.
It seems clear from all this that item response theory is basic to the study of
item bias. Mellenbergh (1972) reports an unsuccessful early study of this type,
using the Rasch model. A recent report, comparing item response theory and
other methods of study, is given by Ironson (1978). Before applying item re-
sponse theory here, let us first consider a conventional approach in current use.
For illustrative purposes, we shall compare the responses of about 2250 whites
with the responses of about 2250 blacks on the 85-item Verbal section of the
April 1975 College Board SAT.2 Each group is about 44% male. All items are
five-choice items.
For each item, Fig. 14.2.1 plots pi, the proportion of correct answers, for
blacks against Pi for whites. Items (crosses) falling along the diagonal (dashed)
line in the figure are items that are as easy for blacks as for whites. Items below
this line are easier for whites. The solid oblique line is a straight line fitted to the
scatter of points. The solid line differs from the diagonal line because whites
score higher on the test than blacks. If all the items fell directly on the solid line,
we could say that the items are all equally biased or, conceivably, equally
unbiased.
It has been customary to look at the scatter of items about the solid line and to
pick out the items lying relatively far from the line and consider them as atypical
and undesirable. In the middle of Fig. 14.2.1 there is one item lying far below the
line that appears to be strongly biased in favor of whites and also another item far
above the line that favors blacks much more than other items. A common judg-
ment would be that both of these items should be removed from the test.
In Fig. 14.2.1 the standard error of a single proportion is about .01, or less.
Thus, most of the scattering of points is not attributable to sampling fluctuations.
Unfortunately, the failure to fall along a straight line is not necessarily attributa-
ble to differences among items in bias. This statement is true for several different
reasons, discussed below.
1
Most of this section is taken, by permission, from F. M. Lord, A study of item bias, using item
characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology.
Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.
2
Thanks are due to Gary Marco and to the College Entrance Examination Board for permission to
present some of their data and results here.
214 14. STUDY OF ITEM BIAS
I .0
0.8
0.6
P (BLACKS)
0.4
0.2
0,0
FIG. 14.2.1. Proportion of right answers to 85 items, for blacks and for whites.
In the first place, we should expect the scatter in Fig. 14.2.1 to fall along a
curved and not a straight line. If an item is easy enough, everyone will get it
right, and the item will fall at (1, 1). If the item is hard enough, everyone will
perform at some "chance" level c, so the item will fall at (c, c). Logically, the
items must fall along some line passing through the points (c, c) and (1, 1). If the
groups performed equally well on the test, the points could fall along the
diagonal line. But since one group performs better than the other, most of the
points must lie to one side of the diagonal line, so the relationship must be
curved.
Careful studies attempt to avoid this curvature by transforming the propor-
tions. If an analysis of variance is to be done, the conventional transformation is
the arcsine transformation. The real purpose of the arcsine transformation is to
equalize sampling variance. Whatever effect it may have in straightening the line
of relationship is purely incidental.
14.2. A CONVENTIONAL APPROACH 215
0.0
-0.5
-I.0
-1.5
2.0
In practice, items differ from each other in discriminating power (ρ'i or,
equivalently, ai). Use of (14-2) may make items of the same discriminating
power lie along the same straight line; but items of a different discriminating
power will then lie along a different straight line. The more discriminating items
will show more difference between blacks and whites than do the less dis
criminating items; thus use of (14-2) cannot make all items fall along a single
straight line. All this is a reflection of the fact, noted earlier (Section 3.4), that πi
is really not a proper measure of item difficulty. Thus the πi, however trans
formed, are not really suitable for studying item bias.
Suppose we plan to study item bias with respect to several groups of examinees.
A possible practical procedure is as follows:
Standardizing on the bi means that the scale is chosen so that the mean of the
bi is 0 and the standard deviation is 1.0 (see Section 3.5). Except for sampling
fluctuations, this automatically places all parameters for all groups on the same
scale. If the usual method of standardizing on 6 were used, the item parameters
for each group would be on a different scale.
Before standardizing on the bi, it would be best to look at all bi values and
exclude very easy and very difficult items both from the mean and the standard
deviation. Items with low ai should also be omitted. The reason in both cases is
that the bi for such items have large sampling errors. Such items are omitted only
from the mean and standard deviation used for standardization; they are treated
like other items for all other purposes.
Following the outlined procedure, a given item response function will be
compared across groups on âi and bi only. We are acting as if a given item has
the same ci in all groups. The reason for doing this is that many ĉ's are so
indeterminate (see Chapter 12) that they are simply set at a typical or average
value; this makes tests of statistical significance among ĉi impossible in many or
most cases. If there are differences among groups in ci, they cannot be found by
the recommended procedure; however, this should not prevent us from observing
differences in ai and bi The null hypothesis states that ai bi, and ci do not
vary across groups. If the recommended procedure discovers significant dif-
ferences, it is clear that the null hypothesis must be rejected.
218 14. STUDY OF ITEM BIAS
Figure 14.4.1 compares estimated item response functions for an antonym item.
The data are the same as for Fig. 14.2.1 and 14.2.2. The top and bottom 5% of
individuals in each group are indicated by individual dots, except that the lowest
5% of the black group fall outside the limits of the figure. Clearly, this item is
much more discriminating among whites than it is among blacks.
Figure 14.4.2 shows an item on which blacks as a whole do worse than
whites; nevertheless at every ability level blacks do better than whites! Such
results are possible because there are more whites than blacks at high values of 0
and more blacks than whites at low values of θ. The item is a reading comprehen
sion item from the SAT. This is the only item out of 85 for which the item
response function of blacks is consistently so far above that of whites. The reason
for this result will be suggested by the following excerpts from the reading
passage on which the item is based:
1.0
- 4 -3 -2 - 1 0 1 2 3
Ability
FIG. 14.4.1. Black (dashed) and white (solid) item response curves for item 8.
(From F. M. Lord, Test theory and the public interest. In Proceedings of the 1976
ETS Invitational Conference—Testing and the Public Interest. Princeton, N.J.:
Educational Testing Service, 1977.)
14.4. COMPARING ITEM RESPONSE FUNCTIONS 219
1.0
8
Answer
Correct
.6
of a
.4
Probability
.2
- 4 -3 -2 - 1 0 1 2 3
Ability
FIG. 14.4.2. Item response curves for item 59. (From F. M. Lord, Test theory
and the public interest. Proceedings of the 1976 ETS Invitational Conference—
Testing and the Public Interest. Princeton, N.J.: Educational Testing Ser-
vice, 1977.)
American blacks have been rebelling in various ways against their status since
1619. Countless Africans committed suicide on the passage to America. . . . From
1955 to the present, the black revolt has constituted a true social movement.
It is often difficult or impossible to judge from figures like the two shown
whether differences between two response functions may be due entirely to
sampling fluctuations. A statistical significance test is very desirable. An obvious
procedure is to compare, for a given item, the difference between the black and
the white bi with its standard error
SE (bi1 - bi2) = √ V a rbi1+ Var bi2 . (14-17)
d1nL 2
ξ
[
( db i ) ]
220 14. STUDY OF ITEM BIAS
and similarly for âi. As shown for Eq. (5-5), we can carry out the differentiation
and expectation operations. In the case of the three-parameter logistic function,
after substituting estimated parameters for their unknown true values, we obtain
N
D 2 d i2 Qia -l
Var bi (P ia - )2 P
[ (1 - ĉ , ) 2 a = l
Ci
ia ] ,
(14-8)
D2 N
Q ia -1
Var ât (θa - b i ) 2 (P ia - )2 P
[ (1 - ĉ i
)2
a=\
ĉi
ia ] .
(14-9)
If many of the items are found to be seriously biased, it appears that the items are
not strictly unidimensional: The θ obtained for blacks, for example, is not strictly
comparable to the θ obtained for whites. This casts some doubt on the results
obtained when all items are analyzed together. A solution (suggested by Gary
Marco) is
TABLE 14.5.1
Approximate Significance Test for the Hypothesis That Blacks and
Whites Have Identical Item Response Functions
ai bi
Item Chi Significance
No. Whites Blacks Whites Blacks Square Level
final results for the first 15 verbal items for the data described in Section 14.2.3
Does the SAT measure the same psychological trait for blacks as for whites?
If it measured totally different traits for blacks and for whites, Fig. 14.2.2 would
show little or no relationship between the item difficulty indices for the two
groups. In view of this, the study shows that the test does measure approximately
the same skill for blacks and whites.
The item characteristic curve techniques used here can pick out certain atypi-
cal items that should be cut out from the test. It is to be hoped that careful study
will help us understand better why certain items are biased, why certain groups of
people respond differently than others on certain items, and what can be done
about this.
The significance test used in Section 14.5 has been questioned on the grounds
that if some items are biased, unidimensionality and local independence are
3
The remainder of this section is taken, by permission, from F. M. Lord, A study of item bias,
using item characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural
psychology. Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.
222 14. STUDY OF ITEM BIAS
violated; hence the item parameter estimates are not valid. This objection is not
compelling if the test has been purified. Statistical tests of a null hypothesis (in
this case the hypothesis of no bias) are typically made by assuming that the null
hypothesis holds and then looking to see if the data fit this assumption. If not,
then the null hypothesis is rejected.
The statistical significance tests used are open to several other criticisms,
however:
In view of these difficulties, it is wise to make some check on the adequacy of the
statistical method, such as that described below.
For the data discussed in this chapter, an empirical check was carried out. All
4500 examinees, regardless of color, were divided at random into two groups,
"reds" and "blues." The entire item bias study was repeated step by step for
these two new groups.
TABLE 14.6.1
Distribution of Significance Levels
Testing the Difference Between the
Item Response Functions for Two
Randomly Selected Groups of
Subjects*
Significance No. of
Level Items
Table 14.6.1 shows the 85 significance levels obtained for the 85 SAT Verbal
items. Since the groups were random groups, the significance levels should be
approximately rectangularly distributed from 0 to 1, with about 8½ items for
each interval of width .10. The actual results are very close to this.
Although Table 14.6.1 is not a complete proof of the adequacy of the statisti
cal procedures of Section 14.4, a comparison with the complete SAT results
abstracted for Table 14.5.1 makes it very clear that blacks and whites are quite
different from random groups for present purposes. The final SAT results show a
significant difference at the 5% level between blacks and whites for 38 out of 85
items, this is quite different from the 9 out of 85 shown in Table 14.6.1 for a
comparison of random groups.
It is not claimed here that the suggested statistical significance tests are opti
mal, nor that the parameter estimates are valid for those items that are biased. A
good additional check on the foregoing statistical analysis could be obtained by
repeating the entire comparison separately for independent samples of blacks
and whites.
APPENDIX
This appendix describes the chi-square test used in Section 14.4 for the null
hypothesis that for given i both bi1 = bi2 and ai1 = ai2. The procedure is based
on the chi-square statistic
x2i ≡ v'i i-1v, (14-10)
-1
where v' is the vector {bi1 - bi2, ân - â i2 } and i is the inverse of the
asymptotic variance-covariance matrix for bi1 - bi2 and âi1 - â i 2 .
Since âi1 and bi1 for whites are independent of âi2 and bi2 for blacks, we
have
i = i1 + i2, (14-11)
where i1 is the sampling variance-covariance matrix of âi1 and bi1 in group 1,
and similarly for i2. These latter matrices are found for maximum likelihood
estimators from the formulas i1 = Ii1-1 and i2 = Ii2-1 where Ii is the 2 X 2
information matrix for âi and bi [Eq. (12-8), (12-9), (12-11)]. The diagonal
elements of Ii are the reciprocals of (14-8) and (14-9).
The significance test is carried out separately for each item by computing Xi2
and looking up the result in a table of the chi-square distribution. If the null
hypothesis is true, Xi2 has a chi-square distribution with 2 degrees of freedom
(Morrison, 1967, p. 129, Eq. 1).
When there are more than two groups, a simultaneous significance test for
differences across groups on ai and bi can be made by multivariate analysis of
variance.
224 14. STUDY OF ITEM BIAS
REFERENCES
Ironson, G. H. A comparative analysis of several methods of assessing item bias. Paper presented at
the annual meeting of the American Educational Research Association, Toronto, March 1978.
Mellenbergh, G. J. Applicability of the Rasch model in two cultures. In L. J. Cronbach & P. J. D.
Drenth (Eds.), Mental tests and cultural adaptation. The Hague: Mouton, 1972.
Morrison, D. F. Multivariate statistical methods. New York: McGraw-Hill, 1967.
15 Omitted Responses and
Formula Scoring
The simpler item response theories consider only two kinds of response to an
item. Such theories are not directly applicable if the item response can be right,
wrong, or omitted.
More complex theories deal with cases where the item response may be A, B,
C, D, or E, for example. Although these more complex theories have sometimes
been used to deal with omitted responses, it is not always obvious that the
mathematical models used are appropriate or effective for this use.
225
226 15. OMITTED RESPONSES AND FORMULA SCORING
only if they are unspeeded. In practice, some deviation from this rule can doubt
less be tolerated.
If most examinees read and respond to items in serial order, a practical procedure
for formula-scored tests is to ignore the "not-reached" responses of each exam
inee when making statistical inferences about examinee and item parameters.
Such treatment of not-reached responses is discussed in Section 12.4.
To summarize: If item response theory is to be applied, tests should be
unspeeded. If many examinees do not have time to finish the test, purely random
responses may be discouraged by using formula scoring and giving appropriate
directions to the examinee. The not-reached responses that appear in formula-
scored tests should be ignored during parameter estimation.
If (1) number-right scores are used, (2) proper test directions are given, (3) the
examinees understand the directions, and (4) they act in their own self-interest,
then there will be no omitted responses. If formula scoring is used with appro-
15.7. THE PRACTICAL MEANING OF AN ITEM RESPONSE FUNCTION 227
Such situations occur constantly in practice. Since apparently Pi(θA) > Pi(θB),
θA must be greater than θB. But also apparently Pj(θA) < Pj (θB), so θA must be
less than θB. What is the source of this absurdity?
The trouble comes from an unsuitable interpretation of the practical meaning
of the item response function Pi(θA) = Prob(uiA = 1|θ A ). If we try to interpret
Pi(θA) as the probability that a particular examinee A will answer a particular
item i correctly, we are likely to reach absurd conclusions. To obtain useful
results, we may properly
A complete mathematical model for item response data with omits would involve
many new parameters: for example, a parameter for each examinee, representing
his behavior when faced with a choice of omitting or guessing at random. Such
complication might make parameter estimation impractical; we therefore avoid
all such complicated models here.
Since "not-reached" responses can be ignored in parameter estimation, why
not ignore omitted responses? Two lines of reasoning make it clear that we
cannot do this:
Supplying random responses in place of omits does not introduce a bias into the
examinee's formula score: His expected formula score is the same whether he
omits or responds at random. The objection to requiring him to respond is that the
required (random) responses would reduce the accuracy of measurement.
Although we can obtain unbiased estimates of ability by supplying random
responses in place of omits, introduction of random error degrades the data.
There should be some way to obtain unbiased estimates of the same parameters
without degrading the data.
A method for doing this is described in Lord (1974). The usual likelihood
function
via 1- v
i a P iaia Q ia ia (4-21)
is replaced by
via
P iaia Q 1-v
ia
(15-2)
i a ia
,
where via = uia if the examinee responds to the item and via = 1/A if the
examinee omits the item. The product a is to be taken only over the examinees
who actually reached item i. It should be noted that (15-2) is not a likelihood
function. Nevertheless, if the item parameters are known, the value of θa that
maximizes (15-2) is a better estimate of θ than the maximum likelihood estimate
obtained from Eq. (4-21) after replacing omits by random responses.
n - x Ax n
y ≡ x = A - 1 - A - 1 . (15-3)
A - 1
If there are no omits, formula score y is a specified linear transformation of
number-right score x.
There are two ways we can predict a person's formula score from his ability 0
and from the item parameters. If we know which items examinee a answered, his
number-right true score is
ξa = (a) P i (θ
a) (15-4)
where the summation is over the items answered by examinee a. From this and
(15-1), the examinee's true formula score ηa is
(a) Q (θ )
(a) P i (θa) i a
(15-5)
ηa= -
A - 1 .
We can estimate the examinee's observed formula score from his θa by substitut
ing ya for ηa and θa for θa in (15-5).
If examinee a answered all the items in the test, (15-5) becomes
n
A p i (θa) - n
ηa = . (15-6)
A - 1
This can also be derived directly from (15-3). Again, the examinee's formula
score can be estiamted from his θa by substituting θa for θa and ya for ηa in
(15-6).
An examinee's formula score has the same expected value whether he omits
items or whether he answers them at random. If we do not know which items the
examinee omitted, we cannot use (15-5) but we can still use (15-6) if the exam
inee finished the test.
If the examinee did not finish the test, we can use (15-5) or (15-6) to estimate his
actual formula score on the partly speeded test from his θ, provided we know
which items he did not reach: The not-reached items are simply omitted from the
summations in (15-5) and (15-6). If we do not know which items he reached, we
can still use (15-6) to estimate the formula score that he would get if given time to
finish the test.
REFERENCES
Diamond, J., & Evans, W. The correction for guessing. Review of Educational Research, 1973, 43,
181-191.
Ebel, R. L. Blind guessing on objective achievement tests. Journal of Educational Measurement,
1968, 5, 321-325.
Lord, F. M. Estimation of latent ability and item parameters when there are omitted responses.
Psychometrika, 1974, 39, 247-264.
REFERENCES 231
Sax, G., & Collet, L. The effects of differing instructions and guessing formulas on reliability and
validity. Educational and Psychological Measurement, 1968, 28, 1127-1136.
Slakter, M . J . Generality of risk taking on objective examinations. Educational and Psychological
Measurement, 1969,29, 115-128.
Traub, R. E., & Hambleton, R. K. The effect of scoring instructions and degree of speededness on
the validity and reliability of multiple-choice tests. Educational and Psychological Measurement,
1972, 32, 737-758.
Waters, L. K. Effect of perceived scoring formula on some aspects of test performance. Educational
and Psychological Measurement, 1967,27, 1005-1010.
IV ESTIMATING TRUE-SCORE
DISTRIBUTIONS
16 Estimating True-Score
Distributions 1
16.1. INTRODUCTION
We have already seen [Eq. (4-5) or (4-9)] that true-score ζ or ξ on a test is simply
a monotonic transformation of ability 6. The transformation is different from test
to test. If we know the distribution g(ζ) of true score, the joint distribution of true
score and observed score is
Φ(x, ζ = g(ζ)h(x|ζ), (16-1)
where h(x|ζ) is the conditional distribution of observed score for given true
score. The form of the conditional distribution h(x\ζ)) is usually known [see Eq.
(4-1), (11-24)]; its parameters (the ai, bi, and c i ) can be estimated. If we can
estimate g(ζ) also, then we can estimate the joint distribution of true score and
observed score. As noted in Section 4.5, this joint distribution contains all
relevant information for describing and evaluating the properties of observed
score x as a measure of true score ζ or as a measure of ability 6. An estimated
true-score distribution is thus essential to understanding the measurement pro
cess, the effects of errors of measurement, and the properties of observed scores
as fallible measurements.
In addition, an estimated true-score distribution can be used for many other
purposes, to be explained in more detail:
1
Much of the material in this chapter was first presented in Lord (1969).
235
236 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS
If we integrate (16-1) over all true scores, we obtain the marginal distribution of
observed scores:
Our first problem is to infer the unknown g(ζ) from Φ(x), the distribution of
observed scores in the population, presumed known, and from h(x|ζ), also
known.
If observed score x were a continuous variable, (16-2) would be a Fredholm
integral equation of the first kind. In this case it may be possible to solve (16-2)
and determine g(ζ) uniquely. Here we deal only with the usual case where x is
number-right score, so that (16-2) need hold only for x = 0, 1, 2,. . . , n.
Suppose temporarily that h(x|ζ) is binomial (see Section 4.1). Let us multiply
both sides of (16-2) by x[r] ≡ x(x - 1) • • • ( x - r + 1), where r is a positive
integer. Summing over all x, we have
16.3. A MATHEMATICAL SOLUTION FOR THE POPULATION 237
n l n
X [r] Ø(x) g(ζ) x[r]h(x|ζ) dζ.
x=0
∫0 x=0
Now, the sum on the left is by definition the rth factorial moment ofØ(x), to be
denoted by M [ r ] ; the sum on the right is the rth factorial moment of the binomial
distribution, which is known (Kendall & Stuart, 1969, Eq. 5.8) to be n[r]ζr. The
foregoing equation can now be written
1
M [r]
n[r]
\ 0
ζrg(ζ) dζ ≡ μr' (r = 1, 2, .. . , n), (16-3)
where μ'r is the rth ordinary moment of the true-score distribution g(ζ). This
equation shows that when h(x|ζ) is binomial, the first n moments of g(ζ) can be
easily determined from the first n moments of the distribution of observed
scores. This last statement is still true when h(x|ζ) has the generalized binomial
distribution [Eq. (4-1)] appropriate for item response theory.
Since only n mathematically independent quantities can be determined from
the n mathematically independent values Ø(1), Ø ( 2 ) , . . . , Ø(n), it follows that
the higher moments, above order n, of the true-score distribution cannot be
determined from Ø(x). Indeed, any g(ζ) with appropriate moments up through
order n will satisfy (16-3) exactly, regardless of the value of its higher moments.
Since the frequency distribution of a bounded integer-valued variable (x) is
determined by its moments, it follows that any g(ζ) with the appropriate
moments up through order n will be a solution to (16-2). Thus, the true-score
distribution for an infinite population of examinees in principle cannot be deter
mined exactly from their Ø(x) and h(x|ζ) (x = 0, 1,. . . , n).
If two different true-score distributions have the same moments up through
order n, they have the same best fitting polynomial of degree n in the least-
squares sense (Kendall & Stuart, 1969, Section 3.34). If the distributions oscil
late more than n times about the best fitting polynomial, they could differ
noticeably from each other. If the true-score distributions are reasonably smooth,
however, without many peaks and valleys, they will be closely fitted by the same
degree n polynomial. Since they each differ little from the polynomial, they
cannot differ much from each other. Thus any smooth g(ζ) with the required
moments up through order n will be a good approximation to the true g(ζ)
whenever the latter is smooth.
It is common experience in many diverse areas that sample frequency distri
butions of continuous variables become smooth as sample size is increased. We
therefore assume here that the true g(ζ) is smooth.
A function with many sharp local fluctuations is not well fitted by any smooth
function. Thus we could take
∫ [g(ζ) - γ(ζ]2 dζ
as a convenient measure of smoothness, where γ(ζ) is some smooth density
function specified by the user. Actually, we shall use instead the related measure
1
[g(ζ) - γ(ζ)]2 dζ. (16-4)
∫0 γ(ζ)
This measure of smoothness is the same as an ordinary chi square between g(ζ)
and γ(ζ) except that summation is replaced by integration.
The need for the user to choose γ(ζ) may seem disturbing. For most practical
purposes, however, it has been found satisfactory to choose γ(ζ) ≡ 1 or γ(ζ) ∞
ζ(1 — ζ). The choice usually makes little difference in practice. Remember that
we are finding one among many g(ζ), all of which produce an exact fit to the
population Ø(x). Any smooth solution to our problem will be very close to any
other smooth solution.
Given γ(ζ), h(x|ζ), and Ø(x), what we require is to find the g(ζ) that
minimizes (16-4) subject to the restriction that g(ζ) must satisfy (16-2) exactly
for x = 0, 1, 2 , . . . , n. This is a problem in the calculus of variations. The
solution (Lord, 1969) is
n
g(ζ) = γ(ζ) λ x h(X|ζ), (16-5)
x=0
the values of the λ x being chosen so that (16-2) is satisfied for x = 0, 1, 2 , . . . , n.
To find the λx, substitute (16-5) in (16-2):
n 1
λx y(ζ)h(X|ζ)h(x|ζ) dζ = Ø(x) (x = 0, 1, . . . , n). (16-6)
∫0
x=0
These are n + 1 simultaneous linear equations in the n + 1 unknowns λx. If
h(x|Ζ) is binomial and if γ(ζ) is constant or a beta distribution with integer
parameters, then the integral in (16-6) can be evaluated exactly for X, x = 0, 1,
2,. . . , n. If h(x|ζ) is the generalized binomial of Eq. (4-1), we replace it by a
two- or four-term approximation (see Lord, 1969), after which the integral in
(16-1) can again be evaluated exactly. The required values of λx are then found
by inverting the resulting matrix of coefficients and solving linear equations
(16-6).
To be a valid solution, the g(ζ) found from (16-5) in this way must be
nonnegative for 0 ≤ ζ ≤ 1. This requirement could be imposed as part of the
calculus of variations problem; however, the resulting solution might still be
intuitively unsatisfactory because of its angular character. A practical way of
dealing with this condition is suggested at the end of Section 16.5.
16.5. A PRACTICAL ESTIMATION PROCEDURE 239
The problem solved in the last section has no direct practical application since we
never know the Ø(x) exactly. Instead, we have sample frequencies f(X) that are
only rough approximations to the Ø(x). In most statistical work, the substitution
of sample values for population values provides an acceptable approximation,
but not in the present case, as we shall see.
It is clear from (16-2) that Ø(x) is a weighted average of h(x|ζ), averaged over
0 ≤ ζ ≤ 1 with weight g(ζ). Likewise, the first difference ΔØ(x) ≡ Ø(x + 1) —
Ø(x) is a weighted average of conditional first differences h(x + 1|ζ) — h(x|ζ).
Since an average is never more than its largest component, ΔØ(x) can never be
greater than maxζ [h(x + 1|ζ) - (x|ζ)]. This proves that sufficiently sharp
changes in Ø(x) are incompatible with (16-2). A similar argument holds for
second- and higher-order differences. Thus any sample frequency distribution
f(x) may be incompatible with (16-2) simply because of local irregularities due to
sampling fluctuations. In such cases, any g(ζ) obtained by the methods of Sec
tion 16.3 are negative somewhere in the range 0 ≤ ζ ≤ 1 and thus not an
acceptable solution to our problem. This is what usually happens when f(x) is
substituted for Ø(x) in (16-5) and (16-6).
The statistical estimation problem under discussion is characterized by the fact
that a small change in the observed data produces a large change in the solution.
Such problems are important in many areas of science where the scientist is
trying to infer unobservable, causal variables from their observed effects. Re
cently developed methods for dealing with this class of problems are discussed
by Craven and Wahba (1977), Franklin (1970, 1974), Gavurin and Rjabov
(1973), Krjanev (1974), Shaw (1973), Varah (1973) and Wahba (1977).
where the notation indicates that the summation is to be taken over all integers x
in class interval u.
The reasoning of Section 16.3 can now be applied to the grouped frequencies.
The basic equation specifying the model is now
1
Øu =
∫ 0
g(ζ)
x:u
h{x\l)di (u = 1, 2, . . . , U). (16-8)
240 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS
axu ≡
X:u
∫ γ(ζ)h(X|ζ)h(x|ζ) dζ.
0
The main problem not already dealt with is the choice of class intervals for
grouping the number-right scores. Arbitrary grouping frequently fails to provide
a good fit to the data, as measured by a chi square between actual and estimated
frequencies f(x) and Ø(x). A possible automatic method for finding a successful
grouping is as follows.
1. Group the tails of the sample distribution to avoid very small values of
fix).
2. Arbitrarily group the remainder of the score range, starting in the middle
and keeping the groups as narrow as possible, until the total number of groups is
reduced to some practical number, perhaps U = 25.
3. Estimate λu (u = 1, 2 , . . . , 25) by maximum likelihood, subject to the
restriction that λu ≥ 0.
4. As a by-product of step 3, obtain the asymptotic variance-covariance
matrix of the nonzero λu (the inverse of the Fisher information matrix).
5. Compute Ø{x) from (16-13).
6. Compute the empirical chi square comparing Ø(x) with f(x):
X2 =
n [f(x) - Ø(x)]2 (16-14)
x=1
Ø(x)
7. Determine the percentile rank of X2 in a standard chi square table with
U* — U degrees of freedom, where U* is the number of class intervals at the end
of step 1.
8. If λu and λ u+1 were identical for some u, it would make no difference if
we combined intervals u and u + 1 into a single class interval (the reader may
check this assertion for himself). If λu and λ u+1 are nearly the same, it makes
little difference if we combine the two intervals.
(a) For each u=0,1,...,U — 1, compute the asymptotic variance of
λu+1–λu.
242 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS
Under the procedure suggested above, the grouping is determined by the data.
Thus, strictly speaking, the resulting X2 no longer has a chi square distribution
with U* - U degrees of freedom. If an accurate chi square test of significance is
required, the data should be split into random halves and the grouping deter
mined from one half as described above. A chi-square test of significance can
then be properly carried out, using this grouping, on the other half of the data.
Chi-square significance levels quoted in this chapter and in the next chapter
are computed as in step 10. Thus the significance levels quoted are only nominal;
they are numerically larger (less "significant") than they should be.
FIG. 16.7.1. Sample observed-score distribution (irregular polygon), estimated population true-
243
score and observed-score distributions, sixth-grade vocabulary test, N = 1715.
244 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS
We know from item response theory that true score ζ can never be less than
11Ci /n. In the case of Fig. 16.7.1, estimated Ci were available. These values
\
were utilized by setting a lower limit of n1ĉi/n = .225755 to the range of ζ,
requiring that g(ζ) = 0 when ζ <.225755. This requirement was imposed by
replacing the lower limit 0 in the integrals of (16-8) and (16-11) by .225755. The
integral in (16-11), now an incomplete beta function, can be evaluated by recur
sive procedures (Jordan, 1947, Section 25, Eq. 5) without using approximate
methods, as long as γ(ζ) is either a constant or a beta function with integer
exponents.
In the case of Fig. 16.7.1, γ(ζ) was taken as constant. The figure shows the
resulting estimated true- and observed-score distributions. The estimated true-
score distribution has U — 1 = 3 independent parameters λu. The chi square
(16-14) is 23.5; nominally, the degrees of freedom are 30, suggesting a good fit.
The results are considered further in the next section.
16.8. BIMODALITY
r = .882 0=8
r = .898 a = .9
FREQUENCY
r = .910 a=|.0
0 10 20 RAW SCORE 30 40
If a test is lengthened by adding parallel forms of the test, the true score of each
person remains unchanged; thus g(ζ) is also unchanged. Any change in test
246
n
TRUE SCORE ••
OBSERVED SCORE 5
OBSERVED SCORE 10
OBSERVED SCORE 20
OBSERVED SCORE 40
OBSERVED SCORE 80
OBSERVED SCORE 180
FIG. 16.10.1. Estimated population true- and observed-score distributions, sixth-grade vocabulary test, N = 1715.
16.11. EFFECTS OF SELECTING ON OBSERVED SCORE 247
length n changes h(x|ζ) in a known way. Thus the theoretical effect of test length
on Ø(x) can be determined from (16-2).
In practical applications, we have the estimated true-score distribution (16-
12). In this case, the effect of test length on Ø(x) can be determined by varying n
in (16-11) and (16-13). The axu defined by (16-11) must be determined from
(16-10) each time n is changed; the estimates of λu are supposed to be unaffected
by changes in n.
Figure 16.10.1 shows estimated proportion-correct observed-score frequency
distributions when the 42-item vocabulary test of Fig. 16.7.1 is shortened or
lengthened to n = 5, 10, 20, 40, 80, 160, or ∞. As n becomes large, the
distribution of proportion-correct score z ≡ x/n approaches g(ζ). For small n,
observed- and true-score distributions may have very different shapes, as illus
trated.
ZETA
0,0 0.2 0.4 0.6 0.8 1.0
OBSERVED-SCORE DISTRIBUTION
400
10
300
F(X),PHI(X)
G( ZETA )
200
5
100
0
0
0 10 20 30 40 50 60
X
True Score
Observed
Score .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00
249
(continued)
250
TABLE 16.11.1
(continued)
True Score
Observed
Score .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00
42 9 12 15 20 29 44 58 66 68 69 69 69 69 69
41 9 12 15 20 29 43 55 61 62 62 62 62 62 62
40 9 12 15 20 29 42 52 56 56 56 56 56 56 56
39 9 12 15 20 29 40 48 50 51 51 51 51 51 51
38 9 12 15 20 28 38 44 45 45 45 45 45 45 45.5
37 9 12 15 20 27 36 39 40 40 40 40 40 40 40.6
36 9 12 15 19 26 33 35 36 36 36 36 36 36 36.1
35 9 12 15 19 25 30 32 32 32 32 32 32 32 32.1
34 9 12 15 19 24 27 28 28 28 28 28 28 28 28.5
33 9 12 15 18 22 25 25 25 25 25 25 25 25 25.4
32 9 12 15 18 21 22 23 23 23 23 23 23 23 22.7
31 9 12 14 17 19 20 20 20 20 20 20 20 20 20.4
30 9 11 14 16 18 18 18 18 18 18 18 18 18 18.4
29 9 11 14 15 16 17 17 17 17 17 17 17 17 16.7
28 9 11 13 14 15 15 15 15 15 15 15 15 15 15.2
27 9 11 13 13 14 14 14 14 14 14 14 14 14 13.9
26 8 11 12 12 13 13 13 13 13 13 13 13 13 12.7
25 8 10 11 11 12 12 12 12 12 12 12 12 12 11.6
24 8 10 10 10 11 11 11 11 11 11 11 11 11 10.6
23 8 9 9 10 10 10 10 10 10 10 10 10 10 9.7
22 7 8 9 9 9 9 9 9 9 9 9 9 9 8.7
16.12. ESTIMATING ITEM TRUE-SCORE REGRESSION 251
reject all students with x ≤ 38 (refuse to graduate them from high school), the
right-hand column shows that we shall be rejecting .045 of all students. The table
entry at (.60, 38) shows that .038 of all students lie at or below ζ = .60 and also
at or below x = 38; these students are all rightly rejected.
From the foregoing numbers we can compute the following 2 x 2 table:
unqualified qualified
This shows that .008 of the total group were accepted even though they were
really unqualified and that .007 of the total group were rejected even though they
were really qualified. These two proportions are useful for summarizing the
effectiveness of the minimum qualifications reading test, since they represent the
proportion of students erroneously classified. The foregoing procedure is de
scribed and implemented by Livingston (1978).
An item-test regression (Section 3.1) can be computed for each observed score x
as follows: Divide the number of examinees at x who answer the item correctly
by the total number of examinees at x. An item-true-score regression can in
principle be obtained similarly. If gi(u i , ζ) denotes the bivariate density function
of ui (item score) and ζ (proportion-correct true score), and if g(ζ) is the (margi
nal) density of ζ, then the item-true-score regression may be found from
gi(1,ζ) (16-15)
ξ(ui|ζ = .
8(0
The denominator on the right of (16-15) can be estimated by (16-12). If we
apply (16-12) to the subgroup of examinees who answer item i correctly, we
obtain an estimate of g i -(ζ|u i = 1), the conditional distribution of true score for
examinees who answer item i correctly. The numerator in (16-15) is gi(I, ζ) =
πigi(ζui = 1), where πi is the proportion of all examinees who answer item i cor
rectly. Since πi can be approximated by the observed proportion of correct
answers in the total group, we can use (16-12) to estimate both the numerator and
the denominator of (16-15) and thus to estimate the item-true-score regression.
Let ζ11 denote true score on an n-item test; let ζ11-1 denote true score on the
same test excluding item i. This use of (16-12) is appropriate only if item i is
excluded from the items used to determine number-right score x. Thus, (16-12)
and (16-15) yield an estimate of the regression of ui o n ζ n - 1 .
252 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS
gi (1, ζ) (16-17)
Ξ(ui|Ζ) =
g i (1, ζ) + gi(0, ζ)
The distribution gi(0, ζ) is estimated by applying (16-12) to the group of exam
inees who answered item i incorrectly.
The relation
1
ζ n - l ≡ ζ n - l (θ) ≡ j≠iPj(θ), (16-18)
n - 1
transforms ζn_1 to θ. Since P i (θ) ≡ ξ(ui|θ), (16-18) can be used to convert an
estimated item-true-score regression into an estimated item response function
(regression of item score on ability). Thus the item response function can be
written
Results obtained by this method are illustrated by the dashed curves in Fig.
2.3.1. The solid curves are three-parameter logistic functions computed by Eq.
(2-1) from maximum likelihood estimates âi, bi, and ĉi. The agreement be
tween the two methods of estimation is surprisingly close, especially so when
one considers that the methods of this chapter are based on data and on assump
tions very different from the data and assumptions used to obtain the logistic
curves (solid lines) in Fig. 2.3.1. An explicit listing and contrasting of the data
and assumptions used by the two methods is given in Lord (1970), along with
further details of the procedure used. Assuming they are confirmed on other sets
of data, results such as shown in Fig. 2.3.1 suggest that the three-parameter
logistic function is quite effective for representing the response functions of items
in published tests.
REFERENCES
Craven, P., & Wahba, G. Smoothing noisy data with spline functions: Estimating the correct degree
of smoothing by the method of generalized cross-validation. Technical Report No. 445. Madison,
Wis.: Department of Statistics, University of Wisconsin, 1977.
Franklin, J. N. Well-posed stochastic extensions of ill-posed linear problems. Journal of Mathemati
cal Analysis and Applications, 1970, 31, 682-716.
Franklin, J. N. On Tikhonov's method for ill-posed problems. Mathematics of Computation, 1974,
28, 889-907.
Gavurin, M. K., & Rjabov, V. M. Application of Ĉebyšev polynomials in the regularization of
ill-posed and ill-conditioned equations in Hilbert space. (In Russian) žurnal VyčisliteVnot
Matematiki i Matematičeskoî Fiziki, 1973, 13, 1599-1601, 1638.
Jordan, C. Calculus offinite differences (2nd ed.). New York: Chelsea, 1947.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1). New York: Hafner, 1969.
Krjanev, A. V. An iteration method for the solution of ill-posed problems. (In Russian) Zurnal
Vyčislitel' noî Matematiki i Matematičeskoî Fiziki, 1974, 14, 25-35,266.
Livingston, S. Reliability of tests used to make pass-fail decisions: Answering the right questions.
Paper presented at the meeting of the National Council on Measurement in Education, Toronto,
March 1978.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Lord, F. M. Estimating true-score distributions in psychological testing (An empirical Bayes estima
tion problem). Psychometrika, 1969, 34, 259-299.
Lord, F. M. Item characteristic curves estimated without knowledge of their mathematical form—a
confrontation of Birnbaum's logistic model. Psychometrika, 1970, 35, 43-50.
Shaw, C. B., Jr. Best accessible estimation: Convergence properties and limiting forms of the direct
and reduced versions. Journal of Mathematical Analysis and Applications, 1973, 44, 531-552.
Stocking, M., Wingersky, M. S., Lees, D. M., Lennon, V., & Lord, F. M. A program for
estimating the relative efficiency of tests at various ability levels, for equating true scores, and for
predicting bivariate distributions of observed scores. Research Memorandum 73-24. Princeton,
N.J.: Educational Testing Service, 1973.
Varah, J. M. On the numerical solution of ill-conditioned linear systems with applications to ill-posed
problems. SIAM Journal on Numerical Analysis, 1973, 10, 257-267.
Wahba, G. Practical approximate solutions to linear operator equations when the data are noisy.
SIAM Journal on Numerical Analysis, 1977, 14, 651-667.
17 Estimated True-Score
Distributions for Two Tests
This chapter considers problems involving two or more tests of the same trait. In
every discussion of tests x and y here, it is assumed that the ability 0 is the same
for both tests.
The trivariate distribution of x, y, and 0 for any population may be written
[compare Eq. (16-1)]
Ø(x, y, θ) = g*(θ)h1*(x|θ)h2*(y|θ), (17-1)
* *
where g* is the distribution of 6 and h1 and h2 are the conditional distributions
of observed scores x and y for given θ. The bivariate distribution of x and y is
thus
00
* *
Ø(x, y) =
∫g*(θ) h (x|θ)h (y |θ) dθ.
— 00
1 2 (17-2)
Now, the proportion-correct true scores ζ and nare related to θ by the formulas
nx
ny
1 1
ζ≡ Pi(θ), 7] = Pi (θ), (17-3)
nx i=1 ny 3= 1
where i indexes the nx items in test x, and j indexes the ny items in test y. Thus
after a transformation of variables, (17-2) can now be written [compare Eq.
(16-2)]
1
Ø(x, y) =
∫ 0
g(ζ)h1(x|ζ)h2 [y|n(ζ)] dζ, (17-4)
254
17.3. TRUE-SCORE EQUATING 255
where g(ζ) is the same as in Chapter 16, h 1 (x|ζ) is the same as h(x|ζ) in Chapter
16, h2(y|n) is the conditional distribution of y, and n ≡ n(ζ) is the transforma
tion relating n to ζ, obtained from (17-3) by elimination of θ.
If the item parameters are known, then h1* and h2* are known and it should be
possible in principle to estimate g*(θ) from Ø(x, y) using (17-2); equivalently, it
should be possible to estimate g(ζ) from Ø(x, y) using (17-4). Full-length nu
merical procedures for doing this would be complicated and have not been
implemented. Some short-cut procedures (using a series approximation to the
generalized binomial) are the subject of this chapter. Illustrative results are
presented.
If x and y are parallel test forms, then ζ and n are identical and also h1 and h2 are
identical. In this case, (17-4) becomes
1
Ø(x, y) =
∫0
g(ζ)h(x|ζ)h(y|ζ) dζ. (17-5)
If test x and test y are different measures of the same trait, their proportion-
correct true scores, ζ and n, have a mathematical relationship. This relation
256 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS
TABLE 17.2.1
Estimated Joint Cumulative Distribution of Number-Right
Observed Scores on Two Parallel Test Forms, x and y, of the Basic
Skills Assessment Reading Test
y=23 26 29 32 35 38 41 44 47 50 53 56 59 62 65
x
65 10 13 17 23 32 45 62 83 108 143 192 265 392 663 1000
62 10 13 17 23 32 45 62 83 108 143 192 264 377 550 663
59 10 13 17 23 32 45 62 83 108 143 190 250 319 377 392
56 10 13 17 23 32 45 62 83 108 140 179 220 250 264 265
53 10 13 17 23 32 45 62 82 105 132 159 179 190 192 192
50 10 13 17 23 32 45 62 80 99 118 132 140 143 143 143
47 10 13 17 23 32 45 60 76 89 99 105 108 108 108 108
44 10 13 17 23 32 43 56 67 76 80 82 83 83 83 83
41 10 13 17 22 30 40 49 56 60 62 62 62 62 62 62
38 10 13 16 21 27 34 40 43 45 45 45 45 45 45 45
35 10 12 16 20 24 27 30 32 32 32 32 32 32 32 32
32 9 12 15 17 20 21 22 23 23 23 23 23 23 23 23
29 9 11 13 15 16 16 17 17 17 17 17 17 17 17 17
26 9 10 11 12 12 13 13 13 13 13 13 13 13 13 13
23 8 9 9 9 10 10 10 10 10 10 10 10 10 10 10
This equation asserts that ζo and no ≡ n(ΖO) have identical percentile ranks in
their respective distributions. Numerical values of the function n(ζo) for given
values of ζo are found in practice from (17-6) by numerical integration and
inverse interpolation. The result is an estimated true-score equating of ζ and n.
This method of equating does not make use of the responses of each examinee to
each item, as do the methods of sections 13.5 and 13.6.
Figure 17.3.1 shows two estimates of the equating function n(ζ) relating true
scores on two verbal tests, P and Q. Since P and Q are randomly parallel, being
produced by randomly splitting a longer test, the relation n(ζ) should be nearly
linear, but not precisely linear, as would be the case if P and Q were strictly
parallel.
The relation n(ζ) was estimated by the method of this section from two
different groups of examinees. Each curve in the figure runs from the first to the
ninety-ninth percentile of the distribution of ζ for the corresponding group. The
17.4. BIVARIATE DISTRIBUTION 257
n
1.0
.9
.8
.7
.6
.5
.4
.3 ζ
.3 .4 .5 .6 .7 .8 .9 1.0
FIG. 17.3.1. Estimates from two different groups of the line of relationship
equating true scores ζ and n for two randomly parallel tests, P and Q. (From
F. M. Lord, A strong true-score theory, with applications. Psychometrika, 1965,
30, 239-270.)
two estimated relations agree well with each other and are appropriately nearly
linear.
Suppose g(ζ) and q(n) have been independently estimated by the method of
Chapter 16 and then n(ζ) has been estimated by the method of the preceding
section. The bivariate distribution of number-right observed scores x and y can
now be estimated from (17-4) by numerical integration.
An early version of this method was used1 to predict 16 different bivariate
1
The remainder of this section is taken by permission from F. M. Lord, A strong true-score
theory, with applications. Psychometrika, 1965, 30, 239-270.
258 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS
i
7 8 9
1011 1213 141 5 16 17
18 1 9 2 d 2 1 22 25 2>* 25 26 27 28 29 30 31 32 33 34 36 37 38 39 l*o 1*1 1*2
43 45
44 46
47 49 48 50
25
24
23
10
22
10
21
15
20 13
12 12 16
13'
19 17 11
18 11 13
17 16
25
17 11*
17 12
16 16
14 11 11 19
16 10 10 1 1 11
15 1 1 13 13 10 13
19 8 12 14 7 7
15
15 1 0 11* 1 5 15 11 16
20
14 21 14
10 1118 16 12 6 12
10 16 17 1 5 10
15 13
11 13 8 15 18 12 16 16 14
13 11 15
10 1 3 16 17 17 14 16 15
10
16 10 6 2 3 12 16 24 9
12 13
15 10 13 15 1 7 17 1 5 12 11
19
13 12 17 20 1 5 17 8 15
11 1 1 13 1 5 16 16 16 1 3
13 14 17 15
12
14 12 14 10 16 1 1 16 10 18 12 26 1 0 17
10 9
14 13 10 12 13 1 5 16 16 15 13 10 15
19
11 14 15 15 14 11 16 13 12 10 11* 9 15 7 11
9 14 12 10
13 13 18 11 12 13 14 1 5 15 13
8
14 16 7 15 11 81 2 1 22 1 1 1 1 12
16 15 19
1 12 13 13 13 1 3 12 1 1 16 11
18 15
6
10 19 7
11 16 1 5 14 7 9 13
7 13 18 1 0 11 11 1 1 11 1 1 10 16
14
13 16 13 13
21
15 16 12 12 12 16
6
16
16 15 12 1
15 18
19
18 17 14 13
14 20 23 21
5 8
19 10 15 19 11 15 13 13 11
9
16 16
4 12 7 18
10 11
11 12 11 12
2 14 12 21 8 4
12 11 l1 13 12
l
FIG. 17.4.1 Actual frequencies (upper) and predicted frequencies (lower) for Tests H and J.
17.5. CONSEQUENCES OF SELECTING ON OBSERVED SCORE 259
H
25
20
15
10
0 J
5 10 15 20 25 30 35 40 45 50
from .02 to .05 lower then the predicted correlation, whereas for the remaining
10 bivariate distributions the observed correlation was in every case a trifle
higher than the predicted correlation.
Figure 17.4.1 compares predicted and observed bivariate distributions (N =
2000) for Tests H and J, the hard and easy vocabulary tests. Figure 17.4.2 shows
for the same data the theoretical regressions of J on H and H on J, as well as
those row means and column means of the observed distribution based on five or
more cases. To the naked eye, the fit in these two figures seems rather good; the
chi square is significant at the 5% level, however. If the two tests measure
slightly different psychological traits, as suggested above, then significant chi
squares are to be expected. The analysis carried out is in fact just the analysis that
could be used to investigate whether the tests actually are or are not measuring
the same dimension.
For the remaining 12 pairs of distributions studied, it is more plausible that
both tests are measures of the same trait. For these, the model appears to be very
effective: 11 of the 12 chi-squares are nonsignificant at the 5% level.
Table 17.5.1 shows the true-score distribution for a rejected group of examinees
(x ≤ 38), as discussed in Section 16.11. The estimated distribution of true scores
260 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS
TABLE 17.5.1
Estimated Population Observed-Score (Noncumulative)
Distribution of Failing Students (x ≤ 38) Compared with
Their Estimated True-Score Distribution and with Their
Estimated Observed-Score Distribution on Parallel Test y
52 0 0 .1
51 0 0 .1
50 0 0 .1
49 0 0 .1
48 0 0 .3
47 0 0 .5
46 0 0 .6
45 0 .1 .7
44 0 .2 .9
43 0 .6 1.1
42 0 1.1 1.3
41 0 1.6 1.7
40 0 2.1 2.0
39 0 2.5 2.1
38 4.9 2.9 2.1
37 4.5 3.4 2.2
36 4.0 3.6 2.1
35 3.6 3.0 2.1
34 3.1 2.4 2.0
33 2.7 1.9 1.9
32 2.3 1.6 1.8
31 2.0 1.3 1.7
30 1.7 1.1 1.6
29 1.5 1.1 1.4
28 1.3 1.0 1.2
27 1.2 1.0 1.1
26 1.1 1.0 1.1
25 1.0 1.0 1.0
24 1.0 1.0 1.0
23 1.0 1.0 1.0
22 .9 1.0 1.0
21 .9 1.0 1.0
20 .9 1.0 .9
19 .9 1.0 .9
18 .9 1.0 .9
17 .8 1.0 .8
16 .8 1.0 .7
(continued)
17.5. CONSEQUENCES OF SELECTING ON OBSERVED SCORE 261
TABLE 17.5.1
{continued)
15 .7 1.0 .6
14 .6 1.0 .5
13 .5 0 .4
12 .4 0 .3
11 .2 0 .2
10 .2 0 .1
9 .1 0 .1
8 0 0 .1
7 0 0 .1
.
.
Total 45.5 45.5 998
ĝ(ζ) having been obtained by the methods of Chapter 16. A disadvantage of this
result is that there is no way to check its validity.
From Table 17.2.1, we can write the estimated (noncumulative) observed-
score distribution on form y for those examinees who are rejected by form x. The
estimated distribution of form y observed scores for examinees rejected by test x
is given by the formula
X
0
Ø(x,y)
X=0
fo (y|x ≤ x0) = X o ny , (17-8)
Ø(x, y)
x=0 y=0
Ø(x, y) having been estimated by substituting ĝ(ζ) into (17-5). This distribution
is shown in Table 17.5.1 for comparison with the other distributions there. This
distribution could be checked against actual test data if we could administer both
form x and form y to the same examinees without practice effect.
262 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS
Selection need not necessarily involve a cutting score. Given f(x) examinees
at observed score x, we can select a proportion px of these at random (x = 0,
1 , . . . , n ) . The true-score distribution for the selected group will then be given by
n n
gp (Ζ) = g(Ζ) Px ζx (1 - ζ)n-x. (17-9)
x=0
( x)
The observed-score distribution on form y for the selected group will be given by
n
fp(y) = Px Ø(x, y) (y = 0, 1, . . . , n). (17-10)
x=0
Not only does this last equation allow us to estimate fp(y) when the selection
procedure p ≡ {px} is given but it also can be used to find the selection
procedure p that will produce a required distribution fp of y. If the left-hand side
of (17-10) is given for y = 0, 1,. . . , n, we have n + 1 linear equations in the n
+ 1 unknowns p0, p1. . . , pn. Since the matrix ||Ø(x, y)|| will normally be
nonsingular, values of p0, p1. . . , pn can be found satisfying (17-10) when the
left-hand side is given.
To provide a meaningful solution to the problem stated, each value of px thus
determined from (17-10) must satisfy the inequalities 0 ≤ px ≤ 1. In practical
work, it is likely that these inequalities will not always be satisfied, in which case
some approximation will be required.
Suppose that a test publisher wishes to norm his test on a nationally representa
tive norms group. If he selects a representative sample of schools and asks them
to administer the test, he may receive many refusals. If so, any norms finally
collected will be of doubtful value: the schools that finally agree to administer his
test may be unrepresentative.
Suppose that the publisher can avoid refusals if he asks to administer only a
10-minute short form of the regular test. Our problem is then to estimate from
their scores on the short form what the total norms sample would have done on
the regular test.
Denote the short form by x and the regular form by y. We do not wish to go so
far as to assume that y is simply a lengthened version of x; we assume only that
both forms measure the same psychological dimension.
264 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS
The relation n(ζ) between true score n on test y and true score ζ on test x can
be found from (17-6). To find this relation, the publisher need only administer
test y and test x separately to different random samples from any convenient
population. The publisher does not need nationally representative samples for
this purpose, since true-score equating is independent of the group tested (see
Chapter 13).
In addition to determining the relation n(ζ) from some convenient sample, as
just described, the publisher must estimate g(ζ), the true-score distribution for
test x in the nationally representative sample by the methods of Chapter 16. The
n(Ζ) from the convenient sample and the ĝ(ζ) from the national sample can then
be substituted into (17-4) to estimate Ø(x, y), the bivariate distribution of x and y
for the national sample. The estimated national norms distribution f*(y), say, for
the full-length test y, is then obtained by summing on x across the estimated
bivariate distribution:
n
f*(y) = Ø(x, y ) . (17-11)
x=0
Chapter 4
Chapter 5
Chapter 6
265
266 ANSWERS TO EXERCISES
Chapter 8
1. .6, .4.
2. .086,.351,.314,.249.
y = l
/2 1 11/2 2
3.
Ø = .086 .314 .351 .249
4. .415,.585;.234,.293,.351,.123.
y = 1/2 1 11/2 2
Ø = .234 .351 .293 .123
Chapter 9
1. -1.92, - . 4 0 , 0, .40, .84, 1.44.
2. 1.83,.69,.64,.63,.66,.76.
3. θ x = 0-1 2-3 4-5 6-7
2 0 0 .03 .97
0 .02 .27 .55 .16
-2 .46 .47 .06 0
Chapter 11
1.
x ≥ 2 3 4 5
α = .69 .34 .10 .01
β = .01 .07 .29 .70
C = .70 .41 .39 .71
2. 24.1, 4.7,.91,.18,.03,.01;
examinees scoring x ≥ 4.
3. .31,.26,.18.
4. 1.84, 1.64, 1.27.
5. 3.48, 3.12, 2.91; 3.
6. .93,.83,.65.
7. 1.76, 1.58, 1.48; 1.5.
Chapter 12
1. MLE for θ is θ = 0.
2. MLE for θ* is θ* = e() = 1.
3. BME for θ is θ = 0.
4. BME for θ* is θ* = e-5 = .0067.
Author Index
Numbers in italics indicate the page on which the complete reference appears.
B
D
Bennett, J. A., 12, 25
Betz, N. E., 127, 127, 146, 148, 161 Dahm, P. A., 12, 25
Bianchini, J. C , 96, 105 David, C. E., 160, 162, 176
Birnbaum, A., 63, 64, 65, 67, 72, 80, 152, Deal, R., 131, 148
160, 162, 173, 176, 186, 191 DeGraff, M. H., 107, 112
Blot, W., 14, 25 DeWitt, L. J., 160
Bock, R. D., 12, 21, 25, 189, 191 Diamond, J., 227, 230
Brogden, H. E., 12, 25 Dyer, A. R., 14, 25
C E
Chambers, E. A., 14, 25 Ebel, R. L., 107, 108, 112, 113, 227, 230
Charles, J. W., 107, 112 Evans, W., 227, 230
267
268 AUTHOR INDEX
F L
R T
271
272 SUBJECT INDEX
Correlation (cont.) G
item-test, see Correlation, biserial
spurious, 41 Guessing, 31, see also c
tetrachoric, 19-21, 39, 41-42 correction for, 102
Cramer-Rao inequality, 71 effect on estimation of bi, 37
Cutting score, 255, 262 effect on information, 103
mastery test, 162-175 effects of, 40-43, 244
flexilevel test, 124-126
and the item response function, 17
D no sufficient statistic, 58
omits and formula scoring, 226-229
Decision rule, 163-169 and optimal item difficulty, 108-112, 152
Delta, 34 random, not assumed, 30
Difficulty, see Item difficulty; Test difficulty and scoring weights, 23, 75, 77
Dimensionality, 19-21, 35, 68 and tetrachoric correlation, 20
Discriminating power two-stage tests, 138-139
and bimodality, 245
definition, 13
effects of, 40-41 I
and information, 152
item bias in, 217 Independence, local, 19
and item-test biserial, 33-43 Indeterminacy of item parameters, 36-38, 184
and item weight, 23, 75 Information function, 65-80
tailored test, 159 flexilevel test, 122-126
item, 21-23, 72-73
in tailored testing, 151-153
E
maximum, 112, 151-153
target, 23, 72
test, 21-23, 71-73
Efficiency, relative, 23, 83-104, 110
for transformed ability, 84-90
approximation, 91-101
on true score, 89
Equating, 76, 193-211, 236
two-stage tests, 132-148
with an anchor test, 200-205 Information matrix, 180
equipercentile, 92, 203, 207 Integral equation, 236
for raw scores, 202 Invariance of item parameters, 34-38
second-stage tests, 140 iosr, 19, 27-30, 236, 251
true-score, 199-205, 210, 256 Item
Equipercentile relationship, 194-211, 256, see analysis, 27-43
also Equating, equipercentile bias, 212-223
Error of measurement, 4-7, 235-236, see also calibration, 154, 205
Standard error of measurement choices, 17
Examinees, low ability, 37, 75, 103, 110, 183 number of, 106-112
difficulty
corrected, 216
F effect of, 23, 102-104
in IRT, 12-14
Factor, common, as ability, 19, 39 and maximal information, 152
Factor analysis of items, 20-21 optimal, 172
Flexilevel test, 115-127 proportion correct, 33-38, 213
Formula score, 226-230 and scoring weight, 76
Free-response item, 43, see also Guessing standard error of, 185
SUBJECT INDEX 273
Parallel forms
K bivariate score distribution of, 236, 255-264
in classical test theory, 3, 6
Kuder-Richardson lengthening a test, 65
formula 8, 20, 245 Parameters, unidentifiable, 184
formula 8,21 Path analysis, 6
Phi coefficient, 9, 41-43
Preequating, 205
Pseudo-chance score level, 203, 210, 244
L