0% found this document useful (0 votes)
23 views289 pages

Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980)

Uploaded by

mpessoasilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views289 pages

Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980)

Uploaded by

mpessoasilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 289

APPLICATIONS OF

ITEM RESPONSE THEORY


TO PRACTICAL TESTING PROBLEMS

FREDERIC M. LORD
Educational Testing Service
ROUTLEDGE

Routledge
Taylor &. Francis Group

NEW YORK AND LONDON


First published by
Lawrence Erlbaum Associates
10 Industrial Avenue
Mahwah, New Jersey 07430

Transferred to Digital Printing 2009 by Routledge


270 Madison Ave, New York NY 10016
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

Copyright © 1980 by Lawrence Erlbaum Associates, Inc.


All rights reserved. No part of this book may be reproduced in
any form, by photostat, microform, retrieval system, or any other
means, without the prior written permission of the publisher.

Copyright is claimed until 1990. Thereafter all portions of this work


covered by this copyright will be in the public domain.
This work was developed under a contract with the National Institute of
Education, Department of Health, Education, and Welfare. However, the
content does not necessarily reflect the position or policy of that
Agency, and no official endorsement of these materials should be inferred

Reprinted 2008 by Routledge

Routledge Routledge
Taylor and Francis Group Taylor and Francis Group
270 Madison Avenue 2 Park Square
New York, NY 10016 Milton Park, Abingdon
Oxon OX14 4RN

Library of Congress Cataloging in Publication Data


Lord, Frederic M 1912-
Applications of item response theory to practical
testing problems.
Bibliography: p.
Includes index.
1. Examinations. 2. Examinations—Evaluation.
I. Title.
LB3051.L64 371.2'6 79-24186
ISBN 0-89859-006-X

Publisher's Note
The publisher has gone to great lengths to ensure the quality of this reprint
but points out that some imperfections in the original may be apparent.
Contents

Preface xi

PART I: INTRODUCTION TO ITEM RESPONSE THEORY

1. Classical Test Theory—Summary and Perspective 3


1.1. Introduction 3
1.2. True Score 4
1.3. Uncorrelated Errors 6
1.4. Parallel Test Forms 6
1.5. Envoi 7
Appendix 7

2. Item Response Theory—Introduction and Preview 11


2.1. Introduction 11
2.2. Item Response Functions 12
2.3. Checking the Mathematical Model 75
2.4. Unidimensional Tests 19
2.5. Preview 21

iii
IV CONTENTS

3. Relation of Item Response Theory to


Conventional Item Analysis 27
3.1. Item-Test Regressions 27
3.2. Rationale for Normal Ogive Model 30
3.3. Relation to Conventional Item Statistics 33
3.4. Invariant Item Parameters 34
3.5. Indeterminacy 36
3.6. A Sufficient Condition for the
Normal Ogive Model 39
3.7. Item Intercorrelations 39
3.8. Illustrative Relationships Among
Test and Item Parameters 40
Appendix 41

4. Test Scores and Ability Estimates


as Functions of Item Parameters 44
4.1. The Distribution of Test Scores for
Given Ability 44
4.2. True Score 45
4.3. Standard Error of Measurement 46
4.4. Typical Distortions in
Mental Measurement 49
4.5. The Joint Distribution of Ability
and Test Scores 51
4.6. The Total-Group Distribution of
Number-Right Score 51
4.7. Test Reliability 52
4.8. Estimating Ability From Test Scores 52
4.9. Joint Distribution of Item Scores
for One Examinee 54
4.10. Joint Distribution of All Item Scores
on All Answer Sheets 55
4.11. Logistic Likelihood Function 56
4.12. Sufficient Statistics 57
4.13. Maximum Liklihood Estimates 58
4.14. Maximum Likelihood Estimation for
Logistic Items with ct = 0 59
4.15. Maximum Likelihood Estimation for
Equivalent Items 59
4.16. Formulas for Functions of the
Three-Parameter Logistic Function 60
4.17. Exercises 61
Appendix 63
CONTENTS V

5. Information Functions and Optimal Scoring Weights 65


5.1. The Information Function for a Test Score 65
5.2. Alternative Derivation of the
Score Information Function 68
5.3. The Test Information Function 70
5.4. The Item Information Function 72
5.5. Information Function for a
Weighted Sum of Item Scores 73
5.6. Optimal Scoring Weights 74
5.7. Optimal Scoring Weights Not Dependent on 0 76
5.8. Maximum Likelihood Estimate of Ability 77
5.9. Exercises 77
Appendix 78

PART II: APPLICATIONS OF ITEM RESPONSE THEORY

6. The Relative Efficiency of Two Tests 83


6.1. Relative Efficiency 83
6.2. Transformations of the Ability Scale 84
6.3. Effect of Ability Transformation on
the Information Function 84
6.4. Effect of Ability Transformation on
Relative Efficiency 88
6.5. Information Function of
Observed Score on True Score 89
6.6. Relation Between Relative Efficiency and
True-Score Distribution 90
6.7. An Appproximation for Relative Efficiency 92
6.8. Desk Calculator Approximation for
Relative Efficiency 94
6.9. Relative Efficiency
of Seven Sixth-Grade Vocabulary Tests 96
6.10. Redesigning a Test 101
6.11. Exercises 104

7. Optimal Number of Choices Per Item 106


7.1. Introduction 106
7.2. Previous Empirical Findings 107
7.3. A Mathematical Approach 107
7.4. Grier's Approach 108
7.5. A Classical Test Theory Approach 108
7.6. An Item Response Theory Approach 110
7.7. Maximizing Information at a Cutting Score 112
vi CONTENTS

8. Flexilevel Tests 114


8.1. Introduction 114
8.2. Flexilevel Tests 115
8.3. Scoring 116
8.4. Properties of Flexilevel Tests 117
8.5. Theoretical Evaluation
of Novel Testing Procedures 119
8.6. Conditional Frequency Distribution of
Flexilevel Test Scores 120
8.7. Illustrative Flexilevel Tests, No Guessing 122
8.8. Illustrative Flexilevel Tests, with Guessing 124
8.9. Conclusion 726
8.10. Exercises 127

9. Two-Stage Procedures and Multilevel Tests 128


9.1. Introduction 128
9.2. First Two-Stage Procedure—Assumptions 129
9.3. Scoring 130
9.4. Conditional Distribution of Test Score θ 131
9.5. Illustrative 60-Item Two-Stage Tests,
No Guessing 132
9.6. Discussion of Results for 60-Item Tests
with No Guessing 135
9.7. Illustrative 15-Item Two-Stage Tests
with No Guessing 136
9.8. Illustrative 60-Item Two-Stage Tests
with Guessing 138
9.9. Converting a Conventional Test to a
Multilevel Test 140
9.10. The Relative Efficiency of a Level 141
9.11. Dependence of the Two-Stage Test on
its Levels 142
9.12. Cutting Points on the Routing Test 144
9.13. Results for Various Two-Stage Designs 144
9.14. Other Research 146
9.15. Exercises 146
Appendix 147

10. Tailored Testing 150


10.1. Introduction 150
10.2. Maximizing Information 151
10.3. Administering the Tailored Test 153
CONTENTS VII

10.4. Calibrating the Test Items 154


10.5. A Broad-Range Tailored Test 154
10.6. Simulation and Evaluation 156
10.7. Results of Evaluation 157
10.8. Other Work on Tailored Tests 159

11. Mastery Testing 162


11.1. Introduction 162
11.2. Definition of Mastery 163
11.3. Decision Rules 163
11.4. Scoring the Test: The Likelihood Ratio 164
11.5. Losses 166
11.6. Cutting Score for the Likelihood Ratio 166
11.7. Admissible Decision Rules 168
11.8. Weighted Sum of Item Scores 169
11.9. Locally Best Scoring Weights 170
11.10. Cutting Point for Locally Best Scores 170
11.11. Evaluating a Mastery Test 171
11.12. Optimal Item Difficulty 172
11.13. Test Length 173
11.14. Summary of Mastery Test Design 174
11.15. Exercises 175

PART III: PRACTICAL PROBLEMS AND FURTHER APPLICATIONS

12. Estimating Ability and Item Parameters 179


12.1. Maximum Likelihood 179
12.2. Iterative Numerical Procedures 180
12.3. Sampling Variances of Parameter Estimates 181
12.4. Partially Speeded Tests 182
12.5. Floor and Ceiling Effects 182
12.6. Accuracy of Ability Estimation 183
12.7. Inadequate Data and
Unidentifiable Parameters 184
12.8. Bayesian Estimation of Ability 186
12.9. Further Theoretical Comparison of Estimators 187
12.10. Estimation of Item Parameters 189
12.11. Addendum on Estimation 189
12.12. The Rasch Model 189
12.13. Exercises 190
Appendix 191
Viii CONTENTS

13. Equating 193


13.1. Equating Infallible Measures 193
13.2. Equity 195
13.3. Can Fallible Tests be Equated? 196
13.4. Regression Methods 198
13.5. True-Score Equating 199
13.6. True-Score Equating
with an Anchor Test 200
13.7. Raw-Score "Equating"
with an Anchor Test 202
13.8. Illustrative Example 203
13.9. Preequating 205
13.10. Concluding Remarks 207
13.11. Exercises 207
Appendix 208

14. Study of Item Bias 212


14.1. Introduction 212
14.2. A Conventional Approach 213
14.3. Estimation Procedures 217
14.4. Comparing Item Response Functions
Across Groups 218
14.5. Purification of the Test 220
14.6. Checking the
Statistical Significance Test 221
Appendix 223

15. Omitted Responses and Formula Scoring 225


15.1. Dichotomous Items 225
15.2. Number-Right Scoring 225
15.3. Test Directions 226
15.4. Non-Reached Responses 226
15.5. Omitted Responses 226
15.6. Model for Omits Under Formula Scoring 227
15.7. The Practical Meaning of
an Item Response Function 227
15.8. Ignoring Omitted Responses 228
15.9. Supplying Random Responses 228
15.10. Procedure for Estimating Ability 229
15.11. Formula Scores 229
CONTENTS IX

PART IV: ESTIMATING TRUE-SCORE DISTRIBUTIONS

16. Estimating True-Score Distributions 235


16.1. Introduction 235
16.2. Population Model 236
16.3. A Mathematical Solution
for the Population 237
16.4. The Statistical Estimation Problem 239
16.5. A Practical Estimation Procedure 239
16.6. Choice of Grouping 241
16.7. Illustrative Application 242
16.8. Bimodality 244
16.9. Estimated Observed-Score Distribution 245
16.10. Effect of a Change in Test Length 245
16.11. Effects of Selecting on Observed Score:
Evaluation of Mastery Tests 247
16.12. Estimating Item True-Score Regression 251
16.13. Estimating Item Response Functions 252

17. Estimated True-Score Distributions for Two Tests 254


17.1. Mathematical Formulation 254
17.2. Bivariate Distribution of
Observed Scores on Parallel Tests 255
17.3. True-Score Equating 255
17.4. Bivariate Distribution of
Observed Scores on Nonparallel Tests 257
17.5. Consequences of Selecting on
Observed Score 259
17.6. Matching Groups 262
\1.1. Test Norms 263

Answers to Exercises 265

Author Index 267

Subject Index 269


Preface

The purpose of this book is to make it possible for measurement specialists to


solve practical testing problems by use of item response theory. This theory
expresses all the properties of the test, as a measuring instrument, in terms of the
properties of the test items. Practical applications include

1. The estimation of invariant parameters describing each test item; item


banking.
2. Estimating the statistical characteristics of a test for any specified group.
3. Determining how the effectiveness of a test varies across ability levels.
4. Comparing the effectiveness of different methods of scoring a test.
5. Selecting items to build a conventional test.
6. Redesigning a conventional tests.
7. Design and evaluation of mastery tests.
8. Designing and evaluating novel testing methods, such as flexilevel tests,
two-stage tests, multilevel tests, tailored tests.
9. Equating and preequating.
10. Study of item bias.

The topics, organization, and presentation are those used in a 4-week seminar
held each summer for the past several years. The material is organized primarily
to maintain the reader's interest and to facilitate understanding; thus all related
topics are not always packed into the same chapter. Some knowledge of classical
test theory, mathematical statistics, and calculus is helpful in reading this mate-
rial.
Chapter 1, a perspective on classical test theory, is perhaps not essential for

XI
Xii PREFACE

the reader. Chapter 2, an introduction to item response theory, is easy to read.


Some of Chapter 3 is important only for those who need to understand the
relation of item response theory to classical item analysis. Chapter 4 is essential
to any real understanding of item response theory and applications. The reader
who takes the trouble to master the basic ideas of Chapter 4 will have little
difficulty in learning what he wants from the rest of the book. The information
functions of Chapter 5, basic to most applications of item response theory, are
relatively easy to understand.
The later chapters are mostly independent of each other. The reader may
choose those that interest him and ignore the others. Except in Chapter 11 on
mastery testing and Chapters 16 and 17 on estimating true-score distributions, the
reader can usually skip over the mathematics in the later chapters, if that suits his
purpose. He will still gain a good general understanding of the applications under
discussion provided he has previously understood Chapter 4.
The basic ideas of Chapters 16 and 17, on estimated true-score distributions,
are important for the future development of mental test theory. These chapters
are not a basic part of item response theory and may be omitted by the general
reader.
Reviewers will urge the need for a book on item response theory that does not
require the mathematical understanding required here. There is such a need; such
books will be written soon, by other authors (see Warm, 1978).
Journal publications in the field of item response theory, including publica-
tions on the Rasch model, are already very numerous. Some of these publications
are excellent; some are exceptionally poor. The reader will not find all important
publications listed in this book, but he will find enough to guide him in further
search (see also Cohen, 1979).
I am very much in debt to Marilyn Wingersky for her continual help in the
theoretical, computational, mathematical, and instructional work underlying this
book. I greatly appreciate the help of Martha Stocking, who read (and checked) a
semifinal manuscript; the errors in the final publication were introduced by me
subsequent to her work. I thank William H. Angoff, Charles E. Davis, Ronald
K. Hambleton, and Hariharan Swaminathan and their students, Huynh Huynh,
Samuel A. Livingston, Donald B. Rubin, Fumiko Samejima,Wim J. van der
Linden, Wendy M. Yen, and many of my own students for their helpful com-
ments on part or all of earlier versions of the manuscript. I am especially indebted
to Donna Lembeck who typed innumerable revisions of text, formulas, and
tables, drew some of the diagrams, and organized production of the manuscript. I
would also like to thank Marie Davis and Sally Hagen for proofreading numerous
versions of the manuscript and Ann King for editorial assistance.
Most of the developments reported in this book were made possible by the
support of the Personnel and Training Branch, Office of Naval Research, in the
form of contracts covering the period 1952-1972, and by grants from the
Psychobiology Program of the National Science Foundation covering the period
PREFACE xiii

1972-1976. This essential support was gratefully acknowledged in original jour-


nal publications; it is not detailed here. The publication of this book was made
possible by a contract with the National Institute of Education. All the work in
this book was made possible by the continued generous support of Educational
Testing Service, starting in 1948. Data for ETS tests are published here by
permission.

References
Cohen, A. S. Bibliography of papers on latent trait assessment. Evanston, Ill.: Region V Technical
Assistance Center, Educational Testing Service Midwestern Regional Office, 1979.
Warm, T. A. A primer of item response theory. Technical Report 941078. Oklahoma City, Okla.:
U.S. Coast Guard Institute, 1978.

FREDERIC M. LORD
I INTRODUCTION TO ITEM
RESPONSE THEORY
1 Classical Test Theory—
Summary and Perspective

1.1. INTRODUCTION

This chapter is not a substitute for a course in classical test theory. On the
contrary, some knowledge of classical theory is presumed. The purpose of this
chapter is to provide some perspective on basic ideas that are fundamental to all
subsequent work.
A psychological or educational test is a device for obtaining a sample of
behavior. Usually the behavior is quantified in some way to obtain a numerical
score. Such scores are tabulated and counted. Their relations to other variables of
interest are studied empirically.
If the necessary relationships can be established empirically, the scores may
then be used to predict some future behavior of the individuals tested. This is
actuarial science. It can all be done without any special theory. On this basis, it is
sometimes asserted from an operationalist viewpoint that there is no need for any
deeper theory of test scores.
Two or more "parallel" forms of a published test are commonly produced.
We usually find that a person obtains different scores on different test forms.
How shall these be viewed?
Differences between scores on parallel forms administered at about the same
time are usually not of much use for describing the individual tested. If we want a
single score to describe his test performance, it is natural to average his scores
across the test forms taken. For usual scoring methods, the result is effectively
the same as if all forms administered had been combined and treated as a single
test.
The individual's average score across test forms will usually be a better

3
4 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

measurement than his score on any single form, because the average score is
based on a larger sample of behavior. Already we see that there is something of
deeper significance than the individual's score on a particular test form.

1.2. TRUE SCORE


In actual practice we cannot administer very many forms of a test to a single
individual so as to obtain a better sample of his behavior. Conceptually, how­
ever, it is useful to think of doing just this, the individual remaining unchanged
throughout the process.
The individual's average score over a set of postulated test forms is a useful
concept. This concept is formalized by a mathematical model. The individual's
score X on a particular test form is considered to be a chance variable with some,
usually unknown, frequency distribution. The mean (expected value) of this
distribution is called the individual's true score T. Certain conclusions about true
scores T and observed scores X follow automatically from this model and defini­
tion.
Denote the discrepancy between T and X by
E ≡X - T; (1-1)
E is called the error of measurement. Since by definition the expected value of X
is T, the expectation of E is zero:
μE|T ≡ μ(X - T)|T ≡ μX|T - μT|T = T - T = 0, (1-2)

where μ denotes a mean and the subscripts indicate that T is fixed.


Equation (1-2) states that the errors of measurement are unbiased. This
follows automatically from the definition of true score; it does not depend on any
ad hoc assumption. By the same argument, in a group of people,
μT ≡ μX - μE ≡ μX.

Equation (1-2) gives the regression of E on T. Since mean E is constant regard­


less of T, this regression has zero slope. It follows that true score and error are
uncorrelated in any group:
ΡET = 0. (1-3)
Note, again, that this follows from the definition of true score, not from any
special assumption.
From Eq. (1-1) and (1-3), since T and E are uncorrelated, the observed-score
variance in any group is made up of two components:
σ2X ≡ σ 2 T + E ≡ σ2T + σ2E. (1-4)
The covariance of X and T is
=
σXT ≡ σ(T + E)T σ2T + σET =
σ2T . (1-5)
1.2. TRUE SCORE 5

An important quantity is the test reliability, the squared correlation between X


and T, by (1-5),

Σ2XT Σ2T
Ρ2XT ≡
σ2Xσ2T σ2X
2
= 1 - σ2 E (1-6)
σ X

If ΡXT were nearly 1.00, we could safely substitute the available test score X for
the unknown measurement of interest T.
Equations (1-2) through (1-6) are tautologies that follow automatically from
the definition of T and E.
What has our deeper theory gained for us? The theory arises from the realiza­
tions that T, not X, is the quantity of real interest. When a job applicant leaves
the room where he was tested, it is T, not X, that determines his capacity for
future performance.
We cannot observe T, but we can make useful inferences about it. How this is
done becomes apparent in subsequent sections (also, see Section 4.2).
An example will illustrate how true-score theory leads to different conclusions
than would be reached by a simple consideration of observed scores. An
achievement test is administered to a large group of children. The lowest scoring
children are selected for special training. A week later the specially trained
children are retested to determine the effect of the training.
True-score theory shows that a person may receive a very low test score either
because his true score is low or because his error score E is low (he was
unlucky), or both. The lowest scoring children in a large group most likely have
not only low T but also low E. If they are retested, the odds are against their being
so unlucky a second time. Thus, even if their true scores have not increased, their
observed scores will probably be higher on the second testing. Without true-score
theory, the probable observed-score increase would be credited to the special
training. This effect has caused many educational innovations to be mistakenly
labeled ''successful.''
It is true that repeated observations of test scores and retest scores could lead
the actuarial scientist to the observation that in practice, other things being equal,
initially low-scoring children tend to score higher on retesting. The important
point is that true-score theory predicts this conclusion before any tests are given
and also explains the reason for this odd occurrence. For further theoretical
discussion, see Linn and Slinde (1977) and Lord (1963). In practical applica­
tions, we can determine the effects of special training for the low-scoring chil­
dren by splitting them at random into two groups, comparing the experimental
group that received the training with the control group that did not.
Note that we do not define true score as the limit of some (operationally
impossible) process. The true score is a mathematical abstraction. A statistician
doing an analysis of variance components does not try to define the model
6 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

parameters as if they actually existed in the real world. A statistical model is


chosen, expressed in mathematical terms undefined in the real world. The ques­
tion of whether the real world corresponds to the model is a separate question to
be answered as best we can. It is neither necessary nor appropriate to define a
person's true score or other statistical parameter by real world operational proce­
dures .

1.3. UNCORRELATED ERRORS

Equations (1-1) through (1-6) cannot be disproved by any set of data. These
equations do not enable us to estimate σ2T|, σ2E|, or ρXT, however. To estimate
these important quantities, we need to make some assumptions. Note that no
assumption about the real world has been made up to this point.
It is usual to assume that errors of measurement are uncorrelated with true
scores on different tests and with each other: For tests X and Y,
ρ(E X ,E Y ) = 0, ρ(E X ,T Y ) = 0 (X≠Y). (1-7)
Exceptions to these assumptions are considered in path analysis (Hauser &
Goldberger, 1971; Milliken, 1971; Werts, Linn, & Jöreskog, 1974; Werts,
Rock, Linn, & Jöreskog, 1977).

1.4. PARALLEL TEST FORMS

If a test is constructed by random sampling from a pool or "universe" of items,


then σ2E, σ2T, and ρXT can be estimated without building any parallel test forms
(Lord & Novick, 1968, Chapter 11). But perhaps we do not wish to assume that
our test was constructed in this way. If three or more roughly parallel test forms
are available, these same parameters can be estimated by the theory of nominally
parallel tests (Lord & Novick, 1968, Chapter 8; Cronbach, Gleser, Nanda, &
Rajaratnam, 1972), an application of analysis of variance components.
In contrast, classical test theory assumes that we can build strictly parallel test
forms. By definition, every individual has (1) the same true score and (2) the
same conditional error variance σ2(E|T) on all strictly parallel forms:
T = T', σ2(E|T) = σ 2 (E'|T'), (1-8)
where the prime denotes a (strictly) parallel test. It follows that σ2X = σ2X'.
When strictly parallel forms are available, the important parameters of the
latent variables T and E can be estimated from the observed-score variance and
from the intercorrelation between parallel test forms by the following familiar
equations of classical test theory:
Ρ2XT ( = Ρ2X'T') = ΡXX', (1-9)
APPENDIX 7

σ2T ( = σ2T) = σ 2 X ρxx', (1-10)


2 2 2
σ E ( = σ E') = σ X(1 - ρXX’). (1-11)

1.5. ENVOI

In item response theory (as discussed in the remaining chapters of this book) the
expected value of the observed score is still called the true score. The discrep­
ancy between observed score and true score is still called the error of measure­
ment. The errors of measurement are thus necessarily unbiased and uncorrelated
with true score. The assumptions of (1-7) will be satisfied also; thus all the
remaining equations in this chapter, including those in the Appendix, will hold.
Nothing in this book will contradict either the assumptions or the basic con­
clusions of classical test theory. Additional assumptions will be made; these will
allow us to answer questions that classical theory cannot answer. Although we
will supplement rather than contradict classical theory, it is surprising how little
we will use classical theory explicitly.
Further basic ideas and formulas of classical test theory are summarized for
easy reference in an appendix to this chapter. The reader may skip to Chapter 2.

APPENDIX

Regression and Attenuation


From (1-9), (1-10), (1-11) we obtain formulas for the linear regression coeffi­
cients:
ΒXT = 1, ΒTX = ρxx'. (l-l2)
Let ξ and η be the true scores on tests X and Y, respectively. As in (1-5), σξη =
(ΣXY. From this and (1-10) we find the important correction for attenuation,
σξη σXY ρXY
Pen = (1-13)
σξση σXσY √ΡXX’ρYY' √ρXX’ρYY’
From this comes a key inequality:
√ρXX’ ≥ ρXY. (1-14)
This says that test validity (correlation of test score X with any criterion Y) is
never greater than the square root of the test reliability.

Composite Tests
Up to this point, there has been no assumption that our test is composed of
subtests or of test items. If the test score X is a sum of subtest or item scores Yi,
8 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

so that
n
X =
Σ
i=1
Yi,

then certain tautologies follow:


n n(n-l)
σ2X = 2 σ2i + 22 σij, (1-15)
i=1 i≠j

where σi ≡ σ(Y i ) and σij ≡ σ(Yi, Y j ). Similarly,


Σ Σ σii'
i i'
ρXX' = 2
(1-16)
σ X

where V indexes the items in test X'. If all subtests are parallel,
nρYY'+ (1-17)
ρXX' = ,
1 + (n - l)ρYY'

the Spearman-Brown formula.


Coefficient alpha (a) is obtained from (1-15) and (1-16) and from the
Cauchy-Schwartz inequality:

n Σ σ2i
Ρ2XT = ρXX'≥ 1- ≡ α. (1-18)
n - 1 ( σ2X )
Alpha is not a reliability coefficient; it is a lower bound.
If items are scored either 0 or 1, α becomes the Kuder-Richardson formula-20
coefficient ρ20: from (1-18) and (1-23),

n Σπi (1 - πi)
Ρ2XT = ρXX'≥
n - 1
1 -
σ2
X } = ρ20, (1-19)

where πi is the proportion of correct answers (Yi = 1) for item i. Also,

n μX(n - μX)
ρ20 ≥
n - 1 [1 -
nσ2X } = ρ21 , (1-20)

the Kuder-Richardson formula-21 coefficient.

Item Theory
Denote the score on item i by Yi. Classical item analysis provides various
tautologies. The variance of the test scores is

σ2X =Σi 2σσρ


3
i j ij = σX
2σρ
i
i iX , (1-21)

where ρij and ρix are Pearson product moment correlation coefficients. If Yi is
always 0 or 1, then X is the number-right score, the interitem correlation ρij is a
APPENDIX 9

phi coefficient, and ρix is an item-test point biserial correlation. Classical item
analysis theory may deal also with the biserial correlation between item score and
test score and with the tetrachoric correlations between items (see Lord &
Novick, 1968, Chapter 15). In the case of dichotomously scored items (Yi = 0 or
1), we have
n
μx = πii, (1-22)
Σ
i=l

σ2i = πi(l - πi). (1-23)


From (1-18) and (1-21), coefficient α is
Σσ2i
α = n 1 - (1-24)
n - 1 ( Σ Σ σiσjρij ).
If C is an outside criterion, the test validity coefficient is

Σ σiρiC
i
ρxc = √ Σ Σ σ σ ρ . (1-25)
i j ij
i 3

These two formulas provide the two paradoxical classical rules for building a
test:

1. To maximize test reliability, choose test items that correlate as high as


possible with each other.
2. To maximize validity, choose test items that correlate as high as possible
with the criterion and as low as possible with each other.

Overview
Classical test theory is based on the weak assumptions (1-7) plus the assumption
that we can build strictly parallel tests. Most of its equations are unlikely to be
contradicted by data. Equations (1-1) through (1-13) are unlikely to be falsified,
since they involve the unobservable variables T and E. Equations (1-15), (1-16),
and (l-20)-(l-25) cannot be falsified because they are tautologies.
The only remaining equations of those listed are (1-14) and (1-17)—(1-19).
These are the best known and most widely used practical outcomes of classical
test theory. Suppose when we substitute sample statistics for parameters in
(1-17), the equality is not satisfied. We are likely to conclude that the discrep­
ancies are due to sampling fluctuations or else that the subtests are not really
strictly parallel.
The assumption (1-7) of uncorrelated errors is also open to question, however.
Equations (1-7) can sometimes be disproved by path analysis methods. Similar
comments apply to (1-14), (1-18), and (1-19).
10 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

Note that classical test theory deals exclusively with first and second moments:
with means, variances, and covariances. An extension of classical test theory
to higher-order moments is given in Lord and Novick (1968, Chapter 10). With-
out such extension, classical test theory cannot investigate the linearity or non-
linearity of a regression, nor the normality or nonnormality of a frequency
distribution.

REFERENCES

Cronbach, L. J., Gleser, G. C , Nanda, H., & Rajaratnam, N. The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley, 1972.
Hauser, R. M., & Goldberger, A. S. The treatment of unobservable variables in path analysis. In H.
L. Costner (Ed.), Sociological methodology, 1971. San Francisco: Jossey-Bass, 1971.
Linn, R. L., & Slinde, J. A. The determination of the significance of change between pre- and
posttesting periods. Review of Educational Research, 1977, 47, 121-150.
Lord, F. M. Elementary models for measuring change. In C. W. Harris (Ed.), Problems in measur-
ing change. Madison: University of Wisconsin Press, 1963.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Milliken, G. A. New criteria for estimability for linear models. The Annals of Mathematical Statis-
tics, 1971, 42, 1588-1594
Werts, C. E., Linn, R. L., & Jöreskog, K. G. Intraclass reliability estimates: Testing structural
assumptions. Educational and Psychological Measurement, 1974, 34, 25-33.
Werts, C. E., Rock, D. A., Linn, R. L., & Jöreskog, K. G. Validating psychometric assumptions
within and between several populations. Educational and Psychological Measurement, 1977,
37, 863-872.
2 Item Response Theory—
Introduction and Preview

2.1. INTRODUCTION

Commonly, a test consists of separate items and the test score is a (possibly
weighted) sum of item scores. In this case, statistics describing the test scores of
a certain group of examinees can be expressed algebraically in terms of statistics
describing the individual item scores for the same group [see Eq. (1-21) to
(1-25)]. As already noted, classical item theory (which is only a part of classical
test theory) consists of such algebraic tautologies.
Such a theory makes no assumptions about matters that are beyond the control
of the psychometrician. It cannot predict how individuals will respond to items
unless the items have previously been administered to similar individuals. In
practical test development work, we need to be able to predict the statistical and
psychometric properties of any test that we may build when administered to any
target group of examinees. We need to describe the items by item parameters and
the examinees by examinee parameters in such a way that we can predict prob-
abilistically the response of any examinee to any item, even if similar examinees
have never taken similar items before. This involves making predictions about
things beyond the control of the psychometrician—predictions about how people
will behave in the real world.
As an especially clear illustration of the need for such a theory, consider the
basic problem of tailored testing: Given an individual's response to a few items
already administered, choose from an available pool one item to be administered
to him next. This choice must be made so that after repeated similar choices the
examinee's ability or skill can be estimated as accurately as possible from his
responses. To do this even approximately, we must be able to estimate the

11
12 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

examinee's ability from any set of items that may be given to him. We must also
know how effective each item in the pool is for measuring at each ability level.
Neither of these things can be done by means of classical mental test theory.
In most testing work, our main task is to infer the examinee's ability level or
skill. In order to do this, we must know something about how his ability or skill
determines his response to an item. Thus item response theory starts with a
mathematical statement as to how response depends on level of ability or skill.
This relationship is given by the item response function (trace line, item charac­
teristic curve).
This book deals chiefly with dichotomously scored items. Responses will be
referred to as right or wrong (but see Chapter 15 for dealing with omitted
responses). Early work in this area was done by Brogden (1946), Lawley (1943),
Lazarsfeld (see Lazarsfeld & Henry, 1968), Lord (1952), and Solomon (1961),
among others. Some polychotomous item response models are treated by Ander­
sen (1973a, b), Bock (1972, 1975), and Samejima (1969, 1972). Related models
in bioassay are treated by Aitchison and Bennett (1970), Amemiya (1974a, b, c),
Cox (1970), Finney (1971), Gurland, Ilbok, and Dahm (1960), Mantel (1966),
van Strik (1960).

2.2. ITEM RESPONSE FUNCTIONS

Let us denote by 6 the trait (ability, skill, etc.) to be measured. For a dichotomous
item, the item response function is simply the probability P or P(θ) of a correct
response to the item. Throughout this book, it is (very reasonably) assumed that
P(d) increases as 6 increases. A common assumption is that this probability can
be represented by the (three-parameter) logistic function

P = p(θ) = c + 1 - c
, (2-1)
1 + e-l.7a(θ-b)

where a, b, and c are parameters characterizing the item, and e is the mathemat­
ical constant 2.71828. . . . Logistic item response functions for 50 four-choice
word-relations items are shown in Fig. 2.2.1 to illustrate the variety found in a
typical published test. This logistic model was originated and developed by Allan
Birnbaum.
Figure 2.2.2 illustrates the meaning of the item parameters. Parameter c is the
probability that a person completely lacking in ability (θ = —∞)will answer the
item correctly. It is called the guessing parameter or the pseudo-chance score
level. If an item cannot be answered correctly by guessing, then c = 0.
Parameter b is a location parameter: It determines the position of the curve
along the ability scale. It is called the item difficulty. The more difficult the item,
the further the curve is to the right. The logistic curve has its inflexion point at
θ= b. When there is no guessing, b is the ability level where the probability of a
2.2. ITEM RESPONSE FUNCTIONS 13

1.0
P(θ)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

O. I

0.0
-2.5 -2.0 - I .5 - I .0 -0,5 0.0 0.5 I .0 I .5 2.0 2.5
θ

FIG. 2.2.1. Item response functions for SCAT II Verbal Test, Form 2B.

correct answer is .5. When there is guessing, b is the ability level where the
probability of a correct answer is halfway between c and 1.0.
Parameter a is proportional to the slope of the curve at the inflexion point [this
slope actually is .425^(1 — c)]. Thus a represents the discriminating power of
the item, the degree to which item response varies with ability level.
An alternative form of item response function is also frequently used: the
(three-parameter) normal ogive,

P ≡ P(θ) = c + (1 - c)
i a(θ-b)
- 0 0
1
√2π
e-t2/2dt. (2-2)

Again, c is the height of the lower asymptote; b is the ability level at the point of
inflexion, where the probability of a correct answer is (1 + c)/2; a is propor-
14 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

P(θ)
10

Inf l e x ion

0 9
b
FIG. 2.2.2. Meaning of item parameters (see text).

tional to the slope of the curve at the inflexion point [this slope actually is a(l —
C)/√2Π].
The difference between functions (2-1) and (2-2) is less than .01 for every set
of parameter values. On the other hand, for c = 0, the ratio of the logistic
function to the normal function is 1.0 at a(θ - b) = 0, .97 at - 1, 1.4 at - 2, 2.3
at - 2.5, 4.5 at - 3, and 34.8 at - 4. The two models (2-1) and (2-2) give very
similar results for most practical work.
The reader may ask for some a priori justification of (2-1) or (2-2). No
convincing a priori justification exists (however, see Chapter 3). The model must
be justified on the basis of the results obtained, not on a priori grounds.
No one has yet shown that either (2-1) or (2-2) fits mental test data signifi­
cantly better than the other. The following references are relevant for any statisti­
cal investigation along these lines: Chambers and Cox (1967), Cox (1961, 1962),
Dyer (1973, 1974), Meeter, Pirie, and Blot (1970), Pereira (1977a, b) Quesen-
berry and Starbuck (1976), Stone (1977).
In principle, examinees at high ability levels should virtually never answer an
easy item incorrectly. In practice, however, such an examinee will occasionally
make a careless mistake. Since the logistic function approaches its asymptotes
less rapidly than the normal ogive, such careless mistakes will do less violence to
the logistic than to the normal ogive model. This is probably a good reason for
preferring the logistic model in practical work.
Prentice (1976) has suggested a two-parameter family of functions that in­
cludes both (2-1) and (2-2) when a = 1, b = 0, and c = 0 and also includes a
variety of skewed functions. The location, scale, and guessing parameters are
easily added to obtain a five-parameter family of item response curves, each item
being described by five parameters.
2.3. CHECKING THE MATHEMATICAL MODEL 15

2.3. CHECKING THE MATHEMATICAL MODEL

Either (2-1) or (2-2) may provide a mathematical statement of the relation be-
tween the examinee's ability and his response to a test item. A more searching
consideration of the practical meaning of (2-1) and (2-2) is found in Section 15.7.
Such mathematical models can be used with confidence only after repeated
and extensive checking of their applicability. If ability could be measured accu-
rately, the models could be checked directly. Since ability cannot be measured
accurately, checking is much more difficult. An ideal check would be to infer
from the model the small-sample frequency distribution of some observable
quantity whose distribution does not depend on unknown parameters. This does
not seem to be possible in the present situation.
The usual procedure is to make various tangible predictions from the model
and then to check with observed data to see if these predictions are approximately
correct. One substitutes estimated parameters for true parameters and hopes to
obtain an approximate fit to observed data. Just how poor a fit to the data can be
tolerated cannot be stated exactly because exact sampling variances are not
known. Examples of this sort of check on the model are found throughout this
book. See especially Fig. 3.5.1. If time after time such checks are found to be
satisfactory, then one develops confidence in the practical value of the model for
predicting observable results.
Several researchers have produced simulated data and have checked the fit of
estimated parameters to the true parameters (which are known since they were
used to generate the data). Note that this convenient procedure is not a check on
the adequacy of the model for describing the real world. It is simply a check on
the adequacy of whatever procedures the researcher is using for parameter esti-
mation (see Chapter 12).
At this point, let us look at a somewhat different type of check on our item
response model (2-1). The solid curves in Fig. 2.3.1 are the logistic response
curves for five SAT verbal items estimated from the response data of 2862
students, using the methods of Chapter 12. The dashed curves were estimated,
almost without assumption as to their mathematical form, from data on a total
sample of 103,275 students, using the totally different methods of Section 16.13.
The surprising closeness of agreement between the logistic and the unconstrained
item response functions gives us confidence in the practical value of the logistic
model, at least for verbal items like these.
The following facts may be noted, to point up the significance of this result:

1. The solid and dashed curves were obtained from totally different assump-
tions. The solid curve assumes the logistic function, also that the test items all
measure just one psychological dimension. The dashed curve assumes only that
the conditional distribution of number-right observed score for given true score is
a certain approximation to a generalized binomial distribution.
2. The solid and dashed curves were obtained from different kinds of raw
1.0 1.0

.8 .8

II

P
.5 .5
P

47

.2 .2

0 0
-2 -I 0 1 2 -2 -I 0 1 2
θ θ
1.0 1.0

.8 .8
10
13
P

.5 5

.2 .2

0 0
-2 -I 0 1 2 -2 -I 0 1 2

θ θ
1.0

.8

30
P

.5

.2

0
-2 -1 0 1 2

θ
FIG. 2.3.1. Five item characteristic curves estimated by two different methods.
(From F. M. Lord, Item characteristic curves estimated without knowledge of
their mathematical form—a confrontation of Birnbaum's logistic model. Psycho-
metrika, 1970, 35, 43-50.)

16
2.3. CHECKING THE MATHEMATICAL MODEL 17

data. The solid curve comes from an analysis of all the responses of a sample of
students to all 90 SAT verbal items. The dashed curve is obtained just from
frequency distributions of number-right scores on the SAT verbal test and, in a
minor way, from the variance across items of the proportion of correct answers to
the item.
3. The solid curve is a logistic function. The dashed curve is the ratio of two
polynomials, each of degree 89.
4. The solid curve was estimated from a bimodal sample of 2862 examinees,
selected by stratified sampling to include many high-ability and many low-ability
students. The dashed curve was estimated from all 103,275 students tested in a
regular College Board test administration.

Further details of this study are given in Sections 16.12 and 16.13.
These five items are the only items to be analyzed to date by this method. The
five items were chosen solely for the variety of shapes represented. If a hundred
or so items were analyzed in this way, it is likely that some poorer fits would be
found.
It is too much to expect that (2-1) or (2-2) will hold exactly for every test item
and for every examinee. If some examinees become tired, sick, or uncooperative
partway through the testing, the mathematical model will not be strictly appro-
priate for them. If some test items are ambiguous, have no correct answer, or
have more than one correct answer, the model will not fit such items. If exam-
inees omit some items, skip back and forth through the test, and do not have time
to finish the test, perhaps marking all unfinished items at random, the model
again will not apply.
A test writer tries to provide attractive incorrect alternatives for each
multiple-choice item. We may imagine examinees so completely lacking in
ability that they do not even notice the attractiveness of such alternatives and so
respond to the items completely at random; their probability of success on such
items will be 1/A, where A is the number of alternatives per item. We may also
imagine other examinees with sufficient ability to see the attractiveness of the
incorrect alternatives although still lacking any knowledge of the correct answer;
their probability of success on such items is often less than 1/A. If this occurs, the
item response function is not an increasing function of ability and cannot be fitted
by any of the usual mathematical models.
We might next imagine examinees who have just enough ability to eliminate
one (or two, or three,.. .) of the incorrect alternatives from consideration, al-
though still lacking any knowledge of the correct answer. Such examinees might
be expected to have a chance of 1/(A - 1) (or 1/(A - 2), 1/(A - 3),. . .) of
answering the item correctly, perhaps producing an item response function look-
ing like a staircase.
Such anticipated difficulties deterred the writer for many years from research
on item response theory. Finally, a large-scale empirical study of 150 five-choice
18
100
100

80
80

60
60

RIGHT
RIGHT

%
%

40
40

20
20

0
0

0 30 60 90 6 30 60 90
SCORE SCORE

FIG. 2.3.2. Proportion of correct answers to an item as a function of number-right test score. The
two items shown are the two worst examples of nonmonotonicity among the 150 items studied.
2.4. UNIDIMENSIONAL TESTS 19

items was made to determine proportion of correct answers as a function of


number-right test score. With a total of 103,275 examinees, these proportions
could be determined with considerable accuracy. Out of 150 items, only six were
found that clearly failed to be increasing functions of total test score, and for
these the failure was so minor as to be of little practical importance. The results
for the two worst items are displayed in Figure 2.3.2; the crosses show where the
curve would have been if examinees omitting the item had chosen at random
among the five alternative responses instead. No staircase functions or other
serious difficulties were found.

2.4. UNIDIMENSIONAL TESTS

Equation (2-1) or (2-2) asserts that probability of success on an item depends on


three item parameters, on examinee ability 0, and on nothing else. If the model is
true, a person's ability 0 is all we need in order to determine his probability of
success on a specified item. If we know the examinee's ability, any knowledge of
his success or failure on other items will add nothing to this determination. (If it
did add something, then performance on the items in question would depend in
part on some trait other than 0; but this is contrary to our assumption.)
The principle just stated is Lazarsfeld's assumption of local independence.
Stated formally, Prob(success on item i given θ) = Prob(success on item i given
θ and given also his performance on items j , k, . . .). If ut = 0 or 1 denotes the
score on item i, then this may be written more compactly as
P(ui = l|θ) = P(ui = l|θ, uj, uk,.. .) (i ≠ j , k, . . .). (2-3)
A mathematically equivalent statement of local independence is that the prob­
ability of success on all items is equal to the product of the separate probabilities
of success. For just three items i, j , k, for example,
P(ui = 1, uj = 1, uk = 1|θ) = P(ui = 1|θ)P(uj = 1|θ)P(uk = 1|θ).
(2-4)

Local independence requires that any two items be uncorrelated when θ is


fixed. It definitely does not require that items be uncorrelated in ordinary groups,
where θ varies. Note in particular that local independence follows automatically
from unidimensionality. It is not an additional assumption.
If the items measure just one dimension (θ), if θ is normally distributed in the
group tested, and if model (2-2) holds with c = 0 (there is no guessing), then the
matrix of tetrachoric intercorrelations among the items will be of unit rank (see
Section 3.6). In this case, we can think of θ as the common factor of the items.
This gives us a clearer understanding of what is meant by θ and what is meant by
unidimensionality.
20 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

Note, however, that latent trait theory is more general than factor analysis.
Ability θ is probably not normally distributed for most groups of examinees.
Unidimensionality, however, is a property of the items; it does not cease to exist
just because we have changed the distribution of ability in the group tested.
Tetrachoric correlations are inappropriate for nonnormal distributions of ability;
they are also inappropriate when the item response function is not a normal
ogive. Tetrachoric correlations are always inappropriate whenever there is guess­
ing. This poses a problem for factor analysts in defining what is meant by
common factor, but it does not disturb the unidimensionality of a pool of items.
It seems plausible that tests of spelling, vocabulary, reading comprehension,
arithmetic reasoning, word analogies, number series, and various types of spatial
tests should be approximately one-dimensional. We can easily imagine tests that
are not. An achievement test in chemistry might in part require mathematical
training or arithmetic skill and in part require knowledge of nonmathematical
facts.
Item response theory can be readily formulated to cover cases where the test
items measure more than one latent trait. Practical application of multidimen­
sional item response theory is beyond the present state of the art, however,
10
8
SIZE OF ROOT

6
4
2
0

0 1 2 3 4 5 6 7 8 9 10 II 12

RANKING

FIG. 2.4.1. The 12 largest latent roots in order of size for the SCAT 2A Verbal
Test.
2.5. PREVIEW 21

except in special cases (Kolakowski & Bock, 1978; Mulaik, 1972; Samejima,
1974; Sympson, 1977).
There is great need for a statistical significance test for the unidimensionality
of a set of test items. An attempt in this direction has been made by Christof-
fersson (1975), Indow and Samejima (1962), and Muthén (1977).
A rough procedure is to compute the latent roots of the tetrachoric item
intercorrelation matrix with estimated communalities placed in the diagonal. If
(1) the first root is large compared to the second and (2) the second root is not
much larger than any of the others, then the items are approximately unidimen-
sional. This procedure is probably useful even though tetrachoric correlation
cannot usually be strictly justified. (Note that Jöreskog's maximum likelihood
factor analysis and accompanying significance tests are not strictly applicable to
tetrachoric correlation matrices.)
Figure 2.4.1 shows the first 12 latent roots obtained in this way for the SCAT
II Verbal Test, Form 2A. This test consists of 50 word-relations items. The data
were the responses of a sample of 3000 high school students. The plot suggests
that the items are reasonably one-dimensional.

2.5. PREVIEW

In order to motivate the detailed study of item response functions in succeeding


chapters, it seems worthwhile to outline briefly just a few of the practical results
to be developed. At this point, the reader should expect only a preview, not a
detailed explanation.
For each item there is an item information function I{θ, ui} that can be
determined from the formula
P'i2
I{θ, ui,} = (2-5)
PiQi ,
where Pi = Pi(θ) is the item response function, Qi = 1 - Pi, and P'i is the
derivative of Pi with respect to θ [the formula for P'i can be written out explicitly
once a particular item response function, such as (2-1) or (2-2), is chosen]. The
item information functions for the five items (10, 11, 13, 30, 47) in Fig. 2.3.1
are shown in Fig. 2.5.1.
The amount of information given by an item varies with ability level θ. The
higher the curve, the more the information. Information at a given ability level
varies directly as the square of the item discriminating power, ai. If one informa­
tion function is twice as high as another at some particular ability level, then it
will take two items of the latter type to measure as well as one item of the former
type at that ability level.
There is also a test information function I{θ}, which is inversely proportional
to the square of the length of the asymptotic confidence interval for estimating
22 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

4.0

3.0

10
Information

2.0

13

1.0

30
II

47
0.0

Ability

FIG. 2.5.1. Item and test information functions. (From F. M. Lord, An analysis
of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic
model. Educational and Psychological Measurement, 1968, 28, 989-1020.)

the examinee's ability θ from his responses. It can be shown that the test informa­
tion function I{θ} is simply the sum of the item information functions:

/{θ} = Σ I{θ,
i
ui}. (2-6)

The test information function for the five-item test is shown in Fig. 2.5.1.
We have in (2-6) the very important result that when item responses are
optimally weighted, the contribution of the item to the measurement effectiveness
of the total test does not depend on what other items are included in the test. This
is a different situation from that in classical test theory, where the contribution of
each item to test reliability or to test validity depends inextricably on what other
items are included in the test.
2.5. PREVIEW 23

Equation (2-6) suggests a convenient and effective procedure of test construc­


tion. The procedure operates on a pool of items that have already been calibrated,
so that we have the item information curve for each item.
1. Decide on the shape desired for the test information function. The desired
curve is the target information curve.
2. Select items with item information curves that will fill the hard-to-fill areas
under the target information curve.
3. Cumulatively add the item information curves, obtaining at all times the
information curve for the part-test composed of items already selected.
4. Continue (backtracking if necessary) until the area under target informa­
tion curve is filled to a satisfactory approximation.
The test information function represents the maximal amount of information
that can be obtained from the item responses by any kind of scoring method. The
linear composite Σi wi*ui of item scores ui (= 0 or 1) with weights

wi* = Pi' (2-7)


PiQi
is an optimal score yielding maximal information. The optimal score is not
directly useful since the optimal weights wi* depend on θ, which is unknown.
Very good scoring methods can be deduced from (2-7), however.
The logistic optimal weights for the five items of Fig. 2.3.1 are shown as
functions of θ in Fig. 2.5.2. It is obvious that the relative weighting of different
items is very different at low ability levels than at high ability levels. At high
levels, optimal item weights are proportional to item discriminating power ai. At
low ability levels, on the other hand, difficult items should receive near-zero
scoring weight, regardless of ai. The reason is that when low-ability examinees
guess at random on difficult items, this produces a random result that would
impair effective measurement if incorporated into the examinee's score; hence
the need for a near-zero scoring weight.
Two tests of the same trait can be compared very effectively in terms of their
information functions. The ratio of the information function of test y to the
information function of test x represents the relative efficiency of test y with
respect to x. Figure 6.9.1 shows the relative efficiency of a STEP vocabulary test
compared to a MAT vocabulary test. The STEP test is more efficient for low-
ability examinees, but much less efficient at higher ability levels. The dashed
horizontal line shows the efficiency that would be expected if the two tests
differed only in length (number of items).
Figure 6.10.1 shows the relative efficiency of variously modified hypothetical
SAT Verbal tests compared with an actual form of the test. Curve 2 shows the
effect of adding five items just like the five easiest items in the actual test. Curve
3 shows the effect of omitting five items of medium difficulty from the actual
test. Curve 4 shows the effect of replacing the five medium-difficulty items by
24 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

4.0

10

3.0

13
Optimum Weight

2.0

II

30

1.0

47

47

SAT Scaled Score


380 460 540 710
0.0

Ability

FIG. 2.5.2. Optimal (logistic) scoring weight for five items as a function of
ability level. (From F. M. Lord, An analysis of the Verbal Scholastic Aptitude
Test using Birnbaum's three-parameter logistic model. Educational and Psy-
chological Measurement, 1968, 28, 989-1020.)

the five additional easy items. Curve 6 shows the effect of discarding (not
scoring) the easier half of the test. Curve 7 shows the effect of discarding the
harder half of the test; notice that the resulting half-length test is actually better
for measuring low-ability examinees than is the regular full-length SAT. Curve 8
shows a hypothetical SAT just like the regular full-length SAT except that all
items are at the same middle difficulty level.
Results such as these are useful for planning revision of an existing test,
perhaps increasing its measurement effectiveness at certain specified ability
levels and decreasing its effectiveness at other levels. These and other useful
applications of item response theory are treated in detail in subsequent chapters.
REFERENCES 25

REFERENCES

Aitchison, J., & Bennett, J. A. Polychotomous quantal response by maximum indicant. Biometrika,
1970, 57, 253-262.
Amemiya, T. Qualitative response models. Technical Report No. 135. Stanford, Calif.: Institute for
Mathematical Studies in the Social Sciences, Stanford University, 1974. (a)
Amemiya, T. The maximum likelihood estimator vs. the minimum chi-square estimator in the
general qualitative response model. Technical Report No. 136. Stanford, Calif.: Institute for
Mathematical Studies in the Social Sciences, Stanford University, 1974. (b)
Amemiya, T. The equivalence of the nonlinear weighted least squares method and the method of
scoring in the general qualitative response model. Technical Report No. 137. Stanford, Calif.:
Institute for Mathematical Studies in the Social Sciences, Stanford University, 1974. (c)
Andersen, E. B. Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk
Forlag, 1973. (a)
Andersen, E. B. Conditional inference for multiple-choice questionnaires. British Journal of
Mathematical and Statistical Psychology, 1973, 26, 31-44. (b)
Bock, R. D. Estimating item parameters and latent ability when responses are scored in two or more
nominal categories. Psychometrika, 1972, 37, 29-51.
Bock, R. D. Multivariate statistical methods in behavioral research. New York: McGraw-Hill,
1975.
Brogden, H. E. Variation in test validity with variation in the distribution of item difficulties, number
of items, and degree of their intercorrelation. Psychometrika, 1946, 11, 197-214.
Chambers, E. A., & Cox, D. R. Discrimination between alternative binary response models.
Biometrika, 1967, 54, 573-578.
Christoffersson, A. Factor analysis of dichotomized variables. Psychometrika, 1975, 40, 5-32.
Cox, D. R. Tests of separate families of hypotheses. In J. Neyman (Ed.), Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1). Berkeley: University of
California Press, 1961.
Cox, D. R. Further results on tests of separate families of hypotheses. Journal of the Royal Statistical
Society, 1962, 24, 406-424.
Cox, D. R. The analysis of binary data. London: Methuen, 1970.
Dyer, A. R. Discrimination procedures for separate families of hypotheses. Journal of the American
Statistical Association, 1973, 68, 970-974.
Dyer, A. R. Hypothesis testing procedures for separate families of hypotheses. Journal of the
American Statistical Association, 1974, 69, 140-145.
Finney, D. J. Probit analysis (3rd ed.). New York: Cambridge University Press, 1971.
Gurland, J., Ilbok, J., & Dahm, P. A. Polychotomous quantal response in biological assay. Biomet-
rics, 1960, 16, 382-398.
Indow, T., & Samejima, F. LIS measurement scale for non-verbal reasoning ability. Tokyo:
Nihon-Bunka Kagakusha, 1962. (In Japanese)
Kolakowski, D., & Bock, R. D. Multivariate generalizations of probit analysis. Unpublished manu-
script, 1978.
Lawley, D. N. On problems connected with item selection and test construction. Proceedings of the
Royal Society of Edinburgh, 1943, 61, 273-287.
Lazarsfeld, P. F., & Henry, N. W. Latent structure analysis. Boston: Houghton-Mifflin, 1968.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Mantel, N. Models for complex contingency tables and polychotomous dosage response curves.
Biometrics, 1966, 22, 83-95.
Meeter, D., Pirie, W., & Blot, W. A comparison of two model discrimination criteria. Technomet-
rics, 1970, 12, 457-470.
26 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

Mulaik, S. A. A mathematical investigation of some multidimensional Rasch models for psychologi-


cal tests. Paper presented at the Spring meeting of the Psychometric Society, Princeton, N.J.,
1972.
Muthén, B. Statistical methodology for structural equation models involving latent variables with
dichotomous indicators. Unpublished doctoral dissertation, Uppsala University, 1977.
Pereira, B. de B. Discriminating among separate models: A bibliography. International Statistical
Review, 1977, 45, 163-172. (a)
Pereira, B. de B. A note on the consistency and on the finite sample comparisons of some tests of
separate families of hypotheses. Biometrika, 1977, 64, 109-113. (b)
Prentice, R. L. A generalization of the probit and logit methods for dose response curves. Biomet-
rics, 1976, 32, 761-768.
Quesenberry, C. P., & Starbuck, R. R. On optimal tests for separate hypotheses and conditional
probability integral transformations. Communications in Statistics, 1976, A5, 507-524.
Samejima, F. Estimation of latent ability using a response pattern of graded scores. Psychometric
Monograph No. 17. Psychometric Society, 1969.
Samejima, F. A general model for free-response data. Psychometric Monograph Supplement, No.
18, 1972.
Samejima, F. Normal ogive model on the continuous response level in the multidimensional latent
space. Psychometrika, 1974, 39, 111-121.
Solomon, H. (Ed.). Studies in item analysis and prediction. Stanford, Calif.: Stanford University
Press, 1961.
Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion.
Journal of the Royal Statistical Society, Series B, 1977, 39, 44-47.
Sympson, J. B. A model for testing with multidimensional items. Paper presented at the Com-
puterized Adaptive Testing Conference, Minneapolis, 1977.
van Strik, R. A method of estimating relative potency and its precision in the case of semi-
quantitative responses. Symposium on Quantitative Methods in Pharmacology, 1960. Amster-
dam: N. Holland Publishing Company.
3 Relation of
Item Response Theory to
Conventional Item Analysis

3.1. ITEM-TEST REGRESSIONS

In conventional item analysis, it is common to compare high- and low-scoring


students on their proportion of correct answers. Sometimes the students may be
divided on test score into as many as five levels and then the levels compared on
proportion of correct answers. An extension of this would be to divide the stu­
dents into as many levels as there are test scores before making the comparison.
The proportion of correct answers to an (dichotomous) item is also the mean
item score, the mean of the statistical variable ui (= 0 or 1). Thus the curve
representing proportion of correct answers as a function of test score x is also the
regression of ui on x. Such a curve is called an item-observed score regression
(iosr). Note that for dichotomous items any item response function, as defined in
Chapter 2, can be considered by the same logic to be an item-ability regression,
the regression of ui on θ.
Figure 3.1.1 shows sample iosr for several SAT verbal and math items. Each
curve is computed from the responses of 103,275 examinees. The base line is
number-right score on the verbal test or on the math test omitting the item under
study. Points based on fewer than 50 examinees are not plotted.
These curves would be empirical item-response functions if the base line were
0 instead of number-right score x. Thus it is common, although incorrect (as we
shall see), to think that an iosr, like an item-ability regression, will have at least
approximately an ogive shape, like Eq. (2-1) or (2-2).
To show that iosr cannot all be approximately normal ogives, consider a test
composed of n items. Denote the iosr for item i by
μi|x = E(ui|x),

27
28
20 40 60 0 30 60 90
0 SCORE SCORE
100
100

8.0
80

60
60

% RIGHT
% RIGHT

40
40

20
20

0
0

0 30 60 90 0 30 60 90
SCORE SCORE

FIG. 3.1.1. Selected item-test regressions for five-choice Scholastic Aptitude Test items (crosses
show regression when omitted responses are replaced by random responses).

29
30 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

the expectation being taken over all individuals at score level x. Now, for any
individual, x is the sum (over items) of his item scores; that is
n
(3-1)
x = Σu
i=l
i .

If we take the expectation of (3-1) for fixed x, we have


n n
x = E(x|x) = E ui|x
( Σ
i=1
)
=
Σ
i= 1
E(u ).
i|x

Then by definition
n n
(3-2)
Σ
i=1
μi|x ≡
Σ E(u )
i=1
i|x = x.

We can understand this general result most easily by considering the special
case when all the items are statistically equivalent. In this case, μi|x is by
definition the same for all items, so (3-2) can be written
n

Σμ
i= 1
i|x = nμi|x = x,

from which it follows that μi|x - x/n for each item. Thus the iosr of each item is
a straight line through the origin with slope 1/n. Note that for statistically equiva­
lent items μi|x = x/n even when the items are entirely uncorrected with each
other. The iosr has a slope of 1/n even when the test does not measure anything!
This is still true if each item is negatively correlated with every other item!
All this proves that we cannot as a general matter expect item-observed score
regressions to be even approximately normal ogives. We shall not make further
use of item-observed score regressions in this book. The regression of item score
on true score is considered in Section 16.12.

3.2. RATIONALE FOR NORMAL OGIVE MODEL

The writer prefers to consider the choice of item response function, such as Eq.
(2-1) or (2-2), as a basic assumption to be justified by methods discussed in
Section 2.3 rather than by any a priori argument. This is particularly wise when
there is guessing, since one assumption often used in this case to deduce Eq.
(2-1) or (2-2) from a priori considerations is that examinees either know the
correct answer to the item or else guess at random. This assumption is totally
unacceptable and would discredit the entire theory if the theory depended on it.
The alternate, acceptable point of view is simply that Eq. (2-1) and (2-2) are
useful as versatile formulas capable of adequately representing a wide variety of
3.2. RATIONALE FOR NORMAL OGIVE MODEL 31

ogive-shaped functions that increase monotonically from a lower asymptote to


1.00. Justification of their use is to be sought in the results achieved, not in
further rationalizations. In this section, a rationale is provided for Eq. (2-2) in a
rather specialized situation in order to make this and similar item response
models seem plausible, not with the idea of providing a firm basis for their use.
Suppose that there is a (unobservable) latent variable Yi' that determines
examinee performance on item i. If for some examinee Yi' is greater than some
constant γi, then he answers the item correctly, so that ui = 1. Similarly, if for
some examinee Yi' < γi, then ui = 0. (There is zero probability that Yi' = γi, so
we need not discuss this case.) From the point of view of the factor analyst, Yi' is
a composite of (1) the common factor 6 of the test items and (2) a specific factor
or error factor for item i, not found in other items.
Note that the foregoing supposition rules out guessing. If the correct answer
can be obtained by a partially random process, then no attribute of the examinee
can determine whether ui = 0 or 1.
Assume now that
1. The regression μi'|θ of Yi' on θ is linear.
2. The scatter of Yi' about this regression is homoscedastic; in other words,
the conditional variance σi|'2θ = σi2.θ about the regression line is the same
for all θ.
3. The conditional distribution of Yi' given θ is normal.
Conditional distributions of Yi' for given θ are illustrated in Fig. 3.2.1. We
can see from the figure that the item response function Pi ≡ Pi(θ) ≡ Prob(ui =
l|θ) ≡ Prob(Yi' > γi|θ) is equal to a standardized normal curve area. A little
algebra shows that it is equal to the standardized normal curve area above (γi —
μi'|θ)/σ'i.θ, which will be denoted by — Li.
For convenience, let us choose (as we may) the scale of measurement for both
Yi' and θ so that for the entire bivariate population the unconditional means of
both variables are 0 and their unconditional standard deviations are 1. Then the
equation for the regression of Yi' on θ is simply μi'|θ =ρi'θ, where ρi' is the
correlation between Yi' and θ. The conditional variance about this regression is,
by standard formula, σi.θ = 1 – ρi'2. Making use of these last formulas, we have
γi - ρi'θ
-Li =
√1-ρi'θ
Let
ρi'
ai ≡ (3-3)
√1 -ρi'2
γi (3-4)
bi ≡
ρi' ,
so that — Li = ai(bi — θ).
32 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

Yi

Pi(θ)
u =1
i

Pi(θ)
Pi(θ)

γ
i

Q i (θ)

Q i (θ)
Ui=0
μi|θ

θ
FIG. 3 . 2 . 1 . Hypothetical conditional distribution of Yi' for three levels of ability
θ, showing the regression μi|'θ and the cutting point γi that separates right an­
swers from wrong answers.

For symmetric distributions,

∫∞-L =∫L-∞ ,

so, finally, the item response function is seen to be


ai(θ-bi.)
Pi (θ) =
∫ — ∞ √2π e
1 -t2/2
dt. (3-5)

This is the same as Eq. (2-2) for the normal ogive item response function when ci
= 0.
Note that we have not made any assumption about the distribution of ability θ
in the total group tested. In particular, contrary to some assertions in the litera­
ture, we have not assumed that ability is normally distributed in the total group.
Furthermore, if (3-5) holds for some group of examinees, selection on θ will not
change the conditional distribution of Y'i for fixed θ and hence will not change
(3-5). Thus the shape of the distribution of ability θ in the total group tested is
irrelevant to our derivation of (3-5).
Equation (3-5) has the form of a cumulative frequency distribution, as do Eq.
(2-1) and (2-2) when c = 0. In general, however, there seems to be little reason
for thinking of an item response curve as a cumulative frequency distribution.
3.3. RELATION TO CONVENTIONAL ITEM STATISTICS 33

3.3. RELATION TO CONVENTIONAL ITEM STATISTICS

Conventional item analysis deals with πi, conventionally called the item diffi­
culty, the proportion of examinees answering item i correctly. It also deals with
ρ i x , the product moment correlation between item score ui and number-right test
score x, often called the point-biserial item-test correlation, or else with ρ'ix, the
corresponding biserial item-test correlation. A general formula for the relation of
biserial correlation (ρ') to point-biserial correlation is
φ(γ)
ρ = ρ' , (3-6)
√Π(1 - π)

where φ(γ) is the normal curve ordinate at the point γ that cuts off area π of the
standardized normal curve.
If ability θ is normally distributed and ci = 0, then by definition the
product-moment correlation ρ'iθ or simply ρ'i) between Y'i and θ is also the
biserial correlation between ui and θ. Such a relationship is just what is meant by
biserial correlation.
There is also a product-moment or point-biserial correlation between ui and θ,
to be denoted by ρiθ. To the extent that number-right score x is a measure of
ability θ,ρ'ix is an approximation to ρ'i ≡ ρ'iθ and ρ'ix. is an approximation to ρ'iθ.
Combined with (3-3), this (crude) approximation yields a conceptually illuminat­
ing crude relationship between the conventional item-test correlation and the ai
parameter of item response theory, valid only for the case where θ is normally
distributed and there is no guessing:
ρ'ix
ai ≡ (3-7)
√1- ρi'2x
and

ai
ρ'ix ≡ (3-8)
√1 + a2i ,

where ≡ denotes approximate equality. This shows that under the assumptions
made, the item discrimination parameter ai and the item-test biserial correlation
ρ'ix are approximately monotonic increasing functions of each other.
Approximations (3-7) and (3-8) hold only if the unit of measurement for θ has
been chosen so that the mean of θ is 0 and the standard deviation is 1 (see Section
3.5). Approximations (3-7) and (3-8) do not hold unless θ is normally distributed
in the group tested. They do not hold if there is guessing. In addition, the
approximations fall short of accuracy because (1) the test score x contains errors
of measurement whereas θ does not; and (2) x and θ have differently shaped
distributions (the relation between x and θ is nonlinear).
Approximations (3-7) and (3-8) are given here not for practical use but rather
34 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

to give an idea of the nature of the item discrimination parameter ai. The relation
of ai to conventional item and test parameters is illustrated in Table 3.8.1.
Item i is answered correctly whenever the examinee's ability Yi' is greater
than γi. If ability θ is normally distributed, then Yi' will not only be conditionally
normally distributed for fixed θ but also unconditionally normally distributed in
the total population. Since the unconditional mean and variance of Yi' have been
chosen to be 0 and 1, respectively, a simple relation between γi and πi (propor-
tion of correct answers to item i in the total group) can be written down: When θ
is normally distributed,

πi = φ(t) dt. (3-9)


∫γ i

The parameter γi is the item difficulty parameter used in certain kinds of


Thurstone scaling (see Fan, 1957). It is also the same as the College Entrance
Examination Board delta (Gulliksen, 1950, pp. 368-369) except for a linear
transformation.
If (3-7) and (3-8) hold approximately, then from (3-4)

bi ≡ γi (3-10)
ρix .
If all items have equal discriminating power ai, then by (3-4) all ρi' are equal and
the difficulty parameter bi is proportional to γi, the normal curve deviate corre­
sponding to the proportion of correct answers πi . Thus when all items are equally
discriminating, there is a monotonic relation between bi and πi: AS πi increases,
bi and γi both decrease. When all items are not equally discriminating, the
relation between bi and γi or πi depends on ai. In general, arranging items in
order on πi is not the same as arranging them on bi.

3.4. INVARIANT ITEM PARAMETERS

As pointed out earlier, an item response function can also be viewed as the
regression of item score on ability. In many statistical contexts, regression
functions remain unchanged when the frequency distribution of the predictor
variable is changed. In the present context this should be quite clear: The proba­
bility of a correct answer to item i from examinees at a given ability level θ0
depends only on θ0, not on the number of people at θ0, nor on the number of
people at other ability levels θ1, θ2, . . . . Since the regression is invariant, its
lower asymptote, its point of inflexion, and the slope at this point all stay the
same regardless of the distribution of ability in the group tested. Thus ai, bi, and
ci are invariant item parameters. According to the model, they remain the same
regardless of the group tested.
Suppose, on the contrary, it is found that the item response curves of a set of
3.4. INVARIANT ITEM PARAMETERS 35

items differ from one group to another. This means that people in group 1 (say) at
ability level θ0 have a different probability of success on the set of items than do
people in group 2 at the same θ0. This now means that the test is able to
discriminate group 1 individuals from group 2 individuals of identical ability
level θ0. And this, finally, means that the test items are measuring some dimen­
sion on which the groups differ, a dimension other than θ. But our basic assump­
tion here is that the test items have only one dimension in common. The conclu­
sion is either that this particular test is not one-dimensional as we require or else
that we should restrict our research to groups of individuals for whom the items
are effectively one-dimensional.
The invariance of item parameters across groups is one of the most important
characteristics of item response theory. We are so accustomed to thinking of item
difficulty as the proportion (π i ) of correct answers that it is hard to imagine how
item difficulty can be invariant across groups that differ in ability level. The
following illustration may help to clarify matters.
Figure 3.4.1 shows two rather different item characteristic curves. Inverted on
the baseline are the distributions of ability for two different groups of examinees.
First of all, note again: The ability required for a certain probability of success on
an item does not depend on the distribution of ability in some group; con­
sequently, the item difficulty b should be the same regardless of the group from
which it is determined.
Now note carefully the following. In group A, item 1 is answered correctly
less often than item 2. In group B, the opposite occurs. If we use the proportion
of correct answers as a measure of item difficulty, we find that item 1 is easier
than item 2 for one group but harder than item 2 for the other group.
Proportion of correct answers in a group of examinees is not really a measure
of item difficulty. This proportion describes not only the test item but also the
group tested. This is a basic objection to conventional item analysis statistics.
Item-test correlations vary from group to group also. Like other correlations,

P
1
10

05
FIG. 3.4.1. Item response curves in
relation to two groups of examinees.
(From F. M. Lord, A study of item
bias, using item characteristic curve b1
theory. In Y. H. Poortinga (Ed.), 0.0 θ
Basic problems in cross-cultural A
psychology. Amsterdam: Swets and B
Zeitlinger, 1977, pp. 19-29.)
36 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

item-test correlations tend to be high in groups that have a wide range of talent,
low in groups that are homogeneous.

3.5. INDETERMINACY

Item response functions Pi(θ) like Eq. (2-1) and (2-2) ordinarily are taken to be
functions of a i (θ — bi). If we add a constant to every θ and at the same time add
the same constant to every bi, the quantity a i (θ — bi) is unchanged and so is the
response function Pi(θ). This means that the choice of origin for the ability scale
is purely arbitrary; we can choose any origin we please for measuring ability as
long as we use the same origin for measuring item difficulty bt.
Similarly, if we multiply every θ by a constant, multiply every bi by the same
constant, and divide every ai by the same constant, the quantity a i (θ - bi)
remains unchanged and so does the response function Pi(θ). This means that the
choice of unit for measuring ability is also purely arbitrary.
One could decide to choose the origin and unit for measuring ability in such a
way that the first person tested is assigned θ1 = 0 and the second person tested is
assigned θ2 = 1 or — 1. Another possibility would be to choose so that for the
first item b1 = 0 and a1 = 1. Scales chosen in this way would be meaningless to
anyone unfamiliar with the first two persons tested or with the first item adminis­
tered. A more common procedure is to choose the scale so that the mean and
standard deviation of θ are 0 and 1 for the group at hand.
The invariance of item parameters, emphasized in Section 3.4, clearly holds
only as long as the origin and unit of the ability scale is fixed. This means that if
we determine the bi for a set of items from one group of examinees and then
independently from another, we should not expect the two sets of bi to be
identical. Rather we should expect them to have a linear relation to each other
(like the relation between Fahrenheit and Celsius temperature scales).
Figure 3.5.1 compares estimated bi from a group of 2250 white students with
estimated bi from a group of 2250 black students for 85 verbal items from the
College Board SAT. Most of the scatter about the line is due to sampling
fluctuations in the estimates; some of the scatter is due to failure of the model to
hold exactly for groups as different as these (see Chapter 14).
If we determine the ai for a set of items independently from two different
groups, we expect the two sets of values to be identical except for an undeter­
mined unit of measurement that will be different for the two groups. We expect
the ai to lie along a straight line passing through the origin (0, 0), with a slope
reciprocal to the slope of the line relating the two sets of bi. The slope represents
the ratio of scale units for the two sets of parameters. The two sets of ai are
related in the same way as two sets of measurements of the same physical
objects, one set expressed in inches and the other in feet.
The ci are not affected by changes in the origin and unit of the ability scale.
The ci should be identical from one group to another.
3.5. INDETERMINACY 37

5
4
3
2
b (BLACKS)
1
0
-I
-2
-3

-3 -2 -I 0 1 2 3 4 5
b (WHITES)

FIG. 3.5.1. Estimated difficulty parameters (b) for 85 items for blacks and for
whites. (From F. M. Lord, A study of item bias, using item characteristic curve
theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology.
Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.)

Ability parameters 6 are also invariant from one test to another except for
choice of origin and scale, assuming that the tests both measure the same ability,
skill, or trait. For 1830 sixth-grade pupils, Fig. 3.5.2 compares the 9 estimated
from a 50-item Metropolitan vocabulary test with the 0 estimated from a 42-item
SRA vocabulary test. Both tests consist of four-choice items.
The scatter about a straight line is more noticeable here than in Fig. 3.5.1
because there each bi was estimated from the responses of 2250 students, here
each 0 is estimated from the responses to only 42 or 50 items. Thus the estimates
of θ are more subject to sampling fluctuations than the estimates of bi. The broad
scatter at low ability levels is due to guessing, random or otherwise. A more
detailed evaluation of the implications of Fig. 3.5.2 is given in Section 12.6. It is
shown there that after an appropriate transformation is made, the transformed
38 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

2.5
2.0
1 .5
1 .0
0.5
0.0
SRA

-0.5
-I .0
-I .5
-2.0
-2.5

-2.5 -2.0 - I .5 - I .0 -0.5 0.0 0.5 1 .0 1 .5 2.0 2.5


MAT

FIG. 3.5.2. Ability estimates from a 50 item MAT vocabulary test are com­
pared with ability estimates from a 42-item SRA vocabulary test for 1830 sixth-
grade pupils. (Ability estimates outside the range - 2 . 5 < θ < 2.5are printed on
the border of the table.)

estimates of θ from the two tests correlate higher than do number-right scores on
the two tests.
In conclusion, in item response theory the item parameters are invariant from
group to group as long as the ability scale is not changed; in classical item
analysis, the item parameters are not invariant from group to group, although
they are unaffected by choice of ability scale. Similarly, ability 6 is invariant
across tests of the same psychological dimension as long as the ability scale is not
changed; number-right test score is not invariant from test to test, although it is
unaffected by choice of scale for measuring θ.
3.7. ITEM INTERCORRELATIONS 39

3.6. A SUFFICIENT CONDITION FOR THE


NORMAL OGIVE MODEL

In Section 3.2, we consider a variable Y'i underlying item i. Suppose n items


each have such an underlying variable, and suppose for some group of examinees
all the Y'i are jointly multinormally distributed. In this case, the joint distribution
of the dichotomous item responses ui is determined by the yi and by the intercor-
relations p'ij of Y'i and Y'i (i ≠ j ; i, j = 1, 2 , . . . , n). It can be shown (Lord &
Novick, 1968, Section 16.8) that if the joint distribution of the observed ui for
any set of data is consistent with this multinomial model for some γi and p'ij (i ≠
j ; i, j = 1 , 2 , . . . , n), then the data are consistent with the two-parameter normal
ogive response model, including the assumption of unidimensionality (local in­
dependence). Furthermore, the p'i j will then have just one common factor, which
may be considered as the ability 0 measured by the n-item test.
The situation just described can only exist if there is no guessing and if 0 is
normally distributed in the group tested. This is a very restrictive situation; but if
this situation held for some group for some free-response items, the normal ogive
model would also hold for all other groups taking these same items.
The point is not that most data will fit the very restrictive conditions. They
will not. The point is rather that the normal ogive model will hold in a very large
variety of other, less restrictive situations. The restrictive conditions are suffi­
cient conditions for the normal ogive model; they are very far from being neces­
sary conditions.

3.7. ITEM INTERCORRELATIONS

Although we do not expect the restrictive model of the previous section to hold
for most actual data, some useful conclusions can be drawn from it that will help
us understand the relation of our latent item parameters to familiar quantities. It is
clear that when Y'i and Y'j are normally distributed, the product-moment correla­
tion p'i j between them is, by definition, the same as the tetrachoric correlation
between item i and item j. Under the restrictive model, the p'i j will have just one
common factor, θ, so that

P'i j = P'iP'j , (3-11)


where p'i and p'j are the factor loadings of items i and j . The factor loadings are
also the correlations of Y'i and Y'j with θ and also the biserial correlations of ui
and Uj with θ.
When there is no guessing, Eq. (3-11) will allow us to infer observable item
intercorrelations from the parameters ai and aj: Under the restrictive model, the
(observable) tetrachoric correlation between item i and item j is found from (3-3)
and (3-11) to be
40 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

aiaj
p'ij = √1 + a2i √1+ a2j

Conversely, under the restrictive model the p ' i , and thus the ai, can be inferred
from a factor analysis of tetrachoric item intercorrelations:

P' i j p' i k
P' i 2 = (i ≠ j , i ≠ k, j ≠ k). (3-12)
P' j k
This is not recommended in the usual situations where there is guessing, how­
ever.

3.8. ILLUSTRATIVE RELATIONSHIPS A M O N G


TEST AND ITEM PARAMETERS

Table 3.8.1 illustrates the relationship of the item discriminating power ai to


various conventional item and test statistics and also the interrelationships among
them. The illustration assumes that ability θ is normally distributed with zero
mean and unit variance and also that bi = 0 for all items. The test statistics are
for a 50-item test (n = 50). All quantities other than the p'iθ, including the ai at
the head of the columns, are rounded values computed from (exact) values shown
for p'iθ (≡p' i ). All 50 items have identical ai and C i .
The most familiar quantities in the table are probably the test reliabilities
( ρ x x ' ) m l n e bottom line. Most 50-item multiple-choice tests with which we are

TABLE 3.8.1
Relation of Item Discriminating Power at to Various
Conventional Item Parameters, and to Parameters of a 50-ltem Test,
when Ability Is Normally Distributed(µθ=0,σθ=1=1) and All
Free-Response Items Are of 50% Difficulty (πi = .50, bi = 0)

Eq.
ai = 0 .20 .44 .75 .98 1.33 2.06 no.

ρ'iθ 0 .2 .4 .6 .7 .8 .9 (3-3)
Free-Response

ρiθ 0 .16 .32 .48 .56 .64 .72 (3-6)


πi = .5)
= .2, (Ci = 0,

0 .040 .16 .36 .49 .64 .81 (3-11)


Items

ρ'ij
ρij 0 .025 .10 .23 .33 .44 .60 (3-13)
ρ i(x-i) 0 .12 .29 .47 .56 .66 .77 (3-14)
ρxx' 0 .57 .85 .94 .96 .98 .99 (1-17)
0 .017 .07 .16 .22 .29 .40 (3-19)
= .6)

ρIJ
Multiple-Choice

ρ I(X-I) 0 .088 .23 .38 .45 .53 .62 (3-14)


ρIX .14 .19 .29 .42 .48 .56 .64 (3-16)
Items

ρ'IX .18 .24 .37 .53 .61 .70 .81 (3-6)


(cI

7.2 (3-15)
πl

σx 3.5 4.7 10.2 11.8 13.6 15.7


ρxx' 0 .46 .79 .90 .93 .95 .97 (1-17)
APPENDIX 41

familiar probably have reliabilities close to .90. If so, we should focus our
attention on the column with ρ x x , — .90 at the bottom and ai = .75 at the top.
The top half of the table assumes that ci = 0 . This is referred to as the
free-response case (although free-response items do not necessarily have ci =
0). Note that by (3-4) and (3-9), under the assumptions made, free-response
items with bi = 0 will have exactly 50% correct answers (πi = .50) in the total
group of examinees. The parameters shown are the biserial item-ability correla­
tion ρ'iθ; the point-biserial (product-moment) item-ability correlation ρiθ; the
tetrachoric item intercorrelation p' ij ; the product-moment item intercorrelation
Pij (phi coefficient); the item-test correlation pix-i), where x — i is number-
right score on the remaining 49 items; and the parallel-forms test reliability pxx,.
The equation used to calculate each parameter is referenced in the table.
The bottom half of the table deals with multiple-choice tests. The theoretical
relation between the multiple-choice and the free-response case is discussed in
the Appendix. For the rest of this chapter only, multiple-choice items are indexed
by I and J to distinguish them from free-response items (indexed by i and j); the
number-right score on a multiple-choice test will be denoted by X to distinguish
it from the score x obtained from free-response items. The multiple-choice item
intercorrelation pIJ (phi coefficient) is computed by (3-19) from the free-
response pij. All multiple-choice parameters in the table are computed fromp IJ .
All numbered equations except (3-19) apply equally to multiple-choice and to
free-response items.
Note in passing several things of general interest in the table:

1. A comparison of pxx' with pxx, indicates the loss in test reliability when
low-ability examinees are able to get one-fifth of the items right without knowing
any answers.
2. The standard deviation σx of number-right scores varies very sharply with
item discriminating power (with item intercorrelation).
3. The usual item-test correlation ρix or ρ'ix (also ρIX or ρ'IX) is spuriously
high because item i is included in x (or I in X). The amount of the spurious effect
can be seen by comparing ρIX and ρI(x_1).
4. For free-response items, the item-test correlation Pi(x-i) in the last two
columns of the table is higher than the item-ability correlation piθ. This may be
viewed as due to the fact (see Section 3.1) that the item observed-score regres­
sion is more nearly linear than the item-ability regression (item response func­
tion) .

APPENDIX

This appendix provides those formulas not given elsewhere that are necessary for
computing Table 3.8.1. In the top half of the table, the phi coefficient pij was
42 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

obtained from the tetrachoric p'ij by a special formula (Lord & Novick, 1968,
Eq. 15.9.3) applicable only to items with 50% correct answers:
2
ρi j = arcsin p ' i j (3-13)
π

the arcsin being expressed in radians. The test reliability pxx ' was obtained from
ρij by the Spearman-Brown formula (1-17) for the correlation between two
parallel tests after lengthening each of them 50 times. The item-test correlation
ρ i ( x - i ) was obtained from a well-known closely related formula for the correla­
tion between one test (/) and the lengthened form (y) of a parallel test (j):
mσi ρ i j
Piy , (3-14)
σy
where m is the number of times j is lengthened [for ρ i ( x - i ) in Table 3.8.1, m =
49], y is number-right score on the lengthened test, a2i = πi(l — πi) is the
variance (1-23) of the item score (ui = 0 or 1), and
σ2y = σ2i [m + m(m – l)pij] (3-15)
is the variance of the y scores [see Eq. (1-21)].
The usual point-biserial item-test correlation pix is computed from pi(x-i) by
a formula derived as follows:

σ i ( ( x - i ) + σ2i σi σx–i ρ i ( x - i ) + σ2i


ρ ix ≡ ρi[(x-i)+i] =
σi σx σi σx .
When y = x — i, we have from (3-14) that σ x _ i ρ i(x-i) = mσiρij; using these
last formulas with m = n — 1, we have
mσ2i ρi j + σ2i
Pix = .
σiσx
Finally, using Eq. (3-15) with m = n, we find from this that
(n - 1)ρij + 1 √l + (n – 1)Pij
Pix = . (3-16)
√n + n(n - 1)ρij √n
Suppose A, B, C, and D are the
relative frequencies in the accompany­ B A πi
ing 2 x 2 intercorrelation table for
free-response items i and j (ci = Cj =
0). The general formula for the phi D C 1 — πi
coefficient for any such table is
1 — πj πj
AD - BC
ρ ij = √Π (1 — π )π (1 - π ) . (3-17)
i i j j

Suppose now that we change the items to multiple choice with cI = CJ = c >
REFERENCES 43

0. According to Eq. (2-1) or (2-2), the effect will be that of the people who got
each free-response item wrong, a fraction c will now get the corresponding
multiple-choice item right. Thus π = Πi + c I (l —π i ). The new 2 x 2 table for
multiple-choice items will therefore be

(1 - c)B + c(1 - c)D A + cB + cC + c2D π + C(l — π)

(1 - c)2D (1 - c)C + c(1 - c)D (1 - C ) ( l -πi,)

(1 – C)(l – πj) π + C(l – π)

In the special free-response case where πi = Πj — ½, we have B = C = ½ —


A and D = A; also we find from (3-17) that

ρij =4A – 1. (3-18)


In this special case, the 2 x 2 table for the multiple-choice items is therefore

(1 – c)(½ – A + cA) A(1 – c)2 + c ½(1 + c)

(1 - c)2A (1 - c)(½ – A + cA) ½(1- c)

½ (l – c) ½ (1 + c)
When the general formula for a phi coefficient is applied to the last 2 x 2 table,
we find that for the multiple-choice items under consideration
A 2(1 - c) 4 + cA{\ - c)2 - (1 - c) 2( ½ - A + cA)2
PIJ =
¼(1 –c2 )
1 -c
(4A – 1).
1 +c
Using (3-18) we find a simple relation between the free-response pij and the
multiple-choice pIJ for the special case where πi = π = .5:

PIJ =
1 -c (3-19)
P
1 + c ij
This formula is a special case of the more general formula in Eq. (7-3).

REFERENCES

Fan, C.-T. On the applications of the method of absolute scaling. Psychometrika, 1957, 22,175–183.
Gulliksen, H. Theory of mental tests. New York: Wiley, 1950.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
4 Test Scores

Ability

of Item
Estimates
and

Parameters
as Functions

The ideas in this chapter are essential to an understanding of subsequent chapters.

4.1. THE DISTRIBUTION OF TEST SCORES


FOR GIVEN ABILITY

It is sometimes asserted that item response theory allows us to answer any


question that we are entitled to ask about the characteristics of a test composed of
items with known item parameters. The significance of this vague statement
arises from the fact that item response theory provides us with the frequency
distribution Ø(x θ) of test scores for examinees having a specified level 6 of
ability or skill.
For the present, let us consider the number-right score, denoted by x. If the n
items in a test all had identical item response curves P = P(θ), the distribution of
x for a person at ability level 0 would then be the binomial distribution
n x n-x
Ø(x θ) (x )p Q ,
where Q = 1 – P. The expression (Q + P)n is familiar as the generating
function for the binomial distribution, because the binomial expansion

(Q + P)n ≡ Qn + nPQn-1 + ( n2)p2Qn-2


x n-x . . .
+… + ( n) p Q + + Pn
x
gives the terms of Ø(x θ) successively for x = 0, 1,. . . , n.
When the item response curves Pi = Pi(θ) (i = 1 , 2 , . . . , n) vary from item

44
4.2. TRUE SCORE 45

to item, as is ordinarily the case, the frequency distribution Ø(x θ) of the number-
right test score for a person with ability θ is a generalized binomial (Kendall &
Stuart, 1969, Section 5.10). This distribution can be generated by the generating
function
n
π (Qi + Pi)- (4-1)
i=1
For example, if n = 3, the scores x = 0, 1, 2, 3 occur with relative frequency
Q1Q2Q3, Q1Q2P3 + Q1P2Q3 + P1Q2Q3, Q 1 P 2 P 3 + P1Q2P3 + P1P2Q3, and
P1P2P3, respectively. The columns of Table 4.3.1 give the reader a good idea of
the kinds of Ø(x θ) encountered in practice.
Although Ø(x θ), the conditional distribution of number-right score, cannot be
written in a simple form, its mean µx\θ and varianceσ2 x\ θ for given θ are simply
n
µx\θ = Pi(θ), (4-2)
i=i
σ2x\θ = n
PiQi . (4-3)
i=i
The mean (4-2) can be derived from the fact that x≡ iui and the familiar fact
that the mean of ui is Pi. The variance (4-3) can be derived from the familiar
binomial variance σ2(ui) = PiQi by noting that
σ2x\θ= σ2( iui θ = iσ2(ui)) = iPi Qi
because of local independence. Note that Ø(x|θ), µx|θ, and σx|θ refer to the
distribution of x (1) for all people at ability level θ and also (2) for any given
individual whose ability level is 6.
If P is the average of the Pi(θ) taken over n items, then in practice Ø(x|θ) is
usually very much like the binomial distribution (xn) Px Qn–x where Q ≡ 1 —
P. The main difference is that the variance Pi Qi is always less than the
binomial variance nPQ unless Pi = P for all items. The difference between the
two variances is simply nσ2p|θ, where σ2p|θ is the variance of the Pi(θ) for fixed
θ taken over items:

σ2x\θ = nPQ – nσ2p\θ (4-4)

4.2. TRUE SCORE

A person's number-right true score ξ (pronounced ksai or ksee) on a test is


defined in Lord and Novick (1968, Chapter 2) as the expectation of his observed
score x. It follows immediately from (4-2) that every person at ability level θ has
the same number-right true score
46 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

n
ξ = Pi{θ). (4-5)
i=i
Since each Pi(θ) is an increasing function of 0, number-right true score is an
increasing function of ability.
This is the same true score denoted by T in Section 1.2. The classical notation
avoids Greek letters, the present notation emphasizes that the relation of ob­
served score x to true score ξ is the relation of a sample observation to a
population parameter.
True score ξ and ability 6 are the same thing expressed on different scales of
measurement. The important difference is that the measurement scale for ξ
depends on the items in the test; the measurement scale for 6 is independent of
the items in the test (Section 3.4). This makes 6 more useful than £ when we wish
to compare different tests of the same ability. Such comparisons are an essential
part of any search for efficient test design (Chapter 6).

4.3. S T A N D A R D E R R O R O F M E A S U R E M E N T

By definition, the error of measurement (e) is the discrepancy between observed


score and true score: e = x — ξ. When ξ is fixed, e and x have the same standard
deviation, since in that case e and x differ only by a constant. This standard
deviation is called the standard error of measurement at ξ, denoted here by σelξ
The squared standard error of measurement s2e.ξ of classical test theory is simply
σ2elξ averaged over all (N) examinees:

1 N 2
s2e.ξ = σ elξ (4-6)
N
When ξ is fixed, so is θ and vice versa [see Eq. (4-5)]. Thus

σe ξ=ξ0 ≡ (σe\ θ=θ0 (4-7)


provided ξu and θO are corresponding values satisfying (4-5). By (4-3) and (4-7),
finally,
n
σ2elξo= Pi(θo)Qi(θo). (4-8)
i=l
Note that the standard error of measurement approaches 0 at high ability levels,
where Pi(θ) 1. At low ability levels,

Pi(θ) Ci and σ2elξ n


ci(l – Ci).
Table 4.3.1 shows the conditional frequency distribution of number-right
scores at equally spaced ability levels as estimated for the 60-item Mathematics
section of the Scholastic Aptitude Test. This table was computed from (4-1),
TABLE 4.3.1
Theoretical Conditional Frequency Distribution of Number-Right
Scores on SAT Mathematics Test (January 1971) for Equally Spaced
Fixed Values of 0 (all frequencies multiplied by 100)

Selected Fixed Values of 0


Score
(x) -3.000 -2.625 -2.250 -1.875 -1.500 -1.125 -0.750 -0.375 0.000 +0.375 +0.750 + 1.125 + 1.500 + 1.875 +2.250 +2.625 +3.000

60 7 28 52
59 2 20 37 35
58 6 27 23 11
57 12 23 9 2
56 1 18 14 3
55 2 20 7 1
54 5 17 3
53 8 12 1
52 12 7
51 1 15 4
50 1 15 2
49 3 14 1
48 5 11
47 7 8
46 10 5
45 1 12 3
44 1 14 1
43 3 13 1
42 5 11
41 7 9
40 1 9 6
39 1 12 4
38 3 13 2
37 4 13 1
36 1 6 11 1
35 1 9 9
34 2 11 7
33 3 12 5
32 1 5 12 3
31 1 7 12 2

47
(continued)
TABLE 4.3.1 (continued)

48
Selected Fixed Values of θ

Score
(x) -3.000 -2.625 -2.250 -1.875 -1.500 -1.125 -0.750 -0.375 0.000 +0.375 +0.750 +1.125 +1.500 +1.875 +2.250 +2.625 +3.000

30 2 9 10 1
29 3 11 7
28 1 5 12 5
27 1 7 12 3
26 2 9 11 2
25 1 4 11 9 1
24 1 6 12 7 1
23 1 3 8 12 5
22 1 4 10 11 3
21 1 2 6 11 9 2
20 1 2 4 8 12 7 1
19 1 1 3 6 10 12 5
18 1 1 2 4 8 12 10 3
17 2 2 4 6 10 12 8 2
16 3 4 6 9 12 12 6 1
15 5 6 8 11 13 10 4 1
14 7 9 11 12 12 8 2
13 10 11 12 13 11 6 1
12 12 13 13 12 8 4 1
11 14 13 13 10 6 2
10 14 13 11 7 4 1
9 12 10 8 5 2
8 9 8 5 3 1
7 6 5 3 1
6 4 3 1 1
5 2 1 1
4 1
3
2

0
4.4. TYPICAL DISTORTIONS IN MENTAL MEASUREMENT 49

using estimated item parameters. All items are five-choice items. The typical
ogive shape of the regression function µx\θ = nPi(θ) is apparent in this table
and also the typical decreasing standard error of measurement and increasing
skewness at high ability levels.

4.4. TYPICAL DISTORTIONS IN MENTAL


MEASUREMENT

As already noted, formula (4-2) for the regression µx\θ of number-right score on
ability is the same as formula (4-5) for the relation of true score to ability. This
important function ξ = ξ(θ) ≡ nPi(θ) and also the function

1 n
ζ ≡ ζ(θ) ≡ Pi(θ) (4-9)
n
i=1
are called test characteristic functions. Either of these functions specifies the
distortion imposed on the ability scale when number-right score on a particular
set of test items is used as a measure of ability. A typical example of a test
characteristic function appears in Fig. 5.5.1.
Over ability ranges where the test characteristic curve is relatively steep,
score differences are exaggerated compared to ability differences. Over ranges
where the test characteristic curve is relatively flat, score differences are com­
pressed compared to ability differences. Since number-right scores are integers,
compression of a wide range of ability into one or two discrete score values
necessarily results in inaccurate measurement.
If all items had the same response function, clearly the test characteristic
function (4-9) would be the same function also. More generally, test characteris­
tic curves usually have ogive shapes similar to but not identical with item re­
sponse functions. Differences in difficulty among items cause a flattening of the
test characteristic curve. If all items had the same response curves except that
their difficulty parameters bi were uniformly distributed, the test characteristic
curve would be virtually a straight line except at its extremes. For a long test, the
greater the range of the bi, the more nearly horizontal the test characteristic
curve.
If a test is composed of two sets of items, one set easy and the other set
difficult, the test characteristic curve may have three relatively flat regions: It
may be flat in the middle as well as at extreme ability levels. Such a test will
compress the ability scale and provide poor measurement at middle ability levels,
as well as at the extremes.
If the distribution of ability is assumed to have some specified shape (for
example, it is bell-shaped), the effect of the distortions introduced by various
types of test characteristic functions can be visualized. If a test is much too easy
50 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

for the group tested, the point of inflection of the test characteristic curve may
fall in the lower tail of the distribution where there are no examinees. Only the
top part of the test characteristic curve may be relevant for the particular group
tested. In this case, most examinees in the group may be squeezed into a few
discrete score values at the top of the score range. The bottom part of the
available score range is unused. Measurement is poor except for the lower ability
levels of the group. An assumed bell-shaped distribution of ability is turned into a
negatively skewed, in the extreme a J-shaped, distribution of number-right
scores. Such a test may be very approproate if its main use is simply to weed out
a few of the lowest scoring examinees.
If the test is much too hard for the group tested, an opposite situation will
exist. The score distribution will be positively skewed but will not be J-shaped if
there is guessing, because zero scores are then unlikely. Such a test may be very
appropriate for a scholarship examination or for selecting a few individuals from
a large group of applicants.
If the test is not very discriminating, the test characteristic curve will be
relatively flat. If the relevant part of the characteristic curve (the part where the
examinees occur) is nearly straight, the shape of the frequency distribution of
ability will not be distorted by the test. However, a wide range of ability will be
squeezed into a few middling number-right scores, with correspondingly poor
measurement.
If the test is very discriminating, its characteristic curve will be corre­
spondingly steep in the middle. The curve cannot be steep throughout the ability
range because it is asymptotic to ζ = 1 at the right and to ξ = c at the left, where
1 n
c = Ci. (4-10)
n i=1
Thus there will be good measurement in the middle but poor measurement at
the extremes. If the test difficulty is appropriate for the group tested, the middle
part of the bell-shaped distribution of ability will be spread out and the tails
squeezed together. The result in this case is a platykurtic distribution of number-
right scores. The more discriminating the items, the more platykurtic the
number-right score distribution, other things being equal. In the extreme, a
U-shaped distribution of number-right scores may be obtained.
If we wish to discriminate well among people near a particular ability level (or
levels), we should build a test that has a steep characteristic curve at the point(s)
where we want to discriminate. For example, if a test is to be used only to select a
single individual for a scholarship or prize, then the items should be so difficult
that only the top person in the group tested knows the answer to more than half of
the test items. The problem of optimal test design for such a test is discussed in
Chapters 5, 6, and 11.
An understanding of the role of the test characteristic curve is important in
designing a test for a specific purpose. Diagrams showing graphically just how
4.6. THE TOTAL-GROUP DISTRIBUTION OF NUMBER-RIGHT SCORE 51

various test characteristic curves distort the ability scale, and thus the frequency
distribution of ability, are given in Lord and Novick (1968, Section 16.14).

4.5. THE JOINT DISTRIBUTION OF ABILITY A N D


TEST SCORES

If ability had a rectangular distribution from θ = —3 to θ = + 3, Table 4.3.1


would give a good idea of the joint distribution of ability and number-right score.
Otherwise, the probabilities at each value of θ must be multiplied by its relative
frequency of occurrence in order to obtain the desired joint distribution. Formally,
the joint distribution is

Ø(x, θ) = Ø(x θ)g*(θ), (4-11)


where g*(d) is the distribution (probability density) of ability in the group tested.
(The general term distribution rather than probability density or frequency func­
tion is used throughout this book. Where necessary to prevent confusion,
cumulative or noncumulative is specified.)
Usually g*(θ) is unknown. The observed distribution of estimated θ is an
approximation to g*(θ). A better approximation can often be obtained by the
methods of Chapter 16.
Given an adequate estimate of g*(θ), the joint distribution of ability and
number-right score can be determined from (4-11) and (4-1). This joint distribu­
tion contains all relevant information for describing and evaluating the properties
of the number-right score x for measuring ability θ. One such (estimated) joint
distribution is shown in Table 16.11.1.

4.6. THE TOTAL-GROUP DISTRIBUTION OF


NUMBER-RIGHT SCORE

Suppose we have a group of N individuals whose ability levels θa (a = 1,


2 , . . . , N) are known (in practice the θa will be replaced by estimated values
θ a ). It is apparent that the total-group or marginal distribution Ø(X) of number-
right scores will be

1 N
Ø(x) = Ø(x θa), (4-12)
N
a= 1
where Ø(x θa) is calculated from (4-1).
Any desired moments of the total-group distribution of test score can be
calculated from the Ø(x) obtained by (4-12). The expected score for the N
examinees also can be found from (4-12) and (4-2):
52 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

1 N 1 N n
Ex = µx\θa = Pi(θ). (4-13)
N N
a=1 a=1 i=i
An estimate of the expected variance of the N scores can be found from an
ANOVA identity relating total-group statistics to conditional statistics:

σ2x= mean of σ2x\θa + variance of µx\θa.


From this, (4-2), (4-3), and (4-13), we have the estimated total-group variance

1 N n 1 N n 1 N n
N PiaQia +
N Pia 2 N2 ( Pia 2 (4-14)
( ) )
a=l i=l a=1 i = 1 a=\i=l
where Pia ≡ Pi(θa).

4.7. TEST RELIABILITY

Although we shall have little use for test reliability coefficients in this book, it is
reassuring to have a formula relating test reliability to item and ability parame­
ters. A conventional definition of test reliability is given by Eq. (1-6) and (1-9):
Written in our current notation, this definition is

σ2xξ
ρ x x ' ≡ ρ2xξ ≡ 1 - σ2 (4-15)
x

For a sample of examinees, (4-15), (4-6), (4-8), and (4-14) suggest an appro­
priate sample reliability coefficient:

N n N n
( Pia ( pia)2/N
Pxx' (4-16)
N n N n N n
Pia Qia + ( P ia ) 2
– ( Pia)2/N

From (4-7), we see that (4-15) is the complement of the ratio of (averaged
squared error about the regression of x on θ) to (variance of x). Reliability is
therefore, by definition, equal to the correlation ratio of score x on ability 6.

4.8. ESTIMATING ABILITY FROM TEST SCORE

The following procedure clarifies the construction of confidence intervals for


estimating ability 0 from number-right score x. Consider the score distribution in
any column in Table 4.3.1. Find the highest score level below which lies no more
than 2½% of the frequency. Do this for all values of θ. If x were continuous, the
points would form a smooth curve; since x is an integer, the points fall on a step
4.8. ESTIMATING ABILITY FROM TEST SCORE 53

function. This step function cuts off 2½% or less of the frequency at every ability
level 6. Repeat this process for the upper tails of the score distributions.
The two resulting step functions are shown in Fig. 4.8.1. No matter what the
value of θ may be, in the long run at least 95% of all randomly chosen scores will
lie in the region between these step functions.
Now consider a random examinee. His number-right score on the test is x0,
say. We are going to assert that he is in the region between the step functions.
This assertion will be correct at least 95% of the time for randomly chosen
examinees. But given that this examinee's test score is x0, this assertion is in
logic completely equivalent to the assertion that he lies in a certain interval on 6.
In Fig. 4.8.1, the ends of an illustrative interval are denoted by θ and θ. We shall
therefore assert that his ability 6 lies in the interval (θ, θ). Such assertions will be
correct in the long run at least 95% of the time for randomly chosen examinees.
An interval with this property is called a 95% confidence interval. Such
confidence intervals are basic to the very important concept of test information,
introduced in Chapter 5. It is for this reason that we consider it in such detail
here.
A point estimate of 6 for given x would be provided by the regression of 6 on
x (see Section 12.8). Although the regression of x on 6 is given by (4-2), the
60
50
40
NUMBER RIGHT SCORE

Xo
30
20
10
0

-3 -2 -1 0 i 1 θ 2 3
THETA

FIG. 4.8.1. Confidence interval (θ, θ) for estimating ability (SAT Mathematics
Test, January 1971).
54 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

regression of θ on x cannot be determined unless we know g*(θ), the distribution


of ability in the group tested. The regression of θ on x is by definition given by
1 ∞
µθ\x θg*(θ)Ø (x θ) dθ (4-17)
Ø(x) ∫-∞
[compare Eq. (4-11)]. Ordinarily it cannot be written in closed form. It can be
calculated numerically if the distribution of θ is known.

4.9. JOINT DISTRIBUTION OF ITEM SCORES FOR ONE EXAMINEE

Number right is not the only way, nor the best way, to score a test. For more
general results, and for other reasons, we need to know the conditional frequency
distribution of the pattern of item responses—the joint distribution of all item
responses ui (i = 1, 2 , . . . , n) for given θ.
For item i, the conditional distribution given θ of a single item response is

Pi(θ) if u i = 1,
L(ui θ) = Qi(θ) if ui = 0, (4-18)
{ 0 otherwise.

This may be written more compactly in various ways. For present purposes, we
shall write

L(ut θ) = Puii Q1i–ui (4-19)

The reader should satisfy himself that (4-18) and (4-19) are identical for the two
permissible values of ui.
Because of local independence, which is guaranteed by unidimensionality
(Section 2.4), success on one item is statistically independent of success on other
items. Therefore, the joint distribution of all item responses, given θ, is the
product of the distributions (4-19) for the separate items:
n
L(u|θ; a, b , c) = L(u1, u2,... ,un θ) = π PuiiQ1i–Ui, (4-20)
i=l
where u = {ui} is the column vector { u 1 , u 2 , . . . , un}' and a, b , c are vectors
of ai bi, and ci
Equation (4-20) may be viewed as the conditional distribution of the pattern u
of item responses for a given individual with ability θ and for known item
parameters a, b , and c. In this case the ui (i = 1, 2 , . . . , n) are random
variables and θ, a, b , and c are considered fixed.
If the ui for an individual have already been determined from his answer
sheet, they are no longer chance variables but known constants. In this case,
assuming the item parameters to be known from pretesting, it is useful to think of
(4-20) as a function of the mathematical variable θ, which represents the (un­
known) ability level of the examinee. Considered in this way, (4-20) is the
4.10. JOINT DISTRIBUTION OF ALL ITEM SCORES 55

0
0

1
-20
-30
LOG LIKELIHOOD
-40
-50
-eo
-70
-80
-90

-5 -4 -3 -2 -I O 2 3 4 5
ABILITY θ

FIG. 4.9.1. Logarithm of likelihood functions for estimating the ability of six
selected examinees from the SCAT II 2B Mathematics test.

likelihood function for 6. The maximum likelihood estimate 6 (see Section 4.13)
of the examinee's ability is the value of θ that maximizes the likelihood (4-20) of
his actually observed responses ui (i = I, 2, . . . , n).
Figure 4.9.1 shows the logarithm of six logistic likelihood functions com­
puted independently by (4-20) for six selected examinees taking a 100-item high
school mathematics aptitude test. The maxima of these six curves are found at
ability levels 6 = —5.6, —4.6, —1.0, . 1 , 1.0, and 3.7. These six values are the
maximum likelihood ability estimates 6 for the six examinees.

4.10. JOINT DISTRIBUTION OF ALL ITEM SCORES ON


ALL ANSWER SHEETS

If N examinees can be treated as a random sample from a very large population


of examinees, then the distributions of u for different examinees are statistically
56 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

independent. Thus the joint distribution of the N different u for all examinees is
the product of the separate distributions. This joint distribution is then
N

ΠΠ
n
L(U|θ;a,b,c)≡L(ul,u2,... ,uN\θ) = Pia uia Qia1–u ia , (4-21)
a=1 n=1

where the subscript a distinguishes the contribution of examinee a (a = 1,


2,. .. , N and where Pia = Pi(θa), U is the matrix ||uia||, and θ is the vector
{θ1,θ2,...,θN}'.
When examinee responses have already been recorded on a pile of answer
sheets, (4-21) may be viewed as the likelihood function of both the ability
parameters θ1 θ2,. . . , θN and the item parameters a1, b1, c 1 , a2, b2, c2, . . . ,
an, bn, cn. The maximum likelihood estimates are the values θa (a = 1, 2 , . . . ,
N) and ai bi ci (i = 1, 2 , . . . , n) that together maximize (4-21). If the item
parameters are known, then (4-21) is simply the likelihood function of all the θa.
Does it seem unlikely that we can successfully estimate N + 3n parameters all
at the same time from just one pile of answer sheets? Actually, we do almost the
same thing in everyday practice. We obtain a score (ability estimate) for each
examinee and also do an item analysis on the same pile of answer sheets,
obtaining indices of item difficulty and item discriminating power. It is reason­
able to do this if nN, the total number of observations, is large compared to both
N and n. If n is 50 and N is 1000, we have about 43 observations for each
parameter estimated.

4.11. LOGISTIC LIKELIHOOD FUNCTION

The likelihood function (4-20) for θ for one examinee can also be written
n Pi ui n
L(u θ) = π . π Qi .
( Qi )
i=1 i=l
In general, this is not helpful; but in the case of the logistic function [Eq. (2-1)]
when Ci = 0,
l
Pi 1 + e–DLi
= eDLi , (4-22)
Qi 1
1
1 + e–DLi
where D is the constant 1.7 and

L, ≡ ai(θ - bi,). (4-23)

Substituting (4-22) in the preceding likelihood function gives the logistic likeli­
hood function
4.12. SUFFICIENT STATISTICS 57

n n
L(u|θ) = exp uiLi Qt
( ) Π
i=l i=l
n n
= exp -D aibiui eDθs (4-24)
( ) Π Qi (θ)
i=i i=l
where
n
S = s(u) ≡ aiui. (4-25)
i=l
In Appendix 4, it is shown that if

1. the item response function is logistic,


2. there is no guessing (ci = 0 for all i),
3. the item parameters are known.

then the weighted item score


n
s = i aiui
is a sufficient statistic for estimating examinee ability θ.

4.12. SUFFICIENT STATISTICS

The key property of a sufficient statistic s is that the conditional distribution of


the observations given s is independent of some parameter θ. This means that
once s is given, the data contain no further information about θ. This justifies the
usual statement that the sufficient statistic s contains all the information in the
data concerning the unknown parameter θ.
Note that s in (4-25) is a kind of test score, although different from the usual
number-right score x. Note also that s ranges from 0 to ni ai. Clearly, s is not a
consistent estimator of 0, which ranges from - ∞ to + ∞.
Since
E(ui θ) = Pi(θ) (4-26)

the expectation of s is
n
E(s θ) = aiPi (θ). (4-27)
1=1
Note that (4-27) is a kind of true score, although different from the usual
number-right score ξ . A consistent estimator 0 of 6 is found by solving for 0
the equation.

iaiPiθ) = s. (4-28)
58 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

It is shown in Section 4.14 that the θ obtained from Eq. (4-28) is also the
maximum likelihood estimator of 6 under the logistic model with all ci = 0.
If s is sufficient for 0, so is any monotonic function of s. It is generally agreed
that when a sufficient statistic s exists for θ, any statistical inference for 0 should
be based on some function of s and not on any other statistic.
The three conditions stated at the end of the preceding section are the most
general conditions for the existence of a sufficient statistic for 6. There is no
sufficient statistic when the item response function is a normal ogive, even
though the normal ogive and logistic functions are empirically almost indistin­
guishable. There is no sufficient statistic when there is guessing, that is, when
Ci ≠ 0. This means that there is no sufficient statistic in cases, frequently reported
in the literature, where the Rasch model (see Wright, 1977) is (improperly) used
when the items can be answered correctly by guessing.

4.13. MAXIMUM LIKELIHOOD ESTIMATES

When no sufficient statistic exists, the statistician uses other estimation methods,
such as maximum likelihood. As already noted in Section 4.10, the maximum
likelihood estimates θa (a = 1, 2, . . . , N) and ai, bi, and ci (i = 1, 2 , . . . , n)
are by definition the parameter values that maximize (4-21) when the matrix of
observed item responses U ≡ ||uia|| is known. In practice, the maximum likeli­
hood estimates are found by taking derivatives of the logarithm of the likelihood
function, setting the derivatives equal to zero, and then solving the resulting
likelihood equations.
The natural logarithm of (4-21), to be denoted by l, is
N n
l= ln L , ( U | θ ; a , b , c ) = [u ia ln Pia + (1 - uia) ln Q i a ] . (4-29)
a=l i=l
If X represents θa , aj, bj, or Cj, the derivative of the log likelihood with respect
to X is

l N n P'ia
uia — (1 – u ia ) Pia
X [ Pia Qia ]
a=l i=1
N n P ia
uia – P ia ) (4-30)
P ia Qia
a=l i=1
where P ia ≡ Pia / x. An explicit expression for P'ia can be written as soon as
'

the mathematical form of Pia is specified, as by Eq. (2-1) or (2-2). The result for
the three-parameter logistic model is given by Eq. (4-40). Some practical proce­
dures for solving the likelihood equations (4-29) are discussed in Chapter 12.
When a, b , c are known from pretesting, the likelihood equation for estimat­
ing the ability of each examinee is obtained by setting (4-30) equal to zero:
4.15. MAXIMUM LIKELIHOOD ESTIMATION FOR EQUIVALENT ITEMS 59

n P' ia
(uia – P ia ) = 0. (4-31)
P'ia Qia
i= 1
This is a nonlinear equation in just one unknown, θa. The maximum likelihood
estimate θa of the ability of examinee a is a root of this equation. The roots of
(4-31) can be found by iterative numerical procedures, once the mathematical
form for Pia is specified.
If the number of items is small, (4-31) may have more than one root
(Samejima, 1973). This may cause difficulty if the number of test items n is 2 or
3, as in Samejima's examples. Multiple roots have not been found to occur in
practical work with n ≥ 20.
If the number of items is large enough, the long test being formed by combin­
ing parallel subtests, the uniqueness of the root 6 of the likelihood equation
(4-31) is guaranteed by a theorem of Foutz (1977). The unique root is a consis­
tent estimator; that is, it converges to the true parameter value as the number of
parallel subtests becomes large.

4.14. MAXIMUM LIKELIHOOD ESTIMATION FOR


LOGISTIC ITEMS WITH ci = 0

If Pia is logistic [Eq. (2-1)] and each ci = 0, it is found that

Pia (4-32)
= DaiPia Qia .
θa
Substituting this for P'ia in and rearranging gives the likelihood equation
n n
ai Pia (θa) = ai uia . (4-33)
i= 1 i=1
This again in a nonlinear equation in a single unknown, θa.
Note that its root θa, the maximum likelihood estimator, is a function of the
sufficient statistic (4-25). Thus (4-33) is the same as (4-28). It is a general
property that a maximum likelihood estimator will be a function of sufficient
statistics whenever the relevant sufficient statistics exist.

4.15. MAXIMUM LIKELIHOOD ESTIMATION FOR


EQUIVALENT ITEMS

Suppose all items have the same response function P(θ). We shall call this the
case of equivalent items. This is not likely to occur in practice, but it is a limiting
case that throws some light on practical situations.
60 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

In the case of equivalent items, the likelihood equation (4-31) for estimating θ
becomes {P'lPQ) i(ui – P ) = 0 or

1 n
P(θ) = Ui ≡ Z, (4-34)
n
i=l
where z ≡ xln is the proportion of items answered correctly. The maximum
likelihood estimator θ is found by solving (4-34) for θ:

θ = P–1(z), (4-35)
–1
where P ( ) is the inverse function to P( ), whatever the item response
function may be.
Note that when all items are equivalent, a sufficient statistic for estimating
ability 6 is s = i au i = a iui = ax. Thus, in this special case, both the
number-right score x and the proportion-correct score z are sufficient statistics
for estimating ability.

E x e r c i s e 4.15.1
Suppose that P(6) is given by Eq. (2-1) and all items have ai = a, bi = b, and
ci = c, where a, b, and c are known. Show that the maximum likelihood
estimator 0 of ability is given by
1
θ In z — c + b. (4-36)
Da 1 -z
(Here and throughout this book, " I n " denotes a natural logarithm.) If c = 0,θ is
a linear function of the logarithm of the odds ratio (probability of success)/
(probability of failure).

4.16. FORMULAS FOR FUNCTIONS OF THE


THREE-PARAMETER LOGISTIC FUNCTION

A few useful formulas involving the three-parameter logistic function [Eq. (2-1)]
are recorded here for convenient reference. These formulas do not apply to the
three-parameter normal ogive [Eq. (2-2)].

1 — Ci ci + eDLi
Pi=Ci + (4-37)
1 + e–DLi 1 + eDLi '

where D ≡ 1.7 and Li ≡ ai(0 — bi).


l -Ci
Qi ≡ l - Pi = (4-38)
1 + e DL i

Pi c, + eDLi
(4-39)
Qi 1 - ci
4.17. EXERCISES 61

dPi Dai Dai (l - ci)


P'i ≡ Qi (Pi - ci) =
eDLi + 2 + e–DLi
(4-40)
dθ 1 - Ci

P'i Dai
(4-41)
Qi 1 + e–DLi

P'i Dai Pi — Ci Dai


Wi (θ) ≡ (4-42)
Pi Qi 1 — ci Pi 1 + cie–DLi
'2
p i D2 a2I (1 – Ci)
I{θ, ui) ≡ (ci + eDLi) (1 + e–DLi)2 .
(4-43)
Pi Qi
d2Pi D2a2i
Q (P – Ci)(Qi -Pi +ct). (4-44)
dθ 2
(1 – ci)2 i i
Pi - ci 1 - Ci
(4-45)
Pi 1 + ci e-DLi

4.17. EXERCISES

4-1 Compute P(6) under the normal ogive model [Eq. (2-2)] for a = 1/1.7,
b = 0, c = .2, and θ = —3, —2, — 1 , 0 , 1, 2, 3. Compare with the results
given for item 2 in Table 4.17.1 under the logistic model. Plot the item
response function P(θ) for each item in test 1, using the values given in
Table 4 . 1 7 . 1 .

TABLE 4.17.1
Item Response Function P(θ) and Related Functions for Test 1,
Composed of n = 3 Items with Parameters a1 = a2 = a3 = 1/1.7,
b1! = - 1 , b2 = 0, b3 = + 1 , C1 = c2 = c 3 = .2

(9

Item No.*
1 2 3 P(θ) P(θ)Q(θ) P'(θ) p'2/pQ P'IPQ

3 4 5 .985611 .014182 .014130 .014078 .996350


2 3 4 .962059 .036501 .036142 .035785 .990141
1 2 3 .904638 .086268 .083994 .081780 .973646
0 1 2 .784847 .168862 .157290 .146511 .931466
-1 0 1 .6 .24 .2 .166667 .833333
-2 -1 0 .415153 .242801 .157290 .101895 .647814
-3 -2 -1 .295362 .208123 .083994 .033898 .403582
-4 -3 -2 .237941 .181325 .036142 .007204 .199318
-5 -4 -3 .214389 .168426 .014130 .001185 .083894

*For item i, enter the table with the θ values shown in column i (i = 1, 2, 3).
62 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

4-2 Compute from Eq. (4-1) for examinees at 6 = 0 the frequency distribution
Φ(x\θ) of the number-right score on a test composed of n = 3 equivalent
items, given that P i (0) = .6 for each item. Compute the mean score from
the Φ(x\θ), also from (4-2). Compute the standard deviation (4-3) of
number-right scores. Compute the mean of the proportion-correct score
z = x\n.
4-3 Compute from (4-1) the frequency distribution of number-right score x on
test 1 when (9 = 0, given that P 1 (0) = .7848, P 2 (0) = .6, P 3 (0) = .4152.
C o m p u t e µx θ,µZ θ a n d σx θ. Compare with the results of Exercise 4-2.
4-4 Note that σx θ is the standard error of measurement, (4-8). Check the
value found in Exercise 4-3, using Eq. (4-4).
4-5 Compute from Table 4.3.1 the standard deviation of the conditional dis­
tribution Φ(x\θ) of number-right scores when θ = –3, 0, + 2 . 2 5 . (Be­
cause of rounding errors, the columns do not add to exactly 100; compute
the standard deviation of the distribution as tabled.)
4-6 What is the range of number-right true scores ξ on test 1 (see Table
4.17.1)?
4-7 In Table 4 . 3 . 1 , find very approximately a (equal-tailed) 94% confidence
interval for θ when x = 26.
4-8 Given that P 1 (0) = .7848, P 2 (0) = .6, P 3 (0) = .4152, as in Exercise 4-3,
compute from (4-20) the likelihood when θ = 0 of every possible pattern
of responses to this three-item test.
4-9 Given that u1 = 0, u2 = 0, and u3 = 1, compute for θ = - 3 , - 2 , —1,0,
1, 2, 3 and plot the likelihood function (4-24) for a three-item test com­
posed of equivalent items with a = 1/1.7, b = 0, and c = 0 for each
item. The necessary values of P(θ) are given in Table 4.17.2.
4-10 For Exercise 4-9, show that the right side of (4-33) exceeds the left side
when θa = — 1 but that the left side exceeds the right side when θa = 0;
consequently the maximum likelihood estimator θa satisfying (4-33) lies
between — 1 and 0.
4-11 Find from (4-36) the maximum likelihood estimate θ for the situation in
Exercises 4-9 and 4-10.

TABLE 4.17.2
Logistic Item Response Function P(θ) when a = 1/1.7, b = 0, c = 0

e -3 -2 -1 0 1 2 3

P(θ) .047426 .119203 .268941 .5 .731059 .880797 .952574


PQ .045177 .104994 .196612 .25 .196612 .104994 .045177
P' .045177 .104994 .196612 .25 .196612 .104994 .045177
APPENDIX 63

APPENDIX

Proof that au Is a Sufficient Statistic for 0


Using a line of proof provided by Birnbaum (1968, Chapter 18), we can see that
s is a sufficient statistic for estimating ability. By a familiar general formula
relating conditional, marginal, and joint distributions,
Prob(A and B) = Prob(A) • Prob (B\A). (4-46)
If 0 is fixed, this becomes
Prob(A and B θ) = Prob(A θ) • Prob(B A, θ). (4-47)
Substitute s0 for A and u 0 for B in (4-47) where the subscript indicates corre­
sponding but otherwise arbitrary values of s = s(u) and u. Rearrange to obtain
Prob(s0 and u 0 θ)
Prob(u o |s o ,0) =
Prob(s0 θ) .
Now s ≡ s(u) depends entirely on u, so Prob(s and u|θ) ≡ Prob(u|θ) and
Prob(u0 θ)
Prob(u 0 |s 0 , θ) =
Prob(u0 θ) .
From this and (4-25),

Prob(u 0 |s 0 , 6) = Prob(u0 θ)
.
Prob(u|θ)
u sa
where the summation is over all vectors u for which iaiui = s0. By (4-24),
exp(-DXiaibiuoil)eD θ s
o Π i Q i (θ)
Prob(u 0 |s 0 , θ) = Dθs
exp (-D i ai biui)e o ΠiQi(θ)
u so

exp (–D i ai b i u ioi )


= exp (–D i ai biui)
u So

The point of this result is not the formula obtained but the fact that it does not
depend on 0. In view of the definition of a sufficient statistic (see Section 4.12),
we therefore have the following: If

1. the item response function is logistic,


2. there is no guessing (ci = 0 for all i),
3. the item parameters are known,
64 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

then the weighted item score


n
S ≡ i ai ui

is a sufficient statistic for estimating examinee ability θ.

REFERENCES

Birnbaum, A. Test scores, sufficient statistics, and the information structures of tests. In F. M. Lord
& M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley,
1968.
Foutz, R. V. On the unique consistent solution to the likelihood equations. Journal of the American
Statistical Association, 1977, 72, 147-148.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1, 3rd ed.). New York: Hafner,
1969.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Samejima, F. A comment on Birnbaum's three-parameter logistic model in the latent trait theory.
Psychometrika, 1973, 38, 221-233.
Wright, B. D. Solving measurement problems with the Rasch model. Journal of Educational Mea­
surement, 1977, 14, 97-116.
5 I n f o r m a t i o n

O p t i m a l
F u n c t i o n s

S c o r i n g
a n d

W e i g h t s

5.1. THE INFORMATION FUNCTION FOR A TEST


SCORE

The information function I{θ, y} for any score y is by definition inversely pro­
portional to the square of the length of the asymptotic confidence interval for
estimating ability 0 from score y (Birnbaum, 1968, Section 17.7). In this chap­
ter, an asymptotic result means a result that holds when the number n of items
(not the number N of people) becomes very large. In classical test theory, it is
usual to consider that a test is lengthened by adding items "like those in the
test,'' that is, by adding test forms that are strictly parallel (see Section 1.4) to the
original test. This guarantees that an examinee's proportion-correct true score
("zayta") ξ = ξ/n is not changed by lengthening the test. We shall use lengthen­
ing in this sense here and throughout this book.
Denote by z ≡ xln the observed proportion-correct score (proportion of n
items answered correctly). The regression of z on ability θ is by Eq. (4-2) and
(4-9)
1 n
µz θ Pi(θ)= ζ. (5-1)
n
i =1
This regression is not changed by lengthening the test. The variance of z for fixed
0 is seen from Eq. (4-3) and (4-4) to be
1 n l
σ2z θ = n2 Pi Qi = (PQ -σ2p θ)- (5-2)
n
i=1
This variance approaches zero as n becomes large.

65
66 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

The distribution of z for fixed 0 is generated by Eq. (4-1). As n becomes


large, and σz θ 0, the conditional distribution of z shrinks toward a single
point, its mean.
Now consider Fig. 4.8.1, replacing the number-right scale for x on the verti­
cal axis by z ≡ x/n, which ranges from z = 0 to z = 1. The regression µz θ of z
on θ may be visualized as an ogive-shaped curve lying midway between the two
step functions, which approximate the 2½ and the 97½ percentiles of the distri­
butions of z given θ. As n becomes large, this regression does not change, but
σ2z |θ shrinks toward zero, so that the step functions crowd in toward the regres­
sion curve. At the same time, the number of steps increases so that the step
functions increasingly approximate smooth curves. When the conditional distri­
bution of z given 6 has become approximately normal, the distance from the
regression curve to the 2½ percentile, also to the 97½ percentile, will be about
1.96 standard deviations or (1.96/n)√ "iPiQi.
We can visualize the mathematics of the situation by imagining examining
Fig. 4.8.1 under a microscope while n becomes large. If, inappropriately, the
microscope magnifies n times, the bounds of the confidence region will still
appear as step functions. This is true because this view of the figure is equivalent
to examining µnz |θ = µx |θ, the mean number-right score for fixed θ. The num­
ber of steps will increase directly with n; no regularities will appear. The bounds
of the confidence region will appear to move away from the regression line, since
σx |θin (5-2) decreases as √n, while the magnification increases as n.
We shall avoid this by looking at √n(z — ζ) rather than at z. As n becomes
large, the conditional variance of √n(z — ζ) remains finite:

σ 2 [√n(z - ζ)|θ = PQ–σ2p,

and the conditional distribution of √n(z — ζ) approaches normality. Thus the


appropriate microscope magnifies only √n times. With this magnification, the
bounds of the confidence region will approach smooth curves that appear to
remain at a fixed distance (proportional to PQ —σ2p) from the regression line as
both n and the magnification are increased simultaneously. At sufficient mag­
nification, the portions of curves in any small region will appear to be straight
lines, as in Fig. 5.1.1. The bounds of the confidence region and the regression
line will appear parallel because σz |θ does not change appreciably over a small
range of 6.
The asymptotic confidence interval corresponding to z 0 is (θ, θ). The length
of the confidence interval, the distance AB, describes the effectiveness of the test
as a measure of ability. In the triangle ABC,

CB 2(1.96 σz|θ)
tan α = =
AB AB
or
5.1. THE INFORMATION FUNCTION FOR A TEST SCORE 67

Z≡x/n 9 7.5% ile


C

»µ|θ

2.5% ile
Z0 A a a
B

l.96σz|θ

i.96<σz|θ

e e ©
FIG. 5.1.1. Construction of a 95% asymptotic confidence interval (0, 0) for ability 0.

= 3.92σzθ
AB
tan α .
Since tan a is the slope of the regression line µz\θ , the information function, as
defined at the beginning of this chapter, for score z is proportional to
d 2
1 = µz\θ
( dθ )
.
AB2 (3.92) 2 Var(z|θ )

Figure 5.1.1 was derived for estimating ability from the proportion-correct
score z. For unidimensional tests, the same line of reasoning applies quite gener­
ally to almost all kinds of test scores in practical use. Thus Birnbaum (1968)
defines the information function for any score y to be

d 2
µy|θ )
( dθ
I{θ,y} ≡ (5-3)
Var (y\θ)

The information function for score y is by definition the square of the ratio of the
slope of the regression of y on θ to the standard error of measurement of y for
fixed d.
Now true score n, corresponding to observed score y, is fixed whenever θ is
fixed. (If this were not true, the test items would be systematically measuring
68 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

some dimension other than 6, contrary to the requirement of unidimensionality.)


Thus, as in Eq. (4-7), Var (y θ) is identical to the familiar squared standard error
of measurement σ2y η.
The information provided by score y for estimating θ varies at different θ
levels. The variation comes from two distinguishable sources:

1. The smaller the standard error of measurement σy |η, the more informa­
tion y provides about θ.
2. The steeper the slope of the regression µy|θ (the more sharply y varies
with θ), the more information y provides about θ.

If two tests have the same true-score scale, their effectiveness as measuring
instruments can properly be summarized by their standard errors of measurement
at various true-score levels. If, however, the tests measure the same trait but their
true-score scales are nonlinearly related, the situation is different. This will
ordinarily be the case whenever the tests are not parallel forms (see Chapter 13).
In this case, it is not enough to compare standard errors of measurement; we must
also take the relation of their true-score scales into account. This is the reason
why the score information function depends not only on the standard error of
measurement but also on the slope of the regression of score on ability.

E x a m p l e 5.1
The use of (5-3) can be illustrated by deriving the information function for the
proportion correct score z. From (5-1),

dµz|θ 1)
= i P' i (θ),
dθ (n
where P'i(θ) is the derivative of Pi(θ) with respect to θ. From (5-3) and (5-2) we
can now write the information function for z:
n 2
P'i(θ),
[ ]
I{θ, z} = i=1
n .
Pi(θ)Qi(θ)
i=1
This result is the same as the information function for number-right score, which
is derived from a more general result in the sequel and presented as Eq. (5-13).

5.2. ALTERNATIVE DERIVATION OF THE


SCORE INFORMATION FUNCTION

A nonasymptotic derivation of (5-3) was given by Lord (1952, Eq. 57) before the
term score information function was coined and also in a different context, by
5.2. ALTERNATIVE DERIVATION OF THE SCORE INFORMATION FUNCTION 69

Mandel and Stiehler (1954). Suppose that we are using test score 3; in an effort to
discriminate individuals at θ' from individuals at θ". Figure 5.2.1 illustrates the
two frequency distributions Φ(y θ) at θ' and θ" and shows the mean µy θ of each
distribution.
A natural statistic to use to measure the effectiveness of y for this purpose is
the ratio

µy θ" – µy θ'
σy θ" ,

where the denominator is some sort of average of σy θ' and σy θ". The displayed
ratio is proportional to the difference between means divided by its standard
error, sometimes called a critical ratio.
If θ' and θ" are close together, µy θ will be an approximately linear function
of θ in the interval (θ' ,θ"). Thus the numerator of our ratio will be proportional to
the distance θ" - θ'. The coefficient of proportionality is the slope of the
regression, given by the derivative d µy θ/dθ. Over short distances, it will make
no difference whether this slope is taken at θ" or at θ'. Also, σy θ" will be close
to σy θ, so their average will differ little froma y \ e σ yθ',.Thus our ratio can be
written

(θ" –θ')(d µy θ/dθ)θ=θ'

σy θ' .
The information function (5-3) when θ = θ' is directly proportional to the
square of the ratio just derived. The coefficient of proportionality is (θ" — θ')2, a
quantity of no relevance for assessing the discriminating power of test score y at
ability level θ = θ'.
If asymptotic values of µy θ and Var (y\θ) are used in (5-3), we find an
asymptotic information function explicable in terms of the length of the asympto­
tic confidence interval. If exact values of µy θ and Var (y\θ) are used, the

µ y θ"

µy θ'

FIG. 5.2.1. Using score y to dis­


θ'
criminate two ability levels. θ'"
70 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

resulting information function may be justified by the nonasymptotic derivation


of the present section. The language used need not always distinguish asymptotic
and nonasymptotic information functions, provided no serious confusion arises.

5.3. THE TEST INFORMATION FUNCTION

The maximum likelihood estimator θ is a kind of test score. Thus, we can use
(5-3) to find the information function of the maximum likelihood estimator. To
do this, we need (asymptotic) formulas for the regression µθ|θ and for the var­
iance σ2θ|θ.
There is a general theorem, under regularity conditions, satisfied here
whenever the item parameters are known from previous testing: A maximum
likelihood estimator θ of a parameter θ is asymptotically normally distributed
with mean θ0 (the unknown true parameter value) and variance

1
Var(θ|θ o ) = (5-4)
d ln L 2 ,
E
[ ( dθ ) θ0 }
where L is the likelihood function.
When the item parameters are known, we have from Eq. (5-4) and (4-30) that

1 [ n ]2
= E (ui – Pi)P'i/PiQi θo
Var (θ\θo) { }
i=1
{ [ n
= E (ui – Pi)P'i/PiQi
i=1
n ] }
(uj – Pj)P'j/PjQj |θo
J=1

n n
P'io P'jo
= E[(u i – P i )(u j – P j )|θ o ].
Pio Pjo Qio Qjo
i=l j =1

Since E(ui |θo) = Pio() , the expectation under the summation sign is a covariance.
Because of local independence, ui is distributed independently of uj for fixed θ.
Consequently the covariance is zero except when i = j , in which case it is a
variance. Thus

1 n p 'i2 n p 'io2
= 2
Var
2
(uio|θo) = pi02Qio2•
Var(θ|θo) p i Q 'i p io2 p io2
i=1 i=l
Dropping the subscript o, the formula for the asymptotic sampling variance of the
maximum likelihood estimator is thus
5.3. THE TEST INFORMATION FUNCTION 71

1
Var(θ|θ) = (5-5)
n P'i2 .
Pi Qi
t=1
Now, as already stated, θ is a consistent estimator; so asymptotically µθ|θ
θ. Thus asymptotically the numerator of the information function (5-3) for score
θ is (dµθ|θ/dθ)2 = 1. Thus the (asymptotic) information function (5-3) of the
maximum likelihood estimator of ability is the reciprocal of the asymptotic
variance (5-5):
n p' i 2
I{θ} ≡ I {θ,θ} = (5-6)
Pi Qi .
i= 1
Let us note an obvious theorem in passing:

Theorem 5.3.1. The information function for an unbiased (consistent)


estimator of ability is the reciprocal of the (asymptotic) sampling variance of the
estimator.

Equation (5-6) is of such importance that it is given a special name and


symbol. It is called the test information function and is denoted simply by I{θ}.
Information functions for ordinary published tests are usually roughly bell-
shaped. Such a test information function is shown in Fig. 5.5.1.
The importance of the test information function comes partly from the fact
that it provides an (attainable) upper limit to the information that can be obtained
from the test, no matter what method of scoring is used:

Theorem 5.3.2. The test information function I{θ} given by (5-6) is an upper
bound to the information that can be obtained by any method of scoring the test.

Proof: Suppose t is an unbiased estimator of some function θ. Denote this


function of θ by τ(θ). According to the Cramér-Rao inequality (Kendall &
Stuart, 1973, Section 17.23),
2
[τ' (θ)]
Var (t θ) ≥ (5-7)
[ d ln L ] ,
E 2
( dθ

where τ' (θ) is the derivative of τ'(θ). Since E(t\θ) ≡ τ(θ), we have from (5-3),
(5-4), and (5-7) asymptotically

[τ' (θ)]2 [ d ln L 1
I{θ, t}= ≤E 2 = = I{θ}. (5-8)
Var (t θ) ( dθ ) ] Var (θ θ)

This result holds under rather general regularity conditions on the item response
function P i ( θ ) .
72 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

5.4. THE ITEM INFORMATION FUNCTION

A very important feature of (5-6) is that the test information consists entirely of
independent and additive contributions from the items. The contribution of an
item does not depend on what other items are included in the test. The contribu­
tion of a single item is P'i2/Pi Qt. This contribution is called the item information
function:
p i '2
I{θ, ui} = (5-9)
PiQi .
Item information functions for five familiar items are shown in Figure 2.5.1
along with the I{6} for the five-item test.
In classical test theory, by contrast, the validity coefficient ρxC for number-
right test score (correlation between score and criterion C) is given by Eq. (1-25)
in terms of item intercorrelations ρij and item-criterion correlations ρ i c . There is
no way to identify the contribution of a single item to test validity; the contribu­
tion of the item depends in an intricate way on the choice of items included in the
test. The same may be said of an item's contribution to coefficient alpha, as
shown by Eq. (1-24), and to other test reliability coefficients.
For emphasis and clarity, let us elaborate here Birnbaum's (1968) suggested
procedure for test construction, previewed in Chapter 2. The procedure operates
on a pool of items that have been calibrated by pretesting, so that we have the
item information curve for each item.1

1. Decide on the shape desired for the test information function. Remember
that this information function is inversely proportional to the squared length of
the asymptotic confidence interval for estimating ability from test score. What
accuracy of ability estimation is required of the test at each ability level? The
desired curve is the target information curve.
2. Select items with item information curves that will fill the hard-to-fill
areas under the target information curve.
3. Cumulatively add up the item information curves, obtaining at all times the
information curve for the part-test composed of items already selected.
4. Continue (back-tracking if necessary) until the area under target informa­
tion curve is filled up to a satisfactory approximation.

The item information curve for the three-parameter logistic model in Eq. (2-1)
can be written down from (5-9) in many forms, such as

1
These rules are reproduced with special permission from F. M. Lord, Practical applications of
item characteristic curve theory. Journal of Educational Measurement, Summer 1977, 14, No. 2,
117-138. Copyright 1977, National Council on Measurement in Education, Inc., East Lansing,
Mich.
5.5. INFORMATION FUNCTION FOR A WEIGHTED SUM OF ITEM SCORES 73

Pi – ci 2
= D a i Qi
2 2
I{θ, ui} ,
Pi ( 1 – ci
D2a2i(1 – ci)
I{θ, ui] =
(c, + e D L i ) ( 1 +e-DLi)2 '

where Li = ai(θ - bi).

5.5. INFORMATION FUNCTION FOR A WEIGHTED


S U M OF ITEM S C O R E S

Suppose the test score is the weighted composite y ≡ iWiui,, where the wi are
any set of weights. Since each ui is a (locally) independent binomial variable, we
have

µ2 wu θ = i Wi Pi, (5-10)

σ2 wu θ = i W2i Pi (5-11)
By (5-3), the information function for the weighted composite is

( iWiPi )2 (5-12)
I{θ, iWiui} = 2
i W i Pi ,
.

If the weights are all 1, y is the usual number-right score x. Thus the informa­
tion function for number-right score x is

( iP'i )2 (5-13)
I{θ, x} =
i Pi Qi ,

Note that (5-12) and (5-13) cannot be expressed as simple sums of independent
additive contributions from individual items, as in (5-6).
Figure 5.5.1 shows the estimated information function I{θ, x} for the
number-right score on a high school-level verbal test (SCAT II, Form 2A)
composed of 50 four-choice word-relations items. For comparison, the test in­
formation function I{θ} is shown and also two other information functions to be
discussed later. The test characteristic curve is given also.
We have seen that the squared slope of the test characteristic curve is the
numerator of the information function for number-right score. The inflection
point of the test characteristic curve in the figure is to the left of the maximum
information, showing the effect of the denominator (squared standard error of
measurement).
The relation shown between the number-right curve and the upper bound I{θ}
is fairly typical of plots seen by the writer. This relation is of interest since it
limits the extent to which we can hope to improve the accuracy of measurement
by improving the method of scoring the test.
74 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

ABILITY
-3 -2 -I 0 1 2 3

I 8
50
TEST CHARACTERISTIC CURVE
EQUAL WEIGHTS
OPTIMAL WEIGHTS

15
40 SCORING WEIGHTS A
SCORING WEIGHTS (5-18)

12
SCORE

3.0

INFORMATION
RIGHT

9
NUMBER

20

6
10

3
0

0
-3 -2 0 1 2 3
-1 ABILITY

FIG. 5.5.1. Test characteristic curve (solid line) and various information curves
(dashed lines) for SCAT II 2A Verbal test. Item-scoring weights for the informa­
tion curves are specified in the legend.

5.6. OPTIMAL SCORING WEIGHTS

Suppose we allow the item-scoring weight wi to be a function of θ. In particular,


consider using as item-scoring weight the function wi( θ)≡ P'i/PiQi. Sub­
stituting this weight into (5-12) gives the result
p '2
P'i ui, } ( iP'i
2
Qi Qi)2 1
/ θ, = i (5-14)
{ i
PiQi
=
i Pi Qi(p'i/Pi Qi)
2 i
PiQi

We have the surprising result that the information function for the weighted
composite i (p'i/Pi Qi)ui is the same as the test information function, which is
the maximum information attainable by any scoring method. Thus
p'i(θ) (5-15)
Wi, (θ) ≡
P i (θ)Q i (θ)

is the optimal scoring weight for item i.


In practice, we do not know θ for any individual; hence we cannot know
Wi(θ). We can approximate θ and thus Wi(θ), however.
If the item response function is three-parameter logistic [Eq. (2-1)], it is easily
verified that
5.6. OPTIMAL SCORING WEIGHTS 75

Dai Qi(Pi – ci )
P'i = (5-16)
1 – ci
From (5-15) and (5-16), the optimal item-scoring weights are
Dai (Pi – ci ) Dai
Wi(θ) = (5-17)
Pi(1 – ci) 1 + cie–DLi
where Li ≡ ai{θ — bi). Note that when ci = 0, the optimal weight is 1.7ai or
since we may divide all the weights by 1.7, simply ai.
At high ability levels P i a (θ) 1, consequently Wi(θ) Dai . Thus, we see
that at high ability levels optimal scoring weights under the logistic model are
proportional to item discriminating power ai.
The optimal weights Wi(θ) under the logistic model are shown in Fig. 2.5.2
for five familiar items. Note the following facts about optimal item weights for
the logistic model, visible from the curves in the figure.2
1. As ability increases, the curve representing optimal item weight as a func­
tion of ability sooner or later becomes virtually horizontal. Thus, for sufficiently
high ability levels, the optimal item weights are virtually independent of ability
level. The optimum weight at this upper asymptote is proportional to the item
parameter ai. This occurs because there is no guessing at high ability levels.
2. As ability decreases from a very high level, the optimal weight curves for
the difficult items begin to decline. The reason is that at lower ability levels
random guessing destroys the value of these items.
3. As ability decreases further, the optimal weights for these difficult items
become virtually zero. Such items will not be wanted if the test is used only to
discriminate among examinees at low ability levels.
In summary, under the logistic model, the optimal weight to be assigned to an
item for discriminating at high ability levels depends on the general discriminat­
ing power of the item. The optimal weight to be used for discriminating at lower
ability levels depends not only on the general discriminating power of the item
but also very much on the amount of random guessing occurring on the item at
these ability levels. Thus, all moderately discriminating items are of use for
discriminating at high ability levels, whereas only the easy items are of appreci­
able use for discriminating at low ability levels.
Item-scoring weights that are optimal for a particular examinee can never be
determined exactly, since we do not know the examinee's ability θ exactly.3 A

2
The remainder of this paragraph is taken with permission from F. M. Lord, An analysis of the
Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and
Psychological Measurement, 1968, 28, 989-1020.
3
The remainder of this section is adapted and reprinted with special permission from F. M. Lord,
Practical applications of item characteristic curve theory. Journal of Educational Measurement,
Summer 1977, 14, No. 2, 117-138. Copyright 1977, National Council on Measurement in Educa­
tion, Inc., East Lansing, Mich.
76 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

crude procedure for obtaining item-scoring weights is to substitute the conven­


tional item difficulty pi (proportion of correct answers in the total group of
examinees) for Pi(θ) in (5-17). This crude procedure would use the resulting
weight for scoring item i on all answer sheets regardless of examinee ability
level. Since D = 1.7 is a constant, we can drop it and use the weight
ai Pi – Ci
Wi = (5-18)
1 — ci Pi .
This same item-scoring weight, except for the ai, was recommended on other
grounds by Chernoff (see Lord & Novick, 1968, p. 310).
The effect of using the crude scoring weights is illustrated for the SCAT II-2A
verbal test in Fig. 5.5.1. For this test, the crude weights are almost everywhere
better than no weights at all.
A better but more complicated procedure for determining scoring weights for
a conventional test might be somewhat as follows.

1. Score the test in the usual way.


2. Divide the examinees into three or more subgroups according to the usual
scores.
3. Separately for each subgroup, use (5-18) to find a roughly optimal scoring
weight for each item.
4. Rescore all answer sheets using the item-scoring weights from step 3
(different weights in each subgroup).
5. Equate the three score scales obtained from the three sets of scoring
weights. Conventional equating methods (Angoff, 1971) may be used for
this.
6. Use the equating to place everyone on the same score scale.

The foregoing procedure should improve measurement effectiveness, since


each answer sheet is scored with weights roughly appropriate for the examinee's
ability level. Too much should not be expected, however. If only a third of the
items in a test are useful for measuring examinees at a certain ability level, no
amount of statistical manipulation will make the test a really good one for such
examinees.

5.7. OPTIMAL SCORING WEIGHTS NOT DEPENDENT ON θ

Is there an item response function P(θ) such that the optimal weights w(θ)
actually do not depend on θ? If so, then

w(θ) ≡ P'(θ)
= A,
P(θ)Q(θ)
where A is some constant. This leads to the differential equation
5.9. EXERCISES 77

dP
= A dθ.
P(l – P)
Integrating, we have uniquely
1 – P
–In = Aθ + B,
P
where B is a constant of integration. Solving for P, we find
1 1
P ≡ P(θ) =
1 + e–Aθ–B = 1 + e-Da(θ-b) ,
where A ≡ Da and B ≡ —Dab.
In summary, when the item response function is a two-parameter logistic
function, the optimal scoring weight Wi(θ) does not depend on θ. The optimal
weight is wi = ai , the item discrimination index. The optimally weighted compos­
ite of item scores is s ≡ i a i u i , the sufficient statistic of Section 4.12. The
two-parameter logistic function, which does not permit guessing, is the most
general item response function for which the optimal item scoring weights do not
depend on θ.
Figure 5.5.1 shows the information curve obtained when the weights wi = ai
are used for the SCAT II-2A verbal test. The score iaiui is optimally efficient
at high ability levels but is less efficient than number-right score at low ability
levels. This is the result to be expected on a multiple-choice test, since wi = ai is
optimal only when there is no guessing.

5.8. MAXIMUM LIKELIHOOD ESTIMATE OF ABILITY

From Eq. (4-31) and (5-15), if the item parameters are known from pretesting,
the maximum likelihood estimate of ability is obtained by solving, for θ, the
equation

n pi(θ) n W (θ)u . (5-19)


Qi(θ) =
i i

i= 1 i=1
Thus we see that the maximum likelihood estimator θ is itself a function of the
optimally weighted composite of item scores iWi(θ)ui with θ substituted for θ.
This is true regardless of the form of the item response function Pi(θ).

5.9. EXERCISES

5-1 For test 1, compute from Table 4.17.1 the mean number-right score x at
θ = —3, — 2, —1,0, 1, 2, 3 using Eq. (4-2). Plot the regression of x on
θ. This is the test characteristic function.
78 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

5-2 As in Exercise 5-1, compute the standard deviation [Eq. (4-3)] of number-
right score for integer values of θ. Plot on the same graph as the regres­
sion σx θ .
5-3 From Table 4.17.1, plot on a single graph the item information function
for each of the three items in test 1.
5-4 Compute from Table 4.17.1 the test information function (5-6) of test 1.
Plot on the same graph as µx θ andσx θ Also plot on the same graph as
the item information functions.
5-5 to 5-8 Using Table 4.17.2, repeat Exercises 5-1 to 5-4 for a three-item test
with a = 1/1.7, b = 0, c = 0 for all items. Compare with the results of
test 1.
5-9 Compute from (5-5) the variance of θ at integral values of θ for test 1.
5-10 From Table 4.17.1, compute the score information function (5-13) for the
number-right score x. Plot this and the test information function (5-6)
from Exercise 5-4 on the same graph.
5-11 For each item in test 1, plot the optimal scoring weights (5-15) as a func­
tion of θ.
5-12 For test 1, compute the optimally weighted composite score iWi(θ)ui
for examinees at ability level θ=0 responding ux = 1, u2 = 0, u3 = 0.
Repeat for ui, = 0, u2 = 1, u3 = 0; also repeat for ux = 0, u2 = 0, u3
= 1. Can you explain why the scores for the three patterns should be in
the rank order you have found?
5-13 Compute the optimal item-scoring weight (5-15) at each θ level for the
items in Table 4.17.2. Explain.

APPENDIX

Information Functions for Transformed Scores


What is the effect on the score information function (5-3) of transforming the
score scale? Let Y ≡ Y(y) denote a monotonic transformation of the score y, and
let η = E(y\θ) be the true score corresponding to y.
For present purposes, score y will be assumed to have the property that η ≡
E(y θ) does not depend on test length n. For most conventional scoring methods,
this is usually a trivial requirement. For example, if x is the number of right
answers, ξ = Ex = n Pi(θ) is not independent of test length. Instead of using x,
however, we simply use z, the proportion of right answers. The proportion-
correct true score ζ ≡ Ez = [ nPi(θ)]/n does not vary if n is increased by
adding test forms parallel to the original test (see Section 5.1).
Now y is an unbiased estimate of η, and E{y —η)2 is the sampling variance of
the estimator y. It is usual to find that such sampling variances are of order 1/n;
this means that for sufficiently large n the sampling variance can be written as
APPENDIX 79

(1/n) times a constant term (a term that does not vary with n) and also that E(y —
η)3, the third sampling moment of y, is of order n–3/2 (is a constant divided by
n3/2). We assume this in all that follows.
Expanding Y(y) by Taylor's formula, we have
Y(y) – Y(η) = Y'(η)(y - η) + ½ y"(η)(y - η)2 +
δY'"(η)(y – η)3, (5-20)
where0 <δ < 1 and Y' (η), Y"(η), Y' "(η) are derivatives of Y(η) with respect to η.
Rearranging (5-20) and taking expectations, we find that the expectation of Y is

E(Y θ) = Y(η)) + terms of order1/n . (5-21)

Squaring (5-20) and taking expectations, we find a formula for the sampling
variance of Y:
Var (Y θ) ≡ E{[Y(y) - Y(η})]2 θ} = [Y'(η)]2 Var (y\ θ) + terms
of order n–3/2. (5-22)
From (5-21),
d dη] 1

E(Y θ) = Y'(η)dθ + terms of order .
n
From this and (5-22) and (5-3), we obtain the information function for the
transformed score Y(y):
2
I{θ, Y(y)} = (dη/dθ) + neglected terms.
Var (y\θ)
Since Var (y\θ) is a constant times 1/n and the numerator is independent of n, the
fraction on the right is of order n (a constant times n). The largest neglected
terms are easily seen to be constant with respect to n. For large n, the largest
neglected terms are therefore small compared to term retained. The term retained
is seen to be the information function of the untransformed score y. Asymptoti­
cally,
/{θ, Y(y)} = l{θ,y}. (5-23)
In summary, if (1) y is a score chosen so that the corresponding true score
does not vary with n, (2) Y(y) is a monotonic transformation of 3; not involving
n, (3) Var (y θ) is of order 1/n and E[(y —η))3θ]is of order n–3/2, then the score
transformation Y(y) does not change the asymptotic score information function.
The first restriction in this summary is readily removed for most sensible
methods of scoring. The number-right true score ξ, for example, varies with n,
whereas the proportion-correct true score ζ = ξ/n does not; yet bothξ andζ have
the same information: I { θ , ξ } = I{θ, ζ}.
The invariance (5-23) of I{θ, Y(y)} is important; however, it is not surprising
80 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

in view of the definition of information. A monotonic transformation of score y


should not change the confidence interval for θ.

REFERENCES

Angoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational


measurement (2nd ed.). Washington, D.C.: American Council on Education, 1971.
Birnbaum, A. Some latent trait models. In F. M. Lord & M. R. Novick, Statistical theories of mental
test scores. Reading, Mass.: Addison-Wesley, 1968.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 2, 3rd ed.). New York: Hafner,
1973.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Mandel, J., & Stiehler, R. D. Sensitivity—A criterion for the comparison of methods of test. Journal
of Research of the National Bureau of Standards, 1954, 53, 155-159.
II APPLICATIONS OF
ITEM RESPONSE THEORY
6 The Relative Efficiency of
Two Tests

6.1. RELATIVE EFFICIENCY

The relative efficiency of test score y with respect to test score x is the ratio of
their information functions:

RE {y, x} ≡ I{θ, y} (6-1)


I{θ, x} .

Scores x and y may be scores on two different tests of the same ability θ, or x and
y may result from scoring the same test in two different ways. Relative efficiency
is defined only when the θ in I{θ, y} is the same θ as in I{θ, x}. Although the
notation does not make it explicit, it should be clear that the relative efficiency of
two test scores varies according to ability level.
The dashed curve in Fig. 6.7.1 shows estimated relative efficiency of a
"regular" test compared to a "peaked" test. Both are 45-item verbal tests
composed of five-choice items. The regular test (y) consists of the even-
numbered items in a 90-item College Board SAT. The peaked test (x) consists of
45 items from the same test with difficulty parameters nearest the average bi (the
average over all 90 items).

83
84 6. THE RELATIVE EFFICIENCY OF TWO TESTS

There is considerable overlap in items between the two 45-item tests, but this
does not impair the comparison. As the figure shows, from the third percentile up
through the thirtieth, the regular test with its wide spread of item difficulty is less
than half as efficient as the peaked test. In other words, the regular test would
have to be lengthened to more than 90 items in order to be as efficient as the
45-item peaked test within this range.

6.2. TRANSFORMATIONS OF THE ABILITY SCALE

The ability scale θ is the scale on which all item response functions have the
particular mathematical form Pi(θ). This is a specified form chosen by the
psychometrician, such as Eq. (2-1) or (2-2). Except for the theoretical case where
all items are equivalent, there is no transformation of the ability scale that will
convert a set of normal ogive response functions to logistic, or vice versa.
Once we have found the scale θ on which all item response curves are (say)
logistic, it is often thought that this scale has unique virtues. This conclusion is
incorrect, however, as the following illustration shows.
Consider the transformations
Dai
θ* ≡ θ*(θ) ≡ Kekθ, b*i ≡ Kekbi, ai* ≡ ,
(6-2)
k
where K and k are any positive constants. Under the logistic model
1 — ci
Pi ≡ ci + 1 + e –Da
eDa i bi

1 — ci (6-3)
≡ ci +
1 +(b*i/θ*)ai*
Also
(b*i /θ*)ai*
Qi=(1 – ci) .
1 + (b*i /θ*)ai*
Thus
Pi – ci θ* ai*
= * (6-4)
Qi ( bi) .
This last equation relates probability of success on an item to the ratio of exam­
inee ability θ * to item difficulty b*i. The relation is so simple and direct as to
suggest that the θ* scale may be better for measuring ability than is the θ scale.
By assumption, all items have logistic response curves on the θ scale; how­
ever, it is equally true that all items have response curves given by (6-3) on the θ*
scale. Thus there is no obvious reason to prefer θ to θ*.
6.3. EFFECT OF ABILITY TRANSFORMATION 85

6.3. EFFECT OF ABILITY TRANSFORMATION ON THE


INFORMATION FUNCTION

If there is no unique virtue in the θ scale for ability, we should consider how a
monotonic transformation of this scale affects our theoretical machinery. There
is nothing about our definition or derivation of the information function [Eq. (5-3)]
that requires us to use the θ scale rather than the θ* scale. If θ* is any monotonic
transformation of θ, the information function for making inferences about θ*
from y is defined by Eq. (5-3) to be

(dµy|θ* /dθ*)2
I{θ*, y} ≡ (6-5)
Var (y\θ*) .
Before proceeding, we need to clarify a notational paradox. Note that, for
every θ O ,

Var (y|θ=θ0) ≡ Var (y|θ*=θ*(θo)).


The left-hand side is usually abbreviated as Var (v|θ o ), the right-hand side as Var
(y|θ *). The abbreviated equation Var (y\θ0) = Var (v|θ* o ) appears self-contradic­
tory. This is the fault of the abbreviated notation and does not impair the validity
of the unabbreviated result. Similarly (in abbreviated notation),

µy|θo ≡ µy|θ*(θo) ≡ µy|θ*o.


By the chain rule for differentiation, d/dθ* ≡ (dθ/dθ*)(d/dθ). Substituting
the last three equations into (6-5) and dropping the subscript o, we have the
important result
(dµy|θ* /dθ*)2 dθ 2
I{θ*, y} ≡
Var (y\θ*) ( dθ* )
= (dµy|θ /dθ)2 dθ 2
Var (y\θ ( dθ* )
I{θ, y}
= (6-6)
dθ* 2
( dθ )

This result states: When we transform θ monotonically to θ*(θ), the information


function is divided by the square of the derivative of the transformation.
This is as it should be. The confidence interval (θ, θ) in Fig. 5.1.1 transforms
into the confidence interval (θ*, θ*). Asymptotically, the length of the latter
interval will be dθ*/dθ times the length of the former interval. Thus the informa­
tion function I{θ*, x} will equal I{θ, x} divided by (dθ*/dθ)2.
When dθ*/dθ varies along the ability scale, the shape of the information
86 6. THE RELATIVE EFFICIENCY OF T W O TESTS

THETA SCALE OF ABILITY

30

s
c
o
R
20 E

I
N
F
0
R
M
A
T
0
N
F
U
N
C
T
i
0
10 N

0
2 10 25 50 75 90 98 %ile

FIG. 6.3.1. Score information function for measuring ability θ, SAT Mathematics
test. Taken with permission from F. M. Lord, The 'ability' scale in item charac­
teristic curve theory. Psychometrika, 1975, 40, 205-217.

function may be drastically altered by the transformation. Worse yet, the ability
level at which a test provides maximum information may be totally different
when ability is measured by θ* rather than by θ. Or /{θ, x} may have one
maximum, whereas I{θ*, x} has two separate maxima. Actually, any single-
valued continuous information function on θ may be transformed to any other
such information function by a suitably chosen monotonic transformation θ*(θ).
Figure 6.3.1 shows the information function I{θ, x} for number-right score
on a 60-item College Board mathematics aptitude test. The baseline, representing
ability, is marked off in terms of estimated percentile rank on ability for the
group tested rather than in terms of θ values. Figure 6.3.2 shows a rather mild
transformation θ*(θ) ≡ ω(θ). Figure 6.3.3 shows the resulting information func-
6.3. EFFECT OF ABILITY TRANSFORMATION 87

THETA SCALE OF ABILITY


2.0

1.5

1.0

0.5 O
M
E
G
A

0.0 s
cA
L
E
O
-0.5 F
A
B
I
L
-1.0
ITY

-1.5

•2.0

-2.5
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 '.5

FIG. 6.3.2. Relation of the ω scale of ability to the usual 6 scale. Taken with
permission from F. M. Lord, The 'ability' scale in item characteristic curve
theory. Psychometrika, 1975, 40, 205-217.

tion I{ω, x} on the Ω scale for the same number-right score. The information
functions for the same score x on the two different ability scales bear little re­
semblance to each other.
Clearly information is not a pure number; the units in terms of which informa­
tion is measured depend on the units used to measure ability. This must be true,
since information is defined by the length of a confidence interval, and this
length is expressed in terms of the units used to measure ability. If we are
uncertain what units to use to quantify ability, then to the same extent we do not
know how to quantify information.
We cannot draw any useful conclusions from the shape of a single information
function unless we assert that the ability scale we are using is unique except for a
88 6. THE RELATIVE EFFICIENCY OF TWO TESTS

OMEGA SCALE OF ABILITY

30

s
c
O
R
E
20 I
N
F
O
R
M
A
T
I
O
N
F
U
N
C
T
I
O
N
10

0
2 10 25 50 75 90 98 % ile

FIG. 6.3.3. Score information function for measuring ability ω, SAT Mathematics
test. Taken with permission from F. M. Lord, The 'ability' scale in item charac­
teristic curve theory. Psychometrika, 1975, 40, 205-217.

linear transformation. Most important we cannot know at what ability level the
test or test score discriminates best, unless we have an ability scale that is not
subject to challenge.
Even though a single information curve may not be readily interpretable,
comparisons between two or more information curves are not impaired by doubt
about the ability scale. This important fact is easily proved in Section 6.4.

6.4. EFFECT OF ABILITY TRANSFORMATION ON


RELATIVE EFFICIENCY

Suppose we transform the ability scale monotonically to θ*(θ) and then compute
the relative efficiency of two scores, x and y (which may be scores on one test or
6.5. INFORMATION FUNCTION OF OBSERVED SCORE ON TRUE SCORE 89

on two different tests), for measuring θ*. Replacing θ by 0* in (6-1) and using
(6-6), we find
2
RE {y, x} =
I{θ*, y} I{θ, y} (dθ*/dθ) I{θ, y}
2
I{θ*, x} I{θ, x} (dθ*/dθ) I{θ, x}
Comparing this with (6-1), we see that relative efficiency is invariant under any
monotonic transformation of the ability scale. It is for this reason that the symbol
6 does not appear in the notation RE {y, x}.
For the reasons outlined in Section 6.3, the practical application of item
response theory in this book are not based on inference from an isolated informa­
tion function. We shall compare information curves, or equivalently we shall rely
on a study of relative efficiency. Such comparisons are not affected by the choice
of scale for measuring ability.

6.5. INFORMATION FUNCTION OF OBSERVED SCORE


ON TRUE SCORE

It was noted in Section 4.2 that number-right true score ξ is a monotonic increas­
ing transformation of ability θ. What is the information function of number-right
score x for making inferences about true score ξ?
If we substitute ξ for θ* and x for y in (6-5), we find
(dµx|ξ/dξ)2
I{ξ, x} = (6-7)
σ2x|ξ
Now the true score ξ is defined as the expectation of x. It follows that µx|ξ = ξ.
If we substitute this into the numerator of (6-7), the desired information function
is found to be

1 (6-8)
I{ξ, x} ≡
σ2x|ξ
When using observed score x to make inferences about the corresponding
true score ξ, the appropriate information function I{ξ , x} is the reciprocal of
the squared standard error of measurement of score x at ξ. This result will hold
for any score x, not just for number-right score, as long as ξ ≡ µx|ξ is a mono­
tonic function of θ.
Figure 6.5.1 shows I{ξ, x} for the same test represented in Fig. 6.3.1 and
6.3.3. The reader should compare these three information functions, noting once
again that information functions do not give a unique answer to the question: ' 'At
what ability level does the test measure best?"
The reader may have been startled to find from Fig. 6.5.1 that I{ξ, x} is
greatest at high and at low ability levels and least at moderate ability levels.
Actually, similar results would be found for most tests. Examinees at very high
90 6. THE RELATIVE EFFICIENCY OF TWO TESTS

0.5
0.4
FUNCTION
0.3
INFORMATION
0.2
SCORE
0. I
0.0

0 20 40 60
NUMBER-RIGHT TRUE SCORE

FIG. 6.5.1. Score information function for measuring the true score ξ on SAT
mathematics test.

ability levels are virtually certain to obtain a perfect score on the test. Thus for
them the standard error of measurement σx|ξ is nearly zero, their true score ξ is
very close to n, the length of the confidence interval for estimating ξ from x is
nearly zero, and consequently I{ξ, x} ≡ l/σ2x|ξ is very large. Clearly true score
ξ can be estimated very accurately for such examinees: It is close to n. Their
ability 0 cannot be estimated accurately; however: We know that their 6 is high
without knowing how high. This situation is mirrored by the fact that I{ξ, x} is
very large for such examinees, whereas I{θ, x} is near zero. The reader should
understand these conclusions if he is to make proper use of information functions
(or of standard errors of measurement).

6.6. RELATION BETWEEN RELATIVE EFFICIENCY AND


TRUE-SCORE DISTRIBUTION

Suppose now that we have another test measuring the same ability 8 as test x.
Denote the observed score on the new test by y and the corresponding true score
by . As in (6-8), the information function for y on will be
1 (6-9)
I{ , y} = σ2
y|
6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY 91

For present purposes, it is not necessary that either x or y be a number-right


score. Suppose merely that ξ is some monotonic increasing function of ability 0
and that also is. Then ξ is necessarily a monotonic increasing function of ξ; we
write ≡ (ξ). Also £ is a monotonic increasing function of . If x and 3; are
number-right scores, (Ξ) is found (numerically) by eliminating 0 from the test
characteristic curves Ξ = iPi(θ) and = jPj(θ). Figure 13.5.1 shows how
the value of (ξ 0 ) may be determined graphically for any given ξ0 from ξ =
iPi(θ) and = jPj(θ).
Now, we can substitute ξ for 0* and for 0 in (6-6) and then use (6-9) to write
the information function of y on ξ:
2
d
I{ξ,y} =
I{ ,y}
=
( )


. (6-10)
dξ σ2y|
(—d )
2

The efficiency of y relative to x is the ratio of (6-10) to (6-8), or


2 d 2
RE {y, x} = σ2 y|ξ
σ y| ( dξ ) . (6-11)

Similarly,
2
RE {x, y} = σ2y| dξ 2
σ y|ξ d ( )
. (6-12)

Equations (6-11) and (6-12) are valid regardless of the scale used to measure
ability (see Section 6.4). In particular, (6-11) and (6-12) do not assume that
ability is to be measured on the true-score scale ξ.
Denote by P(ξ) the frequency distribution (density) of true score ξ in some
population of examinees. The distribution Q( ) of = (ξ) in this same popula­
tion is then found from
q( )d( ) ≡ p(ξ)dξ (6-13)
Rearranging, we have
d
= P(ξ) .
dξ q( )
Substituting this into (6-11), we find

RE {y, x} =
σ2x|ξ
P2(ξ) (6-14)
2
σ-y| ξ q [ (ξ)]
To our surprise, this formula shows that the relative efficiency of two tests can
be expressed directly in terms of true-score frequency distributions and standard
errors of measurement. The formulas agree with the vague intuitive notion that a
test is more discriminating at true-score levels where the scores are spread out
and less discriminating at true-score levels where the scores pile up.
92 6. THE RELATIVE EFFICIENCY OF TWO TESTS

6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY

If estimated item parameters are available, estimated relative efficiency can be


directly and simply computed from (6-1) and from the formula for the appro­
priate information function, such as Eq. (5-6), (5-12), or (5-13). If estimated
item parameters are not conveniently available, it may be possible to estimate
the necessary quantities on the right side of (6-14), thus approximating relative
efficiency without requiring the item parameters ai, bi, and ci. A method for
estimating true-score distributions p(ζ) and q( ) is the subject of Chapter 16.
The following application is presented here rather than in Chapter 16 because
it leads to the much simpler approximation described in Section 6.8. An approx­
imation for the standard error of measurement is given by (Lord, 1965; Eq. 9,34):

σx|ξ = (nx - 2kx)ξ(nx - ξ)


(6-15)
n2x ,
½n 2 x (n x - 1)s2p
kx ≡ (6-16)
[x(n x - x) - s2x - nxs2p] ,
where x and s2x are the sample mean and variance (over people) of the number-
right scores, s2p is the sample variance (over items) of the pi, and pi is the sample
proportion of correct answers to item i; σy| is obtained similarly, from the same
group or from an equivalent group of examinees.
When item parameters ai , bi , and ci have not been estimated, the relation
(ξ) between and ξ may be obtained as follows. Integrating (6-13), we have,
for any value ξ0,
- (ξ )
q( ) d ≡ p(ξ) dξ. (6-17)
∫-∞ ∫ξ0
-∞

In more familiar terms, this equation says that 0 ≡ (ξ0) has the same percen­
tile rank in q( ) as ξ0 does in p(ξ). Thus for any value of ξ, = (ξ) is to be
obtained by standard equipercentile equating. The distributions q( ) and p(ξ)
must be for the same group or for statistically equivalent groups of examinees.
Given estimates of q( ) and p(ξ), the integration and equating are done by
numerical methods by the computer (see Section 17.3).
A computer program (Stocking, Wingersky, Lees, Lennon, & Lord, 1973), is
available to compute (6-14). The program uses estimates (£) and q( ) obtained
by the methods of Chapter 16. It then uses (6-17) to find equivalent values of ξ
and . Finally, using approximation (6-15), it computes relative efficiencies by
(6-14).
In Fig. 6.7.1, the solid curve is the approximate relative efficiency from
(6-14). The dotted curve is the ratio of information functions computed by (6-1)
6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY 93

10.0
9.0
8.0
7.0
6.0
5.0
4.0
R. E. of "Regular" Test vs. Peaked Test

3.0

2.0

1.0
.9
.8
.7
.6
.5
.4

.3

.2

0 .2 .4 .6 .8 1.0 ξ
.1
1 5 10 25 50 75 90 95 9 9 %

FIG. 6.7.1. Approximation (solid line) to relative efficiency [Eq. (6-14)] compared
with estimate (dashed line) from Eq. (6-1) and (5-13). (From F. M. Lord, The
relative efficiency of two tests as a function of ability level. Psychometrika, 1974,
39, 351-358.)

from estimated item response function parameters. The two tests under compari­
son are the regular test (y) and the peaked test (x), described in more detail in
Section 6.1. Approximation (6-14) tends to oscillate about the estimated relative
efficiency (6-1), but the approximation is adequate for the practical purpose of
comparing the effectiveness of the tests over a range of ability levels. The
agreement found here and in later sections of this chapter between relative
efficiency calculated from item parameters and relative efficiency approximated
from totally different sources is a reassuring illustration of the adequacy of item
response theory and of the procedures used for estimation of item parameters.
As noted in Section 6.4, the relative efficiency of two tests remains the same
under any monotonic transformation of the ability scale. Thus, the RE curve can
be plotted against any convenient baseline. In Fig. 6.7.1, the baseline is scaled in
terms of true score ζ = ξ/n = iPi(θ)/n for the peaked test [see Eq. (4-5)].
Is it a good rule of test construction to spread the items over a wide range of
item difficulty, so as to have some items that are appropriate for each examinee?
Or will a peaked test with all items of equal difficulty be better for everyone? In
Fig. 6.7.1, the peaked test (really only partially peaked—it is hard to find 45
items that are identical in difficulty) is better than the regular (unpeaked) test for
all examinees from the first through the seventy-fifth percentile. If the peaked
94 6. THE RELATIVE EFFICIENCY OF TWO TESTS

test were more difficult, it might be better from perhaps the tenth percentile up
through the nintieth.

6.8. DESK CALCULATOR APPROXIMATION FOR


RELATIVE EFFICIENCY

Although the approximation of Section 6.7 avoids the need to estimate item
response function parameters, the method (see Chapter 16) for estimating p(ξ)
and q( ) is far from simple. Section 6.7 is included here because it leads to the
suggestion that a simple approximation to relative efficiency can be obtained by
substituting observed-score relative frequencies, fx and fy, say, for the true-
score densities p(ξ) and q( ).
A simple approximation to σx|ξ and σy| is also available. If the nx items in
test x are considered as a random sample from an infinite pool of items, then the
sampling distribution of number-right score x for a particular examinee, over
successive random samples of items, is the familiar binomial distribution
(nxx)ζx(1 - ζ)nx-x,
where ζ is a parameter characterizing the individual. Since ξ(x|ζ = nxζ = ξ
for the binomial, ζ or ξ is the individual's true score. [Although it may not
seem so, the fact is that the binomial model just described holds just as well
when the items are of widely varying difficulty as when they are all of the
same difficulty. A simple discussion of this fact is given by Lord (1977).]
Under the binomial model just outlined, the sampling variance of an exam­
inee's number-right score x over random samples of items is given by the
familiar formula

σ2x|ζ = n x ζ(1 - ζ) (6-18)


or, equivalently,

σ2x|ζ = ξ(nx - ξ) (6-19)


nx

A similar formula holds for y.


If we substitute these into (6-14) and replace ξ by x, ) by y, we have

RE {y, x} ≡ nyx(nx — x) f2x


(6-20)
nxy(ny - y) f2y

This is the shortcut approximation recommended for calculating relative effi­


ciency. Note that here x and y are number-right scores with the same percentile
rank in some group of examinees (as determined by equipercentile equating); fx
and fy are relative frequencies for the same group or for equivalent groups ( fx
= fy = 1). Note also that test x and test y must be measures of the same ability
or trait.
1

2
3
6

5 4
7

0 .20 .40 TRUE SCORE .60 .80 100

FIG. 6.8.1. Estimated true-score distribution for the sixth-grade data for STEP (1), MAT (2), CAT
(3), ITBS (4), Stanford (5), CTBS (6), and SRA (7).

95
96 6. THE RELATIVE EFFICIENCY OF TWO TESTS

Equation (6-20) will work best with a large sample of examinees, perhaps
several thousand. If the sample size is smaller, the equipercentile equating of x
and y will be irregular because of local irregularities in fx and fv. This can be
overcome by smoothing distributions fx and fy. Smaller samples can then be
used, but at some cost in labor.
In order to investigate the adequacy of (6-20), the relative efficiencies of the
vocabulary sections of seven nationally known reading tests were approximated
by formula (6-20) and also by the computer program (Stocking et al., 1973)
described in Section 6.7. For each test, a carefully selected representative na­
tional sample of 10,000 or more sixth graders from the Anchor Test Study
(Loret, Seder, Bianchini, & Vale, 1974) supplies the frequency distribution of
number-right vocabulary score needed for the two methods. The number of items
per vocabulary section ranges from n = 30 through n = 50.
Figure 6.8.1 shows the true-score distributions for the seven vocabulary tests
as estimated by the method of Chapter 16. These are the p(ξ) and q( ) used in
(6-14) to obtain the smooth curves in Fig. 6.9.1-6.9.6. As already noted, a test
in general tends to be less efficient where the true scores pile up and more
efficient where the true scores are spread out.

6.9. RELATIVE EFFICIENCY OF SEVEN SIXTH-GRADE


VOCABULARY TESTS1

Figures 6.9.1 to 6.9.6 show the efficiency curves for six of the tests relative to
the Metropolitan Reading Tests (1970), Intermediate Level, Form F, Word
Analysis subtest (MAT). The smooth curves are obtained from (6-14); the broken
lines are obtained from (6-20), after grouping together adjacent pairs of raw
scores in order to reduce zigzags due to sampling fluctuations.
Although (6-20) gives only approximate results, the approximation is seen to
be quite adequate for many purposes. Rough calculations using (6-20) can be
conveniently made under circumstances not permitting the use of an elaborate
computer program.
Figure 6.9.1 shows the relative efficiency of STEP (Sequential Tests of Edu­
cational Progress) Series II (1969), Level 4, Form A, Reading subtest. STEP is
more efficient than MAT for the bottom fifth or sixth of the pupils and less
efficient for the rest of the students. Between the fortieth and eightieth percen­
tiles, STEP would have to be tripled in length in order to be as effective as MAT.
STEP (n y = 30) is actually three-fifths as long as MAT (n x = 50), as shown by

1
This section is revised and printed with special permission from F. M. Lord, Quick estimates of
the relative efficiency of two tests as a function of ability level. Journal of Educational Measure­
ment, Winter 1974, 11, No. 4, 247-254. Figures 6.9.1, 6.9.6, and Table 6.9.1 are taken from the
same source. Copyright 1974, National Council on Measurement in Education, Inc., East Lansing,
Mich.
6.3
4.0
2.5
I .6
EFFICIENCY
1.0
RELATIVE
.63

NY
NX
.40
.25
16

0 10 20 30 40 50 60 70 80 90 100
PERCENTILE

FIG. 6.9.1. Relative efficiency of STEP compared to MAT.


6.3
4.0
2.5
RELATIVE EFFICIENCY
1.0 1.6

NY
NX
.63 .40
.25
16

0 10 20 30 40 50 60 70 80 90 100
PERCENTILE

FIG. 6.9.2. Relative efficiency of California Achievement Tests (1970), Level 4,


Form A, Reading Vocabulary compared to MAT.

97
6.3
4.0
2.5
R E L A T I V E EFFICIENCY
1.6
1 .0

NY
NX
.63
.40
.25
16

0 10 20 30 40 50 60 70 80 90 100
PERCENTILE

FIG. 6.9.3. Relative efficiency of Iowa Test of Basic Skills (1970), Level 12, Form
5, Vocabulary compared to MAT.
6.3
4.0
2.5
RELATIVE EFFICIENCY
1.0 1.6

NY
NX
.63 .40
.25
16

0 10 20 30 40 50 60 70 80 90 100
PERCENTILE

FIG. 6.9.4. Relative efficiency of Stanford Reading Tests (1964), Intermediate II,
Form W, Word Meaning compared to MAT.

98
6.3
4.0
2.5
EFFICIENCY
1.0 1.6
RELATIVE

NY
Nx
.63 . 40
.25
16

0 10 20 30 40 50 60 70 80 90 100
PERCENTILE

FIG. 6.9.5. Relative efficiency of Comprehensive Tests of Basic Skills (1968),


Level 3, Form Q, Reading Vocabulary compared to MAT.
6.3
4.0
2.5
RELATIVE EFFICIENCY
1.0 1.6

NY
NX
.63 .40
.25
. 16

0 10 20 30 40 50 60 70 80 90 100
PERCENTILE

FIG. 6.9.6. Relative efficiency of SRA Achievement Series (1971), Green Edition,
Form E, Vocabulary compared to MAT.

99
100 6. THE RELATIVE EFFICIENCY OF TWO TESTS

the dashed line representing the ratio ny/nx. The dashed line represents the
relative efficiency that would be expected if the two tests differed only in length.
The fact that STEP is more efficient for low-ability students and less efficient
at higher ability levels is to be expected in view of the fact that this STEP (Level
4) is extremely easy for most sixth-grade pupils in the representative national
sample. It has long been known that an easy test discriminates best at low ability
levels and is less effective at higher levels than a more difficult test would be.
Similar conclusions can be drawn from the other figures. It is valuable to have
such relative efficiency curves whenever a choice has to be made between tests
measuring the same ability or trait.

Numerical Example
For illustrative purposes, Table 6.9.1 shows a method for computing RE {y, x}
for one set of data, with N = 10,000, nx = 50, ny = 30. The method illustrated
is a little rough but seems adequate for the purpose.
The raw data for the table are the frequency distributions given by the un-
italicized figures in columns fx and fy . The score ranges covered by the table are

TABLE 6.9.1
Illustrative Computations for Relative Efficiency*

Test X Test Y
Percentile
Rank X Fx y fy Fy RE {y, x}
fx
17.5 1141
18 176 14 203
12.70 18.23 (175.08) 14.5 204.5 1270 1.13
13.17 18.5 174 1317 14.73 (205.19) 1.12
19 172 15 206
14.76 19.42 (166.96) 15.5 228 1476 .85
14.89 19.5 166 1489 75.55 (230.20) .83
20 160 16 250
16.49 20.5 171 1649 16.19 (258.74) .71
17.26 20.92 (180.24) 16.5 273 1726 .71
21 182
18.31 21.5 186.5 1831 16.85 (289.10) .69
22 191 17 296
20.22 22.5 196 2022** 17.50 (313) 2022** .67
23 201 18 330

*Italicized figures are obtained by linear interpolation. The remaining figures indicate exact
values obtained from the data.
**These two numbers are identical only by coincidence.
6.10. REDESIGNING A TEST 101

shown by the corresponding integers in the x and y columns. Computational


steps are the following:

1. Enter the half-integer scores in the x and y columns as shown.


2. Obtain from the raw data the cumulative frequency distributions shown in
columns Fx and Fy. (The frequency fx is treated as spread evenly over the
interval from x — .5 to x + .5.)
3. Compute from the Fx and Fy columns the percentile rank of each half-
integer score (the proportion of cases lying below the half-integer score) and
enter this in the first column.
4. Using linear interpolation, compute and record in the x or y column the
score (percentile) for each percentile rank listed in column 1. In the table, these
percentiles are in italics. (Note that, by definition, the score having a given
percentile rank is called the percentile.)
5. For each half-integer score, record (in italics) in the adjacent/column the
average of f for the next higher integer score and the f for the next lower integer
score.
6. For each (italicized) score obtained in step 4, compute by linear interpola-
tion and record in parentheses in the adjacent f column the f for the score.
7. For each row of the table that contains an entry in the first column,
compute RE {y, x} by (6-20).

6.10. REDESIGNING A TEST

When a test is to be changed, we normally have response data for a typical group
of examinees. Such data are not available when a new test is designed. This
section deals chiefly with changing or redesigning an existing test. Procedures
for redesign are best explained by citing a concrete example.
Recently, it was decided to change slightly the characteristics of the College
Entrance Examination Board's Scholastic Aptitude Test, Verbal Section. It was
desired to make the test somewhat more appropriate at low ability levels without
impairing its effectiveness at high ability levels. The possibility of simultane-
ously shortening the test was also considered.
A first step was to estimate the item parameters for all items in a typical
current form of the Verbal test. A second step was to compute from the item
parameters the information curves for variously modified hypothetical forms of
the Verbal test. Each of these curves was compared to the information curve of
the actual Verbal test. The ratio of the two curves is the relative efficiency of the
modified test, which varies as a function of ability level.
Let us now consider some typical questions: How would the relative effi-
ciency of the existing test be changed by
102 6. THE RELATIVE EFFICIENCY OF TWO TESTS

1. Shortening the test without changing its composition?


2. Adding five more items just like the five easiest items in the existing test?
3. Cutting out five items of medium difficulty?
4. Replacing five items of medium difficulty by five very easy items?
5. Replacing all reading items by a typical set of nonreading items?
6. Discarding the easiest half of the test items?
7. Scoring only the easiest half of the test items?
8. Replacing all items by items of medium difficulty?

These questions are taken up one by one in the correspondingly numbered para-
graphs that follow, illustrating the results of various design changes on the SAT
Verbal test. In Figure 6.10.1, the horizontal scale representing the ability mea-
sured by the test has for convenience been marked off to correspond to College
Board true scaled scores.
(The test-score information curves are computed for hypothetical examinees
who omit no items. The SAT is normally scored with a "correction for guess-
ing," but when there are no omitted items, the corrected score is perfectly cor-
related with number-right score. For this reason, the relative efficiencies dis-

2.5

2.0
RELATIVE EFFICIENCY

1.5
5
2
4
1.0
3
6

0.5

7
0.0
200 300 400 500 600 700

VERBAL SCALED SCORE

FIG. 6.10.1. Relative efficiency of various modified SAT Verbal tests. (From F. M.
Lord, Practical applications of item characteristic curve theory. Journal of Educa-
tional Measurement. Summer 1977, 14, No. 2, 117-138. Copyright 1977,
National Council on Measurement in Education, Inc., East Lansing, Mich.)
6.10. REDESIGNING A TEST 103

cussed below are equally appropriate for corrected scores and for number-right
scores. See Section 15.11 for a detailed discussion of work with formula scores.)

1. Reducing the test from n1 to n2 items without changing its composition


produces a new test with efficiency RE = n2/n1 relative to the original test. If
such an RE were shown in Figure 6.10.1, it would appear as a horizontal straight
line.
2. The effect of adding five very easy items to the SAT Verbal test is shown
by curve 2. The efficiency of the lengthened verbal test, relative to the usual SAT
Verbal, is increased at all ability levels, as might be expected. The change in
efficiency at any particular level indicates the (relative) extent to which the five
easy items are useful for measurement at that level. These items are of little use
above the ability level represented by a (true) score of 400, because above this
level most people answer these items correctly.
3. The effect of simply cutting out five items of medium difficulty from the
SAT Verbal test is shown by curve 3. The efficiency of the shortened verbal test,
relative to the usual full-length verbal test, is decreased at all ability levels above
a (true) score of 245. It may seem strange that omitting five items actually
increases relative efficiency below 245. The reason will be pointed out in the
discussion of question 7.
4. Replacing five medium difficulty items by five very easy items simply
adds together the changes made in paragraphs 2 and 3 above. If asymptotically
optimal (maximum likelihood) scoring were used, the relative efficiency of (SAT
Verbal + easy items — medium items) would be exactly equal to (RE of SAT
Verbal + RE of easy items — RE of medium items). Since number-right scoring
is used, the foregoing relationship holds only as a useful approximation. Curve 4
in Figure 6.10.1 shows the combined effects of the changes of curves 2 and 3.
5. Curve 5 shows the result of first omitting all reading items and then
bringing the SAT Verbal test back to its original length without further changes
in composition. This curve is given purely to illustrate a kind of question that can
be answered. The particular result obtained cannot be generalized to other tests.
It also ignores the fact that reading items require more testing time than other
verbal item types. This last fact could be taken into account, if desired.
6. The effect of discarding the easiest half of the test items is shown by curve
6. The test loses most of its effectiveness except at high ability levels, where
there is a loss of about 10% in efficiency.
7. In contrast, discarding the most difficult half of the test items greatly
improves measurement efficiency for low-ability examinees. A similar but
smaller effect was observed in answer to question 3. The reason is that random
guessing by low-level examinees on the harder items adds so much ' 'noise'' to
the measuring process that it would be better simply not to score these items for
low-ability examinees. The half test actually measures better than the full-length
test at low ability levels.
104 6. THE RELATIVE EFFICIENCY OF TWO TESTS

8. Replacing all items by items of medium difficulty produces a "peaked"


SAT that is much more efficient than the actual SAT for the average examinee,
who scores near 500. The peaked test is more efficient than the actual SAT for
examinees from 400 to 610, less efficient for examinees at more extreme ability
levels. (Curve 8 is obtained by changing all bi to an average value, all other SAT
item response parameters being unchanged.)

The foregoing illustrates how typical questions involved in redesigning a test


can be answered by predicting the relative efficiency of a suitably modified test.
A more general redesign problem may be stated: How can we change the present
test to produce a new test with a specified relative efficiency curve? It would be
possible to devise mathematical methods for answering this general question. For
the present, however, it may be more convenient to answer it by trial and error:
by devising a variety of redesigned tests, computing their RE curves, then choos­
ing the best of these, and modifying the tests further until the desired RE curve is
achieved.

6.11. EXERCISES

6-1 Using Table 4.17.1, compute the test information function for a three-item
test with a = 1/1.7, b = 0, c = .2 for all items. Plot it.
6-2 Using Table 4.17.2, plot on the same graph the test information function
for a three-item test with a = 1/1.7, b = 0, c = 0 (see Exercise 5-8).
6-3 Compute and plot the efficiency of the test in Exercise 6-1 relative to test 1
(see Exercise 5-4).
6-4 Compute and plot the efficiency of the test in Exercise 6-2 relative to test 1
(see Exercise 5-4).
6-5 For each item in test 1, plot P(θ) from Table 4.17.1 against θ* = eθ. These
are the item response functions when θ* is used as the measure of ability.
Compare with Exercise 4-1.
6-6 Using Table 4.17.1, compute and plot against 0* the item information
function I{θ*, ui} for each item in test 1. Compare with Exercise 5-3.
6-7 Using Table 4.17.1, plot the information function (6-8) of number-right
observed score x on number-right true score ξ for test 1. Necessary values
were calculated in Exercise 5-2.
6-8 Using Table 4.17.1, compute for test 1 at 0 = - 3 , - 2 , - 1 , 0, 1, 2, 3 the
σ2x|ξ of (6-19), and compare with the σ2x|ξ of Eq. (4-3). The necessary
values of ξ = ξ(θ) are calculated in Exercise 5-1. Explain the discrepancy
between the two sets of results.
6-9 Suppose test 1 is modified by replacing item 2 by an item exactly like item
1. Compute the information function for the modified test and plot its
relative efficiency with respect to test 1 (see Exercise 5-4).
REFERENCES 105

REFERENCES

Lord, F. M. A strong true-score theory, with applications. Psychometrika, 1965, 30, 239-270.
Lord, F. M. Practical applications of item characteristic curve theory. Journal of Educational
Measurement, 1977, 14, 117-138.
Loret, P. G., Seder, A., Bianchini, J. C , & Vale, C. A. Anchor Test Study—Equivalence and norms
tables for selected reading achievement tests (grades 4, 5, 6). Washington, D.C.: U.S. Govern-
ment Printing Office, 1974.
Stocking, M., Wingersky, M. S., Lees, D. M., Lennon, V., & Lord, F. M. A program for
estimating the relative efficiency of tests at various ability levels, for equating true scores, and for
predicting bivariate distributions of observed scores. Research Memorandum 73-24. Princeton,
N.J.: Educational Testing Service, 1973.
7 Optimal Number of
Choices Per Item 1

7.1. INTRODUCTION

Typical multiple-choice tests have four or five alternative choices per item. What
is the optimal number?
If additional choices did not increase total testing time or add to the cost of the
test, it would seem from general considerations that the more choices, the better.
The same conclusion can be reached by examination of the formula (4-43) for the
logistic item information function: Information is maximized when c 0. An
empirical study by Vale and Weiss (1977) reaches the same conclusion.
In practice, increasing the number of choices will usually increase the testing
time. Each approach treated in this chapter makes the assumption that total
testing time for a set of n items is proportional to the number A of choices per
item. This means that nA, the total number of alternatives in the entire test, is
assumed fixed.
It seems likely that many or most item types do not satisfy this condition, but
doubtless some item types will be found for which the condition can be shown to
hold approximately. The relation of n to A for fixed testing time should be
determined experimentally for each given item type; the theoretical approaches
given here should then be modified in obvious ways to determine the optimal

1
This chapter is adapted by special permission from F. M. Lord, Optimal number of choices per
item—a comparison of four approaches. Journal of Educational Measurement, Spring 1977, 14, No.
1. Copyright 1977. National Council on Measurement in Education, Inc., East Lansing, Mich.
Research reported was supported by grant GB-41999 from the National Science Foundation.

106
7.3. A MATHEMATICAL APPROACH 107

value of A for each item type. A useful procedure for doing this is described in
Grier (1976).
In this chapter, some published empirical results, two published theoretical
approaches, and also an unpublished classical test theory approach are compared
with some new results obtained from item response theory. From some points of
view, the contrasts between the different approaches are as interesting and in-
structive as the actual answers given to the question asked.

7.2. PREVIOUS EMPIRICAL FINDINGS

Ruch and Charles (1928), Ruch, DeGraff, and Gordon (1926, pp. 54-88), Ruch
and Stoddard (1925, 1927), and Toops (1921), among others, reported data on
the relative time required to answer items with various numbers of alternatives.
Their empirical evidence regarding the optimal number of alternatives for
maximum test reliability is somewhat contradictory. Ruch and Stoddard (1927)
and Ruch and Charles (1928) concluded that because more of such items can be
administered in a given length of time, two- and three-choice items give as good
or better results than do four- and five-choice items.
More recently, Williams and Ebel (1957, p. 64) report that

For tests of equal working time . . . three-choice vocabulary test items gave a test of
equal reliability, and two-choice items a test of higher reliability, in comparison
with standard four-choice items. However, neither of the differences was signifi-
cant at the 10% level of confidence.

About 230 students were tested with each test.


Williams and Ebel eliminated choices by dropping those shown by item
analysis to be least discriminating. This is a desirable practical procedure that
should yield better results than simply eliminating distractors at random, as
assumed by the theoretical approaches discussed below.

7.3. A MATHEMATICAL APPROACH

Tversky (1964) proposed that the optimal number of choices is the value of A
that maximizes the "discrimination function" An. He chose this function be-
cause An is the total number of possible distinct response patterns on n A -choice
items and also for other related reasons.
Tversky easily showed that when nA = K is fixed, An is maximized by A = e
= 2.718. For integer values of A, An is maximized by A =3. Tversky concludes
that when nA = K, three choices per item is optimal.
108 7. OPTIMAL NUMBER OF CHOICES PER ITEM

7.4. GRIER'S APPROACH

Grier (1975) investigated the same problem. He also found that three-choice
items are best when the total number of alternatives is fixed. Two-choice items
are next best.
Grier reached these conclusions by maximizing an approximation to the
Kuder-Richardson Formula-21 reliability coefficient. This approximation, given
as Eq. (7-2), is derived by Ebel (1969) on the assumption that the mean number-
right score x is halfway between the maximum possible score n and the expected
chance score n/A and also that the standard deviation of test scores sx is one-
sixth of the difference between the maximum possible score and the expected
chance score:

n
x =
2 ( n + —A ) . (7-1)
n
6 (n — — A ).
Sx =

If (7-1) were true, the Kuder-Richardson Formula-21 reliability coefficient of the


test would be

n 9(A + 1)
r21 =
n - 1 [
1 -
n(A - 1) . ] (7-2)

This formula (as Ebel points out) is not useful for small n. When A = 3, the
value of r21 given by (7-2) is negative unless n > 18.

7.5. A CLASSICAL TEST THEORY APPROACH

A third approach is to use the knowledge-or-random-guessing assumption and


work out the reliabilities for hypothetical tests composed of equivalent items.
This is done below. The knowledge-or-random-guessing assumption is not ordi-
narily satisfied in practice, but it is not likely to lead to unreasonable conclusions
for the present problem. In any case, the conclusions reached from this third
approach agree to some extent with Tversky's and Grier's.
The intercorrelation between two equivalent items under the knowledge-or-
random-guessing model is given by

r' = r (7-3)
1 + 1/(A - 1)p ,
where r' is the product-moment intercorrelation between k-choice items when
k = A. Here p and r denote, respectively, the difficulty (proportion of correct
answers) and the product-moment intercorrelation of k-choice items when k =
7.5. A CLASSICAL TEST THEORY APPROACH 109

∞. This formula is a generalization of Eq. (3-19) and may be derived by the same
approach.
By the Spearman-Brown formula [Eq. (1-17)J, the reliability r'tt of number-
right scores on a test composed of n equivalent A -choice items is found from
(7-3) to be
nr' nr
r'tt (7-4)
1 + (n - 1)r' (n - l)r + 1+ 1/(A - 1)p
Since n = K/A, Eq. (7-4) becomes
Kr
r'tt = (7-5)
Kr + (1 - r)A + A/(A - 1)p
We wish to know what value of A, the number of choices, will maximize the
reliability r'tt. The optimal value of A is the value that minimizes the de­
nominator of (7-5). The derivative of the denominator with respect to A is 1 —
r — 1/(A — l) 2 p. Setting this equal to zero and solving for A, the optimal value
is found to be
1
4 = 1+ (7-6)
(1 - r)p
It is easy to verify that this value of A provides a maximum rather than a
minimum for r'tt.
Some optimal values of A from (7-6) are shown in the following table:

p = .20 p = .50 p = .80

r = .10 3.36 2.49 2.18


r = .30 3.67 2.69 2.34

For p = .5 these values agree rather well with those found by Grier (1975). Our
optimal values, however, unlike Grier's, are independent of test length, as in­
deed they should be. For p ≠ .5, the results are different from Grier's.
Some typical values of test reliability are shown in the following table for the
case where p = .5:

A =2 A = 3 A = 4 A =5

K = 250, r = .20, r'tt = .889 .902 .895 .885


K = 150, r = .30, r'tt = .893 .898 .892 .882

These values vary with A less than do those in Grier's Fig. 1.


110 7. OPTIMAL NUMBER OF CHOICES PER ITEM

7.6. AN ITEM RESPONSE THEORY APPROACH

A different perspective appears when item response theory is applied to this


problem. The 90-item Verbal section of the College Board Scholastic Aptitude
Test, Form TSA13, was chosen for investigation. Logistic item parameters for
this test had already been estimated in the course of another study. The median
estimated ci for the five-choice items in TSA13 is about. 15. Four alternative
hypothetical tests were built from TSA13 by replacing all the ci parameters
by .200, .250, .333, or .50 while leaving the ai and bi parameters unchanged.
The parameter ci is the lower asymptote of the item response function. If very
low-level examinees responded purely at random, these hypothetical ci values
would correspond, respectively, to five-, four-, three-, and two-choice items.
The restriction nA = K could then be written n α ci. Accordingly, the number of
items n in each test was specified to be proportional to its ci value, as follows:

Test: TSA13* ci = .20 ci = .25 ci — .333 ci = .50


n = 90 120 150 200 300
*Median ci is .15.

In actual practice, low-level examinees do less well than if they responded at


random. Thus the current investigation really compares different values of ci
(subject to n α ci) rather than comparing different values of A.
The efficiency of number-right scores on each of the other tests was computed
relative to the test with ci = .333. The resulting relative efficiencies are shown in
Fig. 7.6.1 as functions of ability level. The ability scale along the baseline of the
figure is calibrated to show true College Board scaled scores on TSA13.
At high ability levels, there is little random guessing and the relative effi­
ciency of the five tests is determined by their length. Thus when ranked accord­
ing to their efficiency at an ability level of 800, the tests, in order of merit, are
Ci = .50, ci = .333, ci = .25, ci = .20, and TSA13.
At low ability levels the effect of random guessing becomes of overwhelming
importance. At an ability level of 215, the rank order of test efficiency is pre­
cisely the opposite of the order at 800. Thus, for any pair of tests, one test is
better than the other over a certain range of ability but worse over the com­
plementary range of ability.
The figure makes it obvious that it is not enough just to think about overall test
reliability, as in earlier sections. The effect of decreasing the number of choices
per item while lengthening the test proportionately is to increase the efficiency of
the test for high-level examinees and to decrease its efficiency for low-level
examinees. None of the empirical studies known to the writer has taken this
effect into account when determining the optimal number of choices per item.
This effect may be offset by adjusting the difficulty level of the tests. The
median value of bi for TSA13, and hence for all the tests, was bi = .5 approxi-
7.6. AN ITEM RESPONSE THEORY APPROACH 111

I .4
I .2

.50

.333
I .0

.25
RELATIVE EFFICIENCY

0.8

•20

TSAI3
0.6
0 .4
0.2
0.0

260 300 400 500 600 700 800

SCALED SCORE

FIG. 7.6.1. Relative efficiency of five SAT Verbal tests that differ only in test
length and in the value of ci.
I .2
I .0
EFFICIENCY

0.8
0.6
RELATIVE

6
SOLID C= .2
DOTTED C = .25
DASHED C= .50
0.2
0.0

200 300 400 500 600 700 800


SCALED SCORE

FIG. 7.6.2. Efficiency of three SAT Verbal tests relative to the ci = .333 test
after the ci = .5 test has been made easier and the ci = .2 and ci = .25 tests
have been made harder.
112 7. OPTIMAL NUMBER OF CHOICES PER ITEM

mately. The dashed curve in Fig. 7.6.2 shows the relative efficiency of the
ci = .50 test when all its items are made slightly easier (all the item difficulty
parameters bi decreased by 0.1). The dotted curve shows the relative efficiency
of the ci = .25 test when all its items are made slightly harder (all bi increased
by 0.1). The solid curve shows a harder ci = .20 test (all bi increased 0.2). The
efficiencies are shown relative to the test with ci = .333. The test with ci = .333
is clearly superior to the others.
Comparisons using item response theory assume that ci can be changed
without affecting the item discrimination power ai. This would be true if exam­
inees either knew the answer or guessed at random. When an examinee has
partial information about an item, the value of ai is likely to change with the
number of alternatives. This effect could operate against reducing the number of
alternatives per item. The extent of this effect cannot be confidently predicted
here.

7.7. MAXIMIZING INFORMATION AT A CUTTING


SCORE

If a test is to be used only to accept or to reject examinees, all items in the test
ideally should be maximally informative at the cutting score. Now, the maximum
possible information Mi obtainable from a logistic item with parameters ai and ci
is given by Eq. (10-6). If all items have ai = a and ci — c and if the number of
items that can be administered in the available testing time is proportional to c,
what value of c will maximize the test information I{θ} = nMi α cMi at the
cutting score? Numerical investigation of cMi using Eq. (10-6) shows that the
optimal value of c is c = A374.

REFERENCES

Ebel, R. L. Expected reliability as a function of choices per item. Educational and Psychological
Measurement, 1969, 29, 565-570.
Grier, J. B. The number of alternatives for optimum test reliability. Journal of Educational Mea­
surement, 1975, 12, 109-112.
Grier, J. B. The optimal number of alternatives at a choice point with travel time considered. Journal
of Mathematical Psychology, 1976, 14, 91-97.
Ruch, G. ML, & Charles, J. W. A comparison of five types of objective tests in elementary
psychology. Journal of Applied Psychology, 1928, 12, 398-404.
Ruch, G. M., DeGraff, M. H., & Gordon, W. E. Objective examination methods in the social
studies. New York: Scott, Foresman and Co., 1926.
Ruch, G. M., & Stoddard, G. D. Comparative reliabilities of five types of objective examinations.
Journal of Educational Psychology, 1925, 16, 89-103.
Ruch, G. M., & Stoddard, G. D. Tests and measurement in high school instruction. Chicago: World
Book, 1927.
REFERENCES 113

Toops, H. A. Trade tests in education. Teachers College Contributions to Education (No. 115). New
York: Columbia University, 1921.
Tversky, A. On the optimal number of alternatives of a choice point. Journal of Mathematical
Psychology, 1964, 1, 386-391.
Vale, C. D., & Weiss, D. J. A comparison of information functions of multiple-choice and free-
response vocabulary items. Research Report 77-2. Minneapolis: Psychometric Methods Pro-
gram, Department of Psychology, University of Minnesota, 1977.
Williams, B. J., & Ebel, R. L. The effect of varying the number of alternatives per item on
multiple-choice vocabulary test items. The Fourteenth Yearbook. National Council on Mea-
surements Used in Education, 1957.
8 Flexilevel Tests1

8.1. INTRODUCTION

It is well known (see also Theorem 8.7.1) that for accurate measurement the
difficulty level of a psychological test should be appropriate to the ability level of
the examinee. With conventional tests, this goal is achievable for all examinees
only if they are fairly homogeneous in ability. College entrance examinations,
for example, could provide more reliable measurement at particular ability levels
if they did not need to cover such a wide range of examinee talent (see Section
6.10). Furthermore, in many situations it is psychologically desirable that the test
difficulty be matched to the examinee's ability: A test that is excessively difficult
for a particular examinee may have a demoralizing or otherwise undesirable
effect.
There has recently been increasing interest in "branched," "computer-
assisted," "individualized," "programmed," "sequential," or tailored testing
(Chapter 10). When carefully designed, such testing comes close to matching the
difficulty of the items administered to the ability level of the examinee. The
practical complications involved in achieving this result are great, however.
Some simplification can be obtained by simple two-stage testing—by use of a
routing test followed by the administration of one of several alternative second-
stage tests (Chapter 9). This reduces the number of items needed and eliminates

Sections 8.1 through 8.4 and Fig. 8.2.1 are taken with special permission and with some
revisions from F. M. Lord, The self-scoring flexilevel test. Journal of Educational Measurement,
Fall 1971, 8, No. 3, 147-151. Copyright 1971, National Council on Measurement in Education,
Inc., East Lansing, Mich.

114
8.2. FLEXILEVEL TESTS 115

the need for a computer to administer them. To obtain comparable scores from
different second-stage tests, however, expensive equating procedures based on
special large-scale administrations are required.

8.2. FLEXILEVEL TESTS

To a degree, the same result, the matching of item difficulty with ability level,
can be achieved with fewer complications. This can be done by modifying the
directions, the test booklet, and the answer sheet of an ordinary conventional
test. The modified test is called a flexilevel test.
Consider a conventional multiple-choice test in which the items are arranged
in order of difficulty. The general idea of a flexilevel test is simply that the
examinee starts with the middle item in the test and proceeds, taking an easier
item each time he gets an item wrong and a harder item each time he gets an item
right. He stops when he has answered half the items in the test.
Let us consider a concrete example, starting with a conventional test of N =
75 items. (In this chapter, the symbol N is used with this special meaning; in
other chapters, N denotes the number of examinees.) For purposes of discussion,
we assume that the items are arranged in order of difficulty; however, it is seen
later that any rough approximation to this is adequate. The middle item of the
conventional test (formerly item 38) is the first item in the flexilevel test. It is
printed in the center at the top of the first page of the flexilevel test. The page
below this, and subsequent pages, are divided in half vertically (see Fig. 8.2.1).

[ t h e middle d i f f i c u l t y item, formerly


item 3 8 , appears here]

I.* [a slightly easier item, 1.• [a slightly harder item,


formerly item 37] formerly item 39]

2.* 2.•

3.* 3.•

. .
. .
. [easier items] . [harder items]
. .
. .
. .
. .

37.* [the easiest item, 37.• [the hardest item,


formerly item 1] formerly item 75]

*numbers printed in red •numbers printed in blue

FIG. 8.2.1. Layout of printed flexilevel test booklet.


116 8. FLEXILEVEL TESTS

Items formerly numbered 39, 40, 4 1 , . . . , 75 appear in that order in the right-
hand columns, the hardest item (formerly item 75) at the bottom of the last page.
In place of the old numbers, these items are numbered in blue as items 1,2,
3 , . . . , 37, respectively. Items formerly number 37, 36, 35,. . . , 1 appear in
that order in the left-hand columns, the easiest item (formerly item 1) at the
bottom of the last page. In place of the old numbers, these items are numbered in
red as items 1, 2, 3 , . .. ,37, respectively (the easiest item is now at the end and
is numbered 37). The layout is indicated in Fig. 8.2.1.
The answer sheet used for a flexilevel test must inform the examinee whether
each answer is right or wrong. When the examinee chooses a wrong answer, a
red spot appears where he has marked or punched the answer sheet. When he
chooses a right answer, a blue spot appears. Answer sheets similar to this are
commercially available in a variety of designs.
In answering the test, the examinee must follow one rule. When his answer to
an item is correct, he should turn next to the lowest numbered "blue" item not
previously answered. When his answer is incorrect, he should work next on the
lowest numbered "red" item not previously answered.
Each examinee is to answer just ½(N + 1) = 38 items. One way to make it
apparent to him when he has finished the test would be to print the answer sheet
in two columns, using the same format as in Fig. 8.2.1 but with the second
column inverted. Thus, the examinee works down from the top in the first
column of the answer sheet and up from the bottom in the second column. The
examinee can be told to stop (he has completed the test) when he has responded
to one item in each row of the answer sheet.
It is now clear that the high-ability examinee who does well on the first items
he answers will automatically be administered a harder set of items than the
low-ability examinee who does poorly on the first items. Within limits, the
flexilevel test automatically adjusts the difficulty of the items to the examinee's
ability level.

8.3. SCORING

Let us first agree that when examinees answer the same items, we will be
satisfied to consider examinees with the same number-right score equal. A sur-
prising feature of the flexilevel test is that even though different examinees take
different sets of items, complicated and expensive scoring or equating procedures
to put all examinees on the same score scale are not needed. The obvious validity
of the scoring (by contrast with tailored testing) will keep examinees from feeling
that they are the victims of occult scoring methods. Finally, the test is self-
scoring—the examinee can determine his score without counting the number of
correct answers.
The score on a flexilevel test will be the number of questions answered
8.4. PROPERTIES OF FLEXILEVEL TESTS 117

correctly, except that examinees who miss the last question they attempt receive
a one-half point "bonus." Justification that this scoring provides comparable
scores, as well as procedures for arriving at an examinee's score without count-
ing the number of correct answers, is given in the following section.

8.4. PROPERTIES OF FLEXILEVEL TESTS

A flexilevel test has the following properties, which the reader should verify for
himself. For convenience of exposition, we at first assume, as before, that the
items in the conventional test are arranged in order of difficulty. Later on we see
that any rough approximation will be adequate.

1. If the items were ordered by difficulty, the items answered by a single


examinee would always be a block of consecutive items.

For simplicity, assume throughout that the examinee has completed the re-
quired ½(N + 1) = 38 items (the complications arising when examinees do not
have enough time are not dealt with here). Also, assume that the examinee has
been instructed to indicate on the answer sheet the item he would have to answer
next if the test were continued. (In an exceptional case, this might be a dummy
"item 3 8 , " which need not actually appear in the test booklet, since no one will
ever reach it.) An examinee who indicates that he would next try a blue item will
be called a blue examinee; one who indicates a red item will be called a red
examinee.

2. For a blue examinee, the number of right answers is equal to the serial
number of the item that would be answered next if the test were continued.
3. For a red examinee, the number of wrong answers is equal to the serial
number of the item that would be answered next if the test were continued. The
number of right answers is obtained by subtracting this serial number from ½(N
+ 1). (A different serial numbering of the red items could give the number of
right answers directly but might confuse the examinee while he is taking the test.)
4. All blue examinees who have a given number-right score have answered
the same block of items.
5. All red examinees who have a given number-right score have answered the
same block of items.

It can now be seen that all blue examinees can properly be compared with
each other in terms of their number-right scores, even though examinees with
different scores have not taken the same test. Consider two blue examinees, A
and B, whose number-right scores differ by 1. The items answered by the two
examinees are identical except that A had one item that was harder than any of
118 8. FLEXILEVEL TESTS

B's and B had one item that was easier than any of A's. The higher scoring
examinee, A, is clearly the better of the two because he took the harder test.
The same reasoning shows that all red examinees can properly be compared
with each other in terms of their number-right scores:

6. Examinees of the same color are properly compared by their number-right


scores.

In the foregoing discussion, the item taken by A and not by B was far apart on
the difficulty scale from the item taken by B and not by A. Thus A still would be
considered better than B even if the difficulty levels of individual items had been
roughly estimated rather than accurately determined. It will be seen that still
simpler considerations make exact determination of difficulty levels unnecessary
for the remaining comparisons among examinees, discussed below. Thus:

7. Exact ranking of items on difficulty level is not necessary for proper


comparison among examinees.

It remains to be shown how blue examinees can be compared with red exam-
inees. Consider a red examinee with a number-right score of x. If his very last
response had been correct instead of wrong, he would have been a blue examinee
with a score of x + 1. Clearly, his actual performance was worse than this; so we
conclude that

8. A blue examinee with a number-right score of x + 1 has outperformed all


red examinees with scores of x.

Finally, we can compare a blue examinee and a red examinee, both having the
same number-right score. Suppose we hypothetically administer to each exam-
inee the item that he would normally take if the testing were continued. If both
examinees answer this item correctly, they both become blue examinees with
identical number-right scores. We have agreed that such examinees can be con-
sidered equal. In order hypothetically to reach this equality, however, the blue
examinee had to answer a hard item correctly, whereas the red examinee had
only to answer an easy item correctly. Clearly, without the hypothetical extra
item, the standing of the blue examinee is inferior to the standing of the red
examinee:

9. A red examinee has outperformed all blue examinees having the same
number-right score.

In view of this last conclusion, let us modify the scoring by adding one-half
score point to the number-right score of each red examinee. Thus, once we agree
8.5. THEORETICAL EVALUATION OF NOVEL TESTING PROCEDURES 119

to use number-right score for examinees answering the same block of items, we
can say that

10. On a flexilevel test, examinee performance is effectively quantified by


number-right score, except that (roughly) one-half score point should be added
to the score of each red examinee.

If desired, all scores can be doubled to avoid fractional scores.


It is clear from the foregoing that to a considerable extent the flexilevel test
matches the difficulty level of the items administered to the ability level of the
examinee. This result is not achieved without some complication of the test
administration. The complications are minor, however, compared with those
arising in other forms of tailored testing.

8.5. THEORETICAL EVALUATION OF NOVEL TESTING


PROCEDURES

Item response theory is essential both for good design and for evaluation of novel
testing procedures, such as flexilevel testing. If its basic assumptions hold, item
response theory allows us to state precisely the relation between the parameters
of the test design and the properties of the test scores produced.
Although the properties of test scores depend on the design parameters, the
dependence is in general not a simple one. Item response theory will be most
easily applicable if we make some simplifying assumptions. Even then, it is hard
to state unequivocal rules for optimal test design. In the present state of the art,
the following procedure is typical.

1. Evaluate various specific test designs.


2. Compare the results, seeking empirical rules.
3. Take the better designs and vary them systematically.
4. Repeat steps 1-4, evaluating and then modifying the better designs.
5. Stop when further effort leads to little or no improvement.
6. Try to draw general conclusions from a study of the results.

In this way, 100 or 200 different designs for some novel testing procedure can be
tried out on the computer in a short time using simulated examinees.
Nothing like this could be done if 100 actual tests had to be built and adminis-
tered to statistically adequate samples of real examinees. When we have learned as
much as we can from simulated examinees, then we can design an adequate test,
build it, administer it in a real testing situation, and evaluate the results. The real
test administration is indispensable. Limits on testing time, attitudes of exam-
inees, failure to follow directions, or other violations of the assumptions of the
model may in practice invalidate all theoretical predictions.
120 8. FLEXILEVEL TESTS

The preliminary theoretical work and computer simulation are also important.
Without them, the test actually built is likely to be a very inadequate one.

8.6. CONDITIONAL FREQUENCY DISTRIBUTION OF


FLEXILEVEL TEST SCORES2

We can evaluate any given flexilevel test once we can determine ø(y|θ), the
conditional frequency distribution of test scores y for examinees at ability level
6. Given some mathematical form for the function Pi = Pi(θ) = P(θ; ai bi
ci), the value of ø(y|θ) can be determined numerically for any specified value of
6 by the recursive method outlined below. In the case of flexilevel tests, the
testing and scoring procedures are so fully specified that the item parameters are
the only parameters involved. It is assumed that the item parameters have already
been determined by pretesting.3
Assume the N test items to be arranged in order of difficulty, as measured by
the parameter bi. We choose N to be an odd number. For present purposes (not
for actual test administration), identify the items by the index i, taking on the
values — n + 1, — n + 2, . . . ,—1,0, 1 , . . . , n — 2, n — 1, respectively, when
the items are arranged in order of difficulty. Thus n = (N + l)/2 is the number of
items answered by each examinee, and b0 is the median item difficulty.
Consider, for example, the sequence of right (R) and wrong (W) answers R W
W R W R R R W R. Following the rules given for a flexilevel test, we see that the
corresponding sequence of items answered is
i = 0, + 1 , - 1 , - 2 , +2, - 3 , + 3 , +4, +5, - 4 , ( + 6).
Let Iv be the random variable denoting the vth item administered (v = 1,
2,. . . , n + 1); thus Iv takes the integer values i = — n + 1, — n + 2,. . . , n —
1. The general rule for flexilevel tests is that when Iv > 0, either
Iv+1 = Iv + 1 or Iv+1 = Iv - v,
and when Iv < 0, either
Iv+i = Iv - I or Iv+i = Iv + v.
For example, if the fourth item administered is indexed by I4 = -2, the next
item to be administered must be either I3 = - 2 - 1 = - 3 or I5 = - 2 + 4 =
+ 2, depending on whether item 4 is answered incorrectly or correctly.
Let Pv(i'\i, θ) denote the probability that item i' will be the next item ad-

2
Sections 8.6 through 8.9 are revised and taken with permission from F. M. Lord, The theoretical
study of the measurement effectiveness of flexilevel tests. Educational and Psychological Measure­
ment, 1971, 31, 805-813.
3
The reader concerned only with practical conclusions may skip to Section 8.7.
8.6. CONDITIONAL FREQUENCY DISTRIBUTION 121

ministered when the vth item administered was item i (v = 1, 2 , . . . , n). A


simple restatement of the preceding rule gives us if i > 0,
Pi(θ) if i' = i + 1,
Pv(i'\i, θ) =
{
Qi(θ) if i' = i — v,
0 otherwise.
}

If i < 0, (8.1)
Pi(θ) if i' = i + v,
P v (i'|i, θ) =
{
Qi(θ) if i' = i - 1,
0 otherwise.

For examinees at ability level , let pv(i|θ) denote the probability that item i is
the vth item administered (v = 1, 2 , . . . , n + 1). For fixed v, the joint
distribution of i and i' is the product of the marginal distribution pv(i|θ) and the
conditional distribution pv (i'|i, θ). Summing this product over i, we obtain the
overall probability that item i' will be administered on the (v + l)th trial:
n- l
Pv+1 (i|θ) = pv(i|θ)Pv(i|'i,θ). (8-2)
i=-n + l

The rightmost probability, Pv, is known from (8-1). The other probability on the
right, pv, can be found by the procedure described below.
The first item administered (v = 1) is always item I1 = 0, so
1 if i = 0,
P1(i|θ) =
{0 otherwise.
Starting with this fact and with a knowledge of all the Pi(θ) (item response
functions) for a specified value of 6, the values of p2(i'|θ) for each i' can be
obtained from (8-2). Drop the prime from the final result. Repetition of the same
procedure now gives us p3(i'|θ), the overall probability that item i (i = —n + 1,
— n + 2,. . . , n — 1) will be the third item administered. Successive repetitions
of the same procedure now gives us p4(i'|θ), p 5 (i'|θ),. . . , pn + i(i'|θ).
Now we can make use of an already verified feature of flexilevel tests. Again
let i' represent the (v + l)th item to be administered. If ¥ > 0, then the
number-right score x on the v items already administered was x = i'; if i' < 0,
then x = v + i'. Thus the frequency distribution of the number-right score x for
examinees at ability level θ is given by pn+l(x|θ) for those examinees who
answered correctly the nth (last) item administered and by pn+1(x — n|θ) for
those who answered incorrectly. This frequency distribution can be computed
recursively from (8-1) and (8-2).
As already noted, the actual score assigned on a flexilevel test is y = x if the
last item is answered correctly and y = x + ½ if it is answered incorrectly.
Consequently the conditional distribution of test scores is
122 8. FLEXILEVEL TESTS

if y is an integer,
ø(y|θ) =
{ Pn+1(y|θ)
P n+1 (y - n -½|θ) if y is a half integer.
(8-3)

For any specified test design, this conditional frequency distribution ø(y|θ) can
be computed from (8-1) and (8-2) for y = ½, 1, l ½ , . . . , n for various values
of θ.
Such a distribution constitutes the totality of possible information relevant to
evaluating the effectiveness of y as a measure of ability θ. Having computed
ø(y|θ), we compute its mean µy|θ and its variance σ2y|θ. The necessary deriva­
tive dµy|θ/dθ is readily approximated by numerical methods:

dµy|θ
= µy|θ+ - µy|θ

approximately, when A is a small increment in 6. From these we compute the
information function [Eq. (5-3)] for test score y.

8.7. ILLUSTRATIVE FLEXILEVEL TESTS,


NO GUESSING

The numerical results reported here are obtained on the assumption that Pi is a
three-parameter normal ogive [Eq. (2-2)]. The results would presumably be
about the same if Pi had been assumed logistic rather than normal ogive.
To keep matters simple, we consider tests in which all items have the same
discriminating power a and also the same guessing parameter c. Results are
presented here separately for c = 0 (no guessing) and c — .2. The results are
general for any value of a > 0, since a can be absorbed into the unit of
measurement chosen for the ability scale (see the baseline scale shown in the
figures). Each examinee answers exactly n = 60 items. For simplicity, we
consider tests in which the item difficulties form an arithmetic sequence, so that
b i+1 - bi = d, say, for i = — n + 1, — n + 2 , . . . , n — 1.
Figure 8.7.1 compares the effectiveness of four 60-item (n = 60, N = 119)
flexilevel tests and three bench mark tests by means of score information curves.
The "standard test" is a conventional 60-item test composed entirely of items of
difficulty b = 0, scored by counting the number of right answers. There is no
guessing, so c = 0. The values of a and c are the same for bench mark and
flexilevel tests. The average value of bi, averaged over items, is zero for all
seven tests.
The figure shows that the standard test is best for discriminating among
examinees at ability levels near 6 = 0. If good discrimination is important at 6 =
±2/2a or θ = ±3/2a, then a flexilevel test such as the one with d = .033/2^ or
d = .050/2a is better. The larger d is, the poorer the measurement at 0 = 0 but
the better the measurement at extreme values of θ.
8.7. ILLUSTRATIVE FLEXILEVEL TESTS, NO GUESSING 123

Suppose the best possible measurement is required at θ = ±2, with a = .5. It


might be thought that an effective conventional 60-item test for this limited
purpose would consist of 30 items at b = + 2 and 30 items at b = — 2. The curve
for this last test is shown in Fig. 8.7.1. It can be shown numerically that with
a = .5 no conventional test with items at more than one difficulty level, scored

1{θ,y}

standard

d = .033/2a

d=.067/2a d=.050/2a

d=.IOO/2a

half at b = - 2 / 2 a , half at b = + 2/2a

half at b = - 2 . 8 / 2 a , half at b = + 2 . 8 / 2 a

θ
-3 -2 -1 0 1 2 3
2a 2a 2a 2a 2a 2a

FIG. 8.7.1. Score information functions for four 60-item flexilevel tests with b{) =
0 (dotted curves) and three bench mark tests, c = 0. (From F. M. Lord, The
theoretical study of the measurement effectiveness of flexilevel tests. Educational
and Psychological Measurement, 1971, 31, 805-813.)
124 8. FLEXILEVEL TESTS

number-right, can simultaneously measure as well both at 6 = + 2 and at 6 = —2


as does the standard test (which has all items peaked at b = 0).
The situation is different if the best possible measurement is required at 6 =
± 3 , with a = .5. Using dichotomously scored items, the best 60-item conven­
tional test for this purpose consists of 30 items at b = — 2.8 and 30 items at b =
+ 2.8, approximately. The curve for this test is shown in Fig. 8.7.1.
We know from Chapter 5 that I{θ, y} for a standard test is proportional to n,
the test length. We thus see that when a = .75 (a common average value) the
60-item flexilevel test with d = .033/2a — .022 gives about as effective mea­
surement as a
58-item standard test at 6 = 0,
64-item standard test at 0 = ± 1,
86-item standard test at θ = ±2.
At 6 = ±2, the 60-item flexilevel test with d = Al2a = .067 is as effective as a
96-item standard test.
Comparisons between flexilevel tests and a "standard" peaked test are best
understood in the light of the following.
Theorem 8.7.1. When the ai are equal and the ci are equal for all items, no
n-item test no matter how scored can provide more information at a given ability
level θo than does number-right score on an n-item peaked test of suitable
difficulty.
Proof Given ai = a and ci — c, the test information function i I{θ, ui} at
fixed ability level θ0 depends only on the bi (i — 1 , 2 , . . . , n). Its maximum
value is therefore i Maxbi I{θ, ui}. The same value of bi will maximize each
item information function I{θ, ui}. Therefore, maximum information is pro­
duced by a peaked test (all items of equal difficulty). Number-right provides
optimal scoring on such a test (Section 4.15).

8.8. ILLUSTRATIVE FLEXILEVEL TESTS, WITH


GUESSING

Figure 8.8.1 compares the effectiveness of three 60-item flexilevel tests with
each other and with five bench mark tests. All items have c = .2 and all have the
same discriminating power a. Numerical labels on the curves are for a = .75.
The standard test is a conventional 60-item test with all items at difficulty level
b = .5/2a, scored by counting the number of right answers.
If all the item difficulties in any test were changed by some constant amount
b, the effect would be simply to translate the corresponding curve by an amount
b along the θ-axis. The difficulty level of each bench mark test and the starting
8.8. ILLUSTRATIVE FLEXILEVEL TESTS, WITH GUESSING 125

1{θ,y}

standard ,b = -0.33
b0=- 6 d= .022

d=.033 2
3 at b = - 1 . 0
half at b=-1.33
half at b= 0.67
} d= .044 { at b= 1.0

θ
_3 2 1 0 1 2 3
2a - 2a - 2a 2a 2a 2a

FIG. 8.8.1. Information functions for three 60-item flexilevel tests (dotted curves)
and five bench mark tests, c = .2. (Numerical labels on curves are for a = .75.)
(From F. M. Lord, The theoretical study of the measurement effectiveness of
flexilevel tests. Educational and Psychological Measurement, 1971,31, 805-813.

item difficulty level b0 of each flexilevel test in Fig. 8.8.1 has been chosen so as
to give maximum information somewhere in the neighborhood of θ = 0.
The standard test is again found to be best for discriminating among exam­
inees at ability levels near θ = 0. At θ = ±2 the flexilevel tests are better than
any of the other conventional (bench mark) tests, although the situation is less
clear than before because of the asymmetry of the curves.
When a = .75, the 60-item flexilevel test with b0 = - . 6 and d = .022
gives about as effective measurement as a
58-item standard test at θ = 0,
60-item standard test at 6 = ±.67,
126 8. FLEXILEVEL TESTS

83-item standard test at 0 = +2,


114-item standard test at 6 = - 2 .
At θ = —2, the 60-item flexilevel test with b0 = .9 and d = .044 is as effective
as a 137-item standard test.

8.9. CONCLUSION

Near the middle of the ability range for which the test is designed, a flexilevel
test is less effective than is a comparable peaked conventional test. In the outly­
ing half of the ability range, the flexilevel test provides more accurate measure­
ment in typical aptitude and achievement testing situations than a peaked conven­
tional test composed of comparable items. The advantage of flexilevel tests over
conventional tests at low ability levels is significantly greater when there is
guessing than when there is not.
Since most examinees lie in the center of the distribution where the peaked
conventional test is superior, a flexilevel test may not have a higher reliability
coefficient for the total group than the peaked conventional test. The flexilevel
test is designed for situations where it is important to measure well at both high
and low ability levels. As shown by the unpeaked bench mark tests in the figures,
unpeaked conventional tests cannot do as well in any part of the range as a
suitably designed flexilevel test. The most likely application of flexilevel tests is
in situations where it would otherwise be necessary to unpeak a conventional test
in an attempt to obtain adequate measurement at the extremes of the ability
range. Such situations are found in nationwide college admissions testing and
elsewhere.
Empirical studies need to answer such questions as the following:

1. To what extent are different types of examinees confused by flexilevel


testing?
2. To what extent does flexilevel testing lose efficiency because of an in­
crease in testing time per item?
3. How adequately can we score the examinee who does not have time to
finish the test?
4. How can we score the examinee who does not follow directions?
5. What other serious inconveniences and complications are there in flexi­
level testing?
6. Is the examinee's attitude and performance improved when a flexilevel test
"tailors" the test difficulty level to match his ability level?

Empirical investigations should study tests designed in accordance with the


theory used here. Otherwise, it is likely that a poor choice of d and especially b0
will result in an ineffective measuring instrument.
REFERENCES 127

Several empirical studies of varied merit have already been carried out, with
various results. The reader is referred to Betz and Weiss (1975), where several of
these are discussed, to Harris and Pennell (1977), and to Seguin (1976).

8.10. EXERCISES

Suppose the items in a flexilevel test are indexed by i= -n + 1, -n + 2,. . . ,


0,. . . , n - 1. Suppose for all items a = 1/1.7, c = .2, and bi = i. The
examinee starts by taking item 0. Assume the examinee's true ability is 0 = 0.
8-1 Using Table 4.17.1, obtain the probability that the second item the exam-
inee takes is item 1, and also the probability that the second item he takes is
item —1.
8-2 Without using the formulas developed in this chapter, compute the proba-
bility that the third item the examinee takes is item j (j = — 2, — 1, 1, 2).
8-3 If n = 2 for this flexilevel test, write the relative frequency distribution
ø(y|θ) of the final scores for an examinee at 0 = 0.
8-4 Repeat Exercises 8-1, 8-2, 8-3 for an examinee at 6 = — 1. Compare
ø(y|-1) and ø(y|θ) graphically.

REFERENCES

Betz, N. E., & Weiss, D. J. Empirical and simulation studies of flexilevel ability testing. Research
Report 75-3. Minneapolis: Psychometric Methods Program, Department of Psychology, Univer­
sity of Minnesota, 1975.
Harris, D. A., & Pennell, R. J. Simulated and empirical studies of flexilevel testing in Air Force
technical training courses. Report No. AFHRL-TR-77-51. Brooks Air Force Base, Texas:
Human Resources Laboratory, 1977.
Seguin, S. P. An exploratory study of the efficiency of the flexilevel testing procedure. Unpublished
doctoral dissertation, University of Toronto, 1976.
9 Two-Stage Procedures1 and
Multilevel Tests

9.1. INTRODUCTION

A two-stage testing procedure consists of a routing test followed by one of


several alternative second-stage tests. All tests are of conventional type. The
choice of the second-stage test administered is determined by the examinee's
score on the routing test.
The main advantage of such a procedure lies in matching the difficulty level of
the second test to the ability level of the examinee. Since conventional tests are
usually at a difficulty level suitable for typical examinees in the group tested,
two-stage testing procedures are likely to be advantageous chiefly at the extremes
of the ability range.
One will usually want to have some way of scoring the routing test quickly, so
that testing can proceed at a single session. Various paper-and-pencil procedures
that do not involve electronic machinery are possible. For example, the answer
sheet for the routing test can automatically produce a duplicated copy. The
original copy is collected at once and later scored as the official record of the
examinee's performance. The duplicate copy is retained by the examinee, who
scores it according to directions given by the examiner. The score assigned by the
examinee determines the second-stage test administered to him forthwith.
Two-stage testing is discussed by Cronbach and Gleser (1965, Chapter 6),
using a decision theory approach. They deal primarily with a situation where
examinees are to be selected or rejected. Their approach is chiefly sequential in

1
Sections9.1-9.8 are revised and printed with permission from F. M. Lord, A theoretical study of
two-stage testing. Psychometrika, 1971, 36, 227-242.

128
9.2. FIRST TWO-STAGE PROCEDURE—ASSUMPTIONS 129

the special sense that the second-stage test is administered only to borderline
examinees. The advantages of this procedure come from economy in testing
time.
In contrast, the present chapter is concerned with situations where the im­
mediate purpose of the testing is measurement, not classification. Here, the total
number of test items administered to a single examinee is fixed. Any advantage
of two-stage testing appears as improved measurement.
This chapter attempts to find, under specified restrictions, some good designs
for two-stage testing. A "good" procedure provides reasonably accurate mea­
surement for all examinees including those who would obtain near-perfect or
near-zero (or near-chance-level) scores on a conventional test.
The particulars at our disposal in designing a two-stage testing procedure
include the following:

1. The total number of items given to a single examinee (n).


2. The number of alternative second-stage tests available for use.
3. The number of alternative responses per item.
4. The number of items in the routing test (n 1 ).
5. The difficulty level of the routing test.
6. The method of scoring the routing test.
7. The cutting points for deciding which second-stage test an examinee will
take.
8. The difficulty levels of the second-stage tests.
9. The method of scoring the entire two-stage procedure.

It does not seem feasible to locate truly optimal designs. We proceed by


investigating several designs, modifying the best of these in various ways, choos­
ing the best of the modifications, and continuing in this fashion as long as any
modifications can be found that noticeably improves results.
Two different two-stage procedures will be considered in this chapter. Sec­
tions 9.2-9.8 deal with the first two-stage procedure. Sections 9.9-9.13 deal
with the second.

9.2. FIRST TWO-STAGE


PROCEDURE—ASSUMPTIONS

The mathematical model to be used assumes that Pi(θ), the probability of a


correct response to item i, is a three-parameter normal ogive [Eq. (2-2)]. We
rewrite this as
Pi - Pi(θ) = ci + (1 - c i )Φ[a i (θ - bi)], (9-1)
where Φ(t) denotes the area of the standard normal curve lying below t.
130 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

For the sake of simplicity, we assume that the available items differ only in
difficulty, bi. They all have equal discrimination parameters a and equal guess­
ing parameters c. Also, we consider here only the case where the routing test
and each of the second-stage tests are peaked; that is, each subtest is composed
of items all of equal difficulty. These assumptions mean that within a subtest all
items are statistically equivalent, with item response function P ≡ P(θ). (Sections
9.9-9.13 describe an approach that avoids these restrictive assumptions.)

9.3. SCORING

For a test composed of statistically equivalent items, number-right score x is a


sufficient statistic for estimating an examinee's ability 6, regardless of the form
of the item characteristic curve (Section 4.15). Thus at first sight it might seem
that there is no problem in scoring a two-stage testing procedure when all subtests
are peaked. It is clear, however, that different estimates of 0 should be used for
examinees who obtain the same number-right score on different second-stage tests
having different difficulty levels.
What is needed is to find a function of the sufficient statistic x that is an
unbiased estimator or at least a consistent estimator of θ. The maximum likeli­
hood estimator, to be denoted by 0, satisfies these requirements and will be used
here. (The reader who is mainly interested in the conclusions reached may skip to
Section 9.5.)
For an m-item peaked subtest, the likelihood equation (4-31) becomes
in L P'
— (x - mP) = 0,
θ PQ
where P' is the derivative of P with respect to 6. Solving, we obtain the equation
x (9-2)
P(θ) = — .
m
Substituting (9-2) into (9-1) and solving for Φ, we have
x/m — c
Φ[a(θ - b)] =
1 - c ,
where a, b, and c describe each item of the peaked subtest. The maximum likeli­
hood estimator [compare Eq. (4-36)] is found by solving for 0:
1 x/m — c (9-3)
0 =—Φ - 1 + b,
a ( 1 - c )
where Φ-1 is the inverse of the function Φ (Φ - 1 is the relative deviate corre­
sponding to a given normal curve area).
Equation (9-3) gives a sufficient statistic that is also a consistent estimator of 0
and has minimum variance in large samples. The separate use of (9-3) for the
9.4. CONDITIONAL DISTRIBUTION OF TEST SCORE θ 131

routing test and for the second-stage test yields two such estimates, θx and θ2, for
any given examinee. These are jointly sufficient statistics for 0. They must be
combined into a single estimate. In the situation at hand, it would be inefficient
to discard θ1 and use only θ2. Unfortunately, there is no uniquely best way to
combine the two jointly sufficient statistics.
For present purposes, θ1 and θ2 will be averaged after weighting them in­
versely according to their (estimated) large-sample variances. It is well known
that this weighting produces a consistent estimator with approximately minimum
large-sample variance (see Graybill and Deal, 1959). Thus, an examinee's
score 0 on the two-stage test will be proportional to

θ1 θ2
V(θ1) +
,
V(02)
where V is an estimate of large-sample variance, to be denoted by Var. We
multiply this by V (θ1)V (θ2)/[V (θ1) + V (θ2)] to obtain the examinee's overall
score, defined as
θ1 V (θ1) + θ2 V(θ1)
0 = (9-4)
V(θ1) + V(θ1) .
The multiplying factor is chosen so that 0 is asymptotically unbiased:
θ Var θ2 + θ Var θ1
ξθ ≡ = θ.
Var θ1 + Var θ2
From Eq. (5-5), for equivalent items,

Var θ = PQ2 . (9-5)


mP'
From (9-1),
P' = (1 - c)aø[a(θ - bi), (9-6)
where ø(t) is the normal curve ordinate at the relative deviate t. Thus, V (θ1) or
V (θ2) can be obtained by substituting θ1 or θ2, respectively, for θ in the right-
hand sides of (9-5) and (9-6).
When x = m (a perfect score) or x = cm (a "pseudo-chance score"), the θ
defined by (9-3) would be infinite. A crude procedure will be used to avoid this.
Whenever x = m, x will be replaced by x = m — ½. All scores no greater than
cm will be replaced by (l + cm)/2 where l is the smallest integer above cm.

9.4. CONDITIONAL DISTRIBUTION OF TEST SCORE 0

If there are n1 items in the routing test and n2 = n — n1 items in the second-stage
test, there are at most (n1 + l)(n 2 + 1) different possible numerical values for 0.
Let θxy denote the value of 0 when the number-right score on the routing test is x
132 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

and on the second-stage test is y. By Eq. (4-1), the frequency distribution of x for
fixed 0 is the binomial

n1
px Qn1 - x ,
( x )
where P is given by (9-1) with ai = a, ci = c, and bi equal to the difficulty
level (b, say) of the routing test. The distribution of y is the binomial
n2 py Q n 2-y
( y ) y y
,
where Py is similarly given by (9-1) with bi equal to the difficulty level of the
second-stage test, this being a numerical function of x, here denoted by b(x),
assigned in advance by the psychometrician.
These two binomials are independent when 0 is fixed. Given numerical values
for n1, n2, a, b, c and for b(x) (x = 0, 1 , . . . , n1), the exact frequency
distribution pxy of the examinee's score 0 for an examinee at any given ability
level 0 can be computed from the product of the two binomials:
n1 P x Q n t—x n2
Pxy = Prob(θ = θ x y | θ ) = PyyQn2-yy. (9-7)
(x ) ( y )
This frequency distribution contains all possible information relevant for choos­
ing among the specified two-stage testing procedures.
In actual practice, it is necessary to summarize somehow the plethora of
numbers computed from (9-7). This is done by using the information function for
6 given by Eq. (5-3). For given 6, the denominator of the information function is
the variance of 0 given 0, computed in straightforward fashion from the known
conditional frequencies (9-7). We have similarly for the numerator
n1 n2
ξ(θ|θ) = Pxy θ xy .
x = 0y = 0
Since θxy is not a function of θ,
n1 n2
Pxy
ξ(θ|θ) = θxy .

d6 θ
y=0
x =0
A formula for p x y / θ is easily written from (9-7) and (9-1), from which the
numerical value of the numerator of Eq. (5-3) is calculated for given 6. In this
way, I{θ,θ} is evaluated numerically for all ability levels of interest.

9.5. ILLUSTRATIVE 60-ITEM TWO-STAGE TESTS,


NO GUESSING
Figure 9.5.1 shows the information functions for five different testing proce­
dures. For c = 0 (no guessing), only information curves symmetrical about θ =
1{θ,θ}

+.25
11;i i . 25,± .75,

7;±I.I25,±.3I25

θ
1.5 1.0 b
b- b- b - .5
a a a

FIG. 9.5.1. Information functions for some two-stage testing designs when n = 60, c = 0.

133
134 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

b were investigated. For this reason, only the left portion of each curve is shown
in Fig. 9.5.1.
The two solid curves are benchmarks, with which the two-stage procedures
are to be compared. The "standard" curve shows the information function for
the number-right score on a 60-item peaked conventional test whose items all
have the same difficulty level, b, and the same discriminating power, a. The
"up-and-down" benchmark curve is the " b e s t " of those obtained by the up-
and-down method of tailored testing (see Chapter 10; the benchmark curve of
Fig. 9.5.1 here is taken with permission from Fig. 7.6 of Lord, 1970).
If an examiner wants accurate measurement for typical examinees in the group
tested and is less concerned about examinees at the extremes of the ability range,
he should use a peaked conventional test. If a two-stage procedure is to be really
valuable, it will usually be because it provides good measurement for extreme as
well as for typical examinees. For this reason, an attempt was made to find
two-stage procedures with information curves similar to (or better than) the
"up-and-down" curve shown in the figure. For Sections 9.5 and 9.6 nearly 200
different two-stage designs were simulated for this search. Obviously, empirical
investigations of 200 designs would have been out of the question.
Surprisingly, Fig. 9.5.1 shows that when there is no guessing, it is possible
for a 60-item two-stage procedure to approximate the measurement efficiency of
a good 60-item up-and-down tailored testing procedure throughout the ability
range from 8 = b — 1.5|a to θ = b + 1.5/a. The effectiveness of the two-stage
procedures shown falls off rather sharply outside this ability range, but this range
is adequate or more than adequate for most testing purposes.
The label " 1 1 ; ± 1, ± . 5 " indicates that the routing test contains n1 = 11
items (at difficulty b) and that there are four alternative 49-item second-stage
tests with difficulty levels b ± 1/a and b ± .5/a. The cutting points on this
routing test are equally spaced in terms of number-right scores, x1: If x1 — 0-2,
the examinee is routed to the easiest second-stage test; if x1 = 3 - 5 , to the next
easiest; and so on.
The label " 7 ; ± 1.125, ± . 3 1 2 5 " is similarly interpreted, the examinees
being routed according to the score groupings x1 = 0 - 1 , x1 = 2 - 3 , x1 = 4 - 5 ,
and x1 = 6-7. The label " 1 1 ; ± 1 . 2 5 , ± . 7 5 , ± . 2 5 " similarly indicates a proce­
dure with six alternative second-stage procedures, assigned according to the
groupings x1, = 0 - 1 , x1 = 2 - 3 , . . . , x1 = 10-11.
A 60-item up-and-down procedure in principle requires 1830 items before
testing can start; in practice, 600 items might be adequate without seriously
impairing measurement. Two of the two-stage procedures shown in Fig. 9.5.1
require only slightly more than 200 items.
The two-stage procedures shown in Figure 9.5.1 are the " b e s t " out of approx­
imately sixty 60-item procedures studied with c = 0. None of the two-stage
procedures that at first seemed promising according to armchair estimates turned
out well. From this experience, it seems that casually designed two-stage tests
9.6. DISCUSSION OF RESULTS FOR 60-ITEM TESTS WITH NO GUESSING 135

are likely to provide fully effective measurement only over a relatively narrow
range of ability, or possibly not at all.

9.6. DISCUSSION OF RESULTS FOR 60-ITEM TESTS


WITH NO GUESSING

Table 9.6.1 shows the information at four different ability levels obtainable from
some of the better procedures. The following generalizations are plausible and
should hold in most situations.

Length of Routing Test. If the routing test is too long, not enough items are
left for the second-stage test, so that measurement may be effective near θ = b
but not at other ability levels. The test is not adaptive. If the routing test is too
short, then examinees are poorly allocated to the second-stage tests. In this case,
if the second-stage tests all have difficulty levels near b, then effective measure­
ment may be achieved near 6 — b but not at other ability levels; if the second-
stage tests differ considerably in difficulty level, then the misallocation of exam­
inees may lead to relatively poor measurement at all ability levels. The results
shown in Fig. 9.5.1 and Table 9.6.1 suggest that n1 = 3 is too small and n1 = 11

TABLE 9.6.1
Information for Various 60-ltem Testing Procedures with c = 0

Information** at

Procedure* θ = b - 1.5/a b - 1/a b - 0.5/a b

Up-and-down (benchmark) 33.5 34.3 34.9 35.1


7; ±1.125, ±.3125 32.5 34.4 34.5 35.1
7; ±1, ±.25 31.1 34.2 35.1 35.8
7; ±1, ±.25* 27.0 31.4 35.8 37.0
7; ±1.25, ±.25 33.2 33.7 33.7 35.1
7; ± .75, ±.25 28.0 33.7 35.9 36.5
11; ±1, ±.25 30.4 34.1 35.5 36.8
11; ±1, ±.5 30.6 34.8 35.6 34.9
11; ±1.25, ±.375 32.6 34.0 34.6 35.5
3; ± .75, ±.25 27.6 32.9 34.9 35.2
3; ± .75, ±.5 28.0 33.8 34.0 33.4
7; ± .75 28.6 34.4 34.5 31.4
7; ± .5 24.4 32.9 36.0 34.9
3; ± .5 24.5 32.5 34.5 34.4

*A11 cutting points are equally spaced, except for the starred procedure, which has score groups
x1 = 0, x1 = 1-3, x1 = 4-6, x1 = 7.
**A11 information values are to be multiplied by a2.
136 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

is too large for the range b ± 1.5/a in the situation considered, assuming that no
more than four second-stage tests are used.

Number of Second Stage Tests. There cannot usefully be more than n1 + 1


second-stage tests. The number of such tests will also often be limited by consid­
erations of economy. If there are only two second-stage tests, good measurement
may be obtained in the subranges of ability best covered by these tests but not
elsewhere (see "7; ± .75" in Table 9.6.1). On the other hand, a short routing
test cannot make sufficiently accurate allocations to justify a large number of
second-stage tests. In the present study, the number of second-stage tests was
kept as low as possible; however, at least four second-stage tests were required to
achieve effective measurement over the ability range considered.

Difficulty of Second-Stage Tests. If the difficulty levels of the second-stage


tests are all too close to b, there will be poor measurement at extreme ability
levels (see "7; ±.75, ± . 2 5 " in Table 9.6.1). If the difficulty levels are too
extreme, there will be poor measurement near θ = b.

Cutting Points on Routing Test. It is clearly important that the difficulty


levels of the second-stage tests should match the ability levels of the examinees
allocated to them, as determined by the cutting points used on the routing test. It
is difficult to find an optimal match by the trial-and-error methods used here.
Although many computer runs were made using unequally spaced cutting points,
like those indicated in the footnote to Table 9.6.1, equally spaced cutting points
turned out better. This matter deserves more careful study.

9.7. ILLUSTRATIVE 15-ITEM TWO-STAGE TESTS WITH


NO GUESSING

Some 40-odd different procedures were tried out for the case where a total of n =
15 items with c = 0 are to be administered to each examinee. The "best" of
these—those with information curves near the up-and-down bench mark—are
shown in Fig. 9.7.1. The bench mark here is again one of the "best" up-and-
down procedures [see Stocking (1969), Fig. 2].
Table 9.7.1 shows results for various other two-stage procedures not quite so
"good" as those in Fig. 9.7.1. In general, these others either did not measure
well enough at extreme ability levels or else did not measure well enough at 6 =
b. The results for n = 15 seem to require no further comment, since the general
principles are the same as for n = 60.
1{θ,θ}

3; ±1, ±.25
Up-and-Down

3 ; ±1.25, ±.5
3; ±1.25, ±.25

θ
1.5 1.0 .5 b
b- b- b-
a a a

FIG. 9.7.1. Information functions for some two-stage testing designs when n = 15, c = 0.

137
138 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

TABLE 9.7.1
Information for Various 15-ltem Testing Procedures with c = 0

Information* * at

Procedure* θ = b - 1.5/a b - 1/a b - 0.5/a b

Up-and-down (benchmark) 7.6 7.9 8.1 8.2


3; ±1.25, ± .5 7.6 7.8 8.0 8.2
3; ±1.25, ± .25 7.4 7.8 8.0 8.5
3; ± 1 , ± .25 7.0 8.0 8.4 8.7
7; ±1.25, ± .5 6.5 7.6 8.4 8.5
5; ±1.5, ±1, ±.5 7.2 7.7 8.0 8.1
4; ± 1, 0* 7.1 8.0 8.0 8.0
2; ± 1 , 0 7.2 8.0 8.0 7.9
3; ± .25 4.8 7.1 8.7 9.1
7; ±1 6.2 7.8 8.0 7.5

*A11 cutting points are equally spaced, except for the starred procedure, which has score groups
x1 = 0-1, x1 = 2, x1 = 3-4.
**A11 information values are to be multiplied by a2.

9.8. ILLUSTRATIVE 60-ITEM TWO-STAGE TESTS WITH


GUESSING

About 75 different 60-item two-stage procedures with c = .20 were tried out.
The "best" of these are shown in Fig. 9.8.1 along with an appropriate bench
mark procedure (see Lord, 1970, Fig. 7.8).
Apparently, when items can be answered correctly by guessing, two-stage
testing procedures are not so effective for measuring at extreme ability levels as
are the better up-and-down procedures. Unless some really "good" two-stage
procedures were missed in the present investigation, it appears that a two-stage
test might require 10 or more alternative second stages in order to measure well
throughout the range shown in Fig. 9.8.1. Such tests were not studied here
because the cost of producing so many second stages may be excessive. Possibly
a three-stage procedure would be preferable.
When there is guessing, maximum information is likely to be obtained at an
ability level higher than θ = b, as is apparent from Fig. 9.8.1. This means that
the examiner will probably wish to choose a value of b (the difficulty level of the
routing test) somewhat below the mean ability level of the group to be tested. If a
value of b were chosen nearΜΘ,the mean ability level of the group, as might well
be done if there were no guessing, then the two-stage procedures shown in Fig.
9.8.1 would provide good measurement for the top examinees (above θ = b +
9.8. ILLUSTRATIVE 60-ITEM TWO-STAGE TESTS WITH GUESSING 139

1{θ,θ}
Standard
25a2

20a2 68

69
65e
15a2

10 a2

5a2

0
1.5 1 .5 .5 I 1.5 2
θ
b- b- b- b b+ b+ b+ b+
a a a a a a a
FIG. 9.8.1. Information functions for some two-stage testing designs when n = 60, c = .2.

1/a) but quite poor measurement for the bottom examinees (below 0 = b — 1/a).
If an examiner wants good measurement over two or three standard deviations on
each side of the mean ability level of the group, he should choose the value of b
for the two-stage procedures in Fig. 9.8.1 so that μθ falls near b + .75/a. In this
way, the ability levels of his examinees might be covered by the range from 6 =
b — .75/a to θ = b + 2.25/a, for example.
The three two-stage tests shown in Fig. 9.8.1 are as follows. Test 68 has an
11-item routing test with six score groups x1 = 0-3, 4, 5-6, 7-8, 9-10, 11,
corresponding to six alternative second-stage tests at difficulty levels b2 where
a(b2 - b) = - 1 . 3 5 , - . 6 5 , - . 3 2 5 , +.25, +.75, and +1.5. Test 69 has a
17-item routing test with x1 = 0 - 5 , 6 - 7 , 8 - 1 0 , 11-13, 14-15, 16-17 and a(b2 -
b) = - 1 . 5 , - . 7 5 , - . 2 5 , +.35, +.9, +1.5. Test 65e has an 11-item routing test
with x1 = 0-2, 3-4, 5-6, 7-8, 9-10, l l and a(b 2 - b) = - 1 . 5 , - . 9 , - . 3 , +.2,
+ .6, +1.0.
A table of numerical values would be bulky and is not given here. Most of the
conclusions apparent from such a table have already been stated.
140 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

9.9. CONVERTING A CONVENTIONAL TEST TO A


MULTILEVEL TEST
In many situations, the routing test can be a take-home test, answered and scored
by the examinee at his leisure. This allows efficient use of the available super-
vised testing time. In this case, the examinee's score on the routing test cannot
properly be used except for routing. If the examinee takes or scores the routing
test improperly, the main effect is simply to lower the accuracy of his final
(second-stage) score.
Our purpose in the rest of this chapter is to design and evaluate various
hypothetical two-stage versions of the College Board Scholastic Aptitude Test,
Mathematics section. College Board tests are ordinarily scored and reported in
terms of College Board Scaled Scores. These scores presumably have a mean of
500 and a standard deviation of 100 for some imperfectly known historic group
of examinees.
Instead of making detailed assumptions about the nature of the routing test,
followed by complicated deductions about the resulting conditional distribution
of routing test scores, we proceed here with some simple practical assumptions:
(1) After suitable equating and scaling, the routing test yields scores on the
College Board scale; (2) an examinee's routing test scaled score is (approxi-
mately) normally distributed about his true scaled score with a known standard
deviation (standard error of measurement). This standard error will be taken to
be 75 scaled score points, except where otherwise specified.
Since different examinees take different second-stage tests, it is necessary that
all second-stage tests be equated to each other. It does not matter for present
purposes whether the equating is done by conventional methods or by item
characteristic curve methods (Chapter 13). We assume that after proper scaling
and equating, each level of the second-stage test yields scaled scores on the
College Board scale and that the expected value of each examinee's scaled score
is the same regardless of the test administered to him. Although this goal of
scaling and equating will never be perfectly achieved, the discrepancies should
not invalidate the relative efficiency curves obtained here.
For economy of items, the second-stage tests should be overlapping. In simple
cases, the basic design of the second-stage tests can be conveniently described by
three quantities:
L number of second-stage tests,

n2 number of items per second-stage test,

n total number of items.


Another quantity of interest is
m number of items common to two adjacent second-stage tests.
If, as we assume, the overlap is always the same, then
9.10. THE RELATIVE EFFICIENCY OF A LEVEL 141

Item Number

1 10 20 30 40 50 60

Level 1

Level 2

Level 3

Level 4

Level 5

FIG. 9.9.1. Allocation of 60 items, arranged in order of difficulty, to five over-


lapping 32-item levels (second-stage tests).

n2L - n
m =
( L - 1 ).
In practice, L, m, n2, and n are necessarily integers. There is no cause for
confusion, however, if for convenience this restriction is sometimes ignored in
theoretical work.
In what follows, we frequently refer to a second-stage test as a level. Figure
9.9.1 illustrates a second-stage design with L = 5, n2 = 32, n = 60, and m =
25.
If there are too many second-stage tests, the scaling and equating of these tests
becomes burdensome. In any case, it will be found that there is relatively little
gain from having more than a few second-stage tests.

9.10. THE RELATIVE EFFICIENCY OF A LEVEL

In order to determine the efficiency of a particular level, it is necessary to have


quantitative information about the items in it. If this information is to be available
before the level is constructed and administered, it is necessary that the level be
described in terms of items whose parameters are known. This is most readily
done by a specification, as in the following purely hypothetical example:
Level 1 of the proposed SAT mathematics aptitude test will consist of (1) two sets
of 5 items, each set having the same item parameters as the 5 easiest items in the
published Form SSA 45; (2) 15 items having the same item parameters as the next
15 easiest items in Form SSA 45.
With such a specification, assuming the item parameters of Form SSA 45 to have
been already estimated, it is straightforward to compute from the item parameters
the efficiency of the proposed level 1, relative to Form SSA 45.
142 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

RELATIVE EFFICIENCY 2 4
1
1
3

0
200 300 400 500 600 700
SCALED SCORE
FIG. 9.10.1. Efficiency of each of the four levels of a poorly designed two-
stage test, relative to Form SSA 45.

Figure 9.10.1 shows the relative efficiency curves obtained in this way for the
four (second-stage) levels of a hypothetical two-stage mathematics aptitude test.
This four-level test was the first trial design worked out by the writer after
studying the item parameters obtained for the SAT mathematics aptitude test,
Form SSA 45. It is presented here to emphasize that armchair designs of two-
stage tests, even when based on more data than are usually available, are likely to
be very inadequate.
The high relative efficiency of level 1 at scores below 300 was a desired part
of the design. The similar high efficiency of level 2 below 300 is completely
unnecessary, unplanned, and undesirable. Level 2 is too easy and too much like
level 1.
The intention was that each level should reach above 100% relative efficiency
for part of the score scale. Level 3 falls seriously short of this. As a result, the
four-level test design would be inferior to the regular SAT for the majority of
examinees. The shortcomings of level 3 could be remedied by restricting its
range of item difficulty. Level 4 may be unnecessarily effective at the top of the
score range and beyond. It should perhaps be easier.

9.11. DEPENDENCE OF THE TWO-STAGE TEST


ON ITS LEVELS

The design inadequacies apparent in Fig. 9.10.1 can be rather easily covered up
by increasing the number of levels and restricting the difficulty range of each
level. After trying about a dozen different designs, the seven-level test shown in
Fig. 9.11.1 was devised.
9.11. DEPENDENCE OF THE TWO-STAGE TEST ON ITS LEVELS 143

The solid curves in the figure are the relative efficiency curves for the seven
levels (the lower portion of each curve is not shown). The dashed lines are
relative efficiency curves for the entire seven-level two-stage test (formulas for
obtaining the relative efficiency curve of the entire two-stage test from the curves
of the individual levels are derived in the Appendix at the end of this chapter).
The lower dashed curve assumes the routing test score has a standard error of
measurement of 90 scaled-score units. The upper curve assumes the very low
value of 30. To achieve a standard error of 30, the routing test would have to be
as long or longer than the present SAT—an impractical requirement included for
its theoretical interest.
As mentioned earlier, subsequent results to be given here assume the standard
error of measurement of the routing test to be 75 scaled-score points. This value
is bracketed by the two values shown. A standard error of about 75 would be
expected for a routing test consisting of 12 mathematics items.
The relationship between the efficiency curves for the individual levels and
the efficiency curve of the entire two-stage test is direct and visually obvious
from the figure. The effect of lowering the accuracy of the routing test is also
clear and simple to visualize. The effect is less than might be expected.
Each level in Fig. 9.11.1 has only two-thirds as many items as the regular
SAT mathematics aptitude test. Thus use of a two-stage test may enable us to
increase the accuracy of measurement while reducing the official testing time for
each examinee (ignoring the time required for the self-administered routing test,
answered in advance of the regular testing).

2
2
7 σ =30
RELATIVE EFFICIENCY

σ = 90
6
1 3 5
4

LEVEL:1 2 3 4 5 6 7
0
200 300 400 500 600 700
SCALED SCORE

FIG. 9.11.1. Relation between the relative efficiency of a two-stage test and the
relative efficiency of the individual levels.
144 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

TABLE 9.12.1
Determination of Cutting Points for
Assigning Levels of the Two-Stage
Test in Fig. 9.11.1

Cutting
Level Scores at which RE =1.00 Score

1 389
324
2 259* 465
400
3 335 519
469
4 419 577
531
5 485 671
616
6 561 759*
685
7 611

* Values obtained by extrapolation.

9.12. CUTTING POINTS ON THE ROUTING TEST

In order to use a routing test for assigning examinees to levels, the score scale
must be divided by cutting points that determine, for each scaled score on the
routing test, the level to be assigned. There is no simple and uniquely optimal
way to determine these cutting points.
A method that seems effective is illustrated in Table 9.12.1. for the multilevel
test of Fig. 9.11.1. The cutting score between two adjacent levels in the table is
taken to be the average of the two numbers connected by oblique lines. The
cutting scores so obtained are indicated along the baseline of Fig. 9.11.1.
A convincing justification for this procedure is not immediately apparent. The
procedure has been found to give good results as long as the levels are reasonably
spaced and exceed a relative efficiency of 1.00 in a suitable score interval. Small
changes in the cutting scores will have little effect on the RE curve of the
two-stage test.

9.13. RESULTS FOR VARIOUS TWO-STAGE DESIGNS

Further experimentation with different designs shows that, with care, good re-
sults can be achieved with a two-stage test having only three or four (second-
9.13. RESULTS FOR VARIOUS TWO-STAGE DESIGNS 145

7
5
RELATIVE EFFICIENCY
4 3
1I

0
200 300 400 500 600 700
SCALED SCORE

FIG. 9.13.1. Relative efficiency of each of four hypothetical two-stage SAT


Mathematics tests.

stage) levels. Relative efficiency curves for four different two-stage test designs
are shown in Fig. 9.13.1. The curves were obtained in an effort to raise the
lowest point of the curve without changing its general overall shape. It is proba-
bly not possible to make any great improvement from this point of view on the
designs shown. This may account for the fact that the four curves shown differ
little from each other. It would, of course, be easy to design two-stage tests with
very differently shaped curves, if that were desired.
The identifying number on each curve in Figure 9.13.1 is its L value, the
number of levels. The four designs shown are partially described in Table
9.13.1. A full description of each two-stage test would require listing all item
parameters for each level and would not add much of value to the illustrative
examples given.

TABLE 9.13.1
Description of Two-Stage Designs Shown in Fig. 9.13.1

Number of Total number Number of


levels of items items per level Minimum
L n n2 RE

3 102 45 1.045
4 114 45 1.076
5 114 45 1.101
7 123 41 1.074
146 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

9.14. OTHER RESEARCH

Marco (1977) describes a "multilevel" test that resembles a two-stage test ex­
cept that there is no routing test; instead, each examinee routes himself to levels
that seem of appropriate difficulty to him. Item response theory cannot predict
how a person will route himself. The present results may nevertheless be relevant
if his errors in self-routing are similar to the errors made by some kind of routing
test.
Empirical studies of two-stage testing are reported by Linn, Rock, and Cleary
(1969) and by Larkin and Weiss (1975). Other studies are cited in these refer­
ences. A simulation study of two-stage testing is reported by Betz and Weiss
(1974).
Simulation studies in general confirm, or at least do not disagree with, conclu­
sions reached here. Empirical studies frequently do not yield clear-cut results.
This last might well be expected whenever total group reliability or validity
coefficients are used to compare two-stage tests with conventional tests.
If the conventional test contains a wide spread of item difficulties, the two-
stage test may be better at all ability levels, in which case it will have higher
total-group reliability. If the conventional test is somewhat peaked at the appro­
priate difficulty level, however, it will be better than the two-stage test at moder­
ate ability levels where most of the examinees are found; the two-stage test will
be better at the extremes of the ability range. The two-stage test will in this case
probably show lower total-group reliability than the conventional test, because
most of the group is at the ability level where the conventional test is peaked.
Two-stage tests will be most valuable in situations where the group tested has
a wider range of ability than can be measured effectively by a peaked conven­
tional test.

9.15. EXERCISES

9-1 If a = .8, b = 0, c = .2, what is the maximum likelihood estimate (9-3) of


ability when the examinee answers 25% of the items correctly on the
routing test? 50%? 60%? 70%? 80%; 90%?
9-2 What is the square root of the estimated sampling variance of the maximum
likelihood estimates found in Exercise 9-1, as estimated by substituting 0
for 0 in (9-5) with m = 9? Comment on your results.
9-3 Suppose the three cutting scores on a seven-item routing test divide the score
range as follows: 0-1, 2-3, 4-5, 6-7. If a = .8, b = 0, and c = .2, what
proportion of examinees at 0 = 0 will take each of the four second-stage
tests? At θ = +2? At θ = - 2 ?
APPENDIX 147

APPENDIX

Information Functions for the Two-Stage Tests of


Sections 9.9-9.13
An individual's score on a two-stage test of the type discussed in Sections
9.9-9.13 is simply his score on the second stage. As already noted, it is assumed
here that (after scaling and equating) this score, to be denoted by y, is expressed
on the College Board scale. In this case, the true score for y, to be denoted by Η,
is the true (College Board) scaled score.
The scaled score y is a linear function of the number-right observed score.
(When the observed score is a formula score involving a "correction for guess­
ing," this statement will still be correct if we restrict consideration to examinees
who answer all items, as discussed in detail in Chapter 15.) Since y is a linear
function of number-right observed score, η is a linear function of number-right
true score. Thus [Eq. (4-5)], η is a monotonic increasing function of ability 0.
Instead of using an information function on the θ scale of ability, it will be
convenient here to use an information function on the η scale of ability (see
Section 6.5). To compute the required information function I{η, y}, we need
formulas for μy|η and for σy|η, where y is the (scaled) score on the level
(second-stage test) assigned to the examinee.
We have assumed that the equating is carried out so that an examinee's score
y has the same expected value regardless of the level administered to him. Let l
denote the level administered (l = 1, 2, . . . , L) and yl, the score obtained on
that level. By definition of true score, μ(y l |η) — η for each examinee and each
level. It follows that the expected score of an examinee across levels is μy|η = η,
also.
By a common identity from analysis of variance,
L
σ2y|η = Pl|η Var (y l |η) + variance across levels of μ ( y l | η ) ,
l=1

where pl|η is the probability that an examinee with true score η will be assigned
to level l. Since μ(y l |η) is constant, the last term is zero. Thus the denominator
of I{η, y} is
L
σ2y|η = Pl|η Var(y l |η).
l=1

The numerator of I {η, y} is the square of dμy|η/dη = 1. Thus the desired


information function on η for the entire two-stage testing procedure is
1
I{η,y} = L . (9-8)
Pl|η Var (yl|η)
l =1
148 9. TWO-STAGE PROCEDURES AND MULTILEVEL TESTS

The information function on η for a single second-stage test ("level") is


clearly
1
I{η,yl) = (9-9)
Var (yl|η) .
Thus the information function on η for the entire two-stage testing procedure is
the harmonic mean of the L information functions I{ η, yl} for the L separate
second-stage tests. In forming the harmonic mean, the L levels are weighted
according to the probability pl|η of their occurrence.
In the particular problem at hand, it was assumed that the routing test score
was normally distributed about η with known standard deviation. Once some set
of cutting scores for the routing has been chosen as, for example, in Table
9.12.1, the probabilities pl|η are readily found from normal curve tables.
Let xl denote number-right score on second-stage test l ( l = 1 , 2 , . . . , L ) .
Since scaled-score yl is a linear function of xl, we write yl = Al + BlXl and
ηl = Al + Blξl, where Al and Bl are the constants used to place xl on the College
Board scale. The conditional variance of xl when η = η0 is
Var (xl|ηo)= Var (xl|θ0),
where, because of Eq. (4-5), θ0 is defined in terms of η0 by
(l)
η 0 = Al + B lξ0 = Al + Bl Pi (θo), (9-10)
i
(l)
the summation i being over all items in test l. By Eq. (4-3),
(l)
Var (x l |θ 0 ) = i Pi(θo)Qi(θ0).
Thus, once the item response parameters have been determined, we can compute
I{η, yl) from (9-9), (9-10), and

Var (y l |η o ) = B2l Var (x l |η 0 ) = B2l (l)


Pi (θ o )Q i (θ 0 ). (9-11)
i

REFERENCES

Betz, N. E., & Weiss, D. J. Simulation studies of two-stage ability testing. Research Report 74-4.
Minneapolis: Psychometric Methods Program, Department of Psychology, University of Min­
nesota, 1974.
Cronbach, L. J., & Gleser, G. C. Psychological tests and personnel decisions (2nd ed.). Urbana,
111.: University of Illinois Press, 1965.
Graybill, F. A., & Deal, R. Combining unbiased estimators. Biometrics, 1959, 75, 543-550.
Larkin, K. C , & Weiss, D. J. An empirical comparison of two-stage and pyramidal adaptive ability
testing. Research Report 75-1. Minneapolis: Psychometric Methods Program, Department of
Psychology, University of Minnesota, 1975.
REFERENCES 149

Linn, R. L., Rock, D. A., & Cleary, T. A. The development and evaluation of several programmed
testing methods. Educational and Psychological Measurement, 1969, 29, 129-146.
Lord, F. M. Some test theory for tailored testing. In W. H. Holtzman (Ed.), Computer assisted
instruction, testing, and guidance. New York: Harper and Row, 1970.
Marco, G. L. Item characteristic curve solutions to three intractable testing problems. Journal of
Educational Measurement, 1977, 14, 139-160.
Stocking, M. Short tailored tests. Research Bulletin 69-63 and Office of Naval Research Technical
Report N00014-69-C-0017. Princeton, N.J.: Educational Testing Service, 1969.
10 Tailored Testing

10.1. INTRODUCTION 1

It seems likely that in the not too distant future many mental tests will be
administered and scored by computer. Computerized instruction will be com-
mon, and it will be convenient to use computers to administer achievement tests
also.
The computer can test many examinees simultaneously with the same or with
different tests. If desired, each examinee can be allowed to answer test questions
at his own rate of speed. This situation opens up new possibilities. The computer
can do more than simply administer a predetermined set of test items. Given a
pool of precalibrated items to choose from, the computer can design a different
test for each examinee.
An examinee is measured most effectively when the test items are neither too
difficult nor too easy for him. Thus for any given psychological trait the com-
puter's main task at each step of the test administration might be to estimate
tentatively the examinee's level on the trait, on the basis of his responses to
whatever items have already been administered. The computer could then choose
the next item to be administered on the basis of this tentative estimate.
Such testing has been called adaptive testing, branched testing, individualized
testing, programmed testing, sequential item testing, response-contingent test-
ing, and computerized testing. Clearly, the procedure could be implemented

1
This section is a revised version of the introductory section in F. M. Lord, Some test theory for
tailored testing. In W. H. Holtzman (Ed.), Computer assisted instruction, testing, and guidance.
New York: Harper and Row, 1970, pp. 139-183. Used by permission.

150
10.2. MAXIMIZING INFORMATION 151

without a computer. Here, emphasizing the key feature, we shall speak of tai­
lored testing. This term was suggested by William W. Turnbull in 1951.
It should be clear that there are important differences between testing for
instructional purposes and testing for measurement purposes. The virtue of an
instructional test lies ultimately in its effectiveness in changing the examinee. At
the end we would like him to be able to answer every test item correctly. A
measuring instrument, on the other hand, should not alter the trait being mea­
sured. Moreover (see Section 10.2), measurement is most effective when the
examinee only knows the answers to about half of the test items. The discussion
here is concerned exclusively with measurement problems and not at all with
instructional testing.

10.2. MAXIMIZING INFORMATION

Suppose we have a pool of calibrated items. Which single item from the pool will
add the most to the test information function at a given ability level?
According to Eq. (5-6), each item contributes independently to the test infor­
mation function I{θ}. This contribution is given by
Pi'2 (5-9)
I{θ, ui} = ,
PiQi

the item information function. To answer the question asked, compute Eq. (5-9)
for each item in the pool and then pick the item that gives the most information at
the required ability level θ. It is useful here to discuss the maximum of the item
information function in some detail, so as to provide background for tailored
testing applications.
Under the logistic model [Eq. (2-1)] when there is no guessing, P'i =
DaiPiQi. The item information function [Eq. (5-9)] is thus
I{θ, ui} = D 2 a 2 i Pi(θ)Qi(θ). (10-1)
Now, PiQi is a maximum when Pi = .5. It follows that when there is no
guessing, an item gives its maximum information for those examinees who have a
50% chance of answering correctly. When Pi(θ) = .5, we have θ = bi. Thus,
when there is no guessing, an item gives its maximum information for examinees
whose ability θ is equal to the item difficulty bi. All statements in this paragraph
may be shown to hold for the normal ogive model also.
The maximum information, to be denoted by Mi, for the logistic model with
no guessing is seen from (10-1) to be

D2a2i
Mi = = .722a2i. (10-2)
4

For the normal ogive model with no guessing,


152 10. TAILORED TESTING

2
Mi = 2a i = .637a 2 i . (10-3)
π
Note that maximum information Mi is proportional to the square of the item
discriminating power a i . Thus an item at the proper difficulty level with ai =
1.0 is worth as much as four items with ai = .5.
On a certain item, suppose that examinees guess at random with probability p
of success whenever they do not know the correct answer. (Item response theory
does not use this assumption; it is used here only as a bench mark.) According to
this supposition, the actual proportion of correct answers to the item at ability
level θ will be Pi(θ) + pQi(θ). Accordingly, a common rule of thumb for test
design is that the average "item difficulty" ( = proportion of correct answers in
the group tested) should be .5 when there is no guessing and ½(1 + p) when
there is random guessing with chance of success p. Let us check this rule using
item information functions.
It is not difficult to show (Birnbaum, 1968, Eq. 20.4.21) for the three-
parameter logistic model that an item gives maximal information at ability level
θ = θi where
1 1 + √l + 8ci
θi = bi + In (10-4)
Dai 2
When ci = 0, the item gives maximal information when θ = bi. When ci ≠ 0,
θi > bi. The distance from the item difficulty level bi to the optimal θi is
inversely proportional to the item discriminating power ai.
It is readily found from (10-4) that when ability and item difficulty bi are
optimally matched, the proportion of correct answers is

Pi(θi) = ¼ (1 + √l + 8 c i ) . (10-5)

If we substitute ci for p in the old rule of thumb for test design and subtract the
results from (10-5), the difference vanishes for ci = 0 and for ci = 1; for all
other permissible values of c i , Pi(θi) exceeds the probability given by the rule of
thumb. Thus, under the logistic model, an item will be maximally informative
for examinees whose probability of success is somewhat greater than ½(1 + ci).
Some typical values for Pi(θi) are

ci: 0 .1 .15 .2 .25 .333 .5


½(1 +c i ): .500 .55 .575 .60 .625 .667 .75
Pi(θi): .500 .585 .621 .653 .683 .729 .809
It can be shown by straightforward algebra that the most information that can
be provided by a logistic item with specified parameters ai and ci is

D2a2i
Mi = [1 - 20c i - 8c2i + (1 + 8 c i ) 3 / 2 ] . (10-6)
8(1 - ci)2
Typical maximal values can be inferred from the following list:
10.3. ADMINISTERING THE TAILORED TEST 1 53

ci: 0 .1 .167 .2 .25 .333 .5


.72 .60 .52 .49 .45 .38 .26
Mi/a2i
If items of optimal difficulty are used to measure an examinee, items with ci =
.25 will give only .63 as much information as free-response items (ci = 0). Items
with ci = .50 will give only .36 as much information.
Results for the three-parameter normal ogive cannot be written in simple
form. The general picture discussed above remains unchanged, however.

10.3. ADMINISTERING THE TAILORED TEST

Consider now the tailored testing of a single examinee. If we know nothing about
him, we may administer first an item of middle difficulty from the available item
pool. If we have information about the examinee's educational level, or some
other relevant fact, we may be able to pick a first item that is better matched to
his ability level. Unless the test is very short, a poor choice of the first item will
have little effect on the final result.
If the examinee answers the first item incorrectly (correctly), we suppose that
it is hard (easy) for him, so we choose an easier (harder) item to administer next.
If he answers this incorrectly (correctly) also, we next administer a still easier
(harder) item, and so on.
There will be no finite maximum likelihood estimate of the examinee's ability
as long as his answers are all correct or all incorrect. Such a situation will not
continue very long, however: If successive decrements (increments) in item
difficulty are sizable, as they should be, we will soon be administering items at
an extreme level of difficulty or easiness.
Once the examinee has given at least one right answer and at least one wrong
answer, it is usually possible to solve the likelihood Eq. (5-19) for θ, obtaining a
finite maximum likelihood estimate, denoted by θ, of the examinee's ability.
Since Eq. (5-19) is an equation in just one unknown (θ), it may be readily solved
by numerical methods.
Samejima (1973) has pointed out that in certain cases the likelihood equation
may have no finite root or may have both a finite and an infinite root (see end of
Section 4.13). If this occurs, we can follow Samejima's suggestion (1977) and
administer next an extremely easy item if θ = - ∞ or an extremely hard item if
θ = + ∞. This procedure (repeated if necessary) should quickly give a usable
ability estimate without danger of further difficulties. Such difficulties are ex­
tremely rare, once the number of items administered is more than 10 or 15.
As soon as we have a maximum likelihood estimate θ of the examinee's
ability, we can evaluate the information function of each item in the pool at θ =
θ. We administer next the item that gives the most information at θ. When the
examinee has responded to this new item, we can reestimate θ and repeat the
154 10. TAILORED TESTING

procedure. When enough items have been administered, the final θ is the exam­
inee's score. All such scores are on the same scale for all examinees, even though
different examinees may have taken totally different sets of items.

10.4. CALIBRATING THE TEST ITEMS

The pool of items available to the computer must be very much larger than the
number n of items administered to any one examinee. If the pool contains 200 or
more items, it may be impractical to calibrate the items by administering them all
simultaneously to a single group of examinees. In certain cases, furthermore, the
range of item difficulty may be too great for administration to a single group:
Low-ability examinees, for example, who are needed to calibrate the easy items,
might find the very hard items intolerable.
When different items are calibrated on different groups of examinees, the
calibrations will in general not be comparable, because of the essential indeter­
minacy of origin and scale from group to group (see Section 3.5). There are
many special ways to design test administrations so that the data can be pieced
together to place all the estimated parameters on the same scale. A simple
design might be as follows.
Divide the entire pool of items to be calibrated into K modules. If a very wide
range of item difficulty is to be covered, modules 1 , 2 , . . . , ½K should increase
in difficulty from module to module; ½K, ½K + 1,. . . , K should decrease in
difficulty. Form a subtest by combining modules 1 and 2; another by combining
modules 2 and 3; another by combining 3 and 4 , . . . ; another by combining K —
1 and K. Form a Kth subtest by combining modules K and 1. Administer the K
subtests to K nonoverlapping groups of examinees, giving a different subtest to
each group.
With this design, each item is taken by two groups of examinees. Each group
of examinees shares items with two other groups. This interlocking makes it
possible to estimate all item parameters and all ability parameters by maximum
likelihood simultaneously in a single computer run (but see Chapter 13 Appendix
for a procedure to accelerate iterative convergence). Thus all item parameters are
placed on the same scale without any inefficient piecing together of estimates
from different sources.

10.5. A BROAD-RANGE TAILORED TEST

Two parallel forms of a tailored test of verbal ability have been built, using the
principles outlined in the preceding sections. A main feature is that this test is
appropriate at any level of verbal ability from fourth grade up through graduate
school.
10.5. A BROAD-RANGE TAILORED TEST 155

Many of the test items for grades 4 to 12 were obtained from the Cooperative
School and College Ability Tests and the Cooperative Sequential Tests of Educa­
tional Progress. The remaining items were obtained from the College Board's
Preliminary SAT, their regular SAT, and the Graduate Record Examination.
A total of more than 1000 verbal items were available from these sources. All
items were calibrated and put on the same scale by piecing together scraps of data
available from various regular test administrations and from various equating
studies. The resulting item parameter estimates are not so accurate as could have
been obtained by the procedure outlined in the preceding section; this would have
required a large amount of special testing, however.
A very few items with very poor discriminating power ai were discarded. A
better test could have been constructed by keeping only a few hundred of the
most discriminating items (as done by Urry, 1974). Here, this was considered
undesirable in principle because of the economic cost of discarding hundreds of
test items.
Additional items were discarded because they were found to duplicate mate­
rial covered by other items. The remaining 900 items were arranged in order of
difficulty (b i ) and then grouped on difficulty into 10 groups.
All items in the most extreme groups were retained because of a scarcity of
very difficult and very easy items. At intermediate difficulty levels, many more
items were available than were really needed for the final item pool. Although all
items could have been retained, 50 items were chosen at random from each
difficulty level for use in the final pool for the broad-range tailored test, thus
freeing the other items for other uses. Note again that this selection was made at
random and not on the basis of item discriminating power, for the reason outlined
in a preceding paragraph.
A total of 363 items were selected by this procedure. Five different item types
were represented. Within each of the 10 difficulty levels, items of a given item
type were grouped into pairs of approximately equal difficulty (bi). Two parallel
item pools were then formed by assigning one item from each pair at random to
each pool. The two pools thus provide two parallel forms for the broad-range
tailored test.
Each of the two item pools has exactly 25 items at each difficulty level, except
for extreme levels where there are insufficient items. This makes it possible to
administer 25 items to an examinee, all or most at a difficulty level appropriate
for him. In using the broad-range tailored test, exactly 25 items are administered
to each examinee.
If the items given to one examinee were selected solely on the basis of I{θ,
θ}, it could happen by chance, in an extreme case, that examinee A might
receive only items of type C and examinee B might receive only items of type D.
If this happened, it would cast considerable doubt on any comparison of the two
examinees' "verbal ability" scores. One good way to avoid this problem would
be to require that the first item administered to any examinee always be of type
156 10. TAILORED TESTING

C, the second item always be of type D, and so forth. This would assure that all
examinees take the same number of items of each type. A practical approxima­
tion to this was implemented for the broad-range tailored test. The details need
not be spelled out here.
Once a maximum likelihood estimate of the examinee's ability is available, as
described in Section 10.3, the item to be administered next is thereafter always
the item of the required item type that gives the most information at the currently
estimated ability level θ. If one item in the pool has optimal difficulty level
(10.4) at θ but another item is more discriminating, the latter item may give more
information at θ and may thus be the one selected to be administered next. Note
that this procedure tends to administer the most discriminating items (highest ai)
first and the least discriminating items last or not at all.

10.6. SIMULATION AND EVALUATION

For the flexilevel tests and for the two-stage tests of preceding chapters, it is
possible to write formulas for computing the (conditional) mean and variance of
the final test score for people at any specified ability level. The information
function can then be evaluated from these. Some early theoretical work in tai­
lored testing was done in the same way. The up-and-down branching method
(Lord, 1970) and the Robbins-Monro branching method (Lord, 1971) both have
formulas for the small-sample conditional mean and variance of the final test
score.
The method used here for choosing items, while undoubtedly more efficient
than the up-and-down or the Robbins-Monro methods, does not appear to
permit calculation of the required mean and variance. Thus, any comparative
evaluation of procedures here must depend on Monte Carlo estimates of the
required mean and variance, obtained by computer simulation. Monte Carlo
methods are more expensive and less accurate than exact formulas. Monte Carlo
methods should be avoided whenever formulas can be obtained.
Simulated tailored testing can be carried out as follows. A set of equally
spaced ability levels are chosen for study. The following procedure is repeated
independently for each ability level θ.
Some way of selecting the first item to be administered is specified. The
known parameters of the first items administered are used to compute P1(θ), the
probability of a correct response to item 1 for examinees at the chosen θ level. A
hypothetical observation u1a — 0 or u1a = l is drawn at random with probability
of success P 1 (θ). This specifies the response of examinee a (at the specified
ability level θ) to the first item. The second item to be administered is chosen by
the rules of Section 10.3. Then P2(θ) is computed, and a value of u2a is drawn at
random with probability of success P2(θ). The entire process is repeated until
n — 25 items have been administered to examinee a. According to the rules of
10.7. RESULTS OF EVALUATION 157

Section 10.3, this will involve the computation of many successive maximum
likelihood estimates θa of the ability of examinee a. The successive θa are used
to select the items to be administered but not in the computation of Pi(θ). The
final θa, based on the examinee's responses to all 25 items, is his final test score.
In the simulations reported here, the foregoing procedure was repeated inde­
pendently for 200 examinees at each of 13 different θ levels. At each θ level, the
mean m and variance s2 of the 200 final scores were computed. In principle, if
the θ levels are not too far apart, the information function at each chosen level of
θ can be approximated from these results, using the formula (compare Section
5.2)
[m(θ|θ+ l) - m(θ|θ - 1 )] 2
I{θ,θ} , (10-7)
(θ+1 -θ-1)2 s2 (θ|θ0)
where θ-1 θ0 and θ-1 denote successive levels of θ, not too far apart. This
formula uses a common numerical approximation to the derivative of m(θ|θ).
Both μ (θ|θ) and σ2(θ|θ) can be estimated by the Monte Carlo method from
200 final scores with fair accuracy. The difference in the numerator of (10-7)
is a quite unstable estimator, however, because of loss of significant figures
due to cancellation. This is a serious disadvantage of Monte Carlo evaluation of
test procedures and designs.
In the present case, it was found that μ(θ|θ) was close to θ, showing that θ is a
reasonably unbiased estimator of θ. Under such conditions [see Eq. (5-8)], the
information function is inversely proportional to the error variance σ 2 (θ|θ). Re­
sults here are therefore presented in terms of estimated error variance (Fig.
10.7.1) or its reciprocal (Fig. 10.7.2).

10.7. RESULTS OF EVALUATION

A proper choice of starting point was often an important concern in previous


studies of tailored testing (Lord, 1970, 1971). If the first item bi is too far from
the ability level of the examinee, up-and-down and Robbins-Monro branching
processes sometimes waste many items before finding the proper ability level.
This is especially true whenever a low-ability examinee by lucky guessing an­
swers correctly the first four or five items administered.
Figure 10.7.12 shows the effect of starting point on the standard error of
measurement s(θ|θ) for the broad-range tailored test. The points marked + were
obtained when the difficulty level of the first item administered was near — 1.0 on

2
Figures 10.7.1 and 10.7.2, also the accompanying explanations, are taken with permission from
F. M. Lord, A broad-range tailored test of verbal ability. In C. L. Clark (Ed.), Proceedings of the
First Conference on Computerized Adaptive Testing. Washington, D.C.: United States Civil Service
Commission, 1976, pp. 75-78; also Applied Psychological Measurement, 1977, 1, 95-100.
158 10. TAILORED TESTING

0 .3
0. 2
MEAS.
S. E.

0. 1

Grade: V VI VII VIII IX XII


0 .0

-1 . 0 -0.5 0.0 0 .5 1 .0 1 .5 2.0


ABILITY

FIG. 10.7.1. The standard error of measurement at 13 different ability levels


for four different starting points for the 25-item broad-range tailored test.

the horizontal scale—about fifth-grade level. The small dots represent the results
when the difficulty level of the first item was near 0—about ninth-grade level.
For the hexagons, it was near .75—near the average verbal ability level of col-
lege applicants taking the College Entrance Examination Board's Scholastic
Aptitude Test. For the points marked by an x, it was near 1.5. For any given
ability level, the standard error of measurement varies surprisingly little, consid-
ering the extreme variation in starting item difficulty. We see that the difficulty
level of the first item administered is not likely to be a serious problem for the
kind of tailored testing recommended here.
It is important to compare the broad-range tailored test with a conventional
test. Let us compare it with a 25-item version of the Preliminary Scholastic
Aptitude Test of the College Entrance Examination Board. Figure 10.7.2 shows
the information function for the Verbal score on each of three forms of the PS AT
adjusted to a test length of just 25 items and also the approximate information
function for the verbal score on the broad-range tailored test, which administers
just 25 items to each examinee. The PSAT information functions are computed
from estimated item parameters; the tailored test information function is the
reciprocal of the squared standard error of measurement from Monte Carlo re-
sults. The tailored test shown in Fig. 10.7.2 corresponds to the hexagons of Fig.
10.7.1.
This tailored test is at least twice as good as a 25-item conventional PSAT at
almost all ability levels. This is not surprising: At the same time that we are
tailoring the test to fit the individual, we are taking advantage of the large item
pool, using the best 25 items available within certain restrictions on item type.
Because we are selecting only the best items, the comparison may be called
unfair to the PSAT. It is not clear, however, how a "fair" evaluation of the
tailored test is to be made.
10.8. OTHER WORK ON TAILORED TESTS 159

80
6.0
INFORMATION

4.0
20
0

-1.25 -0.75 -0.25 0.25 0.75 1.25 1. 75


ABILITY

FIG. 10.7.2. Information function for the 25-item tailored test, also for three
forms of the Preliminary Scholastic Aptitude Test (dotted lines) adjusted to a
test length of 25 items.

Part of the advantage of the tailored test is due to matching the difficulty of the
items administered to the ability level of the examinee. Part is due to selecting the
most discriminating items. A study of a hypothetical broad-range tailored test
composed of items all having the same discriminating power would throw light
on this problem. It would show how much gain could be expected solely from
matching item difficulty to ability level.
The advantages of selecting the best items from a large item pool are made
clear by the following result. Suppose each examinee answers just 25 items, but
these are selected from the combined pool of 363 items rather than from the pool
of 183 items used for Fig. 10.7.2. Monte Carlo results show that the tailored test
with the doubled item pool will give at least twice as much information as the
25-item tailored test of Fig. 10.7.2. Selecting the best items from a 363-item pool
gives a better set of 25 items than selecting from a 183-item pool.
If it is somehow uneconomical to make heavy use of the most discriminating
items in a pool, one could require that item selection should be based only on
item difficulty and not on information or discriminating power. If this restriction
is not accepted, it is not clear how adjustment should be made for size of item
pool when comparing different tailored tests.

10.8. OTHER WORK ON TAILORED TESTS

It is not appropriate here to give a complete list of references on tailored testing.


Deserving special attention are work for the U.S. Civil Service Commission by
Urry and others and also work for the Office of Naval Research by Weiss and
160 10. TAILORED TESTING

others. Selected reports from these sources and others are included in the list of
references. For good recent reviews of tailored testing, see McBride (1976) and
Killcross (1976).

REFERENCES

Birnbaum, A. Estimation of an ability. In F. M. Lord & M. R. Novick, Statistical theories of mental


test scores. Reading, Mass.: Addison-Wesley, 1968.
Clark, C. (Ed.). Proceedings of the First Conference on Computerized Adaptive Testing. Profes-
sional Series 75-6. Washington, D.C.: U.S. Government Printing Office, 1976.
Cliff, N . , Cudeck, R. A., & McCormick, D. Implied orders as a basis for tailored testing. Technical
Report No. 6. Los Angeles: University of Southern California, 1978.
Davis, C. E., Hickman, J., & Novick, M. R. A primer on decision analysis for individually
prescribed instruction. Technical Bulletin No. 17. Iowa City, Iowa: Research and Development
Division, The American College Testing Program, 1973.
DeWitt, L. J., & Weiss, D. J. A computer software system for adaptive ability measurement.
Research Report 74-1. Minneapolis: Psychometric Methods Program, Department of Psychol-
ogy, University of Minnesota, 1974.
Gorham, W. A. (Chair). Computers and testing: Steps toward the inevitable conquest. Professional
Series 7 6 - 1 . Washington, D.C.: Research Section, Personnel Research and Development Center,
U.S. Civil Service Commission, 1976.
Green, B. F . , Jr. Comments on tailored testing. In W. H. Holtzman (Ed.), Computer-assisted
instruction, testing, and guidance. New York: Harper and Row, 1970.
Killcross, M. C. A review of research in tailored testing. Report APRE No. 9/76. Franborough,
Hants, England: Ministry of Defense, Army Personnel Research Establishment, 1976.
Koch, W. R., & Reckase, M. D. A live tailored testing comparison study of the one- and three-
parameter logistic models. Research Report 7 8 - 1 . Columbia, Mo.: Tailored Testing Research
Laboratory, Educational Psychology Department, University of Missouri, 1978.
Lord, F. M. Some test theory for tailored testing. In W. H. Holtzman (Ed.), Computer-assisted
instruction, testing, and guidance. New York: Harper and Row, 1970.
Lord, F. M. Robbins-Monro procedures for tailored testing. Educational and Psychological Mea­
surement, 1971, 31, 3 - 3 1 .
McBride, J. R. Research on adaptive testing, 1973-1976: A review of the literature. Unpublished
report. Minneapolis: University of Minnesota, 1976.
McBride, J. R. An adaptive test of arithmetic reasoning. Paper prepared for the Nineteenth Annual
Conference of the Military Testing Association, San Antonio, Texas, 1977.
McBride, J. R., & Weiss, D. J. A word knowledge item pool for adaptive ability measurement.
Research Report 74-2. Minneapolis: Psychometric Methods Program, Department of Psychol-
ogy, University of Minnesota, 1974.
McBride, J. R., & Weiss, D. J. Some properties of a Bayesian adaptive ability testing strategy.
Research Report 76-1. Minneapolis: Psychometric Methods Program, Department of Psychol-
ogy, University of Minnesota, 1976.
Mussio, J. J. A modification to Lord's model for tailored tests. Unpublished doctoral dissertation,
University of Toronto, 1973.
Samejima, F. A comment on Birnbaum's three-parameter logistic model in the latent trait theory.
Psychometrika, 1973, 38, 2 2 1 - 2 3 3 .
Samejima, F. A use of the information function in tailored testing. Applied Psychological Measure­
ment, 1977, 1, 233-247.
Urry, V. W. Computer assisted testing: The calibration and evaluation of the verbal ability bank.
REFERENCES 161

Technical Study 74-3. Washington, D.C.: Research Section, Personnel Research and Develop-
ment Center, 1974.
Urry, V. W. Tailored testing: A successful application of latent trait theory. Journal of Educational
Measurement, 1977, 14, 181-186.
Vale, C. D., & Weiss, D. J. A simulation study of stradaptive ability testing. Research Report 75-6.
Minneapolis: Psychometric Methods Program, Department of Psychology, University of Min-
nesota, 1975.
Weiss, D. J. Strategies of adaptive ability measurement. Research Report 74-5. Minneapolis:
Psychometric Methods of Program, Department of Psychology, University of Minnesota, 1974.
Weiss, D. J. (Ed.). Computerized adaptive trait measurement: Problems and prospects. Research
Report 75-5. Minneapolis: Psychometric Methods Program, Department of Psychology, Univer-
sity of Minnesota, 1975.
Weiss, D. J. (Ed.). Applications of computerized adaptive testing. Research Report 77-1. Min-
neapolis: Psychometric Methods Program, Department of Psychology, University of Minnesota,
1977.
Weiss, D. J., & Betz, N. E. Ability measurement: Conventional or adaptive? Research Report 73-1.
Minneapolis: Psychometric Methods Program, Department of Psychology, University of Min-
nesota, 1973.
11 Mastery Testing

11.1. INTRODUCTION

A primary purpose of mastery testing is to determine whether each examinee has


reached a certain required level of achievement. This chapter deals with the
problems of (1) designing a test; (2) scoring the test; (3) setting the cutting score,
so as to determine mastery or nonmastery for each examinee as accurately as
possible. The problem of how to evaluate a mastery test is necessarily covered as
part of this development.
It is instructive to see how these problems can be solved in a clear-cut and
compelling way, given the assumptions of item response theory. Thus, the line of
reasoning is presented here in some detail, for its inherent interest. The develop-
ment is based on Birnbaum's work (1968, Chapter 19).
When similar groups of examinees are tested year after year, the psychometri-
cian knows, in advance of testing, the approximate distribution of ability in the
group to be tested. In this case, a Bayesian approach is appropriate. Any testing
procedure may then be evaluated in terms of two numbers: (1) the proportion of
rejected masters (those misclassified as nonmasters) and (2) the proportion of
accepted nonmasters (those misclassified as masters). Other more complicated
evaluation statistics may also be used. The reader is referred to Birnbaum (1969),
Davis, Hickman, and Novick (1973), Huynh (1976), Subkoviak (1976), and
Swaminathan, Hambleton, and Algina (1975).
In many cases, on the other hand, a single mastery test is to be used by a
variety of institutions, each with a different distribution of ability. If the value of
the testing procedure to the various institutions depends on a variety of ability
distributions unknown to the psychometrician at the time he is developing the
test, how can he design a single test that will be appropriate for every institution?

162
11.3. DECISION RULES 163

What is needed is a way of evaluating the testing procedure that does not
depend on the unknown distributions of ability in the groups to be tested. The
approach in this chapter makes no use of these unknown ability distributions.
We gain flexibility and generality when we do not require knowledge of the
ability distribution in the group tested. At the same time, we necessarily pay a
price: An evaluation based on incomplete information cannot have every virtue
of an evaluation based on complete information.

11.2. DEFINITION OF MASTERY

We denote level of achievement by θ. It is common to think that there is some


cutting point θ0 that divides masters from nonmasters. In practice, when we try
to specify the numerical value of such a θo, we find it impossible to choose a value
that will satisfy everyone and perhaps impossible to choose a value that will
satisfy anyone (see Glass, 1978; van der Linden, 1978).
An alternative is to specify two values, θ2 and θ1. Let θ2 be some relatively
low level of mastery, preferably the lowest level satisfactory to all judges; let θ1
denote some relatively high level of nonmastery, preferably the highest level that
all judges can agree upon. We consider the problem of discriminating examinees
at θ2 from those at θ1. Since the judges cannot agree how individuals between θ2
and θ1 should be classified, we shall not be concerned with these individuals.
If some test that measures ability θ has already been built and administered, it
will be easier to specify mastery levels in terms of true score ξ on this test, rather
than in terms of the less familiar θ scale of ability. In this case the judges will
choose levels ξ1 and ξ2; the required values of θ1 and θ2 will then be found from
the familiar relationship ξ = iPi(θ).

11.3. DECISION RULES

The practical use of any mastery test involves some rule for deciding which
examinees are to be classified as masters and which are not. In the case of a test
composed of n dichotomous items, each decision is necessarily based on the
examinee's n responses. For any one examinee, these responses are denoted by
the vector u = {u1, u2, . . . ,un}. Since the decision d depends on u, we may
write it as the function d ≡ d(u). The decision to classify an examinee as a
master will be denoted by d = 1 or "accept"; as a nonmaster, by d = 0 or
"reject."
The decision rule d may produce two kinds of errors:
d(u) = 1 (accept) when θ = θ1,
d(u) = 0 (reject) when θ = θ2.
164 11. MASTERY TESTING

Let α and β denote, respectively, the probabilities of these two kinds of errors, so
that
α ≡ Prob[d(u) = l|θ 1 ],
β ≡ Prob[d(u) = 0|θ 2 ]. (11-1)

We can reduce the probability of error α to 0 by rejecting all examinees


regardless of their test performance; but this automatically increases the probabil­
ity of error β to 1. On the other hand, we could reduce β to 0, but only at the cost
of increasing α to 1.

11.4. SCORING THE TEST: THE LIKELIHOOD RATIO1

Note that by the definitions already given,


α = Prob(u|θ1),
d(u)=l (11-2)
β = Prob(u|θ2).
d(u)=0
T h e conditional probability of accepting an individual at θ2 is

1 - β = Prob(u|θ2).
d=1
This can be written

1 -β = λ(u)Prob(u|θ 1 ), (11-3)
d=1
where λ(u) is the likelihood ratio
Prob(u|θ2)
λ(u) = (11-4)
Prob(u|θ1) .
This ratio compares the likelihood of the observed response vector u when the
examinee is at θ2 with the likelihood when he is at θ1. This is the ratio commonly
used to test the hypothesis θ = θ1 versus the hypothesis θ = θ2. Note that the
likelihood ratio for any u is known from item response theory [see Eq. (4-20) and
(11-12)] as soon as the item parameters and the form of the item response
function are specified.
The expected value of the likelihood ratio for given θ1, given the decision to
accept, is
1
The problem of how to score the test can be solved by application of the Neyman-Pearson
Theorem (R. V. Hogg & A. T. Craig, Introduction to mathematical statistics (3rd ed.). New York:
Macmillan, 1970, Chapter 9.). An explicit proof is given here in preference to a simple citation of the
theorem.
11.4. SCORING THE TEST: THE LIKELIHOOD RATIO 165

λ(u)Prob(u|θ 1 )
E[λ(u)|θ =θl, d= 1] = d=1
Prob(u|θ1)
d=l

From this, (11-3), and (11-2), we have


1 - β = aE[λ.(u)|θ = θ1, d(u) = 1]. (11-5)
This is a basic equation for our purposes, since it specifies the essential relation­
ship between the decision rule and the error rates α and β.
Equation (11-5) specifies the restriction that prevents β from going to 0 when
a is fixed. To minimize β, for fixed a, we must find the decision rule d(u) that
maximizes the right side of (11-5).
For fixed a, maximizing (11-5) means maximizing E[λ(u)|θ = θ1, d(u) = 1].
Now, the value of λ(u) depends only on u, not on d(u) or on θ. Thus we can
maximize E[λ(u)|θ = θ1 d(u) = 1] by defining the acceptance rule d(u) = 1 so
as to accept only those response patterns u for which the values of λ(u) are
largest. This is the way to maximize the expectation of λ(u) when d(u) = 1.
We now have an order of preference for accepting the response patterns u.
How many patterns should we accept, starting with the most preferred? The
answer is given by the first equation in (11-2). Since a is fixed, we must continue
accepting response patterns u until
Prob(u|θ1) =a.
d(u)=1
For given u, the probabilities on the left are known from Eq. (4-20).
The decision rule for finding the d(u) that minimizes β for any specified a
may thus be stated as follows:

1. List the λ(u) in order of magnitude.


2. Accept item response patterns u starting with the pattern with the largest
λ(u).
3. Continue down the list until the combined probability given θ, of all
accepted patterns is equal to a.
4. If the combined probability does not exactly equal a, the last pattern must
be accepted only a fraction of the time, acceptance being at random and the
fraction being chosen so that Prob(d = l|θ,) = a.

To apply this decision rule for fixed a in practical testing, proceed as follows:

1. Score each answer sheet to obtain λ(u), the likelihood ratio (11-4) for the
pattern of responses given by the examinee [see Eq. (11-12)].
2. Accept each examinee whose score λ(u) is above some cutting score λ0a
that depends on a, as specified in steps 2 and 3 above.
166 11. MASTERY TESTING

3. If λ(u) = λ0a for some examinees, choose among these at random as in step
4 above.

Note that the examinee's score actually is the likelihood ratio for the responses on
his answer sheet. The optimal scoring procedure does not require that we know
the distribution of ability in the group tested. Simplified scoring methods are
considered in Section 11.8.

11.5. LOSSES

Suppose that A' ≥ 0 is the loss of (erroneously) accepting an individual at θ1 and


B' ≥ 0 is the loss of rejecting an individual at θ2. If there are N1 examinees at θ1
and N2 examinees at θ2, the situation is as illustrated in the diagram, which
shows the frequencies in the relevant 2 x 2 table.

θ1 θ2
d =1 N1a N2(1 - β)
d = 0 N1(l -a) N2β
Total N1 N2
The expected loss, to be denoted by C, is N1aA' + N2βB'. Van Ryzin and
Susarla (1977) give a practicable empirical Bayes procedure that, in effect,
estimates N1 and N2 while minimizing C for fixed A' and B'. See also Snijders
(1977). We shall not consider such procedures here.
Define A = N1A' and B = N2B'; then the expected loss is
C = Aa + Bβ. (11-6)
Suppose, first, that we are given A and B. We shall find a decision rule that
minimizes C.

11.6. CUTTING SCORE FOR THE LIKELIHOOD RATIO

If a is given, Section 11.4 provides the solution to the decision problem posed.
The present section deals with the case where a is not known but A and B are
specified instead.
For any given a, it is obvious from (11-6) that the expected loss C is
minimized by making β as small as possible. Thus here we again want to use the
likelihood ratio λ(u) as the examinee's score, accepting only the highest scoring
examinees. In Section 11.4, we determined the cutting score so as to satisfy the
first equation in (11-2) for given a. Here the cutting score that minimizes C, to
be denoted now simply by λ0, will be shown to have a simple relation to A and B.
Let r = 1, 2,. . . , R index all numerically different scores λ(u), arranging
11.6. CUTTING SCORE FOR THE LIKELIHOOD RATIO 167

them in order so that λR > λR-t > . . . > λ1. Let r* denote the lowest score to
be "accepted" under the decision rule. Consider first the case where it is unnec­
essary to assign any examinees at random. The expected loss can then be written
C r * =A Prob(λr|θ1) +B Prob(λr|θ2). (11-7)
* r≤r *
r≥r
We wish to choose a to minimize C. This is the same as choosing r* to
minimize C. Since C ≥ 0, and since each summation in (11-7) lies between 0
and 1, C must have a minimum, to be denoted by C", on 0 ≤ a ≤ 1, or
(equivalently) on 1 ≤ r* ≤ R. Denote by r° the value of r* that minimizes C r * .
If C° is a minimum, we must have C r+1 - C° ≥ 0 and Cr-1 - C° ≥ 0. (If a
= 0 or a = 1, only one of these inequalities is required; for simplicity, we ignore
this trivial case.) Substituting into these inequalities from (11-7), we find

Cro+1 - C° = -A Prob(λ r |θ 1 ) + B Prob(λ r |θ 2 ) ≥ 0,


Cro+1 - C° = A Prob(λr-1|θ1) - B Prob(λr-1|θ2) ≥ 0.
Since A and B are positive, these can be rewritten

Prob(λr|θ2) A ≥ Prob(λr-1|θ2)
≥ . (11-8)
Prob(λr|θ1) B Prob(λr-1|θ1)
If there is only one pattern of scores that has the likelihood ratio λr ≡ λr(u),
we can denote this pattern by u r . In this case, Prob(λr|θ) = Prob(u r |θ) a n d m e
left side of (11-8) is the likelihood ratio λr. If there is only one pattern of scores
that has the likelihood ratio λr-1, the right side of (11-8), similarly, is the
likelihood ratio λr-1. Thus (11.8) becomes
A
λr ≥ ≥ λr-1. (11-9)
B
This same result is easily found also for the case where there may be several u
with the same λ(u). The conclusion is that expected loss is minimized by choos­
ing the cutting score to be
A
λ° = (11-10)
B .
This conclusion was reached for the special case where all examinees with
score λr can be accepted (none have to be assigned at random). We now remove
this limitation by showing the following: If any examinees have scores exactly
equal to the cutting score λ° = A/B, there will be no difference in expected loss
however these examinees are assigned.
Consider examinees whose score pattern u° is such that
Prob(u°|θ2) A
λ(u0) = =
Prob(u0|θ1) B.
168 11. MASTERY TESTING

If the examinee is accepted, his contribution to the expected loss C will be A


Prob[λ(u0)|θ1]; if he is rejected, his contribution to the expected loss will be
B Prob[λ(u0)|θ1]. We now show that these two contributions are equal:
Prob[λ(u 0 )|θ 2 ]
B Prob[λ(u°)|θ2] = B Prob[λ(u0)|θ2]
Prob[λ(u0)|θ1]
A
= B Probt[λ(u0)θ1]
B
= A Prob[λ(u 0 )|θ 1 ]. (11-11)
The fraction on the first right-hand side is evaluated by noting that it is the
likelihood ratio for the score pattern u° whose likelihood ratio is A/B. Equation
(11-11) shows that when the examinee's score is λ(u°) = A/B, the expected loss
C will be the same no matter how the examinee is assigned.
In summary, to minimize expected loss:

1. The score assigned to each examinee is the likelihood ratio for his response
pattern u.
2. Accept the examinee if his score exceeds the ratio of the two costs, A/B;
reject him if his score is less than A/B.
3. If his score equals A/B, either decision is optimal.

Theorem 11.6.1. The expected loss C is minimized by the decision rule d(u) if
and only if

accept when the examinee's score λ(u) > A


d(u) = B
A
reject when the examinee's score λ(u) <
B

11.7. ADMISSIBLE DECISION RULES

Usually we do not know the losses A', B', or the weights A and B. In such cases,
we can fall back on the fact that any decision rule d(u) obtained from Theorem
11.6.1 for any A and any B is an admissible decision rule. In the present context,
this means that no other rule d*(u) can have smaller error probability a unless it
also has a larger β, nor can it have a smaller β unless it also has a larger a.
To prove that d(u) of Theorem 11.6.1 is admissible, suppose to the contrary
that a* = Prob(d* = l|θ1) < a and that at the same time β* - Prob(d* = 0|θ2)
≤ β. It would follow that C* = Aa* + Bβ* < Aa + Bβ = C. But this
contradicts the theorem, which states that d minimizes C; hence the supposition
must be false. Any decision rule defined by Theorem 11.6.1 is an admissible
rule; there is no other rule that is better both at θ1 and at θ2.
11.8. WEIGHTED SUM OF ITEM SCORES 169

If we do not know A and B, we may have to make a somewhat arbitrary


decision as to where the cutting score λ0 should be placed. This is not a new
difficulty—we are accustomed to making arbitrary decisions as to the dividing
line between mastery and nonmastery. Note, however, that we have gained
something important: We have found the optimal method of scoring the test. The
optimal scoring method depends on θ1 and θ2 but does not depend on knowing
α, β, A, or B.

11.8. WEIGHTED SUM OF ITEM SCORES

It will make no difference if we use the logarithm of the likelihood ratio as the
examinee's score instead of the ratio itself. From Eq. (4-20), the likelihood ratio
for response pattern u is
n l-ui
ui Qi(θ2)
λ(u) =
i=1
[ Pi(θ2)
Pi(θ1) ] [ Qi(θ1) ] . (11-12)

The logarithm of this is


n
In λ(u) = Pi(θ2) - In Pi(θ1) + K.
i=1
u
[In Qi(θ2) Qi(θi) ] (11-13)

where

Qi(θ2)
K = i In
[ Qi(θ1) ]. (11-14)

Thus, the examinee's score y (say) may be taken to be a weighted sum of item
scores:
n
y = y(u) = wi(θ1,θ2)ui (11-15)
i=1
where the scoring weights are given by

wi(θ1,θ2 = ln Pi(θ2) - ln Pi(θ1) (11-16)


Qi(θ2) Qi(θ1) .
When θX, θ2, the item parameters, and the form of the item response function are
known, K is a known constant and so are the item-scoring weights Wi(θ1, θ2),
which appear in brackets in (11-13).
By Theorem 11.6.1, all admissible decision rules now have the form
accept if score y > y0,

reject if score y < y0,


where
170 11. MASTERY TESTING

y0 = In A - In B - K. (11-17)

It does not matter what decision is made if y = y0.


Note that these conclusions hold regardless of the form of the item response
function. The form must be known in order to compute the scoring weights
w i (θ l , θ2), however. If the item response function is logistic with all ci = 0, the
optimal scoring weights are not only independent of α, β, A, and B but also are
independent of θ1 and θ2. (The reader may prove this as an exercise.)

11.9. LOCALLY BEST SCORING WEIGHTS

In general, the item-scoring weights w i (θ 1 , θ2) depend on the choice of θ1 and θ2.
If we divide all Wi(θ1, θ2) by θ2 — θ1, and make a corresponding change in the
cutting score y0, this does not change any decision. Let us relabel θ1 as θ0 and
define the locally best scoring weight as the limit of Wi(θ0, θ2) when θ2 —> θ0:

wi(θ0θ2)
W0i = Wi(θo)s lim
θ2 ->θo [ (θ2 - θ 0 ) ] (11-18)

By the definition of a derivative,


In [P i (θ 2 )/Q(θ 2 )] - In [P i (θ o )/Q i (θ o )]
w0i = W i (θ o ) = lim
θ2—>θ0 θ2 - θo
d
= In Pi(θo) = Pi(θo) (11-19)
dθa Qi(θo) Pi(θo)Qi(θo) .
If we wish to discriminate near some ability level θo, then the "locally best"
item-scoring weights are given by (11-19).
The scoring weights obtained here are similar to those that maximized the
information function in Section 5.6. In the context of Section 5.6, the scoring
weights vary from individual to individual, depending on his unknown ability
level; this is a serious handicap to practical use there. Here, on the contrary, the
scoring weights are the same for everyone. Here, we determine from outside
considerations the ability level θo that we consider to be the dividing line be­
tween mastery and nonmastery; this ability level θo determines the locally best
scoring weights. These are the weights to be used for all examinees, regardless
of ability level.

11.10. CUTTING POINT FOR LOCALLY BEST SCORES

The locally best weighted sum Y of item scores is seen from (11-19) to be
n n
Y = P'i(θ0)ui
Wi(θo)ui = (11-20)
i=1 i=l Pi(θo)Qi(θo) .
If the examinee's score is Y, what is the appropriate cutting score Y0?
11.11. EVALUATING A MASTERY TEST 171

The locally best score Y (11-20) is obtained from the optimally weighted sum
y (11-15) by the relation
y |
Y = lim (11-21)
θ2—>θ1 θ2 - θ1 θo

The cutting score Y0 is found from (11-21) and (11-17) to be


In A -InB - K |
Y0 = lim
θ2 —> θ1 θ2 -θ1 θo

If A ≠ B, Y0 becomes positively (negatively) infinite. This means that all


examinees will be accepted (rejected). This case is not of practical use. If A =B,
however, we can obtain a useful result:
-K |
Y0 = lim
θ2 —>θ1 θ2 - θ1 θo

lim i [-ln Qi (θ2) + ln Q i (θ o )]


=
θ2—>θ1 θ2 - θo

d n
=- ln Qi(θo)
dθo i=1
n
P'i(θo) (11-22)
=
i=1
Qi(θo) .

If the cost of accepting nonmasters who are just below θ0 is equal to the cost
of rejecting masters who are just above θo, then we can use Y0 in (11-22) as the
cutting score against which each person's score Y is compared.
If an examinee's score Y is exactly at the cutting score Y0, we have from
(11-20) and (11-22)
n n
P'i(θo)ui P'i(θo)
Y = = (11-23)
i=1
Pi(θo) Qi(θo) i=l
Qi(θo)

Comparing this with Eq. (5-19), we see that this is the same as the likelihood
equation for estimating the examinee's ability from his responses u. This means
that if a person's score Y is at the cutting point, then the maximum likelihood
estimate of his ability is exactly θO, the ability level that divides masters from
nonmasters. This result further clarifies the choice of Y0 as a cutting score.

11.11. EVALUATING A MASTERY TEST

If we are concerned with two specified ability levels θ1 and θ2, as in Sections
11.2-11.8, we obviously should evaluate the mastery test in terms of the ex­
pected loss (11-6). For this, we need to know A, B, α, and β.
172 11. MASTERY TESTING

Given θ1, θ2, the item parameters, the form of the item response function, and
the cutting score y0, we can determine misclassification probabilities a and β
from the frequency distribution of the weighted sum of item scores
y= iwi(θ1, θ 2 )u i .
The required frequency distribution fy =fy(y) for any given θ is provided by the
generating function
n
fyty = [Qi(θ)+ Pi (θ)twi(θ1,θ2)] (11-24)
y i= l

[compare Eq. (4-1)]. In other words, the frequency fy of any score y appears as
the coefficient of ty on the right side of (11-24) after expansion. To obtain a (or
β), (11-24) must be evaluated at θ = θ1 (or θ = θ2) and the frequencies cumu­
lated as required by (11-2) and by Theorem 11.6.1. For example, if n = 2,
w1(Θ1, θ2) = .9, w2(θl, θ2) = 1.2, then (11-24) becomes
Q1Q2 + P 1 Q 2 t 9 +Q 1 P 2 t l . 2 +P1P2t2.1.
Thus f y (0)= Q1Q2,fy(.9) = P1Q2, fy(1.2) = Q1P2,fy(2.1) = P1P2.
If A and B are known, the expected loss can be computed from (11-6). If A
and B are not known a priori, but some cutting score λ° has somehow been
chosen, the ratio of A to B can be found from the equation λ° = A/B. Together
with a and β, this ratio is all that is needed to determine the relative effectiveness
of different mastery tests.
Although the expected loss can be determined as just described, the procedure
is not simple and the formulas are not suitable for needed further mathematical
derivations. In view of this, we shall often evaluate the effectiveness of a mastery
test by the test information I{θ} at the ability level θ = θo that separates mastery
from nonmastery. The test information at ability θo is
n P't2 |
I, {θ} = (11-25)
i=1
PiQi e = θ0

As discussed in Section 6.3, the value of the information function (11-25)


depends on the choice of metric for measuring ability. As in Section 6.4, how­
ever, the relative effectiveness of two mastery tests, as measured by (11-25),
will not be affected by the choice of metric.

11.12. OPTIMAL ITEM DIFFICULTY

What values of bi (i = 1, 2 , . . . , n) will maximize Io{θ}? We note in Section


5.4 that the contribution of each item to I{θ} is independent of the contribution
of every other item. Thus, I{θ} will be maximized by separately maximizing the
11.13. TEST LENGTH 173

contribution of each item. This contribution is by definition the item information


function, I{θ, ui}.
Many item response functions, including all those considered in Chapter 2,
depend on θ and on bi only through their difference θ — bi. For such item
response functions, the value of bi that maximizes I{θ, ui} can be found
directly or from the formula for the ability level θi that maximizes I{θ, ui}.
Some such formulas for θi are given in Section 10.2. We can replace θi by θ0 in
these and solve for bi.
For the three-parameter logistic function we find from Eq. (10-4), for exam­
ple, that the item difficulty bi that maximizes the item information function at θ
= θo is
1 1 + √l + 8ci
bi = θo - In (11-26)
Dai 2 .
If all items have the same ai and ci, then the optimal mastery test will have all
items of equal difficulty. Otherwise, the optimal item difficulties bi will not all
be the same.

11.13. TEST LENGTH


How long does a mastery test need to be? If we are going to evaluate our test in
terms of the error rates a and β, a rough answer to this question can be given by
using a normal approximation to the distribution of the weighted item score y.
This approach is used by Birnbaum (1968, Section 19.5).
If we are going to use the test information at θo to evaluate our test, it is
natural to think in terms of the length of the asymptotic confidence interval for
estimating θ when θ = θo. Suppose our unit of measurement is chosen (see
Section 3.5) in some convenient way—for example, so that the standard devia­
tion of θ is 1.0 for some convenient group. Then we can perhaps decide what
length confidence interval will be adequate for our purposes.
Alternatively, we may prefer to deal with the more familiar scale of number-
right true score ξ. In this case, we can decide on an adequate confidence interval
(ξ, ξ) for estimating ξ. We can then transform ξ and ξ to the θ scale by the
relation ξ = iPi (θ) and thus find approximately the length of confidence
interval needed.
The required information Io{θ} is then (see Section 5.1) the squared recip­
rocal of this length multiplied by [2Φ-1 {(1 — γ)/2}] 2 , where y is the confidence
level and Φ-1 { } is the inverse of the cumulative normal distribution function:
[ 2Φ - 1 {(l - γ ) / 2 } ]2
I0 {θ} = (11-27)
length of confidence interval .
For example, if the confidence level is .95, Φ-1{(1 — γ)/2) is the familiar
quantity —1.96 that cuts off .025 of the normal curve frequency.
174 11. MASTERY TESTING

For the three-parameter logistic case, if all items have the same ai = a and ci
— c and if the bi are optimal (11-26), then the required number n0 of items is
found by dividing the chosen value of I0{θ} by the maximum item information
M given by Eq. (10-6). This gives
8(1 - c)2I0 {θ}
n0 = 2 2 (11-28)
D a [1 - 20c - 8c2 + (1 + 8c)3/2 ] .
If c = 0, the required test length is
1.384I0 {θ}
n0 = (11-29)
a2 .
The number of items required is inversely proportional to the square of the
item discriminating power a. If a certain mastery test requires 100 items with
c = 0, we find from (11-28) that it will require 138 items with c = .167, 147
items with c = .20, 162 items with c = .25, 191 items with c = .333, or 277
items with c = .50.

11.14. S U M M A R Y OF MASTERY TEST DESIGN

According to the approach suggested here, the design of a mastery test for a
unidimensional skill could proceed somewhat as follows.

1. Obtain a pool of items for measuring the skill of interest.


2. Calibrate the items on some convenient group by determining the parame­
ters ai, bi, ci for each item.
3. Consider the entire item pool as a single test; determine what true-score
level ξ0, or levels ξ1 and ξ2, will be used to define mastery. This decision is a
matter of judgment for the subject-matter specialist.
4. Using the item parameters obtained in step 2, find θ0 (or θl and θ2) from
ξ0 (or from ξ1 and ξ2) by means of the relation ξ = i P i (θ).

If a single cutting point θo is used:

5. Compute Pi(θ0) for each item.


6. Evaluate I{θ, ui} = P' 2 i /P i Q i at θo for each item.
7. Decide what length confidence interval for θ will be adequate at θ0. Find
the required I0{θ} from (11-27).
8. Select items with the highest I{θ, ui} at θ0. Continue selecting until the
sum n I o {θ, ui} equals the required IO{θ}.
9. Compute scoring weights w°i = P'i/PiQi|θ=θo for each selected item.
10. For each examinee, compute the weighted sum of item scores Y =
iWiUi. (In practice, an unweighted score may be adequate.)
11.15. EXERCISES 175

11. Compute the cutting score Y0 = ni P'i (θo)/Qi (θO).


12. Accept each examinee whose score Y exceeds Y0; reject each examinee
whose score is less than Y0.

The foregoing procedure is appropriate if erroneous acceptance and erroneous


rejection of examinees are about equally important for examinees near θ0. If this
is not the case and if the relative importance of such errors can be quantified by
some ratio A/B, then steps 5-12 should be replaced by

5. Compute Pi(θ1) and Pi(θ2) for each item.


6. Compute the item-scoring weights
pi(θ2)Qi(θ1)
w,(θ 1 ,θ 2 )= In .
Pi(θ1)Qi(θ2)
7. Select the available items with the highest scoring weights.
8. For each examinee, compute the weighted sum of item scores y =
i Wi(θ1, θ 2 )u i .
9. Compute the cutting score y0 = In A - In B - K.
10. Accept each examinee whose score y exceeds y0; reject each examinee
whose score is less than y0.

11.15. EXERCISES

11-1 Suppose a mastery test consists of n = 5 items exactly like item 2 in test 1
and also that θ1 = - 1, θ2 = + 1 . What is a if we accept only examinees
who score x = 5? (Use Table 4.17.1.) What is β? What if we accept x ≥
4? x ≥ 3? x ≥ 2? If A = B = 1, what is the expected loss C of accepting
x = 5? x ≥ 4? x ≥ 3? x ≥ 2?
11-2 What is the score (likelihood ratio) of an examinee who gets all five items
right (x = 5) in Exercise 11-1? What is his score if JC = 4? 3? 2? 1? 0?
Which examinees will be accepted if A = 5, B = 2?
11-3 Suppose test 1 (see Table 4.17.1) is used as mastery test with θ1 = — 1,θ2
= + 1 . What is the score (likelihood ratio) of an examinee with u = {1,0,
0}? {0, 1, 0}? {0, 0, 1}? Do these scores arrange these examinees in the
order that you would expect? Explain the reason for the ordering obtained.
11-4 What is the optimal scoring weight (11-16) for each of the three items in
the test in Exercise 11-3 (be sure to use natural logarithms)? Why does
item 3 get the least weight?
11-5 What is the optimally weighted sum of item scores (11-15) for an exam­
inee in Exercise 11-3 with response pattern u = {1, 1, 0}? {1, 0, 1}? {0,
1, 1}? If A = B, what is the cutting score (11-17) for these optimally
weighted sums of item scores?
176 11. MASTERY TESTING

11-6 What are the locally best scoring weights (11-19) for each of the three
items in the test in Exercise 11-3 when θ0 = 0? Compare with the weights
found in Exercise 11-4. Are the differences important?
11-7 What is the locally best weighted sum of item scores (11-20) for an
examinee in Exercise 11-3 with response pattern u = {1, 1, 0}? {1,0,
1}? {0, 1, 1}? If A = B, what is the locally best cutting score (11-22) for
these scores? Compare with the results of Exercise 11-5 and comment on
the comparison.

REFERENCES

Birnbaum, A. Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord
and M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley,
1968.
Birnbaum, A. Statistical theory for logistic mental test models with a prior distribution of ability.
Journal of Mathematical Psychology, 1969, 6, 258-276.
Davis, C. E., Hickman, J., & Novick, M. R. A primer on decision analysis for individually
prescribed instruction. Technical Bulletin No. 17. Iowa City, Ia.: Research and Development
Division, The American College Testing Program, 1973.
Glass, G. V. Standards and criteria. Journal of Educational Measurement, 1978, 75, 237-261.
Huynh, H. On the reliability of decisions in domain-referenced testing. Journal of Educational
Measurement, 1976, 13, 253-264.
Snijders, T. Complete class theorems for the simplest empirical Bayes decision problems. The
Annals of Statistics, 1977,5, 164-171.
Subkoviak, M. J. Estimating reliability from a single administration of a criterion-referenced test.
Journal of Educational Measurement, 1976, 13, 265-276.
Swaminathan, H., Hambleton, R. K., & Algina, J. A Bayesian decision-theoretic procedure for use
with criterion-referenced tests. Journal of Educational Measurement, 1975, 12, 87-98.
van der Linden, W. J. Forgetting, guessing, and mastery: The Macready and Dayton models revisited
and compared with a latent trait approach. Journal of Educational Statistics, 1978, 3, 305-317.
van Ryzin, J., & Susarla, V. On the empirical Bayes approach to multiple decision problems. The
Annals of Statistics, 1977,5, 172-181.
Ill PRACTICAL PROBLEMS AND
FURTHER APPLICATIONS
12 Estimating Ability and
Item Parameters

12.1. MAXIMUM LIKELIHOOD

In its simplest form, the parameter estimation problem is the following. We are
given a matrix U = | | u i a | | consisting of the responses (uia = 0 or 1) of each of N
examinees to each of n items. We assume that these responses arise from a
certain model such as Eq. (2-1) or (2-2). We need to infer the parameters of the
model: ai, bi, ci (i = 1, 2, . . . , n) and θa (a = 1 , 2 , . . . , N).
As noted in Section 4.10 and illustrated for one θ in Fig. 4 . 9 . 1 , the maximum
likelihood estimates are the parameter values that maximize the likelihood L ( U | θ ;
a, b , c) given the observations U. Maximum likelihood estimates are usually
found from the roots of the likelihood equations (4-30), which set the derivatives
of the log likelihood equal to zero. The likelihood equations (4-30) are

n Uia - Pia Pia


= 0 (a = 1 , 2 , . . . ,N), (12-la)
Pia Qia θa
i=l
N Uia —Pia Pia
= 0 (i = 1 , 2 , . . . , n),
Pia Qia ai
a=\
N Uia - Pia Pia = 0 (i = 1, 2 , . . . , n), (12-lb)
Pia Qia bi
a=l
N Uia - Pia Pia
= 0 (i = 1 , 2 , . . . , n).
Pia Qia dci
a=\
For the three-parameter logistic model,

179
180 12. ESTIMATING ABILITY AND ITEM PARAMETERS

Pia DaiQia(Pia - Ci )
= ,
θa 1 - Ci

Pia D{θa- bi)Qia(Pia - ci)


= , (12-2)
ai 1 — Ci

Pia -DaiQia(Pia - ci)


= ,
bi 1 -ci
Pia Qia
= ,
ci 1 - Ci

A similar set of likelihood equations can be obtained in the same way for the
three-parameter normal ogive model.
These formulas are given here to show their particular character. The reader
need not be concerned with the details. The important characteristic of (12-la) is
that when the item parameters are known, the ability estimate θa for examinee a
is found from just one equation out of the N equations (12-la). The estimate θa does
not depend on the other θ. When the examinee parameters are known, the three
parameters for item i are estimated by solving just three equations out of (12-lb).
The estimates for item / do not depend on the parameters of the other items.
This suggests an iterative procedure where we treat the trial values of θa (a =
1 , 2 , . . . , N) as known while solving (12-lb) for the estimates âi, bi, ĉi (i = 1,
2,. . . , n); then treat all item parameters (i = 1, 2 , . . . , n) as known while
solving (12-la) for new trial values θa (a = 1, 2,. . . , N). This is to be repeated
until the numerical values converge. Because of the independence within each set
of parameter estimates when the other set is fixed, this procedure is simpler and
quicker than solving for all parameters at once.

12.2. ITERATIVE NUMERICAL PROCEDURES

The likelihood Eq. (12-1) are of the form Sr = In L/ Xr = 0, where Xr is an


arbitrary parameter (r = 1 , 2 , . . . , R ) . The following modification of the stan­
dard Newton-Raphson interative method for solving equations is very effective
in statistical work (Kale, 1962).
Let Iar = ES a S r (q, r = 1,2,. . . , R), and let a superscript zero distinguish
functions of Xr (r = 1, 2 , . . . , R) evaluated at trial values X0r. Solve for A0 the
linear equations ||I0qr||A0 = S°, where S° is the vector of S°r and ||I°qr|| is the matrix
of I°qr. Then x 1 = X° + A° is a vector of improved estimates of the true parameter
vector x.
When the item parameters are fixed (treated as known), the parameter vector x
is simply {θ1, θ2,. .. θN} and the information matrix ||Iqr|| is found to be a
diagonal matrix. By Eq. (5-4), the diagonal elements of ||I qr || are the reciprocals
of Var (θ|θ). Thus by Eq. (5-5),
12.3. SAMPLING VARIANCES OF PARAMETER ESTIMATES 181

'2
n p ia
Irr = (r ≡ a = 1 , 2 , . . . ,N). (12-3)
Pia Qia
i=1
From Eq. (4-30),
n P'ia
Sr = (uia - Pia) P Q (r ≡ a = 1 , 2 , . . . ,N). (12-4)
ia ia
i=l
When the item parameters are fixed, the correction A°r to θ°r is simply A0r =
S0r/I0r. Thus θa (a = 1, 2 , . . . , N) can be readily found by the iterative meth­
od of the preceding paragraph.
When the ability parameters are fixed (treated as known), the parameter
vector x is {a1, b1, c1; a2, b2, c2;... ; an, bn, cn}. Formulas (see Appendix 12)
for the Iqr are obtained by the same method used to find Eq. (5-5). The informa­
tion matrix ||I qr || is a diagonal supermatrix whose diagonal elements are 3 x 3
matrices, one for each item. The 3 x 3 matrices are not diagonal. The corrections
Ar are obtained separately for each item, by solving three linear equations in
three unknown A's.

12.3. SAMPLING VARIANCES OF


PARAMETER ESTIMATES

If the true item parameters were known, then the asymptotic sampling variance
of θa would be approximated by l/I a a evaluated at θa = θa . This is readily ob­
tained from the modified Newton-Raphson procedure after convergence [see Eq.
(5-4)]. Ability estimates θa and θb are uncorrected with a ≠ b.
A large sampling variance occurs when the likelihood function has a relatively
flat maximum, as in the two curves on the left and the one on the right of Fig.
4.9.1. A small sampling variance occurs when the likelihood function has a
well-determined maximum, as in the middle three curves of Fig. 4.9.1.
If the true ability parameters were known, the asymptotic sampling variance-
covariance matrix of the ai, bi, and ci would be approximated by the inverse of
the 3 x 3 matrix of the Iqr for item i, evaluated at ai, bi, and ci. Parameter
estimates for item i are uncorrected with estimates for item j when i ≠ j .
In practice, estimated sampling variances and covariances of parameter esti­
mates are obtained by substituting estimated parameter values for parameters
assumed known. When all parameters must be estimated, this substitution under­
estimates the true sampling fluctuations.
Andersen (1973) argues that when item and person parameters are estimated
simultaneously, the estimates do not converge to their true values as the number
of examinees becomes large. The relevant requirement, however, is convergence
when the number of people and the number of items both become large together.
A proof of convergence for this case has been given for the Rasch model by
182 12. ESTIMATING ABILITY AND ITEM PARAMETERS

Haberman (1977). It seems likely that convergence will be similarly proved for
the three-parameter model also.

12.4. PARTIALLY SPEEDED TESTS

If an examinee does not respond to the last few items in a test because of lack of
time, his (lack of) behavior with respect to these items is not described by current
item response theory. Unidimensional item response theory deals with actual
responses; it does not predict whether or not a response will occur. Many low-
ability students answer all the items in a test well ahead of the allotted time limit.
Ability to answer test items rapidly is thus only moderately correlated, if at all,
with ability to answer correctly. Item response theory currently deals with the
latter ability and not at all with the former. [Models for speeded tests have been
developed by Meredith (1970), Rasch (1960, Chapter 3), van der Ven (1976),
and others.]
If we knew which items the examinee did not have time to consider, we would
ignore these items when estimating his ability. This is appropriate because of the
fundamental property that the examinee's ability 6 is the same for all items in a
unidimensional pool. Except for sampling fluctuations, our estimate of θ will be
the same no matter what items in the pool are used to obtain it. Note that our
ability estimate for the individual represents what he can do on items that he has
time to reach and consider. It does not tell us what he can do in a limited testing
time.
In practice, all consecutively omitted items at the end of his answer sheet are
ignored for estimating an examinee's ability. Such items are called not reached
items. If the examinee did not read and respond to the test items in serial order,
we may be mistaken in assuming that he did not read such a "not reached" item.
We may also be mistaken in assuming that he did read all (earlier) items to which
he responded. The assumption made here, however, seems to be the best practi­
cal assumption currently available.

12.5. FLOOR AND CEILING EFFECTS

If an examinee answers all test items correctly, the maximum likelihood estimate
of his ability is θ = + ∞. In this case, Bayesian methods would surely give a
more plausible result. In maximum likelihood estimation, such examinees may
be omitted from the data if desired. Their inclusion, however, will not affect the
estimates obtained for the item parameters nor for other ability parameters.
If an examinee answers all items incorrectly, the maximum likelihood esti­
mate of his ability is θ = —∞.Most examinees answer at least a few items
correctly, however, if only by guessing. By Eq. (2-1) or (2-2), Pi(θ) ≥ c{. Thus
by Eq. (4-5) the number-right true score ξ is always greater than or equal to
"iCi. An examinee's number-right observed score may be less than "i ci be-
12.6. ACCURACY OF ABILITY ESTIMATION 183

cause he is unlucky in his guessing; in this case we are likely to find an estimate
of θ = —∞forhim.
On the other hand, an examinee with a very low number-right score may still
have a finite θ provided he has answered the easiest items correctly. This occurs
because, as seen in Section 5.6, hard items receive very little scoring weight in
determining θ for low-ability examinees. Correspondingly, an examinee with a
number-right score above nici may still obtain θ = - ∞. This will happen if he
answers some very easy items incorrectly. A person who gets easy items wrong
cannot be a high-ability person.

12.6. ACCURACY OF ABILITY ESTIMATION

Figure 3.5.2 compares ability estimates θ for 1830 sixth-grade pupils from a
50-item MAT vocabulary test with estimates obtained independently from a
42-item SRA vocabulary test. Values of θ outside the range - 2 . 5 to +2.5 are
plotted on the perimeter of the figure. Many of the points on the left and lower
boundaries of the figure represent pupils for whom θ = —∞on one or both tests.
It appears from the figure, as we would anticipate, that θ values between 0 and
1 show much smaller sampling errors than do extreme θ values. Although not
apparent from the figure, a pupil might have a θ of —10 on one test and a θ of
— 50 on the other, simply because of sampling fluctuations. For many practical
purposes, the difference between θ = —10 and θ = — 50 for a sixth-grade pupil is
unimportant, even if numerically large. In the usual frame of reference, it may
not be necessary to distinguish between these two ability levels. If we did wish to
distinguish them, we would have to administer a much easier test.
Since some of the θ are negatively infinite in Fig. 3.5.2, we cannot compute a
correlation coefficient for this scatterplot. If we simply omit all infinite esti­
mates, the correlation coefficient might be dominated by the large scatter of a
few extreme θ. If we omit these θ also, the correlation obtained will depend on
just which θ we choose to omit and which we retain.
It is helpful to transform the θ scale to a more familiar scale that better
represents our interests. A convenient and meaningful scale to use is
the proportion-correct score scale. We can transform all θ on to this scale by
the familiar transformation [Eq. (4-9)J:
n Pi(θ)
ζ ≡ ζ(θ) ≡
1 n .
Figure 12.6.1 is the same as Fig. 3.5.2 except that the points are now plotted
on the proportion-correct score scale. The θ obtained from the SRA items have
been transformed into SRA proportion-correct estimated true scores; the θ from
MAT, to MAT estimated true scores. As noted often before, proportion-correct
true scores cannot fall below ni ci/n. The product-moment correlation between
SRA and MAT values of ζ is found to be .914. This is notably higher than the
184 12. ESTIMATING ABILITY AND ITEM PARAMETERS

1.0
0.9
0.8
0.7
0.6
SRA

0.5
0.4
0.3
0.2
0.1
0.0

0.0 O.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MAT

FIG. 12.6.1. Estimated true scores on a 50-item MAT vocabulary test and a 42-
item SRA vocabulary test for 1830 sixth-grade pupils.

reported correlation for these data of .8998 between number-right observed


scores on the two tests.
When two tests are at different difficulty levels, their true scores will not be
linearly related. Because of this, cubic regressions were fitted to the data in Fig.
12.6.1. The results show that the curvilinear correlation of MAT ζ on SRA ζ is
about .920; the curvilinear correlation of SRA on MAT is about .923.

12.7. INADEQUATE DATA AND UNIDENTIFIABLE


PARAMETERS

If a parameter value is in principle indeterminate even when we are given the


entire population of observable values, then the parameter is called unidentifiable.
12.7. INADEQUATE DATA AND UNIDENTIFIABLE PARAMETERS 185

Actually, all θ, ai, and bi (but not ci) are unidentifiable until we agree on some
arbitrary choice of origin and unit of measurement (see Section 3.5). Once this
choice is made, all θ and all item parameters will ordinarily be identifiable in a
suitable infinite population of examinees and infinite pool of test items.
Just as his θ cannot be estimated from the responses of an examinee who
answers all n test items correctly, similarly bt cannot be estimated from the
responses of a sample of N examinees all of whom answer item i correctly. This
does not mean that bt is unidentifiable; it only means that the data are inadequate
for our purpose. We need a larger sample of examinees, some examinees will
surely get the item wrong if the sample is large enough; or, better, we need a
sample of examinees at lower ability levels.
If only a few examinees in a large sample answer item i correctly, bi will
have a large sampling error. To make this clearer, consider a special case where
Ci = 0, ai — 1/1.7 = .588, and all examinees are at θ = 0. The standard error of
the proportion pi of correct answers in a sample of N such examinees is given by
the usual binomial formula:

SE(p i ) = √P NQ i i
.
Under the logistic model, for the special case considered,
Pi = (1 + e b i) -1

or
1
bi = In
( - 1 + Pi ).
The asymptotic standard error of bi = In (— 1 + 1/pi) is easily found by the delta
method (Kendall & Stuart, 1969, Chapter 10) to be
1
SE(b i ) = .
√NPIQI

The two standard errors are compared in the following listing for TV = 1000.

Pi or Qi .001 .01 .05 .2 .5


bi ±6.91 ±4.60 ±2.94 ±1.39 0
SE(pi) .001 .003 .007 .013 .016
SE(bi) 1.000* .318* .145 .079 .063
*N = 1000 may not be large enough for the asymptotic formula to apply when Pi or Qi is ≤.01;
for large N, the standard errors will be proportional to those shown here, however.

The listing shows that bi is unstable when Pi or Qi is small. The situation is


much worse when both ai and bi must be estimated simultaneously. If only a
few examinees answer item i incorrectly, it is obviously impossible to estimate
ai with any accuracy.
186 12. ESTIMATING ABILITY AND ITEM PARAMETERS

The problem is even more obvious for ci, which represents the performance
of low-ability examinees. If we have no such examinees in our sample, we
cannot estimate ci. This is the fault of the sample and not the fault of the
estimation method. In such a sample, any reasonable value of ci will be able to
fit the data about as well as any other. If we arbitrarily assign some plausible
value of ci and then estimate ai and bi accordingly, we shall obtain a good
description of our data.
We need not forego practical applications of item response theory just because
some of the parameters cannot be estimated accurately from our data, as long as
we restrict our conclusions to ranges for which our data are relevant. If we wish
to predict far outside these ranges, we must gather data relevant to our problem.

12.8. BAYESIAN ESTIMATION OF ABILITY

In work with published tests, it is usual to test similar groups of examinees year
after year with parallel forms of the same test. When this happens, we can form a
good picture of the frequency distribution of ability in the next group of exam­
inees to be tested. Such "prior" information can be used to advantage to improve
parameter estimation, providing it can be conveniently quantified and conve­
niently processed numerically.
Suppose each examinee tested is known to be randomly drawn from a popula­
tion in which the distribution of ability is g(θ). The joint distribution of examinee
ability θ and item response vector u for a randomly chosen examinee is equal to
the conditional probability (4-20) multipled by g(θ):
n
L(u,θ|a,b,c) - L(θ|0;a, b, c)g(θ) = g(θ) [P i (θ)] ui [Q i (θ)] 1-ui .
i=1 (12-5)
The marginal distribution of u for a randomly chosen examinee is obtained from
the joint distribution by integrating out θ:
∞ n
L(u|a, b,c) = g(θ) [P i (θ)] ui [Q i (θ)] 1-ui dθ. (12-6)

—∞ i=l
The conditional distribution of θ for given u is obtained by dividing (12-5) by
(12-6). This last is the posterior distribution of θ given the item response vector
u. Since (12-6) is not a function of θ, we can say that the posterior distribution of
θ is proportional to (12-5). This distribution contains all the information we have
for inferring the ability θ of an examinee whose item responses are u.
If we want a point estimate of θ for a particular examinee, we can use the
mean of the posterior distribution (see Birnbaum, 1969) or its mode. There is no
convenient mathematical expression for either mean or mode, but both can be
determined numerically.
12.9. FURTHER THEORETICAL COMPARISON OF ESTIMATORS 187

Use of the posterior mean θ to estimate θ from u is the same as using the
regression (conditional expectation) of θ on u. Suppose we have a population of
individuals, and for each individual we calculate the posterior mean θ. Let σθ
denote the standard deviation of these posterior means. By a standard definition,
the correlation ratio of θ on u is

σθ
ηθu ≡
,
σθ

where σ2θ is the prior or marginal variance of θ. Since θ correlates only imper­
fectly with θ, ηθu < 1 and σθ; < σθ ; thus in any group the distribution of the
estimates θ will have a smaller variance than does the distribution of true ability
θ. The θ exhibit regression toward the mean.
If we define θ* = θ/ηθu, then σθ = σθ: The estimates have the same vari­
ance as the true values. But θ* is not a type of estimate usually used by Bayesian
statisticians. The estimate θ minimizes the mean square error of estimation over
all examinees, but, as we have seen, it does not have the same variance as 0; the
estimate θ* has the same variance as θ, but it is a worse estimate in terms of
mean square error.
Is θ or θ* unbiased for a particular person? If θ is an unbiased estimator for
every individual in a population of individuals, then the error θ — θ is uncorre­
­­ted with θ, so that in the population σ2θ = σ2θ + σ2θ_θ > σ2θ. Thus, the unbiased
estimates θ have a larger variance than the true ability parameters. They also
have a larger variance than θ or θ*. Thus neither θ nor θ* is unbiased.
The foregoing problems are simply a manifestation of the basic fact that the
properties of estimates are never exactly the same as the properties of the true
values.
The mode of the posterior distribution of θ is called the Bayesian modal
estimator. If the posterior distribution is unimodal and symmetric, then the
Bayesian modal estimator will be the same as the posterior mean θ, whose
properties as an estimator have already been discussed.

12.9. FURTHER THEORETICAL COMPARISON OF


ESTIMATORS

If g(θ) is uniform, then [see Eq. (12-5)] the posterior distribution of θ is propor­
tional to the likelihood function (4-20). Thus when g(θ) is uniform, the
maximum likelihood estimator θ that maximizes (4-20) is the same as the Bayes­
ian modal estimator that maximizes (12-5). Since any bell-shaped g(θ) is surely
nearer the truth than a uniform distribution of θ, it has been argued, the Bayesian
modal estimator (BME) computed from a suitable bell-shaped prior, g(θ), must
surely be better than the maximum likelihood estimator (MLE), which (it is
asserted) assumes a uniform prior.
188 12. ESTIMATING ABILITY AND ITEM PARAMETERS

The trouble with this argument is that it tacitly assumes the conclusion to be
proved. If the BME were a faultless estimation procedure, then this line of
reasoning would show that the MLE is inferior whenever g(θ) is not uniform. On
the other hand, if the BME is less than perfect as an estimator, then the MLE
cannot be criticized on the grounds that under an implausible assumption (uni­
form distribution of θ) it happens to coincide with the BME.
The MLE is invariant under any continuous one-to-one transformation of the
parameter. The same likelihood equations will result whether we estimate θ or
θ* ≡ Kek θ, as in Eq. (6-2), or ξ ≡ n 1 P ί (θ). Thus if θ is the MLE of θ, the
MLE of θ* will be Kek θ and the MLE of ξ will be n1Pί(θ).
If the cited argument proved that the MLE assumes a uniform distribution of
θ, then the same argument would prove that the MLE assumes a uniform distribu­
tion of θ* and also of ξ. This is self-contradictory, since if any one of these is
uniformly distributed, the others cannot be.
The absurd conclusion stems from the fact that BME is not invariant. It yields
a substantively different estimate depending on whether we estimate θ, θ*, or ξ.
The proof of this statement follows.
In simplified notation, the posterior distribution is proportional to L(u|Θ =
θ)g(θ). The BME is the mode of the posterior distribution. If θ* = θ*(θ) rather
than θ is the parameter of interest, then the prior distribution of θ* is

g*(θ*) = g(θ)
dθ* ,
The posterior distribution of 0* is thus proportional to

L(u|Θ* = θ*)g*(θ*) ≡ L(u|Θ = θ)g(θ)
dθ*
The posterior for θ* differs from the posterior for θ by the factor dθ/dθ*. Thus
the two posterior distributions will in general have different modes and will
therefore yield different BME's.
As stated at the beginning, the purpose of this section is not to fault the BME
but to point out a fallacy in a plausible line of reasoning, which superficially
appears to show that the MLE assumes a uniform distribution of the parameter
estimated. As pointed out at the end of Section 12.8, no point estimator, whether
Bayesian or non-Bayesian, has all possible desirable properties. If g(θ) is known
approximately, then any inference about θ may properly be based on the pos­
terior distribution of θ given u (but not necessarily on the mode of this distribu­
tion). The MLE, on the other hand, is of interest in situations where we cannot or
do not wish to restrict our attention to any particular g(θ).
It is often argued in other applications of Bayesian methods that the choice of
prior distribution, here g(θ), does not matter much when the number n of
observations is large. This fact is not helpful here, since n here is the number of
items. Mental test theory exists only because the observed score on n items
differs nonnegligibly from the true score that would be found if n were infinite.
12.12. THE RASCH MODEL 189

12.10. ESTIMATION OF ITEM PARAMETERS

Bayesian estimation of item parameters might start by assuming that bί has a


normal or logistic prior distribution, that aί has a gamma distribution, and that cί
has a beta distribution. Assuming all item parameters are distributed indepen­
dently (there is some evidence to the contrary), one could try to work out
convenient formulas for Bayesian estimation. So far, this seems not to have been
attempted.
The following approach to item parameter estimation has been devised and
used successfully by Bock and Lieberman (1970). Equation (12-6) gives the
(marginal) probability of response vector u. Denote this probability by ΠU and the
corresponding sample frequency of the response pattern (vector) by fu. There are
2 n possible different response patterns u. The joint distribution of all possible
pattern frequencies (response vectors) is the multinomial distribution
N!
L≡
II fu II(π )
u
u
fu (12-7)
u
Since the πu are functions of the item parameters (but not of any θ), maximum
likelihood estimates may be obtained by finding the item parameter values that
maximize (12-7). Further details are given by Bock and Lieberman (1970).

12.11. ADDENDUM ON ESTIMATION

It is likely that new and better methods will be found for estimating both item
parameters and ability parameters. Illustrative data in this book have mostly been
obtained by certain modified maximum likelihood methods (Wood & Lord,
1976; Wood, Wingersky, & Lord, 1976). It is not the purpose of this book to
recommend any particular estimation method, however, since such a recom­
mendation is likely to become quickly out of date. The practical applications
outlined in these chapters are useful regardless of whatever effective estimation
method is used.
Anderson (1978) and Maurelli (1978) report studies comparing maximum
likelihood estimates with Bayesian estimates. The interested reader is referred to
Urry (1977) for a description of an alternative estimation procedure.

12.12. THE RASCH MODEL

Rasch's item response theory (Rasch 1966a, 1966b; Wright, 1977) assumes that
all items are equally discriminating and that items cannot be answered correctly
by guessing. The Rasch model is the special case of the three-parameter logistic
model arising when cί = 0 and aί = a for all items. If the Rasch assumptions are
190 12. ESTIMATING ABILITY AND ITEM PARAMETERS

satisfied for some set of data, then sufficient statistics (Section 4.12) are avail­
able for estimating both item difficulty and examinee ability. If, as is usually the
case, however, the Rasch assumptions are not met, then use of the Rasch model
does not provide estimators with optimal properties. This last statement seems
obvious, but it is often forgotten.
In any comparison of results from use of the Rasch model with results from
the use of the three-parameter logistic model, it is important to remember the
following. If the Rasch model holds, we are comparing the results of two statisti­
cal estimation procedures; we are not comparing two different models, since the
Rasch model is included in the three-parameter model. If the Rasch model does
not hold, then its use must be justified in some way. If sample size is small, for
example, Rasch estimates may be more accurate than three-parameter-model
estimates, even when the latter model holds and the Rasch model does not.

12.13. EXERCISES

12-1 Given that a = 1/1.7, b = 0, c = 0, the logistic item response function is


P(θ) = (1 + e-θ)-1. If u = {1, 0} on a test with just n = 2 equivalent
items, the likelihood function is L(u|Θ = θ) = P(θ)Q(θ). By trial and
error, find numerically the MLE of θ, the value of θ that maximizes
P(θ)Q(θ).
12-2 If θ* = eθ, the likelihood function for the situation in Exercise 12-1 is
P(ln θ*)Q(ln θ*). Find numerically the MLE of θ*. Compare with the
MLE of θ.
12-3 Suppose g(θ) = .1 for - 5 ≤ θ ≤ 5, g(θ) = 0 elsewhere. If u = {1, 0} on
a test with just n = 2 equivalent items, the posterior distribution of θ is
proportional to

{ 0.1 P(θ)Q(θ), ( - 5 ≤ θ ≤ 5),


L(u|Θ = θ) =
0 elsewhere.
Find the BME of θ for the test in Exercise 12-1. Compare with the MLE
found there.
12-4 If θ* = eθ, the posterior distribution of θ* for the situation in Exercise
12-3 is proportional to
.1 P(0)Qm (e-5 ≤ θ* ≤ e 5 ) ,
L(u|Θ* = θ*) = eθ
{ 0 elsewhere.
Show that P(θ)Q(θ)/e ≡ Q2(θ). Find numerically the BME of θ*. What
θ

is the corresponding value of θ ≡ In θ*? Compare this with the estimates


obtained in Exercise 12-2 and 12-3.
REFERENCES 191

APPENDIX

Listed here for convenient reference are formulas for Iq r (q, r = a, b, c) for the
three-parameter logistic function. These formulas are used in the modified
Newton-Raphson iterations (Section 12.2) and for computing sampling var­
iances of maximum likelihood estimators (Section 12.3).
D2 N Q
Iaa = (θa - bi)2 (Pia - Ci)2 ia , (12-8)
(l - cί)2 Pia
2 2
a= 1
D ai N Qia
Ibb = Pia - Ci)2 (12-9)
(1 - ci)2 Pia ,
a=1
1 N Qia
Icc = (12-10)
(1 - cί)2 Pίa ,
a=1
D2 aί N
lab = (θa - bί)(Pίa - cί)2 Qia (12-11)
(1 - cί)2 Pia ,
a=1
D N
lac = 2 (6a - bί )(Pia - cί) (12-12)
(1 - ct) ,
a=\ 7
Dat A Q
Ibc = (Pia - Ct) ia , (12-13)
(1 - cί)2 Pia
a= 1

REFERENCES

Andersen, E. B. Conditional inference for multiple-choice questionnaires. The British Journal of


Mathematical and Statistical Psychology, 1973, 26, 31-44.
Anderson, M. R. The robustness of two parameter estimation methods for latent trait models.
Doctoral dissertation, University of Kansas, 1978.
Birnbaum, A. Statistical theory for logistic mental test models with a prior distribution of ability.
Journal of Mathematical Psychology, 1969, 6, 258-276.
Bock, R. D., & Lieberman, M. Fitting a response model for n dichotomously scored items.
Psychometrika, 1970, 35, 179-197.
Haberman, S. J. Maximum likelihood estimates in exponential response models. The Annals of
Statistics, 1977, 5, 815-841.
Kale, B. K. On the solution of likelihood equations by iteration processes. The multiparametric case.
Biometrika, 1962, 49, 479-486.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1) (3rd ed.). New York:
Hafner, 1969.
Maurelli, V. A., Jr. A comparison of Bayesian and maximum likelihood scoring in a simulated
stradaptive test. Master's thesis, St. Mary's University, San Antonio, Texas, 1978.
Meredith, W. Poisson distributions of error in mental test theory. British Journal of Mathematical
and Statistical Psychology, 1970, 24, 49-82.
Rasch, G. Probabilistic models for some intelligence and attainment tests. Copenhagen: Nielsen and
Lydiche, 1960.
192 12. ESTIMATING ABILITY AND ITEM PARAMETERS

Rasch, G. An individualistic approach to item analysis. In P. F. Lazarsfeld & N. W. Henry (Eds.),


Readings in mathematical social science. Chicago: Science Research Associates, 1966. (a)
Rasch, G. An item analysis which takes individual differences into account. The British Journal of
Mathematical and Statistical Psychology, 1966, 19, 49-57. (b)
Urry, V. Tailored testing: A successful application of latent trait theory. Journal of Educational
Measurement, 1977, 14, 181-196.
van der Ven, Ad H. G. S. An error score model for time-limit tests. Tijdschrift voor Onderwijs-
research, 1976, /, 215-226.
Wood, R L, & Lord, F. M. A user's guide to LOG1ST. Research Memorandum 76-4. Princeton,
N.J.: Educational Testing Service, 1976.
Wood, RL, Wingersky, M. S., & Lord, F. M. LOGIST-A computer program for estimating examinee
ability and item characteristic curve parameters. Research Memorandum 76-6. Princeton, N.J.:
Educational Testing Service, 1976.
Wright, B. D. Solving measurement problems with the Rasch model. Journal of Educational Mea-
surement, 1977, 14, 97-116.
13 Equating

13.1. EQUATING INFALLIBLE MEASURES

Consider a situation where many people are being selected to do typing. All
applicants have been tested before applying for employment. Some come with
test record x showing typing speed in words per second; others come with test
record y showing typing speed in seconds per word. We assume for this section
that all typing tests are perfectly reliable ("infallible").
There is a one-to-one correspondence between the two measures x and y of
typing speed. The hiring official will undoubtedly wish to express all typing
speeds in the same terms for easy comparison of applicants. Perhaps he will
replace all y scores by their reciprocals. In general, we denote such a transforma­
tion by xy ≡ x(y). In the illustration, x(y) = \ly. Clearly x and xy are compara­
ble values.
[Note on notation: In practical test work, a new test is commonly equated to
an old test. It seems natural to call the old test x and the new test y. The function
xy ≡ x(y) may seem awkward, since we habitually think of y as a function of x.
An alternative notation would write y*(y) instead of x(y); this would fail to
emphasize the key fact that xy ≡ x(y) is on the same scale as x.]
In mental testing, we may believe that two tests, x and y, measure the same
trait without knowing the mathematical relation between the score scale for x and
the score scale for y. Suppose the hiring officer knew that both x and y were
perfectly reliable measures of typing speed but did not know how each was
expressed. Could he without further testing find the mathematical relation x(y)
between x and y, so as to use them for job selection? If many applicants have
both x and y, it is easy to find x(y) approximately. But suppose that the hiring
officer never obtains both x and y on the same individual.

193
194 13. EQUATING

The true situation, not known to the hiring officer, is illustrated schematically
in Fig. 13.1.1. The points falling along the curve represent corresponding values
of x and y for various individuals. The frequency distributions of x and y for the
combined population (of all applicants regardless of test taken) are indicated
along the two axes (the y distribution is shown upside down).
It should be clear from Fig. 13.1.1 that when 3; and x have a one-to-one
monotonic relation, any cutting score Y0 on y implies a cutting score X0 =x(Y0)
on x. Moreover, the people who lie to the right of Y0 are the same people as the
people who lie below X0. Thus the percentile rank of Y0, counted from right to
left in the distribution of y, is the same as the percentile rank of X0 counting
upward in the distribution of x. The cutting scores X0 and Y0 are said to have an
equip ere entile relationship.
If G(y) denotes the cumulative distribution of y cumulated from right to left
and F(x) denotes the cumulative distribution of x cumulated from low to high,
then F(X0) = G(Y0) for any pair of corresponding cutting points (Y0,X0). Since
F is monotonic, we can solve this equation for X0, obtaining X0 = F-1 [G(Y 0 )],
where F-1 is the inverse function of F. Thus, the transformation x(y) is given by

xy ≡ x(y) ≡ F-1[G(y)]. (13-1)


It is generally more convenient to express this relationship by the parametric
equations

x or x(y)

Xo
(Yo,Xo)
F(X o )

Yo
y

G(yo)

FIG. 1 3 . 1 . 1 . Equipercentile relation of X0 to Y0.


13.2. EQUITY 195

F(xy) = P
G(y) = P. } (13-2)

where p is the percentile rank of both y and xy. This last pair of equations is a
direct statement of the equipercentile relationship of xy to y.
In the present application where y = 1/x, y decreases as x increases. In most
typical applications, x and y increase together, in which case we define G(y) in
the usual way, cumulated from left to right. If x and y increase together, (13-2)
still applies, using the appropriate definition of G(y).
Suppose, now, that the hiring officer in our illustration knows that applicants
with x are a random sample from the same population as applicants with y. This
information will allow him to estimate the mathematical relation of x to y. His
sample cumulative distribution of y values is an estimate of G(y); his sample
distribution of x values is an estimate of F(x). He can therefore estimate the
relationship x(y) from (13-1) or (13-2).
When this has been done, the transformed or ''equated" score xy is on the x
scale of measurement, and test y is said to have been equated to test x. To
summarize, if (1) x(y) is a one-to-one function of y and (2) the X group has the
same distribution of ability as the Y group, then (3) equipercentile equating will
transform y to the same scale of measurement as x; and (4) the transformed y,
denoted by xy ≡ x(y), will have the same frequency distribution as x. Two
perfectly reliable tests measuring the same trait can be equated by administering
them to equivalent populations of examinees and carrying out an equipercentile
equating by (13-2). The equating will be the same no matter what the distribution
of ability in the two equivalent populations.

13.2. EQUITY

If an equating of tests x and y is to be equitable to each applicant, it must be a


matter of indifference to applicants at every given ability level 6 whether they are
to take test x or test y. More precisely, equity requires that for applicants at every
ability level θ, the (conditional) frequency distribution fx\ θ of score x should be
the same as the (conditional) frequency distribution fx(y)\θ of the (transformed)
score x(y):

fx| θ ≡ fx(y)|θ, (13-3)


where x(y) is a one-to-one function of y. Note that if fxlθ [or fx(y)\θ] has nonzero
variance, as will be assumed hereafter, this implies that x (or y) is an imperfectly
reliable ("fallible") measurement.
It is not sufficient for equity that fx\θ and fx(y)\θ have the same means.
Suppose they have the same means but different variances. Competent applicants
can be confident that a test score with small conditional variance will make their
196 13. EQUATING

competence evident; a score with large conditional variance (large errors of


measurement) may not do so. Thus, if the variances are unequal, it is not a matter
of indifference to them whether they take test x or test y.
Note that if the equity requirement holds over a certain range of ability, it
necessarily holds for all groups within that range of ability. This statement is true
no matter how the groups are defined; for example, by sex, by race, or by
educational or social characteristics.
If an X group and a Y group have the same distribution of ability, g*(0), then
the unconditional distribution of x in the X group is

Φx(x) dθ; (13-4)
∫—∞g*(θ)fx\θ
the unconditional distribution of (V) in the Y group is
00
Y(xy) g*(θ)fx(y)\θ dθ.
∫-00
By (13-3), if equity holds, the conditional frequency distributions are the same,
so that
Φx( )≡Φy(). (13-5)
Consequently, x and y have an equipercentile relationship; equipercentile equat­
ing using (13-2) will discover the relation x(y).
A similar result is shown in the preceding section for perfectly reliable scores;
the present result applies to imperfectly reliable scores provided the equity re­
quirement holds. Unfortunately, as we shall see, the equity requirement cannot
hold for fallible tests unless x and y are parallel tests, in which case there is no
need for any equating at all.

13.3. CAN FALLIBLE TESTS BE EQUATED?

From Eq. (4-1), the conditional frequency distribution fx\θ of number-right test
score x is given by the identity
n n
fx\θtx ≡ II (Qί + Pit), (13-6)
x=0 i=l
where the symbol t serves only to determine the proper grouping of terms on the
right. For n = 3, for example, (13-6) becomes
fo\θ + f'1\θt +.f2\θt2 + f3\θt3 ≡ Q1Q2Q3
+ (Q1Q2P3 + Q1P2Q3 + P1Q2Q3)t + (Q1P2P3 + P 1 Q 2 P 3
+ P1P2Q3)t2 +P1P2P3 t3,
which holds for all values of t.
13.3. CAN FALLIBLE TESTS BE EQUATED? 197

If m is the number of items in test v, we have similarly for the conditional


distribution gy\θ of number-right score y
m m
gy/θty II (Qί + Pίt).
y=0 3=1
It should not matter which test is labeled x, so we choose the labeling to make n
≥ m. Since x(y) is a function of y, it follows that the conditional distribution
hx(y)\θ of the equated score x(y) is given by

m y m
2 h x(y)\θ t II (Qί + pίt). (13-7)
y=0 3=1
Equity, however, requires that the distribution of the function x(y) be the
same as that of x. Substituting x(y) for x in (13-6), we have

n fx(y)\θtx{y) n (13-8)
II (Qί + Pίt).
x(y)=0 i=l
Since each fx(y)\θ > 0, the hx(y)\θ in (13-7) must be the same as the fx(y)\θ in
(13-8). Thus m = n. From (13-3), (13-7), and (13-8),

n n
II (Qί + Pit) II (Qj + Pjt) (13-9)
i=1 ί=1
for all 0 and for all t.
We now prove (under regularity conditions) that (13-9) will hold only if tests x
and y are strictly parallel. Since t in (13-9) is arbitrary, replace t by t + 1 to
obtain 11ί(1 + pίt) ≡ IIj(1 + Pjt). Taking logarithms, we have ί In (1 + Ptt)
≡ j In (1 + pj t). Expanding each logarithm in a power series, we have for t2 <
1 and for all 0 that

ί(Pίt - ½P2ί t2+ Pί3t3 - ...) ≡ ί(Pίt - ½P2ί t2+ Pί3t3 - ...).

(13-10)
After dividing by n, this may be rewritten
r
∞ µr ( - t ) ∞ v(-0) r
(13-11)
r r ,
r=l r=l
where (for any given 6) µr ≡ n-1 nPrί is the rth (conditional) moment about the
origin of the Pί for n items in test x, and vr is the rth (conditional) moment
about the origin of the Pί for the n items in test y. Because a convergent Taylor
series is unique, it follows that µr = vr. Since the distribution of a bounded
variable is determined by its moments [Kendall & Stuart, 1969, Section
198 13. EQUATING

4.22(c)], it follows under realistic regularity conditions1 that for each item in test
x there is an item in test y with the same item response function P(θ), and vice
versa.
Since it contradicts common thinking and practice, it is worth stating this
result as a theorem:
Theorem 13.3.1. Under realistic regularity conditions, scores x and y on two
tests cannot be equated unless either (1) both scores are perfectly reliable or (2)
the two tests are strictly parallel [in which case x(y) ≡ y].

13.4. REGRESSION METHODS

Since test users are frequently faced with a real practical need for equating tests
from different publishers, what can be done in the light of Theorem 13.3.1,
which states that such tests cannot be equated? A first reaction is typically to try
to use some prediction approach based on regression equations.
If we were to try to predict x from y, we would clearly be doing the wrong
thing. From the point of view of the examinee, x and y are symmetrically
related. A basic requirement of equating is that the result should be the same no
matter which test is called x and which is called y. This requirement is not
satisfied when we predict one test from the other.
Suppose we have some criterion, denoted by ω, such as grade-point average
or success on the job. Denote by Rx(ω|x) the value of ω predicted from x by the
usual (linear or nonlinear) regression equation and by Ry(ω|y) the value pre­
dicted from y. A sophisticated regression approach will determine x(y) so that
Rx [ω|x(y)] = Ry(ω|y). For example, if ω is academic grade-point average, a per­
son scoring y on test y and a person scoring x on test x will be treated as equal
whenever their predicted grade-point averages are equal.
By Theorem 13.3.1, however, it is clear that such an x(y) will ordinarily not
satisfy the equity requirement. Other difficulties follow. We state here a general
conclusion reached in the appendix at the end of this chapter: Suppose x(y) is
defined by Rx [ω|x(y)] ≡ Ry(ω|y). The transformation x(y) found typically will
vary from group to group unless x and y are equally correlated with the criterion
ω. This is not satisfactory: An equating should hold for all subgroups of our total
group (for men, women, blacks, whites, math majors, etc.).
Suppose that x is an accurate predictor of ω and that y is not. Competent

1
The need for regularity conditions has been pointed out by Charles E. Davis. For example, let
test x consist of the two items illustrated in Fig. 3.4.1. Let θ+ denote the value of θ where these two
response functions cross. Let the first (second) item in test y have the same response function as item
1 (2) in test x up to θ+ and the same response function as item 2 (1) in test x above θ+. Test x can be
equated to this specially contrived test y. Since such situations are not realistic, the mathematical
regularity conditions required to eliminate them are not detailed here.
13.5. TRUE-SCORE EQUATING 199

examinees are severely penalized if they take test y: Their chance of selection
may be little better than under random selection. Regression methods may op­
timize selection from the point of view of the selecting institution; they may not
yield a satisfactory solution to the equity problem from the point of view of the
applicants.

13.5. TRUE-SCORE EQUATING

Three important requirements are mentioned in preceding sections for equating


two unidimensional tests that measure the same ability:

1. Equity: For every θ, the conditional frequency distribution of x(y) given θ


must be the same as the conditional frequency distribution of x.
2. Invariance across groups: x(y) must be the same regardless of the popula­
tion from which it is determined.
3. Symmetry: The equating must be the same regardless of which test is
labeled x and which is labeled y.

We have seen that in practice these seemingly indispensable requirements in


general cannot be met for fallible test scores. An equating of true scores, on the
other hand, can satisfy the listed requirements. We therefore proceed to consider
how item response theory can be used to equate true scores.
If test x and test y are both measures of the same ability θ, then their
number-right true scores are related to θ by their test characteristic functions [Eq.
(4-5)]:
n m
ξ≡ Pi (θ),
η ≡ j=1 Pj (θ). (13-12)
i=l

Equations (13-12) are parametric equations for the relation between η and ξ. A
single equation for the relationship is found (in principle) by eliminating θ from
the two parametric equations. In practice, this relationship can be estimated by
using estimated item parameters to approximate the Pi(θ) and Pj(θ) and then
substituting a series of arbitrary values of θ into (13-12) and computing ξ and η
for each θ. The resulting paired values define ξ as a function of η (or vice versa)
and constitute an equating of these true scores. Note that the relation between ξ
and η is mathematical and not statistical (not a scatterplot). Since ξ and η are
each monotonic increasing functions of θ, it follows that ξ is a monotonic
increasing function of η.
Figure 13.5.1 shows the estimated test characteristic curves of two calculus
tests: AP with 45 five-choice items and CLEP with 50 five-choice items. The
broken line in Fig. 13.5.1 graphically illustrates the meaning of true-score equat­
ing. The broken line shows that a true score of 35 on CLEP is equivalent to a true
200 13. EQUATING

ABILITY
-3 -2 -I 0 I 2 3
50

45
45

40
CLEP
AP
40

35
35

30
30

25
CLEP

AP
25

20
20

15
15

10
10

5
5

0
O
-3 -2 -I 0 I 2 3
ABILITY

FIG. 13.5.1. Equating AP and CLEP using test characteristic curves.

score of approximately 24 on AP. The estimated equating relation between true


scores on CLEP and AP, obtained in this way from Fig. 13.5.1 or directly from
(13-12), is shown by the dashed curve in Fig. 13.5.2.

13.6. TRUE-SCORE EQUATING WITH AN ANCHOR TEST

If test x and test y are both given to the same examinees, the test administered
second is not being given under typical conditions because of practice effects, for
example, fatigue. If, on the other hand, each of the two tests is given to a
different sample of examinees, the equating is impaired by differences between
the samples.
Differences between the two samples of examinees can be measured and
controlled by administering to each examinee an anchor test measuring the same
ability as x and y. When an anchor test is used, equating may be carried out even
when the x group and the y group are not at the same ability level. The anchor
test may be a part of both test x and test y; such an anchor test is called internal.
13.6. TRUE-SCORE EQUATING WITH AN ANCHOR TEST 201

45
40 TRUE-SCORE EQUATING (12)
EQUIPERCENTILE (13)
35
30
25
AP

20
15
10
5
0

6 5 10 15 20 25 30 35 40 45 50
CLEP

FIG. 13.5.2. Two estimates of the line of relationship between AP and CLEP.

If the anchor test is a separate test, it is called external. An external anchor


test is administered second, after x or y, to avoid any practice effect on the scores
being equated. If the difference between the two groups is small, any difference
between test x and test y in their practice effect on the anchor test is a second-
order effect, assumed to be negligible.
Data from such an administration are readily used for the equating procedure
of the preceding section. All item response data of all examinees are analyzed
simultaneously to obtain estimates of all item parameters. The fact that each
examinee takes only part of the items is no problem for the parameter estimation
procedure; the procedure simply ignores items not administered to a particular
examinee, just as the "not reached" items are ignored in Section 12.4. This is
appropriate and effective, in view of the presumed in variance of an examinee's
ability θ from test to test.
In the case of Fig. 13.5.2, the two groups did differ substantially in ability.
The equating was made possible by the fact that 17 items appeared in both tests,
providing an internal anchor test.
202 13. EQUATING

If the anchor test is external to the two regular tests, the item parameters for
the anchor items are not used in the equating procedure of Section 13.5. They
have served their purpose by tying the data together so that all parameters are
expressed on the same scale. Without such anchor items, equating would be
impossible unless the x group and the y group had the same distribution of
ability.

13.7. RAW-SCORE "EQUATING" WITH AN ANCHOR


TEST

The true-score equating of the preceding section is straightforward and effective.


How else could you estimate the curvilinear relation between true scores on two
nonparallel tests when no person has taken both?
The problem is that there is no really appropriate way to make use of the
true-score equating obtained. We do not know an examinee's true score. We can
estimate his true score from his responses: ξ = iPi(θ) is such an estimate.
However, an estimated true score does not have the properties of true scores; an
estimated true score, after all, is just another kind of fallible observed score. As
proved in Section 13.3, fallible scores cannot be strictly equated unless the two
tests are strictly parallel.
We consider here a possible approximate equating for raw scores. We start
with an estimate γ(θ) of the distribution γ(θ) of θ in some specified group (often
this group will be all the examinees in the equating study). The actual distribution
of θa in the group is an approximation to γ(θ). A better approximation is given in
Chapter 16.
As in Eq. (4-12), the distribution of observed score x for the specified group
can be estimated by
N
l (13-13)
Øx (x) = N Øx (x|θa),
a=1

where a = 1 , 2 , . . . , N indexes the examinees in the specified group. If γ(θ) is


continuous as in Chapter 16, however, the estimated Ø(x) is obtained by nu-
merical integration from

Øx(x) =
∫ — ∞
Øx(x|θ) γ(θ) dθ. (13-14)

The function Øx(x|θa) is given by Eq. (4-1), using estimated item parameters for
test x items in place of their true values.
Similar equations apply for test y. Furthermore, since x and y are indepen­
dently distributed when θ is fixed, the joint distribution of scores x and y for the
specified group is estimated by
13.8. ILLUSTRATIVE EXAMPLE 203

N
1
Ø(x, y) = Øx (x|θa)Øy(y|θa), (13-15)
N
a=1

or by

Ø(x, y) =
∫ — ∞
Ø x (x|θ)Øy(y|θ)γ(θ)dθ. (13-16)

Note that, thanks to the anchor test, it is possible to estimate the joint distribution
of x and y even though no examinee has taken both tests.
The integrand of (13-16) is the trivariate distribution of θ, x, and y. Since θ
determines the true scores ξ and η, this distribution also represents the joint
distribution of the four variables ξ, η, x, and y. The joint distribution contains
all possible information about the relation of x to y. Yet, by Section 13 3, it
cannot provide an adequate equating of x and y unless the two tests are already
parallel.
A plausible procedure is to determine the equipercentile relationship between
x and y from (13-15) or (13-16) and to treat this as an approximate equating. Is
this better than applying the true-score equating of Section 13.5 to observed
scores x and y or to estimated true scores ξ = Pi(θ) and η)?
At present, we have no criterion for evaluating the degree of inadequacy of an
imperfect equating. Without such a criterion, the question cannot be answered.
At least the equipercentile equating of x and y covers the entire range of observed
scores, whereas the equating of ξ and η cannot provide any guide for scores
below the " c h a n c e " levels represented by ci.
The solid line of relationship in Fig. 13.5.2 is obtained by the methods of this
section. The result agrees very closely with the true-score equating of Section
13.6. Further comparisons of this kind need to be made before we can safely
generalize this conclusion.

13.8. ILLUSTRATIVE EXAMPLE2

Figure 13.5.2 is presented because it deals with a practical equating problem. For
just this reason, however, there is no satisfactory way to check on the accuracy of
the results obtained. The results obtained by conventional methods cannot be
justified as a criterion.
The following example was set up so as to have a proper criterion for the
equating results. Here, unknown to the computer procedure, test X and test Y are
actually the same test. Thus we know in advance what the line of relation should be.

2
This section is taken with special permission from F. M. Lord, Practical applications of item
characteristic curve theory. Journal of Educational Measurement, Summer 1977, 14, No. 2, 117–
138. Copyright 1977, National Council on Measurement in Education, Inc., East Lansing, Mich.
204 13. EQUATING

Test X is the 85-item verbal section of the College Board SAT, Form XSA,
administered to a group of 2802 college applicants in a regular SAT administra-
tion. Test Y is the same test administered to a second group of 2763 applicants.
Both groups also took a 39-item verbal test mostly, but not entirely, similar to the
regular 85-item test. The 39-item test is used here as an anchor test for the
equating.
The two groups differed in ability level (otherwise the outcome of the equat-
ing would be a foregone conclusion). The proportion of correct answers given to
typical items is lower by roughly 0.10 in the first group than in the second.
The equating was carried out exactly as described in Section 13.6. One
computer run was made simultaneously for all 85 + 85 + 39 = 209 items and for
all 5565 examinees. The resulting line of relationship between true-scores on test
80
70
09
TEST Y ( FORM XSA5038)
30 4020
10
0

0 1O 20 30 40 50 60 70 80
TEST X (FORM XSA5038)

FIG. 13.8.1. Estimated equating (crosses) between "Test X " and "Test Y,"
which are actual y identical.
13.9. PREEQUATING 205

X and test Y is shown by the crosses in Fig. 13.8.1. It agrees very well with the
45-degree line, also shown, that should be found when a test is equated to itself.
Results like this have been obtained for many sets of data. It is the repeated
finding of such results that encourages confidence in the item response function
model used.

13.9. PREEQUATING

Publishers who annually produce several forms of the same test have a continual
need for good equating methods. Conventional methods may properly require
1000 or more examinees for each test equated. If the test to be equated is a secure
test used for an important purpose, a special equating administration is likely to
impair its security. The reason is that coaching schools commonly secure detailed
advance information about the test questions from such equating administrations.
In preequating, by use of item response theory, each new test form is equated
to previous forms before it is administered. In preequating, a very large pool of
calibrated test items is maintained. New forms are built from this pool. The
method of Section 13.5 is used to place all true scores on the same score scale.
Scaled observed scores on the various test forms are then treated as if they were
interchangeable. A practical study of preequating is reported by Marco (1977).
Preequating eliminates special equating administrations for new test forms. It
requires instead special calibration administrations for new test items to be in-
cluded in the pool. Each final test form is drawn from so many calibration
administrations that its security is not seriously compromised.
Figure 13.9.1 shows a plan for a series of administrations for item calibration.
Rows in the table represent sets of items; columns represent groups of exam-
inees. An asterisk represents the administration of an item set to a particular
group. The column labeled n shows the number of items in each item set; the row
labeled n shows the number of items taken by each group. The row labeled N
shows the number of examinees in each group; the column labeled N shows the
number of examinees taking each item set.
A total of 399 items are to be precalibrated using a total of 15,000 examinees.
The total number of responses is approximately 938,000; thus an examinee takes
about 62 items on the average, and an item is administered to about 2350
examinees on the average. The table is organized to facilitate an understanding of
the adequacy of linkages tying the data together.
Item set F72 consists of previously calibrated items taken from the precali-
brated item pool. The remaining items are all new items to be calibrated. The item
parameters of the new items will be estimated from these data while the bi for the
20 precalibrated items are held fixed at their precalibrated values. This will place
all new item parameters on the same scale as the precalibrated item pool. All new
item and examinee parameters will be estimated simultaneously by maximum like-
lihood from the total data set consisting of all 958,000 responses (see Appendix).
GROUPS (TOTAL N =15.000)
ITEM SETS
XI X2 X3 X4 X5 X6 X7 X8 XI3 XI4 XI7 XI8 XI9 X23
N= 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 2,000
Set Forms Items n
XI 21-40
20 2,000
X3 1-20
XI 41-74
2 34 2,000 (938,000 RESPONSES)
X3 41-74
X2 21-40
3 X3 20 2,000
21-40
X2 41-73
4 33 2,000
X6 41-73
X4 21-40
5 X6 20 2,000
1-20
X5 21-40
6 X6 20 2,000
21-40
7 1-20 20 9,000
F72 41- .72
8 X7 41-72
32 2,000
XI3 41-70
9 X8 41-70
30 2,000
XI4 1-40
10 XI3
40 2,000
I-40
11
XI4 2I-40
20 2,000
XI7 1-20
XI9 41-80
12 40 2,000
XI7 41-80
XI9 2I-40
13 20 2,000
XI8 2I-40
14 XI9 21 -70 50 2,000
X23 399 n= 74 73 74 40 40 73 52 50 72 70 80 60 80 70
Precalibrated items.
FIG. 13.9.1. Item calibration plan.
13.11. EXERCISES 207

13.10. CONCLUDING REMARKS

Since practical pressures often require that tests be ''equated" at least approxi-
mately, the procedures suggested in Sections 13.5-13.7 may be used. What is
really needed is a criterion for evaluating approximate procedures, so as to be
able to choose from among them. If you can't be fair (provide equity) to
everyone, what is the next best thing?
There is a parallel here to the problem of determining an unbiased selection
procedure (Hunter & Schmidt, 1976; Thorndike, 1971). Some procedures are
fair from the point of view of the selecting institutions. Usually, however, no
procedure can be simultaneously fair, even from a statistical point of view, both
to the selecting institutions and to various subgroups of examinees.
In the present problem, the equating needs of a particular selecting institution
could be satisfied by regression methods (Section 13.4). If regression methods of
"equating" are used, however, examinees could properly complain that they had
been disadvantaged (denied admission to college, for example) because they had
taken test y instead of test x or test x instead of test y. It seems important to avoid
this.
An equipercentile "equating" of raw scores has the convenient property that
when a cutting score is used, the proportion of selected examinees will be the
same for those taking test x and for those taking test y, except for sampling
fluctuations. This will be true regardless of where the institution sets its cutting
score. Thus equipercentile "equating" of raw scores gives an appearance of
being fair to everyone.
Most practical equatings are carried out between "parallel" test forms. In
such cases, forms x and y are so nearly alike that equipercentile equating, or
even conventional mean-and-sigma equating, should yield excellent results. This
chapter does not discourage such practical procedures. This chapter tries to
clarify the implications of equating as a concept. Such clarification is especially
important for any practical equating of tests from two different publishers or of
tests at two different educational levels.
The reader is referred to Angoff (1971) for a detailed exposition of conven-
tional equating methods. Woods and Wiley (1978) give a detailed account of
their application of item response theory to a complicated practical equating
problem involving the equating of 60 different reading tests, using available data
from 31 states and the District of Columbia.

13.11. EXERCISES

13-1 The test characteristic function for test 1 was computed in Exercise 5.9.1.
Compute for 6 = — 3, — 2, —1,0, 1, 2, 3 the test characteristic function of
a test composed of n = 3 items just like the items in Table 4.17.2. From
208 13. EQUATING

these two test characteristic functions, determine equated true scores for
these two tests. Plot seven points on the equating function x(y) and
connect by a smooth curve.
13-2 Suppose that test x is a perfectly reliable test with scores x ≡ T. Suppose
test y is a poorly reliable test with scores y ≡ T + E, where E is a random
error of measurement, as in Section 1.2. Make a diagram showing a
scatterplot for x and y and also the regressions of y on x and of x on y.
Discuss various functions x(y) that might be used to try to "equate"
scores on test y to scores on test x.

APPENDIX

"Equating" by Regression Methods Is Not Invariant


Across Groups
This appendix points out one disadvantage of determining x(y) so that the regres­
sion Rx of criterion ω on x(y) is the same as the regression Ry of the criterion on
y (see Section 13.4):
Rx[ω|x(y)] = Ry(ω|y). (13-17)
We shall work with the usual linear regression coefficients βxω, βωx, βyω, and
βωy. Suppose that all regressions are actually linear; that the standard error of
estimate, denoted here by σ.x.ω, is the same for all ω; and that σy|ω = σy..ω
likewise.
A standard formula for the effect of explicit selections on ω (Gulliksen, 1950,
Chapter 11) shows how the correlation pxω (likewise pyω) changes as the var­
iance of ω is changed:

'2 1
ρ xω = (13-18)
1
σ ω2 1 - ρxω2 ,
+σ '2
ω ρ
2

where the prime denotes a statistic for the selected group. For any group
βxωβωx =P2xω. (13-19)
From (13-18) and (13-19) we have
1
βxωβωx = (13-20)
1 - (σ ω2 la ω'2 ) [l - (l lρ xω
2
)]
and similarly for y
1 (13-21)
=
)] .
β'yωβ'ωy 2 '2 2
1- (σ ω la to )[l - ( 1/ρ yω
APPENDIX 13 209

If the equating (13-17) is to hold for the selected group, we must have R'x[ω|x(y)]
≡ R'y(ω|y) and consequently β'ωx = β'ωy. Dividing (13-21) by (13-20) to eliminate
β'ωy = β'ωx, we have
t 2 ' 2 2
βyω = σ ω - σ ω [1 - (1/ρ xω )]
. (13-22)
2 2 2
β xω' σ ' -
ω (J ω
[ 1 - ( 1lρ yω ) ]
We assume, as is usual, that the (linear) regressions on ω are the same before and
after selection: β'xω = βxω and β'yω = βyω. Thus, finally

σ ω - σ ω [ 1 - ( 1/ρ X ω ) ]
' 2 2 2
βyω . (13-23)
= '2 2 2
βxω σ - σ ω 1 - ( 1/ρ yω'2 ) ]
ω

Consider what happens when ΣΩ varies from group to group. All unprimed
statistics in (13-23) refer to a fixed group and do not vary. The ratio on the left
stays the same, but the ratio on the right can stay the same only if ρyω = ρXω.
This is an illustration of a more general conclusion:
Suppose x(y) is defined by Rx[ω|x(y)] ≡ Ry(ω|y). The transformation x(y)
that is found will typically vary from group to group unless x and y are equally
correlated with the criterion ω.

Numerical Estimation Procedures to Accelerate


Convergence
If all item and examinee parameters were estimated simultaneously, the estima­
tion problem for Fig. 13.9.1 would not differ significantly from typical problems
discussed in Chapter 12, except for the very large number of parameters to be
estimated. Actually, item parameters and person parameters must be estimated
alternately (see Section 12.1) rather than simultaneously, in order to keep the
information matrix ||Iqr|| diagonal, or nearly so. In practice, convergence is slow
whenever, as illustrated in Fig. 13.9.1, the item-person matrix is poorly internally
interconnected (that is, when several changes of direction may be required to get
from one part of the matrix to another following the marked straight paths from
asterisk to asterisk).
In such cases, the iterative estimation process reaches a condition where the
estimates no longer fluctuate up and down; instead, each estimate moves consis­
tently in the same direction iteration after iteration. The estimates will converge,
but only slowly. The following extrapolation procedure, suggested by Max
Woodbury, has been found very successful in such cases. One application is
often sufficient.
Let xt denote the rth approximation to the maximum likelihood estimate X ≡
X∞ of parameter X. Assume that the discrepancy Xt - X∞ is proportional to some
positive constant r raised to the rth power:
Xt - X∞ = krt, (13-24)
210 13. EQUATING

where k is the constant of proportionality. Then

Xt-1 - Xt = krt-1 '(1 - r) (13-25)


and
- Xt-1 -Xt
Xt-1
1 - r =
X∞ ,

so that

Xt - rXt-1 (13-26)
X∞ = .
1 - r
The rate r can thus be found from

r = Xt - Xt-1 (13-27)
.
Xt - 1 - Xt-2
In practice, I is computed from (13-27), using the results of three successive
iterations. Then (13-26) provides an extrapolated approximation to the maximum
likelihood estimate X∞.
In the situation illustrated by Fig. 13.9.1, the bi for all items in each subtest
may be averaged and this average substituted for X in (13-27) to find the rate r.
The same rate r may then be used in (13-26) separately for each item to approxi­
mate the maximum likelihood estimator bi.
These bi for all items are then held fixed and the θa are estimated iteratively
for all individuals. The θa are then held fixed while reestimating all item parame­
ters by ordinary estimation methods. Additional applications of (13-26) and
(13-27) may be carried out after further iterations that provide new values of
Xt-2, Xt-1, and Xt. One application of (13-26) and (13-27), however, will often
sufficiently accelerate convergence.

True-Score "Equating" Below the Chance Level


The raw-score "equating" of Section 13.7 is rather complicated to be done
routinely in the absence of any clear indication that it is superior to true-score
equating. Yet, we cannot use true-score equating of Section 13.5 without some
way to deal with observed scores below j cj. This appendix suggests a conve­
nient practical procedure.
Consider applying the method of Section 13.7 just to a hypothetical group of
examinees all at ability level θ = — ∞. According to Eq. (4-2) and (4-3), the
observed scores for such a group of examinees have a mean of jcj and a
variance of jCj(l — cj). For y scores below jcj, let us take the equating
function x(y) to be a linear function of y chosen so that both x and x(y) have the
same mean and also the same variance in our hypothetical subgroup of exam­
inees. This means we shall use a conventional "mean and sigma" linear equation
REFERENCES 211

based on this subgroup of examinees. This equating requires that

x(y) - ici y - jcj


=
,
√ ici ( 1 - ci) √ ici (1 - cj)
where i indexes the items in test x and j indexes the items in test y. The desired
equating function x(y) is thus seen to be

√ ic (1 - ci)
x(y) = (y - jcj) + ici. (13-28)
√ jcj (1 - cj)

We use (13-28) for test y scores below jcj; we use true-score equating
(13-12) above jcj. The equating relationship so defined is continuous: When y
= jcj, we find that x(y) = ici whether we use the true-score equating curve
of (13-12) or the raw-score "equating" line of (13-28). We cannot defend
(13-28) as uniquely correct, but it is a good practical solution to an awkward
problem.

REFERENCES

Angoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike, Educational measurement


(2nd ed.). Washington, D.C.: American Council on Education, 1971.
Gulliksen, H. Theory of mental tests. New York: Wiley, 1950.
Hunter, J. E., & Schmidt, F. L. Critical analysis of the statistical and ethical implications of various
definitions of test bias. Psychological Bulletin, 1976, 83, 1053-1071.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1, 3rd ed.). New York: Hafner,
1969.
Marco, G. L. Item characteristic curve solutions to three intractable testing problems. Journal of
Educational Measurement, 1977, 14, 139-160.
Thorndike, R. L. Concepts of cultural-fairness. Journal of Educational Measurement, 1971, 8,
63-70.
Woods, E. M., & Wiley, D. E. An application of item characteristic curve equating to item sampling
packages or multi-form tests. Paper presented at the annual meeting of the American Educational
Research Association, Toronto, March 1978.
14 Study of Item Bias

14.1. INTRODUCTION

It is frequently found that certain disadvantaged groups score poorly on certain


published cognitive tests. This raises the question whether the test items may be
unfair or biased against these groups.
Suppose a set of items measures one ability or skill for one group and a
different ability or skill for another. Such a test would in general be unfair, since
one ability or skill will in general be more relevant for the purposes of the test
than the other. Such a situation is best detected by factor analytic methods and is
not considered further here. Instead, we consider a situation where most of the
test items measure about the same dimension for all groups tested, but the
remaining items may be biased against one group or another.
If each test item in a test had exactly the same item response function in every
group, then people at any given level 0 of ability or skill would have exactly the
same chance of getting the item right, regardless of their group membership.
Such a test would be completely unbiased. This remains true even though some
groups may have a lower mean θ, and thus lower test scores, than another group.
In such a case, the test results would be reflecting an actual group difference and
not item bias.
If, on the other hand, an item has a different item response function for one
group than for another, it is clear that the item is biased. If the bias is substantial,
the item should be omitted from the test.
If the item response function for one group is above the function for another
group at all 6 levels, then people in the first group at a given ability level have a
better chance of answering the item correctly than people of equal ability in the

212
14.2. A CONVENTIONAL APPROACH 213

other group. This situation is the simplest and most commonly considered case of
item bias.
If the item response functions for the two groups cross, as is frequently found
in practice, the bias is more complicated. Such an item is clearly biased for and
against certain subgroups.
It seems clear from all this that item response theory is basic to the study of
item bias. Mellenbergh (1972) reports an unsuccessful early study of this type,
using the Rasch model. A recent report, comparing item response theory and
other methods of study, is given by Ironson (1978). Before applying item re-
sponse theory here, let us first consider a conventional approach in current use.

14.2. A CONVENTIONAL APPROACH1

For illustrative purposes, we shall compare the responses of about 2250 whites
with the responses of about 2250 blacks on the 85-item Verbal section of the
April 1975 College Board SAT.2 Each group is about 44% male. All items are
five-choice items.
For each item, Fig. 14.2.1 plots pi, the proportion of correct answers, for
blacks against Pi for whites. Items (crosses) falling along the diagonal (dashed)
line in the figure are items that are as easy for blacks as for whites. Items below
this line are easier for whites. The solid oblique line is a straight line fitted to the
scatter of points. The solid line differs from the diagonal line because whites
score higher on the test than blacks. If all the items fell directly on the solid line,
we could say that the items are all equally biased or, conceivably, equally
unbiased.
It has been customary to look at the scatter of items about the solid line and to
pick out the items lying relatively far from the line and consider them as atypical
and undesirable. In the middle of Fig. 14.2.1 there is one item lying far below the
line that appears to be strongly biased in favor of whites and also another item far
above the line that favors blacks much more than other items. A common judg-
ment would be that both of these items should be removed from the test.
In Fig. 14.2.1 the standard error of a single proportion is about .01, or less.
Thus, most of the scattering of points is not attributable to sampling fluctuations.
Unfortunately, the failure to fall along a straight line is not necessarily attributa-
ble to differences among items in bias. This statement is true for several different
reasons, discussed below.

1
Most of this section is taken, by permission, from F. M. Lord, A study of item bias, using item
characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology.
Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.
2
Thanks are due to Gary Marco and to the College Entrance Examination Board for permission to
present some of their data and results here.
214 14. STUDY OF ITEM BIAS

I .0
0.8
0.6
P (BLACKS)

0.4
0.2
0,0

0.0 0.2 0.4 0.6 0.8 I .0


P (WHITES)

FIG. 14.2.1. Proportion of right answers to 85 items, for blacks and for whites.

In the first place, we should expect the scatter in Fig. 14.2.1 to fall along a
curved and not a straight line. If an item is easy enough, everyone will get it
right, and the item will fall at (1, 1). If the item is hard enough, everyone will
perform at some "chance" level c, so the item will fall at (c, c). Logically, the
items must fall along some line passing through the points (c, c) and (1, 1). If the
groups performed equally well on the test, the points could fall along the
diagonal line. But since one group performs better than the other, most of the
points must lie to one side of the diagonal line, so the relationship must be
curved.
Careful studies attempt to avoid this curvature by transforming the propor-
tions. If an analysis of variance is to be done, the conventional transformation is
the arcsine transformation. The real purpose of the arcsine transformation is to
equalize sampling variance. Whatever effect it may have in straightening the line
of relationship is purely incidental.
14.2. A CONVENTIONAL APPROACH 215

The transformation usually used to straighten the line of relationship is the


inverse normal transformation. The proportion of correct answers is replaced by
the relative deviate that would cut off the same upper proportion of the area under
the standard normal curve. The result of this transformation is shown in Fig.
14.2.2. Indeed, the points in Fig. 14.2.2 fall about a line that is more nearly
straight than is the case in Fig. 14.2.1. Unfortunately, there are theoretical
objections to the inverse normal transformation.
Let πi denote the proportion of correct answers to item i in the population: πi.
≡ Prob(ui = 1). A superscript o will be used to denote the special case where
there is no guessing. By a standard formula
πi =ξθ Prob(ui = 1|θ) ≡ξθPi(θ) (14-1)
2.0
1 .5
1 .0
0.5
BLACKS

0.0
-0.5
-I.0
-1.5
2.0

+ 2 .0 - I .5 - I .0 -0.5 0.0 0.5 1 .0 1 .5 2 .0


WHITES

FIG. 14.2.2 Inverse normal transformation of proportion of correct answers for


85 items.
216 14. STUDY OF ITEM BIAS

the expectation being taken over θ. Now, by Eq. (2-2), P i (θ) ≡ ci + (1 —


ci) Poi (θ), so. by (14-1),
πi = ci + (1 - ci)ξθ Poi (θ)
= ci + (1 - ci) πoi. (14-2)
Solving for πoi, we find
πi - Ci
π
o = . (14-3)
i 1- Ci

If Pi(θ) is the three-parameter normal-ogive response function (2-2), and if θ


happens to be normally distributed, then by Eq. (3-9) (adding a superscript o to
accord with the present notation),
Πi() = 1 - (Γi) ≡ (- γi), (14-4)
,
where is the cumulative normal distribution function and γi = ρ'ibi where p'i
is the biserial correlation between item score and ability. Substituting this into
(14-2), we have
πi = ct + (1 - ci) (- ρ' i b i ). (14-5)
-1
It appears from (14-5) that an inverse normal transformation of πi (as
done for Fig. 14.2.2) does not seem to yield anything that is theoretically mean­
ingful.
More interesting results are obtained by applying the inverse normal transfor­
mation to the "corrected item difficulty" (14-3). By (14-4), since γi = ρ'ibi, the
useful result is that
πi
-
Ci -1
-1
[ (-γi) ] = ρi b i . (14-6)
( 1- Ci )=
Now all the bi are invariant from group to group except for an indeterminate
origin and unit of measurement. Suppose now that p'i is approximately the
same for all items. In this special case, the inverse normal transformation of the
corrected conventional item difficulty is invariant from group to group except for
an undetermined linear transformation.
It seems as if Fig. 14.2.2 should have been based on "corrected item difficul­
ties '' (14-3) rather than on actual πi; but there are obvious reasons why this is not
totally satisfactory either:

1. Equation (14-6) holds only if 6 is normally distributed in both groups.


2. The value of ci must be known for each item.
3. The value of p'i must be the same for each item.
4. Because of sampling fluctuations, and because examinees sometimes do
systematically worse than chance on some items, sample estimates of Π()i are too
often zero or negative, in which case the transformation cannot be carried out.
14.3. ESTIMATION PROCEDURES 217

In practice, items differ from each other in discriminating power (ρ'i or,
equivalently, ai). Use of (14-2) may make items of the same discriminating
power lie along the same straight line; but items of a different discriminating
power will then lie along a different straight line. The more discriminating items
will show more difference between blacks and whites than do the less dis­
criminating items; thus use of (14-2) cannot make all items fall along a single
straight line. All this is a reflection of the fact, noted earlier (Section 3.4), that πi
is really not a proper measure of item difficulty. Thus the πi, however trans­
formed, are not really suitable for studying item bias.

14.3. ESTIMATION PROCEDURES

Suppose we plan to study item bias with respect to several groups of examinees.
A possible practical procedure is as follows:

1. Estimate approximately the item parameters for all groups combined,


standardizing on the bi and not on θ (see below).
2. Fixing the ci at the values obtained in step 1, reestimate ai and bi sepa­
rately for each group, standardizing on the bi.
3. For each item, compare across groups the item response functions or
parameters obtained in step 2.

Standardizing on the bi means that the scale is chosen so that the mean of the
bi is 0 and the standard deviation is 1.0 (see Section 3.5). Except for sampling
fluctuations, this automatically places all parameters for all groups on the same
scale. If the usual method of standardizing on 6 were used, the item parameters
for each group would be on a different scale.
Before standardizing on the bi, it would be best to look at all bi values and
exclude very easy and very difficult items both from the mean and the standard
deviation. Items with low ai should also be omitted. The reason in both cases is
that the bi for such items have large sampling errors. Such items are omitted only
from the mean and standard deviation used for standardization; they are treated
like other items for all other purposes.
Following the outlined procedure, a given item response function will be
compared across groups on âi and bi only. We are acting as if a given item has
the same ci in all groups. The reason for doing this is that many ĉ's are so
indeterminate (see Chapter 12) that they are simply set at a typical or average
value; this makes tests of statistical significance among ĉi impossible in many or
most cases. If there are differences among groups in ci, they cannot be found by
the recommended procedure; however, this should not prevent us from observing
differences in ai and bi The null hypothesis states that ai bi, and ci do not
vary across groups. If the recommended procedure discovers significant dif-
ferences, it is clear that the null hypothesis must be rejected.
218 14. STUDY OF ITEM BIAS

14.4. COMPARING ITEM RESPONSE FUNCTIONS


ACROSS GROUPS

Figure 14.4.1 compares estimated item response functions for an antonym item.
The data are the same as for Fig. 14.2.1 and 14.2.2. The top and bottom 5% of
individuals in each group are indicated by individual dots, except that the lowest
5% of the black group fall outside the limits of the figure. Clearly, this item is
much more discriminating among whites than it is among blacks.
Figure 14.4.2 shows an item on which blacks as a whole do worse than
whites; nevertheless at every ability level blacks do better than whites! Such
results are possible because there are more whites than blacks at high values of 0
and more blacks than whites at low values of θ. The item is a reading comprehen­
sion item from the SAT. This is the only item out of 85 for which the item
response function of blacks is consistently so far above that of whites. The reason
for this result will be suggested by the following excerpts from the reading
passage on which the item is based:
1.0

Each question below consists of a word in capital letters,


followed by five lettered words or phrases. Choose the
word or phrase that is most nearly opposite in meaning
to the word in capital letters. Since some of the ques­
tions require you to distinguish fine shades of meaning,
8

consider all the choicesbefore deciding whichis best.


C o r r e c t Answer

GEL: (A) glaze (B) debase


(C) corrode (D) melt (E) infect
.6
a of
.4
Probability
.2
0

- 4 -3 -2 - 1 0 1 2 3
Ability

FIG. 14.4.1. Black (dashed) and white (solid) item response curves for item 8.
(From F. M. Lord, Test theory and the public interest. In Proceedings of the 1976
ETS Invitational Conference—Testing and the Public Interest. Princeton, N.J.:
Educational Testing Service, 1977.)
14.4. COMPARING ITEM RESPONSE FUNCTIONS 219

1.0
8
Answer
Correct
.6
of a
.4
Probability
.2

- 4 -3 -2 - 1 0 1 2 3
Ability

FIG. 14.4.2. Item response curves for item 59. (From F. M. Lord, Test theory
and the public interest. Proceedings of the 1976 ETS Invitational Conference—
Testing and the Public Interest. Princeton, N.J.: Educational Testing Ser-
vice, 1977.)

American blacks have been rebelling in various ways against their status since
1619. Countless Africans committed suicide on the passage to America. . . . From
1955 to the present, the black revolt has constituted a true social movement.

It is often difficult or impossible to judge from figures like the two shown
whether differences between two response functions may be due entirely to
sampling fluctuations. A statistical significance test is very desirable. An obvious
procedure is to compare, for a given item, the difference between the black and
the white bi with its standard error
SE (bi1 - bi2) = √ V a rbi1+ Var bi2 . (14-17)

The same can be done with the âi.


If the bi and the âi are maximum likelihood estimates, the necessary sampling
variances can be approximated by standard asymptotic formulas (see Section
12.3); if the bi are the only parameters estimated,
1
Var bi ,

d1nL 2
ξ
[
( db i ) ]
220 14. STUDY OF ITEM BIAS

and similarly for âi. As shown for Eq. (5-5), we can carry out the differentiation
and expectation operations. In the case of the three-parameter logistic function,
after substituting estimated parameters for their unknown true values, we obtain
N
D 2 d i2 Qia -l
Var bi (P ia - )2 P
[ (1 - ĉ , ) 2 a = l
Ci
ia ] ,
(14-8)

D2 N
Q ia -1
Var ât (θa - b i ) 2 (P ia - )2 P
[ (1 - ĉ i
)2
a=\
ĉi
ia ] .
(14-9)

The summation is only over the Ni examinees who reached item i.


Equations (14-8) and (14-9) assume that the ability parameters θa are known.
Although the θa are actually estimated, available evidence shows that this usu­
ally increases Var (bi) and Var (âi) only slightly.
Using (14-7) and (14-8) or (14-9), and assuming asymptotic normality of bi,
one can readily make a simple asymptotic significance test of the null hypothesis
that bi1 = b i2 . A separate significance test can be made of the null hypothesis
that ai1 = ai2. It is preferable, however, to test both these hypotheses simulta-
neously. This can be done by a chi-square test. The method used is described in
the Appendix at the end of this chapter.

14.5. PURIFICATION OF THE TEST

If many of the items are found to be seriously biased, it appears that the items are
not strictly unidimensional: The θ obtained for blacks, for example, is not strictly
comparable to the θ obtained for whites. This casts some doubt on the results
obtained when all items are analyzed together. A solution (suggested by Gary
Marco) is

1. Analyze the total test, as described in the preceding sections.


2. Remove all items that have significantly different response functions in the
groups under study. The remaining items may now be considered to be a uni­
dimensional pool, even when the groups are combined.
3. Combine all groups and estimate θ for each individual. These θ will all be
comparable.
4. For each group separately, while holding 6 fixed for all individuals at the
values obtained in step 3, estimate the ai and the bi for each item. Do this for all
items, including those previously removed.
5. Compare estimated item response functions or parameters by the methods
of Section 14.4.

The resulting comparisons should be legitimate since an appropriate 6 scale


is presumably being used across all groups. Table 14.5.1 shows some illustrative
14.6. CHECKING THE STATISTICAL SIGNIFICANCE TEST 221

TABLE 14.5.1
Approximate Significance Test for the Hypothesis That Blacks and
Whites Have Identical Item Response Functions

ai bi
Item Chi Significance
No. Whites Blacks Whites Blacks Square Level

1 .87 .66 -1.5 -1.6 14.6 .00


2 .28 .02 -3.3 -31.3 51.6 .00
3 .63 .45 -1.1 -1.4 9.7 .01
4 .85 .61 .1 .2 12.7 .00
5 .35 .24 -1.9 -1.3 50.6 .00
6 .82 .74 -.4 -.4 1.6 .45
7 .56 .67 -.6 -.2 18.6 .00
8 .50 .17 1.2 1.8 40.1 .00
9 1.43 1.74 .5 .5 4.8 .09
10 1.09 .88 .7 .8 3.4 .18
11 1.64 1.39 1.7 1.7 .9 .63
12 .49 .41 1.9 1.9 2.8 .24
13 1.63 2.17 1.7 1.7 1.2 .55
14 1.27 1.04 2.6 2.9 .2 .91
15 .68 .89 3.4 2.7 1.5 .47

final results for the first 15 verbal items for the data described in Section 14.2.3
Does the SAT measure the same psychological trait for blacks as for whites?
If it measured totally different traits for blacks and for whites, Fig. 14.2.2 would
show little or no relationship between the item difficulty indices for the two
groups. In view of this, the study shows that the test does measure approximately
the same skill for blacks and whites.
The item characteristic curve techniques used here can pick out certain atypi-
cal items that should be cut out from the test. It is to be hoped that careful study
will help us understand better why certain items are biased, why certain groups of
people respond differently than others on certain items, and what can be done
about this.

14.6. CHECKING THE STATISTICAL


SIGNIFICANCE TEST

The significance test used in Section 14.5 has been questioned on the grounds
that if some items are biased, unidimensionality and local independence are

3
The remainder of this section is taken, by permission, from F. M. Lord, A study of item bias,
using item characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural
psychology. Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.
222 14. STUDY OF ITEM BIAS

violated; hence the item parameter estimates are not valid. This objection is not
compelling if the test has been purified. Statistical tests of a null hypothesis (in
this case the hypothesis of no bias) are typically made by assuming that the null
hypothesis holds and then looking to see if the data fit this assumption. If not,
then the null hypothesis is rejected.
The statistical significance tests used are open to several other criticisms,
however:

1. They are asymptotic.


2. They assume that the θa are known rather than estimated.
3. They only apply to maximum likelihood estimates. Data containing omit­
ted responses (see Chapter 15) do not fit the usual dichotomous item response
model; this difficulty is dealt with by modifying the estimation procedures.

In view of these difficulties, it is wise to make some check on the adequacy of the
statistical method, such as that described below.
For the data discussed in this chapter, an empirical check was carried out. All
4500 examinees, regardless of color, were divided at random into two groups,
"reds" and "blues." The entire item bias study was repeated step by step for
these two new groups.

TABLE 14.6.1
Distribution of Significance Levels
Testing the Difference Between the
Item Response Functions for Two
Randomly Selected Groups of
Subjects*

Significance No. of
Level Items

' .00- .05


.05- .10 3
.10- .20 6 69
.20- .30 6
.30- .40 11
.40- .50 9
.50- .60 7
.60- .70 5
.70- .80 12
.80- .90 10
.90-1.00 10

*Taken, by permission, from F. M. Lord, A


study of item bias, using item characteristic curve
theory. In Y. H. Poortinga (Ed.), Basic problems in
cross-cultural psychology. Amsterdam: Swets and
Zeitlinger, 1977, pp. 19-29.
APPENDIX 223

Table 14.6.1 shows the 85 significance levels obtained for the 85 SAT Verbal
items. Since the groups were random groups, the significance levels should be
approximately rectangularly distributed from 0 to 1, with about 8½ items for
each interval of width .10. The actual results are very close to this.
Although Table 14.6.1 is not a complete proof of the adequacy of the statisti­
cal procedures of Section 14.4, a comparison with the complete SAT results
abstracted for Table 14.5.1 makes it very clear that blacks and whites are quite
different from random groups for present purposes. The final SAT results show a
significant difference at the 5% level between blacks and whites for 38 out of 85
items, this is quite different from the 9 out of 85 shown in Table 14.6.1 for a
comparison of random groups.
It is not claimed here that the suggested statistical significance tests are opti­
mal, nor that the parameter estimates are valid for those items that are biased. A
good additional check on the foregoing statistical analysis could be obtained by
repeating the entire comparison separately for independent samples of blacks
and whites.

APPENDIX

This appendix describes the chi-square test used in Section 14.4 for the null
hypothesis that for given i both bi1 = bi2 and ai1 = ai2. The procedure is based
on the chi-square statistic
x2i ≡ v'i i-1v, (14-10)
-1
where v' is the vector {bi1 - bi2, ân - â i2 } and i is the inverse of the
asymptotic variance-covariance matrix for bi1 - bi2 and âi1 - â i 2 .
Since âi1 and bi1 for whites are independent of âi2 and bi2 for blacks, we
have
i = i1 + i2, (14-11)
where i1 is the sampling variance-covariance matrix of âi1 and bi1 in group 1,
and similarly for i2. These latter matrices are found for maximum likelihood
estimators from the formulas i1 = Ii1-1 and i2 = Ii2-1 where Ii is the 2 X 2
information matrix for âi and bi [Eq. (12-8), (12-9), (12-11)]. The diagonal
elements of Ii are the reciprocals of (14-8) and (14-9).
The significance test is carried out separately for each item by computing Xi2
and looking up the result in a table of the chi-square distribution. If the null
hypothesis is true, Xi2 has a chi-square distribution with 2 degrees of freedom
(Morrison, 1967, p. 129, Eq. 1).
When there are more than two groups, a simultaneous significance test for
differences across groups on ai and bi can be made by multivariate analysis of
variance.
224 14. STUDY OF ITEM BIAS

REFERENCES

Ironson, G. H. A comparative analysis of several methods of assessing item bias. Paper presented at
the annual meeting of the American Educational Research Association, Toronto, March 1978.
Mellenbergh, G. J. Applicability of the Rasch model in two cultures. In L. J. Cronbach & P. J. D.
Drenth (Eds.), Mental tests and cultural adaptation. The Hague: Mouton, 1972.
Morrison, D. F. Multivariate statistical methods. New York: McGraw-Hill, 1967.
15 Omitted Responses and
Formula Scoring

15.1. DICHOTOMOUS ITEMS

The simpler item response theories consider only two kinds of response to an
item. Such theories are not directly applicable if the item response can be right,
wrong, or omitted.
More complex theories deal with cases where the item response may be A, B,
C, D, or E, for example. Although these more complex theories have sometimes
been used to deal with omitted responses, it is not always obvious that the
mathematical models used are appropriate or effective for this use.

15.2. NUMBER-RIGHT SCORING

When test score is number of right answers, it is normally to the examinee's


advantage to answer every item, even if his response must be at random. If
examinees are convincingly instructed as to their best strategy, as they properly
should be, and if they act in their own best interests, there will then be no omitted
or "not-reached" responses on such tests.
Although superficially this situation seems appropriate for dichotomous-item
response theory, it is actually inappropriate whenever examinees do not have
time to finish the test. In his own best interest, an examinee who runs short of
time on a number-right scored test should quickly answer all unread items at
random. Such responses violate the item response theory model: They do not
depend on the examinee's 0.
In principle, item response theory can be applied to number-right scored tests

225
226 15. OMITTED RESPONSES AND FORMULA SCORING

only if they are unspeeded. In practice, some deviation from this rule can doubt­
less be tolerated.

15.3. TEST DIRECTIONS

If a test is to measure fairly or effectively, examinees at a given ability level must


all follow the same strategy. Thus, test directions must convincingly explain to
the examinee how to act in his own self-interest. In particular, in the case of a
number-right scored test, we should not tell the examinee never to respond at
random, since it is clearly in his best interest to do so.
When a test is formula scored, the examinee's score is
w
y≡x - (15-1)
A - 1
where x is the number-right score, w is the number of wrong answers, and A is
the number of alternative choices per item. When formula scoring is used, we
may discourage the examinees from responding purely at random, since in the
long run their scores will not be improved by purely random guessing. If an
examinee does not have time to finish the test, he can, if he wishes, increase the
random error in his score by answering unread items at random. If examinees
understand the test directions fully, however, many will probably refrain from
doing this.

15.4. NOT-REACHED RESPONSES

If most examinees read and respond to items in serial order, a practical procedure
for formula-scored tests is to ignore the "not-reached" responses of each exam­
inee when making statistical inferences about examinee and item parameters.
Such treatment of not-reached responses is discussed in Section 12.4.
To summarize: If item response theory is to be applied, tests should be
unspeeded. If many examinees do not have time to finish the test, purely random
responses may be discouraged by using formula scoring and giving appropriate
directions to the examinee. The not-reached responses that appear in formula-
scored tests should be ignored during parameter estimation.

15.5. OMITTED RESPONSES

If (1) number-right scores are used, (2) proper test directions are given, (3) the
examinees understand the directions, and (4) they act in their own self-interest,
then there will be no omitted responses. If formula scoring is used with appro-
15.7. THE PRACTICAL MEANING OF AN ITEM RESPONSE FUNCTION 227

priate directions, there will be a scattering of omitted responses by many exam­


inees, in addition to any not-reached responses at the end of the test.
For the remainder of this chapter, the term omitted response, or simply omit,
implies that the examinee read the item and decided not to answer it, so that omit
and not reached are mutually exclusive categories. As a practical expedient, we
assume that omitted and not-reached responses can be distinguished by the fact
that all not-reached responses fall in a block at the end of the test. The remainder
of the chapter is concerned with the treatment of omitted responses.

15.6. MODEL FOR OMITS UNDER FORMULA SCORING

If an examinee's chance of answering a particular item correctly is better than


1/A, under formula scoring he should answer the item—he should not omit it.
The test directions to the examinee should be constructed to ensure this. In the
remainder of this chapter, we assume formula scoring, together with appropriate
test directions, unless otherwise specified.
If an examinee's chance of answering a particular item correctly is only XI A,
we cannot predict whether he will respond or not. We can make inferences in the
other direction, however. If the test directions are effective, we can say that when
an examinee omits an item, his chance of success, were he to answer the item,
should be approximately \IA.
Of course, an examinee may not be a completely accurate judge of his own
chance of success. Empirical studies bearing on this assumption have been car­
ried out by Ebel (1968), Sax and Collet (1968), Slakter (1969), Traub and
Hambleton (1972), Waters (1967), and others. None of these studies is really
definitive. Other references are cited by Diamond and Evans (1973).

15.7. THE PRACTICAL MEANING OF AN ITEM


RESPONSE FUNCTION

Before proceeding, we need to clarify a definition whose inadequacy may have


escaped the reader's attention. Suppose examinee S and examinee T both omit
the same item. If forced to respond, each would have a probability of 1/A of
answering correctly. Since Prob(ui = 1|θS) = Prob(ui = 1|θT) = 1/A, it
apparently follows that θS = θT.
But this is absurd in practice. Two examinees who omit the same item need
not be of equal ability, even approximately. Where is the fallacy?
Another paradox arising from the same source is the following. Suppose item
i and item j measure the same ability 0 and have identical item response functions
with ci = cj = 0. Suppose examinee A knows the answer to item i but does not
know the answer to item j ; examinee B knows item j but does not know item i.
228 15. OMITTED RESPONSES AND FORMULA SCORING

Such situations occur constantly in practice. Since apparently Pi(θA) > Pi(θB),
θA must be greater than θB. But also apparently Pj(θA) < Pj (θB), so θA must be
less than θB. What is the source of this absurdity?
The trouble comes from an unsuitable interpretation of the practical meaning
of the item response function Pi(θA) = Prob(uiA = 1|θ A ). If we try to interpret
Pi(θA) as the probability that a particular examinee A will answer a particular
item i correctly, we are likely to reach absurd conclusions. To obtain useful
results, we may properly

1. Interpret Pi(θA) as the probability that a particular examinee A will give


the right answer to a randomly chosen item whose parameters are ai, bi, and c i .
2. Interpret Pi(θA) as the probability that a randomly chosen examinee at
ability level θA will answer a particular item i correctly.
3. Make both of these interpretations simultaneously.

15.8. IGNORING OMITTED RESPONSES

A complete mathematical model for item response data with omits would involve
many new parameters: for example, a parameter for each examinee, representing
his behavior when faced with a choice of omitting or guessing at random. Such
complication might make parameter estimation impractical; we therefore avoid
all such complicated models here.
Since "not-reached" responses can be ignored in parameter estimation, why
not ignore omitted responses? Two lines of reasoning make it clear that we
cannot do this:

1. "Not-reached" responses contain no readily quantifiable information


about the examinee's ability θ. On the other hand, according to our model,
omitted responses specifically imply that the examinee's ability is limited: If the
examinee were required to answer, his responses to a group of such items would
be correct only 1/A of the time in the long run.
2. If we ignore omitted responses, the examinee can obtain as high a θ as he
pleases, simply by answering only those items he is sure he can answer correctly
and omitting all others.

The last argument is a compelling one.

15.9. SUPPLYING RANDOM RESPONSES

A simple solution to our problem might seem to be to require each examinee to


answer every item. If an examinee failed to follow this requirement, we could
15.11. FORMULA SCORES 229

presumably supply random responses in place of those he could have chosen.


(Note that not-reached items should not be assigned random responses: The
examinee would often do better than random if he had time to read and respond to
such items.)
If number-right scoring is used, as already noted, omitted responses should
not occur. If formula scores are used, as we assume here, a requirement to
answer all items would be unfair and unreasonable. Suppose a student, needing a
formula score of 7 to pass a course, finds that he knows the answers to exactly 8
of the 10 true-false final examination questions. We cannot properly force him to
answer the other two questions, thereby running a substantial risk of flunking the
course with a formula score of 8 - 2 = 6. Neither can we properly supply the
random responses ourselves, for the same reason.

15.10. PROCEDURE FOR ESTIMATING ABILITY

Supplying random responses in place of omits does not introduce a bias into the
examinee's formula score: His expected formula score is the same whether he
omits or responds at random. The objection to requiring him to respond is that the
required (random) responses would reduce the accuracy of measurement.
Although we can obtain unbiased estimates of ability by supplying random
responses in place of omits, introduction of random error degrades the data.
There should be some way to obtain unbiased estimates of the same parameters
without degrading the data.
A method for doing this is described in Lord (1974). The usual likelihood
function
via 1- v
i a P iaia Q ia ia (4-21)
is replaced by
via
P iaia Q 1-v
ia
(15-2)
i a ia
,
where via = uia if the examinee responds to the item and via = 1/A if the
examinee omits the item. The product a is to be taken only over the examinees
who actually reached item i. It should be noted that (15-2) is not a likelihood
function. Nevertheless, if the item parameters are known, the value of θa that
maximizes (15-2) is a better estimate of θ than the maximum likelihood estimate
obtained from Eq. (4-21) after replacing omits by random responses.

15.11. FORMULA SCORES

If an examinee answers all n items, the number of wrong answers is w = n - x,


and his formula score (15-1) can be rewritten
230 15. OMITTED RESPONSES AND FORMULA SCORING

n - x Ax n
y ≡ x = A - 1 - A - 1 . (15-3)
A - 1
If there are no omits, formula score y is a specified linear transformation of
number-right score x.
There are two ways we can predict a person's formula score from his ability 0
and from the item parameters. If we know which items examinee a answered, his
number-right true score is

ξa = (a) P i (θ
a) (15-4)

where the summation is over the items answered by examinee a. From this and
(15-1), the examinee's true formula score ηa is
(a) Q (θ )
(a) P i (θa) i a
(15-5)
ηa= -
A - 1 .
We can estimate the examinee's observed formula score from his θa by substitut­
ing ya for ηa and θa for θa in (15-5).
If examinee a answered all the items in the test, (15-5) becomes
n
A p i (θa) - n
ηa = . (15-6)
A - 1

This can also be derived directly from (15-3). Again, the examinee's formula
score can be estiamted from his θa by substituting θa for θa and ya for ηa in
(15-6).
An examinee's formula score has the same expected value whether he omits
items or whether he answers them at random. If we do not know which items the
examinee omitted, we cannot use (15-5) but we can still use (15-6) if the exam­
inee finished the test.
If the examinee did not finish the test, we can use (15-5) or (15-6) to estimate his
actual formula score on the partly speeded test from his θ, provided we know
which items he did not reach: The not-reached items are simply omitted from the
summations in (15-5) and (15-6). If we do not know which items he reached, we
can still use (15-6) to estimate the formula score that he would get if given time to
finish the test.

REFERENCES

Diamond, J., & Evans, W. The correction for guessing. Review of Educational Research, 1973, 43,
181-191.
Ebel, R. L. Blind guessing on objective achievement tests. Journal of Educational Measurement,
1968, 5, 321-325.
Lord, F. M. Estimation of latent ability and item parameters when there are omitted responses.
Psychometrika, 1974, 39, 247-264.
REFERENCES 231

Sax, G., & Collet, L. The effects of differing instructions and guessing formulas on reliability and
validity. Educational and Psychological Measurement, 1968, 28, 1127-1136.
Slakter, M . J . Generality of risk taking on objective examinations. Educational and Psychological
Measurement, 1969,29, 115-128.
Traub, R. E., & Hambleton, R. K. The effect of scoring instructions and degree of speededness on
the validity and reliability of multiple-choice tests. Educational and Psychological Measurement,
1972, 32, 737-758.
Waters, L. K. Effect of perceived scoring formula on some aspects of test performance. Educational
and Psychological Measurement, 1967,27, 1005-1010.
IV ESTIMATING TRUE-SCORE
DISTRIBUTIONS
16 Estimating True-Score
Distributions 1

16.1. INTRODUCTION

We have already seen [Eq. (4-5) or (4-9)] that true-score ζ or ξ on a test is simply
a monotonic transformation of ability 6. The transformation is different from test
to test. If we know the distribution g(ζ) of true score, the joint distribution of true
score and observed score is
Φ(x, ζ = g(ζ)h(x|ζ), (16-1)
where h(x|ζ) is the conditional distribution of observed score for given true
score. The form of the conditional distribution h(x\ζ)) is usually known [see Eq.
(4-1), (11-24)]; its parameters (the ai, bi, and c i ) can be estimated. If we can
estimate g(ζ) also, then we can estimate the joint distribution of true score and
observed score. As noted in Section 4.5, this joint distribution contains all
relevant information for describing and evaluating the properties of observed
score x as a measure of true score ζ or as a measure of ability 6. An estimated
true-score distribution is thus essential to understanding the measurement pro­
cess, the effects of errors of measurement, and the properties of observed scores
as fallible measurements.
In addition, an estimated true-score distribution can be used for many other
purposes, to be explained in more detail:

la. To estimate the population frequency distribution of observed scores,


lb. To smooth the sample frequency distribution of observed scores.

1
Much of the material in this chapter was first presented in Lord (1969).

235
236 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

2. To estimate the frequency distribution of observed scores for a shortened


or lengthened form of the test.
3. To estimate the bivariate distribution of observed scores on two parallel
test forms when only one has been administered. This is useful in many ways, for
example, for determining how many of the people who failed one test form
would have passed the other.
4. To estimate the bivariate distribution of observed scores on two different
tests of the same ability.
5. To estimate national norms for a full-length test when only a short form
has been administered to the norms sample.
6. To estimate the effect of selecting on observed score instead of on true
score.
7. By using observed scores, to match two groups on true score.
8. To equate two tests.
9. To investigate whether two tests measure the same psychological trait.
10. To estimate item-test regressions and item response functions.

If h(x|ζ) is given by Eq. (4-1) or (11-24), then the estimation of a true-score


distribution is a branch of item response theory. In this chapter, we use approxi­
mations to Eq. (4-1) that do not depend on item-response-function parameters.
The latent trait theory developed here is closely related to many problems en­
countered in earlier chapters, as is apparent from the foregoing list. The theory
and its applications are presented here because of their relevance to practical
applications of item response theory.

16.2. POPULATION MODEL

If we integrate (16-1) over all true scores, we obtain the marginal distribution of
observed scores:

Φ(x) = ∫10 g(ζ)h(x|ζ) dζ (x = 0, 1, 2, .. . , n). (16-2)

Our first problem is to infer the unknown g(ζ) from Φ(x), the distribution of
observed scores in the population, presumed known, and from h(x|ζ), also
known.
If observed score x were a continuous variable, (16-2) would be a Fredholm
integral equation of the first kind. In this case it may be possible to solve (16-2)
and determine g(ζ) uniquely. Here we deal only with the usual case where x is
number-right score, so that (16-2) need hold only for x = 0, 1, 2,. . . , n.
Suppose temporarily that h(x|ζ) is binomial (see Section 4.1). Let us multiply
both sides of (16-2) by x[r] ≡ x(x - 1) • • • ( x - r + 1), where r is a positive
integer. Summing over all x, we have
16.3. A MATHEMATICAL SOLUTION FOR THE POPULATION 237

n l n
X [r] Ø(x) g(ζ) x[r]h(x|ζ) dζ.
x=0
∫0 x=0

Now, the sum on the left is by definition the rth factorial moment ofØ(x), to be
denoted by M [ r ] ; the sum on the right is the rth factorial moment of the binomial
distribution, which is known (Kendall & Stuart, 1969, Eq. 5.8) to be n[r]ζr. The
foregoing equation can now be written
1
M [r]
n[r]
\ 0
ζrg(ζ) dζ ≡ μr' (r = 1, 2, .. . , n), (16-3)

where μ'r is the rth ordinary moment of the true-score distribution g(ζ). This
equation shows that when h(x|ζ) is binomial, the first n moments of g(ζ) can be
easily determined from the first n moments of the distribution of observed
scores. This last statement is still true when h(x|ζ) has the generalized binomial
distribution [Eq. (4-1)] appropriate for item response theory.
Since only n mathematically independent quantities can be determined from
the n mathematically independent values Ø(1), Ø ( 2 ) , . . . , Ø(n), it follows that
the higher moments, above order n, of the true-score distribution cannot be
determined from Ø(x). Indeed, any g(ζ) with appropriate moments up through
order n will satisfy (16-3) exactly, regardless of the value of its higher moments.
Since the frequency distribution of a bounded integer-valued variable (x) is
determined by its moments, it follows that any g(ζ) with the appropriate
moments up through order n will be a solution to (16-2). Thus, the true-score
distribution for an infinite population of examinees in principle cannot be deter­
mined exactly from their Ø(x) and h(x|ζ) (x = 0, 1,. . . , n).
If two different true-score distributions have the same moments up through
order n, they have the same best fitting polynomial of degree n in the least-
squares sense (Kendall & Stuart, 1969, Section 3.34). If the distributions oscil­
late more than n times about the best fitting polynomial, they could differ
noticeably from each other. If the true-score distributions are reasonably smooth,
however, without many peaks and valleys, they will be closely fitted by the same
degree n polynomial. Since they each differ little from the polynomial, they
cannot differ much from each other. Thus any smooth g(ζ) with the required
moments up through order n will be a good approximation to the true g(ζ)
whenever the latter is smooth.
It is common experience in many diverse areas that sample frequency distri­
butions of continuous variables become smooth as sample size is increased. We
therefore assume here that the true g(ζ) is smooth.

16.3. A MATHEMATICAL SOLUTION FOR THE POPULATION

There is no generally accepted unique way of measuring smoothness. Mathemat­


ical measures of smoothness generally depend on a constant and on a weight
function to be specified by the user.
238 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

A function with many sharp local fluctuations is not well fitted by any smooth
function. Thus we could take

∫ [g(ζ) - γ(ζ]2 dζ
as a convenient measure of smoothness, where γ(ζ) is some smooth density
function specified by the user. Actually, we shall use instead the related measure
1
[g(ζ) - γ(ζ)]2 dζ. (16-4)
∫0 γ(ζ)
This measure of smoothness is the same as an ordinary chi square between g(ζ)
and γ(ζ) except that summation is replaced by integration.
The need for the user to choose γ(ζ) may seem disturbing. For most practical
purposes, however, it has been found satisfactory to choose γ(ζ) ≡ 1 or γ(ζ) ∞
ζ(1 — ζ). The choice usually makes little difference in practice. Remember that
we are finding one among many g(ζ), all of which produce an exact fit to the
population Ø(x). Any smooth solution to our problem will be very close to any
other smooth solution.
Given γ(ζ), h(x|ζ), and Ø(x), what we require is to find the g(ζ) that
minimizes (16-4) subject to the restriction that g(ζ) must satisfy (16-2) exactly
for x = 0, 1, 2 , . . . , n. This is a problem in the calculus of variations. The
solution (Lord, 1969) is
n
g(ζ) = γ(ζ) λ x h(X|ζ), (16-5)
x=0
the values of the λ x being chosen so that (16-2) is satisfied for x = 0, 1, 2 , . . . , n.
To find the λx, substitute (16-5) in (16-2):
n 1
λx y(ζ)h(X|ζ)h(x|ζ) dζ = Ø(x) (x = 0, 1, . . . , n). (16-6)
∫0
x=0
These are n + 1 simultaneous linear equations in the n + 1 unknowns λx. If
h(x|Ζ) is binomial and if γ(ζ) is constant or a beta distribution with integer
parameters, then the integral in (16-6) can be evaluated exactly for X, x = 0, 1,
2,. . . , n. If h(x|ζ) is the generalized binomial of Eq. (4-1), we replace it by a
two- or four-term approximation (see Lord, 1969), after which the integral in
(16-1) can again be evaluated exactly. The required values of λx are then found
by inverting the resulting matrix of coefficients and solving linear equations
(16-6).
To be a valid solution, the g(ζ) found from (16-5) in this way must be
nonnegative for 0 ≤ ζ ≤ 1. This requirement could be imposed as part of the
calculus of variations problem; however, the resulting solution might still be
intuitively unsatisfactory because of its angular character. A practical way of
dealing with this condition is suggested at the end of Section 16.5.
16.5. A PRACTICAL ESTIMATION PROCEDURE 239

16.4. THE STATISTICAL ESTIMATION PROBLEM

The problem solved in the last section has no direct practical application since we
never know the Ø(x) exactly. Instead, we have sample frequencies f(X) that are
only rough approximations to the Ø(x). In most statistical work, the substitution
of sample values for population values provides an acceptable approximation,
but not in the present case, as we shall see.
It is clear from (16-2) that Ø(x) is a weighted average of h(x|ζ), averaged over
0 ≤ ζ ≤ 1 with weight g(ζ). Likewise, the first difference ΔØ(x) ≡ Ø(x + 1) —
Ø(x) is a weighted average of conditional first differences h(x + 1|ζ) — h(x|ζ).
Since an average is never more than its largest component, ΔØ(x) can never be
greater than maxζ [h(x + 1|ζ) - (x|ζ)]. This proves that sufficiently sharp
changes in Ø(x) are incompatible with (16-2). A similar argument holds for
second- and higher-order differences. Thus any sample frequency distribution
f(x) may be incompatible with (16-2) simply because of local irregularities due to
sampling fluctuations. In such cases, any g(ζ) obtained by the methods of Sec­
tion 16.3 are negative somewhere in the range 0 ≤ ζ ≤ 1 and thus not an
acceptable solution to our problem. This is what usually happens when f(x) is
substituted for Ø(x) in (16-5) and (16-6).
The statistical estimation problem under discussion is characterized by the fact
that a small change in the observed data produces a large change in the solution.
Such problems are important in many areas of science where the scientist is
trying to infer unobservable, causal variables from their observed effects. Re­
cently developed methods for dealing with this class of problems are discussed
by Craven and Wahba (1977), Franklin (1970, 1974), Gavurin and Rjabov
(1973), Krjanev (1974), Shaw (1973), Varah (1973) and Wahba (1977).

16.5. A PRACTICAL ESTIMATION PROCEDURE

The difficulties discussed in Section 16.4 arise because of sampling irregularities


in the observed-score frequency distribution. The simplest way to reduce such
irregularities is to group the observed scores into class intervals. When this has
been done, the observations are now the grouped frequencies

fu ≡ f(x) (u = 1, 2, .. . , U), (16-7)


x:u

where the notation indicates that the summation is to be taken over all integers x
in class interval u.
The reasoning of Section 16.3 can now be applied to the grouped frequencies.
The basic equation specifying the model is now
1
Øu =
∫ 0
g(ζ)
x:u
h{x\l)di (u = 1, 2, . . . , U). (16-8)
240 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

The "smoothest" solution to this equation is now


U
g(ζ) = γ(ζ) λu h(X|ζ). (16-9)
u=1 X:u

The λu are the parameters of the observed-score distribution Øu. If we follow


the reasoning in Section 16.3, the U observed values of fu are just enough to
determine the U unknown parameters λu exactly. If the λu were determined from
the Øu in this way, the model (16-8) would fit the grouped sample frequencies fu
exactly (u = 1 , 2 , . . . , U). The ungrouped sample frequencies f(x), of course,
would not be fitted exactly.
A still better procedure is to use all n + 1 sample frequencies f(x) to estimate
the parameters λu (u = 1, 2 , . . . , U). The f(x) (x = 0, 1 , . . . , n) are jointly
multinominally distributed: Their likelihood function is proportional to
n
II[Ø(x)]
x=0
f(x)
. (16-10)

Assuming that the true-score distribution is given by (16-9), substitution of


(16-9) into (16-2) expresses each Ø(x) as a known function of the parameters λu:
U
Ø (x) = λu axu (x = 0, 1, . . . , n)
u=1
where (16-11)

axu ≡
X:u
∫ γ(ζ)h(X|ζ)h(x|ζ) dζ.
0

If γ(ζ) is a constant or a beta distribution with integer exponents and h(x|ζ) is


binomial or a suitable approximation to a generalized binomial, the integration in
(16-11) can be carried out algebraically.
It is now a straightforward matter (see Lord, 1969; Stocking, Wingersky,
Lees, Lennon, & Lord, 1973) to find the values of λu (u = 1 , 2 , . . . ,U) that
maximize the likelihood (16-10) for any given observed frequencies f(x) (x = 0,
1,. . . , n). The maximizing values are the maximum likelihood estimators λu
(u = 1, 2 , . . . , U). Notice that the λu are the parameters of the true-score distri­
bution (16-9) as well as of the observed-score distribution (16-11).
Given an appropriate grouping into class intervals, our estimated true-score
distribution is therefore
U
ĝ(ζ) ≡ γ(ζ) λu h(X|ζ), (16-12)
u=l X:u

provided ĝ(ζ) ≥ 0 for 0 ≤ £ ≤ 1. If γ(ζ) is constant or a beta distribution, then


the estimated true-score distribution is a weighted sum of beta distributions.
If ĝ(ζ) turns out to be negative in the range 0 ≤ ζ ≤ 1, it is not an acceptable
16.6. CHOICE OF GROUPING 241

solution to our problem. In principle, maximization of the likelihood should be


carried out subject to the restriction that ĝ(ζ) ≥ 0 for 0 ≤ ζ ≤ 1. Instead, it seems
to be satisfactory and is much simpler to require that λu ≥ 0 for all u. This
requirement is too restrictive, but it seems beneficial in practice. It automatically
guarantees that ĝ(ζ) will be nonnegative for 0 ≤ ζ ≤ 1.
Our estimated observed-score distribution is simply
U
Ø(x) ≡ λu axu (x = 0, 1, .. . , n). (16-13)
u=1

16.6. CHOICE OF GROUPING

The main problem not already dealt with is the choice of class intervals for
grouping the number-right scores. Arbitrary grouping frequently fails to provide
a good fit to the data, as measured by a chi square between actual and estimated
frequencies f(x) and Ø(x). A possible automatic method for finding a successful
grouping is as follows.

1. Group the tails of the sample distribution to avoid very small values of
fix).
2. Arbitrarily group the remainder of the score range, starting in the middle
and keeping the groups as narrow as possible, until the total number of groups is
reduced to some practical number, perhaps U = 25.
3. Estimate λu (u = 1, 2 , . . . , 25) by maximum likelihood, subject to the
restriction that λu ≥ 0.
4. As a by-product of step 3, obtain the asymptotic variance-covariance
matrix of the nonzero λu (the inverse of the Fisher information matrix).
5. Compute Ø{x) from (16-13).
6. Compute the empirical chi square comparing Ø(x) with f(x):

X2 =
n [f(x) - Ø(x)]2 (16-14)
x=1
Ø(x)
7. Determine the percentile rank of X2 in a standard chi square table with
U* — U degrees of freedom, where U* is the number of class intervals at the end
of step 1.
8. If λu and λ u+1 were identical for some u, it would make no difference if
we combined intervals u and u + 1 into a single class interval (the reader may
check this assertion for himself). If λu and λ u+1 are nearly the same, it makes
little difference if we combine the two intervals.
(a) For each u=0,1,...,U — 1, compute the asymptotic variance of
λu+1–λu.
242 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

(b) Divide λ u+1 - λu (u = 1, 2,. . . , U - 1) by its asymptotic stan­


dard error.
(c) Find the value of u for which the resulting quotient is smallest (u =
1, 2 , . . . , U — 1); for this value of u, combine interval u and interval
u + 1 into a single interval.
9. Repeat steps 3 through 8, reducing U by 1 at each repetition.
10. At first, the percentile rank of X2 will decrease at each repetition, due to
the increase in degrees of freedom. When the percentile rank of X2 no longer
decreases, stop the process and use ĝ(ζ) from (16-12) as the estimated true-score
distribution.

Under the procedure suggested above, the grouping is determined by the data.
Thus, strictly speaking, the resulting X2 no longer has a chi square distribution
with U* - U degrees of freedom. If an accurate chi square test of significance is
required, the data should be split into random halves and the grouping deter­
mined from one half as described above. A chi-square test of significance can
then be properly carried out, using this grouping, on the other half of the data.
Chi-square significance levels quoted in this chapter and in the next chapter
are computed as in step 10. Thus the significance levels quoted are only nominal;
they are numerically larger (less "significant") than they should be.

16.7. ILLUSTRATIVE APPLICATION

Seven estimated true-score distributions, obtained by the methods of this chapter,


are seen in Fig. 6.8.1. A different illustration is presented and considered here in
more detail. Figure 16.7.1 shows the estimated true-score distribution and the
estimated observed-score distribution obtained for a nationally known test of
vocabulary administered to a nationally representative sample of 1715 sixth-
grade pupils. The test consists of 42 four-choice items. These data were chosen
for the illustration because of the interesting results. For most data, the true-score
distribution is similar in shape to the observed-score distribution, except that it
has a slightly smaller variance, because observed scores contain errors of mea­
surement. These data show an interesting exception.
Before discussing the true-score distribution, some special features in the
treatment of the data will be noted. There is no provision in the mathematical
model (16-2) for dealing with omitted item responses or with examinees who fail
to finish the test. Each omitted item response in the present data was replaced by
a randomly chosen response. This procedure is adequate for dealing with items
that the examinee has considered and then omitted. It is not appropriate for
dealing with items that the examinee did not reach in the allowed testing time; for
this reason, examinees who did not answer the last item in the test were simply
excluded from the data.
TRUE SCORE
OBSERVED SCORE

0.0 0.2 0.4 o.8 0.8 I .0


PROPORTION-RIGHT SCORE

FIG. 16.7.1. Sample observed-score distribution (irregular polygon), estimated population true-

243
score and observed-score distributions, sixth-grade vocabulary test, N = 1715.
244 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

We know from item response theory that true score ζ can never be less than
11Ci /n. In the case of Fig. 16.7.1, estimated Ci were available. These values
\
were utilized by setting a lower limit of n1ĉi/n = .225755 to the range of ζ,
requiring that g(ζ) = 0 when ζ <.225755. This requirement was imposed by
replacing the lower limit 0 in the integrals of (16-8) and (16-11) by .225755. The
integral in (16-11), now an incomplete beta function, can be evaluated by recur­
sive procedures (Jordan, 1947, Section 25, Eq. 5) without using approximate
methods, as long as γ(ζ) is either a constant or a beta function with integer
exponents.
In the case of Fig. 16.7.1, γ(ζ) was taken as constant. The figure shows the
resulting estimated true- and observed-score distributions. The estimated true-
score distribution has U — 1 = 3 independent parameters λu. The chi square
(16-14) is 23.5; nominally, the degrees of freedom are 30, suggesting a good fit.
The results are considered further in the next section.

16.8. BIMODALITY

The sample observed-score distribution shown in Fig. 16.7.1 has an unusual


shape. One wonders if this shape results from a rectangular or bimodal distribu­
tion of ability 0 in the group tested. (Note that unimodality of the distribution of θ
does not necessarily imply unimodality of the true-score distribution; the relation
between the two distributions depends on the test characteristic curve.)
It is intuitively obvious, regardless of the distribution of 0, that a " p e a k e d "
test, consisting of items all of equal difficulty, can produce a bimodal distribution
of observed scores providing the items are sufficiently highly intercorrelated (see
Section 4.4). It is an important question in general whether a bimodal observed-
score distribution should be attributed to the characteristics of the group tested or
simply to distortions introduced by the measuring instrument (Section 4.4).
When there is no guessing and the ability 6 is normally distributed, the
tetrachoric item intercorrelations must be at least .50 to produce a bimodal
observed-score distribution, according to the normal ogive item characteristic
curve model (Lord, 1952, Section D). This is a much higher correlation than is
ordinarily ever attained for multiple-choice items. When there is guessing, how­
ever, as in the present situation where the test is composed of four-choice items,
it is not so easy to reach a conclusion.
Some computer runs, using the three-parameter normal ogive model with all
Ci = .25, throw light on this matter. The computer runs simulate the administra­
tion of various medium-difficulty tests, each composed of 40 four-choice items
all of equal difficulty, to a group of examinees in which ability 6 is normally
distributed. The various tests differ only in ai, assumed to be the same for all
items within a test. Figure 16.8.1 shows the frequency distribution of number-
right scores obtained for three such tests. As item-ability correlation increases,
16.10. EFFECT OF A CHANGE IN TEST LENGTH 245

r = .882 0=8

r = .898 a = .9
FREQUENCY

r = .910 a=|.0

0 10 20 RAW SCORE 30 40

FIG. 16.8.1. Frequency distribution and reliability (r) of number-right observed


score for three hypothetical peaked tests differing only in item discriminating
power (a).

bimodality appears when the KR-20 reliability of number-right scores reaches


r — .895 (ai = .9), approximately.
When adjusted to a standard length of 40 items, the actual Kuder-Richardson
formula-20 test reliability computed from the data used to obtain Fig. 16.7.1 was
r = .925. It thus appears that the bimodal distribution in that figure may be
attributable to the measuring instrument rather than to some special characteristic
of the group tested.

16.9. ESTIMATED OBSERVED-SCORE DISTRIBUTION

The estimated observed-score distribution Ø(x) is sometimes of interest for its


own sake. Equation (16-13) is a complicated but effective way of smoothing a
sample distribution of observed scores. Unlike many other methods, it has the
advantage that the smoothing (1) does not introduce negative frequencies;
(2) preserves a total relative frequency of exactly 1; (3) does not introduce any
frequencies outside the permissible range 0 ≤ x ≤ n; and (4) is compatible with
relevant mental test theory. Equations (16-8), (16-9), and (16-13) can be used
also to estimate ungrouped frequencies of test scores when the only available data
are grouped.

16.10. EFFECT OF A CHANGE IN TEST LENGTH

If a test is lengthened by adding parallel forms of the test, the true score of each
person remains unchanged; thus g(ζ) is also unchanged. Any change in test
246
n
TRUE SCORE ••
OBSERVED SCORE 5
OBSERVED SCORE 10
OBSERVED SCORE 20
OBSERVED SCORE 40
OBSERVED SCORE 80
OBSERVED SCORE 180

,0 0,2 0,4 0.6 0.8 I.0


PROPORTION-RIGHT SCORE

FIG. 16.10.1. Estimated population true- and observed-score distributions, sixth-grade vocabulary test, N = 1715.
16.11. EFFECTS OF SELECTING ON OBSERVED SCORE 247

length n changes h(x|ζ) in a known way. Thus the theoretical effect of test length
on Ø(x) can be determined from (16-2).
In practical applications, we have the estimated true-score distribution (16-
12). In this case, the effect of test length on Ø(x) can be determined by varying n
in (16-11) and (16-13). The axu defined by (16-11) must be determined from
(16-10) each time n is changed; the estimates of λu are supposed to be unaffected
by changes in n.
Figure 16.10.1 shows estimated proportion-correct observed-score frequency
distributions when the 42-item vocabulary test of Fig. 16.7.1 is shortened or
lengthened to n = 5, 10, 20, 40, 80, 160, or ∞. As n becomes large, the
distribution of proportion-correct score z ≡ x/n approaches g(ζ). For small n,
observed- and true-score distributions may have very different shapes, as illus­
trated.

16.11. EFFECTS OF SELECTING ON OBSERVED


SCORE: EVALUATION OF MASTERY TESTS

Typically, we would like to select individuals on true score rather than on


observed score. What is the effect of selecting on observed score? This question
can be answered using the estimated joint distribution (16-1) of true score ζ and
observed score x, determining from this distribution the effect of selecting on x.
The general principles are illustrated by discussing the evaluation of a mastery
test. Chapter 11 develops a theory of mastery testing without requiring knowl­
edge of the distribution of ability or of true scores in the group tested. Such a
theory is particularly useful when a particular test with a predetermined cutting
score is to be used in many different groups having different distributions of
ability. In contrast, we see here a little of what can be done once the frequency
distribution of true scores has been estimated for a particular group.
Figure 16.11.1 shows the observed-score distribution f(x) for a nationwide
sample of 2395 high school seniors taking a 65-item Basic Skills Reading test.
The figure also shows the corresponding estimated population observed-score
distribution Ø(X) and the estimated true-score distribution ĝ(ζ). The chi square is
6.33 with 12 degrees of freedom, showing a good fit of the model to the data.
Since the test is intended to determine whether high school seniors can do the
kind of reading required in adult life (reading medicine labels, guarantees, em­
ployment application forms, and so forth), it is not surprising that most high
school seniors obtain high scores on the test.
Table 16.11.1 shows the estimated bivariate cumulative frequency distribu­
tion of true score and observed score for all students tested. The cumulative
frequencies are shown only for values of ζ that are multiples of .05. The table is
obtained by applying the trapezoidal rule to ordinates of the noncumulative
distribution (16-1) of ζ and x and then cumulating across rows. The table entry at
(ζ0, x0) shows the number of cases out of 1000 for whom ζ ≤ ζ0 and x ≤ x0.
248 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

ZETA
0,0 0.2 0.4 0.6 0.8 1.0

OBSERVED-SCORE DISTRIBUTION
400

ESTIMATED OBSERVED-SCORE DISTRIBUTION


ESTIMATED TRUE-SCORE DISTRIBUTION

10
300
F(X),PHI(X)

G( ZETA )
200

5
100

0
0

0 10 20 30 40 50 60
X

FIG. 16.11.1. Sample observed-score distribution and estimated population


observed-score and true-score distributions for the Basic Skills Reading test.

Table 17.5.1 compares observed-score and true-score distributions for a re­


jected group (x ≤ 38). The Test x column of Table 17.5.1 is the same as the
lower part of the last column of Table 16.11.1 except that Table 17.5.1 is
noncumulative. The number-right true-score distribution shown in Table 17.5.1
corresponds to the cumulative distribution of proportion-correct true scores in
row 38 of Table 16.11.1. Table 17.5.1 illustrates the effect of selecting on
observed score instead of on true score. Because of the regression effect, the two
distributions are rather different in this example.
To illustrate another use of Table 16.11.1, suppose it is decided that a true
score above ζ = .60 represents satisfaction of minimal qualifications and that a
true score below ζ = .60 represents failure to meet minimal qualifications. The
top row of the table shows that about .046 of all students are unqualified. If we
TABLE 16.11.1
Students at or below a Given Observed Score and a Given True
Score (Proportion of All Students Multiplied by 1000) Basic Skills
Assessment Program, Reading Test

True Score
Observed
Score .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00

65 9 12 15 20 30 46 66 89 117 160 228 325 584 1000


64 9 12 15 20 30 46 66 89 117 160 228 325 582 911
63 9 12 15 20 30 46 66 89 117 160 228 325 571 788
62 9 12 15 20 30 46 66 89 117 160 228 325 542 663
61 9 12 15 20 30 46 66 89 117 160 228 323 497 552
60 9 12 15 20 30 46 66 89 117 160 228 317 441 462
59 9 12 15 20 30 46 66 89 117 160 227 308 386 392
58 9 12 15 20 30 46 66 89 117 160 225 294 338 339
57 9 12 15 20 30 46 66 89 117 160 221 276 297 298
56 9 12 15 20 30 46 66 89 116 159 215 255 265 265
55 9 12 15 20 30 46 66 89 116 158 207 233 237 237
54 9 12 15 20 30 46 66 89 116 155 196 212 213 213
53 9 12 15 20 30 46 66 89 116 152 183 192 192 192
52 9 12 15 20 30 46 66 89 115 147 169 173 174 174
51 9 12 15 20 30 46 66 89 113 140 155 157 157 157
50 9 12 15 20 30 46 66 88 111 133 142 143 143 143
49 9 12 15 20 30 46 66 88 108 124 129 130 130 130
48 9 12 15 20 30 46 66 87 104 115 118 118 118 118
47 9 12 15 20 30 46 66 85 99 107 108 108 108 108
46 9 12 15 20 30 46 65 83 94 98 99 99 99 99
45 9 12 15 20 30 46 64 79 88 90 91 91 91 91
44 9 12 15 20 30 45 63 76 81 83 83 83 83 83
43 9 12 15 20 30 45 61 71 75 76 76 76 76 76

249
(continued)
250
TABLE 16.11.1
(continued)

True Score
Observed
Score .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 1.00

42 9 12 15 20 29 44 58 66 68 69 69 69 69 69
41 9 12 15 20 29 43 55 61 62 62 62 62 62 62
40 9 12 15 20 29 42 52 56 56 56 56 56 56 56
39 9 12 15 20 29 40 48 50 51 51 51 51 51 51
38 9 12 15 20 28 38 44 45 45 45 45 45 45 45.5
37 9 12 15 20 27 36 39 40 40 40 40 40 40 40.6
36 9 12 15 19 26 33 35 36 36 36 36 36 36 36.1
35 9 12 15 19 25 30 32 32 32 32 32 32 32 32.1
34 9 12 15 19 24 27 28 28 28 28 28 28 28 28.5
33 9 12 15 18 22 25 25 25 25 25 25 25 25 25.4
32 9 12 15 18 21 22 23 23 23 23 23 23 23 22.7
31 9 12 14 17 19 20 20 20 20 20 20 20 20 20.4
30 9 11 14 16 18 18 18 18 18 18 18 18 18 18.4
29 9 11 14 15 16 17 17 17 17 17 17 17 17 16.7
28 9 11 13 14 15 15 15 15 15 15 15 15 15 15.2
27 9 11 13 13 14 14 14 14 14 14 14 14 14 13.9
26 8 11 12 12 13 13 13 13 13 13 13 13 13 12.7
25 8 10 11 11 12 12 12 12 12 12 12 12 12 11.6
24 8 10 10 10 11 11 11 11 11 11 11 11 11 10.6
23 8 9 9 10 10 10 10 10 10 10 10 10 10 9.7
22 7 8 9 9 9 9 9 9 9 9 9 9 9 8.7
16.12. ESTIMATING ITEM TRUE-SCORE REGRESSION 251

reject all students with x ≤ 38 (refuse to graduate them from high school), the
right-hand column shows that we shall be rejecting .045 of all students. The table
entry at (.60, 38) shows that .038 of all students lie at or below ζ = .60 and also
at or below x = 38; these students are all rightly rejected.
From the foregoing numbers we can compute the following 2 x 2 table:
unqualified qualified

accepted (.046 - .038 =).008 (.954 - .007 =).947 (1 - .045 =).955


x = 38.5
rejected .038 (.045 - .038 =).007 .045

.046 (1 - .046 =).954


ζ = .60

This shows that .008 of the total group were accepted even though they were
really unqualified and that .007 of the total group were rejected even though they
were really qualified. These two proportions are useful for summarizing the
effectiveness of the minimum qualifications reading test, since they represent the
proportion of students erroneously classified. The foregoing procedure is de­
scribed and implemented by Livingston (1978).

16.12. ESTIMATING ITEM TRUE-SCORE REGRESSION

An item-test regression (Section 3.1) can be computed for each observed score x
as follows: Divide the number of examinees at x who answer the item correctly
by the total number of examinees at x. An item-true-score regression can in
principle be obtained similarly. If gi(u i , ζ) denotes the bivariate density function
of ui (item score) and ζ (proportion-correct true score), and if g(ζ) is the (margi­
nal) density of ζ, then the item-true-score regression may be found from

gi(1,ζ) (16-15)
ξ(ui|ζ = .
8(0
The denominator on the right of (16-15) can be estimated by (16-12). If we
apply (16-12) to the subgroup of examinees who answer item i correctly, we
obtain an estimate of g i -(ζ|u i = 1), the conditional distribution of true score for
examinees who answer item i correctly. The numerator in (16-15) is gi(I, ζ) =
πigi(ζui = 1), where πi is the proportion of all examinees who answer item i cor­
rectly. Since πi can be approximated by the observed proportion of correct
answers in the total group, we can use (16-12) to estimate both the numerator and
the denominator of (16-15) and thus to estimate the item-true-score regression.
Let ζ11 denote true score on an n-item test; let ζ11-1 denote true score on the
same test excluding item i. This use of (16-12) is appropriate only if item i is
excluded from the items used to determine number-right score x. Thus, (16-12)
and (16-15) yield an estimate of the regression of ui o n ζ n - 1 .
252 16. ESTIMATING TRUE-SCORE DISTRIBUTIONS

If the item response function parameters are known, then ζn is a known


monotonic function of ζn-1. This functional relation is given by the parametric
equations
1 }
ζn ≡ Pi (θ),
n j
(16-16)
1
ζn-1 = Pj (θ).
n - 1 j≠i
By eliminating 0 from (16-16) numerically, any value of ζn-1 can be transformed
to the corresponding value of ζn. Thus any regression of ui on ζn-1 can be used
to write the regression of ui on ζn. This is done simply by replacing ζn-1 in
ξ(u|ζ n - 1 ) by the corresponding value of ζn.
If the numerator and denominator of (16-15) are each independently estimated
by (16-12), chance fluctuations may allow the estimate of the numerator to be
larger than the estimate of the denominator when ζ is near 1.0. This can result in
an estimated item-true-score regression that is larger than 1.0 when ζ is near 1.0.
Such an awkward result of sampling fluctuations can be avoided by estimating
the item-true-score regression using the following equivalent of (16-15):

gi (1, ζ) (16-17)
Ξ(ui|Ζ) =
g i (1, ζ) + gi(0, ζ)
The distribution gi(0, ζ) is estimated by applying (16-12) to the group of exam­
inees who answered item i incorrectly.

16.13. ESTIMATING ITEM RESPONSE FUNCTIONS

The relation
1
ζ n - l ≡ ζ n - l (θ) ≡ j≠iPj(θ), (16-18)
n - 1
transforms ζn_1 to θ. Since P i (θ) ≡ ξ(ui|θ), (16-18) can be used to convert an
estimated item-true-score regression into an estimated item response function
(regression of item score on ability). Thus the item response function can be
written

gi[1, ζn-1(θ)] (16-19)


Pi (θ) =
gi[1, ζn-1 (θ)] + gi [0, ζn-1 (θ)].
In practice, the item-true-score regression is estimated by (16-17). Then the
base scale is transformed from true score to ability, using (16-18), to obtain the
estimated item response function (16-19).
REFERENCES 253

Results obtained by this method are illustrated by the dashed curves in Fig.
2.3.1. The solid curves are three-parameter logistic functions computed by Eq.
(2-1) from maximum likelihood estimates âi, bi, and ĉi. The agreement be­
tween the two methods of estimation is surprisingly close, especially so when
one considers that the methods of this chapter are based on data and on assump­
tions very different from the data and assumptions used to obtain the logistic
curves (solid lines) in Fig. 2.3.1. An explicit listing and contrasting of the data
and assumptions used by the two methods is given in Lord (1970), along with
further details of the procedure used. Assuming they are confirmed on other sets
of data, results such as shown in Fig. 2.3.1 suggest that the three-parameter
logistic function is quite effective for representing the response functions of items
in published tests.

REFERENCES

Craven, P., & Wahba, G. Smoothing noisy data with spline functions: Estimating the correct degree
of smoothing by the method of generalized cross-validation. Technical Report No. 445. Madison,
Wis.: Department of Statistics, University of Wisconsin, 1977.
Franklin, J. N. Well-posed stochastic extensions of ill-posed linear problems. Journal of Mathemati­
cal Analysis and Applications, 1970, 31, 682-716.
Franklin, J. N. On Tikhonov's method for ill-posed problems. Mathematics of Computation, 1974,
28, 889-907.
Gavurin, M. K., & Rjabov, V. M. Application of Ĉebyšev polynomials in the regularization of
ill-posed and ill-conditioned equations in Hilbert space. (In Russian) žurnal VyčisliteVnot
Matematiki i Matematičeskoî Fiziki, 1973, 13, 1599-1601, 1638.
Jordan, C. Calculus offinite differences (2nd ed.). New York: Chelsea, 1947.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1). New York: Hafner, 1969.
Krjanev, A. V. An iteration method for the solution of ill-posed problems. (In Russian) Zurnal
Vyčislitel' noî Matematiki i Matematičeskoî Fiziki, 1974, 14, 25-35,266.
Livingston, S. Reliability of tests used to make pass-fail decisions: Answering the right questions.
Paper presented at the meeting of the National Council on Measurement in Education, Toronto,
March 1978.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Lord, F. M. Estimating true-score distributions in psychological testing (An empirical Bayes estima­
tion problem). Psychometrika, 1969, 34, 259-299.
Lord, F. M. Item characteristic curves estimated without knowledge of their mathematical form—a
confrontation of Birnbaum's logistic model. Psychometrika, 1970, 35, 43-50.
Shaw, C. B., Jr. Best accessible estimation: Convergence properties and limiting forms of the direct
and reduced versions. Journal of Mathematical Analysis and Applications, 1973, 44, 531-552.
Stocking, M., Wingersky, M. S., Lees, D. M., Lennon, V., & Lord, F. M. A program for
estimating the relative efficiency of tests at various ability levels, for equating true scores, and for
predicting bivariate distributions of observed scores. Research Memorandum 73-24. Princeton,
N.J.: Educational Testing Service, 1973.
Varah, J. M. On the numerical solution of ill-conditioned linear systems with applications to ill-posed
problems. SIAM Journal on Numerical Analysis, 1973, 10, 257-267.
Wahba, G. Practical approximate solutions to linear operator equations when the data are noisy.
SIAM Journal on Numerical Analysis, 1977, 14, 651-667.
17 Estimated True-Score
Distributions for Two Tests

17.1. MATHEMATICAL FORMULATION

This chapter considers problems involving two or more tests of the same trait. In
every discussion of tests x and y here, it is assumed that the ability 0 is the same
for both tests.
The trivariate distribution of x, y, and 0 for any population may be written
[compare Eq. (16-1)]
Ø(x, y, θ) = g*(θ)h1*(x|θ)h2*(y|θ), (17-1)
* *
where g* is the distribution of 6 and h1 and h2 are the conditional distributions
of observed scores x and y for given θ. The bivariate distribution of x and y is
thus
00
* *
Ø(x, y) =
∫g*(θ) h (x|θ)h (y |θ) dθ.
— 00
1 2 (17-2)

Now, the proportion-correct true scores ζ and nare related to θ by the formulas
nx
ny
1 1
ζ≡ Pi(θ), 7] = Pi (θ), (17-3)
nx i=1 ny 3= 1

where i indexes the nx items in test x, and j indexes the ny items in test y. Thus
after a transformation of variables, (17-2) can now be written [compare Eq.
(16-2)]
1
Ø(x, y) =
∫ 0
g(ζ)h1(x|ζ)h2 [y|n(ζ)] dζ, (17-4)

254
17.3. TRUE-SCORE EQUATING 255

where g(ζ) is the same as in Chapter 16, h 1 (x|ζ) is the same as h(x|ζ) in Chapter
16, h2(y|n) is the conditional distribution of y, and n ≡ n(ζ) is the transforma­
tion relating n to ζ, obtained from (17-3) by elimination of θ.
If the item parameters are known, then h1* and h2* are known and it should be
possible in principle to estimate g*(θ) from Ø(x, y) using (17-2); equivalently, it
should be possible to estimate g(ζ) from Ø(x, y) using (17-4). Full-length nu­
merical procedures for doing this would be complicated and have not been
implemented. Some short-cut procedures (using a series approximation to the
generalized binomial) are the subject of this chapter. Illustrative results are
presented.

17.2. BIVARIATE DISTRIBUTION OF OBSERVED


SCORES ON PARALLEL TESTS

If x and y are parallel test forms, then ζ and n are identical and also h1 and h2 are
identical. In this case, (17-4) becomes
1
Ø(x, y) =
∫0
g(ζ)h(x|ζ)h(y|ζ) dζ. (17-5)

As in Chapter 16, the conditional distribution h is considered known: It is


binomial or the generalized binomial of Section 4.1. When h is known or
approximated and g is estimated by the methods of Chapter 16, then the bivariate
distribution of x and y can be obtained from (17-5) by numerical integration.
Thus the bivariate distribution of observed scores on two parallel forms can be
deduced from a single administration of just one of the forms.
Table 17.2.1 shows part of the estimated bivariate cumulative distribution of
observed scores on two parallel forms of the Basic Skills Reading test discussed
in Section 16.11. It would be desirable to check this estimated distribution
against actual frequencies of scores on two parallel forms. This has not been done
since actual scatterplots for parallel forms are not available.
Reading down the right-hand column in Table 17.2.1, we see that an esti­
mated 45 examinees out of 1000 will be rejected by a cutting score of 38.5 on
form x. Reading down column 38, we see that an estimated 34 of these 45
examinees would have been rejected if they had taken parallel test form y instead
of x and 11 of these 45 examinees would not have been rejected by form y. This
provides one way to describe the consistency of basic skills assessment without
having to talk about unobservable true scores.

17.3. TRUE-SCORE EQUATING

If test x and test y are different measures of the same trait, their proportion-
correct true scores, ζ and n, have a mathematical relationship. This relation
256 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS

TABLE 17.2.1
Estimated Joint Cumulative Distribution of Number-Right
Observed Scores on Two Parallel Test Forms, x and y, of the Basic
Skills Assessment Reading Test

y=23 26 29 32 35 38 41 44 47 50 53 56 59 62 65
x
65 10 13 17 23 32 45 62 83 108 143 192 265 392 663 1000
62 10 13 17 23 32 45 62 83 108 143 192 264 377 550 663
59 10 13 17 23 32 45 62 83 108 143 190 250 319 377 392
56 10 13 17 23 32 45 62 83 108 140 179 220 250 264 265
53 10 13 17 23 32 45 62 82 105 132 159 179 190 192 192
50 10 13 17 23 32 45 62 80 99 118 132 140 143 143 143
47 10 13 17 23 32 45 60 76 89 99 105 108 108 108 108
44 10 13 17 23 32 43 56 67 76 80 82 83 83 83 83
41 10 13 17 22 30 40 49 56 60 62 62 62 62 62 62
38 10 13 16 21 27 34 40 43 45 45 45 45 45 45 45
35 10 12 16 20 24 27 30 32 32 32 32 32 32 32 32
32 9 12 15 17 20 21 22 23 23 23 23 23 23 23 23
29 9 11 13 15 16 16 17 17 17 17 17 17 17 17 17
26 9 10 11 12 12 13 13 13 13 13 13 13 13 13 13
23 8 9 9 9 10 10 10 10 10 10 10 10 10 10 10

could be determined by the method of Eq. (6-17) if the univariate frequency


distribution of both true scores were known for some population of examinees.
If test x and test y have been administered to separate random samples of
examinees from the same population, their true-score distributions g(ζ) and q(n)
can be estimated by the methods of Chapter 16. The relation n(ζ) can then be
estimated from ĝ(ζ) and q(n) using the equipercentile relationship [Eq. (6-17)]:
n(ζo) ζo
∫ q(n) ∫
— 00
dn ≡
— 00
ĝ(ζ) dζ. (17-6)

This equation asserts that ζo and no ≡ n(ΖO) have identical percentile ranks in
their respective distributions. Numerical values of the function n(ζo) for given
values of ζo are found in practice from (17-6) by numerical integration and
inverse interpolation. The result is an estimated true-score equating of ζ and n.
This method of equating does not make use of the responses of each examinee to
each item, as do the methods of sections 13.5 and 13.6.
Figure 17.3.1 shows two estimates of the equating function n(ζ) relating true
scores on two verbal tests, P and Q. Since P and Q are randomly parallel, being
produced by randomly splitting a longer test, the relation n(ζ) should be nearly
linear, but not precisely linear, as would be the case if P and Q were strictly
parallel.
The relation n(ζ) was estimated by the method of this section from two
different groups of examinees. Each curve in the figure runs from the first to the
ninety-ninth percentile of the distribution of ζ for the corresponding group. The
17.4. BIVARIATE DISTRIBUTION 257

n
1.0

.9

.8

.7

.6

.5

.4

.3 ζ
.3 .4 .5 .6 .7 .8 .9 1.0
FIG. 17.3.1. Estimates from two different groups of the line of relationship
equating true scores ζ and n for two randomly parallel tests, P and Q. (From
F. M. Lord, A strong true-score theory, with applications. Psychometrika, 1965,
30, 239-270.)

two estimated relations agree well with each other and are appropriately nearly
linear.

17.4. BIVARIATE DISTRIBUTION OF OBSERVED


SCORES ON NONPARALLEL TESTS

Suppose g(ζ) and q(n) have been independently estimated by the method of
Chapter 16 and then n(ζ) has been estimated by the method of the preceding
section. The bivariate distribution of number-right observed scores x and y can
now be estimated from (17-4) by numerical integration.
An early version of this method was used1 to predict 16 different bivariate

1
The remainder of this section is taken by permission from F. M. Lord, A strong true-score
theory, with applications. Psychometrika, 1965, 30, 239-270.
258 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS

frequency distributions involving three different groups of examinees and eight


different vocabulary tests composed of five-choice items. The N's for the three
groups were 1000, 2000, and 2523.
For one pair of tests, H and J, the four chi squares obtained were all signifi­
cant at the 5% level. The writer's conclusion is that a difficult vocabulary test
like H, which uses such unusual key words as limnetic, eclogue, newel, serice­
ous, measures something slightly different from an easy vocabulary test such as
J, which includes such key words as renegade, clemency, irritability. This
viewpoint, persuasive by itself, tends to be substantiated by the fact that for
these four bivariate distributions the observed product-moment correlation was

i
7 8 9
1011 1213 141 5 16 17
18 1 9 2 d 2 1 22 25 2>* 25 26 27 28 29 30 31 32 33 34 36 37 38 39 l*o 1*1 1*2
43 45
44 46
47 49 48 50

25

24

23
10
22
10
21
15
20 13
12 12 16
13'
19 17 11

18 11 13
17 16
25
17 11*
17 12
16 16
14 11 11 19
16 10 10 1 1 11
15 1 1 13 13 10 13

19 8 12 14 7 7
15
15 1 0 11* 1 5 15 11 16
20
14 21 14
10 1118 16 12 6 12
10 16 17 1 5 10
15 13
11 13 8 15 18 12 16 16 14
13 11 15
10 1 3 16 17 17 14 16 15
10
16 10 6 2 3 12 16 24 9
12 13
15 10 13 15 1 7 17 1 5 12 11
19
13 12 17 20 1 5 17 8 15
11 1 1 13 1 5 16 16 16 1 3
13 14 17 15
12
14 12 14 10 16 1 1 16 10 18 12 26 1 0 17
10 9
14 13 10 12 13 1 5 16 16 15 13 10 15
19
11 14 15 15 14 11 16 13 12 10 11* 9 15 7 11
9 14 12 10
13 13 18 11 12 13 14 1 5 15 13
8
14 16 7 15 11 81 2 1 22 1 1 1 1 12
16 15 19
1 12 13 13 13 1 3 12 1 1 16 11
18 15
6
10 19 7
11 16 1 5 14 7 9 13
7 13 18 1 0 11 11 1 1 11 1 1 10 16
14
13 16 13 13
21
15 16 12 12 12 16
6
16
16 15 12 1
15 18
19
18 17 14 13
14 20 23 21
5 8
19 10 15 19 11 15 13 13 11
9
16 16
4 12 7 18
10 11
11 12 11 12

2 14 12 21 8 4

12 11 l1 13 12
l

FIG. 17.4.1 Actual frequencies (upper) and predicted frequencies (lower) for Tests H and J.
17.5. CONSEQUENCES OF SELECTING ON OBSERVED SCORE 259

H
25

20

15

10

0 J
5 10 15 20 25 30 35 40 45 50

FIG. 17.4.2. Theoretical regression of H on J (solid line) and J on H (dotted


line) with actual column means (dots) and actual row means (crosses).

from .02 to .05 lower then the predicted correlation, whereas for the remaining
10 bivariate distributions the observed correlation was in every case a trifle
higher than the predicted correlation.
Figure 17.4.1 compares predicted and observed bivariate distributions (N =
2000) for Tests H and J, the hard and easy vocabulary tests. Figure 17.4.2 shows
for the same data the theoretical regressions of J on H and H on J, as well as
those row means and column means of the observed distribution based on five or
more cases. To the naked eye, the fit in these two figures seems rather good; the
chi square is significant at the 5% level, however. If the two tests measure
slightly different psychological traits, as suggested above, then significant chi
squares are to be expected. The analysis carried out is in fact just the analysis that
could be used to investigate whether the tests actually are or are not measuring
the same dimension.
For the remaining 12 pairs of distributions studied, it is more plausible that
both tests are measures of the same trait. For these, the model appears to be very
effective: 11 of the 12 chi-squares are nonsignificant at the 5% level.

17.5. CONSEQUENCES OF SELECTING ON


OBSERVED SCORE

Table 17.5.1 shows the true-score distribution for a rejected group of examinees
(x ≤ 38), as discussed in Section 16.11. The estimated distribution of true scores
260 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS

TABLE 17.5.1
Estimated Population Observed-Score (Noncumulative)
Distribution of Failing Students (x ≤ 38) Compared with
Their Estimated True-Score Distribution and with Their
Estimated Observed-Score Distribution on Parallel Test y

Estimated Population Frequency* Distribution for


Number-
Right Test True Test
Score x Score y

52 0 0 .1
51 0 0 .1
50 0 0 .1
49 0 0 .1
48 0 0 .3
47 0 0 .5
46 0 0 .6
45 0 .1 .7
44 0 .2 .9
43 0 .6 1.1
42 0 1.1 1.3
41 0 1.6 1.7
40 0 2.1 2.0
39 0 2.5 2.1
38 4.9 2.9 2.1
37 4.5 3.4 2.2
36 4.0 3.6 2.1
35 3.6 3.0 2.1
34 3.1 2.4 2.0
33 2.7 1.9 1.9
32 2.3 1.6 1.8
31 2.0 1.3 1.7
30 1.7 1.1 1.6
29 1.5 1.1 1.4
28 1.3 1.0 1.2
27 1.2 1.0 1.1
26 1.1 1.0 1.1
25 1.0 1.0 1.0
24 1.0 1.0 1.0
23 1.0 1.0 1.0
22 .9 1.0 1.0
21 .9 1.0 1.0
20 .9 1.0 .9
19 .9 1.0 .9
18 .9 1.0 .9
17 .8 1.0 .8
16 .8 1.0 .7

(continued)
17.5. CONSEQUENCES OF SELECTING ON OBSERVED SCORE 261

TABLE 17.5.1
{continued)

Estimated Population Frequency* Distribution for


Number-
Right Test True Test
Score x Score y

15 .7 1.0 .6
14 .6 1.0 .5
13 .5 0 .4
12 .4 0 .3
11 .2 0 .2
10 .2 0 .1
9 .1 0 .1
8 0 0 .1
7 0 0 .1
.
.
Total 45.5 45.5 998

*Number of students per 1000 students taking the test.

for rejected examinees (x ≤x 0 ) was found by the formula


X
O
n
ĝ(ζ) ζx(1 - ζ ) n - x
ĝ(ζ|x ≤ xo) = ( x )
X=0 (17-7)
x o ,
Ø(x)
x=0

ĝ(ζ) having been obtained by the methods of Chapter 16. A disadvantage of this
result is that there is no way to check its validity.
From Table 17.2.1, we can write the estimated (noncumulative) observed-
score distribution on form y for those examinees who are rejected by form x. The
estimated distribution of form y observed scores for examinees rejected by test x
is given by the formula
X
0

Ø(x,y)
X=0
fo (y|x ≤ x0) = X o ny , (17-8)
Ø(x, y)
x=0 y=0

Ø(x, y) having been estimated by substituting ĝ(ζ) into (17-5). This distribution
is shown in Table 17.5.1 for comparison with the other distributions there. This
distribution could be checked against actual test data if we could administer both
form x and form y to the same examinees without practice effect.
262 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS

Selection need not necessarily involve a cutting score. Given f(x) examinees
at observed score x, we can select a proportion px of these at random (x = 0,
1 , . . . , n ) . The true-score distribution for the selected group will then be given by
n n
gp (Ζ) = g(Ζ) Px ζx (1 - ζ)n-x. (17-9)
x=0
( x)
The observed-score distribution on form y for the selected group will be given by
n
fp(y) = Px Ø(x, y) (y = 0, 1, . . . , n). (17-10)
x=0

Not only does this last equation allow us to estimate fp(y) when the selection
procedure p ≡ {px} is given but it also can be used to find the selection
procedure p that will produce a required distribution fp of y. If the left-hand side
of (17-10) is given for y = 0, 1,. . . , n, we have n + 1 linear equations in the n
+ 1 unknowns p0, p1. . . , pn. Since the matrix ||Ø(x, y)|| will normally be
nonsingular, values of p0, p1. . . , pn can be found satisfying (17-10) when the
left-hand side is given.
To provide a meaningful solution to the problem stated, each value of px thus
determined from (17-10) must satisfy the inequalities 0 ≤ px ≤ 1. In practical
work, it is likely that these inequalities will not always be satisfied, in which case
some approximation will be required.

17.6. MATCHING GROUPS

Suppose two populations have distinctly different distributions of observed score


f(x). The matching problem is to select a subpopulation from each population so
that the subpopulations are matched on ability (or true score).
Suppose subpopulations are chosen so as to have identical distributions of
observed score x. This procedure ordinarily will not produce subpopulations that
are matched on true score or ability, for the following reason.
Suppose that the unselected population A has considerably more ability than
the unselected population B. If we match on observed score, we are mostly
selecting the lower scoring people from group A and the higher scoring people
from group B. Since observed score x equals true score plus error, when we
select low values of x in group A, we tend to obtain a subgroup with negative
errors of measurement. This means that the true scores of the subgroup selected
from A are mostly higher than their observed scores. Similarly, the true scores of
the subgroup selected from B are mostly lower than their observed scores. Thus
selected subgroups matched on observed score are usually not matched on true
score.
Since there is an infinite variety of possible true scores but only n + 1
different possible observed scores, it is, strictly speaking, theoretically impossi-
17.7. TEST NORMS 263

ble to select on observed score in such a way as to produce a subgroup having an


arbitrary distribution of true scores. It may be possible, however, as we saw in
the last section, to use (17-10) to select on observed score x so as to obtain a
subgroup having a specified score distribution fp(y) on parallel test form y. If we
can do this for both population A and population B, obtaining the same f p (y) for
both subpopulations, then the subpopulations are matched with respect to y. This
means that the true-score distributions of the two selected subpopulations must
have identical moments up through order n. For all practical purposes, this
would constitute a satisfactory matching on a true score.
This procedure will be effective if all the px found from (17-10) turn out to lie
between 0 and 1. This is unlikely to happen for an arbitrary fp(.y). In our
problem, however, we are free to choose any fp(y) that we wish. Let us choose
fp(y) so that the px found from (17-10) will lie between 0 and 1, both for
population A and for population B.
The problem of finding such an fp(y) is simply the problem of finding a
feasible point in a linear programming problem. If any fp(y) exists satisfying our
requirements, it can be found in a finite number of steps by standard procedures
for finding a starting point for the iterative solution of a linear programming
problem.
Note that it is not necessary to find the optimal point or to solve the linear
programming problem; it is only necessary to find a feasible point. If desired,
however, we could proceed further, using linear programming to find the f p (y)
for which xPxf(x), the size of the selected subpopulation, is as large as possible.
The procedure described in this section has not as yet been implemented. Thus
no illustrative examples can be shown here. It would be desirable for some
researcher to carry out the procedure and then actually administer test y to the
selected subpopulations. This would provide a good check on the accuracy of the
predictions made.

17.7. TEST NORMS

Suppose that a test publisher wishes to norm his test on a nationally representa­
tive norms group. If he selects a representative sample of schools and asks them
to administer the test, he may receive many refusals. If so, any norms finally
collected will be of doubtful value: the schools that finally agree to administer his
test may be unrepresentative.
Suppose that the publisher can avoid refusals if he asks to administer only a
10-minute short form of the regular test. Our problem is then to estimate from
their scores on the short form what the total norms sample would have done on
the regular test.
Denote the short form by x and the regular form by y. We do not wish to go so
far as to assume that y is simply a lengthened version of x; we assume only that
both forms measure the same psychological dimension.
264 17. ESTIMATED TRUE-SCORE DISTRIBUTIONS FOR TWO TESTS

The relation n(ζ) between true score n on test y and true score ζ on test x can
be found from (17-6). To find this relation, the publisher need only administer
test y and test x separately to different random samples from any convenient
population. The publisher does not need nationally representative samples for
this purpose, since true-score equating is independent of the group tested (see
Chapter 13).
In addition to determining the relation n(ζ) from some convenient sample, as
just described, the publisher must estimate g(ζ), the true-score distribution for
test x in the nationally representative sample by the methods of Chapter 16. The
n(Ζ) from the convenient sample and the ĝ(ζ) from the national sample can then
be substituted into (17-4) to estimate Ø(x, y), the bivariate distribution of x and y
for the national sample. The estimated national norms distribution f*(y), say, for
the full-length test y, is then obtained by summing on x across the estimated
bivariate distribution:
n
f*(y) = Ø(x, y ) . (17-11)
x=0

In any practical application of this procedure, x and y should be unspeeded (if


they are speeded, they should be "equally" speeded, if this is possible). This
requirement has discouraged but need not prevent practical application of this
norming procedure. In some cases, this procedure may be the only way that a
representative national group can be obtained for norming purposes.
Answers to Exercises

Chapter 4

1. .231, .296, .423, .6, .777, .904, .969.


2. .064, .288, .432, .216; 1.8; .8485; .6.
3. .050, .295, .459,.195; 1.8, .6, .8073.
5. 2.88, 3.26, 1.51.
6. .6 to 3.
7. Approximately — 1 to +.5.
8. .050,.184, .075,.036,.275,.130,.054,.195.
9. .043,.092,.144,.125,.053,.013, .002.
11. - . 6 9 .

Chapter 5

1. .75, .95, 1.31, 1.8, 2.29, 2.65, 2.85.


2. .747, .795, .831, .807, .704, .540, .370.
4. .042,.143,.302,.415,.395,.264,.132.
7. .045,.105,.197,.25,.197,.105,.045.
9. 23.6, 7.0, 3.3, 2.4, 2.5, 3.8, 7.6.
10. .032, .122, .282, .406, .393, .264, .132.
12. .931,.833, .648.

Chapter 6

1. .022, .102, .306, .5, .440, .245, .107.


2. .135,.315,.590, .75,.590,.315,.135.
3. .51,.71, 1.01, 1.20, 1.11,.93,.81.

265
266 ANSWERS TO EXERCISES

4. 3.2, 2.2, 2.0, 1.8, 1.5, 1.2, 1.0.


8. .561,.649,.738,.72,.542,.308,.140.
9. .069, .211, .367, .395, .330, .218, .110; 1.6, 1.5, 1.2, .95, .84, .83, .83.

Chapter 8

1. .6, .4.
2. .086,.351,.314,.249.

y = l
/2 1 11/2 2
3.
Ø = .086 .314 .351 .249

4. .415,.585;.234,.293,.351,.123.

y = 1/2 1 11/2 2
Ø = .234 .351 .293 .123

Chapter 9
1. -1.92, - . 4 0 , 0, .40, .84, 1.44.
2. 1.83,.69,.64,.63,.66,.76.
3. θ x = 0-1 2-3 4-5 6-7
2 0 0 .03 .97
0 .02 .27 .55 .16
-2 .46 .47 .06 0

Chapter 11

1.
x ≥ 2 3 4 5
α = .69 .34 .10 .01
β = .01 .07 .29 .70
C = .70 .41 .39 .71

2. 24.1, 4.7,.91,.18,.03,.01;
examinees scoring x ≥ 4.
3. .31,.26,.18.
4. 1.84, 1.64, 1.27.
5. 3.48, 3.12, 2.91; 3.
6. .93,.83,.65.
7. 1.76, 1.58, 1.48; 1.5.

Chapter 12

1. MLE for θ is θ = 0.
2. MLE for θ* is θ* = e() = 1.
3. BME for θ is θ = 0.
4. BME for θ* is θ* = e-5 = .0067.
Author Index

Numbers in italics indicate the page on which the complete reference appears.

A Christoffersson, A., 21, 25


Clark C , 157, 160
Aitchison, J., 12, 25 Cleary, T. A., 146, 149
Algina, J., 162, 776 Cliff, N., 160
Amemiya, T., 12, 25 Collet, L., 227, 231
Andersen, E. B., 12, 25, 181, 191 Cox, D. R., 12, 14, 25
Anderson, M. R., 189, 191 Craven, P., 239, 253
Angoff, W. H., 76, 80, 207, 211 Cronbach, L. J., 6, 10, 128, 148
Cudeck, R. A., 160

B
D
Bennett, J. A., 12, 25
Betz, N. E., 127, 127, 146, 148, 161 Dahm, P. A., 12, 25
Bianchini, J. C , 96, 105 David, C. E., 160, 162, 176
Birnbaum, A., 63, 64, 65, 67, 72, 80, 152, Deal, R., 131, 148
160, 162, 173, 176, 186, 191 DeGraff, M. H., 107, 112
Blot, W., 14, 25 DeWitt, L. J., 160
Bock, R. D., 12, 21, 25, 189, 191 Diamond, J., 227, 230
Brogden, H. E., 12, 25 Dyer, A. R., 14, 25

C E

Chambers, E. A., 14, 25 Ebel, R. L., 107, 108, 112, 113, 227, 230
Charles, J. W., 107, 112 Evans, W., 227, 230

267
268 AUTHOR INDEX

F L

Fan, C.-T., 34, 43 Larkin, K. C , 146, 148


Finney, D. J., 12, 25 Lawley, D. N., 12, 25
Foutz, R. V., 59, 64 Lazarsfeld, P. F., 12, 25
Franklin, J. N., 239, 253 Lees, D. M., 92, 96, 105, 240, 253
Lennon, V., 92, 96, 105, 240, 253
Lieberbaum, M., 189, 191
G Linn, R. L., 5, 6, 70, 146, 149
Livingston, S., 251, 253
Gavurin, M. K., 239, 253 Lord, F. M., 5, 6, 9, 10, 10, 12, 25, 39, 42, 43,
Glass, G. V., 163, 176 45, 51, 64, 68, 76, 80, 92, 94, 96, 102,
Gleser, G. C , 6, 10, 128, 148 105, 134, 138, 149, 150, 156, 157, 160,
Goldberger, A. S., 6, 10 189, 192, 229, 230, 235, 238, 240, 244,
Gordon, W. E., 107, 112 253, 253
Gorham, W. A., 160 Loret, P. G., 96, 105
Graybill, F. A., 131, 148
Green, B. F., Jr., 160
Grier, J. B., 107, 108, 109, 112
Gulliksen, H., 34, 43, 208, 211 M
Gurland, J., 12, 25
Mandel, J., 69, 80
Mantel, N., 12, 25
H Marco, G. L., 146, 149, 205, 211
Maurelli, V. A., Jr., 189, 191
Haberman, S. J., 182, 191 McBride, J. R., 160, 160
Hambleton, R. K., 162, 176, 227, 231 McCormick, D., 160
Harris, D. A., 127, 127 Meeter, D., 14, 25
Hauser, R. M., 6, 10 Mellenbergh, G. J., 213, 224
Henry, N. W., 12, 25 Meredith, W., 182, 191
Hickman, J., 160, 162, 176 Milliken, G. A., 6, 10
Hunter, J. E., 207, 211 Morrison, D. F., 223, 224
Huynh, H., 162, 176 Mulaik, S. A., 21, 26
Mussio, J. J., 160
Muthén, B., 21, 26
I

Ilbok, J., 12, 25


Indow, T., 21, 25 N
Ironson, G. H., 213, 224
Nanda, H., 6, 10
Novick,M. R.,6, 9, 10, 10, 39, 42, 43, 45,51,
JK 64, 76, 80, 160, 162, 176

Jordan, C , 244, 253


Joreskog, K. G., 6, 10
Kale, B. K., 180, 191 PQ
Kendall, M. G., 45, 64, 71, 80, 185, 191, 197,
277, 237, 253 Pennell, R. J., 127, 127
Killcross, M. C , 160, 160 Pereira, B. deB., 14, 26
Koch, W. R., 160 Pirie, W., 14, 25
Kolakowski, D., 21, 25 Prentice, R. L., 14, 26
Krjanev, A. V., 239, 253 Quesenberry, C. P., 14, 26
AUTHOR INDEX 269

R T

Rajaratnam, N., 6, 10 Thorndike, R. L., 207, 211


Rasch, G., 182, 189, 191, 192 Toops, H. A., 107, 113
Reckase, M. D., 160 Traub, R. E., 227, 231
Rjabov, V. M., 239, 253 Tversky, A., 107, 113
Rock, D. A., 6, 10, 146, 149
Ruch, G. M., 107, 112
UVW

S Urry, V. W., 155, 160, 161, 189, 192


Vale, C. D., 96, 105, 106, 113, 161
Samejima, F., 12, 21, 25, 26, 59, 64, 153, 160 van der Linden, W. J., 163, 176
Sax, G., 227, 231 van der Ven, Ad H. G. S., 182, 192
Schmidt, F. L., 207, 211 van Ryzin, J., 166, 176
Seder, A., 96, 105 van Strik, R., 12, 26
Seguin, S. P., 127, 127 Varan, J. M., 239, 253
Shaw, C. B., Jr., 253 Wahba, G., 239, 253
Slakter, M. J., 227, 231 Waters, L. K., 227, 231
Slinde, J. A., 5, 10 Weiss, D. J., 106, 113, 127, 127, 146, 148,
Snijders, T., 166, 176 160, 161
Solomon, H., 12, 26 Werts, C. E., 6, 10
Starbuck, R. R., 14, 26 Wiley, D. E., 207, 211
Stiehler, R. D., 69, 80 Williams, B. J., 107, 113
Stocking, M., 92, 96, 105, 136, 149, 240, 253 Wingersky, M. S., 92, 96, 105, 189, 192, 240,
Stoddard, G. D., 107, 112 253
Stone, M., 14, 26 Wood, R. L., 189, 192
Stuart, A., 45, 64, 71,80, 185, 191, 197,211, Woods, E. M., 207, 211
237, 253 Wright, B. D., 58, 64, 189, 192
Subkoviak, M. J., 162, 176
Susarla, V., 166, 176
Swaminathan, H., 162, 176
Sympson, J. B., 21, 26
Subject Index

A Bayesian modal estimator, 187-188


Beta distribution of true scores, 238, 240
a, 12-14, see also Discriminating power Bimodality, 244
Ability Binomial
distribution distribution of scores, 44, 94, 132, 236, 255
Bayesian inference, 186 generalized, 45, 237-238, 255,
mastery testing, 162 Bioassay, 12
normal, 32
maximum likelihood estimate, 59-60, 70-71,
77 C
in mastery testing, 171
for peaked tests, 130 c, 12-14, 110, see also Guessing
for tailored tests, 153-157 estimation of, 186
variance, 70 Chance score level, see Pseudo-chance score
posterior distribution of, 186-188 level
regression on score, 53 Chi-square test
scale, 84-90 for item bias, 223
and score, joint distribution, 51 for score distribution, 241
transformation, 84-90 Competency test, 50, 251
in Bayesian estimation, 188 Confidence interval
to true scores, 46, 49, 183, see also Test for ability, 52-54
characteristic function and information function, 21, 65-69, 85-86
Alpha, 8, 72 for true score, 90
Attenuation, correction for, 7 used to fix test length, 173
Consistent estimator, 59
B Correlation
biserial, 9, 33-39, 41-42
b, 12-14, see also Item difficulty interitem, 39-43, see also Correlation, tet-
Bayesian estimation, 183, 186-189 rachoric
mastery testing, 162 item-ability, 33-41

271
272 SUBJECT INDEX

Correlation (cont.) G
item-test, see Correlation, biserial
spurious, 41 Guessing, 31, see also c
tetrachoric, 19-21, 39, 41-42 correction for, 102
Cramer-Rao inequality, 71 effect on estimation of bi, 37
Cutting score, 255, 262 effect on information, 103
mastery test, 162-175 effects of, 40-43, 244
flexilevel test, 124-126
and the item response function, 17
D no sufficient statistic, 58
omits and formula scoring, 226-229
Decision rule, 163-169 and optimal item difficulty, 108-112, 152
Delta, 34 random, not assumed, 30
Difficulty, see Item difficulty; Test difficulty and scoring weights, 23, 75, 77
Dimensionality, 19-21, 35, 68 and tetrachoric correlation, 20
Discriminating power two-stage tests, 138-139
and bimodality, 245
definition, 13
effects of, 40-41 I
and information, 152
item bias in, 217 Independence, local, 19
and item-test biserial, 33-43 Indeterminacy of item parameters, 36-38, 184
and item weight, 23, 75 Information function, 65-80
tailored test, 159 flexilevel test, 122-126
item, 21-23, 72-73
in tailored testing, 151-153
E
maximum, 112, 151-153
target, 23, 72
test, 21-23, 71-73
Efficiency, relative, 23, 83-104, 110
for transformed ability, 84-90
approximation, 91-101
on true score, 89
Equating, 76, 193-211, 236
two-stage tests, 132-148
with an anchor test, 200-205 Information matrix, 180
equipercentile, 92, 203, 207 Integral equation, 236
for raw scores, 202 Invariance of item parameters, 34-38
second-stage tests, 140 iosr, 19, 27-30, 236, 251
true-score, 199-205, 210, 256 Item
Equipercentile relationship, 194-211, 256, see analysis, 27-43
also Equating, equipercentile bias, 212-223
Error of measurement, 4-7, 235-236, see also calibration, 154, 205
Standard error of measurement choices, 17
Examinees, low ability, 37, 75, 103, 110, 183 number of, 106-112
difficulty
corrected, 216
F effect of, 23, 102-104
in IRT, 12-14
Factor, common, as ability, 19, 39 and maximal information, 152
Factor analysis of items, 20-21 optimal, 172
Flexilevel test, 115-127 proportion correct, 33-38, 213
Formula score, 226-230 and scoring weight, 76
Free-response item, 43, see also Guessing standard error of, 185
SUBJECT INDEX 273

not reached, 182, 225-230 N,0


parameters, 12-14
response function, 12-17, 30-32 Newton-Raphson method, 180
estimated, 252 Neyman-Pearson lemma, 164
interpretation, 227 Normal ogive, 13, 20, 27-41, 58, 84, 122, 129
and item bias, 218-219 Norms, 236, 263
response model fit, 15-21 Not reached responses, see Item, not reached
response, polychotomous, 12 Omitted responses, 226-230, 242
scores, distribution of, 54-57
selection, see Test construction
theory, classical, 8 P

Parallel forms
K bivariate score distribution of, 236, 255-264
in classical test theory, 3, 6
Kuder-Richardson lengthening a test, 65
formula 8, 20, 245 Parameters, unidentifiable, 184
formula 8,21 Path analysis, 6
Phi coefficient, 9, 41-43
Preequating, 205
Pseudo-chance score level, 203, 210, 244
L

Latent trait theory, 236 R


Latent variable, 31-41
Latent variables, 6 Rasch model, 58, 181, 189-190, 213
Likelihood Regression
equations, 58-60, 179-181 item-ability, 27, 34
function, 55-57 item-observed score, see iosr
ratio, 164-169 item-test, see iosr
Logistic function, 12-17 item-true score, 251
related formulas 60, 180 Reliability, 5
and bimodality, 245
item contribution to, 22, 72
M lower bound, 8
relation to item response theory, 40-42, 52
Mastery test, 247 two-stage test, 146
design, 172-175
Mastery testing, 162-175
Matching groups, 236, 262 S
Maximum likelihood
ability estimate, see Ability Scholarship examination, 50
information function for, 240 Score
and Bayesian estimation, 187 distribution, 44-52
estimate bimodal, 244
infinite, 182-186 determined by test characteristic function,
variance of, 181 50
estimation in equating, 202
procedures, 179-189, 209-210 flexilevel test, 120-122
theory, 55-60 in relation to true-score distribution, 235-
Measurement effectiveness, 50 264
274 SUBJECT INDEX

Score (cont.) peaked


mean, 45, 52, 66 flexilevel test compared to, 124-126
regression on ability, 49-51, 66-70 SAT, 83, 93, 104
transformed, information functions for, 78 score distribution for, 244
variance, 45, 52 in two-stage testing, 130-132
Scoring, see Weights; Ability; Maximum likeli­ two-stage tests compared to, 134-139
hood estimate redesigning, 101-104
Scores, bivariate distribution of, 202, 257-264 routing, 128-139
Selection, 236, 247-251, 261 speeded, 182, 226, 230, 264
Simulation, 120-127, 156-159 tailored, 11, 114, 134, 150-160
Smoothing a distribution of scores, 245 theory, classical, 3-12
Smoothness, 238 two-stage, 128-139
Spearman-Brown formula, 8, 42, 109 Transformation, see also Ability, transforma­
Standard error of measurement, 7, 46-49, 89-90 tion; Scores transformed
and information, 68, 89-90, 92 arcsine, 214
multilevel tests, 140-148 inverse normal, 215
Sufficient statistic, 57-64 True score
and optimal scoring, 57, 77 and ability, 45
for peaked tests, 130 classical theory, 4-10
definition in item response theory, 45-51
distribution
T estimation of, 235-264
examples of, 96
Test effect of different scales, 68
characteristic function, 49-51, 73 information function on, 89-90
use for equating, 199-202 in mastery test design, 163
construction, 23, 72 moments, 237
design, 101-104, 119-127, 162, see also
Test, redesigning
difficulty U,V,W
effect of, 50, 110
flexilevel test, 114-127 Unidimensionality, see Dimensionality
two-stage test, 128-148 Validity, 9, 22
lengthened, 103, 236, 245 item contribution to, 72
definition, 65 Weights, scoring, 23, 73-77
multilevel, 140-148 for mastery test, 169-175

You might also like