0% found this document useful (0 votes)

23 views289 pages

Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980)

Uploaded by

mpessoasilva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views289 pages

Frederic M. Lord - Applications of Item Response Theory To Practical Testing Problems (1980)

Uploaded by

mpessoasilva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 289

APPLICATIONS OF

ITEM RESPONSE THEORY

TO PRACTICAL TESTING PROBLEMS

FREDERIC M. LORD
Educational Testing Service
ROUTLEDGE

Routledge
Taylor &. Francis Group

NEW YORK AND LONDON

First published by
Lawrence Erlbaum Associates
10 Industrial Avenue
Mahwah, New Jersey 07430

Transferred to Digital Printing 2009 by Routledge

270 Madison Ave, New York NY 10016
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

Copyright © 1980 by Lawrence Erlbaum Associates, Inc.

All rights reserved. No part of this book may be reproduced in
any form, by photostat, microform, retrieval system, or any other
means, without the prior written permission of the publisher.

Copyright is claimed until 1990. Thereafter all portions of this work

covered by this copyright will be in the public domain.
This work was developed under a contract with the National Institute of
Education, Department of Health, Education, and Welfare. However, the
content does not necessarily reflect the position or policy of that
Agency, and no official endorsement of these materials should be inferred

Reprinted 2008 by Routledge

Routledge Routledge
Taylor and Francis Group Taylor and Francis Group
270 Madison Avenue 2 Park Square
New York, NY 10016 Milton Park, Abingdon
Oxon OX14 4RN

Library of Congress Cataloging in Publication Data

Lord, Frederic M 1912-
Applications of item response theory to practical
testing problems.
Bibliography: p.
Includes index.
1. Examinations. 2. Examinations—Evaluation.
I. Title.
LB3051.L64 371.2'6 79-24186
ISBN 0-89859-006-X

Publisher's Note
The publisher has gone to great lengths to ensure the quality of this reprint
but points out that some imperfections in the original may be apparent.
Contents

Preface xi

PART I: INTRODUCTION TO ITEM RESPONSE THEORY

1. Classical Test Theory—Summary and Perspective 3

1.1. Introduction 3
1.2. True Score 4
1.3. Uncorrelated Errors 6
1.4. Parallel Test Forms 6
1.5. Envoi 7
Appendix 7

2. Item Response Theory—Introduction and Preview 11

2.1. Introduction 11
2.2. Item Response Functions 12
2.3. Checking the Mathematical Model 75
2.4. Unidimensional Tests 19
2.5. Preview 21

iii
IV CONTENTS

3. Relation of Item Response Theory to

Conventional Item Analysis 27
3.1. Item-Test Regressions 27
3.2. Rationale for Normal Ogive Model 30
3.3. Relation to Conventional Item Statistics 33
3.4. Invariant Item Parameters 34
3.5. Indeterminacy 36
3.6. A Sufficient Condition for the
Normal Ogive Model 39
3.7. Item Intercorrelations 39
3.8. Illustrative Relationships Among
Test and Item Parameters 40
Appendix 41

4. Test Scores and Ability Estimates

as Functions of Item Parameters 44
4.1. The Distribution of Test Scores for
Given Ability 44
4.2. True Score 45
4.3. Standard Error of Measurement 46
4.4. Typical Distortions in
Mental Measurement 49
4.5. The Joint Distribution of Ability
and Test Scores 51
4.6. The Total-Group Distribution of
Number-Right Score 51
4.7. Test Reliability 52
4.8. Estimating Ability From Test Scores 52
4.9. Joint Distribution of Item Scores
for One Examinee 54
4.10. Joint Distribution of All Item Scores
on All Answer Sheets 55
4.11. Logistic Likelihood Function 56
4.12. Sufficient Statistics 57
4.13. Maximum Liklihood Estimates 58
4.14. Maximum Likelihood Estimation for
Logistic Items with ct = 0 59
4.15. Maximum Likelihood Estimation for
Equivalent Items 59
4.16. Formulas for Functions of the
Three-Parameter Logistic Function 60
4.17. Exercises 61
Appendix 63
CONTENTS V

5. Information Functions and Optimal Scoring Weights 65

5.1. The Information Function for a Test Score 65
5.2. Alternative Derivation of the
Score Information Function 68
5.3. The Test Information Function 70
5.4. The Item Information Function 72
5.5. Information Function for a
Weighted Sum of Item Scores 73
5.6. Optimal Scoring Weights 74
5.7. Optimal Scoring Weights Not Dependent on 0 76
5.8. Maximum Likelihood Estimate of Ability 77
5.9. Exercises 77
Appendix 78

PART II: APPLICATIONS OF ITEM RESPONSE THEORY

6. The Relative Efficiency of Two Tests 83

6.1. Relative Efficiency 83
6.2. Transformations of the Ability Scale 84
6.3. Effect of Ability Transformation on
the Information Function 84
6.4. Effect of Ability Transformation on
Relative Efficiency 88
6.5. Information Function of
Observed Score on True Score 89
6.6. Relation Between Relative Efficiency and
True-Score Distribution 90
6.7. An Appproximation for Relative Efficiency 92
6.8. Desk Calculator Approximation for
Relative Efficiency 94
6.9. Relative Efficiency
of Seven Sixth-Grade Vocabulary Tests 96
6.10. Redesigning a Test 101
6.11. Exercises 104

7. Optimal Number of Choices Per Item 106

7.1. Introduction 106
7.2. Previous Empirical Findings 107
7.3. A Mathematical Approach 107
7.4. Grier's Approach 108
7.5. A Classical Test Theory Approach 108
7.6. An Item Response Theory Approach 110
7.7. Maximizing Information at a Cutting Score 112
vi CONTENTS

8. Flexilevel Tests 114

8.1. Introduction 114
8.2. Flexilevel Tests 115
8.3. Scoring 116
8.4. Properties of Flexilevel Tests 117
8.5. Theoretical Evaluation
of Novel Testing Procedures 119
8.6. Conditional Frequency Distribution of
Flexilevel Test Scores 120
8.7. Illustrative Flexilevel Tests, No Guessing 122
8.8. Illustrative Flexilevel Tests, with Guessing 124
8.9. Conclusion 726
8.10. Exercises 127

9. Two-Stage Procedures and Multilevel Tests 128

9.1. Introduction 128
9.2. First Two-Stage Procedure—Assumptions 129
9.3. Scoring 130
9.4. Conditional Distribution of Test Score θ 131
9.5. Illustrative 60-Item Two-Stage Tests,
No Guessing 132
9.6. Discussion of Results for 60-Item Tests
with No Guessing 135
9.7. Illustrative 15-Item Two-Stage Tests
with No Guessing 136
9.8. Illustrative 60-Item Two-Stage Tests
with Guessing 138
9.9. Converting a Conventional Test to a
Multilevel Test 140
9.10. The Relative Efficiency of a Level 141
9.11. Dependence of the Two-Stage Test on
its Levels 142
9.12. Cutting Points on the Routing Test 144
9.13. Results for Various Two-Stage Designs 144
9.14. Other Research 146
9.15. Exercises 146
Appendix 147

10. Tailored Testing 150

10.1. Introduction 150
10.2. Maximizing Information 151
10.3. Administering the Tailored Test 153
CONTENTS VII

10.4. Calibrating the Test Items 154

10.5. A Broad-Range Tailored Test 154
10.6. Simulation and Evaluation 156
10.7. Results of Evaluation 157
10.8. Other Work on Tailored Tests 159

11. Mastery Testing 162

11.1. Introduction 162
11.2. Definition of Mastery 163
11.3. Decision Rules 163
11.4. Scoring the Test: The Likelihood Ratio 164
11.5. Losses 166
11.6. Cutting Score for the Likelihood Ratio 166
11.7. Admissible Decision Rules 168
11.8. Weighted Sum of Item Scores 169
11.9. Locally Best Scoring Weights 170
11.10. Cutting Point for Locally Best Scores 170
11.11. Evaluating a Mastery Test 171
11.12. Optimal Item Difficulty 172
11.13. Test Length 173
11.14. Summary of Mastery Test Design 174
11.15. Exercises 175

PART III: PRACTICAL PROBLEMS AND FURTHER APPLICATIONS

12. Estimating Ability and Item Parameters 179

12.1. Maximum Likelihood 179
12.2. Iterative Numerical Procedures 180
12.3. Sampling Variances of Parameter Estimates 181
12.4. Partially Speeded Tests 182
12.5. Floor and Ceiling Effects 182
12.6. Accuracy of Ability Estimation 183
12.7. Inadequate Data and
Unidentifiable Parameters 184
12.8. Bayesian Estimation of Ability 186
12.9. Further Theoretical Comparison of Estimators 187
12.10. Estimation of Item Parameters 189
12.11. Addendum on Estimation 189
12.12. The Rasch Model 189
12.13. Exercises 190
Appendix 191
Viii CONTENTS

13. Equating 193

13.1. Equating Infallible Measures 193
13.2. Equity 195
13.3. Can Fallible Tests be Equated? 196
13.4. Regression Methods 198
13.5. True-Score Equating 199
13.6. True-Score Equating
with an Anchor Test 200
13.7. Raw-Score "Equating"
with an Anchor Test 202
13.8. Illustrative Example 203
13.9. Preequating 205
13.10. Concluding Remarks 207
13.11. Exercises 207
Appendix 208

14. Study of Item Bias 212

14.1. Introduction 212
14.2. A Conventional Approach 213
14.3. Estimation Procedures 217
14.4. Comparing Item Response Functions
Across Groups 218
14.5. Purification of the Test 220
14.6. Checking the
Statistical Significance Test 221
Appendix 223

15. Omitted Responses and Formula Scoring 225

15.1. Dichotomous Items 225
15.2. Number-Right Scoring 225
15.3. Test Directions 226
15.4. Non-Reached Responses 226
15.5. Omitted Responses 226
15.6. Model for Omits Under Formula Scoring 227
15.7. The Practical Meaning of
an Item Response Function 227
15.8. Ignoring Omitted Responses 228
15.9. Supplying Random Responses 228
15.10. Procedure for Estimating Ability 229
15.11. Formula Scores 229
CONTENTS IX

PART IV: ESTIMATING TRUE-SCORE DISTRIBUTIONS

16. Estimating True-Score Distributions 235

16.1. Introduction 235
16.2. Population Model 236
16.3. A Mathematical Solution
for the Population 237
16.4. The Statistical Estimation Problem 239
16.5. A Practical Estimation Procedure 239
16.6. Choice of Grouping 241
16.7. Illustrative Application 242
16.8. Bimodality 244
16.9. Estimated Observed-Score Distribution 245
16.10. Effect of a Change in Test Length 245
16.11. Effects of Selecting on Observed Score:
Evaluation of Mastery Tests 247
16.12. Estimating Item True-Score Regression 251
16.13. Estimating Item Response Functions 252

17. Estimated True-Score Distributions for Two Tests 254

17.1. Mathematical Formulation 254
17.2. Bivariate Distribution of
Observed Scores on Parallel Tests 255
17.3. True-Score Equating 255
17.4. Bivariate Distribution of
Observed Scores on Nonparallel Tests 257
17.5. Consequences of Selecting on
Observed Score 259
17.6. Matching Groups 262
\1.1. Test Norms 263

Answers to Exercises 265

Author Index 267

Subject Index 269

Preface

The purpose of this book is to make it possible for measurement specialists to

solve practical testing problems by use of item response theory. This theory
expresses all the properties of the test, as a measuring instrument, in terms of the
properties of the test items. Practical applications include

1. The estimation of invariant parameters describing each test item; item

banking.
2. Estimating the statistical characteristics of a test for any specified group.
3. Determining how the effectiveness of a test varies across ability levels.
4. Comparing the effectiveness of different methods of scoring a test.
5. Selecting items to build a conventional test.
6. Redesigning a conventional tests.
7. Design and evaluation of mastery tests.
8. Designing and evaluating novel testing methods, such as flexilevel tests,
two-stage tests, multilevel tests, tailored tests.
9. Equating and preequating.
10. Study of item bias.

The topics, organization, and presentation are those used in a 4-week seminar
held each summer for the past several years. The material is organized primarily
to maintain the reader's interest and to facilitate understanding; thus all related
topics are not always packed into the same chapter. Some knowledge of classical
test theory, mathematical statistics, and calculus is helpful in reading this mate-
rial.
Chapter 1, a perspective on classical test theory, is perhaps not essential for

XI
Xii PREFACE

the reader. Chapter 2, an introduction to item response theory, is easy to read.

Some of Chapter 3 is important only for those who need to understand the
relation of item response theory to classical item analysis. Chapter 4 is essential
to any real understanding of item response theory and applications. The reader
who takes the trouble to master the basic ideas of Chapter 4 will have little
difficulty in learning what he wants from the rest of the book. The information
functions of Chapter 5, basic to most applications of item response theory, are
relatively easy to understand.
The later chapters are mostly independent of each other. The reader may
choose those that interest him and ignore the others. Except in Chapter 11 on
mastery testing and Chapters 16 and 17 on estimating true-score distributions, the
reader can usually skip over the mathematics in the later chapters, if that suits his
purpose. He will still gain a good general understanding of the applications under
discussion provided he has previously understood Chapter 4.
The basic ideas of Chapters 16 and 17, on estimated true-score distributions,
are important for the future development of mental test theory. These chapters
are not a basic part of item response theory and may be omitted by the general
reader.
Reviewers will urge the need for a book on item response theory that does not
require the mathematical understanding required here. There is such a need; such
books will be written soon, by other authors (see Warm, 1978).
Journal publications in the field of item response theory, including publica-
tions on the Rasch model, are already very numerous. Some of these publications
are excellent; some are exceptionally poor. The reader will not find all important
publications listed in this book, but he will find enough to guide him in further
search (see also Cohen, 1979).
I am very much in debt to Marilyn Wingersky for her continual help in the
theoretical, computational, mathematical, and instructional work underlying this
book. I greatly appreciate the help of Martha Stocking, who read (and checked) a
semifinal manuscript; the errors in the final publication were introduced by me
subsequent to her work. I thank William H. Angoff, Charles E. Davis, Ronald
K. Hambleton, and Hariharan Swaminathan and their students, Huynh Huynh,
Samuel A. Livingston, Donald B. Rubin, Fumiko Samejima,Wim J. van der
Linden, Wendy M. Yen, and many of my own students for their helpful com-
ments on part or all of earlier versions of the manuscript. I am especially indebted
to Donna Lembeck who typed innumerable revisions of text, formulas, and
tables, drew some of the diagrams, and organized production of the manuscript. I
would also like to thank Marie Davis and Sally Hagen for proofreading numerous
versions of the manuscript and Ann King for editorial assistance.
Most of the developments reported in this book were made possible by the
support of the Personnel and Training Branch, Office of Naval Research, in the
form of contracts covering the period 1952-1972, and by grants from the
Psychobiology Program of the National Science Foundation covering the period
PREFACE xiii

1972-1976. This essential support was gratefully acknowledged in original jour-

nal publications; it is not detailed here. The publication of this book was made
possible by a contract with the National Institute of Education. All the work in
this book was made possible by the continued generous support of Educational
Testing Service, starting in 1948. Data for ETS tests are published here by
permission.

References
Cohen, A. S. Bibliography of papers on latent trait assessment. Evanston, Ill.: Region V Technical
Assistance Center, Educational Testing Service Midwestern Regional Office, 1979.
Warm, T. A. A primer of item response theory. Technical Report 941078. Oklahoma City, Okla.:
U.S. Coast Guard Institute, 1978.

FREDERIC M. LORD
I INTRODUCTION TO ITEM
RESPONSE THEORY
1 Classical Test Theory—
Summary and Perspective

1.1. INTRODUCTION

This chapter is not a substitute for a course in classical test theory. On the
contrary, some knowledge of classical theory is presumed. The purpose of this
chapter is to provide some perspective on basic ideas that are fundamental to all
subsequent work.
A psychological or educational test is a device for obtaining a sample of
behavior. Usually the behavior is quantified in some way to obtain a numerical
score. Such scores are tabulated and counted. Their relations to other variables of
interest are studied empirically.
If the necessary relationships can be established empirically, the scores may
then be used to predict some future behavior of the individuals tested. This is
actuarial science. It can all be done without any special theory. On this basis, it is
sometimes asserted from an operationalist viewpoint that there is no need for any
deeper theory of test scores.
Two or more "parallel" forms of a published test are commonly produced.
We usually find that a person obtains different scores on different test forms.
How shall these be viewed?
Differences between scores on parallel forms administered at about the same
time are usually not of much use for describing the individual tested. If we want a
single score to describe his test performance, it is natural to average his scores
across the test forms taken. For usual scoring methods, the result is effectively
the same as if all forms administered had been combined and treated as a single
test.
The individual's average score across test forms will usually be a better

3
4 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

measurement than his score on any single form, because the average score is
based on a larger sample of behavior. Already we see that there is something of
deeper significance than the individual's score on a particular test form.

1.2. TRUE SCORE

In actual practice we cannot administer very many forms of a test to a single
individual so as to obtain a better sample of his behavior. Conceptually, how
ever, it is useful to think of doing just this, the individual remaining unchanged
throughout the process.
The individual's average score over a set of postulated test forms is a useful
concept. This concept is formalized by a mathematical model. The individual's
score X on a particular test form is considered to be a chance variable with some,
usually unknown, frequency distribution. The mean (expected value) of this
distribution is called the individual's true score T. Certain conclusions about true
scores T and observed scores X follow automatically from this model and defini
tion.
Denote the discrepancy between T and X by
E ≡X - T; (1-1)
E is called the error of measurement. Since by definition the expected value of X
is T, the expectation of E is zero:
μE|T ≡ μ(X - T)|T ≡ μX|T - μT|T = T - T = 0, (1-2)

where μ denotes a mean and the subscripts indicate that T is fixed.

Equation (1-2) states that the errors of measurement are unbiased. This
follows automatically from the definition of true score; it does not depend on any
ad hoc assumption. By the same argument, in a group of people,
μT ≡ μX - μE ≡ μX.

Equation (1-2) gives the regression of E on T. Since mean E is constant regard

less of T, this regression has zero slope. It follows that true score and error are
uncorrelated in any group:
ΡET = 0. (1-3)
Note, again, that this follows from the definition of true score, not from any
special assumption.
From Eq. (1-1) and (1-3), since T and E are uncorrelated, the observed-score
variance in any group is made up of two components:
σ2X ≡ σ 2 T + E ≡ σ2T + σ2E. (1-4)
The covariance of X and T is
=
σXT ≡ σ(T + E)T σ2T + σET =
σ2T . (1-5)
1.2. TRUE SCORE 5

An important quantity is the test reliability, the squared correlation between X

and T, by (1-5),

Σ2XT Σ2T
Ρ2XT ≡
σ2Xσ2T σ2X
2
= 1 - σ2 E (1-6)
σ X

If ΡXT were nearly 1.00, we could safely substitute the available test score X for
the unknown measurement of interest T.
Equations (1-2) through (1-6) are tautologies that follow automatically from
the definition of T and E.
What has our deeper theory gained for us? The theory arises from the realiza
tions that T, not X, is the quantity of real interest. When a job applicant leaves
the room where he was tested, it is T, not X, that determines his capacity for
future performance.
We cannot observe T, but we can make useful inferences about it. How this is
done becomes apparent in subsequent sections (also, see Section 4.2).
An example will illustrate how true-score theory leads to different conclusions
than would be reached by a simple consideration of observed scores. An
achievement test is administered to a large group of children. The lowest scoring
children are selected for special training. A week later the specially trained
children are retested to determine the effect of the training.
True-score theory shows that a person may receive a very low test score either
because his true score is low or because his error score E is low (he was
unlucky), or both. The lowest scoring children in a large group most likely have
not only low T but also low E. If they are retested, the odds are against their being
so unlucky a second time. Thus, even if their true scores have not increased, their
observed scores will probably be higher on the second testing. Without true-score
theory, the probable observed-score increase would be credited to the special
training. This effect has caused many educational innovations to be mistakenly
labeled ''successful.''
It is true that repeated observations of test scores and retest scores could lead
the actuarial scientist to the observation that in practice, other things being equal,
initially low-scoring children tend to score higher on retesting. The important
point is that true-score theory predicts this conclusion before any tests are given
and also explains the reason for this odd occurrence. For further theoretical
discussion, see Linn and Slinde (1977) and Lord (1963). In practical applica
tions, we can determine the effects of special training for the low-scoring chil
dren by splitting them at random into two groups, comparing the experimental
group that received the training with the control group that did not.
Note that we do not define true score as the limit of some (operationally
impossible) process. The true score is a mathematical abstraction. A statistician
doing an analysis of variance components does not try to define the model
6 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

parameters as if they actually existed in the real world. A statistical model is

chosen, expressed in mathematical terms undefined in the real world. The ques
tion of whether the real world corresponds to the model is a separate question to
be answered as best we can. It is neither necessary nor appropriate to define a
person's true score or other statistical parameter by real world operational proce
dures .

1.3. UNCORRELATED ERRORS

Equations (1-1) through (1-6) cannot be disproved by any set of data. These
equations do not enable us to estimate σ2T|, σ2E|, or ρXT, however. To estimate
these important quantities, we need to make some assumptions. Note that no
assumption about the real world has been made up to this point.
It is usual to assume that errors of measurement are uncorrelated with true
scores on different tests and with each other: For tests X and Y,
ρ(E X ,E Y ) = 0, ρ(E X ,T Y ) = 0 (X≠Y). (1-7)
Exceptions to these assumptions are considered in path analysis (Hauser &
Goldberger, 1971; Milliken, 1971; Werts, Linn, & Jöreskog, 1974; Werts,
Rock, Linn, & Jöreskog, 1977).

1.4. PARALLEL TEST FORMS

If a test is constructed by random sampling from a pool or "universe" of items,

then σ2E, σ2T, and ρXT can be estimated without building any parallel test forms
(Lord & Novick, 1968, Chapter 11). But perhaps we do not wish to assume that
our test was constructed in this way. If three or more roughly parallel test forms
are available, these same parameters can be estimated by the theory of nominally
parallel tests (Lord & Novick, 1968, Chapter 8; Cronbach, Gleser, Nanda, &
Rajaratnam, 1972), an application of analysis of variance components.
In contrast, classical test theory assumes that we can build strictly parallel test
forms. By definition, every individual has (1) the same true score and (2) the
same conditional error variance σ2(E|T) on all strictly parallel forms:
T = T', σ2(E|T) = σ 2 (E'|T'), (1-8)
where the prime denotes a (strictly) parallel test. It follows that σ2X = σ2X'.
When strictly parallel forms are available, the important parameters of the
latent variables T and E can be estimated from the observed-score variance and
from the intercorrelation between parallel test forms by the following familiar
equations of classical test theory:
Ρ2XT ( = Ρ2X'T') = ΡXX', (1-9)
APPENDIX 7

σ2T ( = σ2T) = σ 2 X ρxx', (1-10)

2 2 2
σ E ( = σ E') = σ X(1 - ρXX’). (1-11)

1.5. ENVOI

In item response theory (as discussed in the remaining chapters of this book) the
expected value of the observed score is still called the true score. The discrep
ancy between observed score and true score is still called the error of measure
ment. The errors of measurement are thus necessarily unbiased and uncorrelated
with true score. The assumptions of (1-7) will be satisfied also; thus all the
remaining equations in this chapter, including those in the Appendix, will hold.
Nothing in this book will contradict either the assumptions or the basic con
clusions of classical test theory. Additional assumptions will be made; these will
allow us to answer questions that classical theory cannot answer. Although we
will supplement rather than contradict classical theory, it is surprising how little
we will use classical theory explicitly.
Further basic ideas and formulas of classical test theory are summarized for
easy reference in an appendix to this chapter. The reader may skip to Chapter 2.

APPENDIX

Regression and Attenuation

From (1-9), (1-10), (1-11) we obtain formulas for the linear regression coeffi
cients:
ΒXT = 1, ΒTX = ρxx'. (l-l2)
Let ξ and η be the true scores on tests X and Y, respectively. As in (1-5), σξη =
(ΣXY. From this and (1-10) we find the important correction for attenuation,
σξη σXY ρXY
Pen = (1-13)
σξση σXσY √ΡXX’ρYY' √ρXX’ρYY’
From this comes a key inequality:
√ρXX’ ≥ ρXY. (1-14)
This says that test validity (correlation of test score X with any criterion Y) is
never greater than the square root of the test reliability.

Composite Tests
Up to this point, there has been no assumption that our test is composed of
subtests or of test items. If the test score X is a sum of subtest or item scores Yi,
8 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

so that
n
X =
Σ
i=1
Yi,

then certain tautologies follow:

n n(n-l)
σ2X = 2 σ2i + 22 σij, (1-15)
i=1 i≠j

where σi ≡ σ(Y i ) and σij ≡ σ(Yi, Y j ). Similarly,

Σ Σ σii'
i i'
ρXX' = 2
(1-16)
σ X

where V indexes the items in test X'. If all subtests are parallel,
nρYY'+ (1-17)
ρXX' = ,
1 + (n - l)ρYY'

the Spearman-Brown formula.

Coefficient alpha (a) is obtained from (1-15) and (1-16) and from the
Cauchy-Schwartz inequality:

n Σ σ2i
Ρ2XT = ρXX'≥ 1- ≡ α. (1-18)
n - 1 ( σ2X )
Alpha is not a reliability coefficient; it is a lower bound.
If items are scored either 0 or 1, α becomes the Kuder-Richardson formula-20
coefficient ρ20: from (1-18) and (1-23),

n Σπi (1 - πi)
Ρ2XT = ρXX'≥
n - 1
1 -
σ2
X } = ρ20, (1-19)

where πi is the proportion of correct answers (Yi = 1) for item i. Also,

n μX(n - μX)
ρ20 ≥
n - 1 [1 -
nσ2X } = ρ21 , (1-20)

the Kuder-Richardson formula-21 coefficient.

Item Theory
Denote the score on item i by Yi. Classical item analysis provides various
tautologies. The variance of the test scores is

σ2X =Σi 2σσρ

3
i j ij = σX
2σρ
i
i iX , (1-21)

where ρij and ρix are Pearson product moment correlation coefficients. If Yi is
always 0 or 1, then X is the number-right score, the interitem correlation ρij is a
APPENDIX 9

phi coefficient, and ρix is an item-test point biserial correlation. Classical item
analysis theory may deal also with the biserial correlation between item score and
test score and with the tetrachoric correlations between items (see Lord &
Novick, 1968, Chapter 15). In the case of dichotomously scored items (Yi = 0 or
1), we have
n
μx = πii, (1-22)
Σ
i=l

σ2i = πi(l - πi). (1-23)

From (1-18) and (1-21), coefficient α is
Σσ2i
α = n 1 - (1-24)
n - 1 ( Σ Σ σiσjρij ).
If C is an outside criterion, the test validity coefficient is

Σ σiρiC
i
ρxc = √ Σ Σ σ σ ρ . (1-25)
i j ij
i 3

These two formulas provide the two paradoxical classical rules for building a
test:

1. To maximize test reliability, choose test items that correlate as high as

possible with each other.
2. To maximize validity, choose test items that correlate as high as possible
with the criterion and as low as possible with each other.

Overview
Classical test theory is based on the weak assumptions (1-7) plus the assumption
that we can build strictly parallel tests. Most of its equations are unlikely to be
contradicted by data. Equations (1-1) through (1-13) are unlikely to be falsified,
since they involve the unobservable variables T and E. Equations (1-15), (1-16),
and (l-20)-(l-25) cannot be falsified because they are tautologies.
The only remaining equations of those listed are (1-14) and (1-17)—(1-19).
These are the best known and most widely used practical outcomes of classical
test theory. Suppose when we substitute sample statistics for parameters in
(1-17), the equality is not satisfied. We are likely to conclude that the discrep
ancies are due to sampling fluctuations or else that the subtests are not really
strictly parallel.
The assumption (1-7) of uncorrelated errors is also open to question, however.
Equations (1-7) can sometimes be disproved by path analysis methods. Similar
comments apply to (1-14), (1-18), and (1-19).
10 1. CLASSICAL TEST THEORY—SUMMARY AND PERSPECTIVE

Note that classical test theory deals exclusively with first and second moments:
with means, variances, and covariances. An extension of classical test theory
to higher-order moments is given in Lord and Novick (1968, Chapter 10). With-
out such extension, classical test theory cannot investigate the linearity or non-
linearity of a regression, nor the normality or nonnormality of a frequency
distribution.

REFERENCES

Cronbach, L. J., Gleser, G. C , Nanda, H., & Rajaratnam, N. The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley, 1972.
Hauser, R. M., & Goldberger, A. S. The treatment of unobservable variables in path analysis. In H.
L. Costner (Ed.), Sociological methodology, 1971. San Francisco: Jossey-Bass, 1971.
Linn, R. L., & Slinde, J. A. The determination of the significance of change between pre- and
posttesting periods. Review of Educational Research, 1977, 47, 121-150.
Lord, F. M. Elementary models for measuring change. In C. W. Harris (Ed.), Problems in measur-
ing change. Madison: University of Wisconsin Press, 1963.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Milliken, G. A. New criteria for estimability for linear models. The Annals of Mathematical Statis-
tics, 1971, 42, 1588-1594
Werts, C. E., Linn, R. L., & Jöreskog, K. G. Intraclass reliability estimates: Testing structural
assumptions. Educational and Psychological Measurement, 1974, 34, 25-33.
Werts, C. E., Rock, D. A., Linn, R. L., & Jöreskog, K. G. Validating psychometric assumptions
within and between several populations. Educational and Psychological Measurement, 1977,
37, 863-872.
2 Item Response Theory—
Introduction and Preview

2.1. INTRODUCTION

Commonly, a test consists of separate items and the test score is a (possibly
weighted) sum of item scores. In this case, statistics describing the test scores of
a certain group of examinees can be expressed algebraically in terms of statistics
describing the individual item scores for the same group [see Eq. (1-21) to
(1-25)]. As already noted, classical item theory (which is only a part of classical
test theory) consists of such algebraic tautologies.
Such a theory makes no assumptions about matters that are beyond the control
of the psychometrician. It cannot predict how individuals will respond to items
unless the items have previously been administered to similar individuals. In
practical test development work, we need to be able to predict the statistical and
psychometric properties of any test that we may build when administered to any
target group of examinees. We need to describe the items by item parameters and
the examinees by examinee parameters in such a way that we can predict prob-
abilistically the response of any examinee to any item, even if similar examinees
have never taken similar items before. This involves making predictions about
things beyond the control of the psychometrician—predictions about how people
will behave in the real world.
As an especially clear illustration of the need for such a theory, consider the
basic problem of tailored testing: Given an individual's response to a few items
already administered, choose from an available pool one item to be administered
to him next. This choice must be made so that after repeated similar choices the
examinee's ability or skill can be estimated as accurately as possible from his
responses. To do this even approximately, we must be able to estimate the

11
12 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

examinee's ability from any set of items that may be given to him. We must also
know how effective each item in the pool is for measuring at each ability level.
Neither of these things can be done by means of classical mental test theory.
In most testing work, our main task is to infer the examinee's ability level or
skill. In order to do this, we must know something about how his ability or skill
determines his response to an item. Thus item response theory starts with a
mathematical statement as to how response depends on level of ability or skill.
This relationship is given by the item response function (trace line, item charac
teristic curve).
This book deals chiefly with dichotomously scored items. Responses will be
referred to as right or wrong (but see Chapter 15 for dealing with omitted
responses). Early work in this area was done by Brogden (1946), Lawley (1943),
Lazarsfeld (see Lazarsfeld & Henry, 1968), Lord (1952), and Solomon (1961),
among others. Some polychotomous item response models are treated by Ander
sen (1973a, b), Bock (1972, 1975), and Samejima (1969, 1972). Related models
in bioassay are treated by Aitchison and Bennett (1970), Amemiya (1974a, b, c),
Cox (1970), Finney (1971), Gurland, Ilbok, and Dahm (1960), Mantel (1966),
van Strik (1960).

2.2. ITEM RESPONSE FUNCTIONS

Let us denote by 6 the trait (ability, skill, etc.) to be measured. For a dichotomous
item, the item response function is simply the probability P or P(θ) of a correct
response to the item. Throughout this book, it is (very reasonably) assumed that
P(d) increases as 6 increases. A common assumption is that this probability can
be represented by the (three-parameter) logistic function

P = p(θ) = c + 1 - c
, (2-1)
1 + e-l.7a(θ-b)

where a, b, and c are parameters characterizing the item, and e is the mathemat
ical constant 2.71828. . . . Logistic item response functions for 50 four-choice
word-relations items are shown in Fig. 2.2.1 to illustrate the variety found in a
typical published test. This logistic model was originated and developed by Allan
Birnbaum.
Figure 2.2.2 illustrates the meaning of the item parameters. Parameter c is the
probability that a person completely lacking in ability (θ = —∞)will answer the
item correctly. It is called the guessing parameter or the pseudo-chance score
level. If an item cannot be answered correctly by guessing, then c = 0.
Parameter b is a location parameter: It determines the position of the curve
along the ability scale. It is called the item difficulty. The more difficult the item,
the further the curve is to the right. The logistic curve has its inflexion point at
θ= b. When there is no guessing, b is the ability level where the probability of a
2.2. ITEM RESPONSE FUNCTIONS 13

1.0
P(θ)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

O. I

0.0
-2.5 -2.0 - I .5 - I .0 -0,5 0.0 0.5 I .0 I .5 2.0 2.5
θ

FIG. 2.2.1. Item response functions for SCAT II Verbal Test, Form 2B.

correct answer is .5. When there is guessing, b is the ability level where the
probability of a correct answer is halfway between c and 1.0.
Parameter a is proportional to the slope of the curve at the inflexion point [this
slope actually is .425^(1 — c)]. Thus a represents the discriminating power of
the item, the degree to which item response varies with ability level.
An alternative form of item response function is also frequently used: the
(three-parameter) normal ogive,

P ≡ P(θ) = c + (1 - c)
i a(θ-b)
- 0 0
1
√2π
e-t2/2dt. (2-2)

Again, c is the height of the lower asymptote; b is the ability level at the point of
inflexion, where the probability of a correct answer is (1 + c)/2; a is propor-
14 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

P(θ)
10

Inf l e x ion

0 9
b
FIG. 2.2.2. Meaning of item parameters (see text).

tional to the slope of the curve at the inflexion point [this slope actually is a(l —
C)/√2Π].
The difference between functions (2-1) and (2-2) is less than .01 for every set
of parameter values. On the other hand, for c = 0, the ratio of the logistic
function to the normal function is 1.0 at a(θ - b) = 0, .97 at - 1, 1.4 at - 2, 2.3
at - 2.5, 4.5 at - 3, and 34.8 at - 4. The two models (2-1) and (2-2) give very
similar results for most practical work.
The reader may ask for some a priori justification of (2-1) or (2-2). No
convincing a priori justification exists (however, see Chapter 3). The model must
be justified on the basis of the results obtained, not on a priori grounds.
No one has yet shown that either (2-1) or (2-2) fits mental test data signifi
cantly better than the other. The following references are relevant for any statisti
cal investigation along these lines: Chambers and Cox (1967), Cox (1961, 1962),
Dyer (1973, 1974), Meeter, Pirie, and Blot (1970), Pereira (1977a, b) Quesen-
berry and Starbuck (1976), Stone (1977).
In principle, examinees at high ability levels should virtually never answer an
easy item incorrectly. In practice, however, such an examinee will occasionally
make a careless mistake. Since the logistic function approaches its asymptotes
less rapidly than the normal ogive, such careless mistakes will do less violence to
the logistic than to the normal ogive model. This is probably a good reason for
preferring the logistic model in practical work.
Prentice (1976) has suggested a two-parameter family of functions that in
cludes both (2-1) and (2-2) when a = 1, b = 0, and c = 0 and also includes a
variety of skewed functions. The location, scale, and guessing parameters are
easily added to obtain a five-parameter family of item response curves, each item
being described by five parameters.
2.3. CHECKING THE MATHEMATICAL MODEL 15

2.3. CHECKING THE MATHEMATICAL MODEL

Either (2-1) or (2-2) may provide a mathematical statement of the relation be-
tween the examinee's ability and his response to a test item. A more searching
consideration of the practical meaning of (2-1) and (2-2) is found in Section 15.7.
Such mathematical models can be used with confidence only after repeated
and extensive checking of their applicability. If ability could be measured accu-
rately, the models could be checked directly. Since ability cannot be measured
accurately, checking is much more difficult. An ideal check would be to infer
from the model the small-sample frequency distribution of some observable
quantity whose distribution does not depend on unknown parameters. This does
not seem to be possible in the present situation.
The usual procedure is to make various tangible predictions from the model
and then to check with observed data to see if these predictions are approximately
correct. One substitutes estimated parameters for true parameters and hopes to
obtain an approximate fit to observed data. Just how poor a fit to the data can be
tolerated cannot be stated exactly because exact sampling variances are not
known. Examples of this sort of check on the model are found throughout this
book. See especially Fig. 3.5.1. If time after time such checks are found to be
satisfactory, then one develops confidence in the practical value of the model for
predicting observable results.
Several researchers have produced simulated data and have checked the fit of
estimated parameters to the true parameters (which are known since they were
used to generate the data). Note that this convenient procedure is not a check on
the adequacy of the model for describing the real world. It is simply a check on
the adequacy of whatever procedures the researcher is using for parameter esti-
mation (see Chapter 12).
At this point, let us look at a somewhat different type of check on our item
response model (2-1). The solid curves in Fig. 2.3.1 are the logistic response
curves for five SAT verbal items estimated from the response data of 2862
students, using the methods of Chapter 12. The dashed curves were estimated,
almost without assumption as to their mathematical form, from data on a total
sample of 103,275 students, using the totally different methods of Section 16.13.
The surprising closeness of agreement between the logistic and the unconstrained
item response functions gives us confidence in the practical value of the logistic
model, at least for verbal items like these.
The following facts may be noted, to point up the significance of this result:

1. The solid and dashed curves were obtained from totally different assump-
tions. The solid curve assumes the logistic function, also that the test items all
measure just one psychological dimension. The dashed curve assumes only that
the conditional distribution of number-right observed score for given true score is
a certain approximation to a generalized binomial distribution.
2. The solid and dashed curves were obtained from different kinds of raw
1.0 1.0

.8 .8

P
.5 .5
P

.2 .2

0 0
-2 -I 0 1 2 -2 -I 0 1 2
θ θ
1.0 1.0

.8 .8
10
13
P

.5 5

.2 .2

0 0
-2 -I 0 1 2 -2 -I 0 1 2

θ θ
1.0

30
P

0
-2 -1 0 1 2

θ
FIG. 2.3.1. Five item characteristic curves estimated by two different methods.
(From F. M. Lord, Item characteristic curves estimated without knowledge of
their mathematical form—a confrontation of Birnbaum's logistic model. Psycho-
metrika, 1970, 35, 43-50.)

16
2.3. CHECKING THE MATHEMATICAL MODEL 17

data. The solid curve comes from an analysis of all the responses of a sample of
students to all 90 SAT verbal items. The dashed curve is obtained just from
frequency distributions of number-right scores on the SAT verbal test and, in a
minor way, from the variance across items of the proportion of correct answers to
the item.
3. The solid curve is a logistic function. The dashed curve is the ratio of two
polynomials, each of degree 89.
4. The solid curve was estimated from a bimodal sample of 2862 examinees,
selected by stratified sampling to include many high-ability and many low-ability
students. The dashed curve was estimated from all 103,275 students tested in a
regular College Board test administration.

Further details of this study are given in Sections 16.12 and 16.13.
These five items are the only items to be analyzed to date by this method. The
five items were chosen solely for the variety of shapes represented. If a hundred
or so items were analyzed in this way, it is likely that some poorer fits would be
found.
It is too much to expect that (2-1) or (2-2) will hold exactly for every test item
and for every examinee. If some examinees become tired, sick, or uncooperative
partway through the testing, the mathematical model will not be strictly appro-
priate for them. If some test items are ambiguous, have no correct answer, or
have more than one correct answer, the model will not fit such items. If exam-
inees omit some items, skip back and forth through the test, and do not have time
to finish the test, perhaps marking all unfinished items at random, the model
again will not apply.
A test writer tries to provide attractive incorrect alternatives for each
multiple-choice item. We may imagine examinees so completely lacking in
ability that they do not even notice the attractiveness of such alternatives and so
respond to the items completely at random; their probability of success on such
items will be 1/A, where A is the number of alternatives per item. We may also
imagine other examinees with sufficient ability to see the attractiveness of the
incorrect alternatives although still lacking any knowledge of the correct answer;
their probability of success on such items is often less than 1/A. If this occurs, the
item response function is not an increasing function of ability and cannot be fitted
by any of the usual mathematical models.
We might next imagine examinees who have just enough ability to eliminate
one (or two, or three,.. .) of the incorrect alternatives from consideration, al-
though still lacking any knowledge of the correct answer. Such examinees might
be expected to have a chance of 1/(A - 1) (or 1/(A - 2), 1/(A - 3),. . .) of
answering the item correctly, perhaps producing an item response function look-
ing like a staircase.
Such anticipated difficulties deterred the writer for many years from research
on item response theory. Finally, a large-scale empirical study of 150 five-choice
18
100
100

80
80

60
60

RIGHT
RIGHT

%
%

40
40

20
20

0
0

0 30 60 90 6 30 60 90
SCORE SCORE

FIG. 2.3.2. Proportion of correct answers to an item as a function of number-right test score. The
two items shown are the two worst examples of nonmonotonicity among the 150 items studied.
2.4. UNIDIMENSIONAL TESTS 19

items was made to determine proportion of correct answers as a function of

number-right test score. With a total of 103,275 examinees, these proportions
could be determined with considerable accuracy. Out of 150 items, only six were
found that clearly failed to be increasing functions of total test score, and for
these the failure was so minor as to be of little practical importance. The results
for the two worst items are displayed in Figure 2.3.2; the crosses show where the
curve would have been if examinees omitting the item had chosen at random
among the five alternative responses instead. No staircase functions or other
serious difficulties were found.

2.4. UNIDIMENSIONAL TESTS

Equation (2-1) or (2-2) asserts that probability of success on an item depends on

three item parameters, on examinee ability 0, and on nothing else. If the model is
true, a person's ability 0 is all we need in order to determine his probability of
success on a specified item. If we know the examinee's ability, any knowledge of
his success or failure on other items will add nothing to this determination. (If it
did add something, then performance on the items in question would depend in
part on some trait other than 0; but this is contrary to our assumption.)
The principle just stated is Lazarsfeld's assumption of local independence.
Stated formally, Prob(success on item i given θ) = Prob(success on item i given
θ and given also his performance on items j , k, . . .). If ut = 0 or 1 denotes the
score on item i, then this may be written more compactly as
P(ui = l|θ) = P(ui = l|θ, uj, uk,.. .) (i ≠ j , k, . . .). (2-3)
A mathematically equivalent statement of local independence is that the prob
ability of success on all items is equal to the product of the separate probabilities
of success. For just three items i, j , k, for example,
P(ui = 1, uj = 1, uk = 1|θ) = P(ui = 1|θ)P(uj = 1|θ)P(uk = 1|θ).
(2-4)

Local independence requires that any two items be uncorrelated when θ is

fixed. It definitely does not require that items be uncorrelated in ordinary groups,
where θ varies. Note in particular that local independence follows automatically
from unidimensionality. It is not an additional assumption.
If the items measure just one dimension (θ), if θ is normally distributed in the
group tested, and if model (2-2) holds with c = 0 (there is no guessing), then the
matrix of tetrachoric intercorrelations among the items will be of unit rank (see
Section 3.6). In this case, we can think of θ as the common factor of the items.
This gives us a clearer understanding of what is meant by θ and what is meant by
unidimensionality.
20 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

Note, however, that latent trait theory is more general than factor analysis.
Ability θ is probably not normally distributed for most groups of examinees.
Unidimensionality, however, is a property of the items; it does not cease to exist
just because we have changed the distribution of ability in the group tested.
Tetrachoric correlations are inappropriate for nonnormal distributions of ability;
they are also inappropriate when the item response function is not a normal
ogive. Tetrachoric correlations are always inappropriate whenever there is guess
ing. This poses a problem for factor analysts in defining what is meant by
common factor, but it does not disturb the unidimensionality of a pool of items.
It seems plausible that tests of spelling, vocabulary, reading comprehension,
arithmetic reasoning, word analogies, number series, and various types of spatial
tests should be approximately one-dimensional. We can easily imagine tests that
are not. An achievement test in chemistry might in part require mathematical
training or arithmetic skill and in part require knowledge of nonmathematical
facts.
Item response theory can be readily formulated to cover cases where the test
items measure more than one latent trait. Practical application of multidimen
sional item response theory is beyond the present state of the art, however,
10
8
SIZE OF ROOT

6
4
2
0

0 1 2 3 4 5 6 7 8 9 10 II 12

RANKING

FIG. 2.4.1. The 12 largest latent roots in order of size for the SCAT 2A Verbal
Test.
2.5. PREVIEW 21

except in special cases (Kolakowski & Bock, 1978; Mulaik, 1972; Samejima,
1974; Sympson, 1977).
There is great need for a statistical significance test for the unidimensionality
of a set of test items. An attempt in this direction has been made by Christof-
fersson (1975), Indow and Samejima (1962), and Muthén (1977).
A rough procedure is to compute the latent roots of the tetrachoric item
intercorrelation matrix with estimated communalities placed in the diagonal. If
(1) the first root is large compared to the second and (2) the second root is not
much larger than any of the others, then the items are approximately unidimen-
sional. This procedure is probably useful even though tetrachoric correlation
cannot usually be strictly justified. (Note that Jöreskog's maximum likelihood
factor analysis and accompanying significance tests are not strictly applicable to
tetrachoric correlation matrices.)
Figure 2.4.1 shows the first 12 latent roots obtained in this way for the SCAT
II Verbal Test, Form 2A. This test consists of 50 word-relations items. The data
were the responses of a sample of 3000 high school students. The plot suggests
that the items are reasonably one-dimensional.

2.5. PREVIEW

In order to motivate the detailed study of item response functions in succeeding

chapters, it seems worthwhile to outline briefly just a few of the practical results
to be developed. At this point, the reader should expect only a preview, not a
detailed explanation.
For each item there is an item information function I{θ, ui} that can be
determined from the formula
P'i2
I{θ, ui,} = (2-5)
PiQi ,
where Pi = Pi(θ) is the item response function, Qi = 1 - Pi, and P'i is the
derivative of Pi with respect to θ [the formula for P'i can be written out explicitly
once a particular item response function, such as (2-1) or (2-2), is chosen]. The
item information functions for the five items (10, 11, 13, 30, 47) in Fig. 2.3.1
are shown in Fig. 2.5.1.
The amount of information given by an item varies with ability level θ. The
higher the curve, the more the information. Information at a given ability level
varies directly as the square of the item discriminating power, ai. If one informa
tion function is twice as high as another at some particular ability level, then it
will take two items of the latter type to measure as well as one item of the former
type at that ability level.
There is also a test information function I{θ}, which is inversely proportional
to the square of the length of the asymptotic confidence interval for estimating
22 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

4.0

3.0

10
Information

2.0

1.0

30
II

47
0.0

Ability

FIG. 2.5.1. Item and test information functions. (From F. M. Lord, An analysis
of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic
model. Educational and Psychological Measurement, 1968, 28, 989-1020.)

the examinee's ability θ from his responses. It can be shown that the test informa
tion function I{θ} is simply the sum of the item information functions:

/{θ} = Σ I{θ,
i
ui}. (2-6)

The test information function for the five-item test is shown in Fig. 2.5.1.
We have in (2-6) the very important result that when item responses are
optimally weighted, the contribution of the item to the measurement effectiveness
of the total test does not depend on what other items are included in the test. This
is a different situation from that in classical test theory, where the contribution of
each item to test reliability or to test validity depends inextricably on what other
items are included in the test.
2.5. PREVIEW 23

Equation (2-6) suggests a convenient and effective procedure of test construc

tion. The procedure operates on a pool of items that have already been calibrated,
so that we have the item information curve for each item.
1. Decide on the shape desired for the test information function. The desired
curve is the target information curve.
2. Select items with item information curves that will fill the hard-to-fill areas
under the target information curve.
3. Cumulatively add the item information curves, obtaining at all times the
information curve for the part-test composed of items already selected.
4. Continue (backtracking if necessary) until the area under target informa
tion curve is filled to a satisfactory approximation.
The test information function represents the maximal amount of information
that can be obtained from the item responses by any kind of scoring method. The
linear composite Σi wi*ui of item scores ui (= 0 or 1) with weights

wi* = Pi' (2-7)

PiQi
is an optimal score yielding maximal information. The optimal score is not
directly useful since the optimal weights wi* depend on θ, which is unknown.
Very good scoring methods can be deduced from (2-7), however.
The logistic optimal weights for the five items of Fig. 2.3.1 are shown as
functions of θ in Fig. 2.5.2. It is obvious that the relative weighting of different
items is very different at low ability levels than at high ability levels. At high
levels, optimal item weights are proportional to item discriminating power ai. At
low ability levels, on the other hand, difficult items should receive near-zero
scoring weight, regardless of ai. The reason is that when low-ability examinees
guess at random on difficult items, this produces a random result that would
impair effective measurement if incorporated into the examinee's score; hence
the need for a near-zero scoring weight.
Two tests of the same trait can be compared very effectively in terms of their
information functions. The ratio of the information function of test y to the
information function of test x represents the relative efficiency of test y with
respect to x. Figure 6.9.1 shows the relative efficiency of a STEP vocabulary test
compared to a MAT vocabulary test. The STEP test is more efficient for low-
ability examinees, but much less efficient at higher ability levels. The dashed
horizontal line shows the efficiency that would be expected if the two tests
differed only in length (number of items).
Figure 6.10.1 shows the relative efficiency of variously modified hypothetical
SAT Verbal tests compared with an actual form of the test. Curve 2 shows the
effect of adding five items just like the five easiest items in the actual test. Curve
3 shows the effect of omitting five items of medium difficulty from the actual
test. Curve 4 shows the effect of replacing the five medium-difficulty items by
24 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

4.0

3.0

13
Optimum Weight

2.0

1.0

SAT Scaled Score

380 460 540 710
0.0

Ability

FIG. 2.5.2. Optimal (logistic) scoring weight for five items as a function of
ability level. (From F. M. Lord, An analysis of the Verbal Scholastic Aptitude
Test using Birnbaum's three-parameter logistic model. Educational and Psy-
chological Measurement, 1968, 28, 989-1020.)

the five additional easy items. Curve 6 shows the effect of discarding (not
scoring) the easier half of the test. Curve 7 shows the effect of discarding the
harder half of the test; notice that the resulting half-length test is actually better
for measuring low-ability examinees than is the regular full-length SAT. Curve 8
shows a hypothetical SAT just like the regular full-length SAT except that all
items are at the same middle difficulty level.
Results such as these are useful for planning revision of an existing test,
perhaps increasing its measurement effectiveness at certain specified ability
levels and decreasing its effectiveness at other levels. These and other useful
applications of item response theory are treated in detail in subsequent chapters.
REFERENCES 25

REFERENCES

Aitchison, J., & Bennett, J. A. Polychotomous quantal response by maximum indicant. Biometrika,
1970, 57, 253-262.
Amemiya, T. Qualitative response models. Technical Report No. 135. Stanford, Calif.: Institute for
Mathematical Studies in the Social Sciences, Stanford University, 1974. (a)
Amemiya, T. The maximum likelihood estimator vs. the minimum chi-square estimator in the
general qualitative response model. Technical Report No. 136. Stanford, Calif.: Institute for
Mathematical Studies in the Social Sciences, Stanford University, 1974. (b)
Amemiya, T. The equivalence of the nonlinear weighted least squares method and the method of
scoring in the general qualitative response model. Technical Report No. 137. Stanford, Calif.:
Institute for Mathematical Studies in the Social Sciences, Stanford University, 1974. (c)
Andersen, E. B. Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk
Forlag, 1973. (a)
Andersen, E. B. Conditional inference for multiple-choice questionnaires. British Journal of
Mathematical and Statistical Psychology, 1973, 26, 31-44. (b)
Bock, R. D. Estimating item parameters and latent ability when responses are scored in two or more
nominal categories. Psychometrika, 1972, 37, 29-51.
Bock, R. D. Multivariate statistical methods in behavioral research. New York: McGraw-Hill,
1975.
Brogden, H. E. Variation in test validity with variation in the distribution of item difficulties, number
of items, and degree of their intercorrelation. Psychometrika, 1946, 11, 197-214.
Chambers, E. A., & Cox, D. R. Discrimination between alternative binary response models.
Biometrika, 1967, 54, 573-578.
Christoffersson, A. Factor analysis of dichotomized variables. Psychometrika, 1975, 40, 5-32.
Cox, D. R. Tests of separate families of hypotheses. In J. Neyman (Ed.), Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1). Berkeley: University of
California Press, 1961.
Cox, D. R. Further results on tests of separate families of hypotheses. Journal of the Royal Statistical
Society, 1962, 24, 406-424.
Cox, D. R. The analysis of binary data. London: Methuen, 1970.
Dyer, A. R. Discrimination procedures for separate families of hypotheses. Journal of the American
Statistical Association, 1973, 68, 970-974.
Dyer, A. R. Hypothesis testing procedures for separate families of hypotheses. Journal of the
American Statistical Association, 1974, 69, 140-145.
Finney, D. J. Probit analysis (3rd ed.). New York: Cambridge University Press, 1971.
Gurland, J., Ilbok, J., & Dahm, P. A. Polychotomous quantal response in biological assay. Biomet-
rics, 1960, 16, 382-398.
Indow, T., & Samejima, F. LIS measurement scale for non-verbal reasoning ability. Tokyo:
Nihon-Bunka Kagakusha, 1962. (In Japanese)
Kolakowski, D., & Bock, R. D. Multivariate generalizations of probit analysis. Unpublished manu-
script, 1978.
Lawley, D. N. On problems connected with item selection and test construction. Proceedings of the
Royal Society of Edinburgh, 1943, 61, 273-287.
Lazarsfeld, P. F., & Henry, N. W. Latent structure analysis. Boston: Houghton-Mifflin, 1968.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Mantel, N. Models for complex contingency tables and polychotomous dosage response curves.
Biometrics, 1966, 22, 83-95.
Meeter, D., Pirie, W., & Blot, W. A comparison of two model discrimination criteria. Technomet-
rics, 1970, 12, 457-470.
26 2. ITEM RESPONSE THEORY—INTRODUCTION AND PREVIEW

Mulaik, S. A. A mathematical investigation of some multidimensional Rasch models for psychologi-

cal tests. Paper presented at the Spring meeting of the Psychometric Society, Princeton, N.J.,
1972.
Muthén, B. Statistical methodology for structural equation models involving latent variables with
dichotomous indicators. Unpublished doctoral dissertation, Uppsala University, 1977.
Pereira, B. de B. Discriminating among separate models: A bibliography. International Statistical
Review, 1977, 45, 163-172. (a)
Pereira, B. de B. A note on the consistency and on the finite sample comparisons of some tests of
separate families of hypotheses. Biometrika, 1977, 64, 109-113. (b)
Prentice, R. L. A generalization of the probit and logit methods for dose response curves. Biomet-
rics, 1976, 32, 761-768.
Quesenberry, C. P., & Starbuck, R. R. On optimal tests for separate hypotheses and conditional
probability integral transformations. Communications in Statistics, 1976, A5, 507-524.
Samejima, F. Estimation of latent ability using a response pattern of graded scores. Psychometric
Monograph No. 17. Psychometric Society, 1969.
Samejima, F. A general model for free-response data. Psychometric Monograph Supplement, No.
18, 1972.
Samejima, F. Normal ogive model on the continuous response level in the multidimensional latent
space. Psychometrika, 1974, 39, 111-121.
Solomon, H. (Ed.). Studies in item analysis and prediction. Stanford, Calif.: Stanford University
Press, 1961.
Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion.
Journal of the Royal Statistical Society, Series B, 1977, 39, 44-47.
Sympson, J. B. A model for testing with multidimensional items. Paper presented at the Com-
puterized Adaptive Testing Conference, Minneapolis, 1977.
van Strik, R. A method of estimating relative potency and its precision in the case of semi-
quantitative responses. Symposium on Quantitative Methods in Pharmacology, 1960. Amster-
dam: N. Holland Publishing Company.
3 Relation of
Item Response Theory to
Conventional Item Analysis

3.1. ITEM-TEST REGRESSIONS

In conventional item analysis, it is common to compare high- and low-scoring

students on their proportion of correct answers. Sometimes the students may be
divided on test score into as many as five levels and then the levels compared on
proportion of correct answers. An extension of this would be to divide the stu
dents into as many levels as there are test scores before making the comparison.
The proportion of correct answers to an (dichotomous) item is also the mean
item score, the mean of the statistical variable ui (= 0 or 1). Thus the curve
representing proportion of correct answers as a function of test score x is also the
regression of ui on x. Such a curve is called an item-observed score regression
(iosr). Note that for dichotomous items any item response function, as defined in
Chapter 2, can be considered by the same logic to be an item-ability regression,
the regression of ui on θ.
Figure 3.1.1 shows sample iosr for several SAT verbal and math items. Each
curve is computed from the responses of 103,275 examinees. The base line is
number-right score on the verbal test or on the math test omitting the item under
study. Points based on fewer than 50 examinees are not plotted.
These curves would be empirical item-response functions if the base line were
0 instead of number-right score x. Thus it is common, although incorrect (as we
shall see), to think that an iosr, like an item-ability regression, will have at least
approximately an ogive shape, like Eq. (2-1) or (2-2).
To show that iosr cannot all be approximately normal ogives, consider a test
composed of n items. Denote the iosr for item i by
μi|x = E(ui|x),

27
28
20 40 60 0 30 60 90
0 SCORE SCORE
100
100

8.0
80

60
60

% RIGHT
% RIGHT

40
40

20
20

0
0

0 30 60 90 0 30 60 90
SCORE SCORE

FIG. 3.1.1. Selected item-test regressions for five-choice Scholastic Aptitude Test items (crosses
show regression when omitted responses are replaced by random responses).

29
30 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

the expectation being taken over all individuals at score level x. Now, for any
individual, x is the sum (over items) of his item scores; that is
n
(3-1)
x = Σu
i=l
i .

If we take the expectation of (3-1) for fixed x, we have

n n
x = E(x|x) = E ui|x
( Σ
i=1
)
=
Σ
i= 1
E(u ).
i|x

Then by definition
n n
(3-2)
Σ
i=1
μi|x ≡
Σ E(u )
i=1
i|x = x.

We can understand this general result most easily by considering the special
case when all the items are statistically equivalent. In this case, μi|x is by
definition the same for all items, so (3-2) can be written
n

Σμ
i= 1
i|x = nμi|x = x,

from which it follows that μi|x - x/n for each item. Thus the iosr of each item is
a straight line through the origin with slope 1/n. Note that for statistically equiva
lent items μi|x = x/n even when the items are entirely uncorrected with each
other. The iosr has a slope of 1/n even when the test does not measure anything!
This is still true if each item is negatively correlated with every other item!
All this proves that we cannot as a general matter expect item-observed score
regressions to be even approximately normal ogives. We shall not make further
use of item-observed score regressions in this book. The regression of item score
on true score is considered in Section 16.12.

3.2. RATIONALE FOR NORMAL OGIVE MODEL

The writer prefers to consider the choice of item response function, such as Eq.
(2-1) or (2-2), as a basic assumption to be justified by methods discussed in
Section 2.3 rather than by any a priori argument. This is particularly wise when
there is guessing, since one assumption often used in this case to deduce Eq.
(2-1) or (2-2) from a priori considerations is that examinees either know the
correct answer to the item or else guess at random. This assumption is totally
unacceptable and would discredit the entire theory if the theory depended on it.
The alternate, acceptable point of view is simply that Eq. (2-1) and (2-2) are
useful as versatile formulas capable of adequately representing a wide variety of
3.2. RATIONALE FOR NORMAL OGIVE MODEL 31

ogive-shaped functions that increase monotonically from a lower asymptote to

1.00. Justification of their use is to be sought in the results achieved, not in
further rationalizations. In this section, a rationale is provided for Eq. (2-2) in a
rather specialized situation in order to make this and similar item response
models seem plausible, not with the idea of providing a firm basis for their use.
Suppose that there is a (unobservable) latent variable Yi' that determines
examinee performance on item i. If for some examinee Yi' is greater than some
constant γi, then he answers the item correctly, so that ui = 1. Similarly, if for
some examinee Yi' < γi, then ui = 0. (There is zero probability that Yi' = γi, so
we need not discuss this case.) From the point of view of the factor analyst, Yi' is
a composite of (1) the common factor 6 of the test items and (2) a specific factor
or error factor for item i, not found in other items.
Note that the foregoing supposition rules out guessing. If the correct answer
can be obtained by a partially random process, then no attribute of the examinee
can determine whether ui = 0 or 1.
Assume now that
1. The regression μi'|θ of Yi' on θ is linear.
2. The scatter of Yi' about this regression is homoscedastic; in other words,
the conditional variance σi|'2θ = σi2.θ about the regression line is the same
for all θ.
3. The conditional distribution of Yi' given θ is normal.
Conditional distributions of Yi' for given θ are illustrated in Fig. 3.2.1. We
can see from the figure that the item response function Pi ≡ Pi(θ) ≡ Prob(ui =
l|θ) ≡ Prob(Yi' > γi|θ) is equal to a standardized normal curve area. A little
algebra shows that it is equal to the standardized normal curve area above (γi —
μi'|θ)/σ'i.θ, which will be denoted by — Li.
For convenience, let us choose (as we may) the scale of measurement for both
Yi' and θ so that for the entire bivariate population the unconditional means of
both variables are 0 and their unconditional standard deviations are 1. Then the
equation for the regression of Yi' on θ is simply μi'|θ =ρi'θ, where ρi' is the
correlation between Yi' and θ. The conditional variance about this regression is,
by standard formula, σi.θ = 1 – ρi'2. Making use of these last formulas, we have
γi - ρi'θ
-Li =
√1-ρi'θ
Let
ρi'
ai ≡ (3-3)
√1 -ρi'2
γi (3-4)
bi ≡
ρi' ,
so that — Li = ai(bi — θ).
32 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

Pi(θ)
u =1
i

Pi(θ)
Pi(θ)

γ
i

Q i (θ)

Q i (θ)
Ui=0
μi|θ

θ
FIG. 3 . 2 . 1 . Hypothetical conditional distribution of Yi' for three levels of ability
θ, showing the regression μi|'θ and the cutting point γi that separates right an
swers from wrong answers.

For symmetric distributions,

∫∞-L =∫L-∞ ,

so, finally, the item response function is seen to be

ai(θ-bi.)
Pi (θ) =
∫ — ∞ √2π e
1 -t2/2
dt. (3-5)

This is the same as Eq. (2-2) for the normal ogive item response function when ci
= 0.
Note that we have not made any assumption about the distribution of ability θ
in the total group tested. In particular, contrary to some assertions in the litera
ture, we have not assumed that ability is normally distributed in the total group.
Furthermore, if (3-5) holds for some group of examinees, selection on θ will not
change the conditional distribution of Y'i for fixed θ and hence will not change
(3-5). Thus the shape of the distribution of ability θ in the total group tested is
irrelevant to our derivation of (3-5).
Equation (3-5) has the form of a cumulative frequency distribution, as do Eq.
(2-1) and (2-2) when c = 0. In general, however, there seems to be little reason
for thinking of an item response curve as a cumulative frequency distribution.
3.3. RELATION TO CONVENTIONAL ITEM STATISTICS 33

3.3. RELATION TO CONVENTIONAL ITEM STATISTICS

Conventional item analysis deals with πi, conventionally called the item diffi
culty, the proportion of examinees answering item i correctly. It also deals with
ρ i x , the product moment correlation between item score ui and number-right test
score x, often called the point-biserial item-test correlation, or else with ρ'ix, the
corresponding biserial item-test correlation. A general formula for the relation of
biserial correlation (ρ') to point-biserial correlation is
φ(γ)
ρ = ρ' , (3-6)
√Π(1 - π)

where φ(γ) is the normal curve ordinate at the point γ that cuts off area π of the
standardized normal curve.
If ability θ is normally distributed and ci = 0, then by definition the
product-moment correlation ρ'iθ or simply ρ'i) between Y'i and θ is also the
biserial correlation between ui and θ. Such a relationship is just what is meant by
biserial correlation.
There is also a product-moment or point-biserial correlation between ui and θ,
to be denoted by ρiθ. To the extent that number-right score x is a measure of
ability θ,ρ'ix is an approximation to ρ'i ≡ ρ'iθ and ρ'ix. is an approximation to ρ'iθ.
Combined with (3-3), this (crude) approximation yields a conceptually illuminat
ing crude relationship between the conventional item-test correlation and the ai
parameter of item response theory, valid only for the case where θ is normally
distributed and there is no guessing:
ρ'ix
ai ≡ (3-7)
√1- ρi'2x
and

ai
ρ'ix ≡ (3-8)
√1 + a2i ,

where ≡ denotes approximate equality. This shows that under the assumptions
made, the item discrimination parameter ai and the item-test biserial correlation
ρ'ix are approximately monotonic increasing functions of each other.
Approximations (3-7) and (3-8) hold only if the unit of measurement for θ has
been chosen so that the mean of θ is 0 and the standard deviation is 1 (see Section
3.5). Approximations (3-7) and (3-8) do not hold unless θ is normally distributed
in the group tested. They do not hold if there is guessing. In addition, the
approximations fall short of accuracy because (1) the test score x contains errors
of measurement whereas θ does not; and (2) x and θ have differently shaped
distributions (the relation between x and θ is nonlinear).
Approximations (3-7) and (3-8) are given here not for practical use but rather
34 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

to give an idea of the nature of the item discrimination parameter ai. The relation
of ai to conventional item and test parameters is illustrated in Table 3.8.1.
Item i is answered correctly whenever the examinee's ability Yi' is greater
than γi. If ability θ is normally distributed, then Yi' will not only be conditionally
normally distributed for fixed θ but also unconditionally normally distributed in
the total population. Since the unconditional mean and variance of Yi' have been
chosen to be 0 and 1, respectively, a simple relation between γi and πi (propor-
tion of correct answers to item i in the total group) can be written down: When θ
is normally distributed,
∞

πi = φ(t) dt. (3-9)

∫γ i

The parameter γi is the item difficulty parameter used in certain kinds of

Thurstone scaling (see Fan, 1957). It is also the same as the College Entrance
Examination Board delta (Gulliksen, 1950, pp. 368-369) except for a linear
transformation.
If (3-7) and (3-8) hold approximately, then from (3-4)

bi ≡ γi (3-10)
ρix .
If all items have equal discriminating power ai, then by (3-4) all ρi' are equal and
the difficulty parameter bi is proportional to γi, the normal curve deviate corre
sponding to the proportion of correct answers πi . Thus when all items are equally
discriminating, there is a monotonic relation between bi and πi: AS πi increases,
bi and γi both decrease. When all items are not equally discriminating, the
relation between bi and γi or πi depends on ai. In general, arranging items in
order on πi is not the same as arranging them on bi.

3.4. INVARIANT ITEM PARAMETERS

As pointed out earlier, an item response function can also be viewed as the
regression of item score on ability. In many statistical contexts, regression
functions remain unchanged when the frequency distribution of the predictor
variable is changed. In the present context this should be quite clear: The proba
bility of a correct answer to item i from examinees at a given ability level θ0
depends only on θ0, not on the number of people at θ0, nor on the number of
people at other ability levels θ1, θ2, . . . . Since the regression is invariant, its
lower asymptote, its point of inflexion, and the slope at this point all stay the
same regardless of the distribution of ability in the group tested. Thus ai, bi, and
ci are invariant item parameters. According to the model, they remain the same
regardless of the group tested.
Suppose, on the contrary, it is found that the item response curves of a set of
3.4. INVARIANT ITEM PARAMETERS 35

items differ from one group to another. This means that people in group 1 (say) at
ability level θ0 have a different probability of success on the set of items than do
people in group 2 at the same θ0. This now means that the test is able to
discriminate group 1 individuals from group 2 individuals of identical ability
level θ0. And this, finally, means that the test items are measuring some dimen
sion on which the groups differ, a dimension other than θ. But our basic assump
tion here is that the test items have only one dimension in common. The conclu
sion is either that this particular test is not one-dimensional as we require or else
that we should restrict our research to groups of individuals for whom the items
are effectively one-dimensional.
The invariance of item parameters across groups is one of the most important
characteristics of item response theory. We are so accustomed to thinking of item
difficulty as the proportion (π i ) of correct answers that it is hard to imagine how
item difficulty can be invariant across groups that differ in ability level. The
following illustration may help to clarify matters.
Figure 3.4.1 shows two rather different item characteristic curves. Inverted on
the baseline are the distributions of ability for two different groups of examinees.
First of all, note again: The ability required for a certain probability of success on
an item does not depend on the distribution of ability in some group; con
sequently, the item difficulty b should be the same regardless of the group from
which it is determined.
Now note carefully the following. In group A, item 1 is answered correctly
less often than item 2. In group B, the opposite occurs. If we use the proportion
of correct answers as a measure of item difficulty, we find that item 1 is easier
than item 2 for one group but harder than item 2 for the other group.
Proportion of correct answers in a group of examinees is not really a measure
of item difficulty. This proportion describes not only the test item but also the
group tested. This is a basic objection to conventional item analysis statistics.
Item-test correlations vary from group to group also. Like other correlations,

P
1
10

05
FIG. 3.4.1. Item response curves in
relation to two groups of examinees.
(From F. M. Lord, A study of item
bias, using item characteristic curve b1
theory. In Y. H. Poortinga (Ed.), 0.0 θ
Basic problems in cross-cultural A
psychology. Amsterdam: Swets and B
Zeitlinger, 1977, pp. 19-29.)
36 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

item-test correlations tend to be high in groups that have a wide range of talent,
low in groups that are homogeneous.

3.5. INDETERMINACY

Item response functions Pi(θ) like Eq. (2-1) and (2-2) ordinarily are taken to be
functions of a i (θ — bi). If we add a constant to every θ and at the same time add
the same constant to every bi, the quantity a i (θ — bi) is unchanged and so is the
response function Pi(θ). This means that the choice of origin for the ability scale
is purely arbitrary; we can choose any origin we please for measuring ability as
long as we use the same origin for measuring item difficulty bt.
Similarly, if we multiply every θ by a constant, multiply every bi by the same
constant, and divide every ai by the same constant, the quantity a i (θ - bi)
remains unchanged and so does the response function Pi(θ). This means that the
choice of unit for measuring ability is also purely arbitrary.
One could decide to choose the origin and unit for measuring ability in such a
way that the first person tested is assigned θ1 = 0 and the second person tested is
assigned θ2 = 1 or — 1. Another possibility would be to choose so that for the
first item b1 = 0 and a1 = 1. Scales chosen in this way would be meaningless to
anyone unfamiliar with the first two persons tested or with the first item adminis
tered. A more common procedure is to choose the scale so that the mean and
standard deviation of θ are 0 and 1 for the group at hand.
The invariance of item parameters, emphasized in Section 3.4, clearly holds
only as long as the origin and unit of the ability scale is fixed. This means that if
we determine the bi for a set of items from one group of examinees and then
independently from another, we should not expect the two sets of bi to be
identical. Rather we should expect them to have a linear relation to each other
(like the relation between Fahrenheit and Celsius temperature scales).
Figure 3.5.1 compares estimated bi from a group of 2250 white students with
estimated bi from a group of 2250 black students for 85 verbal items from the
College Board SAT. Most of the scatter about the line is due to sampling
fluctuations in the estimates; some of the scatter is due to failure of the model to
hold exactly for groups as different as these (see Chapter 14).
If we determine the ai for a set of items independently from two different
groups, we expect the two sets of values to be identical except for an undeter
mined unit of measurement that will be different for the two groups. We expect
the ai to lie along a straight line passing through the origin (0, 0), with a slope
reciprocal to the slope of the line relating the two sets of bi. The slope represents
the ratio of scale units for the two sets of parameters. The two sets of ai are
related in the same way as two sets of measurements of the same physical
objects, one set expressed in inches and the other in feet.
The ci are not affected by changes in the origin and unit of the ability scale.
The ci should be identical from one group to another.
3.5. INDETERMINACY 37

5
4
3
2
b (BLACKS)
1
0
-I
-2
-3

-3 -2 -I 0 1 2 3 4 5
b (WHITES)

FIG. 3.5.1. Estimated difficulty parameters (b) for 85 items for blacks and for
whites. (From F. M. Lord, A study of item bias, using item characteristic curve
theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology.
Amsterdam: Swets and Zeitlinger, 1977, pp. 19-29.)

Ability parameters 6 are also invariant from one test to another except for
choice of origin and scale, assuming that the tests both measure the same ability,
skill, or trait. For 1830 sixth-grade pupils, Fig. 3.5.2 compares the 9 estimated
from a 50-item Metropolitan vocabulary test with the 0 estimated from a 42-item
SRA vocabulary test. Both tests consist of four-choice items.
The scatter about a straight line is more noticeable here than in Fig. 3.5.1
because there each bi was estimated from the responses of 2250 students, here
each 0 is estimated from the responses to only 42 or 50 items. Thus the estimates
of θ are more subject to sampling fluctuations than the estimates of bi. The broad
scatter at low ability levels is due to guessing, random or otherwise. A more
detailed evaluation of the implications of Fig. 3.5.2 is given in Section 12.6. It is
shown there that after an appropriate transformation is made, the transformed
38 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

2.5
2.0
1 .5
1 .0
0.5
0.0
SRA

-0.5
-I .0
-I .5
-2.0
-2.5

-2.5 -2.0 - I .5 - I .0 -0.5 0.0 0.5 1 .0 1 .5 2.0 2.5

MAT

FIG. 3.5.2. Ability estimates from a 50 item MAT vocabulary test are com
pared with ability estimates from a 42-item SRA vocabulary test for 1830 sixth-
grade pupils. (Ability estimates outside the range - 2 . 5 < θ < 2.5are printed on
the border of the table.)

estimates of θ from the two tests correlate higher than do number-right scores on
the two tests.
In conclusion, in item response theory the item parameters are invariant from
group to group as long as the ability scale is not changed; in classical item
analysis, the item parameters are not invariant from group to group, although
they are unaffected by choice of ability scale. Similarly, ability 6 is invariant
across tests of the same psychological dimension as long as the ability scale is not
changed; number-right test score is not invariant from test to test, although it is
unaffected by choice of scale for measuring θ.
3.7. ITEM INTERCORRELATIONS 39

3.6. A SUFFICIENT CONDITION FOR THE

NORMAL OGIVE MODEL

In Section 3.2, we consider a variable Y'i underlying item i. Suppose n items

each have such an underlying variable, and suppose for some group of examinees
all the Y'i are jointly multinormally distributed. In this case, the joint distribution
of the dichotomous item responses ui is determined by the yi and by the intercor-
relations p'ij of Y'i and Y'i (i ≠ j ; i, j = 1, 2 , . . . , n). It can be shown (Lord &
Novick, 1968, Section 16.8) that if the joint distribution of the observed ui for
any set of data is consistent with this multinomial model for some γi and p'ij (i ≠
j ; i, j = 1 , 2 , . . . , n), then the data are consistent with the two-parameter normal
ogive response model, including the assumption of unidimensionality (local in
dependence). Furthermore, the p'i j will then have just one common factor, which
may be considered as the ability 0 measured by the n-item test.
The situation just described can only exist if there is no guessing and if 0 is
normally distributed in the group tested. This is a very restrictive situation; but if
this situation held for some group for some free-response items, the normal ogive
model would also hold for all other groups taking these same items.
The point is not that most data will fit the very restrictive conditions. They
will not. The point is rather that the normal ogive model will hold in a very large
variety of other, less restrictive situations. The restrictive conditions are suffi
cient conditions for the normal ogive model; they are very far from being neces
sary conditions.

3.7. ITEM INTERCORRELATIONS

Although we do not expect the restrictive model of the previous section to hold
for most actual data, some useful conclusions can be drawn from it that will help
us understand the relation of our latent item parameters to familiar quantities. It is
clear that when Y'i and Y'j are normally distributed, the product-moment correla
tion p'i j between them is, by definition, the same as the tetrachoric correlation
between item i and item j. Under the restrictive model, the p'i j will have just one
common factor, θ, so that

P'i j = P'iP'j , (3-11)

where p'i and p'j are the factor loadings of items i and j . The factor loadings are
also the correlations of Y'i and Y'j with θ and also the biserial correlations of ui
and Uj with θ.
When there is no guessing, Eq. (3-11) will allow us to infer observable item
intercorrelations from the parameters ai and aj: Under the restrictive model, the
(observable) tetrachoric correlation between item i and item j is found from (3-3)
and (3-11) to be
40 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

aiaj
p'ij = √1 + a2i √1+ a2j

Conversely, under the restrictive model the p ' i , and thus the ai, can be inferred
from a factor analysis of tetrachoric item intercorrelations:

P' i j p' i k
P' i 2 = (i ≠ j , i ≠ k, j ≠ k). (3-12)
P' j k
This is not recommended in the usual situations where there is guessing, how
ever.

3.8. ILLUSTRATIVE RELATIONSHIPS A M O N G

TEST AND ITEM PARAMETERS

Table 3.8.1 illustrates the relationship of the item discriminating power ai to

various conventional item and test statistics and also the interrelationships among
them. The illustration assumes that ability θ is normally distributed with zero
mean and unit variance and also that bi = 0 for all items. The test statistics are
for a 50-item test (n = 50). All quantities other than the p'iθ, including the ai at
the head of the columns, are rounded values computed from (exact) values shown
for p'iθ (≡p' i ). All 50 items have identical ai and C i .
The most familiar quantities in the table are probably the test reliabilities
( ρ x x ' ) m l n e bottom line. Most 50-item multiple-choice tests with which we are

TABLE 3.8.1
Relation of Item Discriminating Power at to Various
Conventional Item Parameters, and to Parameters of a 50-ltem Test,
when Ability Is Normally Distributed(µθ=0,σθ=1=1) and All
Free-Response Items Are of 50% Difficulty (πi = .50, bi = 0)

Eq.
ai = 0 .20 .44 .75 .98 1.33 2.06 no.

ρ'iθ 0 .2 .4 .6 .7 .8 .9 (3-3)
Free-Response

ρiθ 0 .16 .32 .48 .56 .64 .72 (3-6)

πi = .5)
= .2, (Ci = 0,

0 .040 .16 .36 .49 .64 .81 (3-11)

Items

ρ'ij
ρij 0 .025 .10 .23 .33 .44 .60 (3-13)
ρ i(x-i) 0 .12 .29 .47 .56 .66 .77 (3-14)
ρxx' 0 .57 .85 .94 .96 .98 .99 (1-17)
0 .017 .07 .16 .22 .29 .40 (3-19)
= .6)

ρIJ
Multiple-Choice

ρ I(X-I) 0 .088 .23 .38 .45 .53 .62 (3-14)

ρIX .14 .19 .29 .42 .48 .56 .64 (3-16)
Items

ρ'IX .18 .24 .37 .53 .61 .70 .81 (3-6)

(cI

7.2 (3-15)
πl

σx 3.5 4.7 10.2 11.8 13.6 15.7

ρxx' 0 .46 .79 .90 .93 .95 .97 (1-17)
APPENDIX 41

familiar probably have reliabilities close to .90. If so, we should focus our
attention on the column with ρ x x , — .90 at the bottom and ai = .75 at the top.
The top half of the table assumes that ci = 0 . This is referred to as the
free-response case (although free-response items do not necessarily have ci =
0). Note that by (3-4) and (3-9), under the assumptions made, free-response
items with bi = 0 will have exactly 50% correct answers (πi = .50) in the total
group of examinees. The parameters shown are the biserial item-ability correla
tion ρ'iθ; the point-biserial (product-moment) item-ability correlation ρiθ; the
tetrachoric item intercorrelation p' ij ; the product-moment item intercorrelation
Pij (phi coefficient); the item-test correlation pix-i), where x — i is number-
right score on the remaining 49 items; and the parallel-forms test reliability pxx,.
The equation used to calculate each parameter is referenced in the table.
The bottom half of the table deals with multiple-choice tests. The theoretical
relation between the multiple-choice and the free-response case is discussed in
the Appendix. For the rest of this chapter only, multiple-choice items are indexed
by I and J to distinguish them from free-response items (indexed by i and j); the
number-right score on a multiple-choice test will be denoted by X to distinguish
it from the score x obtained from free-response items. The multiple-choice item
intercorrelation pIJ (phi coefficient) is computed by (3-19) from the free-
response pij. All multiple-choice parameters in the table are computed fromp IJ .
All numbered equations except (3-19) apply equally to multiple-choice and to
free-response items.
Note in passing several things of general interest in the table:

1. A comparison of pxx' with pxx, indicates the loss in test reliability when
low-ability examinees are able to get one-fifth of the items right without knowing
any answers.
2. The standard deviation σx of number-right scores varies very sharply with
item discriminating power (with item intercorrelation).
3. The usual item-test correlation ρix or ρ'ix (also ρIX or ρ'IX) is spuriously
high because item i is included in x (or I in X). The amount of the spurious effect
can be seen by comparing ρIX and ρI(x_1).
4. For free-response items, the item-test correlation Pi(x-i) in the last two
columns of the table is higher than the item-ability correlation piθ. This may be
viewed as due to the fact (see Section 3.1) that the item observed-score regres
sion is more nearly linear than the item-ability regression (item response func
tion) .

APPENDIX

This appendix provides those formulas not given elsewhere that are necessary for
computing Table 3.8.1. In the top half of the table, the phi coefficient pij was
42 3. RELATION TO CONVENTIONAL ITEM ANALYSIS

obtained from the tetrachoric p'ij by a special formula (Lord & Novick, 1968,
Eq. 15.9.3) applicable only to items with 50% correct answers:
2
ρi j = arcsin p ' i j (3-13)
π

the arcsin being expressed in radians. The test reliability pxx ' was obtained from
ρij by the Spearman-Brown formula (1-17) for the correlation between two
parallel tests after lengthening each of them 50 times. The item-test correlation
ρ i ( x - i ) was obtained from a well-known closely related formula for the correla
tion between one test (/) and the lengthened form (y) of a parallel test (j):
mσi ρ i j
Piy , (3-14)
σy
where m is the number of times j is lengthened [for ρ i ( x - i ) in Table 3.8.1, m =
49], y is number-right score on the lengthened test, a2i = πi(l — πi) is the
variance (1-23) of the item score (ui = 0 or 1), and
σ2y = σ2i [m + m(m – l)pij] (3-15)
is the variance of the y scores [see Eq. (1-21)].
The usual point-biserial item-test correlation pix is computed from pi(x-i) by
a formula derived as follows:

σ i ( ( x - i ) + σ2i σi σx–i ρ i ( x - i ) + σ2i

ρ ix ≡ ρi[(x-i)+i] =
σi σx σi σx .
When y = x — i, we have from (3-14) that σ x _ i ρ i(x-i) = mσiρij; using these
last formulas with m = n — 1, we have
mσ2i ρi j + σ2i
Pix = .
σiσx
Finally, using Eq. (3-15) with m = n, we find from this that
(n - 1)ρij + 1 √l + (n – 1)Pij
Pix = . (3-16)
√n + n(n - 1)ρij √n
Suppose A, B, C, and D are the
relative frequencies in the accompany B A πi
ing 2 x 2 intercorrelation table for
free-response items i and j (ci = Cj =
0). The general formula for the phi D C 1 — πi
coefficient for any such table is
1 — πj πj
AD - BC
ρ ij = √Π (1 — π )π (1 - π ) . (3-17)
i i j j

Suppose now that we change the items to multiple choice with cI = CJ = c >
REFERENCES 43

0. According to Eq. (2-1) or (2-2), the effect will be that of the people who got
each free-response item wrong, a fraction c will now get the corresponding
multiple-choice item right. Thus π = Πi + c I (l —π i ). The new 2 x 2 table for
multiple-choice items will therefore be

(1 - c)B + c(1 - c)D A + cB + cC + c2D π + C(l — π)

(1 - c)2D (1 - c)C + c(1 - c)D (1 - C ) ( l -πi,)

(1 – C)(l – πj) π + C(l – π)

In the special free-response case where πi = Πj — ½, we have B = C = ½ —

A and D = A; also we find from (3-17) that

ρij =4A – 1. (3-18)

In this special case, the 2 x 2 table for the multiple-choice items is therefore

(1 – c)(½ – A + cA) A(1 – c)2 + c ½(1 + c)

(1 - c)2A (1 - c)(½ – A + cA) ½(1- c)

½ (l – c) ½ (1 + c)
When the general formula for a phi coefficient is applied to the last 2 x 2 table,
we find that for the multiple-choice items under consideration
A 2(1 - c) 4 + cA{\ - c)2 - (1 - c) 2( ½ - A + cA)2
PIJ =
¼(1 –c2 )
1 -c
(4A – 1).
1 +c
Using (3-18) we find a simple relation between the free-response pij and the
multiple-choice pIJ for the special case where πi = π = .5:

PIJ =
1 -c (3-19)
P
1 + c ij
This formula is a special case of the more general formula in Eq. (7-3).

REFERENCES

Fan, C.-T. On the applications of the method of absolute scaling. Psychometrika, 1957, 22,175–183.
Gulliksen, H. Theory of mental tests. New York: Wiley, 1950.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
4 Test Scores

Ability

of Item
Estimates
and

Parameters
as Functions

The ideas in this chapter are essential to an understanding of subsequent chapters.

4.1. THE DISTRIBUTION OF TEST SCORES

FOR GIVEN ABILITY

It is sometimes asserted that item response theory allows us to answer any

question that we are entitled to ask about the characteristics of a test composed of
items with known item parameters. The significance of this vague statement
arises from the fact that item response theory provides us with the frequency
distribution Ø(x θ) of test scores for examinees having a specified level 6 of
ability or skill.
For the present, let us consider the number-right score, denoted by x. If the n
items in a test all had identical item response curves P = P(θ), the distribution of
x for a person at ability level 0 would then be the binomial distribution
n x n-x
Ø(x θ) (x )p Q ,
where Q = 1 – P. The expression (Q + P)n is familiar as the generating
function for the binomial distribution, because the binomial expansion

(Q + P)n ≡ Qn + nPQn-1 + ( n2)p2Qn-2

x n-x . . .
+… + ( n) p Q + + Pn
x
gives the terms of Ø(x θ) successively for x = 0, 1,. . . , n.
When the item response curves Pi = Pi(θ) (i = 1 , 2 , . . . , n) vary from item

44
4.2. TRUE SCORE 45

to item, as is ordinarily the case, the frequency distribution Ø(x θ) of the number-
right test score for a person with ability θ is a generalized binomial (Kendall &
Stuart, 1969, Section 5.10). This distribution can be generated by the generating
function
n
π (Qi + Pi)- (4-1)
i=1
For example, if n = 3, the scores x = 0, 1, 2, 3 occur with relative frequency
Q1Q2Q3, Q1Q2P3 + Q1P2Q3 + P1Q2Q3, Q 1 P 2 P 3 + P1Q2P3 + P1P2Q3, and
P1P2P3, respectively. The columns of Table 4.3.1 give the reader a good idea of
the kinds of Ø(x θ) encountered in practice.
Although Ø(x θ), the conditional distribution of number-right score, cannot be
written in a simple form, its mean µx\θ and varianceσ2 x\ θ for given θ are simply
n
µx\θ = Pi(θ), (4-2)
i=i
σ2x\θ = n
PiQi . (4-3)
i=i
The mean (4-2) can be derived from the fact that x≡ iui and the familiar fact
that the mean of ui is Pi. The variance (4-3) can be derived from the familiar
binomial variance σ2(ui) = PiQi by noting that
σ2x\θ= σ2( iui θ = iσ2(ui)) = iPi Qi
because of local independence. Note that Ø(x|θ), µx|θ, and σx|θ refer to the
distribution of x (1) for all people at ability level θ and also (2) for any given
individual whose ability level is 6.
If P is the average of the Pi(θ) taken over n items, then in practice Ø(x|θ) is
usually very much like the binomial distribution (xn) Px Qn–x where Q ≡ 1 —
P. The main difference is that the variance Pi Qi is always less than the
binomial variance nPQ unless Pi = P for all items. The difference between the
two variances is simply nσ2p|θ, where σ2p|θ is the variance of the Pi(θ) for fixed
θ taken over items:

σ2x\θ = nPQ – nσ2p\θ (4-4)

4.2. TRUE SCORE

A person's number-right true score ξ (pronounced ksai or ksee) on a test is

defined in Lord and Novick (1968, Chapter 2) as the expectation of his observed
score x. It follows immediately from (4-2) that every person at ability level θ has
the same number-right true score
46 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

n
ξ = Pi{θ). (4-5)
i=i
Since each Pi(θ) is an increasing function of 0, number-right true score is an
increasing function of ability.
This is the same true score denoted by T in Section 1.2. The classical notation
avoids Greek letters, the present notation emphasizes that the relation of ob
served score x to true score ξ is the relation of a sample observation to a
population parameter.
True score ξ and ability 6 are the same thing expressed on different scales of
measurement. The important difference is that the measurement scale for ξ
depends on the items in the test; the measurement scale for 6 is independent of
the items in the test (Section 3.4). This makes 6 more useful than £ when we wish
to compare different tests of the same ability. Such comparisons are an essential
part of any search for efficient test design (Chapter 6).

4.3. S T A N D A R D E R R O R O F M E A S U R E M E N T

By definition, the error of measurement (e) is the discrepancy between observed

score and true score: e = x — ξ. When ξ is fixed, e and x have the same standard
deviation, since in that case e and x differ only by a constant. This standard
deviation is called the standard error of measurement at ξ, denoted here by σelξ
The squared standard error of measurement s2e.ξ of classical test theory is simply
σ2elξ averaged over all (N) examinees:

1 N 2
s2e.ξ = σ elξ (4-6)
N
When ξ is fixed, so is θ and vice versa [see Eq. (4-5)]. Thus

σe ξ=ξ0 ≡ (σe\ θ=θ0 (4-7)

provided ξu and θO are corresponding values satisfying (4-5). By (4-3) and (4-7),
finally,
n
σ2elξo= Pi(θo)Qi(θo). (4-8)
i=l
Note that the standard error of measurement approaches 0 at high ability levels,
where Pi(θ) 1. At low ability levels,

Pi(θ) Ci and σ2elξ n

ci(l – Ci).
Table 4.3.1 shows the conditional frequency distribution of number-right
scores at equally spaced ability levels as estimated for the 60-item Mathematics
section of the Scholastic Aptitude Test. This table was computed from (4-1),
TABLE 4.3.1
Theoretical Conditional Frequency Distribution of Number-Right
Scores on SAT Mathematics Test (January 1971) for Equally Spaced
Fixed Values of 0 (all frequencies multiplied by 100)

Selected Fixed Values of 0

Score
(x) -3.000 -2.625 -2.250 -1.875 -1.500 -1.125 -0.750 -0.375 0.000 +0.375 +0.750 + 1.125 + 1.500 + 1.875 +2.250 +2.625 +3.000

60 7 28 52
59 2 20 37 35
58 6 27 23 11
57 12 23 9 2
56 1 18 14 3
55 2 20 7 1
54 5 17 3
53 8 12 1
52 12 7
51 1 15 4
50 1 15 2
49 3 14 1
48 5 11
47 7 8
46 10 5
45 1 12 3
44 1 14 1
43 3 13 1
42 5 11
41 7 9
40 1 9 6
39 1 12 4
38 3 13 2
37 4 13 1
36 1 6 11 1
35 1 9 9
34 2 11 7
33 3 12 5
32 1 5 12 3
31 1 7 12 2

47
(continued)
TABLE 4.3.1 (continued)

48
Selected Fixed Values of θ

Score
(x) -3.000 -2.625 -2.250 -1.875 -1.500 -1.125 -0.750 -0.375 0.000 +0.375 +0.750 +1.125 +1.500 +1.875 +2.250 +2.625 +3.000

30 2 9 10 1
29 3 11 7
28 1 5 12 5
27 1 7 12 3
26 2 9 11 2
25 1 4 11 9 1
24 1 6 12 7 1
23 1 3 8 12 5
22 1 4 10 11 3
21 1 2 6 11 9 2
20 1 2 4 8 12 7 1
19 1 1 3 6 10 12 5
18 1 1 2 4 8 12 10 3
17 2 2 4 6 10 12 8 2
16 3 4 6 9 12 12 6 1
15 5 6 8 11 13 10 4 1
14 7 9 11 12 12 8 2
13 10 11 12 13 11 6 1
12 12 13 13 12 8 4 1
11 14 13 13 10 6 2
10 14 13 11 7 4 1
9 12 10 8 5 2
8 9 8 5 3 1
7 6 5 3 1
6 4 3 1 1
5 2 1 1
4 1
3
2

0
4.4. TYPICAL DISTORTIONS IN MENTAL MEASUREMENT 49

using estimated item parameters. All items are five-choice items. The typical
ogive shape of the regression function µx\θ = nPi(θ) is apparent in this table
and also the typical decreasing standard error of measurement and increasing
skewness at high ability levels.

4.4. TYPICAL DISTORTIONS IN MENTAL

MEASUREMENT

As already noted, formula (4-2) for the regression µx\θ of number-right score on
ability is the same as formula (4-5) for the relation of true score to ability. This
important function ξ = ξ(θ) ≡ nPi(θ) and also the function

1 n
ζ ≡ ζ(θ) ≡ Pi(θ) (4-9)
n
i=1
are called test characteristic functions. Either of these functions specifies the
distortion imposed on the ability scale when number-right score on a particular
set of test items is used as a measure of ability. A typical example of a test
characteristic function appears in Fig. 5.5.1.
Over ability ranges where the test characteristic curve is relatively steep,
score differences are exaggerated compared to ability differences. Over ranges
where the test characteristic curve is relatively flat, score differences are com
pressed compared to ability differences. Since number-right scores are integers,
compression of a wide range of ability into one or two discrete score values
necessarily results in inaccurate measurement.
If all items had the same response function, clearly the test characteristic
function (4-9) would be the same function also. More generally, test characteris
tic curves usually have ogive shapes similar to but not identical with item re
sponse functions. Differences in difficulty among items cause a flattening of the
test characteristic curve. If all items had the same response curves except that
their difficulty parameters bi were uniformly distributed, the test characteristic
curve would be virtually a straight line except at its extremes. For a long test, the
greater the range of the bi, the more nearly horizontal the test characteristic
curve.
If a test is composed of two sets of items, one set easy and the other set
difficult, the test characteristic curve may have three relatively flat regions: It
may be flat in the middle as well as at extreme ability levels. Such a test will
compress the ability scale and provide poor measurement at middle ability levels,
as well as at the extremes.
If the distribution of ability is assumed to have some specified shape (for
example, it is bell-shaped), the effect of the distortions introduced by various
types of test characteristic functions can be visualized. If a test is much too easy
50 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

for the group tested, the point of inflection of the test characteristic curve may
fall in the lower tail of the distribution where there are no examinees. Only the
top part of the test characteristic curve may be relevant for the particular group
tested. In this case, most examinees in the group may be squeezed into a few
discrete score values at the top of the score range. The bottom part of the
available score range is unused. Measurement is poor except for the lower ability
levels of the group. An assumed bell-shaped distribution of ability is turned into a
negatively skewed, in the extreme a J-shaped, distribution of number-right
scores. Such a test may be very approproate if its main use is simply to weed out
a few of the lowest scoring examinees.
If the test is much too hard for the group tested, an opposite situation will
exist. The score distribution will be positively skewed but will not be J-shaped if
there is guessing, because zero scores are then unlikely. Such a test may be very
appropriate for a scholarship examination or for selecting a few individuals from
a large group of applicants.
If the test is not very discriminating, the test characteristic curve will be
relatively flat. If the relevant part of the characteristic curve (the part where the
examinees occur) is nearly straight, the shape of the frequency distribution of
ability will not be distorted by the test. However, a wide range of ability will be
squeezed into a few middling number-right scores, with correspondingly poor
measurement.
If the test is very discriminating, its characteristic curve will be corre
spondingly steep in the middle. The curve cannot be steep throughout the ability
range because it is asymptotic to ζ = 1 at the right and to ξ = c at the left, where
1 n
c = Ci. (4-10)
n i=1
Thus there will be good measurement in the middle but poor measurement at
the extremes. If the test difficulty is appropriate for the group tested, the middle
part of the bell-shaped distribution of ability will be spread out and the tails
squeezed together. The result in this case is a platykurtic distribution of number-
right scores. The more discriminating the items, the more platykurtic the
number-right score distribution, other things being equal. In the extreme, a
U-shaped distribution of number-right scores may be obtained.
If we wish to discriminate well among people near a particular ability level (or
levels), we should build a test that has a steep characteristic curve at the point(s)
where we want to discriminate. For example, if a test is to be used only to select a
single individual for a scholarship or prize, then the items should be so difficult
that only the top person in the group tested knows the answer to more than half of
the test items. The problem of optimal test design for such a test is discussed in
Chapters 5, 6, and 11.
An understanding of the role of the test characteristic curve is important in
designing a test for a specific purpose. Diagrams showing graphically just how
4.6. THE TOTAL-GROUP DISTRIBUTION OF NUMBER-RIGHT SCORE 51

various test characteristic curves distort the ability scale, and thus the frequency
distribution of ability, are given in Lord and Novick (1968, Section 16.14).

4.5. THE JOINT DISTRIBUTION OF ABILITY A N D

TEST SCORES

If ability had a rectangular distribution from θ = —3 to θ = + 3, Table 4.3.1

would give a good idea of the joint distribution of ability and number-right score.
Otherwise, the probabilities at each value of θ must be multiplied by its relative
frequency of occurrence in order to obtain the desired joint distribution. Formally,
the joint distribution is

Ø(x, θ) = Ø(x θ)g*(θ), (4-11)

where g*(d) is the distribution (probability density) of ability in the group tested.
(The general term distribution rather than probability density or frequency func
tion is used throughout this book. Where necessary to prevent confusion,
cumulative or noncumulative is specified.)
Usually g*(θ) is unknown. The observed distribution of estimated θ is an
approximation to g*(θ). A better approximation can often be obtained by the
methods of Chapter 16.
Given an adequate estimate of g*(θ), the joint distribution of ability and
number-right score can be determined from (4-11) and (4-1). This joint distribu
tion contains all relevant information for describing and evaluating the properties
of the number-right score x for measuring ability θ. One such (estimated) joint
distribution is shown in Table 16.11.1.

4.6. THE TOTAL-GROUP DISTRIBUTION OF

NUMBER-RIGHT SCORE

Suppose we have a group of N individuals whose ability levels θa (a = 1,

2 , . . . , N) are known (in practice the θa will be replaced by estimated values
θ a ). It is apparent that the total-group or marginal distribution Ø(X) of number-
right scores will be

1 N
Ø(x) = Ø(x θa), (4-12)
N
a= 1
where Ø(x θa) is calculated from (4-1).
Any desired moments of the total-group distribution of test score can be
calculated from the Ø(x) obtained by (4-12). The expected score for the N
examinees also can be found from (4-12) and (4-2):
52 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

1 N 1 N n
Ex = µx\θa = Pi(θ). (4-13)
N N
a=1 a=1 i=i
An estimate of the expected variance of the N scores can be found from an
ANOVA identity relating total-group statistics to conditional statistics:

σ2x= mean of σ2x\θa + variance of µx\θa.

From this, (4-2), (4-3), and (4-13), we have the estimated total-group variance

1 N n 1 N n 1 N n
N PiaQia +
N Pia 2 N2 ( Pia 2 (4-14)
( ) )
a=l i=l a=1 i = 1 a=\i=l
where Pia ≡ Pi(θa).

4.7. TEST RELIABILITY

Although we shall have little use for test reliability coefficients in this book, it is
reassuring to have a formula relating test reliability to item and ability parame
ters. A conventional definition of test reliability is given by Eq. (1-6) and (1-9):
Written in our current notation, this definition is

σ2xξ
ρ x x ' ≡ ρ2xξ ≡ 1 - σ2 (4-15)
x

For a sample of examinees, (4-15), (4-6), (4-8), and (4-14) suggest an appro
priate sample reliability coefficient:

N n N n
( Pia ( pia)2/N
Pxx' (4-16)
N n N n N n
Pia Qia + ( P ia ) 2
– ( Pia)2/N

From (4-7), we see that (4-15) is the complement of the ratio of (averaged
squared error about the regression of x on θ) to (variance of x). Reliability is
therefore, by definition, equal to the correlation ratio of score x on ability 6.

4.8. ESTIMATING ABILITY FROM TEST SCORE

The following procedure clarifies the construction of confidence intervals for

estimating ability 0 from number-right score x. Consider the score distribution in
any column in Table 4.3.1. Find the highest score level below which lies no more
than 2½% of the frequency. Do this for all values of θ. If x were continuous, the
points would form a smooth curve; since x is an integer, the points fall on a step
4.8. ESTIMATING ABILITY FROM TEST SCORE 53

function. This step function cuts off 2½% or less of the frequency at every ability
level 6. Repeat this process for the upper tails of the score distributions.
The two resulting step functions are shown in Fig. 4.8.1. No matter what the
value of θ may be, in the long run at least 95% of all randomly chosen scores will
lie in the region between these step functions.
Now consider a random examinee. His number-right score on the test is x0,
say. We are going to assert that he is in the region between the step functions.
This assertion will be correct at least 95% of the time for randomly chosen
examinees. But given that this examinee's test score is x0, this assertion is in
logic completely equivalent to the assertion that he lies in a certain interval on 6.
In Fig. 4.8.1, the ends of an illustrative interval are denoted by θ and θ. We shall
therefore assert that his ability 6 lies in the interval (θ, θ). Such assertions will be
correct in the long run at least 95% of the time for randomly chosen examinees.
An interval with this property is called a 95% confidence interval. Such
confidence intervals are basic to the very important concept of test information,
introduced in Chapter 5. It is for this reason that we consider it in such detail
here.
A point estimate of 6 for given x would be provided by the regression of 6 on
x (see Section 12.8). Although the regression of x on 6 is given by (4-2), the
60
50
40
NUMBER RIGHT SCORE

Xo
30
20
10
0

-3 -2 -1 0 i 1 θ 2 3
THETA

FIG. 4.8.1. Confidence interval (θ, θ) for estimating ability (SAT Mathematics
Test, January 1971).
54 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

regression of θ on x cannot be determined unless we know g*(θ), the distribution

of ability in the group tested. The regression of θ on x is by definition given by
1 ∞
µθ\x θg*(θ)Ø (x θ) dθ (4-17)
Ø(x) ∫-∞
[compare Eq. (4-11)]. Ordinarily it cannot be written in closed form. It can be
calculated numerically if the distribution of θ is known.

4.9. JOINT DISTRIBUTION OF ITEM SCORES FOR ONE EXAMINEE

Number right is not the only way, nor the best way, to score a test. For more
general results, and for other reasons, we need to know the conditional frequency
distribution of the pattern of item responses—the joint distribution of all item
responses ui (i = 1, 2 , . . . , n) for given θ.
For item i, the conditional distribution given θ of a single item response is

Pi(θ) if u i = 1,
L(ui θ) = Qi(θ) if ui = 0, (4-18)
{ 0 otherwise.

This may be written more compactly in various ways. For present purposes, we
shall write

L(ut θ) = Puii Q1i–ui (4-19)

The reader should satisfy himself that (4-18) and (4-19) are identical for the two
permissible values of ui.
Because of local independence, which is guaranteed by unidimensionality
(Section 2.4), success on one item is statistically independent of success on other
items. Therefore, the joint distribution of all item responses, given θ, is the
product of the distributions (4-19) for the separate items:
n
L(u|θ; a, b , c) = L(u1, u2,... ,un θ) = π PuiiQ1i–Ui, (4-20)
i=l
where u = {ui} is the column vector { u 1 , u 2 , . . . , un}' and a, b , c are vectors
of ai bi, and ci
Equation (4-20) may be viewed as the conditional distribution of the pattern u
of item responses for a given individual with ability θ and for known item
parameters a, b , and c. In this case the ui (i = 1, 2 , . . . , n) are random
variables and θ, a, b , and c are considered fixed.
If the ui for an individual have already been determined from his answer
sheet, they are no longer chance variables but known constants. In this case,
assuming the item parameters to be known from pretesting, it is useful to think of
(4-20) as a function of the mathematical variable θ, which represents the (un
known) ability level of the examinee. Considered in this way, (4-20) is the
4.10. JOINT DISTRIBUTION OF ALL ITEM SCORES 55

0
0

1
-20
-30
LOG LIKELIHOOD
-40
-50
-eo
-70
-80
-90

-5 -4 -3 -2 -I O 2 3 4 5
ABILITY θ

FIG. 4.9.1. Logarithm of likelihood functions for estimating the ability of six
selected examinees from the SCAT II 2B Mathematics test.

likelihood function for 6. The maximum likelihood estimate 6 (see Section 4.13)
of the examinee's ability is the value of θ that maximizes the likelihood (4-20) of
his actually observed responses ui (i = I, 2, . . . , n).
Figure 4.9.1 shows the logarithm of six logistic likelihood functions com
puted independently by (4-20) for six selected examinees taking a 100-item high
school mathematics aptitude test. The maxima of these six curves are found at
ability levels 6 = —5.6, —4.6, —1.0, . 1 , 1.0, and 3.7. These six values are the
maximum likelihood ability estimates 6 for the six examinees.

4.10. JOINT DISTRIBUTION OF ALL ITEM SCORES ON

ALL ANSWER SHEETS

If N examinees can be treated as a random sample from a very large population

of examinees, then the distributions of u for different examinees are statistically
56 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

independent. Thus the joint distribution of the N different u for all examinees is
the product of the separate distributions. This joint distribution is then
N

ΠΠ
n
L(U|θ;a,b,c)≡L(ul,u2,... ,uN\θ) = Pia uia Qia1–u ia , (4-21)
a=1 n=1

where the subscript a distinguishes the contribution of examinee a (a = 1,

2,. .. , N and where Pia = Pi(θa), U is the matrix ||uia||, and θ is the vector
{θ1,θ2,...,θN}'.
When examinee responses have already been recorded on a pile of answer
sheets, (4-21) may be viewed as the likelihood function of both the ability
parameters θ1 θ2,. . . , θN and the item parameters a1, b1, c 1 , a2, b2, c2, . . . ,
an, bn, cn. The maximum likelihood estimates are the values θa (a = 1, 2 , . . . ,
N) and ai bi ci (i = 1, 2 , . . . , n) that together maximize (4-21). If the item
parameters are known, then (4-21) is simply the likelihood function of all the θa.
Does it seem unlikely that we can successfully estimate N + 3n parameters all
at the same time from just one pile of answer sheets? Actually, we do almost the
same thing in everyday practice. We obtain a score (ability estimate) for each
examinee and also do an item analysis on the same pile of answer sheets,
obtaining indices of item difficulty and item discriminating power. It is reason
able to do this if nN, the total number of observations, is large compared to both
N and n. If n is 50 and N is 1000, we have about 43 observations for each
parameter estimated.

4.11. LOGISTIC LIKELIHOOD FUNCTION

The likelihood function (4-20) for θ for one examinee can also be written
n Pi ui n
L(u θ) = π . π Qi .
( Qi )
i=1 i=l
In general, this is not helpful; but in the case of the logistic function [Eq. (2-1)]
when Ci = 0,
l
Pi 1 + e–DLi
= eDLi , (4-22)
Qi 1
1
1 + e–DLi
where D is the constant 1.7 and

L, ≡ ai(θ - bi,). (4-23)

Substituting (4-22) in the preceding likelihood function gives the logistic likeli
hood function
4.12. SUFFICIENT STATISTICS 57

n n
L(u|θ) = exp uiLi Qt
( ) Π
i=l i=l
n n
= exp -D aibiui eDθs (4-24)
( ) Π Qi (θ)
i=i i=l
where
n
S = s(u) ≡ aiui. (4-25)
i=l
In Appendix 4, it is shown that if

1. the item response function is logistic,

2. there is no guessing (ci = 0 for all i),
3. the item parameters are known.

then the weighted item score

n
s = i aiui
is a sufficient statistic for estimating examinee ability θ.

4.12. SUFFICIENT STATISTICS

The key property of a sufficient statistic s is that the conditional distribution of

the observations given s is independent of some parameter θ. This means that
once s is given, the data contain no further information about θ. This justifies the
usual statement that the sufficient statistic s contains all the information in the
data concerning the unknown parameter θ.
Note that s in (4-25) is a kind of test score, although different from the usual
number-right score x. Note also that s ranges from 0 to ni ai. Clearly, s is not a
consistent estimator of 0, which ranges from - ∞ to + ∞.
Since
E(ui θ) = Pi(θ) (4-26)

the expectation of s is
n
E(s θ) = aiPi (θ). (4-27)
1=1
Note that (4-27) is a kind of true score, although different from the usual
number-right score ξ . A consistent estimator 0 of 6 is found by solving for 0
the equation.

iaiPiθ) = s. (4-28)
58 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

It is shown in Section 4.14 that the θ obtained from Eq. (4-28) is also the
maximum likelihood estimator of 6 under the logistic model with all ci = 0.
If s is sufficient for 0, so is any monotonic function of s. It is generally agreed
that when a sufficient statistic s exists for θ, any statistical inference for 0 should
be based on some function of s and not on any other statistic.
The three conditions stated at the end of the preceding section are the most
general conditions for the existence of a sufficient statistic for 6. There is no
sufficient statistic when the item response function is a normal ogive, even
though the normal ogive and logistic functions are empirically almost indistin
guishable. There is no sufficient statistic when there is guessing, that is, when
Ci ≠ 0. This means that there is no sufficient statistic in cases, frequently reported
in the literature, where the Rasch model (see Wright, 1977) is (improperly) used
when the items can be answered correctly by guessing.

4.13. MAXIMUM LIKELIHOOD ESTIMATES

When no sufficient statistic exists, the statistician uses other estimation methods,
such as maximum likelihood. As already noted in Section 4.10, the maximum
likelihood estimates θa (a = 1, 2, . . . , N) and ai, bi, and ci (i = 1, 2 , . . . , n)
are by definition the parameter values that maximize (4-21) when the matrix of
observed item responses U ≡ ||uia|| is known. In practice, the maximum likeli
hood estimates are found by taking derivatives of the logarithm of the likelihood
function, setting the derivatives equal to zero, and then solving the resulting
likelihood equations.
The natural logarithm of (4-21), to be denoted by l, is
N n
l= ln L , ( U | θ ; a , b , c ) = [u ia ln Pia + (1 - uia) ln Q i a ] . (4-29)
a=l i=l
If X represents θa , aj, bj, or Cj, the derivative of the log likelihood with respect
to X is

l N n P'ia
uia — (1 – u ia ) Pia
X [ Pia Qia ]
a=l i=1
N n P ia
uia – P ia ) (4-30)
P ia Qia
a=l i=1
where P ia ≡ Pia / x. An explicit expression for P'ia can be written as soon as
'

the mathematical form of Pia is specified, as by Eq. (2-1) or (2-2). The result for
the three-parameter logistic model is given by Eq. (4-40). Some practical proce
dures for solving the likelihood equations (4-29) are discussed in Chapter 12.
When a, b , c are known from pretesting, the likelihood equation for estimat
ing the ability of each examinee is obtained by setting (4-30) equal to zero:
4.15. MAXIMUM LIKELIHOOD ESTIMATION FOR EQUIVALENT ITEMS 59

n P' ia
(uia – P ia ) = 0. (4-31)
P'ia Qia
i= 1
This is a nonlinear equation in just one unknown, θa. The maximum likelihood
estimate θa of the ability of examinee a is a root of this equation. The roots of
(4-31) can be found by iterative numerical procedures, once the mathematical
form for Pia is specified.
If the number of items is small, (4-31) may have more than one root
(Samejima, 1973). This may cause difficulty if the number of test items n is 2 or
3, as in Samejima's examples. Multiple roots have not been found to occur in
practical work with n ≥ 20.
If the number of items is large enough, the long test being formed by combin
ing parallel subtests, the uniqueness of the root 6 of the likelihood equation
(4-31) is guaranteed by a theorem of Foutz (1977). The unique root is a consis
tent estimator; that is, it converges to the true parameter value as the number of
parallel subtests becomes large.

4.14. MAXIMUM LIKELIHOOD ESTIMATION FOR

LOGISTIC ITEMS WITH ci = 0

If Pia is logistic [Eq. (2-1)] and each ci = 0, it is found that

Pia (4-32)
= DaiPia Qia .
θa
Substituting this for P'ia in and rearranging gives the likelihood equation
n n
ai Pia (θa) = ai uia . (4-33)
i= 1 i=1
This again in a nonlinear equation in a single unknown, θa.
Note that its root θa, the maximum likelihood estimator, is a function of the
sufficient statistic (4-25). Thus (4-33) is the same as (4-28). It is a general
property that a maximum likelihood estimator will be a function of sufficient
statistics whenever the relevant sufficient statistics exist.

4.15. MAXIMUM LIKELIHOOD ESTIMATION FOR

EQUIVALENT ITEMS

Suppose all items have the same response function P(θ). We shall call this the
case of equivalent items. This is not likely to occur in practice, but it is a limiting
case that throws some light on practical situations.
60 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

In the case of equivalent items, the likelihood equation (4-31) for estimating θ
becomes {P'lPQ) i(ui – P ) = 0 or

1 n
P(θ) = Ui ≡ Z, (4-34)
n
i=l
where z ≡ xln is the proportion of items answered correctly. The maximum
likelihood estimator θ is found by solving (4-34) for θ:

θ = P–1(z), (4-35)
–1
where P ( ) is the inverse function to P( ), whatever the item response
function may be.
Note that when all items are equivalent, a sufficient statistic for estimating
ability 6 is s = i au i = a iui = ax. Thus, in this special case, both the
number-right score x and the proportion-correct score z are sufficient statistics
for estimating ability.

E x e r c i s e 4.15.1
Suppose that P(6) is given by Eq. (2-1) and all items have ai = a, bi = b, and
ci = c, where a, b, and c are known. Show that the maximum likelihood
estimator 0 of ability is given by
1
θ In z — c + b. (4-36)
Da 1 -z
(Here and throughout this book, " I n " denotes a natural logarithm.) If c = 0,θ is
a linear function of the logarithm of the odds ratio (probability of success)/
(probability of failure).

4.16. FORMULAS FOR FUNCTIONS OF THE

THREE-PARAMETER LOGISTIC FUNCTION

A few useful formulas involving the three-parameter logistic function [Eq. (2-1)]
are recorded here for convenient reference. These formulas do not apply to the
three-parameter normal ogive [Eq. (2-2)].

1 — Ci ci + eDLi
Pi=Ci + (4-37)
1 + e–DLi 1 + eDLi '

where D ≡ 1.7 and Li ≡ ai(0 — bi).

l -Ci
Qi ≡ l - Pi = (4-38)
1 + e DL i

Pi c, + eDLi
(4-39)
Qi 1 - ci
4.17. EXERCISES 61

dPi Dai Dai (l - ci)

P'i ≡ Qi (Pi - ci) =
eDLi + 2 + e–DLi
(4-40)
dθ 1 - Ci

P'i Dai
(4-41)
Qi 1 + e–DLi

P'i Dai Pi — Ci Dai

Wi (θ) ≡ (4-42)
Pi Qi 1 — ci Pi 1 + cie–DLi
'2
p i D2 a2I (1 – Ci)
I{θ, ui) ≡ (ci + eDLi) (1 + e–DLi)2 .
(4-43)
Pi Qi
d2Pi D2a2i
Q (P – Ci)(Qi -Pi +ct). (4-44)
dθ 2
(1 – ci)2 i i
Pi - ci 1 - Ci
(4-45)
Pi 1 + ci e-DLi

4.17. EXERCISES

4-1 Compute P(6) under the normal ogive model [Eq. (2-2)] for a = 1/1.7,
b = 0, c = .2, and θ = —3, —2, — 1 , 0 , 1, 2, 3. Compare with the results
given for item 2 in Table 4.17.1 under the logistic model. Plot the item
response function P(θ) for each item in test 1, using the values given in
Table 4 . 1 7 . 1 .

TABLE 4.17.1
Item Response Function P(θ) and Related Functions for Test 1,
Composed of n = 3 Items with Parameters a1 = a2 = a3 = 1/1.7,
b1! = - 1 , b2 = 0, b3 = + 1 , C1 = c2 = c 3 = .2

Item No.*
1 2 3 P(θ) P(θ)Q(θ) P'(θ) p'2/pQ P'IPQ

3 4 5 .985611 .014182 .014130 .014078 .996350

2 3 4 .962059 .036501 .036142 .035785 .990141
1 2 3 .904638 .086268 .083994 .081780 .973646
0 1 2 .784847 .168862 .157290 .146511 .931466
-1 0 1 .6 .24 .2 .166667 .833333
-2 -1 0 .415153 .242801 .157290 .101895 .647814
-3 -2 -1 .295362 .208123 .083994 .033898 .403582
-4 -3 -2 .237941 .181325 .036142 .007204 .199318
-5 -4 -3 .214389 .168426 .014130 .001185 .083894

*For item i, enter the table with the θ values shown in column i (i = 1, 2, 3).
62 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

4-2 Compute from Eq. (4-1) for examinees at 6 = 0 the frequency distribution
Φ(x\θ) of the number-right score on a test composed of n = 3 equivalent
items, given that P i (0) = .6 for each item. Compute the mean score from
the Φ(x\θ), also from (4-2). Compute the standard deviation (4-3) of
number-right scores. Compute the mean of the proportion-correct score
z = x\n.
4-3 Compute from (4-1) the frequency distribution of number-right score x on
test 1 when (9 = 0, given that P 1 (0) = .7848, P 2 (0) = .6, P 3 (0) = .4152.
C o m p u t e µx θ,µZ θ a n d σx θ. Compare with the results of Exercise 4-2.
4-4 Note that σx θ is the standard error of measurement, (4-8). Check the
value found in Exercise 4-3, using Eq. (4-4).
4-5 Compute from Table 4.3.1 the standard deviation of the conditional dis
tribution Φ(x\θ) of number-right scores when θ = –3, 0, + 2 . 2 5 . (Be
cause of rounding errors, the columns do not add to exactly 100; compute
the standard deviation of the distribution as tabled.)
4-6 What is the range of number-right true scores ξ on test 1 (see Table
4.17.1)?
4-7 In Table 4 . 3 . 1 , find very approximately a (equal-tailed) 94% confidence
interval for θ when x = 26.
4-8 Given that P 1 (0) = .7848, P 2 (0) = .6, P 3 (0) = .4152, as in Exercise 4-3,
compute from (4-20) the likelihood when θ = 0 of every possible pattern
of responses to this three-item test.
4-9 Given that u1 = 0, u2 = 0, and u3 = 1, compute for θ = - 3 , - 2 , —1,0,
1, 2, 3 and plot the likelihood function (4-24) for a three-item test com
posed of equivalent items with a = 1/1.7, b = 0, and c = 0 for each
item. The necessary values of P(θ) are given in Table 4.17.2.
4-10 For Exercise 4-9, show that the right side of (4-33) exceeds the left side
when θa = — 1 but that the left side exceeds the right side when θa = 0;
consequently the maximum likelihood estimator θa satisfying (4-33) lies
between — 1 and 0.
4-11 Find from (4-36) the maximum likelihood estimate θ for the situation in
Exercises 4-9 and 4-10.

TABLE 4.17.2
Logistic Item Response Function P(θ) when a = 1/1.7, b = 0, c = 0

e -3 -2 -1 0 1 2 3

P(θ) .047426 .119203 .268941 .5 .731059 .880797 .952574

PQ .045177 .104994 .196612 .25 .196612 .104994 .045177
P' .045177 .104994 .196612 .25 .196612 .104994 .045177
APPENDIX 63

APPENDIX

Proof that au Is a Sufficient Statistic for 0

Using a line of proof provided by Birnbaum (1968, Chapter 18), we can see that
s is a sufficient statistic for estimating ability. By a familiar general formula
relating conditional, marginal, and joint distributions,
Prob(A and B) = Prob(A) • Prob (B\A). (4-46)
If 0 is fixed, this becomes
Prob(A and B θ) = Prob(A θ) • Prob(B A, θ). (4-47)
Substitute s0 for A and u 0 for B in (4-47) where the subscript indicates corre
sponding but otherwise arbitrary values of s = s(u) and u. Rearrange to obtain
Prob(s0 and u 0 θ)
Prob(u o |s o ,0) =
Prob(s0 θ) .
Now s ≡ s(u) depends entirely on u, so Prob(s and u|θ) ≡ Prob(u|θ) and
Prob(u0 θ)
Prob(u 0 |s 0 , θ) =
Prob(u0 θ) .
From this and (4-25),

Prob(u 0 |s 0 , 6) = Prob(u0 θ)
.
Prob(u|θ)
u sa
where the summation is over all vectors u for which iaiui = s0. By (4-24),
exp(-DXiaibiuoil)eD θ s
o Π i Q i (θ)
Prob(u 0 |s 0 , θ) = Dθs
exp (-D i ai biui)e o ΠiQi(θ)
u so

exp (–D i ai b i u ioi )

= exp (–D i ai biui)
u So

The point of this result is not the formula obtained but the fact that it does not
depend on 0. In view of the definition of a sufficient statistic (see Section 4.12),
we therefore have the following: If

1. the item response function is logistic,

2. there is no guessing (ci = 0 for all i),
3. the item parameters are known,
64 4. TEST SCORES AS FUNCTIONS OF ITEM PARAMETERS

then the weighted item score

n
S ≡ i ai ui

is a sufficient statistic for estimating examinee ability θ.

REFERENCES

Birnbaum, A. Test scores, sufficient statistics, and the information structures of tests. In F. M. Lord
& M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley,
1968.
Foutz, R. V. On the unique consistent solution to the likelihood equations. Journal of the American
Statistical Association, 1977, 72, 147-148.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1, 3rd ed.). New York: Hafner,
1969.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Samejima, F. A comment on Birnbaum's three-parameter logistic model in the latent trait theory.
Psychometrika, 1973, 38, 221-233.
Wright, B. D. Solving measurement problems with the Rasch model. Journal of Educational Mea
surement, 1977, 14, 97-116.
5 I n f o r m a t i o n

O p t i m a l
F u n c t i o n s

S c o r i n g
a n d

W e i g h t s

5.1. THE INFORMATION FUNCTION FOR A TEST

SCORE

The information function I{θ, y} for any score y is by definition inversely pro
portional to the square of the length of the asymptotic confidence interval for
estimating ability 0 from score y (Birnbaum, 1968, Section 17.7). In this chap
ter, an asymptotic result means a result that holds when the number n of items
(not the number N of people) becomes very large. In classical test theory, it is
usual to consider that a test is lengthened by adding items "like those in the
test,'' that is, by adding test forms that are strictly parallel (see Section 1.4) to the
original test. This guarantees that an examinee's proportion-correct true score
("zayta") ξ = ξ/n is not changed by lengthening the test. We shall use lengthen
ing in this sense here and throughout this book.
Denote by z ≡ xln the observed proportion-correct score (proportion of n
items answered correctly). The regression of z on ability θ is by Eq. (4-2) and
(4-9)
1 n
µz θ Pi(θ)= ζ. (5-1)
n
i =1
This regression is not changed by lengthening the test. The variance of z for fixed
0 is seen from Eq. (4-3) and (4-4) to be
1 n l
σ2z θ = n2 Pi Qi = (PQ -σ2p θ)- (5-2)
n
i=1
This variance approaches zero as n becomes large.

65
66 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

The distribution of z for fixed 0 is generated by Eq. (4-1). As n becomes

large, and σz θ 0, the conditional distribution of z shrinks toward a single
point, its mean.
Now consider Fig. 4.8.1, replacing the number-right scale for x on the verti
cal axis by z ≡ x/n, which ranges from z = 0 to z = 1. The regression µz θ of z
on θ may be visualized as an ogive-shaped curve lying midway between the two
step functions, which approximate the 2½ and the 97½ percentiles of the distri
butions of z given θ. As n becomes large, this regression does not change, but
σ2z |θ shrinks toward zero, so that the step functions crowd in toward the regres
sion curve. At the same time, the number of steps increases so that the step
functions increasingly approximate smooth curves. When the conditional distri
bution of z given 6 has become approximately normal, the distance from the
regression curve to the 2½ percentile, also to the 97½ percentile, will be about
1.96 standard deviations or (1.96/n)√ "iPiQi.
We can visualize the mathematics of the situation by imagining examining
Fig. 4.8.1 under a microscope while n becomes large. If, inappropriately, the
microscope magnifies n times, the bounds of the confidence region will still
appear as step functions. This is true because this view of the figure is equivalent
to examining µnz |θ = µx |θ, the mean number-right score for fixed θ. The num
ber of steps will increase directly with n; no regularities will appear. The bounds
of the confidence region will appear to move away from the regression line, since
σx |θin (5-2) decreases as √n, while the magnification increases as n.
We shall avoid this by looking at √n(z — ζ) rather than at z. As n becomes
large, the conditional variance of √n(z — ζ) remains finite:

σ 2 [√n(z - ζ)|θ = PQ–σ2p,

and the conditional distribution of √n(z — ζ) approaches normality. Thus the

appropriate microscope magnifies only √n times. With this magnification, the
bounds of the confidence region will approach smooth curves that appear to
remain at a fixed distance (proportional to PQ —σ2p) from the regression line as
both n and the magnification are increased simultaneously. At sufficient mag
nification, the portions of curves in any small region will appear to be straight
lines, as in Fig. 5.1.1. The bounds of the confidence region and the regression
line will appear parallel because σz |θ does not change appreciably over a small
range of 6.
The asymptotic confidence interval corresponding to z 0 is (θ, θ). The length
of the confidence interval, the distance AB, describes the effectiveness of the test
as a measure of ability. In the triangle ABC,

CB 2(1.96 σz|θ)
tan α = =
AB AB
or
5.1. THE INFORMATION FUNCTION FOR A TEST SCORE 67

Z≡x/n 9 7.5% ile

»µ|θ

2.5% ile
Z0 A a a
B

l.96σz|θ

i.96<σz|θ

e e ©
FIG. 5.1.1. Construction of a 95% asymptotic confidence interval (0, 0) for ability 0.

= 3.92σzθ
AB
tan α .
Since tan a is the slope of the regression line µz\θ , the information function, as
defined at the beginning of this chapter, for score z is proportional to
d 2
1 = µz\θ
( dθ )
.
AB2 (3.92) 2 Var(z|θ )

Figure 5.1.1 was derived for estimating ability from the proportion-correct
score z. For unidimensional tests, the same line of reasoning applies quite gener
ally to almost all kinds of test scores in practical use. Thus Birnbaum (1968)
defines the information function for any score y to be

d 2
µy|θ )
( dθ
I{θ,y} ≡ (5-3)
Var (y\θ)

The information function for score y is by definition the square of the ratio of the
slope of the regression of y on θ to the standard error of measurement of y for
fixed d.
Now true score n, corresponding to observed score y, is fixed whenever θ is
fixed. (If this were not true, the test items would be systematically measuring
68 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

some dimension other than 6, contrary to the requirement of unidimensionality.)

Thus, as in Eq. (4-7), Var (y θ) is identical to the familiar squared standard error
of measurement σ2y η.
The information provided by score y for estimating θ varies at different θ
levels. The variation comes from two distinguishable sources:

1. The smaller the standard error of measurement σy |η, the more informa
tion y provides about θ.
2. The steeper the slope of the regression µy|θ (the more sharply y varies
with θ), the more information y provides about θ.

If two tests have the same true-score scale, their effectiveness as measuring
instruments can properly be summarized by their standard errors of measurement
at various true-score levels. If, however, the tests measure the same trait but their
true-score scales are nonlinearly related, the situation is different. This will
ordinarily be the case whenever the tests are not parallel forms (see Chapter 13).
In this case, it is not enough to compare standard errors of measurement; we must
also take the relation of their true-score scales into account. This is the reason
why the score information function depends not only on the standard error of
measurement but also on the slope of the regression of score on ability.

E x a m p l e 5.1
The use of (5-3) can be illustrated by deriving the information function for the
proportion correct score z. From (5-1),

dµz|θ 1)
= i P' i (θ),
dθ (n
where P'i(θ) is the derivative of Pi(θ) with respect to θ. From (5-3) and (5-2) we
can now write the information function for z:
n 2
P'i(θ),
[ ]
I{θ, z} = i=1
n .
Pi(θ)Qi(θ)
i=1
This result is the same as the information function for number-right score, which
is derived from a more general result in the sequel and presented as Eq. (5-13).

5.2. ALTERNATIVE DERIVATION OF THE

SCORE INFORMATION FUNCTION

A nonasymptotic derivation of (5-3) was given by Lord (1952, Eq. 57) before the
term score information function was coined and also in a different context, by
5.2. ALTERNATIVE DERIVATION OF THE SCORE INFORMATION FUNCTION 69

Mandel and Stiehler (1954). Suppose that we are using test score 3; in an effort to
discriminate individuals at θ' from individuals at θ". Figure 5.2.1 illustrates the
two frequency distributions Φ(y θ) at θ' and θ" and shows the mean µy θ of each
distribution.
A natural statistic to use to measure the effectiveness of y for this purpose is
the ratio

µy θ" – µy θ'
σy θ" ,

where the denominator is some sort of average of σy θ' and σy θ". The displayed
ratio is proportional to the difference between means divided by its standard
error, sometimes called a critical ratio.
If θ' and θ" are close together, µy θ will be an approximately linear function
of θ in the interval (θ' ,θ"). Thus the numerator of our ratio will be proportional to
the distance θ" - θ'. The coefficient of proportionality is the slope of the
regression, given by the derivative d µy θ/dθ. Over short distances, it will make
no difference whether this slope is taken at θ" or at θ'. Also, σy θ" will be close
to σy θ, so their average will differ little froma y \ e σ yθ',.Thus our ratio can be
written

(θ" –θ')(d µy θ/dθ)θ=θ'

σy θ' .
The information function (5-3) when θ = θ' is directly proportional to the
square of the ratio just derived. The coefficient of proportionality is (θ" — θ')2, a
quantity of no relevance for assessing the discriminating power of test score y at
ability level θ = θ'.
If asymptotic values of µy θ and Var (y\θ) are used in (5-3), we find an
asymptotic information function explicable in terms of the length of the asympto
tic confidence interval. If exact values of µy θ and Var (y\θ) are used, the

µ y θ"

µy θ'

FIG. 5.2.1. Using score y to dis

θ'
criminate two ability levels. θ'"
70 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

resulting information function may be justified by the nonasymptotic derivation

of the present section. The language used need not always distinguish asymptotic
and nonasymptotic information functions, provided no serious confusion arises.

5.3. THE TEST INFORMATION FUNCTION

The maximum likelihood estimator θ is a kind of test score. Thus, we can use
(5-3) to find the information function of the maximum likelihood estimator. To
do this, we need (asymptotic) formulas for the regression µθ|θ and for the var
iance σ2θ|θ.
There is a general theorem, under regularity conditions, satisfied here
whenever the item parameters are known from previous testing: A maximum
likelihood estimator θ of a parameter θ is asymptotically normally distributed
with mean θ0 (the unknown true parameter value) and variance

1
Var(θ|θ o ) = (5-4)
d ln L 2 ,
E
[ ( dθ ) θ0 }
where L is the likelihood function.
When the item parameters are known, we have from Eq. (5-4) and (4-30) that

1 [ n ]2
= E (ui – Pi)P'i/PiQi θo
Var (θ\θo) { }
i=1
{ [ n
= E (ui – Pi)P'i/PiQi
i=1
n ] }
(uj – Pj)P'j/PjQj |θo
J=1

n n
P'io P'jo
= E[(u i – P i )(u j – P j )|θ o ].
Pio Pjo Qio Qjo
i=l j =1

Since E(ui |θo) = Pio() , the expectation under the summation sign is a covariance.
Because of local independence, ui is distributed independently of uj for fixed θ.
Consequently the covariance is zero except when i = j , in which case it is a
variance. Thus

1 n p 'i2 n p 'io2
= 2
Var
2
(uio|θo) = pi02Qio2•
Var(θ|θo) p i Q 'i p io2 p io2
i=1 i=l
Dropping the subscript o, the formula for the asymptotic sampling variance of the
maximum likelihood estimator is thus
5.3. THE TEST INFORMATION FUNCTION 71

1
Var(θ|θ) = (5-5)
n P'i2 .
Pi Qi
t=1
Now, as already stated, θ is a consistent estimator; so asymptotically µθ|θ
θ. Thus asymptotically the numerator of the information function (5-3) for score
θ is (dµθ|θ/dθ)2 = 1. Thus the (asymptotic) information function (5-3) of the
maximum likelihood estimator of ability is the reciprocal of the asymptotic
variance (5-5):
n p' i 2
I{θ} ≡ I {θ,θ} = (5-6)
Pi Qi .
i= 1
Let us note an obvious theorem in passing:

Theorem 5.3.1. The information function for an unbiased (consistent)

estimator of ability is the reciprocal of the (asymptotic) sampling variance of the
estimator.

Equation (5-6) is of such importance that it is given a special name and

symbol. It is called the test information function and is denoted simply by I{θ}.
Information functions for ordinary published tests are usually roughly bell-
shaped. Such a test information function is shown in Fig. 5.5.1.
The importance of the test information function comes partly from the fact
that it provides an (attainable) upper limit to the information that can be obtained
from the test, no matter what method of scoring is used:

Theorem 5.3.2. The test information function I{θ} given by (5-6) is an upper
bound to the information that can be obtained by any method of scoring the test.

Proof: Suppose t is an unbiased estimator of some function θ. Denote this

function of θ by τ(θ). According to the Cramér-Rao inequality (Kendall &
Stuart, 1973, Section 17.23),
2
[τ' (θ)]
Var (t θ) ≥ (5-7)
[ d ln L ] ,
E 2
( dθ

where τ' (θ) is the derivative of τ'(θ). Since E(t\θ) ≡ τ(θ), we have from (5-3),
(5-4), and (5-7) asymptotically

[τ' (θ)]2 [ d ln L 1
I{θ, t}= ≤E 2 = = I{θ}. (5-8)
Var (t θ) ( dθ ) ] Var (θ θ)

This result holds under rather general regularity conditions on the item response
function P i ( θ ) .
72 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

5.4. THE ITEM INFORMATION FUNCTION

A very important feature of (5-6) is that the test information consists entirely of
independent and additive contributions from the items. The contribution of an
item does not depend on what other items are included in the test. The contribu
tion of a single item is P'i2/Pi Qt. This contribution is called the item information
function:
p i '2
I{θ, ui} = (5-9)
PiQi .
Item information functions for five familiar items are shown in Figure 2.5.1
along with the I{6} for the five-item test.
In classical test theory, by contrast, the validity coefficient ρxC for number-
right test score (correlation between score and criterion C) is given by Eq. (1-25)
in terms of item intercorrelations ρij and item-criterion correlations ρ i c . There is
no way to identify the contribution of a single item to test validity; the contribu
tion of the item depends in an intricate way on the choice of items included in the
test. The same may be said of an item's contribution to coefficient alpha, as
shown by Eq. (1-24), and to other test reliability coefficients.
For emphasis and clarity, let us elaborate here Birnbaum's (1968) suggested
procedure for test construction, previewed in Chapter 2. The procedure operates
on a pool of items that have been calibrated by pretesting, so that we have the
item information curve for each item.1

1. Decide on the shape desired for the test information function. Remember
that this information function is inversely proportional to the squared length of
the asymptotic confidence interval for estimating ability from test score. What
accuracy of ability estimation is required of the test at each ability level? The
desired curve is the target information curve.
2. Select items with item information curves that will fill the hard-to-fill
areas under the target information curve.
3. Cumulatively add up the item information curves, obtaining at all times the
information curve for the part-test composed of items already selected.
4. Continue (back-tracking if necessary) until the area under target informa
tion curve is filled up to a satisfactory approximation.

The item information curve for the three-parameter logistic model in Eq. (2-1)
can be written down from (5-9) in many forms, such as

1
These rules are reproduced with special permission from F. M. Lord, Practical applications of
item characteristic curve theory. Journal of Educational Measurement, Summer 1977, 14, No. 2,
117-138. Copyright 1977, National Council on Measurement in Education, Inc., East Lansing,
Mich.
5.5. INFORMATION FUNCTION FOR A WEIGHTED SUM OF ITEM SCORES 73

Pi – ci 2
= D a i Qi
2 2
I{θ, ui} ,
Pi ( 1 – ci
D2a2i(1 – ci)
I{θ, ui] =
(c, + e D L i ) ( 1 +e-DLi)2 '

where Li = ai(θ - bi).

5.5. INFORMATION FUNCTION FOR A WEIGHTED

S U M OF ITEM S C O R E S

Suppose the test score is the weighted composite y ≡ iWiui,, where the wi are
any set of weights. Since each ui is a (locally) independent binomial variable, we
have

µ2 wu θ = i Wi Pi, (5-10)

σ2 wu θ = i W2i Pi (5-11)
By (5-3), the information function for the weighted composite is

( iWiPi )2 (5-12)
I{θ, iWiui} = 2
i W i Pi ,
.

If the weights are all 1, y is the usual number-right score x. Thus the informa
tion function for number-right score x is

( iP'i )2 (5-13)
I{θ, x} =
i Pi Qi ,

Note that (5-12) and (5-13) cannot be expressed as simple sums of independent
additive contributions from individual items, as in (5-6).
Figure 5.5.1 shows the estimated information function I{θ, x} for the
number-right score on a high school-level verbal test (SCAT II, Form 2A)
composed of 50 four-choice word-relations items. For comparison, the test in
formation function I{θ} is shown and also two other information functions to be
discussed later. The test characteristic curve is given also.
We have seen that the squared slope of the test characteristic curve is the
numerator of the information function for number-right score. The inflection
point of the test characteristic curve in the figure is to the left of the maximum
information, showing the effect of the denominator (squared standard error of
measurement).
The relation shown between the number-right curve and the upper bound I{θ}
is fairly typical of plots seen by the writer. This relation is of interest since it
limits the extent to which we can hope to improve the accuracy of measurement
by improving the method of scoring the test.
74 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

ABILITY
-3 -2 -I 0 1 2 3

I 8
50
TEST CHARACTERISTIC CURVE
EQUAL WEIGHTS
OPTIMAL WEIGHTS

15
40 SCORING WEIGHTS A
SCORING WEIGHTS (5-18)

12
SCORE

3.0

INFORMATION
RIGHT

9
NUMBER

6
10

3
0

0
-3 -2 0 1 2 3
-1 ABILITY

FIG. 5.5.1. Test characteristic curve (solid line) and various information curves
(dashed lines) for SCAT II 2A Verbal test. Item-scoring weights for the informa
tion curves are specified in the legend.

5.6. OPTIMAL SCORING WEIGHTS

Suppose we allow the item-scoring weight wi to be a function of θ. In particular,

consider using as item-scoring weight the function wi( θ)≡ P'i/PiQi. Sub
stituting this weight into (5-12) gives the result
p '2
P'i ui, } ( iP'i
2
Qi Qi)2 1
/ θ, = i (5-14)
{ i
PiQi
=
i Pi Qi(p'i/Pi Qi)
2 i
PiQi

We have the surprising result that the information function for the weighted
composite i (p'i/Pi Qi)ui is the same as the test information function, which is
the maximum information attainable by any scoring method. Thus
p'i(θ) (5-15)
Wi, (θ) ≡
P i (θ)Q i (θ)

is the optimal scoring weight for item i.

In practice, we do not know θ for any individual; hence we cannot know
Wi(θ). We can approximate θ and thus Wi(θ), however.
If the item response function is three-parameter logistic [Eq. (2-1)], it is easily
verified that
5.6. OPTIMAL SCORING WEIGHTS 75

Dai Qi(Pi – ci )
P'i = (5-16)
1 – ci
From (5-15) and (5-16), the optimal item-scoring weights are
Dai (Pi – ci ) Dai
Wi(θ) = (5-17)
Pi(1 – ci) 1 + cie–DLi
where Li ≡ ai{θ — bi). Note that when ci = 0, the optimal weight is 1.7ai or
since we may divide all the weights by 1.7, simply ai.
At high ability levels P i a (θ) 1, consequently Wi(θ) Dai . Thus, we see
that at high ability levels optimal scoring weights under the logistic model are
proportional to item discriminating power ai.
The optimal weights Wi(θ) under the logistic model are shown in Fig. 2.5.2
for five familiar items. Note the following facts about optimal item weights for
the logistic model, visible from the curves in the figure.2
1. As ability increases, the curve representing optimal item weight as a func
tion of ability sooner or later becomes virtually horizontal. Thus, for sufficiently
high ability levels, the optimal item weights are virtually independent of ability
level. The optimum weight at this upper asymptote is proportional to the item
parameter ai. This occurs because there is no guessing at high ability levels.
2. As ability decreases from a very high level, the optimal weight curves for
the difficult items begin to decline. The reason is that at lower ability levels
random guessing destroys the value of these items.
3. As ability decreases further, the optimal weights for these difficult items
become virtually zero. Such items will not be wanted if the test is used only to
discriminate among examinees at low ability levels.
In summary, under the logistic model, the optimal weight to be assigned to an
item for discriminating at high ability levels depends on the general discriminat
ing power of the item. The optimal weight to be used for discriminating at lower
ability levels depends not only on the general discriminating power of the item
but also very much on the amount of random guessing occurring on the item at
these ability levels. Thus, all moderately discriminating items are of use for
discriminating at high ability levels, whereas only the easy items are of appreci
able use for discriminating at low ability levels.
Item-scoring weights that are optimal for a particular examinee can never be
determined exactly, since we do not know the examinee's ability θ exactly.3 A

2
The remainder of this paragraph is taken with permission from F. M. Lord, An analysis of the
Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and
Psychological Measurement, 1968, 28, 989-1020.
3
The remainder of this section is adapted and reprinted with special permission from F. M. Lord,
Practical applications of item characteristic curve theory. Journal of Educational Measurement,
Summer 1977, 14, No. 2, 117-138. Copyright 1977, National Council on Measurement in Educa
tion, Inc., East Lansing, Mich.
76 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

crude procedure for obtaining item-scoring weights is to substitute the conven

tional item difficulty pi (proportion of correct answers in the total group of
examinees) for Pi(θ) in (5-17). This crude procedure would use the resulting
weight for scoring item i on all answer sheets regardless of examinee ability
level. Since D = 1.7 is a constant, we can drop it and use the weight
ai Pi – Ci
Wi = (5-18)
1 — ci Pi .
This same item-scoring weight, except for the ai, was recommended on other
grounds by Chernoff (see Lord & Novick, 1968, p. 310).
The effect of using the crude scoring weights is illustrated for the SCAT II-2A
verbal test in Fig. 5.5.1. For this test, the crude weights are almost everywhere
better than no weights at all.
A better but more complicated procedure for determining scoring weights for
a conventional test might be somewhat as follows.

1. Score the test in the usual way.

2. Divide the examinees into three or more subgroups according to the usual
scores.
3. Separately for each subgroup, use (5-18) to find a roughly optimal scoring
weight for each item.
4. Rescore all answer sheets using the item-scoring weights from step 3
(different weights in each subgroup).
5. Equate the three score scales obtained from the three sets of scoring
weights. Conventional equating methods (Angoff, 1971) may be used for
this.
6. Use the equating to place everyone on the same score scale.

The foregoing procedure should improve measurement effectiveness, since

each answer sheet is scored with weights roughly appropriate for the examinee's
ability level. Too much should not be expected, however. If only a third of the
items in a test are useful for measuring examinees at a certain ability level, no
amount of statistical manipulation will make the test a really good one for such
examinees.

5.7. OPTIMAL SCORING WEIGHTS NOT DEPENDENT ON θ

Is there an item response function P(θ) such that the optimal weights w(θ)
actually do not depend on θ? If so, then

w(θ) ≡ P'(θ)
= A,
P(θ)Q(θ)
where A is some constant. This leads to the differential equation
5.9. EXERCISES 77

dP
= A dθ.
P(l – P)
Integrating, we have uniquely
1 – P
–In = Aθ + B,
P
where B is a constant of integration. Solving for P, we find
1 1
P ≡ P(θ) =
1 + e–Aθ–B = 1 + e-Da(θ-b) ,
where A ≡ Da and B ≡ —Dab.
In summary, when the item response function is a two-parameter logistic
function, the optimal scoring weight Wi(θ) does not depend on θ. The optimal
weight is wi = ai , the item discrimination index. The optimally weighted compos
ite of item scores is s ≡ i a i u i , the sufficient statistic of Section 4.12. The
two-parameter logistic function, which does not permit guessing, is the most
general item response function for which the optimal item scoring weights do not
depend on θ.
Figure 5.5.1 shows the information curve obtained when the weights wi = ai
are used for the SCAT II-2A verbal test. The score iaiui is optimally efficient
at high ability levels but is less efficient than number-right score at low ability
levels. This is the result to be expected on a multiple-choice test, since wi = ai is
optimal only when there is no guessing.

5.8. MAXIMUM LIKELIHOOD ESTIMATE OF ABILITY

From Eq. (4-31) and (5-15), if the item parameters are known from pretesting,
the maximum likelihood estimate of ability is obtained by solving, for θ, the
equation

n pi(θ) n W (θ)u . (5-19)

Qi(θ) =
i i

i= 1 i=1
Thus we see that the maximum likelihood estimator θ is itself a function of the
optimally weighted composite of item scores iWi(θ)ui with θ substituted for θ.
This is true regardless of the form of the item response function Pi(θ).

5.9. EXERCISES

5-1 For test 1, compute from Table 4.17.1 the mean number-right score x at
θ = —3, — 2, —1,0, 1, 2, 3 using Eq. (4-2). Plot the regression of x on
θ. This is the test characteristic function.
78 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

5-2 As in Exercise 5-1, compute the standard deviation [Eq. (4-3)] of number-
right score for integer values of θ. Plot on the same graph as the regres
sion σx θ .
5-3 From Table 4.17.1, plot on a single graph the item information function
for each of the three items in test 1.
5-4 Compute from Table 4.17.1 the test information function (5-6) of test 1.
Plot on the same graph as µx θ andσx θ Also plot on the same graph as
the item information functions.
5-5 to 5-8 Using Table 4.17.2, repeat Exercises 5-1 to 5-4 for a three-item test
with a = 1/1.7, b = 0, c = 0 for all items. Compare with the results of
test 1.
5-9 Compute from (5-5) the variance of θ at integral values of θ for test 1.
5-10 From Table 4.17.1, compute the score information function (5-13) for the
number-right score x. Plot this and the test information function (5-6)
from Exercise 5-4 on the same graph.
5-11 For each item in test 1, plot the optimal scoring weights (5-15) as a func
tion of θ.
5-12 For test 1, compute the optimally weighted composite score iWi(θ)ui
for examinees at ability level θ=0 responding ux = 1, u2 = 0, u3 = 0.
Repeat for ui, = 0, u2 = 1, u3 = 0; also repeat for ux = 0, u2 = 0, u3
= 1. Can you explain why the scores for the three patterns should be in
the rank order you have found?
5-13 Compute the optimal item-scoring weight (5-15) at each θ level for the
items in Table 4.17.2. Explain.

APPENDIX

Information Functions for Transformed Scores

What is the effect on the score information function (5-3) of transforming the
score scale? Let Y ≡ Y(y) denote a monotonic transformation of the score y, and
let η = E(y\θ) be the true score corresponding to y.
For present purposes, score y will be assumed to have the property that η ≡
E(y θ) does not depend on test length n. For most conventional scoring methods,
this is usually a trivial requirement. For example, if x is the number of right
answers, ξ = Ex = n Pi(θ) is not independent of test length. Instead of using x,
however, we simply use z, the proportion of right answers. The proportion-
correct true score ζ ≡ Ez = [ nPi(θ)]/n does not vary if n is increased by
adding test forms parallel to the original test (see Section 5.1).
Now y is an unbiased estimate of η, and E{y —η)2 is the sampling variance of
the estimator y. It is usual to find that such sampling variances are of order 1/n;
this means that for sufficiently large n the sampling variance can be written as
APPENDIX 79

(1/n) times a constant term (a term that does not vary with n) and also that E(y —
η)3, the third sampling moment of y, is of order n–3/2 (is a constant divided by
n3/2). We assume this in all that follows.
Expanding Y(y) by Taylor's formula, we have
Y(y) – Y(η) = Y'(η)(y - η) + ½ y"(η)(y - η)2 +
δY'"(η)(y – η)3, (5-20)
where0 <δ < 1 and Y' (η), Y"(η), Y' "(η) are derivatives of Y(η) with respect to η.
Rearranging (5-20) and taking expectations, we find that the expectation of Y is

E(Y θ) = Y(η)) + terms of order1/n . (5-21)

Squaring (5-20) and taking expectations, we find a formula for the sampling
variance of Y:
Var (Y θ) ≡ E{[Y(y) - Y(η})]2 θ} = [Y'(η)]2 Var (y\ θ) + terms
of order n–3/2. (5-22)
From (5-21),
d dη] 1
dθ
E(Y θ) = Y'(η)dθ + terms of order .
n
From this and (5-22) and (5-3), we obtain the information function for the
transformed score Y(y):
2
I{θ, Y(y)} = (dη/dθ) + neglected terms.
Var (y\θ)
Since Var (y\θ) is a constant times 1/n and the numerator is independent of n, the
fraction on the right is of order n (a constant times n). The largest neglected
terms are easily seen to be constant with respect to n. For large n, the largest
neglected terms are therefore small compared to term retained. The term retained
is seen to be the information function of the untransformed score y. Asymptoti
cally,
/{θ, Y(y)} = l{θ,y}. (5-23)
In summary, if (1) y is a score chosen so that the corresponding true score
does not vary with n, (2) Y(y) is a monotonic transformation of 3; not involving
n, (3) Var (y θ) is of order 1/n and E[(y —η))3θ]is of order n–3/2, then the score
transformation Y(y) does not change the asymptotic score information function.
The first restriction in this summary is readily removed for most sensible
methods of scoring. The number-right true score ξ, for example, varies with n,
whereas the proportion-correct true score ζ = ξ/n does not; yet bothξ andζ have
the same information: I { θ , ξ } = I{θ, ζ}.
The invariance (5-23) of I{θ, Y(y)} is important; however, it is not surprising
80 5. INFORMATION FUNCTIONS AND OPTIMAL SCORING WEIGHTS

in view of the definition of information. A monotonic transformation of score y

should not change the confidence interval for θ.

REFERENCES

Angoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational

measurement (2nd ed.). Washington, D.C.: American Council on Education, 1971.
Birnbaum, A. Some latent trait models. In F. M. Lord & M. R. Novick, Statistical theories of mental
test scores. Reading, Mass.: Addison-Wesley, 1968.
Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 2, 3rd ed.). New York: Hafner,
1973.
Lord, F. M. A theory of test scores. Psychometric Monograph No. 7. Psychometric Society, 1952.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-
Wesley, 1968.
Mandel, J., & Stiehler, R. D. Sensitivity—A criterion for the comparison of methods of test. Journal
of Research of the National Bureau of Standards, 1954, 53, 155-159.
II APPLICATIONS OF
ITEM RESPONSE THEORY
6 The Relative Efficiency of
Two Tests

6.1. RELATIVE EFFICIENCY

The relative efficiency of test score y with respect to test score x is the ratio of
their information functions:

RE {y, x} ≡ I{θ, y} (6-1)

I{θ, x} .

Scores x and y may be scores on two different tests of the same ability θ, or x and
y may result from scoring the same test in two different ways. Relative efficiency
is defined only when the θ in I{θ, y} is the same θ as in I{θ, x}. Although the
notation does not make it explicit, it should be clear that the relative efficiency of
two test scores varies according to ability level.
The dashed curve in Fig. 6.7.1 shows estimated relative efficiency of a
"regular" test compared to a "peaked" test. Both are 45-item verbal tests
composed of five-choice items. The regular test (y) consists of the even-
numbered items in a 90-item College Board SAT. The peaked test (x) consists of
45 items from the same test with difficulty parameters nearest the average bi (the
average over all 90 items).

83
84 6. THE RELATIVE EFFICIENCY OF TWO TESTS

There is considerable overlap in items between the two 45-item tests, but this
does not impair the comparison. As the figure shows, from the third percentile up
through the thirtieth, the regular test with its wide spread of item difficulty is less
than half as efficient as the peaked test. In other words, the regular test would
have to be lengthened to more than 90 items in order to be as efficient as the
45-item peaked test within this range.

6.2. TRANSFORMATIONS OF THE ABILITY SCALE

The ability scale θ is the scale on which all item response functions have the
particular mathematical form Pi(θ). This is a specified form chosen by the
psychometrician, such as Eq. (2-1) or (2-2). Except for the theoretical case where
all items are equivalent, there is no transformation of the ability scale that will
convert a set of normal ogive response functions to logistic, or vice versa.
Once we have found the scale θ on which all item response curves are (say)
logistic, it is often thought that this scale has unique virtues. This conclusion is
incorrect, however, as the following illustration shows.
Consider the transformations
Dai
θ* ≡ θ*(θ) ≡ Kekθ, b*i ≡ Kekbi, ai* ≡ ,
(6-2)
k
where K and k are any positive constants. Under the logistic model
1 — ci
Pi ≡ ci + 1 + e –Da
eDa i bi
iθ

1 — ci (6-3)
≡ ci +
1 +(b*i/θ*)ai*
Also
(b*i /θ*)ai*
Qi=(1 – ci) .
1 + (b*i /θ*)ai*
Thus
Pi – ci θ* ai*
= * (6-4)
Qi ( bi) .
This last equation relates probability of success on an item to the ratio of exam
inee ability θ * to item difficulty b*i. The relation is so simple and direct as to
suggest that the θ* scale may be better for measuring ability than is the θ scale.
By assumption, all items have logistic response curves on the θ scale; how
ever, it is equally true that all items have response curves given by (6-3) on the θ*
scale. Thus there is no obvious reason to prefer θ to θ*.
6.3. EFFECT OF ABILITY TRANSFORMATION 85

6.3. EFFECT OF ABILITY TRANSFORMATION ON THE

INFORMATION FUNCTION

If there is no unique virtue in the θ scale for ability, we should consider how a
monotonic transformation of this scale affects our theoretical machinery. There
is nothing about our definition or derivation of the information function [Eq. (5-3)]
that requires us to use the θ scale rather than the θ* scale. If θ* is any monotonic
transformation of θ, the information function for making inferences about θ*
from y is defined by Eq. (5-3) to be

(dµy|θ* /dθ*)2
I{θ*, y} ≡ (6-5)
Var (y\θ*) .
Before proceeding, we need to clarify a notational paradox. Note that, for
every θ O ,

Var (y|θ=θ0) ≡ Var (y|θ=θ(θo)).

The left-hand side is usually abbreviated as Var (v|θ o ), the right-hand side as Var
(y|θ *). The abbreviated equation Var (y\θ0) = Var (v|θ* o ) appears self-contradic
tory. This is the fault of the abbreviated notation and does not impair the validity
of the unabbreviated result. Similarly (in abbreviated notation),

µy|θo ≡ µy|θ(θo) ≡ µy|θo.

By the chain rule for differentiation, d/dθ* ≡ (dθ/dθ*)(d/dθ). Substituting
the last three equations into (6-5) and dropping the subscript o, we have the
important result
(dµy|θ* /dθ*)2 dθ 2
I{θ*, y} ≡
Var (y\θ*) ( dθ* )
= (dµy|θ /dθ)2 dθ 2
Var (y\θ ( dθ* )
I{θ, y}
= (6-6)
dθ* 2
( dθ )

This result states: When we transform θ monotonically to θ*(θ), the information

function is divided by the square of the derivative of the transformation.
This is as it should be. The confidence interval (θ, θ) in Fig. 5.1.1 transforms
into the confidence interval (θ*, θ*). Asymptotically, the length of the latter
interval will be dθ*/dθ times the length of the former interval. Thus the informa
tion function I{θ*, x} will equal I{θ, x} divided by (dθ*/dθ)2.
When dθ*/dθ varies along the ability scale, the shape of the information
86 6. THE RELATIVE EFFICIENCY OF T W O TESTS

THETA SCALE OF ABILITY

s
c
o
R
20 E

I
N
F
0
R
M
A
T
0
N
F
U
N
C
T
i
0
10 N

0
2 10 25 50 75 90 98 %ile

FIG. 6.3.1. Score information function for measuring ability θ, SAT Mathematics
test. Taken with permission from F. M. Lord, The 'ability' scale in item charac
teristic curve theory. Psychometrika, 1975, 40, 205-217.

function may be drastically altered by the transformation. Worse yet, the ability
level at which a test provides maximum information may be totally different
when ability is measured by θ* rather than by θ. Or /{θ, x} may have one
maximum, whereas I{θ*, x} has two separate maxima. Actually, any single-
valued continuous information function on θ may be transformed to any other
such information function by a suitably chosen monotonic transformation θ*(θ).
Figure 6.3.1 shows the information function I{θ, x} for number-right score
on a 60-item College Board mathematics aptitude test. The baseline, representing
ability, is marked off in terms of estimated percentile rank on ability for the
group tested rather than in terms of θ values. Figure 6.3.2 shows a rather mild
transformation θ*(θ) ≡ ω(θ). Figure 6.3.3 shows the resulting information func-
6.3. EFFECT OF ABILITY TRANSFORMATION 87

THETA SCALE OF ABILITY

2.0

1.5

1.0

0.5 O
M
E
G
A

0.0 s
cA
L
E
O
-0.5 F
A
B
I
L
-1.0
ITY

-1.5

•2.0

-2.5
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 '.5

FIG. 6.3.2. Relation of the ω scale of ability to the usual 6 scale. Taken with
permission from F. M. Lord, The 'ability' scale in item characteristic curve
theory. Psychometrika, 1975, 40, 205-217.

tion I{ω, x} on the Ω scale for the same number-right score. The information
functions for the same score x on the two different ability scales bear little re
semblance to each other.
Clearly information is not a pure number; the units in terms of which informa
tion is measured depend on the units used to measure ability. This must be true,
since information is defined by the length of a confidence interval, and this
length is expressed in terms of the units used to measure ability. If we are
uncertain what units to use to quantify ability, then to the same extent we do not
know how to quantify information.
We cannot draw any useful conclusions from the shape of a single information
function unless we assert that the ability scale we are using is unique except for a
88 6. THE RELATIVE EFFICIENCY OF TWO TESTS

OMEGA SCALE OF ABILITY

s
c
O
R
E
20 I
N
F
O
R
M
A
T
I
O
N
F
U
N
C
T
I
O
N
10

0
2 10 25 50 75 90 98 % ile

FIG. 6.3.3. Score information function for measuring ability ω, SAT Mathematics
test. Taken with permission from F. M. Lord, The 'ability' scale in item charac
teristic curve theory. Psychometrika, 1975, 40, 205-217.

linear transformation. Most important we cannot know at what ability level the
test or test score discriminates best, unless we have an ability scale that is not
subject to challenge.
Even though a single information curve may not be readily interpretable,
comparisons between two or more information curves are not impaired by doubt
about the ability scale. This important fact is easily proved in Section 6.4.

6.4. EFFECT OF ABILITY TRANSFORMATION ON

RELATIVE EFFICIENCY

Suppose we transform the ability scale monotonically to θ*(θ) and then compute
the relative efficiency of two scores, x and y (which may be scores on one test or
6.5. INFORMATION FUNCTION OF OBSERVED SCORE ON TRUE SCORE 89

on two different tests), for measuring θ*. Replacing θ by 0* in (6-1) and using
(6-6), we find
2
RE {y, x} =
I{θ*, y} I{θ, y} (dθ*/dθ) I{θ, y}
2
I{θ*, x} I{θ, x} (dθ*/dθ) I{θ, x}
Comparing this with (6-1), we see that relative efficiency is invariant under any
monotonic transformation of the ability scale. It is for this reason that the symbol
6 does not appear in the notation RE {y, x}.
For the reasons outlined in Section 6.3, the practical application of item
response theory in this book are not based on inference from an isolated informa
tion function. We shall compare information curves, or equivalently we shall rely
on a study of relative efficiency. Such comparisons are not affected by the choice
of scale for measuring ability.

6.5. INFORMATION FUNCTION OF OBSERVED SCORE

ON TRUE SCORE

It was noted in Section 4.2 that number-right true score ξ is a monotonic increas
ing transformation of ability θ. What is the information function of number-right
score x for making inferences about true score ξ?
If we substitute ξ for θ* and x for y in (6-5), we find
(dµx|ξ/dξ)2
I{ξ, x} = (6-7)
σ2x|ξ
Now the true score ξ is defined as the expectation of x. It follows that µx|ξ = ξ.
If we substitute this into the numerator of (6-7), the desired information function
is found to be

1 (6-8)
I{ξ, x} ≡
σ2x|ξ
When using observed score x to make inferences about the corresponding
true score ξ, the appropriate information function I{ξ , x} is the reciprocal of
the squared standard error of measurement of score x at ξ. This result will hold
for any score x, not just for number-right score, as long as ξ ≡ µx|ξ is a mono
tonic function of θ.
Figure 6.5.1 shows I{ξ, x} for the same test represented in Fig. 6.3.1 and
6.3.3. The reader should compare these three information functions, noting once
again that information functions do not give a unique answer to the question: ' 'At
what ability level does the test measure best?"
The reader may have been startled to find from Fig. 6.5.1 that I{ξ, x} is
greatest at high and at low ability levels and least at moderate ability levels.
Actually, similar results would be found for most tests. Examinees at very high
90 6. THE RELATIVE EFFICIENCY OF TWO TESTS

0.5
0.4
FUNCTION
0.3
INFORMATION
0.2
SCORE
0. I
0.0

0 20 40 60
NUMBER-RIGHT TRUE SCORE

FIG. 6.5.1. Score information function for measuring the true score ξ on SAT
mathematics test.

ability levels are virtually certain to obtain a perfect score on the test. Thus for
them the standard error of measurement σx|ξ is nearly zero, their true score ξ is
very close to n, the length of the confidence interval for estimating ξ from x is
nearly zero, and consequently I{ξ, x} ≡ l/σ2x|ξ is very large. Clearly true score
ξ can be estimated very accurately for such examinees: It is close to n. Their
ability 0 cannot be estimated accurately; however: We know that their 6 is high
without knowing how high. This situation is mirrored by the fact that I{ξ, x} is
very large for such examinees, whereas I{θ, x} is near zero. The reader should
understand these conclusions if he is to make proper use of information functions
(or of standard errors of measurement).

6.6. RELATION BETWEEN RELATIVE EFFICIENCY AND

TRUE-SCORE DISTRIBUTION

Suppose now that we have another test measuring the same ability 8 as test x.
Denote the observed score on the new test by y and the corresponding true score
by . As in (6-8), the information function for y on will be
1 (6-9)
I{ , y} = σ2
y|
6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY 91

For present purposes, it is not necessary that either x or y be a number-right

score. Suppose merely that ξ is some monotonic increasing function of ability 0
and that also is. Then ξ is necessarily a monotonic increasing function of ξ; we
write ≡ (ξ). Also £ is a monotonic increasing function of . If x and 3; are
number-right scores, (Ξ) is found (numerically) by eliminating 0 from the test
characteristic curves Ξ = iPi(θ) and = jPj(θ). Figure 13.5.1 shows how
the value of (ξ 0 ) may be determined graphically for any given ξ0 from ξ =
iPi(θ) and = jPj(θ).
Now, we can substitute ξ for 0* and for 0 in (6-6) and then use (6-9) to write
the information function of y on ξ:
2
d
I{ξ,y} =
I{ ,y}
=
( )
—
dξ
. (6-10)
dξ σ2y|
(—d )
2

The efficiency of y relative to x is the ratio of (6-10) to (6-8), or

2 d 2
RE {y, x} = σ2 y|ξ
σ y| ( dξ ) . (6-11)

Similarly,
2
RE {x, y} = σ2y| dξ 2
σ y|ξ d ( )
. (6-12)

Equations (6-11) and (6-12) are valid regardless of the scale used to measure
ability (see Section 6.4). In particular, (6-11) and (6-12) do not assume that
ability is to be measured on the true-score scale ξ.
Denote by P(ξ) the frequency distribution (density) of true score ξ in some
population of examinees. The distribution Q( ) of = (ξ) in this same popula
tion is then found from
q( )d( ) ≡ p(ξ)dξ (6-13)
Rearranging, we have
d
= P(ξ) .
dξ q( )
Substituting this into (6-11), we find

RE {y, x} =
σ2x|ξ
P2(ξ) (6-14)
2
σ-y| ξ q [ (ξ)]
To our surprise, this formula shows that the relative efficiency of two tests can
be expressed directly in terms of true-score frequency distributions and standard
errors of measurement. The formulas agree with the vague intuitive notion that a
test is more discriminating at true-score levels where the scores are spread out
and less discriminating at true-score levels where the scores pile up.
92 6. THE RELATIVE EFFICIENCY OF TWO TESTS

6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY

If estimated item parameters are available, estimated relative efficiency can be

directly and simply computed from (6-1) and from the formula for the appro
priate information function, such as Eq. (5-6), (5-12), or (5-13). If estimated
item parameters are not conveniently available, it may be possible to estimate
the necessary quantities on the right side of (6-14), thus approximating relative
efficiency without requiring the item parameters ai, bi, and ci. A method for
estimating true-score distributions p(ζ) and q( ) is the subject of Chapter 16.
The following application is presented here rather than in Chapter 16 because
it leads to the much simpler approximation described in Section 6.8. An approx
imation for the standard error of measurement is given by (Lord, 1965; Eq. 9,34):

σx|ξ = (nx - 2kx)ξ(nx - ξ)

(6-15)
n2x ,
½n 2 x (n x - 1)s2p
kx ≡ (6-16)
[x(n x - x) - s2x - nxs2p] ,
where x and s2x are the sample mean and variance (over people) of the number-
right scores, s2p is the sample variance (over items) of the pi, and pi is the sample
proportion of correct answers to item i; σy| is obtained similarly, from the same
group or from an equivalent group of examinees.
When item parameters ai , bi , and ci have not been estimated, the relation
(ξ) between and ξ may be obtained as follows. Integrating (6-13), we have,
for any value ξ0,
- (ξ )
q( ) d ≡ p(ξ) dξ. (6-17)
∫-∞ ∫ξ0
-∞

In more familiar terms, this equation says that 0 ≡ (ξ0) has the same percen
tile rank in q( ) as ξ0 does in p(ξ). Thus for any value of ξ, = (ξ) is to be
obtained by standard equipercentile equating. The distributions q( ) and p(ξ)
must be for the same group or for statistically equivalent groups of examinees.
Given estimates of q( ) and p(ξ), the integration and equating are done by
numerical methods by the computer (see Section 17.3).
A computer program (Stocking, Wingersky, Lees, Lennon, & Lord, 1973), is
available to compute (6-14). The program uses estimates (£) and q( ) obtained
by the methods of Chapter 16. It then uses (6-17) to find equivalent values of ξ
and . Finally, using approximation (6-15), it computes relative efficiencies by
(6-14).
In Fig. 6.7.1, the solid curve is the approximate relative efficiency from
(6-14). The dotted curve is the ratio of information functions computed by (6-1)
6.7. AN APPROXIMATION FOR RELATIVE EFFICIENCY 93

10.0
9.0
8.0
7.0
6.0
5.0
4.0
R. E. of "Regular" Test vs. Peaked Test

3.0

2.0

1.0
.9
.8
.7
.6
.5
.4

0 .2 .4 .6 .8 1.0 ξ
.1
1 5 10 25 50 75 90 95 9 9 %

FIG. 6.7.1. Approximation (solid line) to relative efficiency [Eq. (6-14)] compared
with estimate (dashed line) from Eq. (6-1) and (5-13). (From F. M. Lord, The
relative efficiency of two tests as a function of ability level. Psychometrika, 1974,
39, 351-358.)

from estimated item response function parameters. The two tests under compari
son are the regular test (y) and the peaked test (x), described in more detail in
Section 6.1. Approximation (6-14) tends to oscillate about the estimated relative
efficiency (6-1), but the approximation is adequate for the practical purpose of
comparing the effectiveness of the tests over a range of ability levels. The
agreement found here and in later sections of this chapter between relative
efficiency calculated from item parameters and relative efficiency approximated
from totally different sources is a reassuring illustration of the adequacy of item
response theory and of the procedures used for estimation of item parameters.
As noted in Section 6.4, the relative efficiency of two tests remains the same
under any monotonic transformation of the ability scale. Thus, the RE curve can
be plotted against any convenient baseline. In Fig. 6.7.1, the baseline is scaled in
terms of true score ζ = ξ/n = iPi(θ)/n for the peaked test [see Eq. (4-5)].
Is it a good rule of test construction to spread the items over a wide range of
item difficulty, so as to have some items that are appropriate for each examinee?
Or will a peaked test with all items of equal difficulty be better for everyone? In
Fig. 6.7.1, the peaked test (really only partially peaked—it is hard to find 45
items that are identical in difficulty) is better than the regular (unpeaked) test for
all examinees from the first through the seventy-fifth percentile. If the peaked
94 6. THE RELATIVE EFFICIENCY OF TWO TESTS

test were more difficult, it might be better from perhaps the tenth percentile up
through the nintieth.

6.8. DESK CALCULATOR APPROXIMATION FOR

RELATIVE EFFICIENCY

Although the approximation of Section 6.7 avoids the need to estimate item
response function parameters, the method (see Chapter 16) for estimating p(ξ)
and q( ) is far from simple. Section 6.7 is included here because it leads to the
suggestion that a simple approximation to relative efficiency can be obtained by
substituting observed-score relative frequencies, fx and fy, say, for the true-
score densities p(ξ) and q( ).
A simple approximation to σx|ξ and σy| is also available. If the nx items in
test x are considered as a random sample from an infinite pool of items, then the
sampling distribution of number-right score x for a particular examinee, over
successive random samples of items, is the familiar binomial distribution
(nxx)ζx(1 - ζ)nx-x,
where ζ is a parameter characterizing the individual. Since ξ(x|ζ = nxζ = ξ
for the binomial, ζ or ξ is the individual's true score. [Although it may not
seem so, the fact is that the binomial model just described holds just as well
when the items are of widely varying difficulty as when they are all of the
same difficulty. A simple discussion of this fact is given by Lord (1977).]
Under the binomial model just outlined, the sampling variance of an exam
inee's number-right score x over random samples of items is given by the
familiar formula

σ2x|ζ = n x ζ(1 - ζ) (6-18)

or, equivalently,

σ2x|ζ = ξ(nx - ξ) (6-19)

A similar formula holds for y.

If we substitute these into (6-14) and replace ξ by x, ) by y, we have

RE {y, x} ≡ nyx(nx — x) f2x

(6-20)
nxy(ny - y) f2y

This is the shortcut approximation recommended for calculating relative effi

ciency. Note that here x and y are number-right scores with the same percentile
rank in some group of examinees (as determined by equipercentile equating); fx
and fy are relative frequencies for the same group or for equivalent groups ( fx
= fy = 1). Note also that test x and test y must be measures of the same ability
or trait.
1

2
3
6

5 4
7

0 .20 .40 TRUE SCORE .60 .80 100

FIG. 6.8.1. Estimated true-score distribution for the sixth-grade data for STEP (1), MAT (2), CAT
(3), ITBS (4), Stanford (5), CTBS (6), and SRA (7).

95
96 6. THE RELATIVE EFFICIENCY OF TWO TESTS

Equation (6-20) will work best with a large sample of examinees, perhaps
several thousand. If the sample size is smaller, the equipercentile equating of x
and y will be irregular because of local irregularities in fx and fv. This can be
overcome by smoothing distributions fx and fy. Smaller samples can then be
used, but at some cost in labor.
In order to investigate the adequacy of (6-20), the relative efficiencies of the
vocabulary sections of seven nationally known reading tests were approximated
by formula (6-20) and also by the computer program (Stocking et al., 1973)
described in Section 6.7. For each test, a carefully selected representative na
tional sample of 10,000 or more sixth graders from the Anchor Test Study
(Loret, Seder, Bianchini, & Vale, 1974) supplies the frequency distribution of
number-right vocabulary score needed for the two methods. The number of items
per vocabulary section ranges from n = 30 through n = 50.
Figure 6.8.1 shows the true-score distributions for the seven vocabulary tests
as estimated by the method of Chapter 16. These are the p(ξ) and q( ) used in
(6-14) to obtain the smooth curves in Fig. 6.9.1-6.9.6. As already noted, a test
in general tends to be less efficient where the true scores pile up and more
efficient where the true scores are spread out.

6.9. RELATIVE EFFICIENCY OF SEVEN SIXTH-GRADE

VOCABULARY TESTS1

Figures 6.9.1 to 6.9.6 show the efficiency curves for six of the tests relative to
the Metropolitan Reading Tests (1970), Intermediate Level, Form F, Word
Analysis subtest (MAT). The smooth curves are obtained from (6-14); the broken
lines are obtained from (6-20), after grouping together adjacent pairs of raw
scores in order to reduce zigzags due to sampling fluctuations.
Although (6-20) gives only approximate results, the approximation is seen to
be quite adequate for many purposes. Rough calculations using (6-20) can be
conveniently made under circumstances not permitting the use of an elaborate
computer program.
Figure 6.9.1 shows the relative efficiency of STEP (Sequential Tests of Edu
cational Progress) Series II (1969), Level 4, Form A, Reading subtest. STEP is
more efficient than MAT for the bottom fifth or sixth of the pupils and less
efficient for the rest of the students. Between the fortieth and eightieth percen
tiles, STEP would have to be tripled in length in order to be as effective as MAT.
STEP (n y = 30) is actually three-fifths as long as MAT (n x = 50), as shown by

1
This section is revised and printed with special permission from F. M. Lord, Quick estimates of
the relative efficiency of two tests as a function of ability level. Journal of Educational Measure
ment, Winter 1974, 11, No. 4, 247-254. Figures 6.9.1, 6.9.6, and Table 6.9.1 are taken from the
same source. Copyright 1974, National Council on Measurement in Education, Inc., East Lansing,
Mich.
6.3
4.0
2.5
I .6
EFFICIENCY
1.0
RELATIVE
.63

NY
NX
.40
.25
16