Developing and Validating Test Item
Developing and Validating Test Item
VALIDATING TEST
ITEMS
Since test items are the building blocks of any test, learning how to develop and validate test
items has always been critical to the teaching–learning process. As they grow in importance and
use, testing programs increasingly supplement the use of selected-response (multiple-choice)
items with constructed-response formats. This trend is expected to continue. As a result, a new
item-writing book is needed, one that provides comprehensive coverage of both types of items
and of the validity theory underlying them.
This book is an outgrowth of the co-author’s previous book, Developing and Validating
Multiple-Choice Test Items, 3rd Edition (Haladyna, 2004). That book achieved distinction as the
leading source of guidance on creating and validating selected-response test items. As with its
predecessor, the content of this new book is based on both an extensive review of the literature
and on its author’s long experience in the testing field. It is very timely in this era of burgeoning
testing programs, especially when these items are delivered in a computer-based environment.
Key features include:
Comprehensive and Flexible—No other book so thoroughly covers the field of test item devel-
opment and its various applications.
Based on Theory and Research—A comprehensive review and synthesis of existing research
runs throughout the book and complements the expertise of its authors.
BY
THOMAS M. HALADYNA
AND
MICHAEL C. RODRIGUEZ
First published 2013
by Routledge
711 Third Avenue, New York, NY 10017
Simultaneously published in the UK
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2013 Taylor & Francis
The right of Thomas M. Haladyna and Michael C. Rodriguez to be identified
as authors of this work has been asserted by them in accordance with sections
77 and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced
or utilised in any form or by any electronic, mechanical, or other means,
now known or hereafter invented, including photocopying and recording,
or in any information storage or retrieval system, without permission in
writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or
registered trademarks, and are used only for identification and
explanation without intent to infringe.
Library of Congress Cataloging in Publication Data
Haladyna, Thomas M., author.
Developing and validating test items / Thomas M. Haladyna,
Michael C. Rodriguez.
pages cm
Includes bibliographical references and index.
1. Educational tests and measurements—Design and construction.
I. Rodriguez, Michael C. II. Title.
LB3051.H297 2013
371.26—dc23
2012037236
Preface vii
Acknowledgments viii
v
vi • Contents
18. Validity Evidence From Statistical Study of Subjectively-Scored Test Items 357
19. Issues Involving Item Responses and Item Validation 384
References 416
Author Index 437
Subject Index 445
Preface
Although the scholarly study of item development has been ongoing for quite some time, critics
have often noted that this study does not match the effort we give to statistical theories and meth-
ods in testing. This book documents the progress we have made in the science of item develop-
ment but, at the same time, issues warnings and offers suggestions about future efforts.
Our goal is to provide readers with a comprehensive, authoritative volume on how to develop
all kinds of test items and how to ensure that when an item is used in a test it performs as it
should. We refer to the process of getting an item ready for a test as item validation.
This book has been a long collaborative labor that has been evolving over many years.
Earlier versions of this book were mainly aimed at the selected-response formats. This book
examines constructed-response formats and also includes some special topics that seem justified
by their popularity and importance in testing. We have attempted to provide up-to-date infor-
mation from the most authoritative sources and also drawing from our collective experiences.
The book is organized into six parts. The first part presents foundation information about
validity, the item development process, the challenging problem of defining item content and
cognitive demand, and the equally difficult choice of item formats. The second part deals exclu-
sively with the selected-response format, with the exception of chapter 9. This chapter was added
to provide guidance on the development of survey items. The third part is complementary to the
second part. This part offers a variety of constructed-response formats, guidelines, and informa-
tion on scoring. Part IV deals with three unique areas. Each chapter addresses the problems,
research, and approaches to measurement in one of these areas. The areas are writing, credential-
ing and exceptionalities. We could have expanded this part of the book, as we think these three
are very important and challenging for item and test developers. Each chapter provides useful
information and recommendations for both research and more effective item development.
Part V deals with item validation. The four chapters are intended to be complementary, but
some overlap is intentional because of the importance of procedures and statistical study of item
development and item responses. Part VI has a single chapter—a prospective appraisal of where
we are and where we need to go to advance the science of item development and validation.
We hope this volume meets your needs, and future editions will only expand on these efforts.
T. Haladyna
M. Rodriguez
vii
Acknowledgments
We thank the many students in our advising roles, colleagues in our research roles, and clients in
our consulting roles over the years who have challenged our thinking on test item development
and validation.
We also would like to thank several students at the University of Minnesota who contrib-
uted to the search for example items throughout the book. They include Anthony Albano, Anica
Bowe, Okan Bulut, Julio Cabrera, Danielle Dupuis, Yoo Jeong Jang, Brandon LeBeau, Amanuel
Medhanie, Mario Moreno, Jose Palma, Mao Thao, Luke Stanke, and Yi (Kory) Vu.
viii
I
A Foundation for Developing and
Validating Test Items
Part I covers four important, interrelated concerns in item development and validation.
This first chapter provides definitions of basic terms and distinctions useful in identifying
what is going to be measured. The first chapter also discusses validity and the validation proc-
ess as it applies to item development. The second chapter presents the essential steps in item
development and validation. The third chapter presents information on the role of content and
cognitive demand in item development and validation. The fourth chapter presents a taxonomy
of selected-response (SR) and constructed (CR) test item formats for certain types of content and
cognitive demands.
This page intentionally left blank
1
The Role of Validity in Item Development
Overview
This chapter provides a conceptual basis for understanding the important role of validity in item
development. First, basic terms are defined. Then the content of tests is differentiated. An argu-
ment-based approach to validity is presented that is consistent with current validity theory. The
item development process and item validation are two related steps that are integral to item
validity. The concept of item validity is applied throughout all chapters of this book.
A test item is the basic unit of observation in any test. The most fundamental distinction for the
test item is whether the test taker chooses an answer (selected-response: SR) or creates an answer
(constructed-response: CR). The SR format is often known as multiple-choice. The CR format
also has many other names including open-ended, performance, authentic, and completion. This
SR–CR distinction is the basis for the organization of chapters in this book. The response to any
SR or CR item is scorable. Some items can be scored dichotomously, one for right and zero for
wrong, or polytomously using a rating scale or some graded series of responses. Refined distinc-
tions in item formats are presented in greater detail in chapter 4.
Thorndike (1967) advised item and test developers that the more effort we put into building
better test items, the better the test is likely to be. To phrase it as to validity, the greater effort
expended to improve the quality of test items in the item bank, the greater degree of validity
we are likely to attain. As item development is a major step in test development, validity can be
greatly affected by a sound, comprehensive effort to develop and validate test items.
Toward that end, we should develop each test item to represent a single type of content and a
single type of cognitive behavior as accurately as is humanly possible. For a test item to measure
3
4 • A Foundation for Developing and Validating Test Items
multiple content and cognitive behaviors goes well beyond our ability to understand the meaning
of a test taker’s response to such an item.
To each construct there corresponds a set of operations involved in its scientific use. To
know these operations is to understand the construct as fully as science requires; without
knowing them, we do not know what the scientific meaning of the construct is, not even
whether it has scientific meaning. (Kaplan, 1963, p. 40)
The Role of Validity in Item Development • 5
With an operational definition, we have no surplus meaning or confusion about the construct. We
can be very precise in the measurement of an operationally defined construct. We can eliminate or
reduce random or systematic error when measuring any operationally defined construct. Instances
of operationally defined constructs include time, volume, distance, height, speed, and weight. Each
can be measured with great precision because the definition of each of these constructs is specific
enough. Test development for any construct that is operationally defined is usually very easy.
However, many constructs in education and psychology are not amenable to operational defini-
tion. Validity theorists advise that the alternative strategy is one of defining and validating constructs.
By doing so, we recognize that the construct is too complex to define operationally (Cronbach &
Meehl, 1955; Kane 2006b; Kaplan, 1963; Messick, 1989). As previously noted, constructs include
reading and writing. Also, each profession or specialty in life is a construct. For example baseball
ability, financial analysis, quilt-making, and dentistry are examples of constructs that have useful-
ness in society. Each construct is very complex. Each construct requires the use of knowledge and
skills in complex ways. Often we can conceive of each construct as to a domain of tasks performed.
For every construct, we can identify some aspects that can be operationally defined. For
instance, in writing, we have spelling, punctuation, and grammatical usage that is operationally
defined and easily measured. In mathematics, computation can be operationally defined. In most
professions, we can identify sets of tasks that are either performed or not performed. Each of
these tasks is operationally defined. However, these examples of operational definition within a
construct represent the minority of tasks that comprise the construct. We are still limited to con-
struct measurement and the problems it brings due to the construct’s complexity and the need
for expert judgment to evaluate performance.
Because constructs are complex and abstractly defined, we employ a strategy known as construct
validation. This investigative process is discussed later in this chapter and used throughout this
book. The investigation involves many important steps, and it leads to a conclusion about validity.
In light of Table 1.1, some subtleties exist that are useful later when observing how test takers
respond to test items. If we have a change in cognitive behavior that we can attribute to teaching
or training, then we might infer that achievement has occurred. Factors that are not relevant to
teaching/learning may also account for changes such as cheating. Items will reflect this change
due to learning in a pre-to-post comparison. If a student lacks an instructional history for some
domain of content or some ability, then lack of instruction is the inference to make regarding
the item’s performance. That is, the item will perform as if it is invalid, when, in fact, the testing
situation is inappropriate. If a student has received reasonable instruction and fails to perform as
anticipated or hoped for, then something else has to account for that level of performance. What
is probably accounting for test performance is not achievement but intelligence or lack of motiva-
tion to respond. Thus, the role of instruction or training and instructional history is an important
consideration in deciding whether a test or test item reflects achievement or intelligence.
Validity
The most important concern in this book and for any test score or test item response interpreta-
tion is validity. Throughout this book and in each chapter, the essential concepts and principles
The Role of Validity in Item Development • 7
underlying validity are woven into presentations. Test items are developed for use in tests, but
every item is also subject to an evaluation of its validity. As we will see, the development of test
items is an integral part of an argument for item validity. This idea will be made clearer in the
next section of this chapter.
The Standards for Educational and Psychological Testing (1999) state that validity is “the degree
to which evidence and theory support the interpretations of test scores entailed by proposed uses
of tests” (American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education, 1999, p. 9). We will refer many times through-
out the book to this resource, hereinafter referred to as the testing Standards (AERA, APA, &
NCME, 1999). What process do we follow to enable us to assess the validity of a test score inter-
pretation or use, or, in this book, of an item response?
In this chapter, we will employ an argument-based approach to validity advocated by Kane
(1992; 2002; 2006a, 2006b) and, recently, successfully applied in a national testing program
(Chappelle, Enright, & Jamieson, 2010). This argument-based approach accomplishes valida-
tion without a necessity for construct definition, which has been problematic in the past. The
argument-based approach is not without precedent. Theorists have argued that validity should
be conceptualized in forms of plausible arguments and challenges to the arguments (Cronbach,
1971, 1988; Messick, 1989). The idea of construct validation is based on inferences and assump-
tions presented in a logical framework.
Cronbach and Meehl (1955) provided three steps in testing: the definition of the construct, the
explication of the construct—this means test development—and the validation. Kane (2006b)
prefers a two-stage process: the development stage and the appraisal stage. With the argument-
based approach we will highlight some major features of validity and validation as it currently
exists, and in the final section of this chapter apply it to item development and validation.
Two Types of Constructs
The idea of two ways to approach measuring a construct can be traced to Messick (1994). Each
construct type has a logical rationale and a set of procedures that enable validation. Both include
domains from which a test is a representative sample.
The first of these two constructs is more traditional. It is a domain of knowledge and skills
that is usually represented by a set of instructional objectives. Messick uses the term construct-
referenced. A visit to any state education department website will reveal a set of instructional
objectives organized by subject matter domains. These domains are the targets of instruction.
Not very often, the cognitive demand of knowledge and skills is defined and used. Commonly,
some tasks to be performed are more complex than knowledge and skills and resemble some-
thing else—a cognitive ability. This kind of domain has its origin in behaviorism and the crite-
rion-referenced test movement so notable in the 1970s and later (Roid & Haladyna, 1982).
The second, less traditional, is a domain of tasks representing a developing cognitive ability,
such as reading or writing. Messick uses the term task-driven for this type of construct. Writ-
ing SMEs will assert that the only way to measure writing is to obtain a sample of one’s writing
performance. Therefore, a domain of writing tasks might comprise the basis for testing of one’s
writing ability. The distinction between these two types of constructs is treated more fully in
chapter 3. The main idea presented here is that the second type of construct is emerging due to
the efforts of cognitive psychologists and others who are promoting authentic assessment. This
second type of construct is preferred to the first type because it focuses on complex learning that
is often overlooked with the use of the first type of construct. Both Messick (1994) and Lane and
Stone (2006) articulate the rationale and progress needed to promote the valid use of test scores
from this second type of construct.
8 • A Foundation for Developing and Validating Test Items
Target Domain
Whether the construct is conceptualized as a domain of knowledge and skills or a domain of tasks
representing a cognitive ability, we have a hypothetical domain of tasks to be performed. A simple
example of a domain is what a first-year dental student must learn—the universal coding system
for teeth. We have 32 teeth in the human dentition and 20 in the primary dentition. Given the tooth
number or letter, the dental student must name the tooth. Given the name of the tooth, the dental
student must give the identifying number. With 52 teeth and two variations, the domain has 104
different behaviors. This domain is a very simple target domain. A test of this target domain might
entail a test of 20- or 30-item test that adequately samples from this domain. A target domain for
dentistry is quite complex but readily identifiable. For instance, consider the 63 competencies
required in a domain of knowledge and skills for dental licensing (Kramer & Neumann, 2003).
This is only one part of a three-part licensing requirement. These competencies are complicated by
the fact that each dental patient differs in many ways, such as age, patient problem, complications,
and emergency situations. Table 1.2 shows the examples of competencies for patient treatment in
dentistry. The target domain for dentistry is organized around the following major competencies:
practice and profession, patient management, diagnosis and treatment planning, treatment.
So a target domain contains many tasks to be performed that should be grouped or organized
using a taxonomy. These tasks cannot be realistically performed in a testing situation. As we can
see, the term target is appropriate for helping us understand the meaning of the construct. A
target domain is a heuristic device. Although the target domain may not provide realistic tasks to
test, organizing it into a framework of content and cognitive demand that best reflects the collec-
tive judgments of SMEs is important. This framework should be our set of item and test specifica-
tions. In this document is a test blueprint or two-way grid, which will guide us in test design.
Although the target domain can be abstract in nature, hypothesizing that a target score exists is
useful. It would be the score received if a student or candidate for certification were administered
all of the tasks in this domain. In the example of writing, the target score would entail a lifetime
of writing assignments assembled and scored accurately to achieve the target score—quite unre-
alistic. In dentistry, similarly, a candidate for licensure would have to show proficiency in all 63
competencies to achieve a target score. This is possible but very impractical. The target score has
some implications for validity as we move to the universe of generalization (see Table 1.2).
Table 1.2 Target Domain and Universe of Generalization for Example Professional Competencies in Dentistry
38. Anticipate, diagnose, and provide initial treatment and follow-up management for medical emergencies that may
occur during dental treatment.
39. Perform basic cardiac life support.
40. Recognize and manage acute pain, hemorrhage, trauma, and infection of the orofacial complex.
41. Manage patients with pain and anxiety by the use of non-pharmacological methods.
42. Select and administer or prescribe pharmacological agents in the treatment of dental patients.
43. Anticipate, prevent, and manage complications arising from the use of therapeutic and pharmacological agents
employed in patient care.
44. Provide patient education to maximize oral health.
45. Manage preventive oral health procedures.
Source: http://www.jdentaled.org/content/68/7/742.full.pdf
Universe of Generalization
Realistically, with the help of its SMEs, a test developer can create a domain of tasks that can be
performed by test takers practically in a testing situation. For educational achievement testing
The Role of Validity in Item Development • 9
programs as proposed in any state’s content standards, these tasks should have considerably close
connection to the target domain.
Fidelity is the closeness of any task in the universe of generalization to the target domain
(Loevinger, 1957). In other words, fidelity is the resemblance of tasks in the universe of generali-
zation to the tasks in the target domain. The judgment of fidelity is best made by SMEs. We have a
very well-developed technology for matching tasks in the universe of generalization to the target
domain. See Raymond and Neustel (2006) for examples dealing with professional competence
type tests. See Webb (2006) for examples dealing with school achievement test content.
Another important characteristic of this universe of generalization is its organizational scheme.
It should resemble or be identical to the target domain. For example, recently a research study
on states’ writing testing programs necessitated the development of a taxonomy of prompt types
(Jeffery, 2009). Table 1.3 shows several tasks from a hypothetical target domain, and on the right
side of the table a taxonomy of prompt types is given as an organizing structure for test develop-
ment. Chapter 13 provides more discussion on how writing might be conceived in the framework
of argument-based validity.
Referring to Table 1.2, the target domain consisting of 63 dental competencies is also the basis
for the universe of generalization. By this fact, the fidelity of subsequent test specifications and
tests is very high. The challenge to test developers is to create a universe of generalization consist-
ing of testable tasks that accurately reflect these 63 competencies.
As with the concept of the target score, we have a universe score. This score is one achieved
by a targeted test taker who is administered all items in the universe of generalization. It is what
we know in classical test theory as the true score. However, the true score is not really true unless
the target domain and target score have a great deal of fidelity with the universe of generalization
and the universe score. Table 1.4 simplifies this discussion into the constituent elements in this
content-related validity evidence assumption.
Table 1.4 Validity Features as Connected by the Target, Universe, and Sample
Validity feature Target Universe Sample
Task score Hypothetical set of tasks Testable Domain of tasks Test
Domain definition Target Domain Universe of Generalization Sample from the universe of generalization
Score type Target Score Universe Score Test score
10 • A Foundation for Developing and Validating Test Items
With any test score, we have several inferences affecting validity. The test score reflects ade-
quately the universe of generalization, and the universe of generalization has high fidelity with
the target domain. In effect, the target score, universe score, and test score should be perfectly
correlated.
In effect, the interpretive argument is the blueprint for gathering validity evidence that addresses
questions, assumptions, and issues that affect the assessment of validity. The validity argument
supplies answers to the questions that are subject to an assessment by a validator—a critical judg-
ment about the degree of validity. Because validation is an exercise designed to confirm validity,
this approach supports validity.
However, Cronbach (1988) and other validity theorists have also argued that we should
examine threats to validity and weak links in the chain of inferences (Crooks, Kane, & Cohen,
1996). For instance, if all indicators for validity are sound, but reliability is very low, the valida-
tion fails. We have several good reasons for seeking evidence that may be disconfirming of valid-
ity. First, validation is a search for truth. One needs to examine forces supporting and refuting
validity. Second, discovering threats to validity can only improve validity because subsequent
remedial action can eliminate or reduce each threat to validity. So those engaged in validation
should always seek validity evidence that might undermine validity. By doing so, a great service
is afforded any testing program. In this book, the focus of validity and validation is both with
test scores and item responses, simply because we interpret and use item responses just as we
interpret and use test scores. Because items and item responses are subunits of tests and test
scores, validity is also important for both item responses and test scores. The validity evidence
we gather to support interpreting an item response is also part of the validity evidence we use
to support the interpretation of a test score. However, some validity evidence may support a
counterargument—that validity is not attained. Toward the end of looking for evidence that
may undermine validity, Messick (1989) identified two major sources of threats to validity that
should be considered.
evidence is fundamental to the meaning of a test score. The target domain may (a) misrepresent
(b) overrepresent, or (c) underrepresent the construct. The difference in faithful representation
is based on the collective, consensus judgment of SMEs.
First, evidence should be assembled showing that SMEs have developed an adequate target
domain. A reconciliation is needed for the fidelity of the target domain to the fundamental quali-
ties of the construct. For instance, with writing, if we had a domain of writing tasks comprising
our target domain, how closely does this domain resemble writing as it exists in our society?
Second, evidence should be assembled showing that SMEs have developed an adequate organiza-
tion or structure for tasks. The tasks in the target domain should be arranged in a taxonomy, but
the job is not finished. When the universe of generalization is developed displaying the test tasks,
the SMEs must determine how much correspondence exists between the content of the target
domain and the content of the universe of generalization. In the instances of writing as shown in
Table 1.3, what percentage of representation does each prompt mode have in the assessment of
any student’s writing? Documentation of correspondence of test content to construct content is
critical. To put this in the language of the argument-based approach and current view of valid-
ity, the target domain reflects the criterion tasks to be performed. Given that such performance
testing is unrealistic, the universe of generalization assumes the ability to simulate the tasks in the
target domain. The fidelity of tasks in the universe of generalization and the target domain is a
critical feature of construct representation.
Validation
As we know, validation is an investigative process that has several important steps: the develop-
ment of our interpretive argument and the development of the validity argument. The test pro-
vides a test score that is an estimate of the universe score. The test score, and universe score, and
target score should be perfectly correlated if our content-related validity evidence for the test is in
order. Thus, we assemble evidence to support many issues that affect the interpretation and use
of test scores, and, in this volume, item responses. The assembling of evidence is very systematic.
The body of evidence must be assessed to support the claim for validity or the evidence may have
weak links that cast doubt about validity.
In this final section of this chapter we apply the concepts and principles of validation discussed
previously to item validation. For an item response to be valid, we have a set of assumptions as
questions. If these condition represented by each condition can be documented as having been
met, both the interpretive and validity arguments for item validation can be satisfied.
previously, the 63 competencies for dentistry are organized in a hierarchy: practice and profes-
sion, patient management, diagnosis and treatment planning, and treatment (Kramer & Neu-
mann, 2003). Writing can be thought of as existing in six distinct prompt modes. Chapter 3 pro-
vides more information about this type of item validity information. Kane (2006a) discusses con-
tent-related validity evidence as a major concern in test development. Organizing target domains
is a vital step in achieving a body of content-related validity evidence.
4. How much fidelity is there between the target domain and the universe of
generalization?
With item validation, we seek the consensus judgment of SMEs that the universe of generaliza-
tion that contains our item bank has high fidelity with the tasks in the target domain. We are very
concerned with the degree of judged fidelity. As noted previously, with the dentistry competen-
cies, fidelity is very good. With the writing domain, the fidelity of the six prompt modes to the set
of writing tasks employed in our society is unknown. The proposed taxonomy of prompt modes
may reflect common practices in the United States, but there is no reference to a target domain
of writing tasks widely practiced throughout this nation. Chapter 3 presents more information
about this type of item validity evidence.
extinct or useless and require new items. Therefore, a periodic review of items by SMEs is neces-
sary to refresh the item pool.
Although the item may appear to be too easy, the item may represent important content expected
to be learned. Thus, the SME wants the item to stay in the test. Or even though an item is too dif-
ficult and fails to discriminate, it stays because it tests something important that is not taught very
well or is not learned by candidates for licensure. We support the judgment of the SME panel, but
with the proviso that psychometric criteria are revealed to this panel and considered. Using solely
psychometric criteria to decide the future of the item may improve reliability but at the expense
of content-related validity evidence.
Technical Documentation
This chapter has focused on the need to build an interpretive argument and satisfy a validity
argument for item validity. A primary device for documenting this validity evidence is the tech-
nical report for any testing program. Thus, it is strongly recommended that the periodic techni-
cal report provide all the validity evidence suggested in Table 1.5. This advice is consistent with
many testing experts who have consistently advocated documenting validity evidence responsi-
bly (Becker & Pomplun, 2006; Ferrara, 2006; Haladyna, 2002b).
Summary
In this first chapter, test item and test were defined. A major theme throughout this chapter
has been the role that validity plays in making test score interpretations and uses as truthful as
possible. As test items are the essential building blocks of a test, validating item response inter-
pretations and uses is just as appropriate as validating test score interpretations and uses. An
argument-based approach is used. Many propositions were provided that comprise essential
steps in item development. These 16 questions support an interpretive argument for item vali-
dation. The validity argument requires that evidence be collected and organized. Such evidence
might appear in a report concerning item quality or as part of a technical report. As we docu-
ment these steps, validity evidence is displayed in our technical report, and, by that, validity is
improved. Weaknesses in item validation can be very deleterious to test score validity, which is
why considerable attention should be given to test score validation.
2
Developing the Test Item
Overview
This chapter presents the essential steps required to develop any test item. The chapter is intended
for those who are interested in developing a validated item bank for a testing program. Whereas
subsequent chapters deal with developing items for different types of item formats, this chapter
is intended to outline the steps involved in developing any test item.
Planning
At the beginning of item development for a new testing program, the item bank will be empty. At
this point, a plan should be created for filling the bank. Even for existing testing programs with
extant items, a plan is always useful. The plan will include many procedures and issues, which
include the number of items needed, the types of item formats to be used, the rationale for using
these item formats, the type of content and cognitive demand intended, the personnel responsi-
ble for item development and validation, and, most important, a schedule.
For selected-response (SR) items, a rule of thumb is that the item bank should be 2.5 times the
size of a test. This number is purely a subjective value that is based on opinion and experience.
For a 100-item test, a reasonable goal is to validate 250 items that match your needs as specified
in your item and test specifications. If multiple test forms are being used annually, the number of
items in the bank should be proportionately higher.
For constructed-response (CR) items, the determination of how many items are needed is very
difficult to ascertain because CR tests vary considerably in the number of items. This number will
be governed by the type of test for which items are being developed. Many subsequent chapters
are devoted to developing CR items with unique qualities, so better guidance is found in those
chapters.
Because many tests appear in a SR format, we will limit discussion to this format, but the steps
apply equally to CR formats as well.
The inventory is a private planning document shared between test developers and the test
sponsor. The inventory resembles the item and test specifications (which includes the test blue-
print) but shows the supply of validated items available for test construction. The inventory will
also reveal what content and cognitive demand categories are deficient with respect to the desired
levels of items in each cell of the test specifications.
17
18 • A Foundation for Developing and Validating Test Items
Table 2.1 shows a test blueprint (also known as a two-way grid) for topics and cognitive demand
that shows the ideal inventory for a hypothetical item bank. This is a 100-item imaginary certifi-
cation test for professional meal planners.
Table 2.1 Ideal Number of Items in the Item Bank
Topics % Coverage of topics Knowledge Skill Ability Desired (by topic)
Basic principles 20% 15 15 20 50
Planning 25% 19 19 25 63
Preparation 20% 15 15 20 50
Presentation 20% 15 15 20 50
Clean-up 15% 11 11 15 37
Desired (by cognitive demand) 100% 30% 30% 40% 250 (100%)
As shown in Table 2.1, an ideal number of validated items in this bank is 250. The number
of items in each cell shows the proportion of items desired. The percentages show the emphasis
for each topic and cognitive demand. These percentages were recommended by subject-matter
experts (SMEs) who examined the results of a practice analysis.
Table 2.2 shows the actual number of items in the item bank, based on the structure of the same
test blueprint (two-way table). For the first topic, basic principles, we only have 40 items and need
10 more. However, we need three more knowledge items and nine more skill items, and we have
an excess of ability items (which require the use of knowledge and skills).
Table 2.2 Actual Number of Items in Item Bank
Topics % Coverage Knowledge Skill Ability Total
Basic principles 20% 12 6 22 40
Planning 25% 50 12 28 90
Preparation 20% 11 22 10 43
Presentation 20% 13 11 31 55
Clean-up 15% 9 12 9 30
Total 100% 95 63 100 258
A periodic inventory is easy to complete and provides a very accurate account of how many
validated items are available for test design. Moreover, in planning for future item development,
assignments to SMEs should be made with the inventory results in mind. That is, if a meal prepa-
ration expert is especially highly qualified in meal presentation, that specialist should be assigned
to write items in that category. As Tables 2.1 and 2.2 show, knowledge and skill items are needed,
and no ability items are needed. In planning for future item development and validation, these
two tables provide excellent guidance.
Item Bank
One aspect of planning is to decide where test items are kept. The item bank is a modern, indis-
pensable tool for housing test items and associated information about the performance of the
item with test takers. The most informative and complete discussion of item banking can be
found in Vale (2006). The history and technology supporting computerized item banking is both
interesting and useful. However, item banking is a developing technology and new and more
effective products are constantly being released for public use. Many test companies have propri-
etary item banking systems that are not available to the public.
Developing the Test Item • 19
The item bank has two major functions. One, it should keep the validated item in camera-ready
format ready for placement on the test. Two, the item bank should contain a history of the item
including its difficulty (p-value), discrimination index, and other relevant information. If the item
is in a SR format, it should have a frequency of response for each option. If performance and a rat-
ing scale are used to score a response, it should have a frequency for each rating category. If more
sophisticated scaling methods are used, such as those involving item response theory, appropriate
statistics from the various item response theory models should also be recorded in the item bank.
Of particular interest is a typology of items that Vale (2006) has introduced. He conceives of
test items in a social order. The value of using a social order is to avoid the placement of items on
a test in inappropriate ways that may threaten validity.
• Friends are items that must appear together because of some similarity. For example, in a
subsection of a mathematics test, we might enjoy presenting items on geometry as a set.
• Close friends. In some item formats, items must be grouped together due to the depend-
ence on a stimulus, such as in reading comprehension where a passage is presented before
the test items. This type of format has been referred to as an item set or testlet (Haladyna,
2004).
• Snobs. Some items must appear in a specific way in proximity to other items. There are no
exceptions with snob items. Vale uses the example of punctuation in a series of sentences
comprising a paragraph. If the items were reordered and did not conform to the order of
the example, the items would confuse test takers.
• Dependents. Some items need supporting material. A typical example would be the item
set, where each item cannot be answered unless a passage, vignette, photograph, or another
stimulus precedes the set of items. These items are truly dependent.
• Supporters. These items have no interaction with other items but support other items. We
might conceptualize supporters to be critical features of an item set on problem-solving
where one item is NOT a cue for another item.
• Antagonists. Any item that presents a cue for another item is an antagonist. Chapter 6
presents information on different types of cuing for SR items.
• Enemies are simply items that cannot appear on the same test. These might be items testing
the exact content.
Table 2.3 lists properties of items that an item bank might contain. The decision of which proper-
ties are appropriate is based on many factors, such as type of test, types of test items, varieties of
item formats, and type of scaling use (featuring item response theory or other methods).
As noted previously in this chapter, software for item banking is challenging. We have com-
mercially available products and test companies have proprietary item banking software that is
not publicly available. If you have contracted with a test company, each company can describe
how they bank items and their item-banking capabilities. If you have to bank your items with-
out the benefit of a test company, your only option is to purchase one of several commercially
available item banking systems or create your own homemade system. Table 2.4 provides a list
of commercially available item-banking systems available. A homemade item banking system
would use a word processing program such as WordPerfect or Word for the camera-ready image
of the item. A spreadsheet can be used for recording item history with a common identification
code to connect the word processing file with the spreadsheet. A proviso about commercially
item-banking software is that these products have a dynamic nature and are continually chang-
ing. Also, with the introduction of new item formats as shown in different chapters of this book,
competing software will have the same capabilities. So, as a potential user of any software system,
matching up the needs of the testing program with the software is important.
20 • A Foundation for Developing and Validating Test Items
Another possibility for item banking is any relational database, such as Access, widely dis-
seminated with Microsoft Office or similar office packages. Relational databases allow for the
creating of multiple tables to be linked through a common identifier. For example, a test item
can be created with a unique identification and, with that identification, the item can exist in
multiple tables, including a table that contains the item itself, a table containing the history of
the item development, a table containing associated graphics, a table containing item statistics,
a table containing item content and cognitive demand information, etc. The tables are linked
Table 2.4 Examples of Popular Commercially Available Software Providing Comprehensive Testing Services That Include Item Banking
Name Web Address
FastTEST 2.0 fasttestweb.com
FastTEST Pro 2.0 fasttestweb.com
FastTEST Web fasttestweb.com
Perception questionmark.com
Random Test Generator PRO hirtlesoftware.com
Test Creator centronsoftware.com
Test Generator testshop.com
Testdesk aditsoftware.com
Unitest System sight2k.com
Developing the Test Item • 21
through the unique item identification. Any subset of information related to a specific item or set
of items can be obtained through the development of specific queries (a summary table generated
on demand to report on information from multiple tables for a specific item).
For more information about computerized item banking, the reader is directed to Vale’s (2006)
chapter 11 in the Handbook of Test Development.
Item-Writing Guide
An item-writing guide is the official document used in the training of item writers and used by
item writers to help them in the item-writing process. The item-writing guide should contain the
item and test specifications, some information about the inventory so that they can understand
their role in writing items, formats to be used and not used, and item-writing guidelines, such as
suggested by Haladyna and Downing (1989a, 1989b), Haladyna, Downing, and Rodriguez (2002)
and in this book in chapter 6.
One of the most comprehensive item-writing guides can be found on the website of the
National Board of Medical Examiners (http://www.nbme.org/publications). This item-writing
guide is the longest you will encounter, but it shows detail and thoroughness are possible. On the
other hand, SMEs are not likely to read these longer item-writing guides as thoroughly as you
might want, so shorter item-writing guides might be more effective for getting SMEs started. (See
Table 2.5.)
Table 2.5 Outline for an Item-Writing Guide
1. Brief description of the testing program
2. Description of the item classification system for content and cognitive demand
3. Instructions on how to prepare and transfer items and other logistical concerns
4. Item formats to be used and not used
5. Examples of well-written and poorly written items
6. Item-writing form/template (electronic versions preferred)
7. Guidelines for writing items (DOs and DON’Ts)
Item-Writing Training
The training of item writers is an important event. Conducting this training constitutes evidence
for item validation and for test score validation. Most untrained item writers have predetermined
habits for writing test items that will not produce validated items. Most untrained item writers
have almost no experience writing test items. Therefore, training provides each SME with an
opportunity to develop knowledge, a set of skills, and some strategies that comprises their item-
writing ability. The training uses the item-writing guide and begins with a didactic introduction.
A typical session might include the outline provided in Table 2.6.
One of the most useful, valuable activities in item-writing training is the post-session group
discussion. To hear colleagues discuss your item and offer constructive advice is valuable both
for improving the item and for learning how to write better items. The length of the session can
be as little as a few hours or can be extended to several days if the group of SMEs is on a produc-
tion schedule. The editing of items is often done by a professional editor, but preliminary editing
helps. The reviewing of items is an ongoing activity over several months.
CR: objectively scored. Some of these CR item formats require a simple word or sentence
response, scored right/wrong, or require a performance of a simple skill. In both instances, a
24 • A Foundation for Developing and Validating Test Items
scorer must determine if the item is correctly or incorrectly answered. Judgment is needed,
unless the response is objectively determined.
CR: subjectively scored. One of the best examples of this type of item scoring is a writing
prompt where a rubric (descriptive rating scale) is used. Chapters 10 and 11 provide specific
information about these kinds of items and how to design these types of items.
The scoring of these items is subjective, and requires the judgment(s) of SMEs as described in
chapter 12. Unlike objective scoring where there is only one answer or a set of agreed upon cor-
rect answers, the judgment is made in terms of degrees. To aid the SMEs in making judgments,
benchmark performance examples might be provided in a training session to find out if the SME
is performing in accurate and consistent ways.
Estimating the Cognitive Demand for the Targeted Learner for Each Item
As chapter 3 is devoted to the topic of content and cognitive demand, the task of labeling each
item with an expected cognitive demand gets extensive treatment there. As test developers con-
tinue to lament the lack of test items with the more desirable higher cognitive demand, we have
a considerable challenge ahead. Each item should be given a designation for the intended cogni-
tive complexity that the typical test taker has to undergo to select or construct the right answer.
The system for organizing items by cognitive complexity is a matter of considerable concern as
chapter 3 shows.
validity evidence. Each type of review is complementary. A series of reviews is strongly recom-
mended. Each intends to ward off a threat to validity. Each is independent of the others.
Fairness
Although fairness has been a concern of test developers and test users for many years, we have no
widely accepted definition (AERA, APA, & NCME, 1999, p. 80). One definition that works for
item validity is that any characteristics of items that affect test scores and are unrelated to what is
being measured are unfair. That is, the item elicits construct-irrelevant variance, which is a major
threat to validity (Haladyna & Downing, 2004).
One of the most significant efforts to date on fairness comes from the Educational Testing
Service Fairness Review Guidelines and the efforts of Zieky (2006). Fairness review is highly
recommended. First, guidelines provide a concrete basis for determining what is fair and unfair.
Subjective judgments of fairness are less likely. If the guidelines are universally shared and pub-
lished, we are less likely to have unsuitable content in tests. From his research of similar guide-
lines on fairness, Zieky lists six guidelines that one might consider:
Zieky also recommends various ways to adjudicate fairness reviewers’ disagreements. Generally
he favors a third party, whether it is a dispassionate expert or a large committee.
Aside from fairness review, we have other remedies that do not fit in this category. For instance,
differential item functioning is a statistical technique that uncovers an empirical basis for unfair-
ness—where an item provides an advantage for one group over another. Another major area
involving fairness is accommodations. Chapter 15 deals with item development for students with
exceptionalities. Chapter 16 provides more information about fairness.
Language Complexity
With growing awareness about how those learning the English language take an English language
test, we see more research on the nature of language complexity in item development and how it
might present instances of construct-irrelevant variance. As with fairness, the linguistic complex-
ity of test items may lower a test taker’s score unfairly.
The central issue is that if a test does not measure reading comprehension, we do not want
reading comprehension to be a factor in determining the performance on the test. That is, a test is
supposed to measure one construct alone and not many constructs. There is extensive and grow-
ing research that the degree of linguistic complexity does affect test performance (Abedi, 2006).
By simplifying the complexity of test items in an appropriate way, test performance increases for
some test takers. We think that such accommodation is fair, as reading comprehension should
not interfere with the performance on a test. In theory, we think it is the cognitive demand of
reading comprehension that influences test performance. Abedi’s chapter provides the most
extensive treatment of this subject to date.
What are some features of linguistic complexity that should concern us?
1. Word frequency and familiarity. Words high on the word frequency list are more likely to
be read and understand than low frequency words.
26 • A Foundation for Developing and Validating Test Items
Editorial
Depending upon the size of the testing program, editing is done professionally by a highly trained
and well-qualified specialist or it is done informally by someone with good knowledge and skills
of editing. In either situation, the goals of the editor are namely (a) revise items to improve clarity
but NEVER change content, and (b) correct grammatical, spelling, punctuation, and capitaliza-
tion errors. The editor also insures that the item is presented in the correct format, so that when
the item goes into the item bank, it is ready for use in a test.
Another important activity of the editor is proofing. Although proofing may be a shared
responsibility, the editor is best trained and qualified to proof. The standard for publication of
any test is perfection.
A chapter by Baranowski (2006) provides much useful information about editing. She consid-
ers the editorial review as a type of qualitative validity evidence, a view that is consistent in this
book as well. The editorial style guide is a useful document to this editor.
Summary
This chapter gives a brief overview of the many important steps involved in item develop-
ment. Many of these steps by virtue of their completion present validity evidence (Downing &
Developing the Test Item • 27
Haladyna, 1997). This is procedural evidence: events recorded that show that vital actions were
taken to validate items. Because these events comprise an important source of item validity evi-
dence, documenting when these events were held and completed is important. Other chapters
provide great detail about some of these procedures as noted in this chapter.
3
Content and Cognitive Demand of Test Items
Overview
Content-related validity evidence is a major consideration in any validation of a test score inter-
pretation or use (Kane, 2006a). Content also plays a very important role in item development and
validation. A panel of highly qualified subject-matter experts (SMEs) is critical in establishing a
body of content-related validity evidence. Their expertise and judgment comprise the basis for the
validity argument supporting the content of a test. Thus, the focus of this chapter is on developing
content-related item validity evidence as part of item validation. Two major issues in this quest for
content-related validity evidence and item validation are content and cognitive demand.
Content refers to knowledge, skills, and abilities, which were briefly defined in chapter 1. The
use of knowledge and skills in performing a complex behavior is characteristic of a cognitive abil-
ity. In chapter 1, we used writing in the language arts curriculum, and dentistry, a professional
competence, as examples to illustrate two very different cognitive abilities.
Cognitive demand refers to the expected mental complexity involved when a test item is admin-
istered to a typical test taker. Recalling knowledge is the simplest form of cognitive demand.
Comprehending or understanding knowledge is a slightly higher cognitive demand. Tasks that
involve the complex use of knowledge and skills are the highest type of cognitive demand. Exam-
ples include solving a problem, writing poetry, or completing a science project. Any cognitive
demand depends on the nature of the task and the instructional history of the test taker. Thus,
no item or objective has an absolute cognitive demand. The cognitive demand that we assign to
a test item is only a best guess based on the speculation about the cognitive process needed to
respond to the test item correctly. This speculation considers the typical test taker, not necessar-
ily the very advanced learner or someone who has not yet learned that content.
The first part of this chapter discusses how cognitive psychology and measurement theorists led
an effort to improve the measurement of cognitive abilities, such as that found in schools and pro-
fessions. Then we discuss limitations of the most popular cognitive taxonomy for classifying types of
cognitive demand. In the next section, we present a simplified cognitive taxonomy that draws from
our understanding of knowledge, skills, and abilities. When we have organized the target domain
and the universe of generalization for content and cognitive demand, a set of item and test speci-
fications are created. Other terms used for the item and test specifications document are two-way
grid or test blueprint (e.g. Gronlund & Waugh, 2009; Linn & Miller, 2005; Thorndike & Thorndike-
Christ, 2010). This chapter’s final section identifies the content-related validity evidence needed in
item validation. Naturally, content and cognitive demand are the mainstays of this evidence.
28
Content and Cognitive Demand of Test Items • 29
1. Developing a learner’s cognitive abilities is the goal of most k-12 instruction and profes-
sional education. We want students who have adequately developing reading, writing,
speaking, listening and mathematical and scientific problem-solving abilities. Other abili-
ties are developed in schools, including critical/analytical thinking and various creative
abilities. In the training of any profession, candidates for licensure or certification must
acquire knowledge and skills and put these to work on more complex tasks that they
encounter in their profession.
2. We have two different kinds of domains of learning. The first is more traditional. It con-
sists of knowledge and skills organized into a hierarchy (Messick, 1994; Sugrue, 1995).
This domain is based on behavioral learning theory. The terms criterion-referenced and
domain-referenced have been often used to describe tests designed to sample from this
domain. The second type of domain is a collection of performed tasks that require the
complex use of knowledge and skills. Messick referred to this type as task-based and the
former as construct-based. Cognitive learning theory identifies closely with this second
type of domain. In chapter 1 and this chapter, we have used two kinds of cognitive abili-
ties, writing and dental competence, to illustrate features of this kind of domain.
3. Regarding this second type of learning domain, Kane (2006a, 2006b) stated that a
target domain is the reification of the domain of tasks to be performed. For instance, a
30 • A Foundation for Developing and Validating Test Items
target domain for writing includes all possible writing tasks that we might encounter in
a lifetime. A target score is the score a test taker might achieve if all items/tasks in this
domain were administered. The target domain is a heuristic device, and the target score
is hypothetical.
4. Realistically, the universe of generalization represents those test tasks/items that might
be included on any test of that construct. The universe score is the true score—the score
obtained if all items in the universe of generalization were administered. Ideally and hypo-
thetically, the universe score is perfectly correlated with the target score. The judged corre-
spondence between the tasks in the target domain and the tasks in the universe of generali-
zation is an important outcome in the development of content-related validity evidence.
5. A validly interpreted test score should result in designing a test that is a representative
sample from the universe of generalization. The item and test specifications document is
the device we use for this end.
6. Without doubt, sponsors of virtually all types of tests have wanted to engage test takers
in test items with higher cognitive demand. Research and surveys continually show that
tests have far too many items calling for low cognitive demand. Thus, test developers are
concerned about including objectives that have a higher cognitive demand and items that
match their content standards or competencies.
7. As stated previously, the cognitive demand of any item is a function of the develop-
mental level of the test taker and the design of the item. An expert learner works mainly
from memory by recalling an experience that fits a task/test item demand. The cognitive
demand is relatively simple. Perhaps it is recall or recognition. A novice learner may have
to invoke more complex strategies to respond to the task/test item because of a lack of
knowledge and skills and experience with the task. Thus, the test item will have a high
cognitive demand for a novice learner. We can only speculate about cognitive demand for
any test taker, unless we use a think-aloud procedure to see what the test taker was think-
ing when responding to the item.
8. Identifying the content of each item is crucial. Knowing its cognitive demand for test tak-
ers of an average or desired developmental level is also important. Not only is instruction
improved by knowing about content and cognitive demand, but testing is focused exactly
on what the construct represents.
The main issue in this chapter is to show that the universe of generalization has a high degree
of fidelity with the target domain. Then another goal is to ensure that any test is an adequate,
representative sample of test items from this universe of generalization. The item and test speci-
fications document is the current technology for designing this kind of test. However, cognitive
psychologists have been working on alternative methods for test design that avoid test specifica-
tions and focus on the cognitive ability itself (see Mislevy, 2006, for an example of a cognitively
based approach to test design).
Both Messick (1989) and Kane (2006b) have argued that content-related validity evidence
should address important concepts. These include content relevance, content representativeness,
dimensionality/structure, adequacy of the item pool, and internal and external aspects of item
response and test score structure. Later in this chapter, we will address the assembling of evidence
to support item validation as for content and cognitive demand.
The next section deals with the vexing problem of ascertaining the cognitive demand repre-
sented in the test item, given the repeated caveats about the variability that naturally exists among
test takers with cognitive demand for any item.
Content and Cognitive Demand of Test Items • 31
As a final assessment of the validity of the claims concerning the psychological properties of
the taxonomy, it is perhaps fairest to say that the picture is uncertain. No one has been able
to demonstrate that these properties do not exist. (Seddon, 1978, p. 321)
hierarchical structure of the data. A study by Miller, Snowman, and O’Hara (1979) using the same
data and different methods of analysis concluded that the structure of the data resembled fluid
and crystallized intelligence. Another reanalysis of the Stoker/Kropp data by Hill and McGaw
(1981) using another method of analysis concluded some support for higher-order thinking.
Kunen, Cohen, and Solman (1981) found some evidence for a hierarchy but thought that the
evaluation category was misplaced. We have no more current studies to report of the structure of
data generated for the cognitive taxonomy.
The study of internal structure in test item response data is hopelessly muddled by the fact that
the cognitive demand of any item is a joint function of the item and the developmental level of the
student being tested. As noted previously in this chapter, a novice level test taker is likely to be a
low performer and will encounter the task in the test item with a more complex form of cognitive
behavior. An expert test taker who performs at a high level will simply recall from memory. Thus,
it is no wonder that these studies fail to produce item response patterns resembling the cognitive
taxonomy. If the learning history of each test taker were known, such studies might reveal more
about the veracity of the cognitive taxonomy.
origins. In a study by Gierl (1997) involving seventh-graders’ mathematics achievement test items,
the correspondence between the expected classifications as judged by item writers and students
was 54%. He concluded that the taxonomy does not help item writers to anticipate the cognitive
demand of students. A very interesting study by Solano-Flores and Li (2009) used cognitive inter-
views with different cultural groups in the fifth grade for mathematics. They found that students
had different perspectives for each item that influenced the way they approached each item.
What studies are beginning to show is that students can generate a correct answer using
patterns of thought that are unrelated to knowledge and skills targeted by the test item.
Genuine domain mastery or competence can be usurped by testwise strategies and alterna-
tive knowledge not specifically targeted by the test item. (Leighton & Gierl, 1997, p. 5)
Thus, when a team of SMEs makes a claim for an item for its cognitive demand, the actual
thought process employed by each test taker varies considerably with learners. Every test item
has a personal context with the student, and that fact influences how they perform. Sugrue (1995)
also provided greater understanding of the problem of determining the cognitive demand in a
student’s response to a test item. As students vary in their instructional history, mathematics
problem-solving cannot be treated as a unified construct. Her analysis led her to conclude that
students need a more personalized approach to something as complex as problem-solving.
Critical Analyses
Critical analyses of the cognitive taxonomy were of great interest long ago (Furst, 1981; Poole,
1971, 1972). Furst’s findings generally supported the use of the taxonomy, although he acknowl-
edged and cited instances where users had simplified the taxonomy to three categories for con-
venience of use. Cognitive learning and constructivist theorists consider the cognitive taxonomy
to be outdated (e.g. Bereiter & Scardamalia, 1998). These authors have adopted the position
reported in studies of student think-aloud; cognitive demand depends on the developmental
learning level of the student. No item has a natural cognitive demand. Other philosophers and
cognitive psychologists have been critical of the cognitive taxonomy and presented arguments
in favor of the interpretation of cognitive demand based on the interaction of each student and
the item (Ennis, 1989, 1993; Lewis & Smith, 1993; Mayer, 2002). Much of higher-level thinking
consists of structured and ill-structured problems that require the application of knowledge and
skills. A cognitive task analysis performed by SMEs or as reported by students performing the
task is the best way to uncover a cognitive demand. Clearly, the direction taken by contemporary
learning theorists is away from the mental-filing-cabinet approach of the cognitive taxonomy.
Because we understand more about complex thinking, behaviorism does not explain it very well,
and cognitive psychology seems better suited for modeling complex thinking.
“conceptual swamp.” To illustrate, we have a plethora of terms taken from various sources that
convey higher-level thinking:
We have a very loose use of terminology without adequate definition. It is no wonder that no
taxonomy will ever work until any category of higher-level thinking is more adequately defined
and terminology is standardized. This collection of ill-defined terms is not the fault of behavior
learning theorists but more a problem for all learning theorists and the testing specialists who
carry out various taxonomic schemes.
In its current form, the cognitive taxonomy is far too complex to be workable in the organiza-
tion of curriculum, professional competence, or a cognitive ability such as writing.
The idea of simplifying the cognitive taxonomy has merit because there is more consensus
about the first two levels of the cognitive taxonomy, and the application of knowledge and skills
in some complex ways represents a third type of cognitive demand that is a catch-all category for
the complex use of knowledge and skills.
Knowledge
Borrowing from the definition of knowledge proposed by David Merrill (1994), all knowledge
can be classified as either a fact, concept, principle, or procedure. All knowledge is subject to
Content and Cognitive Demand of Test Items • 35
Fact A fact is a truth known by experience or observation. A fact is a statement that is indisputable.
In this chapter and test content, facts are established by SMEs. Most facts are committed to memory
and can be recalled or recognized in a test. Facts should be distinguished from opinions. Opinions
have different points of view and rationales for each opinion. Most elementary school curricula help
learners determine the difference between a fact and an opinion. Regarding the cognitive demand,
facts may be recalled and facts may be understood. Facts may also be used in a more complex task
as part of an argument, in the development of a solution to a problem, or in some creative way.
For writing ability some facts are:
Concept A concept is an idea of something formed by mentally combining all its characteristics
or particulars. A concept has a name, distinguishing characteristics, and examples and non-exam-
ples. As with any fact, we can recall the definition of a concept, understand or comprehend the
meaning of a concept, or apply a concept for a more complex task. A concept can be defined liter-
ally by recalling or recognizing a written definition. Or a concept can be understood by presenting
it in a paraphrased version and asking the learner to identify or create a response that exhibits
understanding. Another way to test for understanding/comprehension is to present examples and
non-examples that have not been previously introduced. In the performance of a complex task, the
learner may use the concept with other concepts to solve a problem, think critically, or analyze.
For writing ability, some concepts are:
1. Persuasive writing
2. Punctuation
3. Spelling
4. Grammar
5. Word
1. The chance of fatal injury when a passenger has a fastened seatbelt is less than if the
passenger had an unfastened seatbelt.
2. The origin of humans is a complex evolutionary story.
3. Smoking causes respiratory and other illnesses.
4. A paragraph should begin with a topic sentence.
5. When water evaporates, it carries heat with it.
36 • A Foundation for Developing and Validating Test Items
Procedure A procedure is an observable physical or mental course of action that has an intended
result. Although procedures are associated with skills, a fundamental requirement of performing
any skill is knowledge of the procedure. Examples of procedures include knowing how to:
1. Unlock a door.
2. Wash a car.
3. Estimate the amount of paint to buy to paint a bedroom.
4. Water your vegetable garden.
5. Turn on your computer.
Use of Knowledge in Developing an Ability and In Test Design This four-category organization
of knowledge helps item writers better understand the variety of content of knowledge that is
available when creating a test item. An item need not be classified as either fact, concept, princi-
ple, or procedure. However, as a construct is being defined, the atomistic analysis of knowledge
should lead to more precise teaching and more effective learning that is guided by more valid
testing for formative and summative purposes.
Table 3.1 shows the three types of cognitive demand for different content.
These examples show that for different kinds of content we essentially have three cognitive
demands: recall of knowledge, understanding of knowledge and the use of knowledge for a more
complex task. Note that this organization of cognitive demand is very much like the traditional
cognitive taxonomy, but the application of knowledge applies to the four types of higher-level
thinking found in the cognitive taxonomy.
Skill
A skill is a performed act. The structure of any skill is simple. Some skills consist of a singular act
whereas other skills are procedures involving two or more steps. The distinction between a skill
Content and Cognitive Demand of Test Items • 37
and the performance of a complex task representing an ability is arbitrary, but the latter distinc-
tion reflects one task in the universe of generalization for a cognitive ability. The performance
of these complex tasks involves more than just performing a skill. A committee of SMEs is best
suited to judge whether a performance is an instance of a skill or an instance of a task from the
universe of generalization for an ability.
Any skill has three types of cognitive demand:
Table 3.2 shows the progression of knowledge of a skill to the performance of the skill in an iso-
lated way to the use of a skill in a more complex task.
Cognitive Ability
As presented in chapter 1, a cognitive ability is a mental capacity to perform any task from a
domain of tasks. Each cognitive ability is represented by a target domain that contains a popu-
lation of complex tasks. In mathematics education, this target domain might be all problems
requiring mathematics that we encounter in our daily lives. Any complex task in mathematics will
require that knowledge and skill be applied in a complex way. Evidence of its complexity comes
from a cognitive task analysis conducted by SMEs or via discussions with targeted learners.
38 • A Foundation for Developing and Validating Test Items
How many different double-topping pizzas can you make with ____ different toppings?
Explain your thinking.
Explain how you got your answer.
Comment: The blank can be replaced with the following numbers: 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, which yields 10 well-structured items from the same algorithm.
In the opposite case, we have ill-structured problems that exist in any domain. These tasks are
less easily categorized. A team of SMEs can identify tasks that do not lend themselves to algorith-
mic variation to produce many new items. The best we can hope for is to create complex tasks
that are worthy of learning and agree that knowledge and skills are to be applied. Figure 3.2 shows
an ill-structured problem from a writing prompt.
People tell us that we need exercise to stay healthy. Write a paper to convince your reader
to join you in an activity that will be fun and healthy.
A patient comes to your office complaining of hallucinations and feelings of fear when
around cars.
Although scoring is standardized, the cognitive demand of this writing prompt is unique com-
pared to other writing prompts.
Examples of Instructional Objectives Representing Cognitive Abilities In Table 3.3 are exam-
ples taken from a typical state’s content standards that illustrate these complex tasks. SMEs can
use their expertise and personal experience to create scenarios or vignettes that make complex
tasks more meaningful to learners and test takers.
Table 3.4 takes a different tack as it shows competencies in professions that represent tasks
from the target domain that might be included in the universe of generalization for a credential-
ing test (for licensure or certification).
Table 3.4 Examples of Competencies in Various Professions That Represent a Cognitive Ability
Field Objective
Accountancy Maintain and broaden public confidence, members should perform all professional responsibilities
with the highest sense of integrity.
Dentistry Continuously analyze the outcomes of patient treatment to improve that treatment.
Nursing Supervise/evaluate activities of assistive personnel.
Pharmacy Evaluate information about pharmacoeconomic factors, dosing regimens, dosage forms, delivery
systems and routes of administration to identify and select optimal pharmacotherapeutic agents for
patients.
Physical Therapy Complete documentation related to physical therapy practice in an appropriate, legible, and timely
manner that is consistent with all applicable laws and regulatory requirements.
Table 3.5 Summary of Cognitive Demands for Knowledge, Skills and Abilities
Cognition Types Demands
Knowledge Fact, concept, principle, procedure Recall/recognize
Comprehend/understand
Application
Skill Mental, physical Recall/recognition of procedure for performing skill
Comprehension/understanding of procedure for
performing the skill
Performing the skill
Ability Collection of structured and ill-structured tasks Use knowledge and skills in the performance of each task
As noted several times in this chapter, and as is worth repeating, any cognitive demand evalu-
ation done by an SME or team of SMEs is only a speculation. The interaction of a learner and
the test items is a joint function of the learning history and background of the learner and the
complexity of the task in the item.
content from construct definition to item development to test design and then to scoring and
reporting. The former is a well-established technology; the latter appears to be emerging as an
alternative to item and test specifications.
1. The types of test items to be used and the rationale for the selections are described. Chap-
ter 4 provides useful information for this important step in item development.
2. Guidelines are provided for how to create these items. Sometimes, these items might be
selected from extant items. Guidelines for developing items are presented throughout this
book. Item and test specifications include information about item style, cognitive demand,
graphics, and item-writing principles to follow.
3. A classification system is created and used for developing the item bank. Items are classified
by content. In some test specifications, items are also classified by cognitive demand.
4. The test blueprint provides the basis for designing a test. Thus, the test blueprint serves
as a recipe that demands ingredients in very precise ways. The test blueprint provides for
an inventory of items in the item bank (the universe of generalization) so that the test
developers know how many items are available and how many more items are needed for
various content categories and cognitive demands.
5. The item and test specifications are useful to item developers and reviewers. The same test
specifications should be made publicly available so that consumers of test information are
aware of the high standards employed to ensure that test content represents this target
domain. Also, those preparing for any test should know what content will be tested; for
what content they are accountable.
We have many examples of item and test specifications. These examples vary from a single page
to a small volume devoted to many important topics in item development and test design. Table
3.6 provides a generic outline for a document containing item and test specifications. Tables
3.7 and 3.8 list websites that provide examples of test specifications in the professions and in
state achievement testing programs. Among those listed in these tables, the states of Florida and
Oregon have especially comprehensive, well-developed test specifications.
Test developers needing item validation should develop a set of item and test specifications
much like these states’ achievement testing programs have developed. The item and test specifica-
tions used in professions do not appear to reach the same standard as these states have reached.
Beyond Test Specifications Cognitive psychologists and measurement theorists have partnered
in developing new ways to define content for testing that include complex cognitive demand. For
instance, Bryant and Wooten (2006) reported a study involving the use of Embretson’s work on
a cognitive design system. As the goals of measurement are defined, the test designer uses a cog-
nitive or information-processing model to generate test items with predicted properties, which
are tested using an item response model. Claesgens, Scalise, Wilson, and Stacy (2008) reported
Content and Cognitive Demand of Test Items • 41
Table 3.6 A Generic Outline of a Set of Item and Test Specifications for an Achievement or Competency Construct
Topic Contents
Introduction The role that the item and test specifications document plays in test design and validation.
Background Description of the test program, its content, the kind of cognitive demands desired, test length,
and other salient details.
Content If the testing program represents a profession, usually a list of tasks performed or competencies is
provided that are prioritized via a practice analysis. If the testing program represents a curriculum
for a school ability (such as reading), a published list of content standards is apropos.
Cognitive Demand Classification of items by intended levels of cognitive demand is highly desirable, but the trust
we have in the accuracy of these classifications is low because students have differing levels of
cognitive demand that depend on their instructional history.
Item Specifications This section can be quite lengthy. It includes criteria for items, item formats to be used, item-
writing guidelines, a style guide for editing items, and criteria for each specific item format. For
instance, if a reading passage is presented as part of a testlet (context-dependent item set), then
the readability, length, and content of the passage should be specified with boundaries or limits.
Review Processes A series of reviews is recommended. Chapter 16 discusses the nature and importance of these
reviews. Also, chapters by Abedi (2006) and Zieky (2006) are especially helpful in describing the
nature and extent of some of these review processes.
Weights The test blueprint provides the percentage of items allocated to each content area and for each
cognitive demand.
Table 3.8 Examples of Test Specifications From State and National Testing Programs
State Website
Florida http://fcat.fldoe.org/fcat2/itemspecs.asp
New Jersey http://www.nj.gov/education/assessment/ms/5-7/specs/math6.htm
Minnesota http://education.state.mn.us/MDE/EdExc/Testing/TestSpec
Oregon http://www.ode.state.or.us/search/page/?id=496.
Washington http://www.k12.wa.us/Reading/Assessment/pubdocs/ReadingTestItemSpecs2010.pdf
Common Core http://www.pearsonassessments.com/pai/ai/Products/NextGeneration/ItemDevelopment.htm
State Standards (primarily contains guidance as the Common Core State Standards are currently under
development)
National Assessment http://www.nagb.org/publications/frameworks.htm
of Educational Progress (provides access to frameworks and specifications to all subject areas)
a study using a cognitive framework for mapping student understanding of chemistry and veri-
fying results using an item response model. More information about this kind of approach to
defining content without test specifications is presented in chapter 7. The scope and significance
of this kind of approach to test content identification and cognitive demand is growing rapidly
and may approach implementation.
42 • A Foundation for Developing and Validating Test Items
Gorin (2006) described the way cognitive models define the construct. Verbal protocols (think-
aloud) are used to assemble qualitative data to help the test developer in item development. Thus,
the traditional item and test specifications are eschewed in favor of a cognitive learning theory
approach. The process has features that include item development and validation usually using item
response models. Regarding item development, Gorin emphasizes newer formats with versatility
for more complex cognitive demand and item-generating methods, which limit item-writer free-
dom but increase the speed at which the items are produced and increase item quality. Gierl (2010)
describes another approach. A construct map specifies the knowledge and skills required to perform
complex tasks. A cognitive model like the one he proposes can be used for diagnosis. A consequence
of using this kind of model is a hierarchy of tasks. SMEs are used to judge the content.
According to Ferrera (2006), the guiding principles behind the cognitive psychology approach
to replacing item and test specifications are the following:
For most testing programs, item and test specifications should continue to be used. However, the
promise of cognitive psychologists and their partnering measurement specialists is that cognitive
modeling will produce a system for test design that will exceed our expectations for validity, but
not soon.
Summary
In this chapter, we have asserted that cognitive learning theory is the dominant basis for explain-
ing content and cognitive demand. Regarding the assessment of cognitive demand, the traditional
cognitive taxonomy has been found inadequate. In more than 50 years since its introduction, the
paucity of research supporting the validity of the taxonomy is greatly at odds with its popularity.
A set of recommendations was presented for classifying content and cognitive demand for two
kinds of achievement domains. The recommendations draw from the cognitive taxonomy but
simplify the classification of higher-level thinking to a single category.
Item and test specifications are very desirable and a major source of content-related validity
evidence. Exemplars for item and test specifications were presented from several large-scale test-
ing programs.
All testing programs should create the item and test specifications document for many good
reasons. First, it stands as an important piece of content-related validity evidence. Second, it
drives item development. Third, it is the basis for test design.
4
Choosing an Item Format
Overview
One of the most important steps in the design of any test is the choice of item formats. Although
most test designers use a single type of item format, a test can have a variety of item formats. The
choice of an item format is based on the capability of a particular format to cover that content and
elicit a specific cognitive behavior. Sometimes, economy and efficiency enter this consideration,
but there is always a consequence. Fortunately, we have a large body of research bearing on many
issues related to item format to inform us and help us with making the best choice.
The topic of choosing an item format is formidable because we have a large variety of selected-
response (SR) and constructed-response (CR) formats from which to choose. The number of
item formats is increasing as we move from paper-and-pencil testing to computer-based testing.
For instance, Sireci and Zeniskey (2006) presented a variety of innovative SR formats for com-
puterized testing.
A primary consideration in choosing an item format is fidelity, which is the closeness of any test
task to a criterion behavior in the target domain. Another term commonly used in testing literature
is directness (Lane & Stone, 2006). A direct measure closely resembles a task in the target domain
and an indirect measure has lower or very little fidelity with the task in the target domain.
Another significant characteristic of item formats is complexity. Any item can vary in com-
plexity in two ways: (a) instructions to the test taker and (b) conditions for responding.
Messick (1994) claimed that the range of test item complexity for both instructions to test tak-
ers and conditions for performance may be independent but some generalizations are also true
about a relationship between instructions to the test takers and the item’s complexity. That is, an
SR item generally has a brief instruction and few conditions for responding. Some CR item for-
mats also can be very briefly presented and require an answer consisting of a single word, phrase,
or brief paragraph. Other CR item formats have very complex instructions and conditions for
responding. For instance, having a simply worded item that requires a precise written response
that is subjectively scored by trained subject-matter experts (SMEs) is very common in writing
performance testing. The scoring guide is the descriptive rating scale. Complexity may be related
to cognitive demand. More complex instructions to the test taker coupled with more complex
scoring involving SMEs will usually elicit a higher cognitive demand. Item formats that are brief
and direct are more likely to elicit a lower cognitive demand.
This chapter presents a simple taxonomy of item formats. As you will see, each item format
distinguishes itself as to its anatomical structure. However, the more important idea is what
content and cognitive demand can be measured with any item format. The next section in this
43
44 • A Foundation for Developing and Validating Test Items
chapter discusses some criteria that may influence your choice. Then, research is reviewed that
addresses validity arguments for different item formats. Finally, recommendations are offered
for best choices when measuring knowledge, skills, or abilities.
We have three fundamental types of item formats, as described in other chapters. Rodriguez
(2002) described the salient differences in item formats in the following way, and the fourth was
suggested by Messick (1994):
1. Objective versus subjective scoring. The former type of scoring is clerical and with a very
small degree of error. The latter requires a human judge using a descriptive rating scale.
This kind of judgment usually contains a higher degree of random error.
2. Selection versus production. With the SR format, the test taker recognizes and selects the
answer. With the CR format, the test taker produces the response. Producing a response
usually implies higher fidelity to the task in the target domain.
3. Fixed-response versus free-response. Some CR items are written to offer the test taker more
freedom to express oneself, whereas other CR items are more focused on generating a
structured response.
4. Product versus performance. Some CR items have a product evaluated. Usually the prod-
uct is a written document but it could be a model, invention, or another similar palpable
object. A performance can be analyzed for certain qualities related to a predetermined
process. The interest in performance is one of technique versus an outcome.
Using these distinctions, we appear to have three distinctly different formats from which to
choose. The SR format is one type. Directions are simple. Scoring is objective.
One type of CR format has test taker responses that are objectively scored (OS). Therefore, this
CR format will be designated as CROS. This format requires no inference because the test item
elicits observable performance.
The second type of CR format requires an inference by a judge/rater because the skill or ability
being measured has an abstract nature. To score performance on this kind of CR item, we need
a descriptive rating scale (also known as a rubric or scoring guide). Sometimes we have to use a
set of rating scales representing critical traits. Because of subjective scoring (SS), this format is
designated CRSS. The three formats can be expressed theoretically, as shown in Table 4.1.
a The true score is what a test taker would obtain if all the items in the universe of generalization were administered. Another equivalent
term is domain score.
Choosing an Item Format • 45
Theoretically, each format has the same first two components, a true score and a random
error component. The degree of random error is a critical component in reliability. Scoring
error is simply defined as scoring consistency. Objective scoring is highly consistent, whereas
subjective scoring often has greater inconsistency in scoring. CIV is a well-established threat
to validity of CRSS items. CIV is quantifiable. If it is suspected, it can be evaluated via valid-
ity studies (Haladyna & Downing, 2004). Rater effects are well documented as a CIV threat to
validity due to subjective scoring for a writing test involving a CRSS item (Haladyna & Olsen,
submitted for publication).
A more important distinction about item formats comes from Messick (1994). He argued that
SR and CR do not form a dichotomy but a continuum both of stimulus complexity and response
complexity. Any format can vary according to each continuum. More structured tasks can be
modeled with SR, CROS, and CRSS formats, but less structured tasks might be easier to model
with CRSS formats with the accompanying deficit of expensive and subjective scoring that may
detract from validity.
1. For what purpose was Alexander Hamilton’s economic program primarily designed?
A. Prepare for war with Great Britain.
B. Provide a platform for the 1792 election.
C. Establish financial stability.
D. Ensure dominance over the southern states.
We have many variations of the SR format, which are illustrated in chapter 5. Chapters 5 to 8
provide extensive information on developing SR items.
The SR format is most suitable for measuring knowledge of any cognitive demand (recall,
comprehension, or application). This format can be used for measuring mental skills, although
with less fidelity than the CROS format. Some types of SR formats are very good for measuring
application of knowledge and skills intended for measuring a complex task that reflects an ability.
Chapter 5 provides more information on the capability of these SR formats for applying knowl-
edge and skill in a complex way.
Item development for the SR format is not easy. Consequently, the cost of a SR item profes-
sionally developed can run between $800 and $1,200 per item (Haladyna, 2004). More current
estimates inflate this value considerably. SR tests are group-administered, which can be very effi-
cient. The scoring of SR tests is dichotomous. For most testing programs, SR scoring is auto-
mated, although rapid accurate scoring can be done with a scoring template as well for testing
46 • A Foundation for Developing and Validating Test Items
programs with few test takers. As a result, the cost of scoring is very low with the SR format.
Reliability of test scores can be very high with this format. Scaling for comparability is usually
done with very easy-to-use test designs that link test forms to a common scale. Seeing why the SR
format is so desirable is easy. As stated previously, the SR format is not appropriate for measuring
the performance of physical skills.
Knowledge item:
Give an example of onomatopoeia in poetry.
Define onmatopoeia.
Describe how to write a poem.
Mental skill item:
Extend a given pattern occurring in a sequence of numbers.
Copy the 26 letters of the alphabet.
Differentiate works of fiction from nonfiction.
Physical skill item:
Attach the bicycle pump to the bicycle tire.
Turn on your computer.
Run 880 yards in less than five minutes.
Figure 4.2 Examples of tasks for knowledge, mental skill, and physical skill CROS items.
It is unlikely that this format could be used for any ability, as it requires the use of knowledge
and skills in complex ways. However, for skills that are important in the performance of complex
tasks, the CROS is a good choice. With mental skill, the SR may serve as an efficient alternative to
the CROS although with lower fidelity. As there is such a high degree of correspondence between
SR and CROS test scores when measuring knowledge, the SR is preferable due to its greater effi-
ciency (Rodriguez, 2004).
The development of CROS items is less time-consuming than comparable SR items for
measuring knowledge. The development of CROS items for measuring mental or physical
skills is easy. CROS items can be group-administered in knowledge tests, but when perform-
ance is required, some CROS items need to be individually administered, which can be very
expensive. Scoring for CROS items is usually dichotomous (zero–one, yes–no, performed–not
performed). The cost of scoring is higher than with SR tests, because of the need for human
scoring. Because scoring is objective, scoring errors are small. Scoring a CROS test is a clerical
activity and does not require an SME. Rater inconsistency is not a problem with the CROS for-
mat. Reliability can be very high if the number of tasks is also high, as with an SR test. Scaling
for comparability is usually not a problem with the CROS format. Often, CROS item scoring
can be automated (see chapter 12).
Choosing an Item Format • 47
Below in Figure 4.3 is an example of a CRSS item for mathematical problem-solving for third-
grade students. This item cleverly has two scoring options. The first is the correct answer to the
question. The second is the subjectively scored process for how the student arrived at the right
answer. All items from this testing program measuring mathematical problem-solving have the
same design, one objective result and several traits that require subjective scoring.
The conditions are usually more involved than with the SR or the CROS because the test taker’s
response is usually more complex. The example in Figure 4.3 comes from the topic probability
and statistics. The scoring guide for this problem is an accurate answer that is objectively scored
(CROS), but four descriptive rating scales are used that evaluate critical aspects of mathematical
problem-solving (conceptual understanding, processes and strategies, verification, and commu-
nication). The item calls for knowledge and skills to be used to solve the problem. The conditions
for performance ask the student to (a) interpret the concepts of the task and translate each into
mathematical principles, (b) develop and choose a strategy to solve the problem, (c) verify that
the student’s answer is correct, and (d) communicate the solution to the reader using pictures,
symbols, or words. This example is very rare in a field of testing where SR response is usually
used. CRSS items are especially well suited for measuring the complex tasks residing in the uni-
verse of generalization for a cognitive ability, such as writing.
Four classes had a booth at a fair. One class sold hats for $3.00 each. Another class sold
pepperoni sticks for $0.75 each. The third class sold popcorn for $1.00 a bag, while the
last class sold pickles for $0.50 each. The chart shows the number of items each class
sold. ONE SYMBOL STANDS FOR 12 ITEMS. How much money did the four classes make?
Figure 4.3 Examples of a CRSS test item that also serves as a CROS item.
Source: http://www.ode.state.or.us/search/page/?id=503
Used with permission from the Oregon Department of Education Assessment Program.
48 • A Foundation for Developing and Validating Test Items
CRSS items are very difficult to develop, but in comparison to other formats, a tendency exists
not to use many CRSS items in a test. So the cost of development may not exceed the cost of
development of SR and CROS tests, but then the content of these tests is also different. The cost
of scoring is greater than the cost for scoring SR and CROS items. Usually, test takers perform a
task that is subject to scoring by an SME. Training is needed to hone each scorer’s ability to score
consistently and accurately. The type of administration is usually by group, but in some circum-
stances individual administration is needed, which can make the test very expensive. As noted
previously, scoring is done with a descriptive rating scale. However, whether one uses a holistic
or analytic trait rubric seems to affect test scores and interpretation of test scores. Research shows
that SMEs rate differently as a function of the type of rubric (Haladyna & Olsen, submitted for
publication; Lane & Stone, 2006).
Also, Lane and Stone cited many examples of SMEs rating on irrelevant features rather than
the content intended. Length of a writing passage is a well-documented irrelevant feature that
seems to elicit higher scores. However, the counterargument is that a long well-written passage
is usually better than a short passage. Some researchers have controlled for that feature and still
found that passage length yields higher scores, so there is evidence that passage length is a threat
to validity with the CRSS requiring written responses. The cost of scoring is very high because
at least one or possibly two or more SMEs are needed to score test results. Because scoring is
subjective, a family of threats to validity is possible. These include rater severity/leniency, cen-
tral tendency, idiosyncrasy, indifference to rating responses by the SME, response set, and halo.
Consequently, monitoring scoring of SMEs is very important. Reliability of test scores tends to be
lower than desired. Scaling for comparability is very challenging. Another challenge is in estimat-
ing the difficulty of a test item for purposes of equating.
Table 4.2 Salient Distinctions from SR, CROS, and CRSS Formats
Item format type Selected-response Constructed-response Constructed-response
Objective scoring (SR) Objective scoring (CROS) Subjective scoring (CRSS)
Chapters in book 5 to 8 10 to 12 10 to 12
Content best suited Knowledge Skill Ability
Ease of item development Difficult Less difficult Very difficult but not as many
items are needed
Type of Administration Group Group/Individual Group/Individual
Scoring Right/wrong Dichotomous Rating scale/rubric
Cost of Scoring Low Moderate High
Type of Scoring Automated/clerical Automated/clerical SME
Rater effects/consistency None None Threat to validity
Reliability Can be very high Can be very high Usually a problem
Scaling for comparability Is usually very good Can be very good but Poses some problems
not done very often
Choosing an Item Format • 49
For a task from the domain of complex tasks representing a cognitive ability, the best choice is the
CRSS. A caveat is that the SR testlet is a good substitute for the CRSS if one is willing to sacrifice
some fidelity. The other factors in this section might affect your decision, but content should be
your primary concern.
Table 4.3 Item Format Effects: Topics and Research Hypotheses (Questions)
Topic Hypothesis
Prediction If a test score is used to predict some external criterion, does item format make a difference?
Content equivalence If SR and CROS item formats purport to measure knowledge, does correlation research support
the hypothesis?
Proximity If two different item formats have varying degrees of fidelity, does it matter if we use the
measure of lower fidelity that is also more efficient?
Differential format Do gender and other construct-irrelevant variables interact with item format to produces
functioning and CIV—a threat to validity?
contamination
Cognitive demand Do item formats elicit unique cognitive demands?
Influence on teaching Does the use of any item format affect the way teachers teach and students learn?
and learning
50 • A Foundation for Developing and Validating Test Items
Prediction
Generally, student grades in college or graduate school are predicted from earlier achieve-
ment indicators such as grades or test scores. The ACT Assessment (American College Testing
Program) and SAT (College Board) are given to millions of high school students to guide and
support college admission decisions. The Graduate Record Examination (Educational Testing
Service) is widely administered to add information to graduate school admission decisions. The
predictive argument is the simplest to understand. We have a criterion (designated Y) and pre-
dictors (designated as Xs). The extent to which a single X or a set of Xs correlates with Y deter-
mines the predictive validity coefficient. Prediction is purely statistical. If one item format leads
to test scores that provide better statistical prediction, then we resolve the answer to the question
of which item format is preferable.
Downing and Norcini (1998) reviewed studies involving the prediction for SR and CR item
formats to a criterion. Instead of using an exhaustive approach, they selected exemplary stud-
ies. All studies reviewed favor the SR format over the CR format, except one in which the CR
test consisted of high-fidelity simulations of clinical problem-solving in medicine. In this study,
the two measures were not construct-equivalent. A counterargument offered by Lane and Stone
(2006) is that if one corrects for attenuation, the CR test will have a higher correlation with the
criterion (Y). However, correction for attenuation is hypothetical and not factual. The CRSS test
will have lower reliability, typically. Correction for attenuation will show that the CRSS item
may have higher fidelity, but it does not lead to better prediction due to the limitation of lower
reliability. The data reported by Downing and Norcini seems to favor the SR tests even when the
predictive validity correlations are nearly equivalent because the SR measures are easier to obtain
and usually more reliable.
Content Equivalence
As noted at the beginning of this chapter, since the SR format was introduced in the early part of
the 20th century, an active, ongoing debate has involved what the SR and CR items measure. This
section draws mainly from a comprehensive, integrative review and meta-analysis by Rodriguez
(2002, 2004). Simply stated, the issue is:
If a body of knowledge, set of skills, or a cognitive ability is being measured, does it matter if we
use a SR, CROS, or CRSS format?
Rodriguez’ review provides clear answers. When the items are stem-equivalent and the cogni-
tive task is recognition versus generation of the answer and the content is the same, correlations
between SR and CROS test scores appear to approach unity.
If the stems are not equivalent, but the content is intended to be the same, correlations remain
quite high after removing the attenuation due to reliability. When items are not content-equiva-
lent by design but appear to measure the same content, correlations after correction for attenu-
ation are high. When SR items are correlated with CRSS items measuring the same content, the
correlations are moderately high. With any CRSS item, we may have several cognitive abilities
embedded. For instance, the test taker has to read the item’s instructions and produce complete
answer. The CRSS test score may include reading and writing abilities. Thus, the CRSS represents
more than just content that the SR is supposed to measure. Verbal ability may be a component
of the content when using a CRSS format. Other studies provide additional evidence that SR and
CROS item formats yield similar results (Bacon, 2003; DeMars, 2000; Lawrence & Singhania,
Choosing an Item Format • 51
2004). Besides the meta-analysis reported by Rodriguez, he reported that 32 other studies used
other methods to evaluate construct equivalence. When the content to be measured is held con-
stant, whether one uses the SR or CROS format seems not to matter.
Table 4.4 also shows a sequential order of fidelity for measures of teaching competence, with
measures of teaching with varying degrees of fidelity. The difference between one measure and
another is proximity. If a low-fidelity measure has good proximity with a high-fidelity meas-
ure, but is very expensive and inefficient, would the less expensive, more efficient, lower-fidelity
measure suffice?
There is no research to report on the proximity of one measure to another measure of the same
construct as a function of fidelity. The reasoning process for the assessment of fidelity causes ten-
sion for test developers who must decide when the more efficient SR format suffices in place of
the less efficient and sometimes less reliable CRSS format.
Females generally perform better than males on the language measures, regardless of
assessment format; and males generally perform better than females on the mathematics
52 • A Foundation for Developing and Validating Test Items
measures, also regardless of format. All of the differences, however, are quite small in an
absolute sense. These results suggest that there is little or no format effect and no format-by-
subject interaction. (Ryan & DeMark, 2002, p. 14)
Thus, their results clearly show small differences between boys and girls that may be real and
not a function of item formats. Ryan and DeMark (2002) offered a validity framework for future
studies of item format that should be useful in parsing the results of past and future studies on CR
and SR item formats. Table 4.5 captures four categories of research that they believe can be used
to classify all research of this type.
The first category is justified for abilities where the use of CR formats is obvious. In writing, for
example, the use of SR to measure writing ability seems nonsensical, though SR test scores might
predict writing ability performance. The argument we use here to justify the use of a CR format
is fidelity to a criterion.
The second category is a subtle one, where writing ability is interwoven with ability being
measured. This situation may be very widespread and include many fields and disciplines where
writing is used to advance arguments, state propositions, review or critique issues or perform-
ances, or develop plans for solutions to problems. This second category uses CR testing in a com-
plex way that involves verbal expression. Critical thinking may be another ability required in this
performance. Thus, the performance item format is multidimensional in nature.
The third category is a source of bias in testing. This category argues that verbal ability should
not get in the way of measuring something else. One area of the school curriculum that seems to
have this tendency is with the measurement of mathematics ability where CR items are used that
rely on verbal ability. This verbal ability biases results. Constructs falling into this third category
seem to favor using SR formats, whereas constructs falling into the first or second categories seem
to favor CR formats.
The fourth category includes no reliance on verbal ability. In this instance, the result may be so
objectively oriented that a simple CROS item format with a right and wrong answer may suffice. In
these circumstances, SR makes a good proxy for CR, because SR is easily and objectively scored.
A study of Advanced Placement history tests nicely expressed two of the important findings of
the Ryan and Franz review (Breland, Danos, Kahn, Kubota, & Bonner, 1994). They found gender
differences in SR and CR scores of men and women, but attributed the higher scoring by men to
more knowledge of history, whereas the scores for men and women on a CR test are about the
same. Attention in this study was drawn to potential biases in scoring CR writing. Modern high-
quality research such as this study reveals a deeper understanding of the problem and the types
of inferences drawn from test data involving gender differences. In another study, Wightman
(1998) examined the consequential aspects of differences in test scores. She found no bias due to
format effects on a law school admission test. A study by DeMars (2000) of students in a statewide
Choosing an Item Format • 53
assessment revealed very little difference in performance despite format type. Although format-
by-gender interactions were statistically significant, the practical significance of the differences
was very small.
A study of students from different countries by Beller and Gafni (2000) found reversed gen-
der–format interactions in two different years. Upon closer analysis, they discovered that the dif-
ficulty of the CR items was found to interact with gender to produce differential results. Garner
and Engelhard (1999) also found an interaction between format and gender in mathematics for
some items. Hamilton (1998) found one CR item that displayed differential item functioning. She
found that gender differences were accentuated for items requiring visualization and knowledge
acquired outside school. Lane, Wang, and Magone (1996) studied differences between boys and
girls in a middle-school mathematics performance test. Using comparable samples of boys and
girls, girls outperformed boys based on better communication of their solution and providing
more comprehensive responses. Their findings point to a critical issue. Is this construct of math-
ematical problem-solving defined so that it emphasizes communication skills (verbal ability) in
performing a mathematics task? If so, then there is no argument supporting DFF.
In their study of gender differences on a graduate admissions test, Gallagher, Levin, and Caha-
lan (2000) concluded that performance seemed to be based on such features of test items as
problem-setting, multiple pathways to getting a correct answer, and spatially based shortcuts
to the solution. Their experimentation with features of item formats leads the way on designing
items that adapt to differences in gender that may be construct-irrelevant factors that need to be
removed during test item design. This theme seems to recur. The design of items has more to do
with the item’s performance, namely difficulty and discrimination. The design of items should
focus on the formats’ capability for content and cognitive demand. The quest for satisfactory dif-
ficulty and discrimination is secondary to content and cognitive demand.
This research and the research reviewed by Ryan and DeMark (2002) should not lead to a con-
clusion that formats interact with gender as much as CROS, and CRSS formats have a demand for
verbal ability besides the content of the construct being measured. For the reported interaction
of gender and formats, effect sizes are very small. Research should continue to search for sources
of bias. The most important outcome of their study is the evolution of the taxonomy of types of
studies. As stated repeatedly in this chapter, knowing more about the construct being measured
has everything to do with choosing the correct item format.
Cognitive Demand
As noted in chapter 3, defining cognitive demand is very challenging. Having SMEs judge the
cognitive demand of any item is virtually impossible because test takers have different instruc-
tional histories and cultural differences.
Nonetheless, a persistent belief is that the CRSS can elicit more complex cognitive behavior
whereas the SR is limited to simple recognition. Of course research, experience, and examples
presented in this book show that all three formats have capability for measuring complex cogni-
tive behavior associated with ability. When it comes to assessing the fidelity of a task from a test
to a comparable task in the target domain, the CRSS seems better suited. In this section, research
is reviewed that informs us about the unique capabilities of the SR, CROS, and CRSS formats.
A set of studies and review of research is very informative about the possibility of variations in
cognitive demand as a function of item formats (Martinez, 1990, 1993, 1999; Martinez & Katz,
1996). These studies led to the conclusion that considerable variety exists between CROS and
CRSS formats as to the kinds of cognitive behavior elicited. These studies suggest that under a
variety of conditions and for different subject matters and using different research methods, SR,
54 • A Foundation for Developing and Validating Test Items
CROS, and CRSS formats can elicit higher-level cognitive demand. Martinez (1999) concluded
that the CRSS formats have greater potential for the full range of complex cognitive behavior, but
SR formats can elicit many types of complex behavior as well.
Other more recent studies also offer more perspective. For instance, Palmer and Devitt (2007)
evaluated SR and CROS items in a medical education setting. They found a tendency for most
items to measure the recall of knowledge and claimed no advantage for either format. Haynie
(1994) examined delayed retention using SR and CROS. He found SR to be superior in measur-
ing delayed retention of knowledge. van den Bergh (1990) argued from his testing of the reading
comprehension of third graders that format made little difference in test score interpretation.
His theoretical orientation provided a stronger rationale for the validity of his findings than prior
studies. A study by Hamilton, Nussbaum, and Snow (1997) involved 41 high-school students
who were interviewed after taking a test involving SR and CR formats. The SR items performed
quite well as to higher-level thinking, as did the CR items. What surprised the researchers was the
wide range of findings. They concluded:
Our interviews suggest that the MC format forced students to think about scientific con-
cepts and that the lack of structure in the CR items invited responses based more on
everyday knowledge and non-scientific explanations. (Hamilton, Nussbaum, & Snow,
1997, p. 191)
Singh and Rosengrant (2003) experimented with a set of oddly designed physics SR items that
probed complex concepts and principles in physics. The also interviewed students and discov-
ered that qualitative explanations of concepts and principles were naive as opposed to an infer-
ence that might be drawn from a set of SR items. This kind of study might also be done with CR
items with the same results. Nonetheless, these researchers make a good point that the fidelity
of comparable CR items provides a truer picture than simply selecting an answer from a list of
options. In a medical education setting, Coderre, Harasym, Mandin, and Fick (2004) reported
the efficacy of two SR formats for measuring medical problem-solving. Although a difference
was found between the two formats, the researchers concluded that SR formats were successful
in measuring problem-solving. Their study involved think-aloud impressions from novices and
experts.
focus on that format. Heck and Crislip (2001) examined this premise with a large, representative
sample of third-grade students in writing. While girls outperformed boys on SR and CRSS meas-
ures, the CRSS measures showed less difference for format comparisons.
One line of research tests the hypothesis that taking any SR test may aid learning but also pro-
mote false knowledge when students choose wrong answers (Marsh, Roediger, Bjork, & Bjork,
2007; Roediger & Marsh, 2005). Although negative effects were detected, the researchers argued
that the overall net effect was positive. Thus, SR tests may provide a stimulus for learning. Other
researchers have also pursued this principle by providing feedback when correct or incorrect
answers are chosen. Their review and a meta-analysis reported 29 studies showing a positive
effect and only six studies showing a negative effect. Another study examined the effects of feed-
back on SR test takers (Butler & Roediger, 2008). They found that immediate and delayed feed-
back helped future test performance over a no-feedback condition. Thus, SR testing also becomes
a method of teaching. A study by Dihoff, Brosvic, Epstein, and Cook (2004) also found significant
benefits from feedback during test preparation. These studies support the time-honored princi-
ple that feedback from SR tests can be very beneficial for future learning. Thus far, these claims
have not been made or verified by research with CR formats.
An important line of research relates to the diagnostic information that can be mined from
SR items (Tsai & Chou, 2002). Although such approaches have been theorized often (Roid &
Haladyna, 1980), and research has been done (Haladyna, 2004), these approaches have not led to
operational testing programs that accomplish this worthy end. More about diagnostic testing’s
potential is discussed in the final chapter of this book.
Endorsements of educators may provide another source of evidence. Lane and Stone (2006)
reported several studies of the consequences of using performance type measures in statewide
testing. They noted the endorsements of teachers’ use of CRSS items corresponded with improve-
ments in test scores. They also stated that improvements were of a small magnitude. If the intro-
duction of CRSS items with high cognitive demand leads to improved instruction, then future
research findings might validate the increased use of the CRSS format.
One benefit of the concern for the influence that an item format may have on learning comes
from the American Education Research Association (2000). One guideline for high stakes testing
encourages test preparation to include practice on a variety of formats rather than simply those
used in a test. Such test preparation and the appropriate use of a variety of item formats may be a
good remedy to remove the threat to validity posed in this section.
Methodological Issues
Most of the research reported thus far shows a high degree of correlation between SR and both
types of CR measures where the construct is an ability (e. g. reading, writing, mathematical prob-
lem solving). However, we have instances where format differences seem to exist, although to a
small degree. Many of these researchers have commented on methodological issues and concerns
that may affect these research results. This section discusses methodological issues that future
research should address.
As Cole (1991) remarked in her presidential address at an American Educational Research Asso-
ciation meeting, educators have not done a good job of defining educational constructs. The most
basic concern about any item format’s properties starts with the definition of the construct. For
instance, Haladyna and Olsen (submitted for publication) concluded, after an extensive review,
that writing ability has many challenges in its definition that limit validity. They identified 15 factors
affecting validity that may be resolved with an improved construct definition. Both Martinez (1990)
and Rodriguez (2002) favor a theoretical analysis that involves construct definition and an
56 • A Foundation for Developing and Validating Test Items
understanding of the capabilities of these item formats to measure the tasks in our target domain.
Reliance on strictly psychometric analyses might be a mistake. One such approach that has been
championed by cognitive psychologists is item modeling. Such approaches involve a cognitive
task analysis that identifies knowledge and skills needed to perform. Evidence-centered assess-
ment design is a construct-centered approach (Mislevy, 2006). In this approach, expertise is used
in task/item design, instruction, and psychometrics to create test items. It has been argued in this
book and by others, that the use of SMEs in such deliberations is an important aspect of item
validity evidence. Psychometric expertise is also part of this recipe. Chapter 8 discusses item
generation procedures. Chapter 20 provides a glimpse into the future of item development from
a cognitive psychology perspective.
Dimensionality is a methodological issue with these studies. Martinez (1999) warned us not to
be seduced by strictly psychometric evidence. Studies reviewed by Thissen, Wainer, and Wang
(1994) and Lukhele, Thissen, and Wainer (1994) provided convincing evidence that in many
circumstances, CR and SR items lead to virtually identical interpretations due to unidimen-
sional findings following factor analysis. Earlier studies by Martinez (1990, 1993) offer evidence
that different formats may yield different types of student learning. However, when content is
intended to be similar, SR and CROS item scores are highly related (Rodriguez, 2002, 2004).
Wainer and Thissen (1993) commented that measuring a construct not as accurately but more
reliably is much better than measuring the construct more accurately but less reliably. In other
words, because of its proximity to a CRSS test, the SR test might serve as a more reliable proxy.
This idea may explain why SR items are used in a writing test. Writing performance items have
the highest fidelity with the target domain, but writing performance test scores are less reliable
than a test composed of SR items.
The design of items has been identified as a factor in how items perform (Hamilton, 1998).
Ryan and DeMark (2002) argued that the properties of test items may depend on the way the
construct is defined and be less influenced by their structural anatomy. Rodriguez (2002) also
noted that item writing practices and item design may influence test scores in unintended ways.
Interviews with students show that the design of items and students’ testwise strategies figure
into performance. Martinez (1999) stated that the development of options in SR items relates
to the cognitive demands on test takers. Hamilton, Nussbaum, and Snow (1997) also concluded
that the interviews with students that they conducted exposed nuances in item development that
would improve the writing of both SR and CR items. One of the most highly complex studies
involved science performance items that received a high degree of attention in scoring (Stecher,
Klein, Solano-Flores, McCaffery, Robyn, Shavelson, & Haertel, 2000). Although they used an
item-writing algorithm that was theory-based, the results were disappointing because perform-
ance lacked the similarity expected. Nonetheless, high-quality studies such as this one show that
more attention needs to be given to item design. This theme is repeated in studies cited and is
omnipresent with all item formats.
Studies involving reading comprehension offer a unique setting for studies of item format
effects. Traditionally, the reading passage is connected to a set of test items. Hannon and Dane-
man (2001) studied whether test performance varies when students read the passage first or read
the items first. They found systematic differences in performance of a complex nature. Research
by Katz and Lautenschlager (2001) found that variation in performance at the item level may be
attributed to test-taking skills and students’ prior knowledge. Another study with similar intents
using student interviews led them to conclude that the item format seems to elicit certain complex
cognitive strategies that may not be consistent with the construct intended—in this case reading
comprehension (Rupp, Ferne, & Choi, 2006). Their suggestion is to develop response-processing
models that are consistent with the item format. Methods for exploring sources of contamination
Choosing an Item Format • 57
in item formats have improved significantly and now involve the direct questioning of students
(Ercikan, Arim, & Law, 2010).
We have discovered that the SR format should not be rejected but can be used appropriately
with careful design principles. We should not be swayed to conclude that either SR or CR for-
mats have these certain unique properties, but that both item formats could be improved to
generate the results desired if item-writing were improved. Toward this end, we favor theoretical
approaches to item-writing that have been validated and guidelines that have research support
(see Haladyna & Downing, 1989a, 1989b; Haladyna, Downing, & Rodriguez, 2004). Katz and
Lautenschlager (2000) experimented with passage and no-passage versions of a reading com-
prehension test. From their results, they argued that students had outside knowledge and could
answer items without referring to the passage. This research and earlier research they cite shed
light on the intricacies of writing and validating items for reading comprehension. They con-
cluded that a science for writing reading comprehension items does not yet exist, and that we can
do a better job of validating items by doing a better analysis of field-test data and more experi-
mentation with the no-passage condition.
Part of the problem with the study of the content and cognitive demand of SR and CR items
concerns the CRSS format. First, item development is not based on scientific grounds. We have
no extant item development theories or technologies. Most CRSS item development approaches
are prescriptive and based on experience of test developers. We have no systematic taxonomy
of rubrics. Many researchers have studied the cognitive processes underlying CRSS scoring (e.g.
DeRemer, 1998; Penny, 2003; Weigle, 1999; Wolfe, 1999). Their findings reveal that many fac-
tors are affecting ratings of student performance. For instance, longer written passages get higher
scores than shorter written passages (Powers, 2005). Do longer passages reflect quality or simply
wordiness? Raters commit many systematic errors in scoring that bias test scores. Some of these
errors are severity/leniency, central tendency, halo, idiosyncrasy, and logical. These errors are
well documented (Engelhard, 2002; Hoyt, 2000; Myford & Wolfe, 2003). Chapters 12, 13, and
18 provide more information on threats to validity arising from rater bias. The CRSS format can
measure complex learning, but this format also presents many challenges in design and many
threats to validity.
Another factor that has troubled researchers and others who have studied the issue of for-
mat effects is format familiarity. As students become more familiar with formats that they have
not used in the past, performance improves (Fuchs, Fuchs, Karns, Hamlett, Dutka, & Katzaroff,
2000). Thus, the improvement is not actual learning but increased experience with an unfamiliar
format.
A final comment on methodological problems relates to the quality of research. For instance,
in studies where CR and SR formats were compared with results showing a low degree of rela-
tionship, difficulty was used as a criterion (see Kuechler & Simkin, 2010). CR and SR scores are
on different scales and thus not comparable. When CR and SR scores are correlated, reliabilities
should be reported, and the correlation should be corrected for attenuation to detect the true,
theoretical relationship. Also, CR test scores have construct-irrelevant factors embedded such
as writing ability and scorer bias. These factors, if not considered in analysis and interpretation
of results, may account for a low correlation between CR and SR scores of the same presumed
construct.
A conclusion we can draw from these methodological concerns is that these studies have
shown that SR, CROS, and CRSS formats can measure knowledge, skills, and abilities very
effectively. We can do a better job of defining the construct before measuring it. What we have
learned is that the greater the effort put into item development, the better the item performs,
whatever its format. Another important finding is that interviews with test takers are most
58 • A Foundation for Developing and Validating Test Items
revealing about the cognitive demand capabilities of all test items. Statistical methods, includ-
ing factor analysis, are not sufficient evidence for dimensionality when different item formats
are being compared. A theoretical analysis should always precede statistical analysis. By con-
ducting these think-aloud interviews, we find greater understanding about how to design items
better that provide validated responses. Future researchers of item format capabilities need to
be informed of these issues and design studies that probe more accurately into similarities and
differences.
1. If knowledge is to be measured and the design of the test calls for sampling from a domain
of knowledge, which item format gives you the best sampling from the domain? SR is
superior to the CROS. Whether the cognitive demand is recall or understanding, SR seems
justified.
2. If cognitive skills are to be measured, the CROS has very high fidelity and yields high
reliability. However, the SR format provides a lower-fidelity alternative that has a large
advantage in scoring efficiency but a large inefficiency in item development. So a trade-off
exists between these two options.
3. If a physical skill is to be measured, the CROS is the right choice. Usually the physical skill
can be objectively observed. If the physical skill requires graded, subjective judgment, then
CRSS must be used as it contains a rating scale and the rater/judge must infer how much
performance was observed.
4. If a cognitive ability is measured, we have a target domain of tasks and a universe of gener-
alization of test-like tasks from which we comprise a test. The logical choice is a CRSS. The
performance is observed by a rater/judge who must decide the degree of the performance
using a descriptive rating scale/rubric or a set of these rating scales. It is conceivable that
such tasks might be objectively scored, but finding instances of this for a complex ability
is very hard.
II
Developing Selected-Response Test Items
Overview
In previous chapters groundwork has been established as follows. The content of tests consists of
knowledge, cognitive and physical skills, and abilities. Three item format types have been identi-
fied: selected-response (SR), constructed-response, objective scoring (CROS), and constructed-
response, subjective scoring (CRSS). The SR format is very appropriate for measuring knowledge
and skills and has limited application for the kinds of complex tasks characteristic of ability.
Ordinarily, the CRSS is used to measure these complex tasks. Sometimes, we are willing to sac-
rifice some fidelity and use an SR item instead of a CRSS item to increase reliability and obtain a
better sampling of content.
This chapter addresses a variety of SR formats. For each of the SR formats presented, examples
are provided and advice is offered about how effectively each format can be used. Two formats
are not recommended. In each instance, the reasoning and research behind each recommenda-
tion are provided.
61
62 • Developing Selected-Response Test Items
The correct option is undeniably the one and only right answer. In the question format, the cor-
rect choice can be a word, phrase, or sentence. With the incomplete stem, the second part of the
sentence is the option, and one of these options is the right answer.
Distractors are the most difficult part of the test item to write. Distractors are unquestionably
wrong answers. Each distractor must be plausible to test takers who have not yet learned the
knowledge or skill that the test item is supposed to measure. To those who possess the knowl-
edge asked for in the item, the distractors are clearly wrong choices. Distractors should resemble
the correct choice in grammatical form, style, and length. Subtle or blatant clues that give away
the correct choice should always be avoided. Good distractors should be based on common errors
of students who are learning. Distractors should never be deceptively correct (tricky).
Question Format
Three examples are presented, one for a unique type of content. The first shows this format used
for a knowledge-based item requiring comprehension. Figure 5.1 shows a CMC item in the ques-
tion format.
1. The student misbehaved. What does the word “misbehaved” mean in this sentence?
A. Behaved well
B. Behaved quietly
C. Behaved noisily
Instead of traveling over primitive roads to the South people used the easier and cheaper
waterways.
The next item requires the application of knowledge and skill to select the correct answer. This
type of MC is intended to simulate a complex task (see Figure 5.3).
3. Kim needs $6.00 to go to a movie. She has $3.30 in her coat. In her desk she finds six
quarters, four dimes, and two nickels. How much money does Mom need to give her
so Kim can go with her friends?
A. $0.70
B. $2.70
C. $4.30
D. $7.30
Statman (1988) asserted that with the completion format, the test taker has to retain the stem
in short-term memory while completing this stem with each option. The test taker must evaluate
the truthfulness of each option. If short-term memory fails, the test taker has to go back and forth
from the stem to each option, making a connection and evaluating the truth of that connection.
The use of short-term memory may provoke test anxiety. The mental steps involved in answer-
ing a completion item also take more time, which is also undesirable. Nevertheless, research has
shown no appreciable difference when these two formats are compared (Rodriguez, 1997, 2002,
2004). Our experience with this format shows that if the item is well-written it functions just as
well as the question format CMC.
5. Which is the most effective safety feature in your car for a front-end crash?
A. Seat belt
B. Front air bag
C. Side air bag
D. An alert driver
Implicit in this format is that all four choices have merit, but when a criterion or a set of criteria
is used, one of these choices is clearly the best.
Figure 5.6 Example of a CMC item with blanks inserted in the stem.
7. Draw four samples randomly from a distribution with a mean of 50 and a standard
deviation of 10. Find the standard deviation of your sample of four.
A B C D E F G H I
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
The generation of numbers for distractors is easy. Because writing distractors is the hardest
step in a writing CMC item, this variation can be very effective for quantitative items. In fact
the example above can be used to generate many similar test items. This format also avoids the
tendency for students to guess the right answer. Also, some testing researchers suspect that the
conventional CMC provides too many clues in the options. The students choose the option that
is closest to their answer.
Selected-Response Formats • 65
Fajardo and Chan (1993) gave an example using a key word or key phrase list in the hundreds.
The student is expected to read an item stem and search the list for the correct answer. Guessing is
virtually eliminated. These items have good qualities: namely, they provide diagnostic information
about failure to learn (Fenderson, Damjanov, Robeson, Veloski, & Rubin, 1997). Test designers can
study patterns of response and determine what wrong choices students are making and study why
they are making these wrong choices. The uncued MC also tends to be more discriminating at the
lower end of the test score scale and yields higher reliability than CMC. These researchers argue that
the writing of distractors for many items is eliminated once the key word list is generated.
Figure 5.8 Example of a CMC item with both and neither options.
The Three-Option MC
As a matter of theory and empirical research, the number of distractors required for the CMC
item is no longer controversial (Ebel, 1981, 1982; Haladyna & Downing, 1993; Haladyna, 2004;
Lord, 1977; Rodriguez, 2005). Nearly all theorists and researchers have advocated a three-option
MC. Apart from theory and research, personal experience in item development has supported
this opinion. In this section we briefly review the theory, research, and practical considerations
that lead to the recommendation that the three-option MC format is better than the four-option
or five-option CMC format.
Theory
Theoretical study of this problem has led to the same conclusion—that three options is optimal
(Grier, 1975, 1976; Lord, 1944, 1977; Tversky, 1964). Lord’s study is most informative, because
first he compares Grier’s and Tversky’s theoretical findings with a third and fourth approach
regarding the desirable number of options. “The effect of decreasing the number of choices
per item while lengthening the test proportionately is to increase the efficiency of the test
for high-level examinees and to decrease its efficiency for low-level examinees” (Lord, 1977,
p. 36).
66 • Developing Selected-Response Test Items
In effect, for high-performing test takers, the two-option format seems to work quite well,
because most potential options are implausible and high performers do not guess very much.
For average test takers, three options are appropriate. For lower-performing test takers who are
prone to random guessing, four or five options for an item seem to work well. Lord commented
that no studies he reviewed considered the performance level of test takers. From a theoretical
perspective, it would appear that if precision is sought in the lower end of the scale, then four-
and five-option CMC items are desirable. If precision is more important in the middle and upper
parts of the scale, then two- and three-option MC items are better. Lord’s conclusion is based
strictly on precision and not on other factors such as item development costs and feasibility,
which are significant issues in this argument about the number of options.
Research
The most comprehensive study of distractor functioning included more than 1,100 items from
four standardized tests with different content and purposes (Haladyna & Downing, 1993). They
defined three types of distractors: (a) has a characteristic of a distractor—low scorers choose
it and high scorers avoid it; (b) is non-discriminating; and (c) is seldom chosen, which indi-
cates implausibility. They found that when the non-discriminating distractors were counted and
removed, most items had only two or three options. They concluded that three options (a right
answer and two distractors) were optimal. Few items had three functioning distractors. A meta-
analysis and evaluation of the extensive theoretical and empirical literature and narrative reviews
surrounding this issue were done by Rodriguez (2005). After a painstaking and comprehensive
study of this issue, he drew this conclusion: “Based on this synthesis, MC items should consist of
three options, one correct option and two plausible distractors. Using more options does little to
improve item and test score statistics and typically results in implausible distractors” (Rodriguez,
2005, p. 11).
Practical Considerations
From a practical standpoint, these arguments are presented in favor of the three-option format.
1. SR item-writing is a very expensive process that uses considerable time of SMEs. Those
who have to write four-option and five-option CMC items report anecdotally that devel-
oping the fourth and fifth option is not only time-consuming but also futile. Creating
plausible fourth and fifth options based on common student errors is very hard. As previ-
ously noted, Haladyna and Downing (1993) discovered that fourth and fifth options were
usually non-functioning. Thus, item development time of these SMEs is wasted on devel-
oping these fourth and fifth options.
2. Item development cost for three-option items is less than the cost for four-option and five-
option items.
3. If three-option items replace four-option and five-option items, administration time for a
fixed-length test will be reduced. To fill this unallocated time more three-option items can
be added, which affects the sampling of content and test score reliability. Thus, content-
related validity evidence and reliability are improved.
The main argument for four-option and five-option conventional CMC is that guessing contrib-
utes construct-irrelevant variance. The lucky or unlucky guesser will get an undeserved higher
or lower score. However, this argument is specious. Random error from guessing is small and
approaches zero as test length increases. Random error can be positive or negative and large or
small. The standard error of measurement for guessing is very small and gets smaller as the length
Selected-Response Formats • 67
of the test increases. The floor of a three-option item’s scale is 33%, whereas with a four-option
item the floor is 25% and with a five-option item the floor is 20%. Few testing programs are con-
cerned with scores that low. Low-scoring test takers are more likely to make random guesses, and
for low-scoring test takers, such variation is likely to be inconsequential.
Guessing is a complex concept. As Budescu and Bar-Hillel (1993) and others have also noted,
any guess of a SR item can depend on complete ignorance (random guess) or some testwise elimi-
nation strategy where implausible distractors are eliminated, making correct guessing easier. As
most test takers have an option elimination strategy that is part of their testwiseness, random
guessing in the presence of complete ignorance is very rare. If options are implausible or non-
discriminating, these four-option and five-option items are by default two- or three-option items
anyway. Consequently, guessing is much overrated as a threat to validity.
Below in Figure 5.9 is an example of an item written for educators that shows the futility of
four- or five-option item-writing.
9. You are reporting the typical price of homes in a neighborhood, but several homes
have very high prices and the other homes are moderately priced. Which measure of
central tendency is appropriate?
A. Mean
B. Median
C. Mode
D. Mediatile
E. Interquartile range
The test item appears to ask the test taker to apply knowledge to choose the correct measure of
central tendency. The first three options are expected and typical. Options D and E are obvious add-
ons. These fourth and fifth options are implausible and obviously not right. Nevertheless, the item
satisfies the need for five-options. This is a typical way to expand an item’s options for no benefit.
Recommendation
Without any reservation, the three-option MC is superior to the four- and five-option CMC.
Four- and five-option CMC should not be used, unless the options are logically based on com-
mon student errors, and an item analysis reveals that all distractors are working as intended. That
is, each distractor needs to be evaluated by SMEs and in pilot testing should perform as predicted.
Previous studies have shown convincingly that if such meticulous analysis of four-option and
five-option CMC items were done, we would discover that many distractors are not working as
expected. Methods for analyzing distractors are described in chapter 17. As research continues to
show that fourth and fifth options usually do not perform and anecdotal comments of item writ-
ers consistently report frustration in creating fourth and fifth options, the creating of the fourth
and fifth option seems pointless.
format. He concluded that the AC format is viable. As noted previously, Lord (1977) argued that
for testing high-achievers, most CMC items have only two working options. Levine and Drasgow
(1982) and Haladyna and Downing (1993) provided further support and evidence for the two-
option format. Many four-option or five-option conventional CMC items have one or more non-
functioning distractors. If distractors were evaluated and those not performing were removed,
AC testing would be very prominent.
Evidence for high reliability of a test composed of AC items is abundant (Burmester & Olson,
1966; Ebel, 1981, 1982; Hancock, Thiede, & Sax, 1992; Maihoff & Mehrens, 1985). Also, AC items
have a history of exhibiting satisfactory discrimination (Ruch & Charles, 1928; Ruch & Stoddard,
1925; Williams & Ebel, 1957).
Figure 5.10 shows a simple example of an AC item that measures knowledge at a comprehen-
sion/understanding cognitive demand for students who have been studying differences between
similes and metaphors.
This item is not a memory type unless the two examples have been presented to a learner
before a test. The best way to test for comprehension/understanding is to provide novel content.
Figure 5.11 gives an example of a set of AC items that tries to model editing skill in writing.
Although actual editing of an essay has high fidelity, the AC item does a good job of simulating
actual editing decisions in a short administration time. This is why the AC format is so useful for
these kinds of skills.
11. (A-Providing, B-Provided) that all homework is done, you may go to the movie.
12. It wasn’t very long (A-before, B-until) Earl called Keisa.
13. Knowledge of (A-preventative, B-preventive) medicine will lengthen your life.
14. All instructions should be written, not (A-oral, B-verbal).
15. She divided the pizza (A-between, B-among) the three boys.
16. The (A-exact, B-meticulous) calculation of votes is required.
17. I make (A-less, B-fewer) mistakes now than previously.
18. The climate of Arizona is said to be very (A-healthful, B-healthy).
In the example in Figure 5.11, note that these items have sentences that are not designed to
tap memory but to provide previously unencountered sentences needing the choice of a correct
word. Also note that the items are compactly presented, easy to respond to, and provide eight
score points. A test composed of AC items can be very briefly presented yet have considerable test
length, which will often generate very high reliability.
Although the AC format is a slimmer version of a CMC item, it is NOT a true–false (TF) item.
AC offers a comparison between two choices, whereas the TF format does not provide an explicit
comparison between two choices. With the TF format, the test taker must mentally create the
counterexample and choose accordingly.
Selected-Response Formats • 69
The only limitation is the fear that guessing will inflate a test score. As argued previously, random
guessing will not greatly distort a test score because random guessing is governed by principles of
probability. The floor of an AC test is 50%, so standards for interpreting a test score need to recog-
nize this fact. For instance, a score of 55% is very low. If item response theory is used to scale an AC
test, the task is simplified in creating a test score scale containing highly discriminating items that fit
the model quite well. If a student’s true score is at the floor of a scale, what is the probability that stu-
dent’s guessing will earn a score of 60% or 70% for a test of 50 items or more? Very close to zero.
Recommendation
Downing (1992) recommended the AC format for formal testing programs, because AC has been
found comparable to three- or four-option items, if properly constructed (Burmester & Olson,
1966; Maihoff & Phillips, 1988). As many CMC items are actually AC items with two or three
useless distractors, this recommendation is easy to support.
19. The first thing to do with an automatic transmission that does not work is to check the
transmission fluid. (A)
20. The major cause of tire wear is poor wheel balance. (B)
21. The usual cause of clutch “chatter” is in the clutch pedal linkage. (A)
22. The distributor rotates at one half the speed of the engine crankshaft. (B)
The TF format has been well established for classroom testing but seldom used in standardized
testing programs. Haladyna, Downing, and Rodriguez (2002) found that for a contemporary set
of educational measurement textbooks, all 26 recommended TF items.
However, there is evidence to warrant some concern with its use (Downing, 1992; Grosse &
Wright, 1985; Haladyna, 1992b). Like other SR formats, the TF format can be misused. The most
common misuse is to test excessively recall of trivial knowledge, but this misuse can be found
with any item format. Peterson and Peterson (1976) investigated the error patterns of positively
70 • Developing Selected-Response Test Items
and negatively worded TF questions that were either true or false. Errors were not evenly distrib-
uted among the four possible types of TF items. Although this research is not damning, it does
warn item writers that the difficulty of the item can be controlled by its design.
Figure 5.13 shows how a simple chart converts to a TF format consisting of 12 responses. This
example is not a conventional TF format, but one that has a theme that groups items in a homo-
geneous way.
Place an “X” beneath each structure for which each characteristic is true?
Characteristic Structure
Root Stem Leaf
23. Growing point protected by a cap
24. May possess a pithy center
25. Epidermal cells hair-like
26. Growing region at tip
Hsu (1980) pointed out that the design of the item and the format for presentation as shown
above are likely to cause differential results. An advocate of the TF format, Ebel (1970) opposed
the grouping of items in this manner. However, there is no research to support or refute group-
ing-type TF items.
Grosse and Wright (1985) argued that TF has a large error component due to guessing, a
finding that other research supports (Frisbie, 1973; Haladyna & Downing, 1989b; Oosterhof
& Glasnapp, 1974). Grosse and Wright claimed that if a test taker’s response style favors true
instead of false answers in the face of ignorance, the reliability of the test score may be seriously
undermined. A study comparing CMC, AC, and TF showed very poor performance for TF as to
reliability (Pinglia, 1994).
As with AC, Ebel (1970) advocated the use of TF. The chapter on TF testing by Ebel and Frisbie
(1991) remains an authoritative work. Ebel’s (1970) arguments are that the command of useful
knowledge is important. We can state all verbal knowledge as propositions, and each proposition
can be truly or falsely stated. We can measure student knowledge by determining the degree to
which each student can judge the truth or falsity of knowledge. Frisbie and Becker (1991) synthe-
sized the advice of 17 textbook sources on TF testing.
The advantages of TF items can be summarized in the following way:
As noted, some of these criticisms have been defended. The more important issue is: Can TF
items be written to measure nontrivial content? Ebel and Frisbie (1991) provided an unequivocal
“yes” to this question.
Recommendation
Given widespread support among testing experts, TF is recommended for instructional testing
with the caveat that it be done well. For standardized testing programs, we have other formats
described in this chapter that are more useful and have less negative research.
27. Which actors are most likely to appear in the 2015 movie Avatar 3: Who are the
aliens?
1. Sigorney Weaver
2. Meryl Streep
3. Nicole Kidman
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1, 2, and 3
The Educational Testing Service first introduced this format, and the National Board of Medi-
cal Examiners later adopted it for use in medical testing (Hubbard, 1978). Because many items
used in medical and health professions testing programs had more than one right answer, com-
plex MC permits the use of one or more correct options in a single item. Because each item is
scored either right or wrong, it seems sensible to set out combinations of right and wrong answers
in a CMC format where only one choice is correct.
This format was very popular in formal testing programs, but its popularity is justifiably wan-
ing. This format has received evaluation of its qualities (Albanese, 1993; Haladyna, 1992b; Hala-
dyna and Downing, 1989b). Among the reasons for NOT using this format are:
72 • Developing Selected-Response Test Items
Studies by Case and Downing (1989), Dawson-Saunders, Nungester, and Downing (1989), and
Shahabi and Yang (1990) provided additional evidence of the inferiority of the complex MC.
However, Subhiyah and Downing (1993) provided evidence that no difference exists, that com-
plex MC items have about the same item difficulty and item discrimination qualities as CMC.
Recommendation
This format should not be used. A better alternative is the format presented next.
Note the example in Figure 5.15 is not true or false but another dichotomy (absurd/realistic).
In fact, the MTF format is applicable to any dichotomy.
Selected-Response Formats • 73
1. This format avoids the disadvantages of the complex MC format and is a good alternative
to it.
2. Researchers have established that the MTF format produces higher reliability estimates
when compared with the CMC items (Albanese, Kent, & Whitney, 1977; Downing et al.,
1995; Frisbie & Druva, 1986; Frisbie & Sweeney, 1982; Hill & Woods, 1974).
3. Frisbie and Sweeney (1982) reported that students preferred the MTF compared with
CMC. Oddly enough, Hill and Woods (1974) reported that the MTF items seemed harder,
but several students anecdotally reported that the MTF items were better tests of their
understanding.
4. The MTF is very efficient in item development, examinee reading time, and the number
of items that can be asked in a fixed time. For instance, placing 30 MTF items on a page is
possible and administering more than 100 items per 50-minute testing period is feasible.
Given that the test score scale ranges from 50% to 100%, interpretation of results should
recognize this fact.
Recommendation
The MTF format is an effective substitute for the complex MC. Because the MTF has inherently
good characteristics for measuring knowledge and some skills, it should be more widely used.
The MTF format is underutilized. It should be used for instructional testing as well as in stand-
ardized testing programs.
the stems are presented on the left and the options are presented on the right. The instructions
that precede the options and stems tell the test taker how to respond and where to mark answers.
Figure 5.16 presents an example.
Directions: On the line next to each author in Column A, place the letter of the type of writ-
ing in Column B for which the author is best known. Choices in Column B may be used
once, more than once, or not at all.
Column A Column B
______38. Janet Evanovich A. History
______39. Ray Bradbury B. Tragedy
______40. Bill Bryson C. Humor
______41. Robert Frost D. Mystery
______42. Gore Vidal E. Poetry
______43. John Irving F. Science Fiction
G. Adventure
We could easily expand the list of six statements in Figure 5.16 into a longer list, which would
make the set of items more comprehensive.
In a survey of measurement textbooks, Haladyna, Downing, and Rodriguez (2002) discovered
that every measurement textbook surveyed recommended the matching format. Interestingly,
there is no cited research on this format in any of these textbooks or prior reviews of research on
item formats. This format is seldom used in standardized testing programs. One major reason is
that a set of matching items usually measures very specific content. As the matching format does
not lend itself to broad sampling of content, you will not find this format used in standardized
testing programs.
Linn and Gronlund (2000) and Nitko (2001) both offered excellent instruction on design-
ing effective matching items. The former authors suggest the following contexts for matching
items: persons and achievements, dates and events, terms and definitions, rules and examples,
symbols and concepts, authors and books, English and non-English equivalent words, machines
and uses, plants or animals and classification, principles and illustrations, objects and names of
objects, parts and functions. As you can see, these strategies can lead to items with a cognitive
demand higher than just recognition. Also, the cognitive demand for matching items can be
recall or understanding. To accomplish the latter needs the use of novel presentation of stems or
the options. For example, content may be presented one way in a textbook or in instruction, but
the stems or options should be paraphrased in the matching item.
The matching format has many advantages:
6. The options do not have to be repeated. If we reformatted this into the CMC, then it would
require the repeating of the five options for each stem.
1. write as many items as there are options, so that the test takers match up item stems to
options. For instance, we might have five items and five options. This item design invites
cuing of answers. Making the number of options unequal to the number of item stems can
avoid this problem.
2. mix the content of options: for instance, have several choices be people and several choices
be places. The problem is non-homogeneous options. It can be solved by ensuring that the
options are part of a set of things, such as all people or all places.
Recommendation
Matching items seem well suited for instructional (classroom) testing of understanding of con-
cepts and principles. Matching does not seem suitable for standardized tests where content is
broadly sampled.
Lead-in:
For each patient, select the mechanism that explains the edema.
Each option can be used once, more than once, or not at all.
Sample Items:
44. Leucocytes, hydro-thorax and hydro-pericardium in a 10-year-old dog with
glomerulonephritis.
45. A 6-year-old Shorthorn cow developed marked pulmonary edema and dyspnea eight
days after being moved from a dry summer range to a lush pasture of young grasses
and clover.
Figure 5.17 Theme, options, lead-in, and a few items from an EM item set.
Adapted from: http://scholar.lib.vt.edu/ejournals/JVME/V20 3/wilson.html
This format is widely used in medicine and related fields both in the United States and the United
Kingdom. In fact, Alcolado and Mir (2007) have published a book containing 200 extended
matching item sets for the allied health sciences. This format has versatility for a variety of situa-
tions. An excellent instructional source for this format can be found in Case and Swanson (2001),
also available online at http://www.nbme.org/.
Recommendation
This format seems suitable for instructional testing and has application in large-scale standard-
ized testing programs. The cognitive demand possible with this format is a very attractive feature.
Although there is little research on the EM format, this research supports its use.
The Testlet
The testlet is a mini-test. Its anatomy is very simple. First a stimulus is presented. This stimulus
might be a reading passage, poem, work of art, photograph, music, chart, graph, table, article,
essay, cartoon, problem, scenario, vignette, experiment, narrative, or reference to an event, per-
son, or object (Haladyna, 1992a). Customarily, SR items are used in testlets. However, testlets can
also be presented using open-ended questions that require expert scoring. For instance the State
of Colorado’s assessment program uses such items (http://www.cde.state.co.us/cdeassess/docu-
ments/csap/2010/GR_4_Reading_Released_Items.pdf). But these CR examples are the exception
to the rule. Other terms for the testlet include interpretive exercise, scenarios, vignette, item bun-
dle, problem set, super-item, and context-dependent item set.
The principal advantage of the testlet using SR formats is its capacity to model the kind of
complex thinking found in a CRSS item that measures a task from the universe of generalization
for any ability. Using SR item formats provides greater efficiency for test administration and
objective scoring.
Selected-Response Formats • 77
The testlet is increasingly used in national, standardized testing programs. Some examples of
testing programs where the testlet is used include The Uniform CPA Examination (http://www.
aicpa.org), Medical College Admissions Test (http://www.aamc.org/students/mcat/), the Cer-
tified Financial Institute Examinations (http://www.cfainstitute.org/), National Board Dental
Examinations Parts I and II, and the licensing examination of the National Council of Architec-
tural Registration Boards. Nearly all standardized tests of reading comprehension use the testlet.
The testlet is increasingly being used for mathematical and scientific problem-solving. The testlet
can even be used to measure editing skills for writing ability in a simulated natural context.
Due to the considerable length of testlets, sample testlets are presented in appendices A through
D of this chapter: reading comprehension, mathematics problem-solving, scientific problem-
solving, interlinear, and figural/graphical.
Reading Comprehension
The typical way to measure reading comprehension is to provide a written passage in a testlet
format. Some passages may be three or more pages long and three to 12 items may follow. The
two reading comprehension examples in Appendix A both entail multiple pages. As testlets go,
these are relatively compact. Despite the use of four-option items, the testlet exhibits other good
features of testlet design. Most of the items have a generic quality that can be applied to other
reading passages. The items are independent of one another. If items were dependent, one item
might cue answers to other items. Dependency among items in a testlet is a major problem. SMEs
are directed to ensure that items are not dependent on one another.
Reading comprehension passages can be classified as long, medium, or short or may come in
dual reading passages with test items that relate one passage to the other as Appendix A shows.
A useful source for classifying reading comprehension item types is Sparks Educational Publish-
ing (2005). They provide seven categories of item types, which are presented in Table 5.1. These
categories may be useful in developing reading comprehension test items for specific purposes
related to well-stated instructional objectives. The table presents item stems that suggest a spe-
cific type of content and cognitive demand.
Pyrczak (1972) observed that students often respond to items without making reference to the
passage. Subsequently, others have made the same observation and have been troubled at the way
students respond to passage-dependent reading comprehension test items. We may argue that
reading comprehension is not tested as intended because students tend to use test-taking strate-
gies instead of actually reading the passage first. Katz and Lautenschlager (2000) found that some
students used prior knowledge to respond to test items instead of referring to the passage.
Other research reported in the previous chapter shows how the cognitive demand of testlet
reading comprehension items may direct the cognitive demand of test takers in unpredictable
and varied ways. Clearly, more attention needs to be paid to the cognitive demand of test takers
for reading comprehension testlets.
Problem-Solving
Problem-solving is a universal concept in all aspects of education and training and in all profes-
sions and all life’s activities. Appendix B shows the first example, which is a mathematical prob-
lem-solving testlet. The variation is that a set of test items captures a thorough problem-solving
exercise. Note that each item measures a different step in the solution process. Item 1 requires
the test taker to decide the total cost by multiplying, deducting a 10% discount and correctly add-
ing the handling charge to arrive at the total cost. Distractors should represent common student
errors in solving this very difficult problem. Item 2 requires careful reading and the adding of the
ticket price and the handling charge. Item 3 requires the test taker to compute the amount of the
78 • Developing Selected-Response Test Items
Table 5.1 Item Stems for Seven Categories of Item Types for Reading Comprehension Testlets
Category Item Stems
Author’s Main Idea What is the main idea of this passage?
What is the primary purpose of this passage?
One of the author’s main points is …
The main purpose of the article is …
Attitude and Tone of Author Is it positive, negative, neutral? What is the author’s state of mind?
The point of view from which the passage is told can best be described as that of:
How does the author/writer feel about …?
Where would this article be found in the library?
Why did the author write this article?
Specific Information: Explicit Which of the following statements is supported by the passage?
reference to sentences or Where does the story take place?
concepts in the passage
Implied Information: It can reasonably be inferred from the passage that …?
Requires inference on the Which of the following would the author of the passage be LEAST likely to
part of the test taker recommend?
How did _______ feel …?
What would happen if …?
Themes and Arguments: Which of the following sentences best summarizes the first paragraph?
Author’s opinion or arguments One of the main points in the last paragraph is…
Which statement from the article below expresses an opinion?
Which persuasive technique does the author use in this article?
Which of these statements is an opinion?
Technique: Items dealing with The use of the word __________ refers to …
alliteration, allusion, assonance, Which sentence is an example of _______?
caricature, cliché, epiphany, Match the examples on the left with the literary technique on the right.
foreshadowing, hyperbole, idiom,
imagery, irony, metaphor, motif,
onomatopoeia, oxymoron,
paradox, personification, pun,
rhetorical question, sarcasm,
simile, symbol, theme, thesis, tone
Words in Context As it is used in line 65, the term_____ refers to ________.
In line x, what does the word _______ mean?
In line x, what does the phrase _______ mean?
In paragraph x, what does the word _____ mean?
Which definition of pitch best fits the meaning as it is used in the story?
In paragraph x, the pronoun _____ refers to …
10% discount for each ticket and multiply by four. The second example comes from the ACT Sci-
ence Assessment. It shows how a testlet can work to test scientific problem-solving. There is little
doubt about the complexity of the problem.
Interlinear
The interlinear testlet is simply an opportunity for a student to edit some written material in a
multiple true–false format. In Appendix C, the paired set of words in the example in the appendix
is either right or wrong. This kind of format is easy to produce since pairs of words or phrases can
be presented in context as either right or wrong. The test taker has the opportunity to correct the
text. A total editing score is possible from the testlet.
Figural/Graphical
Appendix D presents a table-based item set and a graph-based item set. Both examples show the
versatility of this format for a variety of stimuli. Questioning strategies are crucial to a testlet’s
Selected-Response Formats • 79
validity. With charts and graphs and other illustrative material, space is a problem. Such presen-
tations usually require more space than simple, stand-alone items.
Recommendation
For measuring a complex cognitive demand associated with tasks representing an ability, the
testlet seems to have the capability to model complex thinking. The examples in the four appen-
dices at the end of this chapter show possibilities. Some excellent examples of testlets can be
viewed on the ACT website at http://www.actstudent.org/sampletest/.
Calculators
The use of calculators in tests requiring the application of mathematics has been a controversial
issue that has received much study. The central issue is whether the cognitive demand of a test or
test item requires the test taker to calculate with or without the assistance of a calculator.
The National Council of Teachers of Mathematics (NCTM) issued a policy statement regard-
ing the role of calculators in mathematics instruction (NCTM, 2005). They showed that both
calculation skills and skill in using a calculator are needed (retrieved from http://www.nctm.
org/about/content.aspx?id=6358). Therefore, it seems that some items require calculation by
test takers without using calculators and some items benefit from using calculators. However,
some research studies provide context for the role of calculators in tests and accompanying
instruction.
A meta-analysis of 54 studies on the effectiveness of using calculators by Ellington (2003) pro-
vided clear results that using calculators improves operational and problem-solving skills. Loyd
(1991) observed that using calculators with these SR item formats will likely diminish calculation
errors and provide for greater opportunity to include items with greater cognitive demand. She
made an important distinction that some items benefit from using calculators whereas other
items do not benefit from having this aid. Thus, both studies provide complementary support
for using calculators. To add to these arguments, the most sensible way to calculate is to use a
calculator in all aspects of life. The use of calculators should be as routine for calculation as the
use of word processors to write.
Other researchers have found that performance on concepts, calculation, and problem-solv-
ing changes under conditions of calculators and no calculators, depending on the type of mate-
rial tested (Lewis & Hoover, 1981). Some researchers reported that calculators have little or
no effect on test performance because the construct tested is not affected by using calculators
(Ansley, Spratt, & Forsyth, 1988). However, this observation was based on using calculators
where they were not needed. A study by Cohen and Kim (1992) showed that the use of calcula-
tors for college-age students actually changed the objective that the item represented. These
researchers argued that even the type of calculator can affect item performance. Poe, Johnson,
and Barkanic (1992) reported a study using a nationally normed standardized achievement test
where calculators had been experimentally introduced several times at different grade levels.
Both age and ability were found to influence test performance when calculators were per-
mitted. Bridgeman, Harvey, and Braswell (1995) reported a study of 275 students who took
80 • Developing Selected-Response Test Items
Scholastic Assessment Test (SAT) mathematics questions, and the results favored the use of
calculators. In fact, Bridgeman et al. (1995) reported that one national survey showed that
98% of all students have family-owned calculators and 81% of twelfth-grade students regularly
use calculators. Scheuneman et al. (2002) evaluated performance on the SAT by students who
either brought and used or did not bring calculators. Results showed that higher-performing
students benefitted from using calculators. However, an argument for cause-and-effect cannot
be made from this descriptive study.
The universality of calculators coupled with the ecological validity of using calculators naturally
to solve mathematics problems seems to weigh heavily in favor of calculator usage in mathemati-
cal problem-solving. Bridgeman et al. (1995) concluded that the use of calculators may increase
validity but test developers need to be very cautious about the nature of the problems where cal-
culators are used. Thus, the actual format of the SR item is not the issue in determining whether
or not a calculator should be used. Instead, we need to study the cognitive demand required by
the item before deciding whether a calculator can be used. There is little doubt that using calcula-
tors helps performance in many items requiring calculation. Finally, the term authentic has been
used often to reflect the concept of fidelity of a test item to its target domain partner. If calcula-
tors are part of daily use in computation, then why should calculators not be used in any and all
mathematics test items?
Dangerous Answers
The goal of any licensing/certification test is to pass competent candidates and fail incompe-
tent candidates. Another goal is to protect the public from incompetent practitioners. In the
health professions, one line of promising research has been the use of dangerous answers, dis-
tractors that if chosen have harmful effects on patients portrayed in the problem. The inference
Selected-Response Formats • 81
is that a physician who chooses a dangerous answer potentially endangers his or her patients.
The use of dangerous distractors in such tests would assist in the identification of dangerously
incompetent practitioners. A useful distinction is harmful choices versus choices that may lead
to a fatality (Skakun & Gartner, 1990). Research shows that items can be successfully written,
and that the inclusion of such items was agreed as content relevant by appropriate content
review committees of professional practitioners. Slogoff and Hughes (1987) found that passing
candidates chose 1.6 dangerous answers and failing candidates chose 3.4 dangerous answers.
In a follow-up of 92 passing candidates who chose four or more dangerous answers, a review
of their clinical practices failed to reveal any abnormalities that would raise concern over their
competence. They concluded that the use of such answers was not warranted. Perhaps the best
use of dangerous answers is in formative testing during medical education and training in
other professions.
Most studies of dangerous answers were done long ago. One recent study involved compu-
ter-simulated cases for physicians (Harik, O’Donovan, Murray, Swanson, & Clauser, 2009). The
computer-based case simulations were used in the United States Medical Licensing Examination
Step 3. These researchers found more than 20% of test takers’ choices involved practices danger-
ous to patients. Their sample was 25,283, so this is a very substantial finding.
Although the placing of dangerous answers on professional licensing and certification tests
may seem attractive, it does not seem likely that consensus has existed that validates the use of
dangerous answers to eliminate candidates from a profession.
APPENDIX TO CHAPTER 5
4. What happens after the main character and his mother walk back to the house?
A. They shovel the snow.
B. They make hot chocolate.
C. They eat ice cream cones.
D. They take a ride on the sled.
5. Which of these is MOST LIKELY true about the main character’s mother?
A. She likes to play outside.
B. She likes to eat ice cream.
C. She wishes that they had not moved.
D. She wishes that school were not closed.
6. Which of these BEST explains why the main character’s mother sleds down the hill?
A. because the hill is icy
B. because the hill is bumpy
C. because the main character is tired
D. because the main character is nervous
7. Which BEST describes the main idea of the passage?
A. Baking cookies is fun.
B. Making new friends is easy.
C. Playing in the snow can be fun.
D. Moving somewhere new can be hard.
8. How will the main character MOST LIKELY feel the next time it snows?
A. proud
B. lonely
C. excited
D nervous
9. Which is an antonym of “sparkling” as it is used in the sentence? The snow was very light.
It was crisp and sparkling white.
A dull
B fresh
C pretty
D. heavy
10. What is the meaning of the word “raced” as it is used in the sentence? The sled raced down
the hill as if it were on ice skates. The wind blew through our hair. The cold air burned our
cheeks.
A. moved quickly
B. moved strangely
C. moved sideways
D. moved backwards
84 • Developing Selected-Response Test Items
1. What is Poe referring to when he speaks of “the entire orb of the satellite”?
A. The sun
B. The moon
C. His eye
2. What is a “tarn”?
A. A small pool
B. A bridge
C. A marsh
3. How did the house fall?
A. It cracked into two pieces.
B. It blew up.
C. It just crumpled.
4. How did the speaker feel as he witnessed the fall of the House of Usher?
A. afraid
B. awestruck
C. pleased
5. What does the speaker mean when he said “his brain reeled?”
A. He collected his thoughts.
B. He felt dizzy.
C. He was astounded.
3. If she works like this for 12 months, how much can she earn?
A. $ 48.00
B. $912.00
C. More than $1,000
4. Tammy wants to save 40% of her earnings for a new bike that costs $360. How many
months will she have to work to save enough money for that bike?
A. 11
B. 12
C. More than 12 months
Passage I
Unmanned spacecraft taking images of Jupiter’s moon Europa have found its surface to be very
smooth with few meteorite craters. Europa’s surface ice shows evidence of being continually res-
moothed and reshaped. Cracks, dark bands, and pressure ridges (created when water or slush is
squeezed up between 2 slabs of ice) are commonly seen in images of the surface. Two scientists
express their views as to whether the presence of a deep ocean beneath the surface is responsible
for Europa’s surface features.
Scientist 1
A deep ocean of liquid water exists on Europa. Jupiter’s gravitational field produces tides within
Europa that can cause heating of the subsurface to a point where liquid water can exist. The
numerous cracks and dark bands in the surface ice closely resemble the appearance of thawing
ice covering the polar oceans on Earth. Only a substantial amount of circulating liquid water
can crack and rotate such large slabs of ice. The few meteorite craters that exist are shallow and
have been smoothed by liquid water that oozed up into the crater from the subsurface and then
quickly froze.
Jupiter’s magnetic field, sweeping past Europa, would interact with the salty, deep ocean
and produce a second magnetic field around Europa. The spacecraft has found evidence of this
second magnetic field.
Scientist 2
No deep, liquid water ocean exists on Europa. The heat generated by gravitational tides is
quickly lost to space because of Europa’s small size, as shown by its very low surface temperature
(–160°C). Many of the features on Europa’s surface resemble features created by flowing glaciers
on Earth. Large amounts of liquid water are not required for the creation of these features. If a
thin layer of ice below the surface is much warmer than the surface ice, it may be able to flow and
cause cracking and movement of the surface ice. Few meteorite craters are observed because of
Europa’s very thin atmosphere; surface ice continually sublimes (changes from solid to gas) into
this atmosphere, quickly eroding and removing any craters that may have formed.
86 • Developing Selected-Response Test Items
1. Which of the following best describes how the two scientists explain how craters are
removed from Europa’s surface?
Scientist 1 Scientist 2
A. Sublimation Filled in by water
B. Filled in by water Sublimation
C. Worn smooth by wind Sublimation
D. Worn smooth by wind Filled in by water
2. According to the information provided, which of the following descriptions of Europa
would be accepted by both scientists?
F. Europa has a larger diameter than does Jupiter.
G. Europa has a surface made of rocky material.
H. Europa has a surface temperature of 20°C.
J. Europa is completely covered by a layer of ice.
3. With which of the following statements about the conditions on Europa or the evolution
of Europa’s surface would both Scientist 1 and Scientist 2 most likely agree? The surface of
Europa:
A. is being shaped by the movement of ice.
B. is covered with millions of meteorite craters.
C. is the same temperature as the surface of the Arctic Ocean on Earth.
D. has remained unchanged for millions of years.
4. Which of the following statements about meteorite craters on Europa would be most con-
sistent with both scientists’ views?
F. No meteorites have struck Europa for millions of years.
G. Meteorite craters, once formed, are then smoothed or removed by Europa’s surface
processes.
H. Meteorite craters, once formed on Europa, remain unchanged for billions of years.
J. Meteorites frequently strike Europa’s surface but do not leave any craters.
5. Scientist 2 explains that ice sublimes to water vapor and enters Europa’s atmosphere. If
ultraviolet light then broke those water vapor molecules apart, which of the following gases
would one most likely expect to find in Europa’s atmosphere as a result of this process?
A. Nitrogen
B. Methane
C. Chlorine
D. Oxygen
6. Based on the information in Scientist 1’s view, which of the following materials must be
present on Europa if a magnetic field is to be generated on Europa?
F. Frozen nitrogen
G. Water ice
H. Dissolved salts
J. Molten magma
Source: http://www.actstudent.org/sampletest/science/sci_01.html
Used with permission from ACT.
Selected-Response Formats • 87
Summer survey
60
50
40
Percentage
30
20
10
0
Superman Catwoman Wonder Woman
Fall survey
45
40
35
30
Percentage
25
20
15
10
0
Superman Catwoman Wonder Woman
Overview
This chapter provides guidance on how to write SR test items. The chapter is intended for test-
ing programs where item banks are used to construct test forms. However, the development of
these guidelines originated from studies of classroom testing practices. Therefore, these guide-
lines have the dual benefit of helping testing program personnel develop test items and those
planning tests and quizzes for instructional learning. For standardized testing programs, these
guidelines should be part of an item-writing guide and be used consistently for the development
of all SR items.
some redundancy. These guidelines are valuable to item writers and developers of item-writing
guides.
Content Concerns
FORMAT CONCERNS
7. Format each item vertically instead of horizontally.
STYLE CONCERNS
8. Edit and proof items.
9. Keep linguistic complexity appropriate to the group being tested.
10. Minimize the amount of reading in each item. Avoid window dressing.
21. Make all distractors plausible. Use typical errors of test takers to write distractors.
22. Avoid the use of humor.
The test taker has to know what an antonym is and then know the definition of forbidden. Thus,
the item calls for two types of content. If the student chooses the right answer, we might infer that
the student knows what an antonym is and knows what the word forbidden means. If a student
makes the wrong choice, we do not know whether the student knows what an antonym is or
whether the student knows what forbidden means.
Better: The examples below separate the two objectives. The first item tests the meaning of an
antonym. The second item uses the multiple true–false format to test very effectively the distinc-
tion between antonym and synonym.
2. What is an antonym?
A. A word that has the opposite meaning of the other word.
B. A word that has the same meaning of another word.
C. A word that is against some cause or movement.
3. Chilly/warm
4. Allowed/prevented
5. Rest/activity
6. Charity/miserly
7. Advancing/retreating
For more complex learning, such as a task requiring complex learning, a mathematics item might
follow a familiar form but use different numbers. For instance:
Guidelines for Writing Selected-Response Items • 93
13. Marilee has 24 acres of land. She will get a tax break of 10% if she plants 20% of her land
in trees. Her tax bill last year was $3,440. How much money does she stand to save if she
plants 20 trees on 20% of her land?
A. $344
B. $688
C. 24 × 20% × $3,440
The numbers can be changed to make the problem similar in cognitive demand. Also, this vignette
could be transformed into a testlet.
4. Test important content. Avoid overly specific and overly general content.
Very specific and very general content can be infuriatingly trivial to test takers. SMEs must make
good judgments about the importance of content. Simply matching an item to an objective is not
enough. Imagine a continuum of love like this one:
Neither extreme qualifies well as an adequate definition or example of love. Most items should
probably be written with this kind of continuum in mind. For example, in literature, a test item
might appear like this one:
Overly specific knowledge in an item is often trivial and hard to justify in any curriculum.
The other extreme is too general. The problem with general knowledge is that sometimes the
generality is not true or it has many exceptions, and the question becomes ambiguous. The dan-
ger in being too general is that no answer is truly satisfactory.
Each item writer must decide how specific or how general each item must be to reflect adequately
the content topic and type of mental behavior desired. Moderation in this continuum is highly
desirable. The use of either extreme also creates a feeling of anxiety with test takers because such
extreme items are very hard to answer.
On the related issue of importance, the SME committee is the best judge of what is and is not
important. Generally, such committees have checks and balances that avoid a violation of this
guideline.
The former item seems indefensible, whereas the second item is defensible because it is qualified
and the test taker has presumably had exposure to the Film Institute of New York.
irrelevant variance, which lowers a test score unfairly. Roberts clarified the topic by distinguish-
ing between two types of trick items: those items deliberately intended by the item writer to
mislead test takers, and those items that accidentally trick test takers. The latter type of trick items
will be treated under guideline 20 from Table 6.1. The reason is that we have a set of conditions
that provide clues to test takers. All these poor item-writing practices involve options. The item
writer’s intention appears to deceive, confuse, or mislead test takers. Here are some intentional
humorous examples:
Yes, there is a fourth of July in England. All months have 28 days. It was Noah not Moses who
loaded animals on the ark in the Biblical story. The butcher weighs meat. Panama hats originate
from Ecuador. Panama is a port city where the hats are shipped. Items like these are meant to
deceive test takers and not to measure knowledge. Roberts encouraged more work on defining trick
items. His research has made a much-needed start on this topic. A negative aspect of trick items is
that such questioning strategies, if frequent enough, build an attitude in the test taker characterized
by distrust and potential lack of respect for the testing process. We have enough problems in testing
without contributing more by using trick items. As Roberts pointed out, one of the best defenses
against trick items is to allow test takers opportunities to challenge test items and allow them to
provide alternative interpretations. Toward that end, some researchers have recommended answer
justification (Dodd & Leal, 1988). This technique offers test takers the opportunity to argue why
their choice is correct or to clarify that they have the requisite knowledge or skill measured by the
item. They write out an appeal or the appeal can be made orally in a classroom. This kind of answer
justification cannot be used in standardized testing programs. However, we have strongly recom-
mended in other chapters that all items be given to small groups of representative test takers where
their oral or written comments can be noted to clarify problems with items. This technique is more
common in cognitive psychology where probing into the minds of test takers is valued.
From the cognitive psychologist’s perspective, such test items are called semantic illusions.
Research on this phenomenon has been programmatic. According to Hannon and Daneman
(2001) misleading students via questioning has consistently lowered test performance. In their
study, more than 40% of item trick items were missed, despite warnings to the test takers. They
state that the level of cognitive processing of students taking this test accounts for performance
differences. Some students have better working long-term memory and strategic thinking that
ward off the trickiness, whereas some students have difficulty due to weaker long-term memory.
The importance of their work is not to defeat poor item-writing practices but to improve the
measurement of reading comprehension. Nonetheless, they have shed more light on the issue of
trick questioning. Fortunately, such test items are rare in high-quality testing programs.
There is no justification for writing or using such items in any cognitive test.
Format Concerns
25a. You draw a card from a deck of 52 cards. What is the chance you will draw a card with
an odd number on it?
A. 36/52
B. 32/52
C. About one half
25b. You draw a card from a deck of 52 cards. What is the chance you will draw a card with
an odd number on it?
A. 36/52 B. 32/52 C. About one half
The advantage of horizontal formatting is that it occupies less space on a page and is therefore
more efficient as to printing cost. On the other hand, cramped options affect the look of the test.
If appearance is important, horizontal formatting should be avoided. With younger or test-anx-
ious test takers, the horizontal format may be more difficult to read, thus needlessly lowering
test performance. The vertical format is recommended. Most testing programs present test items
formatted vertically.
Style Concerns
Editing items entails improving the clarity of the task/demand of the test taker without affect-
ing content. Editing also includes the correction of spelling, grammatical, capitalization, and
punctuation errors. Sentence structure is simplified to improve clarity. Spellcheckers and gram-
mar checkers found on word processing programs are very useful aids for an editor.
We have many sources for editorial guidelines (Baranowski, 2006). These include The Chicago
Manual of Style (University of Chicago Press Staff, 2003), Strunk and White’s Elements of Style
(Strunk & White, 2000), and the Publication Manual of the American Psychological Association
(2001). However, these are general editorial references that have a limited usefulness.
Test editors should develop a style guide for a testing program. These guides are not publicly
available but a style guide is usually a single page or two of notes. First, how each item is to be
formatted would be on this style guide. Acronyms and how they are presented would be included.
Special vocabulary that is resident to a testing program might be there. DOs and DON’Ts should
be included in the style guide—guidelines that appear in this chapter. Ancillary information
about each item should be the responsibility of the editor, and the style guide can remind the edi-
tor of what information should be added to the item bank besides the item. Because every item is
put through a series of reviews, the editor should keep track of what has been done to each item
and what needs to be done. Item development is a process of continuous checking and polishing.
As we can see, the editor has enormous responsibilities for ensuring that each item appears in the
item bank ready for use of a future test.
The purpose of proofing is to ensure that the test and all test items are perfectly presented. A
rule-of-thumb among editors is that if you find three proofing errors in a test or a large set of items
intended for deposit in an item bank, there is probably another that you missed. We should never
overlook the opportunity to improve each item by thorough proofing. Computer-based software
does a good job catching typographical and grammatical errors. A word processing spell checker
and grammar checker can be very useful. Exception dictionaries can be kept on a computer to ensure
that special vocabulary is spelled correctly or abbreviated properly. The World Wide Web provides
another opportunity to look up an unfamiliar word to check on its meaning and spelling.
To summarize, the validity of test score interpretations can be improved by competent edito-
rial work on items and thorough proofing of final test items. One should never overlook this
opportunity to ensure that items are presented as they should be presented.
Original:
26a. The weights of three objects were compared using a pan balance. Two comparisons were
made …
Revised:
26b. Sandra compared the weights of three objects using a pan balance. She made two
comparisons …
98 • Developing Selected-Response Test Items
The first was abstract and used a passive voice. The second included a student, which is more
concrete and more like a story. It makes the test taker more willing to identify with the issue or
problem being tested.
For testing in the professions, the reading comprehension load is also important. Some can-
didates for certification or licensure have a primary language other than English, and tests with
unnecessary linguistic complexity pose a serious threat to validity for this population. Chapter 16
discusses linguistic complexity in greater detail.
10. Minimize the amount of reading in each item. Avoid window dressing.
Items that require extended reading lengthen test administration time. As items are equally
weighted, usually, in scoring, a wordy item counts the same as a briefly stated item.
One benefit of reducing test taker reading time is that the number of items one can ask in a
fixed time is increased. Because the number of items given in a fixed time directly affects the
reliability of test scores and the adequacy of sampling of content, items need to be as briefly
presented as possible. Because reliability and validity are very important, we should try to reduce
reading time. Unless we can show that lengthy reading is necessary, such as with some complex
problem-solving exercises, items with high reading demand are not used. This advice applies to
both the stem and the options.
Here is an example of a test item, which is just awful with respect to verbosity (adapted from
Mouly & Walton, 1962, p. 188):
27. Which of the following represents the best position the vocational counselor can take in view
of the very definite possibility of his being in error in his interpretations and prognoses?
A. He must always take the risk or possibly lose the client’s respect and cooperation.
B. He should couch his statement in terms of probability and emphasize that they may
not apply to this client.
C. He should emphasize what the client should not do since negative guidance can be
more accurate than positive guidance.
D. He should never hazard a prognosis unless he is certain of being right.
E. He should give his best professional advice without pointing to the risk of error and
thereby creating doubts in the mind of the client.
This item has other item-writing faults. One of the most obvious and often observed problems in
writing test items is window dressing. This problem involves the use of excessive description that
is unrelated to the content of the stem. Consider words from the Lloyd Price song Stagger Lee and
then the test item.
28. The night was clear. The moon was yellow. And the leaves came tumbling down.
Who shot Stagger Lee?
A. Billy
B. Stagger Lee
C. Two men gamblin’ in the dark
The opening first three sentences have nothing to do with the question.
There are times when verbiage in the stem may be appropriate. For example, where the test
taker sorts through relevant and irrelevant information to solve a problem, more information
is necessary. (Note that the phrase window dressing is used exclusively for situations where use-
Guidelines for Writing Selected-Response Items • 99
less information is embedded in the stem without any purpose or value.) In this latter instance,
the purpose of more information is to see if the test taker can separate useful from useless
information.
29. A compact disc was offered on a website for $9.00. In the local store, it sells for $12.00. This
weekend, it was marked at a 30% discount. Sales tax is 6%. Tina had $9.00 in her wallet
and no credit card. Does Tina have enough money to buy this compact disc at her local
store?
In this item, the student needs to compute the discount price, figure out the actual sales price,
compute the sales tax, and add the tax to the actual sale price. The $9.00 is irrelevant information,
and the student is supposed to ignore this fact in the problem-solving effort. This is not window
dressing.
11. State the central idea clearly and concisely in the stem and not in the options.
One common fault in item writing is to have a brief stem and most of the content in the options.
The item below shows the unfocused stem.
This stem provides no direction or idea about what the item writer wants to know. Unfocused
stems are a frequent type of error made by novice item writers. The test taker does not under-
stand the intent of the item until reading the options. The second item in the example below takes
the same idea and provides a more focused stem.
According to Harasym, Doran, Brant, and Lorscheider (1992), a better way to phrase such an
item is to remove the NOT and make the item a multiple true–false (MTF) with more options:
Another benefit of this transformation is that because the options now become items, more items
can be added, which may increase test score reliability.
If a negative term is used, it should be stressed or emphasized by placing it in bold type, capital-
izing it, or underlining it, or all of these.
The reason is that the student might not process the meaning of NOT and might forget to reverse
the logic of the relation being tested. This is why the use of NOT is not recommended for item
stems.
13. Use only options that are plausible and discriminating. Three options are usually
sufficient.
As reported previously, the three-option MC is sufficient for most testing purposes. Research
and experience shows consistently that writing fourth and fifth options is futile. It is a waste of
precious resources—the time of your SMEs.
A good distractor should be selected by low achievers and ignored by high achievers. We have
a statistical method of analyzing options that informs us about each option’s operating charac-
teristic. In developing SR test items, more than three options might be written, but when field-
testing, the analysis will eventually show that only two or three options survive the evaluation.
Chapter 17 provides a comprehensive discussion of distractor evaluation.
One disclaimer is offered. If great precision is needed in the lower-third of the test score dis-
tribution, four- or five-options might actually work better than three options, but this emphasis
is usually not desired.
14. Make sure that only one of these options is the right answer.
Although an SME writes the item and chooses the correct answer, inadvertently some items end
up with more than one right answer or no right answer. The way to prevent such embarrass-
ment is to have other SMEs verify the right answer. After the item is field-tested, the results
Guidelines for Writing Selected-Response Items • 101
should show that the right answer has a response pattern that is consistent with expectations.
That is, low-scoring test takers choose wrong answers, and high-scoring test takers choose the
right answer. If an item has two right answers or no right answers, revision is done, and the item
needs to be field-tested again. Also, the committee of SMEs should agree that the right answer is
correct. If there is disagreement, the item is flawed.
15. Vary the location of the right answer according to the number of options.
Most testing specialists will advise that the key should have approximately the same position dis-
tribution. If three options are used for a 100-item test, the distribution of right answers might be
A—33%, B—34%, and C—33%. Testwise students are always looking for clues to right answers.
Lack of balance in the key might be a clue. Or a pattern of right answers might offer false hope
to a test taker.
This issue has become more complex thanks to research by Attali and Bar-Hillel (2003). Given
the overwhelming support for key balancing as stated in the previous paragraph, these two
researchers have discovered that test takers tend to make guesses in the middle of the choices
offered and avoid the extremes. These researchers call this phenomena edge aversion. They cited
previous research and their own research with many test items showing that not only is edge
aversion a real tendency of test takers, but the positioning of the correct answer has an effect
on an item’s difficulty and discrimination. Their remedy to this source of construct-irrelevant
variance is a complex one for large testing programs: The reordering of options should be done
in such a way that an item is presented in all possible key orderings for a large group of test
takers. Not many testing programs have this capability, but a computer-based testing program
could reorder options on each item administration, which should dismiss this small threat to
validity.
WRONG
39. What is the cost of an item that normally sells for $9.99 that is discounted 25%?
A. $5.00
B. $7.50
C. $2.50
D. $6.66
RIGHT
40. What is the cost of an item that normally sells for $9.99 that is discounted 25%?
A. $2.50
B. $5.00
C. $6.66
D. $7.50
Answers should always be arranged in ascending or descending numerical order. Every item
should measure knowledge or skill in a direct fashion. Another point about numbers is that items
should always be presented in correct decimal alignment:
102 • Developing Selected-Response Test Items
Logical ordering is more difficult to illustrate, but some examples offer hints at what this guide-
line means. The following example illustrates illogical ordering.
42. What are the three most important concerns in fixing a recalcitrant thermofropple?
A. O-ring integrity, wiring, lubricant
B. Positioning, O-ring integrity, wiring
C. Lubricant, wiring, positioning
Although we may criticize such a questioning strategy for other reasons, this popular format
is additionally and unnecessarily confusing because the four possible terms (Lubricant, O-ring
integrity, positioning, wiring) are presented in an inconsistent order. A more logical ordering
and presentation is:
43. What are the three most important concerns in fixing a recalcitrant thermofropple?
A. Lubricant, O-ring integrity, wiring
B. O-ring integrity, positioning, wiring
C. Lubricant, positioning, wiring
If the correct answer is 25, then both B and C are correct. Numerical problems that have ranges
that are close make the item more difficult. This careless error can be simply corrected by devel-
oping ranges that are distinctly different. The avoidance of overlapping options will also prevent
embarrassing challenges to test items.
18. Avoid using the options None of the Above, All of the Above, and I Don’t Know.
Research has increased controversy over this guideline, particularly the first part, dealing with
the option none of the above. Studies by Knowles and Welch (1992) and Rodriguez (1997) do
not completely concur with the use of this rule as suggested by Haladyna and Downing (1989b).
Guidelines for Writing Selected-Response Items • 103
Gross (1994) argued that logical versus empirical arguments should determine the validity of an
item writing guideline. For this reason, and because of the fact that most textbook authors sup-
port this guideline, none of the above is still not recommended.
Perhaps the most obvious reason for not using this format is that a correct answer obviously
exists and should be used in the item. No advantage exists for omitting the right answer from the
list of options. One argument favoring using none of the above in quantitative test items is that
it forces the student to solve the problem rather than choose the right answer. In these circum-
stances, the student may work backward, using the options to test a solution. In these instances,
a CROS format should be used.
The use of the choice all of the above has been controversial (Haladyna & Downing, 1989a).
Some textbook writers recommend and use this choice. One reason may be that in writing a test
item, it is easy to identify one, two, or even three right answers. The use of the choice all of the
above is a good device for capturing this information. However, the use of this choice may help
testwise test takers. For instance, if a test taker has partial information (knows that two of the
three options offered are correct), that information can clue the student into correctly choosing
all of the above. Because the purpose of a SR test item is to test knowledge or cognitive skill, using
all of the above seems to draw students into test-taking strategies more than directly testing for
knowledge and skills. One alternative to the all-of-the-above choice is the use of the MTF format.
Another alternative is to simply avoid all of the above and ensure that there is one and only one
right answer. For these reasons, this option should be avoided.
The intention of using I don’t know as a choice is to minimize the role of guessing the correct
choice. Unfortunately, not all children or adults treat this choice the same way. Sherman (1976)
studied patterns of response for children answering items that had the I don’t know choice. Dif-
ferences existed for region, gender, personality variables, and ethnic background. Nnodim (1992)
also studied this option but in the context of scoring that considered whether a student chose it
or not. His results showed no advantage for higher-achieving students over lower-achieving stu-
dents, as the Sherman study contended. However, the use of this option does not seem justified
until the rules for scoring are clearly stated and research shows a decided advantage for I don’t
know. With respect to Sherman’s results, why would anyone want to use such a choice knowing
that it benefits some groups of test takers at the expense of others? In other words, the I don’t
know choice appears to have great potential for producing bias in test scores. Therefore, it should
be avoided.
19. Word the options positively; avoid negative words, such as NOT.
The use of negatives such as NOT and EXCEPT should be avoided in options and also the stem.
This is an extension of guideline 12. The use of negative words in the stem is potentially problem-
atic; in the options, a serious error.
A. Length of options. One common fault in item writing is to make the correct answer the long-
est. This may happen very innocently. The item writer writes the stem and the right answer, and
in the rush to complete the item adds two or three hastily written wrong answers that are shorter
than the right answer. Inadvertently, the testwise test taker can see the clue.
104 • Developing Selected-Response Test Items
The remedy is to ensure that all options are about equal in length.
B. Specific determiner. A specific determiner is a distractor that is so extreme that seldom can it
be the correct answer. Specific determiners include such terms as always, never, totally, absolutely,
and completely. A specific determiner may occasionally be the right answer. In these instances,
their use is justified if it is used consistently as the right and wrong answer. However, if a specific
determiner is used to deceive a test taker, it could be a trick test item.
46. Which of the following is most likely to produce the most student learning over a school
year?
A. Never assign homework on Fridays or before a holiday.
B. Assign homework that is consistent with that day’s class learning.
C. Always evaluate homework the next day to ensure quick feedback.
A restatement of this guideline: Never use never (or other extreme words).
C. Clang associations. Sometimes, a word or phrase that appears in the stem will also appear in
the list of options, and that word or phrase will be the correct answer.
If a clang association exists and the word or phrase is NOT the correct answer, then the item may
be a trick item.
The hundred-year war was a series of separate wars for the French throne lasting from 1337 to
1453 between two royal houses. The clang association was supposed to trick you into choosing
the obvious, wrong choice.
D. Pairs or triplets of options. Sometimes an item contains highly related options that provide
clues to the test taker that the pair or triplet of highly related terms is not the correct choice.
Guidelines for Writing Selected-Response Items • 105
49. What belief is defensible about a cultural core and a cultural pattern?
A. The two are synonymous.
B. The two are opposite.
C. Few people follow cultural patterns.
D. The former is biological.
The first two are paired and the right answer seems to reside with one of these choices. The next
two seem implausible add-on options.
E. Blatantly absurd, ridiculous options. When writing that fourth or fifth option there is a
temptation to develop a ridiculous choice either as humor or out of desperation. In either case,
the ridiculous option will seldom be chosen and is therefore useless. You may not know the per-
son in the second choice (B), but you know that it is the right answer, because the other two are
absurd. If A or C are correct, then the item is a trick question.
F. Option homogeneity. The use of options that are heterogeneous in content and grammar
is also often a cue to the student. Such cues are not inherent in the intent of the item but an
unfortunate accident. Fuhrman (1996) suggested if the correct answer is more specific or
stated in another language, or perhaps more technical or less technical, these tendencies
might make the item easier. A standard practice is keeping options homogeneous in content.
One study is very informative about option homogeneity (Ascalon, Meyers, Davis, & Smits,
2007). They developed a measure of option similarity for a driver’s license test. They found
that when an item’s options have a high degree of similarity that type of item is about .12
easier than items with options that are dissimilar. Easier does not mean less discriminating.
Another finding was that option similarity was highly correlated with distractor plausibil-
ity. Thus, another item-writing guideline (make distractors plausible) is supported by this
guideline. The following item illustrates both homogeneous and heterogeneous options.
HOMOGENEOUS OPTIONS
51. The way to make salsa really hot is by adding
A. habanero chili peppers.
B. anaheim chili peppers.
C. jalapeno chili peppers.
HETEROGENEOUS OPTIONS
52. What makes salsa hottest?
A. Adding the seeds of peppers
B. Using spices
C. Blending the mixture very slowly
106 • Developing Selected-Response Test Items
21. Make all distractors plausible; use typical errors of test takers to write distractors.
Plausibility is an abstract concept. We know that the right answer must be right, and the wrong
answers must clearly be wrong. Plausibility refers to the idea that the item should be correctly
answered by those who possess a high degree of knowledge and incorrectly answered by those
who possess a low degree of knowledge. Thus, a plausible distractor will look like a right answer
to those who lack this knowledge. Chapter 17 discusses ways to evaluate distractor performance.
The most effective way to develop plausible distractors is to either obtain or know what typical
learners will be thinking when the stem of the item is presented to them. We refer to this concept
as a common error. Knowing common errors can come from a good understanding of teaching
and learning for a specific grade level; it can come from think-aloud studies with students; or it
can come from student responses to a constructed-response format version of the item without
options.
The example below exemplifies excellent item development with SME justification for each
option (adapted from http://www.actstudent.org/sampletest/math):
For each option, a justification can be given. For example, for option C, we might provide
an explanation that the x-intercept is the location on the x-axis where y = 0, not where x = 0.
This website for the ACT Assessment provides many items of this type, where each option is
given a justification. It is very rare to find such thoroughness given to distractors. If justifications
were given for all distractors written, four- and five-option items might be more effective than
they are currently.
sometimes highly anxious test takers react in negative ways. Humor detracts from the purpose of
the test. The safe practice is to avoid humor.
55. In Phoenix, Arizona, you cannot take a picture of a man with a wooden leg. Why not?
A. Because you have to use a camera to take a picture.
B. A wooden leg does not take pictures.
C. That’s Phoenix for you.
Because of the humor element of this example item, there are possibly multiple correct answers!
For classroom testing, if humor fits the personality of the instructor and class, it could be used but
probably very sparingly and with care taken to ensure there are no adverse consequences.
Balance the Number of True and False Statements. Key balancing is important in any kind of
objectively scored test. This guideline refers to the balance between true and false statements,
but it also applies to negative and positive phrasing. So, it is actually key balancing as applied to
true–false items.
Use Simple Declarative Sentences. A true–false item should be a simple, non-complex sentence.
It should state something in a declarative rather than interrogative way. It should not be an ellip-
tical sentence.
Write Items in Pairs. Pairs of items offer a chance to detect ambiguity. One statement can be
true and another false. One would never use a pair of items in the same test, but the mere fact
that a pair of items exists offers the item writer a chance to analyze the truth and falsity of related
statements.
108 • Developing Selected-Response Test Items
56a. Overinflated tires will show greater wear than underinflated tires. (false)
56b. Underinflated tires will show greater wear than overinflated tires. (true)
Make Use of an Internal Comparison Rather Than an Explicit Comparison. When writing the
pair of items, if comparison or judging is the mental activity, write the item so that we clearly state
the comparison in the item. Examples are provided:
Desirable: 57a. In terms of durability, oil-based paint is better than latex-based paint.
Undesirable: 57b. Oil-based paint is better than latex-based paint.
Take the Position of an Uninformed Test Taker. This pair of items reflects two common errors.
The first example is a common misunderstanding among students learning about testing. The
second item is another common misinterpretation of the concept of percentile rank.
58. A percentile rank of 85 means that 85% of items were correctly answered. (false)
59. A percentile rank of 85 means that 15% of test takers have scores lower than people at that
percentile rank. (false)
The number of MTF items per cluster may vary within a test. Balance in many aspects of a test
is something to strive for, as it provides for a manageable experience for the test taker. However,
there is no reason to strive for balance in the number of items of different formats or the number
of elements within an item across items. Similarly, although we argue that three-option items
are optimal, the real point is that options should be plausible and represent likely errors of stu-
dents; there is no psychometric reason why each item should have the same number of options
—although there are practical reasons to avoid errors in responding on bubble response forms,
for example.
The primary goal in test design is content coverage. Given the item and test specifications
discussed in chapter 3, an MTF item set must be a fair representation of the desired content
(guideline 4). It is more important to include the important instructionally relevant or stand-
ards-based content in the MTF than to worry about balancing the number of items within each
MTF cluster. However, there should be balance in the number of true and false items within a
cluster.
Use MC items as a basis for writing MTF items. Good advice is to take a poor-functioning MC
item and convert it to several MTF items. Observe the examples below:
60. Which of the following are ways to increase the mileage of modern automobiles?
A. Use a higher-premium gas.
B. Increase tire inflation to a maximum allowed.
C. Shift gears if possible.
D. Lose weight.
E. Increase highway driving.
Guidelines for Writing Selected-Response Items • 109
The items that might be extracted from the item are as follows:
Which actions listed below will improve gas mileage in your car?
Mark A if it tends to improve gas mileage, mark B if not.
61. Use a higher-premium gas.
62. Increase tire inflation to a maximum allowed.
63. Shift gears if possible.
64. Lose weight.
65. Increase highway driving.
Notice how the original item is expanded via the MTF format to increase the breadth of testing
the understanding of this principle.
No strict guidelines exist about how many true and false items appear in a cluster, but expect-
ing a balance between the number of true and false items per set seems reasonable. The limit for
the number of items in a cluster may be as few as three or as many as would fit on a single page
(approximately 30–35).
Use Algorithms if Possible. An algorithm is a standard testlet scenario with a fixed number
of items. The scenario can be varied according to several dimensions, producing many useful
items. Haladyna (1991) presented examples for teaching statistics and art history. The strategy
involves developing a set of item stems and options that apply equally to vignettes. Each vignette
is altered systematically to capture subtle variations in the cognitive demand of the problem to
be solved. Chapter 8 provides illustrations and examples of these. With any testlet, conventional
MC, matching, alternative-choice, and MTF items can be used. The testlet encourages consider-
able creativity in developing the stimulus and using these various formats. Even CR item formats,
such as short-answer essays, can be used.
Develop Stimulus Material That Resembles or Mimics Your Target Domain. For reading com-
prehension items, passages are sought that fit the population being tested. Some passages can be
original and have reading levels and vocabulary that are appropriate for the target population.
Other passages can come from literature or from popular magazines, such as those covering news
or science. Such passages have the hazard of having reading levels or vocabulary that may not be
appropriate. Some passages are originally written for a testlet and therefore require considerable
analysis regarding readability for an intended target group of test takers and also all the item
reviews for linguistic complexity and fairness.
For problem-solving, developing vignettes that come from the target domain is very desirable.
For instance, mathematics and science problems should address everyday events where mathe-
matics is needed or the problem is one encountered in life that has a scientific principle involved.
For a test in the professions, the SME can usually develop cases representing the target domain.
For instance, in a professional test, such as facial plastic surgery, a vignette might include a patient
with a severed facial nerve followed by a set of items that probe into the appropriate treatment
110 • Developing Selected-Response Test Items
and rehabilitation. For a science vignette, one might discuss or deal with the sweeping of a puddle
after rainfall. Which method is most effective for increasing evaporation of the puddle?
Format the Testlet So All Items Are on a Single Page or Opposing Pages of the Test Booklet. This
step will ensure easy reading of the stimulus material and easy reference to the item. When lim-
ited to two pages, the total number of items ranges from two to twelve items. If the MTF or AC
formats are used with the testlet, then many more items can be used on one or two pages.
Overview
To steal ideas from one person is plagiarism. To steal from many is research.
(Author unknown)
This chapter features the results of our research on exemplary and innovative selected-response
(SR) item formats. As Hambleton (2004) noted, we have witnessed an explosion in the develop-
ment of item formats. All of the formats presented in this chapter are intended for paper-and-
pencil testing. Some of these formats in this chapter come from an interesting archive published
in the 1930s at the University of Chicago.
Many new item formats have been created uniquely for use in computer-based testing (Scalise,
2010; Scalise & Gifford, 2006; Sireci & Zenisky, 2006; Zenisky & Sireci, 2002). Another excellent
source for new, innovative, computer-based test item formats is the Educational Testing Service,
where extensive item development activities are ongoing (http://www.ets.org/s/research/pdf/
CBALlasampleitems.pdf).
Innovative item formats provide opportunities to increase the choice of item formats to meas-
ure various content at different cognitive demands. Another benefit is the capability for diag-
nosing learning as some of these formats are intended to provide formative feedback to test
takers/learners. Some of these item formats have a generative capacity. That is, we can create
similar items with very little effort. Although the next chapter features item-generating theories
and technology, a few examples of item formats that can be manipulated to create many more
items are presented. Another benefit is that some of these formats test content and different cog-
nitive demands very efficiently.
No matter how each item format is perceived, any of these item formats are subject to the same
standard for item validation as are traditional item formats. Validity studies are essential. Fre-
quently, these item formats have not received the same degree of scholarly study as the formats
previously presented that represent the mainstream of item formats.
A disclaimer for this chapter is that its contents are hardly comprehensive or representative of
all new, innovative, and exemplary item formats. We have an abundance of new item formats.
References are provided to other examples throughout this chapter.
This chapter begins with the introduction of a taxonomy of item formats proposed by Scal-
ise and Wilson (2006). Next, items selected from this archive at the University of Chicago are
111
112 • Developing Selected-Response Test Items
featured. After that, many examples of unusual SR item formats are presented. All these formats
require further study, but testing programs should also experiment with new, innovative item
formats, because of the many benefits that accrue.
Language Translation
The teaching of any language compels instructors and test developers to use performance meas-
ures as these item formats have high fidelity with the target domain. Learning any language natu-
rally engages learners in reading, writing, speaking, and listening in the language to be learned.
However, we have useful formats that efficiently measure one’s ability to translate from the lan-
guage to be learned into English. This item comes from The Technical Staff (1937, p. 31):
As seen above each item has a short stem and three options and occupies four lines. The reading
demand is very light. These items are easy to write if the item writer can develop two plausible
distractors. The number of items administered per hour has to be very high, but no research is
Exemplary and Innovative SR Item Formats • 113
available. About 10 items can be presented on a single page, so the test is compact. There is much
to like with this format.
The Technical Staff (1937, p. 32) suggested that foreign language vocabulary be tested in the
following way:
Word A B C D E
2. dein insignificant ours smooth yours honest
3. einzeln one plural one-by-one first zeal
4. scheiden absent cut depart tear presume
5. steur expensive new fire tax few
6. wahr was true beg faithful war
The format of these items is unambiguous. The use of five options in this instance seems plausible
and justifiable. These items may be easy to develop. Perhaps fewer options might be more effec-
tive. As many as 30 items can be presented on a single page, which should yield highly reliable
test scores for vocabulary. Many items can be administered in a very short time due to the low
demand for reading and the compactness of this format. However, there is no research to report
on this dated but appealing format.
We have generic options that apply to all items, so this is a matching format. This testlet occupies
very little space and has items that are not dependent—a chronic problem with the testlet format.
Reading demand is low. This format seems adaptable to other subject areas.
114 • Developing Selected-Response Test Items
10. It is difficult for many persons to understand why Lafcadio Hearn, the writer who left Amer-
ican and went to the orient.
11. That he was a person of unusual temperament and the strangeness of his outlook on life have
caused causal observors to condemn him.
12. He was a wanderer, therefore his interests were transient and unsettled.
Many sentences can be provided, each containing a specific kind of error. At least one in nine
sentences should be correctly written. This kind of matching set of test items seems very challeng-
ing for someone learning to be an English teacher or editor.
Triple Matching
In this odd variation of the matching format, we have three matching categories. Consider the
following example:
The synopsis of the work is presented. The student must identify the correct author, work, and
type.
13. Achilles and Agamemon have a mighty quarrel one day, after which Achilles deserts the
Greeks and returns home …
With each synopsis, we have 6 × 6 × 6 possible response patterns with only one response pattern
being correct. In effect if we use 0–1 (wrong–right) scoring, each synopsis has three scorable
items. This format would be useful in many other subject matters where linkages are sought
among conjunctive concepts. In every instance, a higher cognitive demand can be achieved by
Exemplary and Innovative SR Item Formats • 115
using actual passages, lines, photographs, or other media that need to be recognized. This kind of
format might be used in art, music, or other fields:
Once the set of three concepts is created, the item writers develop the variations for each concept
and then synopses can be written. Using the last set of concepts, the following synopses were
created.
For each synopsis, we might have these crops and more, different climate conditions, and dif-
ferent fertilizer configurations. For example, one fertilizer is triple-16 (nitrogen, phosphorous,
potassium).
A. inversely proportional
B. directly proportional
C. numerically equal to
D. independent of
E. inversely proportional to the square of
F. proportional to the square of
Like other extended-matching formats, items are easy to generate, guessing is minimized, and
the items are very compactly presented. Administration time for 50 items appears to be very
brief.
These examples selected from The Technical Staff (1937) illustrate how uniquely we can test
for more than recall using compact, efficiently administered items. Unfortunately, these formats
have not been researched, and there is no evidence that any of these formats have been used or
116 • Developing Selected-Response Test Items
even reported in standard textbooks on testing. A general reluctance exists to utilize innovative
formats such as the ones presented in this section that emanated from long ago. Hopefully, this
tendency will not exist in the future.
Generic Options
Chapter 8 features recent progress on item generation. Generic options are very attractive because,
once conceived, an option set such as that shown below can be used for many items. Bejar and his
colleagues (2003, p. 8) presented this generic option set:
The object of this option set was a map-reading skill. The stem of the item can vary because we
have two quantities to compare, one in column A and one in column B. One example they pro-
vide involves map-reading. A map is presented with a legend, and a series of items are adminis-
tered that involve reading the map. This instruction is given:
This map is drawn to scale. One centimeter equal 30 kilometers.
Two columns are presented, and the correct relationship comes from choosing one of the four
generic options above. For example:
Column A Column B
19. Two cities are 2,000 kilometers apart. 30 centimeters
The student has to compare 2,000 kilometers with the distance represented by 30 centimeters. In
generative testing, we can vary values to obtain parallel items. Given any two points on the map,
a test item can be generated in several ways that thoroughly test this map-reading skill.
A second example involves a geometric figure (square, octagon, pentagon, etc.) as shown in
Figure 7.1. The area is given. With the triangle, one length of one leg of the triangle is presented
in column A, and a variable name is given for the other leg of the triangle. The student can
apply some algebra and the formula for the area of a triangle. The four options apply again.
This example can also be used to generate 28 items using the same four options, as partially
shown below.
Column A Column B
20. Square Octagon
21. Square Pentagon
22. Square Rectangle
23. Square Hexagon
24. Square Triangle
25. Square Circle
26. Square Hexagon
27. Octagon Pentagon
28. Octagon Rectangle
29. Octagon Triangle
30. Octagon Octagon
Exemplary and Innovative SR Item Formats • 117
GEOMETRIC SHAPES
SQUARE
OCTAGON PENTAGON
RECTANGLE
TRIANGLE
For each geometric figure, some clues are offered to help the test taker compute the area of the
figure. For instance, the square is 4 cm. The octagon’s side is 3 cm. The base of the triangle is 6 cm
and the height is 8 cm. The circle has a radius of 5 cm, or the diameter is 7 cm. The numbers can
be varied, resulting in many more items generated. The learner has to use the clues to calculate
the area of each geometric shape and then select one of the four generic options.
A third example involves an essay. A set of declarative statements is presented in an essay. The
four options are:
Because writing options is the most difficult part of item writing, the use of these generic options
provides an important service. However, some ingenuity is needed to create a situation that sup-
ports a set of generic options. The three examples provided show that this format has applicabil-
ity to a variety of content.
Generic options are a subset of matching and extended-matching formats. The main differ-
ence is that generic options are applicable to a variety of situations, whereas conventional match-
ing and extended-matching option sets are usually unique to one situation. Generic options seem
to have a bright future, if researchers and test developers are more willing to use matching and
extended-matching in testing programs.
118 • Developing Selected-Response Test Items
The difficulty with this format is creating the yes-no justifications. Each option has to have plau-
sibility for a particular expression. Possibly, these justifications can be generic or based on com-
mon student errors.
DIRECTIONS: In the passage that follows, certain words and phrases are underlined and num-
bered. In the right-hand column, you will find alternatives for the underlined part. In most cases,
you are to choose the one that best expresses the idea, makes the statement appropriate for stand-
ard written English, or is worded most consistently with the style and tone of the passage as a
whole. If you think the original version is best, choose “NO CHANGE.” In some cases, you will
find in the right-hand column a question about the underlined part. You are to choose the best
answer to the question.
You will also find questions about a section of the passage, or about the passage as a whole.
These questions do not refer to an underlined portion of the passage, but rather are identified by
a number or numbers in a box.
For each question, choose the alternative you consider best and fill in the corresponding oval
on your answer document. Read the passage through once before you begin to answer the ques-
tions that accompany it. For many of the questions, you must read several sentences beyond the
question to determine the answer. Be sure that you have read far enough ahead each time you
choose an alternative.
I grew up with buckets, shovels, and nets waiting by the back door [36]; hip-waders hanging
in the closet; tide table charts covering the refrigerator door; and a microscope was sitting [37]
on the kitchen table.
Exemplary and Innovative SR Item Formats • 119
36.
A. NO CHANGE
B. waiting, by the back door,
C. waiting by the back door,
D. waiting by the back door
37.
F. NO CHANGE
G. would sit
H. sitting
J. sat
Consider the reasoning for the answers to item #36.
The best answer is A. It provides the best punctuation for the underlined portion. The phrase
“waiting by the back door” describes the noun nets and is essential because it tells which nets
the narrator “grew up with.” Therefore, no comma should be placed after nets. The semico-
lon after the word door is appropriate because semicolons are used between items in a series
when one or more of these items include commas.
The best answer is NOT B because the first comma after waiting is unnecessary. In addition,
the appropriate punctuation after door should be a semicolon (not a comma). Semicolons
are used between items in a series when one or more of these items include commas.
The best answer is NOT C because the appropriate punctuation after door should be a semi-
colon and not a comma. Semicolons are used between items in a series when one or more of
these items include commas.
The best answer is NOT D because the punctuation, in this case a semicolon, is missing after
the word door. It is needed to set off the first of this sentence’s three items in a series.
The second example comes from science. This testlet has a passage on a science topic and seven
conventional MC items. Only two items are presented in a modified format for this chapter.
Interested readers should consult the website for a complete and accurate presentation.
DIRECTIONS: The passage in this test is followed by several questions. After reading the passage,
choose the best answer to each question and fill in the corresponding oval on your answer docu-
ment. You may refer to the passage as often as necessary.
You are NOT permitted to use a calculator on this test.
Passage I
Unmanned spacecraft taking images of Jupiter’s moon Europa have found its surface to be very
smooth with few meteorite craters. Europa’s surface ice shows evidence of being continually res-
moothed and reshaped. Cracks, dark bands, and pressure ridges (created when water or slush is
squeezed up between 2 slabs of ice) are commonly seen in images of the surface. Two scientists
express their views as to whether the presence of a deep ocean beneath the surface is responsible
for Europa’s surface features.
Scientist 1
A deep ocean of liquid water exists on Europa. Jupiter’s gravitational field produces tides within
Europa that can cause heating of the subsurface to a point where liquid water can exist. The
numerous cracks and dark bands in the surface ice closely resemble the appearance of thawing
120 • Developing Selected-Response Test Items
ice covering the polar oceans on Earth. Only a substantial amount of circulating liquid water can
crack and rotate such large slabs of ice. The few meteorite craters that exist are shallow and have
been smoothed by liquid water that oozed up into the crater from the subsurface and then quickly
froze. Jupiter’s magnetic field, sweeping past Europa, would interact with the salty, deep ocean
and produce a second magnetic field around Europa. The spacecraft has found evidence of this
second magnetic field.
Scientist 2
No deep, liquid water ocean exists on Europa. The heat generated by gravitational tides is
quickly lost to space because of Europa’s small size, as shown by its very low surface tempera-
ture (–160°C). Many of the features on Europa's surface resemble features created by flowing
glaciers on Earth. Large amounts of liquid water are not required for the creation of these
features. If a thin layer of ice below the surface is much warmer than the surface ice, it may
be able to flow and cause cracking and movement of the surface ice. Few meteorite craters are
observed because of Europa's very thin atmosphere; surface ice continually sublimes (changes
from solid to gas) into this atmosphere, quickly eroding and removing any craters that may
have formed.
38. Which of the following best describes how the 2 scientists explain how craters are removed
from Europa’s surface?
Scientist 1 Scientist 2
A. Sublimation Filled in by water
B. Filled in by water Sublimation
C. Worn smooth by wind Sublimation
D. Worn smooth by wind Filled in by water
A is not the best answer. Scientist 1 says that the craters are smoothed by liquid water that
oozes up into the craters from the subsurface and then quickly freezes. Scientist 2 says that
ice sublimates, eroding and removing any craters that form.
B is the best answer. Scientist 1 says that the craters are smoothed by liquid water that oozes
up into the craters from the subsurface and then quickly freezes. Scientist 2 says that when
ice sublimates, the craters are eroded and smoothed.
C is not the best answer. Scientist 1 says that the craters are smoothed by liquid water that
oozes up into the craters from the subsurface and then quickly freezes.
D is not the best answer. Scientist 1 says that the craters are smoothed by liquid water that
oozes up into the crater from the subsurface and then quickly freezes. Scientist 2 says that ice
sublimates, eroding and removing any craters that form.
Although the future of item writing may become more automated through item generation
theories and technology, as the next chapter suggests, the highest standard for item develop-
ment for today’s testing programs seems present with ACT’s items. The cognitive demand and
answer justification features are a model for item development. If tests are to provide forma-
tive feedback to learners, this kind of use of SR items represents a very high level of help to
learners.
Exemplary and Innovative SR Item Formats • 121
Multiple-Mark Version
39. Which of the following writers is (are) known for their outstanding contributions in the
20th century?
A. F. Scott Fitzgerald
B. Upton Sinclair
C. Samuel Clemens (Mark Twain)
D. Jack London
MTF Version
The following writers are known for their contributions in the 20th century. True or False?
40. F. Scott Fitzgerald T F
41. Upton Sinclair T F
42. Samuel Clemens (Mark Twain) T F
43. Jack London T F
In the multiple-mark format, the MTF is used, but the instructions to test takers are to mark true
answers and leave other options blank. Scoring is based on a partial credit formula. Instead of
scoring 0–1 as with the CMC item, scores can range between 0 and 1 with gradations at .25, .50
and .75). Partial credit scoring tends to improve reliability. It also yields information about the
degree of learning for a concept, principle, or procedure. Pomplum and Omar (1997) reported
that the multiple-mark format has many positive characteristics to recommend its use. A prob-
lem was reported with orienting young test takers to this format. However, as with any new for-
mat, practice and familiarity can solve this problem.
The GRE revised general test now employs multiple-mark MC items. In the verbal reasoning
section, for example, there are some MC questions that are “select one or more answer choices.”
In these items, there are three options. The test taker must select all options that are correct: one,
two, or all three options may be correct. To gain credit for this item, all correct options must be
selected. Similar question types are found in the quantitative reasoning section, but may include
a larger number of options.
“Uncued” Multiple-Choice
This item format consists of eight dichotomously scored items arranged in two four-item sets. As
shown in Figure 7.2, the graph shows Marisa’s bicycle trip over 80 minutes. The first set requires
the learner to indicate which phase of the trip was designated A, B, C, or D. The second set of
four items does the same thing. Scalise and Gifford (2006) stated that the uncued aspect is that,
although each item is specific, the learner must work through the entire chart before attempting
the sets of items.
122 • Developing Selected-Response Test Items
o
20 40 60 80
Ti me (mlnute.s)
There is a dependency among choices that cues the learner. This item set can be administered
in a paper-and-pencil format or via a computer. Other items could be added to this set that ask
the learner about the average speed, distance traveled, and other factors that require chart read-
ing and comprehension.
Scoring is another issue. The entire eight item set can be scored so the points range between 0 and
8, or the result can be 0 (incorrect) versus 1 (correct). As there is cuing potential, conducting a field
test of such items is important to decide if it validly measures what is purports to measure.
Exemplary and Innovative SR Item Formats • 123
Ordered Multiple-Choice
This format is specifically part of a diagnostic assessment system where each choice is linked to a
developmental level of the student (Briggs, Alonzo, Schwab, & Wilson, 2006). The promise with
this kind of theoretical development and technology is that SR test items provide students and
their teachers with greater information about their strengths and weaknesses in learning. Cur-
rently, distractors for SR items are not used that way. Briggs et al. provided this example:
44. Which is the best explanation for why it gets dark at night?
A. The Moon blocks the Sun at night. [Level 1 response]
B. The Earth rotates on its axis once a day. [Level 4 response]
C. The Sun moves around the Earth once a day. [Level 2 response]
D. The Earth moves around the Sun once a day. [Level 3 response]
E. The Sun and Moon switch places to create night. [Level 2 response]
Although the item appears to be a five-option CMC, each option represents a developmental
level. Technical aspects of scoring using item response theory are a strong feature of this format
(Wilson, 1992). They propose a method of scoring that uses information from distractors. This
kind of option ordering and scoring is one of several viable option-weighting methods (Hala-
dyna, 1990; Sympson & Haladyna, 1993). All option-weighting methods actually work quite well
but require considerable preliminary analysis and very complex scoring. Testing programs seem
reluctant to employ such schemes due to the heavy demand on item development.
If the item development could become more efficient, these kinds of test items not only pro-
vide information to students and their teachers but improve score reliability.
One factor that may limit the use of this format is preconditions that must be met. First, a
construct map is needed. Its development requires theoretical development and research. Not
all achievement constructs are amenable to construct mapping. Second, distractor writing is very
demanding. Third, teacher training is needed. Finally, although such items have diagnostic value,
CR items that are subjectively scored provide better information.
a deeper type of learning than surface learning that often results from using the SR format for
recognition of facts.
The two-tiered item has a content item followed by a reasoning response. The authors had a
process for item development that included the development of a concept map for each chemi-
cal reaction. Each map was validated by two SMEs. The next step in this process was to ask stu-
dents to respond to items in writing so that a set of common student errors could be generated
for each chemical reaction. Besides the students’ misconceptions, previous research was used to
supplement this list. SMEs were employed regularly in item development to validate the items.
As a result, they produced 33 SR items that ranged from two to four options. In chapter 5, it was
argued that conventional MC items usually have three functional options. All items were field-
tested and interviews were conducted with some students to reveal their cognitive demand. From
this process, they assembled a final test of 15 items (each item has two parts—actually two items).
The authors offer additional information on a website (see, for example, Interactive Courseware
for Chemistry, Acids and Bases, and Qualitative Analysis available online at http://www.cool-sci-
ence.net). Here are three examples:
45. Dilute sulfuric acid is added to some black copper(II) oxide powder and warmed. The
copper(II) oxide disappears producing a blue solution. Why is a blue solution produced?
A. The copper(II) oxide dissolves in the acid producing a blue solution.
B. Copper(II) oxide reacts with dilute sulfuric acid, producing a soluble salt, copper(II)
sulfate.
C. Copper(II) oxide is anhydrous. When the acid is added the copper(II) oxide becomes
hydrated and turns blue.
46. What is the reason for my answer?
A. The ions in copper(II) sulfate are soluble in water.
B. Cu2+ ions have been produced in the chemical reaction.
C. Hydrated salts contain molecules of water of crystallization.
D. Cu2+ ions originally present in insoluble copper(II) oxide are now present in soluble
copper(II) sulfate.
47. When powdered zinc is added to blue aqueous copper(II) sulfate and the mixture shaken,
the blue color of the solution gradually fades and it becomes colorless. At the same time a
reddish-brown deposit is produced. The chemical equation for the reaction that occurs is,
Zn(s) + CuSO4(aq) → ZnSO4(aq) + Cu(s), while the ionic equation is, Zn(s) + Cu2+(aq) →
Zn2+(aq) + Cu(s). Why did the solution finally become colorless?
A. Copper has formed a precipitate.
B. Zinc is more reactive than copper(II) sulfate.
C. The copper(II) sulfate has completely reacted.
D. Zinc has dissolved, just like sugar dissolves in water.
48. What is the reason for my answer?
A. Zinc ions are soluble in water.
B. Zinc loses electrons more readily than copper.
C. Soluble, blue Cu 2+ ions have formed insoluble, reddish-brown copper atoms.
D. In aqueous solution Cu 2+ ions produce a blue solution, while Zn 2+ ions produce a color-
less solution.
Exemplary and Innovative SR Item Formats • 125
Another item set was adapted by Tsai and Chou (2002, p. 18). The item shows two light bulbs.
One is bare and the other is covered by a glass shell.
49. On the earth, there is a light bulb that gives out heat. We cover the light bulb with a glass
shell and extract the air inside, so the pressure within the shell is in a vacuum state. If our
face is pressed close to the shell, will we be able to see the light and feel the heat?
A. We can only see the light, but cannot feel the heat.
B. We can only feel the heat, but cannot see the light.
C. We can both see the light and feel the heat.
D. We can neither see the light, nor feel the heat.
50. What is the cause of this phenomenon?
A. The light must be propagated by the air; and, the heat can be propagated via radia-
tion under a vacuum state.
B. The light need not be propagated by the air; and, the heat cannot be propagated via
radiation under a vacuum state.
C. The light must be propagated by the air; and, the heat cannot be propagated via
radiation under a vacuum state.
D. The light need not be propagated by the air; and, the heat can be propagated via
radiation under a vacuum state.
E. The light need not be propagated by the air; and, the heat can be propagated via
convection under a vacuum state.
The use of five options is justified because each option has a justifiable and defensible alter-
native conception of a scientific principle. Knowing which option was chosen by a learner is
informative.
The motivation for this kind of research and item and test development comes from the need
to understand what students are thinking when they choose correct or incorrect options.
Tamir (1971) is credited with making the observation that distractors should include student
alternative conceptions where students can justify their choice. Tamir (1989) noted that justifica-
tion was very effective way to improve learning and validly score the classroom test. Haladyna
(2004) recommends such practice in any instructional testing as a way to evaluate test items and
also help students learn. The cost of developing such items must be considerable. If a technology
could be developed for more rapid and efficient item development, this would be helpful as these
kinds of item are much needed in formative testing for student learning.
A fit-looking 35-year-old office worker smokes cigarettes. He comes to you to complain of cramps
in his left leg when playing tennis.
For item #51, a list of eight acceptable answers is provided as a key. For item #52, a list of 16
acceptable questions is provided as a key.
New Research
Rabinowitz and Hojat (1989) compared correlations of the MEQ and a SR test to clinical perform-
ance ratings. Although the SR test scores had a higher correlation to the clinical performance rat-
ings, the authors concluded that the MEQ must measure something unique that the SR does not.
A more recent study by Palmer and Devitt (2007) with undergraduate students provided more
insight into the MEQ. About one half of the MEQ items tested recall. Ironically, the rationale for
MEQs was to avoid recall testing so popularly associated with SR item formats. These authors
concluded that well-designed CMC items perform better than the MEQ. Given the same sce-
nario, today’s item writers preparing to measure clinical problem-solving might augment clinical
observation with a SR test that includes a multiple true–false testlet of the following type:
A fit-looking 35-year-old office worker smokes cigarettes. He comes to you to complain of cramps
in his left leg when playing tennis. Which of the following are likely causes of his cramps?
This testlet is easily scored 0–1. The items are not interdependent. The number of items can be
expanded to provide many true and false items. Palmer and Devitt reported low reliability for
the MEQ, but by adding more scoring points, reliability is predicted to be higher. The testlet is
presented very compactly on the page for easy administration and reading. The administration
time is shorter.
The MEQ is a historical note in item formats where a perceived liability of SR formats caused a
migration to a complex vignette-based CR that then migrated to testlets that seem to improve on
the MEQ. The developmental history is informative and useful in showing how the quest for item
formats that have the capability to model complex learning is continuous and evolving. Ironi-
cally, the MEQ is a predecessor for more viable SR formats. In the next chapter, item generation
is a field that capitalizes on this historical epoch that emphasizes the SR format as opposed to the
CR format.
Exemplary and Innovative SR Item Formats • 127
NORT, BERL, and SAMP are nonsense terms; JET and CAR are real terms; speed and weight
are semantic features. The paragraph is followed by a series of TF items; half are true and half
are false. The statements involve four components of reading comprehension: text memory, test
inference, knowledge access, and knowledge integration. Because there is no prior knowledge,
this item format eliminates the possibility that prior knowledge may interfere in the measure in
a construct-irrelevant way. Katz and Lautenschlager (1991) showed that students who are highly
skilled readers perform quite well on these reading comprehension items.
In their appendix, Hannon and Daneman (2001, p. 125) present a complete exposition of the
initial paragraph (shown above), and many statements are test items with different cognitive
demands.
The statements tested inferences about information presented explicitly in the paragraph. Prior
knowledge does not exist. For instance, a learner can infer that a SAMP is slower than a CAR
because a SAMP is slower than a BERL and a BERL is slower than a CAR. Paragraphs with two
semantic features had two true and two false text-inferencing statements; paragraphs with three
semantic features had three true and three false text-inferencing statements, and paragraphs with
four semantic features had four true and four false text-inferencing statements.
128 • Developing Selected-Response Test Items
Supporting Arguments—Reasoning
This next example comes from the Educational Testing Service (retrieved from http://www.ets.
org/s/research/pdf/CBALlasampleitems.pdf). A series of arguments are presented that are either
for an issue, against an issue, or off-topic. The generic form is like this:
Then a series of statements are presented. In this example the argument is:
This example was presented originally as a computer-based format, but it is actually a matching
format that has a high cognitive demand due to the fact that each statement has not been pre-
sented before and the test taker has to evaluate the nature of the argument for or against or deter-
mine if the statement is neutral or simply off topic. No evaluation is associated with this item.
Assertion-Reason Multiple-Choice
This format uses true/false aspects in a sentence where an assertion is made followed by a reason.
The subject matter in which this format was tested was college-level economics. An example
provided by Williams (2006, p. 292) is:
ASSERTION REASON
In a small open economy, if the prevailing BECAUSE in a small, open economy, any surplus
world price of a good is lower than the in the domestic market will be
domestic price, the quantity supplied by the absorbed by the rest of the world.
domestic producer will be greater than the This increases domestic consumer
domestic quantity demanded, increasing surplus
domestic producer surplus
Exemplary and Innovative SR Item Formats • 129
Williams administered a survey to his students regarding aspects of this item format. Most stu-
dents reported very favorable opinions about this format. No research findings were reported
about item difficulty or discrimination or the cognitive demand of such items. Nonetheless, the
structure of this CMC format appears to capture what Williams thinks is reasoning.
Summary
This chapter has featured innovative SR item formats of various types including some formats
created a long time ago and buried in the archives at the University of Chicago. The formats
: : - - - - - - --.. Q1Jestion It Timer
( C8AL Reading Test ) ( (56 minutes)
'-- - 240(29 )
Move the names of the people who OPPOSE school uniforms into the column on the left Move the names of the people who SUPPORT
school uniforms into the column on the right Move the names of the people who neither SUPPORT nor OPPOSE school uniforms into the
middle column .
To move a name into a column, click on the name. Then click on an empty space where the name belongs. If you change your mind about a
name , click on it again and then click on the bulleted list again.
Figure 7.3 Computer-based sorting item set from CBAL (Mislevy, 2006).
Used with permission. Copyright 8 2012 Educational Testing Service.
Exemplary and Innovative SR Item Formats • 131
included here were mostly intended for paper-and-pencil testing. However, it is recognized
that computer-based testing has created opportunities for many innovative SR formats to be
introduced. Programs of research should support the introduction of any new item format. All
item formats need to pass a test that includes measuring specific content and a specific cognitive
demand. These formats also have to be as efficient as or better than conventional item formats.
All item formats should be validated before being used in a testing program.
8
Automatic Item Generation
Overview
Traditional selected-response (SR) item-writing has very much remained unchanged since the
introduction of conventional multiple-choice (CMC) in the early 1900s. We know that the major
expense in most testing programs is item development (Case, Holzman, & Ripkey, 2001). Accord-
ing to Wainer (2002), the cost of developing and validating any item for a high-quality testing
program can exceed $1,000 per item. More current estimates greatly exceed that figure. One
source reports that the current cost of a professionally developed item exceeds $2,000 (Joseph
Ryan, personal communication). We recommend that item pools exceed the length of a test form
by at least a factor of 2.5. Using that figure from a decade ago, the replacement value of a bank of
250 items would exceed $500,000.
As we know, item-writing is a subjective process. Items are written to cover specific content
with a certain cognitive demand. Despite guidelines for writing items, great variability exists in
the quality of the items produced. As a result, not all original items survive the item development/
validation process. In our experience as many as 40% of new items will be discarded or revised
during a rigorous item development and validation process. The item bank needs validated items
in sufficient numbers to create operational test forms and at least one backup test form. As items
grow old and retire, new items are needed that match the content and cognitive demand of retired
items. Also, some testing programs routinely release items to the public or their constituency.
When items are exposed due to lacks in test security, new items are needed to replace the exposed
items. Consequently, we have persistent pressure to develop and validate new items.
Automatic item generation (AIG) offers hope that new items can be created more efficiently
and effectively than traditional item generation without the threats to validity that come from
human subjectivity. The ultimate goal of AIG is the production of many items with predeter-
mined difficulty and discrimination. All items must satisfy the content and cognitive demand
required in item and test specifications. Two benefits of AIG are: (a) we can save money by hav-
ing item-writing automated and (b) creating parallel test forms is possible on-the-fly so that test
takers can have an equivalent testing experience without any threats to security (Bejar, Lawless,
Morley, Wagner, Bennett, & Revuelta, 2003).
AIG is a new science. It is growing rapidly corresponding with advances in computer-based
and computer-assisted testing and cognitive psychology. There is no doubt of the interconnectiv-
ity of validity, construct definition, cognitive psychology, and emerging psychometric methods
that convene to aid the development of AIG. Currently, AIG has problems and limitations that
132
Automatic Item Generation • 133
remain to be studied and solved (Gierl & Haladyna, 2012). Nevertheless, the hope is that contin-
ued research and development will realize the lofty goals for AIG.
We currently have two branches of AIG. The first branch is item generation of a practical
nature. These are remedies currently in use that speed up the item-writing process for human
item writers. The basis of these practical item-generating methods is expedience. We need devices
to help item writers produce items more efficiently. The second branch of AIG is an item-writ-
ing science that is based on cognitive learning theory. This second branch is not quite ready for
widespread use in testing programs or classroom assessment, but the hope is that it emerges to
replace the need for human, subjective item-writing with all its attendant problems.
This chapter is organized as follows:
Prose-Based AIG
One of the first item-generating methods was based on the theory proposed by Bormuth (1970).
This theory involved algorithmic transformations of key sentences in prose passages. The goal
was to make test item development for reading comprehension tests fully automated. Roid and
Haladyna (1978) streamlined the algorithm with the assistance of Patrick Finn (Roid & Finn,
1977). This research showed the feasibility of algorithms for passage-based test items. Unfortu-
nately, the items had a low cognitive demand—recall. Thus, this theory has since been abandoned.
However, Bormuth’s pioneering work showed that item-writing can be automated. Prose-based
item generation remains the most challenging area for item-generation researchers.
Item Form
The concept of an item form was first introduced by Osborn (1968), but it was the work of Hively
and his colleagues that spawned future item generation work (Bejar, 1993; Hively, 1974; Hively,
Patterson, & Page, 1968).
134 • Developing Selected-Response Test Items
Item forms are used to define domains of content. The term domain-referenced testing emerged
from this early work by Osborn and Hively. This concept is still relevant in modern-day validity
(Chapter 3; Kane, 2006a, 2006b). The target domain is a set of tasks representing an ability; the
universe of generalization is our operationalization of that domain. The universe of generaliza-
tion is our item bank.
The item form generates items using a fixed syntactic structure. It contains one or more vari-
ables, and it defines a class of sentences (Osborn, 1968). The development of a set of item forms
operationally defines the domain. Extant items were often the basis for creating the item form.
An example of an item form is as follows:
If you travel X miles and use Y gallons of gas, your fuel consumption is Z.
x is a replacement set that varies between 200 and 1,000 miles in whole number values.
y is a replacement set that varies between 1 and 50 gallons in whole and one-place decimal values.
Solve for z.
Distractors are not an explicit aspect of the item form, but they could be. The challenge is to sup-
ply a generic distractor for a common student error. Otherwise, the distractor is not plausible.
By selecting items strategically, the mapping of a content domain might be sufficient. Sub-
ject matters containing symbols, codes, tables, quantities, calculations, data, or problems involv-
ing quantitative variables are very adaptable to item forms. Item forms do not seem as useful
for prose-based content such as reading, history, psychology, or clinical problem-solving in the
health professions.
Mapping Sentences
Guttman’s mapping sentences are similar to item forms (Guttman, 1969). The mapping sentence
is the tool for generating test items. It has content and statistical properties. An example of a map-
ping sentence with four facets is given below:
A student is given a test item presented in (FACET 1: figural, numerical, verbal) language and
it requires (FACET 2: inference, application) of a rule (FACET 3: exactly like, similar to, unlike)
one taught in one of the student’s courses, he or she is likely to answer the item (FACET 4: cor-
rectly or incorrectly).
The four facets create 36 combinations (test items). To make the item alternative-choice, the
last facet would be held constant and become two options. Mapping sentences have many
advantages:
requires intensive efforts by SMEs. The content is limited to objectively observed phenomena. As
this chapter reports, the mapping sentence and item forms have contributed to the field of AIG
in profound ways but there is much more work to do.
be necessary, such as we see in a testlet. Ideally, the development of generic options representing
common student errors makes the item template the highest form of AIG, whereas the practical
examples presented previously in this chapter do not offer specific options. Lai and his colleagues
presented 15 examples in different content areas include reading comprehension, algebra, and
mathematics word problems.
Item Model 1
1. Ann has paid $1,525 for planting her lawn. The cost of the lawn planting is $45/m2. Given
the shape of the lawn is square, what is the side length of Ann’s lawn?
A. 4.8
B. 5.8
C. 6.8
D. 7.3
Item Model Variable Elements
Stem: Ann paid I1 for planting her lawn. The cost of lawn planting is I2. Given that the
shape of the lawn is square, what is the side length of the Ann’s lawn.
Elements: I1 Value range: 1,525 to 1,675 by 75.
I2—30 and 40
Options: A. (I1/I2)**0.5
B. (I1/I2)**0.5} + 1.0
C. (I1/I2)**0.5) – 1.0
D. (I1/I2)**0.5) + 1.5
Key: A
Stem Types
Gierl and his colleagues described four stem types:
1. Independent elements have no bearing on one another but simply create opportunities for
variation. These independent elements are incidentals. One element may be a radical, but
it is independent of the other elements.
Automatic Item Generation • 137
2. Dependent elements can be incidentals or radicals but these dependent elements exist in
pairs or triplets. They provide for a richer and more complex vignette than simple inde-
pendent elements.
3. Mixed independent and dependent elements are also possible.
4. Fixed elements would have a single stem with no variation.
Option Types
They also described three option types:
1. Randomly selected options. A pool of options is created. The right answer is chosen and
the distractors are randomly selected from the pool.
2. Constrained options refer to a situation where the correct answer and a set of distractors
are generated via certain rules.
3. Fixed options are invariant for all option variations.
Item Model 2
2. Four students finished in a foot race at their campsite near Jasper. John finished 5 seconds
behind Ryan. Sheila finished 3 seconds behind John. Danielle was 6 seconds in front of Sheila.
In what order, from first to last, did the students finish?
A. Ryan, Danielle, Sheila, John
B. Ryan, John, Danielle, Sheila
C. Ryan, Sheila, John, Danielle
D. Ryan, Danielle, John, Sheila
Stem: Four (S1) had a (S2) at their (S3). John finished (I1) (S4) behind Ryan. Sheila
finished (I2) (S4) behind John. Danielle was (I3)(S4) in front of Sheila. In what
order, from first to last, did the S1 finish?
Elements: S1: Range: Students, kids, children
S2: Range: Foot race, bike race, raffle, miniature golf, swimming, bingo
S3: Range: School, campsite, community center
S4: Range: Points, seconds, minutes
I1: 3 to 6 by 1
I2: 2 to 5 by 1
I3: I2 + 2
As S2 is a foot race bike race or swimming, S4 is seconds.
As S3 is a raffle, miniature golf, bingo, then S4 is points.
Distractors: All combinations of the four contestants.
Key: D
Item model 3 below has dependent elements in the stem and constrained options.
138 • Developing Selected-Response Test Items
Item Model 3
3. The thermostat on the oven malfunctioned. First, the temperature dropped 5 degrees C, then
it increased 7 degrees C, fell to 12 degrees C, and finally stabilized at 185 degrees C. What
was the original temperature?
A. 131 degrees C
B. 145 degrees C
C. 235 degrees C
D. 239 degrees C
Stem: The thermostat of an oven malfunctioned. First the temperature dropped (I1) degrees
(S1), then it increased (I2) degrees (S1), fell (I3) degrees (S1), and finally decreased
a further (I4) degrees (S1) before it stabilized at (I5) degrees (S1). What was the
original temperature?
Elements:
S1= C for Centigrade S1= F for Fahrenheit
I1 Value range 3 to 18 by 3 15 to 30 by 3
I2 Value range 2 to 20 by 2 10 to 30 by 2
I3 Value range 5 to 15 by 1 21 to 30 by 1
I4 Value range 10 to 40 by 4 50 to 60 by 5
I5 Value range 100 to 200 by 5 200 to 300 by 5
Options:
A. I1 + I2 + I3 + I4 + I5
B. I1 – I2 + I3 + I4 + I5
C. I1 + I2 – I3 – I4 + I5
D. I1 + I2 – I3 – I4 + I5
Auxiliary information: Picture of an oven
Key: B
The article by Gierl et al. (2008) provides many examples of item models containing all of the
possible variations of stem and option combinations sans the two that are not feasible. The full
item models are displayed that illustrate all 10 possible combinations of stem and option element
conditions. The authors also discuss a computerized item generator, IGOR (Item GeneratOR).
Readers are directed to this article for more information about the taxonomy, the item models
illustrating the taxonomy, and the software IGOR. A more current source for IGOR can be found
in a chapter by Yazdchi, Mortimer, and Stroulia (2012).
This first facet identifies six major settings involving patient encounters. The weighting of these
settings may be done through studies of the profession or through professional judgment about
the criticalness of each setting.
This second facet provides the array of possible physician activities in sequential order. The last
activity, applying scientific concepts, is somewhat disjointed from the others. It connects patient
conditions with diagnostic data and disease or injury patterns and their complications. In other
words, it is the complex step in treatment that the other categories do not conveniently describe.
This third facet provides five types of patient encounters, in three discrete categories with two
variations in each of the first two categories. A sample question provides the application of these
three facets:
4. A 19-year-old archeology student comes to the student health service complaining of severe
diarrhea, with 15 large-volume watery stools per day for 2 days. She has had no vomiting,
hematochezia, chills or fever, but she is very weak and very thirsty. She is just returned from
a 2-week trip to a remote Central American archeological research site. Physical examina-
tion shows a temperature 37.2 degrees Centigrade (99.0 degrees Fahrenheit), pulse 120/min,
respirations 12/min and blood pressure 90/50 mm Hg. Her lips are dry and skin turgor is
poor. What is the most likely cause of the diarrhea?
A. Anxiety and stress from traveling
B. Inflammatory disease of the large bowel
C. An osmotic diarrheal process
D. A secretory diarrheal process
E. Poor eating habits during her trip
140 • Developing Selected-Response Test Items
This item has the following facets: Facet One: Setting-2. Scheduled appointment; Facet Two: Phy-
sician Task-3. Formulating most likely diagnosis; Facet Three: Case Cluster-1a. Initial workup of
new patient, new problem.
Although the item requires a diagnosis, it also requires the successful completion of tasks in the
first two facets. The scenario could be transformed to a testlet that includes all six physician tasks.
The genesis of the patient problem comes from the clinical experience of the physician/expert,
but systematically fits into the faceted scenario so that test specifications can be satisfied.
This approach to item generation has many virtues.
5. Ulla had a big wooden ball and an ordinary little door key made of metal. She wanted to see
if they could float on water. The ball was heavier than the key. Yet the ball floated while the
key sank to the bottom. Why did the little key sink?
A. The key is made of metal and metal always sinks.
B. The key cannot float on the water because it is too light.
C. The key sank because the water weighed down the key.
D. The key is too heavy for the water, so the water cannot carry it.
E. The key is heavier than the same volume of water.
The 48 items they created were distributed among three grade levels (three, four, and seven).
These items showed acceptable item characteristics as evaluated by the Rasch item response
model. What is most useful about this example is its ability to model conceptual thinking. AIG
methods typically model quantitative content.
Haladyna, 1980, 1982; Wesman, 1971). Because some items do not survive the item validation
process, the yield from human item-writing is often disappointing. Two item writers working
from the same item and test specifications with the same intent for content and cognitive demand
should write items with similar characteristics including difficulty and discrimination. In one
experiment, Roid and Haladyna (1977) used the same specifications for writing test items for a
science passage. They found a 10% point difference in difficulty in their items. They attributed this
result to the subjectivity and non-standardization found typically in item writers, although the
content and item-writing practices were standardized. Current AIG theories and research prom-
ise to eliminate this subjectivity and increase the efficiency of item development and validation.
In this section, theory, research and an emerging technology for AIG is described. Concepts
and principles are introduced and defined and research is reviewed dating from roughly 1998 to
the present. The choice of 1998 is coincidental to the symposium held at the Educational Test-
ing Service on AIG. This meeting led to the publication of Item generation for test development
(Irvine & Kyllonen, 2002).
Features of AIG
A diverse range of theorists and researchers have contributed to this growing science. Despite
variations in terminology and theoretical differences, six features seem prominent in AIG.
1. Framework
For some time, cognitive psychologists and measurement specialists have spoken of and par-
ticipated in partnerships to produce more valid measurement of student achievement (Snow
& Lohman, 1999; Mislevy, 2006a). A framework is a complex device that entails many steps in
test development including item development, which includes item generation methods. Two
frameworks come to mind as useful devices for modern item generation: ECD (Mislevy & Ricon-
scentes, 2006; Mislevy, Winters, Bejar, Bennett, & Haertel, 2010) and AE (Gierl & Leighton, 2010;
Luecht, 2006a, 2006b, 2007, 2012; Luecht, Burke, & Devote, 2009).
ECD is both a research program and a development project that weds cognitive and instruc-
tional science. It focuses on complex performance and uses a reasoning model very much like the
one advocated by Kane (2006a, 2006b). One example of the use of ECD was reported by Sheehan,
Kostin, and Futagi (2007). They helped item writers more effectively develop passages for testlets.
Also, they developed an assessment framework to link task features of test content.
AE integrates construct definition, test design, item development, test assembly and scoring to
support formative and summative uses of tests. Lai, Gierl, and Alves (2010) provide examples of
AIG in an AE framework. Whereas the purpose of this chapter is item generation, both frame-
works provide a place for item generation in a conceptual framework that is compatible with
the goals of AIG. A presentation on task modeling was provided by Luecht, Burke, and Devore
(2009). Instead of using item and test specifications, they used integrated item difficulty density,
cognitive characteristics of responses, and content. Their approach was to use these task mod-
els, and results confirmed the success of these efforts. Masters and Luecht (2010) also presented
examples of item templates in the framework of AE.
Readers interested in a more comprehensive treatment of ECD and AE should consult Gierl
and Haladyna (2012), Gierl and Leighton (2010), Luecht (2012), Mislevy and Riconscentes
(2006), and Mislevy, Winters, Bejar, Bennett, and Haertel (2010).
2. Construct Definition and Analysis
AIG depends on an explicit construct definition. Construct definition has been a problem in edu-
cational achievement testing then and now (Chapter 3; Cole, 1980). In certification and licensing
142 • Developing Selected-Response Test Items
testing, construct definition is less of a problem because each professional society defines the
domain of tasks to be performed by a person in that profession using practice analysis (Raymond
& Neustel, 2006). With AIG, the universe of generalization is operationally defined by the item-
generating methods developed. Thus, the argument for validity is based on how well these item-
generating methods map the target domain. This mapping can be a strength and also a weak-
ness. Examples of AIG are seldom sufficient to represent the kind of construct we want, such as
reading, writing, speaking, and listening. Standardization in item-writing is desirable, but if the
construct is not fully explicated via item generation, than item validation may prove futile.
3. Measurement Paradigms
A useful distinction is that AIG exists in three perspectives (Irvine, 2002). The first paradigm is
achievement testing where the traditional item bank represents the universe of generalization,
which is designated as a R-model. This paradigm is suitable for measuring scholastic achievement
or professional competence. The L-paradigm deals with latencies in item responses. The L-model
deals with fast and slow performances. This applies to speed tests, such as keyboarding speed. The
D-paradigm involves repeated measurement, such as a keyboarding accuracy test. Whereas the
latter two models have salience in ability/aptitude testing, the R-model is retained in this chapter
as the focus for AIG for measuring scholastic achievement or professional competence.
5. Generative Testing
Generative testing refers to a capability to generate test items automatically and rapidly (Bejar,
2002). Bejar described three types of generative testing. A low level involves methods presented
in section 4 of this chapter. Instead of focusing on a construct definition and construct analysis,
items are generated to model instructional objectives, which have been identified by SMEs. A
higher level of generative testing, called model-based, requires a construct analysis. Bejar cited
the work of Enright and Sheehan (2006) and Mislevy and his colleagues (Mislevy, Steinberg,
Almond, & Lukas, 2006) as employing model-based item-generation procedures. This kind of
AIG requires extensive work with establishing a network of knowledge and skills leading to the
development of a cognitive ability, a very daunting task. A third type of generative testing is
grammatical (Bejar & Yocom, 1991; Revuelta & Ponsoda, 1999; Embretsen, 2006; Gorin, 2005).
This kind of generative testing is specifiable and verifiable. This approach comes closest to
realizing Bormuth’s concept of item generation from written material, but it will require more
theoretical development and research before a technology emerges that enters mainstream item
development.
Automatic Item Generation • 143
6. Formative/Diagnostic Testing
As AIG becomes an effective force in item development, the ability to provide formative tests for
learners and to diagnose learning shortcomings becomes greater. It is very difficult to develop
summative achievement tests and, also, produce in-depth tests of specific domains of knowledge
and skills representing current learning. If the potential of AIG is achieved, formative testing
with diagnostic prescriptions will become a reality. One recent example comes from the work of
Roberts and Gierl (2010). They propose a framework of cognitive diagnostic assessments called
attribute hierarchy method. Although AIG is not an explicit feature of their method, AIG plays
a strong role in supplying items for such methods. Examples of diagnostic testing in the frame-
work of AE involving AIG were presented by Luecht, Gierl, Tan, and Huff (2006). Graf (2008)
reviewed recent research in this area and the different approaches proposed.
Recent Research
The research reviewed here dates from 1998 to the present. This research can be classified into
two strands. The first involves studies where researchers manipulate features of reading com-
prehension passages or actual items to produce predictable difficulty and discrimination. The
second strand involves item-generation strategies and research.
item development from a very different perspective (Sheehan, Kostin, & Futagi, 2007). Instead
of a pure form of AIG, their approach was to help item writers more efficiently and effectively
create reading passages for testlets. The research on language complexity for English language
learners also informs us about how grammar affects test performance (Abedi, 2006). Although,
Abedi’s work is not intended to improve AIG, his research points to specific variables that would
be consequential to studies that attempt to control item characteristics.
Future Research
The challenge ahead is to work within a framework and completely map a construct. This mapping
would employ item models so that items can be generated on demand with known difficulty and
discrimination for formative and summative evaluation of learners. The research reported shows
the potential to produce items with known item characteristics and to produce items in large
numbers. The greatest limitation with the item models is that all items essentially look the same.
Most tests have test items with extensive variety in content and cognitive demand, which suggests
that the construct is more refined than what is reflected when using a limited set of item shells to
define a complex construct. When a framework is more widely adopted and cognitive psychology
principles are in place, AIG will assist item writers in the short term and replace them later.
Haladyna and Shindoll (1989) defined an item shell as a hollow item containing a syntactic
structure that is useful for writing sets of similar items. Each item shell is a generic CMC test item.
A simple item shell is shown here:
The major limitation of item shells is that when the technique is used too often, the set of test items
produced looks the same. The same is true with most item-generating techniques. The solution to
this problem is to use a variety of item shells, and, when the SMEs have gained confidence, allow
more freedom in phrasing the stem and developing options. In other words, the item shell is a
device to get SMEs started writing items but it is not a mainstream item-generating device.
There are generally two ways to develop item shells. The first is the easier to use. The second
is more time-consuming but may lead to more effective shells. Or both methods might be used
in concert.
The first method is to use generic item shells shown in Table 8.2. These shells are nothing more
than item stems taken from successfully performing items. In that table, the shells are organized
around a classification system suggested by Haladyna (2004). However, these item shells can be
used in other ways using the classification system suggested in Chapter 3.
The second method involves a series of steps. First an item must be identified as a successful
performer. Second, the type of cognitive behavior represented by the item must be identified.
Third, the content that the item tests must be identified. Fourth, the stem is stripped of a critical
feature, which then remains blank.
Morrison and Free (2001) developed item shells to measure clinical problem-solving in medi-
cine. Table 8.3 presents a partial listing of some of their item shells chosen from the complete
table they present in their article. As shown there, the item shells have much more detail and are
more specific about the content of the stem. Also, the shells provide some guidance for the varia-
tions possible, including patient problems, settings, and treatment alternatives. There is much to
admire with the item shells presented that capture one of the most difficult and elusive qualities
of professional health care clinical problem-solving.
The basis for item shell development is empirical—it derives from successfully performing
items and the expertise of SMEs who develop these shells. Once developed, the shells provide
help for novice item writers. Once the item writers gain more experience and confidence in item-
writing, the item shells can be abandoned or modified to avoid developing too many items that
have the same appearance. Thus, the item shell is only a temporary remedy for the item writer.
Generic Testlets
As chapter 5 reported and illustrated, the testlet is a useful SR format for modeling complex
thinking as represented in tasks that reflect a cognitive ability. To generate testlets rapidly that
have desirable content and cognitive demand is a very valuable item-writing technology. A natu-
ral transition of the item-shell technique is found with generic testlets. This work is based on
research by Haladyna (1991) but also has roots in the earlier theories of Guttman and Hively
(discussed in Roid & Haladyna, 1982; Haladyna 2004).
The key idea in developing generic testlets is the establishment of a generic scenario, which
is a short story containing relevant information to solve a problem. The scenario is very much
Automatic Item Generation • 147
like a Guttman mapping sentence. It contains two or more variable elements that constitute the
important variation. Sometimes the scenario can contain irrelevant information that requires the
test taker to sort through information. Haladyna (2004) provided this example for the teaching
of statistics:
Given a situation where bivariate correlation is to be used, the student will (1) state or iden-
tify the research question/hypothesis, (2) identify the constructs (Y and X) to be measured, (3)
write or identify the statistical null and alternative hypotheses, or directional, if indicated in the
problem, (4) identify the criterion and predictor variables, (5) assess the power of the statistical
test, (6) determine alpha, (7) when given results draw a conclusion regarding the null/alterna-
tive hypotheses, (8) determine the degree of practical significance, (9) discuss the possibility of
Type I and Type II errors in this problem, and (10) draw a conclusion regarding the research
question/hypothesis.
The above example involved one statistical method, product-moment correlation. A total of 18
common statistical methods are taught. With the use of each method, four statistical results vari-
ations exist: (a) statistical and practical significance, (b) statistical but no practical significance,
(c) no statistical but potentially practical significance, and (d) neither statistical nor practical sig-
nificance. Thus, the achievement domain contains 72 possibilities. Once a scenario is generated,
the four conditions may be created with a single scenario. For example:
Two researchers studied 42 men and women for the relationship between amount of sleep each
night and calories burned on an exercise bike. They obtained a correlation of .28, which has a
two-tailed probability of .08.
Several variables can be employed to create more scenarios. The same size can be increased or
decreased. The nature of the study can be varied according to the SMEs’ personal experience
or imagination. The size of the correlation can be systematically varied. The probability can be
manipulated. With each scenario, a total of 10 test items is possible. With the development of
this single scenario and its variants, the item writer has created a total of 40 test items. Some item
sets can be used in an instructional setting for practice, while others should appear on formative
quizzes and summative tests. For formal testing programs, item sets can be generated in large
quantities to satisfy needs without great expense. Table 8.4 shows a fully developed testlet based
on this method.
The generic testlet provides a basis for testing complex, multistep thinking that is usually sce-
nario-based. In most circumstances, an open-ended performance test may seem justified, but
scoring is usually subjective, and that presents a threat to validity. The generic item set makes no
assumption about which test item format to use. However, the generic item set technique is very
well suited to simulating complex thinking with the SR format.
The generic testlet has a structure much like Guttman’s mapping sentences and Hively’s item
forms. On the other hand, the item writer has the freedom to write interesting scenarios and
identify factors within each scenario that may be systematically varied. The generic questions can
also be a creative endeavor, but once they are developed can be used for variations of the scenario.
The writing of the correct answer is somewhat straightforward, but the writing of distractors
requires some inventiveness. Once options have been developed, these can be used repeatedly
with different scenarios of the same structure as the example in Table 8.4 shows.
The generic item set seems to apply well to quantitative subjects, like the statistics exam-
ples. However, how well does it apply to non-quantitative content? These item sets have been
148 • Developing Selected-Response Test Items
Testlet Conversions
Often, we encounter scenarios in various testing programs for stand-alone items. These test items
resemble testlets but lack the development of additional items that make the testlet so popular.
The reading time for such an item is large compared with simpler stand-alone items. However,
the yield from such an item is a single score point. A simple strategy is to identify a scenario
for a stand-alone item and convert to a testlet containing many items. The goal is to model the
complex thinking that leads to a correct solution to a problem. Two examples are provided and
discussed. Table 8.5 presents one of these conversions.
Thus, testlet conversion is actually a recycling of a perfectly good stand-alone item that con-
tains a scenario. The scenario is retained by the item writer, who then develops a series of items
that probe into steps perceived to be logical and sequential in arriving at the correct solution.
Occasionally, the series of items might be generic and applied to all scenarios of a certain set.
Testlet conversion is highly recommended as a practical remedy for generating test items of a
high cognitive demand that simulate complex problem-solving performance. Although a per-
formance type test item might have higher fidelity to the target domain, the testlet offers good
fidelity in an objectively scorable format.
150 • Developing Selected-Response Test Items
1. Construct definition, analysis, and modeling. Construct definition has been a shortcom-
ing in scholastic achievement. In the health professions, construct definition is more sys-
tematic and centered on tasks performed by professionals that require knowledge and
skills. AIG can produce test items in quantitative scholastic subjects to a limited degree,
but an item model is needed for each outcome, and we have literally hundreds of out-
comes for which item models are needed. ECD and AE concern construct definition and
analysis. Refinements in either approach to the extent that common subject-matters like
reading and writing can be adequately defined are much needed. This is a daunting task.
Until scholastic constructs have a target domain consisting of tasks requiring knowledge
and skills that are hierarchically structured, AIG will be limited to augmenting human
item-writing.
2. For what kind of content can we generate items? AIG seems very useful for certain types of
content but not all content. Item forms and mapping sentences seem best suited to quan-
tities and least suited to logical relationships or prose transformations, as the Bormuth
theory attempted to produce.
3. A danger—lopsided AIG. As Wainer (2002) pointed out, because AIG is suited for quanti-
tative content, the tendency is to develop a lopsided technology and forgo the more expen-
sive and challenging item generation for types of complex thinking that are non-quantita-
tive. The challenge here is for learning that involves principally prose passages.
4. Another danger—construct misrepresentation. A point well made by Gierl and Leighton
(2004) is that using item models runs a risk of limiting our operational definition of the
construct only to those items produced by the models. If the item models are in some
way inadequately representing our target domain, then what harm is done to validity?
For instance, with fourth-grade reading, Arizona content standards lists 35 performance
objectives (http://www.ade.state.az.us/standards/language-arts/articulated.asp). How
many item models should be produced? How do we insure that an item model is an ade-
quate representation of the construct?
5. Still another danger—are isomorphs isomorphic? The concept of isomorphs seems increas-
ingly empirically established. However, we know from interviews with learners that prior
knowledge and learning history relates to item performance. Without the assessment of
the learning histories of learners, isomorphic items may not be all that is claimed for them:
identical content and cognitive demand and predictable item characteristics.
6. Security and transfer. Morley, Bridgeman, and Lawless (2004) bring up a threat to validity
that comes from test takers recognizing item models and by that comprising the security
of test items. Once an item model is known and shared, can test takers take advantage of
Automatic Item Generation • 151
this knowledge and increase performance? In their study, they found a transfer effect that
seems to compromise security.
7. Terminology. One of the most difficult tasks is sorting out the meaning of terms used by
theorists and researchers of AIG. A standard terminology does not exist. As the science of
item-writing and AIG matures, standardization is very likely because a central paradigm
is likely to emerge.
8. Feasibility. A fully explicated construct consisting of item models may not exist. The only
documented example is the architecture examination and the item modeling led by Bejar.
However, as a secure licensing examination, the extent to which the explication of the
construct is complete or adequate is difficult to detect without extensive public disclosure.
The easiest way to explicate fully a construct is in any profession, where the competencies
are established as a target domain, and the universe of generalization clearly, logically, and
empirically is established (see Raymond & Neustel, 2006).
Summary
This chapter presented two very disparate aspects of AIG. The first aspect involves theory,
research, and technology that intend to make AIG a science of item development and validation.
The second aspect includes practical methods shown to improve item-writing. These methods
help item writers generate more items. These practical methods have current-day applicabil-
ity but, as the technology for AIG continues to grow and improve, we will see the time when
items will be generated by computer, probably for adaptive testing sessions. Although significant
progress has been made, this field has a long way to go to achieve the lofty goals it has set out for
itself. The merging of cognitive science and measurement theory has afforded theorists a better
understanding of the constructs being defined. However, current methods seem limited to con-
tent that is decidedly quantitative.
9
Formats and Guidelines for Survey Items
Overview
The measurement of attitudes, behaviors, preferences, beliefs, opinions, and many other non-
cognitive traits is often accomplished via a survey. We have many books, chapters, and journal
articles that provide guidance, but few are grounded in theoretically relevant and empirically
validated models. Much of the guidance provided in this book for the development of selected-
response (SR) and constructed-response (CR) items also applies to survey item development.
This chapter builds on that guidance.
Just as with a test, a survey item elicits a response by a respondent, thus allowing analysis
and interpretation and in some cases informing decision-making. During the item development
process, we assemble evidence supporting the validity of our interpretation of the survey results.
The process of assembling this evidence is presented in chapter 1.
First, we identify other useful resources that guide item development for surveys and the
design of surveys and analysis of data. Second, we present a taxonomy of survey item formats.
Finally, we present guidelines for survey item development. These guidelines are divided into
two sections, one on CR items and the other on SR items.
152
Formats and Guidelines for Survey Items • 153
respondents answer questions with honesty and effort. These principles play a role in every step
of the survey process and apply to many item-writing guidelines.
1. Do you disagree or agree that courses delivered entirely online meet the same quality stand-
ards as classroom courses?
{ Disagree
{ Tend to disagree
{ Tend to agree
{ Agree
{ Do not know
2. Would you consider registering for an online course if the topic was of interest to you?
{ Yes
{ No
3. I use the following sources to stay up to date on the upcoming national election.
Internet
Magazines
Newspapers
154 • Developing Selected-Response Test Items
Radio
Campaign signs
Television
Word of mouth
Another form of SR survey item is the ranking item, which requires respondents to enter or select
a ranking for each option. We recommend against this format because other formats provide the
same information with less effort (guideline 29).
Constructed-Response Formats
We have many CR survey item formats. These include (a) a numeric response, (b) a single short
response, (c) a list of items, and (d) a description or elaboration. These four CR survey item types
are illustrated below.
4. For how many years have you lived in your current residence?
If less than 1 year, enter 0.
years
5. Who is your favorite author?_____________________________________
6. List three important characteristics of an academic advisor?
Characteristic #1: _________________________________________________
Characteristic #2: _________________________________________________
Characteristic #3: _________________________________________________
7. Describe one thing that you would like to change about your school?
CR items typically take the form of a question or prompt with a blank space or text box where sur-
vey respondents provide their responses, which vary from a single word or number to an extended
response. This CR format works well when the intent is to obtain descriptive information or when
the number of plausible options is very large. One criticism of the SR survey item is that the options
available for selection for any given item force respondents to respond in a way they may not wish or
that the available options influence responses. In addition, when the survey item developer is inves-
tigating a topic for which little is known, CR items allow the survey developer to explore the realm
of possibilities as perceived or experienced by the respondents. Both require careful early planning
in the item development stage, placing intended content and cognitive demand at the forefront. CR
responses can also take the form of a drawn figure, diagram, map, or other graphical images.
SR format has seven age categories and the second has four age categories. The need for information
should be determined by the purpose of the survey. Perhaps specific age information is not required
for the intended uses of the survey data. Also, as explained by Social Exchange Theory, obtaining
precise age from respondents who are unwilling to provide personal information may be difficult.
8. What is your age in years?
years
9. Please indicate your age within the following ranges.
{ 16–25 years
{ 26–35 years
{ 36–45 years
{ 46–55 years
{ 56–65 years
{ 66–75 years
{ 76 or more years
10. Please indicate your age within the following ranges.
{ 16–35 years
{ 36–55 years
{ 56–75 years
{ 76 or more years
Many of the costs and benefits for CR formats in survey items are similar to those with test items
(see chapter 12 on scoring). The choice of item format is also similar to the SR/CR format deci-
sions in achievement and ability tests, which were discussed in chapter 4.
General Guidelines
Table 9.2 presents these general guidelines, which apply to all survey items.
1. Every item is important and requires a response. The item should apply to all
respondents, unless filter questions are used to exclude a respondent.
This guideline has important implications for the item developer. First, the survey item should
apply to everyone. If you start a question with the word If, you are likely to leave out some
respondents. A filter question should precede the question to make sure the question applies.
For example, if you intend to ask questions about decision-making in a previous election, you
first need to ask whether the respondent voted in the election. If the respondent answer indicates
that they did vote, then through branching, either in an online survey or on a paper survey, the
respondents will be asked questions that apply to them only.
In this example, a filter question is presented. In a paper-and-pencil form, navigation devices
must be presented (skip directions or arrows or indenting questions within filter questions). With
online surveys, the skipping can be automated, as a function of the response to the filter question.
Yes No
A. To change my government { {
B. To encourage decisions about specific issues { {
C. To exercise my responsibilities { {
D. To exercise my rights { {
E. To support my political party { {
F. To voice my opinions { {
Moreover, the questions that apply to everyone should be important and directly relevant to the
survey purpose. The questions must be relevant and appear important. Otherwise, respondents
will not respond to the survey. This guideline is consistent with test item-writing guideline 4, to
survey important content. Trivial information should be avoided. Particularly these days when
online surveys are becoming more commonplace, survey respondents are becoming more selec-
tive in choosing to respond to any given survey.
With the second statement, do respondents consider whether the staff member knows either or
both the hotel facilities and the surrounding area? There is no way to tell from a response to such
a statement. In a similar customer service survey, a government agency asked customers to rate
their experiences.
Sometimes, statements contain two very similar or related conditions or characteristics. Courte-
ous and pleasant are related but not synonymous. Efficient and valuing your time are not the
same. To avoid confusion and miscommunication, each item should convey a single concept.
This is consistent with SR item-writing guideline 1.
16. How likely or unlikely are you to support each funding source for the new campus football
stadium?
If state law restricts the use of tuition revenue to cover institutional instructional costs only,
then this item is technically not accurate. In other words, do not suggest something that is
impossible.
Although the above items appear simple, each item invites error. The gender item lists options
horizontally and the answer box between Male and Female might be chosen in error because of
its proximity to both options (see guideline 20). Age is ambiguous and could be answered many
ways (e.g., 20s, middle-aged, or who knows). Number of years in MN is similarly vague. Dillman
et al. (2009) found that incomplete sentences like County: result in erroneous responses, includ-
ing a response like USA. The word county differs from the word country by a single letter. By
using questions or complete sentences in the survey, you provide an important source of consist-
158 • Developing Selected-Response Test Items
ency and avoid ambiguity. In this sense, remembering the principles of Social Exchange Theory
is important: we need to ask questions in a way that engenders respect.
Original:
25a. How useful, if at all, did you find the online discussion board (e.g., posting comments,
questions, and responses)?
Revision:
25b. How useful, if at all, was the online discussion board?
It is possible that a respondent used the discussion board to keep up with ongoing discussions,
but never posted. For the purpose of learning, the online discussion board was very useful, but for
posting, it was not useful. The challenge here is that some respondents will use the examples (i.e.,
posting) to define the intent of the question. Examples might unintentionally restrict the content
of the survey question.
Original:
26a. How important, if at all, is it for a beginning teacher to know the practices of classroom
management (e.g., developing classroom procedures, defining standards of conduct,
creating a positive learning environment)?
Revision:
26b. How important, if at all, is it for beginning teachers to understand these classroom
management practices?
Not at all Slightly Somewhat Very
important important important important
A. Developing classroom procedures { { { {
B. Defining standards of conduct { { { {
C. Creating a positive learning environment { { { {
When it seems important to provide examples, consider providing each example as a separate
item. If the general notion of classroom management is what is important, then no examples
should be provided to avoid over-specification or leading contexts.
6. Use simple, familiar words; avoid technical terms, jargon, and slang.
Unless the point of the survey item is to evaluate understanding or particular use of technical
terms, jargon, or slang, these types of words should be avoided. There are many examples of
words that appear in casual conversation or writing, but for which there are more simple and
direct options. A thesaurus is an important tool to the survey item writer. Dillman et al. (2009)
offered a handful of examples of complex/simple pairs of words, including exhausted/tired, lei-
sure/free time, employment/work, responses/answers, extracurricular activities/after-school
activities. The best way to capture these words is through a comprehensive pilot study and think-
aloud interviews with those you plan to survey. A few examples are presented below.
27a. What type of school would your child attend if you decided to leave the public schools?
{ Private, parochial school
{ Private, non-parochial school
{ Charter school
{ Cyber school
{ Home school
A simple way to remedy the technical nature of the options above is to provide brief definitions
of each option. Also notice, the item is not technically accurate (guideline 3). Charter schools
are also public schools. Virtual high schools and even home schools could be considered public.
Also, notice the complex grammatical structure and tense of the question: would … if … A better
example is presented below.
27b. Which type of school is the best alternative to the local public school district?
{ Private, parochial school (religious affiliation)
{ Private, non-parochial school (non-religious affiliation)
{ Charter school (publicly funded independent school)
{ Cyber school (classes provided entirely online via the World Wide Web)
{ Home school (classes provided at home by parent, tutors, or online network)
The issue of slang or jargon is often audience-specific. Knowing your audience is the most impor-
tant key to successful survey item writing, just as knowing the subject-matter content is the most
important key to successful test item writing. Consider these survey questions.
28. How many apps do you download in an average month?
29. How often do you use an MP3 player?
160 • Developing Selected-Response Test Items
30. How far is the nearest Wi-Fi connection from your residence?
31. In what ways can homeowners improve property value through sweat equity?
The underlined terms might be considered jargon. In some respects, the use of jargon or slang
expresses a common understanding and knowledge. It represents important images and conveys
important meaning to your audience. However, the risk of causing confusion or misunderstand-
ing among some respondents is a risk not worth taking. Making important connections with
your respondents with clear and unambiguous communication is just as important. If jargon or
slang is important to include, then offer simple definitions.
32. How many apps (software applications) do you download in an average month?
33. How often do you use a digital music player (MP3 or iPod)?
34. How far is the nearest wireless internet connection (Wi-Fi) from your residence?
35. In what ways can homeowners improve the property value of their home through sweat
equity (non-financial investments)?
7. Use specific, concrete words to specify concepts clearly; avoid words that are ambigu-
ous or words with multiple or regional meanings.
Poorly designed survey items often have ambiguous language. Princeton University houses a
website called WordNet (http://wordnet.princeton.edu). This website provides a lexical database
for English. One interesting use of this website for the survey item writer is its ability to identify
synonyms. For example, entering the word run in the online word search, we find 16 different
noun meanings. For example, some meanings include a score in baseball, experimental trial, a
race, a streak of good luck, a short trip, a political campaign, a discharge, and unraveled stitches.
There are 41 different verb meanings. For instance, one is to move fast using one’s feet. Another
is to control operations or a business.
In a survey from an insurance agency, customers are asked to rate their agency on Being acces-
sible. To some, accessible means the agent is there when I stop by or I can call the agency and talk
to the agent when I need to or the agent responds to e-mail within 24 hours. To others, accessible
means I can get my wheelchair into the office easily or there is an elevator in the building or there
is free parking or the office is on my bus route.
Unless the purpose of the item is to investigate understanding or perceptions of words with
regional meaning, such words should be avoided. Uncovering multiple or regional meanings of
words without some review of draft survey items by knowledgeable others is difficult. Again, an
important step to uncover these challenging words is the think-aloud process.
In the next examples, items are positively worded: I experience …, I come to class …, and Our
principal is …. However, the context of each is negative (difficulties, incomplete, disrespectful).
A complex example from the National Survey of Student Engagement (Indiana University,
2012) employs a negative term in the stem (not assigned), but the item has a positive direc-
tion. Also note the complexity introduced by the word or. There is a general question followed
by several statements, one of which is provided here as an example (Indiana University, 2012,
Question 3b. Used with permission). We note that this item was dropped from the 2013 version
of the NSSE.
41. During the current school year, about how much reading and writing have you done?
A. Number of books read on your own (not assigned) for personal enjoyment or academic
enrichment. (The five response options range from None to More than 20.)
Research suggests that items that are worded negatively or that are connotatively inconsistent
are not simply the opposite or the simple reversal of positively worded items (Chang, 1995). In
studies of response structures, Chang reported that connotatively inconsistent items tend to load
on a separate factor from the connotatively consistent, positively worded items. For example,
optimism is not a simple reflection of pessimism, where, in some constructions, optimism and
pessimism may exist on a continuum, or optimism and pessimism may coexist at any level as two
separate constructs. A simpler example is the argument that being not happy is different from
sad, or worse, not being not happy is different from happy. Moreover, some respondents are likely
to miss the negative word in the item or have difficulty understanding the item because of the
increased complexity due to the negation and provide erroneous responses.
42. Four out of five dentists use brand X toothpaste. Which brand of toothpaste do you buy
for your family?
{ Brand X
{ Brand Y
{ Brand Z
{ None of the above
In most surveys, items are written in a specific, positive direction. The respondent is asked to
agree or disagree. Because the statement is usually directional, the item must be balanced, which
means that both ends of the response scale are opposite (guideline 11). Unfortunately, creating
balance in the statements to be rated is nearly impossible. Consider another example, which has
an imbalanced stem.
162 • Developing Selected-Response Test Items
Ignoring the fact that the statement is vague regarding degree of importance and that the options
are not balanced (three agree and one disagree), the larger problem is that the item implies that
one’s job is important. The response is led to either agree or acquiesce. A better construction
provides for balance in the stem and provides information about magnitude (degree or amount)
and direction (positive or negative).
44. How important, if at all, is your job to the mission of the organization.
{ Not at all important
{ Slightly important
{ Moderately important
{ Very important
To consider a less direct form of leading, items can be subtle regarding how loaded they are to force
an undesired interpretation. Consider the following example from a survey of college faculty.
45. For each item, indicate whether you agree or disagree with the characteristics that describe
your department chair.
Agree Disagree
A. Takes the time to learn about my career goals. { {
B. Cares about whether or not I achieve my goals. { {
C. Gives me helpful feedback about my performance. { {
The difficulty with interpreting agreement ratings with such items is whether the agreement is
for the direction or the magnitude of the statement. This kind of statement appears to indicate a
positive behavior, such that agreement is a positive response. However, the statement also con-
veys magnitude, which is vaguely stated, the time to learn conveying more than casual interest
or helpful feedback conveying more than a simple good job or nice work. One plausibly could
respond by disagreeing if the feedback was not helpful and if the feedback was more than helpful
but actually invaluable or significantly more important than just helpful.
If the item is about whether the administrator provides feedback, then leave it at that (Gives
me feedback about my performance. True, False). Then, if there is interest in the nature of that
feedback, another question could be used to gauge the degree to which the feedback was helpful
(To what extent, if at all, is the feedback helpful? Not at all, Slightly, Somewhat, Very).
CR Survey Items
30. Clearly define expectations for response demands.
31. Specify the length of response or number of responses desired in the item stem.
32. Design response spaces that are sized appropriately and support the desired response.
33. Provide labels with the answer spaces to reinforce the type of response requested.
34. Provide space at the end of the survey for comments.
The point here is to balance the item by including the full range of responses. This action pro-
motes a complete and thoughtful response. With a one-sided item, the respondent has an initial
impression focused on one side of the response scale.
48. To what extent do you agree or disagree with uses of e-books and reading devices?
In this example, the items are behavioral: will use or am using. These items could be answered
with a yes/no response. Whether one agrees or tends to agree is not relevant. The point here
is that not enough consideration was given to what is being measured or what information is
needed. When asking questions about what will be done, many responses are possible, including
the yes/no response or a likelihood (unlikely to very likely). When asking about what a respond-
ent is doing, response options could include yes/no or frequency (never to frequently). There are
other options for both types depending on the kind of information desired.
Consider an item requesting frequency information:
What are the issues with this item? First, the stem is not balanced and indirect. It might be better
phrased:
How frequently, if at all, are these tools and techniques used in your course?
Then, the response scale needs to be reconsidered. The scale points are vague with respect to
frequency of instructional practices. For example, what does it mean to use textbook examples
all of the time? Is it possible to use more than one instructional technique all of the time? Never
and seldom may work, but the last three categories refer to of the time, which is not defined. Can
we replace of the time with days or weeks, for example: some days, most days, every day? More
thought must be given as to what the plausible options are in the population (see guideline 24)
and what information is desired or relevant given the purpose of the survey.
Finally, a single item example is given where the response options represent an ordinal scale,
regarding frequency, mixed with qualitatively different options that are not on the same response
scale.
50. To what extent did you use the course textbook as a resource to prepare to teach this course?
{ I did not use the textbook.
{ I referenced only a few things in the textbook.
{ I used the textbook a moderate amount.
{ I used the textbook a great deal.
{ The textbook was my primary resource.
In this item, we have four different response scales in the five options. (a) The first option (did not
use) is an appropriate response to a yes/no question. (b) The second option is in response to ref-
erencing the textbook. (c) The third and fourth options are more in line with the stem, regarding
extent of using the textbook. (d) Finally the fifth option is in response to whether it was a primary
resource or not, but does not refer to the extent to which it was used. Another result of the incon-
sistent reference to the stem is a set of options that are not mutually exclusive (guideline 23).
In a similar sense, it is important that the category labels match the metric in the item stem.
Consider an item where the category labels do not match the information requested in the
stem:
166 • Developing Selected-Response Test Items
Notice that the response options do not describe percentages, but vague quantities. As an alter-
native, categories of percent ranges could be asked, within the actual percent distribution found
in the population (guideline 24). Since the national average is about 14% (National Center for
Education Statistics, 2012), this might be a good place to center the categories.
ing scale categories. He introduced the idea of seven pieces of information, plus or minus two.
He was also concerned about the length of lists of items that could be repeated. Contemporary
reviews of this classic work suggest the limit of working memory is closer to three or four (Far-
rington, 2011). Memory capacity also appears to depend on the information being stored and
may not be consistent. Cowan (2001) provided evidence in many settings that the limit of cogni-
tion is closer to four units of information.
With the flexibility of the web-based survey, a new format for rating scales is available. One
of these is the graphic or visual analog rating scale. This allows respondents to use a continuous
scale for responding using a slide-bar or by simply clicking in the continuous space between two
endpoints. Cook et al. (2001) experimentally studied the use of an evaluation survey associated
with research library initiatives, where 41 items were rated on a one to nine-point scale with the
traditional radio-button response options (circles clicked on or off) and a slide-bar rating on a con-
tinuum of 100 points. They found the highest reliability resulting from the radio-button discrete
responses rather than the slide-bar responses, and, as expected, the slide-bar responses took more
time. Such graphic rating scales can take many forms, including the examples in Figure 9.1.
A uniquely successful graphic is the FACES Pain Rating Scale (Figure 9.2). The developers
initially used a five-point pain scale, but because healthcare providers were so familiar with the
10-point pain scale, they found this version to be more functional for them. They suggest that the
1 2 3 4 5 6 7 8 9 10
Strongly Strongly
Disagree Agree
No Moderate Worst
pain pain possible
pain
Strongly Strongly
Disagree Agree
1 2 3 4
scale is designed for people who find it hard to use numbers for self-assessment and remind us
that the numbers on the Wong-Baker FACES Pain Rating Scale are really for the caregivers, not
for the person experiencing pain (Connie M. Baker, Executive Director, personal communica-
tion, July 19, 2012).
Couper, Tourangeau, and Conrad (2006) studied the use of graphic rating scales and experi-
mentally controlled the use of standard radio-button entry, numeric entry in a text box, the use
of a midpoint, and numbered versus unnumbered scale points. The slide-bar was not visible on
the graphic scale until the respondent clicked on the scale, so as not to influence their starting
point. The survey presented eight vignettes regarding health and lifestyle behaviors, requesting
respondents to identify the degree to which the given outcome was a result of genetics or the
environment. They found that the graphic rating scale responses had higher rates of missing
responses and non-completion and longer response times. There were no advantages in response
distributions from the use of graphic rating scales. Because of these findings with large samples,
we recommend against the use of the graphic or visual analog rating scale.
Health professionals often use the 10-point pain scale with patients. Because of this, a 10-
point rating scale is clearly within their cognitive capacity—we expect that health professionals
can distinguish among 10 points on a scale consistently across many items. In some settings,
the 10-point pain scale includes the value of 0, showing the absence of pain, and thus includes a
midpoint (5).
Based on a global review of the evidence, we believe that the four-point rating scale is generally
the most useful. The overriding rule for such a decision is to know your respondents (Halpin,
Halpin, & Arbet, 1994). With younger survey respondents, the yes/no options or the happy/neu-
tral/sad faces shown above are useful. The real point is to capture reliable information, given the
ability of respondents to locate their position on a scale consistently.
53. Should the instructor spend less time, the same time, or more time discussing course readings?
{ Less time
{ About the same time
{ More time
Here, About the same time is conceptually relevant to the item and can be interpreted with cer-
tainty; it is not an ambiguous neutral position. Other examples include just as likely, the same
amount, or others.
15. Provide balanced scales where categories are relatively equal distances
apart conceptually.
The challenge is to identify labels that capture a consistent, relatively equal distance between
categories. We recognize that more than just the labels we use will influence respondents,
including the ordering of the labels, the number of categories, and the space between labels and
response spaces (distance between response circles). How these elements work in concert is
best addressed through a study with a sample of respondents, such as a think-aloud, where we
learn how respondents interpret the response options. Nevertheless, to maintain logical con-
sistency in ratings, the psychological distance between options experienced by the respondent
should be equal.
As described in guideline 18, we suggest labeling every category. In an attempt to provide a
label for every category, it is important to generate words that conceptually imply an equal dis-
tance. However, this is a subjective process.
Note in this example that there is a large gap between Always and Sometimes, whereas a very
small gap exists between Rarely and Never. In addition, regarding guideline 17, the generic fre-
quency ratings are perhaps not the best rating scale conceptually for this type of item, regarding
classroom reading activities: What does it mean for students to always engage in silent reading?
A better rating scale might be:
This rating scale is not a perfect alternative, but it is more concrete. Perhaps a more consistent
equal-distance set of categories would be five days a week, four days a week, three days a week,
two days a week, once a week, and less than once a week. Note that the response categories are
positioned on the page in columns of equal width as well, as space between categories also con-
veys distance information (guideline 16).
Wakita, Ueshima, and Noguchi (2012) measured the distance between adjacent rating scale
points and found greater distortion in the psychological distance with the seven-point scale than
with the four- or five-point scales. They attributed this result to the use of the neutral position. In
170 • Developing Selected-Response Test Items
particular, this result came from items regarding socially negative content, which may be influ-
enced by social desirability bias. These findings also support rating scales with fewer points and
no midpoint.
Consider a question regarding income. Because of the sensitivity of such a question, categories
are often used where respondents can select a range. Note in the example, the three middle cat-
egories include ranges of $5.00. It makes no sense to continue this to include all ranges of $5.00 to
the largest hourly wage for some specific population. We might only be interested in separating
those who have wages within the most common ranges, those with less than $20.00 per hour (see
guideline 24). Where we do have specific values, they are equally distant.
An easy way to create a four-point scale is to select the differentiating descriptors and balance
them with the opposites of the desired characteristic. For example, we might be able to use the
descriptors very and somewhat with any number of characteristics, using their opposites: see
Table 9.5.
Table 9.5 Examples of Balanced Rating Scale Labels
Very … Somewhat … Somewhat … Very …
weak strong
slow fast
cold hot
satisfied unsatisfied
useful useless
comfortable uncomfortable
old new
In Table 9.6 are example scale-point bidirectional and unidirectional labels for scales of vary-
ing length.
Table 9.6 Examples of Rating Scale Labels for Rating Scales of Various Lengths
5-points in reference to attribute strength: very weak, somewhat weak, adequate, somewhat strong, very strong
6-points in reference to satisfaction: very unsatisfied, moderately (somewhat) unsatisfied, slightly unsatisfied, slightly
satisfied, moderately (somewhat) satisfied, very satisfied
6-points in reference to amount: none, very little, little, some, a good amount, a great deal
7-points in reference to performance level: very basic, somewhat basic, slightly basic, adequate, slightly advanced,
somewhat advanced, very advanced
7-points in reference to a normative rating based on average experience or expectation: far below, somewhat below,
just below, average (or met expectation), just above, somewhat above, far above
10-points in reference to a general construct, such as satisfaction, importance, or relevance: completely dissatisfied,
extremely dissatisfied, largely dissatisfied, moderately dissatisfied, slightly dissatisfied, slightly satisfied, moderately sat-
isfied, largely satisfied, extremely satisfied, completely satisfied.
Formats and Guidelines for Survey Items • 171
Any number of category labels can be obtained by reducing the number of labels from the
10- or seven-point scales. Five- and seven-point scales can be transformed to four- and six-
point scales by removing the middle label. Using the bidirectional scales to create unidirec-
tional scales is also possible: not at all satisfied, slightly satisfied, somewhat satisfied, very satis-
fied, by taking one end of the continuum and grounding it with not at all or whatever amount
fits the context.
Avoid absolutes like never, completely, always, unless these are plausible options. They will
rarely be selected and often reflect a very different distance from the adjacent categories.
16. Maintain spacing between response categories that is consistent with measurement
intent.
As an extension of the previous guideline, more than just the category labels influences inter-
pretations of response categories. Consider the following example where spacing is a function of
category label length.
56. To what extent is advisor support important, if at all, to the following aspects of your
research training?
The first row contains the width of the column in inches. This may seem like a subtle or irrelevant
feature, but the Somewhat important category is the widest and the Not important category is the
narrowest, potentially conveying information about relevance.
Make sure the columns containing response labels and response options are equal width. Con-
firm that the font and font size are the same across labels. Also, achieve balance in the length of
response category labels. In the example above, each label contains two words. Here is an exam-
ple where the columns are equal width, but the labels are not equal in length.
57. How often, if ever, do hall monitors discuss problem solution strategies following each
event?
Every time Sometimes Rarely Never
such an or occasionally
event occurs
A. Students arguing { { { {
B. Students fighting { { { {
58. Based on your experiences in the department, please check the appropriate box to indicate
your level of agreement or disagreement with each statement.
The survey item might rank-order students well with respect to their belief that communication is
good within the department. However, each specific item is then less useful in understanding the
quality of the communication. It may be more informative, and serve the function of separating dif-
ferent aspects of communication, by asking about the quality of the communication directly.
59. Based on your experiences in the department, please rate the nature of the communication
among and between faculty and students.
How is the communication Very poor Somewhat Somewhat Very good No basis for
in the department? poor good judgment
A. between faculty and students { { { { {
B. among the faculty { { { { {
C. among the students { { { { {
20. Align response options vertically in one column (single item) or horizontally in one
row (multiple items).
Two issues are present here, one including the placement of the rating categories and response
spaces, the other regarding the distance between response spaces on the page. For a single item, it
makes sense to list the options vertically, as done in the following example.
Original:
60a. The assistance my counselor gives me is supportive.
{ Disagree { Tend to disagree { Tend to agree { Agree
Revision:
60b. The assistance my counselor gives me is supportive.
{ Disagree
{ Tend to disagree
{ Tend to agree
{ Agree
Better revision, using a response scale cognitively consistent with the stem:
60c. To what extent, if at all, is the assistance received from your counselor supportive?
{ Not at all supportive
{ Slightly supportive
{ Somewhat supportive
{ Very supportive
174 • Developing Selected-Response Test Items
Placement in this manner avoids the problem of ambiguity in the association between response
category and response space. When placed horizontally, the response categories and spaces can
be misinterpreted, as in the original format. This guideline was previously presented for conven-
tional SR test items for ability and achievement. In these instances, two-column presentation of
items may save space and present items in a more compact form.
21. Place non-substantive options at the end of the scale; separate them from substantive
options.
If for some respondents a plausible option is not in the rating scale, a separated alternative
response must be offered. Some of these responses include: does not apply, do not know, unsure,
no basis for judgment, no experience, and no opinion. We do not want to force respondents to
locate themselves on the response scale if the item does not apply to them. It is important to sepa-
rate these alternative responses so that they are not considered to be part of the response scale.
Consider the following example.
61. Based on your experiences and knowledge of other students’ experiences in the program,
please click the appropriate button to indicate your level of agreement or disagreement with
each statement.
Disagree Tend to Tend to Agree No basis for
Disagree Agree judgment
A. There is good { { { { {
communication
between faculty
and students in
the department.
62. Thinking of your most recent contact with your child’s teacher, how did you make
contact?
{ Phone
{ Written note
{ E-mail
{ In person
63. When is the MOST convenient time for you to attend professional development training
courses?
{ Weekdays
{ Weekday evenings
{ Weekends
{ Anytime is convenient
{ No time is convenient
Formats and Guidelines for Survey Items • 175
Sometimes, the options are limited naturally. For example, when asking questions about prefer-
ence for one program over another, the response options will naturally reflect those programs
that are either available or those that could be adopted. In such cases, stating the nature of the
selection process in the question itself is usually helpful.
64. Considering the after-school programs currently available, which is most useful to improv-
ing student achievement?
{ Homework hotline
{ Monday/Wednesday volunteer tutors
{ Tuesday/Thursday library help sessions
65. On average, how many hours of unpaid volunteer work do you do in a month?
{ None
{ 1 to 5 hours
{ 5 to 10 hours
{ 10 hours or more
The three final response options overlap with a common limit (5 or 10 hours). Even a small
degree of overlap might cause distress in a respondent. Here is a common example of a question
regarding income, where there is no overlap among categories due to their separation of $1.00.
There is another consideration for categories that appear to be mutually exclusive but might in
fact be simultaneously true. Consider the following example:
Depending on the purpose of the survey, these categories might be considered mutually exclusive,
but for other purposes, they present overlap. For example, the fact that there is a set of multiple
176 • Developing Selected-Response Test Items
categories within one (e.g., married/partnered and separated/divorced) may result in difficulty in
interpretation. In another case, it may be possible to be a widower and currently single or mar-
ried, or separated. Or any other combination is possible. Perhaps a widow is previously divorced
or is currently partnered.
A common scenario regards implementation of a new program or initiative. Consider this set
of options.
68. To what extent are you prepared to implement Disarm in your advising sessions?
{ I am ready to implement Disarm immediately.
{ I need more practice to implement Disarm.
{ I still have basic questions about implementing Disarm.
{ I already use Disarm in my advising sessions.
The first and last options are plausible responses to the question about preparedness to imple-
ment the new program. However, options two and three are responses to a different question,
regarding needs for additional training or information. Even the first option includes a compo-
nent that is unrelated to the question, regarding the term immediately. The question was about
preparedness, making each of the four options overlap. It is possible that a single individual could
find that all four options apply.
24. Response categories should approximate the actual distribution of the characteristic
in the population.
Knowing your audience is important. If we are asking for age information of middle school stu-
dents, it is unlikely that we need to include age categories of 20 years old or greater (maybe a
catch-all category such as 16 or older). Similarly, it makes no sense to include highly specific
annual income categories for adults (such as increments of $10,000), whereas including smaller
categories for teens might be more important.
In the example below, the first set of categories has two extremes. Always and never are seldom
chosen. Vague quantities are not useful to respondents. In the second set, the four choices are
intended to represent what is actually experienced with this population.
69. How often did you attend soccer games this year?
{ Always { Five or more times
{ Sometimes { Three or four times
{ Seldom { One or two times
{ Never { Never
From this set of categories, we surmise that this college is interested in separating the number of
alumni with very high salaries. Five of the eight categories include amounts $100,000 or greater,
whereas 25% of U.S. families earn this much; fewer than 11% earn more than $150,000 annually
(U.S. Census Bureau, 2012). Now we should expect a greater percentage of families with college
degrees to earn above this threshold, but the implication is that the college is expecting graduates
to be in the highest income categories among U.S. families. Although these categories are not
equal in width, they may represent the actual distribution of graduates.
Other examples of this idea are present in some of the other guidelines. In a recent alumni
survey for a private college, the following income categories were offered as alternatives to the
question.
Formats and Guidelines for Survey Items • 177
71. Which degree requirement do you and your doctoral advisor talk about most often?
{ Required courses
{ Elective courses
{ Pre-dissertation research project
{ Written comprehensive exam
{ Dissertation
Consider the results contained in the parentheses following each option; based on 200 respond-
ents, 3% reported Don’t know. The data analyst may delete these responses and consider them
missing. This action will not eliminate the bias introduced with the other option.
What are the appropriate inferences from such results? The results clearly show the most
popular burger places, but the 17% for Smash Burger as listed from the other option distorts our
interpretation.
From a statistical perspective, making comparisons across prespecified and other options is
not appropriate. Not all respondents were given the opportunity to respond to the other options.
This item is badly flawed due to the omission of Smash Burger.
When deciding item format or content, we must consider the purpose of the survey. If some-
one is surveying for information regarding opening a new franchise, then the options should be
more suitable for those being surveyed and for the purpose of the survey. The item should be
rephrased.
The other option is a natural way to finish a list of options on a pilot version of an item. How-
ever, after the pilot test, the other option should be eliminated.
73. In which activities have you worked with your advisor? (Check all that apply)
Designing a research project
Analyzing data
Writing a research report
Writing a journal article
Writing a grant proposal
Presenting at a scholarly meeting
This SR format includes the challenge of how to handle the options not selected, since it is impos-
sible to know if they were skipped intentionally. In addition, no preference is shown among the
options selected. To avoid unintended response sets and ambiguity introduced through non-
response, it is important to require a response to every item (guideline 1). For the above item, it
Formats and Guidelines for Survey Items • 179
is preferable to ask respondents how necessary, how useful, or how often each activity was done
with an advisor. For each activity, a response is required.
74. For online courses, do you consider these forms of technical support to be necessary or not?
Necessary Not
Necessary
A. Assistance by telephone available during regular office { {
hours
B. Assistance by telephone available 24 hours a day (or a { {
time frame close to that)
C. Assistance by e-mail available during regular office hours { {
28. Use differently shaped response spaces to help respondents distinguish between
single-response (circles) and multiple-response items (squares).
There is a design standard now in place regarding the use of squares or circles (check-boxes or
radio-buttons for online surveys) in survey design, particularly for online surveys. The standard
is to use circles when a single response is required and squares when multiple responses are
requested. However, since we recommend avoiding multiple-response items (guideline 27), the
use of squares for survey responses should rarely occur.
76. Which are true for you? (Check all that apply)
My advisor informs me about new publications in the field.
I have done research with my advisor.
I have written one or more papers with my advisor.
I receive grant writing support from my advisor.
29. Avoid ranking items; if necessary, ask respondents to rank only a few items at once.
Respondents find ranking difficult. They often do not follow the rules. It may be difficult for
respondents to avoid ties or make distinctions between similarly desirable options. When we
ask respondents to rank a set of statements, we may be asking them to do work that we could do
more easily. If respondents rate objects in a survey, ranking information is easily obtained by the
data analyst.
Original:
77a. Please rank the following types of technical support available to students in an online course
in terms of how NECESSARY each is, from 1 = MOST necessary to 6 = LEAST necessary.
___ Assistance by telephone at regular office hours
___ Assistance by telephone at 24 hours (or a time frame close to that)
180 • Developing Selected-Response Test Items
Revised:
77b. Please identify the three most important types of technical support available to students in
an online course. From the list, write the letter associated with each type of support.
A. Assistance by telephone at regular office hours
B. Assistance by telephone at 24 hours (or a time frame close to that)
C. Assistance by e-mail at regular office hours
D. Assistance by e-mail at 24 hours
E. University provided required software
Open-ended items provide an opportunity to explore possibilities to gain full access to respond-
ent thinking. These items must address important topics that are relevant to the respondent.
These items should not appear too soon in a survey. As motivation is secured through the process
of taking the survey and learning more about the topic being covered, the respondent is more
likely to respond to these open-ended items.
Perhaps the most difficult aspect of using open-ended items comes in the analysis phase.
Qualitative analysis methods are available to maximize the value and meaning derived from
open-ended responses. Some methods are aided through computer software that helps identify,
code, classify, and associate response features. Most statistical software programs produce results
through standardized routines, without requiring an assessment of the validity of underlying
assumptions. Far too often, survey researchers attempt to code, classify, count, and rank order
responses. Unfortunately, when we move from qualitative to quantitative methods using open-
ended responses, we lose track of the nature of the responses themselves. It is difficult to justify
Formats and Guidelines for Survey Items • 181
counting and rank ordering responses when respondents were not able to react to the responses
of others—the same problem we find with the other option (guideline 26).
The ability to code, combine, and order responses is possible when the sampling structure
of the responses allows such manipulation, not because we have a system or software tool to do
so. We find no justification for suggesting that the most frequently reported comment is more
important or more relevant than any other, simply because more respondents thought to produce
it. We know from experimental research that if these responses were provided to all respondents,
a different ordering of importance, prominence, or relevance would result (Tourangeau, Rips, &
Rasinski, 2000). Moreover, if our identification and coding of themes is too concise, we may lose
the richness and depth of the responses our audience took the time and effort to provide to us.
Open-ended items are important tools for uncovering complex beliefs, preferences, opinions,
behaviors, and contexts. Careful, theoretically driven classification is often helpful, particularly
when faced with volumes of responses. However, quantitative manipulation such as counting,
ranking or ordering of such data is generally unwarranted. Such quantitative inferences are not
defensible. To value the responses truly, each response deserves equal treatment.
Original:
78a. How long have you lived in Minnesota? ____________________________
Revised:
78b. For how many years have you lived in Minnesota? If less than 1 year, enter 0.
years
In the first example, respondents have no idea what is expected of them. One could plausibly
respond: Way too long! Here are two examples with clear expectations.
79a. Including yourself, how many principals has your current school had in the past 10 years?
Principals
79b. What is the ideal temperature of your home during winter heating months?
Degrees Fahrenheit
There are additional examples provided in the remaining guidelines below, including examples
requesting mileage, and others with the labels preceding the response space.
31. Specify the length of response or number of responses desired in the item stem.
The stem of the CR item should explicitly state what is expected in the response, including the
length of response and number of responses desired. As in guideline 30, the CR item should be
182 • Developing Selected-Response Test Items
explicit in what is being requested. Do you want a paragraph, a sentence, a single word? Do you
want one, two, three or more examples? Do not leave the respondent guessing, as this will cause
frustration in some and simply inspire others to skip the question completely.
Original:
80a. Please list two or three things you learned from teaching in the Global MBA program?
__________________________________________________________________
__________________________________________________________________
Revised:
80b Please list three things you learned from teaching in the Global MBA program?
1. __________________________________________________________________
2. __________________________________________________________________
3. __________________________________________________________________
Additional examples of items with specific directions about the length or number of responses
are provided here.
32. Design response spaces that are sized appropriately and support the desired response.
Evidence suggests that the response spaces support the desired response (Dillman et al., 2009).
For example, consider the two questions below. Both are CR items with approximately the same
response space, but clearly they have different requirements for what should constitute a com-
plete response. The first question does not require much space, since it should be no more than
a number. The second question might require multiple lines since a teacher’s relationship with a
principal could be complicated.
84. For how many years have you worked at your current school? __________________ years
85. Please describe your relationship with your school principal? _______________________
With online surveys, the use of text boxes for entry of responses is required. Two options are
available. One is a fixed sized box that only allows entry of a single line or a specific number of
characters. Another option is the scrollable box, which provides for additional lines that scroll as
the text is entered. These also have a finite number of characters that can be entered, but can be
very flexible, allowing for hundreds of characters. As suggested by guideline 30, it is imperative
that clear instructions are provided to respondents so that they know the text is being entered in
a scrollable box that will allow for more lines than can be seen.
Figure 9.3 shows two scrollable text boxes. The first has only two lines of text entered, so the
scroll bars do not appear. Once the third line of text is entered, the scroll bars appear.
Formats and Guidelines for Survey Items • 183
A scrollable text box with only two lines of text, so the scroll bars do not appear.
A scrollable text box with multiple lines of text, where the scroll bars appear.
A~ t e xt
e ntered, the ~ croll
i~ *
bar appear. .nd addition.l text.
can be ente r ed .c;
Figure 9.3 Scrollable text entry boxes.
The question regarding how much space should be provided in a paper-and-pencil survey is
challenging, as space is at a premium and in large part determines survey costs, such as paper,
printing, and postage. The best way to decide how much space is required for a CR item is to pilot
the form. However, even before the pilot, the survey designer should know the audience and what
might be reasonable for the nature of the question. It is safe to err on the side of slightly more space
than needed, to accommodate those respondents that might have more to say.
33. Provide labels with the answer spaces to reinforce the type of response requested.
When an answer space is provided following a CR item, it helps respondents to know exactly
what is expected of them if the space includes a label to reinforce what should be entered in the
space. There are examples of this associated with guidelines 30 to 32. In guideline 30, the example
of a question about age is given. Immediately following the response space, years is included, to
reinforce the idea that age is being requested in years.
Many other examples could be given of this type, which primarily focus on the provision of a
numeric answer. Consider the following questions.
86. How many miles do you drive one-way to your place of employment?
Miles, one-way
87. How many laptop and desktop computers are in your household, if any?
88. Number of laptop computers:
89. Number of desktop computers:
In guideline 31, to reinforce the idea that three courses are requested, the spaces are preceded by
the labels: Course #1, Course #2, and Course #3. This action may seem unnecessary, but it helps
guide the respondent in at least three ways: (a) with respect to exactly where the response should
be written, (b) how long the response should be, and (c) how many were requested. In a similar
example, having two response spaces helps ensure that you will receive two characteristics; how-
ever, the lines are too short (guideline 32).
184 • Developing Selected-Response Test Items
90. What are the two most important characteristics of a school principal?
First most important characteristic: _________________
Second most important characteristic: _________________
Comments:
Please write your comments here:
What else can you tell us?
Anything else you would like to add?
These invitations are not particularly inviting. The last question only requires a yes/no response.
A couple of these statements are not even complete sentences and do not meet the item-writing
guidelines presented here, mostly regarding the clarity of the directions or expectations for a
response. All of these statements suffer from being indirect and vague. Improved invitations for
final comments include the following examples.
Do you have any final comments about your expectations or experiences with advising?
Please describe any issues related to your online course experiences that we might have missed.
Please tell us about any other experiences you had in the program that we should consider.
items. Once a set of items has been thoroughly reviewed and edited, a pilot test is necessary. This
action is necessary to select items for the final survey and support validity. Statistical properties
of survey items are discussed in chapter 18.
Finally, item order is an important consideration. The first item must be easy to respond to,
must apply to everyone, and be interesting enough to focus the respondents’ thinking directly
on the topic of the survey. The first question informs the respondent what the survey is really
about. The remaining topics should be organized by issue, with items having similar response
options organized together to help easy responding. For example, a set of items might use the
same rating scale. The more important or salient issues should be presented at the beginning of
the survey and the less salient issues should be presented later in the survey. Personal questions
or background information should be reserved for the end of the survey. As trust is developed
throughout the survey, respondents are more likely to divulge personal information at the end.
CR items are more likely to be answered at the end of the survey as respondents have invested a
great deal by that time and may be more willing to fulfill that investment with responses requiring
more effort toward the end.
This page intentionally left blank
III
Developing Constructed-Response Test Items
As reported in previous chapters in this volume, item formats are classified into three broad cate-
gories: (a) selected-response (SR), (b) constructed-response (CR) with objective scoring (CROS),
and (c) constructed-response with subjective scoring (CRSS). This organization of formats was
based on four conditions:
Chapter 10 presents the anatomy of CR items and presents various CR formats. Examples are
provided, and recommendations are made for the design, validation, and use of each format.
Characteristics of each CR format are discussed and recommendations are made regard-
ing whether a format is suitable for a standardized testing program or limited to classroom or
instructional use.
Chapter 11 provides guidelines for designing and validating CR items that are either objec-
tively or subjectively scored. These guidelines come from many sources. The guidelines mainly
focus on content, directions to the test taker, and conditions for performance.
Chapter 12 presents scoring guidelines and procedures for CROS and CRSS items. The prin-
ciples derive from experience and are less reliant on theory or research. These guidelines are best
practices. However, there are some threats to validity to consider that do have a body of research.
Rater consistency is an important topic for CRSS items because it relates to reliability.
These three chapters are a coordinated set that should help those designing and validating
CR test items for standardized testing programs. Those interested in designing CR items for
classroom assessment have many resources to use, as there are many books on this topic (see
Haladyna, Downing, & Rodriguez, 2004). However, this set of chapters provides a very compre-
hensive set of guidelines and procedures that those involved in classroom/instructional assess-
ment may also find useful.
A difficulty encountered in developing this coordinated set of chapters is a paucity of research.
With the SR format, there has been a steady stream of research that provides more of a founda-
tion for formats, guidelines, and scoring. For CR items, the scientific basis for the accumulated
guidance offered in these chapters is not as substantial.
188 • Developing Constructed-Response Test Items
Other chapters in this volume provide CR formats that are particular to certain types of testing.
For instance, chapter 13 deals the measurement of writing ability. This chapter features many CR
item formats and identifies problems with the subjective scoring formats. Chapter 14 presents
CR formats for measuring competence for certification and licensing. Some unique formats are
presented in that chapter. Chapter 15 deals with the measurement of students with disabilities
and the challenges faced. Some formats included in that chapter are CR. Standards for educa-
tional testing are an important feature in that chapter. There is considerable variety in these CR
formats but very little research on the effectiveness and capabilities of these formats to measure
desirable learning with a high cognitive demand.
10
Constructed-Response Item Formats
Overview
This chapter will present and address general characteristics of CR item formats. The many issues
in choosing an item format were covered earlier in chapter 4 and by Rodriguez (2002). In the first
part of this chapter, some important distinctions of CR item formats are presented. Then, some
background is given that includes its history and the research supporting its use. Then CR item
formats are presented. Where possible, research is cited.
As noted previously and often in this volume, the cognitive demand of any CR item should
be higher than you can achieve with a SR item. If the content is subjectively scored, the SR
format is eliminated and a scoring guide (rubric/descriptive rating scale or set of scales) must
be used. This cognitive demand must show through subject-matter expert (SME) analysis
or via interviews with test takers that knowledge and skills are used in combination in a unique
way.
The instruction may be a question or a command to the test taker. The instruction may involve
a single sentence or several pages of directions on the nature and scope of the performance being
undertaken. As noted previously, many CR tests consist of a single item. Increasingly, tests con-
sist of both CR and SR items. For instance, current versions of the NAEP use both CR and SR
formats of many types.
The conditions for performance provide the details to the test taker that answer all pos-
sible questions, so there can be no doubt about the performance. In some circumstances, the
conditions can be very simple and in other circumstances very complex. Conditions are
developed by SME item developers with the idea of providing clear and complete directions.
189
190 • Developing Constructed-Response Test Items
These conditions also may posit the administration procedure that will be followed. For most
standardized tests, time allowed and materials available are part of these conditions. Chap-
ter 11 provides guidance about the instruction and conditions for performance vital to any
CR item.
The criteria for scoring should be part of the item as this informs the test taker about the value
assigned for aspects of performance. With objective scoring, the criteria is usually right/wrong
with point values. For subjective scoring, the criteria may include one or more rubrics with point
values for discriminative levels of performance. Chapter 12 provides extensive information about
scoring criteria.
what the test taker is expected to do versus how the task is to be completed. This is the difference
between Fill in the blank (what to do) and by selecting the correct response from the options below
(how to do it).
In their edited volume, Bennett and Ward (1993) provided a series of chapters addressing a
wide range of issues in CR tasks that included performance tests and portfolios. Two chapters
attempted to provide classification schemes for CR item formats. Bennett (1993) described the
label constructed response as a superordinate classification including a variety of formats, and
repeated the structure of the earlier seven-category framework. In the same volume, Snow (1993)
presented a continuum of CR item formats based on one facet of test design, the distinction
between selecting a response (SR items) and constructing a response. He also argued that many
other facets could enter a larger taxonomy of item formats, including how tasks are administered,
how responses are produced, and many others. In his continuum, he included SR formats; SR
with some construction (e.g., providing a reason for a selection); simple completion or cloze
forms; short answer or complex completion (e.g., generating a sentence or paragraph), prob-
lem exercise (e.g., solving a problem or providing an explanation), teach-back procedure (e.g.,
explain a concept or procedure), long essay, demonstration, or project; and a collection of any of
the earlier tasks, as in a portfolio.
Factual Recall
The first two dimensions of the taxonomy describe the nature of the cognitive processes
employed, the cognitive demand of the task, and could relate to the instructional objective from
simple to complex. The type of reasoning employed ranges from low-level factual recall to pre-
dictive reasoning.
The second dimension involves the nature of the cognition employed, on a continuum from
convergent to divergent thinking. Convergent thinking is based on a comparison among avail-
able information in a process of narrowing or clarifying understanding; whereas divergent think-
ing begins with a premise and an exploration for alternative contexts where it may be applied,
in an expansion of thinking, rather than a focusing of thought. Osterlind and Merz were not
able clearly to differentiate formats between these two ends of the spectrum and recognized that
students could use either form of thinking to respond to many items, although not both forms
simultaneously. Although we see value in recognizing these forms of cognition, we would not
restrict the application of one or the other to specific formats of CR items so its utility in a tax-
onomy of CR formats is unclear.
Constructed-Response Item Formats • 193
The third dimension describes the types of response process required. These include closed-
ended products or open-ended products. The format permits few or many responses. Based on
our description of characteristics of CR formats, this dimension aligns closely to the degree to
which responses can be objectively or subjectively scored.
Notice the taxonomy does not rest on any other feature of the items or tasks themselves, leav-
ing open the widest possible range of formats—the distinguishing features are the cognitive and
response processes. Also, Osterlind and Merz do not suggest that a specific CR format will fit
within one cell of the taxonomy, but that items themselves, depending on their location on the
three dimensions, will fit within the taxonomy. It is not a method of categorizing item formats,
but items themselves.
As we review the many formats in the CR arena, there are few characteristics of the formats
that distinguish one from the other. More than anything, the nature of the response is one pos-
sible distinguishing characteristic, and the open/closed product is one way to classify the differ-
ences. Other characteristics, like the reasoning competency required or the cognitive demand,
are functions of the specific item and can vary within any given CR format.
The first, objective versus subjective scoring is the most fundamental distinction among CR
formats. Whether some task is objectively or subjectively scored depends on how the outcome is
defined. An operationally defined response is objectively scored, whereas an abstractly defined
trait is subjectively scored. Spelling is objectively scored; whereas organization of one’s writing is
to be judged along a continuum.
The second, product versus performance, makes the important distinction in any CR item
of whether the process of performance or the end product is the focus of the measurement. For
instance, how a certified public accountant (CPA) does an audit may emphasize the perform-
ance/process and the result is the product—did the CPA get it right? Product and performance
continue to be a vital distinction in designing and validating any CR item.
The third, open-ended versus closed-ended, emphasizes whether the product is evaluated by
a judge according to some well-defined guidelines or criteria or if the product is to be creative.
Osterlind and Merz (1994) referred to this characteristic as unconstrained and constrained. Writ-
ing a short story is an open-ended product. Writing an informative report, such as a newspaper
article on a story, is closed-ended. Most essays intended to measure writing ability are closed-
ended, where open-ended content is not evaluated.
The 21 CR item formats presented in this section of the chapter are intended to be compre-
hensive. Most of these formats are better suited for classroom/instructional testing and not for
testing programs. With each format, examples are provided, research may be cited regarding its
validity, and a recommendation is made regarding its appropriate and valid use for testing pur-
poses outside the classroom. Table 10.4 lists these CR item formats and provides indications of
194 • Developing Constructed-Response Test Items
how each might be scored. The table shows whether the outcome is product-oriented or perform-
ance-oriented, whether the outcome is open-ended or closed-ended, and whether the outcome
measures knowledge, skill, or a task representing an ability.
Anecdotal Record
Mostly used in classroom settings, anecdotal records are brief accounts of student behavior,
understanding, performance, and any number of non-cognitive characteristics that may inform
educational decisions like selection, placement, instruction, and intervention. Anecdotal records
can be informal or quite formal. They are best at describing a single incident or behavior, con-
taining factual observable information, and a description of the context of the incident. These
formats are important in settings where other forms of testing are less common, including
preschool and early childhood education settings and special education. A publication of the
International Reading Association provides some guidance to enhance anecdotal records as a
tool for standards-based authentic assessment (Boyd-Batstone, 2004). The anecdotal record
might also be suitable information to include in a portfolio to provide additional insight into
what the portfolio is measuring. In the evaluation of professional performance such as in teach-
ing, medicine, or a sport, anecdotal records and observational notes in anecdotal form provide
Constructed-Response Item Formats • 195
information. Such information is valid if it corresponds or contributes to other data used to make
an assessment. Sometimes, such information is useful in supporting reassignment or disciplinary
action, as with professional practice.
1. During morning exploration time, Jose pretended to read his favorite book, the Dinosaur
Book, to Luke. Although he does not read yet, he paged through the book correctly and
recalled some ideas associated with most of the pages.
2. After having his toy truck taken by Melissa, Robert hit Melissa on the back with an open
hand. Melissa began crying. When he was asked to explain why he hit Melissa, Robert
responded by saying “I do not like her.”
3. Following Larry’s afternoon nap, he immediately went to the kitchen looking for his snack.
Upon reaching the kitchen and discovering that his snack had not been put out yet, Larry
went to throw a tantrum until his snack was given to him.
Cloze
The cloze procedure is a technique for measuring reading comprehension and fluency. A reading
passage is identified, which is typically curriculum-based. Words from the passage are replaced
with a blank line typically according to a word-count formula, such as every seventh word. Other
methods of word deletion can be used, for example, removing words that convey quantity. Some-
times choices are supplied for each blank in the passage. The task requires sufficient vocabu-
lary and a recognition of the meaning of context. The cloze procedure was first introduced by
Taylor (1953).
4. When making pancakes, there are a _______ (few) common mistakes to avoid. First,
use _______ (measuring) tools to get the right amount _______ (of) ingredients in the
mix. Next, be _______ (sure) to crack the eggs in a _______ (separate) bowl so that you
avoid getting _______ (eggshells) in the mix. Finally, don’t mix _______ (the) batter too
much; small lumps help _______ (make) the pancakes fluffy and light. [Every 7th word
deleted.]
5. When planting a garden, many things need ____(to) be planned beforehand. First, decid-
ing how _____(much) space is available for the garden in _____(order) to plant the right
amount of ______(vegetables) for the garden. Next, vegetables need to be _____(planted)
at the right time so they can ____(have) optimal growing conditions. Lastly, having a mix
____(of) early harvest vegetables and late harvest vegetables ____(can) help increase the
yield. [Every 8th word deleted.]
6. Basketball is a team sport. ____(two/2) teams of ____(five/5) players each try to score by
shooting a ball through a hoop elevated ____(10) feet above the ground. The game is played
on a rectangular floor called the court, and there is a hoop at each end. The court is divided
into ___(two/2) main sections by the mid-court line. When a team makes a basket, they
score ____(two/2) points and the ball goes to the other team. If a basket, or field goal, is made
outside of the ____(three/3)-point arc, then that basket is worth ___(three/3) points. A free
throw is worth ____(one/1) point. [Every number word deleted.]
The cloze procedure has been and continues to be a very effective research tool. For example,
Torres and Roig (2005) used the cloze technique as a tool for detecting plagiarism. However,
the cloze procedure is not customarily used in testing programs and is not recommended for
measuring knowledge, skill, or abilities because we have more effective ways to measure aspects
of reading.
196 • Developing Constructed-Response Test Items
Demonstration
A demonstration is a performance test where the individual is given an opportunity to demon-
strate a skill or ability within a specified context. Many will recall having to do a demonstration
in speech class or demonstrating specific techniques in art, music, or physical education. The
focus here is on the comprehensiveness of the process as explained and/or demonstrated by the
individual. Unlike performance, described below, the focus is much more on the process than
the product. An objectively scored checklist of important steps and elements of the skill is typi-
cally used to score a demonstration. This format might be used in a certification or licensing test
in many fields where a candidate has to show evaluators how something is done.
Discussion
A written or oral discussion-based test is an effective method for assessing critical thinking. Often
in a discussion task, there is no right answer. Responses are open-ended. This discussion could be
one-sided or part of a debate. Individuals are asked to discuss an issue or problem that requires
evaluation, prediction, synthesis, or other higher-order thinking. Discussion requires the indi-
vidual to think critically in the discussion. Knowledge of an issue is less important than the criti-
cal thinking exhibited. As can be imagined, scoring a discussion is a complex task. This could
occur in many formats, including two or more test takers discussing a topic in person or via
computer simultaneously or over time, a test taker discussing a topic with a rater, or a test taker
discussing points of a topic alone or in writing (paper or computer).
Essay
An essay provides an opportunity to measure knowledge, skills, and abilities. Often, the essay
test consists of a single item. The essay item includes a range of possible tasks and responses that
range from a sentence response to five or more paragraphs. The purpose of an essay test is to elicit
a response to a prompt, which is a stimulus. The prompt can be a question or command. The
focus of the test in an essay task is content. We typically expect the response to an essay prompt
to be more than a single sentence (which is more likely to be a short-answer item). This format
should be distinguished from the writing prompt, which is intended to elicit a sample of writing
ability (discussed later in this chapter under the writing sample).
14. Describe the effects of global warming on mammals on the coastline of Alaska.
15. Who are the major contributors to 20th-century American literature? Why?
16. What is the origin of a black hole?
In these examples arising from instruction, the intent is for the test taker to respond correctly.
The first item has three well-known, valid effects. The second item has five widely agreed major
Constructed-Response Item Formats • 197
contributors who have been the target of instruction. The essay can also be used to evaluate a
candidate for certification or licensure. For instance, in a bar examination, the candidate lawyer
is presented with a case problem and must discuss how the case would be handled. Chapter 13
presents more information about the issues and problems related to essay formats. One issue is
the content of an essay versus writing mechanics and other writing traits, such as organization.
Exhibition
An exhibition is a commonly used measuring device in the humanities, particularly studio or
performing arts, and other fields where creativity and technical skill are displayed in the design
and construction of products. An exhibition, then, is a display of a collection of work or products.
A common form of an exhibition is an artist’s exhibition of paintings or sculptures. An exhibi-
tion may consist of interpretive objects, such as those recounting an event or epoch in history.
For instance, it may be an exhibition covering the events in the Lewis and Clark exploration to
the Pacific Northwest. Individuals or groups of individuals place their work on display. One or
more judges will then evaluate the body of work on several prespecified criteria, typically by using
subjectively scored rating scales. The exhibition has very much the same objective as a portfolio,
which is also presented in this chapter. The product of an exhibition is unconstrained as the proc-
ess of putting the exhibition is not the objective of measurement.
17. Create three display boards (of a prespecified size) illustrating three spaces, each with one
perspective and description regarding the space. Each display must include a complete
architectural illustration of a home living space.
18. Prepare a collection of four poems you have written for a website.
19. Create a five-page collection highlighting the best sketches or drawings of the same inani-
mate object from differing perspectives. A minimum of two drawings or sketches must be in
color.
Experiment
An experiment can be the basis for a test item in a variety of ways. The items can be open-ended
or closed-ended with subjective or objective scoring. The experiment may include any combi-
nation of events including designing or setting up an experiment, conducting an experiment,
recording the results, and recording the findings or writing a lab report. In fact, the experiment
may have both subjective and objective aspects in its scoring.
In typical educational settings, an experiment is carefully specified under standard conditions,
so that the performance can be scored objectively. This requires carefully specified instructions;
otherwise, an experiment that largely depends on the innovation of the test taker will require
subjective scoring. Being exploratory is possible for some experiments, with much less specified,
open-ended products and innovative procedures. Some experiments could be considered below.
20. Does how much dye is in candle wax affect how quickly the candle burns? Identify three
levels of dye in three different candles, noting how much dye is contained in each candle. Set
up and conduct the timing experiment. Record the results and provide a working hypoth-
esis about the relation between the amount of dye and burn time.
21. Does the current produced by a fruit battery depend on the type of fruit used? Create a
fruit battery using a variety of citrus fruit (lemon, lime, orange, grapefruit), zinc and cop-
per nails, a section of holiday lights, and a multimeter. Conduct an experiment to test the
strength of the current from different citrus fruit. Monitor and record results and produce
a working hypothesis about the relation.
198 • Developing Constructed-Response Test Items
Fill-in-the-Blank
The fill-in-the-blank format is much like the cloze procedure, except it usually involves a single
sentence with one or two blanks. This typically measures the recall of knowledge. If used, the
blank should appear at or near the end of the sentence, which makes it function, much like a
short-answer item.
This format is NOT recommended. It seems limited to measuring knowledge at a very low cog-
nitive demand; if needed, SR formats seem to provide a more suitable alternative for measuring
basic knowledge.
Grid-in Response
The grid-in response requires test takers to write their response in a grid (typically small squares
much like grid paper). The primary benefit of this format is that test takers cannot use SR options
to work backwards in solving a problem. Grid-in formatted items force the test taker to solve the
problem and provide the answer.
The Florida End-of-Course Examination item and test specifications provide detailed infor-
mation for the development of grid-in response items (http://fcat.fldoe.org/eoc/). It states that
grid-in items should take an average of 1.5 minutes to complete. Examples of formats include
equivalent fractions and decimals are acceptable for grid-in items if each form of the correct
response can be recorded in the grid. Grid-in items should include instructions that specify
the appropriate unit for the responses. In grades four and five, for instance, currency grids
are preceded with a dollar sign and in grids requiring responses in decimals, a fixed decimal
point is provided in the third column of the six column grids. Where the items are written
with consideration for the number of columns in the grid, the computer-based forms will use
a standard seven-column grid. In grades six and seven, a six-column grid is used that includes
digits zero to nine and two symbols including a decimal point (.) and a fraction bar (/) for grid-
ding fractions.
Grid items can also be administered in paper-and-pencil tests, and even machine-readable
forms—for example, where a numeric answer is written in a grid and the bubbles associated
with each value are filled. Given the increasing adoption of the computer for test administra-
tion, this is becoming a more common format, particularly for mathematics test items that
require respondents to produce a numeric value. This is a good alternative to the SR item,
where test takers can estimate a response and select the closest option. Instead, the options
are removed and the test taker must supply the result (to some specified level of precision).
This format (numeric entry questions) is being adopted for the revised Graduate Record
Examination.
In this example, the test taker must first find the product of 3 × 7, sum 4 and 21 to get the
numerator, divide 100 by 2, form the new fraction: 25/50, and finally find the decimal equivalent.
The answer 0.5 is entered in the grid-in boxes and the appropriate circles are filled in on the
machine-readable form.
Constructed-Response Item Formats • 199
4+3×7
100/2
. . . . .
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
26. The normal alveolar partial pressure of CO2 is 40 mm Hg. A drug depresses alveolar ven-
tilation to 50% of its normal value without changing CO2 production. After the drug is
delivered, what will be the alveolar partial pressure of CO2, in mm Hg?
8 0 Mm Hg
In this example, the test taker must understand the association between partial pressure and
ventilation depression to solve the problem. The grid-in is computer-enabled and so only the
response boxes are available to fill in, rather than the typical bubble sheets, to designate the
entered values, as in the first example. Notice that this response box includes the metric (mm Hg)
as a guide to the test taker.
27. A small engine shop charges a flat rate to fix lawnmowers of $25 and an additional $12.50
per hour for labor costs. A function f(h) can be used to represent the number of hours of
labor in fixing lawnmowers and estimate the total cost: f(h) = 25 + 12.5(h)
If the total cost for lawnmower repair is 45.50, what was the total number of hours of
labor on the job?
6 4
Many states now employ end-of-course (EOC) exams. You can find item samplers online for
EOC exams. A quick search online finds such item sampler documents in states including Florida,
Indiana, Texas, Louisiana, North Carolina, Missouri, Arkansas, Iowa and many others. Many of
these EOC exams employ the numeric grid-in response types. Item #27 is an example that is simi-
lar to those found on EOC algebra tests. Solving functions is an important task in algebra courses.
We note that the precision of the response expected is not specified, nor is the metric.
200 • Developing Constructed-Response Test Items
28. At Business M, the value of inventory for May was what percent of the value of inventory
for June? Give your answer to the nearest 0.1 percent.
%
Answer: 89.3
Used with permission.Copyright © 2012 Educational Testing Service. www.ets.org
This example illustrates the new grid-in format used by the GRE (http://www.ets.org/gre/
revised_general/about/content/quantitative_reasoning). We note that the response expectation
is also clearly described (to the nearest 0.1 percent). The guidance provided to the test takers
states that items of this type require the test taker to enter the answer as an integer or a decimal in
a single answer box or to enter it as a fraction in two separate boxes, one each for the numerator
and denominator (Educational Testing Service, 2012).
Interview
As a test item format, the interview can be formal/informal or structured/unstructured. Stiggins
(1994) supports the use of interviews as an informal information-gathering process that more
closely involves the student. The interview provides an opportunity for unique student input
and responses. Interviews can give individuals an opportunity to explain thinking. The interview
gives the examiner an opportunity to probe responses more deeply. However, because of the sub-
jective nature of interpersonal interaction, interviews should be used carefully and be much more
structured if used for high-stakes purposes such as assigning grades or for placement. AERA
(1999) endorses the practice of using collateral information when making high-stakes decisions,
so techniques like anecdotal records and interviews have a place in a holistic, comprehensive
assessment of any educational or training program.
29. The Oral Proficiency Interview of the American Council on the Teaching of Foreign Lan-
guages (ACTFL, http://www.actfl.org) is a 20 to 30 minute face-to-face or telephone inter-
view to assess how well a person speaks a language. The interview is interactive where the
tester adapts to the interests and abilities of the speaker. Performance criteria are defined
in the ACTFL proficiency guidelines.
30. The Oral Proficiency Test at Wright State University is a test given to all international
graduate teaching assistants and lab assistants. In one section of the test, candidates are
asked to respond to a series of questions, where scores are based on the candidate’s ability
to express ideas and opinions. Example questions include:
A. Describe your home town. You can talk about size, location, climate, population, or
any other characteristics of your choice.
B. If you were trying to convince someone to enter your field of study, what particularly
attractive aspects of your major would you point out?
C. Since you have been in America, what differences between American students and
students in your country have surprised you?
Source: http://www.wright.edu/cola/Dept/eng/esl/optsample.html
Used with permission.
Observation
One of the most common methods is observation. Teachers, trainers, and supervisors regularly
engage in observation to evaluate the degree to which individuals behave as expected, usually
Constructed-Response Item Formats • 201
31. In a science laboratory, observations come in many forms. One of the most basic is simply
following directions for an experiment.
I read the directions.
I performed the steps in the experiment in the correct order.
I completed all the steps.
I recorded the results of the experiment in the scientist’s journal.
I cleaned up my mess after the experiment.
I put away all the equipment in the cabinet.
32. Students use a geoboard to construct two different shapes using shapes with different
attributes (side lengths and angle measures).
33. The nurse acknowledges patients upon entry into the emergency room and addresses the
medical needs, assessing their need for immediate medical attention.
Oral Examination
Oral examinations have had a long presence in the history of testing. For instance, oral examina-
tions in law were conducted at the University of Bologna in the early 1200s. It was not until the
1800s that the fairness of oral examinations was formally challenged, causing a shift to written
examinations (DuBois, 1970). The tradition of the oral report has been upheld in some high-
stakes contexts, including the dissertation defense. Oral examinations are more commonplace
in credentialing examinations. Chapter 16 presents a section on the validity of oral examination
score interpretations and uses.
Compared with the discussion format, which focuses primarily on critical thinking, oral reports
allow the examiner to focus on complex tasks and problems that require knowledge, skills, and
more complex cognitive behaviors. Because of the wide range of abilities that can be observed in
an oral report, scoring is subjective and potentially more complex.
In the example in Table 10.5, the speaker is being evaluated for speaking ability. The item calls
for a 10-minute presentation. The list of criteria is very specific and comprehensive. The items
represent characteristics of the presentation and presenter that are either subjectively scored or
objectively scored.
Performance Tasks
Performance tests have been in use for centuries. Each test can consist of a single or many
performance tasks. Evidence suggests that as early as 2200 BC, Chinese emperors set up per-
formance-based civil service examinations. Classroom teachers have continued to employ per-
formance-based tests in both formative and summative evaluation of students. In the 1980s
and 1990s, performance test items resurfaced in large-scale testing programs, particularly in
statewide achievement testing programs. The technical qualities of the results have not met cur-
rent professional standards for making high-stakes decisions about individuals. There has been
limited success in the use of performance tests in some large-scale programs, but the evidence
has not been overwhelming in support of these activities in preference to SR formats. There are
many potential uses for performance tests, and we have learned a great deal about their benefits
and costs. The literature is rich in this area and the technology for performance test develop-
ment. Being systematic about the conceptualization is important: design, implementation, and
analysis of performance tests as employed in both small-scale (e.g., classrooms) and large-scale
(e.g., statewide and national testing programs) settings. The design stage alone is wide-ranging
as to building test specifications, training task developers, designing tasks and scoring criteria,
reviewing tasks for sensitivity and bias, piloting tasks, analyzing pilot results, and finally select-
ing tasks. There are several examples of large-scale settings of performance tests, including
some Advanced Placement tests and performance-based test systems for students with severe
cognitive/physical impairments. Chapters 13, 14, and 15 provide many examples of perform-
ance test items that are objectively and subjectively scored. Dental licensing testing in most
states involves a clinical performance test that is very comprehensive (http://www.crdts.org/,
http://www.wreb.org/).
A performance test is a broad umbrella term that could include all of the formats listed
within this chapter. In professional licensure and certification tests, performance tests are an
important source of evidence of mastery (see chapter 14). Lane (2013) argues that performance
tests can measure higher-level cognitive demand in ways that help connect teaching and learn-
ing, providing information to teachers regarding what is important to teach and to students
regarding what is important to learn. As such, performance tasks have the potential power
of illustrating to students the kinds of tasks that are relevant to a specific field and what the
field values as important. Such tasks are also called authentic because they represent real tasks
undertaken by individuals in the field represented by the test. In our technical jargon, the tar-
get domain represents an ideal set of tasks, and an authentic task on a test resembles the tasks
in that target domain (Kane, 2006). We hope that the test task has high fidelity with the task in
the target domain.
Because of the wide range of formats performance test tasks might take, no single set of guide-
lines can provide complete guidance. The guidelines presented in chapter 11 will support the
design of most performance tasks. Performance tasks require carefully designed scoring guide-
lines or rubrics, which detail the aspects of performance valued and successful or exemplary levels
of performance. Such rubrics are most effective when the test taker knows what the expectations
are and knows the elements of performance that will be rated. Guidance on design of scoring
methods and scoring is presented in chapter 12.
These are examples of performance tests from an infinite domain of possibilities. What these
items need are context and criteria. The context of the task needs to be specified clearly, and the
test taker must be aware of the criteria used to evaluate performance. Then, and only then, will
these become performance test tasks.
Portfolio
A portfolio is often mentioned as an example of a performance test. However, the characteristics
of portfolios are unique. The portfolio usually consists of a set of entries or complementary per-
formances. Arter and Spandel (1992) provide a definition of a portfolio as a purposeful collection
of work that captures relevant aspects of effort, progress, or achievement. The objective of any
portfolio is the comprehensive measurement of an ability, such as writing. Or with a professional
test for certification or licensing, the portfolio may contain the masterwork of a candidate as
noted in the performing arts, literature, or medicine. In these professional contexts, this may also
include capabilities, accomplishments, and potential.
Portfolios can be used at any level, including elementary- and secondary-level classrooms and
professional certification and licensure. Perhaps the most highly structured large-scale portfolio
system is one used by the National Board of Professional Teaching Standards (NBPTS) to cer-
tify accomplished teachers (http://www.nbpts.org/). Foundational work in the specifications of
high-quality portfolios was developed by LeMahieu, Gitomer, and Eresh (1995) through the ETS
Center for Performance Assessment.
Portfolios can be composed of multiple entries in multiple formats, given the purpose of the
test and the nature of the knowledge, skills, and abilities to be assessed. In elementary and sec-
ondary settings, portfolios can be used in writing, science, the arts, and indeed most subject areas.
They can also be used as alternative tests for students with severe cognitive or physical impair-
ments, providing for a mechanism to collect evidence of relevant knowledge, skills, and abilities.
In professional certification, as with the NBPTS portfolios, entries include written descriptions
of teaching, classroom artifacts and reflection, and videos of instructional segments with written
commentary.
A general description is available for the portfolio of the NBPTS. The specific characteristics
and features of the portfolio entries are tailored for each subject area and level. Chapters 13 and
14 provide more information on this interesting and highly effective testing program.
Each entry requires some direct evidence of teaching or school counseling as well as a com-
mentary describing, analyzing, and reflecting on this evidence.
Source: http://www.nbpts.org/for_candidates/the_portfolio
Reprinted with permission from the National Board for Professional Teaching Standards,
www. nbpts. org. All rights reserved.
California is the first state in the nation to employ a dental-school-based portfolio examination to
obtain initial licensure. The portfolio allows students to collect and display evidence of completed
clinical experiences as well as competency exams. The Hybrid Portfolio Pathway Examination
to Qualify for a California Dental License was authorized in 2010, the first in the nation (http://
www.dbc.ca.gov/applicants/portfolio_concept.shtml). The hybrid portfolio model is based on
the existing dental school evaluation of students according to a standard set of criteria, where
candidates prepare a portfolio of documented proof of competency evaluations of specific pro-
cedures (Comira, 2009).
.
43. The Hybrid Portfolio consists of sequential candidate evaluation and passing a Compe-
tency Exam utilizing a patient record in each of the following areas:
Oral Diagnosis and Treatment Planning: Completed case
Periodontics: Diagnosis, Scaling and Root Planing procedures
Direct Restorative: Class II amalgam or composite, and Class III composite
Indirect Restorative: Fixed Prosthodontics, Crown and Bridge Procedures
Endodontics: Completed case
Removable Prosthetics: Completed case
Much like the wide variety of tasks that can be found in performance tests, portfolios can be
constructed in many ways. Again, the guidelines presented in chapter 11 will be generally help-
ful. However, additional guidance is needed to complete the design of portfolios (see Tombari
& Borich, 1999, for a more comprehensive description). These elements include determining
who will select the entries, how many entries will be selected, how the entries will be stored
or recorded, whether analytic or holistic scoring will be used, and how many raters will be
involved.
Project
Projects are common in classrooms and courses in undergraduate and graduate education in
nearly every subject area. Projects give students opportunities to engage in individual or group
work and practice important skills in an applied context. Projects can include one or more of the
many CR formats described in this chapter. They may also include activities in non-classroom
contexts, including in the home, community, or work site. The project itself can be composed of
multimedia presentation or web page, or simply consist of a written summary of activities and a
reflection. Because projects tend to be relatively unstructured and completed in multiple stages,
they provide excellent instructional feedback that can be used for formative purposes. As such,
projects are more often employed in classroom assessment rather than large-scale test settings.
However, structured variants of projects appropriate for large-scale settings are typically formu-
lated as performance tasks.
44. Students organize an after school potluck for Día de los Muertos, including traditional
foods, art, and activities. They create a traditional altar and collage.
Constructed-Response Item Formats • 205
Research Papers
Research papers are typically introduced in high school and quite common in post-secondary
education programs. Perhaps the ultimate research paper is embodied in a graduate thesis
or dissertation. To some extent, even elementary school students can write components of
a research paper. These days, a great deal of research takes place online. Except for ancient
manuscripts and some earlier research journals not yet digitized, one can find extensive online
information on just about any topic. Research papers themselves can also be multimedia,
through online resource tools, including interactive documents containing pictures, interac-
tive illustrations, and videos.
Much like other performance test tasks, the components of a research paper that make it an
effective test format include the clarity of the context and criteria provided to the test taker. There
are many resources available that provide guidance for structuring a research report. These are
mostly provided in classroom assessment, but such a report might be an important addition to a
portfolio. The research paper follows a model that we use to measure writing ability. This model
includes topic selection, creating an outline, pre-writing or drafting, writing the beginning, mid-
dle and end, and revising the final draft. (See Table 10.6.)
Review (Critique)
A review provides an opportunity for an individual to display critical thinking ability. The review
consists of a single item. Typically, a review will result in a written summary. Besides critical
thinking, the review also requires good writing ability that includes such skills as organization,
sentence fluency, voice, and mechanics. The review is then judged based on a checklist or a rating
scale. To maximize the success of a review as a test item format, the context and criteria must be
clearly defined.
Some examples of the target of reviews include special event, action, policy, law, musical per-
formance, works of art, book, musical work, a new product, a movie, a television program, business,
a new car, a restaurant, poem, professional conduct, play, project, proposal for action.
Self/Peer Tests
Self-test can be developed as part of self-reflection. This encourages metacognitive activity,
requiring individuals to think about their own level of knowledge, skills, and abilities and per-
formances in specific contexts. Self-test can be done in many ways, including journaling, running
records and checklists, and other forms of making notes of reflection on one’s own performance.
This is most effective as part of a formative test process, since it encourages self-regulation and
206 • Developing Constructed-Response Test Items
45. In a course on item writing, students will write 10 test items and exchange their items with
two classmates. Each classmate will review the items with respect to the item writing guide-
lines provided in class. The student will then have the opportunity to revise their items prior
to submitting them to the instructor for evaluation.
The self-test has limited benefit in classroom assessment, but it is not used in standardized testing
programs. The peer test is not recommended.
Short-Answer
The short-answer item is scored objectively. Sometimes, a set of short answer items requires
some judgment to be exercised to decide if the answer is correct. As such, these items usually
measure recall of knowledge. As with the completion format, short-answer items are structured
much like a SR item without options. Thus, this format measures the same content but at much
higher scoring cost. We find little to recommend about this item format. However, if used to elicit
a higher cognitive demand than recall, the format may be useful.
46. What is one potential effect on validity-related inferences from dropping poor test items on
the basis of item analysis alone? (There is potential to change the content coverage, affecting
validity of resulting inferences regarding content knowledge.)
47. In what year did Wellington defeat Napoleon?
48. Describe one common characteristic of leadership of Mahatma Gandhi and Genghis
Khan.
Writing Samples
Writing samples have an organizational structure similar to essays, except that in a testing situ-
ation, a writing sample is intended to be a measure of writing ability itself. These are typically
obtained in an on-demand setting, whereas an essay is more likely to be developed over time,
potentially through multiple iterations or revisions. A writing sample, as an on-demand test,
provides a snapshot of writing ability. Again, because of the popularity and prominence of writ-
ing ability in education, this item format is addressed in chapter 13. Three examples of a writing
prompt are provided below. Explicit instructions regarding the task demands would accompany
the writing prompt, as further described in chapters 11 and 13.
49. What was the nicest thing anyone has ever done for me?
50. If I could change one thing about this school, what would I change?
51. If I were an animal, what animal would I like to be?
Constructed-Response Item Formats • 207
Video-Based Tasks
Technology has revolutionized communications and networking. Among those tools is the abil-
ity to create videos. Online resources, such as YouTube, currently operated by Google Incorpo-
rated, where two billion videos are viewed each day, allow individuals to upload many videos.
These resources have given everyone an unprecedented audience and voice. Educators have
begun using this newly developed resource as a learning and testing tool. Short videos are now
being used as both test prompts or stimuli and as test products to be created and submitted by
students. In a creative endeavor, interactive videos are also being explored. Similar resources and
innovative item types enabled by computers and online tools are discussed more completely in
chapter 7.
In the arena of recruiting and hiring, live and recorded video interviews are becoming more
common. Recruitment for businesses and government agencies has taken advantage of social
networking, mobile recruiting, and blogging; however, the opportunity to interview from any-
where is becoming common practice (Sullivan, 2009).
Another example of a large-scale use of video is in the college admissions process. An example
product, LikeLive (www.likelive.com), provides colleges and universities with an online tool that
allows applicants to go online and record and submit their interviews. Interview questions can be
posed at the time the student logs on to record the interview or can be provided ahead of time for
students to prepare responses.
The National Board of Professional Teaching Standards portfolio assessment system to certify
accomplished teachers includes four entries: a classroom-based entry with accompanying student
work; one documented accomplishments entry that provides evidence of accomplishments out-
side the classroom that affects student learning; and two classroom-based entries requiring video
recordings of interactions between the candidate and students. Instructions provided to candidates
regarding the video entries include (National Board for Professional Teaching Standards, 2011):
In two or more of the portfolio entries required for National Board Certification, you are asked to
submit video recordings of your teaching. The purpose of the video-recorded entries is to provide as
authentic and complete a view of your teaching as possible. National Board assessors are not able to
visit your classes; therefore, a video recording is the only illustration of these key practices:
• how you interact with students and how they interact with you and with each other
• the climate you create in the classroom
• the ways in which you engage students in learning
Your video-recorded entries convey to assessors how you practice your profession, the decisions
you make, and your relationships with students. (p. 38)
Source: http://www.nbpts.org/userfiles/file/Part1_general_portfolio_instructions.pdf
Reprinted with permission from the National Board for Professional Teaching Standards,
www. nbpts. org. All rights reserved.
Summary
An important principle is that several indicators of any ability are needed to make an accurate
assessment of any student or candidate for a credential. The value of this chapter is to identify
sources for these indicators. Some of these sources are admittedly very subjective. For instance,
anecdotal reports and interviews are the most difficult to support for validity.
208 • Developing Constructed-Response Test Items
As described across the chapters covering both SR and CR items, we find many formats and
uses of test items. Always, the format for the item or task should be selected because it has the
highest fidelity with the target domain of tasks represent an important ability, such as reading,
writing, or mathematical problem-solving. The item should be developed consistently with the
most rigorous guidelines. SMEs should agree regarding the appropriateness and accuracy of its
contents. A suitable scoring guide is developed in a way that maintains coherence between what
the item is intended to capture and the resulting inference. Items should be field-tested and test
takers should be appropriately prepared to interact with the item meaningfully. They should
know what is expected of the item. These principles apply to both SR and CR item formats, but
because of the variety of CR item types and the potential for innovative versions, such guidance
becomes even more important.
11
Guidelines for Writing
Constructed-Response Items
Overview
Although constructed-response (CR) item formats predate selected-response (SR) formats, vali-
dated guidelines for writing CR items are lacking. Several taxonomies have been proposed for
constructing CR items that have not had broad appeal. Most testing companies have developed
guidelines for the design of CR items. However, these guidelines are proprietary and, therefore,
not in the public domain.
This chapter presents a new set of guidelines for the construction of CR items. These guide-
lines are a distillation of many prior efforts to organize and improve guidance for writing CR
items. The sources for these guidelines are diverse as reported in the next section.
We have several books devoted to aspects of item-writing: Writing Test Items to Evaluate Higher
Order Thinking (Haladyna, 1997), Developing and Validating Multiple-Choice Test Items (Haladyna,
209
210 • Developing Constructed-Response Test Items
2004), and Construction Versus Choice in Cognitive Measurement (Bennett & Ward, 1993). All these
books provide many useful concepts, principles, and procedures that address CR item-writing.
The formats described in chapter 10 provided examples on how items should be developed.
Unfortunately, many CR items are very lengthy and are not well adapted to presentation in vol-
umes such as this one. Thus, only skeletal aspects of some CR item formats are presented. The
most explicit guidelines are provided by Educational Testing Service (ETS) (Baldwin, Fowles,
& Livingston, 2005; Gitomer, 2007; Hogan & Murphy, 2007; Livingston, 2009). ETS supports
several large-scale testing programs that employ CR items. Some of these programs are the
National Assessment of Education Progress (NAEP), the Advanced Placement program of the
College Board, the Test of English as a Foreign Language, and the Graduate Record Examination.
Through these programs, ETS has produced a large body of research on the quality of CR item
development and scoring. The ETS Guidelines for Constructed-Response and Other Performance
Assessments (Baldwin, Fowles, & Livingston, 2005; and Livingston, 2009) provide a well-con-
ceived, comprehensive approach to planning the test, creating item and test specifications, guid-
ing the writing of the item, test design, and administration. At the item level, this publication
gives advice on reviewing tasks, developing scoring criteria, pretesting items, and final scoring.
All these sources stress the importance of planning, which includes clarifying the purpose of the
test and the intended uses. Furthermore, the ETS guidelines provide task review criteria, includ-
ing the following questions to be asked of each task:
1. Is the task appropriate to the purpose of the test, the population of test takers, and the
specifications of the test?
2. Does the test as a whole (including all item formats) represent an adequate and appropri-
ate sampling of the domain to be measured?
3. Are the directions to each task clear, complete, and appropriate?
4. Is the phrasing of each task clear, complete, and appropriate?
Overall, these guidelines are based on a set of responsibilities of the test designers, that they (a)
include among the test designers individuals that represent the same populations to be assessed;
(b) provide relevant information about the test early in the development stages to individuals
who support, instruct, and train potential test takers; and (c) provide relevant information to test
takers regarding the purpose of the test and describing its content, format, and scoring criteria.
These responsibilities help to ensure good test design. Other goals to achieve in the design of CR
tests are improvement of access, fairness, and equity. All of this is accomplished by means of tasks
that represent the target domain, well-designed scoring guides, and sample responses to aid rat-
ing performances (Baldwin, Fowles, & Livingston, 2005).
In a comprehensive review of test design models, Ferrara and DeMauro (2006) reviewed the
core elements of several conceptualizations of design that integrate aspects of cognitive psychol-
ogy. They assert that little evidence exists regarding the integration of these models with opera-
tional elementary and secondary achievement testing programs. Their review included aspects of
test design that they characterize as construct-driven, scientifically principled, cognitive-design-
system-based, and evidence-centered. Among these design models, they identified a common
goal “to enable interpretation of performance on achievement tests as a process of reasoning
from evidence … about test taker status in relation to a well-specified achievement construct”
(Ferrara & DeMauro, p. 606). Their review of current CR item construction practices called for
making desired inferences explicit for each task and connecting score interpretations with the
definition of the ability being measured. This plea is consistent with Gitomer’s (2007) call for
coherence in CR item design.
Guidelines for Writing Constructed-Response Items • 211
Hogan and Murphy (2007) reviewed 25 textbooks and chapters on educational measurement
from 1960 to 2007. They examined authors’ advice about preparing and scoring CR items. They
found 124 statements on preparing CR items and 121 statements on scoring. They also refer-
enced empirical research on these guidelines. They found that most guidance for CR items is not
based on empirical evidence.
We have some inconsistencies among previous CR item-writing guidelines. The guidelines of
Hogan and Murphy are limited by differential degrees of specificity among textbook authors. As
examples, the ETS guidelines recommend providing choice of task where appropriate, whereas
the Hogan/Murphy guidelines recommend against choice. The ETS guidelines also include pro-
viding assurances that personal or other information extraneous to the response not be available
to scorers as to unduly bias scoring. However, through video or other observational methods,
securing response information out of a greater context may not be possible. The Hogan/Murphy
guidelines include the guideline to look for evidence that is also a recommendation regarding
scoring, as it recommends that the scorer look for evidence to support the specific position taken
by the test taker, particularly on opinion-based or controversial topics. They included it in the list
of item preparation guidelines because this is where they found the recommendation among the
four textbook authors that discussed it. These inconsistencies may exist due to the fact that one
source is aimed at improving classroom assessment of student learning and the other source is
dedicated to preparing highly effective items for an operational testing program.
With respect to the intended cognitive functions in the NAEP, CR items were defined as:
well-defined tasks that assess subject matter achievement and that ask test takers to dem-
onstrate understanding through the generation of representations that are not prespecified
and that are scored via judgments of quality. (Gitomer, 2007, p. 2)
Gitomer argues that task demands are only clear in the context of the rubric. Also, the meaning
of the rubric is only clear in the context of the associated scoring process. Similarly, students
must understand what is being asked of them in the task, the response requirements, and the
scoring system. This is the substance he refers to as coherence among the various parts of a CR
item.
Gitomer argues that these requirements are typically satisfied for SR items, assuming the task
is clearly stated and students understand what is expected. However, for some CR tasks, these
requirements are not satisfied. He presents a framework for CR item design that is intended
to secure these requirements in a coherent way. The intent is to ensure that test takers under-
stand what is being asked of them and that scorers know how to interpret student responses
appropriately.
Gitomer (2007) argued that to obtain valid inferences about student learning, all CR items
must ensure that (a) the student understands what is being asked by the task, (b) the response
requirements must be clearly described, and (c) the scoring system must be structured consist-
ently to interpret the student response. To support these goals, he suggested that a design require-
ment for CR tasks includes the three components of a task, rubric, and scoring apparatus, such
that they work in a coherent manner. That is, the connection is very clear among these three criti-
cal components of a validated CR task, which is illustrated in Figure 11.1. The definition of the
ability to be measured is the unifying framework through which the task-rubric-scoring appara-
tus coherence must provide support. Task and rubric clarity are ensured through an effective set
of CR item/task-writing guidelines. Scoring effectiveness is ensured through careful selection of
scorers and task response exemplars followed by an effective, structured training process. Scoring
is the main topic in the next chapter.
212 • Developing Constructed-Response Test Items
Task
Coherence
Scoring
Rubric apparatus
Construct definition
Figure 11.1 Components of Gitomer (2007) CR task design model. Used with permission.
CONTEXT CONCERNS
9. Consider cultural and regional diversity and accessibility.
10. Ensure that the linguistic complexity is suitable for intended population of test takers.
Content Concerns
Before the integration of cognitively based models for item and task development, the focus in CR
item design was on the task itself (Ferrara & DeMauro, 2006). Task-driven approaches do little to
secure coherence among various aspects of test design and frequently limit the ability to connect
responses to important learning targets. The focus on content concerns elevates the importance
of identifying the knowledge, skills, or abilities. These guidelines recognize the importance of
justifying the use of CR formats (Gitomer, 2007). Also, these guidelines are intended to ensure
that these items elicit the cognitive demand that is not easily elicited using SR item formats (Rod-
riguez, 2002, 2003).
Most of the content concerns are addressed through the development of a test blueprint, with
detailed item specifications. The item specifications should provide the item writer with infor-
mation regarding the precise domain of knowledge and skills to be assessed, information about
the appropriate formats of intended cognitive demand, the appropriate level at which the items
are targeted, and guidance to produce construct comparability across tasks. Because CR tasks
can invite novel, innovative responses, one challenge is to ensure comparability of the intended
construct being measured across tasks.
science, among other subjects at each grade. These content standards are used to develop curricu-
lar guides that specify the scope and sequence of instructional content. Current efforts to develop
common core state standards are underway, with standards in place concerning language arts
and mathematics (www.corestandards.org). The common core standards are intended to pro-
vide consistent and clear understandings of what students are expected to learn and reflect the
knowledge and skills that young people need to be successful in college and careers. The common
core standards are an attempt at consistency across states as to expectations for proficiency in
these subject areas. The standards are imbedded within clusters of related standards describing
aspects of larger domains within each subject area.
Note that this standard specifies the content (properties of integer exponents) and cognitive skill
to be performed (generate equivalent numerical expressions). Skill is appropriate for CR items
as it requires students to apply properties to generate a response rather than select a response in
an SR item calling for recall. Such a standard can lead to tasks to generate equivalent expressions
such as:
Similarly, consider the second standard in this cluster of working with radicals and integer
exponents:
Standard #2: Use square root and cube root symbols to represent solutions to equations of the
form x2 = p and x3 = p, where p is a positive rational number.
Again, the content is specified explicitly: Be able to use square root and cube root symbols. The
second standard is a clearly specified skill: to represent solutions to equations. Such explicit state-
ments of content are strong aids to item writers, such that they have sufficient information to
develop tasks that are more direct measures of the intended content.
NAEP results are sometimes used to make comparisons among states. NAEP covers 12 broad
content areas (e.g., the arts, civics, economics, foreign languages). Each content area is repre-
sented by a framework that provides a basis for the test content, directions for the kinds of tasks
that should be included in the test, how the tasks should be designed, and how responses should
be scored.
NAEP first measured the economics knowledge and skills of students in grade 12 in 2006. They
defined economic literacy as the ability to identify, analyze, and evaluate the consequences of
individual decisions and public policy (National Assessment Governing Board, 2006). The eco-
nomics test’s three content areas were the market economy (45%), the national economy (40%),
and the international economy (15%). These three areas were equally distributed across three
Guidelines for Writing Constructed-Response Items • 215
cognitive categories: knowing, applying, and reasoning. Here is a sample of a standard from the
national market content area:
Poor:
2.12.4 Real interest rates affect saving and borrowing behavior.
Note that in this sample of standard 12, which has six components in total, this statement
addresses aspects of interest rates for the national economy and explicitly describes the rela-
tion between real interest rates and spending/borrowing. Other standards statements address the
additional contexts of businesses and the public sector. A sample item generated based on this
standard is the following:
Poor:
2a. Describe the relation between real interest rates and borrowing behavior.
These are CR tasks directly addressing standard 2.12.4. The poor version is an ambiguous attempt
to elicit understanding of the standard without providing guidance about what aspect of the
association should be described and without context (individual behavior, business behavior,
governmental behavior). The better version provides the specific context of individual behavior
and provides a specific relation and requires correct reasoning (e.g., an increase in real interest
rates will make it more expensive for people to borrow).
2. Ensure that the format is appropriate for the intended cognitive demand.
CR items should be designed to test for a high cognitive demand, not simple recall of knowledge.
The quest for high cognitive demand is a challenge. We recognize the additional expense in using
CR items in tests, which includes additional costs in item development, piloting, and scoring. So
to justify the increased expense, each CR item should measure content and cognitive demand
that is not easily measured by an SR item.
Constructed-response objectively scored items (CROS) are well suited for testing knowledge,
comprehension, some computational skills, and the answers to problems. As described in chap-
ter 10, these formats include the cloze, fill-in-the-blank, grid-in, and short answer items. A ben-
efit of these formats is that they can be objectively and easily scored. Again, the challenge is to
develop such items that require test takers to construct a response that requires a high cognitive
demand.
The open-product subjectively scored item (CRSS) is a good option for testing high cognitive
demand. This format includes demonstration, discussion, essay, a writing sample, performance,
216 • Developing Constructed-Response Test Items
and portfolios. This item format usually elicits an extensive response. This format can elicit the
complex cognitive demand that reflects an ability and allows for creative thinking.
NAEP uses all three types of item formats (SR, CROS, and CRSS). The CROS and CRSS items
make up about 40% of the testing time, in part recognizing the importance of these formats for
measuring complex thinking.
For example, note that in standard 12, which is an example for guideline 1, the standard state-
ments provided in the economics framework address aspects of knowledge, application, and
reasoning that NAEP uses. This provides many options for the task developer to address the
intended cognitive processes. Consider the example task provided above in guideline 1:
2b. How will an increase in real interest rates affect the amount of money that people will bor-
row? Explain why this will occur.
This task requires a response that contains evidence of reasoning. If the direction to “Explain
why this will occur” was not provided, a simple plausible answer based on knowledge or perhaps
application could be “People will borrow less.” Such a response contains no reasoning. Reasoning
in this task could include the idea that “an increase in real interest rates makes it more expensive
for people to borrow money.”
As we have stated often in this volume, SMEs’ input is critical in many stages of item develop-
ment. One of the most important contributions they make is when determining the appropriate
level of tasks given predetermined content standards and desired cognitive demand to be meas-
ured. This also informs the appropriate grade or age level or, more broadly, the appropriate level
given the expected stage of learning in a subject area.
National Council for Teachers of Mathematics (2012) standards for grades three to five in
geometry include the following: make and test conjectures about geometric properties and rela-
tionships and develop logical arguments to justify conclusions.
Consider the following items that might be found on a fifth-grade mathematics test:
Poor:
3a. How many sides does a triangle have?
Better:
3b. The figure below is a parallelogram. What is the relation between angles a and b?
Consider the standards within the domain of Expressions and Equations across grades from
the Common Core State Standards Initiative (2011). This domain first appears in grade six and
continues through grade eight. The domains change into conceptual categories at the high school
level. (See Figure 11.2.)
We note again the importance of SMEs. The appropriateness of the level of the tasks is criti-
cal if CR tasks are to be meaningful representations of important knowledge, skills, and abilities.
Reviewing this specific domain in mathematics in the Common Core State Standards helps us to
understand this guideline. The domain of Expressions and Equations does not even appear until
Guidelines for Writing Constructed-Response Items • 217
grade six and, up through grade eight, the specific standard areas in this domain become more
advanced and complex. In the high school years, the content areas of expressions and equations
finds themselves spread across all of the domains of Number and Quantity, Algebra, Functions,
Modeling, Geometry, and Statistics and Probability.
Current form:
4a. Describe how you could best determine, from the data given on page 6, the speed of the Earth
in kilometers per day as it moves in its orbit around the Sun.
Revised:
4b. Plug in the appropriate data from page 6 into a formula for speed in order to describe how
you could best determine the speed of the Earth in kilometers per day as it moves in its orbit
around the Sun.
Source: 2000 NAEP Science, Grade Twelve, Block S9, #9.
For this item, the test taker must recognize that speed is distance divided by time, and that in
one complete revolution the distance is 2πr. Time is the amount of time it takes for a complete
revolution around the sun. The data given in the test booklet provides the distance from the
sun to the earth, which is the radius. One orbit around the sun is 365 days, which is time. How-
ever, the task does not specify that the test taker must use the data from page six to show how
to compute the distance traveled in one revolution. So students who simply state that “speed
is distance divided by time” are not awarded full credit. They must state that circumference is
computed by 2πr.
218 • Developing Constructed-Response Test Items
Current form:
5a. Suppose that you have been given a ring and want to determine if it is made of pure gold.
Design a procedure for determining the density of the ring. Explain the steps you would fol-
low, including the equipment that you would use, and how you would use this equipment to
determine the ring’s density.
Revised:
5b. Design a procedure for determining the density of the ring. Explain the steps you would fol-
low, including the equipment that you would use, and how you would use each of the values
resulting from these measures to determine the ring’s density.
Source: 2000 NAEP Science, Grade Twelve, Block S11, #12.
In the second item, the test taker must identify a way to measure the density of the ring. The test
taker must measure volume. This is accomplished by noting the water displacement in a gradu-
ated cylinder. The test taker must know that density is mass divided by volume. Here, however,
the student is not required explicitly to state how mass or volume are found. The student must
list the steps involved for determining density and explain each step. Finally, the student must
explain how values obtained help achieve the solution.
In part, we recognize that one element of the inconsistency in measurement with these two
items is based on the scoring guide for the responses. However, the task demands themselves
must be clear enough to provide consistent and explicit instructions to the test taker. The test
taker must be given the opportunity to respond completely, fully exposing their understanding of
the problem. In the modest revisions to these tasks, the test taker is given directions to be explicit
about the calculation of density and speed, to show how each value obtained is used to determine
the result.
solution strategies that are not identified until tasks are piloted. Guideline 5 in chapter 12 dis-
cusses this point. Rhoades and Madaus reported that in 2002, a high school student responded
to the Massachusetts Comprehensive Assessment System mathematics test item using a spatial
solution strategy instead of a numeric strategy, which led to a different right answer.
Quality control is just as important as the use of research-based item-writing and test design
guidelines. Errors can occur in many ways. Many errors result because supervision is limited,
procedures and practices are not well documented, standards and expectations are not commu-
nicated, time to develop items is often limited, and training of item writers is insufficient. Quality
control can be viewed as a formal systematic process designed to ensure that expected standards
are understood and followed during each phase of testing (Allalouf, 2007).
Some may think that CR items are prone to fewer errors than SR items. We have more oppor-
tunities for errors in SR items because of the presence of options, which we know are very dif-
ficult to write. However, many factors contribute to errors with CROS and CRSS items. These
errors include poor grammar and incorrect punctuation in the item, and lack of clarity in the
scoring guidelines. Another error is lack of a logical coherence between the item and the scoring
guideline.
A sample of 993 items written by 15 technology education teachers was reviewed. They were
developed for the North Carolina State Department of Public Instruction item bank (Haynie,
1992). From this review, researchers found 10% of the items had spelling errors, 26% had punc-
tuation errors, 39% had stem clarity problems, and 15% had questionable representation of the
correct cognitive domain.
OCR is a certification agency in the United Kingdom providing a wide range of tests. One of
these is the General Certificate of Education. In one report from their examiners (OCR, 2011),
examiners expressed concerns about many items.
There was an error in this part of the question [6.ii.]; this was very unfortunate and we apol-
ogise for this. The marking scheme was adjusted to take account of this error and further
steps were taken at the award to ensure that candidates were not penalised.
Because of the omission of the repeated AB+FG=1.9 km in the expressions on the ques-
tion paper all reasonable attempts to apply the route inspection algorithm were marked,
including crossed-out work that had been replaced. However, attempts that were just lists
of specific cases were not usually given any credit as they did not demonstrate the use of the
route inspection method. (OCR, 2011, p. 49)
Harrison (2011) reported that 6,800 high school students struggled with the item that had no
solution. The item asked students to find the shortest route between two points given several
conditions. The values in the supplied conditions were incorrectly printed, which made the solu-
tion ambiguous. This item was worth just more than 10% of the total points on the test. Harrison
reported that this event was reported nationally. Several comments were reported from students
who took this test. One test taker commented:
Having spent a long time on this question I resorted to crossing out all of my working-out.
The amount of time I spent meant I wasn’t able to answer the rest of the exam paper to the
best of my ability. The only logical option I could see for OCR is to put out another exam
paper quickly or my application to university will be extremely hindered due to this being
33% of my A-Level grade. It’s ridiculous, how can the highest marked question on the paper
not be double-/triple-checked? (Harrison, 2011)
220 • Developing Constructed-Response Test Items
Standard 3.7: The procedures used to develop, review, and try out items, and to select items from
the item pool should be documented.
Although it seems that piloting is a standard practice, it remains surprising that so many errors
in items continue to plague the testing industry. Rigorous item development with the recom-
mended reviews and thorough pilot testing should have uncovered many of these errors before
these items appeared on operational tests.
6. Find x
Here it is
x
3m
5m
Ambiguity in an item will promote a variety of inappropriate responses that fail to elicit the
cognitive demand desired.
Guidelines for Writing Constructed-Response Items • 221
The California Standardized Testing and Reporting (STAR) Program is the state’s testing pro-
gram. Regarding test directions and administration, an error in a 2002 STAR test had cover direc-
tions instructing students to open the booklets and write a story. However, the instructions did
not indicate that there were additional directions inside the cover with specific information about
what to write (Rhoades & Madaus, 2003).
The testing Standards (AERA et al., 1999) requires clear instructions to test takers:
Standard 3.20: The instructions presented to the test takers should contain sufficient detail so
that test takers can respond to a task in the manner that the test developer intended.
Standard 5.5: Instructions to test takers should clearly indicate how to make responses.
6. Clearly define directions, expectations for response format, and task demands.
As a way to complete the development of CR items, all aspects of the test should be clearly speci-
fied to the test taker. This effort includes directions, allowable response formats, the response
length, time limits, other relevant conditions of testing, and related task demands and features
of items. Usually, item writers assume students are familiar with the item format and response
demands. Consider the common but poorly constructed instructions in Figure 11.4.
Poor:
You may use a calculator for this test. Stop working on this test when directed.
Better:
You may use a scientific calculator (no graphing capabilities) for this test. You will have
exactly 45 minutes. Place your pencil on your desk after the time is up. You must show
your work to get full credit. If you need to use scratch paper, attach the scratch paper to
this test. If you finish before the 45 minutes is up, hand this test to your professor and you
are free to leave.
The better instructions are more explicit and informative. Such directions avoid confusion,
errors, and frustration by test takers.
There are at least two major forms of this guideline. For open-product CR items, the task
demands that clarify what is expected must be clearly described. These instructions should
inform test takers what qualifies as a strong response, features of a response that must be included,
whether response content should be restricted to information presented in the task or if external
or prior knowledge can be used, and length expectations of a response.
A second form of defining response format expectations concerns items that have restricted
response formats, like grid-in items. Grid-in items must provide for all possible options for
responses and clearly describe how those options should be used. When the grid-in item is admin-
istered by computer, this can be controlled through entry controls—only certain characters can
be entered. In paper-based grid-in items, this is much more challenging. Consider the following
example. This may be an appropriate mathematics item addressing probability at the high school
level, but the response format is likely to introduce complexity that is construct-irrelevant.
The grid-in response box in Figure 11.5 provides for the responses that are negative, fractions,
or decimals. However, it is not clear to the test taker whether these conventions are even needed
for this item. Notice also, some grid-in item specifications require the shading of alternate col-
umns to help test takers distinguish each column better. However, we also recognize that this
should be done carefully to avoid creating problems for test takers with visual impairments.
222 • Developing Constructed-Response Test Items
As shown in chapter 10, CR formats range from closed-product, objectively scored item to
open-product, subjectively scored. The choice of item format should be directly connected to
the desired cognitive demand as described in the item and test specifications. Sometimes, the
response mode is specified by the item format. For example, in a cloze item, a single word is
required to complete the task. For grid-in response items, a single, correct numeric response is
required. In short-answer items, responses are typically restricted to a few words or sentences.
Occasionally, a short-answer task could be answered with a diagram or drawing. Alternative-
response modes are more common in alternative assessments for students with physical or cog-
nitive impairments. Innovative items types are now being explored through computer-enabled
testing modes, which introduce a wide range of possibilities.
In open-product, subjectively scored formats, such as demonstrations, essays, writing samples,
performance tasks and others, the nature of the response mode and medium of response need to
be specified. The response mode and medium should be consistent with the purpose of the test
and the content and cognitive demand being tested.
In the example in Figure 11.6, students are expected to navigate from school to the post office,
taking the safest route possible. This navigation involves crossing streets where there are cross-
walks with traffic lights. The natural response mode here is a drawing of the route. Another
example is provided in Figure 11.7. This item requires both a drawing and written response.
One source of confusion to test takers happens when multiple tasks are requested simultane-
ously or when there are multiple features of a problem that must be considered. To support the
test taker in this NAEP item shown in example item #9, the directions in the item require stu-
dents to make a diagram and to write an explanation. In addition, two places are provided for the
responses. Test takers are instructed to draw the diagram in the box and are then provided with
four lines for an explanation.
NAEP provides an example of conditions for each section of their tests. In a sample ques-
tion for reading at grade four, here is an excerpt from the booklet directions (http://nces.ed.gov/
nationsreportcard/about/booklets.asp). Such directions avoid ambiguity in responses.
Guidelines for Writing Constructed-Response Items • 223
8. The fourth grade class of Rockfield School is going to visit the Rockfield post office.
They will leave school and walk using only sidewalks and crosswalks. Draw on the
map of Rockfield the safest route the class can take.
TOWN QF ROCKFIE LD
n
South St.
Figure 11.6 Example item #8 from 2010 NAEP Geography, Grade Four, Block G3.
9. Use the terms above [cows, sun, grass, people] to make a diagram of a food chain in
a simple ecosystem. Put your diagram in the box below.
Then write an explanation telling how your ecosystem works.
You will be asked to respond to two types of questions. The first type of question requires you to
choose the best answer and fill in the oval of that answer in your booklet. Some questions of this
type will ask you about the meaning of the word as it is used in the passage.
The other type of question requires you to write your answer on the blank lines in your booklet.
Some questions of this type will ask you to write a short answer and some questions will ask you
to write a longer answer.
This section of the test consists of two open-response item assignments that appear on the fol-
lowing pages. You will be asked to prepare a written response of approximately 150–300 words
(1–2 pages) for each assignment. You should use your time to plan, write, review, and edit your
response for each assignment (retrieved from http://www.mtel.nesinc.com/MA_PT_opener.
asp).
These examples provide important information to the test taker regarding different features of
items, providing some information about assigning effort to different tasks, and clarify response
expectations. They help secure optimal responses and help focus the test taker on the relevant
cognitive demand.
Sample Prompt and Directions from the Minnesota writing composition test.
10. Choose a character from a story, movie, or book who inspires you. Explain why this charac-
ter inspires you. Use specific examples to help your reader understand your choice. Remem-
ber you are writing for an adult reader.
REMINDERS:
Write as neatly as possible.
Make sure your composition has the following:
________ a clear, focused central idea
________ supporting details (reasons, examples)
________ a logical organization (beginning, middle, end)
________ correct spelling, grammar and punctuation
________ complete sentences
Source: http://www.mnstateassessments.org/resources/ItemSamplers/WC_GRAD_Item_
Sampler_Prompt-1.pdf
Used with permission from the Minnesota Department of Education.
Guidelines for Writing Constructed-Response Items • 225
Consider the guidance given to students in a Minnesota written composition test required for
a high school diploma (Minnesota Department of Education, 2011). This comprises a sample
item prompt and directions to the student. We note that the reminders contain each element of
the scoring rubric.
In this example, response expectations are clearly described, and these features parallel ele-
ments of the scoring guide. No specific content is provided.
Examples of NAEP items provide information about what is required to obtain a high score
and the item response spaces are structured to ensure opportunity to perform. In the short-
answer CR item illustrated in item #11, we note that only two lines are provided. This shows that
the answer should be brief. The response demands are clear (one reason) and the two parts are
clearly distinguished by underlining the key words in favor and against.
11. There are differences of opinion about using chemicals called pesticides to kill insects.
A. Give one reason in favor of using pesticides.
_____________________________________________________________
_____________________________________________________________
B. Give one reason against using pesticides.
_____________________________________________________________
_____________________________________________________________
In other cases, the entire scoring rubric is provided to test takers. The National Board of Profes-
sional Teaching Standards operates the largest portfolio assessment system in the world. This
system certifies accomplished teachers in many specific and generalist areas. It includes four
portfolio entries and assessment center exercises. These complex entries include such things as
videos of instruction, student work, and reflections from teachers. An example rubric is provided
in Figure 12.4 in chapter 12. Giving the candidates the entire rubrics is the best way to clarify the
task demands and provide full opportunities to perform. In this way, the candidates know what
the scoring criteria are and can develop portfolio entries to meet those criteria. Since the criteria
are directly based on the standards for accomplished teachers, the portfolio entries are more
likely to be mapped to those standards—the target domain.
Providing explicit and clear information about the scoring criteria is an important method
for securing construct-relevant responses. This is a validity issue. The extent to which a complex
response contains evidence of relevant knowledge, skills, and abilities directly supports the valid-
ity of resulting interpretations and uses of test scores. The connection between the target domain,
the test item, and the scoring method and the rubric determines the degree to which coherence
can be achieved. (See Figure 11.1.)
Consider a set of action verbs proposed as useful for the purpose of writing instructional objec-
tives (Anderson and Kratwohl, 2001). Although these action verbs are organized in terms of the
cognitive taxonomy associated with Bloom, the concept of action verbs is very sound for item
design. For instance, they recommend words like list, name, repeat for simple recall. For under-
standing (comprehension), they use terms like describe, express, given in your own words, and
restate. For higher level cognitive demand, they use terms like operate, conclude, evaluate, experi-
ment, solve, diagnose, measure, revise, produce, propose, and devise.
As a poor demonstration task with ambiguous language, consider the following prompt:
Original:
12a. Demonstrate how to care for your pets.
Revised:
12b. Demonstrate how to brush a golden retriever.
This is vague at best, with unclear contexts. There is no formal process for caring for your pets
and the possible options for care probably depend on the kind of pet one might have. A revision
provides context and avoids implicit assumptions on the behalf of the test taker:
In another example, there is an unintentional inference required regarding the definition of
annual. The inference regarding the mathematics skill of multiplication is dependent on under-
standing that annual most likely means 12 months. In some cases, annual may not indicate the
same amount of time, for example, if you are a teacher on a nine-month contract.
Original:
13a. Mary earns $2,500 per month. What is her annual salary?
Revised:
13b. Mary earns $2,500 per month. Compute her 12-month salary.
Consider a NAEP item that has a three-level performance scoring guide. A complete response
is where students correctly identify one more distinguishing characteristic. A partial response is
where students do not fully describe such characteristics. An unsatisfactory response contains
incorrect characteristics or vague general statements.
14. Viet and Andrea were using a microscope to look at a slide of some cells. They looked at
some interesting cells that Viet thought were plant cells. Andrea thought they looked more
like animal cells. If you looked at these same cells, how could you tell whether they were
plant cells or animal cells?
Source: 2005 NAEP Science, Grade Eight, Block S13, #14.
Gitomer (2007) argued that such items simply ask for recognition of conceptual features rather
than require explanation or synthesis of information. Such questions could be presented in SR
format. To take advantage of the expanded range of cognitive skills assessable by CR items, the
task could require students to illustrate both types of cells and label distinguishing characteristics,
noting their correct location.
Here we find a NAEP item in Figure 11.8 that taps the content area of knowing and doing sci-
ence, with a focus on conceptual understanding.
The scoring guides awards complete credit to responses that state animals get (a) both energy and
nutrients, (b) energy and a specific nutrient, or (c) two specific nutrients. Partial credit is awarded
for a response that mentions (a) energy only, (b) one specific nutrient only, or (c) nutrients and
names one specific nutrient only. In the sample student responses, we find the response:
Guidelines for Writing Constructed-Response Items • 227
POND ECOSYSTEM
Sunlight
~
~ Large Fish
15. Each of the animals in the pond needs food. What are two things that the animals get
from their food that keep them alive?
Figure 11.8 Example item #2 from 2000 NAEP Science, Grade Eight, Block S9.
The animals get protien [sic] from their food and they also get nutrients from their food
This response is awarded partial credit since protein is a nutrient. Another partial credit response
was:
They get food from eating plants, algae, bacteria, and insects. They eat these, and in turn get
energy to survive.
The scorer comments suggested that both of these responses provided only one thing animals get
from their food. Gitomer (2007) argued the lack of specificity in the task demands result in dif-
ferent interpretations by the students and forces scorers to make inferences about the intent and
meaning of the students’ responses. Does the student mean nutrients when suggesting that they
get food from eating plants …? Moreover, it is not clear that the illustration is required to answer
the item and the item could be effectively formatted as a SR item.
The following prompt shown in Figure 11.9 is a focused item. The test takers are not encour-
aged to pull in elements of personal experience and knowledge from other domains in order to
respond to the CR item. This item, from the National Conference of Bar Examiners (2011) is
precisely targeting the domain of knowledge and skills for the test taker. Test takers are instructed
to respond in a way that shows understanding of the facts, recognition of the issues presented,
knowledge of the applicable principles of law, and reasoning through which they arrive at a con-
clusion. They are told to not assume facts that are not given in the scenario.
228 • Developing Constructed-Response Test Items
Customer went to Star Computers (Star) to buy a refurbished computer. Upon arrival,
Customer was approached by Owner, who identified himself as the owner of Star. Owner
directed Customer to a refurbished desktop computer and told Customer, “We have the
best refurbished computers in town. We send used computers to a computer technician
who always installs new hard drives and replaces any defective parts.” Owner made these
claims because Owner believed that they would be effective in persuading Customer to
buy a refurbished computer. In fact, Customer was persuaded by Owner’s claims and
purchased a computer for $250 cash.
At the time of this transaction, Owner did not believe that Star had the best refurbished
computers in town. Owner was aware of at least two other computer stores in town and
believed that the refurbished computers sold by these other stores were better than those
sold by Star. Owner also thought it was very likely that the computer technician used by
Star did not actually install new hard drives in the refurbished computers. Owner had never
raised the issue with the technician because the technician offered much faster service
and lower rates than those of any other technician in the area.
After Customer’s purchase, a local news station conducted an investigation into the
computer technician used by Star and reported that the technician did not install new hard
drives in any of the computers she refurbished. After the report aired, the computer techni-
cian acknowledged that no new hard drives had been installed in the computers she had
refurbished for Star.
Owner has been charged with larceny by false pretenses in connection with the com-
puter sale to Customer.
Several examples were provided here to illustrate weak and strong items regarding the presence
of task-irrelevant features. Again, the core message of the CR item-writing guidelines is to clearly
communicate to the test taker what is required in the response.
Context Concerns
The two context concerns guidelines are not limited solely to CR item-writing. These guidelines
are similar to SR item-writing guidelines, but worth reiterating here in a slightly different way
and setting. Elements of both of these context concerns are addressed subtly in other chapters,
including chapters 1 to 4 regarding the development and validation of items and the relevant fea-
tures in deciding between item formats. These context concerns are also discussed more directly
in chapter 15 regarding improving item accessibility for individuals with exceptionalities.
We present two context concerns specific to CR items, because it is much more difficult to
separate the person in the test taker from features of the response. In some cases, such as per-
formance tests, the person is an important feature of the performance. In other cases, as in writ-
ing tests, personal voice and experience will inherently be imbedded in the response. In order to
avoid undue bias and present barriers to optimal performance, it is important to consider cultural
and regional diversity and accessibility. Similarly, we must ensure that the linguistic complexity is
suitable for the intended population of test takers.
called for a moratorium on the testing of minority children (Williams, 1970). This was not a reac-
tion to poorly designed tests as much as to inappropriate test use. Today we concern ourselves with
fairness and accessibility. Testing Standards (AERA et al., 1999) devote two chapters to these topics
and strongly advise the inclusion of a representative sample of individuals from the testing popula-
tion to be involved in all aspects of item and test development. This includes item tryouts, pilots,
and think-aloud studies to investigate item quality. It is critically important to include the widest
possible diversity from the target audience in the item development process. Techniques related to
improving item and test accessibility are described comprehensively in chapter 15. Here you will
find many examples of items in original versions and more accessible versions.
Regarding the cultural and regional contexts, there have been recent attempts to help us
understand how these contextual issues play a role in item development. Solano-Flores and his
colleagues (Kachchaf & Solano-Flores, 2012; Solano-Flores, & Li, 2009a, 2009b) have undertaken
a wide agenda addressing issues of cultural relevance in item development. Some of that work
is related to the complexities of testing English language learners. This includes the influence
of rater language background, the role of dialectical differences across non-English native lan-
guages, and the use of cognitive interviews across cultural groups (Solano-Flores & Li, 2009).
Recently, the use of illustrations in science tests has been investigated with culturally diverse
populations. Solano-Flores and Wang (2011) developed a framework for examining science test
item illustrations. They combined contemporary theories of cognitive science, linguistics, and
sociocultural theory to identify features of test item illustrations that convey different concepts,
essentially introducing CIV. The dimensions of test item illustrations include (a) representation
of objects and background; (b) metaphorical visual language; (c) text present in illustrations; (d)
representation of variables, constants, and functions; and (e) illustration-text interactions.
In a recent study of this illustration features study in American and Chinese college students,
Wang and Solano-Flores (2011) found that students’ interpretations of the scientific concepts
present in the illustrations were more accurate for items that were originally developed in the
students’ own culture, rather than those items developed in the other culture. We look forward
to future work in this direction as it improves our understanding of item features across cultures,
which we believe helps eliminate CIV and improve validity.
10. Ensure that the linguistic complexity is suitable for the intended population
of test takers.
This is a common theme throughout this volume. The language used in test directions, items, and
associated item stimuli must be appropriate for the target audience. It must not introduce CIV.
This is the same as guideline 9 for SR items.
The role of linguistic complexity is discussed often in this volume, because it is so important.
We have several useful guidelines for appropriately minimizing linguistic complexity. These
are presented most succinctly in chapter 15. Some of the concepts that are most important (Abedi,
2006; Abedi et al., 2012; Hess, McDivitt, & Fincher, 2008) include:
An important tool in understanding the role of linguistic complexity is the think-aloud. Gath-
ering item pilot information from a widely diverse sample of the target audience will provide
relevant information to achieve the guidelines regarding context concerns.
The first three steps are covered well by the CR item/task guidelines described previously in
this chapter. The fourth step will be more fully described in chapter 12. The design of perform-
ance test tasks prompts tends to be more challenging to test takers. This challenge is made more
significant when trying to accommodate the target characteristics described above. Within this
step of task design, Stiggins suggested three components. First, the form of the exercise must be
selected, which may include structured tasks or natural events. Second, the degree to which the
task is on-demand or obtrusive and whether the test taker is aware of the test must be specified.
To be consistent with the goal of ensuring that test takers understand what is being asked of
them and ensuring clearly described response requirements, the test taker should be aware that
they are being assessed, particularly in high-stakes contexts. The third component requires the
specification of the amount of evidence to be gathered, both in terms of the number of samples of
behaviors and the amount of time given for the assessment.
Only two chapters on performance testing appear after 1995 (Lane & Stone, 2006; Welch,
2006). All references directly concerning item or task development predate 2000. Few recent
advances have occurred with respect to performance test item design. One exception is recent
work in evidence-centered design. A special issue of Applied Measurement in Education (2010,
Volume 23, Issue 4) deals with evidence-centered assessment design in practice.
Portfolio Considerations
Nearly all of the characteristics and conditions for performance tests are applicable to the port-
folio. That is because the portfolio is often a systematic collection of CR items and other infor-
mation that focuses on accomplishment, whether it is the outcome of educational or training or
professional preparation. Tombari and Borich (1999) provided a portfolio system checklist that
specifies elements of the portfolio that should be clarified for effective scoring. Many of the 17
components in the checklist are covered by the CR item/task guidelines. Elements of design that
are unique to portfolios include the following questions:
How are entries in the portfolio combined for overall scoring (if necessary)?
Does the test taker have a choice over content categories to be included?
Who selects the entries (test taker, teacher, supervisor, others)?
Guidelines for Writing Constructed-Response Items • 231
Other aspects of the portfolio design have analogs in the CR item/task guidelines, but are slightly
different in application to portfolios. For example, specific elements must be defined, including
the number of portfolio entries, allowable formats for various entries, and aspects of the portfolio
that might be evaluated beyond specific entries, including variety of entries, organization, overall
presentation, and growth across entries selected over time.
Summary
This chapter featured a set of guidelines for developing CR items. These guidelines were based on
earlier efforts from a variety of sources. The most important guidelines have to do with content
and cognitive demand. We hope that items achieve the highest fidelity with tasks in the target
domain. With CROS items, scoring is typically simple, but the content of the items is restricted
to tap knowledge with a relatively low cognitive demand. With CRSS items, the challenges are
greater due to the more complex responses in the areas of skills and abilities with higher cognitive
demand and the resulting subjectivity of scoring.
12
Scoring Constructed-Response Items
Overview
Chapter 10 presented a variety of constructed-response (CR) formats. One category of CR for-
mats is objectively scored (designated CROS). The other category is subjectively scored (desig-
nated CRSS). Chapter 11 presented CR item-writing guidelines. Although these guidelines in
chapter 11 addressed many aspects of designing CR items, guidelines for scoring are presented in
this chapter. This decision is necessitated due to the complexity of scoring CR items. Therefore,
this chapter focuses on issues of scoring both CROS and CRSS items. Chapter 17 deals with sta-
tistical analysis of CROS item responses. Chapter 18 treats the analysis of ratings for CRSS items.
A good source of advice about scoring CR items can be found in a chapter by Lane (2010) and
another chapter by Lane and Stone (2006). Chapter 13 also points out some scoring problems
related to writing performance tests. Chapter 14 discusses scoring issues with CR item formats
for credentialing tests.
This chapter only briefly discusses scoring CROS items. This treatment is because scoring is
simple and does not require a subject-matter expert (SME). Because the CRSS items are usually
designed to measure complex cognitive behavior, scoring is more complicated. This high degree
of complication also comes from that fact that two or more SMEs judge performance on some
task using a rubric (also known as a descriptive rating scale or scoring guide). As noted previ-
ously, we have emphasized that there should be a clear connection between the item and the
intended content and cognitive demand elicited by the item. The rubric must match the targeted
task. That is, the extent to which an item measures complex, cognitive behavior intended must
be represented in the scoring.
Gitomer (2007) identified a core set of scoring problems that result from faulty task design for
CRSS items. First, he argued that judges of performance spend extra effort to resolve ambiguities
often found in responses that result from unclear task demands. Respondents do not know what
is expected given ambiguous task features and give responses that are similarly ambiguous. Sec-
ond, a rubric can be ambiguous. Thus, the rubric is unable to give SMEs sufficient information to
make accurate judgments. Finally, SME training and selection of exemplars are generally based
on assumptions that may not be consistent with the intent of the item developer. As described in
this chapter, Gitomer’s model requires both prompt, task, and rubric clarity to ensure effective
scoring.
This chapter has the following sections. First, characteristics of CROS and CRSS scoring are
presented and discussed. The second and main part of this chapter involves CRSS scoring guide-
232
Scoring Constructed-Response Items • 233
lines. As noted in this chapter subjective scoring is very challenging and has many threats to
validity that must be avoided. With CRSS items, rater consistency is also a major issue because it
affects the reliability. Bias (rater effects) is another major issue with CRSS items as it affects valid-
ity. The chapter ends with a discussion of automated scoring, and administration issues affecting
scoring. Throughout this chapter, recommendations are made for designing highly effective CR
items that are either objectively or subjectively scored. Clearly, the greater challenge is with CRSS
items.
1. The table above shows the scores for 12 students on a science test.
What is the average (mean) score of the group to the nearest whole number.
Answer: _______________
An example of a numeric entry item from the revised Graduate Record Examination (GRE)
General Test is presented in Figure 12.2.
2. A university admitted 100 students who transferred from other institutions. Of these
students, 34 transferred from two-year community colleges, 25 transferred from pri-
vate four-year institutions, and the rest transferred from public four-year institutions. If
two different students are to be selected at random from the 100 students, what is the
probability that both students selected will be students who transferred from two-year
community colleges?
In this item, the test taker must type in the responses for the numerator and the denominator.
In this example, the answer is 17/150. If the answer had an equivalent fraction, then alternative
right answers may be accepted.
A third example (Figure 12.3) is a multiple numeric entry item from the 2003 National Assess-
ment of Educational Progress (NAEP) mathematics test. The CROS item has partial credit scor-
ing because the item has two scorable parts. The possible number of correct responses is finite
(seven combinations), and the scoring rule is unambiguous. The test taker must get both a and b
correct to receive full credit, get only one correct to receive half credit, or get neither correct for
no credit. The possible resulting scores include 2–1–0.
3.
A school yard contains only bicycles and wagons like those in the figure above.
On Tuesday the total number of wheels in the school yard was 24. There are several ways
this could happen.
a. How many bicycles and how many wagons could there be for this to happen?
Number of bicycles ________
Number of wagons ________
b. Find another way that this could happen.
Number of bicycles ________
Number of wagons ________
Scoring Guide
Solution:
Any two of the following correct responses:
0 bicycles, 6 wagons
2 bicycles, 5 wagons
4 bicycles, 4 wagons
6 bicycles, 3 wagons
8 bicycles, 2 wagons
10 bicycles, 1 wagon
12 bicycles, 0 wagons
2 – Correct
Two correct responses
1 – Partial
One correct response, either for part a or part b
OR
same correct response in both parts
0 – Incorrect
Incorrect responses
Objectively scored CR items are easy to score, and expertise is not needed to score right and
wrong answers. We often refer to this type of scoring as clerical.
Automated Scoring
As we know, scoring is a time-consuming and very expensive activity for large-scale testing pro-
grams. Scoring involves the identification and training of raters; the development of systems
for sorting, routing, and collecting papers for scoring; monitoring the accuracy of the raters;
and resolving rater differences. These processes quickly become more complex as the number of
responses increases. For an extensive treatment of this topic, see Williamson, Mislevy, and Bejar
(2006).
There are many automated scoring programs available to improve these processes. Educa-
tional Testing Service has developed four computer programs for automated scoring. Livingston
(2009) briefly described each. There are many papers on the design, functionality, and accuracy
of each. These are described well as to the natural language processing models used for each at
http://www.ets.org/research/topics/as_nlp/ including e-rater, c-rater, m-rater, and speech rater.
E-rater (essay-rater) analyzes the linguistic features of an essay. It is trained to develop
weights for the relevant features identified by the programmers to best predict the human score.
This engine scores the quality of the writing to the extent that the human scores also indicate
236 • Developing Constructed-Response Test Items
THE LEVEL 4 performance provides clear, consistent, and convincing evidence that the
teacher is able to foster active engagement of students, with the teacher and with each
other, in a small group exploration of a significant English language arts topic that is part
of a learning sequence effectively integrating reading, writing, listening, speaking, and/or
viewing.
• that the teacher has established a safe, inclusive, and challenging environment that
promotes active student engagement in the activities and substance of English lan-
guage arts instruction.
• that the teacher draws on a detailed knowledge of students’ backgrounds, needs,
abilities, interests, and his or her knowledge of English language arts in selecting high,
worthwhile, and attainable goals and in selecting instructional approaches that support
those goals.
• that the teacher integrates reading, writing, listening, speaking, and/or viewing activi-
ties that are connected to the learning goals and that the instruction is sequenced and
structured so that students can achieve those goals.
• of the teacher’s understanding of the dynamics of small-group discussion.
• of the teacher’s ability to foster student engagement and learning and of the teacher’s
skill in using open-ended questions, listening, and feedback to support active learning
in a small-group environment.
• that the teacher encourages students to explore, clarify, and challenge each other’s
ideas in a respectful and fair manner.
• that the teacher uses appropriate, rich, and thought-provoking instructional resources
to engage students in learning important English language arts content as part of the
small group exploration.
• that the teacher is able to describe his or her practice accurately, analyze it fully and
thoughtfully, and reflect insightfully on its implications for future teaching.
Overall, there is clear, consistent, and convincing evidence of the fostering of active
engagement of students, with the teacher and with each other, in the small-group discus-
sion of a significant English language arts topic that is part of a learning sequence effec-
tively integrating reading, writing, listening, speaking, and/or viewing.
writing quality. It requires a wide range of responses to assign effective weights so that the full
range of human scores can be adequately predicted. Research has demonstrated the ability of
e-rater to assign scores that are very consistent with human scores, where sometimes, e-rater
matches human scores more consistently than a second human rater. E-rater produces holistic
scores and provides diagnostic information about the quality of several features of the writing.
It has been used as a quality control tool in the GRE testing program and the Test of English as a
Foreign Language Internet-Based Test (TOEFL iBT) test.
C-rater (content-rater) analyzes the content of the response, rather than the quality of the writ-
ing. This engine is ideal for analytically scored short-answer items from a few to about 100 words.
The programmer can indicate all possible words and phrases that would be found in a correct
response, and even statements that might result in the subtraction of points. C-rater can be used
to award any number of points to a response. Using C-rater to generate statements of similar
meaning to capture novel and innovative responses that are also correct is also possible.
Scoring Constructed-Response Items • 237
M-rater (math-rater) analyzes numeric responses, algebraic expressions, and graphical dis-
plays and geometric figures generated on the computer. The m-rater engine evaluates the equiva-
lence of the response to the correct response identified by the programmer. So it looks for alge-
braic equivalence or converts a line to an equation to check its accuracy or checks the accuracy of
many points on a curved line.
Speech-rater evaluates ability to communicate in English. Similar to E-rater, it analyzes lin-
guistic features of spoken language and predicts the human score. This software has been used to
support the TOEFL test program.
Many other models are available throughout the industry. A recent innovative scoring pro-
gram was developed by researchers at the University of Utah (http://ruready.net). They devel-
oped a mathematical expression parser to score mathematics CR items automatically. The parser
compares a response with an instructor-provided reference expression and identifies the correct-
ness or absence of each required element. For a review of this work, see Livne, Livne, and Wight
(2007).
Researchers from Educational Testing Service, Pearson, and College Board collaborated on a
white paper with the goal of creating methods and computer applications that reduce the cost
and effort involved in using human graders while improving the validity of test results (William-
son et al. 2010). Their paper was written in response to the demands presented by the assessment
of Common Core State Standards. This paper is a strong resource that addresses the state-of-
the-art automated scoring, ensuring the accuracy of automated scoring systems, and deploying
automated scoring.
Pearson Education has developed automated scoring programs for use in primary, secondary
and post-secondary education settings, and government agencies, publishers, and other types of
corporations engaged in testing and assessment (Streeter, Bernstein, Foltz, & DeLand, 2011). They
presented issues related to automated scoring of writing, speaking, and mathematics, particularly
in the context of current work on Common Core State Standards assessments. They addressed
two key questions in high-stakes automated scoring posed by Williamson et al. (2010):
1. Are automated scoring methods disclosed in sufficient detail to determine their merit?
2. Are automated scores consistent with consensus scores from human graders?
The types of CR items being scored through automated scoring at Pearson include essays, short
text responses to content-related questions, numeric and graphical responses to mathematics
questions, and spoken responses. The automated scoring through Pearson is in operational use
with College Board’s ACCUPLACER test and the Pearson Test of English, practice essays for
publishers and test preparation companies, and short-answer scoring for the Maryland science
tests, among others.
Pearson’s Knowledge Technologies group hosts the Intelligent Essay Assessor (IEA), an Inter-
net-based tool for automated scoring of essay quality. IEA evaluates essays and short answer
responses for substantive content and mechanics, grammar, and style. IEA is a flexible platform
that can be trained to meet the unique scoring needs of a testing program, including, for exam-
ple, the six traits of writing. Research on the IEA tool has demonstrated its capacity to evaluate
content in multiple areas, including science, history, social studies, and language arts, including
text in other languages.
Another Pearson product line includes the Versant Testing System, including a writing test,
and speaking tests in English, Spanish, Arabic, Aviation English, and others. All involve auto-
mated scoring. The speaking tests apply advances in speech recognition, employing scoring
based on linguistic units incorporated into statistical models built from actual performance of
238 • Developing Constructed-Response Test Items
native and non-native speakers. The scoring can be used on a variety of speaking item types
including reading aloud, repeating sentences, short-answer responses to standardized questions,
brief storytelling, elements of conversations, and comprehension of reading passages. The writ-
ing tasks suitable for automated scoring on Versant tests include activities like typing, complet-
ing sentences, dictation, writing emails, and other functional tasks. Some research is available on
the quality of these automated scoring tools (Pearson, 2008).
The use of computational and applied linguistics has been invaluable in the application to
scoring elements of speech communication. Bernstein and Cheng (2007) reviewed the logic of
automated scoring of spoken English and presented initial validity evidence. They compared
automated scoring to the standard oral proficiency interview (OPI). The forms of evidence they
gathered included:
Above all, for automated scoring, a core necessity is the accumulation of validity-related evidence
to support the scoring system. This includes human scoring and automated machine scoring.
Before operational use, the scoring routines must be calibrated and evaluated given multiple
sources of agreement, across humans, between humans and machines, and between humans,
machines, and expert scorers.
In a validity study examining current-criterion-related evidence, scores from Versant tests
of language in Spanish, Dutch, Arabic, and English were correlated with standard interview-
based tests of language (Bernstein, Van Moere, & Cheng, 2010). Strong correlations were found
between automated and interview-based scores, as were test-retest correlations. These research-
ers also carefully explored the domain-specific interpretive argument for scores regarding lan-
guage proficiency across languages.
A search of the World Wide Web will also identify additional for-profit automated scoring
services and programs: Pacific Metrics, Knowledge Analysis Technologies, and Vantage Learn-
ing, among others, and of course systems developed for specific testing programs, including the
National Board of Medical Examiners and the National Council of Architectural Registration
Boards.
The future of automated scoring is very bright due to three qualities. First, it will be more eco-
nomical than human scoring. Second, rater consistency will be ensured. Third, systematic error
caused by human scorers can be eliminated. This argues for continued research and development
for automated scoring.
and cognitive demand. Scoring must stick to the goal of task clarity and focus attention on the
right features of the response in the right balance to support the intended inference. The scoring
process should not introduce construct-irrelevant variance (CIV) that may allow rater bias or
personal preferences to enter into judgments and scores. The scoring process also serves to evolve
the meaning of the construct. What is a good response to a CR task is shaped by how exemplars,
or benchmark responses, are selected (Gitomer, 2007, p. 9).
Two messages surface in these considerations. First, the CR item must be designed in a way
that captures the intended attribute or behavior. Second, the scoring rules must be consistent
with that intent to capture the relevant aspects of the response that informs our understanding
of the level, quantity, or quality of the attribute without bias. These are not simple messages and
require careful planning. This is why scoring rules must be developed at the time the item or task
is developed. Task demands are only clear in a rubric. The meaning of the rubric is only clear in
the scoring process (Gitomer, 2007, p. 22).
As presented in the previous chapter, valid inferences about learning require that the test taker
understands what is being asked and the response requirements, and the scoring rules must allow
the raters to interpret the student response consistently.
We have a few studies that illustrate the important connection between task design, scor-
ing, and inferences. One study involved a review of student responses to a statewide reading
test (Goldberg & Kapinus, 1993). There were frequent examples of how student responses were
inconsistent with the intended demands of the task, many of which appeared to be due to lack
of task clarity. Consider the following example from that study. After reading a short story, stu-
dents were asked a question intended to elicit a comparison utilizing text-based information.
These researchers found that students simply identified a similar or contrasting experience but
provided no construction of meaning to the text. They argued that an implied connection to the
text resulted in surface-oriented responses. They recommended a revision to request an explicit
comparison or contrast between personal experience and text-based contexts. Figure 12.5 shows
two versions of the item.
Original version
4a. Think about how Eric’s friends helped him in the story and then explain below how an
experience of yours was like or unlike Eric’s. (Goldberg & Kapinus, 1993, p. 288)
Recommended revision
4b. Describe an experience you had in which someone helped you and explain how your
experience was like or unlike Eric’s. (Goldberg & Kapinus, 1993, p. 289)
In other cases, responses to questions intended to elicit text-based evidence were provided that
were plausible, but not text-based. For example, questions such as “What was the main purpose
of this recipe?” or “Would the story be as effective without illustrations?” (Goldberg & Kapinus,
p. 290) could be answered without reference to the text. Responses such as “to teach someone
how to cook something” or “No, the illustrations help you to visualize what’s happening” (p. 290)
presented challenges to scorers.
Similarly, Wolfe and Gitomer (2001) used a complex performance assessment of teachers
from the National Board of Professional Teaching Standards. They examined the effect of design
changes intended to reduce the ambiguity of task demands to test takers and judges. The three
changes involved giving the highest level of the scoring rubric to test takers when preparing
240 • Developing Constructed-Response Test Items
responses, giving test takers explicit guidance on what to focus on and what to avoid, and break-
ing down the complex task into smaller questions. These design features resulted in scores with
higher reliability, nearly equivalent to doubling the number of tasks. The point is that the quality
of task design and scoring are dependent. When coordinated, scoring is more effective, and the
results more valid. In addition, a set of questions was developed to help structure how scorers
would review evidence presented in portfolio entries. The following questions are examples of
those presented to scorers to help them focus their judgments (p. 98):
1. Are the goals of the lesson worthwhile and appropriate, even if they are not goals I would
choose for my students?
2. Is the teacher demonstrating knowledge of his or her students, as individuals or as a devel-
opmental or social group, even if the teacher’s approach is different from one I would
take?
3. Is the teacher showing command of the content, making connections, even if they are not
the connections I would make?
4. Are the students engaged in the lesson, even if it is not a way I am used to?
CR Scoring Guidelines
These scoring guidelines are organized by three major activities: (1) Clarifying the construct
with respect to content and cognitive demand, (2) developing an appropriate and useful scoring
guide, and (3) developing scoring procedures to secure meaningful results. The scoring guide-
lines encompass all of the recommendations from Hogan and Murphy (2007), the integrated
design principles of Gitomer (2007), and other item-writing guidelines presented in this book.
In the guidelines that follow, we use the term task to refer to all types of CR items, including the
extended-response essay, the portfolio, and performance tests. A summary of the guidelines is
presented in Table 12.1.
SCORING PROCESS
6. Qualify raters.
7. Train raters.
8. Rate consistently.
9. Minimize bias.
10. Obtain multiple ratings.
11. Monitor ratings.
Scoring Constructed-Response Items • 241
CONTENT CONCERNS
1. Clarify the intended content and cognitive demand of the task as targets for scoring.
The test item should meld elements of the definition of the content being measured, intended
inferences, item/task design, scoring, and resulting inferences from the score assigned. This clari-
fication is also consistent with the content concern guidelines from SR items. It is an absolute
necessity to decide in advance what qualities are to be considered in judging the adequacy of a
response to each task. This step is achieved when designing the task. It enables the first step in
developing scoring guides. SMEs must reach consensus for validating each CRSS item. We start
with the assumption that the CRSS task requires thinking and response features that cannot be
easily elicited through SR formats. The completion of this activity provides an important piece of
item validity evidence.
For example, in responses requiring a writer to take a position for or against an issue, we rec-
ommend separating the quality of response and argument from which position was taken. The
intent of the task is to elicit an argument for or against a particular position, to contrast positions,
or to develop a position. The score should reflect these intended task demands, not whether the
response is consistent with the actual position of the rater.
The recent efforts by the College Board to support advances in the Advanced Placement (AP)
program have led to a formal specification of redesign methodology. AP provides an interest-
ing context because the tests are course-specific with a clear curriculum and explicit learning
outcomes. Four major steps were defined for the redesign effort, including (1) identify the main
ideas in the discipline; (2) identify the enduring understandings and supporting understandings
for each main idea, which include necessary knowledge and skills; (3) make connections among
these understandings explicit; and (4) specify the evidence needed to demonstrate mastery of
the knowledge, skills, and abilities (KSAs). To support this work, they used Evidence Centered
Design (Mislevy & Riconscente, 2006) as the overall framework and the approach of Wiggins and
McTighe (2006) to clarify and prioritize KSAs.
Four historical thinking skills were identified to structure the AP World History curriculum.
This included a definition of the skill component, a description of the desired proficiency, and
an explanation of how the skill could be addressed instructionally. These guides provide excel-
lent frameworks for test item writers as items are written to specific content across the curricu-
lum—as detailed in the AP World History Course and Exam Description (College Board, 2011).
All four skills are assessed in the AP examination.
Consider an AP World History test item appearing in Figure 12.6. This sample document-
based question was administered in the 2010 AP World History Examination (College Board,
2011, pp. 112–113). This task, the document-based essay question, is designed to evaluate stu-
dents’ ability to express their abilities in the four thinking skills to formulate an answer from
documentary evidence. It is not an attempt to test knowledge of the subject matter. Notice also
that the directions are explicit about what is expected in a response. It reflects the points in the
scoring guide.
Another example of a typical, essay-based test item is taken from the Minnesota Test of Writ-
ten Composition used for high school graduation. In their Item Sampler Guide, they stated that
the purpose of the test was to measure the writing ability as displayed at a given point in time. A
passing score of 3 or higher is required for a diploma from a Minnesota public high school. “It
is important that the writing be an example of what the student is able to produce without the
assistance of teachers, peers, or writing resources” (Minnesota Department of Education, 2011,
p. 1a). They further clarify the target of measurement:
242 • Developing Constructed-Response Test Items
Directions: The following question is based on the accompanying Documents 1–5. (The
documents have been edited for the purpose of this exercise.) This question is designed to
test your ability to work with and understand historical documents. Write an essay that:
• Has a relevant thesis and supports that thesis with evidence from the documents.
• Uses all of the documents.
• Analyzes the documents by grouping them in as many appropriate ways as possible.
Does not simply summarize the documents individually.
• Takes into account the sources of the documents and analyzes the authors’ points of
view.
• Identifies and explains the need for at least one additional type of document.
You may refer to relevant historical information not mentioned in the documents.
5. Using the following documents, analyze similarities and differences in the mechaniza-
tion of the cotton industry in Japan and India in the period from the 1880s to the 1930s.
Identify an additional type of document and explain how it would help your analysis of
the mechanization of the cotton industry.
Students may use an expository/informational mode with narrative and persuasive elements
to develop this composition. Students may choose to respond to prompts in many ways and
may draw from a variety of personal experiences. Students should make effective choices in
the organization of their writing. They should include details to illustrate and elaborate their
ideas and use appropriate conventions of the English language. (Minnesota Department of
Education, 2011, p. 2a)
Finally, advice is given to teachers, stating that the best way to prepare students for the content
aspects of the test is to be comprehensive in curriculum and instructional planning. Teachers are
encouraged to review the Minnesota Academic Standards and the graduation test specifications.
full credit. Is it enough to contain the necessary response in the presence of additional incorrect
information? Does the presence of incorrect information cast doubt on the tasks being tested?
Revisiting the AP World History example item presented previously (Figure 12.6), one element
from the directions is shown below in Figure 12.7. This instruction tells test takers to avoid sim-
ple summarization of the documents and then eliminates such responses from receiving credit.
Write an essay that does not simply summarize the documents individually.
For the Collegiate Learning Assessment (CLA) performance tasks, directions are frequently
given to raters to keep prompt-specific issues in mind and to ignore other characteristics of
responses that seem content-related, but unrelated to the task demand—the construct being
measured.
In the CLA task called Crime Reduction (Council for Aid to Education, 2011), students must
agree or disagree with a character in the scenario who argues that hiring additional police will
lead to more crime, based on a correlation between number of police in a community and the
number of crimes. Raters are instructed to give credit if the student disagrees with the state-
ment, but a strong response has the intent of stating that correlation does not imply causation,
even if these words are not used (correlation or causation). The point is to be able to express this
principle.
An example CLA analytic writing task is the Make-an-Argument task that focuses on logic and
argumentation, rather than the specific position taken (Council for Aid to Education, 2011). Stu-
dents are instructed “to plan and write an argument on the topic on the next screen. You should
take a position to support or oppose the statement. Use examples taken from your reading, course
work, or personal experience to support your position” (p. 13). Raters are instructed to identify the
position taken in the response and assess the reasoning, examples, and considerations made for the
complexity of the issue. Students can argue for one position or the other, or they can argue that both
or neither of the positions have merit. Whatever their argument, it must be clear with support.
Referring to the Minnesota Written Composition Test described earlier, specific guidance is
given regarding the role of prior knowledge and the intent of the test, to clarify what the test is
intended not to measure:
It is important the writing be an example of what the student is able to produce without the
assistance of teachers, peers, or writing resources. (Minnesota Department of Education,
2011, p. 1a)
In addition, there is a two-stage scoring process in place, where the first stage is a rubric having
scores ranging from 0 to 6 where a 3 or higher is considered passing.
This first stage addresses the degree to which the response is on-topic, focused and structured
with a beginning, middle and end; contains supporting detail; and demonstrates control of lan-
guage and knowledge of grammar and mechanics. The highest score point of 6 is reserved for
exceptionally skillful compositions. The score point of 3 is reserved for a basic passing composi-
tion. The lowest score point of 0 is reserved for a below-basic composition.
244 • Developing Constructed-Response Test Items
Scores of 0 to 2 (not passing) result in the use of a secondary rubric called the Domain Reviewing
rubric. This rubric categorizes each writing domain as either Developing Skills or Minimal Skills,
including the domains of composing, style, sentence formation, usage/grammar, and mechan-
ics/spelling. This provides additional diagnostic information for non-passing compositions.
Table 12.2 Four types of rubrics with examples from this chapter.
Analytic Holistic
Task-specific Connecticut Academic Performance Test Level 4 NBPTS Sample Rubric (Figure 12.4)
(Figure 12.9) Massachusetts Comprehensive Assessment
System Rubric (Figure 12.21)
Generic Nebraska Grade 8 Descriptive Writing Minnesota Writing Composition Rubric
(Figure 12.8) (Figures 12.17 and 12.18)
For testing programs, the first distinction is the most important. The choice is analytic or
holistic. The analytic scoring system uses a set of rubrics that breaks down the test taker response
into essential components or the elements required fully to describe the qualities being assessed.
A holistic rubric is an overall description of the entire performance, such that the entire perform-
ance has analytic traits integrated and represent a complete picture of performance. The perspec-
tive for a holistic rubric is that performance is more than the sum of its parts.
Task-specific models are specifically designed to address the unique content or cognitive
demands of the specific task. A generic rubric is applied to most similar tasks. The use of task-spe-
cific rubrics is highly desirable for classroom instruction because it affords a high level of clarity
for students, but in a testing program, task-specific rubrics are difficult to create and difficult to
scale for comparability across testing occasions. In most instances in this chapter, generic rubrics
are used and recommended. Additional discussion of rubrics and examples are also provided in
chapter 13, which is specifically devoted to measuring writing ability.
Analytic Rubric The analytic rubric is actually a set of coordinated rubrics that completely
describes an ability—such as writing. The idea of the model answer is broken down into charac-
teristics. These characteristics should be independent and separable therefore scorable. Develop
an outline or list of major traits that students should include in the ideal answer, then determine
the number of points to award for each element. Focus on one characteristic of a response at a
time.
Advantages: Analytic scoring can yield very reliable scores when used by conscientious raters.
The process of preparing the detailed answer will provide another opportunity to examine rel-
evant features of the CR task, bringing attention to errors like faulty wording, extreme difficulty
and/or complexity of the task and unrealistic time limits. An analytic framework for scoring also
provides a stronger basis on which to discuss performance with test takers, including the identi-
fication of strengths and weaknesses. By scoring analytically, you can look over all the papers to
see which elements of the answer gave students the most trouble and therefore need to be further
Scoring Constructed-Response Items • 245
developed. By weighting some elements of the response, a strong indicator can be provided to test
takers regarding what aspects of responses are valued. This means that test takers must be given
information about scoring before the examination. Thus, analytic scoring can be very useful for
formative evaluation of learning.
Disadvantages: Analytic scoring can be time-consuming. In attempting to identify various
characteristics of a response, giving undue attention to superficial aspects of the response is also
possible. For some tasks, identifying well-defined elements to include in the scoring guide may
be difficult. If the characteristics being scored in the analytical rubric are highly related, scores
across dimensions will be similar, providing no additional information about the performance
at a higher cost.
There are examples of analytic scoring guides in chapter 13 on the measurement of writing
ability. A typical example is found in the Nebraska Department of Education (2010) analytic
scoring guide for descriptive writing. Four analytic traits are ideas/content (35%), organization
(25%), voice/word choice (20%), and sentence fluency/conventions (20%). As an example, Figure
12.8 gives descriptions of the highest level (4 points) for the first two components and the lowest
level (1 point) for the second two components.
Figure 12.8 Nebraska Analytic Scoring Guide for Grade 8 Descriptive Writing
Source: http://www.education.ne.gov/assessment/pdfs/FINAL analytic rubric.descriptive.pdf
Another example below shows a task with a specific set of rubrics. The score is based on per-
formance levels defined by the sufficiency of evidence in the written work submitted by the stu-
dents on multiple criteria. Figure 12.9 provides directions and the scoring guide. This exam-
ple item is from an early version of items being piloted during the 1990s for the Connecticut
Common Core of Learning Performance Assessment Project (CAPT) in Science. It was retrieved
online from the Performance Assessment Links in Science (PALS, http://pals.sri.com/).
Holistic Rubric The holistic rubric incorporates all analytic traits into a single rating. The rater
makes a single judgment about the overall quality of a response. The ideal answer may be a
composite of many analytic traits, but the composite is more important than the quality of indi-
vidual features. All factors are taken into account in forming the judgment about the quality of
the response. Holistic scoring is less specific than the analytic method.
Advantages: Holistic scoring is efficient compared with analytic scoring because there is
only one score instead of a set of scores. This kind of scoring enables a comprehensive view of a
246 • Developing Constructed-Response Test Items
6. In this activity you are going to investigate some of the chemical and physical effects
of a car wash on its environment. Your performance on this activity, alone, and as a
member of a group, will be assessed by your teacher, based on your written materials
and your group’s oral presentation. Therefore, make sure that everything you do is well
documented. Keep a careful record of your experimental plans, all the data you gather,
analyze and display (computation, graphing, etc.), and your final conclusions.
Figure 12.9 Early example item and rubric from the Connecticut Common Core of Learning Performance Assessment Project in Science.
Source: http://pals.sri.com/
Used with permission, Connecticut State Department of Education.
performance. This method can be used with tasks that are not divisible into separate components
and may convey to the respondent that the overall performance is what is valued.
Disadvantages: Selecting papers for anchor items may be difficult, specifically if the sample
responses are homogeneous. Holistic scores do not provide explicit information about strengths
and weaknesses. However, feedback can be provided through comments recorded by the rater.
Raters’ personal biases and errors can be masked by the overall score. Holistic scores tend to result
in a lower reliability estimate due to fewer observations. Also, reliability is difficult to estimate.
A good example of a generic, holistic rubric is found in the writing test of the SAT (College
Board; http://sat.collegeboard.org/scores/sat-essay-scoring-guide) shown in Figure 12.10. This is
the description for the highest score attained.
SAT Writing Test Score of 6: An essay in this category demonstrates clear and consistent
mastery, although it may have a few minor errors. A typical essay:
• Effectively and insightfully develops a point of view on the issue and demonstrates
outstanding critical thinking, using clearly appropriate examples, reasons and other
evidence to support its position
• Is well organized and clearly focused, demonstrating clear coherence and smooth
progression of ideas
• Exhibits skillful use of language, using a varied, accurate and apt vocabulary
• Demonstrates meaningful variety in sentence structure
• Is free of most errors in grammar, usage and mechanics.
Referring again to the Minnesota Grade Nine Test of Writing Composition previously pre-
sented, two layers of rubrics were listed. The first is a holistic rubric that assesses the overall
quality of the composition. Scores range from 0 to 6, and a score of 3 is passing. An analytic
Scoring Constructed-Response Items • 247
rubric reserved for compositions receiving scores below 3 (not passing) provide more informa-
tion regarding the five domains of writing based on Minnesota state standards for writing. In
Figure 12.11 we provide the description of a score point of 2, just below passing, on the holistic
rubric. We also provide a sample of the domain-specific analytic rubric for the domain of style for
the two levels of developing versus minimal skills (Minnesota Department of Education, 2011).
Analytic and holistic rubrics appear to be evenly divided in their use in various states and in
the nation with respect to writing (Haladyna & Olsen, submitted for publication). The choice of
a rubric type depends on which strengths and weaknesses ideally best fit the needs for the evalu-
ation of a CRSS response.
the same score level are cognitively equivalent. Finally, if anticipated responses are specified in
great detail, it does not leave room for innovative or unique responses, resulting in scoring dif-
ficulties and greater inconsistency.
4.a. Clarify distinctions across score points. To have a functional scoring system where raters can
score responses consistently, and to support validity, the nature of the evidence required to achieve
each score point must be clearly defined. This also requires that the core labels used to differentiate
the points in the scoring guide must be parallel and based on the same feature, characteristic, or
metric of the quality differentiating score points. The Virginia Grade Level Alternative (VGLA)
assessments for students with disabilities (Virginia Department of Education, 2011) illustrates a
reference to evidence without being specific about the evidence. The VGLA is holistically scored at
the standard level, based on the collection of evidence (COE) submitted by the student. The COE
is scored on a 0 to 4 rubric based on the degree the evidence demonstrates student achievement
of the standard. These points or levels of evidence are rated with respect to the demonstration of
knowledge and skills addressed in the specific standard being assessed (see Figure 12.12).
0 = no evidence
1 = little evidence
2 = some evidence
3 = adequate evidence
4 = ample evidence
Figure 12.12 The Virginia Grade Level Alternative Assessment holistic rubric score points.
This rubric is based on an alternative test for students with disabilities or limited English pro-
ficiency (Virginia Department of Education, 2011). The points on this rubric differentiate score
points as a function of how much evidence is presented in the response. Students and teachers
select graded work samples, which include audio, video, or interview evidence, which must be
accompanied by statements of accuracy. The challenge here is a subtle change from an amount
of evidence to the adequacy of the amount. The first three points are clearly amounts (although
vague amounts) as shown in Figure 12.13.
0 = no evidence
1 = little evidence
2 = some evidence
Then there is a shift in the characteristic of the score label, implying a criterion regarding the
amount of evidence, shown in Figure 12.14.
An alternative would be to focus on the adequacy of the evidence, where adequate would have
to be clearly defined and evidence is appropriate supporting facts and information as shown in
Figure 12.15.
0 = no evidence
1 = inadequate evidence (very little)
2 = partially adequate evidence (some, but not enough)
3 = adequate evidence (enough or sufficient)
4 = ample evidence (more than enough)
The Minnesota Test of Written Composition provides a clear description of composition quality
at each level. The descriptions are parallel as to content across the score points. Each score point
description addresses the same characteristics of the composition, but describes a different level of
quality. Consider the descriptions (Minnesota Department of Education, 2011) of score points 4 (a
competent composition) and 3 (a basic passing composition) presented in Figure 12.16.
Figure 12.16 Comparisons and contrasts between adjacent score points on the rating scale.
Source: Minnesota Department of Education (2011). Used with permission.
Upon reviewing the descriptions of each score point, each is parallel as to content. Between
each score point, the characteristics of the composition that relate to overall quality are equally
described.
4.b. Define clear justifications within score points. The justifications provided within a score
point should be coherent and focused on the construct being measured. Secondly, within a given
score point on the rubric, if there are different ways to achieve that score level, justification should
be clear so that different responses achieving the same score level are equivalent. Figure 12.17
contains an example of a score point of 6 (exceptionally skillful composition) from the Minnesota
GRAD writing test for grades 9–12 (Minnesota Department of Education, 2011).
250 • Developing Constructed-Response Test Items
The composition:
Figure 12.17 Minnesota GRAD writing test score point 6 for an exceptionally skillful composition.
Source: Minnesota Department of Education (2011). Used with permission.
In this rubric (Figure 12.17), we see that the elements of a skillful composition are described
with parallel features to obtain the score of 6, where multiple features or characteristics of the
writing are provided as options for achieving the same score. Consider the third bullet, which
provides for parallel features of “ample, selected supporting detail” or “elaboration that clarifies
and expands the central idea.” Consider the fourth bullet, which provides for parallel features
of “uses transitional devices, parallel structure, or other unifying devices” with the same goal of
providing a progression of ideas. How the progression of ideas is produced is not as important as
the provision of “a clear, unified progression of ideas.”
The Minnesota Writing Composition Test example is a clear description of composition quality
at each level. Consider the score point 4 descriptor of a skillful composition (MN Department of
Education, 2011), presented in Figure 12.18.
The composition:
Figure 12.18 Minnesota GRAD writing test score point 4 for a skillful composition.
Source: Minnesota Department of Education (2011). Used with permission.
Scoring Constructed-Response Items • 251
Upon reviewing the descriptions of each score point, they are parallel in terms of content,
so that within each score point, the characteristics of the composition that relate to overall
quality are equally described. There are words in this rubric that do need clarification, which
usually comes as exemplars or benchmark papers illustrating each score point. Words that are
initially vague and require some clarification and training include clearly expressed, and suf-
ficient development and minor obstacles. Other aspects of the description are much clearer and
behavioral.
4.c. Do not over-specify expected responses. If anticipated responses are specified in great
detail, it does not leave room for innovative or unique responses, resulting in scoring difficulties
and greater inconsistency.
In the CLA performance task example introduced previously in this chapter, the student is
asked to agree or disagree with the statement that more police lead to more crime. A strong
response demonstrates understanding that correlation does not imply causation and suggests an
alternative reason for the relation between number of police and crime in a community. How-
ever, the student is required to express uncertainty because no evidence of alternative reasons is
presented, where raters are instructed to look for words such as might or could. Weak responses
are ones stated with certainty, with words such as obviously or clearly. In the scoring guide, exam-
ple words are provided and raters are instructed to consider similar language that conveys the
same intent.
The Massachusetts Comprehensive Assessment System high school biology examination pro-
vides a sample scoring guide for a CRSS item in anatomy and physiology (question 12, provided
online). This addresses a content standard to explain how the digestive system converts macro-
molecules from food into smaller molecules that the cells can use, provided in Figure 12.19.
7. The digestive enzymes in the table function in some organs to perform the chemical
digestion of food. The major organs of the digestive system are the esophagus, large
intestine, mouth, pharynx, small intestine, and stomach.
A. List these six organs in the order in which food passes through them.
B. Identify which of these organs is primarily responsible for absorbing nutrients from
digested food.
C. Describe the functions of two of the organs listed other than the one you identified in
part (b).
4.d. Expectations for the same cognitive demand should be the same across similar tasks and
scoring rules. To maintain coherence in the test and expectations across tasks for test takers and
252 • Developing Constructed-Response Test Items
raters, establishing a consistent set of expectations regarding the cognitive demand for different
tasks is important. This is especially important when task-specific rubrics are used. The rubrics
must be parallel in terms of the expectations for scoring features of responses to different tasks.
From the set of example items provided by the Massachusetts Comprehensive Assessment
System, there are several opportunities to observe a consistent set of task-specific rubrics across
several tasks in several domains. Consider the descriptions for the highest score point of 4 for
different tasks in Figure 12.21. The rubric designers carefully structured each description to be
parallel in terms of wording and specificity. Each description states that the “response demon-
strates a thorough understanding” of the key concept in the task and that the response correctly
lists, describes, explains, or identifies biology features based on the task demands.
The challenge in these rubrics is in determining whether the cognitive demands are simi-
lar across tasks. It is unclear whether the tasks of describing, explaining, or identifying are
similar and, if they are different, if they are similarly weighted across tasks. Where the rubric
expects the students to identify one disadvantage correctly, does this imply a description of the
disadvantage or explanation of why it is a disadvantage? Is the task to explain what the data
show similar to describe the function of ? Terms such as describe and explain are often used
interchangeably, but may represent different cognitive demands and expectations for students,
teachers, and raters.
A potential improvement in this set of rubrics would be a careful balancing of the tasks, so that
each task requires students to list or identify features of biological phenomena and describe some
feature or characteristic specific to the task with an explanation of its role, importance, origin,
or whatever fits the focus of the task. This provides for a framework to ensure similar cognitive
demand expectations across scoring rules for similar tasks, so that the scoring provides exchange-
able evidence of knowledge, skills, and abilities whatever the specific task. With the clarification
and systematic item scoring guidance provided here, examples provided throughout this chapter
can be used to improve similar efforts. Massachusetts has many excellent examples moving in
these directions, including the score descriptors in Figure 12.21. Here we see important consist-
ency within a common score point across tasks in different standards categories.
but only know how students will interpret and respond to such a task by examining a sample of
those responses.
If students are responding in ways that were not anticipated, there could be misinterpretation
of the task or prompt. By reviewing the task or prompt and the materials used during instruc-
tion or instructions provided to the test taker, the origins of the misunderstanding may become
clearer. Are the instructions explicit? Is the task clear? Are practice materials inappropriate or
misleading? While reviewing a sample of responses, diversity of responses might also serve as an
indicator of task ambiguity. The SMEs scoring of this small sample establishes a basis for training
raters as well as a basis for monitoring rater performance at the beginning and throughout the
scoring.
Once the scoring has begun, the scoring guide should remain consistent from paper to paper
and from rater to rater. If a change in scoring guide needs to be made in the midst of scoring, the
entire set of papers must be rescored to maintain consistency.
254 • Developing Constructed-Response Test Items
The NAEP test item-scoring process involves four stages: (a) develop scoring guides, (b) score
pilot tasks, (c) score the results of the first operational administration, and (d) score ongoing
operational administrations. During the second stage of scoring pilot tasks, content experts
check students’ papers against the scoring guides and go through a process of refining the scoring
guides. These are then reviewed by the NAEP contractors and expert content committees (http://
nces.ed.gov/nationsreportcard/contracts/pdf/NAEPScoring.pdf). This process occurs again dur-
ing the third stage, before formal scoring occurs in operational administration of the NAEP.
These reviews are done to ensure that the scoring guides make appropriate distinctions among
levels of performance and that scores can be assigned objectively, consistently, and accurately.
Scoring Process
6. Qualify raters.
Raters should meet qualifications before being selected for operational rating in any testing pro-
gram. There are three types of qualifications that should be monitored and met, including initial
selection criteria, training and evaluation (Scoring Process guideline 7), and operational-scoring
monitoring (Scoring Process guideline 11). All three types produce evidence that supports the
argument for valid score interpretation. Such documentation should be summarized in the rel-
evant technical manual in a section on rater qualification. A nomination process could be used to
identify appropriate individuals for the task of scoring CR task responses, and employing school
systems, higher educational institutions, states, or related professional organizations to identify
candidates. If the raters are scoring a credentialing examination, the raters should have qualifica-
tions showing advanced competence in the area being tested.
The first stage of rater qualification is defining the essential characteristics of qualified raters.
These typically involve relevant background, training or level of education, and substantive exper-
tise or experience in the field. These raters are truly SMEs. This recruitment typically involves
individuals who also have direct experience with the intended population of test takers. Such
experience supports the ability of raters to discern the correctness of responses, as a response to
a CR item includes elements of the task requirements and characteristics of the test taker. There
also may be a need to secure a certain level of representativeness to support fairness goals, includ-
ing geographic representativeness, and representation of various personal characteristics includ-
ing gender, race, and ethnicity, and perhaps language background.
7. Train raters.
The second stage is training. Training must be completed to familiarize raters with the test pur-
pose, content, and formats. The type of item to be scored must be reviewed as well as completing
review and practice with sample items and the scoring guides. In most high stakes settings, a
qualification evaluation must be completed, which may include a test of the ability of the rater
to use the scoring guides as intended. This involves a complete assessment of the rater’s ability
to assign appropriate scores to model responses. The focus of the training is on the accurate and
consistent application of the scoring guides.
In most if not all standardized testing programs involving CRSS items, raters are required to
attend scoring sessions and pass a structured training program that focuses on rater consistency
and bias. For instance, the scoring of teacher portfolios for the National Board of Professional
Teaching Standards certification of accomplished teaching has such a program. The training
provides a set of tools to help raters identify and uncover personal biases so that they can be
monitored and controlled during the scoring process. This includes identifying our own personal
Scoring Constructed-Response Items • 255
preferences and expectations so that they can be separated from the intentions and demands of
the task and the performance standards set forth in the scoring guide.
Does training make a difference? Extensive training of raters can lessen tendencies to score
less consistently and with less bias, but not eliminate these threats to validity. One study showed
that threats to validity can be detected statistically and training can lessen the tendency to rate
with bias (Wolfe & McVay, 2010). Haladyna and Olsen (submitted for publication) reviewed the
extensive research on training effectiveness for scoring CRSS items, focusing specifically on writ-
ing performance tests. They found a lengthy history of research that showed that training does
have effectiveness in increasing rater consistency and improving accuracy, but other remedies
are needed because training is seldom sufficient. Many professional examination boards have
extensive training programs for raters (e.g., Western Regional Licensing Board, 2011, for dental
and dental hygienist licensing performance testing).
8. Rate consistently.
Consistent rating is necessary in all testing programs where CRSS items are used. There is a direct
relationship between consistent rating and reliability. Generalizability theory is a useful frame-
work for studying rater consistency (Brennan, 2001). Chapter 18 presents elementary statistical
concepts for the study of rater consistency. In this section, we discuss some important concepts
associated with rater consistency including intra-rater and inter-rater consistency. The goal of
consistent rating is to reduce random error, which is related to reliability. If two raters’ ratings are
used to compute a total score, consistency directly affects reliability. The example below in Figure
12.22 shows that for two raters, the reliability estimate for scores increases as a function of this
agreement (correlation) between two raters (based on the Spearman-Brown formula):
Figure 12.22 Predicted reliability estimates based on two raters and the correlation between ratings.
Generally speaking, at least two raters constitute a basis for score reliability. More raters are
very desirable to increase reliability, but the expense of adding more raters is usually very high.
Here we introduce the concepts of intra-rater consistency and inter-rater agreement. Many more
approaches to these issues are described in chapter 18.
Intra-rater Consistency The primary goal of rater training is to build intra-rater consistency
in the use of a rubric. Can a rater score the same response consistently? This type of consistency
includes agreement with expert ratings of pre-scored responses and agreement with ratings on
the same response over time. Once competence in intra-rater consistency is achieved, production
scoring should begin.
DeCarlo (2005) presented an approach to model rater behavior based on a latent class exten-
sion to signal detection theory (SDT), which addresses the cognitive process underlying rat-
ing behavior. The model also includes measures of rater precision and classification accuracy,
as applied to essay grading. He and others challenge the notion that a continuous quantitative
characteristic underlies ratings, rather than qualitatively ordered categories. SDT considers the
function of each rater as discriminating between latent classes of essays, such that the model
can separate the rater’s perception of essay quality from the rater’s use of response criteria. In a
256 • Developing Constructed-Response Test Items
four-point rubric scoring of college essays, DeCarlo found considerable variability in rater pre-
cision, where at least two of the eight raters appeared to employ different response criteria. In
additional comparisons, he found the latent class and latent trait models to perform similarly.
In 2011, DeCarlo, Kim, and Johnson introduced an extension to the earlier SDT model, through
a hierarchical model, where rater scores are indicators of essay quality and the (latent) essay qual-
ity is an indicator of test taker proficiency. The latent class SDT model can provide information
about rater precision and other rater effects. Such a model also satisfies the dependence between
common raters within items across test takers. A set of raters score the first CR item, and a dif-
ferent set of raters score the second CR item. When applied to large-scale tests requiring written
essays, they showed how typical item response theory confounds rater effects with item effects.
Their hierarchical rater model separates rater and item effects so rater behavior can be evaluated
independent of items.
Congdon and McQueen (2000) reviewed the rater consistency research and found that intra-
rater consistency was difficult to attain. They used a state writing test with a six-point rubric and
two ratings per response with 16 raters across responses to examine rater consistency. Using a
uniform training program and scoring routine, they found significant rater variation each of the
nine days of scoring, with more variation early in the scoring. It took three days of training to sta-
bilize this day-to-day variation. One aspect of this problem is drift, which is discussed in chapter
19 and represents a threat to validity.
Inter-Rater Agreement Here the concern is agreement among two or more raters rating the
same responses. As predicted from generalizability theory, reliability depends on how well mul-
tiple raters agree on ratings for task performances. This type of rater consistency is often indexed
by a product-moment correlation between a pair of raters or an index that can be used for more
than two raters. Generally, this index is a form of coefficient alpha, which is a measure of internal
consistency. More about this topic is presented in chapter 18. Inter-rater consistency may be
important to monitor to secure uniformity in scoring both concurrently and over time.
Improving Rater Consistency There are several things we can do to train raters to be more con-
sistent in rating practices. Consider the case of the rating of essays. This is a significant concern
because raters might be influenced by the first few responses they read, either negatively or posi-
tively. Some things we can do to maintain consistency in rating include continuously checking
scores against the rubric and against papers already graded. Another step is to periodically res-
core previous scored papers. These monitoring procedures help secure evidence of consistent
and appropriate rating, particularly over time.
When possible, raters should read all responses to a given item without interruption. Variation
in scoring can occur over time, particularly when there are larger periods of time at stake or long
delays. However, providing short breaks during a scoring period is important, both to refresh
thinking and to avoid fatigue. When more than one day is needed for scoring, the rater should
reread earlier scored responses to reaffirm conceptualization of the rating standards.
9. Minimize bias.
Bias is a natural product of experience and preference, a part of being human. Bias is also part of
the lens we use to view and interact with the world. In part, bias is a component of those things
that are important to us, disagreements we have with others, and how we judge our own conduct
and that of others. Bias becomes a problem when it interferes with the application of a specified
standard or criteria to evaluate performance—particularly when the standard or performance
criteria are different from what we would employ. We need to set aside our biases when it comes
Scoring Constructed-Response Items • 257
to scoring the responses and performance of others to support intended inferences and uses of
rating results and scores.
Statistically, bias concerns systematic error. In rating responses to CRSS items, any test taker’s
score based on responses to a single or a set of CRSS items depends on three components:
Thus, validity studies are intended to search for bias because it is a threat to validity. When
detected, it should be eliminated or reduced significantly.
Types of Systematic Rating Bias Raters can commit any from a family of systematic errors that
bias test scores. These errors include severity/leniency, central tendency, halo, idiosyncrasy, and
logical. These errors are well documented (Engelhard, 2002; Hoyt, 2000; Myford & Wolfe, 2003).
Severity/leniency is potentially the most troublesome of these rater effects. For instance, Engelhard
and Myford (2003) reported that if scores had been adjusted for severity, about 30% of student
essays would have received higher scores. For many students whose scores are near the cut point
and have severe raters, the decision to adjust scores is consequential. In one study where raters’
tendencies were studied over years, Fitzpatrick, Ercikan, and Wen (1998) found raters to become
more consistent, but rater severity seemed constant. Another related error is drift in rating over
time, as raters may become increasingly severe or lenient (Hoskens & Wilson, 2001). Many
researchers have recommended that when rater effects are detected, scores should be adjusted to
eliminate CIV (Braun, 1988; Fitzpatrick et al., 1998; Houston, Raymond, & Svec, 1991; Longford,
1994). Despite a well-developed technology for adjusting scores, we have no evidence that rater
effects are studied in any writing testing program or whether scores are adjusted to correct for
these sources of CIV.
Early research on halo effects suggests that such effects can be reduced through training
(e.g., Bernardin & Walter, 1977). In a carefully controlled study of eighth-grade mathematics
responses, employing two forms of training, including self-trained (received sample responses
with generic rubrics) and director-trained (received direct training, item-specific rubrics, super-
vised practice), Ridge (2000) found no main effect due to the form of training and no significant
halo effect on a series of five items with responses designed to elicit a halo effect on the fifth item.
However, the self-trained raters were less accurate in scoring the fifth item compared to the direc-
tor-trained raters.
Combating Bias In some circumstances, knowing the person producing a response for a CRSS
item might bias scoring. If at all possible, raters should not know who provided a response. In a
public performance, such as acting or musical performance, knowing the performer may influ-
ence the outcome. As much as is possible, scorers should not know the person performing. In oral
examinations, this is unavoidable. In cases where the rater knows the performer, raters should be
removed from rating this person to prevent bias.
The order of scoring is often cited as critical. If multiple performances are scored at one time,
or two or more analytic traits are being scored on a performance, raters should score responses to
one item or one trait only before scoring other items or other traits. Moving from item to item or
trait to trait invites systematic error. With the trait-to-trait scoring, halo bias is the usual result.
The correlation among scores of analytic traits is usually very high with no distinction for what
each analytic trait measures.
258 • Developing Constructed-Response Test Items
When responses are complex or more than a few words, it helps to rate only one trait at a
time. This reduces the potential of a halo bias, where the quality of responses to other questions
influences the rater’s evaluation of the response quality to a later item. This also helps the rater to
become completely familiar with the scoring criteria for the item or task being rated. It is impor-
tant to maintain a clear picture of the standards or objectives being addressed. Another option is
to record ratings on a separate sheet to avoid being influenced by seeing the ratings of other items
on the same page.
Score Resolution Score resolution is a widely used testing industry practice. When two raters
disagree by more than one point, a referee’s judgment is used to reconcile the difference
(Johnson, Penny, & Gordon, 2000; Johnson, Penny, Fischer, & Kuhs, 2003). Score resolution is a
very important way to reduce random error and improve rater consistency—which contributes
to increasing reliability. However, if two raters are very severe or very lenient and as a result they
agree, score resolution will not detect this systematic error. Although score resolution is very
useful and should be practiced, the detection of severity and leniency and its correction is also
important. Score resolution does not resolve all rater scoring problems.
Resolving the Threat of Rater Bias—A Summary Human subjective scoring is the testing
industry standard for high-stakes scoring. Computers are being programed to score more
complex CRSS response, as described earlier in the section on automated scoring. Although
score resolution is strongly recommended when multiple scores on a single response differ, it
is not a solution for rater bias. Multiple-ratings of responses and monitoring are two important
remedies for eliminating or reducing rater bias. Adjusting scores is also a remedy for bias when
it is severe.
10. Obtain multiple ratings and use the average as the final score.
Because of the subjectivity inherent in some forms of CR response scoring, which leads to reduced
consistency in scoring, a useful recommendation is to include two or more independent ratings
of the response. Two different raters may see different forms of evidence to support the intended
task demands and provide different scores. This may be due, in part, to rater bias, described
in guideline 9 above. In most large-scale testing programs where CRSS items are used, scoring
involves two raters, in large part to avoid the bias possibly introduced by a single rater. When
two scores are within one point, the standard is to take the average. The average of the independ-
ent scores is more stable than a single score. This is particularly important for high-stakes tests,
where the scoring decision is crucial or may have a significant impact on decision making for the
individual or a group. When raters differ by more than one point, or some other criterion set by
the testing program, a third rater or adjudicator will score the CRSS item and that score will be
used.
In some large-scale settings, computer-based scoring systems have been developed to support
this goal. Before August 2011 in the GRE general test, the automated scoring engine e-rater was
used as a confirmatory rater for the analytical writing test (described in the section on automated
scoring). If the e-rater score differs from the human score at a certain level, the response is scored
by a second human rater. The resulting score is the average of the two human scores (e-rater
scores were not used to compute final scores). The GRE revised general test employs two human
raters for each essay. Once the automated scoring models can be trained for the revised general
test, e-rater will again be used. In other programs, such as TOEFL, also from the Educational
Testing Service, the score produced by the automated scoring engine is averaged with a human
score to compute the final score. The SAT writing test, is scored by two readers on a scale of 1 to
Scoring Constructed-Response Items • 259
6 and the two scores are summed. If the two readers differ by more than one point, a third reader
scores the essay. In the Law School Admissions Test (LSAT), of the Law School Admissions
Council, a writing sample is obtained during the exam. The writing sample is not scored, but
digitally imaged and sent to admissions offices with LSAT scores for inclusion in applications to
law schools. The Graduate Management Admissions Test analytical writing essays are scored by
a human rating and via automated essay-scoring. The two scores are averaged. If the scores differ
by more than one point, a second human rater resolves the score.
Administration Considerations
Although we focus primarily on item development and validation in this book, there are many
other considerations to be addressed regarding the scoring of CR items. These include admin-
istrative issues, such as the methods used for securing scores, including distributed, regional, or
local scoring. Distributed scoring is often accomplished by allowing scorers to access responses
online from anywhere, which may promote greater inclusion of participants across a larger geo-
graphic region and can be adapted for any form of a test. Regional scoring is accomplished by
establishing regional centers where scorers gather for the process, which is more common when
intensive and in-person training is desired. Regional scoring can also accommodate any form of
a test, but generally costs more and may restrict participation to those who can travel. Local scor-
ing is generally thought of as scoring within schools or even within classrooms. This is an option
for low stakes tests and generally a positive experience for teachers where the test is intended for
formative purposes, giving teachers immediate feedback for use in instruction. However, local
scoring is unlikely to secure uniformity in scoring and may be subject to more bias. Thus, local
scoring is not useful for high-stakes testing.
With scoring in a school or school district, teachers are the most able candidates for rating stu-
dent work. There are different models here that allow for different needs and desired outcomes.
Teachers can rate the work of their own students, again most helpful to support formative use of
test results, but may require some training or monitoring to secure fairness. Teachers can rate the
work of other students or a random anonymous sample of work, which helps alleviate some risk
of bias where teachers may see a wider range of performance. There are also moderated forms
of scoring where teachers moderate their work in a review process or double scoring process,
or even where the teacher scores are audited randomly by professional raters; although this is
more costly and requires a higher level of management. Always where teachers are involved in
260 • Developing Constructed-Response Test Items
scoring, practice has shown such involvement to be a valuable and appreciated form of profes-
sional development.
Summary
This chapter describes scoring procedures for CR tasks. The scoring of CROS items is very
straightforward and received very little treatment. The scoring of CRSS items is very complex
and challenging. The goal is to support intended inferences through highly effective CR item
design and scoring that is consistent with the item intent. The rubric is a very important part of
the CRSS item as a rubric will clarify the task demand. Effective scoring of CRSS items requires
a highly coordinated effort that addresses content concerns, scoring guide development, and a
highly effective scoring process.
IV
Unique Applications for Selected-Response and
Constructed-Response Formats
This page intentionally left blank
13
Developing Items to Measure Writing Ability
Overview
Writing is a primary cognitive ability. Like reading and speaking, writing is used in performing
tasks in other cognitive abilities, such as mathematical and scientific problem-solving, critical
thinking, and creative abilities. Writing is present in all educational settings including elemen-
tary and secondary schools, college and universities, graduate education, professional education,
and even in the professions as part of licensing requirements. Writing is also required in college
admissions tests and post-secondary testing programs, including the ACT Assessment, the SAT
and GRE (of ETS), the National Assessment of Educational Progress (NAEP), and the Graduate
Management Admissions Test, and Law School Admissions Test. In recent years, graduation
testing, requirements for accountability in No Child Left Behind legislation, and other high-
stakes uses of test scores have proliferated. Writing is also part of everyday life for most of us.
Because of writing’s importance, this chapter is devoted to developing and validating test items
that measure writing ability.
The measurement of writing ability has many challenges to overcome to achieve a high degree
of validity (Haladyna & Olsen, submitted for publication). Foremost is that a definition of the
construct of writing is essential. We have many choices of item formats and test designs. Then
there is the issue of scoring the response to the item. Although this chapter does not consider test
design, some caveats are warranted in the presentation of item formats and scoring methods for
test items that measure writing ability. This chapter provides a comprehensive examination of
item development and validation for measuring writing ability.
This chapter is organized as follows. First a definition of writing is provided as a basis for item
development and validation. The second section deals with the more important constructed-
response (CR) item format that includes what we call prompts and the rubric(s). Sometimes a
single, holistic rubric is used, and, in other instances, several rubrics are used. The third section is
admittedly less important but presents selected-response (SR) items that are often used in writ-
ing testing programs. The rationale for and against the use of SR items is presented.
263
264 • Unique Applications for Selected-Response and Constructed-Response Formats
open-class capacity refers to a creative/content aspect. Some test developers have emphasized a
six-trait model for writing (Spandel, 2001). In a carefully controlled experiment, Shaefer, Gagne,
and Lissitz (2005) found that raters of student essays can distinguish between content and writ-
ing ability. So the planning and measurement of both open and closed-class capacities of writing
seem feasible.
As stated in the first chapter, most educational and psychological constructs are cognitive
abilities (Lohman, 1993; Messick, 1989; Sternberg, 1998). Writing is a primary cognitive ability.
Considering cognitive learning theory as a basis for the learning, writing is slowly developed over
a lifetime via school and out-of-school experiences. Knowledge and skills play a supportive role
in the development of writing, but the primary evidence of the quality of writing comes from
samples of the individual’s writing. Paul Diederich (1974, p. 1) stated: “As a test of writing ability,
no test is as convincing to teachers of English … as actual samples of each student’s writing....” The
quintessential writing test elicits a sample of writing. The content of writing can be very explicitly
stated using the concept that an ability can be best operationalized by defining a domain of tasks
(Kane, 2006a, 2006b). The target domain is the complete collection of tasks that a writer could
perform. These tasks can range from writing a text message on a cell phone to organizing and
writing a three-volume compendium on a complex topic. Many of these tasks are unsuitable for a
writing performance test, so Kane suggested the development of a universe of generalization that
includes writing tasks that are suitable for testing for appropriate audiences. No such domain yet
exists in reality, but in theory such a domain would enable writing to be more validly measured,
interpreted, and used. The main idea of such a domain of writing tasks is that a writing test must
be a representative sample.
Another aspect of writing is motivation. Not only must test takers be capable of writing, they
must see the value in writing and want to write effectively. Any assessment of a test taker’s writing
ability should consider this emotional component. Therefore, a source of construct-relevant vari-
ance may be how a test taker views a writing task: interesting or relevant or not. A related issue is
whether students are assigned or choose a prompt. Pros and cons of this issue are a worthy topic
in writing test design (see Allen, Holland, & Thayer, 2005; Wainer & Thissen, 1994; Wainer,
Wang, & Thissen, 1994).
The Joint Task Force of the International Reading Association and the National Council of
Teachers of English (2009, pp. 16–18) recognized that education has shifted from knowledge
transmission to higher-level thinking processes. Writing has its basis in the domain of writing
tasks. Their fourth assessment standard addresses a need for critical inquiry. Unfortunately, in a
survey of state high school writing programs, Jeffery (2009) found that only 4% of the prompts
she reviewed required higher-level thinking. This definition of writing prospectively includes the
consideration of higher cognitive demands in prompts though current writing testing programs
do not emphasize higher cognitive demands.
Prompts
The prompt might be a single sentence or as long as a paragraph. A good prompt will show a
process to follow, a reasonable time limit, and suggest the extent of a response: short, medium,
or extended. Often, the CR writing test will present a series of instructions to the student. Space
limitation prevents publishing examples but five websites are offered in Table 13.2 that pro-
vide good examples of the accompanying material for a prompt. As shown there and in most
states, test administration instructions and student instructions for the writing prompt are quite
detailed and thorough.
The prompt is a brief statement consisting of a single sentence or a short paragraph that elicits
a written response from a student.
What is the nicest thing anyone has ever done for me?
If I were the teacher, what is the first thing I would do?
What is the coolest technology I know?
Describe a perfect day in your life.
As shown in Figure 13.1, the prompt can be presented as a question or it can be a declarative
statement as an instruction. All prompts are presented in a test with administrator and student
instructions as Table 13.2 shows.
As noted previously, we have extended prompts and short-answer prompts. The distinction
between these two is very subjective. For the most part, large-scale testing programs prefer a
single extended-essay prompt. The National Assessment of Educational Progress (NAEP) offers
both types. For purposes of balance, both are featured in this section. A taxonomy of prompts is
presented in Table 13.3 based on research by Jeffery (2009). Examples and discussion are pro-
vided for each of the prompt modes.
Table 13.3 A Taxonomy of Prompt Modes
Mode Name Brief Description
Persuasive A persuasive prompt supports a general proposition with concrete evidence, and the essay is intended
for a specific audience.
Argumentative An argumentative prompt calls for support or opposition without designating an audience. The prompt
calls for more abstract reasoning than the persuasive prompt.
Narrative A narrative prompt requires a student to share a personal experience.
Explanatory An explanatory prompt requires students to tell why something is so for that student.
Informative An informative prompt requires students to provide information on a topic or issue.
Analytic An analytic prompts requires critical thinking, as it has the highest cognitive demand.
Persuasive Prompt
This type of prompt is the most often used in state and national testing programs. The example in
Figure 13.2 was taken from the NAEP 1998 eighth-grade released items (Block W19, #1).
Many people think that students are not learning enough in school. They want to shorten
most school vacations and make students spend more of the year in school. Other people
think that lengthening the school year and shortening vacations is a bad idea because
students use their vacations to learn important things outside of school.
What is your opinion?
Write a letter to your school board either in favor of or against lengthening the school year.
Give specific reasons to support your opinion that will convince the school board to agree
with you.
Figure 13.2 Example item from 1998 NAEP Writing, Grade Eight, Block W19, #1.
The prompt calls for the writer to persuade the school board of the need for a longer or shorter
school year. The response to a persuasive prompt tries to convince an audience using concrete
evidence and a good argument. The successful essay will cite research, economic factors, and
other concrete reasons supporting the proposition.
Argumentative Prompt
The argumentative prompt does not have a specific audience and argues from abstraction. Prin-
ciples may be evoked in support of the argument. The argumentative prompt may present both
sides of an argument with an invitation for the reader to choose. The argumentative prompt has
a greater cognitive demand because it exacts from the writer’s multiple perspectives and reasons
for and against. (See Figure 13.3.)
268 • Unique Applications for Selected-Response and Constructed-Response Formats
Should high school graduates take a year off before entering college?
Should we have more holidays and longer vacations?
Should tobacco products be outlawed?
Should all cars in a city with pollution be powered by electricity?
Should financial incentives be offered to high school students who perform well on stan-
dardized tests?
Narrative Prompt
A narrative prompt asks the test taker to write a true or imagined story. Such verbs as tell, describe,
or write are used to designate the narrative prompt. The writing is usually of a personal nature,
and sometimes the prompt will offer themes to stimulate the student’s writing. Although the
narrative prompt seems to emphasize the open aspect of writing, the scoring guide is critical to
tapping this characteristic of test taker writing. Typically, the open aspect of writing is usually not
considered in rubrics. (See Figure 13.4.)
Tell a story about a time or an event that you would like to remember.
Think about the one person who has helped you. Write a journal entry telling about how
that person helped you.
Choose a special event when you were a child. Describe the event.
Explanatory Prompt
Explanatory prompts require test takers to explain how or why something is so. An explanatory
prompt may seem like an argument, but it is more focused on explaining a reason than arguing.
One implication of allowing test taker choice in their stances toward topics is that by choosing
a prompt, the test taker might give the best example of writing possible. Explanatory prompts
frequently contain language signaling argumentative or persuasive demands (e.g., support your
ideas, illustrate your claims). (See Figure 13.5.)
Here is a quotation from an ancient Greek philosopher: “A word to the wise is sufficient.”
What does this saying mean? Write a one-page essay explaining the meaning. Use
examples.
Write a paragraph in which you describe your favorite season of the year and tell why it is
your favorite.
If you could go to any restaurant, which one would it be? What would you order?
How can your school be improved? Explain what that change should be.
Choose one fad or trend that is popular and explain why it is popular.
Persuasive and argumentative prompts require a position and the use of abstract or concrete
facts and principles supporting the writer’s point of view. With the explanatory prompt, the cog-
nitive demand is not as great. There is no right or wrong answer or winning point of view.
Informative Prompt
An informative prompt is similar to an explanatory prompt but differs in an important way. The
informative prompt involves specific information. Jeffery (2009) uses an example of an informa-
tive prompt that calls for procedures for learning to swim. Consider the example in Figure 13.6
from the 2007 NAEP.
Open the envelope labeled E that you have been given. Take out the letter from Rina and
read it. Rina, who wrote the letter, is coming to a school in America for the first time and
needs to know what a backpack is.
Write a letter back to Rina. In your letter, include a clear description of a backpack and
explain in detail what she should keep in it. Remember, the more information Rina has, the
better prepared she’ll be to start eighth grade.
Figure 13.6 Example item #1 from 2007 NAEP Writing, Grade Eight, Block W14.
Analytic Prompts
The analytic prompt has the highest cognitive demand. Presently, this kind of prompt seems
to be exclusively aimed at literary analysis, although this kind of writing could certainly be
applied to other subjects, such as history, economics, psychology, and science. According to
Jeffery’s survey of high school writing programs, this type of prompt is the least used (only
4% of the prompts she analyzed were judged to be analytic). In its current version, an analytic
prompt includes a specific work of literature and the instructions to the test taker to analyze
the work as to literary elements and other literature. Sometimes a theme is proposed, such as
grandparents or the role of chance in one’s life, in reference to a literary work or several works.
(See Figure 13.7.)
You will have 45 minutes to write a response. Your essay should be no more than 1,000
words.
Often in works of literature, a character learns or discovers something that changes his or
her life. From a work of literature you have read in or out of school, select a character who
learns or discovers something that changes his or her life. In a well-developed composi-
tion, identify the character, describe what the character learns or discovers, and explain
how the discovery relates to the work as a whole.
The analytic prompt may be difficult to score and will require several rubrics that include
closed- and open-ended aspects of writing.
270 • Unique Applications for Selected-Response and Constructed-Response Formats
Short-Essay Prompt
The length of any writing performance should be expressed in the directions to the test taker.
Generally, a standard can be set for the ability level. For instance, an essay of 300 words or less
might constitute a short essay prompt. Thus, the prompt alone does not designate the length of the
response. The cover instructions instead provide the scope of response. In all instances, the scor-
ing guide will require the same elements of good writing: introduction, body, and a conclusion.
Literary-Essay Prompt
One type of writing is intended to measure reading comprehension or critical thinking related
to literary analysis. It is not intended to measure writing as a cognitive task or skill set. However,
the two constructs are closely related and a deficiency in writing ability might affect performance
on a reading comprehension test that involves writing ability. This kind of bias has been referred
to in the past as construct-irrelevant because writing ability may have a great influence on a test
taker’s reading comprehension score. An effective way to eliminate this bias is to score the essay
using a rubric for its intended purpose (reading comprehension/literary analysis) and to score
the essay again for writing using an appropriate writing rubric or set of rubrics.
However, if you were developing your own prompt, here are some steps that may help you
produce and validate the prompt.
Figure 13.8 provides an example taken from the Ohio Department of Education website.
In this activity, you will use the facts on the graphic organizer to write a report about but-
terflies. You may look back at the graphic organizer as much as necessary to complete
your report. Remember to use the Revising Checklist and the Editing Checklist at the end
of the writing activity to check your work.
It is also a good idea to indicate to the test taker how long a response is expected (CR item-
writing guideline 6). Wordiness is a problem in scoring. If test takers know how long a response
might be, wordiness might be less of a threat to validity. Also, test takers need to know how the
essay will be scored (CR item-writing guideline 7). Giving the test taker the indications of scoring
or the actual rubric can be of considerable help in getting the test taker to perform at their best
level.
Rubric
Rubric goes by several names. Measurement experts might use the term descriptive rating scale
because each point on the rating scale has some description. The term scoring guide is also used
to mean the same thing. Chapter 12 presents comprehensive guidance for scoring CR items for
a wide range of purposes. Since the assessment of writing is a unique challenge, which crosses
subject areas and purposes, we recognize the unique characteristics of rubrics for the measure-
ment of writing.
We have some interesting variations in rubrics for writing and typically experience a lack of
standardization. Moreover we have some disturbing research to report that makes the task of
choosing or developing writing rubrics very problematic. Nonetheless, in this chapter we attempt
to bring some order to the business of defining and using a rubric to score writing performance.
As with the taxonomy of prompts, Jeffery (2009) has performed the same service for rubrics.
272 • Unique Applications for Selected-Response and Constructed-Response Formats
Table 13.4 below presents her taxonomy. A technology for developing such rubrics does not
exist. Future research needs to focus on which of the rubrics is most promising for advancing the
teaching, learning, and assessment of writing.
For the present, the science of rubrics remains in a simplified taxonomy where examples are
plentiful. One dimension features holistic rubric versus analytic trait rubrics. Another dimension
features generic versus task-specific (genre mastery) rubrics.
Holistic Rubric
A holistic writing rubric captures the entire writing experience in a single rating. A good example
is a recently introduced holistic rubric for the Arizona statewide testing program. (See Figure
13.9.)
There are many variations of holistic rubrics for writing that cannot be presented here. A
search on the World Wide Web will identify hundreds of examples and advice on how to create
and use rubrics. The scientific basis for rubric design does not yet exist, but researchers are begin-
ning to explore the facets of rubric design and use (Haladyna & Olsen, submitted for publication;
Lane & Stone, 2006).
1. Ideas/Content: Writer identifies the theme and gives appropriate supporting details that
develop the theme. Writing is clear, comprehensive, and well developed. Writing is appro-
priate to the audience.
2. Organization: The essay is organized logically. Paragraphs are also organized in a way
to connect paragraphs to provide a central meaning. The order of presentation is a key
issue with this trait. The structure of the essay should be appropriate for the topic and the
intended audience. An opening and closing statement should tie together the essay.
3. Voice: Voice of the essay can be formal or casual, distant or intimate. The voice depends
on the intended audience for the essay.
4. Word Choice: This trait measures the writer’s use of words and phrases in an effective way
to convey a message. The use of words should seem precise yet natural for the intended
audience.
Developing Items to Measure Writing Ability • 273
5. Sentence Fluency: The variation in sentence structure should be varied and effective for
the audience intended. A rhythm is achieved by varying sentence length and structure.
6. Conventions: This trait concerns mechanics—spelling, capitalization, punctuation, gram-
mar, and paragraph breaks.
Analytic traits can be presented in many ways. A review of documents from state and national
testing programs and a search of the worldwide web will produce hundreds of variations.
Some of these variations can be very lengthy, such as one page per trait. An example of a sin-
gle analytic-trait–word-choice rubric from Arizona is found in Figure 13.10. We note that
6 Words convey the intended message in an exceptionally interesting, precise, and nat-
ural way appropriate to audience and purpose. The writer employs a rich, broad range
of words, which have been carefully chosen and thoughtfully placed for impact. The
writing is characterized by (1) accurate, strong, specific words; powerful words ener-
gize the writing, (2) fresh, original expression; slang, if used, seems purposeful and is
effective, (3) vocabulary that is striking and varied, but that is natural and not overdone,
(4) ordinary words used in an unusual way, and (5) words that evoke strong images;
figurative language may be used.
5 Words convey the intended message in an interesting, precise, and natural way appro-
priate to audience and purpose. The writer employs a broad range of words which have
been carefully chosen and thoughtfully placed for impact. The writing is character-
ized by (1) accurate, specific words; word choices energize the writing, (2) fresh, vivid
expression; slang, if used, seems purposeful and is effective, (3) vocabulary that may
be striking and varied, but that is natural and not overdone, (4) ordinary words used in
an unusual way, (5) words that evoke clear images; figurative language may be used.
4 Words effectively convey the intended message. The writer employs a variety of words
that are functional and appropriate to audience and purpose. The writing is character-
ized by (1) words that work but do not particularly energize the writing, (2) expression
that is functional; however, slang, if used, does not seem purposeful and is not particu-
larly effective, (3) attempts at colorful language that may occasionally seem overdone,
(4) occasional overuse of technical language or jargon, and (5) rare experiments with lan-
guage; however, the writing may have some fine moments and generally avoid clichés.
3 Language is quite ordinary, lacking interest, precision and variety, or may be inap-
propriate to audience and purpose in places. The writer does not employ a variety of
words, producing a sort of “generic” paper filled with familiar words and phrases. The
writing is characterized by (1) words that work, but that rarely capture the reader’s
interest. (2) expression that seems mundane and general; (3) slang, if used, does not
seem purposeful and is not effective, (4) attempts at colorful language that seem over-
done or forced. (5) words that are accurate for the most part, although misused words
may occasionally appear, technical language or jargon may be overused or inappro-
priately used and (6) reliance on clichés and overused expressions.
2 Language is monotonous and/or misused, detracting from the meaning and impact.
The writing is characterized by (1) words that are colorless, flat or imprecise, (2) monot-
onous repetition or overwhelming reliance on worn expressions that repeatedly dis-
tract from the message and (3) images that are fuzzy or absent together.
1 The writing shows an extremely limited vocabulary or is so filled with misuses of words
that the meaning is obscured. Only the most general kind of message is communicated
because of vague or imprecise language. The writing is characterized by (1) general, vague
words that fail to communicate, (2) an extremely limited range of words, and (3) words
that simply do not fit the text; they seem imprecise, inadequate, or just plain wrong.
Arizona no longer uses the six-trait scoring rubric for word choice for summative assessment, but
encourages educators to use the analytic model for classroom purposes. They now use a holis-
tic rubric for summative assessment (see their website for current rubrics at http://www.azed.
gov/standards-development-assessment/aims/aims-writing/).
The task of evaluating student writing using six analytic traits with each trait entailing a single
page of description may seem daunting. Essay raters are chosen on the basis of their experience
and expertise and receive extensive training and monitoring for errant rating. Nonetheless, the
task of knowing a trait, analyzing student writing, and providing an accurate rating for each trait
is challenging.
natural conclusion is that analytic and holistic rubrics seem to relate to different aspects of writ-
ing. The choice of a holistic or analytic trait scoring guide seems to make a difference. Although
more research is needed to illuminate the nature of the difference and implications for scoring,
the choice of the type of rubric seems to influence scoring. Thus, the item development should
make clear what the definition of writing is and how it applies to the rubric type.
Cognitive Processing
Researchers have been very curious about what goes on in the minds of raters of student writing
(e.g., DeRemer, 1998; Penny, 2003; Weigle, 1999; Wolfe, 1999). Penny (2003) provided a first-
person account of how he thinks his personality affects his scoring. A case study of three raters’
cognitive processing when scoring essays revealed distinctly different patterns of task elabora-
tions (DeRemer, 1998). A task elaboration is the way a rater approaches scoring a sample of
student writing. Although each rater formed a general impression of a student’s writing, that
impression was conceptualized in different ways as a function of a unique cognitive style. Moreo-
ver, rater agreement was low, and there was little commonality in task elaboration for each essay
encountered. Wolfe reviewed an extensive literature and detected different cognitive process-
ing in scoring narrative essays as a function of how proficient a rater was. This type of research
exposes another rater effect that has not been studied enough—what goes on in the mind of a
rater when reading an essay. The think-aloud procedure is one way to uncover these cognitive
processes. Until we know more, the scoring of prompts has a high degree of ambiguity about
what exactly raters think when they rate an essay.
Rater Effects
Rater effects are systematic errors committed by raters that contribute CIV. These errors include
severity/leniency, central tendency, halo, idiosyncrasy, and logical. All these errors are well
documented (Engelhard, 2002; Hoyt, 2000; Myford & Wolfe, 2003). Severity/leniency is rater
Developing Items to Measure Writing Ability • 277
overrating or underrating. This kind of error is the most troublesome of these rater effects. Engel-
hard and Myford (2003) reported that if scores had been adjusted for severity, about 30% of stu-
dent essays would have received higher scores. For many students whose scores are near the cut
point and have severe raters, the decision to adjust scores is consequential. In one study where
raters’ tendencies were studied over years, Fitzpatrick, Ercikan, and Wen (1998) found raters to
become more consistent, but rater severity was constant. Another related error is drift in rating
over time, as raters may become increasingly severe or lenient (Hoskens & Wilson, 2001). Many
researchers have recommended that when rater effects are detected, scores should be adjusted to
eliminate CIV (Braun, 1988; Fitzpatrick et al., 1998; Houston, Raymond, & Svec, 1991; Longford,
1994). Despite a well-developed technology for adjusting scores, we have no evidence that rater
effects are studied in any writing testing program or whether scores are adjusted to correct for
these sources of CIV.
Score Resolution
Score resolution is highly recommended and widely practiced in scoring essays. If two raters
disagree by more than one point, a referee’s judgment is used to reconcile the difference (John-
son, Penny, & Gordon, 2000; Johnson, Penny, Fischer, & Kuhs, 2003). Score resolution improves
rater consistency—which contributes to increasing reliability. However, if two raters are very
severe or very lenient and they agree, score resolution will not detect this problem. As we can see,
score resolution solves one problem but does not help with the detection and correction of rater
leniency or severity.
Training
Training of raters is a primary tool for improving the scoring of essays. Unfortunately, training
is not a failsafe remedy. One study raises an important issue in how essays are scored (Moon &
Hughes, 2002). A scoring anomaly aroused curiosity that showed that how papers were scored
may have accounted for differences in test scores. Thus, not only are rater effects a potential
source of CIV, but how raters are trained to score papers may be another source. Weigle (1999)
found in her study differences between experienced and less experienced raters and with some
prompts, differences in rater severity could be reduced through effective training. She argued that
some prompts make greater demands on raters thus making the difference between experienced
and inexperienced raters less resistant to training. Historically, training has not had a profound
effect on rater effects (e.g., Bernardin, 1988; Borman, 1979). Nonetheless, effective training is one
remedy for improving scoring.
Wordiness
Essay length influences the rating a test taker receives. Are longer essays better or is verbosity a
source of undesired influence on the score a wordy writer receives? Powers (2005) reviewed the
278 • Unique Applications for Selected-Response and Constructed-Response Formats
literature on the relationship between essay length and total score. If this result is publicly known,
students and their teachers could manipulate essay length and, by that, potentially increase one’s
test score. Powers’ review gives ample evidence of this relation between essay length and rating.
However, there is also evidence that quality of the essay is correlated with essay length (Miller,
2003). If students or their teachers manipulate essay length to achieve a higher score, the scoring
of writing is subverted.
Scoring Errors
Reports of scoring errors in large-scale writing testing programs have been reported. Quality
control in scoring a test taker’s response to a prompt is an ever present threat to validity. One very
positive remedy is to compare a test taker’s score on an essay with the body of work in writing
assembled previously. Major deviations in the pattern of achievement should signal the possibil-
ity of a scoring error.
Poor:
Which sentence has correct punctuation?
A. Do you know who did this painting
B. Do you know who did this painting!
C. Do you know who did this painting?
D. Do you know who did this painting.
Better:
Which of the following sentences are correctly written.
Mark A if correctly written and mark B if incorrectly written.
1. Mark done the painting.
2. Do you know who did this painting?
3. Do you know who was supposed to do this painting?
4. Elise was supposed to do this painting.
5. The paint was boughten at the hardware store.
6. The paint was easy to use.
A student is writing a letter requesting more healthful food choices in the school cafeteria.
Which is the best way to end this letter?
A. As obesity is a major health problem, we need more healthy food choices.
B. I really like healthy food.
C. Students will have to eat what is served so why not serve healthy food.
D. Healthy food choices can be very appealing and economical.
For the following sentences, identify the correct spelling of the word as it is used in the
sentence.
1. He {A. seized; B. siezed} the opportunity and quickly scored the winning point.
2. My {A. neighborhood; B. naborhood, C. nieghborhood} is in the south side of the city.
3. The car stopped running and it was stuck in the mud. It was not {A. movable; B.
moveable}.
4. This math problem is {A. too; B. two; C. to} hard.
5. Can you {A. alter; B. altor; C. altar} your plans for tomorrow?
Summary
Writing is a slow-growing cognitive ability that involves knowledge, skills, motivation, and ways
to use knowledge and skills in complex ways to write for a specific audience. A specifiable domain
of writing tasks (called the universe of generalization) is the basis for test design. The quintes-
sential test item consists of a prompt and a scoring rubric. A caveat was presented about test
design, that a single item performance writing test is not a sufficient sample from the domain of
Developing Items to Measure Writing Ability • 281
Read the paragraph. It has mistakes that need to be corrected. Use the paragraph to
answer questions 1 through 3.
1 Meteors are rocks that fall from space. 2 Meteors look like streaks of light. 3 They can be
seen on clear, dark nights. 4 Meteors are also known as “shooting stars.” 5 Some meteors
are made of stone. 6 Others are made of iron. 7 When meteors enter Earth’s atmosphere,
they burn and glow. 8 Those large enough to reach the ground are called meteorites.
writing tasks. Only a portfolio of writing has the potential of adequately sampling from a domain
of writing tasks. Prompts exist in many modes with varying cognitive demands. Rubrics are scor-
ing guides (descriptive rating scales) used to score writing. We have arguments for and against
the use of the SR format for writing tests. If an SR format is desired, this chapter supplies examples
and references to other published work that provides examples of well-written SR items. Devel-
opers were encouraged to adopt and adapt items from the public domain as opposed to creating
prompts. With the SR formats, item modeling may be a very useful device for generating items
measuring writing knowledge and skills. Despite many shortcomings in the validity of writing
tests, a technology exists for creating validated writing test items in both CR and SR formats.
14
Developing Items for Professional
Credentialing
Overview
This chapter addresses item development and validation for any test used for a credentialing
purpose. By the term credentialing, we mean earning (a) a certificate indicating high achievement
in a profession or (b) a license to practice in a profession usually in a state.
The first part of this chapter concerns defining the construct of professional practice. A con-
tinuum of fidelity is introduced and it is explained that is essential to understanding why cer-
tain test item formats are chosen for any credentialing testing program. The second part of this
chapter describes different item formats used in credentialing tests. This chapter features mostly
innovative and exemplary item formats and traditional formats that have been successful in the
past.
282
Developing Items for Professional Credentialing • 283
1. The content domain must be rooted in behavior with a generally accepted meaning.
2. The content domain must be defined unambiguously.
3. The content domain must be relevant to the purposes of measurement.
4. Qualified judges must agree that the domain has been adequately represented.
5. The response content must be reliably observed and evaluated. (Kane, 2006, p. 149)
It follows that a practice analysis should be published and disseminated, and its results used to
develop item and test specifications that motivate item development and validation. The com-
mittee of SMEs needs to be highly qualified, and their work should achieve a consensus about this
domain of tasks. The results of their work should be published for all to see. The published report
is an important source of validity evidence bearing on item quality.
Organizing Content
In chapter 3, we discussed the problem of organizing content and the limitations of the cognitive
taxonomy introduced by Bloom and his colleagues (Bloom et al., 1956). An increasingly popular
method used in the health science credentialing field is the Miller Pyramid (Miller, 1990) illus-
trated in Figure 14.1. The Bloom cognitive taxonomy categorizes cognitive behavior into seven
distinct categories whereas the Miller Pyramid has a simpler structure. Knowledge is the founda-
tion for performance. Knowing how to do something is necessary for the performance of any skill
or a complex task. Actually performing a task is the most accepted way to show competence as
judged by SMEs and the community of concerned parties involved in credentialing testing. At the
highest level in the pyramid is professional practice. The value of Miller’s Pyramid is that it shows
284 • Unique Applications for Selected-Response and Constructed-Response Formats
Action
Performance
Competence
Knowledge
how the structure of cognitive behavior is organized for professional competence. This structure
is a natural lead-in to another important concept in item development and validation for items in
professional credentialing tests—fidelity.
each problem. Or we could present a series of problems and ask the plumber what he or she
would do. Finally, at the lower end of fidelity, we might present a series of objectively-scored test
items that include knowledge and skills important to plumbing.
Once the construct of professional competence has been adequately defined via this target
domain. The item development occurs to create the universe of generalization. Part Two of this
chapter presents outstanding, high-fidelity examples from credentialing testing programs.
for evaluating test score and test item responses are difficult to apply due to lack of variability
in item responses.
As stated repeatedly, practice analysis is a necessary first step for developing item and test spec-
ifications and developing and validating items in these tests. Technical reports and independent
evaluations are also available on their websites for public review. These factors contribute to item
validity and validation.
As we can see, removing calculus from targeted teeth counts the most in the dental hygienist
examination. After the patient is treated, the three examiners examine the patient independ-
ently. The examiners have to agree that the candidate has successfully removed the targeted
calculus if points are deducted. Five candidate errors result in a 30-point deduction, which
leads to automatic failure. As noted above penalties can also be accessed using a separate
scoring model that also requires examiner consensus. Usually, agreement among examiners
is extremely high. Bias is also monitored and reports are provided to examiners. Examiners
exhibiting repeated inconsistency and/or bias are not retained. The development of items and
scoring protocols has evolved over many years, and the result is a well-organized and sophis-
ticated clinical performance testing experience leading to licensure decisions by states for the
successful candidates.
Developing Items for Professional Credentialing • 287
Evaluation
The idea of having a professional person practice under supervision as a capstone test appears
to have the highest fidelity to the criterion—actual practice. In medicine and in dentistry, super-
vised practice in clinics is common in professional training and in internships and apprentice-
ships. However, a formal test is different. The sampling of tasks from the target domain should
represent the profession. In such tests, there is appropriate sampling; in internships and appren-
ticeships, there is no guarantee of adequate sampling from the domain of professional tasks.
Criticisms of this item format include the risk to patients that treatment might not be satisfac-
tory, that patients may present more or less challenging problems and thus tests may not provide
the same level of difficulty, and that examiners may be inconsistent or biased in their judgment.
Critics have also stated that live-patient examinations lack validity, which is largely based on sur-
veys of graduates from dental schools who experienced this licensing process (Gerrow, Murphy,
Boyd, & Scott, 2006). The experience in Canada caused them to move to the lower-fidelity clinical
examination with high reliability and objective scoring.
Noting these criticisms, the resolution of any debate about the best way to test for competence
is to ask the SMEs which type of test is most defensible after considering the arguments and
research pro and con. With respect to patient-based testing, these patients are closely monitored
and procedures are in place to protect patients. Difficulty is taken into consideration during
assessment of the patient’s treatment. Studies show that examiners are well trained and highly
consistent and accurate in their judgments.
History
Although the SP method has an earlier history, it became popular in the 1980s, and in the 1990s
the Medical Council of Canada was the first to use it in a licensure test. Barrows (1993) provided
288 • Unique Applications for Selected-Response and Constructed-Response Formats
a useful account of this history and the basis for the SP up to that date. Just before the turn of
the century, in 1998, the Clinical Skills Assessment Examination was introduced, which is now
the United States Medical Licensing Examination (USMLE). So not only do medical students in
the United States and Canada train using SPs, but they are tested for licensure in part with the
USMLE and the Medical Council of Canada’s Medical Licensing Test.
Since its introduction in medical education, the SP has grown phenomenally, and research
probing into its validity has also expanded. No review of this extensive body of research is
intended here. However, some highlights are presented that are informative. Most of the research
involves using the SP in medical education.
Research
The SP method has received considerable study over many years. This research involves pres-
entations of new developments and subtle variations in how SPs are trained and perform, and
how scoring is done—objectively or subjectively. Nearly all research bears on education/training
settings in many fields in the health and medical professions.
Correlations with other measures of competence. The most fundamental type of study with any
new method is to correlate measures with other like measures of competence. The reasoning
is that like measures of professional competence should be highly related, as indicators of the
same construct. Research by Stillman et al. (1991) epitomizes this type of study. SP scores were
correlated with other measures of competency during training and afterwards. Some correlation
research relates the SP with the Objective Structured Clinical Examination (OSCE) (Simon, Vol-
kan, Hamann, Duffey, & Fletcher, 2002). Generally, such correlations are positive but low. These
kinds of studies show that competence has many dimensions that are not necessarily highly cor-
related. Given that the measures of these dimensions may not be highly reliable, high corre-
lations may not be possible. Because of these conditions, correlation research on multidimen-
sional measures of competence may be futile. In some professions, with professional competence
being a cognitive ability, sub-abilities do not necessarily correlate highly. Research probing into
the structure of competence is needed that is more sophisticated than has been reported. This
research begins with a clear definition of competence and hypotheses about the structure of these
sub-abilities.
Fidelity. In the judgment of SMEs, the SP method has high fidelity to the target domain. Thus,
content-related validity evidence might be considered very strong. All applications of SP,
whether high-stakes certification/licensing or training, employ CROS and CRSS item formats.
Most testing specialists will recommend CROS formats because it virtually eliminates exam-
iner inconsistency and bias in scoring and reliability will be higher if compared to CRSS for-
mats measuring the same ability. However, as critics rightly point out, the objectively observed
behaviors appearing on a checklist may not have such high fidelity with the target domain.
Sometimes, the best approach is a task that requires a rating scale and subjective professional
judgment. The fidelity is higher but two threats to validity are rater inconsistency and bias.
This problem is not unique to SP. It applies to all CRSS item formats presented. Fortunately,
this issue has been studied extensively with the SP method. Cohen, Colliver, and Marcy (1996)
examined the properties of checklist and subjective ratings of performance in a SP application.
They also reviewed the research on this issue. Their findings, and those of others, are complex.
Checklists can be constructed that have high-fidelity items that are objectively scorable. How-
ever, rating scales involving subjective judgments also work well. In both instances, a tendency
exists to find trivial content that seems to fit a particular method (CRSS or CROS), but there is
Developing Items for Professional Credentialing • 289
dissatisfaction with the results. They concluded that SP subjective ratings can be more efficient
and reliable than objectively scored checklists for some aspects of competence. They also found
that global subjective ratings work quite well. The lesson to be learned here is whether one uses
CROS or CRSS formats for the SP, test items must be well designed to reflect the target domain
realized in the practice analysis.
Evaluation
The SP program is highly successful in training in the health professions and medicine espe-
cially where there is a doctor–patient interaction. The technique has research and development
290 • Unique Applications for Selected-Response and Constructed-Response Formats
to support its validity. Moreover, the format has been accepted and validated for use in a medical
licensing test. The SP format has also been successfully applied in the next examination format,
the OSCE. Technical reports that provide validity evidence would strengthen the argument for
using this format more extensively. It would be interesting to see if this format could be applied
to the non-health-related professions involving clients and customers, such as law, accountancy,
or social work. The SP offers valid assessment of clinical ability in a high-fidelity way. Examiner
consistency can be high and bias as a threat to validity can be controlled. Might other professions
and related credentialing testing programs benefit from this method?
Evaluation
The OSCE used in Canada for dental and medical licensing is an excellent example of blending
many principles of valid assessment of competence in a structured and objective way. The design
and validation efforts have resulted in a high-quality testing program that may be spreading to
other professional examination programs. The interesting blend of SPs and SR testing is part of
the uniqueness. The multiple perspective found in stations adds to the evidence for good sampling
from the universe of generalization. The fidelity of tasks supports the argument that performance
at stations is related to the target domain. These characteristics constitute validity evidence.
Simulations
A simulation is a type of high-fidelity performance test that is supposed to model actual practice.
The SP technique is a type of simulation. However, according to Downing and Yudkowsky (2009)
simulation is used to connote a wide range of performances. For instance they mention human
patient simulators and manikins and devices that mimic heart sounds. Sireci and Zeniskey (2006)
present examples of how computers are used to simulate complex thinking in many objectively
scorable formats. An effective way to illustrate simulations is to provide some examples.
One research study stands out in support of simulations over the oral examination (Savoldelli
et al., 2006). They reported that the patient simulator was moderately correlated with oral exami-
nation scores but interactions with the modality of testing suggested that the simulation was bet-
ter for showing how whereas the oral examination is good for telling how.
Flight Simulation
One of the most common and well-known examples is flight simulation. The objective in this
kind of simulation is to recreate the conditions of flight in different aircraft realistically. Such
factors as air density, precipitation, cloud coverage and visibility, and turbulence can be pro-
grammed to vary and in combinations. Like SPs, flight simulations can be done as part of training
or for actual licensing. Flight simulations are offered for recreational purposes as well.
The software for flight simulation varies extensively from in-home low-fidelity simulation to
high-fidelity cockpit realism. A visit to the Federal Aviation Administration website (http://www.
faa.gov/) reveals the extensiveness of the role of simulation in flight training and certification.
Given the gravity of flight for commercial and military purposes, this is perhaps the most exten-
sive, pervasive, and well-researched type of simulation existing today.
NCARB
The National Council of Architects Registry Board (NCARB) sponsors the Architect Registra-
tion Examination (ARE), which is a national licensing test. NCARB claims that the test has high
fidelity with tasks that architects generally have to perform. The ARE results are accepted by 54
U.S. and Canadian architectural associations. Aside from answering SR-formatted items, candi-
dates must also participate in simulations presented on a computer. Candidates are allowed to
manipulate visual material and construct a detailed answer. Not only are tasks performed, but
judgment is required. This program has a total of seven tests, and each has to be passed in order
for a candidate to be eligible for licensure. Their website provides excellent examples of each test
in this examination program.
Vignettes provide a realistic problem that architects encounter. Each candidate must read the
vignette and study visual materials. The candidate is allowed to manipulate some visual material
via the computer and prepare an answer. Each response is scored by a well-trained rater. More
can be found at http://www.ncarb.org/~/media/Files/PDF/ARE-Exam-Guides/BDCS_Exam_
Guide.pdf
Medical Boards of the United States & the National Board of Medical Examiners, 2011, p. 23). For
a complete description of the Step 3 examination and specifically this simulation, see the USMLE
website: http://www.usmle.org/pdfs/step-3/2011content_step3.pdf
Evaluation
Simulations are an effective way to achieve high fidelity in a credentialing test. Simulations vary
from complex to simple and vary with respect to fidelity. With any simulation we usually have a
high development cost and long developmental period without any guarantee that it will work.
Thus, there is an investment that may or may not be realized. The examples provided in this sec-
tion show successful attempts at simulation. Technology and its continued growth provide hope
that simulations will increase in effectiveness and by that improve the credentialing examina-
tion process. The quest with simulations is a high-fidelity performance test with accurate and
consistent scoring. The tasks performed must be sampled from the universe of generalizations
and resemble the target domain. As with any high-fidelity examination, professional judgment
of highly trained and well-qualified examiners using subjective rating scales is needed. In all of
these instances requiring professional judgment, we have the same threats to validity—reliability
and construct-irrelevant variance that may arise from subjective rating—severity/leniency, halo,
restriction of range, and central tendency.
Oral Examination
The oral examination in credentialing testing is a traditional approach by which examiners and
the candidate meet for a testing session. A very informative review of the oral examination his-
tory, criticism of this format, and its extensive use in credentialing was provided by Raymond
and Luciw-Dubas (2010). They reported that 14 of 24 medical specialties currently use the oral
examination format. This format has been used extensively in doctoral granting programs and
many other graduate school and professional training settings.
A credentialing board can use the oral examination in a variety of ways:
In the traditional oral examination for credentialing in medicine, the candidate is presented with
several patient problems. The candidate is asked a series of questions. The candidate may also
be supplied with patient information or may request that information, including patient history,
radiographs, and photographs. If the answers are inadequate, the examiners have the liberty to
probe to obtain more information. After the session, the examiners rate performance on descrip-
tive rating scales for traits they consider to be part of the target domain. There is a heavy reliance
on the professional judgment of the examiners. Independence of judgment of each examiner is
strongly recommended. However, in well-designed oral examinations, model answers are pro-
vided, and examiners are trained to follow the scoring guide and not to freelance. If examiners
discuss the case before rating, then a consensus is reached, or bias is exercised, which may not
be faithful to the candidate’s true level of performance. In other words, there are many threats
to validity that are attendant in this process. Highly effective oral examination programs guard
against such threats. Several examples are also provided.
294 • Unique Applications for Selected-Response and Constructed-Response Formats
to show that not only is the item well designed but, also, can be used with other patient problems
using the same problem-solving steps.
This kind of examination is routine. Scoring involves a panel of examiners—all of whom are
highly qualified experts in ophthalmology. The scoring is subjective.
Evaluation
Evaluation and criticism of this examination item format have been offered for quite some time
as noted previously. The oral examination has lower fidelity than other formats presented thus
far in this chapter. Nonetheless, the oral examination has many assets: (a) good sampling of
patient cases, (c) reasonable reliability, and (b) reasonable fidelity with actual practice. However,
stating how to treat a patient and treating a patient are not necessarily highly related. In Miller’s
Pyramid, knowing and doing are related but distinctly different.
The oral examination format should continue to be used. Its validity is highly dependent on
conditions repeatedly discussed in this volume: conducting a practice analysis, developing a tar-
get domain, careful item development using CRSS or CROS format, whichever is appropriate,
selection of highly qualified examiners, effective examiner training, monitoring of examiners,
and validation. Chapter 18 provides greater detail on the diagnosis and treatment of threats to
validity related to examiner inconsistency and bias. There is no sound reason to abandon this
time-honored format. Of course, other formats presented in this chapter may have higher fidelity
to the target domain, but may have shortcomings that limit their use.
Portfolio
We have many types of portfolios used in a variety of ways (Haladyna, 1997). For example, the
showcase portfolio is used when applying for a professional position or establishing qualifications
for a special assignment. The showcase portfolio highlights the knowledge, skills, and abilities of
a person. This kind of portfolio shows a person’s breadth and depth of experience. An evaluation
portfolio is used in some high-stakes testing situations to measure a wide range of performances.
This kind of portfolio is created by the candidate but follows a structure and organization dic-
tated by the credentialing agency. Two examples are briefly described to show how these portfo-
lios are designed and used. Portfolios are also described in chapters 10 and 11.
Teacher Certificaton
As noted in previous chapters, The National Board for Professional Teaching Standards (NBPTS)
has been providing certification for accomplished teachers since 1987 (http://www.nbpts.org).
This Board provides a national voluntary program for teachers to receive their certification based
296 • Unique Applications for Selected-Response and Constructed-Response Formats
on each teacher’s professional accomplishments. Interestingly, each state chooses its teacher
licensing testing program: no single national program exists. So this certification is the only
national teacher testing program, but certification should not be confused with licensing. The
latter is the formidable barrier to teaching in a state, whereas the NBPTS is a certification recog-
nizing an accomplished (outstanding) teacher.
Each teacher seeking NBPTS certification who pays the application fee must submit four
entries and participate in an assessment center including six computer-based CR exercises. The
portfolio entries include:
Evaluation
The portfolio is a time-honored device for obtaining a more comprehensive description of a
candidate’s ability in a profession. However, it has threats to validity that should be recognized
and studied. A portfolio is an advocate’s biased view of competence. The person who submits a
Developing Items for Professional Credentialing • 297
portfolio for certification does not submit evidence that runs contrary to the goal of certification.
Also, as most portfolio scoring involves subjective rating, we have the same attendant threats to
validity-rater inconsistency and rater bias. Finally, reliability needs to be very high. The portfolio
would have to have many scored observations that are moderately intercorrelated to achieve suf-
ficiently high reliability.
An analyst gathered the following information about a company and that company’s
common stock:
Weighted average cost of capital 12%
Intrinsic value per share $30
Market value per share $40
The company’s investment projects for the year are expected to return 14 to 18 per-
cent. Which of the following best characterizes the company and the company’s common
stock?
A. Growth company and growth stock
B. Growth company and speculative stock
C. Speculative company and growth stock
D. Speculative company and speculative stock
Figure 14.5. A sample item from the 2007 Level 1 CFA Examination. Beginning in 2009, CFA Institute provides three answer choices for
selected response items on all CFA exams.
Source: https://www.cfainstitute.org
Copyright (2007), CFA Institute. Reproduced and republished with permission from CFA Institute. All rights reserved.
298 • Unique Applications for Selected-Response and Constructed-Response Formats
CFA items using the testlet format are among the most complex that anyone will encounter in
a credentialing test. Examples can be found on their website: http://www.cfainstitute.org/cfapro-
gram/courseofstudy/Documents/sample_level_II_itemset_questions.pdf
A typical testlet can have two pages of explanatory material as a vignette/scenario involving
a financial situation. Charts, tables, and graphs might be included. Then a series of SR items is
presented addressing aspects of the problem. As a service to candidates, answers are provided to
sample items with justifications for each correct answer.
A wife retained an attorney to advise her in negotiating a separation agreement with her
husband. Even though he knew that his wife was represented by the attorney, the hus-
band, who was not a lawyer, refused to obtain counsel and insisted on acting on his own
behalf throughout the protracted negotiations. The attorney never met or directly com-
municated in any way with the husband during the entire course of the negotiations. After
several months, the wife advised the attorney that the parties had reached agreement and
presented the attorney with the terms. The attorney then prepared a proposed agreement
that contained all of the agreed upon terms. The attorney mailed the proposed agreement
to the husband, with a cover letter stating the following:
“As you know, I have been retained by your wife to represent her in this matter. I enclose
two copies of the separation agreement negotiated by you and your wife. Please read it
and, if it meets with your approval, sign both copies before a notary and return them to
me. I will then have your wife sign them and furnish you with a fully executed copy.” Is the
attorney subject to discipline?
A. Yes, because the attorney did not suggest that the husband seek the advice of inde-
pendent counsel before signing the agreement.
B. Yes, because the attorney directly communicated with an unrepresented person.
C. No, because the attorney acted only as a scrivener.
D. No, because the attorney’s letter did not imply that the attorney was disinterested.
Evaluation
Many credentialing boards continue to employ high-fidelity SR item formats for the valid reason
that these objectively-scored formats are very effective in eliciting the cognitive demand desired
and also testing important concepts, principles, and procedures. Moreover, high reliability is
obtainable, which makes the accuracy of pass/fail decisions more defensible. Systematic error
that comes from subjective, human scoring is not present with this format. SMEs might correctly
argue that actual professional performance tasks are more desirable because of higher fidelity.
Often, as with the CFA Institute and the National Conference of Bar Examiners credentialing
examinations, the compromise to use SR items seems very effective.
Developing Items for Professional Credentialing • 299
Written Essay
Previously in this chapter, the oral examination format was presented and discussed. Despite some
shortcomings associated with the CRSS format, the oral examination has been very effectively
used as a high-fidelity testing method. The final item format in this chapter is the written essay.
An important distinction in essay writing for any credentialing examination is whether the
examining board wants to measure writing ability or a candidate’s ability to solve problems or
think critically in a professional setting. Specifically, the written essay might ask a candidate to
organize and make arguments for or against an issue, justify a choice, analyze a situation, or solve
a problem. A good example of essay testing is provided by the Certified Financial Analyst Insti-
tute (https://www.cfainstitute.org).
The Level III exam morning session presents 10 to 15 essay items of a complex structure to
candidates. A Council of Examiners Essay Style Guide (May 2006) provides guidance to essay item
writers on conventions followed, style, and adherence to the curriculum of the CFA Institute.
Each item has a specific structure that includes a preamble—which is basically a scenario/vignette
that specifies the problem of issue. Then the item writer is invited to use command words to
phrase the question to the candidate. Command words include identify, discuss, list, explain,
state, calculate, prepare, create, recommend, justify, use, and cite. Their item-writing guide has an
appendix of command words for different types of cognitive behavior. They also provide a check-
list for item writers to help determine the adequacy of each essay item. It is also the item writer’s
responsibility to write a model answer and a scoring model, which they call a grading key. This
document describes how partial scoring is conducted and how calculation errors are treated. It
also helps essay scorers evaluate inconsistent or illogical responses. An example of an essay ques-
tion from the 2009 examination year is presented.
This item forms two pages of the examination booklet and is one of the shorter examples that
can be shown. Such sophistication in item design and scoring is virtually unseen in any testing
situation. (See Figure 14.7.)
The vignette design is very interesting. The scenario is presented followed by three questions.
Then more information is presented followed by two more questions. On their website and in
their publications, considerable effort is dedicated to correctly scoring essays.
Evaluation
When a credentialing examination program desires high fidelity to the target domain but cannot
provide the high-fidelity performance test, the written essay seems like a viable alternative. Essay
items can be done very poorly and result in low reliability with the potential of many threats to
validity caused by raters (as Chapter 13 shows). The example in this section shows a very rigorous
level of essay item-writing and scoring. Candidates are given extensive advice on how to prepare
and how to write an essay (see http://www.cfainstitute.org/cfaprogram/exams/format/Pages/
cfa_exam_question_formats.aspx).
The meticulous attention to item design, cognitive complexity, high-fidelity problems encoun-
tered by financial analysts, and scoring makes this kind of essay testing a strong model for other
testing programs.
Summary
Tests designed to measure professional practice are among the best developed anywhere. As the
stakes for such tests are incredibly high, testing agencies have to invest considerable resources
in test design, item development and validation, administration, scoring, and validation. Many
300 • Unique Applications for Selected-Response and Constructed-Response Formats
A. Prepare FF’s return objective for next year. Show your calculations. (4 minutes)
B.
1. Determine whether FF or the Wirth-Moore pension plan has greater ability to take risk. Jus-
tify your determination with one reason.
2. Determine whether FF or the Wirth-Moore pension plan has greater willingness to take risk.
Justify your determination with one reason. (6 minutes)
C. Formulate the following investment policy constraints for FF:
1. Liquidity. Show your calculations.
2. Time horizon. Justify your response with one reason. (6 minutes)
FF presently bases its annual spending on the average market value of its assets each year.
Noland Reichert, a member of FF’s board, is concerned about recent market volatility. Reichert
proposes a spending rule based on a rolling three-year average market value. In response to
Reichert’s proposal, Joy recommends a geometric spending rule, where spending is based on a
geometrically declining average of trailing endowment values. FF’s external tax counsel advises
that there would be no adverse tax consequence from adopting either smoothing rule.
D. Explain the effect on FF’s spending of adopting Joy’s smoothing rule rather than Reichert’s
smoothing rule. (4 minutes)
Reichert also serves on the board of Headwaters University Foundation, an endowment with more
than USD 1 billion in assets. Headwaters recently invested in a private equity venture based on the
recommendation of its internal investment staff. The venture requires a USD 2.5 million minimum
investment by each participant, with a five-year lock-up provision. The private equity venture is not
expected to generate income, but has the potential to increase in value at a rate of 20% per year over
the next five years. Reichert recommends that FF should participate in this private equity venture.
E. Justify, with two reasons, why Reichert’s recommendation is inappropriate for FF. (4 minutes)
Figure 14.7 CFA Institute Essay Item. Beginning in 2009, CFA Institute provides three answer choices for selected response items on all
CFA exams.
Source: http://www.cfainstitute.org/cfaprogram/courseofstudy/Documents/level_III_essay_questions_2009.pdf
Copyright (2009), CFA Institute. Reproduced and republished with permission from CFA Institute. All rights reserved.
Developing Items for Professional Credentialing • 301
of these testing agencies have high examination fees and generate considerable income, which is
invested in test development, maintenance, and validation. So it is no surprise that many of these
testing programs, including those featured in this chapter, are so excellent.
The sample of testing programs and item formats used is diverse. As we can see, SR, CROS, and
CRSS formats are all used with success. These testing agencies are leaders in item format innova-
tion and produce validly interpreted test scores. We have much to learn from their work.
A body of research exists about the threats to validity from using subjective scoring. Discussion
of these threats can be found in many chapters in this book. Chapter 18 provides a discussion of
two of these threats: rater inconsistency as it affects reliability and rater bias. So it is important to
state that all credentialing testing programs that feature CRSS items have considered these two
major threats to validity. For every credentialing testing program, technical reports are vital to
establishing the validity evidence of any method. None of the testing programs reviewed here
are exceptions. Technical reports available to the public are rare. These technical reports should
provide information about the threats to validity and the evidence needed to support the validity
of each item’s interpretation and use.
As regards which of the item formats presented in this chapter should be preferred, Epstein
(2007) had the best response. All methods have strengths and weaknesses that are well known.
Effective design and validation are strategies that render most of these methods effective. How-
ever, professional competence is complex, and no single method seems sufficient. Multi-source
information about competence is the best approach. Well-thought out elements of competence
have to be delineated before choosing specific formats like those reviewed in this chapter.
15
Developing Items for Accessibility by
Individuals With Exceptionalities
Overview
The field of educational measurement serves a wide audience including those who design, admin-
ister, interpret, use, and take tests along the entire period of human development that includes
from infancy to the golden years. Across such a spectrum, we find individuals with unique learn-
ing needs and means of interacting and communicating with others. These unique needs present
a challenge to the item writer about how to maximize accessibility of each individual to the intent
of the item. Such unique needs might be present in cognitive, physical, health-related, behavioral,
or emotional impairments. Also, second-language learners face the same challenges. Gifted stu-
dents also face unique challenges as giftedness has great diversity. Finally, an eventuality is that
any individual might be multiply challenged by a combination of these conditions.
These human conditions are sometimes accompanied by cultural and learning style differ-
ences that are beyond the scope of this book. We focus on the role of the item writer to improve
and maximize accessibility at the item level for the widest possible range of exceptionalities. We
do not introduce an additional set of guidelines, but illustrate how accessibility guidelines are
consistent with the SR and CR item-writing guidelines presented in this book.
We begin by introducing a general theory for test accessibility, review the characteristics of
exceptionalities with special attention to English language learners and individuals with disabili-
ties, and review accessibility research supporting the existing SR and CR item-writing guidelines.
We also present a few special topics in accessibility research concerning universal design (UD),
development and application of the Test Accessibility and Modification Inventory, and a sum-
mary of findings from item modification studies.
302
Developing Items for Accessibility by Individuals With Exceptionalities • 303
Beddow and his colleagues argue that any relevant inference from a test score presupposes
accessibility, which is consistent with the validity framework presented in chapters 1 and 3. In
the lexicon of Kane’s (1992, 2006) validity as argument, accessibility exists as an assumption of
the interaction between test takers and test items. When a test taker responds to an item, there is
an assumption that an interaction took place, that the person engaged the intended content and
engaged cognitive demand as presented by the item. As described by Beddow, the argument sup-
ports the assertion that we can make a valid inference regarding content and cognitive demand.
If the person is unable to access the item fully because of one barrier or another, the interaction
assumption is not viable. Consequently, doubt is created about any inferences about content or
cognitive demand. More generally, the validity of the ability being measured is weakened.
The model presented by Beddow (2012) is based on a disaggregation of sources of error vari-
ance or construct-irrelevant variance. Such sources of error can be observed in the test event
itself, in the interaction between test taker and test items. This, of course, is part of a larger model
of testing, where there is a period of instruction, training, or learning, followed by the test admin-
istration, resulting in a score used to draw inferences and make decisions. He then offers a model
of test error resulting from limited accessibility, including physical, perceptive, receptive, emo-
tive, and cognitive sources. Beddow argues that each of these sources can be linked to test or item
features, making them malleable factors in test and item development. We briefly describe these
sources of construct-irrelevant error variance in item development and validation, which is con-
sistent with Beddow’s model.
Physical sources of error variance are largely due to impairments that reduce mobility, agility,
fine motor skills, vision, and hearing. These sources are typically addressed through test accom-
modations and UD. The item-writing guidelines used to improve accessibility at the item level
are largely intended for the remaining sources.
Perception, including whichever sensory skills are required given the nature of the test and test
items, allows the test taker to interact with the item. It is through the senses that most test takers
gain access to the task demands in the item, which may include reading or listening to directions,
prompts, illustrations and graphical or other visual displays, and the items themselves. This per-
ception is closely followed by reception, which allows the test taker to read and comprehend or
listen for understanding.
Cognition involves accessing working memory and long-term memory and the mental proc-
esses required by the task demands in the item. Cognitive load theory as presented by Sweller
(2010) provides a means to identify and reduce construct-irrelevant features by addressing three
categories of cognitive load: to minimize extraneous load (irrelevant features of the item or task),
isolate intrinsic load (that addressed directly by the item), and manage germane load (interac-
tions with long-term memory). Beddow provided the theoretical underpinnings of such focus
and reduced the implications regarding item development into practical guidance:
Finally, emotion captures a full range of affective experiences stemming from the interaction of
the test taker with the test items. Beddow reviewed research showing the potential of negative
304 • Unique Applications for Selected-Response and Constructed-Response Formats
emotional experiences interfering with cognition (see Chen & Chang, 2009). Emotion is an area
often overlooked by test and item developers, but in some ways is addressed in the current set of
item-writing guidelines. In part, the guidelines regarding avoidance of opinion, trick, or humor-
ous items, and other guidelines focusing on economy and clarity of expression, are intended to
improve cognition but also minimize test-taker frustration.
Many current models of test and item development discussed in chapter 4 refer to the assess-
ment triangle as a theoretical model of the ecology of testing. The three points of the triangle are
cognition, observation, and interpretation (Pellegrino, Chudowsky, & Glaser, 2001). The trian-
gle rests on the function of cognition, which involves the knowledge, skills, and abilities (KSAs)
employed when interacting with an item and the way knowledge is developed and learning
progresses. This cognition is observed through the individual’s interaction with the test, which is
followed by appropriate interpretations. Ketterlin-Geller (2008) proposed a slight modification,
where interpretation is the apex of the triangle to emphasize the unique nature of the interaction
of cognition and observation and the potential sources of construct-irrelevant variance uniquely
affecting students with exceptionalities. She argued that cognition not only represents the under-
lying learning cognitive model, but the cognitive characteristics of the test taker.
Exceptionality
In a very general sense, the term exceptionality describes a broad class of unique individual char-
acteristics concerning educational needs of students. Exceptionalities include physical, cognitive,
and emotional or behavioral impairments, and also conditions affecting gifted and talented stu-
dents. Exceptionalities may affect learning and communication, and in our context, interaction
with tests and assessments.
In its twentieth year, the journal Exceptionality is published “to provide a forum for the presen-
tation of current research in special education. Manuscripts accepted for publication will repre-
sent a cross section of all areas of special education and exceptionality and will attempt to further
the knowledge base and improve services to individuals with disabilities and gifted and talented
behavior” (http://www.tandf.co.uk/journals/authors/hexcauth.asp).
Exceptional Children, published by the Council for Exceptional Children (CEC), is “dedicated
to improving the educational success of individuals with disabilities and/or gifts and talents”
(http://www.cec.sped.org/). Their website on assessment provides information on many issues
affecting the exceptionalities community. On this website is a description of Response to Interven-
tion (RtI), which is a process to identify and address emerging learning difficulties in children
early and directly. RtI also includes methods for identification, selection, and placement, and
progress monitoring assessment tools used to support students with exceptionalities. This web-
site also contains information on alternative assessments used by school districts and states. This
information includes dynamic assessment as an interactive tool employing a series of scaffolded
prompts to uncover KSAs. Also included are family assessments, portfolios, and issues related to
accommodations and modifications to improve accessibility to standardized assessments, among
many others.
Another important resource is the National Center on Educational Outcomes (NCEO), which
was established in 1990 at the University of Minnesota. Its mission is “to provide national leader-
ship in designing and building educational assessments and accountability systems that appro-
priately monitor educational results for all students, including students with disabilities and Eng-
lish Language Learners (ELLs)” (http://www.cehd.umn.edu/nceo/About/). It serves as a strong
policy advocate for the promotion of evidence-based practice and provides important dissemina-
tion and technical assistance through state and national needs assessments and policy reviews. It
Developing Items for Accessibility by Individuals With Exceptionalities • 305
is a clearing house and contributor to research regarding educational outcomes for students with
disabilities (SWDs) and ELLs. Among the topic areas with multiple resources at their website, we
find accommodations, alternative assessments, universally designed assessments, assessment of
English language proficiency. These topics, among others, are relevant to our discussion of item-
writing for test accessibility.
More general legislation includes the Americans with Disabilities Act (see http://www.ada.gov/
cguide.htm for more general information).
306 • Unique Applications for Selected-Response and Constructed-Response Formats
a person who has a physical or mental impairment that substantially limits one or more major
life activities, a person who has a history or record of such an impairment, or a person who is
perceived by others as having such an impairment. The ADA does not specifically name all of
the impairments that are covered. (US Department of Justice, 2009)
The issues related to meeting the unique learning needs of individuals with disabilities and
exceptionalities more broadly become even more complicated when individuals exhibit multiple
exceptionalities. Sometimes, individuals with disabilities may also be gifted in many ways and
are referred to as twice-exceptional (see Nicpon, Allmon, Sieck, & Stinson, 2011, for a review of
research). The ability to identify and support ELLs with disabilities is also a significant challenge
(Abedi, 2009). The biggest issues with this population are in identification and service provision,
as both are much more complex. The issues related to test accessibility are no more or less com-
plicated and the approaches described in this chapter apply to all cases.
Our interest in this area with respect to developing and validating test items is to illustrate how
item-writing guidelines can be and have been used to improve test accessibility for individuals
with exceptionalities. This is a validity issue at its core, as it bears directly on the interpretation
and use of test scores. Special educators, English language educators, and measurement special-
ists have contributed often and effectively to increase our understanding of how individuals with
exceptionalities interact with tests and test items. We have much guidance to support test devel-
opment with these principles in mind. Unfortunately we are unable to include test and item
design relevant to individuals with the most significant impairments—typically instructionally
imbedded observational systems employing portfolios, performance assessments, or rating scales
(Elliott & Roach, 2007). The guidelines presented in chapter 11 apply equally to these cases as
well. We highlight the work of Abedi (2006) and Elliott and his colleagues (Elliott et al., 2010,
Elliott et al., 2011), which embodies elements of the general theory of test accessibility, including
principles of good item writing, UD, and cognitive load theory.
Content Concerns
1. Base each item on one type of content and cognitive demand.
The clarity of the construct being measured facilitates accessibility by focusing on construct-
relevant features (Thurlow et al. 2009).
3. Keep the content of items independent of one another.
This also supports the recommendation to limit cognitive load appropriately.
Developing Items for Accessibility by Individuals With Exceptionalities • 307
Format Concerns
7. Format each item vertically instead of horizontally.
This guideline is expanded in the item accessibility literature to include additional white
space throughout the item, including between stimulus materials, visuals, the item stem, and the
options. The spatial layout of items is an important accessibility tool (Elliott et al., 2010; Kettler
et al., 2011).
Style Concerns
9. Keep linguistic complexity appropriate to the group being tested appropriate.
This style concern is an area with substantial research evidence with respect to testing ELLs
and SWDs. Some aspects of linguistic complexity that affect accessibility include (a) word fre-
quency and familiarity, (b) word length, (c) sentence length, (d) the structure of discourse, (e) the
use of passive voice, (f) length of noun phrases, (g) multiple prepositional phrases, (h) the use of
subordinate, conditional, or relative clauses, (i) number of unique words and total word count in
an associated passage, (j) excessive use of pronouns, adjectives, and adverbs, and (k) specialized
vocabulary that is not construct-specific (Abedi, 2006, 2012; Hess, McDivitt, & Fincher, 2008).
10. Minimize the amount of reading in each item. Avoid window dressing.
Another common element in accessibility research is the principle of minimizing cognitive
load, consistent with the intended construct being measured (Thurlow et al., 2009). Some recom-
mend chunking or segmenting the text and interspersing test items so they are found near the
relevant text (Hess, McDivitt, & Fincher, 2008; Roach, Beddow, Kurz, Kettler, & Elliott, 2010).
Content Concerns
1. Clarify the domain of knowledge and skill(s) to be tested.
This guideline is also consistent with the SR guideline 1. It is a central component of UD, as the
first two principles are to consider an inclusive assessment population in all stages of item and
test development and secure precisely defined constructs (Thurlow et al., 2009; Welch, 2006).
308 • Unique Applications for Selected-Response and Constructed-Response Formats
Context Concerns
9. Consider cultural and regional diversity and accessibility.
Here is the special attention to issues of exceptionalities when using CR items (Welch, 2006).
It is imperative, as called for in UD, to consider an inclusive test-taking population at the point
of item development, to facilitate full accessibility. Usually, researchers who conduct studies on
item formats with special populations recommend that individuals with exceptionalities be rep-
resented in the development and field-testing of items. Individuals with exceptionalities should
be considered from the initial phases of item development (Thurlow et al., 2009).
10. Ensure that the linguistic complexity is suitable for the intended population of test takers.
The principles presented above on the role of language structures apply equally to SR and CR
items (Abedi, 2006, 2012; Hess, McDivitt, & Fincher, 2008).
Original
Near the end of the story, Jimmy gave his brother his bike. When he gave it to him, what was his
first reaction?
Improved
When Jimmy gave the bike to his brother Mark, what was Mark’s first reaction?
Developing Items for Accessibility by Individuals With Exceptionalities • 309
2. Use of negatives (guidelines 12, 19): negative terms can obscure the message and unnecessarily
confuse the reader. Sometimes the point of the passage has a negative orientation (for example,
why someone does not want to do something). Yet often, there is a positive side as well, (e.g., why
does someone want to do something).
Original
Which of the following is not a reason the student did not want to disappoint her teacher?
Improved
Why did Maria NOT want to disappoint her teacher, Mr. Stone?
Better
Why did Maria want to impress Mr. Stone?
3. Vocabulary load (guideline 9): when vocabulary is not the target of measurement, the vocabu-
lary should be appropriate for the test takers and the content of the test. Consider an example
NAEP item (2011 NAEP Mathematics Test, Grade Four, Block M8, item #18).
Original
4 quarts = 1 gallon. Amy wants to put 8 gallons of water into her aquarium. She has a 2-quart
pitcher to carry water from the sink. How often will she need to fill her pitcher?
Improved
Amy needs 8 gallons of water for her fish tank. She has a 2-quart pail to carry water. How often
will she need to fill her pail to fill the fish tank?
There are two vocabulary words that may be appropriate for fourth-graders, but they may also
present a problem with understanding the task in order to access the mathematical problem pre-
sented in the text. The words aquarium and pitcher both have complex meanings and the math-
ematics problem is not a vocabulary problem, but a conversion and multiplication problem
4. Non-construct subject area language or specialized vocabulary (guidelines 9, 10): the exam-
ple item above fits in this context as well. The fact that terminology related to aquariums is
used in this mathematics item is potentially introducing non-construct subject area language.
Considering both #3 and #4, vocabulary should be appropriate for the test taker and should be
construct-relevant.
5. Complex sentences, dense text (guidelines 1, 9): too many adjectives and adverbs modifying
the head noun and the use multiple propositions are likely to introduce significant complexity
that may be construct-irrelevant. Consider the example item below. It is modified with additional
adjectives and propositions not found in the original item for illustration purposes. Consider an
example NAEP item (2011 NAEP Reading Test, grade four, Block M12, item #12).
Original
Joe slowly rode his new bicycle on a special bicycle path from his new house to his friend’s house
on the other side of the city park. He rode his new bicycle 1.7 miles along the special path as seen
in the artistic figure below …
Improved
Joe rode his bike on a path from his house to his friend’s house, 1.7 miles away …
310 • Unique Applications for Selected-Response and Constructed-Response Formats
Schafer and Liu (2006) identified six phases of test development and applied the principles of uni-
versal design to each. With respect to item development, they argued that the following elements
should be in place to ensure successful application of UD:
Johnstone, Thompson, Bottsford-Miller, and Thurlow (2008) offer advice on item review with
UD principles in mind, suggesting that there are at least three points where it should be employed.
The first is an early sensitivity review shortly after the item is first drafted. This review provides
a structured process to eliminate unintended barriers. A second is the use of think-aloud during
the field-testing stages of item development. This activity allows for deeper understanding of the
person–item interaction. The third is statistical analysis of item functioning from operational
administration with low-incidence populations. This analysis includes those students with sig-
nificant visual, hearing, or cognitive impairments. These reviews can actually be employed at
various stages during the life span of an item. Johnstone and colleagues offered specific guidance
on elements to include at various stages.
assessment and testing decisions are made. States define eligibility for participation in regular
education and alternative assessments differently. Many states have tried to develop alternative
assessments for two groups of students, as specified by current federal guidelines. For the first
group, we use alternative assessments for alternative academic achievement standards (AA-AAS).
For the second group, we use modified academic achievement standards (AA-MAS). Generally,
AA-AAS are intended to meet the unique learning and assessment needs for students with the
most severe cognitive impairments and AA-MAS are intended for students with moderate to
severe cognitive impairments (persistent academic difficulties).
Theoretical and empirical work on these assessments has provided insights into item func-
tioning that have shown improvements in accessibility. Most of this work has come through
the efforts of researchers involved in item modification for AA-MAS, as AA-AAS tend to be
qualitatively different forms of assessments including observational and performance-based
tasks. In addition, there has been much more attention to AA-MAS, likely because of the larger
population. A special issue of Peabody Journal of Education (2009) on AA-MAS provided several
updates on the current state of affairs. Rodriguez (2009b) provided a review of psychometric
considerations in a review of this special edition. In that volume, Kettler, Elliott, and Beddow
(2009) reviewed their work on a theory-guided and empirically based tool for guiding test modi-
fication to enhance accessibility. The Test Accessibility and Modification Inventory (TAMI) and
the associated Accessibility Rating Matrix (ARM; available at http://peabody.vanderbilt.edu/
tami.xml) are guided by the principles of UD, test accessibility, cognitive load theory, test fair-
ness, test accommodations, and item-writing research (also see Elliott, Kettler, Beddow, & Kurz,
2011). The TAMI was developed through the Consortium for Alternate Assessment Validity
and Experimental Studies (CAAVES) and the ARM was developed through the Consortium for
Modified Alternate Assessment Development and Implementation (CMAADI), funded through
the United States Department of Education.
item, it makes no sense to apply any one modification across items uniformly. Such a systematic
modification strategy is inconsistent with good item-writing practice. Moreover, items should
be developed initially with maximizing accessibility as a goal, where TAMI is used as an effective
guide for item writers.
Fewer items on a page also means more pages for the test. However, this is likely to improve valid-
ity through greater accessibility, allowing test takers to focus on one item at a time.
Any visual, graphic, chart, table, or other display should serve a necessary function for the item
(Crisp & Sweiry, 2006). Usually the visual will contain relevant information needed to respond
to the item correctly. The goal is to make the relevant features of any visual prominent. To maxi-
mize accessibility, the fonts, lines, shading/patterns, and other features should be large enough to
observe, even for test takers with some visual impairments, such as color-blindness or extreme
farsightedness. The font should be at least as large as the font used for the item stems and options.
Consistency across visuals throughout the test is another goal.
When the visual is an essential element of the item, as with a graph or chart that contains the
required information to which the item refers, the use of visuals should enhance accessibility to
the item.
A challenge in this area is the isolation of the effect of each modification. Typically, a package
of modifications is used, depending on the specific needs of each item. Kettler (2011) has a good
review of research on modification packages. To illustrate some of these modifications, items
used in the CAAVES and CMAADI studies will be presented here with item statistics results
(Rodriguez, 2009a). Common modifications include elimination of one distractor and the addi-
tion of white space around the item, graphical displays, and options.
Examples of modifications intended to increase accessibility to test items are provided below.
Developing Items for Accessibility by Individuals With Exceptionalities • 313
In the CAAVES and CMAADI experimental studies of item formats, such changes were associ-
ated with improved item functioning, where the items became more somewhat easier and more
discriminating overall, and the remaining two distractors were better functioning, were selected
by a more even proportion of students, and were more highly discriminating. The four items
presented here are modeled after successful items from those studies.
Associated with each item is a brief description of the modifications made and changes in item
functioning, based on item difficulty, item discrimination, and distractor discrimination. These
are summarized in Table 15.1. These item statistics are based on the original items, from which
the following examples are modeled—they differ in the specifics of the content but the item for-
mat and nature of the problem are identical.
In Table 15.1, we see shifts in item statistics. In Item 1, option A was eliminated in the modi-
fied form (additional modifications are described below). This was the least effective distractor.
In this item, distractor A had a p-value of .04 (only 4% selected this option), with a point-biserial
correlation (r) = –.17. We notice that the item did not change in difficulty because of modifica-
tion (% correct went from 69% to 68%); however, the discrimination of the remaining options
each improved (the item discrimination went from .38 to .40), and the distractor discrimination
values became stronger (–.23 to –.25 and –.18 to –.27).
Table 15.1 Item statistics for Four Example Items Pre- and Post-Modification.
Item 1 Item 2 Item 3 Item 4
Option % r % r % r % r
A 4 (–) –.17 (–) 60 (68) .37 (.35) 35 (53) .40 (.47) 24 (10) –.27 (–.46)
B 69 (68) .38 (.40) 10 (–) –.25 (–) 21 (23) –.28 (–.41) 10 (–) –.37 (–)
C 16 (21) –.23 (–.25) 17 (19) –.21 (–.34) 22 (24) –.19 (–.14) 9 (8) –.23 (–.36)
D 11 (11) –.18 (–.27) 13 (14) –.08 (–.10) 22 (–) .01 (–) 57 (82) .59 (.61)
Note Table includes % correct and the item–total correlation (r, item discrimination), per item. Post-modification item statistics are
presented in parentheses.
The other three items had similar results. Item 2 became slightly easier, whereas items 3 and
4 became much easier. And in all but two cases, options became more discriminating. Of 12
options, 10 became more discriminating. This improvement in option (correct option and dis-
tractor) discrimination is an improvement in the measurement quality of the items and will
aggregate to the test score level. Improved item discrimination results in improved score reli-
ability and the precision of resulting inferences.
C. 2 1, 3, 7
3 2, 3, 3, 6, 8, 9
4 0, 1, 4, 5, 5, 6
5 1, 1, 2, 4, 5
D. 2 1, 3, 7
3 2, 3, 6, 8, 9
4 0, 1, 4, 5, 6
5 1, 2, 4, 5
The recommended modifications for item 1.1 included removal of the first distractor, which was
the least plausible, a reduction in the number of data points with the resulting data points con-
tained on a single line, and additional white space, particularly between the stem-and-leaf plots.
The modified version is item 1.2.
2.1 Mark earns $100 per month cutting grass in the neighborhood. Review the pie chart that
shows how Mark spends his money each month. If Mark stays on budget, how much will he
spend on food and fun each month?
A. $50
B. $40
C. $30
D. $20
Mark's Budget
Food
20%
Saving
40%
Fun
Gas 30%
10%
Developing Items for Accessibility by Individuals With Exceptionalities • 315
The modifications on item 2.2 include simplification of the stem from 39 words to 20 words,
removal of the least plausible distractor, the addition of patterns to make the pie pieces more dis-
tinguishable, and an increase in the size of the graph with more white space about the options.
2.2 Mark earns $100 per month. Based on the graph, how much can he spend on food and fun
each month?
A. $50
B. $30
C. $20
Mark's Budget
Food
20%
Saving
40%
'oo
30%
10%
3.1 A candy store sold some of its candy for $4 per bag. If every bag contained 9 pieces of candy
and the candy store made $336 from the sale, how would you determine the number of
candies that were sold?
A. Divide 336 by 4 and then multiply by 9.
B. Multiply 9 by 4 and then add 336.
C. Multiply 336 by 4 and then divide by 9.
D. Divide 336 by 9 and then multiply by 4.
The modifications for item 3.1 included simplification of the stem from 40 to 28 words, inclu-
sion of an illustration (not shown), and simplification of the options from sentences to numeric
mathematical statements with additional white space.
3.2 A store sold candy for $4 per bag. Every bag contained 9 pieces of candy.
The store made $336. How can you determine the number of candies sold?
A. (336 ÷ 4) × 9
B. (9 × 4) + 336
C. (336 × 4) ÷ 9
Consider a fictional reading passage. It is a narrative story told in the voice of a child who is
experiencing her first day in a new school. She has one friend, Maria, in her class on that first day.
The interesting twist is that the school principal is her aunt. However, the fact that the narrator
is the principal’s niece is never explicitly stated—only that the principal is her aunt. Item 4.1 is a
question about the narrator’s point of view from such a story.
4.1 From whose point of view is this passage written?
A. Maria’s
B. the principal’s
C. the teacher’s
D. the niece’s
316 • Unique Applications for Selected-Response and Constructed-Response Formats
The modifications included revision of the remaining distractors and elimination of one distractor.
In the passage, since the relationship between the child and the principal refers to the aunt and not
the niece, this option requires an inference about the other side of the relationship. However, the tar-
get of the item is point of view, not inferences regarding relationships among the story’s characters.
Summary
The issues related to measurement of achievement of individuals with exceptionalities are com-
plex. We have described those characteristics most related to item development and validation.
Much of the recent research and development focused on accessibility for students with dis-
abilities. This body of work includes research on and design of alternative assessments based
on modified academic achievement standards (AA-MAS) and based on alternative academic
achievement standards (AA-AAS). This work largely stems from federal legislation regarding
school-based accountability (Weigert, 2011). Psychometric considerations for AA-MAS were
presented (Rodriguez, 2009b) in some of the articles in a special issue of the Peabody Journal of
Education (2009). A special issue of Applied Measurement in Education (2010) was also devoted
to measurement issues with special populations.
The point we have focused on here is that careful attention to good item-writing practices
grounded in the guidelines provided in this book will enhance accessibility of the test items and
test to the widest possible audience. The use of item-writing guidelines will improve validity
for the intended interpretation of results and allows us to evaluate the appropriateness of the
intended interpretation for all test takers in the target audience. In this effort, it is essential that
newly developed items be tried out in field tests. During the field-testing of new items, capture
a sufficient sample of ELLs and SWDs to allow for analysis of their performance on the items.
When possible, these tryouts can be complemented with cognitive interviews (think-aloud) or
focus groups to secure direct student insights. In this way, we can evaluate important features of
the items or tasks and the experience of students with exceptionalities (Pitoniak et al., 2009).
Some of the questions we might ask during the item-tryout phase include:
1. Are there timing restrictions that do not allow all test takers to complete the test?
2. Are the instructions clear to all test takers?
3. Are there differential item completion and skip rates?
4. Are test takers interpreting the items as intended?
5. Does the item elicit the appropriate cognitive demand?
6. For ELLs, what is the role of their native language in responding to English language items?
7. Can item responses be easily, accurately, and consistently scored?
8. Does the rating scale or rubric address the full possible range of responses?
Chapter 16 provides a comprehensive framework for item validation in the context of item devel-
opment. Several standards in the Standards for Educational and Psychological Testing (Ameri-
can Educational Research Association, American Psychological Association, and the National
Council on Measurement in Education, 1999) directly address issues related to item development
with attention to exceptionalities, including:
Developing Items for Accessibility by Individuals With Exceptionalities • 317
Standard 10.3: Where feasible, tests that have been modified for use with individuals with dis-
abilities should be pilot-tested on individuals who have similar disabilities to investigate the
appropriateness and feasibility of the modification.
Standard 10.4: If modifications are made or recommended by test developers for test takers
with specific disabilities, the modifications as well as the rationale for the modifications should
be described in detail in the test manual and evidence of validity should be provided whenever
available.
The last few questions address the scoring of CR items. This is also an area where there is very
little research. Do ELLs and SWDs provide responses that are scorable or do they have features
or characteristics which were not included in the rubric design process and are not anticipated?
The scoring guidelines presented in chapter 12 attempt to capture this challenge, particularly in
the first two guidelines:
Clarify the intended content and cognitive demand of the task as targets for scoring.
Specify factors in scoring that are irrelevant to the task demands.
These offer an opportunity to identify aspects of responses that are relevant to scoring. For exam-
ple, does English language proficiency affect the reading of a mathematics problem or a writ-
ten response for an ELL student? Such challenges may be irrelevant to the task demands—for
example, is English proficiency necessary to predict an outcome from part of a story or to solve
an algebraic equation? To support the full use of these guidelines and enhance accessibility, the
characteristics of SWDs and ELLs must be taken into consideration when developing scoring
guidelines and rubrics, recruiting scorers, and training scorers (Pitoniak et al., 2009).
The full range of activities in item development, tryout, administration, and scoring, should
reflect the full range of characteristics of the intended audience, with special attention to test
takers with exceptionalities. In this way, the item and test developer will make important gains
in improving test accessibility for the target audience and enhance the validity of intended
inferences and uses of test scores. Such inferences and uses are just as important, if not more
important, for individuals with limited access to some aspects of their world, which the rest of
us take for granted. Decision-making for individuals with exceptionalities is thus more demand-
ing and requires additional validation, which is the real intent of the broad principles of test
accessibility.
This page intentionally left blank
V
Validity Evidence Arising From Item
Development and Item Response Validation
We have many important steps in item development. Performing these steps ensures that our
item has high quality, and, also, provides validity evidence. These item development procedures
also conform to the Standards for Educational and Psychological Testing (AERA, APA, & NCME,
1999). In these chapters, we emphasize two types of item validation evidence. The first type
involves those procedures designed to develop and improve test items. The second type comes
from statistical studies of item responses. Also, validity studies can contribute to this body of evi-
dence by addressing important question about issues that threaten validity. The next four chap-
ters are complementary to the end of polishing test items to make them worthy of being placed in
an item bank for operational testing. We have referred to this set of activities as item validation.
The mix of evidence collected and organized regarding item quality supports validity.
Chapter 16 reviews the essential steps in item development and identifies evidence that can
be used to support item quality. These steps include recommended procedures that are usually
performed in high-quality testing programs (Downing, 2006).
Chapter 17 covers the topic item analysis for selected-response (SR) and constructed-response
objectively scored (CROS) items. This chapter includes traditional and new methods for study-
ing item responses with emphases on graphical methods and evaluating distractors.
Chapter 18 address item analysis in the context of constructed-response subjectively scored
(CRSS) items. Because a rating scale is used, we have unique statistical issues to consider how that
affects reliability and construct-irrelevant variance.
Chapter 19 contains a variety of topics dealing with item responses. Each topic presents an
issue affecting validity. Validity studies are highly recommended for exposing the utility or futil-
ity of a procedure affecting an item format and item responses or identifying threats to validity
and what might be done to reduce or eliminate a threat (Haladyna, 2006).
Testing program personnel should document these activities and ensure that a section of a
technical report presents all evidence bearing on item response validity and test score validity.
Validity studies should be published in some form. Documentation of all validity evidence is an
essential action in any testing program (Becker & Pomplun, 2006; Haladyna, 2002).
This page intentionally left blank
16
Validity Evidence From Item
Development Procedures
Overview
As noted previously in this volume and more extensively discussed by Kane (2006a, 2006b),
validation is an investigative process. It involves a definition of the construct being measured,
arguments affecting validity, a claim for validity by the test developers, the assembling and
integration of validity evidence, and an evaluation of the arguments and evidence vis-à-vis
the claim. According to Kane, two coordinated aspects in any testing program are develop-
ment and validation. In this chapter, we examine procedures that comprise item validity evi-
dence. Validation is also an ongoing effort to improve a testing program. As test items are the
most basic ingredient of any test, we focus our time, energy, and resources on validating test
items.
Part I addresses some important concepts of item development and validation. Part II describes
the specific activities needed to improve and validate items.
Validating Items
According to Kane (2006a), validation involves the establishment of evidence to support a pro-
posed interpretation or use of a test score. Validation also involves the evaluation of this evidence
vis-à-vis a claim.
The basis for item validation is a process very similar to the process of validation for the inter-
pretation and use of test scores. Applied to test items and item responses, if an item measures the
content it is supposed to measure and the item responses follow a pattern suggesting effective
measurement, we claim the item is valid for being included in a test.
Kane also describes the interpretive argument that is part of validation. The interpretive argu-
ment involves how the item performs as compared to how it is supposed to perform. That is, we
have outlined a meticulous process of item development that begins with a name for the trait
321
322 • Validity Evidence Arising From Item Development and Item Response Validation
being measured, the identification of content in a specific way, the development of items, and the
observation of each items’ performance in a test.
According to Kane, the interpretive argument has two overlapping stages: development and
appraisal. In the development stage, items are created along with the interpretive argument. In
the appraisal stage, the evidence we have gathered and integrated is evaluated and threats to
validity are considered. Part of the argument presented in this chapter is that in item develop-
ment we want to ensure that procedures that are part of item development contribute to making
each item as effective as possible. To go along with this claim, we want item performance to fol-
low a pattern suggesting the item is doing what it is supposed to do in contributing to a validly
interpreted test score.
Therefore, for our purposes for item validation, the interpretive argument comes from the
many careful steps in item development from naming the construct, specifying content, and writ-
ing test items that will become the universe of generalization. We also gather two types of validity
evidence: procedural and empirical/statistical. This chapter addresses procedural evidence, and
the next chapter addresses empirical/statistical evidence. Chapter 18 addresses factors that might
undermine or improve validity.
for professional credentialing examinations. The latter is used often in elementary and second-
ary student achievement testing. Kane (2006a, 2006b) provides an extensive discussion of the
role of content in test score validation. The identification of the target domain and the universe
of generalization are key components in test development and test design regardless of the way
content is defined.
To review, the target domain identifies the tasks ideally performed to show proficiency in the
construct. The universe of generalization is a domain of performance tasks that can be feasibly
tested. These tasks may come from the target domain or be hypothetical. If a task is hypothetical,
then the judgment of fidelity to the target domain becomes more important. Chapter 14 contains
a discussion of fidelity in credentialing examinations. In theory, a test score derived from a test
taker taking all tasks in the target domain is perfectly correlated to a test score from the admin-
istration of all items in the item bank—the universe of generalization. Item validation presents
evidence that our SMEs have ensured a high degree of fidelity between the two domains: target
and universe of generalization.
The basis for item development is the establishment of a target domain from which the universe
of generalization is an operational item pool for test design. As with all other types of validity evi-
dence, documentation of the process for developing content and the target domain is important.
From a practical standpoint, a set of content specifications involves the defining of a con-
struct (such as dental competence or third-grade mathematics). Raymond and Neustel (2006)
and Webb (2006) have described processes for taking an abstract construct from its initial nam-
ing to item and test specifications. For our purposes, the next section is the crux of what we
need for item development. The other activities listed in this part follow from the item and test
specifications.
Besides specification of the content, for example, the test domain or universe to be assessed,
a number of other attributes of the test needs to be defined. These attributes are identi-
fied in the test specifications in addition to specifying the content of the test. As noted test
specifications provide a detailed description of the number or proportions of items that
assess each content and process or skill area, the format of items, responses, scoring rubrics
and procedures; and the desired psychometric properties of the items and tests such as the
distribution of item difficulty and discrimination index, as suggested in the Standards for
Educational and Psychological Testing (AERA et al., 1999, pp. 176–177)
The work of SMEs beginning with a construct definition and the identification of a target domain
results in item and test specifications. Examples of item and test specifications were cited in chap-
ter 3. Table 3.6 presents an ideal generic outline for this document (see chapter 3). The item
and test specifications document is necessary for two important test development activities: item
development and test design. Item and test specifications should be dated and published. This
document is used often in test and item development and is referenced throughout this chapter.
324 • Validity Evidence Arising From Item Development and Item Response Validation
Item Development
As we know, a primary form of validity evidence comes from item development activities. These
activities comprise one major part of item validity evidence. Chapter 2 presents the steps required
to develop test items. Planning is an essential first step. Many steps in item development pre-
sented in that chapter are part of the argument and evidence for item validity.
From a practical perspective, a goal in item development is to create enough items to design as
many test forms as are needed annually. For testing programs with one form, an item pool should
consist of at least 2.5 times the number of items found on one test. For a testing program with two
or more forms, the determination of a minimum number of items is complex. Item banks need
to be large. In computer-adaptive testing, the item bank needs to be very large and have specific
characteristics by which a tailor-made test form is created for each test taker (Davy & Pitoniak,
2006). The goal should be to develop enough test items for any contingency. A large item bank
is insurance that if a test is exposed, lost or destroyed, another test form can come to the rescue
from a well-stocked item bank.
Regular inventories of test items need to be conducted. The results of an inventory should be
used annually to identify item needs for content categories and types of cognitive demand. Once
the needs are identified in this inventory, item writers can be assigned to develop new items to
fill these needs. For purposes of security, such inventories are not normally disclosed. These are
working documents that are internal to the testing program.
To reiterate, a plan for item development is recommended. The plan should be written
and published and drive all item development activities. Chapter 2 provided the basis for this
planning.
Item-Writing Guidelines
Chapters 6 and 11 presented item-writing guidelines and examples of the use or misuse of each
guideline. Every test item should be subjected to a review to decide if items were properly writ-
ten. Generally, the item-writing guide has a list of acceptable formats and a set of guidelines. As
noted previously, several sets of guidelines exist. Test developers should select those guidelines
that are most important to follow and ensure that the item-writing guide features these guidelines
and examples of violations. The SR guidelines used in this book come from reviews of existing
evidence and research (Haladyna & Downing, 1989a, 1989b; Haladyna, Downing, & Rodriguez,
2002; Haladyna, 2004). The guidelines presented in chapter 11 are new and address CR items.
Many guidelines apply to all items no matter what the format is.
Does using a set of item-writing guidelines make a difference in the performance of test items?
There is a body of research to support an affirmative answer. Downing (2005) and Stagnaro-
Green, Alex, & Downing (2006) provide evidence on the consequences of violating item-writing
guidelines. Reviews by Haladyna and Downing (1989a, 1989b) and Haladyna, Downing, and
Rodriguez (2002) provide substantial evidence of the effects of item-writing violations. One con-
sequence is increased item difficulty. Another consequence is lower discrimination that dimin-
ishes reliability.
Adopting a set of guidelines and applying these guidelines in item-writing is good advice for
any testing program. Justifying the set of guidelines used based on expert judgment of test spe-
cialists and research is also encouraged. Two types of criteria for supporting any guideline are (a)
consensus from the community of test specialists who work in item development or who write
about this topic, and (b) research. The guidelines recommended by Haladyna et al. (2002) meet
these criteria.
Validity Evidence From Item Development Procedures • 325
Item-Writing Guide
As noted in chapter 2, the item-writing guide is an official document used by item writers to
help them complete their item-writing assignments. One of the most exemplary item-writing
guides was developed at the National Board of Medical Examiners (http://www.nbme.org/Publi-
cations). This guide is publicly available. It is much more comprehensive than most item-writing
guides. Examples of item-writing guides may be found on the World Wide Web, but most item-
writing guides have serious shortcomings. Many use outdated item formats and list only a few
item-writing guidelines. Some present irrelevant information that may be of general interest but
not pertinent to the task of improving items.
An idealized item-writing guide is presented in chapter 2 in Table 2.5. The item-writing
guide is very useful for training item writers and should also be used as a reference. Describ-
ing the testing program is an important first step. Security is critical, as items should never be
exposed to people outside the testing program. Items should always be secured and transmit-
ted safely. Item writers need to know about the item-writing cycle and the schedule. They need
to follow the schedule. These item writers need to know how to submit items. Acceptable and
unacceptable formats should be explicitly described. In these instances, examples of each are
important. Statements about language complexity are important. Language should be appro-
priate for the group being tested. Cognitive demand is a difficult topic because the cognitive
demand for any test item depends on who is responding to the test item. Nonetheless, the
assignment of a cognitive demand is based on a consensus of SME judgment for a general
population of test takers (not necessarily novices or experts). The item-writing guide should
provide examples of well-written and poorly written items. At the end of the guide, having a
short section on how to use item analysis results to evaluate item performance is advisable,
with decision-making criteria.
Item-Writing Training
In chapter 2, we presented advice on item-writing training in an outline (see Table 2.6). The
training should be conducted for each new cadre of item writers. In some circumstances, hav-
ing the training conducted before an item-writing session is desirable. The item-writing ses-
sions can be done in many ways with teams of item writers or item writers writing individually
and then reviewing each other’s items collectively or as a group. Group reviews of items can
take a long time but also can provide important comprehensive review. Often, item writers
work at home and submit their work electronically. With the use of the World Wide Web, the
speedy and secure transmission of items from writer to reviewers to test sponsor is very effec-
tive and efficient.
326 • Validity Evidence Arising From Item Development and Item Response Validation
Item-Writing
Once item writers have been recruited and trained, each item writer should be given an assign-
ment based on their specific area of expertise and with attention to the item and test specifica-
tions. With any testing program, the needs dictated by the inventory of items in the item bank
motivate the assignments. The item writers write usually in their area(s) of expertise.
As noted previously, many item writers, particularly new ones, experience writer’s block. Rem-
edies have been proposed, such as item shells (Haladyna, 2003; Haladyna & Shindoll, 1989).
As described in chapter 8, item-generating methods are emerging as a technology (see Gierl &
Haladyna, 2012). Any aids that can be created, adopted, or adapted to help new item writers get
started will pay large dividends, as item-writing can be a slow and tortuous process for many
SMEs. These item-writing devices speed up the process, but overuse of these devices ends up
producing many items that look alike. Eventually, these item-writing aids should be abandoned
once the SME gains confidence to write more original items.
Summary
Part I has reviewed concepts, principles, and procedures that address important aspects of item
development and validation. The next section deals with a set of coordinated, specific activities
that we do in high-quality item development and validation. As these reviews are done, items are
improved, and evidence for item validation amasses.
1. Content Review
As noted in chapters 1 and 3 and previously in this chapter, content can exist as a domain of tasks
or a domain of knowledge and skills and application of knowledge and skills in complex ways.
No matter how content is conceptualized, we have two types of content review. Each type is inde-
pendent of the other review; each is very important evidence.
The first type was identified by Messick (1989):
Judgments of the relevance of test items or tasks to the intended score interpretation should
take into account all aspects of the testing procedure that significantly affect test perform-
ance. These include, as we have seen, specification of the construct domain of reference as
to topical content, typical behaviors, and underlying processes. Also needed are test speci-
fications regarding stimulus formats and response alternatives, administration conditions
(such as examinee instructions or time limits), and criteria for item scoring. (p. 276)
In content review, the SMEs must determine if the item truly reflects the content to be tested in
an important way. Some test items may measure trivial knowledge or simple skills, whereas other
test items seem more suitable for the objective of the test. Overall, modern testing programs
Validity Evidence From Item Development Procedures • 327
seek test items that are more engaging with respect to cognitive demand. Achieving consensus
in judging the relevance of an item for this universe of generalization and a member of the item
bank is important.
The second type involves the classification of each item according to the content categories
of the item and test specifications. As items are written, each item writer assigns the item to a
content category based on the item classification system. Later in this process, other SMEs review
these items and confirm or reclassify the item given its content.
Professional Testing
As noted in chapter 14, with a credentialing test of a profession, the domain involves tasks per-
formed by a professional as part of everyday practice. Raymond and Neustel (2006) describe how
a survey is conducted to poll professionals about these tasks and their usefulness, importance,
and frequency of performance in the profession. After the practice analysis is conducted, SMEs
direct the development of the item and test specifications that become the basis for classifying
items by content categories.
Achievement Testing
Similarly, in testing programs involving scholastic achievement, Webb (2006) provided an in-
depth discussion of content specification for these kinds of tests. He recognizes that the way stu-
dents learn can be conceptualized in different ways (including cognitive and behavioral learning
theories). He presents four criteria for organizing the review of content for a test:
Webb provides extensive treatment of these four criteria and how item and test specifications are
achieved.
1. Can the reviewers make accurate judgments regarding the content category of each item?
2. How much agreement exists among content reviewers?
3. What factors affect the accuracy of content judgments by the reviewers?
Rigorous research on the content classification consistency of SMEs is rare. Such research would
strengthen our validity evidence for items that enter the item bank.
actual cognitive demand of any item can be discovered by conducting an exercise known as a
think-aloud, which was discussed in chapter 3 and is discussed more extensively at the end of this
chapter. Having a sample of test takers of varying levels of ability talk about their thinking as they
take a test is one effective way to learn about an item’s typical cognitive demand.
We have advocated a simple system for classifying items. For the general population of test
takers, the SMEs judge whether the item requires (a) memory, (b) understanding, or (c) applica-
tion of knowledge and skills in some complex way. Memory items require test takers to recall
facts, concepts, principles, and procedures as presented elsewhere. Understanding requires test
takers to respond to an item in a way that shows comprehension instead of recall. Usually, a
concept, principle, or procedure is presented in some novel way. The application of knowledge
and skills is the highest type of cognitive demand. In Bloom’s cognitive taxonomy, there are four
types of higher-level cognitive demand—application, analysis, synthesis, and evaluation (Bloom
et al., 1955).
Useful validity evidence would reveal the consistency with which SMEs classify items accord-
ing to their cognitive demand. Other evidence might attest to the existence of these categories of
higher-level thinking, such as is done with statistical methods like factor analysis.
mundane activity, proofing is the most effective method for ensuring that the test in its final
form is perfect.
There are many good reasons for the editorial review. First, edited test items present the cogni-
tive task involved in the item in a clearer fashion than unedited test items. Items should clearly
and accurately present the performance task and options. Second, editorial errors tend to distract
test takers. Because great concentration is needed on the test by the test taker, such errors result
in increased test anxiety. Third, errors reflect badly on the test developer. If there are many edito-
rial errors, the test takers are likely to think that the test falls short in the more important areas of
content and item-writing quality. Thus, the test maker loses the respect of test takers.
There are several areas of concern of the editorial review show in Table 16.2 below.
A valuable aid in testing programs is an editorial guide, which is normally several pages of
guidelines about acceptable formats, abbreviations, style conventions, and other details of item
preparation, such as type font and size, margins, etcetera. There are some excellent references
that should be part of the library of a test maker, whether professional or amateur. These appear
in Table 16.3.
A spellchecker on a word processing program is also very handy. Spellcheckers have resident
dictionaries for checking the correct spelling of many words. However, the best feature is the
opportunity to develop an exception spelling list, where specialized words not in the spellcheck-
er’s dictionary can be added. Of course, many of these types of words have to be first verified
from another source before each word can be so added. For example, if one works in medicine
or in law, the spelling of various medical terms can be checked in a specialized dictionary, such
as Stedman’s Medical Dictionary or Black’s Law Dictionary. The former has more than 68,000
medical terms and the latter uses more than 16,000 legal terms.
As Baranowski (2006) noted, the Standards are silent about the virtues of editing. We think
editing is a critical step in improving items and should always be done. Poor performance of test
330 • Validity Evidence Arising From Item Development and Item Response Validation
items can be a function of lack of editing. As we can see from other item reviews, the combination
of reviews is a collective effort to produce high-quality, performing items.
5. Fairness
Fairness is a topic that is receiving more and more attention in item development and in test
design for many good reasons. Items are sometimes perceived as unfair to a segment of the popu-
lation for which the test is intended. For example, a test item that describes the mechanics of
using a “block heater” on your car in subzero weather in Minnesota may not be effective in
Phoenix, Arizona.
Background
An excellent treatment of this topic is found in a chapter by Zieky (2006). This chapter draws
from his work and the work of others at the Educational Testing Service (ETS, 2002, 2003, 2004).
ETS has been a leader in the development of policies and procedures for fairness in testing. As he
pointed out, the testing industry paid little attention to fairness in the first half of the twentieth
century, but the civil right movement is credited with increasing awareness of the importance
of fairness in testing. The Standards for Educational and Psychological Testing (APA, 1985) pro-
vided in standard 3.10 a single paragraph of commentary that explains the nature of item bias and
the responsibility of test makers:
When previous research indicates the need for studies of item or test performance differ-
ences for a particular kind of test for members of certain age, ethnic, cultural, and gender
groups in the population of test takers, such studies should be conducted as soon as is feasi-
ble. Such research should be designed to detect and eliminate aspects of test design, content,
or format that might bias test scores for particular groups. (p. 27)
The current Standards (AERA et al., 1999) devote chapter 7 to fairness and have many more
standards addressing fairness. The latest revision of the Standards, to be published in 2013, gives
greater coverage to fairness. So we are witnessing an appropriate increase in attention to fairness
in testing.
Definition
There is no standard definition of fairness, yet. One definition relates to differences in probabili-
ties of correctly responding to an item for two groups of equal proficiency on the construct being
measured. This difference is differential item functioning, which involves statistical study and is
treated in chapter 17. There is an extensive literature on DIF analysis.
Another definition relates to tests used to select persons. If a test over-predicts or under-pre-
dicts performance, this bias in prediction leads to invalid selection or classification of those being
tested. As Zieky pointed out, prediction and selection relate to test scores and not test items.
Nonetheless, this form of bias is construct-irrelevant and should be addressed.
A third definition concerns the appearance of items. An item is unfair if a test taker is emotion-
ally affected by its content. This judgment is purely subjective and has little empirical support.
Nonetheless, in the last 30 years, the concept of reviewing items for their emotional effects on test
takers has taken hold and is important in item validation.
A fourth perspective offered by Zieky is that fairness is linked to validity. If an item is unfair to
a group of test takers, it lowers validity. In our jargon, it is construct-irrelevant difficulty (ETS,
2003). All test takers should have equal opportunity for success. If two groups have the same abil-
ity, their mean scores should be equal.
Validity Evidence From Item Development Procedures • 331
1. Treat people with respect. This includes employing a diverse set of proper names for the
characters in items; they should represent the population of test takers. Ethnocentrism
should be avoided. Test items should acknowledge the diversity inherent in each sub-
group. By achieving each element of fairness presented here, the item writer makes great
strides in treating test takers with respect.
2. Reduce the effects of construct-irrelevant variance (CIV). Haladyna and Downing (2004)
provide an extensive treatment of this topic. CIV is systematic error. Any factor that
increases or decreases the difficulty of the item for a subgroup of those being tested pro-
duces CIV. Such influences should be removed or the item should be retired. These unto-
ward influences might include irrelevant or difficult-to-understand charts or graphs, dif-
ficult or inappropriate vocabulary, specialized vocabulary, technical language, inappro-
priate or undocumented acronyms, religious or cultural references, and the like. The ETS
guidelines (ETS, 2003) provide more specific instances.
3. Avoid emotionally charged content. This advice is based on subjective judgments and
requires the use of SMEs with sensitivity to potentially upsetting material. Some offensive
topics to avoid include abortion, cultural taboos, genocide, Halloween, murder, political
view, rape, sexual orientation, supernatural events, suicide, religion, and torture.
4. Use appropriate terminology. This fairness review category is aimed at distinctions we
draw in describing people: African-American, Black, Negro, or Colored? Native Ameri-
can or Indian? In most instance, such designations for people are irrelevant. However, if
these designations are used, they should be done with respect to the group of people being
identified. Maintaining parallelism is also important. For example, characters in test item
vignettes should be of equal status when it comes to gender, race, and ethnic background.
General Mills in the U.S. Army can be a man or a woman. However, it would not be proper
to use terms such as a male nurse or woman scientist. Male-dominated terms such as man-
made are unacceptable.
5. Avoid stereotypes. Although we often forget, we should not assume that a character in a
vignette is of one gender category or one ethnic classification—for example, that all people
in child care are women, all teachers are women, all firefighters are men, all dental hygi-
enists are women. As stated above, test items, reading passages, and associated materials,
should acknowledge the diversity within subgroups.
6. Represent diversity. Historically, we have depicted characters in test items as coming from
one race or social class in an innocent, idealistic way. Fairness in testing suggests that more
diversity should be represented. Characters in test items should come from many social
climes and ethnic backgrounds. If a person’s gender, race, education, or social standing is
relevant, the reference should be proportional with the population and fair.
Some final comments on fairness. The principles and intent of fairness review are universal in test
development. We have addressed item validation, but the issue is more pervasive than that. The
SMEs who participate in fairness review should represent a wide constituency. Accommodations
are another issue. Chapter 15 provides an extensive discussion of item development and valida-
tion for test takers with exceptionalities. The quest for fairness is never-ending. As it is subjective,
it is imperfect but necessary.
332 • Validity Evidence Arising From Item Development and Item Response Validation
1. The gap between English language learners (ELLs) and other students can be reduced by
reducing low-frequency vocabulary and simplifying sentence structure.
2. All test items should use clear language and provide sufficient time for a response.
3. Tests should be developed with linguistic complexity in mind, and not as an
afterthought.
4. The education of ELLs should consider language complexity at the time of their instruc-
tion and not after.
What are the linguistic features that may affect comprehension? According to Abedi, we have 14
features. These are briefly summarized in Table 16.4. A more comprehensive discussion of lin-
guistic complexity can be found in a chapter by Abedi (2006). Implications for item development
and the connections to item-writing guidelines are presented in chapter 15.
The table provides only a brief glimpse into the science of linguistic complexity as it affects
ELLs. As test items are being developed for an item bank, a set of principles need to be embedded
into item-writing training and the item-writing guide that ensure that the language complexity is
appropriate for the population being tested. Documentation of these efforts constitutes validity
evidence.
Validity Evidence From Item Development Procedures • 333
7. Key Verification
Of all the reviews suggested in this section, none is as important as ensuring that the right answer
is right. That said, the process for getting right answers right is not so easy sometimes. However,
these steps are recommended for ascertaining that the right answer is right.
SR Items
With SR items, the key verification process is straightforward. As a SR item is developed, the
SMEs review the key and ensure that the correct choice is correct and no other choices can be
considered correct. After the item undergoes many reviews and is placed in the item bank, the
item resides with its correct key. This item is field-tested. The key is verified at this stage. At
the operational stage, the test is given and an item analysis is done. The test analyst discovers
that some items are misperforming. Why is it necessary to check the key after all these efforts?
Because several possibilities exist after the test is given and the items are statistically analyzed:
What should be done if any of these three circumstances exist after the test is given? In the unlikely
event of the first situation, all examinees should be given credit for the item or the item should
be removed from scoring. The principle at stake is that no test taker should be penalized for the
test maker’s error. Another possibility is to omit that item from the test and rescale the test—the
practice of rescaling means that the shorter test will constitute the “valid” version.
334 • Validity Evidence Arising From Item Development and Item Response Validation
If the second or third conditions exist, right answers should be rekeyed and the test results
should be rescored to correct any errors created by the problems. The alternate action is to
remove the item, as suggested in the previous paragraph. These kinds of drastic actions can be
avoided through a thorough, conscientious key check.
CROS Items
With CROS items, we have two kinds of right answers. Some CROS items have a single right
answer. Consider the question: What is the capital city of Illinois? Springfield should be the only
correct response. Other CROS items have several right answers that are similar or actually may be
very different. Consider the question: What is the meaning of forte? Scoring CROS items can be
very consistent, as consistent as SR items. However, when shades of right answers exist that call
for judgment or acumen of the scorer, random error can be introduced that will lower reliability.
A remedy and a source of item validity evidence is some check of scoring consistency. For the
one-answer CROS, this level of consistency will be very high, but for the multi-answer CROS or
the shades-of-meaning CROS, this level of consistency will be lower. How much lower may be
important to know and document.
CRSS Items
With CRSS items there is a considerable variety in possible responses. By its very nature, rater
consistency and rater bias are threats to validity. These two threats are discussed in great detail in
the next chapter (as well as chapter 12 on CR scoring). An important piece of validity evidence
is the development of model answers if the response is written or a checklist of desirable charac-
teristics if the CRSS item calls for a product or performance that needs evaluation. The rubric is
often the key device for rating performance. The rubric’s development is very important. Docu-
mentation of how the rubric was developed is also important. Another remedy is to produce
examples of test taker’s responses also known as benchmarks that represent each performance
level. That way, through training, the SMEs who score these items know what kind of perform-
ance is expected compared to what is being exhibited.
In the evaluation of a tennis player, for example, there are seven levels (with half-point incre-
ments). Each level has a set of characteristics that a qualified judge (SME) is supposed to apply;
self-ratings are also possible to encourage matching oneself with compatible opponents (see
Table 16.5). Not only is the rubric itself an important source of validity evidence, but how it was
developed and how it was applied are equally important.
Overall, the best strategy for CRSS scoring is to ensure that raters are properly trained and
tested on their rating ability and that the scoring devices have been well developed and thor-
oughly evaluated. Evidence of this kind of thorough item development can contribute greatly
to overall item validation. Documentation of rater training with the rubric and all related tools
(benchmarks) constitutes validity evidence.
8. Answer Justification
The keying of an answer for a SR and CROS item or the development of a model answer for CRSS
item is a process intended to provide the most accurate way to score the item. However, as was
noted in the previous section, there are many threats to validity that arise if the correct answer is
not objectively observed. If SME judgment or ratings come into play, we have doubts about the
veracity of the resulting item score.
With SR and CROS formats, SMEs decide the right answer. With SR items, some options may
be similar to the chosen right answer. With CROS formats, the right answer may have variations
Validity Evidence From Item Development Procedures • 335
or we may have equally right answers. If the SMEs do not generate all variations, the test taker is in
danger of having the item scored as incorrect. With the CRSS format, the item may elicit different
cognitive demands leading test takers to take very different tacks than intended by the SMEs.
With all this being said about ambiguity in item development and identifying the right answer,
we have a group of experts who should be consulted—the test taker. There has been an emerg-
ing science that uses the responses of test takers to evaluate items and reduce ambiguity, and the
testing Standards recognize information resulting from think-aloud to be an important source of
validity evidence regarding responses processes.
Think-Aloud
The think-aloud procedure has been used to study the thought (cognitive) processes of test tak-
ers. To put it in the context of a testing program, when a group of items is administered to a
representative, very small sample of test takers, we call this a developmental field test. Its purpose
is to uncover latent ambiguity and improve the item. Think-aloud procedures can take several
forms, including interviews with a single student or focus groups with a small group of students,
and can occur simultaneously while taking the test or retrospectively immediately following the
test. Usually, the researcher does very little probing, often limited to “keep talking ...” or “what
336 • Validity Evidence Arising From Item Development and Item Response Validation
are you thinking?” However, in other cases, probing the thinking process is an important tool to
elicit the deeper thinking processes. The challenge is to elicit verbal reports of the actual thinking
process, rather than motivating the test taker to change their thinking processes because of being
asked to think-aloud.
The basis for the think-aloud procedure comes from studies of cognitive behavior. Norris
(1990) provided an excellent review of both the history and the rationale for verbal reports of
test-taking behavior. The think-aloud method has found many applications. For instance, it is a
mainstay in research in cognitive science (Bannert & Mengelkam, 2008). Think-aloud has been
helpful in evaluating e-learning (Cotton & Gresty, 2005). It has been useful as a supplemental,
complementary source of data in research studies (Young, 2005).
However, seeing the link to validity, test specialists have recommended this practice (Hala-
dyna, 2003). Indeed, test taker reports of perceived thought processes involved in answer selec-
tion or answer creation can be very revealing about the actual thought processes involved. Norris
provides a useful taxonomy of elicitation levels, shown in Table 16.6.
Although this method is time-consuming and logistically challenging, it seems well worth the
effort if one is serious about validating test items.
One conclusion that Norris drew from his experimental study of college students is that the use
of the six levels of elicitation of verbal reports did not affect cognitive test behavior. Some benefits of
this kind of probing, he claimed, include detecting misleading expressions, implicit clues, unfamil-
iar vocabulary, and alternative justifiable answers. A study by Farr, Pritchard, and Smitten (1990)
involved reading comprehension of passages using testlets. Various critics of using MC to measure
reading comprehension claim that this format encourages test takers to perform surface reading
as opposed to the more desired in-depth reading. Four distinctly different strategies were identi-
fied for answering these context-dependent passages. The most popular of these strategies was to
read the passage, then read each question, then search for the answer in the passage. Without any
doubt, all test takers manifested question-answering behaviors. In other words, they were focused
on answering questions, as opposed to reading the passage for surface or deep meaning. Although
these researchers concluded that the SR format is a special type of reading comprehension task,
it seems to have general value to the act of reading comprehension. These researchers concluded
that the findings answer critics who contend that surface thinking only occurs in this kind of test.
Further, they say that the development of items (tasks) actually determines the types of cognitive
behaviors being elicited. Their sample included highly able adult readers who used effective read-
ing skills in finding the correct answers. Descriptive studies like this one are rare but they help us
understand better the underlying cognitive processes actually used to answer questions.
Validity Evidence From Item Development Procedures • 337
As Norris (1990) summarized, verbal reports of test takers serve important functions:
Verbal reports of thinking would be useful in the validation of multiple-choice critical think-
ing tests, if they could provide evidence to judge whether good thinking was associated with
choosing keyed answers and poor thinking was associated with unkeyed answers. (p. 55)
9. Security
The most pernicious influence in testing programs, especially those with the highest stakes,
relates to security. There is a worldwide effort to compromise testing programs. One way is to
copy secure test items and release or sell these items to future test takers. Evidence of the scope
of this problem can be observed through blogs on cheating in the news (see http://www.caveon.
com, for example). Testing companies have extensive forensic methods for detecting cheating
in testing. The copying of test items is the most popular of these methods for compromising the
validity of test score interpretations. Technology affords considerable opportunities to steal test
items. Every testing program should have in place a policy regarding security at its central office
and any auxiliary sites. Not only does the theft of test items threaten validity, but the replace-
ment cost for each item stolen can exceed $1,000 per item. For comprehensive treatment of test
security in the context of cheating and the detection of cheating, see Cizek (1999). The point here
is that documented procedures to secure and enhance the security of tests constitutes validity
evidence.
Summary
This chapter has sought to combine many activities in item development into a cohesive frame-
work for producing evidence to support item validation. As repeatedly stated in this volume, doc-
umentation of these activities, preferably in a technical report, provides validity evidence that can
be employed in a validation of test score interpretations or uses. Any testing program should have
in an archive a series of documents providing information that these activities were performed
and report the results. As all of this information is part of item validation, the overall validation
process is improved through efforts to document better how items were developed and validated.
These activities are summarized in Table 16.7.
Overview
Chapters 17 and 18 are complementary. This chapter features the statistical study of item
responses for selected-response (SR) and constructed-response objectively scored (CROS) items.
Chapter 19 does the same for constructed-response subjectively scored (CRSS) items.
The statistical study of item responses provides many insights into item quality and into factors
that threaten validity. First, statistical study provides us information for evaluating and improv-
ing test items. After that study, we decide to retain, retire, or revise the item. Second, these studies
provide validity evidence for item quality. Third, knowing the characteristics of item responses,
we can scale a test properly. If multiple test forms exist, we want the scale to be the same which-
ever test form is used.
a unitary ability known as writing? This is a key issue in how the ability is defined. Is the abil-
ity being measured intended to be unidimensional or multidimensional? With any ability, an
argument is needed that is the consensus of SMEs. Once the argument is stated, we can study
item responses to evaluate the dimensionality of item responses either to support or refute the
argument.
A related issue is whether one desires subscores or not. Subscores imply that a measure is
multidimensional, and evidence should be generated at the item response level supporting the
validity of subscore validity. Haladyna and Kramer (2003) studied subscore validity for a creden-
tialing test. They found that although a test may seem unidimensional, faint signals for subscores
can be detected and a body of evidence argued for valid interpretation of some subscores. In
theory, an ability may be composed of several highly related sub-abilities. A test measuring this
ability may seem unidimensional, but one can tease out validly interpreted subscores. The need
to validate these subscores based on item responses is challenging. Dimensionality is discussed
in greater detail in chapters 18 and 19. What SMEs contribute to the study of dimensionality is
the hypothesis that these sub-abilities are highly correlated. Upon analyzing data, we ascertain
if the hypothesis is correct. Then, we can investigate ways to extract subscore information that
might be validly interpreted. An excellent discussion of dimensionality can be found in a chapter
by Tate (2002).
Item Characteristics
This chapter is concerned with the statistical study of several item characteristics. Two most com-
mon item characteristics are difficulty and discrimination. Guessing is an item characteristic that
is more difficult to define, study, and measure. We have some controversy about its definition.
The extent of guessing and how to estimate the effect of guessing are major issues in item analysis.
We wonder if guessing threatens validity.
Sample size is important in the study of item characteristics because with small samples the
standard error of estimate for any descriptive statistic such as item difficulty and discrimination
is large. A general rule of thumb is that for estimating a statistic such as item difficulty or dis-
crimination, 30 test takers is an arbitrary minimum (probably only appropriate for field-testing
items) and, of course, more is better. With large samples, the standard error of estimate is very
small, and we can have greater confidence about an item’s difficulty or discrimination. Some
statistical models, such as item response theory, require much larger samples.
340 • Validity Evidence Arising From Item Development and Item Response Validation
Item Analysis
After a response to a SR and CROS items is scored, these scores are subject to item analysis. The
objective is to know each item’s level of difficulty and its ability to discriminate among test takers
with varying levels of ability.
The purposes of item analysis are (a) to evaluate the item during its initial tryout, (b) to verify
the key—correct answer, and (c) to correctly estimate difficulty and discrimination for scaling
for comparability, especially if multiple test forms are used. Knowing an item’s difficulty and
discrimination is a great help in test design.
As mentioned previously, whether the scaling involves CTT or IRT, simple descriptive statis-
tics suffice. However, if scaling involves IRT, item parameter estimates must be used. The results
will be nearly identical to CTT-produced test scores, but scaling for comparability is done more
conveniently with IRT.
Item Difficulty
The most fundamental measure of item difficulty is the proportion (or percentage) of test tak-
ers responding correctly. This statistic is known as the item p-value. The p-value has a floor and
Validity Evidence From Statistical Study of Objectively-Scored Test Items • 341
ceiling. For SR items, the floor depends on the number of options. For a three-option item, the
theoretical floor is 33%. That is, if one guesses randomly, the expected test score is 33%. For a
true–false (or two-option) item the floor is 50%. For CROS items, the floor is zero. The ceiling is
always 100%.
Every item has a natural difficulty. This value is based on the performance of all persons we
intend to test. This p-value is very difficult to estimate accurately unless a very representative
group of test takers is being tested. This is one reason that CTT is criticized. The estimation of the
p-value is potentially biased by the sample. If the sample contains well-instructed, highly trained,
or highly able people, then the tests and its items appear very easy, potentially with p-values
above .90, showing that more than 90% answer correctly. If the sample contains uninstructed,
untrained, or low-ability test takers, then the test and the items appear very hard, usually with
p-values below .40. In both instances, a false conclusion about item difficulty might be drawn.
Item difficulty is sample-specific.
With IRT, item difficulty can be estimated without consideration of exactly who is tested.
The claim is that IRT provides sample-free (sample invariant) estimation. This is generally true,
except in instances where instruction occurs and some content is well taught and other content
is not well taught and learned (Haladyna, 1974). The difficulty of the item in the IRT perspective
is governed by the true difficulty of the item, which is a reflection of the ability level needed to
have a decent probability of correctly responding to the item. The estimation of difficulty is based
on this idea, and the natural scale for scaling difficulty in IRT is stated in logarithm units (logits)
that roughly vary between –3.00 and +3.00, with the negative values being interpreted as easy
(requiring less ability) and the higher values being interpreted as hard (requiring more ability).
For convenience, test scores are on the same scale.
342 • Validity Evidence Arising From Item Development and Item Response Validation
The correlation between p-values and IRT difficulty estimates is nearly perfect. Figure 17.1
provides a scatter plot of item difficulty estimates for a scholastic achievement test. The correla-
tion is perfect if one uses a nonlinear correlation method because the regression line is nonlinear.
Note that the CTT difficulty scale is directly interpretable whereas the IRT scale (“Rasch” in Fig-
ure 17.1) is inverse. Moreover, IRT estimates are typically expressed on a logistic scale, which is
difficult to interpret. Percentage correct is easy to interpret.
1.0
0.9
0.8
0.7
PVALUE
0.6
0.5
0.4
0.3
0.2
–3 –2 –1 0 1 2 3 4
RASCH
The easiest way to understand item difficulty is the item p-value (proportion correct). Know-
ing the sample upon which the item difficulty is based is important. The IRT scale is confusing
to many and does not provide easily interpretable results. Moreover, sample sizes should be sub-
stantial to obtain an accurate estimate (Reckase, 1978).
However, under extreme circumstances, the ability to accurately estimate item difficulty
using IRT fails. When the sample consists of only individuals who are highly trained, untrained,
instructed or uninstructed, scores might be so extreme as to prevent adequate estimation of item
difficulties (Haladyna & Roid, 1983). IRT models have additional assumptions that must be met
before IRT item parameters can be interpreted, including an assumption of unidimensionality.
Unidimensionality is required because the item difficulty parameter is about a single ability—in
the context of IRT, item difficulty tells us about the ability level required to have a given prob-
ability of correctly responding to the item.
development, we would have more intelligent control of item difficulty and, perhaps, overall item
quality as well.
Another possible cause of a p-value is the extent to which instruction, training, or development
has occurred with those being tested. An item can appear easy, moderate, or difficult based on the
group of test takers who take the test. This topic is instructional sensitivity, and it is discussed later
in this section, because it relates to item discrimination.
Chapter 8 introduced automatic item generation. As reported in that chapter, item develop-
ment theorists are working on theories and an emerging technology for generating large num-
bers of SR items with predictable item difficulty. In fact, a claim is made that such items need not
be field-tested because the item’s difficulty and discrimination may be stable because of the item
generation method. However, the technology for this kind of control over item characteristics is
in a very early stage (Gierl & Haladyna, 2012).
Recommendation
For purposes of item analysis of SR and CROS items, the p-value provides the most useful and
universally understood measure of item difficulty. However, an important caveat is to consider
and understand the sample upon which the p-value is based. The sample must represent the
population being tested.
Item Discrimination
Item discrimination describes the item’s ability to measure individual differences that truly exist
among test takers. As each item mimics the total test, we expect the item response to be highly
correlated to total test scores. If we know test takers may differ in their ability, then each test item
should mirror this tendency for test takers to be different. Thus, with a highly discriminating test
item, those choosing the correct answer must necessarily differ in total score from those choosing
any wrong answer. This is a characteristic of any measuring instrument where test items are used.
Those possessing more of the trait should do better on the items comprising the test than those
possessing less of that trait.
is continuous (test score), the name of the product-moment correlation is point-biserial. The
formula is provided below for increasing understanding of how this index discriminates. All
computer and item analysis programs compute this index, as it is the Pearson product-moment
correlation. However, since the item score is part of the total score, these correlations are spuri-
ously high because of the common component. When item discrimination is estimated based on
item-total score correlations, the correlation is corrected by removing the item score from the
total score prior to estimating the correlation.
M1 − M 0 n1n0
Point-biserial discrimination index = rpb =
s n2
M1 is the mean of all test takers who answer correctly.
M0 is the mean of all test takers who answer incorrectly.
n1 is the number of test takers in the right-answer group.
n0 is the number of test takers in the wrong-answer group.
n is the total number of test takers.
As the formula shows, the key feature is the difference in the means of test takers who answer
correctly and incorrectly. A large mean difference (M1–M0) indicates high discrimination, and
a small mean difference indicates low discrimination. This coefficient can be positive or nega-
tive and ranges between .00 and 1.00. Test items with low or negative coefficients are very
undesirable.
Before interpreting a coefficient, a product-moment correlation is subject to a test for statisti-
cal significance. That test should be directional (one-tailed), because these indexes should be
positive. However, most testing programs have many test takers so testing for statistical signifi-
cance is seldom a problem. Even the smallest coefficient will be statistically significant. However,
if the sample size is small, less than 50, statistical significance is an important condition that
needs to be met before interpreting a discrimination index. That is why a one-tailed test should
be used. It is customary when rejecting a null statistical hypothesis to establish an effect size. For
item discrimination, the effect size is an arbitrary value. For an operational testing program, the
minimum acceptable discrimination may be .10 or .15 or .20.
The biserial correlation is sometimes used as the discrimination index. Its computation is
more complex, and it offers no better information than the point-biserial. Attali and Fraenkel
(2000) pointed out that item discrimination indexes are highly interrelated, so it matters little
which one is used. The point-biserial is most directly related to reliability estimation (see Ebel,
1967; Nunnally, 1967; Nunnally & Bernstein, 1984). An important warning is that the test should
measure a single ability and not many abilities or sub-abilities. An important principle is that
tests should never be built only on the discriminating ability of its items to maximize reliability
because content matters more. Tests should be designed to satisfy the item and test specifications
primarily. Secondarily, we want these items to be highly discriminating to maximize reliability.
Sometimes, we ignore discrimination because an item represents some content that was learned
by everyone or nearly everyone in the sample. Its p-value will be very high, and SMEs will rightly
argue that the item is valid.
choosing any wrong answer or responding incorrectly is 30%. Item two shows a 10% difference
between these two groups. The second item has moderate discrimination. Item three has no dif-
ference between the two groups. The third item is non-discriminating. Item four has a negative
discrimination. Item four is very rare in testing because negative discrimination would suggest a
scoring error, such as the wrong key. As Table 17.2 shows, the data are very easy to understand
and are unaided by statistical inference.
Another tabular method is useful for studying the discrimination across the full test score scale
of those being tested. This method has implications for distractor analysis, which is discussed
later in this chapter and for the study of discrimination with IRT. Table 17.3 presents 10 ordered
score groups with 200 test takers in each group. In the low group (score group 1), 70% (140)
missed this item and 30% (60) answered correctly. In the highest group (10), 72% (144) got the
item right and only 56 (28%) missed the item. This item is discriminating but also difficult.
Table 17.3 Frequency of Correct and Incorrect Responses for 10 Score Groups (N = 2,000)
Total Score Group (1 is lowest-score group, 10 is highest-score group)
Item 1 2 3 4 5 6 7 8 9 10
Score
0 140 138 130 109 90 70 63 60 58 56
1 60 62 70 91 110 130 137 140 142 144
The choice of the number of score groups is arbitrary, but knowing the number of test takers
helps decide if that number should be five score groups or 10. Having more than 10 score groups
does not provide better information and is also more difficult to present in tabular form. Also,
more score groups require very large samples.
Table 17.2 is a very simple presentation for item discrimination in tabular form. Table 17.3
is not a very effective presentation, but it is useful for the next way to display discrimination—
graphical.
160
140
120
Frequency
Wrong
100
Right
80
60
40
1 2 3 4 5 6 7 8 9 10
Ten score groups (Low-1 to High-10)
IRT Discrimination
The basis for understanding discrimination in item response theory is the trace line introduced
in Figure 17.2. It is a plot of performance of test takers as a function of their total score. The
pattern is the same for item-level data. Consider the trace line from Winsteps for a SR item,
here called the ICC in Figure 17.3. For any test score level (measure), the probability that the
test taker will respond correctly to the SR or CROS item is given (score on item from 0 to 1.0).
Low scoring test takers have a low probability of responding correctly; high-scoring test takers
have a high probability of answering correctly. In IRT, the discrimination is technically defined
at the point of inflection of this curve, where the slope is steepest. With the Rasch model (as
estimated by Winsteps), this occurs at the ability level where the test taker has a 50% chance of
correctly responding to the item. This ability is associated with the CTT item p-value. The actual
index of discrimination is a function of the slope of the line at this point. Unlike the point-bise-
rial discrimination index, which is a correlation, the IRT discrimination index tells us where on
the ability continuum the item provides the most information, the ability at which it maximally
discriminates.
14. q14
1
0.75
Score on item
0.5
0.25
0
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7
Measure
The ICC provides an illustration of the continuous relation between level of ability (measure)
and probability of correct response. The trace line illustrates how the item discriminates for each
score group. The tabular method also shows the same item response pattern.
The two-parameter and three-parameter binary scoring IRT models yield different discrimi-
nation parameters, which is a bit confusing. The two-parameter item discrimination is highly
correlated with the point-biserial discrimination index. It appears that the two procedures pro-
duce about the same information. However, the three-parameter discrimination is affected by the
floor of the item scale. The third parameter in the three-parameter IRT model is pseudo-chance,
which affects the estimate of discrimination. Thus, three-parameter discrimination is slightly
different from the point-biserial and the two parameter discrimination.
Fit
A concern among test analysts is how well the data fit the model being used. The interest is in
how well the IRT model represents the data at the person or item levels. We have many methods
of study of fit. Embretson and Reise (2000, pp. 233–238) provide a useful and detailed discussion
of approaches to studying fit. They stated that “there is no procedure that results in a researcher
stating definitively that a particular model does nor does not fit, or is not appropriate” (p. 233).
The one-parameter Rasch model requires that item responses fit the model. The fit statistic
estimates the extent to which item responses follow the trace line for the right answer. Items
having statistically significant misfit are subject to further study to decide whether the item is
retained, revised, or retired. IRT users recommend that the fit statistic be used as a tool for evalu-
ating items, much like we use difficulty and discrimination in classical item response analysis. A
misfitting item is a signal to SMEs to review the item and determine if it presents any threat to
validity. The fit statistics are not related to item discrimination, but the fit statistic gives SMEs a
different perspective for studying the way an item behaves.
Items need to have a common factor to function appropriately. The lack of a common factor
clouds our interpretation of the test score. As previously stated in this chapter and throughout
this volume, the study of dimensionality is critical to item development and validation. The defi-
nition of content and the work of SMEs is vital for establishing understanding of the dimension-
ality of the ability being measured.
Item discrimination is most accurate when the criterion (test score) is related to item perform-
ance. If an ability is multidimensional, the total score is an aggregate that is usually weighted. A
standard item analysis will accurately estimate difficulty but discrimination will likely be very low
and disappointing. In the IRT framework, item difficulty is equally challenging to define when a
score is multidimensional. In IRT, the item difficulty is a function of the ability required to cor-
rectly respond to the item—but if the measure is multidimensional, which ability is being used to
estimate item difficulty? An item may have multiple difficulties. Consider a mathematical word
problem—it may be a relatively easy reading task, but a difficult mathematics task. In a different
challenge, a science test composed of biology and chemistry content will confound biology-based
and chemistry-based abilities.
The remedy is to first identify the sub-abilities of the construct. Item and test specifications
should clearly show the sub-abilities and their appropriate weighting. The subscore interpreta-
tion should be validated (Haladyna & Kramer, 2004). After items are developed and field-tested,
the item analysis should be conducted using each subscore as the criterion instead of the total
score. The resulting coefficients may resemble the coefficients when the total score was used, but
these subscore discriminations will be higher and more accurate. The computer program ITE-
MAN (previously cited in Table 17.1) does an excellent subscore item analysis for SR, CROS, and
survey items using rating scales.
Instructional Sensitivity
Discrimination has been defined as the relationship between item and test performance. Instruc-
tional sensitivity is the tendency of items to discriminate with respect to the effectiveness of
instruction. An uninstructed group of students will find an item based on the content of this
instruction to be very hard, and an instructed group will find this item very easy. An index of
instructional sensitivity is the pre-to-post difference (PPDI). Figure 17.4 shows an idealized dis-
tribution of test scores for an instructional setting. As items are the building blocks of tests, item
responses should mimic test scores. We have a variety of instructional sensitivity indexes at the
item level as well (Haladyna, 1974; Haladyna & Roid, 1981; Herbig, 1976; Polikoff, 2010).
Instructional sensitivity is especially helpful for analyzing several important instructional con-
ditions. After the PPDI index is computed, we have these competing hypotheses to consider:
The interpretation of PPDI is best done by SMEs, preferably those very close to instruction, like
teachers. For instance, the first group can be typical students who have not yet received instruc-
tion, whereas the second group can be typical students who have received instruction.
Table 17.4 is a simple illustration that suggests that the item is moderately difficult (60%)
for a typical SR item. The change in difficulty for the two conditions represents an estimate of
how much learning occurred from the pre-instruction to the post-instruction. This item displays
instructional sensitivity.
A single item is an undependable measure of overall learning. Also, a single item is biased by its
intrinsic difficulty. Aggregating several items measuring the same content to make an inference
Validity Evidence From Statistical Study of Objectively-Scored Test Items • 349
Instruction
Pre-Instruction Post-Instruction
40
30
Frequency
20
10
0
0 7 13 20 27 33 40 0 7 13 20 27 33 40
Score
about instructional effectiveness or growth is far better. Other conditions exist for this index that
provides useful descriptive information about item performance.
The data in Table 17.5 suggests ineffective instruction or lack of opportunity to learn. A second
plausible and rivaling explanation is that the item is so difficult that few students can answer it
correctly, despite the effectiveness of instruction. A third plausible hypothesis is that the item is
unrelated to the content being taught. Therefore, no amount of instruction is relevant to per-
formance on the item. The item could be unrelated to the instructional or learning objectives.
The instructional designer and test designer must be careful to consider other, more plausible
hypotheses and reach a correct conclusion. Often this conclusion is augmented by studying the
performance patterns of clusters of items. Having a single item perform like the previous one
is weak evidence of successful instruction, but having all items perform as just shown is strong
evidence for effective learning and validly interpreted test scores. A single item may be unneces-
sarily difficult, but if all items perform similarly, then the problem may lie with instruction, or the
entire test may not reflect the desired content.
Table 17.6 shows a PPDI of zero just like the previous example. Unlike the previous example,
however, the performance of pre-instruction and post-instruction samples is high. Several rival-
350 • Validity Evidence Arising From Item Development and Item Response Validation
ing hypotheses may explain this performance. First, the material may have already been learned,
which is why both uninstructed and instructed groups perform so well. Second, the item may
have a fault that is cuing the correct answer; therefore, most students are picking the right answer
no matter whether they have learned the content represented by the item. Third, the item is
inherently easy for everyone. The item fails to reflect the influence of instruction, because the
item has an obvious right answer that instruction does not affect.
These three examples show the interplay of items in an instructional setting. Knowing how
students perform before and after instruction informs the test designer about the effectiveness of
the items and also instruction.
Polikoff (2010) provided a good discussion of many instructional sensitivity indexes. His
recent review of research leads us to think that PPDI is a good choice, but other indexes provide
similar information. Such information should never be considered without knowing the context
for these item responses. Thus, knowing what instruction preceded testing and if the test items
are measuring the content and cognitive demand identified in the item and test specifications is
very important. We might consider each student’s instructional history as we evaluate perform-
ance on the total test and each item contributing to the total test score.
Table 17.7 Some Arbitrary Standards for Evaluating Item Statistics for SR and CROS Items
Type Difficulty Disc. Comment
1 .60 to .90 > .15 Ideal item. Moderate difficulty and high discrimination
2 .60 to .90 < .15 Poor discrimination
3 Above .90 Disregard High performance item; usually not very discriminating
4 < .60 > .15 Difficult but very discriminating
5 < .60 < .15 Difficult and non-discriminating
6 < .60 < .15 Identical to type 5 except that one of the distractors has a pattern like type 1,
which signifies a key error
Validity Evidence From Statistical Study of Objectively-Scored Test Items • 351
Type 2 items should be avoided. These items do not contribute much to reliability. Type 3
items are correctly answered by most test takers, and thus not much information is gained. If the
content of a type 3 item is deemed important by SMEs, however, such items are kept in the item
bank. Type 4 items are retained if the SMEs can overlook the extreme difficulty of the item. Using
many items of this type will keep reliability high but the distribution of test scores may be in the
middle of the scale. Type 5 items are very undesirable. Such items should be retired or revised.
Type 6 items are likely to be key errors.
Guessing
Guessing has been a source of concern for testing involving SR items. With the recommendation
in this volume to restrict conventional SR items to three options, the threat of guessing may seem
increased. We have two hypotheses about guessing and we can use probability to decide to what
extent guessing is a threat.
The first hypothesis is that random guessing is characteristic of low-scoring test takers, and
high-scoring test takers guess very infrequently (Lord, 1977). Random guessing introduces ran-
dom error into testing, which decreases reliability (Lord, 1964). For low-scoring test takers, how
serious can random guessing be as a source of random error? For a three-option SR item, the
random error is constrained about a true score of 33.3%. Thus, a test taker with a true score at
or near 33.3% for that domain will randomly guess and by chance may score higher or lower
than the expected 33.3%. The amount of random error due to random guessing is a function of
the number of items in a test. The suspicion is that a random guesser might obtain a high score
simply by guessing. This result is very implausible. For a four-option, 10-item test, the probability
of getting 10 correct random guesses is .0000009. For a 10-item true–false test, the probability
of scoring 10 is .001. As we predict from probability theory, random error from guessing is very
small and inconsequential. In other words, a random guesser has very little chance of getting a
high test score. On longer tests, you could equate such a result to winning a lottery.
The second hypothesis, and the more plausible one, is that test takers have partial knowledge of
the content of the test item. With SR items, distractors should be plausible. However, research has
shown repeatedly that many distractors are dysfunctional (Haladyna & Downing, 2003; chapter
5 this volume). Thus, the typical low-scoring test taker, not being sure of the right answer but
having some degree of testwiseness or some limited information related to the test content, can
eliminate some distractors as very implausible. Support for the partial knowledge hypothesis
is easy to generate. For any item analysis, the computing of the mean for each option chosen
will show that some options have high or low choice means. In other words, the test takers as
a group are discriminating among the options. Random guessing plays little or no role in these
instances. A very sensitive analysis of distractor performance should reveal exactly what Lord
(1977) observed: high-scoring test takers seldom guess and only one distractor is usually chosen
because it is the most plausible. Whereas with low-scoring test takers, all distractors are in play
and random guessing is more frequent.
Correction for guessing formulae applied to SR items adds a degree of risk to test taking that
introduces gender bias (males are more inclined to guess than females). In other words cor-
rection for guessing is an instance of construct-irrelevant variance and a threat to validity. As
guessing has little consequence on a true score and the degree of random error is small, the use of
two-option or three-option SR items should be considered and retained for testing even in high-
stakes situations. As we know the floor of the scale for a two-option item is 50% and the floor
of the scale for three-option items is 33%, this information can be used for standard-setting and
interpretation of a test score.
352 • Validity Evidence Arising From Item Development and Item Response Validation
Distractor Discrimination
A fundamental principle in developing SR items is that each distractor should be a plausible
wrong answer and, ideally, distractors should be equally plausible. The empirical basis for plausi-
bility is that a distractor should appeal to a low scorer and be avoided by a high scorer. Distractors
that are seldom chosen by any test taker are useless and should be removed or replaced. These
kinds of distractors are likely to be so implausible to all test takers that hardly anyone would
choose them.
More than 50 years of continuous research has revealed a relationship between distractor
choice and total test score (Haladyna, 2004; Levine & Drasgow, 1983; Nishisato, 1980; Thissen,
1976). Figure 17.2 shows the reciprocity between the discriminating ability of the right answer
and the collective discrimination of the distractors. Distractor discrimination is the relationship
between distractor choice and total test score. A measure of discrimination informs us as to the
effectiveness of the distractor to “distract” students with low ability.
The first steps in evaluating distractor discrimination involve SMEs and their work in item
development and the many reviews of item features in the previous chapter. Expert, consensus
judgment is critical in determining the rightness and wrongness of each option. But research
shows that this is not enough (Haladyna & Downing, 1993). Distractor analysis provides insights
into potential errors of judgment and inadequate performances of distractors that SMEs did not
anticipate. Distractors that fail to perform should be revised, replaced, or removed. Thus, the
objective is to detect poorly performing distractors and then take remedial action.
As with item discrimination, three different yet complementary ways exist to study responses
to distractors. First, there is the tabular presentation of a frequency table that provides a tabula-
tion of option choices as a function of ordinal score groups (Levine & Drasgow, 1983; Wainer,
1989). Second, there is a trace line for each option. Third, a family of statistical indexes can be
used to decide which distractors are working.
Tabular Presentation
Any set of test scores can be divided into score groups that consist of approximately the same
number of test takers. The score groups are formed by ranking all 500 test takers and selecting the
first 100 for the first group, the next 100 for the next group, and so on. Most computer programs
with item analysis features can supply this information (e.g., ITEMAN, LERTAP, TESTFACT).
Most statistical computer programs can also provide this kind of information. Here is a hypo-
thetical 100-point test given to 500 test takers in Table 17.8.
Option A is the correct answer. It discriminates very well. The higher the score group; the
higher the rate of selection of this option. The low-scoring group chooses the correct answer 20%
of the time, whereas the high scoring group chooses it 78% of the time. Options B and C have the
reverse trend. Higher score groups have a lower frequency for choosing these distractors. This is
Validity Evidence From Statistical Study of Objectively-Scored Test Items • 353
a very healthy-performing item. If the item has gone through systematic review by SMEs as dis-
cussed in the previous chapter, the item is validated and ready to be included into the operational
item bank for use on future tests.
80
A
70 B
C
60 D
Number of test takers
50
40
30
20
10
0
1 2 3 4 5
Trace lines provide an easy-to-read summary of the performance of the option choices so that
SMEs can evaluate how the item is performing. The accompanying tabular presentation provides
the basis for the trace lines and provides a complementary view of the same phenomena.
Statistical Methods
Several statistical methods can be used to study distractor performance. These methods do not
necessarily measure the same characteristic (Downing & Haladyna, 1997). Among these meth-
ods, two have serious shortcomings and probably should not be used, whereas the last, which is
derived from the tabular presentation of option choices, is the best.
The first is the product-moment correlation between distractor performance and total test
score. This is the point-biserial discrimination index that is found in most item analysis compu-
ter programs. This index considers the average performance of those selecting the distractor ver-
sus the average of those not selecting the distractor. A statistically significant positive correlation
is expected for a correct choice, whereas a statistically significant negative correlation is expected
for a distractor. Low-response distractors are eliminated from this analysis; the low response
would suggest that such distractors are so implausible that only a few random guessers would
select this option. Decreasing distractors would produce negative correlations, but are subject to
the test for statistical significance. Nonmonotonically decreasing distractors result in a correla-
tion that is more likely not to be statistically significant, because the nonmonotonic trace line
mimics the trace line of a correct answer in part. Also, because the number of test takers choosing
distractors is likely to be small, the statistical tests lack the results to reject the null hypothesis that
the population correlation is zero. To increase the power, a directional test should be used and
alpha should be set at .10.
Distractors should be negatively correlated with total test score, and correct choices should be
positively correlated with total test score. A bootstrap method is suggested for overcoming any
bias introduced by the nature of the sample (de Gruijter, 1988), but this kind of extreme measure
points out an inherent flaw in the use of this index. It should be noted that the discrimination
index is not robust. If item difficulty is high or low, the index is attenuated. It maximizes when
difficulty is moderate. The sample composition has much to do with the estimate of discrimina-
tion. Distractors tend to be infrequently chosen, particularly when item difficulty exceeds .75.
Thus, the point-biserial correlation is often based on only a few observations, which is a seri-
ous limitation. Henrysson (1971) provided additional insights into the inadequacy of this index
for the study of distractor performance. Because of these many limitations, this index probably
should not be used. A more recent critical review by Attali and Fraenkel (2000) focused on the
limitation of the point-biserial for distractor discrimination. Instead they proposed a statistic that
compares distractor choice to the correct choice. This uses the choice mean. Clearly, the point-
biserial has a limitation for measuring distractor discrimination.
Another related method is the choice mean for each distractor. As noted earlier in this chapter,
discrimination can be assessed as the difference between the choice mean of the correct answer
and the choice mean of the distractors. Referring to Table 17.9, note that the choice mean for
each distractor differs from the choice mean of the correct answer in both items. The difference in
these choice means can serve as a measure of distractor effectiveness: the lower the choice mean,
the better the distractor. This difference can be standardized by using the standard deviation of
test scores, if a standardized effect-size measure is desired.
For item 32, option C has the lowest choice mean and because of that seems to be the best
distractor. Option D is very close to the correct answer and should be revised or removed. In this
way, this becomes a new item with an anticipated improvement in discrimination. The analysis
of variance shows that the choice means differ significantly. However, the R-squared statistic is a
Validity Evidence From Statistical Study of Objectively-Scored Test Items • 355
* Correct choice
useful index of practical significance. It shows that 12% of the variance in test scores comes from
variation of option means. For item 45, the item is easy. Casual inspection of the means suggests
very little discrimination. The F-test for the analysis of variance of these option means leads to
failure to reject the null hypothesis. This appears to be random variance. This item has distractors
that do not seem to be working.
The third statistical approach to evaluating distractor performance is directly related to the
tabular presentation of data and the trace line. Haladyna and Downing (1993) also showed that
trace lines reveal more about an option’s performance than a choice mean. Whereas choice means
reveal the average performance of all examinees choosing any option, the trace line accurately
characterizes the functional relationship between item and total test performance. Distractor dis-
crimination is the relationship between option and total test performance. Table 17.10 presents
a contingency table for option performance. Applying a chi-square test to these categorical fre-
quencies, a statistically significant result would signal a trace line that is not flat. In the above case,
it is monotonically increasing, which is characteristic of a correct answer.
Often, with four- and five-option SR items, one or two options are infrequently chosen. These
tabular, graphical, and statistical methods do not work well. These options apparently are not
plausible because they are so seldom chosen. You can set an arbitrary standard for determining
that an option is too seldom chosen. For example, Haladyna and Downing (1993) chose 5%.
Many options were eliminated based on this standard of 5%.
are four major areas of validity evidence from statistical study of responses to objectively-scored
items.
1. For item difficulty, the p-value seems sufficient if the sample used to obtain the p-value is
representative of the population for which the test is intended.
2. For item discrimination, the point-biserial correlation between item and total test score
seems best. Moreover, the direct relationship of the point-biserial to test score reliability is
a very useful feature.
3. Graphical methods for item responses are very useful and should be applied for audiences
less sophisticated in statistics.
4. Distractors need to be evaluated. Not only does this evaluation improve item perform-
ance, but contributes to the growing argument for and research in support of reducing the
number of options typically used in SR tests.
We have excellent computer software for item analysis with many useful options. There is much
to appreciate about item analysis. Some useful references on this topic include Lord and Novick
(1966), Livingston (2006), and Nunnally and Bernstein (1997). Most important, all sources of
evidence supporting the functioning of objectively scored items are validity evidence.
18
Validity Evidence From Statistical Study of
Subjectively-Scored Test Items
Overview
This chapter presents basic concepts, principles, and procedures concerning item analysis for
constructed-response subjectively-scored items (CRSS). The previous chapter did the same for
selected-response (SR) and constructed-response objectively scored (CROS) items. This chapter
is organized into five sections.
1. The nature of item responses is revisited. Some unique characteristics distinguish CRSS
from SR and CROS items.
2. Three different CRSS formats are presented and discussed. Each presents a unique
approach to studying item characteristics and threats to validity.
3. Rater consistency is a major topic with CRSS items. Its relationship to reliability is direct
and important. Ways to study and improve rater consistency are essential to validity.
4. Rater bias (also known as rater effects) is a family of variables that represents systematic
error that weakens validity.
5. The last section summarizes and makes recommendations for studying threats to validity
and gathering validity evidence to support test score interpretations.
Product or Performance
One important distinction with any CRSS item is whether the actual performance or a product
is being evaluated. When a student writes an essay, the writing process may be important, but
it is the result that is evaluated by raters. As part of the definition of any construct, it should be
357
358 • Validity Evidence Arising From Item Development and Item Response Validation
clear to the rater exactly what is expected: performance or the product. A good example comes
from the Oregon State Assessment Program. Several years ago, they implemented a mathemati-
cal problem-solving performance test consisting of a single item. One of the four analytic traits
evaluated was processes and strategies. Each student had to choose a problem-solving strategy
that works and carry it out (http://www.ode.state.or.us/search/page/?id=32). Raters rated the
extent the student solution satisfied the description on the rating scale for process and strategies.
The focus for this rating was not solely on the result but also on how the result was obtained. The
state’s mathematics educators were very concerned with the process of mathematical problem-
solving as well as the result. Independently of this rating of four traits, the answer to the problem
was also scored objectively.
Cognitive Demand
CRSS formats are usually selected because item and test developers think that it is the best way to
elicit the kind of higher-level cognitive demands required. For instance, referring again to the Ore-
gon mathematics problem-solving test, the CRSS item not only had fidelity to mathematics prob-
lem-solving, but the scoring guide is generic in terms of the aspects of complex thinking sought.
However, as stated previously in chapter 3 and elsewhere, the cognitive demand of test items
varies considerably in actual use as a function of the developmental level of the test taker. The
intention for cognitive demand is determined by how the typical test taker would engage the
item. The best way to determine that is through the think-aloud procedure where a panel of stu-
dents are interviewed as they self-administer items. Chapter 16 discussed this approach.
Dimensionality
Dimensionality is an important consideration with CRSS items as it is with CROS and SR items.
If more than one item is used and analytic traits are being rated by two or more raters, dimen-
sionality should be a concern. Tate (2002, pp. 181–182) provided four reasons why.
1. If we seek a unidimensional test score, our item and test specifications should be sup-
ported by empirical data. Empirical analyses add convergent evidence to the assertion of
subject-matter experts (SMEs) about the meaning of a total score.
2. The structure of the data may suggest different approaches to reliability estimation.
3. Bias in an item that is identified through the study of differential item functioning can be
done in the context of dimensionality analysis.
4. The study of dimensionality can be a great aid in equating of alternate test forms. Writ-
ing performance tests have great difficulty in maintaining stable scales from year to year.
There are many factors that may intervene that disturb the comparability of a scale (Hala-
dyna & Olsen, submitted for publication).
The structure of the data is hypothesized on the basis of the construct definition by SMEs.
Generally, a measure is unidimensional, because a total score is what is interpreted. That is why
we gather empirical evidence to confirm this hypothesis. As noted previously, for any cogni-
tive ability, there is considerable interest in sub-abilities. Can subscores be validated for use in
reporting performance? A total test score is an estimate of status with respect to that ability, and
a subtest score is an estimate of status with respect to a sub-ability of that ability. Both ability and
sub-abilities must be validated.
With CRSS items, the construct definition should drive the thinking of SMEs about whether
traits associated with an ability are highly related or not. For instance, the six-trait writing model
posits that writing features six distinctive traits. If two or more raters rate these traits for each stu-
dent, the resulting raters should reveal patterns in the data supporting the six analytic traits. The
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 359
same holds true for the four-trait mathematical problem-solving CRSS items featured previously
in the Oregon mathematics problem-solving test. If two or more raters score a set of student
papers, the four traits should emerge as highly related but distinguishable.
Random Error
A test consisting only of one or several CR items usually has a high degree of random error. This
result occurs due to one or more contributing factors. Definition of the construct may be less
than clear, and the resulting scoring guide cannot produce consistent results because each rater
creates a personal definition of what constitutes high and low performance for that ability. The
rating scale may not be well designed. Too few items and too little range in the rating scale may
also increase random error. Thus, reliability is likely to be lower than desired.
One strategy to overcome low reliability is to augment the CRSS test with SR items. How-
ever, as Kane and Case (2004) warned, the test designer may be reducing content-related validity
evidence for the sake of improving reliability. SR items do not have high fidelity with the tar-
get domain of writing—which consists of high-fidelity writing tasks. The extent to which raters
disagree also reflects random error. If raters rate inconsistently, random error will be large and
reliability will be low. That is why extensive training of raters is so important—to ensure that
random error is as small as possible.
Systematic Error
Some factors that have no relationship to the ability being measured produce systematic error.
For example, a bathroom scale may add 10 pounds to any person’s weight. The technical name
for systematic error is construct-irrelevant variance (CIV) (Haladyna & Downing, 2004; Messick,
1989). Much of the rest of this chapter is devoted to studying and reducing systematic error in
CRSS items. The more common term for this kind of error is bias.
Theoretically, any test-takers observed score (Y) is as follows:
As we know from theory, random error can be large or small and positive or negative. Random
error is assessed via the reliability estimate of test scores. Systematic error like random error
can be large or small and positive or negative. The worst aspect of systematic error is that it is
directional, either positively or negatively. When not detected, test takers’ scores are uniformly
increased or decreased. Even worse, because so many variables can contribute CIV, systematic
error is cumulative. Subjective scoring elicits systematic error in many ways. Additional ways to
deal with systematic error focused on item responses are presented in chapter 19.
Computer Programs
Fortunately, we have many computer programs that accommodate scoring for CRSS items. Table
17.1 in the previous chapter lists these programs. Many of these programs provide analyses based
on item response theory (IRT).
Item Type 1 is a single-item test that requires a performance or a product with a single, holistic
rating scale used by one or more raters.
Item Type 2 is a single-item test that also requires a performance or a product but uses several
analytic trait rating scales used by one or more raters.
Item Type 3 is in a multi-item test. A survey might have a set of items answered via a rating
scale, usually scored by a single rater—the respondent. For instance, an attitude scale might con-
sist of four or five rating scale items. A total score infers the extent to which an attitude exists
toward an object. There might also exist tests with two or more CRSS items and multiple raters.
Table 18.1 summarizes these three item types. Item analysis for each of these is different. How-
ever, all share the same problems concerning systematic error.
The characteristics of item scores will be discussed next. Simple descriptive statistics are used
to illustrate the basic ideas. However, readers should note that there is an increasing interest in
and technology for IRT item analysis and scaling and the use of generalizability theory (Brennan,
2001). Also, more advanced multivariate statistical procedures are sometimes used to analyze
item scores. However, these topics are well beyond the scope of this chapter. Those interested
in IRT applications for CRSS items should consult any of a variety of books on this topic (de
Ayala, 2009; DeMars, 2010; Embretson & Reise, 2000; Ostini & Nering, 2006; van der Linden &
Hambleton, 2010).
Item Type 1: The One-Item Test Scored With a Holistic Rating Scale
When item type 1 has only one rater, item analysis is very restricted. Also, several threats to valid-
ity are associated with this item type. Ideally, two or more raters should be used to ascertain rater
consistency, improve reliability, and study and defend against threats to validity that come from
systematic error.
Discrimination A shortcoming of the one-item CRSS with a holistic rating scale is the lack of an
estimate of discrimination. As discrimination has been defined as the relationship between item
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 361
and test score, with this type of item, item and test score is the same. Therefore, discrimination
cannot be estimated.
Item Type 2: The One-Item Product/Performance Scored With Analytic Trait Scales
As with the first item type, having only a single rater is a very ineffective way to estimate item dif-
ficulty and discrimination. The ideal situation is where the one-item test is scored independently
by two or more raters using analytic trait rating scales.
An example is a writing performance test for 2,684 elementary level students. The six-trait
writing model was used with two independent scores for each trait. The rating scale varies from
1 to 6, but scores of 1 are very infrequent.
There are two levels of analyses to consider: First, a total score is computed by summing all
trait scores. Trait scores are summed across two or more raters. Having one rater for each trait
score is very risky due to systematic error. Both total scores and trait scores need separate valida-
tion. It should never be assumed that one validation is a proxy for the other.
Frequency Distribution An item analysis or statistical computer program can produce a fre-
quency distribution of item ratings as shown below. An inspection of such results is helpful in
detecting problems. Statistics computer programs routinely compute descriptive statistics for
distributions like those shown below in Table 18.2.
Table 18.2 Frequency Distribution for an Analytic Trait for a Single Observation
Rating 1 2 3 4 5 6 7
Typical 2% 12% 20% 41% 14% 8% 3%
Negative skew 0 0 0 0 1% 11% 88%
Positive skew 78% 20% 2% 0 0 0 0
Leptokurtic 1% 3% 25% 44% 24% 2% 1%
Sparseness 0 24% 25% 31% 19% 1% 0
Assuming a relatively normal distribution, a typical result is expected. A negative skew might
be expected for a highly able sample, such as graduated medical or dental students on a licensing
test. A positive skew is usually an indication that the test takers have very little ability or the test
is simply too difficult. A leptokurtic distribution is an indication of a serious CIV threat, where
raters tend to rate in the middle of the scale. Sparseness is an occasional problem where raters
are reluctant to use the extremes of the rating scale. In effect the rating scale reduces from seven
points to five points, which has a negative effect on reliability. We recognize that these character-
istics of distributions are typically attributed to interval or ratio measurement scales as opposed
to binary or ordinal variables. The line between ordinal and interval variables and the appropriate
mathematical and statistical operations for each are under reasonable debate (Fife-Shaw, 2006).
We do not add to this debate. We find the examination of the distribution of rating data useful
and employ the standard language of distributions to describe those results. The point is that the
examination of frequencies for ratings is an important tool to examine the quality of the CRSS
item.
Difficulty With analytic traits, we have two categories of difficulty: total and trait. Total item dif-
ficulty is simply the sum of the trait ratings for a set of rated performances.
Table 18.3 shows trait difficulty ratings for first and second ratings of a team of raters. On this
rating scale, these trait observations are remarkably similar. Are the traits at the same level with
362 • Validity Evidence Arising From Item Development and Item Response Validation
Table 18.3 Item Difficulty Estimates for First and Second Ratings of a Single Trait
Traits 1 2 3 4 5 6 Total
First rating 4.10 4.30 4.00 4.10 4.30 4.00 4.12
Second rating 4.00 4.00 3.90 4.00 4.00 3.90 3.97
Total 4.05 4.15 3.95 4.05 4.15 3.95 4.05
this sample of students? It would seem to be. We would expect no difference in first and second
ratings as these ratings are independent and based on expertly trained raters who have seen and
use benchmark papers to guide them in scoring.
As shown in the above table, the marginal means provide information that can be used to
evaluate item difficulty. The overall rating is 4.05. The first ratings were slightly higher than the
second ratings. If one were to perform a statistical test for differences between the first and sec-
ond ratings, for a large sample, the result would be statistically significant (p <.05), but practically
speaking a difference of .15 of a rating scale point is very small. This is most likely random error.
Trait means vary between 3.95 and 4.15. If one runs a statistical test such as one-way analysis of
variance for repeated measures, the result is statistically significant (p <.05), but again the practi-
cal differences among these means is very small. What can be taken from Table 18.3 is that we can
compute the mean of the one-item task (4.05) and the means of the six traits. However, there is
much more work to do to establish validity of total score and trait score interpretations. Haladyna
and Kramer (2004) provided some theoretical rationale and methods for investigating total score
and trait score validity. The next chapter treats this topic more adequately.
Discrimination The indexes in Table 18.4 are product-moment correlations between item rat-
ing and total score. An important note is that a discrimination index should be based on the cor-
relation of the item score to the total score (with that item removed from scoring). Otherwise, the
use of that item inflates discrimination. The correlation of the item score to the total score is actu-
ally the total score based on five trait ratings—not the entire set of six trait ratings. The indexes
below are extremely high. These results suggest that item scores are highly interrelated and highly
related to the total score. Rater consistency and reliability are topics very closely related to dis-
crimination. Both are treated in subsequent sections of this chapter.
Table 18.4 Item Discrimination Estimates for First and Second Ratings of a Single Trait
Traits 1 2 3 4 5 6
First rating .74 .76 .78 .77 .73 .78
Second rating .78 .78 .75 .78 .76 .75
Sub-type 1: Stand-Alone Survey Items Standalone items simply provide descriptive informa-
tion and, as such, are not subject to the computing of difficulty and discrimination.
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 363
A frequency of response is the only relevant information obtainable. In some instances, it may be
important to disaggregate the information. Although the frequency of response for each choice is
informative, when data are disaggregated by another variable, such as drivers/non-drivers, more
refined information is obtained. Drivers do not like speed bumps but non-drivers like speed
bumps a lot (see Table 18.5).
The degree to which a relationship exists between these two variables can be tested—as
described below.
Sub-type 2: Survey Items Intended to Measure a Trait If survey items are designed to measure
a human trait, the item analysis has characteristics we associated with SR or CROS tests. Consider
the simple attitude-toward-school variable. Table 18.6 shows three faces representing three dif-
ferent emotional states of elementary school children.
Positive 3 2 1
Negative 1 2 3
Each face is arbitrarily assigned a point value based on whether the item presents in a positive
or negative way the trait being measured. Although these data appear categorical, we often take
the liberty of assigning a value to each category so that we can compute an index—representing a
degree of attitude. Students choose the face that best represents their attitude toward something.
Although we assign values to each face, the values change depending on the polarity of the item,
as shown above. Chapter 9 presents additional graphical rating scale formats. Here is a set of
items measuring attitude toward school. They display different directions in wording or conno-
tation. Sometimes items can be worded positively (no negative terms) but have a connotation or
meaning that is in the negative direction.
Difficulty Difficulty is not a relevant variable here, because the items represent ordinal measures
of attitudes. As in the ratings of a single performance or observation above, the frequency distri-
bution is the most appropriate summary of ratings.
Relationships Among Items Rather than stressing the importance of item responses to a total
score, the more important descriptive statistic is the relationship of responses to other responses.
Note that a person who likes school will respond happy, happy, sad, sad. And a person who does
not like school will respond sad, sad, happy, happy. We have multiple ways to examine relation-
ships among items. We present three approaches here.
First, a set of correlations among these items should be high and with the following posi-
tive/negative pattern. This not only confirms the expected direction of the items (positive versus
negative), but assesses the magnitude of association among the items.
Table 18.7 Correlations Among Common Items
Item 1 Item 2 Item 3
Item 2 .45 —
Item 3 –.42 –.52 —
Item 4 –.38 –.49 .55
In a recent study examining the relation between feelings about going to school and school plans,
Warshawsky, Rodriguez, et al. (2012) found a strong association. This was an analysis of large-
scale survey data from the Minnesota Student Survey administered in 2010 to nearly all ninth-
and twelfth-grade students in public schools. In Table 18.8, you can see, along or near the diago-
nal, the largest volume of respondents agreed with corresponding levels of both feelings about
school and school plans. To test this relation, a Chi-square test (χ2) was conducted and they
reported the Cramér’s phi, a general effect size for associations among categorical variables. The
results indicated significance: χ2(16)=12,000, p<.001, Cramér’s phi = .190, a small but statistically
significant association.
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 365
Table 18.8 Cross tabulation of Feelings about School and School Plans.
How do you feel about going to school?
School Plans Like very much Like quite a bit Like a little Don’t like very much Hate Total
Graduate school 5209 11592 7109 2118 1024 27052
19% 43% 26% 8% 4% 100%
College 4438 16045 17228 6620 2803 47134
9% 34% 37% 14% 6% 100%
Vocational school 190 705 1395 768 511 3569
5% 20% 39% 22% 14% 100%
High school 274 633 1284 987 870 4048
7% 16% 32% 24% 22% 100%
Quit school 106 61 94 136 695 1092
10% 6% 9% 13% 64% 100%
Total 10217 29036 27110 10629 5903 82895
12% 35% 33% 13% 7% 100%
A third strategy for evaluating the effectiveness of these items is factor analysis. If the intent
is to sum these items to create a total score or trait scores to indicate a person’s level of the trait
being measured, we need to provide evidence that it makes sense to combine the items. The items
must be intercorrelated, related as a set, and have a factor structure that supports the total score.
The analysis should reveal if the dimensions of the survey are independent or dependent. Does
the survey measure general attitude or are specific aspects of the survey reflected in the responses
of the respondents in the survey. Chapter 19 provides a discussion of other issues related to
dimensionality and factor analysis. See Nardi (2005) for a comprehensive approach to survey
data analysis.
Creating a Scale From Survey Items To support more sophisticated modeling of item responses
and trait scores, survey items can be scaled through a measurement model, for example the
Rasch model. When the survey researcher hopes to measure a trait, like school climate, several
questions can be asked in an attempt to sample the domain of items from a construct we might
define as school climate. These items can be viewed like test items (a school climate measure) and
respondents can be asked to rate their responses on a rating scale, like the three-point graphical
rating scale above. The items can be combined into a total score.
Most survey researchers will simply sum the ratings from the set of items measuring some
trait. But since these ratings are ordinal, a simple sum may not be defensible. An interval scale
can be obtained for a set of items if the items are scaled through a measurement model. The Rasch
model is the most commonly used for survey data (Green & Frantom, 2002; Reeve, & Fayers,
2005). Through the Rasch scaling of survey items, we can obtain information about the perform-
ance of each item and a continuous score for each person. We will briefly present an example,
based on a set of items from the Minnesota Student Survey that may measure school climate. The
set of items can be analyzed using a Rasch measurement model. This resulting score provides us
with the trait level for each person on the scale for school climate. However, large enough sam-
ples must be obtained to justify such scaling, including more than 100 respondents to support
the Rasch model scaling. Be sure to assess the necessary assumptions, primarily including the
assumption that the measure is unidimensional. However, Tate (2002) says that the Rasch model
is robust to minor violations of the assumption of unidimensionality.
The results of the Rasch scaling also provide an estimate of scale score reliability, providing an
indicator of how internally consistent the item responses are that make up the total score scale.
366 • Validity Evidence Arising From Item Development and Item Response Validation
Another strong result of Rasch scaling is the Item Map, where items are located relative to people.
This is possible because IRT models place persons and items on the same scale. In this way, we
obtain a more accurate estimate of the trait level of each item—which indicates which items are
easier to endorse (more accepted or more common) or more difficult to endorse (more severe or
more rare). It provides a true rank-ordering of items from low to high. In the example provided
in Figure 18.1, the Item Map includes five items, briefly described here. The two items from #15
are rated on a five-point scale from none to all. The three items from #17 are rated on a four-point
scale from strongly disagree to strongly agree.
The Item Map in Figure 18.1 shows us the locations of the items (labeled 15a, etc.) and the
respondents (labeled with . and #). We can see a distribution of respondents on the left side of the
scale, centered around 1.7 or so (where the M is located on the scale). This scale is a logit scale,
which is the IRT Rasch metric. It is centered at zero, the average location of the items. Then it
locates persons on that same scale. Since the mean (M) of the persons is near 1.7, we can say that
the average person is located in the more positive direction than the average item—students are
generally positive overall regarding school climate.
What we see in the Item Map is the location of the thresholds of the items, the half-points. For
example, the most difficult item to agree with at a high level is 15a (“All” students are friendly).
In the Item Map this is labeled as “15a.45” because it is the location of the change from a rating of
4 (Most) to 5 (All)—so the code indicates 4.5. At the opposite point, the easiest item to endorse
at the lowest level is also 15a (“None” of the students are friendly). This is labeled as “15a.15”
because it is at the level 1.5, half way between 1 and 2.
The Item Map provides a picture of the ordering of the items and the scale points. Notice that
all of the 1.5 scale points are near the bottom of the scale (the lower or more negative area of
attitudes regarding school climate) and the 3.5s and 4.5s are near the top (the more positive area
of school climate). The rating scale points are ordered as expected. The Rasch model’s scaling
provides us with many useful analytical tools. Another tool we find is measures of item fit and
person fit. The person fit topic is discussed in chapter 19 (Topic 7).
Describing the Sample Finally, a more critical aspect of the analysis of these items is to describe
the sample in terms of the relevant variables of such a survey. For instance in a survey of students
in grades three through six, an attitude survey should include those demographic variables that
might be important context for understanding results. For instance, boys and girls differ in their
attitudes. Other variables may be important such as parents’ education, whether students are at-
risk or not at-risk. Such variables as Title I status, English language learner status, and whether
special services are provided for a disability might be relevant variables to better describe the
sample. Also, the variables describing the sample might provide substantive answers to research
questions. So identifying a set of variables describing the sample serves two purposes: (a) helps
describe the respondents to address the representativeness of the sample and (b) provides a basis
for answering substantive research questions.
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 367
An example is used to illustrate the intricacies and challenges in estimating rater consist-
ency. The example comes from a project where each of the portfolios of 594 teachers is evalu-
ated by four highly trained raters. For each of the four items (four portfolio entries), the rating
scale runs from .00 to 1.00 in increments of .25. In effect, this is a five-point rating scale. A total
competency cluster score is 4.00. As the portfolio is an instance of outstanding performance,
ratings are typically very high. Thus, descriptive statistics reveal this fact. Total scores are very
negatively skewed. Despite this non-normal distribution, the data clearly shows three types of
rater consistency.
Table 18.9 shows ratings for a competency cluster (four items) for the four independent raters.
The low reliability estimate is a function of two factors. First, the range of scores is very restricted,
and thus reliability is said to be attenuated (weakened). If the mean rating was in the middle of
the scale and the distribution was normal, the reliability would likely be higher. Second, it is very
hard to get four independent raters to agree on the level of performance using a five-point rating
scale. If the ratings were more consistent, reliability would be higher.
Table 18.9 Ratings for a Competency Cluster for Four Independent Raters
Item------> 1 2 3 4
Sample 594 594 594 594
Low Score 2.00 1.50 2.00 2.00
High Score 4.00 4.00 4.00 4.00
Mean 3.87 3.84 3.78 3.83
Standard Deviation 0.28 0.34 0.38 0.33
Skewness –2.70 –2.70 –2.00 –2.20
Reliability .30 .35 .37 .35
PRODUCT-MOMENT CORRELATION AMONG PAIRS OF RATERS. As noted in Table 18.10, correlations among
raters are positive and statistically significant. However, all coefficients are in the low-to-moderate
range, but one should not be fooled into thinking that the relationships are truly low. Correlation
is bounded by a ceiling that is the square root of the product of the reliabilities of each measure.
As noted in the previous table, reliability estimates were low. For instance, the correlation of total
competency cluster scores for raters 3 and 4 is only .21. However, the internal consistency reliabil-
ity estimate is .37 and .35. If there was no random error, the correlation between these two ratings
would be .58. Considering that the range of scores is very restricted, this degree of relationship
actually shows high rater consistency.
Table 18.10 Relationship of Competency Cluster Ratings for the First Competency (Four Independent Raters)
cc11 cc12 cc13
cc11 — — —
cc12 .38 — —
cc13 .29 .35 —
cc14 .28 .20 .20
A point made earlier is worth repeating. Frequency distributions of ratings provide a useful per-
spective for a situation like this one. In the example below, the correlation between two items’ rat-
ings (.00, .50, 1.00) is .05, which is very low. However, because of the ceiling effect we can note
that virtually there is 86.7% perfect agreement between ratings. Only .8 percent of the ratings were
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 369
seriously discrepant (.00 and 1.00). This is a very high correspondence. Thus, correlation fails to
reveal a high degree of rater consistency, whereas a frequency tabulation gives more information
(see Figure 18.2).
Rater A Rater B
0 .5 1 Total
0 0 1 3 4
0.5 0 9 24 33
1 14 227 1754 1995
Total 14 237 1781 2032
COEFFICIENT ALPHA. Another indicator of rater consistency for the entire team of raters is simply
coefficient alpha, which is a measure of internal consistency. As the four raters are evaluating the
same performances using the same rating scale, coefficient alpha captures the consistency of the
team of four raters. For competency cluster one, the internal consistency of the ratings of the four
judges for 594 performances is .61. Again, by normal standards this coefficient may appear low;
the mitigating factor is that the mean score is very high. This restriction in the rage of ratings does
not mean that reliability is low. Strictly speaking, it means that the ratio of error variance to total
score variance is 39%. The standard error of measurement (SEM) is a way to better understand
how much random error is associated with a true score. For instance, if the SEM is .02 and a teacher
cluster 1 portfolio score is .922, the margin of error around a true score (estimated from observed
scores) of .922 is very small (.902 to .942). However, coefficient alpha is not recommended for
small numbers of items or tasks.
TABULAR REPORT. The degree of rater consistency can also be presented in tabular form. Practitioners
who are less inclined to understand correlations and coefficient alpha prefer tabular results like
the one in Table 18.11. The table provides the basis for computing the rater consistency index.
The shaded diagonal shows agreement, and adjacent cells show one half-point disagreement on
this rating scale. A few observations are very disparate. These outliers greatly affect reliability in a
negative way.
Rater Resolution
One way to reduce rater inconsistency is through rater resolution (Johnson, Penny, & Gordon,
2000; Johnson, Penny, Fisher, & Kuhs, 2003). For instance, if two raters differ by a certain degree,
say two rating-scale points, an adjudicator independently rates the performance. If a difference is
370 • Validity Evidence Arising From Item Development and Item Response Validation
observed between two raters that exceeds a boundary value (usually one rating scale point), then
a third party is required to independently score the result. Then a rule is applied to determine
the most appropriate score to assign. Such remedies are highly recommended. This procedure
is done for the sake of fairness, but it also eliminates disagreements and, by that, increases rater
consistency and reliability.
This table can also be simplified for more effective presentation of agreements and disagree-
ments for one pair of judges. Table 18.12 presents the above data in a simpler format for easier
review by SMEs.
Table 18.12 Percentage of Agreement & Disagreements in
Classification by Differences Between Raters
Perfect Agreements 65.1%
0.5 Disagreements 27.4%
1.0 Disagreements 6.3%
1.5 Disagreements 0.8%
2.0 Disagreements 0.2%
Another tabular report considers the extent of agreement and disagreement among all four
raters (six combinations) for a total of 3,564 pairs of ratings (not illustrated here). So far rater
consistency has been limited to pairs of raters and the extent to which they agree. A useful index
of rater consistency is coefficient alpha for the four raters. This is also the reliability estimate.
Coefficient alpha is a measure of internal consistency. If the four raters agree, then alpha should
be large. The coefficient estimated is .609. Although this coefficient may seem small, given the
skewness of scores and the restriction of the range of scores, this coefficient is very high. The SEM
is .046. The mean of these ratings is 3.83 on this four-point scale. If a true score were 3.83, we
might expect a rating to fall between 3.78 and 3.88 about 67% of the time due to random error.
Another index of rater consistency is kappa (Cohen, 1960). This coefficient includes an adjust-
ment for chance ratings, but may be severe since chance is defined as raters not knowing how to
rate a response or performance and simply guess. So kappa is thought of as being a conservative
measure of agreement. It is based on the difference in relative observed agreement from hypo-
thetical chance agreement: κ = [P(a) – P(e)]/[1–P(e)] where P(a) is the probability of observed
agreement for raters (proportion of perfect agreement between two raters) and P(e) is the hypo-
thetical probability of chance agreement, based on marginal agreements. If there is perfect agree-
ment among raters, the κ = 1.00.
When more than two raters are involved, Fleiss’ Kappa is used (Fleiss, 1971). However, these
indexes are only appropriate for categorical ratings (binary or nominal), and not ordered rating
scales.
Generalizability of Scores Inherent in the analysis of raters is the question of being able to gen-
eralize scores beyond the raters employed to obtain those scores. Are scores dependent on raters?
A more complete model for assessing rating agreement and the effect of raters on scores is gen-
eralizability theory (G-Theory), which is based on the analysis of variance. The theory uses par-
titioning of variance in rating scores into facets determined by the measurement procedure. A
classic G-Theory model is one of partitioning score variance due to persons, items, raters, and all
possible interactions. Although beyond the scope of this book, Table 18.13 lists a brief descrip-
tion of the variance components in a classic G-Theory model. The point is to assess the degree
to which facets of the measurement procedure (for example items and raters) affect scores. See
Brennan (2001) for a comprehensive treatment of G-Theory and its applications.
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 371
Table 18.13 G-Theory Variance Components in a Simple Persons X Items X Raters Design
Variance Component Description
Person This is universe score variance, similar to true-score variance. We want this to be the largest
source of score variance—scores vary because persons vary, not because items or raters vary.
Item This is due to differences in item/task difficulty. We expect this to be relatively large to the
extent that item difficulty varies.
Rater This is due to differences in rater severity. We prefer this to be very small, as raters should be
consistent.
Person X Item This is where item difficulty varies across persons (or person score depends on the item). This
is a troublesome source of score variance since a person’s score should not be a function of
which items or tasks they perform; it should reflect their ability.
Person X Rater This is where rater severity varies across persons (or person score depends on the rater). This
is a troublesome source of score variance since raters should rate a given person consistently.
Item X Rater This is where item difficulty varies across raters (or rater severity depends on the item). This is
a troublesome source of score variance since raters should rate a given item consistently.
Person X Rater X Item This is a three-way interaction including all residual random error.
With the flexible G-Theory model, many indices of measurement error can be computed.
Inter-rater agreement (αIR) can be estimated as Person variance divided by Total variance, where
Total variance is Person variance plus Person x Rater variance divided by number of raters:
σ P2
α IR = .
σ PR
2
σP +
2
nR
Similarly, coefficient alpha can be estimated as Person variance divided by Total variance, where
Total variance is Person variance plus Person x Item variance divided by number of items:
σ P2 .
α=
σ2
σ P2 + PI
nI
Any remedy for increasing rater consistency is recommended. Improved training is often said
to be the best strategy. Monitoring of raters before and during the process can be done to study
and improve rater consistency. Reporting to raters about their performance is a useful remedy.
Another idea is to certify raters. Those failing to rate consistently to a standard are dismissed
from rating. Another remedy is to improve the rating scale that is being used. Several methods
of preparing raters and monitoring their performance are presented in chapter 12, regarding the
scoring of CR items.
To review briefly what was presented at the beginning of this chapter, rater bias is an instance
of method variance (Campbell & Fiske, 1959). From Lord and Novick (1968, p. 42), we can rep-
resent systematic error as follows:
A true score is unknown. Any observed score is supposed to be an unbiased estimate of that true
score. However, both kinds of error affect this estimate. Reliability can be estimated, and it helps
estimate the degree of random error in a set of test scores. The amount of random error estab-
lishes some doubt about the location of the true score.
Systematic error has many sources (Haladyna & Downing, 2004). These include such factors
as fatigue, lack of scale comparability among test forms, lack of test taker motivation, inaccurate
scoring, and test taker cheating. This section presents several, important sources of rater bias.
Engelhard (2002) identified several of these rater biases that need to be studied and resolved.
Table 18.14 provides an overview of these threats to validity. The example is a performance test
item that is evaluated for five traits (Trait1, Trait2, Trait3, Trait4, and Trait5). The rating scale
varies from 1 to 5.
Table 18.14 A Hypothetical Set of Ratings Representing Types of Systematic Error Arising from Ratings
Trait 1 Trait 2 Trait 3 Trait 4 Trait 5
True Score 4 3 2 3 4
1 Leniency 5 4 3 4 5
Harshness 3 2 1 2 3
2 Restriction in range 3 3 3 3 3
3 Halo 4 5 4 5 4
4 Idiosyncratic 5 1 5 1 5
1. Leniency/harshness. The first two are poles of a bipolar rating problem. A lenient rater
consistently overrates; whereas a severe rater consistently underrates. The term severity
will be used to denote this rater bias.
2. Restriction in range is similar to central tendency except that the ratings vary little but
can be found also at the top or bottom of the rating scale. Restriction in range can be
thought of as a combination of leniency/harshness and central tendency.
3. Halo is the tendency for a rater to form a first impression and ignore the trait definitions
that should be used to evaluated performance. The first impression carries over to all other
ratings. That is, there is a similarity of all other ratings to the first rating.
4. Idiosyncratic ratings are simply outliers. These ratings are so far out of line that each
must be wrong. Such ratings should be identified and another set of independent ratings
should be done to determine the accuracy of that rating.
5. Not illustrated in Table 18.14 is the tendency for raters to let the adjacent trait in a list of
traits to be rated influence a rating. This type of error is known as proximity.
6. Also not illustrated in Table 18.14 is a logical error. This type of error is difficult to detect
because the definition of the trait to be rated is personal to the rater. In other words, the
definition of the trait and relevant training are ignored, and the rater freelances.
Each of these rater biases will be discussed. Research will be reviewed, and methods for detec-
tion and elimination will be presented. These methods are discussed as they pertain directly to
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 373
Early studies identified the problem and sought ways to eliminate or reduce it (Kingsbury,
1922; Kneeland, 1929; Rugg, 1922; Thorndike, 1920). Many subsequent researchers reported on
rater severity (e.g., Bridgeman, Morgan, & Wang, 1997; Ito & Sykes, 1998; Lunz & O’Neill, 1997;
McNamara & Adams, 1991; Raymond & Viswesvaran, 1993). From these studies, it should be
noted that the influence of rater severity can be very low, around 5% of the variance of ratings, or
as high as 12% (Engelhard & Myford, 2003, p. 20). Little doubt should exist that rater severity is
pervasive in judged performance test items.
374 • Validity Evidence Arising From Item Development and Item Response Validation
2. Restriction in Range
Restriction in range is another type of rater bias. Some raters are reluctant to use the extremes of
a rating scale and stay comfortably in a tight zone on the rating scale. Restriction in range is easy
to detect. One only has to calculate the standard deviation of all ratings for each rater. Referring
to Table 18.15, note the standard deviations of the 16 raters. Some raters tend to rate all perform-
ance in the middle of the scale instead of using the full range of the rating scale. Although there
is no research on this kind of error, it is very easy to spot. Referring to Table 18.15, the standard
deviations of ratings was presented. Standard deviations for these 16 raters ranged from .12 to
.35. Raters 1, 3, and 6 seem the most likely suspects for restricting the range of their ratings. Their
standard deviations of ratings are small relative to other raters. Those who are restricted in their
range of ratings tend to contribute a higher proportion of random error in their ratings.
Central tendency is a specific instance of restriction of range. Central tendency rating is char-
acterized by a small standard deviation of ratings coupled with a rating mean near the mean of all
ratings. Rater 12 appears to be an instance of central tendency. A test for homogeneity of variance
will reveal if these variations in ratings are statistically significant. In Table 18.15, the variation
is statistically significant. The next step is to determine how serious these central tendency
ratings are.
Saal, Downey, and Lahey (1980) were early researchers of restriction of range. They completed
a very comprehensive review of many rater biases. They recognized that restriction in range is the
general rater bias, and central tendency is a unique aspect of restriction in range. Engelhard (1994)
has stated that researchers often combine central tendency and restriction in range together as
if they were one rater bias, but he argues that they should be separated. Much of the research
on restriction in range comes from clinical evaluation in the professions, psychology, personnel
evaluation, and other fields outside of the measurement of cognitive abilities. Nonetheless, these
studies not only provide evidence for the extent of this rater bias but ways to study it.
The consequences of restriction in range are many. If a cut score or multiple cut scores are
used, central tendency ratings can seriously undermine accurate classifications, particularly if the
cut scores are high or low on the test score scale. Restriction in range also reduces rater consist-
ency, which, in turn, negatively affects reliability.
3. Halo
A halo rater bias is characterized by a pattern of consistently similar ratings across a set of traits
to be rated. In other words, the ratings are expected to vary, but the ratings are remarkably simi-
lar. Example 3 in Table 18.4 represents a halo rating given that the true ratings are known. The
example can also be interpreted as resulting from a lenient rater resulting in restriction of range.
However, there is a substantial amount of research on halo ratings and it begins with a theory.
Theories of and Research on Halo Rating Bias Three different definitions of halo rater bias have
been proposed (Fisicaro & Lance, 1990).
1. The General Impression Model, as it implies, is used as a basis for trait ratings. In other
words, the descriptions of each trait are ignored, and this overall, holistic impression
motivates all ratings. As a result, trait correlations are very high. There is a large body of
research they reviewed that supports the psychological process underpinning this theory.
2. The Salient Dimension Model. After the rater is trained, the rater keys in on the dimension
that seems to be salient to the ability/construct being measured. That salient dimension
drives other ratings. Some studies are reported favoring this interpretation of halo rater
bias.
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 375
3. The Inadequate Discrimination Model recognizes that some raters simply do not under-
stand what the traits mean and therefore scramble around to find meaning. This tendency
is said to be due to poor definition, training, or design of the trait rating scales.
Other researchers have suggested that the cognitive demand of using traits in a meaningful way
may be too hard for raters and the similarity among traits may motivate raters to use the same
rating.
Research by Lance, LaPointe, and Stewart (1994) led them to conclude that the first model (GI)
is the correct one. This conclusion is generally supported by other research they reviewed. The
research literature on halo rater bias is extensive and populated in many fields, such as personnel
evaluation, clinical assessment in medicine, advertising, psychology, and self-rating of personal-
ity traits, among many others. Research on halo rater bias for performance on cognitively based
tests is rare. In a study by Dennis (2007), halo was detected in scoring school projects by faculty.
Dennis concluded that the general impression model did not explain all the halo effects detected.
There is no doubt that halo is a pervasive problem in all fields where raters rate traits.
Regarding future research on the extent of halo in ratings of CRSS item performance, a more
direct and effective research strategy would be to have raters rate under experimental conditions
and then interview them to ascertain which halo rater bias model is active. Having clear definitive
findings about halo is the first step in treating the problem. Knoch (2009) reported an experi-
mental study that showed that having specific descriptors for each rating scale point and effective
training did reduce this tendency for halo and improve rater accuracy.
Consequences of Halo Error The most serious consequence of halo rater bias is that traits do
not get evaluated. Therefore, any diagnostic information that might be obtained is lost. Also,
an unwary test analyst may interpret the high degree of relationship among ratings to be a sign
of high reliability. There is no doubt that halo scoring will inflate coefficient alpha—one rating
repeated for each trait. Another consequence is that ratings are inaccurate. Something other than
traits is motivating these ratings.
Statistical Detection We have many simple statistical techniques to detect halo errors. Fortu-
nately, all of these will agree, so the choice of any of these may seem irrelevant.
1. ROOT MEAN SQUARE. Using the previously reported writing data, the root mean square (RMS) was
used, which is the sum of the squared differences between each rating and the mean of all six rat-
ings. Evidence of a halo rating error is a RMS of zero, which shows no variation among ratings.
Evidence of discrimination among the traits is a high RMS. Of the 2,684 student essays rated on six
traits, Table 18.16, 478 (18%) had no variation. Another 702 (26%) had very small variation—RMS
was .833. The rating string had a single deviation (e.g., 444344). In other words, more than 44% of
this sample had very strong halo rater effect.
Table 18.16 Root Mean Square Statistics for Individual Halo Rater Errors
Students Median Range RMS = 0 RMS = .833
First scoring 2,684 1.33 .00 to 10.83 478 (17.8%) 702 (26.2%)
Second scoring 2,684 1.33 .00 to 7.50 552 (20.6%) 688 (25.6%)
The RMS statistic shows an incredibly high degree of halo error when using each rater as its
own control. Lack of variation in ratings is based on the assumption that the traits are equally
376 • Validity Evidence Arising From Item Development and Item Response Validation
represented in writing or halo error, if the former plausible hypothesis is testable. The correlation
between RMSs for first and second scorings is .197, which is very low.
2. COEFFICIENT ALPHA. Another way to study this tendency is to simply compute coefficient alpha,
which is a measure of internal consistency as well as a reliability estimate for a unidimensional
set of item scores. Writing traits are hypothesized to be moderately to highly related, so we would
expect alpha to be moderate to high for first and second scorings. Alpha for the six traits was .918
for the first reading and .919 for the second reading. These results should not be confused with
reliability. Reliability can be estimated in various ways. Using the Spearman-Brown formula for
reliability for two separate observations, the reliability coefficient is .778. This coefficient is closer
to being realistic. Coefficient alpha for first and second scorings is a concise way to show the extent
of halo rater bias.
3. MULTI-TRAIT, MULTI-METHOD CORRELATIONS. Another way to study this tendency is via a correlation
matrix. The matrix should be arranged in a tradition multi-trait, multi-method framework to
isolate trait validity and method bias (Campbell & Fiske, 1959). Trait integrity should show that
judges generally discriminate among the six traits. Method bias would show that a factor exists for
each judge, and that factor is halo. Presenting a large correlation matrix (66 correlations) can be
cumbersome and hard to interpret. A table can be used that summarizes the findings. Three major
observations are summarized.
First, we have six correlations involving raters and traits representing trait validity (e.g. R1T1
and R2T1, R1T2 and R2T2, R1T3 and R2T3, R1T4 and R2T4, R1T5 and R2T5, and R1T6 and
R2T6). If traits can be validly interpreted, we expect this set of six correlations to be the highest
in the table below. Second, we have within-rater correlations. If halo is present, these correlations
are higher than between-rater correlations. Third, we have between-rater correlations absent the
first set of trait validity correlations. As Table 18.17 shows, trait validity shows faint signs of a trait,
but method variance (within rater halo) is very dominant. Between-rater correlations are expect-
edly lower than trait validity coefficients, but the overall, holistic writing ability seems to emerge.
4. FACTOR ANALYSIS. A fourth way is via a factor analysis with a standard varimax rotation (an orthog-
onal rotation where each factor is estimated to be independent). We have many research articles
on the use of exploratory and confirmatory factor analysis for establishing trait and method fac-
tors in data. Keeping with the theme of using simple, descriptive statistics, a very standard factor
analysis can be used to evaluate the presence of a halo effect or the extent to which method (rater)
variance is presence in ratings. Table 18.18 shows this fact. These results agree with the other factor
analytic methods and rotations used. Further discussion on the use of exploratory and confirma-
tory factor analysis methods is given in chapter 19. Here the exploratory factor analysis is com-
pleted to investigate the presence of a methods factor.
Another piece of evidence supporting halo in this rating is an analysis of the means of each
trait (see Table 18.19).
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 377
Statistically, we can test for mean differences using ANOVA, but with large samples the results
will be statistically significant. An effective strategy is to determine effect sizes (differences meas-
ured in standard deviation units) to ascertain if these trait means follow patterns that SMEs can
interpret. The above table does not suggest much of a pattern, except for trait 6 (conventions),
which is the only objectively measurable trait of the six presented. As noted, there is hardly any
variation in these ratings. The differences between first and second ratings can be seen in the sec-
ond decimal place. Although these differences are statistically significant, the differences account
for only 4.7% of all variance. This is a very small effect. If traits were distinguishable, first and
second ratings would show systematic differences in the group being tested in terms of each trait.
Instead, there is considerable homogeneity in ratings. More evidence to support this conclusion
about halo scoring is the standard deviations of these ratings. There is no systematic pattern for
each trait with variation in ratings (see Table 18.20).
Table 18.20 Standard Deviations of Ratings for Each Trait (First and Second Scorings)
Rater 1 2 3 4 5 6
First .862 .829 .812 .781 .818 .877
Second .837 .842 .777 .746 .800 .845
Halo rating is a pervasive problem that is not often studied or reported in operational testing
programs. Using basic descriptive statistics, halo can be studied from a variety of perspectives.
These different methods should be convergent regarding the existence of halo rater bias. The
RMS seems to be the most direct way to detect individual rater halo bias. Factor analysis does
work very well to summarize these halo patterns, and whether one uses exploratory or confirma-
tory procedures hardly matters. If the halo pattern is pronounced, as it was in these data, it will
emerge in an exploratory factor analysis.
378 • Validity Evidence Arising From Item Development and Item Response Validation
4. Proximity Error
The proximity rater bias is subtle but distinguishable. This error results when a rater has a series
of traits to rate after viewing a performance or product. Adjacent traits on the scoring sheet tend
to have higher correlations than non-adjacent traits. The further the separation of traits on the
rating form, the lower the correlation among the traits. An example is provided from an anony-
mous writing testing program.
Using the same writing data, proximity error is perfectly illustrated in the instance where the
trait scores are formed by combining the first and second scorings for each trait. Note in Table
18.21 that the correlation of T1 to T2 is very high (.821) but progressively lower for the correla-
tion of T1 to T3, T1 to T4, T1 to T5, and T1 to T6. Very high correlations are observed for adja-
cent pairs (T2 and T3, T3 and T4, T4 and T5, and T5 and T6). In other words, it would seem that
raters are influenced by the previous rating instead of the unique properties of each trait.
5. Idiosyncrasy
Idiosyncratic ratings are simply outliers. Where a rating might be predicted on the basis of other
raters’ ratings of the same performance or on the basis of previous rated traits, this rating is sim-
ply out of line. A good example of this comes from a story presented in chapter 13, taken from
the New York Times, about a rater reading an essay who was offended by the topic and gave the
writer a low score even though the writing was excellent (Farley, New York Times, September 27,
2009, p. A21). Such ratings are easy to identify.
Using the same database, two examples are presented in Table 18.22.
Table 18.22 Ratings of Two Raters on Two Essays across the Six Trait Writing Scores
Trait 1 Trait 2 Trait 3 Trait 4 Trait 5 Trait 6
Essay R1 R2 R1 R2 R1 R2 R1 R2 R1 R2 R1 R2
1 5 5 5 5 5 5 5 5 5 5 5 2
2 4 6 4 6 5 3 5 6 4 6 4 6
The RMS was used to identify examples of halo scoring. A zero or .833 indicated flat scoring
across traits. A high RMS indicates some serious deviation from a pattern. The high RMS would
be the opposite of halo scoring, where one or two scores have an unusual deviation from a more
agreeable pattern for the two scorings.
6. Logical Error
Personal experience with medical examiners during oral examination revealed that some raters
tend to use their own criteria to evaluate CRSS performance. These examiners are SMEs, so it
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 379
should not come as a surprise that their view of a professional ability, such as a medical specialty,
is better than any training, benchmark performance, or rating scale description. Eckes (2008)
identified profiles of raters who are selective in choosing criteria. His research confirms that
raters naturally select criteria that they think are most important. The consequences of logical
errors are difficult to assess, but clearly the suspicion that raters commit a logical error seems
confirmed in at least one study. Interviewing raters is one way to ascertain whether logical errors
occur.
More Effective Training Bernardin (1978) reviewed the extensive research on the effective-
ness of training to reduce rater severity. He concluded that while training reduced rater severity,
the types of training and the criteria used are important issues to consider. Saal, Downey, and
Lahey (1980) also reviewed the extensive literature in psychology to that date. They called for
greater clarity and consistency in dealing with severity and other problems associated with rat-
ings. Latham, Wexley, and Pursell (1975) reported a high degree of success in training raters to
reduce errors. Borman’s (1979) experimental study on the effectiveness of training showed that
halo errors were reduced by training but accuracy was not improved. Bernadin and Pence (1980)
reported success at reducing severity in ratings, but they also found evidence that training may
be nothing more than replacing one rater effect with another rater effect. McIntyre, Smith, and
Hassett (1984) had success in training raters to eliminate most types of rating errors. Hedge and
Kavanaugh (1988) reported an experimental study of training, finding that, while some rater
effects can be eliminated, other problems arise that continue to raise concerns about the validity
of ratings. Lumley and McNamara (1995) reached the conclusion that training has consistently
been found to reduce this problem but not eliminate it. The study of training effectiveness con-
tinues in many fields, particularly in the health sciences and medicine, where clinical perform-
ance is the preferred method of testing (e.g. Cook et al., 2009).
Monitoring Raters High-quality testing programs often provide feedback to their raters as a
means for helping them calibrate (apply the rubric appropriately). For instance, CRDTS (Central
Regional Dental Testing Service) and WREB (Western Regional Examining Board, a national
dental testing agency) use such systems (Haladyna, 2010, 2011). Such monitoring can also be
used to retain or dismiss raters. A technology has developed for monitoring raters employing
item response theory (e.g., Engelhard, 2002). For instance, Congdon and McQueen (2005) stud-
ied the stability of rater severity in a large-scale testing program. They found day-to-day fluctua-
tions in severity that led them to conclude that one-time monitoring and calibration may not be
effective. Research has also focused on specific elements of rater bias, such as drifting and uneven
uses of rating scale categories (Myford, 2009). However, Knoch (2011) reviewed the research on
monitoring and reviewed a recent study that showed that raters appreciate feedback, but it does
not seem to improve their rating performance.
Rater Resolution (Revisited) Previously in this chapter, the method for resolving differences
between raters using an adjudicator was introduced and discussed as it affects rater consistency. It
is important to note that leniency or harshness (severity) is hidden when two raters are compared
who are very lenient or very harsh, since they agree. Monitoring agreement will not identify the
fact that they are lenient or harsh. Rater resolution will not detect leniency/harshness. Although
score resolution is an important and useful step, the detection of leniency and harshness should
380 • Validity Evidence Arising From Item Development and Item Response Validation
also be considered in the process of monitoring raters. Score resolution alone will not resolve all
rater scoring problems.
Statistical Adjustment Should adjustments be made in test scores where rater bias has been
detected? These arguments mainly address the very serious problem of rater leniency/harshness.
But the arguments actually apply to all rater biases presented and discussed so far. However,
some of these rater biases are more difficult to detect and have less research to report regarding
their frequency and severity, and methods for adjustment may not be well developed. Nonethe-
less, the problem with leniency/harshness is enough to favor statistical adjustment.
1. Fairness. The first is that in the Standards for Educational and Psychological Testing
(AERA, APA, & NCME, 1999) Chapter 7 is devoted exclusively to fairness in testing.
Standard 7.2 addresses fairness from the standpoint that some groups may be advantaged
or disadvantaged. Standard 7.3 provides that when empirical evidence is presented, we
should eliminate factors causing systematic error. Standard 7.10 addresses the problem
of observing group differences where no differences should appear. If raters commit rater
bias, such that the performances of groups differ in severity, then statistical adjustment
seems justified to eliminate the unfairness arising from harsh (or lenient) ratings.
2. Equating. According to the Standards (AERA, APA, & NCME, 1999), equating places
scores from alternate test forms on the same scale. Peterson, Kolen, and Hoover (1989)
stated that the purpose of equating is to establish equivalence in raw scores. They listed
four conditions for equating that we can apply to raters’ ratings: (a) both raters are rating
the same ability; (b) equity, that the conditional frequencies on the respective ratings for
test takers are the same; (c) the transformation from one scale to the other is the same; and
(d) the mapping is invertible. A widely held assumption by measurement specialists is that
“measures share the underlying supposition that items, raters, and judges are equivalent”
(Drewes, 2000, p. 214). That is, the underlying construct is common because two or more
raters are evaluating the same student writing using a common scoring guide. If one rater
is harsh and the other is lenient, equating is challenged.
3. Accuracy of Test Scores. Although obvious, harsh and lenient ratings and other rater
biases distort true scores. The accuracy of test scores is greater when systematic error is
introduced into a scoring system. Rater adjustment for these biases makes test scores more
accurate.
4. Reliability and Rater Consistency. Although random and systematic errors are theoreti-
cally uncorrelated, the presence of systematic errors increases the deviation of an observed
score from its true score. The elimination of a systematic error as arises with rater severity
should eliminate some of the error variance we attribute to rater inconsistency. We would
expect some increase in reliability after correcting for rater severity or any other source of
CIV. While this increase may be small, improved reliability should improve the accuracy
of our pass/fail decisions
5. Consequences. For test takers in high-stakes pass/fail situations whose scores fall at or
close to a cut score, accuracy is very important. Knowing that systematic error exists in test
scores and not doing anything about it is unconscionable. “What proportion of examinees
would receive an altered grade if the scores for the essay part of the test used in the opera-
tion were adjusted?” (Longford, 1995, p. 63). This is a very salient question that deserves
an answer and a just response to avoid negative consequences.
Validity Evidence From Statistical Study of Subjectively-Scored Test Items • 381
EXAMPLE 1. This performance includes five traits and a seven-point rating scale. The person was
rated for each trait using a single rater. The score mean is 3.4 with a median of 3.00. The median is
more representative of the scores.
EXAMPLE 2. This performance is based on the ratings of three raters. The mean is 4 but the median
is 3. With small numbers of observation, the median is the appropriate statistic.
As with skewed distributions, the median is the appropriate statistic and the mean is not. The
point is to establish a clear policy statement on scoring, combining scores, and reporting results.
These should be consistent with the purpose of the test and support the validity of score inter-
pretation and use.
Wordiness
This topic was presented in chapter 13 in relation to scoring essay performances that are intended
to measure writing. However, the threat to validity posed by wordiness is worth mentioning here
because it is systematic error and it is prevalent. Longer essays get higher ratings than shorter
essays. A review of research on wordiness led to this conclusion:
Over all of the studies reviewed, the correlations of response length with response quality
range from a low of 0.13 to a high of 0.80. Correlations in the 0.50s to the 0.70s are common,
suggesting that response length alone can very often account for up to half the variation in
constructed-response scores. (Of course, response length may be a proxy for other more
relevant factors.) (Powers, 2005, p. 6)
Powers goes on to say that the lone outlier (.13) was an instance where writers were constrained.
Thus, there is ample evidence of the association between wordiness and total score. Another
important distinction is to ascertain if well-written essays are necessarily wordy or if wordiness
contributes CIV to the measurement of that performance. Word counts on essays can be done
as part of the validation effort. If a correlation exists between the word count and the total score,
then raters need to ascertain whether the essay was rated on the basis of quality or the number of
382 • Validity Evidence Arising From Item Development and Item Response Validation
words used in the essay. This requires some research and additional training and scoring, but if
wordiness is a potential threat to validity, then it is well worth the effort and expense.
Another aspect of this issue is that when test takers are being coached for a high-stakes test, are
they encouraged to be wordy knowing that wordy essays earn higher scores?
Handwriting/Printed Material
As reported previously in this volume and elsewhere (Haladyna & Olsen, submitted for publica-
tion) whether written material is submitted as handwritten or typewritten makes a difference.
Raters rating both versions (handwritten and typewritten) tend to give higher ratings to hand-
written essays. First, as part of the construct definition, it should be stated that one is recognized
as being construct-relevant and the other is construct irrelevant. Second, in the scoring of these
items, the essay should not be scored in both formats as if the two versions were equal. They
clearly are not equal. For the most part, modern writing involves computers and software that
produces printed essays. It seems logical to think that when essays are evaluated by raters, the
essay appear in printed form.
Bluffing
With the CROS and CRSS item formats, it is possible to bluff correct answers. A testwise writer
may write longer passages, paragraphs, or sentences. The bluffer may answer another test item
very well as opposed to responding to the item that was presented. The bluffer may use key words
that are sure to inflate a rater’s judgments. Or the bluffer may slant writing to impress a rater
rather than respond to the cognitive demand of the task. The bluffer may write illegibly and hope
for benefit of doubt. The goal of bluffing is to inflate a score, which is an instance of systematic
error. As in the case of SR and CROS items, the effects of bluffing can be minimized with well-
designed CRSS items with descriptive rating scales and highly effective training of raters. Inter-
estingly, there is no research to report of the extent of bluffing or its aspects. Nonetheless, bluffing
is within the realm of factors that threaten validity.
Overview
This chapter presents a variety of topics involving item responses that affect validity. Each topic has
received persistent theoretical study and research through many years, so there is both interest and
justification for including each topic in this chapter. Table 19.1 presents a brief overview of each
topic. The topics are presented in alphabetical order. Some topics are interrelated and others are
independent of other topics. Treatment of each topic is intentionally brief. The concept or problem
is presented, theory and research are reviewed, and a recommendation is made for what might be
done to improve validity. Each of these topics would benefit from continued research.
Table 19.1 Overview of Topics
Topic Overview
1. Answer changing: First Should a test taker stay with a first response to an SR test item or reconsider and change it
Response or Reconsider later in the testing session?
2. Differential Item If different groups of test takers with the same level of ability perform differentially, is this
Functioning (DIF) difference construct-relevant or construct-irrelevant?
3. Dimensionality When defining an ability, is it unidimensional or multidimensional?
4. Edge Aversion Researchers have found a tendency of test takers to avoid the edge of a list of options. For
four option items, A and D are avoided. Is this a threat to validity?
5. Item and Rater Drift Items tend to get easier over time, which we call item drive. Raters tend to drift in their
ratings over time, which we call rater drift? Is drift a threat to validity?
6. Omits and Not Reached When a test taker skips an item or quits a test, should omitted and not-reached items be
Responses scored as incorrect or simply not attempted? Which alternative in scoring produces the
most validly interpreted test score?
7. Person Fit: Differential If a test taker has an unusual pattern of responses, is this a signal for an invalidly
Person Functioning interpreted test score?
(DPF)
8. Subscore Validity Assuming a test is unidimensional, can validly interpreted sub scores be reported?
9. Weighting SR Options Should we weight SR options to optimize reliability and take advantage of differential
information offered by distractors?
an SR test item is most likely your best choice. Interestingly, this myth existed long ago (Mathews,
1929). Any reconsideration of that choice will usually lead to an incorrect choice. In a survey
of college students, 64% believed that answer changing will lead to lower total scores, and 36%
believed that it did not matter if one changed an initial answer (Mueller & Shwedel, 1975). No
student thought that answer changing helped. Fortunately, considerable research has been done
on the benefits or deficits of answer changing that informs us.
Research
The evidence is overwhelmingly in favor of answer changing. Study after study involving many
subject matters and levels of schooling show the advantage of reconsidering the initial answer
and making a change when the test taker has a reason to do so (Copeland, 1972; Higham &
Gerrard, 2005; Mathews, 1929; Mueller & Shwedel, 1977; Mueller & Wasser, 1975). Smith et al.
(1979) wondered if the cognitive demand of the item had something to do with answer changing.
Although they found no difference in gains for answer changing for higher and lower cogni-
tive demand items, they did confirm past findings that most college students think that answer
changing should NOT be done. Mueller and Shwedel (1975) found gender differences in gains
for those changing answers. A study by Vidler and Hansen (1980) found results similar to the
results reported by Mueller and Wasser. There was a preponderance of answer changing with this
college age sample, and most students benefitted from changing answers. They also found that
the more difficult items were the subject of answer changing. Another important finding was a
slight tendency for items with lower discrimination to have answers changed more frequently.
We might infer that items that are more ambiguous are prone to answer changing.
A study with elementary school age students showed that students at all levels profited from
answer changing on a reading comprehension test (Casteel, 1991). Geiger (1997) studied college
student testwiseness in relation to answer switching. He found that students with training and skill
at taking tests know to switch answers when justified. A study by Heidenberg and Layne (2000)
involved the most data of all these studies (1,819 students and 123,548 items). They reported that
most students changed at least one answer, most changes were positive, and there were no gender
differences. They also found a tendency for low-scoring test takers to change answers more often
but clearly the higher scorers were more successful when changing answers. They asked test tak-
ers to code the reasons for changing answers using a six-category taxonomy. The most frequent
reasons for changing answers concerned clerical errors and cued (badly written) items. This kind
of study shows the potential for probing deeply into the reasoning behind answer changing. Milia
(2007) experimented with college-level students in Australia. Roughly 1.7% to 2.4% of items had
answers changed, and 50% increased their scores and 25% decreased their scores. No gender
differences were found, but women were more inclined to change answers. Like other studies,
higher-scoring students benefitted most from answer changing.
Psychologists have probed into aspects of answer changing. For instance, Rabbitt (1966) exam-
ined response latencies for answering SR items, to explore time spent considering and reconsid-
ering responses as related to right and wrong changes. Higham and Gerrard (2005) examined
experimentally how time limits and speededness affected answer-changing. A useful finding from
this study was that if answer changing is considered part of the accurate measurement of a stu-
dent cognitive ability, then allowing sufficient time to respond to all items should be provided.
Recommendation
For SR test preparation, all test takers should understand that reconsidering your first choice is a
good test-taking strategy. Generally students benefit by reconsidering some answers and gener-
ally the change is for the good. So there is no doubt that the more thought given to a test item, the
386 • Validity Evidence Arising From Item Development and Item Response Validation
better the results in thinking and performance. This research also points to the need to allow for
more time on tests. When test takers are rushed, they tend not to reconsider the initial choices
and thus score lower than they should. This conclusion assumes that the test is a power test.
As most cognitive ability tests are not speed tests, providing ample time for test taker response
is important. This need is more urgent with test takers who are taking the test in a nonnative
language.
Importance of DIF
The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) dedi-
cated chapter 7 to fairness. Standard 7.3 in that chapter very specifically advises that DIF should
be investigated and eliminated. Thus, there is no doubt about the importance of DIF studies for
test items.
Methods of Study
Table 19.2 provides a sample of computer programs that provide DIF statistics.
The results of any DIF study should provide convergent evidence if item bias is proven. First,
the fairness review is a mandatory type of review that looks for item bias. Second is DIF analy-
sis. DIF has received extensive theoretical development and research in the past three decades
(Embretson & Reise, 2000; Holland & Thayer, 1988; Holland & Wainer, 1993 Zumbo, 2007).
We have an abundance of methods for studying item response patterns to detect possible DIF.
Although simply defined here, DIF has become a very sophisticated science. Zumbo (2007)
describes three historical, developmental stages in the study of DIF.
1. In the first generation, as high-stakes testing increased, test developers were concerned
about racial and gender differences in performance. Were these differences construct-
relevant or construct-irrelevant? Studies focused on whether specific items were harder
or easier for targeted subgroups of test takers. Then, discussions centered on whether DIF
existed and if DIF had impact on test takers. The term item bias was replaced by the more
specific and technical term DIF. One offshoot of concern was the process of reviewing
items for sensitivity (Zieky, 2006; chapters 2, 3, and 16 this volume).
2. In the second generation, tremendous interest existed for methods of study. The Man-
tel-Haenszel method was the most commonly recommended (Holland & Thayer, 1988).
DIF methods have proliferated and are widely applied to a variety of item formats and
situations. Zumbo (2007) characterizes this generation as focusing on three unique
Issues Involving Item Responses and Item Validation • 387
frameworks: (a) involving IRT, (b) using contingency tables or regression models, and (c)
multidimensional methods. The most popular of these three seems to be IRT (Embretsen
& Reise, 1999).
3. In the third generation, Zumbo characterizes a shift in thinking that includes the test
situation or purpose as a source of DIF. The focus moves to item formats and content
as potential causes of DIF. In the measurement of writing, such a concern is well docu-
mented and justified by research (see chapter 13 this volume, Haladyna & Olsen, submit-
ted for publication). With performance testing, such as that involving writing, there are
numerous construct-irrelevant influences, such as whether writing is done by a computer
or is handwritten, or the number of words used in an essay.
Zumbo (2007) offered five important reasons for future DIF analyses. First, we need to know
not only whether subgroups of test takers perform differentially on some items. We need to
know why they differ. DIF originated because of this concern. Second, DIF can be used as part
of a research agenda to isolate variables that may affect item performance. For instance, some
writing prompts interact with gender, producing DIF. Third, when items are translated, does
each item retain its meaning and performance characteristics? Fourth, cognitive researchers are
using IRT to increase understanding of the cognition behind a DIF finding. Fifth, when an item
is rejected in an IRT analysis due to lack of fit, DIF analysis can uncover potential reasons for this
rejection.
Recommendation
DIF studies are necessary. The choice of a method and computer software may not be as impor-
tant, as all methods strive to accomplish the same end. The actual detection and treatment of
offending items may be small, but the mere fact that this safeguard has been applied provides
a kind of item validity evidence such that a threat to validity has been investigated and, when
detected, removed.
388 • Validity Evidence Arising From Item Development and Item Response Validation
3. Dimensionality
This topic has been presented in other chapters due to its importance in various stages of item
development and validation. The focus for our concern about dimensionality is content-related
validity evidence in test score validation. The Standards for Educational and Psychological Testing
(AERA, APA, & NCME, 1999) list many standards for content-related validity evidence. Many
essays have contributed to the realization that dimensionality is a major issue in assembling con-
tent-related validity evidence for any test (Hattie, 1985; McDonald, 1999; Messick, 1989, 1995a,
1995b; Nunnally & Bernstein, 1994; Tate, 2002, 2003). Ironically, dimensionality is given very
brief treatments in successive editions of Educational Measurement (Brennan, 2006; Linn, 1989)
and the Handbook of Test Development (Downing & Haladyna, 2006). This section serves to
emphasize the importance of dimensionality for many good reasons.
Defining Dimensionality
Dimensionality refers to the minimum number of factors explaining an ability. Messick (1989)
stated that if a single score is used to measure an ability, we should imply that it represents a
single dimension. However, it is the responsibility of those defining the ability to first determine
the structure of the ability to be measured. As SMEs have determined the content and its struc-
ture and organization, they should opine regarding the fundamental issue of single or multiple
dimensions. This content can be in a curriculum or a domain of tasks to be performed. As items
are developed following item and test specifications and then field-tested, data can be analyzed to
decide how many factors constitute this ability. The important point is that the definition of the
ability and item development that follows comes from a thorough and complete understanding
of the ability by SMEs. They initially decide the factors that comprise an ability. Empirical inves-
tigation is a means for confirmation of the judgment of the SMEs concerning dimensionality.
Insofar as we develop a test, many theorists and testing experts have stated that all tests should
be unidimensional. Classical test theory (CTT) and traditional IRT models assume unidimen-
sionality among item responses. The underlying assumption is that the items measuring that
ability or sub-ability are homogeneous (McDonald, 1985). Messick (1989) stated that a single
score on a test implies a single dimension. A similar observation was by Nunnally:
Each test should be homogeneous in content, and consequently the items on each test should
correlate substantially with one another. (Nunnally, 1977, p. 247)
1. The conclusion of SMEs initially drives test development. If they hypothesize a single
dimension for an ability, test development is straightforward. If they hypothesize multiple
dimensions for an ability, test development should consider test development for each
dimension, as theoreticians support.
2. This decision about dimensionality will affect the development of item and test specifica-
tions. A multidimensional conclusion will result in one set of item and test specifications
for each dimension. Item development and validation will be based on this decision. Item
development will be more extensive to support reliable determination of each dimension.
Issues Involving Item Responses and Item Validation • 389
3. DIF studies should be conducted for a multidimensional test. One never knows if a sub-
group of examinees is being advantaged or disadvantaged in one or more of these sub-
abilities that comprise an ability.
4. How scores are combined for a multidimensional ability is an issue. Will scoring be
compensatory or conjunctive? Compensatory scoring simply adds up the scores on each
dimension in a weighted or unweighted way. Conjunctive scoring insists that each dimen-
sion be passed in a high-stakes setting.
5. Scaling is a challenge for a multidimensional test. We need test forms that measure accu-
rately throughout the scale and over time for each dimension.
6. Diagnostic and evaluative subscores need to be validated for a multidimensional ability.
7. The cost of test development for a multidimensional ability is greater than for a unidimen-
sional test.
On the other hand, Tate (2002) stated that even when there is theory and evidence supporting a
multidimensional approach to testing, unidimensional test theory methods work quite well. This
is due to the often observed tendency for dimensions of a single ability to be highly interrelated.
IRT scales are said to be robust to violations of the assumption of unidimensionality.
Methods of Study
As stated previously, any study of dimensionality should begin with a hypothesis by the com-
mittee of SMEs about the nature of the ability being measured and the supposed sub-abilities.
After items are developed and field-tested, the empirical phase of the study of dimensionality can
generate questions that are answered via statistical analysis. These methods contribute to better
understanding of the dimensionality of any ability to be measured.
1. Internal consistency among items. If all item scores are highly intercorrelated, coefficient
alpha will be very high. This is one piece of evidence supporting unidimensionality. Defec-
tive items will lower reliability. Thus, internal consistency estimates of item responses
have to be predicated on the basis that items have been evaluated thoroughly. Also, highly
skewed data will mask internal consistency, and the use of frequency distributions as rec-
ommended in the previous chapter will be more informative.
2. Correlations among logical, content sub-scores should be computed. If subscores are
based on few items, the correlations might be corrected for attenuation. By remov-
ing the random error of each subscore, we can detect the true relationship among the
subscores. If this true relationship is very high, then the evidence points to supporting
unidimensionality. This evidence should concur with internal consistency estimates of
reliability.
3. Factor analysis is the primary means for studying item response structure. About the
time of a review by Hattie (1985), it was widely understood that traditional linear factor
analysis was inappropriate for analyzing item responses. Its limitation resides mainly
with the use of phi or point-biserial correlations among items. Factor analysis based
on these correlation coefficients often results in a difficulty factor or too many factors
due to the range of item difficulties. A tetrachoric correlational approach is preferred
because it assumes that the underlying trait in dichotomously scored items is continu-
ous, not binary. This topic is rich with continuous theoretical development and research.
Sources that provide more substantial treatments of this topic include Brown (2004),
Gorsuch (1983), Hattie (1985), McDonald (1985), Tate (2002, 2003), and Thompson
(2004).
390 • Validity Evidence Arising From Item Development and Item Response Validation
Tate (2002) discussed item factor analysis as being characterized by linear and nonlinear methods
that are further delineated by exploratory and confirmatory procedures. A nonlinear approach
is justified on theoretical grounds, and the confirmatory approach is justified by the fact that
SMEs have created item and test specifications that very clearly specify the structure of the data.
The confirmatory procedures are a special case of structural equation modeling. However, these
more sophisticated procedures are intended for very complex structures involving hierarchical
relationships. Table 19.3 provides a short list of factor analysis programs that provide linear and
nonlinear and exploratory and confirmatory approaches to item factor analysis. Not listed in
Table 19.3 is IRTPRO 2.1, which was listed in Table 19.2 as also useful for DIF analysis.
Recommendation
The study of dimensionality is a critical step in test development because it affects (a) content-
related validity evidence, (b) test design, (c) item development and validation, (d) scaling, (f)
standard setting, and (f) the estimation of reliability. Given this overarching influence on validity,
the study involves SMEs and empirical studies. The study begins with the definition of the ability
being measured and ends with empirical studies that test hypotheses about the structure of the
item response data. The consequences of any study will determine how content-related validity
evidence is used in validation and how the test is designed.
4. Edge Aversion
Simply stated, edge aversion is a human tendency to select middle items from a serial list and
avoid the first and last items on that list (Huber, 1983). This initial hypothesis and study did not
involve tests and test items, but it stimulated research on edge eversion in testing. Interestingly,
psychologists have long known about edge aversion through animal studies, and edge aversion
has implications for consumer marketing. However, it was not applied to educational testing
until researchers at the Educational Testing Service gave it a closer look.
Attali and Bar-Hillel (2003) discovered that although high-quality, professionally developed
tests use balanced keys so that the right answer appears equally in all positions, test takers have a
tendency to select middle options. When a test taker is unsure of the right answer, these research-
ers found that test takers prefer middle options at a ratio between 3:1 to 4:1. In four-choice options,
about 55% of wrong choices occurred in the second and third options. This bias in guessing has
consequences. These researchers recommended that multiple-forms could be created via shuf-
fling items and options, so that no two tests have the same order. By this, edge aversion would be
eliminated. As we can perform such shuffling in computerized test administration, edge aversion
can be eliminated.
Issues Involving Item Responses and Item Validation • 391
Golda (2011) reviewed the extensive literature on shuffling items and its effects on item
characteristics. She reported that in some prior studies better students got lower scores on
shuffled test items, whereas lower-performing students were unaffected. The benefit of shuf-
fling options on her tests was to prevent student cheating. Her study involved the shuffling of
options in a random way to avoid the edge aversion tendency and to decide if this shuffling
had any effects on test form difficulty. Her results suggested that the different, shuffled forms
were equivalent.
Recommendations
Edge aversion bias may be very small. It can be combated by shuffling the order of options on
computer-administered tests. Test preparation programs should incorporate the edge-aversion
theory and research to convince test takers that avoiding the edges actually biases test results.
Item Drift
In any SR testing program where we use equated test forms, item drift may be a problem. Equat-
ing links are minitests used in these alternate test forms to ensure that the test score scale is uni-
form from one test form to another. We have noticed that items reused tend to get easier over
time. We would like to think that an item’s difficulty will remain the same over repeated use, but
too often the item’s difficulty drifts. If some items drift and other items remain stable, we likely
have a serious threat to validity.
We should not confuse a change in item performance that is due to human growth or develop-
ment or teaching or training effects. We expect a set of test items to reflect gain in a uniform way.
Item drift is a unique change in item difficulty over time, unlike other items in a test.
The most common cause of drift is item exposure. That is, repeated testing with the same
test takers results in familiarity with certain items, and these items are correctly answered more
frequently than in the past. Another cause of drift is teaching to the test (Popham, 2001). If
teachers know that specific content is tested repeatedly, they may teach specific content and, as
a result, the items measuring that content have higher performance (drift). Another reason for
drift is that items are stolen and shown to a large group of test takers. The disclosure results in
a jump in the performance on the item. These possibilities are construct-irrelevant influences
that weaken the validity of test score interpretations and uses. One good example comes from
a study of student achievement scores in a statewide testing program involving the Stanford
Achievement Test (Haladyna, 2004). The scores on this test were shown to increase annually.
This trend would signal great advances in student learning in reading and mathematics. How-
ever, such gains were not validated by results for students in the same state on the National
392 • Validity Evidence Arising From Item Development and Item Response Validation
Assessment of Educational Progress (NAEP). There was anecdotal evidence that teachers in
the state, as in most other states, teach content that they know will be tested. The items repre-
senting this content become easier as a result. In theory, if instruction or training is effective,
items on a test should show good instructional sensitivity in a uniform way on all items. Where
some items drift and others do not drift, one should be suspicious about the validity of some
items.
Fortunately, we have many ways to study item drift. Han, Wells, and Sireci (2012) showed that
different methods for correcting for drift produce different results. Studies like this one show that
drift can occur in different directions (positive or negative), that such drift can distort scores, and
that methods for correcting these distortions should be made cautiously.
Rater Drift
For CRSS items, a severe or lenient rater might increasingly change their rating pattern to
higher or lower during a scoring session or over several days of scoring. Upward or downward
trends are undesirable because these trends represent systematic error. Rater drift includes
all of the rater effects discussed in chapters 12 and 18. Thus, all types of rater drift represent a
source of CIV.
In one study involving a credentialing examination scored by raters, adjustments in scoring
were useful initially but failed to produce accurate adjustments several months later (Harik et al.,
2009). Another interesting study with a national writing test in the United Kingdom shows no
drift in rater severity, but they did find a central tendency effect (Leckie & Baird, 2011). A disturb-
ing finding was that the rater severity was unstable over time. Thus, monitoring and correcting
for it might be very difficult. An excellent review of the conditions of rater drift can be found in
Hoskens and Wilson (2001).
Recommendations
As drift is a threat to validity, vigilance is required to monitor, identify, measure, and combat it.
The strategies for item and rater drift are different.
Item drift. The monitoring of item difficulty for items used for different, equated test forms
should be a standard practice. If items are drifting, equating will produce biased results. Thus,
offending items are often removed. Moreover, the reasons for drift should be known. If there is a
general increase in performance attributed to construct-relevant factors, such as effective teach-
ing or training, then drift is not a problem. If teachers are teaching to the test or items are being
taken and disclosed to large groups of test takers, we have another kind of serious problem that is
well beyond the scope of this book but one that is worthy of our attention.
Rater drift. Rater effects should be monitored. Where drift occurs, the problem can be
treated by counseling offending raters, dismissing them, or making statistical adjustments. As
rater drift is a specific type of rater effect where ratings over time are trending undesirably, the
detection and treatment of rater drift is very difficult. Nonetheless, such monitoring is neces-
sary to preserve validity. Methods for the study and correction of each source of systematic
error due to drift need more research and implementation. One of the most effective means
for combating rater drift is through automatic essay scoring (Shermis & Burstein, 2002). This
action removes the rater from the actual scoring, although expert scoring is used to carry out
the scoring system. The benefit of automated scoring is that drift is no longer a problem. This
kind of scoring removes nearly all rater effects if the preparation of the scoring system uses
expert raters wisely and effectively.
Issues Involving Item Responses and Item Validation • 393
1101111000110O1011001NNNN
Strict right–wrong scoring treats O and N responses as wrong. This test taker would earn a
score of 48%. In the second strategy, O and N do not count. We omit these responses from
scoring, and the test taker receives a score of 60%. With IRT scoring, a person’s ability can be
based on responses to the items they attempted. The missing responses should not affect their
ability estimate, but would result in less precision because fewer items were answered. Such a
score would have a larger standard error of measurement than the score obtained by taking a
longer test.
Theoretical Analysis
High-stakes tests include those for graduation, certification, placement, admission, advance-
ment, and the like. The frequency of O and N for high-stakes tests is very small and negligible. For
an anonymous medical certification test involving 1,081 candidates and 150 items, 73 omits were
observed. This is .045%, which is minuscule. This is typical for high-stakes tests where test takers
are performing maximally. There should be little concern about O and N in these circumstances.
O and N are signs of very poor testwiseness by test takers.
Although the frequency of O and N for low-stakes tests can be small, it can also be consequen-
tial. Therefore, the remainder of this section on O and N refers to low-stakes testing.
We have many explanations of what causes O and N responses:
1. The test taker may be less motivated and may skip a few hard items and then quit.
2. The test taker may have simply run out of time. Some test takers are plodders and do not
manage time very well. Lu and Sireci (2007) presented an analysis of problems associated
with time limits on power tests and how it affects many factors include Os and Ns.
3. Some test takers may be testwise and know if they do not answer they will not be penal-
ized. Therefore, they skip harder items and earn a higher score. Such an outcome is a
paradox; if test takers only answered those items for which they had great confidence, they
could attain a high score if Os and Ns were ignored in scoring.
4. English language learners (ELLs) often lack the verbal ability to read test items fluently
and thus take a longer time to respond. As a result, they skip some items, may stop taking
the test, or simply run out of time.
What are the consequences of O and N on validity? This is a central question, and the answer to
this question constitutes a threat to validity.
394 • Validity Evidence Arising From Item Development and Item Response Validation
Table 19.4 Comparison of Item and Test Characteristics for Item Responses With and Without Os and Ns
Os and Ns Included Os and Ns Excluded
Sample Size 1,500 1,213
Mean 18.2 19.0
Standard deviation 5.9 5.8
Lowest score 2 2
Highest score 31 31
Mean p-value .57 .59
Mean discrimination .32 .31
Coefficient alpha .814 .811
When reporting a test result for any demographic or subgroup of the population, are test scores
with a high incidence of Os and Ns valid? Based on this one example, it appears that removing
287 student scores did change results for this demographic group. The result of including Os and
Ns was a slightly lower mean score. The difference in difficulties is statistically significant. The
effect size is .13, which is small but consequential.
Did the removal of this many Os and Ns make a difference in reliability? Evidently not. Vari-
ability was not affected and discrimination did not seem greatly affected. Item difficulty and item
discrimination was not greatly affected when comparing the two samples.
Research on O and N
Haladyna, Osborn Popp, and Weiss (2003) looked at non response for item responses from the
NAEP. They found that English language learners (ELLs) were the most likely to omit and not
Issues Involving Item Responses and Item Validation • 395
attempt items. As they may have both a language deficiency and less motivation to perform on
this low-stakes testing situation, scoring O and N as missed items lowers the score of ELLs. Levine
and Rubin (1979) made the same observation earlier. Several test researchers have expressed res-
ervations about the effect of Os and Ns on IRT parameter estimation (Lord, 1974, 1977; Mislevy
& Wu, 1996). De Ayala, Plake, and Impara (2006) examined the accuracy of ability estimation
under the condition of omitted responses. They concluded that score estimation was the worst
when Os and Ns were treated as incorrect, and the best estimation occurred when Os and Ns were
given a scoring weight of .5. Yamamoto also reported similar findings. A point worth making
again was made by Mislevy and Wu: that missingness (Os and Ns) can be ignored under some
IRT scoring and using a fractionally correct index works, but the reason for missing a response is
not ignorable. Research needs to reveal student motivation for Os and Ns.
One of the most comprehensive studies on Os and Ns was reported by Koretz, Lewis, Skewes-
Cox, and Burstein (1993). The NAEP has a significant non-response problem due to many fac-
tors such as its low stakes and time limits. Also, item formats vary and this affects whether stu-
dents resign from taking the test. Haladyna et al. (2003) also found the same tendency with NAEP
data. O rates also vary by content categories. This finding suggests that as some content is harder,
students omit the items rather than guess answers. Ludlow and O’Leary (1999) reported a study
with a small number of Irish students who had a high degree of Os and Ns. They experimented
with different scoring schemes that counted Os and Ns as incorrect and also not scorable. They
showed the impact of these different scoring methods on individual students. Their study shows
in a very direct way the consequences for how Os and Ns are treated in scoring.
Recommendations
Some test takers tend to O or N some items on SR tests. The first step is to detect how many test
takers have omitted or not attempted items. If the number is large, some caution is urged in sev-
eral regards.
1. Consider that test takers with high numbers of O and N may be providing invalid test
scores. Because they have failed to follow the administration protocol, their scores might
not be reported as real (valid) scores. ELL students tend to be frequent offenders, perhaps
due to their low language proficiency for the test being given. It makes sense to not count
Os and Ns if they do not have time to finish a test.
2. The reporting of results for a demographic group with a high degree of Os and Ns may
also be misleading, simply because there are too many invalid scores.
3. In conducting an item analysis, there may be a slight misestimation of item difficulty and
discrimination due to O and N. Removing lines of test taker data containing Os and Ns
makes for a more accurate item analysis. Counting the frequency of O and N responses is
an important item analysis tool, particularly for field-testing new items.
4. Fortunately, reliability might not be greatly influenced by Os and Ns.
The determination of how many Os and Ns should concern an item response analyst is arbitrary.
Koretz et al. (1993, p. 8) suggested that if 10% of the sample omitted and 15% of the sample did
not reach some items, that these test takers should be considered as non-respondent. As a precau-
tion, analyses might be conducted with Os and Ns present in one data set and removed in another
data set to see if there is any difference. Such differences should be interpreted in terms of valid
or invalid test-taking.
Finally, Levine and Rubin (1979) raised the question that Os and Ns might be useful for study-
ing person fit. This topic is next.
396 • Validity Evidence Arising From Item Development and Item Response Validation
7. Person Fit
With any set of selected-response (SR) or constructed-response objectively scored (CROS) item
responses, there is the possibility of an unusual response pattern. A low-scoring test taker may
correctly answer a difficult item or a high-scoring test taker may miss an easy item. Or, the test
taker may have an unpredictable pattern of answers. Person fit is a statistical method for detect-
ing invalid test scores by evaluating item responses. What constitutes an invalid test score has to
be defined first.
This section has three parts. The first gives examples of aberrant responders. The second
presents theory and research on person fit. The third briefly identifies methods of study. The sec-
tion ends with recommendations for future research and the use of person fit in item analysis.
For those interested in more extensive treatment of this topic, there is a wealth of useful sources.
An issue of Applied Measurement in Education was devoted to this topic in 1996. For more exten-
sive discussion of person fit theory, research, and technology consult these references (Drasgow,
Levine, & Zickar 1996; Meijer, Muijtjens, & van der Vleuten, 1996; Wright & Stone, 1979). A
more recent article discusses person fit for CROS items (Glas & Dagohoy, 2005). Another article
by von Davier and Molenaar (2003) discussed person fit with a test that combines SR, CROS, and
CRSS formats. Engelhard (2009) introduced a person fit application when testing students with
disabilities. Also, person fit statistics have become a tool for cognitive psychologists in their study
of student learning (Cui & Leighton, 2009).
1. Cheating Cheating inflates estimates of knowledge or ability thus leading to invalid inter-
pretations and uses of test scores. In many circumstances, the use of test scores inflated by cheat-
ing may have harmful effects on the public. In licensing testing, a passing score obtained by an
incompetent physician, pharmacist, nurse, architect, or automotive mechanic might negatively
affect the public. The problem of cheating is significant. Two websites are particularly informa-
tive about cheating throughout the world (caveon.com and fairtest.org).
Historically, early research on patterns suggesting cheating showed that cheating is extensive
and that error-similarity pattern analysis could be used (Bellezza & Bellezza, 1989). Two perspec-
tives to this problem are that adjacent test takers have pattern similarity and cheating test takers
answer correctly far too often items that normally would not be answered successfully.
Another aspect of detection is when a low-scoring test taker answers a difficult item correctly.
This event is a signal for cheating. It is possible to determine if this kind of aberrant response
happens often in a test, and the accumulated observations of aberrant responses might provide
evidence of cheating.
2. Test Anxiety Test anxiety is an impairment of goal-directed behavior (test taking) by direct-
ing attention to the threat presented in a test (poor performance). Test anxiety is an omnipresent
problem in all types of testing. Hill and Wigfield (1984) estimated that about 25% of the popula-
tion has some form of test anxiety. Test anxiety is treatable. One way is to prepare adequately for
the test. Another strategy is to provide good test-taking skills that include psychological prepara-
tion and time-management strategies.
3. Inattention Test takers who are not well motivated or easily distracted may choose SR
answers carelessly (Wise, 2006). The term sleeper has been used to denote such a test taker
(Wright, 1977). If sleeper patterns are identified, test scores might be invalidated. Low-stakes
achievement tests given to elementary and secondary students may result in sleeper scores. Some
Issues Involving Item Responses and Item Validation • 397
students have little reason or motivation to sustain a high level of concentration demanded on
these tests. This point was also well made by Paris, Lawton, Turner, and Roth (1991) in their
analysis of the effects of standardized testing on children. They pointed out that older children
tend to think that such tests have less importance, thus increasing the possibility for inattention.
Detection of this problem can be investigated by examining a test taker’s response time on a
computer-based test (Wise & Kong, 2005).
4. Idiosyncratic Answering Under conditions where the test does not have important con-
sequences to the test taker or the test taker is emotionally upset, a peculiar pattern results. Such
behavior produces a negative bias in scores, affecting both individual and group performances.
Some examples could be pattern marking, for example, ABCD ABCD ABCD..., or BBB CCC BBB
CCC... The identification and removal of offending scores help improve the accuracy of group
results. Tests without serious consequences to older children will be more subject to idiosyncratic
pattern marking. A tendency among school-age children to mark idiosyncratically has been doc-
umented in several studies (e.g., Paris et al., 1991). Thus, the problem seems significant in situa-
tions where the test takers have little reason to do well. This kind of misfitting item responding
may be seen from individuals with exceptionalities as well.
5. Plodding Under the condition of a timed test, some students may not have enough time
to answer all items due to their plodding nature. Plodders are very careful and meticulous
in responding to each item. They may also lack time management strategies that are a normal part
of testwiseness. Thus, they do not answer items at the end of the test. The result is a lower score.
Extending the time limit for most standardized tests is not possible; therefore, the prevention of
the problem lies in better test-taking training. Another strategy is to consider the not-reached
responses as unscorable. Thus, the plodder receives a score based on the items answered.
6. Coaching and Test Preparation In all high-stakes testing, there is a technology of coaching
and test preparation that includes practice tests, testwiseness training, test-taking strategies, and
learning the content upon which the test is based. The most defensible test preparation is the last
one—learning. However, we might think of all coaching and test preparation on an ethical scale
(Haladyna, Nolen, & Haas, 1991). Having test takers practice items on the test is unethical. Using
similar items narrows the learning to what is on the test—most tests are a sample from a domain
of knowledge and skills.
Another perspective is that most coaching and test preparation is really not very advantageous
(Becker, 1990; Hausknecht et al., 2007; Linn, 1990). Most coaching gains are of a small nature,
usually less than one-fifth of a standard deviation. However, such a small gain might be conse-
quential if performance is near a cut score. The distinction that Linn made was that any change
in scores might be interpreted incorrectly if the coaching and test preparation change what was
being measured to what is on the test. If coaching involved item-specific strategies, then interpre-
tation of any gain should be that test behavior does not generalize to the larger domain that the
test score represents. When we compare coached and uncoached test takers, does that difference
reflect ability or the result of some unethical practice? Haladyna, Nolen, and Haas (1991) called
this practice test score pollution, arguing that such coaching may boost test performance without
substantially affecting the domain that a test score represents.
7. Creative Test Takers Test items of any format will contain some degree of ambiguity that
will elicit responses interpreted as incorrect. Upon deeper and further investigation, a test tak-
er’s justification may verify the accuracy of their response. Unfortunately, for SR items such
398 • Validity Evidence Arising From Item Development and Item Response Validation
justifications are not permitted. With CROS items alternate key answers are permitted, but these
are generated by SMEs, not test takers. For CRSS items, a creative response might not be credited
or appreciated. Consider the instance of the student who wrote an essay on an X-rated movie
and received a low scoring in writing ability for his creative response (Farley, 2009). Although
his writing ability was excellent, he was downgraded due to his offensive content, which was not
supposed to be evaluated.
8. Language Deficiency Test takers may have a high degree of knowledge about a domain
but fail to show this knowledge because the test taker’s primary language is not English. In these
instances, any interpretation or use of a test score should be declared invalid. Standard 6.10 in
the Standards for Educational and Psychological Testing urged caution in test score interpretation
and use when the language of the test exceeds the linguistic abilities of test takers. The problem
of language deficiency might also fall into two previously discussed categories, non-response
and plodding. Nevertheless, because we have so many second-language learners in the United
States, emphasizing the problem this way seems justified. Testing policies seldom recognize that
language deficiency introduces bias in test scores and leads to faulty interpretations of student
knowledge or ability.
9. Marking or Alignment Errors Test responses are often made on optically scannable answer
sheets. Sometimes, in the midst of an anxiety-provoking testing situation, test takers may mark in
the wrong places on the answer sheet. This may include marking across instead of down, or down
instead of across, or skipping one place and marking in all other places, so that all answers are off
by one or more positions. Such detection is possible. The policy to deal with the problem is again
another issue. Mismarked answer sheets produce invalid test scores. Therefore, it seems reason-
able that these mismarked sheets must be detected and removed from the scoring and reporting
process, and the test taker might be given an opportunity to correct the error if obtaining a validly
interpreted score is important.
response patterns. Levine and Rubin (1979) showed that such detection was achievable, and
since then there has been a steady progression of studies involving several theoretical models
(Drasgow, 1982; Drasgow, Levine, & Zickar, 1996). These studies were initially done using the
three-parameter item response model, but later studies involved polytomous item response mod-
els (Drasgow et al., 1985). Drasgow et al. (1996) provided an update of their work. With more
flexible and faster computer programs, more extensive research can be conducted, and testing
programs might consider employing these methods to identify test takers whose results should
not be reported, interpreted, or used.
Nonparametric person fit statistics have proliferated, which has resulted in many choices and
little research to guide us. Karabatsos (2006) evaluated the utility of 36 different item fit statistics
and reported findings for several indexes. Van der Flier (1982) derived a person-fit statistic U3
and studied its characteristics. The premise of U3 is the comparison of probabilities of an item
score pattern with the probability of the pattern of correct answers. The index is zero if the stu-
dent responses follow a Guttman pattern. Meijer, Molenaar, and Sijtsma (1994) evaluated U3,
finding it to be extremely useful for detecting item response problems. In a series of studies by
Meijer and his associates, many positive findings were reported for U3. One important finding
was that this method works best under conditions of higher reliability, longer tests, and situations
where a high proportion of examinees have aberrant patterns. U3 can also be applied to group
statistics of person fit. Meijer and Sijtsma (1995) concluded that U3 was the best among many
proposed indices available because the sampling distribution is known, easing interpretation of
results.
8. Subscore Validation
Most tests of student achievement and professional competence may seem multidimensional.
An examination of test and item specifications will show a proposed structure that SMEs have
created with various topics that seem distinctive. Reading and mathematics come to mind as
two subject matters that appear multidimensional. Most professional tests have content that
has the appearance of many dimensions. However, these dimensions are often so highly related
that analysis of item responses often leads to a conclusion that these topics represent a single
dimension.
In most testing programs, constituencies request subscores for two reasons. Those who fail
want to know their strengths and weaknesses so they can prepare for the retest. Sponsors of train-
ing or educational program where an ability is developed want to know in what areas did they
succeed and in what areas did they fail? The Standards for Educational and Psychological Testing
state:
400 • Validity Evidence Arising From Item Development and Item Response Validation
Therefore, the issue is whether subscores can be validly reported for individuals and groups.
Theory
A theoretical analysis of the content of a test will lead to a hypothesis about dimensionality. As
noted by Haladyna and Kramer (2004), the validation of subscores is a process that combines
logical analysis and empirical findings. If the interpretation of a subscore is to be validly inter-
preted, many conditions should be met, and each should be documented.
1. Subscales should be clearly defined and be clearly stated in item and test specifications.
2. Items should be developed with the specific content of that subscale.
3. Items should be validated with respect to the content of that subscale.
4. The test is designed to provide enough content representation and reliability for reporting
that subscale.
5. Subscores representing different content should have mean differences that are consistent
from test form to test form and across time. That is, if subscale 1 exceeds subscale 2 on
form A, the same should be true for form B. If not, this is negative validity evidence.
6. Correlations among subscales should not approach or meet unity (r = 1.00) after correc-
tion for attenuation. In other words, the information obtained from subscales should be
different from the information we get from a total score.
7. A confirmatory factor analysis should confirm the fit between the subscale structure and
the observed data for the subscales.
8. Item analysis should be conducted using the total subscore instead of the total score. Items
should be evaluated for subscore validity not total score validity.
9. Subscores should be reported as to the uncertainty based on reliability and the resulting
standard error of measurement.
These many conditions comprise the argument for validity. Both the procedures for item devel-
opment and validation and resulting analyses suggested above constitute validity evidence for
subscores.
Research
Fortunately, we have witnessed a surge in research on subscore validity. Apparently, testing
organizations are recognizing that if subscores are reported, each needs to be validated. Some
studies are reported chronologically to show the extent and flow of this research.
Wainer, Sheehan, and Wang (2000) experimented with augmented subscores for a teacher
competency test. Where a paucity of items exists for a subscore, information from other sub-
scores is used to strengthen subscore reliability. They concluded that with an unidimensional set
of item responses, no subscore stood out. Some researchers have approached this problem using
a multidimensional scaling method using IRT (de la Torre & Patz, 2005). However, their results
were very much like the results reported by Wainer et al. (2000). Skorupski and Carvajal (2010)
continued this line of research using augmentation. They found that the procedure increases
reliability of subscores but the resulting subscores were very highly intercorrelated (r = .97); by
other criteria stated previously, this is not an effective means for producing validly interpretable
subscores.
Issues Involving Item Responses and Item Validation • 401
For a very long credentialing test, Haladyna and Kramer (2004) studied the conditions for
concluding that this test represented a unidimensional ability—dentistry. The results were subtle
and complex. Coefficient alpha was very high for this 400-item test. Subscale means differed
significantly, which supports subscale validity. As each subscale consisted of 100 items, reliability
was very high. Correlations among subscales were very high after correction for attenuation.
However, if SMEs think that these subscales should be highly intercorrelated, then a high alpha
and high subscale intercorrelations should not be taken as negative evidence for subscale valid-
ity. A factor analysis showed a common factor (overall competence) and four very weak content
factors that represented the subscales. The item analysis using the subscale total score and the
test total score showed significant differences. The logical choice for item analysis is the subscale
score. A study by de la Torre and Patz (2005) applied the principal of augmenting subscores via
a hierarchical Bayesian framework. Although they claimed more precision in the estimation of
subscores, they provided no evidence in terms of descriptive statistics that this was the case.
A study by Sinharay, Haberman, and Puhan (2007) used classical test theory to study the valid-
ity of individual and group subscale reports. If subscores are to be validly interpreted, each sub-
scale must be shown to have non-redundant information. In other words, is the subscore proven
to be more accurate than the total score for describing some trait? They reported that “subscores
for the test did not provide any added value over the total test score” (p. 27). These results applied
both to individual subscores as well as group subscores. Their results resembled similar stud-
ies done earlier with different methodologies. Ling (2009) conducted extensive analyses for a
business test and found evidence supporting a single dimension and very little evidence sup-
porting logical subscores. Another study using classical procedures and easy-to-follow statisti-
cal procedures by Haberman, Sinharay, and Puhan (2009) with teacher certification data also
did not support subscore validity both at the individual and institutional levels. Haberman and
Sinharay (2010) experimented with multidimensional item response theory and reported that it
provided slightly more accurate results than conventional methods. However, these researchers
did not provide descriptive statistics supporting their conclusion. Also, the calculations are very
demanding and time-consuming. What needs to be reported here is the import of these gains in
accuracy in terms of conventional criteria: mean differences among subscores, reliability, cor-
relations among subscores, and factor structure.
The most important finding is that it is not easy to have subscores that have added value.
Based on the results here, the subscores have to consist of at least about 20 items and have
to be sufficiently distinct from each other to have any hope of having added value. Several
practitioners believe that subscores consisting of a few items may have added value if they
are sufficiently distinct from each other. However, the results in this study provide evidence
that is contrary to that belief. Subscores with 10 items were not of any added value even for
a realistically extreme (low) disattenuated correlation of .7. The practical implication of this
finding is that the test developers have to work hard (to make the subtests long and distinct)
if they want subscores that have added value. (Sinharay, 2010, p. 169)
Recommendation
If subscores are wanted, purists like Nunnally (1967) demand that a test be developed for each
subscale. Given the impracticality of that, we seek subscores that have the characteristics listed in
the preceding theoretical analysis. Taking the advice of Sinharay (2010), the development of sub-
tests that provide valid subscales is needed. This requires planning and the same steps one uses
in test development. Before engaging in statistical study of item responses, one should always
determine the dimensionality of the test in the most refined way possible. If subscores exist, item
402 • Validity Evidence Arising From Item Development and Item Response Validation
analysis is better served by using the total subscore as the criterion for estimating item difficulty
than using the total score as the criterion.
9. Weighting SR Options
SR items are usually scored 0–1, right or wrong. With the two-parameter or three-parameter IRT
model, each right answer is weighted, whereas all wrong choices are equally weighted. In other
words, distractors have equal weight.
It is very easy to prove that there is differential information in wrong answers (Haladyna &
Downing, 1993; Levine & Drasgow, 1983; Thissen, 1976; Thissen, Steinberg, & Mooney, 1989). In
fact, in chapter 17, trace lines were featured as a means for evaluating distractors for SR items. In
a full-information item analysis of any cognitive test, it is apparent that an item’s distractors will
have differential information. The simplest way to prove this assertion is by using a choice mean
as described in chapter 17. The choice mean is the average of all test takers who chose a particular
option. The choice mean for any set of four or five-option items will vary. Some choices will have
higher means whereas others will have lower means.
Theoretical Analysis
The theory of reciprocal averages is the theoretical basis for option weighting (Guttman, 1941).
The theory states simply that coefficient alpha is maximized when the reciprocal of the choice
mean is used as a weight for that option—given that this reciprocal is iterated in scoring. The
iteration process increases coefficient alpha to its maximum. Lord (1958) showed that this theory
is related to item factor analysis. Haladyna (1990) discovered that a single iteration provides
a sufficient statistic for reciprocal averages and that many other option weighting values have
virtually identical results.
Research
The research on option weighting has a long history with varying methodology. Reviews by Hala-
dyna (1990) provide a summary of that research up to that date. Not only does option weighting
work, but reliability is improved. However, upon closer examination, we see that the gain in
precision is in the lower third of the test score distribution. That is, if precision is desired in the
sector of the test score scale, then option weighting is very desirable. Fortunately, important deci-
sions about test takers are made in the upper third of the test score scale, where option weighting
does the least good.
Recommendation
If SR items are streamlined to three options, option weighting might be feasible and beneficial in
gaining precision for lower-scoring test takers, if that is important. The choice of a method should
be a matter of convenience. Simple statistics or the use of IRT can provide satisfactory test score.
Issues Involving Item Responses and Item Validation • 403
Part VI contains a single chapter—a capstone for this book. This chapter begins with a brief his-
tory of testing that features the role of item development and validation. The second part of this
chapter explores four areas of inquiry that will affect item development’s future. These areas are
(a) the importance and role of theory, (b) the development of new item formats, (c) research, and
(d) the influence of technology. All are inextricably related, but each represents a specific schol-
arly focus that constantly needs attention if the science of item development is to flourish.
This page intentionally left blank
20
The Future of Item Development and Validation
The scholarly study of item development has been generally perceived to be deficient when com-
pared with other aspects of test development (Cronbach, 1970; Haladyna, 2004; Nitko, 1985;
Roid & Haladyna, 1982). As the test item is the most basic building block of any test, item devel-
opment continues to be one of the most important activities in test development. The Handbook
of Test Development (Downing & Haladyna, 2006) devotes 9 of its 32 chapters to item develop-
ment. Chapters in various editions of Educational Measurement have featured item development
(Coffman, 1971; Ebel, 1951; Millman & Greene, 1999; Wesman, 1971). We know that the cost of
developing an item bank and validating its items is considerable. Although the scholarly study
of item development has increased, we have urgent needs in the future. This chapter describes
some of these needs.
This closing chapter features two sections. The first is a brief history of testing that provides a
context. The second section discusses four areas of inquiry needed to advance the science of item
development.
for measuring student achievement. In 1905, the Binet Intelligence Test was introduced. It con-
sisted of a battery of performance tasks. The introduction of the conventional multiple-choice
(CMC) format led to an explosion of large-scale testing in the US. The Army Mental Test was a
mixture of short-answer, constructed-response (CR) and CMC items (Yoakum & Yerkes, 1920).
The test scores were highly correlated with the Binet Intelligence Test results, so the test appar-
ently measured intelligence as defined and measured then.
The introduction of the CMC format sparked a flood of scholarly studies on the merits and
deficits of the SR and CR formats (Eurich, 1931; Hurd, 1932; Kinney, 1932; Meyer, 1934, 1935;
O’Dell, 1928; Patterson, 1926; Ruch, 1929). For the most part, these studies favored the use of
CMC format mainly for the benefits of efficiency in scoring and reliability. One of the earliest
scholarly publications on testing is a book by Thorndike (1904) entitled An Introduction to the
Theory of Mental and Social Measurement. This provided the basis for the development work
that followed.
In 1936 high-speed scanning of test answer sheets was introduced. This technology made large-
scale testing more efficient and with less scoring error. When the Stanford Achievement Test
(First Edition) was introduced (Kelley, Ruch, & Terman, 1933), it led to a proliferation of other
standardized achievement tests given to school-age students. Interestingly, The Stanford One
consisted of three-option, completion-type CMC items. Most items involved recall of facts. Later,
the Scholastic Aptitude Test and the American College Test were introduced and became the
most famous and largest testing programs in the United States. All these tests featured the CMC
format, which has endured to the present day. Nonetheless, many sources of discontent within
the testing community with the CMC format and low cognitive demands led to the reintroduc-
tion of CR formats. As a result, the constructed-response objectively scored (CROS) and con-
structed-response subjectively scored (CRSS) formats join the SR formats in the toolbox for item
development for most testing programs discussed in this volume and widely in use throughout
the world.
Today, standardized testing is a huge, worldwide enterprise. The variety of standardized test-
ing programs includes tests of elementary and secondary achievement sponsored by school dis-
tricts, states, and testing companies, and one national test—the National Assessment of Educa-
tional Progress (NAEP). We have college and graduate school admissions tests, intelligence and
specific ability tests, language proficiency tests, psychological tests, licensing, certification, and
training tests, and armed forces qualification tests, among many others. It is a growing industry
as the public’s need for test score information continues to increase. For every one of these testing
programs, item development continues to be the most expensive and arguably the most impor-
tant activity. Without validated test items, valid test score interpretations are not possible.
Theory
Theory is a very important activity for human development, because it is a symbolic representa-
tion of our collective experience.
The Future of Item Development and Validation • 409
A theory is in constant need of validation or revision. That is the nature of inquiry. Often a well-
developed theory will lead to the development of a technology, which is the sum of the ways in
which we provide the products and services needed in our civilization. Item and test development
are related technologies. The first is an integral aspect of the second.
We have four distinctly different theoretical influences on item development. The first is learn-
ing theory. The second is validity. The third is test score theory. The fourth is item development
theory.
Learning Theory As presented previously in chapters 1 and 2, behavioral learning theory domi-
nated test and item development mostly in the early and mid-twentieth century. Later in that
century, cognitive learning theory emerged (Messick, 1989; Mislevy, 2006). Current-day achieve-
ment testing practices involve a peculiar amalgam of these two competing theories. Criterion-
referenced testing and instructional objectives derive from behavioral learning theory. A domain
of knowledge and skills is the target of testing. Cognitive learning theory involves cognitive proc-
esses underlying learning. Cognitive learning theory has not yet led to a well-established unified
approach to teaching, learning, and testing, but recent efforts are very promising (see Ferrara,
Svetina, Skucha, & Davidson, 2012; Mislevy & Riconscentes, 2006). Theorists and researchers
are paying more attention to construct definition and parsing the subtleties of complex cogni-
tive demands required in curricula and instruction in educational programs, occupations, and
recreation.
A major idea taken from contemporary cognitive learning theory is that learning consists of
the development of cognitive abilities (Lohman, 1993; Messick, 1984; Sternberg, 1999). In test-
ing for the professions, an ability is often thought of as a domain of tasks performed by qualified
persons in that profession (Raymond & Neustel, 2006). By focusing on knowledge and skills as
existing in an achievement domain, we pay homage to behavioral learning theory. The more
useful domain should specify the tasks to be performed that require knowledge and skills used in
complex ways. In reading, we have many purposes for reading that comprise this target domain.
In writing, there is a domain of writing tasks that also comprises the target domain. In any pro-
fession, there are tasks performed that represent professional practice. They may involve clients,
patients, or simply problems that we encounter in everyday life.
In time, we should see a continued shift away from behavioral learning theory to cognitive
learning theory. Despite the diversity within cognitive psychology, unification is necessary if
there is to be a learning science that is the basis of teaching and testing. Where there is more
interaction and consensus-building, all of teaching, learning, and testing will benefit.
Cognitive psychologists and measurement specialists have long championed partnerships that
combine their complementary abilities for the more valid measurement of abilities (Mislevy,
2006; Snow & Lohman, 1999). Such partnerships are happening and should continue. The merit
of this work has yet to show in practice in the testing industry in significant ways. We have few
instances of testing programs guided or influenced by cognitive learning theory. The redesign of
AP science tests is one strong example (see chapter 12, guideline 1).
410 • The Future of Item Development and Validation
Validity Theory Validity has undergone evolution from the introduction of the concept by
Cronbach and Meehl (1955). Its most current version features ideas embodied throughout this
volume (Kane, 2006a, 2006b). The target domain is the focus and basis for defining the construct.
The target domain identifies the tasks performed under ideal conditions. The universe of gener-
alization is the operational representation of the target domain. We create an interpretive argu-
ment in support of valid test score interpretation and use. We make a claim for valid test score
interpretation and use, and validity evidence can support or weaken that claim. Validation is the
quest for validity. The claim for validity is subject to critical evaluation. How much validity exists
is a judgment made after arguments, claims, and evidence are assessed.
We also have evidence-centered test design, which provides a rational basis for content analy-
sis and item and test design leading to interpretations of validity (Mislevy & Riconscentes, 2006).
Another approach is assessment engineering (Luecht, 2012), which incorporates automatic item
generation (AIG) into the test development process. We might consider these alternatives to test
design and construction as competing, and only time will tell which idea has the most usefulness
to test developers or to what extent these approaches to testing have common elements.
Test Score Theory Classical test theory and the related generalizability theory still play important
roles in test development (Brennan, 2002; Lord & Novick, 1967). Validity withstanding, reliability
is the most important result of the use of either theory. Discriminating items directly and positively
contribute to reliability. Thus, it is necessary to design and use highly discriminating test items.
Item response theory (IRT) has enormous appeal because of its capability to make scaling more
efficient for standardized tests. Scaling for comparability continues to be a major activity in stand-
ardized testing and a challenge where CRSS items are used. IRT also has important implications for
test design and creating tests focused on the ability level of the test taker. The important implication
is that a test can be tailored (customized) for each test taker (Davey & Pitoniak, 2006). Tests of the
future will employ IRT test design principles that will reduce the need for items that are too hard
or too easy to be administered to each test taker. Such test items will map to the cognitive learning
theory appropriate for the given target domain. Item development will have a different objective—
items need to be designed with various levels of difficulty while retaining high discrimination.
and basis for each item. The goal is predictable item difficulty and high discrimination for items
computer-generated in large numbers. Item generation resides in a specific context where (a) the
ability construct is clearly defined, (b) a target domain is identified, and (c) item models explicate
the target domain. Each item model can generate many items automatically. Although we have
seen substantial progress, there is a long way to go before item generation is a mature science.
Item generation has great potential but one challenge has to do with the nature of content. For
well-structured content, item generation seems feasible. For ill-structured content, item genera-
tion is daunting. With the estimated cost of validated SR test items approaching or exceeding
$2,000, item generation may play a supporting or major role in item development in the future
for the simple advantage of economy. Is it more valid? As item generation is integrally linked to
construct definition, it would appear that content is better linked to items when compared with
methods that employ committees of subject-matter experts (SMEs).
There is no unified item development theory yet. We have proposals for item generation, and
we have an extant technology for item development cobbled from the collective wisdom of many
and codified via a consensus with some supporting research (Haladyna, 2004; this volume). A
long-term objective is to develop, refine, and validate an item development theory that will lead
to item development that is not only validated but inexpensive.
Item Formats
Performance testing is several thousand years old. Modern standardized testing appears to
have received its impetus from the introduction of SR item formats—namely the CMC format.
For the most part, the CMC format was favored for many good reasons. However, an inspec-
tion of the Stanford Achievement Test (First Edition) will show that most CMC test items had
a low cognitive demand (Kelley, Ruch, & Terman, 1933). The continued use of SR formats for
measuring memory-type content caused unrest in the education community and measure-
ment specialists leading to a resurgence of CR testing mostly involving subjective scoring. As
this volume has shown, we have many SR and CR formats that can be employed for a variety
of content and cognitive demands. Fortunately, new formats are being developed for use on a
computer. Research will always be needed to establish the efficacy of these new formats. As we
have noted, there is a peculiar, illogical adherence to four- and five-option CMC items when
theory, research, and practicality point to using fewer options, as presented in chapter 5. The
result of further scrutiny of the CMC format should lead to the use of CMC items with only
two or three options. Also, new item formats will continue to be introduced that require study
and evaluation. The criteria for evaluating formats will include the types of content measured,
the cognitive demand elicited by each item format, and the standard statistical properties of
difficulty and discrimination.
Throughout this book, the use of all formats has been advocated as long as the content and
cognitive demand expected is within the capability of a format. Fortunately, we have considerable
research that reveals the capability of different formats. For example, the edited book by Bennett
and Ward (1993) importantly contributed to our understanding of item formats. Other efforts
have contributed to this expanding knowledge base on item formats (Haladyna, 2004; Sireci &
Zeniskey, 2006).
As item development theory expands and refines, all formats will be included in item develop-
ment and test design but a careful analysis of each format’s potential and costs will be part of the
decision-making before employing any format. A key issue is the fidelity of the item format to
the target domain. The test item task that most closely resembles some criterion behavior is the
most desirable, but such test items usually have higher costs and less efficiency. So there is always
a tradeoff between fidelity and cost/efficiency.
412 • The Future of Item Development and Validation
The fundamental item-format categories are SR, CROS, and CRSS, but within each category
there is considerable variety. Test developers and researchers have yet to explore the potential of
all the formats presented in this volume. As research continues on item formats, we expect to find
that we have many to choose from for highly effective and efficient testing. Validity will be served
by using the full array of these format choices.
Research
Research on item development typically involves item formats and guidelines for writing items
that roughly began with the introduction of SR formats in the early 1900s. As noted in the brief
history of testing, researchers debated the merits of CR versus SR item formats. For the most
part, item development was a technology that was based on advice from testing specialists. Since
then, there has been a steady, but not overwhelming, stream of innovation and research involving
item formats, guidelines for writing items, and criteria for evaluating items (Haladyna, 2004, this
volume). Depending upon where theory takes us, future research will be in either support of a
sustaining theory of item development or we will continue to study new and old formats.
We should consider what kinds of problems future research on item development should
solve. The research agenda implied here involves better ways to define constructs that we want
to measure, content specification methods that lead to item and test specifications, guidelines for
developing all types of items, and methods for evaluating performance on these items.
Figure 20.1 gives an example from the nation’s common core standards. These are two Grade 5
standards in the domain of Number & Operations in Base Ten (NBT). This section of the domain
concerns understanding the place value system.
The Future of Item Development and Validation • 413
5.NBT.1. Recognize that in a multi digit number, a digit in one place represents 10 times
as much as it represents in the place to its right and 1/10 of what it represents in the place
to its left.
5.NBT.2. Explain patterns in the number of zeros of the product when multiplying a num-
ber by powers of 10, and explain patterns in the placement of the decimal point when a
decimal is multiplied or divided by a power of 10. Use whole number exponents to denote
powers of 10.
Figure 20.1 Example of specific objectives from the Common Core Standards.
Source: http://www.corestandards.org/the standards/mathematics
How the standards and the specific outcome statements in Figure 20.1 lead to the development
of specific, high-fidelity items is a significant challenge. The way the content is specified enables
item developers to create items of high fidelity that can be validated. In one respect, those work-
ing on item-generation theories might enable such an item development project by developing
item models that create algorithms for additional items.
For other content, such as for professional credentials (certification or licensing), the content
might be based on a curriculum or a content outline or a set of competencies. Cognitive learning
theory implies that professional competence is best represented by a set of tasks performed in a
profession. A test is a high-fidelity sample from that target domain of tasks. However, the tasks
are test-feasible and not impractical to administer. For instance, an electrophysiologist test of
highest fidelity involves treating patients with heart arrhythmia. It may be impractical to allow a
cardiologist seeking such certification to diagnose and treat several patients as part of the certifi-
cation process. Thus, a set of test-worthy tasks performed under competent supervision may be
implemented.
One continued vexing issue is the cognitive demand desired with any content to be measured.
As we have no widely accepted and validated taxonomy of cognitive complexity, such statements
of learning outcomes as illustrated in Figure 20.1 are silent as to the cognitive demand.
Ultimately, the most useful outcome of the study of content and its curriculum that drives
instruction is greater specificity. And this will come with a well-constructed theory of learning
that incorporates the concept of validity and the need for validation, and an item development
theory that makes item development more objective and standardized and less susceptible to
individual subjective judgment.
Guidelines Guidelines for writing items mostly came from the wisdom of testing specialists as
captured in many textbooks dating from the early 1930s to the present. Haladyna and Downing
(1989a) evaluated the content of these textbooks and distilled this accumulated wisdom into
guidelines for preparing SR items. An examination of the research supporting these guidelines
shows that few guidelines have been studied adequately and other guidelines remain common
sense or neglected (Haladyna & Downing, 1989a, 1989b; Haladyna, Downing, & Rodriguez,
2002). There is widespread agreement for many guidelines, and research shows that violating
guidelines lowers item quality. New research and technology points to improved guidelines for
item development (Ferrera et al., 2012). An urgent need exists to improve guidelines for devel-
oping CR items. Continued study of guidelines and research on guidelines will only serve to
unify and validate the guidance we have in writing test items. In this volume, we presented a
set of guidelines for CR item design and scoring in chapters 11 and 12 that was based on several
sources.
414 • The Future of Item Development and Validation
Research on theories of item development will also progress. In this volume and this chapter,
we have recognized developing item generation and its potential for guiding item development in
a different way. Instead of using this extant technology with its formats and guidelines, construct
analysis will lead to item models that will automate many aspects of item development. The con-
ceptual work of item design is done at the beginning with content analysis.
Item Analysis The concept of item analysis has very old roots and has not changed much (Guil-
ford, 1954; Lord, 1952). We are always concerned with item difficulty (the p-value) and item
discrimination regardless of the item format. With CRSS, item analysis has additional foci (rater
consistency and bias). We have a wealth of recent research on item analysis methods, which is
reported in chapters 17 and 18. The most important recent trend is to capture all activities in
the polishing and improving of items as item validation (Downing & Haladyna, 1997), which
is a strong feature in this book. IRT offers some help in this direction, but not to the extent that
difficulty and discrimination are replaced by item parameters having the same descriptive func-
tions. The importance of item analysis is in the context of item validation along with many other
activities that continuously seek to improve the item. More properly said, validation of test score
interpretation is what validity is all about. At the item level, validation of item response interpre-
tation is just as important. Thus, we have emphasized item validation as a central concern in this
book.
Scoring With the introduction of the SR format in the early 1900s in the United States, CR for-
mats were less used in standardized testing due to many attendant faults involved with scoring
(Coffman, 1966). As many scholars and researchers have observed, the main problem has always
been the consistency of raters viewing a performance or a constructed-response. Another prob-
lem is rater bias, which takes many forms. Despite the systematic study of rater consistency and
rater bias, testing programs are not very attendant to the threats to validity created by using the
CRSS format. More research that unifies our understanding of these threats to validity should
lead to a technology that increases rater consistency and decreases or eliminates all forms of rater
bias. Technical reporting for tests using CRSS formats should routinely report rater consistency
and the many forms of rater bias that may undermine validity.
With the SR item, studies that focus on the benefits of distractors should reveal as in the past
that many distractors fail to perform as expected and should not be used. The conclusion we have
drawn is that the quest for four or five well-functioning options per item is futile.
Technology
As noted previously in this final chapter, technology has two connotations: (a) accumulated wis-
dom that leads to practice and (b) inventions that make the business of psychological and educa-
tional measurement more efficient or more valid.
Research drives the establishment of accumulated wisdom. As this volume shows, we have
increased our understanding of how to develop and validate CRSS and CROS items. At the
same time we have increased efforts to develop and validate SR items. This volume provides
summaries of this accumulated wisdom for constructing and validating items in any format.
Future efforts should continue to build our knowledge base supporting the technology for test
item-development.
Computer-based testing has made many improvements in what gets delivered online and how
(Mortimer, Stroulia, & Yazdchi, 2012; Sireci & Zeniskey, 2006). The future is very clearly in favor
of computer-adaptive testing, computer-based testing, and Internet-based testing. Along with
these advances in computer-based testing comes a responsibility for continued research that
The Future of Item Development and Validation • 415
emphasizes validity. Does every innovation serve to improve validity? Or, at the least, do these
innovations make item development and administration more efficient?
One of the most promising technological developments is automated essay scoring (Sher-
mis & Burstein, 2004). We have an extant technology for scoring essays and the spoken word
using intelligent systems. The shift from human scoring to automated scoring is happening but
progress is slow, perhaps due to a reservation that automated scoring can miss performances that
are inferior due to deception or idiosyncrasy. As automated essay scoring becomes refined, it
will replace human scoring for the large part, but automated scoring depends initially on expert
human scoring as a preliminary step to production scoring. Automated scoring will reduce rater
inconsistency and eliminate rater bias, so continued research and development on this technol-
ogy will greatly improve the use of the CRSS format in future testing.
Automated test assembly and automated adaptive testing are other technologies that are rap-
idly replacing older technologies. Although these technologies are not specifically centered on
item development, item developers need to take heed of the conditions surrounding these tech-
nologies in the design of item banks.
In Conclusion
Standardized testing continues to be a major event in the lives of students and others throughout
the world. Classroom testing is also an enormous enterprise that is less visible but very important
to learners. Both formative and summative aspects of classroom testing employ the same theories
and technology that we use in standardized testing. However, the resources devoted to classroom
testing are minuscule in comparison to the cost of developing and administering large-scale test-
ing programs.
Item development and validation continue to be a central aspect of test development. The
scholarly effort to improve items is reaping rewards but there is much work ahead. The greatest
needs are (a) unification of learning theory, (b) item-writing theory that leads to an item devel-
opment technology, (c) more effective use of item formats, and the (d) validated guidelines for
item development, and (e) continued growth and use of technologies that enable and assist item
development and validation. With unification of these diverse yet related efforts, the science
of item development moves from a revolutionary, immature one to a mainstream science that
benefits all of us in test development.
References
Abedi, J. (2006). Language issues in item development. In S. M. Downing & T. M. Haladyna (Eds.) Handbook of test
development (pp. 377–398). Mahwah, NJ: Lawrence Erlbaum Associates.
Abedi, J. (2009). English language learners with disabilities: Classification, assessment, and accommodation issues. Journal of
Applied Testing Technology, 10(2). Retrieved from http://www.testpublishers.org/journal-of-applied-testing-technology
Abedi, J., Bayley, R., Ewers, N., Mundhenk, K., Leon, S., Kao, J., & Herman, J. (2012). Accessible reading assessments for
students with disabilities. International Journal of Disability, Development and Education, 59(1), 81–95.
Abedi, J., Leon, S., & Mirocha, J. (2001). Impact of students’ language background on standardized achievement test results:
Analyses of extant data. Los Angeles: University of California, National Center for Research on Evaluation, Stan-
dards, and Student Testing.
Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14 , 219–234.
Abedi, J., Lord, C., Hofstetter, C., & Baker, E. (2000). Impact of accommodation strategies on English language learners’
test performance. Educational Measurement: Issues and Practice, 19 (3), 16–26.
Adelson, J. L., & McCoach, D. B. (2010). Measuring the mathematical attitudes of elementary students: The effects of a
4-point or 5-point Likert-type scale. Educational and Psychological Measurement, 70(5), 796–807.
Albanese, M. (1993). Type K and other complex multiple-choice items: An analysis of research and item properties.
Educational Measurement: Issues and Practices, 12(1), 28–33.
Albanese, M. A., Kent, T. A., & Whitney, D. R. (1977). A comparison of the difficulty, reliability, and validity of complex
multiple-choice, multiple response, and multiple true-false items. Annual Conference on Research in Medical Edu-
cation, 16, 105–110.
Albanese, M. A., & Sabers, D. L. (1988). Multiple true-false items: A study of interitem correlations, scoring alternatives,
and reliability estimation. Journal of Educational Measurement, 25, 111–124.
Alcolado, J., & Afzal Mir, M. (2007). Extended-matching questions for finals. Oxford, United Kingdom: Churchill
Livingstone.
Allalouf, A. (2007). Quality control procedures in the scoring, equating, and reporting of test scores. Educational Mea-
surement: Issues and Practice, 26(1), 36–43.
Allen, N., Holland, P., & Thayer, D. (2005). Measuring the benefits of examinee-selected questions. Journal of Educa-
tional Measurement, 42(1), 27–34.
American Educational Research Association (2000). Position statement on high-stakes testing. Retrieved from http://
www.aera.net/AboutAERA/AERARulesPolicies/tabid/10198/
American Educational Research Association, American Psychological Association & National Council on Measurement
in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational
Research Association.
American Psychological Association. (2009). Publication manual of the American Psychological Association (6th ed.).
Washington, DC.
Anastakis, D. J., Cohen, R., & Reznick, R. K. (1991). The structured oral examination as a method for assessing surgical
residents. The American Journal of Surgery, 162(1), 67–70.
Anderson, L. W., & Krathwohl (Eds.). (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s
Taxonomy of Educational Objectives. New York, NY: Longman.
Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths, J., & Wittrock,
M. C. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational
objectives. New York, NY: Longman.
Anderson , L. W., & Sosniak, L. A. (Eds.). (1994). Bloom’s taxonomy: a forty-year retrospective. Ninety-third yearbook of
the National Society for the Study of Education, Pt.2., Chicago, IL: University of Chicago Press.
Andrich, D. (1988). Rasch models for measurement. Thousand Oaks, CA: Sage.
Ansley, T. N., Spratt, K. F., & Forsyth, R. A. (1988, April). An investigation of the effects of using calculators to reduce the
computational burden on a standardized test of mathematics problem solving. Paper presented at the annual meeting
of the American Educational Research Association, New Orleans.
416
References • 417
Arter, J. A., & Spandel, V. (1992). Using portfolios of student work in instruction and test. Educational Measurement:
Issues and Practice, 11(1), 36–44.
Ascalon, M. E., Meyers, L. S., Davis, B. W., & Smits, N. (2007). Distractor similarity and item-stem structure: Effects on
item difficulty. Applied Measurement in Education, 20(2), 153–170.
Attali, Y., & Bar-Hillel, M. B. (2003). Guess where: The position of correct answers in multiple-choice test items as a
psychometric variable. Journal of Educational Measurement, 40(2), 109–128.
Attali, Y., & Fraenkel, T. (2000). The point-biserial as a discrimination index for distractors in multiple-choice items:
Deficiencies in usage and an alternative. Journal of Educational Measurement, 37, 77–86.
Ayers, S. F. (2001). Developing quality multiple-choice tests for physical education. Journal of Physical Education, Recre-
ation & Dance, 72(6), 23–28.
Bacon, D. R. (2003). Assessing learning outcomes: A comparison of multiple-choice and short-answer questions in a
marketing context, Journal of Marketing Education, 25(1), 31–36.
Baker, F. (2001). The basis of item response theory. Clearinghouse on Assessment and Evaluation. Retrieved from http://
info.worldbank.org/etools/docs/library/117765/Item%20Response%20Theory%20-%20F%20Baker.pdf
Baldwin, D., Fowles, M., Livingston, S. (2005). Guidelines for constructed-response and other performance assessments.
Princeton, NJ: Author.
Bannert, M., & Mengelkam, C. (2008). Assessment of metacognitive skills by means of instruction to think aloud and
reflect when prompted. Does the verbalisation method affect learning? Metacognition and Learning, 3(1), 39–58.
Baranowski, R. A. (2006). Item editing and editorial review. In S. M. Downing & T. M. Haladyna (Eds.) Handbook of test
development (pp. 349–357). Mahwah, NJ: Lawrence Erlbaum Associates.
Barnette, J. J. (2000). Effects of stem and Likert response option reversals on survey internal consistency: If you feel the
need, there is a better alternative to using those negatively worded stems. Educational and Psychological Measure-
ment, 60(3), 361–370.
Barrows, H. (1993). An overview of the uses of standardized patients for teaching and evaluating clinical skills. Academic
Medicine 68(6), 443–453.
Becker, B. J. (1990). Coaching for Scholastic Aptitude Test: Further synthesis and appraisal. Review of Educational
Research, 60, 373–418.
Becker, D. F., & Pomplun, M. R. (2006). Technical reporting and documentation. In S. M. Downing & T. M. Haladyna
(Eds.), Handbook of test development (pp. 711–723). Mahwah, NJ: Lawrence Erlbaum Associates.
Beddow, P. A. (2012). Accessibility theory for enhancing the validity of test results for students with special needs. Inter-
national Journal of Disability, Development and Education, 59(1), 97–111.
Beddow, P. A., Kurz, A., & Frey, J. R. (2011). Accessibility theory: Guiding the science and practice of test item design
with the test-taker in mind. In S. N. Elliott, R. J. Kettler, P. A. Beddow, & A. Kurtz (Eds.), Handbook of accessible
achievement tests for all students: Bridging the gaps between research, practice, and policy (pp. 163–182). New York,
NY: Springer.
Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In N. Frederiksen (Ed.), Test
theory for a new generation of tests (pp. 323–357). Hillsdale, NJ: Lawrence Erlbaum Associates.
Bejar, I. I. (2002). Generative testing: From conception to implementation. In S. H. Irvine & P. C. Kyllonen (Eds.), Item
generation for test development (pp. 199–218). Mahwah, NJ: Lawrence Erlbaum Associates.
Bejar, I. I. (2012). Item generation: Implications for a validity argument. In M. J. Gierl & T. M. Haladyna (Eds.) Automatic
item generation (pp. 40–54). New York: Routledge.
Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A feasibility study of on-the-
fly item generation in adaptive testing. Journal of Learning, Technology and Assessment 2(3), 2–29.
Bejar, I. I., & Yocom, P. (1991). A generative approach to the modeling of isomorphic hidden figure items. Applied Psy-
chological Measurement, 15(2), 129–137.
Beller, M., & Gafni, N. (2005). Can item formats (multiple-choice vs. open-ended) account for gender differences in
mathematics achievement? Behavioral Science, 42(1–2), 1–21.
Bellezza, F. S., & Bellezza, S. F. (1989). Detection of cheating on multiple-choice tests by using error-similarity analysis.
Teaching of Psychology, 16, 151–155.
Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett & W. C. Ward (Eds.) Construction versus
choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp.
1–27). Hillsdale, NJ: Lawrence Erlbaum Associates.
Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett, W. C. Ward, D. A. Rock, & C. LaHart
(Eds.), Toward a framework for constructed-response items (pp. 1–27). Princeton, NJ: Educational Testing Service.
Bennett, R. E., Morley, M., Quardt, D., Rock, D. A., Singley, M. K., Katz, I. R., & Nhouyvanisvong, A. (1999). Psychomet-
ric and cognitive functioning of an under-determined computer-based response type for quantitative reasoning.
Journal of Educational Measurement, 36, 233–252.
Bennett, R. E., & Ward, W. C. (Eds.) (1993). Construction versus choice in cognitive measurement: Issues in constructed
response, performance testing, and portfolio assessment. Hillsdale, NJ: Lawrence Erlbaum Associates.
Bennett, R. E., Ward, W. C., Rock, D. A., & LaHart, C. (1990). Toward a framework for constructed-response items. Princ-
eton, NJ: Educational Testing Service.
Bereiter, C., & Scardamalia, M. (1998). Beyond Bloom’s taxonomy: Rethinking knowledge for the knowledge age. In
A. Hargreaves, A. Lieberman, M. Fullan, & D. Hopkins (Eds.), International handbook of educational change (pp.
675–692). Boston, MA: Kluwer Academic.
Bergling, B. M. (1998). Constructing items measuring logical operational thinking: Facet design-based item construction
using multiple categories scoring. European Journal of Psychological Assessment, 14(2), 172–187.
418 • References
Bernardin, H. J. (1978). Effects of rater training on leniency and halo errors in student ratings of instruction. Journal of
Applied Psychology, 63, 301–308.
Bernardin, H. J., & Pence, E. C. (1980). Effects of rater training: creating new response sets and decreasing accuracy.
Journal of Applied Psychology, 65, 60–66.
Bernardin, H. J., & Walter, C. S. (1977). Effects of rater training and diary-keeping on psychometric error in ratings.
Journal of Applied Psychology, 62, 64–69.
Bernstein, J., & Cheng, J. (2007). Logic and validation of fully automatic spoken English test. In M. Holland, & F. P. Fisher
(Eds.), The path of speech technologies in computer assisted language learning (pp. 174–194). Florence, KY: Routledge.
Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27(3),
355–377.
Beullens, J., Van Damme, B., Jaspaert, H., & Janssen, P. J. (2002). Are extended-matching multiple-choice items appropri-
ate for a final test in medical education? Medical Teacher, 24(4), 390–395.
Bloom B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Kratwohl, D. R. (1956). Taxonomy of educational objectives. New
York, NY: Longmans Green.
Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied Psychology,
64, 410–421.
Bormuth, J. R. (1970). On a theory of achievement test items. Chicago, IL: University of Chicago Press.
Braun, H. (1988). Understanding score reliability: Experiments in calibrating essay readers. Journal of Educational Sta-
tistics, 13(1), 1–18.
Breland, H. M., Danos, D. O., Kahn, H. D., Kubota, M. Y., & Bonner, M. W. (2005). Performance versus objective testing
and gender: An exploratory study of an Advanced Placement History Examination. Journal of Educational Mea-
surement, 31(4), 275–293.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.
Brennan, R. L. (Ed.) (2006). Educational measurement (4th ed.). Westport, CT: American Council on
Education/Praeger.
Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and multiple choice formats. Journal of
educational measurement, 29(3), 253–271.
Bridgeman, B., Harvey, A., Braswell, J. (1995). Effects of calculator use on scores on a test of mathematical reasoning.
Journal of Educational Measurement, 32(4), 323–340.
Bridgeman, B., Morgan, R. L., & Wang, M. (1997). Choice among essay topics: Impact on performance and validity.
Journal of Educational Measurement, 34(3), 273–286.
Bridgeman, B., & Rock D. A. (1993). Relationship among multiple-choice and open-ended analytical questions. Journal
of Educational Measurement, 30(4), 313–329.
Briggs, D. C., Alonzo, A. C., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with ordered multiple-choice items.
Educational Assessment, 11(1) 33–63.
Brown, T. A. (2004). Confirmatory factor analysis for applied research. New York, NY: Guilford Press.
Bryant, D. U., & Wooten, W. (2006) Developing an essentially unidimensional test with cognitively designed item. Inter-
national Journal of Testing, 6(3), 205–228
Budescu, D., & Bar-Hillel (1993). To guess or not to guess: A decision-theoretic view of formula scoring. Journal of Edu-
cational Measurement, 30(4), 277–291.
Burchard, K. W., Rowland-Morin, P., Coe, N. P. W., & Garb, J. L. (1995). A surgery oral examination: Interrater agree-
ment and the influence of rater characteristics. Academic Medicine, 70(11), 1044–1046.
Burmester, M. A., & Olson, L. A. (1966). Comparison of item statistics for items in a multiple-choice and alternate-
response form. Science Education, 50, 467–470.
Burstein, J., & Marcu, D. (2003). Automated evaluation of discourse structure in student essays. In M. D. Shermis and
J. C. Burstein (Eds.). Automated essay scoring: a cross disciplinary approach (pp. 209–230). Mahwah, NJ: Lawrence
Erlbaum Associates.
Butler, A. C., & Roediger, H. L. III. (2008). Feedback enhances the positive effects and reduces the negative effects of
multiple-choice testing. Memory and Cognition, 36(3), 604–616.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix.
Psychological Bulletin, 56(2), 81–105.
Case, S. M., & Downing, S. M. (1989). Performance of various multiple-choice item types on medical specialty examina-
tions: Types A, B, C, K, and X. Philadelphia, PA: National Board of Medical Examiners.
Case, S. M., Holzman, K., & Ripkey, D. R. (2001). Developing an item pool for CBT: A practical comparison of three
models of item writing. Academic Medicine, 76(10), S111–S113.
Case, S. M., & Swanson, D. B. (2001). Constructing written test questions for the basic and clinical sciences (3rd ed.). Phila-
delphia, PA: National Board of Medical Examiners.
Case, S. M., Swanson, D. B., & Becker, D. F. (1996). Verbosity, window dressing, and red herrings: Do they make a better
test item. Academic Medicine, 71, 10.
Case, S. M., Swanson, D. B., & Ripkey, D. R. (1994). Comparison of items in five-option and extended-matching formats
for assessment of diagnostic skills, Academic Medicine, 69 (10 Suppl.), pp. S1–S3.
Casteel, C. A. (1991). Changing on multiple-choice test items among eighth-grade readers. The Journal of Experimental
Education, 59(4), 300–309.
Chandrasegaran, A. L., Treagust, D. F., & Mocerino, M. (2007). The development of a two-tier multiple-choice diagnostic
instrument for evaluating secondary school students’ ability to describe and explain chemical reactions using mul-
tiple levels of representation. Chemistry Education Research and Practice, 8(3), 293–307.
References • 419
Chang, L. (1995). Connotatively inconsistent test items. Applied Measurement in Education, 8(3), 199–209.
Chappelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to validity make a difference?
Educational Measurement: Issues and Practice, 29(1), 3–13.
Chen, I., & Chang, C. C. (2009). Cognitive load theory: An empirical study of anxiety and task performance in language
learning. Electronic Journal of Research in Educational Psychology, 7, 729–746.
Cicchette, D. V., Showalter, D., & Tyrer, P. M. (1985). The effect of number of rating scale categories on levels of interrater
reliability: A Monte Carlo investigation. Journal of Applied Psychology, 9, 31–36.
Cizek, G. J. (1991, April). The effect of altering the position of options in a multiple-choice examination. Paper presented at
the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ: Lawrence Erlbaum.
Claesgens, J., Scalise, K., Wilson, M., & Stacy, A. (2008). Mapping student understanding in chemistry: The perspectives
of chemists. Science Education.
Coderre, S. P., Harasym, P., Mandin, H., & Fick, G. (2004). The impact of two multiple-choice question formats on prob-
lem-solving strategies used by novices and experts. Medical Education, 4(23). Retrieved from http://www.biomed-
central.com/1472–6920/4/23.
Coffman, W. E. (1971). Essay examinations. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed., pp. 271–302).
Washington, DC: American Council on Education.
Cohen, A. S., & Kim S. (1992). Detecting calculator effects on item performance. Applied Measurement in Education, 5,
303–320.
Cohen, D. S., Colliver, J. A., & Marcy, M. S. (1996). Psychometric properties of a standardized-patient checklist. Academic
Medicine, 71(1), 1–6, S1–21.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1),
37–46.
Cole, N. S. (1990). Conceptions of educational achievement. Educational Researcher, 19, 2–7.
College Board (2011). AP world history: Course and exam description (Effective fall 2011). Retrieved from http://apcen-
tral.collegeboard.com/apc/public/repository/AP_WorldHistoryCED_Effective_Fall_2011.pdf
Collins, J. (2006). Writing multiple-choice questions for continuing medical education activities and self-assessment
modules. RadioGraphics, 26: 543–551.
Comira. (2009). Alternative pathways for initial licensure for general dentists. Folsom, CA: Authors. Retrieved from http://
www.dbc.ca.gov/formspubs/pub_portfolio_final.pdf
Common Core State Standards Initiative (2011). Common core state standards for mathematics. Washington DC: Coun-
cil of Chief State School Officers. Retrieved from http://www.corestandards.org/assets/CCSSI_Math Standards.pdf
Congdon, P. J., & McQueen, J. (2005). The stability of rater severity in large-scale assessment programs. Journal of Edu-
cational Measurement, 37(2) 163–178.
Cook, C., Heath, F., Thompson, R. L., & Thompson, B. (2001). Reliability in web or internet-based surveys: Unnumbered
graphic rating scales versus Likert-type scales. Educational and Psychological Measurement, 61(4), 697–706.
Cook, D. A., Dupras, D. M., Beckman, T. J., Thomas, K. G., & Pankratz, V. S. (2009). Effect of rater training on reliability and
accuracy of mini-CEX scores: A randomized, controlled trial. Journal of General Internal Medicine, 24(1), 74–79.
Copeland, D. A. (1972). Should chemistry students change answers on multiple-choice tests? Journal of Chemical Educa-
tion, 49 (4), 258.
Cotton, D., & Gresty, K. (2005). Reflecting on the think-aloud method for evaluating e-learning. British Journal of Edu-
cational Technology, 37(1), 45–54.
Couper, M. P., Tourangeau, R., & Conrad, F. G. (2006). Evaluating the effectiveness of visual analog scales: A web experi-
ment. Social Science Computer Review, 24(2), 227–245.
Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behav-
ioral and Brain Sciences, 24(1), 87–114.
Crisp, V., & Sweiry, E. (2006). Can a picture ruin a thousand words? The effects of visual resources in exam questions.
Educational Research, 48(2), 139–154.
Cromley, J. G., & Azevedo, R. (2007). Testing and refining the direct and inferential mediation model of reading compre-
hension. Journal of Educational Psychology, 99(2), 311–325.
Cronbach, L. J. (1970). [Review of On the theory of achievement test items]. Psychometrika, 35, 509–511.
Cronbach, L. J. (1971). Validation. In R. L. Thorndike (Ed.). Educational measurement (2nd ed., pp. 443–507). Washing-
ton, DC: American Council on Education.
Cronbach, L. J. (1988). Five perspectives of the validity argument. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.
3–18). Hillsdale, NJ: Lawrence Erlbaum Associates.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
Crooks, T. J., Kane, M. T., & Cohen, A. S. (1996). Threats to valid use of assessments. Assessment in Education, 3(3),
265–285.
Crowe, A., Dirks, C., & Wenderoth, M. P. (2008). Biology in Bloom: Implementing Bloom’s taxonomy to enhance student
learning. Cell Biology Education: Life Sciences Education, 7, 368–381.
Cui, Y., & Leighton, J. P. (2009). The hierarchy consistency index: Evaluating person fit for cognitive diagnostic assess-
ment. Journal of Educational Measurement, 46(4), 429–449.
Cunnington, P. W., Norman, G. R., Norman, R., Blake, J. M., Dauphinee, W. D., & Blackmore, D. E. (1996). Applying
learning taxonomies to test items: Is a fact an artifact? Academic Medicine, 71(10), S31–S33.
Davey T., & Pitoniak, M. J. (2006). Designing computerized adaptive tests. In S. M. Downing & T. M. Haladyna (Eds.),
Handbook of test development (pp. 543–574). Mahwah, NJ: Lawrence Erlbaum Associates.
420 • References
Dawson-Saunders, B., Nungester, R. J., & Downing, S. M. (1989). A comparison of single best answer multiple-choice items
(A-type) and complex multiple-choice (K-type). Philadelphia, PA: National Board of Medical Examiners.
Dawson-Saunders, B., Reshetar, R., Shea, J. A., Fierman, C. D., Kangilaski, R., & Poniatowski, P. A. (1992, April). Altera-
tions to item text and effects on item difficulty and discrimination. Paper presented at the annual meeting of the
National Council on Measurement in Education, San Francisco, CA.
Dawson-Saunders, B., Reshetar, R., Shea, J. A., Fierman, C. D., Kangilaski, R., & Poniatowski, P. A. (1993, April). Changes
in difficulty and discrimination related to altering item text. Paper presented at the annual meeting of the National
Council on Measurement in Education, Atlanta, GA.
de Gruijter, D. N. M. (1988). Evaluating an item and option statistic using the bootstrap method. Tijdschrift voor Onder-
wijsresearch, 13, 345–352.
de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application of multidimensional IRT in
test scoring. Journal of Educational and Behavioral Statistics, 30, 295–311.
deAyala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press.
deAyala, R. J., Plake, B. S., & Impara, J. S. (2001). The impact of omitted responses on the accuracy of ability estimation
in item response theory. Journal of Educational Measurement, 38(3) 213–234.
DeCarlo, L. T. (2005). A model of rater behavior in essay grading based on signal detection theory. Journal of Educational
Measurement, 42(1), 53–76.
DeCarlo, L. T., Kim, Y. K., Johnson, M. S. (2011). A hierarchical rater model for constructed responses, with a signal
detection rater model. Journal of Educational Measurement, 48(3), 333–356.
DeMars, C. (2000). Test stakes and item format interaction. Applied Measurement in Education, 13, 55–78.
DeMars, C. (2010). Item response theory. Oxford, England: Oxford University Press.
Dennis, I. (2007). Halo effects in grading student projects. Journal of Applied Psychology, 92(4), 1169–1176.
DeRemer, M. L. (1998). Writing assessment: Raters’ elaboration of the rating task. Assessing writing, 5(1), 7–29.
Diederich, P. B. (1974). Measuring growth in English. National Council of Teachers of English.
Dihoff, R. E., Brosvic, G. M., Epstein, M. L., & Cook, M. J. (2004). Provision of feedback during preparation for academic
testing: Learning is enhanced by immediate feedback. Psychological Record 54(2), 207–231.
Dillman, D. A. (1978). Mail and telephone surveys: The total design method. New York, NY: Wiley.
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail, and mixed-mode surveys: The tailored design
method (3rd ed.). Hoboken, NJ: John Wiley & Sons, Inc.
Dodd, D. K., & Leal, L. (1988). Answer justification: Removing the “trick” from multiple-choice questions. Teaching of
Psychology, 15(1), 37–38.
Dowling, J., Murphy, S. E., & Wang, B. (2007). The effects of the career ladder program on student achievement. Phoenix,
AZ. Retrieved from http://www.ade.az.gov/asd/careerladder/CareerLadderReport.pdf
Downing, S.M. (1992). True-false, alternate-choice, and multiple-choice items. Educational Measurement: Issues and
Practice, 11(3), 27–30.
Downing, S. M. (2002). Construct-irrelevant variance and flawed test questions: Do multiple choice item writing prin-
ciples make any difference? Academic Medicine 77, 103–104.
Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: The consequences
of using flawed test items on achievement examinations in medical education. Advances in Health Sciences Educa-
tion, 10, 133–143.
Downing, S. M. (2006). Twelve steps for effective test development. In. S. M. Downing & T. M Haladyna (Eds.), Hand-
book of test development (pp. 3–25). Mahwah, NJ: Lawrence Erlbaum Associates.
Downing, S. M., Baranowski, R. A., Grosso, L. J., & Norcini, J. J. (1995). Item type and cognitive ability measured: The
validity evidence for multiple true-false items in medical specialty certification. Applied Measurement in Education,
8(2), 187–197.
Downing, S. M., & Haladyna, T. M. (1997). Test item development: Validity evidence from quality assurance procedures.
Applied Measurement in Education, 10, 61–82.
Downing, S. M., & Haladyna, T. M. (Eds.) (2006). Handbook of test development. Mahwah, NJ: Lawrence Erlbaum
Associates.
Downing, S. M., & Norcini, J. J. (1998, April). Constructed response or multiple-choice: Does format make a difference
for prediction? In T. M. Haladyna (Chair), Construction versus choice: A research synthesis. Symposium conducted
at the annual meeting of the American Educational Research Association, San Diego, CA.
Downing, S. M., & Yudowsky, R. (2009). Assessment in health professions education. New York, NY: Routledge.
Drasgow, F. (1982). Choice of test model for appropriateness measurement. Applied Psychological Measurement, 6,
297–308.
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response
models and standardized indices. British Journal of Educational Psychology, 38, 67–86.
Drasgow, F., Levine, M. V., & Zickar, M. J. (1996). Optimal identification of mismeasured individuals. Applied Measure-
ment in Education, 9(1), 47–64.
DuBois, P. H. (1970). A history of psychological testing. Boston: Allyn & Bacon.
Ebel, R. L. (1951). Writing the test item. In E. F. Lindquist (Ed.), Educational Measurement (1st ed., pp. 185–249). Wash-
ington, DC: American Council on Education.
Ebel, R. L. (1967). The relationship of item discrimination to test reliability. Journal of Educational Measurement, 4(3),
125–128.
Ebel, R. L. (1970). The case for true-false test items. School Review, 78, 373–389
Ebel, R. L. (1981, April). Some advantages of alternate-choice test items. Paper presented at the annual meeting of the
National Council on Measurement in Education, Los Angeles.
References • 421
Ebel, R. L. (1982). Proposed solutions to two problems of test construction. Journal of Educational Measurement, 19,
267–278.
Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability Language
Testing, 25(2) 155–185.
Educational Testing Service. (2010). The official guide to the GRE revised general test. New York, NY: McGraw Hill.
Educational Testing Service. (2012). GRE revised general test: Quantitative reasoning question types. Princeton, NJ:
Authors. Retrieved from http://www.ets.org/gre/revised_general/about/content/
Ellington, A. (2003). A meta-analysis of the effects of calculators on students’ achievement and attitude levels in precol-
lege mathematics class. Journal of Research in Mathematics Education, 34(5), 433–463.
Elliott, S. N., & Roach, A. T. (2007). Alternate assessments of students with significant disabilities: Alternative approaches,
common technical challenges. Applied Measurement in Education, 20(3), 301–333.
Elliott, S. N., Kettler, R. J., Beddow, P. A., & Kurz, A. (Eds.). (2011). Handbook of accessible achievement tests for all stu-
dents: Bridging the gaps between research, practice, and policy. New York, NY: Springer.
Elliott, S. N., Kettler, R. J., Beddow, P. A., Kurz, A., Compton, E., McGrath, D., Bruen, C., Hinton, K., Palmer, P., Rodri-
guez, M. C., Bolt, D., & Roach, A. T. (2010). Effects of using modified items to test students with persistent academic
difficulties. Exceptional Children, 76(4), 475–495.
Ellsworth, R. A., Dunnell, P., & Duell, O. K. (1990). Multiple-choice test items: What are textbook authors telling teach-
ers? The Journal of Educational Research, 83(5) 289–293.
Embretson, S. E., & Reise, S. F. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum Publishers.
Engelhard, G. Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch
model. Journal of Educational Measurement, 31(2), 93–112.
Engelhard, G. Jr. (2002). Monitoring raters in performance assessments. In G. Tindal & T. M. Haladyna (Eds.) Large-scale
assessment programs for all students: Validity, technical adequacy, and implementation (pp. 261–288). Mahwah, NJ:
Lawrence Erlbaum Associates.
Engelhard, G. Jr. (2009). Using item response theory and model–data fit to conceptualize differential item and person
functioning for students with disabilities. Educational and Psychological Measurement, 69(4) 585–602.
Engelhard, G. Jr., & Myford, C. M. (2003). Monitoring rater performance in the Advanced Placement English Literature
and Composition Program with a many-faceted Rasch model. New York, NY: College Board.
Ennis, R. H. (1989). Critical thinking and subject specificity: Clarification and needed research. Educational Researcher,
18(3) 4–10.
Ennis, R. H. (1993). Critical thinking assessment. Theory into Practice, 32(3), 179–186.
Enright, M. K., Morley, M., & Sheehan, K. M. (2002). Items by design: The impact of systematic feature variation of item
statistical characteristics. Applied Measurement in Education, 15(1), 49–74
Epstein, R. M. (2007). Assessment in Medical Education. New England Journal of Medicine 356, 387–396.
Ercikan, K., Arim, R., & Law, D. (2010). Application of think-aloud protocols for examining and confirming sources of
differential item functioning identified by expert reviews. Educational Measurement: Issues and Practice, 29(2),
24–35.
Eurich, A. C. (1931). Four types of examination compared and evaluated. Journal of Educational Psychology, 26,
268–278.
Evans, L. R., Ingersoll, R. W., & Smith, E. J. (1966). The reliability, validity, and taxonomic structure of the oral examina-
tions. Journal of Medical Education, 41, 651–657.
Fajardo, L. L., & Chan, K. M. (1993). Evaluation of medical students in radiology written testing using uncued multiple-
choice questions. Investigative Radiology, 28(10), 964–968.
Farley, T. (September 27, 2009). Reading incomprehension. New York Times. Retrieved from http://www.nytimes.
com/2009/09/28/opinion/28farley.html
Farrington, J. (2011). Seven plus or minus two. Performance Improvement Quarterly, 23(4), 113–6.
Federation of State Medical Boards of the United States and the National Board of Medical Examiners (2011). Step 2
Clinical Skills Content Description and General Information. Philadelphia, PA: Author.
Felleti, G. I. (1980). Reliability and validity studies on the modified essay questions. Journal of Medical Education, 55,
933–941.
Fenderson, B. A., Damjanov, I., Robeson, M. R., Veloski, J. J., & Rubin, E. (1997). The virtues of extended matching and
uncued tests as alternatives to multiple-choice questions. Human Pathology, 28(5), 526–532.
Ferrara, S. (2006). Toward a psychology of large-scale educational achievement testing: Some features and capabilities.
Educational Measurement: Issues and Practice, 25(4), 1–75.
Ferrara, S., & DeMauro, G. E. (2006). Standardized assessment of individual achievement in K-12. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 579–621). Westport, CT: American Council on Education/Praeger.
Ferrara, S., Svetina, D., Skucha, S., & Davidson, A. H. (2012). Test development with performance standards and achieve-
ment growth in mind. Educational Measurement: Issues and Practice, 30(4), 3–15.
Ferrell, C. M., & Daniel, L. G. (1995). A frame of reference for understanding behaviors related to the academic miscon-
duct of undergraduate teacher education students. Research in Higher Education, 36, 345–375.
Fife-Shaw, C. (2006). Levels of measurement. In G. M. Breakwell, S. Hammond and C. Fife-Shaw (Eds.), Research meth-
ods in psychology (2nd ed., pp. 147–157). London: Sage.
Fisicaro, S. A., & Lance, C. E. (1990). Implications of three causal models for the measurement of halo error. Applied
Psychological Measurement, 14, 419–429.
Fitzpatrick, A. R, Ercikan, K., & Yen, W. M. (1998). The consistency between raters in scoring in different test years.
Journal of Measurement, 11(2), 195–208.
422 • References
Flanagan, J. C. (1954). The critical incident technique. Psychological Bulletin, 51, 327–358.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Fletcher, D. (Dec. 2009). A brief history of standardized testing. Time Magazine. Retrieved from http://www.time.com/
time/nation/article/0,8599,1947019,00.html
Foster, J. T., Abrahamson, S., Lass, S., Girard, R., & Garris, R. (1969). Analysis of an oral examination used in specialty
board certification. Journal of Medical Education, 44, 951–954.
Fraser, C. (1998). NOHARM: A Fortran program for fitting unidimensional and multidimensional normal ogive models of
latent trait theory [Computer program manual]. Armidale, Australia: The University of New England, Center for
Behavioral Studies.
Frederiksen, J. R., & Collins, A. (1989). A systems approach to educational testing. Educational Researcher, 18(9),
27–32.
Frederiksen, N. (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39,
193–202.
Frederiksen, N., Mislevy, R. J., & Bejar, I. I. (Eds.). (1983). Test theory for a new generation of tests. Hillsdale, NJ: Law-
rence Erlbaum.
Frey, B. B., Petersen, S., Edwards, L. M., Pedrotte, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teach-
ing and Teacher Education, 21, 357–364.
Frisbie, D. A. (1973). Multiple-choice versus true false: A comparison of reliabilities and concurrent validities. Journal of
Educational Measurement, 10, 297–304.
Frisbie, D. A. (1992). The status of multiple true-false testing. Educational Measurement: Issues and Practices, 5, 21–26.
Frisbie, D. A., & Becker, D. F. (1991). An analysis of textbook advice about true–false tests. Applied Measurement in
Education, 4, 67–83.
Frisbie, D. A., & Druva, C. A. (1986). Estimating the reliability of multiple-choice true-false tests. Journal of Educational
Measurement, 23, 99–106.
Frisbie, D. A., & Sweeney, D. C. (1982). The relative merits of multiple true-false achievement tests. Journal of Educational
Measurement, 19, 29–35.
Fuchs, L. S., Fuchs, D., Karns, K., Hamlett, C. L., Dutka, S., & Katzaroff, M. (2000). The importance of providing back-
ground information on the structure and scoring of performance assessments. Applied Measurement in Education,
13(1), 1–34.
Fuhrman, M. (1996). Developing good multiple-choice tests and test questions. Journal of Geoscience Education, 44,
379–384
Furst, E. J. (1981). Bloom’s taxonomy of educational objectives for the cognitive domain: Philosophical and educational
issues. Review of Educational Research, 51(4), 441–453.
Gallagher, A., Levin, J., & Cahalan, C. (2002). GRE research: Cognitive patterns of gender differences on mathematics
admissions tests (ETS Report No. 02 19). Princeton, NJ: Educational Testing Service.
Garland, R. (1991). The mid-point on a rating scale: Is it desirable? Marketing Bulletin, 2, 66–70.
Garner, M., & Engelhard, G. E. Jr. (1999). Gender differences in performance on multiple-choice and constructed-
response mathematics items. Applied Measurement in Education, 12(1), 29–51.
Geiger, M. (1997). An Examination of the Relationship Between Answer Changing, Testwiseness, and Examination Per-
formance. Journal of Experimental Education, 66(1), 49–60.
Gerrow, J. D., Murphy, H. J., Boyd, M. A., & Scott, D. A. (2003). Concurrent validity of written and OSCE components of
the Canadian dental certification examinations. Journal of Dental Education, 67(8) 896–901.
Gerrow, J. D., Murphy, H. J., Boyd, M. A., & Scott, D. A. (2006). An analysis of the contribution of a patient-based com-
ponent to a clinical licensure examination. Journal of the American Dental Association, 137(10), 1434–1439.
Gierl, M. J. (1997). Comparing cognitive representations of test developers and students on a mathematics test with
Bloom’s taxonomy. The Journal of Educational Research, 91(1) 26–32.
Gierl, M. J., & Cui, Y. (1998). Defining characteristics of diagnostic classification models and the problem of retrofitting
in cognitive diagnostic assessment. Measurement, 6, 263–275.
Gierl, M. J., & Haladyna, T. M. (2012). Automatic item generation: Theory and practice. New York, NY: Routledge.
Gierl, M. J., & Leighton, J. P. (2004). Review of item generation for test development. Journal of Educational Measure-
ment, 41, 69–72.
Gierl, M. J., & Leighton, J. P. (2010, April) Developing cognitive models and constructed maps to promote assessment
engineering, Paper presented at the annual meeting of the National Council on Measurement in Education,
Denver, CO.
Gierl, M. J., Zhou, J., Alves, C. (2008). Developing a taxonomy of item model types to promote assessment engineering.
Journal of Technology, Learning, and Assessment ,7(2). Retrieved from http://www.jtla.org
Gitomer, D. H. (2007). Design principles for constructed response tasks: Assessing subject-matter understanding in NAEP.
Unpublished manuscript. Princeton, NJ: Educational Testing Service.
Glas, C. A. W., & Dagohoy, A. V. T. (2005) A person fit test for IRT models for polytomous items. Psychometrika, 72(2)
159–180.
Godshalk, F. I., Swineford, E., & Coffman, W. E. (1966). The measurement of writing ability. College Board Research
Monographs, No. 6. New York, NY: College Entrance Examination Board.
Golda, S. D. (2011). A case study on multiple-choice testing in anatomical sciences. Anatomical Sciences Education, 4(1),
44–48.
Goldberg, G. L., Kapinus, B. (1993). Problematic response to reading performance assessment tasks: Sources and implica-
tions. Applied Measurement in Education, 6(4), 281–305.
References • 423
Gorin, J. S. (2005). Manipulating processing difficulty of reading comprehension questions: The feasibility of verbal item
generation. Journal of Educational Measurement, 42(4), 351–373.
Gorin, J. (2006). Test design with cognition in mind. Educational Measurement: Issues and Practice, 25, 21–35.
Gorin, J., & Embretson, S. (2012). Using cognitive psychology to generate items and predict item characteristics. In M.
Gierl & T. M. Haladyna (Eds.) Automatic item generation (pp. 136–156). New York: Routledge.
Gorin, J. S., & Svetina, D. (2012). Cognitive psychometric models as a tool for reading assessment engineering. In J.
Sabatini & L. Albro (Eds.) Assessing reading in the 21st century: Aligning and applying advances in the reading and
measurement sciences (pp. 169–184). Lanham, MD: Rowman & Littlefield Education.
Gorin, J. S., & Svetina, D. (2011). Test design with higher order cognition in mind. In G. Schraw (Ed.). Current perspec-
tives on cognition, learning, and instruction: Assessment of higher order thinking skills. Lanham, MD: Rowman &
Littlefield Education.
Gorsuch, R. L. (1983). Factor analysis. Hillsdale, NJ: Lawrence Erlbaum Associates.
Graf, E. A. (2008). Approaches to the design of diagnostic item models (No. (RR-08-07)). Princeton, NJ: Educational Test-
ing Service.
Graf, E. A., Peterson, S., Steffen, M., & Lawless, R. (December 2005). Psychometric and cognitive analysis as a basis for
the design and revision of quantitative item models. Research Report 05–25. Princeton, NJ: Educational Testing
Service.
Green, K. E., & Frantom, C. G. (2002, November). Survey development and validation with the Rasch model. Paper pre-
sented at the International Conference on Questionnaire Development, Evaluation, & Testing, Charleston, SC.
Retrieved from http://www.jpsm.umd.edu/qdet/final_pdf_papers/green.pdf
Green, K. E., & Smith, R. M. (1987). A comparison of two methods of decomposing item difficulties. Journal of Educa-
tional Statistics, 12, 369–381.
Grier, J. B. (1975). The number of alternatives for optimum test reliability. Journal of Educational Measurement, 12,
109–112.
Grier, J. B. (1976). The optimal number of alternatives at a choice point with travel time considered. Journal of Math-
ematical Psychology, 14, 91–97.
Gronlund, N. E., & Waugh, C. K. (2009). Assessment of student achievement (9th ed.). Upper Saddle River, NJ: Pearson
Education.
Gross, L. J. (1994). Logical versus empirical guidelines for writing test items. Evaluation and the Health Professions, 17(1),
123–126.
Grosse, M., & Wright, B. D. (1985). Validity and reliability of true-false tests. Educational and Psychological Measure-
ment, 45, 1–13.
Guilford, J. P. (1954). Psychometric methods. New York, NY: McGraw-Hill.
Guttman, L. (1941). The quantification of a class of attributes: A theory and method of scale construction. In P. Horst, P.
Wallin, & L. Guttman, (Eds.), The prediction of personal adjustment (pp. 321–345). New York, NY: Social Science
Research Council.
Haberman, S. J., & Sinharay, S. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statisti-
cal Psychology, 62(1), 79–95.
Haberman, S. J., & Sinharay, S. (2010) Reporting of subscores using multidimensional item response theory. Psy-
chometrika, 75(2), 209–227.
Haladyna, T. M. (1974). Effects of different samples on item and test characteristics of criterion-referenced tests. Journal
of Educational Measurement, 11, 93–100.
Haladyna, T. M. (1990). Effects of empirical option weighting on estimating domain scores and making pass/fail deci-
sions. Applied Measurement in Education, 3 231–244.
Haladyna, T. M. (1991). Generic questioning strategies for linking teaching and testing. Educational Technology: Research
and Development, 39, 73–81.
Haladyna, T. M. (1992a). Context-dependent item sets. Educational Measurement: Issues and Practices, 11, 21–25.
Haladyna, T. M. (1992b). The effectiveness of several multiple-choice formats. Applied Measurement in Education, 5,
73–88.
Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Boston, MA: Allyn & Bacon.
Haladyna, T. M. (1999, April). When should we use a multiple-choice format? Paper presented at the annual meeting of
the American Educational Research Association, Montreal Canada.
Haladyna, T. M. (2002a). Essential of standardized achievement testing: Validity and accountability. Needham Heights,
MA: Allyn & Bacon.
Haladyna, T. M. (2002b). Supporting documentation: Assuring more valid test score interpretations. In J. Tindal &
T. M. Haladyna (Eds.) Large scale assessment for all students (pp. 89–108). Mahwah, NJ: Lawrence Erlbaum
Associates.
Haladyna, T. M. (2004). The condition of assessment of student learning in Arizona: 2004. In A. Molnar (Ed.), The condi-
tion of Pre-K-12 education in Arizona: 2004 (Doc. # EPSL-0405-102-AEPI). Tempe: AZ: Arizona Education Policy
Initiative, Education Policy Studies Laboratory, Arizona State University.
Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum
Associates.
Haladyna, T. M. (2006). Roles and importance of validity studies in test development. In S. M. Downing & T. M. Hala-
dyna (Eds.) Handbook of test development (pp. 739–758). Mahwah, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in
Education, 1, 37–50.
424 • References
Haladyna, T. M., & Downing, S. M. (1989b). The validity of a taxonomy of multiple-choice item-writing rules. Applied
Measurement in Education, 1, 51–78.
Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice test item. Educational and
Psychological Measurement, 53(4), 999–1010.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measure-
ment: Issues and Practice, 23(1), 17–27.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for
classroom assessment. Applied Measurement in Education, 15(3), 309–334.
Haladyna, T. M., & Kramer, G. A., (2004). The validity of subscores for a credentialing examination. Evaluation in the
Health Professions, 27(4), 349–368.
Haladyna, T. M., Nolen, S. B., & Haas, N. S. (1991). Raising standardized achievement test scores and the origins of test
score pollution. Educational Researcher, 20, 2–7.
Haladyna, T. M., & Olsen, R. M. (Submitted for publication). Three significant problems with writing performance tests.
Haladyna, T. M., Osborn Popp, S., and Weiss, M. (2003). Non-response in large-scale assessment. Paper presented at the
annual meeting of the American Educational Research Association, Montreal, Canada.
Haladyna, T. M., & Roid, G. H. (1981). The role of instructional sensitivity in the empirical review of criterion-referenced
test items. Journal of Educational Measurement, 18, 39–53.
Haladyna, T. M., & Roid, G. H. (1983). A comparison of two item selection procedures for constructing criterion-refer-
enced tests. Journal of Educational Measurement, 20, 271–282.
Haladyna, T. M., & Shindoll, R. R. (1989). Item shells: A method for writing effective multiple-choice test items. Evalua-
tion and the Health Professions, 12, 97–104.
Halpin, G., Halpin, G., & Arbet, S. (1994). Effects of number and type of response choices on internal consistency reli-
ability. Perceptual and Motor Skills, 79, 928–930.
Hambleton, R. K. (2004). Theory, methods, and practices in testing for the 21st century. Psicothema, 16(4), 696–701.
Hamilton, L. S. (1998). Gender differences on high school science achievement tests: Do format and content matter?
Educational Evaluation and Policy Analysis, 20(3) 179–195.
Hamilton, L. S., Nussbaum, E. M., & Snow, R. S. (1997). Interview procedures for validating science assessments. Applied
Measurement in Education, 10(2) 181–200.
Han, K. T., Wells, C. S., & Sireci, S. G. (2012). The impact of multidirectional item and parameter drift on IRT scaling
coefficients and proficiency estimates. Applied Measurement in Education, 25(2), 97–117.
Hancock, G. R., Thiede, K. W., & Sax, G. (1992, April). Reliability of comparably written two-option multiple-choice and
true-false test items. Paper presented at the annual meeting of the National Council on Measurement in Education,
Chicago, IL.
Hannon, D., & Daneman, M. (2001). Susceptibility to semantic illusions: An individual-differences perspective. Memory
and Cognition, 29(3), 449–461.
Hannon, D., & Daneman, M. (2001). A new tool for measuring and understanding individual differences in the compo-
nent processes of reading comprehension. Journal of Educational Psychology, 93(1) 103–128.
Harasym, P. H., Doran, M. L., Brant, R., & Lorscheider, F. L. (1992). Negation in stems of single-response multiple-choice
items. Evaluation and the Health Professions, 16(3), 342–357.
Harik, P., Clauser, B. E., Grabovsky, I., Nungester, R. J., Swanson, D., & Nandakumar, R. (2009). An examination of rater
drift within a generalizability theory framework. Journal of Educational Measurement, 46(1) 43–58.
Harik, P., Cuddy, M. O’Donovan, S., Murray, C., Swanson, D., & Clauser, B. (2009). Assessing potentially dangerous
medical actions with the computer-based case simulation portion of the USMLE Step 3 Examination. Academic
Medicine, 84(10), S79–S82.
Harrison, A. (2011, June 2). AS-level maths error: Students set impossible question. BBC News. Retrieved from http://
www.bbc.co.uk/news/education-13627415
Hattie, J. A. (1985). Methodological review: Assessing unidimensionality of tests and items. Applied Psychological Mea-
surement, 9, 139–164.
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., Moriarty G., Meghan O. (2007). Retesting in selection: A meta-analysis
of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92(2), 373–385.
Haynie, W. J. (1992). Post hoc analysis of test items written by technology education teachers. Journal of Technology
Education, 4(1). Retrieved from http://scholar.lib.vt.edu/ejournals/JTE/v4n1/html/haynie.html
Haynie, W. J. III (1994). Effects of multiple-choice and short-answer tests on delayed retention learning. Journal of Tech-
nology Education, 6(1). Retrieved from http://scholar.lib.vt.edu/ejournals/JTE/v6n1/
He, Y. (2011). Evaluating equating properties for mixed-format tests. Doctoral dissertation, University of Iowa. http://
ir.uiowa.edu/etd/981
Heck, R. H., & Crislip, M. (2001). Direct and indirect writing assessments: Examining issues of equity and equality. Edu-
cational Evaluation and Policy Analysis, 23(3), 275–292.
Hedge, J. W., & Kavanagh, M. J. (1988). Improving the accuracy of performance evaluations: Comparison of three meth-
ods of performance appraiser training. Journal of Applied Psychology, 73, 68–73.
Heidenberg, A. J., & Layne, B. H. (2000). Answer changing: A conditional argument. College Student Journal, 34(3),
440–450.
Henrysson, S. (1971). Analyzing the test item. In R. L. Thorndike (Ed.) Educational Measurement (2nd ed., pp. 130–159)
Washington, DC: American Council on Education.
Herbig, M. (1976). Item analysis by use in pre-test and post-test: A comparison of different coefficients. PLET, 13,
49–54.
References • 425
Hess, K., McDivitt, P., & Fincher, M. (2008). Who are the 2% students and how do we design items and assessments
that provide greater access for them? Results from a pilot study with Georgia students. Paper presented at the 2008
CCSSO National Conference on Student Assessment, Orlando, FL. Retrieved from http://www.nciea.org/publica-
tions/CCSSO_KHPMMF08.pdf
Hibbison, E. P. (1991). The ideal multiple-choice question: A protocol analysis. Forum for Reading, 22(2), 36–41.
Higham, P. A. & Gerrard, C. (2005). Not all errors are created equal: Metacognition and changing answers on multiple-
choice tests. Canadian Journal of Experimental Psychology, 59(1), 28–34.
Hill, G. C., & Woods, G. T. (1974). Multiple true-false questions. Education in Chemistry, 11, 86–87.
Hill, K. T., & Wigfield, A. (1984). Test anxiety: A major educational problem and what can be done about it. Elementary
School Journal, 85, 105–126.
Hill, P. W., & McGaw, B. (1981). Testing the simplex assumption underlying Bloom’s taxonomy. American Educational
Research Journal, 18(1), 93–101.
Hively, W. (1974). Introduction to domain-referenced testing. Educational Technology, 14(6), 5–10.
Hively, W., Patterson, H. L., & Page, S. H. (1968). A “universe-defined” system of arithmetic achievement tests. Journal
of Educational Measurement, 5(4) 274–290.
Hockberger, R. S., LaDuca, A., Orr, N. A., Reinhart, M. A., Sklar, D. P. (2003). Creating the Model of a Clinical Practice:
The Case of Emergency Medicine. Academic Emergency Medicine, 10(2), 161–168.
Hogan, T. P., & Murphy, G. (2007). Recommendations for preparing and scoring constructed-response items: what the
experts say. Applied Measurement in Education, 20(4), 427–441.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer
& H. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Hoskens, M., & Wilson, M. (2001). Real-time feedback on rater drift in constructed-response items: An example from the
Golden State Examination. Journal of Educational Measurement, 38(2) 121–145.
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological
Bulletin, 5(1), 64–86.
Hubal, R. C, Kizakevich, P. N., Guinn, C. I., Merino, K. D., & West, S. L. (2000). The virtual standardized patient. Simu-
lated patient-practitioner dialog for patient interview. Student Health Technology Information, 70, 133–138.
Huber, J. (1983). The effect of set composition on item choice separating attraction, edge aversion and substitution
effects. R. P. Bagozzi & A. M. Tybout (Eds.) Advances in Consumer Research, 10, 298–304.
Hurd, A. W. (1932). Comparison of short answer and multiple-choice tests covering identical subject content. Journal of
Educational Research, 26, 28–30.
Indiana University. (2012). National Survey of Student Engagement 2007. Bloomington, IN: author. Retrieved from
http://nsse.iub.edu/html/survey_instruments.cfm?survey_year=2012
Irvine, S. H. (2002). The foundations of item generation for mass testing. In S. H. Irvine & P. C. Kyllonen (Eds.) Item
generation for test development (pp. 3–34). Mahwah, NJ: Lawrence Erlbaum Associates.
Irvine, S. H., & Kyllonen, P. C. (Eds.) (2002). Item generation for test development. Mahwah, NJ: Lawrence Erlbaum
Associates.
Jacoby, J., & Matell, M. S. (1971). Three-point Likert scales are good enough. Journal of Marketing Research, 8,
495–500.
Jeffery, J. V. (2009). Constructs of writing proficiency in US state and national writing assessments: Exploring variability.
Assessing Writing, 14(1), 3–24
Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer-based testing. Journal of Educa-
tional Measurement, 40(1), 1–15.
Johnson, R., Penny, J., & Gordon, B. (2000). The relation between score resolution methods and interrater reliability: An
empirical study of an analytic scoring rubric. Applied Measurement in Education, 13(2), 121–138.
Johnson, R., Penny, J., Fisher, S., & Kuhs, T. (2003). Score resolution: In investigation of reliability and validity of resolved
scores. Applied Measurement in Education, 16(4), 299–322.
Johnstone, C. J., Thompson, S. J., Bottsford-Miller, N. A., & Thurlow, M. L. (2008). Universal design and multimethod
approaches to item review. Educational Measurement: Issues and Practice, 27(1), 25–36.
Joint Task Force of the International Reading Association & National Council of Teachers of English (2009). Standards
for the assessment of reading and writing. Urbana, IL: National Council of Teachers of English.
Jozefowicz, R. F., Koeppen, B. M., Case, S., Galbraith, R., Swanson, D., & Glew, H. (2002). The quality of in-house medical
school examinations. Academic Medicine, 77, 156–161.
Kachchaf, R., & Solano-Flores, G. (2012). Rater language background as a source of measurement error in the testing of
English language learners. Applied Measurement in Education, 25(2), 162–177.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.
Kane, M. T. (2002). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342.
Kane, M. T. (2004). Certification testing as an illustration of argument-based validation. Measurement, 2(3), 135–170.
Kane, M. T. (2006a). Content-related validity evidence. In S. M. Downing & T. M. Haladyna (Eds.) Handbook of test
development (pp. 131–154). Mahwah, NJ: Lawrence Erlbaum Associates.
Kane, M. T. (2006b). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT:
American Council on Education/Praeger.
Kane, M., & Case, S. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education,
17(3), 221–240.
426 • References
Kang, H. W. (1974). Institutional borrowing: The case of the Chinese Civil Service Examination System in early Koryo,
The Journal of Asian Studies, 34(1), 109–125.
Kaplan, A. (1963). The conduct of inquiry. New York, NY: Harper & Row.
Karabatsos, G. (2003) Comparing the aberrant response detection of thirty six person-fit statistics. Applied Measurement
in Education, 16(4), 277–298.
Katz, I., Lipps, A., & Trafton, J. (2002). Factors affecting difficulty in the generating examples item type (GRE Board Profes-
sional Rep. No. 97–18P). Princeton, NJ: ETS.
Katz, S., & Lautenschlager, G. J. (2001). The contribution of passage and no-passage factors in item performance on the
SAT reading task. Educational Assessment, 7(2), 165–176.
Kelley, T. L., Ruch, G. M., & Terman, L. M. (1933). Stanford achievement test (1st ed). Yonkers, NY: World Book Com-
pany. Retrieved from http://blog.seattlepi.com/chalkboard/files/library/1933test.pdf on March 4,2012.
Ketterlin-Geller, L. R. (2008). Testing students with special needs: A model for understanding the interaction between
assessment and student characteristics in a universally designed environment. Educational Measurement: Issues
and Practice, 27(3), 3–16.
Kettler, R. J. (2011). Effects of modification packages to improve test and item accessibility: Less is more. In S. N. Elliott,
R. J. Kettler, P. A. Beddow, & A. Kurz (Eds.), Handbook of accessible achievement tests for all students: Bridging the
gaps between research, practice, and policy (pp. 231–242). New York, NY: Springer.
Kettler, R.J., Elliott, S.N., & Beddow, P.A. (2009). Modifying achievement test items: A theory-guided and data-based
approach for better measurement of what students with disabilities know. Peabody Journal of Education, 84,
529–551.
Kettler, R. J., Rodriguez, M. C., Bolt, D. M., Elliott, S. N., Beddow, P. A., & Kurz, A. (2011). Modified multiple-choice
items for alternate assessments: Reliability, difficulty, and differential boost. Applied Measurement in Education,
24(3), 210–234.
Kettler, R. J., Russel, M., Camacho, C., Thurlow, M., Ketterlin-Geller, L., Godin, K., McDivitt, P., Hess, K., & Bechard,
S. (2009). Improving reading measurement for alternate assessment: Suggestions for designing research on item and
test alternations. White paper based on the Invitational Research Symposium on Alternate Assessments Based on
Modified Achievement Standards in Reading, Arlington, VA. Retrieved from http://alternateassessmentdesign.sri.
com/additionalresources_info.html
Keuchler, W. L., & Simkin, M. G. (2010). Why is performance on multiple-choice tests and constructed-response tests
not more closely related? Theory and an empirical test. Decision Sciences Journal of Innovative Education, 8(1),
55–73
Kingsbury, F. A. (1922). Analyzing ratings and training raters. Journal of Personnel Research, 1, 377–383.
Kinney, L. B. (1932). Summary of investigations comparing different types of tests. School and Society, 36, 540–544.
Kneeland, N. (1929). The lenient tendency in rating. Personnel Journal, 7, 356–366.
Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior—a longitudinal study.
Language Testing, 28(2) 179–200.
Knowles, S. L., & Welch, C. A. (1992). A meta-analytic review of item discrimination and difficulty in multiple-choice
items using none-of-the-above. Educational and Psychological Measurement, 52, 571–577.
Kolen, M. J., & Lee, W-C. (2011). Psychometric properties of raw and scale scores on mixed-format tests. Educational
Measurement: Issues and Practice, 30(2), 15–24.
Komorita, S. S., & Graham, W. K. (1965). Number of scale points and the reliability of scales. Educational and Psychologi-
cal Measurement, 25, 987–995.
Koretz, D., Lewis, E., Skewes-Cox, T., & Burstein, L. (1993). Omitted and not-reached items in Mathematics in the 1990
National Assessment of Educational Progress. CSE Technical Report 357. Los Angeles: Center for Research on Evalu-
ation, Standards and Student Testing.
Kramer, G. A., & Neumann L. M. (2003). Confirming the validity of Part II of the National Board Dental Examinations:
A practice analysis Journal of Dental Education, 67(12), 1286–1298.
Kreitzer, A. E., & Madaus, G. F. (1994). Empirical investigation of the hierarchical structure of the taxonomy. In L.
W. Anderson & L. A. Sosniak (Eds.) Bloom’s taxonomy: a forty-year retrospective. Ninety-third yearbook of the
National Society for the Study of Education, (Pt. 2., pp. 64–81). Chicago, IL: University of Chicago Press.
Kropp, R. P., & Stoker, H. W. (1966). The construction and validation of tests of the cognitive processes as described in the
taxonomy of educational objectives. Tallahasee, FL: Florida State University.
Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50, 537–567.
Krosnick, J. A., & Presser, S. (2010). Question and questionnaire design. In P. V. Marsden & J. D. Wright (Eds.), Hand-
book of survey research (2nd ed., pp. 263–313). United Kingdom: Emerald.
Kunen, S., Cohen, R., & Solman, R. (1981). A levels-of-processing analysis of Bloom’s taxonomy. Journal of Educational
Psychology, 73(2), 202–211.
LaDuca, A. (1994). Validation of professional licensure examinations: Professions theory, test design, and construct
validity. Evaluation & the Health Professions, 17(2), 178–197.
LaDuca, A., Downing, S. M., & Henzel, T. R. (1995). Test development: Systematic item writing and test construction.
In J. C. Impara & J. C. Fortune (Eds.), Licensure examinations: Purposes, procedures, and practices (pp. 117–148).
Lincoln, NE: Buros Institute of Mental Measurements.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modeling procedure for constructing content-
equivalent multiple-choice questions. Medical Education, 20, 53–56.
Lai, H., Gierl, M., & Alves, C. (2010, April). Generating items under the assessment engineering framework. Paper pre-
sented at the annual meeting of the National Council on Measurement in Education. Denver, CO.
References • 427
Lance, C. E., LaPointe, J. A., & Stewart, A. M. (1994). A test of the context dependency of three causal models of halo rater
error. Journal of Applied Psychology, 79(3), 332–340.
Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking? Educational Measure-
ment: Issues and Progress, 23, 6–14.
Lane, S. (2010). Performance assessment: The state of the art. (SCOPE Student Performance Assessment Series). Stanford,
CA: Stanford University, Stanford Center for Opportunity Policy in Education.
Lane, S. (2011). Performance assessment: The state of the art. In L. Darling Hammond (Ed.), Performance tests: Measuring
student achievement so that students succeed.
Lane, S. (2013). Performance assessment in education. In K. F. Geisinger (Ed.), APA handbook of testing and assessment
in psychology. Washington DC: American Psychological Association.
Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.) Educational measurement (4th ed., pp.
387–431). Wesport, CT: American Council on Education/Praeger.
Lane, S., Wang, N., & Magone, M. (1996). Gender related differential item functioning on a middle school mathematics
performance assessment. Educational Measurement: Issues and Practice, 15(4), 21–27, 31.
Latham, G. P., Wexley, K. N., & Pursell, E. D. (1975). Training managers to minimize rating errors in the observation of
behavior. Journal of Applied Psychology, 60, 550–555.
Lawrence, J. A., & Singhania, R. P. (2006) A study of teaching and testing strategies for a required statistics course for
undergraduate business students. Journal of Business Education, 79(6), 333–338.
Leckie, G., & Baird, J.-A. (2011). Rater effects on essay scoring: A multi-level analysis of severity drift, central tendency,
and rater experience. Journal of Educational Measurement, 48(4), 399–418.
Leighton, J. P. (2012). Learning sciences, cognitive models, and automatic item generation. In M. J. Gierl & T. M. Hala-
dyna (Eds.) Automatic item generation (pp. 121–135). New York, NY: Routledge.
Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to
make inferences about examinees’ thinking processes. Educational Measurement Issues and Practice, 26(2), 3–15.
LeMahieu, P. G., Gitomer, D. H., & Eresh, J. T. (1995). Portfolios beyond the classroom: Data quality and qualities (Center
for Performance Test MS #94-1). Princeton, NJ: Educational Testing Service.
Levine, M. V., & Drasgow, F. (1983). The relation between incorrect option choice and estimated ability. Educational and
Psychological Measurement, 43, 675–685.
Levine, M. V., & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational
and Behavioral Statistics, 4(4), 269–290.
Lewis, A., & Smith, D. (1993). Defining higher-order thinking. Theory into Practice, 32(3), 131–137.
Lewis, J. C., & Hoover, H. D. (1981, April). The effect of pupil performance from using hand-held calculators during
standardized mathematics achievement tests. Paper presented at the annual meeting of the National Council on
Measurement in Education, Los Angeles, CA.
Ling, G. (2009). Individual test-takers’ reliability and construct validity evidence. Paper presented at the annual meeting of
the American Educational Research Association.
Linn, R. L. (Ed.). (1989). Educational measurement (3rd ed.). New York, NY: American Council on Education and
Macmillan.
Linn, R. L. (1990). Admissions testing: Recommended uses, validity, differential prediction, and coaching. Applied Mea-
surement in Education, 3(4), 297–318.
Linn, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching (8th ed.). Des Moines, IA: Prentice-Hall.
Linn, R. L., & Miller, M. D. (2005). Measurement and assessment in teaching (9th ed.). Upper Saddle River, NJ: Pearson
Education.
Livingston, S. A. (2006). Item analysis. In S. M. Downing & T. M. Haladyna (Eds.) Handbook of test development (pp.
421–441). Mahwah, NJ: Lawrence Erlbaum Associates.
Livingston, S. A. (2009). Constructed-response test questions: Why we use them; how we score them. R&D Connections,
11. Princeton, NJ: Educational Testing Service.
Livne, N. L., Livne, O. E., & Wight, C. A. (2007). Can automated scoring surpass hand grading of students’ constructed
responses and error patterns in mathematics? MERLOT Journal of Online Learning and Teaching, 3(3), 295–306.
Retrieved from http://jolt.merlot.org/vol3no3/livne.pdf
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, Monograph Supple-
ment, 3, 635–694.
Lohman, D. F. (1993). Teaching and testing to develop fluid abilities. Educational Researcher, 22, 12–23.
Longford, N. T. (1993). Reliability of essay rating and score adjustment. Program Statistics Research Technical Report No.
93–36. Princeton, NJ: Educational Testing Service.
Longford, N. T. (1994). Reliability of essay rating and score adjustment. Journal of Educational and Behavioral Statistics,
19(3), 171–200.
Longford, N. T. (1995a). Research report-95-36. Adjusting for reader rating behaviors in the Test of Written English. Princ-
eton, NJ: Educational Testing Service.
Longford, N. T. (1995b). Models for uncertainty in educational testing. New York, NY: Springer-Verlag.
Lord, F. M. (1944). Reliability of multiple-choice tests as a function of the number of choices per item. Journal of Educa-
tional Psychology, 35, 175–180.
Lord, F. M. (1952). A theory of test scores. Psychometrika Monograph. No. 7.
Lord, F. M. (1958). Some relations between Guttman’s principal components of scale analysis and other psychometric
theory. Psychometrika, 23, 291–296.
Lord, F. M. (1964). The effect of random guessing on test validity. Educational and Psychological Measurement,4,
745–747.
428 • References
Lord, F. M. (1974). Estimation of latent ability and item parameters when there are omitted responses. Psychometrika,
39(2), 247–264.
Lord, F. M. (1977) Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14(2)
117–138.
Lord, F. M. (1977). Optimal number of choices per item—A comparison of four approaches. Journal of Educational
Measurement, 14, 33–38.
Lord, F. M., & Novick, M. R. (1966). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Loyd, B. H. (1991). Mathematics test performance: The effects of item type and calculator use. Applied Measurement in
Education, 4(1), 11–22.
Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement: Issues and Practice, 26(4),
29–37.
Ludlow, L. H., & O’Leary, M. (1999). Scoring omitted and not-reached items: Practical data analysis implications. Educa-
tional and Psychological Measurement, 59(4), 615–630.
Luecht, R. M. (2006a, May). Engineering the test: From principled item design to automated test assembly. Paper presented
at the annual meeting of the Society for Industrial and Organizational Psychology, Dallas, TX.
Luecht, R. M. (2006b, September). Assessment engineering: An emerging discipline. Paper presented in the Centre for
Research in Applied Measurement and Evaluation, University of Alberta, Edmonton, AB, Canada.
Luecht, R. M. (2007, April). Assessment engineering in language testing: From data models and templates to psychometrics.
Invited paper presented at the annual meeting of the National Council on Measurement in Education, Chicago,
IL.
Luecht, R. M. (2012). An introduction to assessment engineering for automatic item generation. In M. J. Gierl & T. M.
Haladyna (Eds.) Automatic item generation (pp. 59–76). New York, NY: Routledge.
Luecht, R. M., Burke, M., & Devore, R. (2009). Task modeling of complex computer-based performance exercises. Paper
presented at the annual meeting of the National Council on Measurement in Education. San Diego, CA.
Luecht, R. M., Gierl, M. J., Tan, X., & Huff, K. (2006, April). Scalability and the development of useful diagnostic scales.
Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.
Lukhele, R., Thissen, D., & Wainer, H. (1993). On the relative value of multiple-choice, constructed-response, and exam-
inee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234–250.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing,
12(1), 54–71.
Lunz, M. E., & O’Neill, T. R. (1997). A longitudinal study of judge leniency and consistency. A paper presented at the
annual meeting of the American Educational Research Association, Chicago, IL.
Maihoff, N. A., & Mehrens, W. A. (1985, April). A comparison of alternate-choice and true-false item forms used in class-
room examinations. Paper presented at the annual meeting of the National Council on Measurement in Education,
Chicago, IL.
Maihoff, N. A., & Phillips, E. R. (1988, April). A comparison of multiple-choice and alternate-choice item forms on class-
room tests. Paper presented at the annual meeting of the National Council on Measurement in Education, San
Francisco, CA.
Marsh, E., Roediger, H. L. III, Bjork, R. A., & Bjork, E. (2007). The memorial consequences of multiple-choice guessing.
Psychonomic Bulletin & Review, 14(2), 194–199.
Martinez, M. E. (1991). A comparison of multiple-choice and constructed figural response items. Journal of Educational
Measurement, 28(2), 131–145.
Martinez, M. E. (1993). Cognitive processing requirements of constructed figural response and multiple-choice items in
architecture assessment. Applied Measurement in Education, 6, 167–180.
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218.
Martinez, M. E., & Katz, I. R. (1995). Cognitive processing requirements of constructed figural response and multiple-
choice items in architecture assessment. Educational Assessment, 3(1), 83–98.
Masters, J. (2010). A comparison of traditional test blueprinting and item development to assessment engineering in a licen-
sure context. Greensboro, NC: University of North Carolina at Greensboro.
Masters, J., & Luecht, R. M. (2010, April). Assessment engineering quality assurance steps: Analyzing sources of variation
in task models and templates. Paper presented at the annual meeting of the National Council on Measurement in
Education, Denver, CO.
Matell, M. S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and
validity. Educational and Psychological Measurement, 31(3), 657–674.
Mathews, C. O. (1929). Erroneous first impressions on objective tests. Journal of Educational Psychology, 20, 280–286.
Mayer, R. E. (2002). A taxonomy for computer-based assessment of problem solving Computers in Human Behavior,
18(6) 623–632.
McCoy, K. M. (2010). Doctoral dissertation: Impact of item parameter drive on examinee ability measures in a computer-
adaptive environment. Chicago, IL: University of Illinois at Chicago.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Erlbaum.
McDonald, R. P. (1999). Test theory. Mahwah, NJ: Lawrence Erlbaum Associates.
McGuire, C. H. (1966). The oral examination as a measure of professional competence. Academic Medicine, 41,
267–274.
Meijer, R. R., & Sijtsma, K. (1995). Detection of aberrant item score patterns: A review of recent developments. Applied
Measurement in Education, 8(3), 261–272.
Meijer, R. R., Molenaar, I. W., & Sijtsma, K. (1994). Influence of person and group characteristics on nonparametric
appropriateness measurement. Applied Psychological Measurement, 8, 111–120.
References • 429
Meijer, R. R., Muijtjens, A. M. M., & van der Vleuten, C. P. M. (1996). Nonparametric person-fit research: Some theoreti-
cal issues and an empirical evaluation. Applied Measurement in Education, 9(1), 77–90.
Merrill, M. D. (1994). Instructional design theory. Englewood Cliffs, NJ: Educational Technology Publications.
Messick, S. (1984). The psychology of educational measurement. Journal of Educational Measurement, 21, 215–237.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104). New York, NY: Ameri-
can Council on Education and Macmillan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educa-
tional Measurement: Issues and Practices, 23(2), 13–23.
Messick, S. (1995a). Validity of psychological assessment: Validation of inferences from persons’ responses and perfor-
mances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Messick, S. (1995b). Standards of validity and the validity of standards in performance assessment. Educational Measure-
ment: Issues and Practice, 14(4), 5–8.
Meyer, G. (1934). An experimental study of the old and new types of examination: I. The effect of the examination set on
memory. Journal of Educational Psychology, 25, 641–661.
Meyer, G. (1935). An experimental study of the old and new types of examination: II. Methods of study. Journal of Edu-
cational Psychology, 26, 30–40.
Milia, L. D. (2007). Benefitting from multiple-choice exams: The positive impact of answer switching. Educational Psy-
chology 27(5), 706–615.
Miller, G. (1990). The assessment of clinical skills/competence/performance. Academic Medicine, 65, S63–67.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing informa-
tion. Psychological Review, 63(2), 81–97.
Miller, W. G., Snowman, J., & O’Hara, T. (1979). Application of alternative statistical techniques to examine the hierar-
chical ordering in Bloom’s taxonomy. American Educational Research Journal, 16(3), 241–248.
Millman, J., & Greene, J. (1999). The specification and development of tests of ability and achievement. In R. L. Linn
(Ed.). Educational Measurement (3rd ed., pp. 335–366). New York, NY: American Council on Education & Mac-
millan Publishing.
Minnesota Department of Education (2011). Graduation-required assessment for diploma: Written composi-
tion item sampler. Roseville, MN: Author. Retrieved from http://www.mnstateassessments.org/resources/
ItemSamplers/WC_GRAD_Item_Sampler_Prompt-1.pdf
Mislevy, R. (2006). Cognitive psychology and educational assessment. In R. L. Brennan (Ed.), Educational measurement
(4th ed., pp. 257–305). Westport, CT: American Council on Education/Praeger.
Mislevy, R. (2008). How cognitive science challenges the educational measurement tradition. College Park, MD: University
of Maryland.
Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In S. M. Downing & T. Haladyna
(Eds.), Handbook of test development (pp. 61–90). Mahwah, NJ: Lawrence Erlbaum Associates.
Mislevy, R. J., Steinberg, L. S., Almond, R. G., & Lukas, J. (2006). Concepts, terminology, and basic models of evidence-
centered design. In D. M. Williamson, R. J. Mislevy, & I. Bejar (Eds.) In Automatic scoring of complex tasks in com-
puter-based testing (pp. 15–41). Mahwah, NJ: Lawrence Erlbaum Associates.
Mislevy, R. J., Winters, F. I., Bejar, I. I., Bennett, R. E., & Haertel, G. D. (2010). Technology supports for assessment
design. In G. McGaw, E. Baker & P. Peterson (Eds.), International Encyclopedia of Education (3rd ed., pp. 56–65).
Oxford: Elsevier
Mislevy, R. J., & Wu, P.-K.. (1996). Missing responses and IRT ability estimation: Omits, choice, time limits, and adaptive
testing. (Research Report RR-96-30-ONR). Princeton, NJ: Educational Testing Service.
Moon, T. R., & Hughes, K. R. (2002). Training and scoring issues involved in large-scale writing assessments. Educational
Measurement: Issues and Practices, 21(2), 15–19.
Moreno, R., Martinez, R. J., & Muniz, J. (2006). New guidelines for developing multiple-choice items. Methodology
2(2),65–72.
Morley, M. E., Bridgeman, B., & Lawless, R. R. (2004). Transfer between variants of quantitative items (GRE Board Rep.
No. 00-06R). Princeton, NJ: Educational Testing Service.
Morrison, S., & Free, K. W. (2001). Writing multiple-choice test items that promote and measure critical thinking. Jour-
nal of Nursing Education, 40(1) 17–24.
Mortimer, T., Stroulia, E., & Yazdchi, M. V. (2012). IGOR: A web-based automated assessment generation tool. In M. J.
Gierl & T. M. Haladyna (Eds.) Automatic item generation (pp. 217–230). New York, NY: Routledge.
Mouly, G. J., & Walton, L. E. (1962). Test items in education. New York, NY: McGraw-Hill.
Mueller, D. J., & Shwedel, A. (1975). Some correlates of net gain resultant from answer changing on objective achieve-
ment tests. Journal of Educational Measurement, 12(4), 251–254.
Mueller, D. J., & Wasser, V. (1977). Implications of changing answers on objective test items. Journal of Educational
Measurement,14(1), 9–13.
Myford, C. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and dif-
ferential scale category use. Journal of Educational Measurement, 46(4), 371–389.
Myford, C., & Wolfe, E. W. (2003). Detecting and measuring rater effects using the many-faceted Rasch model. Journal
of Applied Measurement, 4, 386–422.
Nardi, P. M. (2005). Doing survey research: A guide to quantitative methods (2nd ed.). Boston: Allyn & Bacon.
National Assessment Governing Board (2006). Economics framework for the 2006 National Assessment of Educational
Progress. Washington DC: Authors.
National Board for Professional Teaching Standards. (2009). Scoring guide for candidates. Adolescence and young
430 • References
adulthood English language arts. Arlington, VA: Authors. Retrieved from http://www.nbpts.org/userfiles/File/
AYA_ELA_Scoring_Guide.pdf
National Board for Professional Teaching Standards. (2011). General portfolio instructions. Arlington, VA: Authors.
Retrieved from http://www.nbpts.org/userfiles/file/Part1_general_portfolio_instructions.pdf
National Center for Education Statistics. (2012). Digest of Education Statistics, 2010 (NCES 2011-015). Washington DC:
US Department of Education. Retrieved from http://nces.ed.gov/fastfacts/display.asp?id=64
National Conference of Bar Examiners (2011). 2012 multistate essay examination information booklet. Madison, WI:
Authors.
National Council of Teachers of Mathematics (2012). Geometry standards. Retrieved from http://www.nctm.org/stan-
dards/content.aspx?id=314
National Dental Examining Board of Canada. (2008). Technical Manual for The National Dental Examining Board of
Canada Written Examination and Objective Structured Clinical Examination. Ottawa, Ontario Canada: Author.
NCES (2011). The nation’s report card: Reading 2011 (NCES 2012–457). Washington DC: National Center for Education
Statistics, Instituted of Education Sciences, US Department of Education.
NCES (2012). Fast facts: Students with disabilities. Washington DC: National Center for Education Statistics, Insti-
tuted of Education Sciences, US Department of Education. Retrieved from http://nces.ed.gov/fastfacts/display.
asp?id=64
Nesi, H., & Meara, P. (1991). How using dictionaries affects performance in multiple-choice ESL tests. Reading in a For-
eign Language, 8(1), 631–643.
Nichols, P., & Sugrue, B. (1999). The lack of fidelity between cognitively complex constructs and test development. Edu-
cational Measurement: Issues and Practice, 18(2), 18–29.
Nicpon, M. F., Allmon, A., Sieck, B., & Stinson, R. D. (2011). Empirical investigation of twice-exceptionality: Where have
we been and where are we going? Gifted Child Quarterly, 55(1), 3–17.
Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto, Canada: University of
Toronto.
Nitko, A. J. (1985). {Review of Roid and Haladyna’s A technology for test item writing}. Journal of Educational Measure-
ment, 21, 201–204.
Nitko, A. J. (2001). Educational assessment of students. Des Moines, IA: Prentice-Hall.
Nnodim, J. O. (1992). Multiple-choice testing in anatomy. Medical Education, 26, 301–309.
Norcini, J., Swanson, J., Grosso, D. Shea, J., & Webster, G. A. (1984). A comparison of knowledge, synthesis and clinical
judgment multiple choice questions in the assessment of physician competence. Evaluation in the Health Profes-
sions, 7, 485–499
Nunnally, J. C. (1967). Psychometric theory (1st ed.). New York, NY: McGraw-Hill.
Nunnally, J. C. (1977). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill.
OCR (2011). GCE mathematics examiners’ reports. Nottingham, UK: OCR Publications. Retrieved from http://www.ocr.
org.uk/download/rep_11/ocr_61115_rep_11_gce_june.pdf
O’Neill, K. (1986, April). The effect of stylistic changes on item performance. Paper presented at the annual meeting of the
American Educational Research Association, San Francisco, CA.
Oosterhof, A. C., & Glasnapp, D. R. (1974). Comparative reliabilities and difficulties of the multiple-choice and true-false
formats. The Journal of Experimental Education, 42, 62–64.
Osterlind, S. J., & Merz, W. R. (2004). Building a taxonomy for constructed-response test items. Educational Test, 2(2),
133–147.
Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand Oaks, CA: Sage.
Pearson, Inc. (2008). Versant featured research papers and presentations. Retrieved from http://www.versanttest.com/
technology/research.jsp
Palmer, E. J., & Devitt, P. G. (2007). Assessment of higher order cognitive skills in undergraduate education: Modi-
fied essay or multiple-choice questions. BMC Medical Education, 7, 49. Retrieved from http://www.biomedcentral.
com/1472-6920/7/49/
Paolo, A., & Bonaminio, G. A. (2003) Measuring outcomes of undergraduate medical education: residency directors’ rat-
ings of first-year residents. Academic Medicine, 78(1), 90–95.
Paris, S. G., Lawton, T. A., Turner, J. C., & Roth, J. L. (1991). A developmental perspective on standardized achievement
testing. Educational Researcher, 20(5), 12–20.
Patterson, D. G. (1926). Do new and old type examinations measure different mental functions? School and Society, 24,
246–248.
Peitzman, S. J., Nieman, L. Z., & Gracely, E. J. (1990). Comparison of “fact recall” and “higher order” questions in mul-
tiple-choice examinations as predictors of clinical performance. Academic Medicine, 65(9 suppl.), S59–S60.
Pellegrino, J. W. (1988). Mental models and mental tests. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 49–59).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and design of educational
assessment. Washington DC: National Academy Press.
Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational
measurement (3rd ed., pp. 221–262). New York, NY: Macmillan.
Pennsylvania Department of Education (2010). The Pennsylvania System of School Assessment–Grade 8 PSSA Writing.
Harrisburg, PA: Author.
Penny, J. A. (2003). Reading high-stakes writing samples: My life as a reader. Assessing Writing, 8, 192–215.
References • 431
Peterson, C. C., & Peterson, J. L. (1976). Linguistic determinants of the difficulty of true-false test items. Educational and
Psychological Measurement, 36, 161–164.
Pinglia, R. S. (1994). A psychometric study of true–false, alternate-choice, and multiple-choice item formats. Indian
Psychological Review, 42(1–2), 21–26.
Pitoniak, M. J., Young, J. W., Martiniello, M., King, T. C., Buteux, A., & Ginsburgh, M. (2009). Guidelines for the assess-
ment of English-language learners. Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.
org/about/fairness/
Poe, N., Johnson, S., & Barkanic, G. (1992, April). A reassessment of the effect of calculator use in the performance of stu-
dents taking a test of mathematics applications. Paper presented at the annual meeting of the National Council on
Measurement in Education, San Francisco, CA.
Polikoff, M. S. (2010). Instructional sensitivity as a psychometric property of assessments. Educational Measurement:
Issues and Practice, 29(4), 3–14.
Pomplum, M., & Omar, M. H. (1997). Multiple-mark items: An alternative to objective item formats. Educational and
Psychological Measurement, 57(6), 949–962.
Poole, R. L. (1971). Characteristics of the taxonomy of educational objectives: Cognitive domain. Psychology in the
Schools, 8, 379–385.
Popham, W. J. (2001). Teaching to the test. Educational Leadership, 58(6), 16–20.
Potts, G. R., & Peterson, S. B. (1985). Incorporation versus compartmentalization in memory for discourse. Journal of
Memory and Language, 24, 107–118.
Powers, D. A. (2005). Wordiness: A selective review of its influence, and suggestions for investigating its relevance in tests
requiring extended written responses. (Research Report RM-04-08). Princeton, NJ: Educational Testing Service.
Presser, S., Rothgeb, J. M., Couper, M. P., Lessler, J. T., Martin, E., Martin, J., & Singer, E. (2004). Methods for testing and
evaluating survey questionnaires. New York, NY: Wiley.
Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: Reliability, validity,
discriminating power, and respondent preferences. Acta Psychologica, 104, 1–15.
Pyrczak, F. (1972). Objective evaluation of the quality of multiple-choice test items designed to measure comprehension
of reading passages. Reading Research Quarterly, 8(1). 62–71.
Rabbitt, P. M. (1966). Errors and error correction in choice-response tasks. Journal of Experimental Psychology, 71(2)
264–272.
Rabinowitz H. K., & Hojat M. (1989). A comparison of the modified essay question and multiple-choice question for-
mats: Their relationship to clinical performance. Family Medicine, 21(5), 364–367.
Raymond, M., & Luciw-Dubas, U. A. (2010). The second time around: Accounting for retest effects on oral examinations.
Evaluation & the Health Professions, 33(3), 38–403.
Raymond, M., & Neustel, S. (2006). Determining the content of credentialing examinations. In S. M. Downing and T. M.
Haladyna (Eds). Handbook of test development (pp. 181–223). Mahwah, NJ: Lawrence Erlbaum Associates.
Raymond, M., & Viswesvaran, C. (1993). Least squares models to correct for rater effects in performance assessment.
Journal of Educational Measurement, 30, 253–268.
Reckase, M. D. (1978). A comparison of the one- and three-parameter logistic models for item calibration. Arlington,
VA: Office of Naval Research. Also, paper presented at the annual meeting of the American Educational Research
Association, Toronto, Canada.
Reeve, B. B., & Fayers, P. (2005). Applying item response theory modeling for evaluating questionnaire item and scale
properties. In P. M. Fayers & R. D. Hays (Eds.), Assessing quality of life in clinical trials: Methods and practice
(2nd ed., pp. 55–73). New York, NY: Oxford University Press. Retrieved from http://fds.oup.com/www.oup.
co.uk/pdf/0-19-852769-1.pdf
Rhoades, K., & Madaus, G. (2003). Errors in standardized tests: A systemic problem. (National Board on Educational
Testing and Public Policy Monograph). Boston, MA: Boston College. Retrieved from http://www.bc.edu/research/
nbetpp/statements/M1N4.pdf
Ridge, K. (2001, April). Rater halo error and accuracy in a mathematics performance assessment. Paper presented at the
annual meeting of American Educational Research Association, Seattle, WA.
Rifkin, W. D., & Rifkin, A. (2005). Correlation between house staff performance on the United States Medical Licensing
Examination and standardized patient encounters. The Mount Sinai Journal of Medicine, 72(1), 47–49.
Roach, A. T., Beddow, P. A., Kurz, A., Kettler, R. J., & Elliott, S. N. (2010). Incorporating student input in developing
alternate assessments based on modified academic achievement standards. Exceptional Children, 77(1), 61–80.
Roberts, D. M. (1993). An empirical study on the nature of trick questions. Journal of Educational Measurement, 30,
331–344.
Roberts, M. R., & Gierl, M. J. (2010). Developing score reports for cognitive diagnostic assessments. Educational Mea-
surement: Issues and Practice, 29(3), 25–38.
Rodriguez, M. C. (1997, April). The art and science of item writing: A meta-analysis of multiple-choice format effects. Paper
presented at the annual meeting of the American Educational Research Association, Chicago, IL.
Rodriguez, M. C. (2002). Choosing an item format. In G. Tindal & T.M. Haladyna (Eds.), Large-Scale Assessment Pro-
grams for All Students: Validity, Technical Adequacy, and Implementation (pp. 213–231). Mahwah, NJ: Lawrence
Erlbaum Associates.
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects
synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184.
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research.
Educational Measurement: Issues and Practice, 24(2), 3–13.
432 • References
Rodriguez, M. C. (2009a). Psychometric considerations for alternate assessments based on modified academic achieve-
ment standards. Peabody Journal of Education, 84(4), 595–602.
Rodriguez, M. C. (2009b). Examining Distractor effectiveness in modified items for students with disabilities. Presentation
at the Council of Chief State School Officers’ National Conference on Student Assessment, Los Angeles, CA.
Roediger, H. L. III, Marsh, E. J. (2005). The positive and negative consequences of multiple-choice testing. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 31(5) 1155–1159.
Roid, G. H. (1994). Patterns of writing skills derived from cluster analysis of direct writing assessments. Applied Measure-
ment in Education, 7(2), 159–170.
Roid, G. H., & Finn, P. (1977). Algorithms for developing test questions from sentences in instructional materials. Interim
Report. San Diego, CA: Navy Personnel Research and Development Center
Roid, G. H., & Haladyna, T. M. (1977). Measurement problems in the formative evaluation of instructional systems.
Improving Human Performance, 6, 30–44.
Roid, G. H., & Haladyna, T. M. (1978). The use of domains and item forms in the formative evaluation of instructional
materials. Educational and Psychological Measurement, 38, 19–28.
Roid, G. H., & Haladyna, T. M. (1980). Toward a technology of test item writing. Review of Education Research, 50,
293–314.
Roid, G. H., & Haladyna, T. M. (1982). Toward a technology of test-item writing. New York, NY: Academic Press.
Ruch, G. M. (1929). The objective or new type examination. New York, NY: Scott Foresman.
Ruch, G. M., & Charles, J. W. (1928). A comparison of five types of objective tests in elementary psychology. Journal of
Applied Psychology, 12, 398–403.
Ruch, G. M., & Stoddard, G. D. (1925). Comparative reliabilities of objective examinations. Journal of Educational Psy-
chology, 12, 89–103.
Rudner, L. M., Bracey, G., & Skaggs, G. (1996). The use of person-fit statistics with one high-quality achievement test.
Applied Measurement in Education, 9(1), 91–109.
Rugg, H. (1922). Is the rating of human character practical? Journal of Educational Psychology, 13, 30–42.
Rupp, A. A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-choice questions shapes
the construct: A cognitive processing perspective. Language Testing, 23(4), 441–474.
Ryan, J. M., & DeMark, S. (2002). Variation in achievement test scores related to gender, item format, and content area
tests. In G. Tindal & T. M. Haladyna (Eds.) Large-scale assessment programs for all students: Validity technical
adequacy, implementation (pp. 67–88). Mahwah, NJ: Lawrence Erlbaum Associates.
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data.
Psychological Bulletin, 88, 413–438.
Sanders, N. M. (1966). Classroom questions. What kinds? New York, NY: Harper & Row.
Savoldelli, G. L., Naik, V. N., Joo, H. S., Houston, P. L., Graham, M., Yee, B., & Hamstra, S. J. (2006). Evaluation of patient
simulator performance as an adjunct to the oral examination for senior anesthesia residents. Anesthesiology, 104(3),
475–481.
Scalise, K. (2010). Innovative item types: New results on intermediate constraint questions and tasks for computer-based
testing using NUI objectives. A paper presented at the annual meeting of the National Council on Measurement in
Education, Denver, CO.
Scalise, K., & Gifford, B. (2006). Computer-based assessment in E-learning: A framework for constructing “intermediate
constraint” questions and tasks for technology platforms. The Journal of Technology, Learning, and Assessment,
4(6).
Scalise, K., & Wilson, M. (2010). Examining student reasoning with bundles in criterion-referenced assessment. Paper
presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.
Schafer, W. D., & Liu, M. (2006). Universal design in educational assessments. College Park, MD: University of Maryland,
Maryland Assessment Research Center for Education Success. Retrieved from http://marces.org/Completed.htm
Scheuneman, J. C., Camara, W. J., Cascallar, A. S., Wendler, C., & Lawrence, I. (2002). Calculator access, use, and type
in relation to performance on the SAT I: Reasoning test is mathematics. Applied Measurement in Education, 15(1),
95–112.
Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational measurement, (4th ed., pp.
307–353). Westport, CT: American Council on Education and Praeger Publishers.
Schuman, H., & Presser, S. (1981). Items and answers in attitude surveys. New York, NY: Academic Press.
Scouller, K. (2004). The influence of assessment method on students’ learning approaches: Multiple choice question
examination versus assignment essay. Higher Education, 35(4), 453–472.
Seddon, G. M. (1978). The properties of Bloom’s taxonomy of educational objectives for the cognitive domain. Review of
Educational Research, 48, 303–323.
Shahabi, S., & Yang, L. (1990, April). A comparison between two variations of multiple-choice items and their effects on dif-
ficulty and discrimination values. Paper presented at the annual meeting of the National Council on Measurement
in Education, Boston, MA.
Shea, J. A., Poniatowski, P. A., Day, S. C., Langdona, L. O., LaDuca, A., & Norcini, J. (1992)An adaptation of item model-
ing for developing test-item banks. Teaching and learning in Medicine, 4(1), 19–24.
Sheehan, K. M., Kostin, I., & Futagi, Y. (2007). Supporting efficient evidence-centered item development for the GRE verbal
measure. (ETS RR-07-29). Princeton, NJ: Educational Testing Service.
Sheehan, K., & Ginther, A. (2001, April). What do multiple choice verbal reasoning items really measure? An analysis of
the cognitive skills underlying performance on a standardized test of reading comprehension skill. Paper presented
at the annual meeting of the National Council on Measurement in Education, Seattle, WA.
References • 433
Shepard, L. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4–14.
Sherman, S. W. (1976, April). Multiple-choice test bias uncovered by the use of an “I don’t know” alternative. Paper pre-
sented at the annual meeting of the American Educational Research Association, Chicago, IL.
Shermis, M. D., & Burstein, J. C. (Eds.). (2003). Automated essay scoring: A cross disciplinary approach. Mahwah, NJ:
Lawrence Erlbaum Associates.
Simon, S. R., Volkan, K., Hamann, C., Duffey, C., & Fletcher, S. W. (2002). The relationship between second-year medical
students’ OSCE scores and USMLE Step 1 scores. 24(5) 535–539.
Singh, C., & Rosengrant, D. (2003). Multiple-choice test of energy and momentum concepts. American Journal of Physics,
71(6), 607–617.
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of
Educational Measurement, 47(2), 150–174.
Sinharay, S., Haberman, S., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Edu-
cational Measurement: Issues and Practice, 26(4), 21–28.
Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of improved construct
representation. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 329–347). Mahwah,
NJ: Lawrence Erlbaum Associates.
Skakun, E. N., & Gartner, D. (1990, April). The use of deadly, dangerous, and ordinary items on an emergency medical tech-
nicians-ambulance registration examination. Paper presented at the annual meeting of the American Educational
Research Association, Boston, MA.
Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores.
Educational and Psychological Measurement, 70(3) 357–375.
Slogoff, S., & Hughes, F. P. (1987). Validity of scoring “dangerous answers” on a written certification examination. Jour-
nal of Medical Education, 62, 625–631.
Smith, E. V., Wakely, M. B., de Kruif, R. E. L., & Swartz, C. W. (2003). Optimizing rating scales for self-efficacy (and
other) research). Educational and Psychological Measurement, 63(3), 369–391.
Smith, M. III, White, K. P., Coop, R. H. (1979). The effect of item type on the consequences of changing answers on mul-
tiple-choice tests. Journal of Educational Measurement, 16(3) 203–208.
Smith, R. M., & Kramer, G. A. (1990, April). An investigation of components influencing the difficulty of form-devel-
opment items. Paper presented at the annual meeting of the National Council on Measurement in Education,
Boston, MA.
Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction
versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment
(pp. 45–60). Hillsdale, NJ: Lawrence Erlbaum Associates.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational measurement. In R. L. Linn
(Ed.), Educational measurement (3rd ed., pp. 263–332). New York, NY: American Council on Education and
MacMillan.
Solano-Flores, G., & Li, M. (2009a). Generalizability of cognitive interview-based measures across cultural groups. Edu-
cational Measurement: Issues and Practice, 28(2), 9–18.
Solano-Flores, G., & Li, M. (2009b). Language variation and score variation in the testing of English language learners,
native Spanish speakers. Educational Assessment, 14, 180–194.
Solano-Flores, G., & Wang, C. (2011, April). Conceptual framework for analyzing and designing illustrations in science
assessment: Development and use in the testing of linguistically and culturally diverse populations. Paper presented at
the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
Spandel, V. (2001). Creating writers through 6-trait writing assessment and instruction (3rd Edition). Boston, MA: Allyn
& Bacon.
Stagnaro-Green, S., Alex, S., & Downing, S. M. (2006). Use of flawed multiple-choice items by the New England Journal
of Medicine for continuing medical education. Medical Teacher, 28(6), 566–568.
State Medical Boards of the United States & the National Board of Medical Examiners. (2011). Step 3: Content description
and general information. Philadelphia, PA: Author. Retrieved from http://www.usmle.org/pdfs/step-3/2011con-
tent_step3.pdf
Statman, S. (1988). Ask a clear question and get a clear answer: An enquiry into the question/answer and the sentence
completion formats of multiple-choice items. System, 16, 367–376.
Stecher, B. M., Klein, S. P., & Solano-Flores, G., McCaffery, D., Robyn, A., Shavelson, R. J., & Haertel, E. (2000). The effect
of content, format, and inquiry level on science performance scores. Applied Measurement in Education, 13(2),
139–160.
Sternberg, R. J. (1998). Abilities are forms of developing expertise. Educational Researcher, 27(3), 11–20.
Stiggins, R. J. (1987). Design and development of performance assessments. Educational Measurement: Issues and Prac-
tice, 6(3), 33–42.
Stiggins, R. J. (1994). Student-centered classroom assessment. Columbus, OH: Macmillan.
Stillman, P. et al. (1991). Assessment of clinical skills of residents utilizing standardized patients: A follow-up study and
recommendations for application. Annals of Internal Medicine, 114, 393–401.
Stoker, H. W., & Kropp, R. P. (1964). Measurement of cognitive processes. Journal of Educational Measurement 1(1)
39–42.
Streeter, L., Bernstein, J., Foltz, P., & DeLand, D. (2011). Pearson’s automated scoring of writing, speaking, and mathemat-
ics. White Paper. Retrieved from http://www.pearsonassessments.com/research
Strunk, W., & White, E. B. (2011). The elements of style. Geneva, NY: The elements of style press.
434 • References
Subhiyah, R. G., & Downing, S. M. (1993, April). K-type and A-type items: IRT comparisons of psychometric characteristics
in a certification examination. Paper presented at the annual meeting of the National Council on Measurement in
Education, Atlanta, GA.
Sugrue, B. (1995). A theory-based framework for assessing domain-specific problem-solving ability. Educational Mea-
surement: Issues and Practice, 14(3), 29–36.
Sullivan, J. (2009, June 29). Interviews from anywhere: Live video interviews are now a best practice. ERE Daily. Retrieved
from http://www.ere.net/2009/06/29/
Sweller, J. (2010). Element interactivity and intrinsic, extraneous, and germane cognitive load. Educational Psychology
Review, 22, 123–138.
Sympson, J. B., & Haladyna, T. M. (1993). An evaluation of polyweighting in domain-referenced testing. San Diego, CA:
Navy personnel research and development center.
Tamir, P. (1971), An alternative approach to the construction of multiple-choice test items, Journal of Biological Educa-
tion, 5, 305–307.
Tamir, P. (1989), Some issues related to the use of justifications to multiple-choice answers, Journal of Biological Educa-
tion, 23, 285–292.
Tang, R., Shaw, W. M., & Vevea, J. L. (1999). Journal of the American Society for Information Sciences, 50(3), 254–264.
Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on student achievement in
high-stakes nursing assessments. Medical Education, 42, 198–206.
Tate, R. (2002). Test dimensionality. In G. Tindal & T. M. Haladyna (Eds.). Large-scale assessment programs for all students:
Validity, technical adequacy, and implementation (pp. 181–211). Mahwah, NJ: Lawrence Erlbaum Associates.
Tate, R. (2003). A comparison of selected empirical methods for assessing the structure of responses to test items. Applied
Psychological Measurement, 27(3) 159–203.
Taylor, W. L. (1953). Cloze procedure: A new tool for measuring readability. Journalism Quarterly, 30, 415–433.
The Technical Staff (1933). Manual for examination development (1st ed.). Chicago: University of Chicago Bookstore.
The Technical Staff (1937). Manual for examination development (2nd ed.). Chicago: University of Chicago Bookstore.
Thissen, D. M. (1976). Information in wrong responses to the Raven Progressive Matrices. Journal of Educational Mea-
surement, 14, 201–214.
Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models.
Journal of Educational Measurement, 26, 247–260.
Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice and free-response items neces-
sarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement,
31(2), 113–123.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washing-
ton, DC: American Psychological Association.
Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large scale assessments (Synthesis
Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Thorndike, E. L. (1904). An introduction to the theory of mental and social measurement. New York, NY: The Science
Press.
Thorndike, R. L. (1967). The analysis and selection of test items. In S. Messick & D. Jackson (Eds.), Problems in human
assessment (pp. 201–216). New York, NY: McGraw-Hill.
Thorndike, R. M., & Thorndike-Christ, T. (2010). Measurement and evaluation in psychology and education (8th ed.).
Upper Saddle River, NJ: Pearson Education.
Thurlow, M. L., Laitusis, C. C., Dillon, D. R., Cook, L. L., Moen, R. E., Abedi, J., & O’Brien, D. G. (2009). Accessibility
principles for reading assessments. Minneapolis, MN: National Accessible Reading Assessment Projects.
Tombari, M., & Borich, G (1999). Authentic assessment in the classroom: Applications and practice. Upper Saddle River,
NJ: Prentice-Hall Inc.
Torres, M., & Roig, M. (2005). The Cloze procedure as a test of plagiarism: The influence of text readability. The Journal
of Psychology: Interdisciplinary and Applied, 139(3), 221–232.
Tourangeau, R., Rips, L., & Rasinski, K. (2000). The psychology of survey response. Cambridge: Cambridge University
Press.
Traub, R. E. (1993). On the equivalence of traits assessed by multiple-choice and constructed-response tests. In R. E.
Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response,
performance testing, and portfolio assessment (pp. 1–27). Hillsdale, NJ: Lawrence Erlbaum Associates.
Traub, R. E., & Fisher, C. W. (1977). On the equivalence of constructed response and multiple-choice tests. Applied Psy-
chological Measurement, 1, 355–370.
Treagust, D. F. (1988) Development and use of diagnostic tests to evaluate students’ misconceptions in science. Interna-
tional Journal of Science Education, 10, 159–169.
Triola, M., et al. (2006). A randomized trial of teaching clinical skills using virtual and live standardized patients. Journal
of General Internal Medicine, 21(5), 424–429.
Tsai, C. C., & Chou, C. (2002). Diagnosing students’ alternative conceptions in science. Journal of Computer Assisted
Learning 18, 157–165.
Tversky, A. (1964). On the optimal number of alternatives as a choice point. Journal of Mathematical Psychology, 1,
386–391.
U.S. Census Bureau. (2012). Statistical abstract of the US: Income, expenditures, poverty and wealth. Retrieved from http://
www.census.gov/compendia/statab/2012/tables/12s0695.pdf
University of Chicago Press Staff. (2003). The Chicago manual of style. Chicago, IL: University of Chicago Press.
References • 435
US Department of Justice. (2009). A guide to disability rights laws. Washington DC: US Department of Justice, Civil
Rights Division, Disability Rights Section. Retrieved from http://www.ada.gov/cguide.htm
Vale, C. D. (2006). Computerized item banking. In S. M. Downing & T. M. Haladyna (Eds.) Handbook of test development
(pp. 261–285). Mahwah, NJ: Lawrence Erlbaum Associates.
Van der Flier, H. (1982). Deviant response patterns and comparability of test scores. Journal of Cross-Cultural Psychology,
13, 267–298.
van der Linden, W. J., & Hambleton, R. K. (Eds.). (2010). Handbook of modern item response theory (2nd ed.). New York,
NY: Springer.
van Hoeij, M. J., Haarhuis, J. C., Wierstra, R. F., & van Beukelen, P. (2004). Journal of Veterinary Medical Education,
31(3), 261–267.
Vidler, D., & Hansen, R. (1980). Answer-changing on multiple-choice tests. Journal of Experimental Education, 49(1),
18–20.
Virginia Department of Education. (2011). Virginia grade level alternative: Implementation manual, 2011–2012. Rich-
mond, VA: Author.
von Davier, M., & Molenaar, I. W. (2003). A person-fit index for polytomous Rasch models, latent class models, and their
mixture generalizations. Psychometrika, 68(2) 213–228.
Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement, 26, 191–208.
Wainer, H. (2002). On the automatic generation of test items. In S. H. Irvine, & P. C. Kyllonen (Eds.). Item generation for
test development (pp. 287–305). Mahwah, NJ: Lawrence Erlbaum Associates.
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed response test scores: Toward a Marxist
theory of test construction. Applied Measurement in Education, 6, 103–118.
Wainer, H., & Thissen, D. (1994). On examinee choice in educational testing. Review of Educational Research, 64, 1,
159–195.
Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths toward making praxis scores more useful. Journal of Educational
Measurement, 37, 113–140.
Wainer, H., Wang, X., & Thissen, D. (1994). How well can we compare scores on test forms that are constructed by exam-
inees’ choice? Journal of Educational Measurement, 31(3), 183–199.
Wakita, T., Ueshima, N., & Noguchi, H. (2012). Psychological distance between categories in the Likert scale: Comparing
different numbers of options. Educational Psychological Measurement, 72, 533–546.
Wang, C., & Solano-Flores, G. (2011, April). Illustrations with graphic devices in large-scale assessments: An exploratory
cross-cultural study of students’ interpretation. Paper presented at the annual meeting of the National Council on
Measurement in Education, New Orleans, LA.
Warshawsky, A. J., Rodriguez, M. C., Cabrera, J. C., Palma, J., Albano, A. D., & Vue, Y. (2012, April). Attitudes toward
school and school plans given levels of family alcohol, substance, and physical abuse. Paper presented at the annual
meeting of the American Educational Research Association, Vancouver, Canada.
Washington, W. N., & Godfrey, R. R. (1974). The effectiveness of illustrated items. Journal of Educational Measurement,
11(2), 121–124.
Webb, L. C., & Heck, W. L. (1991, April). The effect of stylistic editing on item performance. Paper presented at the annual
meeting of the National Council on Measurement in Education, Chicago, IL.
Webb, N. (2006). Identifying content for student achievement tests. In S. M. Downing & T. M. Haladyna (Eds.) Hand-
book of test development (pp. 155–180). Mahwah, NJ: Lawrence Erlbaum Associates.
Weigert, S. C. (2011). U.S. policies supporting inclusive assessments for students with disabilities. In S. N. Elliott, R. J.
Kettler, P. A. Beddow, & A. Kurz (Eds.), Handbook of accessible achievement tests for all students: Bridging the gaps
between research, practice, and policy (pp. 19–32). New York, NY: Springer.
Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative
approaches. Assessing Writing, 6(2) 145–l 78.
Welch, C. J. (2006). Item and prompt development in performance testing. In S. M. Downing & T. M Haladyna (Eds.),
Handbook of test development (pp. 303–327). Mahwah, NJ: Lawrence Erlbaum Associates.
Weng, L. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reli-
ability. Educational and Psychological Measurement, 64(4), 956–972.
Weng, L., & Cheng, C. (2000). Effects of response order on Likert-type scales. Educational and Psychological Measure-
ment, 60(6), 908–924.
Wesman, A. G. (1971). Writing the test item. In R. L. Thorndike (Ed.) Educational measurement (2nd ed., pp. 99–111).
Washington, DC: American Council on Education.
Wiggins, G., & McTighe, J. (2005). Understanding by design (2nd ed.). Alexandria, VA: Association for Supervision and
Curriculum Development.
Wightman, L. F. (1998). An examination of sex differences in LSAT scores from the perspective of social consequences.
Applied Measurement in Education, 11(3), 255–277.
Williams, B. J., & Ebel, R. L. (1957). The effect of varying the number of alternatives per item on multiple-choice vocabu-
lary test items. The 14th Yearbook of the National Council on Measurement in Education (pp. 63–65). Washington,
DC: National Council on Measurement in Education.
Williams, J. B. (2006). Assertion-reason multiple-choice testing as a tool for deeper learning: A qualitative analysis.
Assessment and Evaluation in Higher Education 31(3), 287–301.
Williams, R. I. (1970). Black pride, academic relevance, and individual achievement. The Counseling Psychologist, 2(1),
18–22.
Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K., Rubin, D. P., Way, W. D., &
Sweeney, K. (2010). Automated scoring for the assessment of common core standards. Princeton, NJ: Educational
436 • References
Abedi, J. 15, 25–6, 41, 97, 144, 229, 306–8, 332–3 Beckman, T.J. 379
Abrahamson, S. 294 Beddow, P.A. 302–3, 306–7, 311–12
Adelson, J.L. 168 Bejar, I.I. 29, 116, 132–3, 135, 141–2, 144, 151, 235, 410
Afzal Mir, M. 76 Beller, M. 51, 53, 58
Airasian, P.W. 31 Bellezza, F.S. 396
Albanese, M.A. 71, 73 Bellezza, S.F. 396
Albano, A.D. 364 Bennett, R.E. 49, 116, 132, 141, 143–4, 191–2, 237
Alcolado, J. 76 Bereiter, C. 33
Alex, S. 324 Bergling, B.M. 140
Allalouf, A. 219 Bernardin, H.J. 257, 277, 379
Allen, N. 264 Bernstein, I.H. 340, 344, 347, 356, 388
Allmon, A. 306 Bernstein, J. 237, 238
Almond, R.G. 142 Beullens, J. 75
Alonzo, A.C. 123 Bjork, E. 55
Alves, C. 135–8, 141–2 Bjork, R.A. 55
American Educational Research Association (AERA) 7, Blackmore, D.E. 32
25, 190, 200, 220–1, 229, 283, 316, 319, 322–3, 330, 380, Blake, J.M. 32
386, 388, 400 Bloom, B.S. 31, 283, 328
American Psychological Association (APA) 7, 25, 97, Bolt, D.M. 306–7, 311–12
190, 200, 220–1, 229, 283, 316, 319, 322–3, 329–30, 380, Bonaminio, G.A. 289
386, 388, 400 Bonner, M.W. 52
Anastakis, D.J. 294 Borich, G. 204, 230
Anderson, L.W. 31, 226 Borman, W.C. 277, 379
Andrich, D. 340 Bormuth, J.R. 133, 140, 142, 150, 410
Ansley, T.N. 79 Bottsford-Miller, N.A. 310
Arbet, S. 168 Boyd, M.A. 287, 290
Arim, R. 57 Bracey, G. 398
Arter, J.A. 203 Brant, R. 100
Ascalon, M.E. 105 Braswell, J. 79–80
Attali, Y. 101, 344, 354, 390 Braun, H. 257, 277
Ayers, S.F. 34 Breland, H.M. 52
Azevedo, R. 128 Brennan, R.L. 4, 255, 360, 370, 388, 410,
Bridgeman, B. 79–80, 150, 191, 373
Bacon, D.R. 50 Briggs, D.C. 123
Baird, J-A. 392 Brosvic, G.M. 55
Baker, E. 332 Brown, T.A. 389
Baker, F. 340 Bruen, C. 306–7, 311–12
Baldwin, D. 210 Bryant, D.U. 40
Bannert, M. 336 Budescu, D. 67
Baranowski, R.A. 14, 26, 73, 97, 328–9 Burchard, K.W. 294
Bar-Hillel, M.B. 67, 101, 390 Burke, M. 141
Barkanic, G. 79 Burmester, M.A. 68–69
Barnette, J.J. 160 Burstein, J.C. 263, 392, 415
Barrows, H. 287 Burstein, L. 395
Bayley, R. 229 Buteux, A. 305, 316–17
Bechard, S. 308 Butler, A.C. 55
Becker, B.J. 397
Becker, D.F. 16, 70, 80, 107, 319 Cabrera, J.C. 364
437
438 • Author Index
Livingston, S.A. 210, 235, 356 Minnesota Department of Education 224, 241–3, 247,
Livne, N.L. 237 249–50
Livne, O.E. 237 Mirocha, J. 332
Loevinger, J. 9, 264 Mislevy, R.J. 6, 29–30, 56, 129, 135, 141–2, 235, 241, 395,
Lohman, D.F. 6, 29, 141–2, 264, 409 409–10
Longford, N.T. 257, 277, 380 Mocerino, M. 123
Lord, C. 332 Moen, R.E. 306–8
Lord, F.M. 65–6, 68, 340, 351, 356, 372, 394–5, 402, 410, Molenaar, I.W. 396, 399
414 Moon, T.R. 277
Lorscheider, F.L. 100 Mooney, J.A. 402
Loyd, B.H. 79 Moreno, R. 89
Lu, Y. 393 Morgan, R.L. 373
Luciw-Dubas, U.A. 293 Moriarty G. 397
Ludlow, L.H. 395 Morley, M.E. 116, 132, 143–4, 150
Luecht, R.M. 141–3, 410 Morrison, S. 146
Lukas, J. 142 Mortimer, T. 414
Lukhele, R. 56 Mouly, G.J. 98
Lumley, T. 379 Mueller, D.J. 385
Lunz, M.E. 373 Muijtjens, A.M.M. 396
Mundhenk, K. 229
Madaus, G.F. 31, 218–19, 221 Muniz, J. 89
Magone, M. 53 Murphy, G. 191, 209–12, 240
Maihoff, N.A. 68–69 Murphy, H.J. 287, 290
Mandin, H. 54 Murphy, S.E. 296
Marcu, D. 263 Murray, C. 81
Marcy, M.S. 288 Myford, C.M. 57, 257, 276–7, 373, 379
Marsh, E.J. 55
Martin, E. 152 Naik, V.N. 292, 294
Martin, J. 152 Nandakumar, R. 289, 392
Martinez, M.E. 53–6, 191 Nardi, P.M. 365
Martinez, R.J. 89 National Assessment Governing Board (NAGB) 214
Martiniello, M. 305, 316–17 National Board for Professional Teaching Standards
Masters, J. 141, 144 (NBPTS) 204, 207, 235–6, 295–6
Matell, M.S. 166 National Board of Medical Examiners (NBME) 289
Mathews, C.O. 385 National Center for Education Statistics (NCES) 166,
Mayer, R.E. 31, 33 305
McCaffery, D. 56 National Conference of Bar Examiners (NCBE) 227–8,
McCoach, D.B. 168 298
McCoy, K.M. 391 National Council of Teachers of Mathematics
McDivitt, P. 229, 307–8 (NCTM) 79
McDonald, R.P. 388–9 National Council on Measurement in Education
McGaw, B. 32 (NCME) 7, 25, 190, 200, 220–1, 229, 283, 316, 319,
McGrath, D. 306–7, 311–12 322–3, 330, 380, 386, 388, 400
McGuire, C.H. 294 National Dental Examining Board of Canada
McNamara, T.F. 379 (NDEBC) 290
McQueen, J. 256, 379 Nering, M.L. 360
McTighe, J. 241 Nesi, H. 80
McVay, A. 255 Neumann L.M. 8, 13
Meara, P. 80 Neustel, S. 9, 13, 142, 151, 283, 323, 327, 409
Meehl, P.E. 5, 7, 29, 410 Nhouyvanisvong, A. 143
Meghan O. 397 Nichols, P. 14, 29
Mehrens, W.A. 68 Nicpon, M.F. 306
Meijer, R.R. 396, 398–9 Nieman, L.Z. 32
Mengelkam, C. 336 Nishisato, S. 352
Merino, K.D. 289 Nitko, A.J. 74, 140, 407
Merrill, M.D. 34 Nnodim, J.O. 103
Merz, W.R. 191–3 Noguchi, H. 166, 169
Messick, S. 5–7, 10–11, 29–30, 43–5, 49, 264, 326, 359, Nolen, S.B. 397
388–9, 409 Norcini, J.J. 32, 50, 73, 138
Meyer, G. 54, 408 Norman, G.R. 32
Meyers, L.S. 105, 172 Norman, R. 32
Milia, L.D. 385 Novick, M.R. 340, 356, 372, 410
Miller, G.A. 166, 278, 283 Nungester, R.J. 72, 289, 392
Miller, M.D. 28 Nunnally, J.C. 340, 344, 347, 356, 388, 401
Miller, W.G. 32 Nussbaum, E.M. 54, 56
Millman, J. 407
442 • Author Index
O’Brien, D.G. 306–8 Raymond, M. 9, 13, 142, 151, 283, 293, 323, 327, 373, 409
O’Donovan, S. 81 Reckase, M.D. 342
O’Hara, T. 32 Reeve, B.B. 365
O’Leary, M. 395 Reinhart, M.A. 138
O’Neill, K. 96 Reise, S.F. 340, 347, 360, 386–7
O’Neill, T.R. 373 Reshetar, R. 96
OCR 219 Revuelta, J. 116, 132, 144
Olsen, R.M. 13, 45, 48, 55, 247, 255, 263, 265, 271–2, 358, Reznick, R.K. 294
360, 382, 387 Rhoades, K. 218–19, 221
Olson, L.A. 68–69 Riconscente, M.M. 135, 141–2, 241, 409–10
Omar, M.H. 121 Ridge, K. 257
Oosterhof, A.C. 70 Rifkin, A. 289
Orr, N.A. 138 Rifkin, W.D. 289
Osborn Popp, S. 394–5 Ripkey, D.R. 75, 132
Osterlind, S.J. 191–3 Rips, L. 152, 181
Ostini, R. 360 Roach, A.T. 306–7, 311–12
Roberts, D.M. 94–5
Page, S.H. 133 Roberts, M.R. 143
Palma, J. 364 Robeson, M.R. 65
Palmer, E.J. 54, 126 Robyn, A. 56
Palmer, P. 306–7, 311–12 Rock, D.A. 143, 191
Pankratz, V.S. 379 Rodriguez, M.C. 22, 44, 46, 49–51, 55–9, 63, 65–6, 69,
Paolo, A. 289 74, 81, 89, 99–100, 102, 187, 189–90, 209, 213, 306–7,
Paris, S.G. 397 311–12, 316, 324, 364, 407, 413
Patterson, D.G. 49, 408 Roediger, H.L.III. 55
Patterson, H.L. 133 Roid, G.H. 7, 55, 59, 133, 140–1, 143, 146, 150, 265, 342,
Patz, R.J. 400–1 348, 407, 410
Pearson, Inc. 237–8 Roig, M. 195
Pedrotte, J.T. 89 Rosengrant, D. 54
Peitzman, S.J. 32 Roth, J.L. 397
Pellegrino, J.W. 304, 412 Rothgeb, J.M. 152
Pence, E.C. 379 Rowland-Morin, P. 294
Pennsylvania Department of Education 266 Rubin, D.B. 395, 399
Penny, J.A. 57, 258, 276–7, 369 Rubin, D.P. 237
Petersen, S. 89 Rubin, E. 65
Peterson, C.C. 69 Ruch, G.M. 49, 68, 408, 411
Peterson, J.L. 69 Rudner, L.M. 398
Peterson, N.S. 380 Rugg, H. 373
Peterson, S. 135, 144 Rupp, A.A. 56
Peterson, S.B. 127 Russel, M. 308
Peyton, V. 89 Ryan, J.M. 51–3, 56, 132
Phillips, E.R. 69
Pinglia, R.S. 70 Saal, F.E. 374, 379
Pintrich, P.R. 31 Sabers, D.L. 73
Pitoniak, M.J. 305, 316–17, 350, 410 Sanders, N.M. 31
Plake, B.S. 395 Savoldelli, G.L. 292, 294
Poe, N. 79 Sax, G. 68, 71
Polikoff, M.S. 348, 350 Scalise, K. 40, 111–12, 118, 121–2, 129
Pomplun, M.R. 16, 121, 319 Scardamalia, M. 33
Poniatowski, P.A. 96, 138 Schafer, W.D. 310
Poole, R.L. 33 Scheuneman, J.C. 80
Popham, W.J. 391 Schmeiser, C.B. 209, 220
Potts, G.R. 127 Schuman, H. 152
Powers, D.A. 57, 277–8, 381 Schwab, C. 123
Presser, S. 152, 155 Scott, D.A. 287, 290
Preston, C.C. 166 Scouller, K. 54
Puhan, G. 401 Seddon, G.M. 31, 33
Pursell, E.D. 379 Shahabi, S. 72
Pyrczak, F. 77 Shavelson, R.J. 56
Shaw, W.M. 166
Quardt, D. 143 Shea, J.A. 32, 96, 138
Sheehan, K.M. 141, 143–4, 400
Rabbitt, P.M. 385 Shepard, L. 54
Rabinowitz, H.K. 126 Sherman, S.W. 103
Rasinski, K. 152, 181 Shermis, M.D. 392, 415
Raths, J. 31 Shindoll, R.R. 145, 326
Author Index • 443
achievement 5–6, 14, 20, 29, 51, 141–3, 147, 150, 191, 327 58, 62, 68, 75, 77, 79, 91, 116, 214–16, 241, 247, 279–81,
American Educational Research Association (AERA) 7, 283
25, 55, 190, 200, 220–1, 229, 283, 316, 319, 322–3, 330, credentialing tests: content 39, 283–5; defining the
380, 386, 388, 400 construct 282; examples including objectively
American Psychological Association (APA) 7, 25, 190, structured clinical examination 290–1; oral
220, 319, 329, 316 examination 201, 293–5; performance using
answer changing 384–386 live patients 285–7; portfolio 295–7; selected-
answer justification 94–5, 106, 118–20, 125, 247, 249, response 297–8; simulation 292; standardized
334–5, 397–8 patient 287–90; written essay 299–300; practice
automatic item generation: evaluation 150–1; analysis 41, 142, 283, 285–6, 287, 289–90, 327
features 141–3; history 133–5; item model 133, 135– differential format functioning 49, 51
144; practical methods 144–9; research 143–4 differential item functioning 25, 53, 316, 330, 341, 358,
386–7
bias see construct-irrelevant variance dimensionality 15, 52, 56, 58, 288, 338–9, 341–2, 347–8,
bluffing 382 358, 384, 388–90, 399–400
distractors 45, 62, 64–5, 91, 100–7, 134; and
calculators 79–80 discrimination 352–5; elimination 307;
clang association 91, 104 evaluation 118, 125, 345, 351; modification 311–15;
cognitive demand (process) 6–8, 12, 14, 18, 20, 24, 28, number of 65–67; one distractor 68–9; weighting 402
30–4, 36–42, 43, 45, 49, 53–5, 56–8, 63–8, 70, 74, 75,
79–81, 90–1, 109, 111, 125, 127–130, 133–5, 149, edge aversion 390–391
192–3,198, 202, 213, 215–17, 226, 251–2, 267, 269, 325, editing items 10, 23, 26, 37, 91, 96–7, 213, 218–19, 328–30
327–8, 358, 408, 409, 410, 413 editorial or style guide 13, 23, 41
cognitive taxonomy 28–4, 34–9, 283–4, 327–8 exceptionalities 304–6; and examples 208–9;
construct 3–8, 412–13 guidelines for constructed-response items 307–8;
constructed-response items 46–8, 189–90, 193–194, 230, guidelines for selected-response items 306–7; item
238–40 modification 312–16; theory of testing 302–4;
constructed-response guidelines 212, 413–14; and universal design 310–12
content concerns 213–18; context concerns 228–230;
format and style concerns 218–20; writing the field testing items 15, 26, 310, 339, 395
directions/stimulus 220–8
constructed-response formats: anecdotal 194–5; history of standardized testing 407–8
cloze 195; demonstration 196; discussion 196; humor 91, 95, 105, 106–7, 304
essay 196–7; exhibition 197; experiment 197–8;
fill-in-the-blank 198; grid-in-response 198– intelligence 5–6, 32
200; interview 200; observation 200–1; oral item analysis 340, 414 see item characteristics
examination 201; performance 202; portfolio 203, item bank 18–21, 90, 96–7, 324
230–1, 295–297; project 204; research paper 205; item characteristics: computer programs 340–1;
review (critique) 205; self/peers 205–6; short- difficulty 53, 70, 80, 90, 96, 101, 132, 135, 141, 143–4,
answer 206; writing sample 206–7; video-based 207 294, 313, 323, 324, 331, 339–343, 350–1, 360–2, 364,
constructed-response scoring: administration 259–60; 391–2, 395, 414; discrimination 15, 19, 20, 24, 53,
automated 235–8; content 241–44; guidelines 240; 80, 90, 96, 101, 132, 135, 141, 143–4, 313, 323, 324,
scoring guided development 244–254; scoring 340, 343–348, 350–1, 356, 360–362, 375, 395, 414;
process 254–259 drift 256–7, 259, 277, 379, 384, 391–2; guessing 65–7,
content: ability 4, 6, 28–30, 37–9, 42, 134; knowledge 69–70, 115, 280, 339, 351; instructional sensitivity 343,
3–5, 28–31, 34–6, 37, 39, 42, 45, 48, 50–1, 58, 142, 198, 348–350, 392; non response 384, 393–5, 398; omitted
283–4, 327; skill 5–6, 23, 28–9, 34, 36–7, 39, 45–6, 48, response 103,384, 393–6; pseudo-chance 347; sample
445
446 • Subject Index
item characteristics: computer programs (cont.): content 90–5; format 95–6; none-of-the-above
size and composition 229–30, 253, 264–5, 282, 289, option 94, 102–3; origination 89; specific
328, 339, 341–2, 344–5, 353–4, 356, 360–2, 366, 368, determiners 91, 104; specific selected-response
377; standards for evaluating 350–1; subscores 15, formats 107–10; style 96–9; trick items 62, 91, 94–5,
338–9, 348, 358, 389, 399–402 104–5; unfocused stem 90, 99; window dressing 90–1,
item development 324–6, 410–11; and assignments 98–9, 158, 307; writing the options 100–7; writing the
to item writers 8, 18, 21, 325–6; classifying items stem 99–100
for cognitive demand 24, 31, 33, 36–9, 53–4, 328, selected-response issues: calculators 79–80; dangerous
358; classifying items for content 14, 24, 34–39, answers 80–81; dictionaries 80, 97, 329; uncued 64–5,
328; drafting test items 23; editorial and style 121–2
guide 13, 22–23, 90; identifying item writers 21; Standards for Educational and Psychological Testing
item-writing guide 13, 22, 90; keying the item 23–4; (AERA, APA, & NCME) 7,190, 220–1,229, 283, 319,
planning 17–18; recruiting item writers 21; training 322, 330, 335, 380, 386, 388, 398–9
item writers 22–3, 32, 325–6, 332 statistical test score theories: classical 9, 340, 388,
item format 411–12; and corrupts teaching 54–5; 410; generalizability 255–6, 360, 370–1, 410; item
recommendations 58; research 49–58; taxonomy 44– response 19–20, 338–41, 346, 359, 379, 388, 401, 410
8, 112 structured and ill-structured problems 33, 38–9, 411
item response 357–9, 338 survey item: formats 153–5; resources 152–3
item shells 23, 144–6 survey item guidelines: general 155–163; nominal
item validation 11–16, 22, 26, 28, 40, 321–3, 328, 337, 414 constructed-response 180–4; nominal selected-
item weighting 123, 338, 402–3 response 177–80; specific 163–177
systematic error 339, 360–1, 371–9; and halo 374–7;
key balancing 101, 107, 390 idiosyncracy 378; leniency/harshness 373; logical 378;
key check (validation verification) 12, 14, 20, 23–24, proximity 378; restriction in range 374;strategies for
333–5, 340, 350–1 reducing 379–80
learning theory: behavioral learning theory 7, 29, 31, 33, technical documentation 11, 16, 21, 254, 275, 289, 319,
327; cognitive learning theory 29–30, 33, 142, 150, 210, 325–6, 328, 332, 334, 337, 373
327, 409 test item:definition 3–4; item bank 18–19; planning 17
test specifications (item and test specifications, blueprint,
multiple-choice see selected-response two-way grid) 8–9, 13–14, 17–18, 22, 28, 30–1, 40–42,
96, 210, 212–13, 218, 283, 322–3, 326–7, 350, 358, 388,
National Council on Measurement in Education 399–400
(NCME), 7, 25, 190, 220, 319, 322, 380, 386, 388, 400 think-aloud (cognitive lab) 14, 26, 30, 33, 54, 58, 106,
negative phrasing 99–100, 107, 143, 155, 160–4, 307, 309, 160, 169, 229–30, 276, 310, 328, 335–7, 358
363 trace line (option characteristic curve) 341, 345–7, 352–5,
402
operational definition 4–5 true score 9, 30, 44–5, 69, 257, 351, 359, 369, 371–2, 380, 394
opinion-based items 35, 91, 94
option characteristic curve see trace line validity 6–10, 410; and argument-based approach 12–16,
266, 321–2, 324, 400; construct representation 10–11,
person fit 396–399 150, 209, 322, 400; construct-irrelevant variance
proofreading items 14, 26, 91, 96–7, 213, 218–19, 328–30 (bias) 11, 25, 42, 44–5, 49, 51–3, 79–81,168, 225, 229,
proximity of two measures 51, 56 239, 257, 270, 276–7, 298, 303–4, 308–9, 311–12, 330–
random error 11, 44–5, 66, 255, 257–8, 277, 340, 347,351, 3, 340, 351, 359–61, 381–2, 386–7, 391–3; fidelity 9–10,
359, 362, 367–72, 374, 382, 389 13–14, 30, 43–45, 49–56, 58, 80, 190–1, 202, 213, 264,
reviewing test items 24–26, 326–337; and adherence 284, 411, 413; interpretive argument 10, 11–12, 321–2,
to guidelines 328; answer justification 334–337; 410; target domain 8–13, 30, 37, 39–40, 43–4, 51, 53,
cognitive demand 327; content 326–7; editing and 56, 58, 80, 109, 112, 125, 134–5, 142, 150–151, 190–1,
proofing 26, 328–30; fairness 25, 330–331; key 202, 212–13, 225, 264, 282–5, 287–291, 294–5, 322–3,
verification 333–334; language complexity 25–6; 409–10; universe of generalization 8–9, 13, 30, 39–40,
linguistic complexity 332–3; security 132–144, 324, 51, 134, 142, 264–5, 284, 293, 322–3, 410; validity
325, 337 argument 10–12, 28, 44, 266
validity evidence 320; and content-related 322–3;
selected-response formats 45–6; and alternate-choice procedural 26, 322; statistical 15, 26, 58, 255, 257–9,
(two options) 67–9; conventional multiple-choice 62– 310, 322, 389
5; complex multiple-choice 71–2; computer validation 5, 11, 409–10; and item response vii, 3, 11–16,
formats 129–130; extended matching 75–6; generic 22–26, 28, 40–2, 111, 132, 141, 263, 282, 286,316,
items and item sets 116–17, 145–7, 289; matching 73– 321–3, 330–1, 334, 337, 348, 388–9, 398–9, 414;
5; modified essay 125–6; multiple mark 121; multiple subscore 399–402; test score 7, 10–11, 28, 285, 361,
true-false 72–3; ordered multiple-choice 123; testlet 373, 409, 413
(context dependent item set) 76–9, 82–8, 127–8; true- weighting options 402
false 69–71; three-option multiple-choice 65–7; two- writing performance tests: format choice 264–5;
tiered diagnostic 123–5; uncued 122–3 prompts 266, 270–1; rubrics 271–5; selected-response
selected-response guidelines 89–90; and all-of-the- formats 278–81; test design 264–5; threats to
above 90–91, 103; consequences of violations 90; validity 275–8
Taylor & Francis
eBooks
FOR liBRARIES
Over 23,000 eBook titles in the Humanities,
Social Sciences, STM and Law from some of the
world's leading imprints.
Choose from a range of subject packages or create your own!
www.ebooksubscriptions.com