Thomas M. Haladyna - Developing and Validating Multiple-Choice Test Items-Routledge (2004) PDF
Thomas M. Haladyna - Developing and Validating Multiple-Choice Test Items-Routledge (2004) PDF
Third Edition
This page intentionally left blank
Developing and Validating
Multiple-Choice Test Items
Third Edition
Thomas M, Haladyna
Arizona State University West
Haladyna, Thomas M.
Developing and validating multiple-choice test items / Thomas M.
Haladyna.—3rd ed.
p. cm.
Includes bibliographical references and index.
ISBN 0-8058-4661-1
1. Multiple-choice examinations—Design and construction.
2. Multiple-choice examinations—Validity. I. Title.
LB3060.32.M85H35 2004
371.26—dc22
2003060112
CIP
Introduction vii
3 Item Formats 41
4 MC Formats 67
v
vi CONTENTS
Revising a book for the second time requires some motivation. Four factors fu-
eled this effort.
First, readers continue to show an interest in a comprehensive treatment
of MC item development and item response validation. These readers surface
in my life in countless ways. Some readers point out an error, share an idea or a
new MC format, ask a question, or simply offer support to the effort to im-
prove test items.
Second, a scientific basis for test item writing has been slow to develop
(Cronbach, 1970; Haladyna & Downing, 1989a, 1989b; Haladyna, Downing, &
Rodriguez, 2002; Nitko, 1985; Roid & Haladyna, 1982). These critics have
pointed out the paucity of research on item development. This book responds to
that criticism.
The third factor is the short yet rich history of efforts to improve MC item
writing. This history dates back to the early 20th century when MC was intro-
duced. Along the way, many testing specialists and educators have contributed
vii
viii INTRODUCTION
to this book by sharing their ideas, experiences, and collective wisdom in es-
says, textbooks, research, and in other ways. This book draws from this history.
Finally, my more than 30 years of experience in the business of planning, ad-
ministering, and evaluating testing programs and teaching at the elementary,
undergraduate, and graduate levels has helped me better understand the pro-
cess and benefits of well-designed test items and the importance of validating
item responses.
INTENDED AUDIENCE
This book is intended for anyone seriously interested in developing test items
for achievement testing. Students in graduate-level courses in educational
measurement may find this book helpful for better understanding two impor-
tant phases in test development, item development and item response valida-
tion. Those directly involved in developing tests may find this book useful as a
source of new material to enhance their present understanding and their item
development and item response validation practices.
A major premise of this book is that there is indeed a place for MC testing
in the classroom, large-scale assessment of student learning, and tests of
competence in any profession. The public and many test developers and us-
ers need to be more aware of the Standards for Educational and Psychological
Testing (American Educational Research Association [AERA], American
Psychological Association [APA], and National Council on Measurement
in Education [NCME] (1999) and the Guidelines for High Stakes Testing is-
sued by the AERA (2000). In this third edition, these standards and guide-
lines are often linked to recommended item development and item
response validation practices. Once test users are clear about how test re-
sults should and should not be used, we can increase the quality of tests by
sensible and effective item development and item response validation pro-
cedures found in this book.
This third edition has undergone more changes than occurred in the second
edition. For continuity, the book continues to be organized into four sections.
Part I contains three chapters that provide a foundation for writing MC
items. Chapter 1 addresses the most important and fundamental value in any
test: validity. A parallel is drawn between test score validation and item re-
sponse validation. The logical process we use in test score validation also ap-
plies to item response validation because the item response is the most
fundamental unit of measurement in composing the test score. The process of
item development resides in validity with the emphasis on documenting valid-
ity evidence addressing item quality. Chapter 2 addresses the content and cog-
nitive process of test items. Knowledge, skills, and abilities are the three
interrelated categories of content discussed. This chapter principally draws
from cognitive psychology. Existing testing programs also provide some guid-
ance about what we need to measure and the different types of cognitive be-
haviors our items need to represent. Chapter 3 presents a taxonomy of test item
formats that include both constructed response (CR) and MC methods. Chap-
ter 3 also addresses claims for CR and MC formats for measuring different types
of content and cognitive processes. With this foundation in place, the writing
of MC items discussed in part II is enabled.
Part II is devoted to developing MC items. Chapter 4 presents a variety of
MC item formats and claims about the types of content and cognitive processes
that these formats can measure. Chapter 5 presents guidelines for developing
MC items of various formats. Chapter 6 contains a casebook of exemplary and
innovative MC items with supporting narrative about why these items were
chosen. Chapter 7 provides new and improved guidance on item-generation
techniques.
INTRODUCTION xi
Part III addresses the complex idea of item response validation. Chapter 8
reports on the rationale and procedures involved in a coordinated series of
activities intended to improve each test item. Chapter 9 deals with item anal-
ysis for evaluating and improving test items. A theoretical perspective is of-
fered that fits within the frameworks of both classical or generalizability
theory and item response theory. Chapter 10 offers information about how
the study of item responses can be used to study specific problems encoun-
tered in testing.
Part IV contains chapter 11, which deals with the trends in item writing and
validation. Unlike what we experience today, cognitive theorists are working
on better ways for defining what we teach and measure, and test theorists are
developing item-writing theories and item response models that are more ap-
propriate to measuring complex behavior. The fruition of many of these theo-
ries will change the face of education and assessment of learning in profound
ways in the future.
In closing, development of test items and validation of test item responses
still remain as two critical, related steps in test development. This book intends
to help readers understand the concepts, principles, and procedures available
to construct better MC test items that will lead to more validly interpreted and
used test scores.
—Tom Haladyna
This page intentionally left blank
I
A Foundation for
Multiple-Choice Testing
The three chapters in part I provide preparation and background for writing
multiple-choice (MC) test items. These chapters are interdependent.
This first chapter addresses the most important consideration in testing,
which is validity. In validating a specific test score interpretation or use, a body
of validity evidence comes from item development. Another body of evidence
resides with studies of item responses. The two bodies of evidence are shown to
be vital to the validity of any test score interpretation or use.
The second chapter discusses the types of content and cognitive processes
that we want to measure in achievement tests. The organization of content and
cognitive processes is straightforward and easy to follow.
The third chapter presents a taxonomy of test item MC and construe ted-re-
sponse (CR) formats with links to the content and cognitive processes dis-
cussed in chapter 2. Arguments are presented for what different formats can
accomplish regarding the age-old problem of measuring higher level thinking.
At the end of part I, you should have a good understanding of the role of
item development and item response validation as an integral aspect of va-
lidity, the types of content and cognitive processes we want to measure, and
the variety of formats available to you. You should also understand when to
use a MC format and which formats to use for certain content and cognitive
processes.
This page intentionally left blank
1
The Importance of Item
Development for Validity
OVERVIEW
This chapter provides a conceptual basis for understanding the important role
of validity in item development. First, basic terms are defined. Validity refers to
a logical process we follow in testing where what we measure is defined, mea-
sures are created, and evidence is sought and evaluated pertaining to the valid-
ity of interpreting a test score and its subsequent use. This logical process
applies equally to tests and the units that make up tests, namely, test items. This
process involves the validity of test score interpretations and uses and the va-
lidity of item response interpretations and uses. In fact, a primary source of evi-
dence in validating a test score interpretation or use comes from item
development. Thus, we should think of the validation of item responses as a
primary and fundamental source of validity evidence for test score interpreta-
tion and use.
A test item is the basic unit of observation in any test. A test item usually con-
tains a statement that elicits a test taker response. That response is scorable,
usually 1 for a correct response and 0 for an incorrect response, or the response
might be placed on a rating scale from low to high. More sophisticated scoring
methods for item responses are discussed in chapters 9 and 10.
Thorndike (1967) wrote that the more effort we put into building better
test items, the better the test is likely to be. Toward that end, one can design
test items to represent many different types of content and cognitive behav-
3
4 CHAPTER 1
iors. Each test item is believed to represent a single type of content and a sin-
gle type of cognitive behavior. For a test item to measure multiple content and
cognitive behaviors goes well beyond the ability of a test item and our ability
to understand the meaning of an item response. A total score on a test repre-
sents some aggregate of performance across all test items for a specific ability
or domain of knowledge. As defined in this book, the test item is intended to
measure some aspect of human ability generally related to school learning or
training. However, this definition of a test item is not necessarily limited to
human ability. It can apply to other settings outside of education or training.
However, the focus in this book is learning. Therefore, we are mainly con-
cerned with achievement tests.
A fundamental dichotomy in item formats is whether the answer is selected
or created. Although a test item is the most basic element of any test, a test item
can seldom stand by itself as a test. Responses to a single test item are often too
fallible. Also, most cognitive abilities or achievement domains measured by a
test are too complex to be represented adequately by a single item. That is why
we score and aggregate item responses to form the test score. The design of any
test to cover something complex is usually extensive because the knowledge,
skills, or abilities we want to measure dictate a complex test design.
DEFINING A TEST
This section contains three distinctions that you might find useful as you think
about what tests and test items measure. Germane to the goal of this book, how
might MC items meet your needs in developing a test or assessing student
learning in the classroom?
Although this book is concerned with MC testing, oddly enough the most
important constructs are not best measured with MC item formats. Never-
theless, MC tests play a vital role in measuring many important aspects of
most constructs. When it comes to the measurement of knowledge and many
cognitive skills, MC is the logical choice. This point and its rationale are fea-
tured in chapter 3.
Achievement
The context for this book is the measuring of achievement that is the goal of in-
struction or training. Achievement is usually thought of as planned changes in
cognitive behavior that result from instruction or training, although certainly
achievement is possible because of factors outside of instruction or training.
All achievement can be defined in terms of content. This content can be repre-
sented as knowledge, skills, or cognitive abilities. Chapter 2 refines the distinc-
tions among these three concepts, and chapter 3 links different item formats to
knowledge, skills, and abilities.
Knowledge is a fundamental type of learning that includes facts, concepts,
principles, and procedures that can be memorized or understood. Most student
learning includes knowledge. Knowledge is often organized into operationally
defined domains. Consider what a dentist-in-training has to learn about dental
anatomy. We have 20 teeth in the juvenile dentition and 32 teeth in the adult
dentition. A dentist has to know the tooth name and the corresponding tooth
number for all 52 teeth. Given the number, the dentist must state the name.
Given the name, the dentist must state the number. These two statements op-
erationally generate 104 test items. This is the entire domain. The MC format
is generally acknowledged as the most useful and efficient way to measure
knowledge. As you can see, if knowledge can be defined in terms of a domain,
the measurement is made easier. Any achievement test is a representative sam-
ple of items from that domain.
Skills are learned, observable, performed acts. They are easily recognized in
virtually all settings. In writing, spelling, punctuation, and grammar are observ-
able, performed acts. In mathematics, adding, subtracting, multiplying, and di-
viding are also observable, performed acts. The development of domains of
knowledge can also be applied to skills. Take spelling, for example. It is easy to
identify a domain of words that a learner must correctly spell. The same is true
in mathematics. Because skills are so numerous, any test to student learning
should involve some representative sampling from the domain of items repre-
senting these skills.
Abilities are also learned, but the process is long and involved, perhaps span-
ning an entire lifetime. Abilities require the use of both knowledge and skills in
a complex way. Abilities even have an emotional component. Most abilities are
too complex for operational definition; therefore, we have to resort to CR per-
THE IMPORTANCE OF ITEM DEVELOPMENT 7
formance tests that require expert judgment to score. The items we use to mea-
sure an ability often consist of ill-structured problems. It is difficult to explicate
a domain that consists of ill-structured problems. Consider, for example, the
many naturally occurring encounters you have in life that require mathemat-
ics. How many problems exist? What are their form and structure? In limited
ways, MC can serve as a useful proxy for the cumbersome performance tests.
However, any argument for using MC formats instead of performance formats
for a complex ability should be presented and evaluated before a MC format is
used. Chapter 3 provides these arguments and the evidence supporting the
limited use of MC items for measuring abilities.
Intelligence
TABLE 1.1
A Continuum of Cognitive Behavior
What is probably accounting for test performance is not achievement but intel-
ligence. Thus, the role of instruction or training and instructional history is an
important consideration in deciding if a test or test item reflects achievement
or intelligence.
VALIDITY
Validity is "the degree to which accumulated evidence and theory support spe-
cific interpretations of test scores entailed by proposed uses" (American Edu-
cational Research Association [AERA], American Psychological
Association [APA], and National Council on Measurement in Education
[NCME], 1999, p. 84). For every testing program there is a purpose. To fulfill
this purpose, a test score has a clearly stated interpretation and an intended
use. The sponsor of the testing program creates a logical argument and assem-
bles validity evidence supporting that argument. Validity is the degree of sup-
port enabled by the logical argument and validity evidence upholding this
argument. In some instances, the validity evidence works against the argu-
ment and lessens validity. In these instances, the testing organization should
seek and take remedies to reverse the gravity of this negative kind of evi-
dence. The investigative process of creating this argument and collecting va-
lidity evidence testing this argument is validation.
Validity is much like what happens in a court of law. A prosecutor builds an
argument against an accused person concerning the commission of a crime.
The prosecutor collects and organizes evidence to support the argument
against the accused. The defense attorney creates another plausible argument
for the accused and uses evidence to support this argument. One argument op-
poses the other. The jury decides which argument is valid and to what degree
the argument is valid. We do the same thing in testing. We hope that the posi-
tive evidence greatly outweighs the negative evidence and that our argument is
also plausible.
Messick (1989) pointed out that a specific interpretation or use of test re-
sults is subject to a context made up of value implications and social conse-
quences. Thus, thinking of construct validation as merely the systematic
collection of evidence to support a specific test score interpretation or use is in-
sufficient. We must also think of the context that may underlie and influence
this interpretation or use. A good example of consequences comes from well-
documented practices in schools where publishers' standardized achievement
test scores are used as a criterion for educational accountability. Because of ex-
ternal pressure to raise test scores to show educational improvement, some
school personnel take extreme measures to increase test scores. Nolen,
Haladyna, and Haas (1992) showed that a variety of questionable tactics are
used to raise scores that may not increase student learning. The use of a test
that is poorly aligned with the state's curriculum and content standards, cou-
10 CHAPTER 1
pled with test-based accountability, results in test scores that may not be validly
interpreted or used.
In this book, the focus of validity and validation is both with test scores
and item responses, simply because we interpret and use item responses just
as we interpret and use test scores. Because items and item responses are
subunits of test and test scores, validity is important for both item responses
and test scores.
However, the validity evidence we gather to support interpreting an item re-
sponse is also part of the validity evidence we use to support the interpreting of
a test score.
The test developer should set forth clearly how test scores are intended to be in-
terpreted and used. The population(s) for which a test is appropriate should be
clearly delimited, and the construct that the test is intended to assess should be
clearly described. (AERA et al., 1999, p. 17)
studies, must be clear enough for test developers to construct variables that be-
have according to the ideas about our constructs, as Fig. 1.1 illustrates. Two
constructs are denned. The first is the quality of instruction, and the second is
writing ability that instruction is supposed to influence. In the first phase, both
instruction and writing ability are abstractly defined. It is hypothesized that
quality of instruction influences writing ability. A correlation between mea-
sures of quality of instruction and writing ability tells us to what extent our pre-
diction is borne out by the data. One could conduct formal experiments to
establish the same causal relation.
In explication, measures of each construct are identified or created. Gen-
erally, multiple measures are used to tap more adequately all the aspects of each
construct. The most direct measure would be a performance-based writing
prompt. MC items might measure knowledge of writing or knowledge of writ-
ing skills, but they would not provide a direct measure. In explication, Messick
(1989) identified a threat to validity: construct underrepresentation.
Frederiksen (1984) argued that the overreliance on MC may have contributed
to overemphasis on learning and testing knowledge at the expense of the more
difficult-to-measure cognitive abilities.
In validation, evidence is collected to confirm our hopes that an achieve-
ment test score can be interpreted and used validly. This evidence includes
empirical studies and procedures (Haladyna, 2002). The evidence should be
well organized and compelling in support of the plausible argument regarding
the validity of the meaning of the test score and the validity of its use. The val-
idation process also includes a summary judgment of the adequacy of this evi-
dence in support or against the intended interpretation or use.
Messick (1995a, 1995b) provided a structure for thinking about this validity
evidence, and the Standards for Educational and Psychological Testing (AERA et
al., 1999) provide a useful description of the sources of validity evidence.
1. The content of the test, including its relevance to the construct and
the representativeness of the sampling, is a source of validity evidence.
2. The connection of test behavior to the theoretical rationale behind
test behavior is another type of evidence. Claims about what a test measures
should be supported by evidence of cognitive processes underlying perfor-
mance (Martinez, 1998). Implications exist in this category for the choice of
an item format.
3. The internal structure of test data involves an assessment of fidelity of
item formats and scoring to the construct interpretation (Haladyna, 1998).
Messick (1989) referred to this as "structural fidelity." Therefore, a crucial
concern is the logical connection between item formats and desired inter-
pretations. For instance, an MC test of writing skills would have low fidelity
to actual writing. A writing sample would have higher fidelity. Another facet
of internal structure is dimensionality, which is discussed in chapter 10.
4. The external relationship of test scores to other variables is another
type of evidence. We may examine group differences that are known to exist
and we seek confirmation of such differences, or we may want to know if like
measures are more correlated than unlike measures. Another type of rela-
tionship is test criterion. The patterns among items responses should clearly
support our interpretations. Evidence to the contrary works against valid in-
terpretation.
5. We hope that any measure of a construct generalizes to the whole of
the construct and does not underrepresent that construct. The general-
izability aspect relates to how test scores remain consistent across different
samples. One aspect of this is differential item functioning (DIP) and bias, a
topic treated in chapter 10. This aspect of validity evidence also refers to de-
velopment of an ability over time.
6. Finally, the consequences of test score interpretations and uses must
be considered, as discussed previously with misuses and misinterpretations
of standardized achievement test scores.
3.6. The types of items, the response formats, scoring procedures, and test
administration procedures should be selected based on the purposes of the test, the
domain to be measured, and the intended test takers. To the extent possible, test
content should be chosen to ensure that intended inferences from test scores are equally
valid for members of different groups of test takers. The test review process should
include empirical analyses and, when appropriate, the use of expert judges to review
items and response formats. The qualifications, relevant experiences, and demographic
characteristics of expert judges should also be documented.
3.7. The procedures used to develop, review, and tryout items, and to select items
from the item pool should be documented. If the items were classified into different
categories or subtests according to the test specifications, the procedures used for the
classification and the appropriateness and accuracy of the classification should also be
documented.
3.8. When item tryouts or field tests are conducted, the procedures used to select the
sample (s) of test takers for item tryouts and the resulting characteristics of the sample (s)
should be documented. When appropriate, the sample (s) should be as representative as
possible of the populations for which the test is intended.
3.9. When a test developer evaluates the psychometric properties of items, the
classical or item response theory (IRT) model used for evaluating the psychometric
properties of items should be documented. The sample used for estimating item
properties should be described and should be of adequate size and diversity for the
procedure. The process by which items are selected and the data used for item selection,
such as item difficulty, item discrimination, and/or item information, should also be
documented. When IRT is used to estimate item parameters in test development, the
item response models, estimation procedures, and evidence of model fit should be
documented.
7.3. When credible research reports that differential item functioning exists across
age, gender, racial/ethnic, cultural, disability, and/or linguistic groups in the population
of test takers in the content domain measured by the test, test developers should
conduct appropriate studies when feasible. Such research should seek to detect and
eliminate aspects of test design, content, and format that might bias test scores for
particular groups.
7.4. Test developers should strive to identify and eliminate language, symbols, words,
phrases, and content that are generally regarded as offensive by members of racial,
ethnic, gender, or other groups, except when judged to be necessary for adequate
representation of the domain.
7.7. In testing applications where the level of linguistic or reading ability is not part of
the construct of interest, the linguistic or reading demands of the test should be kept to
the minimum necessary for the valid assessment of the intended construct.
13
14 CHAPTER 1
This last section discusses the item-development process. Table 1.3 gives a
short summary of the many important steps one follows in developing test
items for a testing program. This section gives the reader a more complete un-
derstanding of the care and detail needed to produce an item bank consisting of
operational items that are ready to use on future tests.
TABLE 1.3
The Item-Development Process
The Plan
The Schedule
The schedule should be realistic and provide a lists of tasks and persons who
will be responsible for completing each task. Sometimes schedules can be unre-
alistic, expecting that items can be written in a short time. Experience will
show that developing a healthy item bank may take more than one or two years,
depending on the resources available.
Inventory
The test specifications should be documented, along with its rationale and the
process by which it was developed. The test specifications should define the
content of the test, the proposed number of items, and item formats, the de-
sired psychometric properties of the items, and the item and section arrange-
ment. It should also specify the amount of time for testing, directions to the test
makers, procedures to be used for test administration and scoring and other rel-
evant information.
By knowing the number of items in the test and other conditions affecting
test design, the test developer can ascertain the number of items that need to
be developed. Although it depends on various circumstances, we try to have
about 250% of the items needed for any one test in our item bank. But this esti-
mate may vary depending on these circumstances. The inventory is the main
way that we find out what items are needed to keep our supply of items ade-
quate for future needs.
16 CHAPTER 1
The quality of items depends directly on the skill and expertise of the item writ-
ers. No amount of editing or the various reviews presented and discussed in
chapter 8 will improve poorly written items. For this reason, the recruitment of
item writers is a significant step in the item-development process. These item
writers should be subject-matter experts (SMEs), preferably in a specialty area
for which they will be assigned items. Because these SMEs will be writing items,
they will need to document each item's content authenticity and verify that
there is one and only one right answer. They will also become expert reviewers
of colleagues' items.
An item-writing guide should be developed and given to all item writers. The
guide should be specific about all significant aspects of item writing. At a mini-
mum, the guide should tell item writers which item formats are to be used and
which should be avoided. The guide should have many examples of model
items. Guidelines for writing items such as presented in chapter 5 should be
presented. One feature that is probably not prevalent in most item-writing
guides but is greatly needed are techniques for developing items rapidly. Chap-
ter 6 provides many model items, and chapter 7 provides some techniques to
make item writing easier and faster.
An excellent example of an item-writing guide can be found in Case and
Swanson (2001). The guide is used in training item writers for the national
board examinations in medicine. It is in its third edition and can be found by
going to the National Board of Medical Examiners' web page, www.nbme.org.
Item-Writing Training
Any testing program that is serious about validity should engage all item
writers in item-writing training. The way training and item writing is con-
ducted may seem mundane, but the question arises: Does one type of train-
ing produce better items than other types of training? One study by Case,
Holtzman, and Ripkey (2001) addressed this question, which involved the
United States Medical Licensing Examination. In an evaluation of three ap-
proaches to writing items, they used number of items written, quality of
items, and cost as factor in evaluating the three approaches. The traditional
training method involved a committee with a chair, formal item-writing
training, assignments to these item writers to write items targeted by con-
tent and cognitive processes, an iteration of reviews and reactions between
THE IMPORTANCE OF ITEM DEVELOPMENT 17
editors and authors of items, and an item review meeting. The second type
was a one-time task force that met once, received training, wrote items, and
reviewed each other's items. The third type was an item-harvesting ap-
proach in which a group was asked to write some items and was sent the ex-
cellent item-writing guide, and it submitted items for evaluation. The yield
of items per type were small for the latter two methods, and the quality was
lower. Case et al. preferred the traditional method but acknowledged that
for low-budget testing programs, the latter two methods have merit for pro-
ducing high-quality items.
Item-Writing Assignments
As was stated in the recruitment of item writers, each item writer should be
chosen for a particular subject matter expertise, and considering the inven-
tory, each item writer should be assigned to develop items that will potentially
improve the item bank and eventually make it into future tests. Therefore,
the assignments should be made thoughtfully. Unless item writers are com-
pensated, item writing can be a difficult thing to do if the item writer is a busy
professional, which is often the case. Usually someone is responsible for mon-
itoring item writers and making sure that the assignment is completed on
time, according to the schedule that was adopted.
Conduct Reviews
When items are drafted, they are typically subjected to many complemen-
tary reviews. This is the subject of chapter 8. These reviews are intended to
take these initially drafted items and polish them. The reviews are con-
ducted by different personnel, depending on the nature of the review. One
of the most important reviews is by other SMEs for a judgment of the quality
of the item.
When an item has been properly written and has survived all of these reviews,
the next important step is to try this item out on an operational test. It is im-
portant to assess the item relative to other items on the test, but it is also im-
portant not to use each field test item in obtaining the final test score. If the
item passes this final hurdle and performs adequately, the item can be placed
in the item bank where it can be used in future tests. Chapter 9 provides infor-
mation about the criteria used to evaluate item performance.
18 CHAPTER 1
Because the test score and the item response have a logical connection, the
process that is defined for validating test score interpretations and uses also ap-
plies to item responses. We can define what an item is supposed to measure and
the type of cognitive behavior it elicits. We can write the item, which is the ex-
plication step in construct validation, and we can study the responses to the
item to determine whether it behaves the way we think it should behave. Table
1.4 shows the parallelism existing between test score validation and item re-
sponse validation.
SUMMARY
In this chapter a major theme is the role validity plays toward making test score
interpretations and uses as truthful as possible. A parallelism exists between
tests and test items and between test scores and item responses. The logic and
validation process applied to tests equally applies to test items, and the validity
evidence obtained at the item level contributes to the validation of test scores.
TABLE 1.4
Three Steps in Construct Validation
1. Formulation Define construct Define the basis for the item in terms of its
content and cognitive behavior related to
construct
2. Explication Test Item
3. Validation Evidence bearing on the Evidence bearing on the interpretation and
interpretation and use use of an item response with other item
of test scores for a responses in creating a test score that can be
specific purpose validly interpreted or used
2
Content
and Cognitive Processes
OVERVIEW
As noted in chapter 1, all test scores and associated test item responses have in-
tended interpretations. Both test scores and item responses are also subject to
validation. Although the types of evidence may vary for test score and item re-
sponse validations, the logic and process of validation are the same.
The test specifications that assist us in the design of a test call for selection of
items according to the item's content and the cognitive process thought to be
elicited when a test taker responds to the item. This claim for connection be-
tween what is desired in test specification and the content and cognitive pro-
cess of each test item is fundamental to validity. Therefore, each test item
should be accurately classified according to its content and intended cognitive
process. This chapter is devoted to the related topics of item content and cog-
nitive process, sometimes referred to as cognitive demand.
The first part of this chapter provides a discussion of issues and problems af-
fecting content and cognitive process. The second part presents a simple classi-
fication system for test items that includes natural, generic categories of
content and cognitive processes. Examples appearing in this chapter draw from
familiar content areas: reading, writing, and mathematics. These subjects are
prominent in all national, state, and local school district testing programs.
BACKGROUND
What Is Cognition?
Cognition is the act or process of knowing something. It is perception. Be-
cause cognition involves human thought, it is a private event. In a contrived
19
2O CHAPTER 2
This section deals with four issues and problems related to any classification
system involving cognitive process: (a) the distinction between theoreti-
cally based and prescriptive cognitive process taxonomies, (b) the limita-
tions of current prescriptive taxonomies, (c) the ultimate dilemma with
measuring any cognitive process, and (d) the emergence of construct-cen-
tered measurement.
Sanders (1966) provided many examples of test items based on this cognitive
process taxonomy. Anderson and Sosniak (1994) edited a volume of contribu-
tions dealing with aspects of the cognitive taxonomy. Contributing authors dis-
cussed the value and standing of the taxonomy as a means for increasing
concern about the development and measurement of different cognitive pro-
cesses. Despite the taxonomy's widespread popularity, Seddon (1978) reported
in his review of research that evidence neither supports nor refutes the taxon-
omy. A research study by Miller, Snowman, and O'Hara (1979) suggested that
this taxonomy represents fluid and crystallized intelligences. A study by
Dobson (2001) in a college-level class used this taxonomy and found differ-
ences in difficulty. Kreitzer and Madaus (1994) updated Seddon's review and
drew a similar conclusion. Higher level test performance was more difficult and
did not show improvement. However, studies such as this one are too few to
provide evidence that the taxonomy is viable.
Acknowledging that the Bloom cognitive taxonomy is an imperfect tool and
that studies of its validity are seldom up to the task, the taxonomy has contin-
ued to influence educators, psychologists, and testing specialists in their think-
ing about the need to define, teach, and assess higher level achievement.
Although the Bloom taxonomy continues to be an impressive marker in the
history of the study of student achievement, it does not provide the most effec-
tive guidance in test and item design. Most testing programs in my experience
use simpler cognitive classification systems that mainly include the first two
levels of this cognitive taxonomy.
Authors of textbooks on educational measurement routinely offer advice on
how to measure higher level thinking in achievement tests. For example, Linn
and Gronlund (2001) in their eighth edition of this popular textbook suggested
a simple three-category taxonomy, which includes the first two types of learn-
ing in the Bloom cognitive taxonomy and lists application as the third type of
learning. This third category involves the complex use of knowledge and skills.
This simpler approach to defining levels of cognitive behavior is currently the
most popular and easy to use.
Hopes of resolving the dilemma of finding a useful, prescriptive taxonomic
system for classifying items by cognitive process fall to professional organiza-
tions heavily invested in curriculum. Within each organization or through joint
efforts of associated organizations, content standards have emerged in reading,
writing, mathematics, science, and social studies.
The National Council of Teachers of English (NCTE) (www.ncte.org) has a
set of reading standards that emerged in partnership with the International
Reading Association (IRA; www.reading.org) that are widely recognized. Most
states design their own content standards according to NCTE standards
(http://www.ode.state.or.us/tls/english/reading/). Table 2.1 lists the reading
content standards. As we see in Table 2.1, the content standards are broader
than educational objectives and seem to address cognitive ability rather than
CONTENT AND COGNITIVE PROCESSES 23
TABLE 2.1
National Council of Teachers of English (NCTE) and International Reading
Association (IRA) Reading Content Standards
Students learn and effectively apply a variety of reading strategies for comprehending,
interpreting, and evaluating a wide range of texts including fiction, nonfiction, classic,
and contemporary works.
Contextual Recognize, pronounce, and know the meaning of words in text by
analysis using phonics, language structure, contextual clues, and visual cues.
Phonetic Locate information and clarify meaning by skimming, scanning,
analysis close reading, and other reading strategies.
Comprehension Demonstrate literal comprehension of a variety of printed materials.
Inference Demonstrate inferential comprehension of a variety of printed
materials.
Evaluation Demonstrate evaluative comprehension of a variety of printed
materials.
Connections Draw connections and explain relationships between reading
selections and other texts, experiences, issues, and events.
specific knowledge and skills that are most often associated with instructional
objectives and teaching and learning.
Table 2.2 provides examples of writing content standards from the State of
California. Like reading, the focus is not on knowledge that is seen as prerequi-
site to skill or abilities but more on abilities. The writing of essays seems to en-
tail many abilities, including writing, creative and critical thinking, and even
problem solving. Like reading, these content standards reflect the teaching and
learning of knowledge and skills and the application of knowledge and skills in
complex ways.
The National Council of Teachers of Mathematics (NCTM) and the Na-
tional Assessment of Educational Progress (NAEP) have similar mathematics
content dimensions, as shown in Table 2.3 (nces.ed.gov/nationsreportcard).
Each standard contains many clearly stated objectives with a heavy emphasis
on skill development and mathematical problem solving in a meaningful con-
text. Isolated knowledge and skills seem to have little place in modern concep-
tions of mathematics education.
The National Research Council (http://www.nap.edu) also has developed
content standards in response to a perceived need. The standards are volun-
tary guidelines emphasizing the learning of knowledge and skills students need
to make everyday life decisions and become productive citizens. The standards
TABLE 2.2
Draft Writing Content Standards From California
Writing Strategies
Students write words and brief sentences that are legible.
Students write clear and coherent sentences and paragraphs that elaborate a central
impression, using stages of the writing process.
Students write clear, coherent, and focused essays that exhibit formal introductions,
bodies of supporting evidence, and conclusions, using stages of the writing process.
Students write coherent and focused essays that convey a well-defined perspective
and tightly reasoned argument, using stages of the writing process.
Writing Applications
Students write texts that describe and explain objects, events, and experiences that
are familiar to them, demonstrating command of standard English and the drafting,
research and organizational strategies noted previously.
Students write narrative, expository, persuasive, and literary essays (of at least 500 to
700 words), demonstrating command of standard English and the drafting research
and organizational strategies noted previously.
Students combine rhetorical strategies (narration, exposition, argumentation,
description) to produce essays (of at least 1,500 words when appropriate),
demonstrating command of standard English and the drafting, research and
organizational strategies noted previously.
TABLE 2.3
Mathematics Content Standards
NCTM NAEP
Number and operations Number sense, properties, and operations
Algebra Algebra and functions
Geometry Geometry and spatial sense
Data analysis and probability Data analysis, statistics, and probability
Measurement Measurement
24
CONTENT AND COGNITIVE PROCESSES 25
When the validation rests in part on the appropriateness of test content, the pro-
cedures followed in specifying and generating test content should be described
and justified in reference to the construct the test is intended to measure or the
domain it is intended to represent. If the definition of the content sampled incor-
porates criteria such as importance, frequency, or criticality, these criteria should
also be clearly explained and justified. (AERA et al., 1999, p. 18)
If the rationale for a test use or score interpretation depends on premises about
the psychological processes or cognitive operations used by examinees, then the-
oretical or empirical evidence in support of those premises should be provided.
When statements about the processes employed by the observers or scorers are
part of the argument for validity, similar information should be provided. (AERA
et al., 1999, p. 19)
Each test item is designed to measure a specific type of content and an in-
tended cognitive process. Although each student responds to a test item, no
one really knows the exact cognitive process used in making a choice in an MC
test or responding to a CR item. For any test item, the test taker may appear to
be thinking that the item elicits higher level thinking, but in actuality the test
taker may be remembering identical statements or ideas presented before, per-
haps verbatim in the textbook or stated in class and carefully copied into the
student's notes.
Mislevy (1993) provided an example of a nuclear medicine physician who
at one point in his or her career might detect a patient's cancerous growth in a
computerized tomography (CT) scan using reasoning, but at a later time in
his or her career would simply view the scan and recall the patient's problem.
26 CHAPTER 2
The idea is that an expert works from memory, whereas a novice has to em-
ploy more complex strategies in problem solving. In fact, the change from a
high cognitive demand to a lower cognitive demand for the same complex
task is a distinguishing characteristic between experts and novices. The ex-
pert simply uses a well-organized knowledge network to respond to a complex
problem, whereas the novice has to employ higher level thought processes to
arrive at the same answer. This is the ultimate dilemma with the measure-
ment of cognitive process with any test item. Although a consensus of con-
tent experts may agree that an item appears to measure one type of cognitive
process, it may measure an entirely different type of cognitive process simply
because the test taker has a different set of prior experiences than other test
takers. This may also explain our failure to isolate measures of different cogni-
tive processes, because test items intended to reflect different cognitive pro-
cesses are often just recall to a highly experienced test taker. Therefore, no
empirical or statistical technique will ever be completely satisfactory in ex-
posing subscales reflecting cognitive process. Whatever tests we develop will
only approximate what we think the test taker is thinking when answering
the test item.
Frisbie, Miranda, and Baker (1993) reported a study of tests written to re-
flect material in elementary social studies and science textbooks. Their find-
ings indicated that most items tested isolated facts. These findings are
confirmed in other recent studies (e.g., Stiggins, Griswold, &Wikelund, 1989).
That the content of achievement tests in the past has focused on mostly low-
level knowledge is a widely held belief in education and training that is also
supported by studies.
The legacy of behaviorism for achievement testing is a model that sums per-
formance of disassociated bits of knowledge and skills. Sometimes, this sum of
learning is associated with a domain, and the test is a representative sample
from that domain. The latter half of the 20th century emphasized domain defi-
nition and sampling methods that yielded domain interpretations. Clearly, the
objective of education was the aggregation of knowledge, and the achievement
test provided us with samples of knowledge and skills that were to be learned
from this larger domain of knowledge and skills.
Although recalling information may be a worthwhile educational objective,
current approaches to student learning and teaching require more complex
outcomes than recall (Messick, 1984; NCTM, 1989; Nickerson, 1989; Snow,
1989; Snow &Lohman, 1989; Sternberg, 1998; Stiggins et al., 1989). School
reformers call for learning in various subject matter disciplines to deal with
life's many challenges (What Works, 1985). Constructivists argue that all learn-
ing should be meaningful to each learner. Little doubt exists in this era of test
CONTENT AND COGNITIVE PROCESSES 27
Conclusions
knowledge and skills. In the next part of this chapter, we see the supportive
role of knowledge and skills in developing cognitive abilities such as reading,
writing, speaking, listening, mathematical and scientific problem solving, crit-
ical thinking, and creative enterprises.
This book's orientation for MC item writing is in the context of two types of stu-
dent learning. These two types are interrelated. In fact the second type is sup-
ported by the first type.
This first type of student learning is any well-defined domain of knowledge
and skills. Writing has a large body of knowledge and skills that must be learned
before students learn to write. This includes knowledge of concepts such as
knowledge of different modes of writing, such as narrative and persuasive, and
writing skills, such as spelling, punctuation, and grammar. Mathematics has a
well-defined domain of knowledge and skills. The four operations applied to
whole numbers, fractions, and decimals alone defines a large domain of skills in
mathematics.
The second type of learning is any construct-centered ability, for which a
complex CR test item seems most appropriate. This idea is briefly discussed in
chapter 1 and is expanded in this chapter. Some specific cognitive abilities of
concern are reading, writing, and mathematical problem solving.
The taxonomy presented here is an outgrowth of many proposals for classify-
ing items. Current learned societies and testing programs use a similar system of
classification. The Bloom taxonomy is very much linked to the proposed sim-
pler taxonomy offered here, but this taxonomy is much simpler.
An organizing dimension is that learning and associated items all can be
classified into three categories: knowledge, skills, and abilities. Because these
are distinctly different categories, it is important to distinguish among them for
organizing instruction and testing the effects of instruction, which we call
achievement.
For the measurement of a domain of knowledge, the test specifications di-
rect a test designer to select items on the basis of content and cognitive process.
The knowledge category contains four content categories and two cognitive
process categories.
The skill category is simple, containing only two types: mental and physical.
Often, skills are grouped with knowledge because we can conveniently test for
cognitive skills using an MC format. Thus, it is convenient to think of a domain
of knowledge and skills as instructionally supportive, and tests are often
thought of as a representative sample of knowledge and skills from a larger do-
main of knowledge and skills.
CONTENT AND COGNITIVE PROCESSES 29
There are many definitions of knowledge. One that seems to best fit this situa-
tion of designing items to measure achievement is: the body of truths accumu-
lated over time. We reveal a person's knowledge by asking questions or
prompting the person to talk and by listening and evaluating what they say.
Achievement testing allows us to infer knowledge through the use of the test.
But as pointed out in the first part of this chapter, knowing someone's cognition
seems to be a never-ending quest to understanding ourselves. Achievement
testing is limited in inferring true cognition.
We can conceive of knowledge in two dimensions: content and cognitive
process, as the title of this chapter implies. Test specifications commonly call
for all items to be so classified. The validity of the interpretation of a test score
rests on a plausible argument and validity evidence. Some of this evidence co-
mes from good test design that shows that items have been correctly classified
so that the test designer can choose items that conform to the test specifica-
tions. The classification system for knowledge has two dimensions that, not
surprising, are content and cognitive process.
As Table 2.4 shows, all knowledge can be identified as falling into one of
these eight categories. An important distinction is the process dimension. First,
it has been asserted by many critics of instruction, training, and testing that re-
TABLE 2.4
The recalling of knowledge requires that the test item ask the test taker to
reproduce or recognize some content exactly as it was presented in a class or
training or in reading. Somewhere in each student's instructional history,
the content must be recovered verbatim. The testing of recall is often asso-
ciated with trivial content. Indeed, trivial learning probably involves the
memory of things that don't need to be learned or could be looked up in
some reference.
The understanding of knowledge is a more complex cognitive process be-
cause it requires that knowledge being tested is presented in a novel way. This
cognitive process involves the paraphrasing of content or the providing of ex-
amples and nonexamples that have not been encountered in previous instruc-
tion, training, or reading.
This important distinction in cognitive processes is expanded in the next
section with examples coming from familiar instructional contexts.
For our purposes, we can condense all knowledge into four useful content
categories: facts, concepts, principles, and procedures. Each test item intended
to measure knowledge will elicit a student behavior that focuses on one of these
four types of content. Both cognitive processes, recalling and understanding,
can be applied to each type of content, as Table 2.4 shows.
TABLE 2.5
Student Learning of Facts
A. 4
B. 5
C. 15
D. 16
The student is provided with four plausible answers. Of course, choosing the
correct answer depends on the plausibility of the other choices and luck, if the
student is guessing. The student has remembered that 5 is a prime number. To
understand why 5 is a prime number requires an understanding of the concept
of prime number.
TABLE 2.6
Student Learning of Concepts
Explain the concepts related to units of measure and show how to measure with
nonstandard units (e.g., paper clips) and standard metric and U.S. units (concepts are
inches, feet, yards, centimeters, meters, cups, gallons, liters, ounces, pounds, grams,
kilograms).
Identify two-dimensional shapes by attribute (concepts are square, circle, triangle,
rectangle, rhombus, parallelogram, pentagon, hexagon).
Define allusion, metaphor, simile, and onomatopoeia.
Identify the components of a personal narrative using your own words or ideas.
TABLE 2.7
Student Learning of Principles
Predict events, actions, and behaviors using prior knowledge or details to comprehend a
reading selection.
Evaluate written directions for sequence and completeness.
Determine cause-and-effect relationships.
Evaluate the reasonableness of results using a variety of mental computation and
estimation techniques.
Apply the correct strategy (estimating, approximating, rounding, exact calculation)
when solving a problem.
Draw conclusions from graphed data.
Predict an outcome in a probability experiment.
associated with a skill. But a skill is much more than simply a procedure. Be-
cause we observe a skill performed, does it make sense to think of knowledge of
a procedure? We might think that before one learns to perform a skill, the
learner needs to know what to do. Therefore, we can think of testing for knowl-
edge of procedures as a memory task or an understanding task that comes be-
fore actually performing the skill.
Mental procedures abound in different curricula. Adding numbers, find-
ing the square root of a number, and finding the mean for a set of numbers are
mental procedures. As with physical procedures, the focus here is asking a
student for knowledge of the procedure. How do you add numbers? How do
you find the square root of a number? How do you determine the mean for a
set of numbers? Table 2.8 provides some examples of student learning of
knowledge of procedures.
Unlike a mental procedure, a physical procedure is directly observable. Ex-
amples are: cutting with scissors, sharpening a pencil, and putting a key in a
TABLE 2.8
Student Learning of Procedures
Skill
TABLE 2.9
Student Learning of Cognitive Skills
Reading
Identify main characters in a short story.
Identify facts from nonfiction material.
Differentiate facts from opinions.
Writing
Spell high-frequency words correctly.
Capitalize sentence beginnings and proper nouns.
Preserve the author's perspective and voice in a summary of that author's work.
Mathematics
Add and subtract two- and three-digit whole numbers.
State the factors for a given whole number.
Sort numbers by their properties.
Ability
A prevailing theme in this book and in cognitive learning theory is the develop-
ment of cognitive abilities. Different psychologists use different names.
Lohman (1993) called them fluid abilities. Messick (1984) called them develop-
ing abilities. Sternberg (1998) called them learned abilities. Each of these terms
is effective in capturing the idea that these complex mental abilities can be de-
veloped over time and with practice. These cognitive abilities are well known
to us and constitute most of the school curricula: Reading, writing, speaking,
and listening constitute the language arts. Problem solving, critical thinking,
and creative thinking cut across virtually all curricula and are highly prized in
36 CHAPTER 2
our society. In mathematics, the NCTM makes clear that problem solving is a
central concern in mathematics education.
Any cognitive ability is likely to rely on a body of knowledge and skills, but
the demonstration of a cognitive ability involves a complex task that requires
the student to use knowledge and skills in unique a combination to accomplish
a complex outcome. Psychologists and measurement specialists have resorted
to cognitive task analysis to uncover the network of knowledge and skills
needed to map out successful and unsuccessful performance in an ill-struc-
tured problem. This task analysis identifies the knowledge and skills needed to
be learned before completing each complex task. But more is needed. The stu-
dent needs to know how to select and combine knowledge and skills to arrive at
a solution to a problem or a conclusion to a task. Often, there is more than one
way to combine knowledge and skills for a desirable outcome.
Another aspect of cognitive abilities that Snow and Lohman (1989) believe
to be important is conative, the emotional aspect of human cognitive behavior.
CONTENT AND COGNITIVE PROCESSES 37
might consider abilities in this way. Test items are important ingredients in the
development of measures of abilities. Such tests can measure the growth of
these abilities on a developmental scale.
One of the most fundamental aspects of cognitive abilities and one that is
most recognizable to us is knowledge. Educational psychologists call this de-
clarative knowledge. As discussed in subsequent chapters, all testable knowl-
edge falls into one of these categories: facts, concepts, principles, or
procedures. The general assumption behind testing for knowledge is that it is
foundational to performing skills or more complex forms of behaviors. In the
analysis of any complex behavior, it is easy to see that we always need knowl-
edge. The most efficient way to test for knowledge is with the MC format.
Thus, MC formats have a decided advantage over CR formats for testing
knowledge. Chapter 3 discusses the rationale for this more completely and
provides much documentation and references to the extensive and growing
literature on this topic.
Chapters 5 and 6 provide many examples of MC items intended to reflect
important aspects of cognitive abilities. However, MC formats have limitations
with respect to testing cognitive abilities. Not all cognitive abilities lend them-
selves well to the MC format. Usually, the most appropriate measure of a cogni-
tive ability involves performance of a complex nature.
Knowledge is always fundamental to developing a skill or a cognitive ability.
Sometimes MC can be used to measure application of knowledge and skills in
the performance of a cognitive ability, but these uses are rare. If we task analyze
a complex task, we will likely identify knowledge and skills needed to complete
that task successfully.
Skills are also fundamental to a cognitive ability. A skill's nature reflects per-
formance. Skills are often thought of as singular acts. Punctuation, spelling,
capitalization, and abbreviation are writing skills. Skills are critical aspects of
complex performances, such as found in critical thinking, creative thinking,
and problem solving. The most direct way to measure a skill is through a perfor-
mance test. But there are indirect ways to measure skills using MC that corre-
late highly with the direct way. Thus, we are inclined to use the indirect way
because it saves time and gives us good information. For example, we could give
a test of spelling knowledge or observe spelling in student writing. We need to
keep in mind the fundamental differences in interpretation between the two.
But if the two scores are highly correlated, we might use the MC version be-
CONTENT AND COGNITIVE PROCESSES 39
cause it is usually easier to obtain and provides a more reliable test score. In a
high-stakes situation in life, such as life-threatening surgery, knowledge of a
surgical procedure is not a substitute for actual surgical skill, and both knowl-
edge and skills tests are not adequate measures of surgical ability. In low-stakes
settings, we might be willing to substitute the more efficient MC test of knowl-
edge for the less efficient performance test of skill because we know the two are
highly correlated. The risk of doing this is clear: Someone may know how to
perform a skill but is unable to perform the skill.
TABLE 2.10
Student Learning of Cognitive Abilities
Reading
Analyze selections of fiction, nonfictibn, and poetry.
Evaluate an instructional manual.
Compare and contrast historical and cultural perspectives of literary selections.
Writing
Create a narrative by drawing, telling, or emergent writing.
Write a personal experience narrative.
Write a report that conveys a point of view and develops a topic.
Mathematics
Predict and measure the likelihood of events and recognize that the results of an
experiment may not match predicted outcomes.
Draw inferences from charts and tables that summarize data from real-world
situations.
40 CHAPTER 2
from the elbow to the tips of the fingers. This specialty involves tissue, bones,
and nerves. The limits of knowledge and skills exist, and the range of prob-
lems encountered can be identified with some precision. Unfortunately, not
all abilities are so easy to define.
Summary
This chapter identifies and defines three types of student learning that are in-
terrelated and complementary: knowledge, skills, and cognitive ability. As you
can see, the defining, teaching, learning, and measuring of each is important in
many ways. However, the development of cognitive ability is viewed as the ulti-
mate purpose of education and training. Knowledge and skills play important
but supportive roles in the development of each cognitive ability. Knowledge
and skills should be viewed as enablers for performing more complex tasks that
we associate with these cognitive abilities.
3
Item Formats
OVERVIEW
One of the most fundamental steps in the design of any test is the choice of one
or more item formats to employ in a test. Because each item is intended to mea-
sure both content and a cognitive process that is called for in the test specifica-
tions, the choice of an item format has many implications and presents many
problems to the test developer.
This chapter presents a simple taxonomy of item formats that is connected
to knowledge, skills, and abilities that were featured in the previous chapter.
Claims and counterclaims have been made for and against the uses of various
formats, particularly the MC format. In the second part of this chapter, five va-
lidity arguments are presented that lead to recommendations for choosing an
item format.
A fundamental principle in the choice of an item format is that measuring
the content and the cognitive process should be your chief concern. The item
format that does the best job of representing content and the cognitive process
intended is most likely to be the best choice. However, other factors may come
into play that may cause you to chose another format. You should know what
these other factors are before you choose a particular format.
41
42 CHAPTER 3
the test taker, (b) some conditions governing the response, and (c) a scoring
procedure. This chapter attempts to help you sort out differences among item
formats and select an item format that best fits your needs. As you will see,
item formats distinguish themselves in terms of their anatomical structure as
well as the kind of student learning each can measure. Each format competes
with other formats in terms of criteria that you select. As you consider choos-
ing a format, you should determine whether the outcome involves knowl-
edge, skills, or abilities. Then you can evaluate the costs and benefits of
rivaling formats.
Another dimension of concern is the consequences of using a particular
item format (Frederiksen, 1984; Shepard, 2000). The choice of a single for-
mat may inadvertently elicit a limited range of student learning that is not
necessarily desired. Ideally, a variety of formats are recommended to take full
advantage of each format's capability for measuring different content and
cognitive processes.
The most fundamental distinction among item formats is whether the underly-
ing student learning that is being measured is abstract or concrete. This differ-
ence is discussed in chapter 1 and is expanded here in the context of item
formats. Table 3.1 provides a set of learning outcomes in reading, writing, and
mathematics in the elementary and secondary school curriculum that reflect
this fundamental distinction. The learning outcomes in the left-hand column
are abstractly defined, and the learning outcomes on the right-hand column
are concretely defined.
With abstractly defined learning, because we do not have clear-cut consen-
sus on what we are measuring, we rely on logic that requires us to infer from test
taker behavior a degree of learning. We rely on the judgments of trained SMEs
to help us measure an abstract construct. This is construct-centered measure-
ment. With operationally defined student learning, we have consensus about
what is being observed. Expert judgment is not needed. The student behavior is
either correct or incorrect. Table 3.1 provides a comparison of the essential dif-
ferences between abstractly define and concretely defined learning outcomes
in terms of item formats. We use the term high inference to designate abstractly
defined student learning.
Because most school abilities are abstractly defined, high-inference item
formats are the logical choice. Some abstractly defined skills also match well to
high-inference formats. Any skill where judgment comes into play suggests the
use of the high-inference format.
Low-inference formats seem ideally suited to knowledge and most mental
and physical skills that can be concretely observed. In chapter 1, the termoper-
ITEM FORMATS 43
TABLE 3.1
High-Inference and Low-Inference Learning Outcomes
ing assessment may be group administered, but it is a more extended task that
takes more student time. Some low-inference skills have to be observed indi-
vidually, and this too takes much time.
TABLE 3.2
Attributes of High-Inference and Low-Inference Item Formats
Conclusion
All item formats can be classified according to whether the underlying objec-
tive of measurement is abstractly or operationally defined. The type of infer-
ence is the key. Measuring cognitive abilities usually require a high-inference
item format. Most knowledge and mental and physical skills can be observed
using a low-inference item format. Some skills require subjective evaluation by
trained judges and by that fall into the high-inference category. For instance, a
basketball coach may evaluate a player's free throw shooting technique. The
coach's experience can be used as a basis for judging the shooting form. This is
high-inference observation. However, the actual shooting percentage of the
player is a low-inference observation.
tested. Of the eight offered, three are noncognitive (attitudes, anxiety, and
motivation) and only the eighth is psychometric in nature. Later in this chap-
ter, a section is devoted to this perspective, drawing from a research review by
Martinez (1998).
Bennett (1993), like Snow (1993) and many others, believed that the adop-
tion of the unified approach to validity has salience for the study of this prob-
lem. Bennett emphasized values and consequences of test score interpretations
and use. We have seldom applied these criteria in past studies of item format
differences.
In summary, the study of item format differences has continued over most of
this century. The earliest studies were focused on format differences using sim-
ple correlation methods to study equivalence of content measured via each for-
mat. As cognitive psychology evolved, our notion of validity sharpened to
consider context, values, consequences, and the noncognitive aspects of test
behavior. Improvements in methodology and the coming of the computer
made research more sophisticated. Nevertheless, the basic issue seems to have
remained the same. For a domain of knowledge and skills or for a cognitive abil-
ity, which format should we use? This section tries to capture the main issues of
this debate and provide some focus and direction for choosing item formats.
Each of the next six sections of this chapter draw from recent essays and re-
search that shed light on the viability of MC formats in various validity con-
texts. The term argument is used here because in validation we assert a principle
based on a plausible argument and collect evidence to build our case that a spe-
cific test score use or interpretation is valid. The validity of using the MC item
format is examined in six contexts, building an argument that results in a sup-
portable conclusion about the role of MC formats in a test for a specific test in-
terpretations and use.
Generally, student grades in college or graduate school are predicted from ear-
lier achievement indicators such as previous grades or test scores. The
well-known ACT (American College Test) and SATI (Scholastic Assessment
Test 1) are given to millions of high school students as part of the ritual for col-
lege admissions, and the Graduate Record Examination is widely administered
to add information to assist in graduate school admission decisions. The pre-
dictive argument is the simplest to conceptualize. We have a criterion (desig-
nated Y) and predictors (designated as Xs). The extent to which a single X or a
set of Xs correlates with Y determines the predictive validity coefficient. Unlike
other validity arguments, prediction is the most objective. If one item format
leads to test scores that provide better prediction, we find the answer to the
question of which item format is preferable.
ITEM FORMATS 49
Downing and Norcini (1998) reviewed studies involving the predictive va-
lidity coefficients of CR and MC items for various criteria. Instead of using an
exhaustive approach, they selected research that exemplified this kind of re-
search. All studies reviewed favor MC over CR, except one in which the CR
test consisted of high-fidelity simulations of clinical problem solving in medi-
cine. The authors concluded that adding CR measures do little or nothing for
improving prediction, even when a CR criterion resembles the CR predictor.
These authors concluded that although there may be many good reasons for us-
ing CR items in testing, there is no good reason to use CR items in situations
where prediction of a criterion is desired.
The challenge to researchers and test developers is to identify or develop
new item formats that tap important dimensions of student learning that in-
crease predictive coefficients. Whether these formats are MC or CR seems ir-
relevant to the need to increase prediction. The choice of formats to improve
predictive coefficients can only be answered empirically.
This validity argument concerns the interpretability of test scores when either
CR or MC formats are used. In other words, a certain interpretation is desired
based on some definition of a construct, such as writing or reading comprehen-
sion. The interpretation may involve a domain of knowledge and skills or a do-
main of ill-structured performance tasks that represent a developing, cognitive
ability, such as writing. This section draws mainly from a comprehensive, inte-
grative review and meta-analysis by Rodriguez (2002) on this problem. Simply
stated the issue is:
If it does not matter, the MC format is desirable because it has many advan-
tages. Some of these are efficient administration, objective scoring, automated
scoring, and higher reliability. With knowledge and skills, MC items usually give
more content coverage of a body of knowledge or a range of cognitive skills, when
compared with short-answer items, essays, or other types of CR items.
The answer to the question about when to use MC is complicated by the fact
expressed in Table 3.3 that a variety of CR item formats exist. Martinez (1998)
pointed out that CR formats probably elicit a greater range of cognitive behav-
iors than the MC, an assertion that most of us would not challenge. Rodriguez
mentioned another complicating factor, which is that the MC and CR scales
underlying an assumed common construct may be curvilinearly related be-
cause of difficulty and reliability differences. The nature of differences in CR
5O CHAPTERS
and MC scores of the assumed same construct are not easy to ascertain. Marti-
nez provided a useful analysis of the cognition underlying test performance and
its implications for valid interpretations. Rodriguez's meta analysis addressed
many methodological issues in studying this problem intended for future stud-
ies of item format differences. Among the issues facing test designers is the ad-
vice given by testing specialists and cognitive psychologists, cost consider-
ations, and the politics of item format selection, which sometimes runs con-
trary to these other factors.
Dimensionality is a major issue with these studies. Whereas Martinez
(1998) warned us not to be seduced by strictly psychometric evidence, studies
reviewed by Thissen, Wainer, and Wang (1994) and Lukhele, Thissen, and
Wainer (1993) provided convincing evidence that in many circumstances CR
and MC items lead to virtually identical interpretations because unidimen-
sional findings follow factor analysis. Bennett, Rock, and Wang (1990) con-
cluded that "the evidence presented offers little support for the stereotype of
MC and free-response formats as measuring substantially different constructs
(i.e., trivial factual recognition vs. higher-order processes)" (p. 89). Some of
Martinez's earlier studies (e.g., Martinez, 1990, 1993) offered evidence that
different formats may yield different types of student learning. However, when
content is intended to be similar, MC and CR item scores tend to be highly re-
lated, as Rodriguez's review shows.
A final point was made by Wainer and Thissen (1993) in their review of this
problem, from their study of advanced placement tests where CR and MC
items were used: Measuring a construct not as accurately but more reliably is
much better than measuring the construct more accurately but less reliably. In
other words, an MC test might serve as a reliable proxy for the fundamentally
better but less reliable CR test. Their advice applies to the third finding by Ro-
driguez (2002) in Table 3.3, where content is not equivalent, but MC may be a
better choice simply because it approximates the higher fidelity CR that may
have a lower reliability.
TABLE 3.3
General Findings About Multiple-Choice (MC) and Constructed-Response (CR)
Item Formats in Construct Equivalence Settings
Earlier, it was stated that the issue between CR and MC resides with knowledge
and skills. The conclusion was that when the object of measurement is a
well-defined domain of knowledge and skills, the conclusion is inescapably
MC. The alternative, the essay format, has too many shortcomings.
This third validity argument examines the issue of the viability of MC for
measuring an ability. Mislevy (1996a) characterized a criterion as:
If we can define this constellation of complex tasks, the complexity of any crite-
rion challenges us to design test items that tap the essence of the criterion. At
the same time, we need some efficiency and we need to ensure high reliability of
the test scores. To facilitate the study of the problem of criterion measurement,
two ideas are introduced and defined: fidelity and proximity.
TABLE 3.4
A Continuum of Indirect Measures of a Criterion for Medical Competence
Supervised patient practice has high fidelity to actual patient practice but
falls short of being exactly like actual practice. As noted previously in this
chapter and by many others (Linn, Baker, & Dunbar, 1991), this high-fidelity
measure may suffer from many technical and logistical limitations. Such mea-
surement can be incredibly expensive and rest almost totally on the expertise
of trained judges. An alternative to live patient examination is a standardized
patient, where there is an actor who is trained to play the role of a patient with
a prespecified disorder, condition, or illness. The cognitive aspects of patient
treatment are simulated, but actual patient treatment is not done. Scoring of
such complex behavior is only experimental and is in development at this
time. Thus, this is not yet a viable testing format. An alternative with less fi-
delity is the patient management problem (PMP). These paper-and-pencil
problems have been computerized, but success with these has been disap-
pointing, and active projects promoting their use have all but disappeared.
Scenario-based MC item sets are popular (Haladyna, 1992a). Although item
sets provide less fidelity than other MC formats just described, they have ma-
jor advantages. Scoring can be simple and highly efficient. But some problems
exist with item sets that warrant caution. Namely, responses are locally de-
pendent. Thus, the coefficient of reliability for a test containing item sets is
likely to be inflated. The attractive aspect of this item format is efficiency
over the other higher fidelity options. The testing approach that has the least
fidelity involves conventional MC items that reflect knowledge related to the
definition of competence. The test specifications may require recall or under-
standing of knowledge. Candidates must choose an answer from a list, and
usually the choice reflects nothing more than knowledge in the profession.
This option has the lowest fidelity although currently it dominates certifica-
tion and licensing testing.
sionality of various item formats that putatively measured writing skills. She
concluded that MC item formats that measure writing skills were effective.
These formats included select the correction, find and correct the error, and
find the error. These item formats are illustrated in chapter 6. This is the ben-
efit of using an effective but lower fidelity item format in place of a higher fi-
delity item format for measuring the same thing: writing skills.
Haladyna (1998) reported the results of a review of studies of criterion mea-
surement involving CR and MC items. Conclusions from that study are pre-
sented in Table 3.5.
The arguments presented thus far are psychometric in nature and argue that
higher fidelity testing is desirable but sometimes proximate measures, such as
the indirect MC test of writing skills, may be used. However, Heck and Crislip
(2001) argued that lack of attention to issues of equity may undermine the ben-
efits of using higher fidelity measures. They provided a comprehensive study of
direct and less direct measures of writing in a single state, Hawaii. They con-
TABLE 3.5
Conclusions About Criterion Measurement
eluded that the higher fidelity writing assessment not only assessed a more di-
verse range of cognitive behavior but was less susceptible to external influences
that contaminate test score interpretations. Also, direct writing assessments
are more in line with schools' attempt to reform curriculum and teach direct
writing instead of emphasizing writing skills that appear on standardized MC
achievement tests. This research makes a strong statement in favor of CR test-
ing for measuring writing ability.
Differences in performance between boys and girls has been often noted in
reading, writing, and mathematics. Are these differences real or the by-product
of a particular item format? Does item format introduce construct-irrelevant
variance into test scores, thereby distorting our interpretation of achievement?
Part of the argument against MC has been a body of research pointing to
possible interaction of gender with item formats. Ryan and DeMark (2002) re-
cently integrated and evaluated this research, and this section draws princi-
pally from their observations and conclusions, as well as from other excellent
studies (Beller & Garni, 2000; DeMars, 1998; Garner & Engelhard, 2001;
Hamilton, 1999; Wightman, 1998). Ryan and DeMark approached the prob-
lem using meta-analysis of 14 studies and 178 effects. They reached the follow-
ing conclusion:
Females generally perform better than males on the language measures, regard-
less of assessment format; and males generally perform better than females on
the mathematics measures, also regardless of format. All of the differences, how-
ever, are quite small in an absolute sense. These results suggest that there is little
or no format effect and no format-by-subject interaction, (p. 14)
Thus, their results speak clearly about the existence of small differences be-
tween boys and girls that may be real and not a function of item formats. Ryan
and DeMark (2002) offered a validity framework for future studies of item for-
mat that should be useful in parsing the results of past and future studies on CR
and MC item formats. Table 3.6 captures four categories of research that they
believed can be used to classify all research of this type.
The first category is justified for abilities where the use of CR formats is obvi-
ous. In writing, for example, the use of MC to measure writing ability seems
nonsensical, even though MC tests scores might predict writing ability perfor-
mance. The argument we use here to justify CR is fidelity to criteria.
The second category is a subtle one, where writing ability is interwoven with
ability being measured. This situation may be widespread and include many
fields and disciplines where writing is used to advance arguments, state propo-
56 CHAPTERS
TABLE 3.6
A Taxonomy of Types of Research on Gender-by-Item Format
Type Description
Criterion-related CR A CR format is intended for measuring something that is
appropriate, that is, high fidelity, such as a writing prompt for
writing ability.
Verbal ability is part of In these CR tests, verbal ability is required in performance
the ability being and is considered vital to the ability being measured. An
measured. example is advanced placement history, where students read
a historical document and write about it.
Verbal ability is CR tests of knowledge might call for recall or recognition of
correlated to the facts, concepts, principles, or procedures, and writing ability
construct but not part might influence this measurement. This is to be avoided.
of it.
Verbal ability is In many types of test performance in mathematics and in
uncorrelated to the science, verbal ability may not play an important role in CR
construct being test performance.
measured.
men and women but attributed the higher scoring by men to more knowledge
of history, whereas the scores for men and women on CR were about the same.
Much attention in this study was drawn to potential biases in scoring CR writ-
ing. Modern high-quality research such as this study reveals a deeper under-
standing of the problem and the types of inferences drawn from test data
involving gender differences. In another recent study, Wightman (1998) ex-
amined the consequential aspects of differences in test scores. She found no
bias due to format effects on a law school admission test. A study by DeMars
(1998) of students in a statewide assessment revealed little difference in per-
formance despite format type. Although Format X Gender interactions were
statistically significant, the practical significance of the differences was small.
DeMars also presented evidence suggesting the MC and CR items measured
the same or nearly the same constructs. Beller and Gafni (2000) approached
this problem using the International Assessment of Educational Progress in-
volving students from several countries. In Gender X Format interactions in
two assessments (1988 and 1991) appear to have reversed gender effects. On
closer analysis, they discovered that the difficulty of the CR items was found
to interact with gender to produce differential results. Garner and Engelhard
(2001) also found an interaction between format and gender in mathematics
for some items, pointing out the importance of validity studies of DIP, a topic
further discussed in chapter 10. Hamilton (1999) found one CR item that dis-
played DIE She found that gender differences were accentuated for items re-
quiring visualization and knowledge acquired outside of school. This
research and the research review by Ryan and DeMark (2002) do not put to
rest the suspicion about the influence of item format on performances by gen-
der. But if effects do exist, they seem to be small. Research should continue to
uncover sources of bias, if they exist. The most important outcome of their
study is the evolution of the taxonomy of types of studies. As stated repeat-
edly in this chapter, knowing more about the construct being measured has
everything to do with choosing the correct item format. Gallagher, Levin, and
Cahalan (2002) in their study of gender differences on a graduate admissions
test concluded that performance seem to be based on such features of test
items as problem setting, multiple pathways to getting a correct answer, and
spatially based shortcuts to the solution. Their experimentation with features
of item formats leads the way on designing items that accommodate differ-
ences in gender that may be construct-irrelevant factors that need to be re-
moved during test item design.
As you can see from this recent review of research, the gender-by-format is-
sue is by no means resolved. It remains a viable area for future research, but one
that will require more sophisticated methods and a better understanding of
cognitive processes involved in selecting answers. Perhaps a more important
connection for the gender-by-format validity argument is posed in the next sec-
tion on cognitive demand.
58 CHAPTERS
formance at the item level may be attributed to test-taking skills and stu-
dents' prior knowledge. Campbell (2000) devised an experiment and
think-aloud procedures to study the cognitive demands of stem-equiva-
lent reading comprehension items. He found differences favoring the CR
format over the MC format but saw a need for both formats to be used in a
reading comprehension test. Skakun and Maguire (2000) used
think-aloud procedures with medical school students in an effort to un-
cover the cognitive processes arising from the use of different formats.
They found that students used MC options as provisional explanations
that they sought to prove or disprove. With CR items, no such provisional
explanations were available and students had to generate their own provi-
sional explanations. However, they also found that the cognitive process-
ing was more complex than simply being a function of the format used.
With items of varying quality, whether MC or CR, the cognitive demand
varied. When items are well written and demand declarative knowledge, it
does not matter which format is used. When items are poorly written, MC
and CR may have different cognitive demands. Katz, Bennett, and Berger
(2000) studied the premise that the cognitive demands of stem-equivalent
MC and CR items might differ for a set of 10 mathematics items from the
SAT. As with other studies like this one, students were asked to think
aloud about their solution strategies. These authors concluded:
The psychometric literature claims that solution strategy mediates effects of
format on difficulty. The results of the current study counter this view: on some
items, format affected difficulty but not strategy; other items showed the reverse
effect. (Katz et al., 2000, p. 53)
Katz et al. also concluded that reading comprehension mediates format
effects for both strategy and difficulty. Thus, reading comprehension may be
a stronger overarching influence on item performance than item format. In
chapter 1, the discussion centered on construct definition and the need to
theorize about the relation of test behavior to the abstract construct defini-
tion. Indeed, if we can give more thought to the nature of cognition in test-
ing, test items might improve.
5. Response-elimination strategies may contribute to construct'irrelevant
variation in MC testing. This criticism is aimed at faults in the item-writing
process. Good item writers follow guidelines, such as developed by
Haladyna, Downing, and Rodriguez (2002). Most formal testing programs
do not have flaws in items that allow examinees the opportunity to eliminate
options and increase the chances of guessing the right answer. Ironically, re-
cent research points to the fact that most MC items have only two or three
working options (Haladyna & Downing, 1993). Kazemi (2002) reported re-
search involving 90 fourth graders who were interviewed after responding to
MC items in mathematics. These students tended to evaluate the choices
ITEM FORMATS 61
A persistent claim has been that the use of particular item format has an effect
on student learning and test preparation. Frederiksen (1984) and later
Shepard (2002) should be credited for advancing this idea. There has been in-
creasing support for the idea that the overuse or exclusive use of one kind of for-
mat might corrupt student learning and its measurement. Heck and Crislip
(2001) examined this premise with a large, representative sample of
third-grade students in writing. Although girls outperformed boys in CR and
MC measures, the CR measures showed fewer differences for format compari-
sons. If students or other examinees are given a choice, they are more likely to
choose an MC format over the CR format (Bennett et al., 1999).
One benefit of this concern for the influence that an item format may have
on learning is the AERA (2000) guide for high-stakes testing that encourages
test preparation to include practice on a variety of formats rather than simply
those used in a criterion test. Such test preparation and the appropriate use of a
variety of item formats may be a good remedy to remove this threat to validity.
The choice of an item format mainly depends on the kind of learning outcome
you want to measure. As Seller and Gafhi (2000) concluded, "It is believed that
ITEM FORMATS 63
the first priority should be given to what is measured rather than how it is mea-
sured" (p. 18). In other words, our focus should be content and cognitive pro-
cess. In chapter 2, it was established that student learning can include
knowledge, mental or physical skills, and cognitive abilities. Knowledge can be
recalled, understood, or applied. Given that content and cognitive processes
will direct us in the choice of an item format first and foremost, what have we
learned about item formats that will help use choose the most appropriate for-
mat for student learning?
SUMMARY
This chapter has provided information about item formats to measure knowl-
edge, skills, and abilities. An important distinction was made between ab-
stractly-defined and concretely-defined student learning. Each type of
learning requires a different type of item format. The former is subjectively
scored by a content expert; the latter is objectively scored by a trained observer.
Five validity arguments were used as a basis for choosing the appropriate for-
mat. At the end of this chapter, three recommendations were offered regarding
the choice of a format.
II
Developing MC Test Items
Thorndike (1967) noted that constructing good test items is probably the most
demanding type of creative writing imaginable. Not only must the item writer
understand content measured by the item but must determine whether the
cognitive demand will involve recall, understanding, or application. Original-
ity and clarity are key features of well-written test items. The set of four chap-
ters in part II of this book is comprehensive with respect to writing MC items.
Chapter 4 presents and illustrates many MC formats and discusses some impor-
tant issues related to using these formats. Chapter 5 presents a validated list of
guidelines to follow when writing MC items. These guidelines derive from past
and current research (e.g., Haladyna et al., 2002). Chapter 5 contains many ex-
amples of MC items, most of which violate item-writing guidelines. Chapter 6
provides examples of test items taken from various sources. These items are ex-
emplary because of their innovative format, content measured, mental pro-
cesses represented, or some other feature. The purpose of chapter 6 is to give
you a broad sampling of the effectiveness of the MC format for many types of
content and cognitive processes, including the kind of thinking associated with
the measurement of abilities. Chapter 7 is devoted to item generation. This
chapter provides both older and newer ideas about how to prepare many items
for different types of content and mental processes rapidly.
This page intentionally left blank
4
MC Formats
OVERVIEW
In this chapter eight MC formats are presented. Examples are given for each
format. Claims are made about the types of content and cognitive processes
that each format can elicit. This chapter shows the versatility of the MC format
for measuring the recall or understanding of knowledge, some cognitive skills,
and many types of complex mental behavior that we associate with abilities.
Two main contexts apply to MC formats. The first is classroom testing, where
the objective of a MC test is to obtain a measure of student learning effi-
ciently. This measure is helpful when a teacher assigns a grade at the end of
the grading period. This measure has value to teachers and students for giving
students feedback and assistance in future learning or for reteaching and re-
learning content that has not been learned. The second context is a
large-scale testing program. The purposes of this large-scale testing program
might be graduation, promotion, certification, licensure, evaluation, place-
ment, or admission. In this second context, MC is chosen because it is effi-
cient and provides a useful summary of student learning of knowledge and
cognitive skills.
MC ITEM FORMATS
67
68 CHAPTER 4
Conventional MC
Question Format
Who is John Gait? stem
A. A rock star foil or distractor
B. A movie actor foil or distractor
C. A character in a book correct choice
Best Answer
Which is the most effective safety feature in your car?
A. Seat belt
B. Front air bag
C. Anti-lock braking system
Stem. The stem is the stimulus for the response. The stem should provide
a complete idea of the knowledge to be indicated in selecting the right answer.
The first item in Example 4.1 shows the question format. The second item
shows the incomplete stem (a partial sentence) format. The third item shows
the best answer format.
Correct Choice. The correct choice is undeniably the one and only right
answer. In the question format, the correct choice can be a word, phrase, or
sentence. In some rare circumstances, it can be a paragraph or even a drawing
or photograph (if the distractors are also paragraphs, drawings, or photo-
MC FORMATS 69
graphs). However, the use of paragraphs, drawings, photographs, and the like
make the administration of the item inefficient. With the incomplete stem, the
second part of the sentence is the option, and one of these is the right answer.
With the best-answer format, all the options are correct, but only one is un-
arguably the best.
Distractors. Distractors are the most difficult part of the test item to
write. A distractor is an unquestionably wrong answer. Each distractor must
be plausible to test takers who have not yet learned the knowledge or skill
that the test item is supposed to measure. To those who possess the knowledge
asked for in the item, the distractors are clearly wrong choices. Each
distractor should resemble the correct choice in grammatical form, style, and
length. Subtle or blatant clues that give away the correct choice should al-
ways be avoided.
The number of distractors required for the conventional MC item is a mat-
ter of some controversy (Haladyna & Downing, 1993). When analyzing a vari-
ety of tests, Haladyna and Downing (1993) found that most items had only one
or two "working" distractors. They concluded that three options (a right an-
swer and two distractors) was natural. Few items had three working distractors.
In chapter 5, this issue is revisited. In this book, most of the examples contain
three options because both theory and research suggest that for conventional
MC three options works well.
read. Such items also require more time to administer and reduce the time
spent productively answering other items. For these many reasons, the use of
internal or beginning blanks in completion-type items should be avoided. Ex-
ample 4.2 shows the blankety-blank item format.
A. aggressive; structural
B. emotional; psychological
C. structural, emotional
A B C D E F G H I
items. The matching format begins with a set of options at the top followed by a
set of stems below. The instructions that precede the options and stems tell the
test taker how to respond and where to mark answers. As shown in Example
4.4, we have five options and six statements. We could easily expand the list of
six statements into a longer list, which makes the set of items more comprehen'
sive in testing student learning. In a survey of current measurement textbooks,
Haladyna et al. (2002) discovered that every measurement textbook they sur-
veyed recommended the matching format. It is interesting that there is no cited
A. Minnesota
B. Illinois
C. Wisconsin
D. Nebraska
E. Iowa
Among the few limitations of this format are the following tendencies:
1. Write as many items as there are options, so that the test takers match
up item stems to options. For instance, we might have five items and five op-
tions. This item design invites cuing of answers. Making the number of op-
tions unequal to the number of item stems can avoid this problem.
2. Mix the content of options, for instance, have several choices be
people and several choices be places. The problem is nonhomogeneous op-
tions. This can be solved by ensuring that the options are part of a set of
things, such as all people or all places. In Example 4.4, the options are all
states.
Extended Matching
Theme
Options; A, B, C ...
Lead-in Statement
Stems: 1,2,3, ...
The lead-in statement might be a scenario or a vignette. This puts the prob-
lem in a real-life context. Finally, the set of stems should be independently an-
swered. Each set of EM items should have this lead-in statement. Otherwise,
the test taker may find the set of items ambiguous. The set of items must have at
least two stems. Case and Swanson (1993,1998) support the use of this format
because using it is easy and it generates large number of items that test for un-
derstanding of knowledge and cognitive skills. In fact, they showed how some
item sets can involve vignettes or scenarios that suggest higher levels of test be-
havior that we might associate with an ability. Their example reflects medical
problem solving.
Example 4.6 presents an EM set from the Royal College of Psychiatry in the
United Kingdom. The EM format is highly recommended for many good reasons.
good distractors; thus, guessing a right answer is more likely with conven-
tional MC. Case and her colleagues have researched the EM format with
favorable results (Case & Swanson, 1993; Case, Swanson, & Ripkey,
1994).
This format is widely used in medicine and related fields in both the United
States and the United Kingdom. It has much potential for classroom and
large-scale assessments. As chapter 6 shows, this format has versatility for a va-
riety of situations. An excellent instructional source for this format can be
found in Case and Swanson (2001), also available on the web at http://
www.nbme.org/.
Alternate Choice
A. Intermittent praise
B. Consistent praise
Although the AC item may not directly measure writing skills, Example 4.8
shows the potential for the AC format to approximate the measurement of a
writing skill.
Although AC is a downsized version of conventional MC, it is not a
true-false (TF) item. AC offers a comparison between two choices, whereas
the TF format does not provide an explicit comparison among choices. With
the TF format, the test taker must mentally create the counterexample and
choose accordingly.
The AC has several attractive characteristics and some limitations:
76 CHAPTER 4
1. The most obvious advantage is that writing the AC item is easy to write.
The item writer only has to think of a right answer and one plausible distractor.
2. The efficiency of the use of this format with respect to printing costs,
ease of test construction, and test administration is high.
3. Another advantage is that if the item has only two options, one can as-
sign more AC items to a test per testing period than with conventional MC
items. Consequently, the AC format provides better coverage of the content
domain.
4. AC items are not limited to recall but can be used to measure under-
standing, some cognitive skills, and even some aspects of abilities (Ebel,
1982).
5. Ebel (1981, 1982) argued that AC is more reliable than MC because
more AC items can be asked in a fixed time. Because the test length is func-
tionally related to reliability, using valid AC items makes sense. Research on
AC items supports Ebel's contention (Burmester &. Olson, 1966; Ebel,
1981, 1982; Ebel & Williams, 1957; Hancock, Thiede, & Sax, 1992;
Maihoff & Mehrens, 1985; Sax & Reiter, n.d.). Also, AC items have a his-
tory of exhibiting satisfactory discrimination (Ruch & Charles, 1928; Ruch
& Stoddard, 1925; Williams & Ebel, 1957).
6. Lord (1977) suggested another advantage: A two-option format is
probably most effective for high-achieving students because of their ten-
dency to eliminate other options as implausible distractors. Levine and
Drasgow (1982) and Haladyna and Downing (1993) provided further sup-
port for such an idea. When analyzing several standardized tests, they
found that most items contained only one or two plausible distractors.
Many of these items could have been easily simplified to the AC format. If
this is true, two options should not only be sufficient in many testing situa-
MC FORMATS 77
tions but also a natural consequence when useless distractors are removed
from an item containing four or five options.
Place an "X" beneath each structure for which each statement is true.
These items occupy a small space but provide a complete analysis of plant
anatomy.
However, there are subtle and serious problems with the TF format. For
example, Peterson and Peterson (1976) investigated the error patterns of
positively and negatively worded TF questions that were either true or false.
Errors were not evenly distributed among the four possible types of TF
items. Although this research is not damning, it does warn item writers that
the difficulty of the item can be controlled by its design. Hsu (1980) pointed
out a characteristic of TF items when they are presented as a group using the
generic stem in Example 4.11. Such a format is likely to interact with the
ability of the group being tested in a complex way. Both the design of the
item and the format for presentation are likely to cause differential results.
Ebel (1978), a proponent of TF items, was opposed to the grouping of items
in this manner.
Grosse and Wright (1985) described a more serious threat to the usefulness
of TF. They argued that TF has a large error component due to guessing, a find-
ing that other research supports (Frisbie, 1973; Haladyna & Downing, 1989b;
Oosterhof & Glasnapp, 1974). Grosse and Wright claimed that if a test taker's
response style favors true instead of false answers in the face of ignorance, the
MC FORMATS 79
We can refute some of these criticisms. The reputation for testing trivial
content is probably deserved, but only because item writers write items measur-
ing trivial content. This practice is not a product of the item format. Trivial
content can be tested with any format. The more important issue is: Can TF
items be written to measure non trivial content? A reading of the chapter on TF
testing in the book by Ebel and Frisbie (1991) provided an unequivocal yes to
this question. The issue of testing for understanding instead of recall is also an-
swered by better item-writing techniques. As with AC, guessing is not much of
a factor in TF tests, for the same reasons offered in the previous section. If one
8O CHAPTER 4
keeps in mind that the floor of the scale for a TF test is 50% and the ceiling is
100%, our interpretations can be made in that light. Exceeding 60% on these
tests when the test length is substantial is difficult for a random guesser, say 50
or 100 items. This is the same argument that applies to AC.
Given its widespread support from textbook writers, TF is recommended for
classroom assessment. For large-scale assessments, we have other formats de-
scribed in this chapter that are more useful and have less negative research.
Complex MC
This item format offers test takers three choices regrouped into four options,
as shown in Example 4.12. The Educational Testing Service first introduced
this format, and the National Board of Medical Examiners later adopted it for
use in medical testing (Hubbard, 1978). Because many items used in medical
and health professions testing programs had more than one right answer,
complex MC permits the use of one or more correct options in a single item.
Because each item is scored either right or wrong, it seems sensible to set out
combinations of right and wrong answers in an MC format where only one
choice is correct.
1. Mel Gibson
2. Dannie Glover
3. Vin Diesel
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1,2, and 3
MTF
The Lion, the Witch, and the Wardrobe by C. S. Lewis can best be
summarized by saying:
highly correlated with complex measures of competence than MTE They con-
cluded that MTF in this study seemed to reflect more basic knowledge.
Videos Rented
1. The video store makes more money from VMS than from
DVD.
2. DVDs and VHSs are more expensive on the weekdays.
3. The video store sells more DVDs in a week than VMS.
4. DVDs are more expensive than VMS.
5. The video store rents more videos Friday, Saturday, Sunday
than on the weekdays.
6. Customers rent about the same number of DVDs and VHSs
on the weekdays.
7. The video store rents more VHS than DVDs on the
weekends.
83
84 CHAPTER 4
The MTF format is an effective substitute for the complex MC. Because the
MTF has inherently good characteristics for testing knowledge, it should be
more widely used.
items of any MC format. Creativity is much needed in shaping the item set.
Terms used to describe item sets include interpretive exercises, scenarios, vignettes,
item bundles, problem sets, super items, and testlets.
Although this format has a long history, it is only recently becoming more
popular. One reason is the need to create items that measure higher level
thinking. Another reason is that scoring methods have improved. The item
set seems well suited to testing cognitive abilities or aspects of an ability in-
volving complex thinking, such as is found in problem solving or critical
thinking.
Little research has been reported on the item set (Haladyna, 1992a, 1992b).
Although this format appears in many standardized achievement tests and
some professional licensing and certification examinations, the scientific basis
for its design and scoring is neonatal. The 1997 revision of the National Coun-
cil on Architectural Registration Boards adopted vignettes for its Building De-
sign Test. One study by Case, Swanson, and Becker (1996) addressed the issue
of the relative difficulty of medical licensing test items as to difficulty and dis-
crimination. They contrasted items with no stimulus material and items with
short and longer scenarios (vignettes). Although they found little or no differ-
ences in two studies in discrimination, long vignette items tended to be slightly
more difficult. The researchers concluded that vignette-based item sets will
continue to be used due to the higher cognitive demands that their format elic-
its. Another factor supporting the use of vignette-based items is acceptance by
candidates that these test items have a greater fidelity with the implicit crite-
rion of medical competence.
Wainer and Kiely (1987) and Thissen, Steinberg, and Mooney (1989) in-
troduced and described testlets as bundles of items with a variety of scorable
predetermined paths for responding. This is a more complex idea than pre-
sented here, but the technical issues addressed by these authors offer some
guidance in future research on item sets. Like the MTF format, context ef-
fects or interitem dependence is a threat. In fact, the MTF format is a type of
item set. If items are interdependent, the discriminative ability of the items,
and the reliability of scores, will be diminished (Sireci, Thissen, & Wainer,
1991). Wainer and Kiely (1987) explored methods for scoring these item bun-
dles, as applied to computerized adaptive testing, but these methods can apply
to conventional fixed-length testing. They also explored hierarchical testlets.
We can overcome the problem of dependency if we score item sets as mini-
tests (testlets; Rosenbaum, 1988). Thissen and Wainer (2001) have several
chapters that address scoring testlets.
Several types of item sets are featured here, each intended for a certain type
of cognitive activity: (a) reading comprehension, (b) problem solving, (c) pic-
torial, and (d) interlinear. Each type is briefly discussed to provide the essence
of what type of content and cognitive process is being measured. Examples of
each type are presented.
86 CHAPTER 4
"The radiance was that of full, setting, and blood-red moon, which
now shone vividly through that once barely discernible fissure of
which I have before spoken as extending from the roof of the
building, in a zigzag direction, to the base. While I gazed this
fissure rapidly widened—there came a fierce breath of the whirl-
wind—the entire orb of the satellite burst at once upon my sight—
my brain reeled as I saw the mighty walls rushing asunder—there
was a long, tumultuous shouting sound like the voice of a
thousand waters—and the deep and dank tarn at my feet closed
sullenly and silently over the fragments of the House of Usher."
1. What is Poe referring to when he speaks of "the entire orb of
the satellite" ?
A. The sun
B. The moon
C. His eye
2. What is a "tarn"?
A. A small pool
B. A bridge
C. A marsh
3. How did the house fall?
A. It cracked into two pieces.
B. It blew up.
C. It just crumpled.
4. How did the speaker feel as he witnessed the fall of the House
of Usher?
A. Afraid
B. Awestruck
C. Pleased
5. What does the speaker mean when he said "his brain
reeled?"
A. He collected his thoughts.
B. He felt dizzy.
C. He was astounded.
items to a page. Therefore, the two-page item set might contain as many as 10
to 12 items, allowing for a brief introductory passage on the first page. Read-
ing comprehension item sets are common in standardized tests. One page is
devoted to a narrative or descriptive passage or even a short story, and the op-
posing page is devoted to MC items measuring understanding of the passage.
The items might take a generic form and an item set structure is established.
Some items might systematically ask for the meaning of words, phrases, or the
entire passage. Some items might ask for prediction (e.g., what should happen
next?). Other items might analyze characterizes or plot. Once the set of items
is drafted and used, it can be reapplied to other passages, making the testing
of comprehension easy. Chapter 7 presents a set of reading comprehension
item shells.
Katz and Lautenschlager (1999) experimented with passage and no-pas-
sage versions of a reading comprehension test. Based on their results, they
stated that students with outside knowledge could answer some items with-
out referring to the passage. This research and earlier research they cite shed
light on the intricacies of writing and validating items for reading compre-
hension. They concluded that a science for writing reading comprehension
items does not yet exist. We can do a better job of validating items by doing a
better analysis of field test data and more experimentation with the no pas-
sage condition.
Problem Solving. Example 4.16 contains an item set in science. The stim-
ulus is a scientific experiment involving a thermos bottle and some yeast, sugar,
and water. The questions involve the application of principles of science. Al-
though we try to write these items so that they are independent, dependency
seems unavoidable. It is important to note that each item should test a different
step in problem solving. Item 1 asks the student to apply a principle to predict
what happens to the temperature of the water. Item 2 gives the reason for this
result. All four options were judged to be plausible. Item 3 calls for a prediction
based on the application of a principle. Item 4 addresses possible changes in
sugar during a chemical reaction. Item 5 tests another prediction based on this
chemical reaction.
Example 4.17 illustrates a patient problem from a nursing licensing exami-
nation. This item set has an interesting variation; the stimulus presents the
problem and after an item is presented a change in the scenario introduces a
new problem, with an accompanying item. With this format, we test a student's
ability to work through a patient problem.
test one's understanding of the data and inferences that can be made from
these data. The test items reflect reading the table and evaluating the data
presented. Some items require that a ratio of injuries to participants be cre-
ated for each sport, to evaluate the rate of injury. Chapter 7 provides more va-
rieties of this item format.
89
SPORT INJURIES PARTICIPANTS1
4
- Football 453,648 13.3
9. Weightlifting 86 398 39 2
9O
MC FORMATS 91
this MC format tests one's writing skills in a very efficient way. Also, this format
can be used to generate additional items so that other interlinear item sets can
give teachers practice items for students who want to know how proficient they
are with these writing skills.
Computer-Based MC Testing
Many standardized tests and credentialing tests use graphs, tables, illustra-
tions, or photographs as part of the item. There is some research and many pros
and cons to consider before choosing to use accompanying material like this.
Primary among the reasons for using material is that it completes the presen-
tation of the problem to be solved. In many testing situations it is inconceivable
that we would not find such material. Imagine certification tests in medicine
for plastic surgery, ophthalmology, dermatology, orthopedic surgery, and oto-
laryngology that would not have items that present patient diseases, injuries, or
congenital conditions in as lifelike a manner as possible. Tests in virtually any
subject matter can be enhanced by visual material. However, are such items
better than items with no visual material? Washington and Godfrey (1974) re-
ported a study on a single military test where the findings provided a scant mar-
gin of advantage for illustrated items. Lacking descriptive statistics, this study
can hardly be taken as conclusive.
The arguments against using illustrated items is that they require more
space and take more time to read. One would have to have a strong rationale
for using these items. That is, the test specifications or testing policies would
have to justify illustrated items. The main advantage might be face validity.
Dictionaries
To the extent that students have additional aids in taking tests there may be an
improvement or decrement in the validity of test score interpretations. Calcu-
lators are one of these aids, and dictionaries are another that may prove useful
in tests where the language used is not native to the examinee. Nesi and Meara
(1991) studied the effect of dictionary usage in a reading test, citing an earlier
study where the use of dictionaries did not affect test scores or administration
time. In this study, they found similar results, but noted that dictionaries in
both studies did not necessarily provide information useful to students. It
seems that the provision of any aid would have to be justified on the grounds
that it reduces or eliminates a construct-irrelevant influence on test perfor-
MC FORMATS 95
mance. Research by Abedi and his colleagues (Abedi, Lord, Hofstetter, &
Baker, 2000) has uncovered the importance of using language on MC tests that
is understandable to students, particularly when reading comprehension is not
the object of the test. Having a student glossary and extra time seemed helpful
to students who traditionally score low on tests given in English where their na-
tive language is not English. Like calculators, the issue is more complex than it
seems on the surface.
Dangerous Answers
SUMMARY
This chapter presented and evaluated eight types of MC item formats. The
conventional MC, AC, matching, EM, TF, and MTF formats are clearly useful
for testing the recall and understanding of knowledge and many cognitive
skills. The complex MC is not recommended. The item set is the most promis-
ing because it seems well suited to testing for the application of knowledge and
skills in complex settings. Scoring item sets presents a challenge as well. How-
ever, with significant interest in testing cognitive abilities, the item set seems to
96 CHAPTER 4
TABLE 4.1
Multiple-Choice Item Formats and the Content They Can Measure
Alternate choice X X
Matching X X
Extended matching X X X
True-false X X
Multiple true-false X X
OVERVIEW
Table 5.1 presents a list of general item-writing guidelines that can be applied to
all the item formats recommended in chapter 4. These guidelines are organized
by categories. The first category includes advice about content that should be
addressed by the SME item writer. The second category addresses style and for-
matting concerns that might be addressed by an editor. The third category is
writing the stem, and the fourth category is writing the options including the
right answer and the distractors. In the rest of this section, these guidelines are
discussed and illustrated when useful.
Content Concerns
1. Every Item Should Reflect Specific Content and a Single Specific Cogni'
tive Process. Every item has a purpose on the test, based on the test specifica-
tions. Generally, each item has a specific content code and cognitive demand
TABLE 5.1
General Item-Writing Guidelines3
Content Guidelines
1. Every item should reflect specific content and a single specific cognitive process, as
called for in the test specifications (table of specifications, two-way grid, test
blueprint).
2. Base each item on important content to learn; avoid trivial content.
3. Use novel material to measure understanding and the application of knowledge
and skills.
4. Keep the content of an item independent from content of other items on the test.
5. Avoid overspecific or overgeneral content.
6. Avoid opinion-based items.
7. Avoid trick items.
Styk and Format Concerns
8. Format items vertically instead of horizontally.
9. Edit items for clarity.
10. Edit items for correct grammar, punctuation, capitalization, and spelling.
11. Simplify vocabulary so that reading comprehension does not interfere with testing
the content intended.
12. Minimize reading time. Avoid excessive verbiage.
13. Proofread each item.
Writing the Stem
14. Make directions as clear as possible.
15. Make the stem as brief as possible.
16. Place the main idea of the item in the stem, not in the choices.
Writing Options
19. Develop as many effective options as you can, but two or three may be sufficient.
20. Vary the location of the right answer according to the number of options. Assign the
position of the right answer randomly.
21. Place options in logical or numerical order.
22. Keep options independent; choices should not be overlapping.
23. Keep the options homogeneous in content and grammatical structure.
24. Keep the length of options about the same.
25. None of the above should be used sparingly.
26. Avoid using all of the above.
continued on next page
99
1OO CHAPTER 5
code. The content code can come from a topic outline or a list of major topics.
In chapter 2, it was stated that all content can essentially be reduced to facts,
concepts, principles, or procedures. But generally, topics subsume this distinc-
tion. The cognitive demand is usually recall or understanding. But if the intent
of the item is to infer status to an ability, such as problem solving, the applica-
tion of knowledge and skills is assumed.
The use of examples and nonexamples for testing concepts such as similes,
metaphors, analogies, homilies, and the like is easy. You can generate lists of
each and mix them into items as needed.
The following questions come from the story Stones from Ybarra.
A danger in being too specific is that the item may be measuring trivial con-
tent, the memorization of a fact. The judgment of specificity and generality is
subjective. Each item writer must decide how specific or how general each item
must be to reflect adequately the content topic and type of mental behavior de-
sired. Items also should be reviewed by others, who can help judge the specific-
ity and generality of each item.
7. Avoid Trick Items. Trick items are intended to deceive the test taker
into choosing a distractor instead of the right answer. Trick items are hard to
illustrate. In a review and study, Roberts (1993) found just a few references in
the measurement literature on this topic. Roberts clarified the topic by distin-
guishing between two types of trick items: items deliberately intended by the
item writer, and items that accidentally trick test takers. Roberts's students
reported that in tests where more tricky items existed, these tests tended to be
104 CHAPTER 5
more difficult. Roberts's study revealed seven types of items that students
perceived as tricky, including the following:
The open-ended items in Example 5.6 are trick items. Yes, there is a fourth
of July in England as there is around the world. All months have 28 days. It was
Noah not Moses who loaded animals on the ark. The butcher weighs meat.
Items such as these are meant to deceive you not to measure your knowledge.
Trick items often violate other guidelines stated in Table 5.1. Roberts encour-
aged more work on defining trick items. His research has made a much-needed
start on this topic.
A negative aspect of trick items is that if they are frequent enough, they
build an attitude by the test taker characterized by distrust and potential lack of
respect for the testing process. There are enough problems in testing without
contributing more by using trick items. As Roberts (1993) pointed out, one of
the best defenses against trick items is to allow students opportunities to chal-
lenge test items and to allow them to provide alternative interpretations. Dodd
and Leal (2002) argued that the perception that MC items are "tricky" may in-
crease test anxiety. They employ answer justification that eliminates both the
perception and reality of trick items. If all students have equal access to appeal-
GUIDELINES FOR DEVELOPING MC ITEMS 1O5
ing a trick item, this threat to validity is eliminated. Such procedures are dis-
cussed in more detail in chapter 8.
9. Edit Items for Clarity. Early in the development of an item, that item
should be subject to scrutiny by a qualified editor to determine if the central idea
is presented as clearly as possible. Depending on the purpose of the test and the
time and other resources devoted to testing, one should always allow for editing.
Editing for clarity does not guarantee a good item. However, we should never
overlook the opportunity to improve each item using editing for clarity.
We should note caution here. Cizek (1991) reviewed the research on editing
test items. He reported findings that suggested that if an item is already being
effectively used, editorial changes for improving clarity may disturb the perfor-
mance characteristics of those test items. Therefore, warning is that editing
should not be done on an operational item that performs adequately. On the
other hand, O'Neill (1986) and Webb and Heck (1991) reported no differ-
ences between items that had been edited and unedited.
portant. Acronyms may be used, but their use should be done carefully. Gen-
erally, acronyms are explained in the test before being reused.
Dawson-Saunders et al. (1992, 1993) experimented with a variety of al-
terations of items. They found that reordering options along with other edi-
torial decisions may affect item characteristics. A prudent strategy would be
to concentrate on editing the item before instead of after its use. If editing
does occur after the first use of the item, these authors suggested that one
consider content editing versus statistical editing. The former suggests that
content changes are needed because the information in the item needs to be
improved or corrected. Statistical alteration would be dictated by informa-
tion showing that a distractor did not perform and should be revised or re-
placed. The two kinds of alterations may lead to different performances of
the same item. Expert test builders consider items that have been statisti-
cally altered as new. Such items would be subject to pretesting and greater
scrutiny before being used in a test. Reordering options to affect key balanc-
ing should be done cautiously.
reliability of test scores. For these many good reasons, we try to write MC items
that are as brief as possible without compromising the content and cognitive
demand we require. This advice applies to both the stem and the options.
Therefore, as a matter of writing style, test items should be crisp and lean. They
should get to the point in the stem and let the test taker choose among plausi-
ble options that are also as brief as possible. Example 5.8 shows an item with
repetitious wording. The improved version eliminates this problem.
Bad Example:
Improvement:
15. Make the Stem as Brief as Possible. As noted with Guideline 12,
items that require extended reading lengthen the time needed for students to
complete a test. This guideline urges item writers to keep the stem as brief as
possible for the many good reasons offered in Guideline 12. Example 5.10 illus-
trates both lengthy and brief stems.
16. Place the Main Idea in the Stem, Not in the Choices. Guideline 12
urges brief items, and Guideline 15 urges a brief stem. Sometimes, the stem
might be too brief and uninformative to the test taker. The item stem should al-
ways contain the main idea. The test taker should always know what is being
asked in the item after reading the stem. When an item fails to perform as in-
tended with a group of students who have received appropriate instruction,
there are often many reasons. One reason may be that the stem did not present
the main idea. Example 5.11 provides a common example of a stem that is too
brief and uninformative. This item-writing fault is called the unfocused stem.
As you can see, the unfocused stem fails to provide adequate information to
address the options. The next item in Example 5.11 is more direct. It asks a
question and provides three plausible choices.
A. Immediately
B. Several days to a week
C. Several months
D. Seldom ever
Unfocused Stem
Focused Stem
look more lifelike or realistic, to provide some substance to it. We use the
term window dressing to imply that an item has too much decoration and not
enough substance. Example 5.12 shows window dressing. For many good
reasons discussed in Guidelines 9, 11,12,14, and 15, window dressing is not
needed.
11O CHAPTER 5
Window Dressing
High temperatures and heavy rainfall characterize a humid
climate. People in this kind of climate usually complain of heavy
perspiration. Even moderately warm days seem uncomfortable.
Which climate is described?
A. Savanna
B. *Tropical rainforest
C. Tundra
However, there are times when verbiage in the stem may be appropriate. For
example, in problems where the test taker sorts through information and distin-
guishes between relevant and irrelevant information to solve a problem, exces-
sive information is necessary. Note that the phrase window dressing is used
exclusively for situations where useless information is embedded in the stem
without any purpose or value. In this latter instance, the purpose of excessive in-
formation is to see if the examinee can separate useful from useless information.
In Example 5.13, the student needs to compute the discount, figure out the
actual sales price, compute the sales tax, add the tax to the actual sale price,
and compare that amount to $9.00. The $ 12.00 is irrelevant, and the student is
supposed to ignore this fact in the problem-solving effort. This is not window
dressing because the objective in the item is to have the student discriminate
between relevant and irrelevant information.
18. Avoid Negative Words in the Stem. We have several good reasons
for supporting this guideline. First, we have a consensus of experts in the field of
testing who feel that the use of negative words in the stem has negative effects
on students and their responses to such items (Haladyna et al., 2002). Some re-
search on the use of negative words also suggests that students have difficulty
understanding the meaning of negatively phrased items. A review of research
by Rodriguez (2002) led to his support of this guideline. Tamir (1993) cited re-
search from the linguistic literature that negatively phrased items require
about twice as much working memory as equivalent positively phrased forms of
the same item. Negative words appearing both in the stem and in one or more
options might require four times as much working memory as a positively
phrased equivalent item. Tamir's study led to a conclusion that for items with
low cognitive demand, negative phrasing had no effect, but that for items with
high cognitive demand, negatively phrased items were more difficult. Tamir
also found that differences in items in positive and negative forms differed as a
function of the type of cognitive processing required. Taking into account the
various sources of evidence about negative items, it seems reasonable that we
should not use negative wording in stems or in options.
Example 5.14 shows the use of the EXCEPT format, where all answers meet
some criterion except one. Although this is a popular format and it may per-
form adequately, this kind of item puts additional strain on test takers in terms
of working short-term memory. Consequently, it probably should be avoided.
the student often reads through the NOT and forgets to reverse the logic of
the relation being tested. This is why the use of NOT is not recommended for
item stems.
19. Use as Many Choices as Possible, but Three Seems to Be a Natural Limit.
A growing body of research supports the use of three options for conven-
tional MC items (Andres & del Castillo, 1990; Bruno & Dirkzwager, 1995;
Haladyna & Downing, 1993; Landrum, Cashin, &Theis, 1993; Lord, 1977;
Rodriguez, 1997; Rogers &Harley, 1999; Sax &Reiter, n.d.; Trevisan, Sax, &
Michael, 1991, 1994). To summarize this research on the optimal number of
options, evidence suggests a slight advantage to having more options per test
item, but only if each distractor is discriminating. Haladyna and Downing
(1993) found that many distractors do not discriminate. Another implication
of this research is that three options may be a natural limit for most MC items.
Thus, item writers are often frustrated in finding a useful fourth or fifth option
because they typically do not exist.
The advice given here is that one should write as many good distractors as
one can but should expect that only one or two will really work as intended. It
does not matter how many distractors one produces for any given MC item, but
it does matter that each distractor performs as intended. This advice runs
counter to what is practiced in most standardized testing programs. However,
both theory and research support the use of one or two distractors in the design
of a test item. In actuality, when we use four or five options for a conventional
MC test item, the existence of nonperforming distractors is nothing more than
window dressing. Thus, test developers have the dilemma of producing unnec-
essary distractors, which do not operate as they should, for the appearance of
the test, versus producing tests with fewer options that are more likely to do
what they are supposed to do.
GUIDELINES FOR DEVELOPING MC ITEMS 113
One criticism of using fewer instead of more options for an item is that guess-
ing plays a greater role in determining a student's score. The use of fewer dis-
tractors will increase the chances of a student guessing the right answer.
However, the probability that a test taker will increase his or her score signifi-
cantly over a 20-, 50-, or 100-item test by pure guessing is infinitesimal. The
floor of a test containing three options per item for a student who lacks knowl-
edge and guesses randomly throughout the test is 33% correct. Therefore, ad-
ministering more test items will reduce the influence of guessing on the total
test score. This logic is sound for two-option items as well, because the floor of
the scale is 50% and the probability of a student making 20,50, or 100 success-
ful randomly correct guesses is very close to zero. In other words, the threat of
guessing is overrated.
20. Vary the Location of the Right Answer According to the Number of
Options. Assign the Position of the Correct Answer Randomly. The ten-
dency to mark in the same response category is response set. Also, testwise stu-
dents are always looking for clues that will help them guess. If the first item is
usually the correct answer, the testwise student will find this pattern and when
in doubt choose A. Therefore, we vary the location of the right answer to ward
off response set and testwise test takers. If we use a three-option format, about
33% of the time A, B, and C will be the right answer respectively.
Recent research indicates that this guideline about key balancing may have
some subtle complications. Attali and Bar-Hillel (2003) and Bar-Hillel and
Attali (2002) posited an edge aversion theory that the right answer is seldom
found in the first and last options, thus offering an innocent clue to test takers
to guess middle options instead of "edge" options. Guessing test takers have a
preference for middle options, as well. By balancing the key so that correct an-
swers are equally distributed across the MC options, this creates a slight bias be-
cause of edge aversion, and this affects estimates of difficulty and
discrimination. They concluded that correct answers should be randomly as-
signed to the option positions to avoid effects of edge aversion.
Wrong Right
What is the cost of an item that What is the cost of an item that
normally sells for $9.99 that is normally sells for $9.99 that is
discounted 25%? discounted 25%?
A. $5.00 A. $2.50
B. *$7.50 B. $5.00
C. $2.50 C. $6.66
D. $6.66 D. *$7.50
You are dividing the bill of $19.45 equally among four of us who had
lunch together. But your lunch item $9.55. What is your fair share of
the bill?
A. .250 A. 0.049
B. 0.491 B. 0.250
C. .50 C. 0.500
same item with answers aligned two ways. Notice that the second way is easier
to follow than the first. Notice that the decimal point is aligned in the second
example for easy reading. Also, try to keep the number of decimal places con-
stant for uniformity.
Logical ordering is more difficult to illustrate, but some examples offer hints
at what this guideline means. Example 5.18 illustrates an item where options
were alphabetically ordered.
There are instances where the logical ordering relates to the form of the an-
swers instead of the content. In Example 5.19, answers should be presented in
order of length, short to long.
If a value contained in overlapping options is correct, the item may have two
or more correct answers. Example 5.20 illustrates this problem. Numerical
problems that have ranges that are close make the item more difficult. More
important in this example, Options A, B, C, and D overlap slightly. If the an-
swer is age 25, one can argue that both C and D are correct though the author
of the item meant C. This careless error can be simply corrected by developing
ranges that are distinctly different. The avoidance of overlapping options also
will prevent embarrassing challenges to test items.
24. Keep the Length of Choices about the Same. One common fault in
item writing is to make the correct answer the longest. This may happen inno-
cently. The item writer writes the stem and the right answer, and in the rush to
complete the item adds two or three hastily written wrong answers that are
shorter than the right answer. Example 5.22 shows this tendency.
(2002) surveyed current textbooks and found authors split on this guideline.
Frary (1993) supported this format, but with some caution. An argument fa-
voring using none of the above in some circumstances is that it forces the stu-
dent to solve the problem rather than choose the right answer. In these
circumstances, the student may work backward, using the options to test a so-
lution. In a study of none of the above by Dochy, Moerkerke, De Corte, and
Segers (2001) with science problems requiring mathematical ability, their re-
view, analysis, and research point to using none of the above because it is a
plausible and useful distractor, and they argue that students can generate
many incorrect answers to a problem. Thus, none of the above serves a useful
function in these complex problems with a quantitative answer. For items
with a lower cognitive demand, none of the above probably should not be used.
When none of the above is used, it should be the right answer an appropriate
number of times.
26. Avoid Using AH of the Above. The use of the choice all of the above
has been controversial (Haladyna & Downing, 1989a). Some textbook
writers have recommended and have used this choice. One reason may be
that in writing a test item, it is easy to identify one, or two, or even three
right answers. The use of the choice all of the above is a good device for cap-
turing this information. However, the use of this choice may help testwise
test takers. For instance, if a test taker has partial information (knows that
two of the three choices offered are correct), that information can clue the
student into correctly choosing all of the above. Because the purpose of a MC
test item is to test knowledge, using all of the above seems to draw students
into test-taking strategies more than directly testing for knowledge. One al-
ternative to the all of the above choice is the use of the MTF format. Another
alternative is simply avoid all of the above and ensure that there is one and
only one right answer.
28. Avoid Options That Give Clues to the Right Answer. We have a
family of clues that tip off test takers about the right answer. They are as follows:
ways, never, totally, absolutely, and completely. A specific determiner may oc-
casionally be the right answer. In these instances, their use is justified if the
distractors also contain other specific determiners. In Example 5.23, Option
A uses the specific determiner never and Option C uses the specific deter'
miner always.
28. What is the purpose of the TAX table? To help you determine
A. your gross income.
B. the amount of TAX you owe.
C. your net earnings.
D. your allowable deductions.
28. Three objects are thrown in the water. Object A floats on top
of the water. Object B is partially submerged. Object C sinks.
All three objects have the same volume Which object weighs
the most?
A. A
B. B
C. C
You may not know the person in the second option (B), but you know
that it is the right answer because the other two are absurd. If A or C is cor-
rect, the item is a trick question.
30. Use Typical Errors of Students When You Write Distractors. One
suggestion is that if we gave completion items (open-ended items without
choices), students would provide the correct answer and plausible wrong an-
GUIDELINES FOR DEVELOPING MC ITEMS 121
swers that are actually common student errors. In item writing, the good plausi-
ble distractor comes from a thorough understanding of common student errors.
In the example in Example 5.30, Distractor A is a logical incorrect answer for
someone learning simple addition.
29. 77 + 34 =
A. 101
B. 111
Generally, the set of choices for a matching item set is homogeneous as to con-
tent. Because the benefit of a matching format is the measurement of under-
122 CHAPTER 5
1. The number of MTF items per cluster may vary within a test.
2. Conventional MC or complex MC items convert nicely to MTF items.
3. No strict guidelines exist about how many true and false items appear in a
cluster, but expecting a balance between the number of true and false
items per set seems reasonable.
4. The limit for the number of items in a cluster may be as few as 3 or as many
as would fit on a single page (approximately 30 to 35).
TABLE 5.2
Guidelines for the Matching Format
1. Provide clear directions to the students about how to select an option for each stem.
2. Provide more stems than choices.
3. Make choices homogeneous.
4. Put choices in logical or numerical order.
5. Keep the stems longer than the options.
6. Number stems and use letters for options (A, B, C, etc.).
7. Keep all items on a single page or a bordered section of the page.
GUIDELINES FOR DEVELOPING MC ITEMS 123
Format the Item Set So All Items Are on a Single Page or Opposing Pages of
the Test Booklet. This step ensures easy reading of the stimulus material and
easy reference to the item. When limited to two pages, the total number of
items ranges from 7 to 12. If the MTF or AC formats are used with the item set,
many more items can be used.
Use Any Format That Appears Suitable With the Item Set. With any
item set, conventional MC, matching, AC, and MTF items can be used. The
item set encourages considerable creativity in developing the stimulus and us-
ing these various formats. Even CR item formats, such as short-answer essays,
can be used.
126 CHAPTER 5
SUMMARY
OVERVIEW
National Assessment—Reading
The first item comes from the NAEP's 1994 reading assessment and is shown in
Example 6.1. This item is based on a reading passage about the Anasazi Indians
of the Southwest United States. The passage is not presented here because of
space limitations, but it is customary to use reading passages to test reading
comprehension. In chapter 7, advice is given on how to generate a large num-
ber of reading comprehension items using "clone" item stems.
The student after reading the passage must choose among four plausible op-
tions. Understanding of the passage is essential. The four options use language
that cannot be found verbatim in the passage. Thus, the options present in a
novel way what the author portrayed about the Anasazi Indians. Those looking
for a more complete collection of examples of high-quality reading comprehen-
sion items should consult the web page of the National Center for Educational
Statistics (http://nces.ed.gov/nationsreportcard/).
EM
Example 6.2 shows how the EM format discussed in chapter 4 can be effec-
tively used to measure a student's understanding. Each patient is described in
terms of a mental disorder. The long list of disorders are given at the right.
Each learner can memorize characteristics of a disorder, but understanding is
CASEBOOK OF EXEMPLARY ITEMS 129
Patient Disorder
1. Mr. Ree anxiously enters every room left A. Neuresthenia
foot first. B. Dementia
2. Suzy enters the room excited. She walks C. Regression
quickly around chattering. Later she is D. Alexia
normal.
E. Sublimation
3. Bob saw the archangel of goodness
F. Bipolar
followed by Larry, Curly, and Moe, the
three saints of comedy. G. Compulsion
4. Muriel cannot pronounce words, and H. Rationalization
she has no paralysis of her vocal I. Masochism
chords. J. Hallucination
5. "I am the Queen of Sheba," the old lady K. Hypnotism
muttered. L. Delusional
6. After Julie broke up with Jake, she
remarked, "There are many fish in the
sea."
7. Norman was thin and tired. Nothing was
important to him. He felt useless and
inferior. He wanted to escape.
8. Good clothes and good looks did not
get Maria the attention she wanted, so
she excelled in sports.
needed when each learner is confronted with a new patient who demon-
strates one or more symptoms of a disorder. Of course, psychotherapy is not
simplistic, as this example suggests, but in learning about disorders, learners
should understand each disorder rather than simply memorize characteristics
of a disorder.
Anyone who has written MC items for a while has experienced the frustration
of thinking up that third or fourth option. Because distractors have to be plausi-
ble and reflect common student errors, it is hard to come up with more than
one or two really good distractors. The combinatorial format makes that effort
13O CHAPTER 6
easier. You simply write an AC item (two options) and add two more generic op-
tions. The first example comes from the National Board Dental Examination
Testing Programs.
As shown in Example 6.3, we have two statements. We have four combina-
tions for these two statements: true-true, true-false, false-true, false-false.
As shown in Example 6.4, we have two plausible answers that complete the
stem. The student must evaluate whether the first answer only is right, the sec-
ond answer only is right, or whether both answers are correct or both incorrect.
The nice thing about this format is that item writers never have to think of that
third or fourth option; they simply create two MC options. However, they have
to be careful to ensure that an equal number of times in the test the right an-
swer is evenly distributed among the four choices.
The MTF format is also useful for testing understanding. Example 6.5 shows
how teaching the characteristics of vertebrates can be cleverly tested using de-
CASEBOOK OF EXEMPLARY ITEMS 131
scriptions of animals. The animals in the list are not described or presented in a
textbook or during lecture. The student encounters an animal description and
must decide based on its characteristics if it is absurd or realistic. The number of
items may vary from 1 or 2 to more than 30. The decision about the length of
this item set depends on the type of test being given and the kind of coverage
needed for this content.
Example 6.6 has efficient presentation and method of scoring. Imagine that we
are learning about three governments. You select the letter corresponding to
the government that reflects the statement on the left. This example only has 4
statements, but we could easily have 20 to 30 statements. We have 12 scorable
units in this example, but with 30 statements this would be equivalent to a 90-
item TF test.
This item may be measuring recall of facts, but if the statements are pre-
sented in a novel way, these items might be useful measures of student under-
standing.
132 CHAPTER 6
TESTING SKILLS
Vocabulary—Reading
Find the word that most nearly means the same as the word on
the left.
Writing Skills
Example 6.9 shows the breadth of writing skills that can be tested. As noted
in this example, all items have two choices; therefore, guessing plays a factor.
However, if enough items are presented to students, guessing becomes less of a
factor.
A good point to make about these items is that all the distinctions listed in
the examples and by Technical Staff (1933, 1937) can be re-presented with
new sentences that students have not seen. Thus, we are testing the applica-
tion of writing skill principles to new written material.
Example 6.10 is presented in generic format because of space consider-
ations. As you can see, the number of sentences in the stimulus condition can
be long. In fact, this list might range from 5 to 50 or more sentences. The stu-
134 CHAPTER 6
dent is expected to detect eight distinctly different errors in writing. Such a test
has high fidelity with anyone who is learning how to correct and revise writing.
The main idea in using MC to measure writing skills is to use real examples
presented to test takers, allowing them to select the choices to provide insight
into their writing skills. Virtually every writing skill can be converted into an
MC format because most of these skills can be observed naturally as a student
writes or can be assessed artificially in a test using items that appear in this sec-
tion. Although these MC formats can appear artificial or contrived, is the low-
ering of fidelity to true editing a tolerable compromise? Editing student writing
is one way to measure these writing skills, but these examples provide a more
standardized way.
Mathematics Skills
Mathematics skills can also be tested easily using MC formats. Example 6.11
shows a conversion involving fractions, decimals, and percents. The student
learning objective might require that students find equivalents when pre-
sented with any fraction, decimal, or percent. Example 6.11 shows the use of
the MTF format, which permits a more thorough testing of the procedure for
converting from one form to another. As we see from this example, fractions
can vary considerably. We can create many items using this structure. Options
should include the right answer and common student errors.
CASEBOOK OF EXEMPLARY ITEMS 135
Example 6.12 shows a simple area problem that might be appropriate for a
fifth-grade mathematics objective. Problems like these are designed to reflect
real-world-type problems that most of us encounter in our daily lives. The cre-
ating of test items that students can see have real-world relevance not only
makes the problems more interesting but promotes the idea that this subject
matter is important to learn.
You are painting one wall and want to know its area. The wall is 8
feet high and 12 feet wide. What is the area?
A. 20 feet
B. 20 square feet
C. 40 square feet
D. 96 square feet
Example 6.13 involves a more complex skill where two numbers have to be
multiplied. This item represents a two-stage process: (a) recognize that multi-
plication is needed, and (b) multiply correctly. When a student misses this
item, we cannot ascertain whether it was a failure to do (a) or (b) that resulted
in the wrong answer.
Language
Our orchard contains 100 trees. We know from previous years that
each tree produces about 30 apples. About how many apples
should be expected this year at harvest time?
A. 130
B. 300
C. 3,000
D. Cannot say. More information is needed.
phrases that we think are part of the student's body of knowledge to master and
then provide the correct translation and two or three plausible incorrect trans-
lations. Generating test items for practice testing and for summative testing or
for testing programs can be easily accomplished.
Er nimmt platz:
A. He occupies a position.
B. He waits at a public square.
C. He seats himself.
Any reading passage can be presented in one language and the test items
measuring comprehension of the passage can be presented in another lan-
guage. Some state student achievement testing programs have experimented
with side-by-side passages in English and another language to assist those
CASEBOOK OF EXEMPLARY ITEMS 137
This section contains items that are purported to prompt test takers to apply
knowledge and skills to address a complex task. The items range in MC for-
mats and include conventional MC, conventional MC with context-depend-
ent graphic material, conventional MC with generic (repeatable) options,
MTF and multiple-response (MR) formats, networked two-tier item sets,
and combinatorial MC items. These examples should show that writing MC
items can measure more than recall. Although reliance on CR performance
items that require judged scoring is always desirable, many of the examples
shown in this section should convince us that these types of MC items often
serve as good proxies for the performance items that require more time to ad-
minister and human scoring that is fraught is inconsistency and bias.
We have literally hundreds of testing programs that require test items mea-
suring knowledge, skills, and abilities in professions. Item banks are hard to
develop, and these item banks must be updated each year, as most profes-
sions continuously evolve. Old items are retired and new items must replace
these old items. Item writing in this context is expensive. SMEs may be paid
or may volunteer their valuable time. Regardless, the items must not only
look good but they must perform. Example 6.16 shows a situation encoun-
tered by a Chartered Financial Analyst (CFA) where several actions are
possible and only one is ethical, according to the ethical standards of the
Association for Investment Management and Research (AIMR). Although
the ethical standards are published, the test taker must read and under-
stand the real-life situation that may be encountered by a practicing CFA
and take appropriate action. Inappropriate action may be unethical and
lead to negative consequences. Thus, not only are such items realistic in ap-
pearance but such items measure important aspects of professional knowl-
edge. Note that Option A has a negative term, which is capitalized so that
the test taker clearly understands that one of these options is negatively
worded and the other three options are positively worded.
138 CHAPTER 6
Most certification and licensing boards desire test items that call for the appli-
cation of knowledge and skills to solve a problem encountered in their profes-
sion. In medicine, we have a rich tradition for writing high-quality MC items
that attempt to get at this application of knowledge and skills. Example 6.17
provides a high-quality item that is typically encountered in certification tests
in the medical specialties.
These items often derive from an experienced SME who draws the basis for
the item from personal experience. It is customary for every item to have a ref-
erence in the medical literature that verifies the correctness of content and the
selected key.
This item set requires the student to read and interpret a graph showing the
savings of four children (see Example 6.18). The graph could be used to test
other mathematics skills, such as correctly reading the dollar values saved by
each child. Other comparisons can be made. Or a probability prediction could
be made about who is likely to save the most or least next year. These types of
CASEBOOK OF EXEMPLARY ITEMS 139
A. Radioimmunoassay of tri-iodothyronine
B. Resin tri iodothyronine uptake test
C. Thyroid scan
D. Thyroid-stimulus hormone test
E. Free thyroxine index.
items are easily modeled. In fact, chapter 7 shows how item models can be cre-
ated using an item like this one. An item model is a useful device for generating
like or similar items rapidly.
Whether your graphical material addresses a single item (stand alone) or a set
of items, most high-quality testing programs use graphical materials because it
adds a touch of realism to the context for the item and it usually enables the
testing of application of knowledge and skills. In these two examples, tables are
used. These tables require the test taker to read and understand the data pro-
vided and take some action, as called for in the item stem.
Example 6.19 requires the student to add points and read the chart to deter-
mine Maria's grade. The item reflects meaningful learning because most stu-
dents want to know their grades and must perform an exercise like this one to
figure out their grade. This item can be varied in several ways to generate new
items. The points in the stem can be changed, and the grading standards can be
changed. As the next chapter shows, we have many techniques for generating
new items from old items that makes item development a little easier.
Example 6.20, also from the AIMR's certification testing program nicely shows
how a table can be used to test for the complex application of knowledge and skills.
As with virtually all items of a quantitative nature, this format can be used
and reused with new data in the table and appropriate revisions in the stem and
options. The potential for item models that allow you to generate additional
14O CHAPTER 6
Beth, Bob, Jackie, and Tom had savings programs for the
year. How many more dollars did Beth save than Tom?
A. $ 2.50
B. $ 5.00
C. $11.00
D. $21.00
items is great. As mentioned several other times in this chapter, chapter 7 is de-
voted to this idea of item generation.
Example 6.21 shows how MC can be used to test logical thinking is necessary,
such as we find in science, social studies, and mathematics. The first state-
ment is assumed to be true, factual. The second statement is connected to the
first and can be true, false, or indeterminate. By writing pairs of statements,
we can test the student's logical analysis concerning a topic taught and
learned. The number of pairs can be extensive. The options remain the same
for every option. Thus, writing items can be streamlined. Another aspect of
Maria's teacher has the following grading
Total Points Grade
standards in mathematics. Maria wants to
know what her grade this grading period
will be. Her scores from quizzes, portfolio,
920 to 1000 A
and homework are 345, 400,122, and 32.
A. A B
850 to 919
B. B
C. C 800 to 849
D. D
750 to 799 D
this item set is that the first two items are general and conceptual, whereas
the latter two items are specific to something that has been taught and
learned. Thus, we can test for general knowledge and understanding and
then apply it to specific instances.
MTF or MR Formats
The MC item set in Example 6.22 is presented using the MTF format but it
could also be presented in an MR format. Thus, aspects of problem solving and
critical thinking can be tested with a scenario-based format without using the
conventional MC format, which requires more time to write the items. Note
that this format can have more than 5 items (statements). In fact, as many as 30
statements can fit on a single page; therefore, the amount of testing derived
from a single problem scenario can be extensive. Guessing is not a problem with
this format because of the abundance of items that can be generated to test the
student's approach to this problem.
AC is another format that resembles the previous format, is from the same
source, and has a generic nature that can be applied to many subject matters
and situations. AC has two parts. The stimulus contains a series of observa-
tions, findings, or statements about a theme. The response contains a set of
plausible conclusions. The following example shows a set of true statements
about student learning and then a series of logical and illogical conclusions are
drawn. The student must choose between the two choices for each conclusion.
Although only five items are presented in Example 6.23, you can see that the
list can be increased considerably.
performance on achievement tests. Tsai and Chou (2002) developed the net-
worked two-tier test as a means for studying students' prior knowledge and mis-
conceptions about science. The test is administered over the World Wide Web,
which makes it more accessible for its purpose, which is diagnosis.
The two-tier test is a two-item MC item set based on a science problem, usu-
ally accompanied with visual material. The first tier (first item) explores the
child's knowledge of the phenomenon being observed. The second tier (second
item) explores that basis in reasoning for the first choice. Such items are devel-
oped after student interviews. Thus, the content for the items is carefully devel-
oped from actual student encounters with items, as opposed to using students'
perceptions afterward to gain insights. Example 6.24 illustrates a two-tier item.
Two astronauts were having a fight in space. One struck the other.
The one who struck weight 40 kilograms, but the one who was
struck weighted 80 kilograms.
The first-tier item is presented. The student responds. Then the second-tier
item is given, with the first-tier item being retained on screen. The second item is
presented only after the student has made a choice on the first item, so that the
student is not influenced by the first item. The sequencing effect of items pro-
vides for inferences to be made about stages of learning so that teaching inter-
ventions can identify and redirect students into more logical, correct patterns.
This item format coupled with the work of Tsai and Chou (2002) exemplifies
the recent emphasis on studying cognitive processing underlying some item for-
CASEBOOK OF EXEMPLARY ITEMS 145
The first item calls for the use of a principle to make a prediction. The sec-
ond item uses causal reasoning to explain the rationale for the prediction.
Premise-Consequence
Combinatorial Items
Another item from the National Board Dental Examinations employs another
strategy for systematic variation that makes writing options easier. Example
6.27 shows this technique.
146 CHAPTER 6
How does the flouride ion affect size and solubility of the
hydroxyapatite crystal?
Crystal Size Solubility
A. Increases Increases
B. Decreases Decreases
C. Increases Decreases
D. Decreases Increases
Example 6.28 is another good example that comes from the Uniform Cer-
tified Public Accountant Examination with the use of simple yes and no an-
swers to combinatorial conditions.
The options are easier to develop. If the item writer can develop the stem so
that the four options can systematically have paired variations, the item writ-
ing is simplified.
SUMMARY
The purpose of this chapter was to show that MC formats come in a larger va-
riety than presented in chapter 4. Not only is there a variety in MC formats,
but this chapter shows that these MC formats can measure knowledge, skills,
and the application of knowledge and skills in many content areas. You are
CASEBOOK OF EXEMPLARY ITEMS 147
OVERVIEW
Whether you are developing MG test items for a class you teach or for a testing
program, the pressure to produce a large collection of high-quality test items is
omnipresent. New items are always needed because old items based on old con-
tent may be retiring. Case et al. (2001) reported on new-item development for
several medical credentialing examinations. They stated that a significant por-
tion of the budget for test development is given to creating new items. Item
writing is a costly enterprise. Item generation refers to any procedure that
speeds up this item-writing process. Because new and better items are always
needed, any strategy to increase both the quality of items and the rate of pro-
duction is welcome.
This chapter features five sections. The first section covers item shells,
which is a very straightforward, item-generating technology that is easy to em-
ploy but is limited to items that mainly reflect knowledge and skills. The second
section is item modeling, which has more potential for measuring complex cog-
nitive behavior. The third section is key features, which has potential for mea-
suring clinical problem solving in a profession, which is a central interest in
professional credentialing tests. The fourth section discusses generic item sets,
where variable facets are introduced and generic items provide a basis for writ-
ing stems. The fifth section shows how to transform an existing complex perfor-
mance item into one or more MC items that reflect the complex behavior
elicited in the performance item.
These five approaches to item generation represent practical technologies
for item generation. However, there is an emerging science of item generation
that promises to improve our ability to rapidly generate items. The vision is that
computers will someday produce items on demand for testing programs where
new tests and test results are needed quickly. But the basis for creating these
148
ITEM GENERATION 149
computer programs will come from teams of experts, including SMEs, whose
judgments will always be needed.
Wainer (2002) made several important points about item generation in the
future. He argued that item generation is urgently needed in the context of
computerized testing, particularly computer adaptive testing. He pointed out
that computerized testing is probably not desirable for large-scale assessments
of course work, tests that are given on an annual basis, and large-scale perfor-
mance assessments. Computerized testing is feasible in low-stakes settings such
as for placement, when test results are needed quickly, as in credentialing test-
ing program. Wainer concluded that item generation may be best suited for di-
agnostic testing and for things that are easy to codify, but for measuring
high-level complex thinking that school-based assessments desire, these item-
generation theories have not yet been successful.
Whether we use item-generation theories for developing a technology or
use the techniques described in this chapter, the need to write good items for all
testing programs is always there. In the interim period before such theories be-
come operational and provide the kinds of items we desire, we turn to the pro-
cedures in this chapter because they can help item writers accelerate the slow,
painful process of writing new items.
ITEM SHELLS
The item shell technique is primarily intended for item writers who lack formal
item-writing training and experience in MC item writing. These item writers
often have great difficulty in starting to write the MC item, though they had
considerable knowledge, skill, and experience in the subject matter for which
they were preparing items. As its name suggests, the item shell is a skeletal item.
The item shell provides the syntactic structure of a MC item. The item writer
has to supply his or her content, but the stem or partial stem is supplied to give
the item writer a start in the right direction.
As reported earlier in this book, attempts to make item writing a science have
not yet been fruitful. An ambitious endeavor was Bormuth's (1970) algorith-
mic theory of item writing. He suggested a complex, item-writing algorithm
that transformed prose into MC test items. His theory of achievement test
item writing made item development more scientific and less subject to the
caprice and whims of idiosyncratic item writers. The problem, however, was
that the algorithm had too many steps that made its use impractical. Others
have tried similar methods with similar lack of success, including facet theory
and designs, item forms, amplified objectives, among others (see Roid &
Haladyna, 1982).
ITEM GENERATION 151
The item shell was created out of a need for a more systematic method of
MC item writing in the direction of these earlier efforts. However, the item
shell also permits item writers freedom that, in turn, permits greater creativ-
ity in designing the item. The item shell is also seen as a more efficient pro-
cess for writing MC items than presently exists. The method simplifies
writing items.
One could take this item shell and substitute almost any concept or princi-
ple from any subject matter. Writing the stem is only one part of MC item writ-
ing, but often it is the most difficult part. Writing a correct option and several
plausible distractors is also difficult. Once we write the stem, an important part
of that item-writing job is done.
A limitation of the item shell technique is that you may develop an
abundance of items that all have the same syntactic structure. For in-
stance, if you used the shell in Example 7.1, all items might have the
same syntactic structure. Some test makers and test takers may perceive
this situation negatively. We want more variety in our items. The solu-
tion is to use a variety of item shells instead of generating many items
from a single shell.
Another limitation of the item shell is that it does not apply equally well
to all content. There are many instances where the learning task is specific
enough so that generalization to sets of similar items is simply not possible.
In these instances other techniques presented in this chapter may be more
fruitful.
152 CHAPTER 7
There are two ways to develop item shells. The first and easiest way is to adopt
the generic shells presented in Example 7.2. These shells are nothing more
than item stems taken from successfully performing items. The content expert
should identify the facts, concepts, principles, or procedures being tested and
the type of cognitive behaviors desired (recalling, understanding, or applying
knowledge or skills).
Understanding
What are the main symptoms of...?
Comment: This item shell provides for the generation of a multitude
of items dealing with the symptoms of patient illnesses.
Predicting
What is the most common (cause or symptom) of a (patient
problem)?
Comment: This general item shell provides for a variety of
combinations that mostly reflects anticipating consequences or
cause-and-effect relationships arising from principles.
Understanding of concepts is also important for successful
performance on such items.
Applying Knowledge and Skills
Patient illness is diagnosed. Which treatment is likely to be most
effective?
Comment: This item shell provides for a variety of patient illnesses,
according to some taxonomy or typology of illnesses and
treatment options. Simply stated, one is the best. Another
questioning strategy is to choose the reason a particular
treatment is most effective.
Applying Knowledge and Skills
Information is presented about a patient problem. How should the
patient be treated?
Comment: The item shell provides information about a patient
disease or injury. The completed item will require the test taker to
make a correct diagnosis and to identify the correct treatment
protocol, based on the information given.
Steps 4 through 7 can be repeated for writing a set of items dealing with a
physician's treatment of people coming to the emergency department of a
hospital. The effectiveness of the item comes with the writing of plausible
distractors. However, the phrasing of the item, with the three variations,
makes it possible to generate many items covering a multitude of combina-
tions of ages, trauma injuries and complications, and types of injuries. The
item writer need not be concerned with the "trappings" of the item but can
instead concentrate on content. For instance, an experienced physician
who is writing test items for a credentialing examination might draw heavily
from clinical experience and use the item shell to generate a dozen different
items representing the realistic range of problems encountered in a typical
medical practice. In these instances, the testing events can be transformed
into context-dependent item sets.
An item shell for eighth-grade science is developed to illustrate the pro-
cess. The unit is on gases and its characteristics. The steps are as follows:
Oxygen
The last word in the stem can be replaced by any of a variety of gases, easily
producing many item stems. The difficult task of choosing a right answer and
several plausible distractors, however, remains.
Although the process of developing item shells may seem laborious, as illus-
trated on the preceding discussion, keep in mind that many of these seven steps
become automatic. In fact, once a good item shell is discovered, several steps
can be performed simultaneously.
Item shells have the value of being used formally or informally, as part of a
careful item-development effort, or informally for classroom testing.
Clearly the value of the item shell is its versatility to generate items for dif-
ferent types of content (facts, concepts, principles, and procedures) and
cognitive operations.
A culling of stems from passage-related item sets that purport to measure com-
prehension of a poem or story can yield nice sets of item stems that have a ge-
neric quality. Rather than write original item stems, these generic stems can be
used to start your passage-related reading comprehension test items. Example
7.4 provides a list of stems that have been successfully used in MC items that
ITEM GENERATION 157
Poetry
What is the main purpose of this poem?
What is the theme of this poem?
Which of the following describes the mood of the poem?
What is the main event in this poem?
Which poetic device is illustrated in the following line? Possible
options include: allusion, simile, metaphor, personification.
Which of the following describes the basic metric pattern of the
poem? What does the language of the poem suggest?
What is the meaning of this line from the poem?
{Select a critical line}
What is the meaning of {select a specific term or
phrase from a line in the poem}?
Which describes the writing style of this poem/passage? Possible
answers include: plain, colorful, complex, conversational.
Which of the following would be the most appropriate title for this
poem/passage?
Which term best describes this poem? Possible answers include
haiku, lyric, ballad, sonnet.
As you increase your experience with these item stems, you may generate
new stems that address other aspects of understanding reflecting the curricu-
lum, what's taught, and, of course, what's tested.
According to Haladyna and Shindoll (1989), the item shell has several attrac-
tive features:
1. The item shell helps inexperienced item writers phrase the item in an ef-
fective manner because the item shell is based on a previously used and
successfully performing items.
158 CHAPTER 7
Reading Passage/Story/Narrative
What is the main purpose of this selection?
What is the theme of this story?
What is the best title for this story?
What is the conflict in this story?
Which best describes the writing style of this story?
Which best describes the conflict in this story?
Which best summarizes this story?
Which statement from this story is a fact or opinion? {Choose
statements}
What is the meaning of the following {word, sentence, paragraph}?
Which best describes the ending of this story?
How should ... be defined?
Which point of view is expressed in this story? {first person, second
person, third person, omniscient}
Which best summarizes the plot?
Which best describes the setting for this story?
What literary element is represented in this passage? Possible
answers include foil, resolution, flashback, foreshadowing.
Who is the main character?
How does the character feel? {Select a character}
What does the character think? {Select a character}
How are {character A and character B} alike? different?
After {something happened}, what happened next?
What will happen in the future?
Why do you think the author wrote this story {part of the story}?
In summary, the item shell is a very useful device for writing MC items be-
cause it has an empirical basis and provides the syntactic structure for the con-
tent expert who wishes to write items. The technique is flexible enough to
allow a variety of shells fitting the complex needs of both classroom and large-
scale testing programs.
ITEM MODELING
Item modeling is a general term for a variety of technologies both old and new.
In chapter 11, the future of item modeling is discussed. Much of the theoretical
work currently under way that may lead to validated technologies that will en-
able the rapid production of MC items. In this chapter, we deal with practical
methods of item modeling.
An item model provides the means for generating a set of items with a com-
mon stem for a single type of content and cognitive demand. An item model
not only specifies the form of the stem but in most instances also provides a ba-
sis for the creation of the correct answer and the distractors. The options con-
form to well-specified rules. With a single item model we can generate a large
number of similar items.
One rationale for item modeling comes from medical training and evalua-
tion. LaDuca (1994) contended that in medical practice we have used a be-
havioral-based, knowledge-skills model for discrete learning of chunks of
information. Traditional tests of medical ability view cognitive behavior as
existing in discrete parts. Each test item systematically samples a specific class
of behaviors. Thus, we have domain-referenced test score interpretations
that give us information about how much learning has occurred. Mislevy
(1993) referred to this mode of construct definition and the resulting tests as
representing low to high proficiency. Cognitive learning theorists maintain
that this view is outmoded and inappropriate for most professions (Shepard,
1991; Snow, 1993).
This point of view is consistent with modern reform movements in educa-
tion calling for greater emphasis on higher level thinking (Nickerson, 1989).
160 CHAPTER 7
For nearly two decades, mathematics educators have promoted a greater em-
phasis on problem'solving ability, in fact, arguing that problem solving is the
main reason for studying mathematics (Prawat, 1993). Other subject matters
are presented as fertile for problem-solving teaching and testing. In summary,
the impetus of school reform coupled with advances in cognitive psychology
are calling for a different view of learning and, in this setting, competence.
LaDuca (1994) submitted that competent practice resides in appropriate re-
sponses to the demands of the encounter.
LaDuca (1994) proposed that licensure tests for a profession ought to be
aimed at testing content that unimpeachably relates to effective practice. The
nature of each patient encounter presents a problem that needs an effective so-
lution to the attending physician. Conventional task analysis and role delinea-
tion studies identify knowledge and skills that are tangentially related to
competence, but the linkage is not so direct. In place of this approach is prob-
lem-solving behavior that hinges on all possible realistic encounters with pa-
tient problems.
LaDuca's (1994) ideas apply directly to professional credentialing testing,
but they may be adaptable to other settings. For instance, item modeling might
be used in consumer problems (e.g., buying a car or appliance, food shopping,
painting a house or a room, remodeling a house, fixing a car, or planning land-
scaping for a new home).
Facet 1: Setting
1. Unscheduled patients/clinic visits
2. Scheduled appointments
3. Hospital rounds
4. Emergency department
ITEM GENERATION 161
This first facet identifies five major settings involving patient encoun-
ters. The weighting of these settings may be done through studies of the pro-
fession or through professional judgment about the criticalness of each
setting.
The second facet provides the array of possible physician activities, which
are presented in sequential order. The last activity, applying scientific concepts,
is disjointed from the others because it connects patient conditions with diag-
nostic data as well as disease or injury patterns and their complications. In
other words, it is the complex step in treatment that the other categories do not
conveniently describe.
The third facet provides four types of patient encounters, in three dis-
crete categories with two variations in each of the first two categories. Ex-
ample 7.6 is the resulting item showing the application of these three
facets.
The item in Example 7.6 has the following facets: (a) Facet 1: Setting—2.
Scheduled appointment; (b) Facet 2: Physician task—3. Formulating most
likely diagnosis; (c) Facet 3: Case cluster—la. Initial work up of new patient,
new problem. It is interesting that the item pinpoints a central task c) of diag-
nosis but necessarily involves the successful completion of the first two tasks
162 CHAPTER 7
Item modeling works best in areas that are quantifiable. Virtually all types
of mathematics content can be modeled. Example 7.7 presents an item
model that deals with probability from an elementary grade mathematics
curriculum.
Example 7.8 shows an item from this model where we have two red, four
white, and six blue pieces of candy in a bag. The context (jelly beans in a jar,
candy in a bag, marbles, or any object) can be specified as part of the model. As
you can see, the number of red, yellow, and blue objects can vary but probably
should not be equal. The options, including distractors are, created once num-
bers are chosen. These options involve the correct relationship as well as logi-
cal, plausible, but incorrect actions. The range of positive integers can be
varied as desired or needed for the developmental of the students. The com-
plexity of the probability calculated can be increased by picking more than one
object or by including more than one color.
ITEM GENERATION 163
A bag contains two red, four yellow, and six blue pieces of candy.
What is the probability of reaching in the bag and picking a yellow
piece of candy (without peeking)?
A. 1/12
B. 1/4
C. 4/12
D. 4/6
Example 7.9 has a family of eight statements that constitute the item model.
We might limit this family to single-digit integers for a, b, x, and y. We would
not use zero as a value. Within these eight models, some items will be harder or
easier than others.
| 4 - 2 | - | 2 - 5 |=
A. 13
B. 5
C. -1
KEY FEATURES
answer (write-in) or short menu, which involves choosing the answer from a
long list of possible right answers.
Step 4. Select Key Features for Each Problem. A key feature is a critical
step that will likely produce a variety of different choices by physicians. Some of
these choices will be good for the patient and some choices will not be good.
Not all patient problems will necessarily have key features. The key feature
must be difficult or likely produce a variety of effective or ineffective choices.
Although the key feature is identified by one expert, other SMEs have to agree
about the criticality of the key feature. Key features vary from two to five for
each problem. Each key feature has initial information and an assigned task.
Example 7.11 gives key features for two problems.
ITEM GENERATION 167
Step 5. Select Case and Write Case Scenario. Referring back to the five
clinical situations stated in Step 3, the developer of the problem selects the
clinical situation and writes the scenario. The scenario contains all relevant in-
formation and includes several questions. As noted previously, the items can be
in an MC or CR format.
Step 6. Develop Scoring for the Results. Scoring keys are developed
that have a single or multiple right answers. In some instances, candidates
can select from a list where some of their choices are correct or incorrect. The
SME committee develops the scoring weight and scoring rules for each case
scenario.
Step 7. Conduct Pilot Testing. As with any high-stakes test, pilot testing
is critical. This information is used to validate the future use of the case in a for-
mal testing situation.
168 CHAPTER 7
Step 8. Set Standards. As with any high-stakes test with a pass-fail deci-
sion, standards should be set. Page et al. (1995) recommended a variety of stan-
dard-setting techniques to be used.
Example 7.12 is an example of a key features item from Page et al. (1995). The
problem comes from the domain of patient problems identified in Step 1. The
key features for this problem are given in Example 7.11. The scenario has two
related questions. The first question addresses preliminary diagnoses. The
medical problem solver should be able to advance three plausible hypotheses
about the origin the patient's complaint. Note that the diagnoses item is in an
open-ended format, but this could be easily converted into an MC format. The
second item is in an MR format where the candidate must select up to seven el-
ements of history to help in the diagnosis.
According to Page et al. (1995), key feature has many benefits. First, patient
problems are chosen on the basis of their criticality. Each is known to proce-
dure a difficult, discriminating key feature that will differentiate among can-
didates for licensure of varying ability. Second, the key feature problems are
short so that many can be administered. Because reliability is a primary type
of validity evidence, key feature items should be numerous and highly
intercorrelated. Third, there is no restriction or limitation to item format. We
can use MC or CR formats. If cuing is a threat to validity, the CR format can
be used. Scoring guides can be flexible and adaptable to the situation. Finally,
the domain of patient problems is definable in absolute ways so that candi-
dates for licensure can be trained and tested for inference to a large domain of
patient problems intended.
The Clinical Reasoning Skills test of the Medical Council of Canada provides a
nice key feature on their web page: www.mcc.ca. The key feature item-genera-
tion process is intended for testing programs or educational systems where clini-
cal problem solving is the main activity. This process is not for measuring
knowledge and skills. A strong commitment is needed to having SMEs define the
domain, develop the examination blueprint, identify the clinical situations to be
assessed, identify the key features that are relevant to the problem, and select a
case scenario to represent the problem. Examples of key features applied to other
professions would encourage others to experiment with this method. A strong,
related program of research that validates the resulting measures would increase
the use of key feature. For instance, Hatala and Norman (2002) adapted key fea-
tures for a clinical clerkship program. The 2-hour examination produced a reli-
ability estimate of 0.49, which is disappointing. These researchers found low
correlations with other criteria even when corrected for unreliability.
Overall, the key feature approach has to be a strong contender among other
approaches to modeling higher level thinking that is sought in testing compe-
17O CHAPTER 7
Chapter 4 presents and illustrates the item set as a means for testing various
types of complex thinking. The item set format is becoming increasingly popu-
lar because of its versatility. Testing theorists are also developing new models
for scoring item sets (Thissen & Wainer, 2001). The item set appears to have a
bright future in MC testing because it offers a good opportunity to model vari-
ous type of higher level thinking that are much desired in achievement testing
programs.
This section uses the concept of item shells in a more elaborate format, the
generic item set. This work is derived principally from Haladyna (1991) but
also has roots in the earlier theories of Guttman and Hively, which are dis-
cussed in Roid and Haladyna (1982). The scenario or vignette is the key to the
item set, and like the item model suggested by LaDuca (1994), if this scenario
or vignette has enough facets, the set of items flows naturally and easily from
each scenario or vignette.
The method is rigid in the sense that it has a structure. But this is important
in facilitating the development of many relevant items. On the other hand, the
item writer has the freedom to write interesting scenarios and identify factors
within each scenario that may be systematically varied. The generic questions
also can be a creative endeavor, but once they are developed they can be used
for variations of the scenario. The writing of the correct answer is straightfor-
ward, but the writing of distractors requires some inventiveness.
As noted earlier and well worth making the point again, item sets have a
tendency for interitem cuing. In technical terms, this is called local depend-
ence (Hambleton, Swaminathan, &. Rogers, 1991), and the problem is signifi-
cant for item sets (see Haladyna, 1992a; Thissen et al., 1989). Item writers
have to be careful when writing these items to minimize the tendency for
examinees to benefit from other items appearing in the set. This is why it is rec-
ommended that not all possible items in the set should be used for each set at
any one time.
The generic item set seems to apply well to quantitative subjects, such as sta-
tistics. But like item modeling, it does not seem to apply well to
nonquantitative content. These item sets have been successfully used in na-
tional licensing examinations in accountancy, medicine, nursing, and phar-
macy, among others. Haladyna (1991) provided an example in art history.
Therefore, there seems to be potential for other types of content.
ITEM GENERATION 171
The production of test items that measure various types of higher level think-
ing is problematic. Item shells presented in this chapter lessen this problem.
With the problem'solving-type item set introduced in chapter 4, a systematic
method for producing large numbers of items for item sets using shelllike struc-
tures has been developed (Haladyna, 1991). This section provides the concept
and methods for developing item shells for item sets.
Generic Scenario
With each scenario, a total of 10 test items is possible. With the develop-
ment of this single scenario and its four variants, the item writer has created a
total of 40 test items. Some item sets can be used in an instructional setting for
practice, whereas others should appear on formative quizzes and summative
tests. For formal testing programs, item sets can be generated in large quantities
to satisfy needs without great expense.
Example 7.15 presents a fully developed item set. This set is unconventional
because it contains a subset of MTF items. Typically not all possible items from
an item set domain would be used in a test for several reasons. One, too many
items are possible and it might exceed the need that is called for in the test spec-
ifications. Two, item sets are best confined to a single page or facing pages in a
test booklet. Three, item sets are known to have interitem cuing, so that the
use of all possible items may enhance undesirable cuing. With the scenario pre-
sented in Example 7.15, you can see that introducing small variations in the
sample size, the correlation coefficient, and its associated probability, and using
a directional test can essentially create a new problem.
An item set can be created for a small business where items are sold for
profit. For instance, Sylvia Vasquez has a cell phone business at the
Riverview Mall. Help her figure out how her business is doing. Example 7.16
provides a data table for this hypothetical small business. Note that the
product can vary in many ways. For example, the name can be changed, and
the owner of the business can sell caps, ties, earrings, candy, magazines, or
tee shirts. All the numbers can be adjusted by SMEs to create profitable and
unprofitable situations.
ITEM GENERATION 175
A B C D E F
Type of Number Selling Number Amount
Cell Phone Bought Unit Cost Price Sold Received
Generic item sets have a potential for modeling higher level thinking that flows
more directly from the complex performance item. Despite the advocacy in this
book for MC, the generic item set makes no assumption about the test item for-
mat. Certainly, a CRformat could be used for these vignettes. However, the ge-
neric item set technique is well suited to simulating complex thinking with the
objectively scorable MC items. The adaptability of MC to scenario-based item
sets may be the chief reason so many credentialing testing programs are using
item sets. Item writers with their rich background and experience can draw
from this resource to write a scenario and then develop or adapt existing test
items to determine whether a candidate for certification or licensure knows
what to do to achieve the outcome desired. For course material, the generic
item sets provide a good basis for generating large numbers of test items for vari-
ous purposes: formative and summative testing and even test preparation or
homework.
In most circumstances, SMEs may rightfully assert that the complex perfor-
mance item has greater fidelity to what exactly we want to test, but we are will-
ing to sacrifice a little fidelity for greater efficiency. If we are willing to make this
compromise, we can take a perfectly good performance item and convert it into
ITEM GENERATION 177
Example 7.18 was taken from a fourth-grade 2000 NAEP item block in science
(http://nces.ed.gov/nationsreportcard). This question measures basic knowl-
edge and understanding of the following: Look at the two containers of water
with a thermometer in each one. Because this is a basic knowledge question that
tests the mental behavior understanding, it converts nicely into an item set.
Example 7.19 is easy to replicate. A label from any food product can be ob-
tained. A set of items can be written by SMEs that probe the student's reading
comprehension or application of knowledge to solve a problem. Items can be
written to address health and diet issues that may be part of another curriculum
because modern education features integrated learning units that are cross -
discipline.
SUMMARY
Item shells, item modeling, key features, generic item sets, and CR item format
conversions are discussed, illustrated, and evaluated. These methods have
much in common because each is intended to speed up item development and
provide a systematic basis for creating new MC test items.
The item shell technique is merely prescriptive. It depends on using existing
items. Item modeling has the fixed structure of item stems that allows for a do-
main definition of encounters, but each item model tests only one type of con-
tent and cognitive operation. Key features depends on an expert committee
and has a systematic approach that links training with licensure testing. The
generic item set approaches item modeling in concept but has a fixed question-
ing structure. Item format conversions provide a basis for taking CR items and
creating MC items that appear to have the same or a similar cognitive demand.
The advantage is that the latter is objectively scorable. Thus, we give up a little
fidelity for greater efficiency.
Each item-generating method in this chapter has potential for improving
the efficiency of item writing as we know it. The greatest reservation in using
any item-generation method is the preparation required at the onset of
item-writing training. SMEs need to commit to an item-generation method
and use their expertise to develop the infrastructure needed for item genera-
tion, regardless of which item-generation method is used. Although item shells
and item modeling have much in common, further developments will probably
One hot, sunny day Sally left two
buckets of water out in the sun.
The two buckets were the same
except that one was black and one
was white. At the end of the day,
Sally noticed that the water in the
black bucket felt warmer than the
water in the white bucket. Sally
wondered why this happened, so
the next day she left the buckets of
water out in the hot sun again. She
made sure that there was the same
amount of water in each bucket. This time she carefully measured
the temperature of the water in both buckets at the beginning of the
day and at the end of the day. The pictures shows what Sally found.
1. Which of the two container has the hottest water before
sitting in the sun?
A. Black
B. White
C. They are both the same temperature.
2. Which of the two containers has the hottest water after sitting
in the sun?
A. Black
B. White
C. They are both the same.
178
Jay had two bags of chips. He's concerned about his diet. So he
looked at the label on one of these bags. Nutrition Facts: Serving
size: 1 bag—28 grams. Amount Per Serving: Calories 140 calories
from fat 70. Ingredients: potatoes, vegetable oil, salt
Total Fat 8 g. 12%
Saturated Fat 1.5 g. 8%
Cholesterol 0 mg. 0%
Sodium 160mg. 7%
Total Carbohydrates 16g. 5%
Dietary Fiber 1 g. 4%
Sugars 0 g.
Protein 2g.
179
ISO CHAPITER 7
favor item modeling because of its inherent theoretical qualities that strike at
the foundation of professional competence. Key features have potential for
item generation but in a specific context, such as patient treatment and clinical
problem solving. It remains to be shown its applicability to other professions
and general education. Generic item sets work well in training or education, es-
pecially for classroom testing. Its applicability to testing programs may be lim-
ited because too many item sets appear repetitious and may cue test takers.
Adapting CR items to MC formats is a simple, direct way to make scoring ob-
jective yet keeping the higher cognitive demand intended.
As the pressure to produce high-quality test items that measure more than
recall increases, we will see increased experimentation and new developments
in item generation. Wainer (2002) estimated the cost of developing a new item
for a high-quality testing program as high as $1,000. With more computer-
based and computer-adaptive testing, we will see heavier demands for
high-quality MC items. Item generation will have a bright future if items can be
created that have the same quality or better than are produced by item writers.
Test content that has a rigid structure can be more easily transformed via item-
generation methods, as the many methods discussed in Irvine and Kyllonen
(2002) show. Theories of item writing feature automated item generation are
much needed for content involving ill-structured problems that we commonly
encounter in all subject matters and professions. Until the day that such theo-
ries are transformed into technologies that produce items that test problem
solving in ill-structured situations, the simpler methods of this chapter should
help item writers generate items more efficiently than the traditional way of
grinding out one item after another.
Ill
Validity Evidence Arising
From Item Development and
Item Response Validation
A central premise in this book is that item response interpretations or uses are
subject to validation in the same way that test scores are subject to validation.
A parallelism exists between validation pertaining to test scores and validation
pertaining to item responses. Because item responses are aggregated to form
test scores, validation should occur for both test scores and item responses.
Also germane, a primary source of validity evidence supporting any test
score interpretation or use involves test items and responses to test items.
Thus, the study of items and item responses becomes an important part of test
score validation. Part of this validity evidence concerning items and item re-
sponses should be based on the quality of test items and the patterns of item re-
sponses that are elicited by these items during a testing session (Downing &
Haladyna, 1997). The three chapters in this section address complementary
aspects of this item response validation process. Chapter 8 discusses the kinds
of validity evidence that comes from following well-established procedures in
test development governing item development. Chapter 9 discusses study of
item responses that is commonly known as item analysis. Chapter 10 provides
more advanced topics in the study of item responses. The procedures of chapter
8 coupled with the studies described in chapters 9 and 10 provide a body of evi-
dence that supports this validity argument regarding test score interpretation
and use. Thus, collecting and organizing of evidence supporting the validity of
item responses seems crucial in the overall evaluation of validity that goes on in
validation.
This page intentionally left blank
8
Validity Evidence Coming
Procedures
OVERVIEW
TABLE 8.1
Standards Applying to Item Development
3.6. The types of items, the response formats, scoring procedures, and test
administration procedures should be selected based on the purposes of the test,
the domain to be measured, and the intended test takers. To the extent possible,
test content should be chosen to ensure that intended inferences from test scores
are equally valid for members of different groups of test takers. The test review
process should include empirical analyses and, when appropriate, the use of expert
judges to review items and response formats. The qualifications, relevant
experiences, and demographic characteristics of expert judges should also be
documented.
3.7. The procedures used to develop, review, and try out items, and to select items
from the item pool should be documented. If items were classified into different
categories or subtests according to the test specifications, the procedures used for
the classification and the appropriateness and accuracy of the classification should
be documented.
3.8. When item tryouts or field tests are conducted, the procedures used to select
the sample (s) of test takers for item tryouts and the resulting characteristics of
the sample (s) should be documented. When appropriate, the sample (s) should
be representative as possible of the population (s) for which the test is
intended.
3.11. Test developers should document the extent to which the content domain of a
test represents the defined domain and test specifications.
6.4. The population for whom the test is intended and the test specification should be
documented. If applicable, the item pool and scale development procedures
should be described in the relevant test manuals.
7.4. Test developers should strive to identify and eliminate language, symbols, words,
phrases, and content that are generally regarded as offensive by members of racial,
ethnic, gender, or other groups, except when judged to be necessary for adequate
representation of the domain.
In the first part of this chapter, several overarching concerns and issues are
discussed. These are content definition, test specifications, item writer train-
ing, and security. In the second part, seven complementary item review activ-
ities are recommended for any testing program. These include the following:
(a) adhering to a set of item-writing guidelines, (b) assessing the cognitive de-
mand of each item, (c) assessing the content measured by each item, (d) edit-
ing the item, (e) assessing potential sensitivity or unfairness of each item, (f)
checking the correctness of each answer, and (g) conducting a think-aloud,
where test takers provide feedback about each item.
ITEM DEVELOPMENT PROCEDURES 185
Content Definition
The term content validity was traditionally used to draw attention to the impor-
tance of content definition and the many activities ensuring that the content
of each item is systematically related to this definition. Messick (1989) has ar-
gued that because content is not a property of tests but of test scores, content
validity has no relevance. Content-related evidence seems a more appropriate
perspective (Messick, 1995b).
Therefore, content is viewed as an important source of validity evidence.
The Standards for Educational and Psychological Testing (AERA et al., 1999)
make many references to the importance of content in the validation of any
test score interpretation or use. The parallelism between test scores and items
is made in chapter 1 and is carried out here. Each item has an important con-
tent identity that conforms to the test specification. Expert judgment is needed
to ensure that every item is correctly classified by content.
Classroom Testing, For this type of testing, the instructional objective has
long served as a basis for both defining learning and directing the content of
tests. States have developed content standards replete with lists of perfor-
mance objectives. Terminology may vary. Terms such as objectives, instructional
objectives, behavioral objectives, performance indicators, amplified objectives, and
learner outcomes are used. The quintessential Mager (1962) objective is shown
in Table 8.2.
Some interest has been expressed in developing cognitive abilities, such as
reading and writing. Whereas there is still heavy reliance on teaching and
testing atomistic aspects of the curriculum that the objective represents, the
measurement of a cognitive ability requires integrated performance that may
involve reading, writing, critical thinking, problem solving, and even creative
thinking. MC items may not be able to bear the load for such complex behav-
ior. But it has been argued and examples are presented in many chapters in
this book showing attempts to measure complex cognitive behavior using the
MC format.
TABLE 8.2
Anatomy of an Objective
Large-Scale Testing. Testing programs may have different bases for defin-
ing content. In professional certification and licensing testing programs,
knowledge and skills are identified on the basis of surveys of the profession.
These surveys are often known as role delineation, job analysis, task analysis, or
professional practice analysis (Raymond, 2001). Respondents rate the impor-
tance or criticality of knowledge and skills to professional practice. Although
candidates for certification and licensure must meet many criteria to earn a
board's credential, an MC test is often one of these criteria. These tests typi-
cally measure professional and basic science knowledge related to professional
practice. The source of content inovlves expert judgment. No matter what
type of testing, consensus among SMEs is typically used for establishing the
content for a test.
Test Specifications
A systematic process is used to take test content and translate it into test speci-
fications stating how many items will be used and which content topics and
cognitive processes that will be tested. Kane (1997) described ways that we can
establish the content and cognitive process dimensions of our test specifica-
tions. Generally, the effort to create test specification again rests on expert
judgment. As Messick (1995b) expressed, test specifications provide bound-
aries for the domain to be sampled. This content definition is operationalized
through test specifications. Most measurement textbooks discuss test specifi-
cations. They generally have two dimensions: content and cognitive processes.
Chapter 2 discusses a simple classification system for cognitive processes that is
consistent with current testing practices.
Item-Writing Guide
For a testing program, it is a common practice to have a booklet that every item
writer receives that discusses the formats that will be used, the guidelines that
will be followed, examples of well and poorly written items, a classification sys-
tem for items, directions on submitting and reviewing items, and other salient
information to help future item writers.
For testing programs, the expertise of item writers is crucial to the testing pro-
gram's success. Downing and Haladyna (1997) argued that a key piece of valid-
ity evidence is this expertise. The credentials of these item writers should
enhance the reputation of the testing program. Generally, volunteers or paid
ITEM DEVELOPMENT PROCEDURES 187
item writers are kept for appointed terms, which may vary from 1 to 3 years.
Once they are trained, their expertise grows; therefore, it is advantageous to
have these item writers serve for more than 1 year.
For any kind of test, item writers should be trained in the principles of item writ-
ing, as expressed throughout this book and consistent with the item-writing
guide. This training need not take a long time, as content experts can learn the
basics of item writing in a short time. Trainees should learn the test specifica-
tions and the manner in which items are classified by content and cognitive be-
havior. Participants in this training should have supervised time to write items
and engage in collegial review.
Security
REVIEWING ITEMS
In this part of the chapter, seven interrelated, complementary reviews are de-
scribed that are highly recommended for all testing programs. The performing
of each activity provides a piece of validity evidence that can be used to support
both the validity of interpreting and using test scores and item responses.
whether items were properly written. The guidelines are really advice based on
a consensus of testing experts; therefore, we should not think of these guide-
lines as rigid laws of item writing but friendly advice. However, in any
high-stakes testing program, it is important to adopt a set of guidelines and ad-
here to them strictly. Once these guidelines are learned, the detection of
item-writing errors is a skill that can be developed to a high degree of profi-
ciency. Items should be revised accordingly. Violating these guidelines usually
results in items that fail to perform (Downing, 2002). Following these guide-
lines should result in a test that not only looks better but is more likely to per-
form according to expectations.
Table 5.1 (in chapter 5) summarizes these guidelines. A convenient and ef-
fective way to use Table 5.1 in reviewing items is to use each guideline number
as a code for items that are being reviewed. The people doing the review can
read each item and enter the code on the test booklet containing the offending
item. Such information can be used by the test developers to consider redraft-
ing the item, revising it appropriately, or retiring the item. As mentioned previ-
ously, these guidelines are well grounded in an expert opinion consensus, but,
curiously, research is not advanced well enough to cover many of these guide-
lines. Thus, the validity of each guideline varies.
Review 3: Content
The central issue in content review is relevance. In his influential essay on va-
lidity, Messick (1989) stated:
Judgments of the relevance of test items or tasks to the intended score interpre-
tation should take into account all aspects of the testing procedure that signifi-
cantly affect test performance. These include, as we have seen, specification of
the construct domain of reference as to topical content, typical behaviors, and
underlying processes. Also needed are test specifications regarding stimulus for-
mats and response alternatives, administration conditions (such as examinee in-
structions or time limits), and criteria for item scoring, (p. 276)
As Popham (1993) pointed out, the expert judgment regarding test items has
dominated validity studies. Most classroom testing and formal testing pro-
grams seek a type of test score interpretation related to some well-defined con-
ITEM DEVELOPMENT PROCEDURES 189
tent (Fitzpatrick, 1981; Kane, 2002; Messick, 1989). Under these conditions,
content is believed to be definable in terms of a domain of knowledge (e.g., a set
of facts, concepts, principles, or procedures). Under these circumstances, each
test is believed to be a representative sample of the total domain of knowledge.
As Messick (1989) noted, the chief concerns are clear construct definition, test
specifications that call for the sample of content desired, and attention to the
test item formats and response conditions desired. He further added adminis-
tration and scoring conditions to this area of concern.
As noted by Popham (1993) previously, the technology involves the use
of content experts, persons intimate with the content who are willing to re-
view items to ensure that each item represents the content and level of cog-
nitive behavior desired. The expert or panel of experts should ensure that
each item is relevant to the domain of content being tested and is properly
identified as to this content. For example, if auto mechanics' knowledge of
brakes is being tested, each item should be analyzed to figure out if it belongs
to the domain of knowledge for which the test is designed and if it is cor-
rectly identified.
Although this step may seem tedious, it is sometimes surprising to see items
misclassified by content. With classroom tests designed to measure student
achievement, students can easily identify items that are instructionally irrele-
vant. In formal testing programs, many detection techniques inform users
about items that may be out of place. This chapter discusses judgmental con-
tent review, whereas chapters 9 and 10 discuss statistical methods.
Methods for performing the content review were suggested by Rovinelli and
Hambleton (1977). In selecting content reviewers, these authors made the fol-
lowing excellent points:
1. Can the reviewers make valid judgments regarding the content of items ?
2. Is there agreement among reviewers?
3. What information is sought in the content review?
4. What factors affect the accuracy of content judgments of the reviewers?
5. What techniques can be used to collect and analyze judgments?
Regarding the last point, the authors strongly recommended using the simplest
method available.
Toward that end, the review of test items can be done in formal testing pro-
grams by asking each content specialist to classify the item according to an item
classification guide. Rovinelli and Hambleton (1977) recommended a simple
3-point rating scale:
Topics
Behavior Watering Fertilizing Soil Total
Recalling knowledge 15% 15% 10% 40%
Understanding knowledge 10% 10% 10% 30%
Applying knowledge 15% 5% 10% 30%
Total 40% 30% 30% 100%
FIG. 8.1. Test specifications for the Azalea Growers' Certification Test.
Reviewers
Item Original Classification #1 #2 #3
82 Watering 3 3 3
83 Fertilizer 1 1 1
84 Soil 1 2 1
85 Soil 2 3 2
86 Light 1 1 1
The science of content review has been raised beyond merely expert judg-
ment and simple descriptive indicators of content agreement. Crocker, Llabre,
and Miller (1988) proposed a more sophisticated system of study of content rat-
ings involving generalizability theory. They described how theory can be used to
generate a variety of study designs that not only provide indexes of content-
rater consistency but also identify sources of inconsistency. In the context of a
ITEM DEVELOPMENT PROCEDURES 191
high-stakes testing program, procedures such as the one they recommend are
more defensible than simplistic content review procedures.
Content review has been a mundane aspect of test design. As Messick
(1989) noted, although most capable test development includes these impor-
tant steps, we do not have much systematic information in the literature that
informs us about what to use and how to use it. Hambleton (1984) provided a
comprehensive summary of methods for validating test items.
Review 4: Editorial
No matter the type of testing program or the resources available for the devel-
opment of the test, having each test professionally edited is desirable. The edi-
tor is someone who is usually formally trained in the canons of English grammar
and composition.
There are several good reasons for editorial review. First, edited test items
present the cognitive tasks in a clearer fashion than unedited test items. Editors
pride themselves on being able to convert murky writing into clear writing
without changing the content of the item. Second, grammatical, spelling, and
punctuation errors tend to distract test takers. Because great concentration is
needed on the test, such errors detract from the basic purpose of testing, to find
the extent of knowledge of the test taker. Third, these errors reflect badly on
the test maker. Face validity is the tendency for a test to look like a test. If there
are many errors in the test, the test takers are likely to think that the test falls
short in the more important areas of content and item-writing quality. Thus,
the test maker loses the respect of test takers. There are several areas of con-
cern of the editorial review show in Table 8.3.
A valuable aid in testing programs is an editorial guide. This document is
normally several pages of guidelines about acceptable formats, accepted abbre-
viations and acronyms, styles conventions, and other details of item prepara-
TABLE 8.3
Areas of Concern in the Editorial Review
tion, such as type font and size, margins, and so on. For classroom testing,
consistency of style is important.
There are some excellent references that should be part of the library of a
test maker, whether professional or amateur. These appear in Table 8.4
A spelling checker on a word processing program is also handy. Spelling
checkers have resident dictionaries for checking the correct spelling of many
words. However, the best feature is the opportunity to develop an exception
spelling list, where specialized words not in the spelling checker's dictionary
can be added. Of course, many of these types of words have to be verified first
from another source before each word can be added. For example, if one
works in medicine or in law, the spelling of various medical terms can be
checked in a specialized dictionary, such as Stedman's Medical Dictionary for
which there is a Web site (http://www.stedmans.com/), which also has a
CD-ROM that checks more than half a million medical phrases and terms.
Another useful reference is Black's Law Dictionary (Garner, 1999).
Fairness has been an important issue in test development and in the use of test
scores. Chapter 7 in the Standards for Educational and Psychological Testing is de-
voted to fairness. Standard 7.4 in that chapter asserts:
Test developers should strive to identify and eliminate language, symbols, words,
phrases, and content that are generally regarded as offensive by members of ra-
cial, ethnic, or other groups for adequate representation of the domain. (AERA
et al., 1999, p. 82)
TABLE 8.4
References on Grammar, Composition, & Style
Gibaldi, J. (1999). The MLA handbook for writers of research papers (5th ed.). New
York: Modem Language Association of America.
The American Heritage Book of English usage: A practical and authoritative guide to
contemporary English. (1996). Boston: Houghton Mifflin.
American Psychological Association. (2001). Publication manual of the American
Psychological Association (5th ed.). Washington, DC: Author.
Strunk, W., Jr., & White, E. B. (2000). The elements of style (4th ed.). Boston: Allyn &.
Bacon and Longman.
The Chicago manual of style (14th ed.). (1993). Chicago: University of Chicago Press.
Warriner, J. E. (1988). English grammar and composition: Complete course. New York:
Harcourt, Brace, &. Jovanovich.
ITEM DEVELOPMENT PROCEDURES 193
Fairness review generally refers to two activities. The first is a sensitivity re-
view aimed at test items that potentially contain material that is sexist, rac-
ist, or otherwise potentially offensive or negative to any group. The second
is an analysis of item responses to detect differential item functioning,
which is discussed in chapter 10. We should think of fairness as a concern for
all of testing.
This section focuses on this first fairness review, often referred to as the sensi-
tivity review. Chapter 10 has a section on the second type of fairness activity,
DIE The sensitivity review concerns stereotyping of groups and language that
may be offensive to groups taking the test.
The Educational Testing Service has recently issued a new publication on
fairness (2003) (http://www.ets.org/fairness/download.html). Since 1980, Ed-
ucational Testing Service has led the testing industry by issuing and continu-
ously updating guidelines. It has exerted a steadying influence on the testing
industry to be more active in watch guarding the content of test in an effort not
to avoid negative consequences that arise from using content that might offend
test takers.
We have many good reasons for being concerned about fairness and sensitiv-
ity. First and foremost, Zoref and Williams (1980) noted a high incidence of
gender and ethnic stereotyping in several prominent intelligence tests. They
cited several studies done in the 1970s where similar findings existed for
achievement tests. To what extent this kind of bias exists in other standardized
tests currently can only be speculation. However, any incidence of insensitive
content should be avoided.
Two, for humanistic concerns, all test makers should ensure that items do
not stereotype diverse elements of our society. Stereotyping is inaccurate be-
cause of overgeneralization. Stereotyping may cause adverse reactions from
test takers during the test-taking process.
Table 8.5 provides some criteria for judging item sensitivity, adapted from
Zoref and Williams (1980). Ramsey (1993) urged testing personnel to identify
committees to conduct sensitivity reviews of test items and to provide training
to committee members.
He recommended four questions to pose to committee members:
• Is there a problem?
• If there is, which guideline is violated?
• Can a revision be offered?
• Would you sign off on the item if no revision was made? In other words,
how offensive is the violation?
TABLE 8.5
A Typology for Judgmental Item Bias Review
Gender Race/Ethnic
Representation: Items should be balanced with Representation: Simply stated, if
respect to gender representations. Factors to the racial or ethnic identity of
consider include clothes, length of hair, facial characters in test items is present, it
qualities, and makeup. Nouns and pronouns should resemble the demographics
should be considered (he/she, woman/man). of the test-taking population.
Characterization: Two aspects of this are role Characterization: White characters
stereotyping (RS) and apparel stereotyping (AS). in test items may be stereotypically
Male examples of RS include any verbal or be presented in leadership roles,
pictorial referring to qualities such as wealthy, professional, technical,
intelligence, strength, vigor, ruggedness, intelligent, academic, and the like.
historical contributions, mechanical aptitude, Minority characters are depicted as
professionalism, and/or fame. Female examples unskilled, subservient,
of RS include depicting women in domestic undereducated, poor, or in
situations, passiveness, weakness, general professional sports
activity, excessive interest in clothes or
cosmetics, and the like.
AS is viewed as the lesser of the two aspects of
characterization. AS refers to clothing and other
accouterments that are associated with men and
women, for example, neckties and cosmetics. This
latter category is used to support the more
important designation of the former category in
identifying gender bias in an item..
1. Treat people with respect in test materials. All groups and types of people
should be depicted in a wide range of roles in society. Representation of
groups and types of people should be balanced, not one-sided. Never mock
or hold in low regard anyone's beliefs. Avoid any hint of ethnocentrism or
group comparisons. Do not use language that is exclusive to one group of
people. Educational Testing Service used an example to illustrate this point:
"All social workers should learn Spanish." This implies that most people in
need of welfare are Spanish speaking.
2. Minimize construct-irrelevant knowledge. As a nation, we are fond of
figures of speech, idioms, and challenging vocabulary. We need to avoid in
our tests specialized political words, regional references, religious terms, eso-
teric terms, sports, and the like. The intent is to ensure that prior knowledge
does not in some way increase or decrease performance.
ITEM DEVELOPMENT PROCEDURES 195
With most test items, one's race or ethnicity is seldom important to the
content of the item. Therefore, it is best not to use such labels unless justi-
fied in the opinion of the committee doing the sensitivity review.
With men and women, they should be used in parallel ways. Never refer
to appearance of a person in terms of gender. Be careful about the use of boys
and girls. That term is reserved for persons below the age of 18. When de-
picting characters in test items, include men and women equally. Avoid ge-
neric terms such as he or man. Avoid references to a person's sexual
preference. Avoid references to the age of a person unless it is important.
5. Avoid stereotypes. We should avoid using terms that may be part of our
normal parlance but are really stereotypes. The term Indian giver is one that
conveys a false image. "You throw the ball like a girl" is another stereotype
image that should be avoided. Even though we may want to stereotype a
group in a positive way, it is best to avoid stereotyping.
in the test items being reviewed. The review procedure should be docu-
mented and should become part of the body of validity evidence. Chal-
lenged items should be reviewed in terms of which guideline is potentially
violated. Other members should decide on the outcome of the item. Chal-
lenged items should never be released to the public. As you can see, sensi-
tivity review will continue to be done and to be an important aspect of the
item development process.
When an item is drafted, the author of the item usually chooses one of the MC
options as the key (correct answer). The key check is a method for ensuring
that there is one and only one correct answer. Checking the key is an important
step in item development. The key check should never be done superficially or
casually. Why is it necessary to check the key? Because several possibilities exist
after the test is given and the items are statistically analyzed:
What should be done if any of these three circumstances exist after the test
is given? In any testing program where important decisions are made based on
test scores, the failure to deal with key errors is unfair to test takers. In the un-
likely event of the first situation, the item should be removed from the test and
not be used to compute the total score. The principle at stake is that no test
taker should be penalized for the test maker's error. If the second or third condi-
tions exist, right answers should be rekeyed and the test results should be re-
scored to correct any errors created by either situation. These actions can be
avoided through a thorough, conscientious key check.
Performing the Key Check. The key check should always be done by a
panel of SMEs. These experts should agree about the correct answer. The ex-
perts should self-administer each item and then decide if their response
matched the key. If it fails to match the key, the item should be reviewed, and
through consensus judgment, the key should be determined. If a lack of con-
sensus exists, the item is inherently ambiguous or otherwise faulty. The item
should be revised so that consensus is achieved about the key, or the item
should be retired.
Another way to validate a key is to provide a reference to the right answer
from an authoritative source, such as a textbook or a journal. This is a common
practice in certification and licensing testing programs. The practice of provid-
ITEM DEVELOPMENT PROCEDURES 197
ing references for test items also ensures a faithfulness to content that may be
part of the test specifications.
The next class period following a classroom test should be spent discussing
test results. The primary purpose is to help students learn from their errors. If
learning is a continuous process, a posttest analysis can be helpful in subse-
quent learning efforts. A second purpose, however, is to detect items that fail to
perform as intended. The expert judgment of classroom learners can be mar-
shaled for exposing ambiguous or misleading items.
After a classroom test is administered and scored, it is recommended that
students have an opportunity to discuss each item and provide alternative rea-
soning for their wrong answers. Sometimes, they may prove the inherent weak-
ness in the item and the rationale for their answer. In these circumstances, they
deserve credit for their responses. Such informal polling also may determine
that certain items are deficient because the highest scoring students are chron-
ically missing the item or the lowest scoring students are chronically getting an
item right. Standard item analysis also will reveal this, but the informal polling
method is practical and feasible. In fact, it can be done immediately after the
test is given, if time permits, or at the next class meeting. Furthermore, there is
instructional value to the activity because students have the opportunity to
learn what they did not learn before being tested. An electronic version of this
polling method using the student-problems (S-P) chart is reported by Sato
(1980), but such a technique would be difficult to carry out in most instruc-
tional settings because of cost. On the other hand, the informal polling method
can simulate the idea behind the S—P process and simultaneously provide ap-
peals for the correct scoring of the test and provide some diagnostic teaching
and remedial learning.
An analysis for any single student can reveal the nature of the problem.
Sometimes, a student may realize that overconfidence, test anxiety, lack of
study or preparation, or other factors legitimately affected performance, or the
analysis may reveal that the items were at fault. In some circumstances, a stu-
dent can offer a correct line of reasoning that justifies an answer that no one
else in the class or the teacher thought was right. In these rarer circumstances,
credit could be given. This action rightfully accepts the premise that item writ-
ing is seldom a perfect process and that such corrective actions are sometimes
justified.
Another device for obtaining answer justification is the use of a form where
the student writes out a criticism of the item or the reasoning used to select his
or her response (Dodd & Leal, 2002). The instruction might read:
Present any arguments supporting the answer you chose on the test.
ITEM DEVELOPMENT PROCEDURES 199
Nield and Wintre (2002) have been using this method in their introductory
psychology classes for several years with many positive results. In a survey of
their students, 41% used the answer justification option. They reported that
student anxiety may have been lessened and that they gained insight into their
teaching as well as identified troublesome items. They observed that few stu-
dents were affected by changes in scoring, but they also noted that higher
achieving students were more likely to gain score points as a result of the an-
swer justification process.
Naturally, students like this technique. Dodd and Leal (2002) reported in
their study of answer justification that 93% of their students thought the proce-
dure should be used in other classes. They cited many benefits for answer justi-
fication, including the following:
Review 8: Think-Aloud
For think-aloud, students are grouped around a table and asked to respond to a
set of MC items. During that time, the test administrator sits at the table with
the students and talks to the students as they encounter each item. They are
encouraged to talk about their approach to answering the item. The adminis-
trator often probes to find out what prompted certain answers. The setting is
friendly. Students talk with the administrator or each other. The administrator
takes notes or audio- or videotapes the session. We should consider the value of
think-aloud in two settings, research and item response validation.
TABLE 8.6
Descriptions of Elicitation Levels
One conclusion that Norris drew from his experimental study of college students
is that the use of the six levels of elicitation of verbal reports did not affect cogni-
tive test behavior. Some benefits of this kind of probing, he claimed, include de-
tecting misleading expressions, implicit clues, unfamiliar vocabulary, and
alternative justifications. Skakun and Maguire (2000) provided an updated re-
view of think-aloud research. As noted in chapter 3, Hibbison (1991) success-
fully used think-aloud to induce students to describe the cognitive processes they
used in answering MC items. Tamir (1993) used this technique in research on
the validity of guidelines for item writing. Skakun et al. (1994) found that their
medical school students approached MC items in 5 ways, processed options in 16
ways, and used 4 cognitive strategies to make a correct choice. Consequently, we
should see the potential of the think-aloud procedure in research on the cogni-
tive processes elicited by various item formats and the validity of item-writing
guidelines. Seeing the link between think-aloud and construct validity, test spe-
cialists have recommended this practice.
SUMMARY
In the first part of this chapter, some issues are discussed as each affects the pro-
cess we use to review and polish test items. It is crucial to ensure that the con-
struct being measured is well defined and that test specifications are logically
created to reflect this definition. Item writers need to be trained to produce
new, high-quality items. Security surrounds this process and ensures that items
are not exposed or in some other way compromised.
In the second part, seven interrelated, complementary item-review activi-
ties are recommended. Table 8.7 provides a summary of these review activities.
Performing these reviews provides an important body of evidence supporting
both the validity of test score interpretation and uses and the validity of item
response interpretations and uses.
TABLE 8.7
Item Review Activities
OVERVIEW
As is frequently noted in this book, the quality of test items depends on two
complementary activities, the item review procedures featured in chapter 8
and the statistical study of item responses featured in this chapter. The docu-
mentation of these review procedures and the statistical studies discussed in
this chapter provide two important sources of validity evidence desired in a
validation.
Once we have completed the item review activities, we may field test the
item. The item is administered but not included in scoring. We turn to item
analysis results to help us make one of three decisions about the future of
each item:
• Accept the item as is and add it to our item bank and continue to use the
item on future tests.
• Revise the item to improve the performance of the item.
• Reject and retire the item because of undesirable performance.
In this chapter, first, we consider the nature of item responses and explore the
rationale for studying and evaluating item responses. In this treatment of item
responses, we encounter context for MC testing, statistical theories we will use,
and tools that help us in the study and evaluation of item responses. Second, we
examine three distinctly different yet complementary ways to study item re-
sponses: tabular, graphical, and statistical.
2O2
STATISTICAL STUDY OF ITEM RESPONSES 2O3
Every MC item has a response pattern. Some patterns are desirable and
other patterns are undesirable. In this chapter we consider different meth-
ods of study, but some foundation should be laid concerning the patterns of
item responses.
The study of item responses provides a primary type of validity evidence
bearing on the quality of test items. Item responses should follow patterns
that conform with our idea about what the item measures and how
examinees with varying degrees of knowledge or ability should encounter
these items.
Examinees with high degrees of knowledge or ability tend to choose the
right answer, and examinees with low degrees of knowledge or ability tend to
choose the wrong answers. It is that simple. But as you will see, other consider-
ations come in to play that make the study of item response patterns more com-
plex. With this increasing complexity of the study of item responses comes a
better understanding of how to improve these items so that they can serve in
the operational item bank.
We can study item response patterns using a statistical theory that explains
item responses variation, or we can study item responses in a theory-free
context, which is more intuitive and less complex. Although any statistical
theory of test scores we use may complicate the study of item responses,
should we avoid the theory? The answer is most assuredly no. In this book
we address the problem of studying and evaluating item response patterns
using classical test theory (CTT) but then shift to item response theory
(IRT). Both theories handle the study of item response patterns well. Al-
though these rivaling theories have much in common, they have enough
differences to make one arguably preferable to the other, although which
one is preferable is a matter of continued debate. Some statistical methods
are not theory based but are useful in better understanding the dynamics of
item responses. This chapter employs a variety of methods in the interest of
providing comprehensive and balanced coverage, but some topics require
further study in other sources.
We have many excellent technical references to statistical theories of test
scores (Brennan, 2001; Crocker & Algina, 1986; Embretson & Reise, 2000;
Gulliksen, 1987; Hambleton, 1989; Hambleton, Swaminathan, & Rogers,
1991; Lord, 1980; Lord & Novick, 1968; MacDonald, 1999; Nunnally &
Bernstein, 1994).
2O4 CHAPTER 9
TABLE 9.1
Computer Programs Offering Item and Test Information
BILOG3 (Mislevy & Bock, 2002). An IRT program that provides classical and IRT
item analysis information and much more.
BILOG-MG (Zimowski, Muraki, Mislevy &. Bock, 2003). An enhanced version of
BILOG that also provides differential item functioning, item drift, and other more
complex features.
ITEMAN (Assessment Systems Corporation, 1995). A standard test and item analysis
program that is easy to use and interpret. It provides many options including subscale
item analysis.
RASCAL (Assessment Systems Corporation, 1992). This program does a Rasch (one-
parameter) item analysis and calibration. It also provides some traditional item
characteristics. This program shares the same format platform as companion
programs, ITEMAN and XCALIBRE, which means that it is easy to use and
completes its work quickly.
RUMM2010 (Rasch Unidimensional Models for Measurement; Andrich, Lyne,
Sheridan, & Luo, 2001). This program provides a variety of classical and item
response theory item analyses including person fit analysis and scaling. The use of
trace lines for multiple-choice options is exceptional.
TESTFACT4 (Bock et al., 2002). This program conducts extensive studies of item
response patterns including classical and IRT item analysis, distractor analysis, full
information factor analysis, suitable for dichotomous and polytomous item responses.
XCALIBRE (Assessment Systems Corporation). This program complements ITEMAN
and RASCAL by providing item calibrations for the two- and three-parameter model.
The program is easy to use and results are easily interpreted.
CONQUEST (Wu, Adams, & Wilson, 1998). ConQuest fits several item response
models to binary and polytomous data, and produced traditional item analysis.
STATISTICAL STUDY OF ITEM RESPONSES 2O5
A complication we face is the choice of the scoring method. For nearly a cen-
tury, test analysts have treated distractors and items equally. That is, all right
answers score one point each and all distractors score no points. Should some
options have more weight than others in the scoring process? If the answer is
no, the appropriate method of scoring is zero-one or binary. With the coming of
dichotomous IRT, the weighting of test items is realized with the use of the two-
and three-parameter logistic response models.
When comparing unweighted and weighted item scoring, the sets of scores
are highly correlated. The differences between unweighted and weighted
scores are small and usually observed in the upper and lower tails of the distri-
bution of test scores. Weighted scoring may be harder to explain to the public,
which is why, perhaps, most test sponsors prefer to use simpler scoring models
that the public more easily understands and accepts. When we use the two- or
three-parameter logistic response models, we are weighting items. Several
examinees with identical raw scores may have slightly different scale scores be-
cause their patterns of responses vary.
MC OptionAVeighted Scoring
ITEM CHARACTERISTICS
With CTT, the estimation of difficulty can be biased. If the sample is restricted
but scores are observed in the center of the test score scale, this bias may be less
STATISTICAL STUDY OF ITEM RESPONSES 2O7
ITEM DIFFICULTY
The first characteristic of item responses is item difficulty. The natural scale for
item difficulty is percentage of examinees correctly answering the item. The
ceiling of any MC item is 100% and the probability of a correct response deter-
mines the floor when examinees with no knowledge are randomly guessing.
With a four-option item, the floor is 25%, and with a three-option item the
floor is 33%. A commonly used technical term for item difficulty is p value,
which stands for the proportion or percentage of examinees correctly answer-
ing the item.
Every item has a natural difficulty, one that is based on the performance of
all persons for whom we intend the test. This p value is difficult to estimate ac-
curately unless a representative group of test takers is being tested. This is one
reason CTT is criticized, because the estimation of the p value is potentially bi-
ased by the sample on which the estimate of item difficulty is based. If the sam-
ple contains well-instructed, highly trained, or high-ability examinees, the
tests and their items appear easy, usually above .90. If the sample contains unin-
structed, untrained, or low-ability people, the test and the items appear hard,
usually below .40, for instance.
IRT allows for the estimation of item difficulty without consideration of exactly
who is tested. With CTT, as just noted, the knowledge or ability level of the
sample strongly influences the estimation of difficulty. With IRT, item difficulty
can be estimated without bias. The difficulty of the item in the IRT perspective
208 CHAPTER 9
is governed by the true difficulty of the item and the achievement level of the
person answering the item. The estimation of difficulty is based on this idea,
and the natural scale for scaling difficulty in IRT is logarithm units (logits) that
generally vary between -3.00 and +3.00, with the negative values being inter-
preted as easy and the higher values being interpreted as hard.
There are many IRT models. Most are applicable to large testing programs
involving 200 or more test takers. If a testing program is that large and the con-
tent domain is unidimensional, IRT can be effective for constructing tests that
are adaptable for many purposes and types of examinees. The one-, two-, and
three-parameter binary-scoring IRT models typically lead to similar estimates
of difficulty. These estimates are highly correlated to classical estimates of diffi-
culty. The ability to estimate parameters accurately, such as difficulty, provides
a clear advantage for IRT over CTT. IRT is also favored in equating studies and
scaling, and most computer programs listed in Table 9.1 enable IRT equating
and scaling.
ITEM DISCRIMINATION
Tabular Method
The most fundamental tabular method involves the mean of those choosing
the correct answer and the mean of those choosing any incorrect answer. A
good way to understand item discrimination is to note that those who chose the
correct answer should have a high score on the test and those who chose any
wrong answer should have a low score on the test. In Table 9.2, the first item has
good discrimination, the second item has lower discrimination, the third item
fails to discriminate, and the fourth item discriminates in a negative way. Such
an item may be miskeyed.
Another tabular method is to divide the sample of examinees into 10 groups
reflecting the rank order of scores. Group 1 will be the lowest scoring group and
Group 10 will be the highest scoring group. In the example in Table 9.3 we have
2,000 examinees ranked by scores from low to high divided into 10 groups with
200 examinees in each score group. Group 1 is the lowest scoring group, and
Group 10 is the highest scoring group. Note that the number of examinees
choosing the right answer (1) are few in the lowest group but increasingly more
210 CHAPTER 9
TABLE 9.2
Examples of the Average (Mean) Scores of Those Answering the Item Correctly
and Incorrectly for Four Types of Item Discrimination
numerous with higher scoring groups. Also, note that the number of examinees
choosing the wrong answer (0) are numerous in the lowest scoring group but
less frequent in the highest scoring group.
These tabular methods are fundamental to understanding the nature of
item responses and discrimination. These methods provide the materiel for
other methods that enhance this understanding of item discrimination.
Graphical Method
Taking the tabular results of Table 9.3, we can construct graphs that display the
performance of examinees who selected the correct and incorrect responses.
Figure 9.1 illustrates a trace line (also known as an option characteristic curve)
for the correct choice and all incorrect choices taken collectively. The trace
line can be formed in several ways. One of the easiest and most direct methods
uses any computer graphing program. In Fig. 9.1, the trace lines were taken
from results in Table 9.3.
The trace line for the correct answer is monotonically increasing, and the
trace line for the collective incorrect answers is monotonically decreasing.
Note that the trace line for correct answers is a mirror image of the trace line for
incorrect answers. Trace lines provide a clear way to illustrate the discriminat-
ing tendencies of items. Flat lines are undesirable because they indicate a fail-
TABLE 9.3
Frequency of Correct and Incorrect Responses for 10 Score Groups
1 2 3 4 5 6 7 8 9 J O
0 140 138 130 109 90 70 63 60 58 56
1 60 62 70 91 110 130 137 140 142 144
STATISTICAL STUDY OF ITEM RESPONSES 211
FIG. 9.1. Trace lines for the correct answer and collectively all wrong answers.
Statistical Methods
fective and student effort is strong the discrimination index is greatly under'
estimated. In fact, if all students answered an item correctly, the
discrimination index would be zero. Nevertheless, this is misleading. If the
sample included nonlearners, we would find out more about the ability of the
item to discriminate. One can obtain an unbiased estimate of discrimination
in the same way one can obtain an unbiased estimate of difficulty—by obtain-
ing a representative sample that includes the full range of behavior for the
trait being measured. Restriction in the range of this behavior is likely to af-
fect the estimation of discrimination.
With IRT, we have a variety of traditional, dichotomous scoring models
and newer polytomous scoring models from which to choose. The one-pa-
rameter item response model (referred to as the Rasch model) is not con-
cerned with discrimination, as it assumes that all items discriminate equally.
The Rasch model has one parameter—difficulty. The model is popular be-
cause applying it is simple, and it provides satisfactory results despite this im-
plausible assumption about discrimination. Critics of this model
appropriately point out that the model is too simplistic and ignores the fact
that items do vary with respect to discrimination. With the two- and three-
parameter models, item discrimination is proportional to the slope of the op-
tion characteristic curve at the point of inflexion (Lord, 1980). This shows
that an item is most discriminating in a particular range of scores. One item
may discriminate very well for high-scoring test takers, whereas another item
may discriminate best for low-scoring test takers.
A popular misconception is that a fit statistic is a substitute for discrimina-
tion. Fit statistics do not measure discrimination. Fit statistics in IRT answer a
question about the conformance of data to a hypothetical model, the item
characteristic curve. One of best discussions of fit can be found in Hambleton
(1989, pp. 172-182). If items do not fit, some claims for IRT about sample-free
estimation of examinee achievement are questionable.
A third method used to estimate item discrimination is the eta coefficient.
This statistic can be derived from the one-way analysis of variance
(ANOVA), where the dependent variable is the average score of persons se-
lecting that option (choice mean) and the independent variable is the op-
tion choice. In ANOVA, three estimates of variance are obtained: sums of
square between, sums of squares within, and sums of squares total. The ratio
of the sums of squares between and the sums of squares total is the squared
eta coefficient. In some statistical treatments, this ratio is also the squared
correlation between two variables (R 2 ). The eta coefficient is similar to the
traditional product-moment discrimination index. In practice, the eta co-
efficient differs from the product-moment correlation coefficient in that
the eta considers the differential nature of distractors, whereas the prod-
uct-moment correlation makes no distinction among item responses re-
lated to distractors.
STATISTICAL STUDY OF ITEM RESPONSES 213
Table 9.4 illustrates a high discrimination index but a low eta coefficient.
Notice that the choice means are closely bunched. The second item also has a
high discrimination index but it has a high eta coefficient, as well because the
choice means of the distractors are more separated. In dichotomous scoring,
point-biserial item response-total test score correlation or the two- or three-
parameter discrimination may serve as a discrimination index. However, with
polytomous scoring, the eta coefficient provides different information that is
appropriate for studying item performance relative to polytomous scoring.
Chapter 10 treats this subject in more detail.
What we can learn from this section is that with dichotomous scoring, one
can obtain approximately the same information from using the classical dis-
crimination index (the product-moment correlation between item and test
performance) or the discrimination parameter from the two- or three-parame-
ter item response models. But with polytomous scoring these methods are inap-
propriate, and the eta coefficient provides unique and more appropriate index
of discrimination.
TABLE 9.4
Point-Biserial and Eta Coefficients for Two Items
Item 1 Item 2
Point-biserial .512 .552
Eta coefficient .189 .326
Choice Mean Choice Mean
Option A—correct 33.9% 33.4%
Option B— incorrect 23.5% 24-8%
Option C—incorrect 27.0% 29.6%
Option D—incorrect 26.4% 30.7%
214 CHAPTER 9
two or more attributes present in the test, item discrimination has no criterion
on which to fix.
With IRT, unidimensionality is a prerequisite of test data. Hattie (1985)
provided an excellent review of this issue, and Tate (2002) provided a timely
update of this earlier work. When using the two- or three-parameter logistic
response model, the computer program will fail to converge if multidimen-
sionality exists. With the use of classical theory, discrimination indexes, ob-
tained from product-moment correlations or biserial correlations, will be
lower than expected and unstable from sample to sample. Thus, one has to be
cautious that the underlying test data are unidimensional when estimating
discrimination. A quick-and-dirty method for studying dimensionality is to
obtain a KR-20 (Kuder Richardson-20) internal consistency estimate of reli-
ability. If it is lower than expected for the number of items in the test, this is a
clue that the data may be multidimensional. A more dependable method is to
conduct a full-information, confirmatory item factor analysis. Chapter 10
provides more discussion of this problem and its implications for estimating
discrimination.
XX XX
XXX XXX
xxxxxx xxxxxxx
xxxxxxxxxx xxxxxxxxxx
Low performance High performance
Before instruction After Instruction
In this setting, the validity of these conclusions is not easy to prove based on
statistical results alone. Analysis of performance patterns requires close obser-
vation of the instructional circumstances and the judicious use of item and test
scores to draw valid conclusions. Instructional sensitivity is a useful combina-
tion of information about item difficulty and discrimination that contributes to
the study and improvement of items designed to test the effects of teaching or
training.
GUESSING
With the use of MC test items, an element of guessing exists. Any test taker
when encountering the item answer either knows the right answer, has partial
knowledge that allows for the elimination of implausible distractors and a
guess among the remaining choices, or simply guesses in the absence of any
knowledge.
In CTT, one can ignore the influence of guessing. To do so, one should con-
sider the laws of probability that influence the degree to which guessing might
be successful. The probability of getting a higher than deserved score by guess-
ing gets smaller as the test gets longer. For example, even in a four-option, 10-
item test, the probably of getting 10 correct random guesses is .0000009. An-
other way of looking at this is to realize the probability of scoring 70% or higher
on a 10-item MC test by random guessing is less than .004. Increase that test to
25 items, and that probability of getting a score higher than 70% falls to less
than .001.
The three-parameter item response model although often referred to as the
"guessing parameter" is actually a pseudochance level (Hambleton et al.,
1991). This parameter is not intended to model the psychological process of
guessing but merely to establish that a reasonable floor exists for the difficulty
parameter. This third parameter is used along with item difficulty and discrimi-
nation to compute a test taker's score. The influence of this third parameter is
small in relation to the influence of the discrimination parameter. Several
polytomous scoring models that also use correct and incorrect responses also
incorporate information about guessing into scoring procedures (Sympson,
1983, 1986; Thissen & Steinberg, 1984).
Depending on test instructions and other conditions, examinees may omit re-
sponses. One serious form of omitted responses are items that are not tried.
Usually, a string of responses at the end of a set of responses signals items that
are not reached. In the study of item responses, it is important to tabulate
218 CHAPTER 9
DISTRACTOR EVALUATION
Thissen, Steinberg, and Fitzpatrick (1989) stated that test users and analysts
should consider the distractor as an important part of the item. Indeed, nearly
50 years of continuous research has revealed a patterned relationship between
distractor choice and total test score (Haladyna &Sympson, 1988; Levine &
Drasgow, 1983; Nishisato, 1980). The following are five reasons for studying
distractor performance for MC items:
• Slimming down fat items. Haladyna and Downing (1993) provided both
theoretical arguments and empirical results suggesting that most test items
contain too many options. They argue that if we systematically evaluated
dis tractors, we would discover that many distr actors are not performing as
intended. In chapter 5, a guideline was presented that suggested MC items
should have as many good options as is possible, but three is probably a good
target. The research cited there provides a good basis for that guideline and
the observation that most items usually have only one or two really well-
functioning distractors. By trimming the number of options for MC items,
item writers are relieved of the burden of writing distractors that seldom
work, and examinees take shorter tests. Or, if we can shorten the length of
items, we can increase the number of items administered per hour and by
that increase the test length, which may improve the sampling of a content
domain and increase the reliability of test scores.
• Improving test items. The principal objective of studying and evaluating
item responses is to improve items. This means getting difficulty in line and
improving discrimination. Item analysis provides information to SMEs
about the performance of items so that items can be retained in the opera-
tional item pool, revised, or retired. Information about distractors can be
used in this revision process.
STATISTICAL STUDY OF ITEM RESPONSES 219
As you can see, not only are these reasons compelling, but the effort put into
studying and improving distractors contributes important validity evidence to
the overall validation process. The next three sections present and discuss
three unique ways to study distractor performance. Although the methods dis-
cuss are diverse in nature and origin, they should provide convergent informa-
tion about the distractability of distractors.
TABLE 9.5
Frequency Tables for Two 4-Option Multiple-Choice Items
Item I Options
Score" group A" B C D
80-99 percentile 17% 1% 0% 2%
60-79 percentile 14% 2% 0% 4%
40-59 percentile 10% 2% 0% 4%
20-39 percentile 8% 9% 1% 2%
1-19 percentile 6% 13% 3% 0%
Total 55% 27% 4% 14%
Item 2 Options
Score" group Ab B C D
80-99 percentile 19% 1% 0% 0%
60-79 percentile 14% 3% 1% 2%
40-59 percentile 8% 4% 2% 6%
20-39 percentile 8% 9% 1% 2%
1-19 percentile 6% 12% 1% 1%
Total 55% 29% 5% 11%
Option B, a distractor, has a low response rate for the higher groups and a
higher response rate for the lower groups. This is a desirable pattern for a
well-performing distractor. As described earlier, all distractors should have a
pattern like this.
Option C illustrates a low response rate across all five score groups. Such
distractors are useless, probably because of extreme implausibility. Such
distractors should either be removed from the test item or replaced.
Option D illustrates an unchanging performance across all score groups. No
orderly relation exists between this distractor and total test performance. We
should remove or replace such a distractor from the test item because it is not
working as intended.
The second item exhibits a distractor pattern that presents problems of in-
terpretation and evaluation. Option D is more often chosen by the middle
group and less often chosen by the higher and lower groups. This pattern is
STATISTICAL STUDY OF ITEM RESPONSES 221
nonmonotonic in the sense that it increases as a function of total test score and
then decreases. Is this pattern a statistical accident or does the distractor at-
tract middle achievers and not attract high and low achievers? Distractors are
not designed to produce such a pattern because the general intent of a
distractor is to appeal to persons who lack knowledge. The nonmonotonic pat-
tern shown in Option D implies that the information represented by Option D
is more attractive to middle performers and less attractive to high and low per-
formers. The nonmonotonic pattern appears to disrupt the orderly relation be-
tween right and wrong answers illustrated in Options A and B. For this reason,
nonmonotonic trace lines should be viewed as undesirable.
This tabular method is useful for obtaining the basic data that show the per-
formance of each distractor. A trained evaluator can use these tables with con-
siderable skill, but these data are probably more useful for creating graphical
presentations of distractor functioning. The computer program TESTFACT
(Bock et al., 2002) provides useful tables called/racti'Jes that provide tabular op-
tion responses.
The trace line we use in Fig. 9.1 for the correct answer and for the collective
distractors can be used for each distractor. Figure 9.3 shows four trace lines. A
four-option item can have up to five trace lines. One trace line exists for each
option and one trace line can be created for omitted responses.
As noted in Fig. 9.1, an effectively performing item contains a trace line for
the correct choice that is monotonically increasing, as illustrated in Fig. 9.1
and again in Fig. 9.3. These figures show that the probability or tendency to
choose the right answer increases with the person's ability. The collective per-
formance of distractors must monotonically decrease in opposite correspond-
ing fashion, as illustrated in Fig. 9.3. That figure shows that any examinee's
tendency to choose any wrong answer decreases with the person's ability or
achievement.
If the ideal trace line for all distractors is monotonically decreasing, each
trace line should exhibit the same tendency. Any other pattern should be in-
vestigated, and the distractor should be retained, revised, or dropped from
the test item. Referring to Fig. 9.3, the first trace line has the characteristic of
the correct answer, whereas the second trace line has the characteristic of a
plausible, well-functioning distractor. The third trace line shows a flat perfor-
mance across the 10 score groups. This option simply does not discriminate in
the way it is expected to discriminate. This kind of option probably has no use
in the item or should be revised. The fourth type of trace line shows low re-
sponse rates for all score groups. This kind of distractor is one that is probably
222 CHAPTER 9
analysis and supported its use in studying and evaluating distractors. The pri-
mary advance is that practitioners can easily read and understand these option
performance graphs.
These statistical methods can be grouped into three categories: (a) tradi-
tional, (b) nonparametric, and (c) parametric. The number of proposed
methods has increased recently, but research that compares the effectiveness
of these methods has not kept pace. The speculation is offered that because
these methods have the same objective, they probably provide similar results.
These results should follow logically from an inspection of a frequency table
and graphical results. In fact, these statistical methods should confirm what
we observe from viewing tabular results shown in Table 9.5 and the trace lines
shown in Fig. 9.3.
Traditional item analysis relies on the relationship between item and test per-
formance. The most direct method is the simple product-moment (point-
biserial) correlation between item and test performance. Applied to a
distractor, however, the point-biserial coefficient can be estimated incor-
rectly (Attali & Fraenkel, 2000). If the standard formula for point-biserial is
used, the responses to other distractors is grouped with responses to the right
answer, and the resulting discrimination index for the distractor is underesti-
mated. Attali and Fraenkel (2000) pointed out that the correlation of
distractor to total score should be independent of other distractors, and they
showed how discrimination indexes can be corrupted by the casual use of the
point-biserial coefficient. They also pointed out that this coefficient has the
advantage of being an effect size measure. The squared point-biserial is the
percentage of criterion variance accounted for by choosing that distractor.
224 CHAPTER 9
Also, because distractors can be chosen by several examinees, this index can
be unreliable. Therefore, it is recommended that if the traditional point-
biserial is correctly used to evaluate a distractor, the appropriate test of statis-
tical significance is used with a directional hypothesis because the coefficient
is assumed to be negative. Attali and Fraenkel suggested power tables found
in Cohen (1988). A bootstrap method is suggested for overcoming any bias
introduced by limitations of the sample (de Gruijter, 1988), but this kind of
extreme measure points out an inherent flaw in the use of this index. It
should be noted that the discrimination index is not robust. If item difficulty
is high or low, the index is attenuated. It maximizes when difficulty is mod-
erate. The sample composition relates to the estimate of discrimination.
Distractors tend to be infrequently chosen, particularly when item diffi-
culty exceeds 0.75. Thus, the point-biserial correlation is often based on
only a few observations, which is a serious limitation. Henrysson (1971)
provided additional insights into the inadequacy of this index for the study
of distractor performance. Because of these many limitations, this index
probably should not be used.
Choice Mean
For any option, we can calculate the mean of all examinees who chose that op-
tion. For the right answer, this mean will typically be higher than the means of
any wrong answer. We can analyze the relationship of the choice mean to total
score or the choice of the option to total score. The first is a product-moment
correlation between the choice mean and total score, where the choice mean is
substituted for the distractor choice. This coefficient shows the overall working
of an item to tap different levels of achievement through its distractors. An
item with a high coefficient would have different choice means for its
distractors. This may be viewed as an omnibus index of discrimination that in-
cludes the differential nature of distractors. The second type of index is the eta
coefficient, where the independent variable is option choice and the depend-
ent variable is total score. This index also represents an item's ability to discrim-
inate at different levels of achievement. The traditional point-biserial applied
to any item disregards the unique contributions of each distractor. When the
choice mean is used, an effect size can be calculated for the difference in choice
means for the distractor and the correct choice. The greater the different in
standard deviation units, the more effective the distractor. Referring to Table
9.6, however, we note that the choice mean for each distractor differs from the
choice mean of the correct answer. The difference in these choice means can
serve as a measure of distractor effectiveness; the lower the choice mean, the
better the distractor. This difference can be standardized by using the standard
deviation of test scores, if a standardized effect size measure is desired.
STATISTICAL STUDY OF ITEM RESPONSES 225
TABLE 9.6
Choice Means for Two Items From a Test
The choice mean seems useful for studying distractors. The lower the choice
mean, the more effective the distractor. Yet, a bias exists in this procedure, be-
cause when the right answer is chosen by most high-scoring test takers, the
low-scoring test takers divide their choices among the three distractors plus the
correct answer. Therefore, distractors will always have lower choice means,
and statistical tests will always reveal this condition. Any exception would sig-
nal a distractor that is probably a correct answer.
As indicated earlier, the trace line has many attractive characteristics in the
evaluation of item performance. These characteristics apply equally well to
distractor analysis. Haladyna and Downing (1993) also showed that trace lines
reveal more about an option's performance than a choice mean. Whereas
choice means reveal the average performance of all examinees choosing any
option, the trace line accurately characterizes the functional relationship be-
tween each option and total test performance for examinees of different
achievement levels.
IRT Methods
Item analysts are increasingly turning to IRT methods to investigate the work-
ings of distractors (Samejima, 1994; Wang, 2000). Wang (2000) used the gen-
eral linear model and grouping factors (items) as the independent variables.
Distractabilty parameters are estimate and used. His results in a simulation
study and with actual test data show the promise of this technique as confirmed
by graphical procedures. He also pointed out that low-frequency distractors are
not especially well estimated by this technique, nor by any other technique.
Methods like this one need to be studied with more conventional methods to
decide which of these is most and least effective.
Up to this point, the trace line has not been evaluated statistically. Haladyna
and Downing (1993) showed that the categorical data on which the trace line
is based can be subjected to statistical criteria using a chi-square test of inde-
pendence. Table 9.7 illustrates a contingency table for option performance.
Applying a chi-square test to these categorical frequencies, a statistically sig-
nificant result signals a trace line that is not flat. In the preceding case, it is
monotonically increasing, which is characteristic of a correct answer. Thus,
with the notion of option discrimination for the right choice, we expect mono-
tonically increasing trace lines, positive point-biserial discrimination indexes,
positive discrimination parameters with the two- and three-parameter models,
and choice means that exceed the choice means for distractors. For the wrong
choice, we expect monotonically decreasing trace lines, negative discrimina-
tion, negative discrimination parameters for the two- and three-parameter
models (which are unconventional to compute), and choice means that are
lower than the choice mean for the correct option.
The trace line appears to offer a sensitive and revealing look at option per-
formance. Trace lines can be easily understood by item writers who lack the
statistical background needed to interpret option discrimination indexes.
TABLE 9.7
Contingency Table for Chi-Square Test for an Option
First Score Second Score Third Score Fourth Score Fifth Score
Group Group Group Group Group
Expected 20% 20% 20% 20% 20%
Observed 6% 14% 20% 26% 34%
228 CHAPTER 9
The other statistical methods have limitations that suggest that they should
not be used.
Most testing programs have guidelines for evaluating MC items. As they eval-
uate item performance, particularly with SMEs, these guidelines are used to
identify items that can be used with confidence in the future and items that
need to be revised or retired. Table 9.8 provides a generic set of guidelines.
Depending on the overall difficulty of items and other conditions, values are
used to replace words in this table. For example, a moderate item might have
a difficulty (p value) between 40% and 90%. An easy item would have a p
value above 90%. Satisfactory discrimination might be. 15 or higher. Unsatis-
factory discrimination would be lower than .15. Negative discrimination
would signal a possible key error.
TABLE 9.8
Generic Guidelines for Evaluating Test Items
SUMMARY
This chapter focuses on studying and evaluating item responses with the ob-
jective of keeping, revising, or retiring each item. A variety of perspectives
and methods are described and illustrated. Tabular methods provide clear
summaries of response patterns, but graphical methods are easier to under-
stand and interpret. Statistical indexes with tests of statistical significance
are necessary to distinguish between real tendencies and random variation.
The chapter ends with a table providing a generic set of guidelines for evalu-
ating items. All testing programs would benefit by adopting guidelines and
studying item response patterns. Doing item response studies and taking ap-
propriate action is another primary source of validity evidence, one that bears
on item quality.
1O
Using Item Response
Patterns to Study
Specific Problems
OVERVIEW
23O
SPECIFIC PROBLEMS IN TESTING 231
ITEM BIAS
We know that test results may be used in many ways, including placement,
selection, certification, licensing, or advancement. These uses have both
personal and social consequences. Test takers are often affected by test
score uses. In licensing and certification, we run a risk by certifying or li-
censing incompetent professionals or by not certifying or licensing compe-
tent professionals.
Bias is a threat to valid interpretation or use of test scores because bias fa-
vors one group of test takers over another. Bias also has dual meanings. Bias is
a term that suggests unfairness or an undue influence. In statistics, bias is sys-
tematic error as opposed to random error. A scale that "weighs heavy" has this
statistical bias. Although bias has these two identities, the public is most
likely to identify with the first definition of bias rather than the second
(Dorans & Potenza, 1993). Although the discussion has been about bias in
test scores, in this section, the concern is with bias in item responses, thus the
term item bias.
As discussed in chapter 8, sensitivity review involves a trained committee
that subjectively identifies and questions items on the premise that test tak-
ers might be distracted or offended by the item's test content. Therefore, sen-
sitivity item review is concerned with the first meaning of item bias.
DIP refers to a statistical analysis of item responses that intends to reveal
systematic differences among groups for a set of responses to a test item that is
attributable to group membership instead of true differences in the construct
being measured. In other words, the hypothesis is that the groups do not dif-
fer. If DIP analysis find differences in item performance, items displaying DIP
are suspected of being biased. Removal of offending items reduces the differ-
ences between these groups to zero and removes this threat to validity.
Several important resources contributed to this section. The first is an ed-
ited volume by Holland and Wainer (1993) on DIE This book provides a
232 CHAPTER 1O
wealth of information about this rapidly growing field of item response analysis.
Camilli and Shepard (1994) also provided useful information on DIE Another
source is an instructional module on DIP by Clauser and Mazor (1998).
Readers looking for more comprehensive discussions of DIP should consult
these sources and other references provided here.
A Brief History
A barbering examination in Oregon in the late 1800s is one of the earliest ex-
amples of testing for a profession. Since then, test programs for certification,
licensure, or credentialing have proliferated (Shapiro, Stutsky, &Watt, 1989).
These kinds of testing programs have two significant consequences. First, per-
sons taking the test need to pass to be certified or licensed to practice. Second,
these tests are intended to filter competent and incompetent professionals, as-
suring the public of safer professional practice.
Well-documented racial differences in test scores led to widespread discon-
tent, culminating in a court case, the Golden Rule Insurance Company versus
Mathias case in the Illinois Appellate Court in 1980. Federal legislation led to
policies that promoted greater monitoring of Black-White racial difference in
test performance. The reasoning was that if a Black-White difference in item
difficulty was greater than the observed test score difference, this result would
suggest evidence of DIE
TABLE 10.1
Commercially Available Computer Programs for the Study
of Differential Item Functioning (DIF)
Research shows that many methods for detecting DIP share much in com-
mon. We can delineate the field of DIP using four discrete methods with the
understanding that items detected using one method are likely to be identified
using other methods.
The fact that an examinee takes a test does not guarantee that the resulting re-
sponses are validly scorable. There are many reasons examinees may respond to
a test in ways that misinform us about their true achievement.
This section discusses nonresponse and aberrant responses. First, types of
responses are discussed and the problem of nonresponse is described. Then im-
putation is discussed as one approach to reducing the seriousness of this prob-
lem. Second, hypotheses are presented that explain the origins of aberrant
examinee responses. Then, statistical methods are discussed that address some
SPECIFIC PROBLEMS IN TESTING 235
In this section, two related topics are discussed. The first topic involves types of
responses that arise from taking an MC or CR item. The second topic is how we
treat two of these types of responses.
TABLE 10.2
A Taxonomy of Item Responses
Domain of Responses
or Multiple-Choice
(MC) Items Description
Correct answer The student knows the correct answer and either selects or
creates it.
An uneducated guess The student does not know the correct answer but makes an
uneducated guess. The probability of making a correct guess
on MC is I/number of option. With an open-ended item, this
probability is indeterminate but is probably very small.
An educated guess The student does not know the correct answer but makes an
educated guess using partial knowledge, clues, or eliminates
implausible distractors. With constructed-response items, the
student may bluff. In both instances, the probability of
obtaining a correct answer is higher than the uneducated
guess.
Omitted response The student answers omits a response.
Not reached The student makes one or more responses to a block of test
items and then leaves no responses following this string of
attempted responses.
236 CHAPTER 1O
prior knowledge is probably zero. With this second type of response, it is possi-
ble to obtain a correct answer, but this answer was obtained in complete igno-
rance of knowledge of the right answer.
The third response type is an educated guess. For an MC item, partial
knowledge comes into play. The test taker is believed to know something about
the topic and may be able to eliminate one or more distractors as implausible
and then guess using the remaining plausible options as a basis for the choice.
As a result, the probability of obtaining a correct answer is greater than chance.
For the CR format, the tendency to bluff may be based on some strategy that
the test taker has learned and used in the past. For instance, using vocabulary
related to the topic or writing long complex sentences may.earn a higher score
than deserved. In this instance, a correct answer is more probable but the test
taker also has a greater level of proficiency.
The fourth response type is an omitted response. The response string in the
test may show responses, but occasionally the test taker may choose for some
unknown reason to omit a response. An omitted response may occur with ei-
ther an MC or CR item.
The fifth response type is not-reached. The test taker may attempt one or
more items and then quit responding. Although the unattempted items may be
considered omitted, these items are classified as not reached because we want
to distinguish between the conscious act of omitting a response versus quitting
the test entirely.
Haladyna, Osborn Popp, and Weiss (2003) have shown that omitted and
not-reached rates for MC and CR items on the N AEP reading assessment are
unequal. Students have a greater tendency to omit CR items. For not-reached
responses, students have a tendency to stop responding after encountering a
CR item. Omit and not-reached rates are associated with students with educa-
tional disadvantage.
The second part of this section deals with response patterns for the test or a
block of item administered. From IRT, the probability of a correct response is a
function of the difficulty of the item and the achievement level of the test
taker. If a test taker responds in a way that does not follow this expectant pat-
tern, we raise suspicions that the resulting test score might be invalid. This
discussion about aberrant response patterns is conceptual in origin but
informs about the wide range of psychological factors that may produce aber-
rant item response patterns. This discussion draws from many other sources
(Drasgow, Levine, & Zickar, 1996; Haladyna & Downing, in press; Meijer,
1996;Meijer, Muijtjens, &vanderVleuten, 1996; Wright, 1977). We have at
least nine distinctly different psychological processes that may produce aber-
rant response patterns.
Creative Test Taking. Test takers may find test items so easy or ambiguous
that they will reinterpret and provide answers that only they can intelligently
SPECIFIC PROBLEMS IN TESTING 239
justify. These test takers may also provide correct answers to more difficult
items. This pattern also resembles inattentive test takers who might "cruise"
through an easy part of a test until challenged. In chapter 8, answer justifica-
tion was discussed as a means of allowing test takers an opportunity to provide
an alternative explanation for their choice of an option. As we know from re-
search on cognitive processes, students taking the same test and items may be
differing cognitive strategies for choosing an answer. Although their choice
may not agree with the consensus correct choice, their reasoning process for
choosing another answer may be valid. Unless appeals are offered to test takers
with SME adjudication, such test-taking patterns go unrewarded. Research
where answer justification or think-aloud procedures are used should increase
our understanding of the potential to credit justified answers to test items that
do not match the keyed response.
flawed. Haladyna and Downing (in press) argued that this type of test prepara-
tion is a CIV and a threat to validity.
The detection of inappropriate coaching can be done using any of the tech-
niques identified and discussed in the section on DIP in this chapter. The neces-
sary precondition to using these techniques is to identify two groups, one
inappropriately coached and one uncoached. Items displaying DIP provide evi-
dence of the types of items, content, and cognitive demand that affect test
scores. But research of this type about coaching effects is difficult to find. Becker
(1990) opined that the quality of most research on coaching is inadequate.
Inattention. Test takers who are not well motivated or easily distracted
may choose MC answers carelessly. Wright (1977) called such test takers
"sleepers." A sleeper might miss easy items and later correctly answer hard
items. This unusual pattern signals the inattentive test taker. If sleeper pat-
terns are identified, test scores might be invalidated instead of reported and
interpreted. The types of tests that come to mind that might have many inat-
tentive test takers are standardized achievement tests given to elementary
and secondary students. Many students have little reason or motivation to
sustain the high level of concentration demanded on these lengthy tests. This
point was demonstrated in a study by Wolf and Smith (1995) with college stu-
dents. With consequences in a course test, student motivation and perfor-
mance was higher than a comparable nonconsequences condition. This point
was also well made by Paris et al. (1991) in their analysis of the effects of stan-
dardized testing on children. They pointed out that older children tend to
think that such tests have less importance, thus increasing the possibility for
inattention.
This problem is widespread among students who are English language learn-
ers (ELLs). Research by Abedi et al. (2000) showed that simplifying the lan-
guage in tests helps improve the performance of ELLs. This research showed
that reading comprehension is a source of CIV that is one of several threats to
validity.
The Standards for Educational and Psychological Testing (AERA et al., 1999)
provide an entire chapter containing discussions and standards addressing the
problems of students with diverse language backgrounds. In general, caution is
urged in test score interpretation and use when the language of the test exceeds
the linguistic abilities of test takers. Because we have so many ELLs in the
United States, emphasizing the problem seems justified. Testing policies sel-
dom recognize that reading comprehension introduces bias in test scores and
leads to faulty interpretations of student knowledge or ability.
Summary. Table 10.3 summarizes the nine types of aberrant response pat-
terns discussed here. Research is needed on the frequency of these aberrant re-
sponse patterns and their causes in achievement tests with varying stakes.
242 CHAPTER 10
TABLE 10.3
Aberrant Response Patterns
Despite the statistical science that has emerged, there is little research on ex-
tensiveness of aberrant response patterns. We need to know more about the
frequency of these patterns, their causes, and their treatment. The next section
discusses the statistical science associated with aberrant responding.
Person fit is a fairly young science of the study of aberrant item response patterns
on a person-by-person basis. Another term used is appropriateness measurement.
If an examinee's item response pattern does not conform to an expected, plau-
sible item response pattern, we have reason to be cautious about how that re-
SPECIFIC PROBLEMS IN TESTING 243
suiting score is interpreted and used. The objective of person fit is statistical
detection of invalid test scores. An entire issue of Applied Measurement in Edu-
cation (Meijer, 1996) was devoted to person-fit research and applications.
Readers looking for more comprehensive discussions should consult the con-
tributing authors' articles and the many references they provided as well as
those provided in this section. As a science of the study of aberrant examinee
item response patterns, person fit follows traditional IRT methods (Drasgow et
al., 1996). An alternative method of study uses nonparametric methods
(Meijer et al., 1996).
IRT Solutions to Person Fit. Although IRT methods are effective for
studying person fit, large samples of test takers are needed. The chief char-
acteristic of these methods is the use of an explicit statistical IRT model.
The context or purpose for an index for person fit is important. Drasgow and
Guertler (1987) stated that several subjective judgments are necessary. For
example, if one is using a test to make a pass-fail certification decision, the
location of a dubious score relative to the passing score and the relative risk
one is willing to take have much to do with these decisions. Other factors to
consider in using these indexes are (a) the cost of retesting, (b) the risk of
misclassification, (c) the cost of misclassification, and (d) the confidence or
research evidence supporting the use of the procedure. According to
Drasgow, Levine, and Williams (1985), aberrant response patterns are iden-
tified by first applying a model to a set of normal responses and then using a
measure of goodness of fit, an appropriateness index, to find out the degree
to which anyone deviates from normal response patterns. Levine and Rubin
(1979) showed that such detection was achievable, and since then there
has been a steady progression of studies involving several theoretical mod-
els (Drasgow, 1982; Drasgow et al., 1996; Levine & Drasgow, 1982, 1983).
These studies were initially done using the three-parameter item response
model, but later studies involved polytomous item response models
(Drasgow et al., 1985). Drasgow et al. (1996) provided an update of their
work. They indicated that appropriateness measurement is most powerful
because it has a higher rate of error detection when compared with other
methods. With the coming of better computer programs, more extensive re-
search can be conducted, and testing programs might consider employing
these methods to identify test takers whose results should not be reported,
interpreted, or used.
array of choices. According to Meijer et al. (1996), three methods that stand
out are the Sato (1980) caution index, the modified caution index, and the
U3 statistic.
Sato (1975) introduced a simple pattern analysis for a classroom based
on the idea that some scores deserve a cautious interpretation. Like appro-
priateness measurement, the caution index and its derivatives have a broad
array of applications, but this section is limited to problems discussed ear-
lier. The focus of pattern analysis is the S-P chart that is a display of right
and wrong answers for a class. Table 10.4 is adapted from Tatsuoka and Linn
(1983) and contains the right and wrong responses to 10 items for 15 stu-
dents. Not only does the S-P chart identify aberrant scores, but it also iden-
TABLE 10.4
Students-Problems (S-P) Chart for a Class of 15 Students on a 10'Item Test
Items
Person I 2 3 4 5 6 7 8 9 10 Total
1 1 1 1 1 1 1 1 1 1 1 10
2 1 1 1 1 1 1 1 1 1 0 9
3 1 1 1 1 1 0 1 1 0 1 8
4 1 0 1 1 1 1 0 1 0 0 6
5 1 1 1 1 0 1 0 0 1 0 6
6 1 1 1 0 1 0 1 0 1 0 6
7 1 1 1 1 0 0 1 0 0 0 5
8 1 1 1 0 1 1 0 0 0 0 5
9 1 0 0 1 0 1 0 1 1 0 5
10 1 1 0 1 0 0 1 0 0 1 5
11 0 1 1 1 1 0 0 0 0 0 4
12 1 0 0 0 1 1 0 0 0 0 3
13 1 1 0 0 0 1 0 0 0 0 3
14 1 0 1 0 0 0 0 0 0 0 2
15 0 1 0 0 0 0 0 0 0 0 1
Item 13 11 10 9 8 8 6 5 5 4
difference
p value 87 73 67 60 53 53 40 33 23 27
tifies items with aberrant item response patterns. The S-P chart is based on
two boundaries, the S curve and the P curve, and a student-item matrix of
item responses. Students are ordered by scores, and items are placed from
easy on the left side of the chart to hard on the right. The S curve is con-
structed by counting the number correct for any student and constructing
the boundary line to the right of the item response for that student. For the
15 students, there are 15 boundary lines that are connected to form the S
curve. If a student answers items correctly outside of the S curve (to the
right of the S curve), this improbable result implies that the score should be
considered cautiously. Similarly, if a student misses an item inside of the S
curve (to the left of the S curve), this improbable result implies that the stu-
dent failed items that a student of this achievement level would ordinarily
answer correctly. In the first instance, the student passed items that would
normally be failed. Referring to Table 10.4, Student 9 answered Items 6, 8,
and 9 correctly, which would normally be missed by students at this level of
achievement. Student 9 also missed two easy items. A total of 5 improbable
responses for Student 9 points to a potential problem of interpretation for
this student score of 5 out of 10 (50%). The P curve is constructed by count-
ing the number right in the class for each item and drawing a boundary line
below that item response in the matrix. For example, the first item was cor-
rectly answered by 13 of 15 students so the P curve boundary line is drawn
below the item response for the 13th student. Analogous to the S curve, it is
improbable to miss an item above the P curve and answer an item below the
P curve correctly. Item 6 shows that three high-scoring students missed this
item whereas three low-scoring students answered it correctly. Item 6 has
an aberrant response pattern that causes us to look at it more closely. A vari-
ety of indexes is available that provides numerical values for each student
and item (see Meijer et al., 1996; Tatsuoka & Linn, 1983).
One method that appears in the person-fit literature is U3. The underlying
assumption for this method is that for a set of examinees with a specific total
score, their item response patterns can be compared. If number correct is
identical, examinees with aberrant score patterns are subject to further con-
sideration for misfitting. Van der Flier (1982) derived this person-fit statistic
and studied its characteristics. The premise of U3 is the comparison of proba-
bilities of an item score pattern in conjunction with the probability of the pat-
tern of correct answers. The index is zero if the student responses follow a
Guttman pattern. An index of one is the reverse Guttman pattern. Meijer,
Molenaar, and Sijtsma (1994) evaluated U3, finding it to be useful for detect-
ing item response problems. In a series of studies by Meijer and his associates
(Meijer, 1996; Meijer & Sijtsma, 1995; Meijer, Molenaar, & Sijtsma, 1999;
Meijer, Muijtens, & Van der Vleuter, 1996), a number of positive findings
were reported for U3. One important finding was that this method works best
under conditions of higher reliability, longer tests, and situations where a
246 CHAPTER 1O
Factors that contribute aberrant response patterns have not been adequately
studied. Such studies should involve procedures such as think-aloud discussed
in chapters 4 and 8. Test takers could be interviewed and reasons for their pat-
tern of response known. Ultimately, we should be willing to invalidate a test
score based on aberrant performance.
The statistical science of person fit appears to lack a conceptual or theo-
retical basis that comes from an understanding of what aberrant test takers
do. Rudner, Bracey, and Skaggs (1996) suggested that person fit was
nonproductive when applied to a high-quality testing program. In their
study, only 3% of their sample had person-fit problems. This percentage
seems small. Such a result may generate rivaling hypotheses. Is person fit in
these data not much of a problem or are the methods used insensitive to the
range of aberrant response patterns that may exist? The statistical methods
for studying person fit do not seem sufficient for detecting unusual patterns
arising from cheating, inappropriate coaching, and other problems dis-
cussed in Table 10.3. We need a more systematic study of this problem with
better use of research methods that explore the psychological basis for aber-
rant response patterns.
DIMENSIONALITY
Defining Dimensionality
Messick (1989) stated that a single score on a test implies a single dimension. If
a test contains several dimensions, a multidimensional approach should be
used and one score for each dimension would be justified. A total test score
from a multidimensional test is subject to misinterpretation or misuse because
differential performance in any dimension might be overlooked when forming
SPECIFIC PROBLEMS IN TESTING 247
this composite score. An examinee might score high in one dimension and low
in another dimension, and a total score would not reveal this kind of differen-
tial performance. This differential effect also implies that it does not matter
that a low score existed. In credentialing testing, low scores can have negative
consequences for future professional practice.
As we know, one of the most fundamental steps in the development of any
test is construct formulation where the trait to be measured is defined clearly.
That definition needs to state whether a single score is intended to describe the
trait or several scores are needed that differentiate critical aspects of the trait.
The underlying structure of item responses is fundamental to this definition
and eventually to the validity of interpreting and using test scores.
According to MacDonald (1985), the history of cognitive measurement fo-
cused on making a test consisting of items that share a common factor or di-
mension. Simply put:
Each test should be homogeneous in content, and consequently the items on each
test should correlate substantially with one another. (Nunnally, 1977, p- 247)
The Standards for Educational and Psychological Testing (AERA et al., 1999) lists
specific standards pertaining to content-related validity evidence (1.2,1.6,3.2,
3.3,3.5,3.11, 7.3, 7.11,13.5,13.8,14.8,14.9,14.10,14.14).EssaysbyMessick
(1989, 1995a, 1995b) furnished further support for the importance of con-
tent-related evidence. Hattie (1985), MacDonald (1981, 1985, 1999),
Nunnally and Bernstein (1994), and Tate (2002) all stressed the importance of
studies that provide validity evidence for the dimensionality of a test's item re-
sponses. As we build an argument for the validity of any test score interpreta-
tion or use, content-related evidence is primary. Studies that provide such
evidence are essential to the well-being of any testing program.
What are the implications for validity that arise from a study of dimensionality?
TABLE 10.5
Evidence Supporting the Independence of Traits
correlation matrix for two subscores measured in two tests. In terms of ob-
served correlations, median correlations among like traits should exceed the
median correlation among unlike traits.
Trait 1 is more highly correlated with Trait 1 on Test B than Trait 1 is
correlated with Trait 2 on both Test A and Test B. If we think that Trait 1
and Trait 2 are unique, we expect the correlation between Trait 1-Test A
and Trait 1-Test B to be higher than between trait correlations. We also
expect the correlation between Trait 2-Test A and Trait 2-Test B to be
higher than between trait correlations. Correlation coefficients are lim-
ited by the reliabilities of the two variables used to compute the correla-
tion coefficient. In the diagonal of the correlation matrix, the reliability
estimates are given. If we correct for unreliability, the corrected correla-
tion coefficients to the right of the reliability estimates give us an estimate
of true relationship where measurement error is not considered. Note that
the "true" correlation of Trait 1-Test A and Trait 1-Test B is .89, which is
high. Thus, we can conclude that there is some evidence that Tests A and
B seem to measure the same trait, A. Also, note that the corrected correla-
tion coefficient for Trait 2-Test A and Trait 2-Test B is 1.00, which sug-
gests that the two tests are both measuring Trait 2. Table 10.5 shows good
convergent and discriminant validity evidence for Traits 1 and 2 for these
two tests, A and B.
Table 10.6 provides another matrix with different results. In this instance,
the correlations among traits are high, regardless of trait or test designation.
The corrected correlation coefficients to the right of the reliability estimates
are all high. In this instance, the validity evidence points to the convergence
of all measures on a single dimension. In other words, there is no indication
that Traits 1 and 2 are independent. Discriminative evidence is lacking. This
evidence points to a single dimension.
251
TABLE 10.6
Evidence Supporting Convergence of Traits for Method A
TABLE 10.7
Evidence Supporting the Independence of Each Method as Measuring a Trait
As Tate (2002) noted, any achievement test is likely to have some degree of
multidimensionality in its item responses. We need to determine whether that
degree is serious enough to undermine the validity of interpretations. Fortu-
nately, most scaling methods tolerate some multdimensionality. Factor analysis
and other methods reviewed here provide the evidence for asserting that a set
of item responses is sufficiently unidimensional.
It is strongly recommended that a study of dimensionality be routinely con-
ducted to confirm what the construct definition probably intended, a single
score that is sufficient to describe the pattern of item responses. If the construct
definition posits several dimensions, confirmatory factor analysis is recom-
mended, and the results should confirm this thinking.
In many circumstances where a single dimension is hypothesized, subscores
are thought to exist. We have the means for studying and validating item re-
sponses supporting subscore interpretations. In some instances, it is possible to
have a unidimensional interpretation with supporting empirical evidence for
subscore interpretation, as the study by Haladyna and Kramer (2003) showed.
However, establishing the validity of subscore interpretations in the face of
unidimensionality can be challenging. Gulliksen (1987) provided some guid-
ance on how to establish other validity evidence for subscore validity.
MC items are usually scored in a binary fashion, zero for an incorrect choice
and one for a correct choice. A total score is the sum of correct answers.
With the one-parameter IRT model, there is a transformation of the total
score to a scaled score. With the two- and three-parameter models, the
transformation to a scaled score is more complex because items are
weighted so that any raw score can have different scaled scores based on the
pattern of correct answers. With the traditional binary-scoring IRT models,
no recognition is given to the differential nature of distractors. This section
deals with the potential of using information from distractors for scoring
MC tests. The use of distractor information for test scoring is believed to in-
crease the reliability of test scores, which in turn should lead to more accu-
rate decisions in high-stakes pass-fail testing.
The answer is yes. Traditional methods for studying distractor functioning are
convincing of this fact (Haladyna & Downing, 1993; Haladyna & Sympson,
254 CHAPTER 1O
1988; Levine & Drasgow, 1983; Thissen, 1976; Thissen, Steinberg, &
Fitzpatrick, 1989; Thissen, Steinberg, & Mooney, 1989; Wainer, 1989).
As indicated in chapter 9, one of the best ways to study distractor perfor-
mance for a test item is using trace lines. Effective distactors have a mono-
tonically decreasing trace line. A flat trace line indicates a nondiscriminating
distractor. A trace line close to the origin has a low frequency of use, which sig-
nals that the distractor may be so implausible that even low-achieving exam-
inees do not select it.
quire this iterative feature, experience shows that a single estimation is close to
the iterative result (Haladyna, 1990). In the framework of a certification or li-
censing examination, the reciprocal averages produced positive results but the
computational complexity brings to our attention a major limitation of this
method. Schultz (1995) also provided results showing that option weighting
performs better than simple dichotomous scoring with respect to alpha reliabil-
ity and decision-making consistency.
lower half of the test score distribution. If such precision is desirable, poly-
tomous scoring of MC item responses should be done. If, however, the need for
precision is in the upper half of the test score distribution, polytomous scoring
will not be very helpful.
SUMMARY
This chapter focuses on four problems that affect the validilty of test score in-
terpretations and uses. All four problems involve item responses. As we think
of each of these four problems, studies related to each become part of the valid-
ity evidence we can use to support interpretations and uses of test scores. Re-
search on these problems also addresses threats to validity that are not often
considered. By examining each threat to validity and taking remedial action
where justified, we can strengthen the overall argument for each test score in-
terpretation and use.
IV
The Future of Item
Development and Item
Response Validation
This page intentionally left blank
11
New Directions
in Item Writing and
Item Response Validation
OVERVIEW
This book focuses on two important activities in test development, the devel-
opment of the test item and the validation of responses to the item.
In this chapter, these two interrelated topics are evaluated in terms of their
pasts and their futures.
In this final chapter, the science of item development is discussed in the con-
texts that affect its future. These contexts include (a) the role of policy at na-
tional, state, and local levels, politics, and educational reform; (b) the unified
approach to validity; (c) the emergence of cognitive psychology as a prevailing
learning theory and the corresponding retrenchment of behaviorism; and (d)
changes in the way we define outcomes of schooling and professional training.
These four contexts will greatly influence the future of item development.
Item response validation has rested on statistical theories of test scores;
therefore, fewer changes have occurred recently. The progress of polytomous
IRTs in recent years and computer software that applies these theories repre-
sent a significant advance.
in Roid and Haladyna (1982) did not result in further research and develop-
ment. In fact, these theories have been virtually abandoned. Bennett and
Ward (1993) published a set of papers that extended our understanding of the
similarities and differences between MC and CR item formats. In Test Theory
for a New Generation of Tests, Frederiksen et al. (1993) provided us with a
promising set of theories that linked item development to cognitive learning
theory. This effort has been followed by more extensive study of item formats
and their cognitive demands, as chapter 3 in this book shows. Irvine and
Kyllonen (2002) introduced us to more recent item development theories.
An important feature of this new work is that it includes both MC and CRfor-
mats. Another important feature of this new work is that cognitive science is
strongly linked to these efforts. Where these new theories take us will depend
on these contextual factors.
testing in the national, state, and local school districts, the education platforms
of political parties have a major influence on the testing policies and practices
in each jurisdiction.
School reform appears to have received its impetus from the report A Nation
At Risk (National Commission on Educational Excellence, 1983). The legisla-
tion known as the No Child Left Behind Act of 2001 has provided sweeping in-
fluence over student learning, achievement testing, and accountability.
Another significant movement is restructuring of schools, which is more sys-
temic and involves decentralized control of schools by parents, teachers, and
students. Charter schools are one result of this movement.
One of many forces behind the reform movement has been the misuse of
standardized test scores. In recent years, test scores have been used in ways
unimagined by the original developers and publishers of these tests
(Haladyna, Haas, & Allison, 1998; Mehrens &Kaminski, 1989; Nolen et al.,
1992). The need for accountability has also created a ruthless test score im-
provement industry where vendors and educators employ many questionable
practices to raise test scores in high-stakes achievement tests (Cannell, 1989;
Nolen et al., 1992). This unfortunate use of test scores has led to the issuing of
guidelines governing the use of test scores in high-stakes testing programs by
the AERA (2000).
With respect to school reform, traditional ideas and practices will be reex-
amined and reevaluated. This reform movement will lead to new testing para-
digms where some of these traditional ideas and practices will survive, but
others will not. Indeed, this change is already under way. Performance testing
has affected educational testing in the nation, in states, in classrooms, and on
teaching.
MC testing has enjoyed a renaissance as policymakers and educators realize
that the foundation of most education and training is acquisition of knowledge.
MC is still the best way to measure knowledge. Also, MC is useful in approxi-
mating many types of higher level thinking processes. As we get better in using
new MC formats to measure more complex cognitive behavior, our ability to
design better MC tests is increasing.
Validity
The unified view of validity has overtaken the traditional way of studying valid-
ity, thanks to the important work of Messick (1984, 1989, 1995a, 1995b) and
many others. This view is articulated in chapter 1 and is linked to virtually ev-
ery chapter in this book. The future of item development is strongly linked to
the idea that what we do in item development yields a body of validity evidence
that adds to the mix of evidence we evaluate when making judgments about
the validity of any test score interpretation or use.
262 CHAPTER 11
As the test item is the most basic unit of measurement, it matters greatly
that we address the issue of validity evidence at the item and item response
levels. Not only is this body of validity evidence relevant to items but it is also
relevant to the body of evidence we use to support validity for test score inter-
pretation or use.
Cognitive Psychology
Although we are far from having a unified learning theory, the groundwork
is being laid. Given these 11 qualities of this emerging unified, cognitive the-
ory of school learning, present-day teaching and testing practices seem al-
most obsolete. The future of item development and item response validation
in measuring student learning should be quite different from current prac-
tices as illustrated in this volume.
264 CHAPTER 11
Two related but different barriers exist that affect the future of item develop-
ment. The first barrier is the lack of construct definition. Cognitive psycholo-
gists and others have used a plethora of terms representing higher level
thinking, including metacognition, problem solving, analysis, evaluation,
comprehension, conceptual learning, critical thinking, reasoning, strategic
knowledge, schematic knowledge, and algorithmic knowledge, to name a
few. The first stage in construct validity is construct definition. These terms
are seldom adequately defined so that we can identify or construct items that
measure these traits. Thus, the most basic step in construct validity, construct
definition, continues to inhibit both the development of many higher level
thinking behaviors and their measurement. As the focus changes from
knowledge and skills to these learnable, developing cognitive abilities, we
will have to identify and define these abilities better than we have in the past,
as Cole (1990) observed.
The second barrier is the absence of a validated taxonomy of complex cogni-
tive behavior. Studies of teachers' success with using higher level thinking
questions lead to inconclusive findings because of a variety of factors, including
methodological problems (Winne, 1979). Many other studies and reports at-
test to the current difficulty of successfully measuring higher level thinking
with the kind of scientific rigor required in construct validation. Royer et al.
(1993) proposed a taxonomy of higher level behavior and reviewed research on
its validity. This impressive work is based on a cognitive learning theory pro-
posed by Anderson (1990). Although the taxonomy is far from being at the im-
plementation stage, it provides a reasonable structure that invites further study
and validation.
Item writing in the current environment cannot thrive because of these
two barriers. Advances in cognitive learning theory should lead to better con-
struct definitions and organization of types of higher level thinking that will
sustain more productive item development, leading to higher quality
achievement tests.
Once constructs are defined and variables are constructed, testing provides
one basis for the empirical validation of test score interpretations and uses. In
this context, a statistical theory of test scores is adopted, and this theory can be
applied to item responses with the objective of evaluating and improving items
until they display desirable item response patterns.
Classical test theory has its roots in the early part of this century and has
grown substantially. It is still widely accepted and used in testing programs de-
NEW DIRECTIONS IN ITEM WRITING 265
spite the rapid and understandable emergence of IRTs. For many reasons enu-
merated in chapter 8 and in other sources (e.g., Hambleton & Jones, 1993;
Hambleton, Swaminathan, &. Rogers, 1991), classical theory has enough defi-
ciencies to limit its future use. Nonetheless, its use is encouraged by its familiar-
ity to the mainstream of test users.
Generalizability theory is a neoclassical theory that gives users the ability to
study sources of error in cognitive measurement using familiar analysis of vari-
ance techniques. Cronbach, Nanda, Rajaratnam, and Gleser (1972) formu-
lated a conceptual framework for generalizability. Brennan (2001) showed how
generalizability theory can be used to study the sources of measurement error in
many setting involving both MC and CR formats.
IRTs have developed rapidly in recent years, largely due to the efforts of the-
orists such as Rasch, Birnbaum, Lord, Bock, Samejima, and Wright, to name a
few. These theories are increasingly applied in large-scale testing programs.
Computer software is user friendly. Although IRT does not make test score in-
terpretations more valid (Linn, 1990), it provides great ability to scale test
scores to avoid CIV that arises from nonequivalent test forms.
In Test Theory for a New Generation of Tests, Frederiksen et al. (1993) as-
sembled an impressive and comprehensive treatment of ongoing theoretical
work, representing a new wave of statistical test theory. This collection of pa-
pers is aimed at realizing the goal of unifying cognitive and measurement per-
spectives with emphasis on complex learning. Mislevy (1993) distinguished
much of this recent work as departing from low- to high-proficiency testing in
which a total score has meaning to pattern scoring where wrong answers have
diagnostic value. In this setting, the total score does not inform us about how a
learner reached the final answer to a complex set of activities.
An appropriate analysis of patterns of responses may inform us about the
effectiveness of a process used to solve a problem. In other words, patterns
of responses, such as derived from the context-dependent item set, may lead
to inferences about optimal and suboptimal learning. Theoretical develop-
ments by Bejar (1993), Embretsen (1985), Fischer (1983), Haertel and
Wiley (1993), Tatsuoka (1990), and Wilson (1989) captured the rich array
of promising new choices. Many of these theorists agree that traditional
CTT and even present-day IRTs may become passe because they are inade-
quate for handling complex cognitive behavior. As with any new theory, ex-
tensive research leading to technologies will take considerable time and
resources.
These new statistical theories have significant implications for item re-
sponse validation. Traditional item analysis was concerned with estimating
item difficulty and discrimination. Newer theories will lead to option-re-
sponse theories, where right and wrong answers provide useful information,
and patterns of responses provide information on the success of learning
complex tasks.
266 CHAPTER 11
In this section, two topics are addressed. First, the status of item writing is de-
scribed. Second, the characteristics of future item development are identified
and described. A worthwhile goal should be to abandon the current prescrip-
tive method for writing items and work within the framework of an item-writ-
ing theory that integrates with cognitive learning theory.
Critics have noted that item writing is not a scholarly area of testing (e.g.,
Cronbach, 1970; Nitko, 1985). Item writing is characterized by the collective
wisdom and experience of measurement experts who often convey this knowl-
edge in textbooks (Ebel, 1951). Another problem is that item writing is not es-
pecially well grounded in research. Previous discussions of item development
in Educational Measurement (Lindquist, 1951; Linn, 1989; Thorndike, 1970)
have treated item writing in isolation of other topics, such as validity, reliability,
and item analysis, among other topics. Cronbach (1971), in his classic chapter
on validation, provided scant attention to the role of items and item responses
in test validation. Messick (1989), on the other hand, referred to the impor-
tance of various aspects of item development and item response validation on
construct validity. The current unified view of validity explicitly unites many
aspects of item development and item response validation with other critical
aspects of construct validation. But this is only a recent development.
Downing and Haladyna (1997) emphasized the role of item development and
validation on test score validation.
The criterion-referenced testing movement brought sweeping reform to test
constructors at all levels by focusing attention on instructional objectives.
Each item needed to be linked to an instructional objective. Test items were
painstakingly matched to objectives, and collections of items formed tests that
putatively reflected these objectives. The integration of teaching and testing
produced predictable results: high degree of learning, if student time for learn-
ing was flexible to accommodate slow learners. The dilemma was how specific
to make the objective. Objectives too specific limited the degree to which we
could generalize; objectives too vague produced too much inconsistency in
item development resulting in disagreement among context experts about the
classifications of these items. No single test item or even small sample of test
items was adequate for measuring an objective. The widespread use of instruc-
tional objectives in education and training is remarkable. But the criticism of
this approach is that learning can seem fragmented and piecemeal. What fails
to happen is that students do not learn to use knowledge and skills to perform
some complex cognitive operation.
The current reform movement and the current emphasis on performance
testing has caused a reconsideration of the usefulness of the instructional ob-
jective. Because criterion-referenced testing is objective driven, it may be re-
NEW DIRECTIONS IN ITEM WRITING 267
This section addresses some characteristics that these new item-writing the-
ories must possess to meet the challenge of measuring complex behavior.
These characteristics draw heavily from current thinking in cognitive science
but also rely on this item-writing legacy.
Computers now present examinees with tasks of a complex nature, with in-
teractive components that simulate real-life, complex decision making.
Scoring can offer several pathways to correct answers, and scoring can be au-
tomated. The fidelity of such creative testing is being demonstrated in com-
puter-delivered licensing tests in architecture. Mislevy (1996b) made a good
point about this emerging technology. If the information provided is no better
than provided by conventional MC, the innovation seems pointless. These
innovations must provide something well beyond what is available using for-
mats presented in chapter 4.
268 CHAPTER 11
Inference Networks
Item-Generating Ability
As more testing programs offer tests via computers and the format is adap-
tively administered, the need for validated items grows. Testing programs will
have to have large supplies of these items to adapt tests for each examinee on
a daily basis.
Present-day item writing is a slow process. Item writers require training.
They are assigned the task of writing items. We expect these items to go
through a rigorous battery of reviews. Then these items are administered, and
if the performance is adequate, the item is deposited in the bank. If the item
fails to perform, it is revised or discarded. We can expect about 60% of our
items to survive. This state of affairs shows why item-writing methods need to
improve.
Ideally any new item-writing theory should lead to the easy generation of
many content-relevant items. A simple example shows how item-generating
schemes can benefit item writing. In dental education, an early skill is learning
to identify tooth names and numbers using the Universal Coding System. Two
objectives can be used to quickly generate 104 test items:
Because there are 32 teeth in the adult dentition, a total of 64 items defines the
domain. Because the primary dentition has 20 teeth, 40 more items are possi-
ble. Each item can be MC, or we can authentically assess a dental student's ac-
tual performance using a patient. Also, a plaster or plastic model of the adult or
child dentition can be used. If domain specifications were this simple in all edu-
cational settings, the problems of construct definition and item writing would
be trivial.
In Item Generation for Test Development, Irvine and Kyllonen (2002) assem-
bled an impressive set of papers by scholars who have proposed or developed
new ways to generate items. Beginning where Roid and Haladyna (1982) left
off, this volume reports the efforts of many dating from the mid-1980s to the
present.
Irvine (2002) characterized current item-generation efforts as falling
into three categories: R, L, and D models. The R model is traditional and in-
volves item development as depicted in this volume. Item writers replenish
item banks. The machinery of CTT or IRT is used to produce equated tests
so that construct-irrelevant difficulty or easiness is not a threat to validity.
The L model, which has failed, emphasizes latency in responding. In few in-
stances, speed of responding is importantly related to a cognitive ability, but
for the most part, L models do not have a history that supports its continu-
ance. The D model offers continuous testing during the learning period.
Items and tests must be independent, and change is recorded on an individ-
ual basis toward a goal.
Irvine (2002) also saw technology as one of the most influential factors in fu-
ture item generation. Computer-based and computer-adaptive testing includes
variations in display, information, and response modes to consider.
With respect to specific, promising item-writing theories, Bejar (1993)
proposed response generative model (RGM) as a form of item writing that is
superior to these earlier theories because it has a basis in cognitive theory,
whereas these earlier generative theories have behavioristic origins. The
RGM generates items with a predictable set of parameters, from which clear
interpretations are possible. Bejar presented evidence from a variety of re-
searchers, including areas such as spatial ability, reasoning, and verbal ability.
The underlying rationale of the RGM is that item writing and item response
are linked predictably. Every time an item is written, responses to that item
can confirm the theory. Failure to confirm would destroy the theory's credi-
bility. Bejar maintained that this approach is not so much an item-writing
method, a content-specification scheme, or a cognitive theory but a philoso-
phy of test construction and response modeling that is integrative.
The RGM has tremendous appeal to prove or disprove itself as it is used.
It has the attractive qualities of other generative item-writing theories,
namely, (a) the ability to operationalize a domain definition, (b) the abil-
ity to generate objectively sufficient numbers of items, and (c) the ease
with which relevant tests are created with predictable characteristics. Ad-
ditionally, RGM provides a basis for validating item responses and test
scores at the time of administration. What is not provided in Bejar's the-
ory thus far are the detailed specifications of the use of the theory and the
much-needed research to transform theory into technology. Like earlier
theories, significant research will be needed to realize the attractive
claims for this model.
NEW DIRECTIONS IN ITEM WRITING 271
Misconception Strategies
Conclusion
This section discusses the future of item writing. Item writing lacks the rich
theoretical tradition that we observe with statistical theories of test scores.
272 CHAPTER 11
Item analysis has been a stagnant field in the past, limited to the estimation of
item difficulty and discrimination using CTT or IRT, and the counting of re-
sponses to each distractor. Successive editions of Educational Measurement
(Lindquist, 1951; Linn, 1989; Thorndike, 1970) documented this unremark-
able state of affairs. The many influences described in this chapter, coupled
with growth in cognitive and item-response theories, have provided an oppor-
tunity to unify item development and item response validation in a larger con-
text of the unified approach to validity. The tools and understanding that are
developing for more effective treatment of item responses have been charac-
terized in this book as item response validation. The future of item response
validation will never be realized without significant progress in developing a
workable theory of item writing.
Chapter 9 discusses item response validation, and chapter 10 presents
methods to study specific problems. An important linkage is made between
item response validation and construct validation. Three important aspects of
item response validation that should receive more attention in the future are
distractor evaluation, a reconceptualization of item discrimination, and pat-
tern analysis. Because these concepts are more comprehensively addressed in
the previous chapter, the following discussion centers on the relative impor-
tance of each in the future.
Distractor Evaluation
The topic of distractor evaluation has been given little attention in the past.
Even the most current edition of Educational Measurement provides a scant
three paragraphs on this topic (Millman &. Greene, 1989). However, Thissen,
Steinberg, and Fitzpatrick (1989) supported the study of distractors. They
NEW DIRECTIONS IN ITEM WRITING 273
stated that any item analysis should consider the distractor as an important
part of the item. Wainer (1989) provided additional support, claiming that the
graphical quality of the trace line for each option makes the evaluation of an
item response more complex but also more complete. Because trace lines are
pictorial, they are less daunting to item writers who may lack the statistical
background needed to deal with option discrimination indexes.
The traditional item discrimination index provides a useful and conve-
nient numerical summary of item discrimination, but it tends to overlook the
relative contributions of each distractor. Because each distractor contains a
plausible incorrect answer, item analysts are not afforded enough guidance
about which distractors might be revised or retired to improve the item per-
formance. Changes in distractors should lead to improvements in item per-
formance, which in turn should lead to improved test scores and more valid
interpretations.
There are at least three good reasons for evaluating distractors. First, the
distractor is part of the test item and should be useful. If it is not useful, it
should be removed. Useless distractors have an untoward effect on item dis-
crimination. Second, with polytomous scoring, useful distractors contrib-
ute to more effective scoring, which has been proven to affect positively test
score reliability. Third, as cognitive psychologists lead efforts to develop
distractors that pinpoint misconceptions, distractor evaluation techniques
will permit the empirical validation of distractor responses and by that im-
prove our ability to provide misconception information to instructors and
students.
Item Discrimination
Complex behavior requires many mental steps. New theories propose to model
cognitive behavior using statistical models that examine patterns of responses
among items, as opposed to traditional item analysis that merely examines the
pattern of item response in relation to total test score (Frederiksen et al., 1993;
Mislevy, 1993).
Some significant work is currently being done with context-dependent item
sets. Wainer and Kiely (1987) conceptualized item sets as testlets. Responses to
testlets involve the chaining of response, and specific patterns have more value
than others. Although this pattern analysis does not fulfil the promise of cogni-
tive psychologists regarding misconception analysis, testlet scoring takes a ma-
jor first step into the field of item analysis for multistep thinking and the
relative importance of each subtask in a testlet. Chapters 9 and 10 discuss item
response models and computer software that exist for studying various scoring
methods. As cognitive psychologists develop constructs to the point that item
writing can produce items reflecting multistep thinking, response pattern anal-
ysis will become more statistically sophisticated and useful.
SUMMARY
(1971) prophesized that item writing will become automated to eliminate the
caprice and whims of human item writers. When this objectivity is realized,
achievement testing will improve. Creativity will be needed at an earlier stage
with content specification procedures, such as inference networks, that will
automate the item-writing process, but individual creativity associated with
item writers will disappear.
With item response validation, the advent of polytomous IRT has made it
more likely that we will explore the potential for developing distractors that in-
crease the likelihood of polytomous scoring of MC item responses. Conse-
quently, more attention will be given to distractor response patterns that
diagnose wrong thinking in a complex behavior, and the trace line will be a use-
ful and friendly device to understand the role that each distractor plays in
building a coherent item. Both item writing and item response validation are
important steps in test development and validation. As cognitive psychologists
better define constructs and identify the constituent steps in complex thinking,
item development and item response validation should evolve to meet the
challenge. Both item writing and item response validation will continue to play
an important role in test development. Both steps in test development will re-
quire significant research in the context of this unified theory involving both
cognitive science and statistical test score theory.
Finally, it would be remiss not to point out the increasing role of CR perfor-
mance testing in testing cognitive abilities. The CR format has received much
less scholarly attention and research than the MC format. Item writing will cer-
tainly be a unified science of observation where MC and CR assume appropri-
ate roles for measuring aspects of knowledge, skills, and abilities. The road to
better item development and item response validation will be long, as there is
still much to accomplish.
This page intentionally left blank
References
Abedi, J., Lord, C., Hofstetter, C., & Baker, E. (2000). Impact of accommodation strate-
gies on English language learners' test performance. Educational Measurement: Issues
and Practice, 19(3), 16-26.
Adams, R., Wu, M., & Wilson, M. (1998). ConQuest [Computer program]. Camber-
well: Australian Council for Educational Research.
Alagumalai, S., & Keeves, J. R (1999). Distractors—Can they be biased too? Journal of
Outcome Measurement, 3(1), 89-102.
Albanese, M. A. (1992). Type K items. Educational Measurement: Issues and Practices, 12,
28-33.
Albanese, M. A., Kent, T. A., & Whitney, D. R. (1977). A comparison of the difficulty,
reliability, and validity of complex multiple-choice, multiple response, and multiple
true-false items. Annual Conference on Research in Medical Education, 16, 105-110.
Albanese, M. A., & Sabers, D. L. (1988). Multiple true-false items: A study of interitem
correlations, scoring alternatives, and reliability estimation. Journal of Educational
Measurement, 25, 111-124.
American Educational Research Association (2000). Position statement of the Ameri-
can Educational Research Association concerning high-stakes testing in pre K-12
education. Educational Researcher, 29, 24-25.
American Educational Research Association, American Psychological Association.
National Council on Measurement in Education. (1999). Standards for Educational
and Psychological Testing. Washington, DC: American Educational Research
Association.
American Psychological Association, American Educational Research Association, &.
National Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: American Psychological Association.
Anderson, J. R. (1990). The adaptive character of thought. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Anderson, L., & Krathwohl, D. (2001). A taxonomy for learning, teaching and assessing: A
revision of Bloom's taxonomy of educational objectives. New York: Longman.
Anderson, L. W, & Sosniak, L. A. (Eds.). (1994). Bloom's taxonomy: A forty-year retro-
spective. Ninety-third Yearbook of the National Society for the Study of Education. Part II.
Chicago: University of Chicago Press.
277
278 REFERENCES
Andres, A. M., &del Castillo, J. D. (1990). Multiple-choice tests: Power, length, and op-
timal number of choices per item. British journal of Mathematical and Statistical Psy-
chology, 45, 57-71.
Andrich, D., Lyne, A., Sheridan, B., & Luo, G. (2001). RUMM2010: A Windows-
based computer program for Rasch unidimensional models for measurement
[Computer program]. Perth, Western Australia: Murdoch University, Social Mea-
surement Laboratory.
Andrich, D., Styles, I., Tognolini, J., Luo, G., & Sheridan, B. (1997, April). Identifying in-
formation from distractors in multiple-choice items: A routine application o/IRT hypothe-
ses. Paper presented at the annual meeting of the National Council on Measurement
in Education, Chicago.
Angoff, W. H. (1974). The development of statistical indices for detecting cheaters.
Journal of the American Statistical Association, 69, 44-49.
Angoff, W. H. (1989). Does guessing really help? Journal of Educational Measurement,
26(4), 323-336.
Ansley, T. N., Spratt, K. E, & Forsyth, R. A. (1988, April). An investigation of the effects of
using calculators to reduce the computational burden on a standardized test of mathematics
problem solving. Paper presented at the annual meeting of the American Educational
Research Association, New Orleans, LA.
Assessment Systems Corporation. (1992). RASCAL (Rasch analysis program) [Com-
puter program]. St. Paul, MN: Author.
Assessment Systems Corporation. (1995). ITEM AN: Item and test analysis. [Computer
program]. St. Paul, MN: Author.
Attali, Y., & Bar-Hillel, M. (2003). Guess where: The position of correct answers in mul-
tiple-choice test items as a pyschometric variable. Journal of Educational Measure-
ment, 40, 109-128.
Attali, Y., &. Fraenkel, T. (2000). The point-biserial as a discrimination index for
distractors in multiple-choice items: Deficiencies in usage and an alternative. Journal
of Educational Measurement, 37(1), 77-86.
Bar-Hillel, M., &. Attali, Y. (2002). Seek whence: Anser sequences and their conse-
quences in key-balanced multiple-choice tests. The American Statistician, 56,299-303.
Bauer, H. (1991). Sore finger items in multiple-choice tests. System, 19(4), 453-458.
Becker, B. J. (1990). Coaching for Scholastic Aptitude Test: Further synthesis and ap-
praisal. Review of Educational Research, 60, 373-418.
Bejar, I. (1993). A generative approach to psychological and educational measurement.
In N. Frederiksen, R. J. Mislevy, & I. Bejar (Eds.). Test theory for a new generation of
tests (pp. 297-323). Hillsdale, NJ: Lawrence Erlbaum Associates.
Bejar, I. (2002). Generative testing: From comprehension to implementation. In S. H.
Irvine & R C. Kyllonen (Eds.), Item generation for test development (pp. 199-217).
Mahwah, NJ: Lawrence Erlbaum Associates.
Beller, M., & Garni, N. (2000). Can item format (multiple-choice vs. open-ended) ac-
count for gender differences in mathematics achievement? Sex Roles, 42 (1/2), 1-22.
Bellezza, F. S., &Bellezza, S. F. (1989). Detection of cheating on multiple-choice tests by
using error-similarity analysis. Teaching of Psychology, 16, 151-155.
Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett &. W.
C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in con-
structed response, performance testing, and portfolio assessment (pp. 1-27). Hillsdale, NJ:
Lawrence Erlbaum Associates.
REFERENCES 279
Bennett, R. E., Morley, M., Quardt, D., Rock, D. A., Singley, M. K., Katz, I. R., et al.
(1999). Psychometric and cognitive functioning of an under-determined com-
puter-based response type for quantitative reasoning. Journal of Educational Measure-
ment, 36(3), 233-252.
Bennett, R. E., Rock, D. A., & Wang, M. D. (1990). Equivalence of free-response and
multiple-choice items. Journal of Educational Measurement, 28, 77-92.
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W H., & Krathwohl, D. R. (1956). Tax-
onomy of educational objectives. New York: Longmans Green.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are
scored in two or more nominal categories. Psychometrika, 37, 29-51.
Bock, R. D., Wood, R., Wilson, D. T., Gibbons, R., Schilling, S. G., &Muraki, E. (2003).
TESTFACT 4: Full information item factor analysis and item analysis. Chicago: Scien-
tific Software, International.
Bordage, G., &Carretier, H., Bertrand, R., &Page, G. (1995). Academic Medicine, 70(5),
359-365.
Bordage, G., &Page, G. (1987). An alternate approach to PMPs: The key features con-
cept. In I. Hart & R. Harden (Eds.), Further developments in assessing clinical compe-
tence (pp. 57-75). Montreal, Canada: Heal.
Bormuth, J. R. (1970). On a theory of achievement test items. Chicago: University of Chi-
cago Press.
Breland, H. M., Danes, D. O., Kahn, H. D., Kubota, M. Y., &Bonner, M. W. (1994). Per-
formance versus objective testing and gender: An exploratory study of an advanced
placement history examination. Journal of Educational Measurement, 31, 275-293.
Breland, H. M., & Gaynor, J. (1979). A comparison of direct and indirect assessments of
writing skills. Journal of Educational Measurement, 6, 119-128.
Brennan, R. L (2001). Generalizability theory. New York: Springer Verlag.
Bridgeman, B., Harvey, A., Braswell, J. (1995). Effects of calculator use on scores on a
test of mathematical reasoning. Journal of Educational Measurement, 32 (4), 323-340.
Bruno, J. E., & Dirkzwager, A. (1995). Determining the optimal number of alternatives
to a multiple-choice test item: An information theoretical perspective. Educational
and Psychological Measurement, 55, 959-966.
Burmester, M. A., & Olson, L. A. (1966). Comparison of item statistics for items in a
multiple-choice and alternate-response form. Science Education, 50, 467-470.
Camilli, D. & Shepard, L. (1994). Methods for identifying biased test items. Thousand
Oaks, CA: Sage Publications.
Campbell, D. R., & Fiske, D. W (1959). Convergent and discriminant validation by the
multi-trait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Campbell, J. R. (2000). Cognitive processes elicited by multiple-choice and con-
structed-response questions on an assessment of reading comprehension. Disserta-
tion Abstracts International, Section A: Humanities and Social Sciences, 60(1-A), 2428.
Cannell, J. J. (1989). How public educators cheat on standardized achievement tests. Albu-
querque, NM: Friends for Education.
Carroll,]. B. (1963). A model for school learning. Teachers Cottege Record, 64,723-733.
Case, S. M., & Downing, S. M. (1989). Performance of various multiple-choice item
types on medical specialty examinations: Types A, B, C, K, and X. In Proceedings of the
Twenty-Eighth Annua/ Conference of Research in Medical Education (167-172).
Case, S. M., Holtzman, K, & Ripkey, D. R. (2001). Developing an item pool for CBT: A prac-
tical comparison of three models of item writing. Academic Medicine, 76(10),S111-S113.
280 REFERENCES
Dawson-Saunders, B., Reshetar, R., Shea, J. A., Fierman, C. D., Kangilaski, R., &
Poniatowski, R A. (1992, April). Alterations to item text and effects on item difficulty and
discrimination. Paper presented at the annual meeting of the National Council on
Measurement in Education, San Francisco.
Dawson-Saunders, B., Reshetar, R., Shea, J. A., Fierman, C. D., Kangilaski, R., &
Poniatowski, R A. (1993, April). Changes in difficulty and discrimination related to alter-
ing item text. Paper presented at the annual meeting of the National Council on Mea-
surement in Education, Atlanta.
DeAyala, R. J., Plake, B. S., & Impara, J. C. (2001). The impact of omitted responses on
the accuracy of ability estimation in item response theory. Journal of Educational Mea-
surement, 38(3), 213-234-
de Gruijter, D. N. M. (1988). Evaluating an item and option statistic using the bootstrap
method. Tijdschrift voor Ondenvijsresearch, 13, 345-352.
DeMars, C. E. (1998). Gender differences in mathematics and science on a high school
proficiency exam: The role of response format. Applied Measurement in Education,
II (3), 279-299.
Dibello, L. V, Roussos, L. A., & Stout, W. F. (1993, April). Unified cognitive/psychometric
diagnosis foundations and application. Paper presented at the annual meeting of the
American Educational Research Association, Atlanta, GA.
Dobson, C. (2001). Measuring higher cognitive development in anatomy and physiol-
ogy students. Dissertation Abstracts International: Section-B: The Sciences and Engi-
neering, 62(5-B), 2236.
Dochy, E, Moekerke, G., De Corte, E., &Segers, M. (2001). The assessment of quantita-
tive problem-solving with "none of the above"-items (NOTA items). European Jour-
nal of Psychology of Education, 26(2), 163-177.
Dodd, D. K., & Leal, L. (2002). Answer justification: Removing the "trick" from multi-
ple-choice questions. In R. A. Griggs (Ed.), Handbook for teaching introductory psy-
chology (Vol. 3, pp. 99-100). Mahwah, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Holland, R W (1993). DIP detection and description: Mantel-Haenzel
and standardization. In R W. Holland &H. Wainer (Eds.), Differential item functioning
(pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Potenza, M. T. (1993, April). Issues in equity assessment for complex re-
sponse stimuli. Paper presented at the annual meeting of the National Council on
Measurement in Education, Atlanta, GA.
Downing, S. M. (1992). True-false and alternate-choice item formats: A review of re-
search. Educational Measurement: Issues and Practices, I I , 27-30.
Downing, S. M. (2002a). Construct-irrelevant variance and flawed test questions: Do
multiple-choice item-writing principles make any difference? Academic Medicine,
77(10), S103-S104.
Downing, S. M. (2002b). Threats to the validity of locally developed multiple-choice
tests in medical education: Construct-irrelevant variance and construct underrep-
resentation. Advances in Health Sciences Education. 7, 235-241.
Downing, S. M., Baranowski, R. A., Grosso, L. J., & Norcini, J. J. (1995). Item type
and cognitive ability measured: The validity evidence for multiple true-false
items in medical specialty certification. Applied Measurement in Education, 8(2),
187-197.
Downing, S. M., & Haladyna, T. M. (1997). Test item development: Validity evidence
from quality assurance procedures. Applied Measurement in Education, 10 (1), 61 -82.
282 REFERENCES
Eurich, A. C. (1931). Four types of examination compared and evaluated. Journal of Ed-
ucational Psychology, 26, 268-278.
Fajardo, L. L, &Chan, K. M. (1993). Evaluation of medical students in radiology written
testing using uncued multiple-choice questions. Investigative Radiology, 28 (10), 964-968.
Farr, R., Pritchard, R., & Smitten, B. (1990). A description of what happens when an
examinee takes a multiple-choice reading comprehension test. Journal of Educational
Measurement, 27, 209-226.
Fenderson, B. A., Damjanov, I., Robeson, M. R., Veloski, J. J., & Rubin, E. (1997). The
virtues of extended matching and uncued tests as alternatives to multiple-choice
questions. Human Pathology, 28(5), 526-532.
Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika,
48, 3-26.
Fitzpatrick, A. R. (1981). The meaning of content validity. Applied Psychological Mea-
surement, 7, 3-13.
Forster, F. (1974). Sample size and stable calibration. Unpublished paper.
Frary, R. B. (1993). Statistical detection of multiple-choice test answer copying: Review
and commentary. Applied Measurement in Education, 6, 153-165.
Frederiksen, N. (1984). The real test bias. Influences of testing on teaching and learning.
American Psychologist, 39, 193-202.
Frederiksen, N., Mislevy, R. J., &Bejar, I. (Eds.). (1993). Test theory for a new generation of
tests. Hillsdale, NJ: Lawrence Erlbaum Associates.
Frisbie, D. A. (1973). Multiple-choice versus true-false: A comparison of reliabilities
and concurrent validities. Journal of Educational Measurement, JO, 297-304.
Frisbie, D. A. (1992). The status of multiple true—false testing. Educational Measurement:
Issues and Practices, 5, 21-26.
Frisbie, D. A., & Becker, D. F. (1991). An analysis of textbook advice about true—false
tests. Applied Measurement in Education, 4, 67-83.
Frisbie, D. A., & Druva, C. A. (1986). Estimating the reliability of multiple-choice
true-false tests. Journal of Educational Measurement, 23, 99-106.
Frisbie, D. A., Miranda, D. U., &. Baker, K. K. (1993). An evaluation of elementary text-
book tests as classroom assessment tools. Applied Measurement in Education, 6,21-36.
Frisbie, D. A., & Sweeney, D. C. (1982). The relative merits of multiple true-false
achievement tests. Journal of Educational Measurement, 19, 29-35.
Fuhrman, M. (1996). Developing good multiple -choice tests and test questions. Journal
ofGeoscience Education, 44, 379-384.
Gagne, R. M. (1968). Learning hierarchies. Educational Psychologist, 6, 1-9.
Gallagher, A., Levin, J., & Cahalan, C. (2002). Cognitive patterns of gender differences on
mathematics admissions test. ETS Research Report 2—19. Princeton, NJ: Educational
Testing Service.
Gardner, H. (1986). The mind's new science: A history of the cognitive revolution. New York:
Basic Books.
Gardner, H., &. Hatch, T. (1989). Multiple intelligences go to school. Educational Re-
searcher, 18, 4-10.
Gamer, B. A. (Ed.). (1999). Black's Law Dictionary (2nd ed.) New York: West Publishing Co.
Garner, B. A. (Ed.). (1999). Black's Law Dictionary (7th ed.). St. Paul, MN: West Group.
Garner, M., & Engelhard, G., Jr. (2001). Gender differences in performance on multi-
ple-choice and constructed response mathematics items. Applied Measurement in Ed-
ucation, 12(1), 29-51.
284 REFERENCES
Gitomer, D. H., & Rock, D. (1993). Addressing process variables in test analysis. In N.
Frederiksen, R. J. Mislevy, & I. J. Bejar (Eds.), Test theory for a new generation of tests
(pp. 243-268). Hillsdale, NJ: Lawrence Erlbaum Associates.
Glaser, R., & Baxter, G. R (2002). Cognition and construct validity: Evidence for the na-
ture of cognitive performance in assessment situation. In H. I. Braun, D. N. Jackson,
& D. E. Wiley (Eds.), The role of constructs in psychological and educational measure-
ment (pp. 179-192). Mahwah, NJ: Lawrence Erlbaum Associates.
Godshalk, E I., Swineford, E., & Coffman, W. E. (1966). The measurement of writing
ability. College Board Research Monographs, No. 6. New York: College Entrance Exam-
ination Board.
Goleman, D. (1995). Emotional intelligence. New York: Bantam Books.
Green, K. E., & Smith, R. M. (1987). A comparison of two methods of decomposing
item difficulties. Journal of Educational Statistics, 12, 369-381.
Gross, L. J. (1994). Logical versus empirical guidelines for writing test items. Evaluation
and the Health Professions, 17(1), 123-126.
Grosse, M., & Wright, B. D. (1985). Validity and reliability of true-false tests. Educa-
tional and Psychological Measurement, 45, 1-13.
Guilford, J. R (1967). The nature of human intelligence. New York: McGraw-Hill.
Gulliksen, H, (1987). Theory of mental tests. Hillsdale, NJ: Lawrence Erlbaum Asso-
ciates.
Guttman, L. (1941). The quantification of a class of attributes: A theory and method of
scale construction. In R Horst (Ed.), Prediction of personal adjustment (pp. 321-345).
[Social Science Research Bulletin 48].
Haertel, E. (1986). The valid use of student performance measures for teacher evalua-
tion. Educational Evaluation and Policy Analysis, 8, 45-60.
Haertel, E. H., & Wiley, D. E. (1993). Representations of ability structures: Implications
for testing. InN. Frederiksen, R. J. Mislevy, &.I. Bejar (Eds.), Test theory for a new gen-
eration of tests (pp. 359-384). Hillsdale, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M. (1974). Effects of different samples on item and test characteristics of
criterion-referenced tests. Journal of Educational Measurement, 11, 93-100.
Haladyna, T. M. (1990). Effects of empirical option weighting on estimating domain
scores and making pass/fail decisions. Applied Measurement in Education, 3,231-244.
Haladyna, T. M. (1991). Generic questioning strategies for linking teaching and testing.
Educational Technology: Research and Development, 39, 73-81.
Haladyna, T. M. (1992a). Context-dependent item sets. Educational Measurement: Issues
and Practices, 11, 21—25.
Haladyna, T. M. (1992b). The effectiveness of several multiple-choice formats. Applied
Measurement in Education, 5, 73-88.
Haladyna, T. M. (1998, April). Fidelity and proximity in the choice of a test item format.
In T. M. Haladyna (Chair), Construction versus choice: A research synthesis. Sympo-
sium conducted at the annual meeting of the American Educational Research Asso-
ciation, San Diego, CA,
Haladyna, T. M. (2002). Supporting documentation: Assuring more valid test score in-
terpretations and uses. In G. Tindal &T. M. Haladyna (Eds.), Large-scale assessment
for all students: Validity, technical adequacy, and implementation (pp. 89-108). Mahwah,
NJ: Lawrence Erlbaum Associates.
Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writ-
ing rules. Applied Measurement in Education, 1, 37-50.
REFERENCES 285
Haynie, W. J., (1994). Effects of multiple-choice and short-answer tests on delayed re-
tention learning. Journal of Technology Education, 6(1), 32-44.
Heck, R., StCrislip, M. (2001). Direct and indirect writing assessments: Examining
issues of equity and utility. Educational Evaluation and Policy Analysis, 23(3),
275-292
Henrysson, S. (1971). Analyzing the test item. In R. L. Thorndike (Ed.), Educational mea-
surement (2nd ed., pp. 130-159) Washington, DC: American Council on Education.
Herbig, M. (1976). Item analysis by use in pre-test and post-test: A comparison of
different coefficients. PLET (Programmed Learning and Educational Technology),
13,49-54.
Hibbison, E. E (1991). The ideal multiple choice question: A protocol analysis. Forum
for Reading, 22 (2) ,36-41.
Hill, G. C., & Woods, G. T. (1974). Multiple true-false questions. Education in Chemis-
try I], 86-87.
Hill, K., & Wigfield, A. (1984). Test anxiety: A major educational problem and what
can be done about it. The Elementary School Journal, 85, 105-126.
Holland, R W, &Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenzel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp. 129-145).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Holland, E W, & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Holtzman, K., Case, S. M., & Ripkey, D. (2002). Developing high quality items quickly,
cheaply, consistently-pick two. CLEAR Exam Review, 16-19.
House, E. R. (1991). Big policy, little policy. Educational Researcher, 20, 21-26.
Hsu, L. M. (1980). Dependence of the relative difficulty of true-false and grouped true-
false tests on the ability levels of examinees. Educational and Psychological Measure-
ment, 40, 891-894.
HubbardJ. E (197'8). Measuring medical education: The tests and experience of the National
Board of Medical Examiners (2nd ed.). Philadelphia: Lea and Febiger.
Huff, K. L., & Sireci, S. (2001). Validity issues in computer-based testing. Educational
Measurement: Issues and Practices, 20, 16-25.
Hurd, A. W. (1932). Comparison of short answer and multiple-choice tests covering
identical subject content. Journal of Educational Research, 26, 28-30.
Irvine, S. H., & Kyllonen, R C. (Eds.). (2002). Item generation for test development.
Mahwah, NJ: Lawrence Erlbaum Associates.
Johnson, B. R. (1991). A new scheme for multiple-choice tests in lower division mathe-
matics. The American Mathematical Monthly, 98, 427—429.
Joint Commission on National Dental Examinations. (1996). National Board Dental Hy-
giene Pilot Examination. Chicago: American Dental Association.
Jozefowicz, R. E, Koeppen, B. M., Case, S., Galbraith, R., Swanson, D., &Glew, R. H.
(2002). The quality of in-house medical school examinations. Academic Medicine, 77(2),
156-161.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,
527-535.
Kane, M. T. (1997). Model-based practice analysis and test specifications. Applied Mea-
surement in Education, 10, 1, 5-18.
Kane, M. T. (2002). Validating high-stakes testing programs. Educational Measurement:
Issues and Practices, 21 (1), 31-41.
REFERENCES 287
Katz, I. R., Bennett, R. E., & Berger, A. L. (2000). Effects of response format on difficulty
of SAT-mathematics items: It's not the strategy. Journal of Educational Measurement,
37(1), 39-57.
Katz, S., & Lautenschlager, G. J. (1999). The contribution of passage no-passage item
performance on the SAT1 reading task. Educational Assessment, 7(2), 165—176.
Kazemi, E. (2002). Exploring test performance in mathematics: The questions chil-
dren's answers raise. Journal of Mathematical Behavior, 21(2), 203-224-
Komrey, J. D., & Bacon, T. P (1992, April). Item analysis of achievement tests bases on small
numbers of examinees. Paper presented at the annual meeting of the American Educa-
tional Research Association, San Francisco.
Knowles, S. L., & Welch, C. A. (1992). A meta-analytic review of item discrimination
and difficulty in multiple-choice items using none-of-the-above. Educational and
Psychological Measurement, 52, 571-577.
Kreitzer, A. E., & Madaus, G. F. (1994). Empirical investigations of the hierarchical
structure of the taxonomy. In L. W. Anderson &L. A. Sosniak (Eds.), Bloom's taxon-
omy: A forty-year retrospective. Ninety-third yearbook of the National Society for the Study
of Education. Pan II (pp. 64-81). Chicago: University of Chicago Press.
LaDuca, A. (1994). Validation of a professional licensure examinations: Professions the-
ory, test design, and construct validity. Evaluation in the Health Professions, 17(2),
178-197.
LaDuca, A., Downing, S. M., &. Henzel, T. R. (1995). Test development: Systematic
item writing and test construction. In J. C. Impara &.J. C. Fortune (Eds.), Licensure
examinations: Purposes, procedures, and practices (pp. 117-148). Lincoln, NE: Buros
Institute of Mental Measurements.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modelling
procedure for constructing content-equivalent multiple-choice questions. Medical
Education, 20, 53-56.
Landrum, R. E., Cashin, J. R., &Theis, K. S. (1993). More evidence in favor of three op-
tion multiple-choice tests. EducationalandPsychologicalMeasurement, 53,771-778.
Levine, M. V, & Drasgow, F. (1982). Appropriateness measurement: Review, critique,
and validating studies. British Journal of Educational Psychology, 35, 42-56.
Levine, M. V, & Drasgow, F. (1983). The relation between incorrect option choice and
estimated ability. Educational and Psychological Measurement, 43, 675-685.
Levine, M. V, & Drasgow, F. (1988). Optimal appropriateness measurement. Psycho-
metrika, 53, 161-176.
Levine, M. V, & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice
test scores. Journal of Educational Statistics, 4, 269-289.
Lewis, J. C., &. Hoover, H. D. (1981, April). The effect of pupil performance from using
hand-held calculators during standardized mathematics achievement tests. Paper pre-
sented at the annual meeting of the National Council on Measurement in Educa-
tion, Los Angeles.
Lindquist, E. F. (Ed.). (1951). Educational measurement (1st ed.). Washington, DC:
American Council on Education.
Linn, R. L. (Ed.). (1989). Educational measurement (3rd ed.). New York: American
Council on Education and Macmillan.
Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4-16.
Linn, R. L., Baker, E. L., &Dunbar, S. B. (1991). Complex, performance-based assess-
ments: Expectations and validation criteria. Educational Researcher, 20, 15-21.
288 REFERENCES
Linn, R. L., & Gronlund, N. (2001). Measurement and assessment in teaching (7th ed.).
Columbus, OH: Merrill.
Lohman, D. F. (1993). Teaching and testing to develop fluid abilities. Educational Re-
searcher, 22, 12-23.
Lohman, D. F, &. Ippel, M. J. (1993). Cognitive diagnosis: From statistically-based as-
sessment toward theory-based assessment. In N. Frederikesen, R. J. Mislevy, & I.
Bejar (Eds.), Test theory for a new generation of tests (pp. 41-71). Hillsdale, NJ: Law-
rence Erlbaum Associates.
Lord, F. M. (1958). Some relations between Guttman's principal components of scale
analysis and other psychometric theory. Psychometrika, 23, 291-296.
Lord, F. M. (1977). Optimal number of choices per item—A comparison of four ap-
proaches. Journal of Educational Measurement, 14, 33-38.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Lord, F. M., &Novick, M. R. (1968). Statistical theories of mental test scores. Chicago:
McGraw-Hill.
Love, T. E. (1997). Distractor selection ratios. Psychometrika, 62(1), 51-62.
Loyd, B. H. (1991). Mathematics test performance: The effects of item type and calcula-
tor use. Applied Measurement in Education, 4, 11-22.
Luce, R. D. (1959). Individual choice behavior. New York: Wiley.
Lukhele, R., Thissen, D., & Wainer, H. (1993). On the relative value of multiple-choice,
constructed-response, and examinee-selected items on two achievement tests. Jour-
nal of Educational Measurement, 31(3), 234-250.
MacDonald, R. E. (1981). The dimensionality of tests and items. British Journal of Math-
ematical and Statistical Psychology, 34, 100-117.
MacDonald, R. E (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence
Erlbaum Associates.
MacDonald, R. E (1999). Test theory. Mahwah, NJ: Lawrence Erlbaum Associates.
Mager, R. F. (1962). Preparing instructional objectives. Palo Alto, CA: Fearon.
Maihoff, N. A., & Mehrens, W. A. (1985, April). A comparison of alternate-choice and
true-false item forms used in classroom examinations. Paper presented at the annual
meeting of the National Council on Measurement in Education, Chicago.
Martinez, M. E. (1990). A comparison of multiple-choice and constructed figural re-
sponse items. Journal of Educational Measurement, 28, 131-145.
Martinez, M. E. (1993). Cognitive processing requirements of constructed figural re-
sponse and multiple-choice items in architecture assessment. Applied Measurement
in Education, 6, 167-180.
Martinez, M. E. (1998, April). Cognition and the question of test item format. In T. M.
Haladyna (Chair), Construction versus choice: A research synthesis. Symposium con-
ducted at the annual meeting of the American Educational Research Association,
San Diego, CA.
Martinez, M. E., &Katz, I. R. (1996). Cognitive processing requirements of constructed
figural response and multiple-choice items in architecture assessment. Educational
Assessment, 3, 83-98.
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,149-174.
McMorris, R. F., Boothroyd, R. A., & Pietrangelo, D. J. (1997). Humor in educational
testing: A review and discussion. Applied Measurement in Education, 10,269-297.
Mehrens, W. A., & Kaminski, J. (1989). Methods for improving standardized test scores:
Fruitful, fruitless, or fraudulent? Educational Measurement: Issues and Practices, 8,14-22.
REFERENCES 289
Phelps, R. P. (2000). Trends in large-scale testing outside the United States. Educational
Measurement: Issues and Practice, 19(1), 11-21.
Pinglia, R. S. (1994). A psychometric study of true-false, alternate-choice, and multi-
ple-choice item formats. Indian Psychological Review, 42(1-2), 21—26.
Poe, N., Johnson, S., & Barkanic, G. (1992, April). A reassessment of the effect of calculator
use in the performance of students taking a test of mathematics applications. Paper pre-
sented at the annual meeting of the National Council on Measurement in Educa-
tion, San Francisco.
Pomplun, M., &Omar, H. (1997). Multiple-mark items: An alternative objective item
format? Educational and Psychological Measurement, 57, 949-962.
Popham, W. J. (1993). Appropriate expectations for content judgments regarding
teacher licensure tests. Applied Measurement in Education, 5, 285-301.
Prawat, R. S. (1993). The value of ideas: Problems versus possibilities in learning. Educa-
tional Researcher, 22, 5-16.
Ramsey, R A. (1993). Sensitivity reviews: The ETS experience as a case study. In P W.
Holland & H. Wainer (Eds.), Differential item functioning (pp. 367-388). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Raymond, M. (2001). Job analysis and the specification of content for licensure and cer-
tification examinations. Applied Measurement in Education, 14(4), 369-415.
Reckase, M. D. (2000, April). The minimum sample size needed to calibrate items using the
three-parameter logistic model. Paper presented at the annual meeting of the American
Educational Research Association, New Orleans, LA.
Richardson, M., & Kuder, G. E (1933). Making a rating scale that measures. Personnel
Journal, 12, 36-40.
Richichi, R. V, (1996), An analysis of test bank multiple-choice items using item response the-
ory. ERIC Document 405367.
Roberts, D. M. (1993). An empirical study on the nature of trick questions. Journal of Ed-
ucational Measurement, 30, 331-344.
Rodriguez, M. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna
(Eds.), Large-scale assessment programs for all students: Validity, technical ade-
quacy, and implementation issues (pp. 211-229). Mahwah, NJ: Lawrence
Erlbaum Associates.
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-re-
sponse items: A random effects synthesis of correlations. Journal of Educational Mea-
surement, 40(2), 163-184.
Rogers, W. T, & Harley, D. (1999). An empirical comparison of three-choice and four-
choice items and tests: Susceptibility to testwiseness and internal consistency reli-
ability. Educational and Psychological Measurement, 59(2), 234-247.
Roid, G. H. (1994). Patterns of writing skills derived from cluster analysis of direct-writ-
ing assessments. Applied Measurement in Education, 7, 159-170.
Roid, G. H., & Haladyna, T. M. (1982). Toward a technology of test-item writing. New York:
Academic Press.
Rosenbaum, R R. (1988). Item bundles. Psychometrika, 53, 63-75.
Rothstein, R. (2002 Sept. 18). How U. S. punishes states with higher standards. The
New York Times, http://www.nytimes.com/2002/09/18
Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the as-
sessment of criterion-referenced test item validity. Dutch Journal of Educational Re-
search, 2, 49-60.
292 REFERENCES
Royer, J. M., Cisero, C. A., & Carlo, M. S. (1993). Techniques and procedures for assess-
ing cognitive skills. Review of Educational Research, 63, 201-243.
Ruch, G. M. (1929). The objective or new type examination. New York: Scott Foresman.
Ruch, G. M., & Charles, J. W. (1928). A comparison of five types of objective tests in ele-
mentary psychology. Journal of Applied. Psychology, 12, 398-403.
Ruch, G. M., &Stoddard, G. D. (1925). Comparative reliabilities of objective examina-
tions. Journal of Educational Psychology, 12, 89-103.
Rudner, L. M., Bracey, G., & Skaggs, G. (1996). The use of person-fit statistics with one
high-quality achievement test. Applied Measurement in Education, 9(1), 91-109.
Ryan, J. M., & DeMark, S. (2002) .Variation in achievment test scores related to gender,
item format, and content area tests. InG. Tindal &.T. M Haladyna (Eds.), Large-scale
assessment programs for all students: Validity, technical adequacy, and implementation (pp.
67-88). Mahwah, NJ: Lawrence Erlbaum Associates.
Samejima, F. (1979). A new family of models for the multipk-choice item (Office of Naval
Research Report 79-4). Knoxville: University of Tennessee.
Samejima, F. (1994). Non parametric estimation of the plausibility functions of the
distractors of vocabulary test items. Applied Psychological Measurement, 18(1),
35-51.
Sanders, N. M. (1966). Classroom questions. What kinds? New York: Harper & Row.
Sato, T. (1975). The construction and interpretation ofS-P tabks. Tokyo: Meijii Tosho.
Sato, T. (1980). The S-P chart and the caution index. Computer and communications systems
research laboratories. Tokyo: Nippon Electronic.
Sax, G., & Reiter, E B. (n.d.). Reliability and validity of two-option multiple-choice and com-
parably written true-false items. Seattle: University of Washington.
Schultz, K. S. (1995). Increasing alpha reliabilities of multiple-choice tests with linear
polytomous scoring. Psychological Reports, 77, 760-762.
Seddon, G. M. (1978). The properties of Bloom's taxonomy of educational objectives for
the cognitive domain. Review of Educational Research, 48, 303-323.
Serlin, R., & Kaiser, H. F. (1978). A method for increasing the reliability of a short multi-
ple-choice test. Educational and Psychological Measurement, 38, 337-340.
Shahabi, S., & Yang, L. (1990, April). A comparison between two variations of multiple-
choice items and their effects on difficulty and discrimination values. Paper presented at
the annual meeting of the National Council on Measurement in Education,
Boston.
Shapiro, M. M., Stutsky, M. H., & Watt, R. F. (1989). Minimizing unnecessary differ-
ences in occupational testing. Valparaiso Law Review, 23, 213-265.
Shea, J. A., Poniatowski, E A., Day, S. C., Langdon, L. O., LaDuca, A., &Norcini, J. J.
(1992). An adaptation of item modeling for developing test-item banks. Teaching and
Learning in Medicine, 4, 19-24.
Shealy, R., & Stout, W. F. (1996). A model-based standardization approach that sepa-
rates true bias/DIF from group differences and detects bias/DIF as well as item bias/
DIE Psychometrika, 58, 159-194.
Shepard, L. A. (1991). Psychometrician's beliefs about learning. Educational Researcher,
20, 2-9.
Shepard, L. A. (1993). The place of testing reform in educational reform—A reply to
Cizek. Educational Researcher, 22, 10-13.
Shepard, L.A. (2000). "The role of assessment in a learning culture." Educational Re-
searcher, 29(7), 4-14.
REFERENCES 293
Shonkoff, J., & Phillips, D. (Eds.)- (2000). The science of early childhood development.
Washington, DC: National Research Council Institute of Medicine, National Acad-
emy Press.
Simon, H. A. (1973). The structure of ill-structured problems. Artificial Intelligence, 4,
181-201.
Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests.
Journal of Educational Measurement, 28, 237-247.
Skakun, E. N., & Gartner, D. (1990, April). The use of deadly, dangerous, and ordinary
items on an emergency medical technicians-ambulance registration examination. Paper
presented at the annual meeting of the American Educational Research Association,
Boston.
Skakun, E. N., &Maguire, T. (2000, April). What do think aloud procedures tell us about
medical students' reasoning on multipk-choice and equivalent construct-response items?
Paper presented at the annual meeting of the National Council on Measurement in
Education, New Orleans, LA.
Skakun, E. N., Maguire, T., & Cook, D. A. (1994). Strategy choices in multiple-choice
items. Academic Medicine Supplement, 69(10), S7-S9.
Slogoff, S., &Hughes, F. P (1987). Validity of scoring "dangerous answers" on a written
certification examination. Journal of Medical Education, 62, 625-631.
Smith, R. M. (1986, April). Developing vocabulary items to fit a polychotomous scoring
model. Paper presented at the annual meeting of the American Educational Research
Association, San Francisco.
Smith, R. M., & Kramer, G. A. (1990, April). An investigation of components influencing
the difficulty of form-development items. Paper presented at the annual meeting of the
National Council on Measurement in Education, Boston.
Snow, R. E. (1989). Toward assessment of cognitive and conative structures in learning.
Educational Researcher, 18, 8-14-
Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett
& W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in con-
structed response, performance testing, and portfolio assessment (pp. 45-60). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educa-
tional measurement. In R. L. Linn (Ed.), Educational measurement (3rd ed.,
263-332). New York: American Council on Education and MacMillan.
Statman, S. (1988). Ask a clear question and get a clear answer: An inquiry into the
question/answer and the sentence completion formats of multiple-choice items. Sys-
tem, 16, 367-376.
Sternberg, R. J. (1985). Beyond IQ: A triarchic theory of human intelligence. New York:
Cambridge University Press.
Sternberg, R. J. (1998). Abilities are forms of developing expertise. Educational Re-
searcher, 27(3), 11-20
Stiggins, R. J., Griswold, M. M., & Wikelund, K. R. (1989). Measuring thinking skills
through classroom assessment. Journal of Educational Measurement, 26, 233-246.
Stout, W, Nandakumar, R., Junker, B., Chang, H., &Steidinger, D. (1993). DIMTEST:
A FORTRAN program for assessing dimensionality of binary item responses. Applied
Psychological Measurement, 16,236.
Stout, W, &Roussos, L. (1995). SIBTESTmanual (2nd ed.). Unpublished manuscript.
Urbana-Champaign: University of Illinois.
294 REFERENCES
Williams, B. J., & Ebel, R. L. (1957). The effect of varying the number of alternatives per
item on multiple-choice vocabulary test items. In The 14th yearbook of the National
Council on Measurement in Education (pp. 63-65). Washington, DC: National Coun-
cil on Measurement in Education.
Wilson, M. R. (1989). Saltus: A psychometric model of discontinuity in cognitive devel-
opment. Psychological Bulletin, 105, 276-289.
Winne, P. H. (1979). Experiments relating teachers' use of higher cognitive questions to
student achievement. Review of Educational Research, 49, 13-50.
Wolf, L. E, & Smith, J. K. (1995). The consequence of consequence: Motivation, anxi-
ety, and test performance. Applied Measurement in Education, 8(3), 227-242.
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of
Educational Measurement, 14, 97-116.
Zimowski, M. E, Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item
analysis and test scoring with binary logistic models [Computer program]. Chicago:
Scientific Software.
Zoref, L., & Williams, R (1980). A look at content bias in IQ tests. Journal of Educational
Measurement, 17,313-322.
Author Index
Linn, R. L., 22, 53, 57, 239, 244, 245, 265, Nesi, H., 94
266, 272 Nickerson, R. S., 26, 159
Llabre, M., 190 Nield, A. E, 199
Lohman, D. E, 26, 35, 36, 37, 262, 263, 271 Nishisato, S., 218
Lord, C., 95 Nitko, A. J., vii, 72, 164, 266
Lord, P.M., 76, 112, 203, 212 Nolen, S. B., 9, 238, 261
Lorscheider, E L., Ill Norcini, J. J., 49, 81
Love, T. E., 225 Norman, G. R., 165, 169
Loyd, B. K, 92 Norris, S. E, 199
Luce, R. D., 226 Novick, M. R., 203
Lukhele, R., 50 Nungester, R. J., 81
Luo, G., 204, 226 Nunnally, J. C., 203, 211, 213, 247
Lyne, A., 204
O
M
O'Dell, C. W, 47, 63
MacDonald, R. E, 203, 247 O'Hara, T., 22
Madaus, G. E, 22 O'Neill, K., 105
Mager, R. E, 185 Oden, M., 7
Maguire, T., 59, 60, 200 Olson, L. A., 76
Maihoff, N. A., 76 Omar, H., 84
Martinez, M. E., 48, 49, 50, 58, 59 Oosterhof, A. C., 78
Masters, G. N., 255 Osborn Popp, S., 236
Mazor, K. M., 234
McMorris, R. E, 121
Mead, A. D., 226
Meara, R, 94
Mehrens, W. A., 76, 238, 261 Page, G., 165, 166, 168,169
Meijer, R. R., 237, 243, 244, 245, 246 Paris, S. G., 239, 240
Messick, S., 9, 11, 12, 14, 26, 35, 185, 186, Patterson, D. G., 47
188, 189, 190, 246, 247, 261, 266 Perkhounkova, E., 53, 132
Michael, W.B., 112 Peterson, C. C., 78
Miller, M. D., 190 Peterson, J. L., 78
Miller, W. G., 22 Phelps, R., ix
Millman, J., 272 Phillips, D., 7
Minnaert, A., 61 Pietrangelo, D. J., 121
Miranda, D. U, 26 Pinglia, R. S., 79
Mislevy, R. J., 21, 25, 51, 52, 61, 159, 204, Plake, B. S., 218
262, 265, 267, 268, 269, 274 Poe, N., 92
Mokerke, G., 117 Pomplum, M., 84
Molenaar, I. W., 245 Popham, W. J., 188, 189
Mooney, J. A., 85, 254 Potenza, M. T, 231
Muijtjens, A. M. M. M., 237, 245 Prawat, R. S., 160
Mukerjee, D. R, 61 Pritchard, R., 59
N R
Nanda, H., 265 Rajaratnam, N., 265
Nandakumar, R., 252, 265 Ramsey, R A., 193, 234
Neisser, U., 7 Raymond, M., 186, 248
AUTHOR INDEX 3O1
V Wightman, L. E, 54, 57
Wikelund, K. R., 20
van Batenburg, T. A., 222 Wiley, D. E., 265
van den Bergh, H., 59 Williams, B., 226
Van der Flier, H., 245 Williams, B. J., 76
van der Vleuten, C. R M., 237, 245 Williams, E. A., 243
Vargas,]., 215 Williams, E, 193, 194
Veloski, J. J., 70 Wilson, M. R., 204
Winne, E H., 264
W Wintre, M. G, 199
Wolf, L. E, 240
Wood, R., D., 204
Wainer, H., 50, 52, 85, 150, 170, 180, 217,
Woods, G. T., 82
222,231,233,247,254,273,274
Wang, M. D., 50 Wright, B. D., 77, 78, 84, 237, 240
Wang, W, 226 Wu, M., 204
Wang, X., 50, 226, 227
Ward, W. C., 260 Y
Washington, W. N., 94
Watt, R. E, 232 Yang, L., 81
Webb, L. C., 105
Weiss, M., 236 Z
Welch, C. A., 116
Whitney, D. R., 82, 265 Zickar, M. J., 237
Wigfield, A., 69, 237 Zimowski, M. E, 204, 237
Wiggins, G., 67 Zoref, L., 193, 194
Subject Index
A methods, 249-253
Distractor evaluation, 218-228, 272-273
Abilities (cognitive, developing, fluid,
learned), 6-7, 8, 35-40 E
Achievement, 6-7
A/1 of the above option, 117 Educational Testing Service, 193, 195
American Educational Research Associa- Emotional intelligence, 8
tion (AERA), x, 10, 15, 25, 62, Editing items, 105-106
94, 183, 185, 234, 241, 247, 261
American Psychological Association, x
Answer justification, 197-199 Future of item development, 259-272
Appropriateness measurement, 243 factors affecting, 259-265
Assessment Systems Corporation, 204 new theories, 267-272
C Future of item-response validation,
272-275
Calculators, 91-93 G
Clang associations, 118
Clues to answers, 117-120 Generic item sets, 170-176
Cognition, 19 definition, 170
Cognitive demand (process), 19, 25 evaluation, 176
cognitive taxonomies, 20-25, 28-40 generic scenario, 171-175
construct-centered measurement, 26-27 Guessing, 217
Construct definition, 5-6
Constructed-response (CR) item formats, H
42-47
Converting constructed response to multiple - Higher level thinking, 35-40
choice, 176-177,178,179,180 examples of multiple-choice items,
137-147
D Humor in items, 121
Differential item functioning, 231-234 I
Dimensionality, 213-214, 246-253
defining, 246-247 Instructional sensitivity, 205, 214-217
303
304 SUBJECT INDEX