83% found this document useful (6 votes)
2K views319 pages

Thomas M. Haladyna - Developing and Validating Multiple-Choice Test Items-Routledge (2004) PDF

Uploaded by

muffin89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
83% found this document useful (6 votes)
2K views319 pages

Thomas M. Haladyna - Developing and Validating Multiple-Choice Test Items-Routledge (2004) PDF

Uploaded by

muffin89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 319

Developing and Validating

Multiple-Choice Test Items

Third Edition
This page intentionally left blank
Developing and Validating
Multiple-Choice Test Items

Third Edition

Thomas M, Haladyna
Arizona State University West

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS


2004 Mahwah, New Jersey London
Copyright © 2004 by Lawrence Erlbaum Associates, Inc.
All rights reserved. No part of this book may be reproduced in any form,
by photostat, microform, retrieval system, or any other means, without
prior written permission of the publisher.

Lawrence Erlbaum Associates, Inc., Publishers


10 Industrial Avenue
Mahwah, New Jersey 07430

Cover design by Sean Trane Sciarrone

Library of Congress Cataloging-in-Publication Data

Haladyna, Thomas M.
Developing and validating multiple-choice test items / Thomas M.
Haladyna.—3rd ed.
p. cm.
Includes bibliographical references and index.
ISBN 0-8058-4661-1
1. Multiple-choice examinations—Design and construction.
2. Multiple-choice examinations—Validity. I. Title.

LB3060.32.M85H35 2004
371.26—dc22
2003060112
CIP

Books published by Lawrence Erlbaum Associates are printed on acid-free


paper, and their bindings are chosen for strength and durability.

Printed in the United States of America


1 0 9 8 7 6 5 4 3 2 1
Contents

Introduction vii

I A Foundation for Multiple-Choice Testing

1 The Importance of Item Development for Validity 3

2 Content and Cognitive Processes 19

3 Item Formats 41

II Developing MC Test Items

4 MC Formats 67

5 Guidelines for Developing MC Items 97


6 A Casebook of Exemplary Items and Innovative Item 127
Formats

7 Item Generation 148

v
vi CONTENTS

III Validity Evidence Arising From Item


Development and Item Response Validation

8 Validity Evidence Coming From Item Development 183


Procedures

9 Validity Evidence Coming From Statistical Study of Item 202


Responses

10 Using Item Response Patterns to Study Specific Problems 230

IV Trie Future of Item Development


and Item Response Validation

11 New Directions in Item Writing and Item Response 259


Validation
References 277
Author Index 297
Subject Index 303
Introduction

This third edition provides a comprehensive treatment of the development of mul-


tiple-choice (MC) test items and the study of item responses for the improvement
or continued use of these items. This third edition contains some significant revi-
sions that I hope will be an improvement over the previous two editions.

MOTIVATION FOR THIS THIRD EDITION

Revising a book for the second time requires some motivation. Four factors fu-
eled this effort.
First, readers continue to show an interest in a comprehensive treatment
of MC item development and item response validation. These readers surface
in my life in countless ways. Some readers point out an error, share an idea or a
new MC format, ask a question, or simply offer support to the effort to im-
prove test items.
Second, a scientific basis for test item writing has been slow to develop
(Cronbach, 1970; Haladyna & Downing, 1989a, 1989b; Haladyna, Downing, &
Rodriguez, 2002; Nitko, 1985; Roid & Haladyna, 1982). These critics have
pointed out the paucity of research on item development. This book responds to
that criticism.
The third factor is the short yet rich history of efforts to improve MC item
writing. This history dates back to the early 20th century when MC was intro-
duced. Along the way, many testing specialists and educators have contributed

vii
viii INTRODUCTION

to this book by sharing their ideas, experiences, and collective wisdom in es-
says, textbooks, research, and in other ways. This book draws from this history.
Finally, my more than 30 years of experience in the business of planning, ad-
ministering, and evaluating testing programs and teaching at the elementary,
undergraduate, and graduate levels has helped me better understand the pro-
cess and benefits of well-designed test items and the importance of validating
item responses.

INTENDED AUDIENCE

This book is intended for anyone seriously interested in developing test items
for achievement testing. Students in graduate-level courses in educational
measurement may find this book helpful for better understanding two impor-
tant phases in test development, item development and item response valida-
tion. Those directly involved in developing tests may find this book useful as a
source of new material to enhance their present understanding and their item
development and item response validation practices.

LIMITATIONS OF THIS BOOK

Although this book intends to provide a comprehensive treatment of MC item


development and item response validation, it is limited in several ways.
Statistical theories for dealing with item responses and forming scales
abound and are an active field of research and development. The technology
for item writing presented in this book is not based on any current item-writ-
ing theory. I hope theoretical advancements in item writing will make future
books on item development and item response validation more aligned with
cognitive learning theory in a unified way. In theory, there should be a logical
progression from construct conceptualization that flows smoothly and
seamlessly to item development, field testing of each item, and evaluation of
item responses. Ideas about human intellect are constantly undergoing reex-
amination and change. Renewed interest in measuring cognitive abilities has
motivated us to consider new ways to measure the most desirable outcomes of
schooling or training. The way we view human learning relates to how we
teach and test. Three popular alternate views of learning are behavioral, so-
cial-constructivist, and cognitive. Behavior learning theory has thrived with
the widespread use of instructional objectives, mastery learning, and crite-
rion-referenced testing. Social-constructivist learning is a recent interest in
measurement and is gaining more acceptance (Shepard, 2000). The cogni-
tive approach involves a more encompassing idea. Cognitive abilities are
slowly developed over a learner's lifetime. Cognitive psychologists and statis-
tical test theorists are beginning to work in partnerships to measure these
INTRODUCTION ix

slow-growing cognitive abilities. In this book, cognitive ability provides a use-


ful paradigm for defining learning and its measurement. Until greater accep-
tance exists for the concept of cognitive abilities, or the paradigm for
explaining learning shifts to the cognitive viewpoint as it is doing, there are
still conflicts about how best to teach and test. The legacy of behavioral
learning theory and behavioral instruction persists in sharp contrast to the
uniquely different cognitive learning and social-constructivist theories. The
latter two seem to share more in common when compared with the behavior-
ist approach. Consequently, for those of us who teach, the way we assess stu-
dent learning is partly a function of the learning theory we use.
It is unlikely that any of these conditions affecting item writing will be re-
solved soon. Item development and item response validation are still new, dy-
namic processes greatly in need of a unified theory for writing items and
validating research. Without doubt, item writing continues to be a highly cre-
ative enterprise.

THE CURRENT STATUS OF MC TESTING


Perhaps, MC testing was given far too much emphasis in the past, prompting
critics such as Frederiksen (1984) and Shepard (1993, 2000) to contend that
MC testing lends itself to MC teaching. Frederiksen pointed out that MC for-
mats may do much unintended harm to learners by emphasizing the use of MC
formats for easy-to-measure student learning at the expense of complex
hard-to-measure content linked to performance tests. Shepard's point seems
to be that excessive attention to memorizing and testing for knowledge may
cause us to overlook the learning and assessment of more important aspects of
learning involving the application of knowledge and skills in real-life situations
that call for problem solving or critical thinking.
Certainly, analyses of what teachers teach and test reinforce this idea that
much of the content of education may be at the memory level. However, this is
not the fault of the MC format. As this book emphasizes, curriculum, teaching,
and student learning should focus on complex cognitive outcomes that show
cognitive abilities, such as reading, writing, speaking, listening, mathematical
and scientific problem solving, critical thinking, and creative enterprise. MC
formats an important role to play here.
Despite attacks on MC testing, it has thrived in recent years. The need to in-
form policymakers and evaluators is great enough to continue to support test-
ing programs. Little doubt should exist that testing is a major enterprise that
directly or indirectly affects virtually everyone in the United States, and testing
is increasing both in the United States and worldwide (Phelps, 1998, 2000).
MC tests are used in many ways: placement, selection, awards, certification,
licensure, course credit (proficiency), grades, diagnosis of what has and has not
been learned, and even employment.
x INTRODUCTION

A major premise of this book is that there is indeed a place for MC testing
in the classroom, large-scale assessment of student learning, and tests of
competence in any profession. The public and many test developers and us-
ers need to be more aware of the Standards for Educational and Psychological
Testing (American Educational Research Association [AERA], American
Psychological Association [APA], and National Council on Measurement
in Education [NCME] (1999) and the Guidelines for High Stakes Testing is-
sued by the AERA (2000). In this third edition, these standards and guide-
lines are often linked to recommended item development and item
response validation practices. Once test users are clear about how test re-
sults should and should not be used, we can increase the quality of tests by
sensible and effective item development and item response validation pro-
cedures found in this book.

ORGANIZATION OF THE BOOK

This third edition has undergone more changes than occurred in the second
edition. For continuity, the book continues to be organized into four sections.
Part I contains three chapters that provide a foundation for writing MC
items. Chapter 1 addresses the most important and fundamental value in any
test: validity. A parallel is drawn between test score validation and item re-
sponse validation. The logical process we use in test score validation also ap-
plies to item response validation because the item response is the most
fundamental unit of measurement in composing the test score. The process of
item development resides in validity with the emphasis on documenting valid-
ity evidence addressing item quality. Chapter 2 addresses the content and cog-
nitive process of test items. Knowledge, skills, and abilities are the three
interrelated categories of content discussed. This chapter principally draws
from cognitive psychology. Existing testing programs also provide some guid-
ance about what we need to measure and the different types of cognitive be-
haviors our items need to represent. Chapter 3 presents a taxonomy of test item
formats that include both constructed response (CR) and MC methods. Chap-
ter 3 also addresses claims for CR and MC formats for measuring different types
of content and cognitive processes. With this foundation in place, the writing
of MC items discussed in part II is enabled.
Part II is devoted to developing MC items. Chapter 4 presents a variety of
MC item formats and claims about the types of content and cognitive processes
that these formats can measure. Chapter 5 presents guidelines for developing
MC items of various formats. Chapter 6 contains a casebook of exemplary and
innovative MC items with supporting narrative about why these items were
chosen. Chapter 7 provides new and improved guidance on item-generation
techniques.
INTRODUCTION xi

Part III addresses the complex idea of item response validation. Chapter 8
reports on the rationale and procedures involved in a coordinated series of
activities intended to improve each test item. Chapter 9 deals with item anal-
ysis for evaluating and improving test items. A theoretical perspective is of-
fered that fits within the frameworks of both classical or generalizability
theory and item response theory. Chapter 10 offers information about how
the study of item responses can be used to study specific problems encoun-
tered in testing.
Part IV contains chapter 11, which deals with the trends in item writing and
validation. Unlike what we experience today, cognitive theorists are working
on better ways for defining what we teach and measure, and test theorists are
developing item-writing theories and item response models that are more ap-
propriate to measuring complex behavior. The fruition of many of these theo-
ries will change the face of education and assessment of learning in profound
ways in the future.
In closing, development of test items and validation of test item responses
still remain as two critical, related steps in test development. This book intends
to help readers understand the concepts, principles, and procedures available
to construct better MC test items that will lead to more validly interpreted and
used test scores.

—Tom Haladyna
This page intentionally left blank
I
A Foundation for
Multiple-Choice Testing

The three chapters in part I provide preparation and background for writing
multiple-choice (MC) test items. These chapters are interdependent.
This first chapter addresses the most important consideration in testing,
which is validity. In validating a specific test score interpretation or use, a body
of validity evidence comes from item development. Another body of evidence
resides with studies of item responses. The two bodies of evidence are shown to
be vital to the validity of any test score interpretation or use.
The second chapter discusses the types of content and cognitive processes
that we want to measure in achievement tests. The organization of content and
cognitive processes is straightforward and easy to follow.
The third chapter presents a taxonomy of test item MC and construe ted-re-
sponse (CR) formats with links to the content and cognitive processes dis-
cussed in chapter 2. Arguments are presented for what different formats can
accomplish regarding the age-old problem of measuring higher level thinking.
At the end of part I, you should have a good understanding of the role of
item development and item response validation as an integral aspect of va-
lidity, the types of content and cognitive processes we want to measure, and
the variety of formats available to you. You should also understand when to
use a MC format and which formats to use for certain content and cognitive
processes.
This page intentionally left blank
1
The Importance of Item
Development for Validity

OVERVIEW

This chapter provides a conceptual basis for understanding the important role
of validity in item development. First, basic terms are defined. Validity refers to
a logical process we follow in testing where what we measure is defined, mea-
sures are created, and evidence is sought and evaluated pertaining to the valid-
ity of interpreting a test score and its subsequent use. This logical process
applies equally to tests and the units that make up tests, namely, test items. This
process involves the validity of test score interpretations and uses and the va-
lidity of item response interpretations and uses. In fact, a primary source of evi-
dence in validating a test score interpretation or use comes from item
development. Thus, we should think of the validation of item responses as a
primary and fundamental source of validity evidence for test score interpreta-
tion and use.

DEFINING THE TEST ITEM

A test item is the basic unit of observation in any test. A test item usually con-
tains a statement that elicits a test taker response. That response is scorable,
usually 1 for a correct response and 0 for an incorrect response, or the response
might be placed on a rating scale from low to high. More sophisticated scoring
methods for item responses are discussed in chapters 9 and 10.
Thorndike (1967) wrote that the more effort we put into building better
test items, the better the test is likely to be. Toward that end, one can design
test items to represent many different types of content and cognitive behav-
3
4 CHAPTER 1

iors. Each test item is believed to represent a single type of content and a sin-
gle type of cognitive behavior. For a test item to measure multiple content and
cognitive behaviors goes well beyond the ability of a test item and our ability
to understand the meaning of an item response. A total score on a test repre-
sents some aggregate of performance across all test items for a specific ability
or domain of knowledge. As defined in this book, the test item is intended to
measure some aspect of human ability generally related to school learning or
training. However, this definition of a test item is not necessarily limited to
human ability. It can apply to other settings outside of education or training.
However, the focus in this book is learning. Therefore, we are mainly con-
cerned with achievement tests.
A fundamental dichotomy in item formats is whether the answer is selected
or created. Although a test item is the most basic element of any test, a test item
can seldom stand by itself as a test. Responses to a single test item are often too
fallible. Also, most cognitive abilities or achievement domains measured by a
test are too complex to be represented adequately by a single item. That is why
we score and aggregate item responses to form the test score. The design of any
test to cover something complex is usually extensive because the knowledge,
skills, or abilities we want to measure dictate a complex test design.

DEFINING A TEST

A test is a measuring device intended to describe numerically the degree or


amount of learning under uniform, standardized conditions. In educational
testing, most tests contain a single item or set of test items intended to measure
a domain of knowledge or skills or a cognitive ability. In the instance of the lat-
ter, a single test item might be a writing prompt or a complex mathematics
problem that may be scored by one or more judges using one or more traits and
associated rating scales. Responses to a single test item or a collection of test
items are scorable. The use of scoring rules helps create a test score that is based
on the test taker's responses to these test items.

WHAT DO TESTS AND TEST ITEMS MEASURE?

This section contains three distinctions that you might find useful as you think
about what tests and test items measure. Germane to the goal of this book, how
might MC items meet your needs in developing a test or assessing student
learning in the classroom?

Operational Definitions and Constructs

In defining any human cognitive characteristic that we would like to mea-


sure, a dilemma we face is whether we all agree on a definition of the charac-
THE IMPORTANCE OF ITEM DEVELOPMENT 5

teristic we want to measure, or whether the characteristic is sufficiently


abstract to prevent such consensus. The technical terms we use for this dis-
tinction are operational definition and construct. The decision about what we
want to measure points us down a path. One path, the operational definition,
makes our measurement job somewhat easy. The other path, the construct
definition, involves a longer and more involved set of procedures. We might
argue that the operational definition is desirable, but, unfortunately, too
many important things in education that we desire to measure are abstractly
defined and thus we are led down the construct path.
Operational definitions are commonly agreed on by those responsible for
measuring the characteristics. In other words, we have consensus. Traits de-
fined by operational definitions are objectively and directly measured. We
have good examples of operational definitions for time, volume, distance,
height, and weight. In other words, the definitions are specific enough to en-
able precise measurement without the difficulties encountered with con-
structs. Operational definitions abound in education, but cognitive
behaviors directly measured via operational definition are typically very sim-
ple. These behaviors are found in all curricula. We tend to see operational
definitions in early childhood education or the beginning stages of learning
any ability. We also find operational definitions with reading, writing, and
mathematics skills. Most work attack skills are practiced in reading. Spelling,
grammar, punctuation, and capitalization skills can be operationally defined
and readily observed in any student writing. In mathematics, most skills are
also operationally defined and easily observed. Thus, operational definitions
abound in the language arts, mathematics, social studies, and science. We
also have operational definitions in professional and industrial training,
where domains of knowledge and concrete, observable skills are heavily em-
phasized. Most operationally defined types of learning can be directly ob-
served or observed using a measuring instrument, such as a clock, ruler, scale,
or the human eye. But MC has a role to play when we can operationally define
domains of knowledge or skills, particularly if the skills are cognitive.
A construct is opposite to an operational definition. A construct is both
complex and abstract. Constructs include such highly prized abilities as read-
ing, writing, speaking, listening, problem solving, critical thinking, and creative
activity. If we could operationally define any of these abilities, we could use sim-
ple, direct, bias-free, reliable methods associated with operational definitions.
Some aspects of these abilities can be operationally defined. In writing, gram-
mar, punctuation, and spelling can be operationally defined. But none of these
skills constitutes a useful or direct measure of writing ability. As you can see, the
simple things we can easily observe are operationally defined, but the most
complex and prized things are not as easily observable and require expert judg-
ment. With a construct, we resort to subjective observation of student perfor-
mance by highly trained and skilled judges.
6 CHAPTER 1

Although this book is concerned with MC testing, oddly enough the most
important constructs are not best measured with MC item formats. Never-
theless, MC tests play a vital role in measuring many important aspects of
most constructs. When it comes to the measurement of knowledge and many
cognitive skills, MC is the logical choice. This point and its rationale are fea-
tured in chapter 3.

Achievement

The context for this book is the measuring of achievement that is the goal of in-
struction or training. Achievement is usually thought of as planned changes in
cognitive behavior that result from instruction or training, although certainly
achievement is possible because of factors outside of instruction or training.
All achievement can be defined in terms of content. This content can be repre-
sented as knowledge, skills, or cognitive abilities. Chapter 2 refines the distinc-
tions among these three concepts, and chapter 3 links different item formats to
knowledge, skills, and abilities.
Knowledge is a fundamental type of learning that includes facts, concepts,
principles, and procedures that can be memorized or understood. Most student
learning includes knowledge. Knowledge is often organized into operationally
defined domains. Consider what a dentist-in-training has to learn about dental
anatomy. We have 20 teeth in the juvenile dentition and 32 teeth in the adult
dentition. A dentist has to know the tooth name and the corresponding tooth
number for all 52 teeth. Given the number, the dentist must state the name.
Given the name, the dentist must state the number. These two statements op-
erationally generate 104 test items. This is the entire domain. The MC format
is generally acknowledged as the most useful and efficient way to measure
knowledge. As you can see, if knowledge can be defined in terms of a domain,
the measurement is made easier. Any achievement test is a representative sam-
ple of items from that domain.
Skills are learned, observable, performed acts. They are easily recognized in
virtually all settings. In writing, spelling, punctuation, and grammar are observ-
able, performed acts. In mathematics, adding, subtracting, multiplying, and di-
viding are also observable, performed acts. The development of domains of
knowledge can also be applied to skills. Take spelling, for example. It is easy to
identify a domain of words that a learner must correctly spell. The same is true
in mathematics. Because skills are so numerous, any test to student learning
should involve some representative sampling from the domain of items repre-
senting these skills.
Abilities are also learned, but the process is long and involved, perhaps span-
ning an entire lifetime. Abilities require the use of both knowledge and skills in
a complex way. Abilities even have an emotional component. Most abilities are
too complex for operational definition; therefore, we have to resort to CR per-
THE IMPORTANCE OF ITEM DEVELOPMENT 7

formance tests that require expert judgment to score. The items we use to mea-
sure an ability often consist of ill-structured problems. It is difficult to explicate
a domain that consists of ill-structured problems. Consider, for example, the
many naturally occurring encounters you have in life that require mathemat-
ics. How many problems exist? What are their form and structure? In limited
ways, MC can serve as a useful proxy for the cumbersome performance tests.
However, any argument for using MC formats instead of performance formats
for a complex ability should be presented and evaluated before a MC format is
used. Chapter 3 provides these arguments and the evidence supporting the
limited use of MC items for measuring abilities.

Intelligence

Intelligence is another important cognitive construct. Other terms used syn-


onymously for intelligence are scholastic aptitude and mental ability. Although
the dominant theory about intelligence treats intelligence as unitary (one fac-
tor) , research has shown that intelligence consists of three highly interrelated
cognitive abilities: verbal, quantitative, and analytical. These three abilities
have been found to be useful in a variety of settings and professions.
Historically, the Spearman one-factor theory of intelligence has been well
supported by research, including the famous Terman longitudinal studies of
giftedness (Terman & Oden, 1959). However, the one-factor view of intelli-
gence has been periodically challenged. In the 1930s, Thurstone (1938) for-
mulated his primary mental abilities, and his test was widely used. In the
1960s and 1970s, Guilford's (1967) structure of the intellect model was sup-
ported by research, but interest in this model waned. Gardner (1986) posited
a theory of multiple intelligences, and Sternberg (1985) introduced a compo-
nential theory of human abilities. Both theories have received considerable
attention. Although enthusiasm for multiple intelligence has been renewed
by these scholars, this century-long history of the study of human intelligence
in the United States has shown that scientific revolutions of this kind are
hard to sustain. The cumulative body of evidence continues to support a
one-factor theory of intelligence.
There is also emerging evidence that intelligence can be developed through
better nutrition, nuturing family life, and rich schooling experiences (Neisser,
1998; Rothstein, 2000; Shonkoff & Phillips, 2000). This emerging research has
argued that intelligence is susceptible to environment influences, particularly
at prenatal stages and in early childhood. Intelligence can actually increase
over a lifetime under favorable conditions.
Table 1.1 tries to capture the complexity of intelligence and achievement as
a hierarchical entity. At the bottom of this continuum, we have the memoriza-
tion and recall of knowledge, which is easy to teach, learn, and measure. At the
CHAPTER 1

TABLE 1.1
A Continuum of Cognitive Behavior

Intelligence: Verbal, quantitative, analytical


5
Developing, fluid, learned abilities
5
Skills: Simple cognitive or psychomotor acts
5
Knowledge: Understanding of facts, concepts, principles, and procedures
5
Knowledge: Recall of facts, concepts, principles, and procedures

next level, we have the understanding of knowledge, which is more difficult to


teach and learn, and its measurement is more difficult. Above this level, we
have skills, which require knowledge and can be taught and measured effec-
tively. Most of what goes on in schools and in professional training involves
knowledge and skills. At the next level come what cognitive psychologists call
developing, fluid, or learned abilities. These are slow-growing clusters of
knowledge and skills and strategies for applying knowledge and skills in com-
plex ways to accomplish an end. At the top of this continuum, we have intelli-
gence, which we have said is largely three general cognitive abilities that our
society highly values: verbal, quantitative, and analytical.
Goleman (1995) provided a compelling, popular description of emotional
intelligence, which he and some scientists believe account for successes and
failures that intelligence fails to explain. Emotional intelligence can be viewed
as a complementary type of intelligence that also is highly valued in our society.
Intelligence is not a dominant theme in this book, simply because item writ-
ing in this book is focused on developing cognitive abilities that are amenable
to teaching or training. But intelligence plays an important role in how well and
how fast students learn. Table 1.1 summarizes the idea that some types of learn-
ing can be accomplished quickly and easily, whereas other types of learning are
slow growing largely because of their complexity.
In light of Table 1.1, a subtle yet important difference distinguishes achieve-
ment and intelligence that might be helpful. If we have change in cognitive be-
havior that we can reasonably attribute to teaching or training, achievement
has occurred. If a student lacks an instructional history for some domain of con-
tent or some ability, something else has to account for that level of behavior.
THE IMPORTANCE OF ITEM DEVELOPMENT 9

What is probably accounting for test performance is not achievement but intel-
ligence. Thus, the role of instruction or training and instructional history is an
important consideration in deciding if a test or test item reflects achievement
or intelligence.

VALIDITY

Validity is "the degree to which accumulated evidence and theory support spe-
cific interpretations of test scores entailed by proposed uses" (American Edu-
cational Research Association [AERA], American Psychological
Association [APA], and National Council on Measurement in Education
[NCME], 1999, p. 84). For every testing program there is a purpose. To fulfill
this purpose, a test score has a clearly stated interpretation and an intended
use. The sponsor of the testing program creates a logical argument and assem-
bles validity evidence supporting that argument. Validity is the degree of sup-
port enabled by the logical argument and validity evidence upholding this
argument. In some instances, the validity evidence works against the argu-
ment and lessens validity. In these instances, the testing organization should
seek and take remedies to reverse the gravity of this negative kind of evi-
dence. The investigative process of creating this argument and collecting va-
lidity evidence testing this argument is validation.
Validity is much like what happens in a court of law. A prosecutor builds an
argument against an accused person concerning the commission of a crime.
The prosecutor collects and organizes evidence to support the argument
against the accused. The defense attorney creates another plausible argument
for the accused and uses evidence to support this argument. One argument op-
poses the other. The jury decides which argument is valid and to what degree
the argument is valid. We do the same thing in testing. We hope that the posi-
tive evidence greatly outweighs the negative evidence and that our argument is
also plausible.
Messick (1989) pointed out that a specific interpretation or use of test re-
sults is subject to a context made up of value implications and social conse-
quences. Thus, thinking of construct validation as merely the systematic
collection of evidence to support a specific test score interpretation or use is in-
sufficient. We must also think of the context that may underlie and influence
this interpretation or use. A good example of consequences comes from well-
documented practices in schools where publishers' standardized achievement
test scores are used as a criterion for educational accountability. Because of ex-
ternal pressure to raise test scores to show educational improvement, some
school personnel take extreme measures to increase test scores. Nolen,
Haladyna, and Haas (1992) showed that a variety of questionable tactics are
used to raise scores that may not increase student learning. The use of a test
that is poorly aligned with the state's curriculum and content standards, cou-
10 CHAPTER 1

pled with test-based accountability, results in test scores that may not be validly
interpreted or used.
In this book, the focus of validity and validation is both with test scores
and item responses, simply because we interpret and use item responses just
as we interpret and use test scores. Because items and item responses are
subunits of test and test scores, validity is important for both item responses
and test scores.
However, the validity evidence we gather to support interpreting an item re-
sponse is also part of the validity evidence we use to support the interpreting of
a test score.

THREE STEPS IN THE PROCESS OF VALIDATION

According to Cronbach (1971), three essential, sequential steps in validation


are formulation, explication, and validation. The first two steps are part of the
process of theorizing, leading to creation of a test. The third step is the process
that involves collecting the validity evidence, supporting the interpretation
and use of test scores.
In formulation, a construct is identified, named, and defined. The Standards
for Educational and Psychological Testing (AERA et al., 1999) uses the term con-
struct broadly to represent operationally defined domains as well as abstract
concepts. However, these new standards make clear the importance of defining
the construct:

The test developer should set forth clearly how test scores are intended to be in-
terpreted and used. The population(s) for which a test is appropriate should be
clearly delimited, and the construct that the test is intended to assess should be
clearly described. (AERA et al., 1999, p. 17)

As a society, we are greatly interested in the development of cognitive abilities


of its citizens. These cognitive abilities include reading, writing, speaking, lis-
tening, mathematical and scientific problem solving, and critical and creative
thinking. In our daily activities, these abilities are constantly called into action.
Writing is well defined in school curricula, in local and state content stan-
dards, and by national learned societies. Writing's connection to other aspects
of schooling and life is obvious and commonly accepted without argument.
Roid (1994) provided a clear definition of writing ability. Although we have
concrete, observable writing skills, such as spelling and punctuation, the as-
sessment of writing is usually done by trained judges evaluating student writing
using rating scales.
Through this process of formulation, the definition and connectedness of
any achievement construct, such as writing, to other constructs, such as social
THE IMPORTANCE OF ITEM DEVELOPMENT 11

studies, must be clear enough for test developers to construct variables that be-
have according to the ideas about our constructs, as Fig. 1.1 illustrates. Two
constructs are denned. The first is the quality of instruction, and the second is
writing ability that instruction is supposed to influence. In the first phase, both
instruction and writing ability are abstractly defined. It is hypothesized that
quality of instruction influences writing ability. A correlation between mea-
sures of quality of instruction and writing ability tells us to what extent our pre-
diction is borne out by the data. One could conduct formal experiments to
establish the same causal relation.
In explication, measures of each construct are identified or created. Gen-
erally, multiple measures are used to tap more adequately all the aspects of each
construct. The most direct measure would be a performance-based writing
prompt. MC items might measure knowledge of writing or knowledge of writ-
ing skills, but they would not provide a direct measure. In explication, Messick
(1989) identified a threat to validity: construct underrepresentation.
Frederiksen (1984) argued that the overreliance on MC may have contributed
to overemphasis on learning and testing knowledge at the expense of the more
difficult-to-measure cognitive abilities.
In validation, evidence is collected to confirm our hopes that an achieve-
ment test score can be interpreted and used validly. This evidence includes
empirical studies and procedures (Haladyna, 2002). The evidence should be
well organized and compelling in support of the plausible argument regarding
the validity of the meaning of the test score and the validity of its use. The val-
idation process also includes a summary judgment of the adequacy of this evi-
dence in support or against the intended interpretation or use.
Messick (1995a, 1995b) provided a structure for thinking about this validity
evidence, and the Standards for Educational and Psychological Testing (AERA et
al., 1999) provide a useful description of the sources of validity evidence.

Construct definition Quality of teaching is denned Writing ability is defined and is


and hypothesized to affect the hypothesized to be affected by
development of a fluid ability, the quality of teaching.
such as writing.
Construct explication Measure of quality of teaching Measure of writing ability is
is developed to reflect the developed to reflect the
construct of quality of construct of writing ability.
teaching.
Construct validation The two measures are correlated. The size of this correlation
can be used as evidence, along with other evidence, showing
that the quality of teaching affects the development of ability.

FIG. 1.1. The logic of construct validation.


12 CHAPTER 1

1. The content of the test, including its relevance to the construct and
the representativeness of the sampling, is a source of validity evidence.
2. The connection of test behavior to the theoretical rationale behind
test behavior is another type of evidence. Claims about what a test measures
should be supported by evidence of cognitive processes underlying perfor-
mance (Martinez, 1998). Implications exist in this category for the choice of
an item format.
3. The internal structure of test data involves an assessment of fidelity of
item formats and scoring to the construct interpretation (Haladyna, 1998).
Messick (1989) referred to this as "structural fidelity." Therefore, a crucial
concern is the logical connection between item formats and desired inter-
pretations. For instance, an MC test of writing skills would have low fidelity
to actual writing. A writing sample would have higher fidelity. Another facet
of internal structure is dimensionality, which is discussed in chapter 10.
4. The external relationship of test scores to other variables is another
type of evidence. We may examine group differences that are known to exist
and we seek confirmation of such differences, or we may want to know if like
measures are more correlated than unlike measures. Another type of rela-
tionship is test criterion. The patterns among items responses should clearly
support our interpretations. Evidence to the contrary works against valid in-
terpretation.
5. We hope that any measure of a construct generalizes to the whole of
the construct and does not underrepresent that construct. The general-
izability aspect relates to how test scores remain consistent across different
samples. One aspect of this is differential item functioning (DIP) and bias, a
topic treated in chapter 10. This aspect of validity evidence also refers to de-
velopment of an ability over time.
6. Finally, the consequences of test score interpretations and uses must
be considered, as discussed previously with misuses and misinterpretations
of standardized achievement test scores.

Haladyna (2002) showed how classes of validity evidence link to specific


AERA, APA, and NCME standards. Table 1.2 shows the specific validity evi-
dence associated with test items.
Validity involves a subjective judgment of this validity argument and its
validity evidence. We take this evidence collectively as supporting or not sup-
porting interpretations or uses to some degree. Although Cronbach (1988)
and Kane (1992) described this process as the building of an argument sup-
porting interpretation of test scores, four types of problems can undermine
validity:

1. Failure to define constructs adequately (inadequate formulation), a


problem that has troubled education for some time.
TABLE 1.2
Standards That Refer to Item Development and Item Response

3.6. The types of items, the response formats, scoring procedures, and test
administration procedures should be selected based on the purposes of the test, the
domain to be measured, and the intended test takers. To the extent possible, test
content should be chosen to ensure that intended inferences from test scores are equally
valid for members of different groups of test takers. The test review process should
include empirical analyses and, when appropriate, the use of expert judges to review
items and response formats. The qualifications, relevant experiences, and demographic
characteristics of expert judges should also be documented.
3.7. The procedures used to develop, review, and tryout items, and to select items
from the item pool should be documented. If the items were classified into different
categories or subtests according to the test specifications, the procedures used for the
classification and the appropriateness and accuracy of the classification should also be
documented.
3.8. When item tryouts or field tests are conducted, the procedures used to select the
sample (s) of test takers for item tryouts and the resulting characteristics of the sample (s)
should be documented. When appropriate, the sample (s) should be as representative as
possible of the populations for which the test is intended.
3.9. When a test developer evaluates the psychometric properties of items, the
classical or item response theory (IRT) model used for evaluating the psychometric
properties of items should be documented. The sample used for estimating item
properties should be described and should be of adequate size and diversity for the
procedure. The process by which items are selected and the data used for item selection,
such as item difficulty, item discrimination, and/or item information, should also be
documented. When IRT is used to estimate item parameters in test development, the
item response models, estimation procedures, and evidence of model fit should be
documented.
7.3. When credible research reports that differential item functioning exists across
age, gender, racial/ethnic, cultural, disability, and/or linguistic groups in the population
of test takers in the content domain measured by the test, test developers should
conduct appropriate studies when feasible. Such research should seek to detect and
eliminate aspects of test design, content, and format that might bias test scores for
particular groups.
7.4. Test developers should strive to identify and eliminate language, symbols, words,
phrases, and content that are generally regarded as offensive by members of racial,
ethnic, gender, or other groups, except when judged to be necessary for adequate
representation of the domain.
7.7. In testing applications where the level of linguistic or reading ability is not part of
the construct of interest, the linguistic or reading demands of the test should be kept to
the minimum necessary for the valid assessment of the intended construct.

13
14 CHAPTER 1

2. Failure to identify or create measures of the aspects of each construct (an


inadequate explication), which Messick (1989) referred to as construct
underrepresentation.
3. Failure to assemble adequate evidence supporting predictions made from
our theorizing (inadequate validation).
4. Discovering sources of construct-irrelevant variance (CIV; Messick,
1989). This problem exists when we find systematic error in test scores.
Haladyna and Downing (in press) identify many sources of CIV and pro-
vide documentation of their seriousness. Validation that fails to support
the validity of interpreting and using test scores is contrary. CIV repre-
sents this threat to validation.

THE ITEM-DEVELOPMENT PROCESS

This last section discusses the item-development process. Table 1.3 gives a
short summary of the many important steps one follows in developing test
items for a testing program. This section gives the reader a more complete un-
derstanding of the care and detail needed to produce an item bank consisting of
operational items that are ready to use on future tests.

TABLE 1.3
The Item-Development Process

1. Make a plan for how items will be developed.


2. Create a schedule for item development.
3. Conduct an inventory of items in the item bank.
4. Identify the number of items needed in each of these areas.
5. Identify and recruit qualified subject matter experts for developing new items.
6. Develop an item-writing guide.
7. Distribute the guide to the item writers.
8. Conduct item-writing training for these item writers.
9. Make assignments to item writers based on the inventory and the evaluation of
needs.
10. Conduct reviews discussed in chapter 8 leading to one of three decisions: keep,
revise, retire.
11. Field test surviving items.
12. Evaluate the performance of items.
13. Place surviving items in the operational item bank.
THE IMPORTANCE OF ITEM DEVELOPMENT 15

The Plan

As simple as it sounds, a good plan is essential to creating and maintaining an


item bank. The plan should detail the steps found in Table 1.3: the schedule,
the resources needed, and personnel responsible. One of the primary costs of
any testing program is item development. As you can see, the process is not
short and simple, but involved.

The Schedule

The schedule should be realistic and provide a lists of tasks and persons who
will be responsible for completing each task. Sometimes schedules can be unre-
alistic, expecting that items can be written in a short time. Experience will
show that developing a healthy item bank may take more than one or two years,
depending on the resources available.

Inventory

Test specifications (test blueprint or table of specifications) show the test


developers how many items are to be selected for the test, the types of con-
tent being tested, and the types of cognitive behaviors required of test tak-
ers when responding to each item. Items are selected based on these
specifications and other technical considerations, such as item difficulty
and discrimination. The standards (AERA et al., 1999, p. 42) stated in
Standard 3.3:

The test specifications should be documented, along with its rationale and the
process by which it was developed. The test specifications should define the
content of the test, the proposed number of items, and item formats, the de-
sired psychometric properties of the items, and the item and section arrange-
ment. It should also specify the amount of time for testing, directions to the test
makers, procedures to be used for test administration and scoring and other rel-
evant information.

By knowing the number of items in the test and other conditions affecting
test design, the test developer can ascertain the number of items that need to
be developed. Although it depends on various circumstances, we try to have
about 250% of the items needed for any one test in our item bank. But this esti-
mate may vary depending on these circumstances. The inventory is the main
way that we find out what items are needed to keep our supply of items ade-
quate for future needs.
16 CHAPTER 1

Recruitment of Item Writers

The quality of items depends directly on the skill and expertise of the item writ-
ers. No amount of editing or the various reviews presented and discussed in
chapter 8 will improve poorly written items. For this reason, the recruitment of
item writers is a significant step in the item-development process. These item
writers should be subject-matter experts (SMEs), preferably in a specialty area
for which they will be assigned items. Because these SMEs will be writing items,
they will need to document each item's content authenticity and verify that
there is one and only one right answer. They will also become expert reviewers
of colleagues' items.

Develop an Item-Writing Guide

An item-writing guide should be developed and given to all item writers. The
guide should be specific about all significant aspects of item writing. At a mini-
mum, the guide should tell item writers which item formats are to be used and
which should be avoided. The guide should have many examples of model
items. Guidelines for writing items such as presented in chapter 5 should be
presented. One feature that is probably not prevalent in most item-writing
guides but is greatly needed are techniques for developing items rapidly. Chap-
ter 6 provides many model items, and chapter 7 provides some techniques to
make item writing easier and faster.
An excellent example of an item-writing guide can be found in Case and
Swanson (2001). The guide is used in training item writers for the national
board examinations in medicine. It is in its third edition and can be found by
going to the National Board of Medical Examiners' web page, www.nbme.org.

Item-Writing Training

Any testing program that is serious about validity should engage all item
writers in item-writing training. The way training and item writing is con-
ducted may seem mundane, but the question arises: Does one type of train-
ing produce better items than other types of training? One study by Case,
Holtzman, and Ripkey (2001) addressed this question, which involved the
United States Medical Licensing Examination. In an evaluation of three ap-
proaches to writing items, they used number of items written, quality of
items, and cost as factor in evaluating the three approaches. The traditional
training method involved a committee with a chair, formal item-writing
training, assignments to these item writers to write items targeted by con-
tent and cognitive processes, an iteration of reviews and reactions between
THE IMPORTANCE OF ITEM DEVELOPMENT 17

editors and authors of items, and an item review meeting. The second type
was a one-time task force that met once, received training, wrote items, and
reviewed each other's items. The third type was an item-harvesting ap-
proach in which a group was asked to write some items and was sent the ex-
cellent item-writing guide, and it submitted items for evaluation. The yield
of items per type were small for the latter two methods, and the quality was
lower. Case et al. preferred the traditional method but acknowledged that
for low-budget testing programs, the latter two methods have merit for pro-
ducing high-quality items.

Item-Writing Assignments

As was stated in the recruitment of item writers, each item writer should be
chosen for a particular subject matter expertise, and considering the inven-
tory, each item writer should be assigned to develop items that will potentially
improve the item bank and eventually make it into future tests. Therefore,
the assignments should be made thoughtfully. Unless item writers are com-
pensated, item writing can be a difficult thing to do if the item writer is a busy
professional, which is often the case. Usually someone is responsible for mon-
itoring item writers and making sure that the assignment is completed on
time, according to the schedule that was adopted.

Conduct Reviews

When items are drafted, they are typically subjected to many complemen-
tary reviews. This is the subject of chapter 8. These reviews are intended to
take these initially drafted items and polish them. The reviews are con-
ducted by different personnel, depending on the nature of the review. One
of the most important reviews is by other SMEs for a judgment of the quality
of the item.

Field Test and Subsequent Evaluation

When an item has been properly written and has survived all of these reviews,
the next important step is to try this item out on an operational test. It is im-
portant to assess the item relative to other items on the test, but it is also im-
portant not to use each field test item in obtaining the final test score. If the
item passes this final hurdle and performs adequately, the item can be placed
in the item bank where it can be used in future tests. Chapter 9 provides infor-
mation about the criteria used to evaluate item performance.
18 CHAPTER 1

HOW DOES ITEM DEVELOPMENT LINK TO VALIDITY?

Because the test score and the item response have a logical connection, the
process that is defined for validating test score interpretations and uses also ap-
plies to item responses. We can define what an item is supposed to measure and
the type of cognitive behavior it elicits. We can write the item, which is the ex-
plication step in construct validation, and we can study the responses to the
item to determine whether it behaves the way we think it should behave. Table
1.4 shows the parallelism existing between test score validation and item re-
sponse validation.

SUMMARY

In this chapter a major theme is the role validity plays toward making test score
interpretations and uses as truthful as possible. A parallelism exists between
tests and test items and between test scores and item responses. The logic and
validation process applied to tests equally applies to test items, and the validity
evidence obtained at the item level contributes to the validation of test scores.

TABLE 1.4
Three Steps in Construct Validation

Three Steps Test Score Item Response

1. Formulation Define construct Define the basis for the item in terms of its
content and cognitive behavior related to
construct
2. Explication Test Item
3. Validation Evidence bearing on the Evidence bearing on the interpretation and
interpretation and use use of an item response with other item
of test scores for a responses in creating a test score that can be
specific purpose validly interpreted or used
2
Content
and Cognitive Processes

OVERVIEW

As noted in chapter 1, all test scores and associated test item responses have in-
tended interpretations. Both test scores and item responses are also subject to
validation. Although the types of evidence may vary for test score and item re-
sponse validations, the logic and process of validation are the same.
The test specifications that assist us in the design of a test call for selection of
items according to the item's content and the cognitive process thought to be
elicited when a test taker responds to the item. This claim for connection be-
tween what is desired in test specification and the content and cognitive pro-
cess of each test item is fundamental to validity. Therefore, each test item
should be accurately classified according to its content and intended cognitive
process. This chapter is devoted to the related topics of item content and cog-
nitive process, sometimes referred to as cognitive demand.
The first part of this chapter provides a discussion of issues and problems af-
fecting content and cognitive process. The second part presents a simple classi-
fication system for test items that includes natural, generic categories of
content and cognitive processes. Examples appearing in this chapter draw from
familiar content areas: reading, writing, and mathematics. These subjects are
prominent in all national, state, and local school district testing programs.

BACKGROUND

What Is Cognition?
Cognition is the act or process of knowing something. It is perception. Be-
cause cognition involves human thought, it is a private event. In a contrived
19
2O CHAPTER 2

setting we call a test, we observe someone's responses to test items. By evalu-


ating each response, we infer that the person has a certain degree of knowl-
edge. As we further explore cognition, we realize that knowledge is just one
aspect of cognition. Skill is a performed cognitive or physical act that requires
knowledge. Skill is easily observable because it has a simple, unitary appear-
ance. Ability is something more complex than knowledge and skill. The mea-
surement of a cognitive ability usually requires a complex application of
knowledge and skills. Although the study and classification of test items by
content are better understood and more easily done, the study and classifica-
tion of test items by cognitive process have proven difficult. In this first part of
the chapter we explore issues and problems involving cognitive process.

Issues and Problems With Cognitive Process

This section deals with four issues and problems related to any classification
system involving cognitive process: (a) the distinction between theoreti-
cally based and prescriptive cognitive process taxonomies, (b) the limita-
tions of current prescriptive taxonomies, (c) the ultimate dilemma with
measuring any cognitive process, and (d) the emergence of construct-cen-
tered measurement.

The Distinction Between Theoretically Based and Prescriptive


Cognitive Process Taxonomies

All taxonomies contain content and cognitive process dimensions. The


structure and organization of the content dimension seem fairly simple and
straightforward. We generally have some topics, and items are classified ac-
cordingly by SMEs. The second dimension is cognitive process, which seems
more difficult to reconcile. A distinguishing characteristic of the cognitive
process dimension in taxonomies is whether each is based on a theory of cog-
nition or is simply prescriptive. Theoretically based methods for defining and
measuring cognitive process involve theoretical terms, statements of
cause-effect relation, and principles governing how various cognitive pro-
cesses are developed. Such an approach is more comprehensive than simply
listing and defining categories of mental behavior along a continuum of com-
plexity. Cognitive learning theories provide a holistic treatment of student
learning from the identification of content and cognitive process, principles
of instructional design, and principles of assessment involving diagnosis and
remediation, among other aspects. Gagne's (1968) hierarchy is one example,
but there is little evidence of its construct validation or widespread use. An-
other, more recent cognitive process taxonomy was proposed by Royer,
Cisero, and Carlo (1993). It is a good example of theory-driven cognitive pro-
CONTENT AND COGNITIVE PROCESSES 21

cesses based on the learning theory of Anderson (1990). Their description is


comprehensive with regard to how knowledge and skill are defined; how
knowledge is obtained, organized, and used; and how mental models work.
Although this promising work has a theoretical basis and impressive research
supporting its use, it does not seem ready for implementation.
Gitomer and Rock (1993) addressed the problem of cognitive process in test
design using a hierarchical cognitive demand model. They also presented and
discussed related work. Their interest was in improving cognitive process rep-
resentation for diagnostic purposes, one of the main reasons for giving achieve-
ment tests. They explored systematic ways to track complex cognitive process
in mathematics.
Cognitive psychology is a loosely organized field with no central para-
digm driving theory. Unlike behaviorism, cognitive psychology has no uni-
versally accepted way of thinking about learning and how to deal with the
practical problem of classifying student behavior. On the other hand, there
is substantial progress to report. One volume, Test Theory for a New Genera-
tion of Tests (Frederiksen, Mislevy, & Bejar, 1993), has provided one of the
best accounts of emerging thinking about cognitive process and test item
design. These approaches and other more recent developments are dis-
cussed in chapter 11.
Prescriptive methods are born from the necessity of providing practitio-
ners with methods they can readily apply. We implicitly know that there are
complex forms of behavior beyond recall, but how do we develop test items
of different cognitive demand and what classification system should we use?
Prescriptive taxonomies provide simple nontheoretical descriptions of cog-
nitive behavior that hopefully have achieved consensus among users of the
taxonomy.

The Limitations of Prescriptive Taxonomies

Although prescriptive taxonomies are commonly used in educational test-


ing, theoretically based taxonomies have to be the ultimate goal because a uni-
fied cognitive learning theory has the potential to provide a comprehensive
approach to defining content and cognitive processes in a curriculum, provid-
ing instruction with remedial branches, and accurately assessing outcomes.
Prescriptive approaches are too limited in this vision.
The best-known approach to classifying student learning objectives and test
items reflecting these objectives is the Bloom cognitive taxonomy (Bloom,
Engelhart, Furst, Hill, & Krathwohl, 1956). This book is one of the most influ-
ential in education, a standard reference for more than half a century. The con-
tribution of leading test specialists of the time went into the development of
this taxonomy. The cognitive taxonomy currently appears in a revised version
(Anderson & Krathwohl, 2001). In his interesting book, Classroom Questions,
22 CHAPTER 2

Sanders (1966) provided many examples of test items based on this cognitive
process taxonomy. Anderson and Sosniak (1994) edited a volume of contribu-
tions dealing with aspects of the cognitive taxonomy. Contributing authors dis-
cussed the value and standing of the taxonomy as a means for increasing
concern about the development and measurement of different cognitive pro-
cesses. Despite the taxonomy's widespread popularity, Seddon (1978) reported
in his review of research that evidence neither supports nor refutes the taxon-
omy. A research study by Miller, Snowman, and O'Hara (1979) suggested that
this taxonomy represents fluid and crystallized intelligences. A study by
Dobson (2001) in a college-level class used this taxonomy and found differ-
ences in difficulty. Kreitzer and Madaus (1994) updated Seddon's review and
drew a similar conclusion. Higher level test performance was more difficult and
did not show improvement. However, studies such as this one are too few to
provide evidence that the taxonomy is viable.
Acknowledging that the Bloom cognitive taxonomy is an imperfect tool and
that studies of its validity are seldom up to the task, the taxonomy has contin-
ued to influence educators, psychologists, and testing specialists in their think-
ing about the need to define, teach, and assess higher level achievement.
Although the Bloom taxonomy continues to be an impressive marker in the
history of the study of student achievement, it does not provide the most effec-
tive guidance in test and item design. Most testing programs in my experience
use simpler cognitive classification systems that mainly include the first two
levels of this cognitive taxonomy.
Authors of textbooks on educational measurement routinely offer advice on
how to measure higher level thinking in achievement tests. For example, Linn
and Gronlund (2001) in their eighth edition of this popular textbook suggested
a simple three-category taxonomy, which includes the first two types of learn-
ing in the Bloom cognitive taxonomy and lists application as the third type of
learning. This third category involves the complex use of knowledge and skills.
This simpler approach to defining levels of cognitive behavior is currently the
most popular and easy to use.
Hopes of resolving the dilemma of finding a useful, prescriptive taxonomic
system for classifying items by cognitive process fall to professional organiza-
tions heavily invested in curriculum. Within each organization or through joint
efforts of associated organizations, content standards have emerged in reading,
writing, mathematics, science, and social studies.
The National Council of Teachers of English (NCTE) (www.ncte.org) has a
set of reading standards that emerged in partnership with the International
Reading Association (IRA; www.reading.org) that are widely recognized. Most
states design their own content standards according to NCTE standards
(http://www.ode.state.or.us/tls/english/reading/). Table 2.1 lists the reading
content standards. As we see in Table 2.1, the content standards are broader
than educational objectives and seem to address cognitive ability rather than
CONTENT AND COGNITIVE PROCESSES 23

TABLE 2.1
National Council of Teachers of English (NCTE) and International Reading
Association (IRA) Reading Content Standards

Students learn and effectively apply a variety of reading strategies for comprehending,
interpreting, and evaluating a wide range of texts including fiction, nonfiction, classic,
and contemporary works.
Contextual Recognize, pronounce, and know the meaning of words in text by
analysis using phonics, language structure, contextual clues, and visual cues.
Phonetic Locate information and clarify meaning by skimming, scanning,
analysis close reading, and other reading strategies.
Comprehension Demonstrate literal comprehension of a variety of printed materials.
Inference Demonstrate inferential comprehension of a variety of printed
materials.
Evaluation Demonstrate evaluative comprehension of a variety of printed
materials.
Connections Draw connections and explain relationships between reading
selections and other texts, experiences, issues, and events.

specific knowledge and skills that are most often associated with instructional
objectives and teaching and learning.
Table 2.2 provides examples of writing content standards from the State of
California. Like reading, the focus is not on knowledge that is seen as prerequi-
site to skill or abilities but more on abilities. The writing of essays seems to en-
tail many abilities, including writing, creative and critical thinking, and even
problem solving. Like reading, these content standards reflect the teaching and
learning of knowledge and skills and the application of knowledge and skills in
complex ways.
The National Council of Teachers of Mathematics (NCTM) and the Na-
tional Assessment of Educational Progress (NAEP) have similar mathematics
content dimensions, as shown in Table 2.3 (nces.ed.gov/nationsreportcard).
Each standard contains many clearly stated objectives with a heavy emphasis
on skill development and mathematical problem solving in a meaningful con-
text. Isolated knowledge and skills seem to have little place in modern concep-
tions of mathematics education.
The National Research Council (http://www.nap.edu) also has developed
content standards in response to a perceived need. The standards are volun-
tary guidelines emphasizing the learning of knowledge and skills students need
to make everyday life decisions and become productive citizens. The standards
TABLE 2.2
Draft Writing Content Standards From California

Writing Strategies
Students write words and brief sentences that are legible.
Students write clear and coherent sentences and paragraphs that elaborate a central
impression, using stages of the writing process.
Students write clear, coherent, and focused essays that exhibit formal introductions,
bodies of supporting evidence, and conclusions, using stages of the writing process.
Students write coherent and focused essays that convey a well-defined perspective
and tightly reasoned argument, using stages of the writing process.

Writing Applications
Students write texts that describe and explain objects, events, and experiences that
are familiar to them, demonstrating command of standard English and the drafting,
research and organizational strategies noted previously.
Students write narrative, expository, persuasive, and literary essays (of at least 500 to
700 words), demonstrating command of standard English and the drafting research
and organizational strategies noted previously.
Students combine rhetorical strategies (narration, exposition, argumentation,
description) to produce essays (of at least 1,500 words when appropriate),
demonstrating command of standard English and the drafting, research and
organizational strategies noted previously.

TABLE 2.3
Mathematics Content Standards

NCTM NAEP
Number and operations Number sense, properties, and operations
Algebra Algebra and functions
Geometry Geometry and spatial sense
Data analysis and probability Data analysis, statistics, and probability
Measurement Measurement

Note. NCTM = National Council of Teachers of Mathematics; NAEP = National Assessment of


Educational Progress.

24
CONTENT AND COGNITIVE PROCESSES 25

are impressively comprehensive including: (a) science content, (b) classroom


activities, (c) professional development, (d) classroom assessment methods,
(e) components of effective high-quality science programs, and (f) a concep-
tion of the broader system in which a science education program exists.
In social studies, the National Council on Social Studies (www.ncss.org/)
has developed content standards that leads with 10 thematic strands. This or-
ganization also has developed teacher standards including standards for pre-
paring teachers of social studies.
The value of professionally developed national standards for achievement
testing provide a model for virtually all achievement testing including that for
certification, licensing, and proficiency. One of the most fundamental stan-
dards is 1.6, which states:

When the validation rests in part on the appropriateness of test content, the pro-
cedures followed in specifying and generating test content should be described
and justified in reference to the construct the test is intended to measure or the
domain it is intended to represent. If the definition of the content sampled incor-
porates criteria such as importance, frequency, or criticality, these criteria should
also be clearly explained and justified. (AERA et al., 1999, p. 18)

Standard 1.8 states:

If the rationale for a test use or score interpretation depends on premises about
the psychological processes or cognitive operations used by examinees, then the-
oretical or empirical evidence in support of those premises should be provided.
When statements about the processes employed by the observers or scorers are
part of the argument for validity, similar information should be provided. (AERA
et al., 1999, p. 19)

A Dilemma in Measuring Any Cognitive Process

Each test item is designed to measure a specific type of content and an in-
tended cognitive process. Although each student responds to a test item, no
one really knows the exact cognitive process used in making a choice in an MC
test or responding to a CR item. For any test item, the test taker may appear to
be thinking that the item elicits higher level thinking, but in actuality the test
taker may be remembering identical statements or ideas presented before, per-
haps verbatim in the textbook or stated in class and carefully copied into the
student's notes.
Mislevy (1993) provided an example of a nuclear medicine physician who
at one point in his or her career might detect a patient's cancerous growth in a
computerized tomography (CT) scan using reasoning, but at a later time in
his or her career would simply view the scan and recall the patient's problem.
26 CHAPTER 2

The idea is that an expert works from memory, whereas a novice has to em-
ploy more complex strategies in problem solving. In fact, the change from a
high cognitive demand to a lower cognitive demand for the same complex
task is a distinguishing characteristic between experts and novices. The ex-
pert simply uses a well-organized knowledge network to respond to a complex
problem, whereas the novice has to employ higher level thought processes to
arrive at the same answer. This is the ultimate dilemma with the measure-
ment of cognitive process with any test item. Although a consensus of con-
tent experts may agree that an item appears to measure one type of cognitive
process, it may measure an entirely different type of cognitive process simply
because the test taker has a different set of prior experiences than other test
takers. This may also explain our failure to isolate measures of different cogni-
tive processes, because test items intended to reflect different cognitive pro-
cesses are often just recall to a highly experienced test taker. Therefore, no
empirical or statistical technique will ever be completely satisfactory in ex-
posing subscales reflecting cognitive process. Whatever tests we develop will
only approximate what we think the test taker is thinking when answering
the test item.

Emergence of Construct-Centered Measurement of Abilities

Frisbie, Miranda, and Baker (1993) reported a study of tests written to re-
flect material in elementary social studies and science textbooks. Their find-
ings indicated that most items tested isolated facts. These findings are
confirmed in other recent studies (e.g., Stiggins, Griswold, &Wikelund, 1989).
That the content of achievement tests in the past has focused on mostly low-
level knowledge is a widely held belief in education and training that is also
supported by studies.
The legacy of behaviorism for achievement testing is a model that sums per-
formance of disassociated bits of knowledge and skills. Sometimes, this sum of
learning is associated with a domain, and the test is a representative sample
from that domain. The latter half of the 20th century emphasized domain defi-
nition and sampling methods that yielded domain interpretations. Clearly, the
objective of education was the aggregation of knowledge, and the achievement
test provided us with samples of knowledge and skills that were to be learned
from this larger domain of knowledge and skills.
Although recalling information may be a worthwhile educational objective,
current approaches to student learning and teaching require more complex
outcomes than recall (Messick, 1984; NCTM, 1989; Nickerson, 1989; Snow,
1989; Snow &Lohman, 1989; Sternberg, 1998; Stiggins et al., 1989). School
reformers call for learning in various subject matter disciplines to deal with
life's many challenges (What Works, 1985). Constructivists argue that all learn-
ing should be meaningful to each learner. Little doubt exists in this era of test
CONTENT AND COGNITIVE PROCESSES 27

reform that the measurement of these cognitive abilities will be preeminent.


Thus, we are seeing a shift away from testing fragmented knowledge and skills
that have existed in such tests for most of the 20th century to construct-cen-
tered cognitive ability measurement.
At the center of this emergence is an understanding of cognitive process in-
volving "the coordination of knowledge and skills in a particular domain and
the associated cognitive activities that underlie competent performance"
(Glaser &. Baxter, 2002, p. 179). Glaser and Baxter (2002) also discussed a
"content-process space" as needed in the completion of a complex assessment
task reflecting any of these abilities. Rather than having learners accreting
knowledge and skills, the tone of their writing and others is the use of knowl-
edge and skills in a goal directed toward solving an ill-structured problem or
engaging in a critical thinking or creative enterprise.

Conclusions

Based on the preceding discussion, several conclusions seem justified.

• No current cognitive process taxonomy seems validated by adequate


theoretical development, research, and consensus. The proof of this state-
ment comes in the frequent use of any cognitive process taxonomy used in
every testing program and the accuracy as recorded through research or in
technical reports with which users of the taxonomy classify items by cogni-
tive process.
• In the absence of this validated cognitive process taxonomy, we con-
tinue to resort to prescriptive methods that help us define our content and
cognitive processes and to ensure that our tests are representative of this
content and these cognitive processes, as our testing standards suggest. The
work of learned societies have greatly advanced the definition of content
and cognitive processes. We need to rely more on this work.
• Test takers do not reveal the type of cognition they possess by an-
swering a test item. We can't make an inference about the type of cognition
because we don't know their instructional history, and most items are im-
perfectly written to elicit the cognitive behavior it should elicit. Any infer-
ence we make is guesswork based on our best intention. Therefore,
cognitive process classifications will not be as accurate or useful as we
would like.
• Construct-centered measurement has finally emerged as a useful para-
digm that should endure. The object in modern education and testing is not
the development of knowledge and skills but the development of cognitive
abilities, which emphasize the application of knowledge and skills in com-
plex ways. This fact, however, does not diminish the importance of learning
28 CHAPTER 2

knowledge and skills. In the next part of this chapter, we see the supportive
role of knowledge and skills in developing cognitive abilities such as reading,
writing, speaking, listening, mathematical and scientific problem solving, crit-
ical thinking, and creative enterprises.

A TAXONOMY FOR CONTENT


AND COGNITIVE PROCESS

This book's orientation for MC item writing is in the context of two types of stu-
dent learning. These two types are interrelated. In fact the second type is sup-
ported by the first type.
This first type of student learning is any well-defined domain of knowledge
and skills. Writing has a large body of knowledge and skills that must be learned
before students learn to write. This includes knowledge of concepts such as
knowledge of different modes of writing, such as narrative and persuasive, and
writing skills, such as spelling, punctuation, and grammar. Mathematics has a
well-defined domain of knowledge and skills. The four operations applied to
whole numbers, fractions, and decimals alone defines a large domain of skills in
mathematics.
The second type of learning is any construct-centered ability, for which a
complex CR test item seems most appropriate. This idea is briefly discussed in
chapter 1 and is expanded in this chapter. Some specific cognitive abilities of
concern are reading, writing, and mathematical problem solving.
The taxonomy presented here is an outgrowth of many proposals for classify-
ing items. Current learned societies and testing programs use a similar system of
classification. The Bloom taxonomy is very much linked to the proposed sim-
pler taxonomy offered here, but this taxonomy is much simpler.
An organizing dimension is that learning and associated items all can be
classified into three categories: knowledge, skills, and abilities. Because these
are distinctly different categories, it is important to distinguish among them for
organizing instruction and testing the effects of instruction, which we call
achievement.
For the measurement of a domain of knowledge, the test specifications di-
rect a test designer to select items on the basis of content and cognitive process.
The knowledge category contains four content categories and two cognitive
process categories.
The skill category is simple, containing only two types: mental and physical.
Often, skills are grouped with knowledge because we can conveniently test for
cognitive skills using an MC format. Thus, it is convenient to think of a domain
of knowledge and skills as instructionally supportive, and tests are often
thought of as a representative sample of knowledge and skills from a larger do-
main of knowledge and skills.
CONTENT AND COGNITIVE PROCESSES 29

The cognitive abilities category represents a unique category that necessarily


involves knowledge and skills. The tasks we use when measuring cognitive abili-
ties directly do not focus on knowledge and skills but emphasize the use of knowl-
edge and skills in complex and often unique ways. Ill-structured problems
constitute a domain that is hard to define. By its very name, ill-structured prob-
lems seem to occur in profusion naturally and are hard to link to other problems.
Although it would be desirable to develop algorithms and rigorously define com-
plex learning, we have difficulty defining what knowledge and skills are neces-
sary to learn in the performance of a complex task. Gitomer and Rock (1993)
reported some success in classifying items by cognitive demand using a five-cate-
gory classification that ranges from recall and routine types of learning to ingenu-
ity or insight and the applying of knowledge and skills in complex ways.

Knowledge and Its Cognitive Processes

There are many definitions of knowledge. One that seems to best fit this situa-
tion of designing items to measure achievement is: the body of truths accumu-
lated over time. We reveal a person's knowledge by asking questions or
prompting the person to talk and by listening and evaluating what they say.
Achievement testing allows us to infer knowledge through the use of the test.
But as pointed out in the first part of this chapter, knowing someone's cognition
seems to be a never-ending quest to understanding ourselves. Achievement
testing is limited in inferring true cognition.
We can conceive of knowledge in two dimensions: content and cognitive
process, as the title of this chapter implies. Test specifications commonly call
for all items to be so classified. The validity of the interpretation of a test score
rests on a plausible argument and validity evidence. Some of this evidence co-
mes from good test design that shows that items have been correctly classified
so that the test designer can choose items that conform to the test specifica-
tions. The classification system for knowledge has two dimensions that, not
surprising, are content and cognitive process.
As Table 2.4 shows, all knowledge can be identified as falling into one of
these eight categories. An important distinction is the process dimension. First,
it has been asserted by many critics of instruction, training, and testing that re-

TABLE 2.4

Cognitive Process Content


Recalling Fact Concept Principle Procedure
Understanding Fact Concept Principle Procedure
30 CHAPTER 2

calling has been overemphasized at the expense of understanding. We need to


place greater emphasis, even priority, on the teaching, learning, and measure-
ment of understanding over recall. MC has an important role to play in the
measurement of both the recalling and understanding of knowledge.

Two Types of Cognitive Process Involving Knowledge

The recalling of knowledge requires that the test item ask the test taker to
reproduce or recognize some content exactly as it was presented in a class or
training or in reading. Somewhere in each student's instructional history,
the content must be recovered verbatim. The testing of recall is often asso-
ciated with trivial content. Indeed, trivial learning probably involves the
memory of things that don't need to be learned or could be looked up in
some reference.
The understanding of knowledge is a more complex cognitive process be-
cause it requires that knowledge being tested is presented in a novel way. This
cognitive process involves the paraphrasing of content or the providing of ex-
amples and nonexamples that have not been encountered in previous instruc-
tion, training, or reading.
This important distinction in cognitive processes is expanded in the next
section with examples coming from familiar instructional contexts.

Four Types of Knowledge Content

For our purposes, we can condense all knowledge into four useful content
categories: facts, concepts, principles, and procedures. Each test item intended
to measure knowledge will elicit a student behavior that focuses on one of these
four types of content. Both cognitive processes, recalling and understanding,
can be applied to each type of content, as Table 2.4 shows.

Fact. A fact is known by truth or experience. There is consensus about a


fact. Of course, all facts have a social context. But the meaning of a fact should
be undeniable and unarguable in a society. Drawing from a generic writing and
mathematics curricula, Table 2.5 provides a list of student learning outcomes
involving facts.
Any test item intended to elicit student behavior about knowledge of facts
can measure this knowledge in a very direct way. A student either knows or
doesn't know a fact. Although the learning of facts maybe necessary, most edu-
cators might argue that we tend to teach and test too many facts at the expense
of other content. As Table 2.5 shows, the learning of facts is usually associated
with recall. Associating a fact with the cognitive process of understanding
seems very challenging and, perhaps, impossible. Example 2.1 shows an MC
item calling for the recalling of a fact.
CONTENT AND COGNITIVE PROCESSES 31

TABLE 2.5
Student Learning of Facts

A is a letter of the alphabet.


A period ( . ) is ending punctuation for a declarative sentence.
The sum of the interior angles of a triangle is 180 degrees.
7 is a prime number.

Which of the following is a prime number?

A. 4
B. 5
C. 15
D. 16

EXAMPLE 2.1. Testing for recalling a fact.

The student is provided with four plausible answers. Of course, choosing the
correct answer depends on the plausibility of the other choices and luck, if the
student is guessing. The student has remembered that 5 is a prime number. To
understand why 5 is a prime number requires an understanding of the concept
of prime number.

Concept. A concept is a class of objects or events that shares a common


set of characteristics. For example, a chair has the intended function of seat-
ing a person and usually has four legs, a flat surface, and a backrest. The con-
cept chair is noted by these distinguishing characteristics and other
characteristics that may not be as important. A table might resemble a chair
but lacks the backrest, although teenagers may use a table as a chair. We
could provide a list of objects, some of which are chairs and some of which
are not chairs. We can distinguish between chairs and nonchairs and by do-
ing so show our knowledge of the concept chair. Concepts can be abstract or
concrete. Love is an abstract concept and weight is a concrete concept.
Other examples of concepts from reading, writing, and mathematics are
given in Table 2.6. With each of these examples, we might test for recall of
definitions or identifying examples and nonexamples presented in class or
in reading, or we can test for understanding by providing a paraphrased defi-
32 CHAPTER 2

TABLE 2.6
Student Learning of Concepts

Explain the concepts related to units of measure and show how to measure with
nonstandard units (e.g., paper clips) and standard metric and U.S. units (concepts are
inches, feet, yards, centimeters, meters, cups, gallons, liters, ounces, pounds, grams,
kilograms).
Identify two-dimensional shapes by attribute (concepts are square, circle, triangle,
rectangle, rhombus, parallelogram, pentagon, hexagon).
Define allusion, metaphor, simile, and onomatopoeia.
Identify the components of a personal narrative using your own words or ideas.

nition, not presented previously in class or in reading or a set of examples


and nonexamples.

Principle. A principle is a statement of relationship, usually between two


or more concepts. A principle often takes the form: "If..., then — " Principles
come in two forms: immutable law and probable event. For instance, it is im-
mutable that hot air rises on our planet and cold air sinks. Many immutable
principles of science are laws. On the other hand, principles exist that have ei-
ther exact probabilities or subjective probabilities (guesses). A very tall basket-
ball player blocks more shots than a very short basketball player. Driving
without a seat belt fastened is more likely to result in serious personal injury
than driving with the seat belt fastened. With more data or a statistical model,
we can estimate the probability of an event. A set of student learning outcomes
involving principles are given in Table 2.7. Sometimes it is difficult to see how
principles are embedded in such outcomes. The testing of principles can be at a
recall level, which may seem trivial, but the emphasis in this book and in mod-
ern education is understanding principles and applying these principles in
ill-structured problems or situations. Chapter 6 provides examples of exem-
plary items that address principles. A key point in designing items that measure
understanding of a principle are ideas such as predict or evaluate. Invariably,
the cognitive demand requires students to apply a principle to a novel situation
or to select which principle applies to a given, novel situation. In some circum-
stances, students are asked to evaluate something using criteria provided or im-
plicit criteria. The process of evaluating involves some relational situation,
where the criteria are applied in a novel situation.

Procedure. A procedure is a series of related actions with an objective or


desired result. The actions may be mental or physical. A procedure is normally
CONTENT AND COGNITIVE PROCESSES 33

TABLE 2.7
Student Learning of Principles

Predict events, actions, and behaviors using prior knowledge or details to comprehend a
reading selection.
Evaluate written directions for sequence and completeness.
Determine cause-and-effect relationships.
Evaluate the reasonableness of results using a variety of mental computation and
estimation techniques.
Apply the correct strategy (estimating, approximating, rounding, exact calculation)
when solving a problem.
Draw conclusions from graphed data.
Predict an outcome in a probability experiment.

associated with a skill. But a skill is much more than simply a procedure. Be-
cause we observe a skill performed, does it make sense to think of knowledge of
a procedure? We might think that before one learns to perform a skill, the
learner needs to know what to do. Therefore, we can think of testing for knowl-
edge of procedures as a memory task or an understanding task that comes be-
fore actually performing the skill.
Mental procedures abound in different curricula. Adding numbers, find-
ing the square root of a number, and finding the mean for a set of numbers are
mental procedures. As with physical procedures, the focus here is asking a
student for knowledge of the procedure. How do you add numbers? How do
you find the square root of a number? How do you determine the mean for a
set of numbers? Table 2.8 provides some examples of student learning of
knowledge of procedures.
Unlike a mental procedure, a physical procedure is directly observable. Ex-
amples are: cutting with scissors, sharpening a pencil, and putting a key in a

TABLE 2.8
Student Learning of Procedures

Describe the steps in writing a story.


Identify key elements in writing a summary representing an author's position.
Delineate the procedures to follow in writing a persuasive essay.
Define and give examples of transitional devices (e.g., conjunctive adverbs,
coordinating conjunctions, subordinating conjunctions).
34 CHAPTER 2

keyhole. Each example constitutes a physical act with a mental component.


Each requires knowledge of a procedure that must be learned before the physi-
cal act is performed. The performing of a physical act is often termed psycho-
motor because the mind is involved in performing the physical act. The focus in
this section is the measurement of knowledge of a physical procedure. How do
you cut with scissors? How do you sharpen a pencil? How do you assemble a
pendulum?
For each of these examples, students can either recall or understand proce-
dures. Procedures can be presented in a test item in verbatim language to class
presentation, reading, or some other source, or the content can be presented in
a novel way to elicit understanding. The subtleties of testing for recall or under-
standing can be better appreciated in chapter 6 where items are presented in
various contexts that attempt to show these differences.

Skill

The second type of student learning involves the performance of a mental or


physical act. A skill can be performed and should be observed to verify that the
learner has learned the skill. Thus, the natural format for any skill is CR. The
performance is either rated if the skill is judged to be abstractly defined or ob-
served as a dichotomous event (yes-no, right-wrong, 1-0) if the skill is judged
to be operationally defined.
The previous section discusses knowledge of procedures, which touches on
the distinction between knowledge of procedures and the actual performance
of a skill. In this section, these distinctions are further discussed and more ex-
amples are given.
For the most part, the kind of skills we are interested in testing are mental
skills. For most achievement tests, knowledge and skills are often grouped to-
gether. We can imagine a domain of knowledge and skills, and these tests are
representative samples of knowledge and skills. Most standardized achieve-
ment tests, such as the Iowa Test of Basic Skills or the Stanford Achievement
Test are designed with this in mind, a mixture of knowledge and skills repre-
senting a large domain of knowledge and skills.
Table 2.9 gives a list of student learning of skills. This list shows that skills
can be unitary in nature, involving a simple act, or can be thought of as a set
of steps in a procedure. There is no cognitive process dimension to skills.
The difficulty of some skills can be scaled because some of the performances
are more difficult than others. Take the addition example for two- and
three-digit numbers. Some skills are more abstract in nature and we choose
to rate performance. For example, in writing, we might be interested in
knowing how well a student uses transitional devices to sharpen the focus
and clarify the meaning of the writing. Although we can note instances of
CONTENT AND COGNITIVE PROCESSES 35

TABLE 2.9
Student Learning of Cognitive Skills

Reading
Identify main characters in a short story.
Identify facts from nonfiction material.
Differentiate facts from opinions.
Writing
Spell high-frequency words correctly.
Capitalize sentence beginnings and proper nouns.
Preserve the author's perspective and voice in a summary of that author's work.
Mathematics
Add and subtract two- and three-digit whole numbers.
State the factors for a given whole number.
Sort numbers by their properties.

conjunctive verbs, coordinating conjunctions, and subordinating conjunc-


tions, the overall impression of a trained evaluator is often taken as evi-
dence that a student has learned this skill. This impression is recorded as a
rating on a numerical rating scale because the skill is abstractly defined in-
stead of operationally defined.
As you can see from the discussion and the examples in Table 2.9, cogni-
tive skills can range from basic, elemental, and objectively observable to
complex and not directly observable. In some instances, the MC formats
might work well, but in instances when the skill is judgmental, MC does not
work at all.

Ability

A prevailing theme in this book and in cognitive learning theory is the develop-
ment of cognitive abilities. Different psychologists use different names.
Lohman (1993) called them fluid abilities. Messick (1984) called them develop-
ing abilities. Sternberg (1998) called them learned abilities. Each of these terms
is effective in capturing the idea that these complex mental abilities can be de-
veloped over time and with practice. These cognitive abilities are well known
to us and constitute most of the school curricula: Reading, writing, speaking,
and listening constitute the language arts. Problem solving, critical thinking,
and creative thinking cut across virtually all curricula and are highly prized in
36 CHAPTER 2

our society. In mathematics, the NCTM makes clear that problem solving is a
central concern in mathematics education.

Examples of Cognitive Abilities

We have literally thousands of cognitive abilities that abound in our world.


Many of these abilities reside in professions. Medical ability is possessed by li-
censed physicians. A pediatrician has a highly specialized medical ability. Ac-
counting ability goes with being a certified public accountant. Architects,
automotive repair specialists, dentists, dental hygienists, dieticians, electri-
cians, financial analysts, plumbers, police officers, social workers, and teachers
all have developed abilities in their chosen profession.
Cognitive abilities are useful for all of us to apply to our occupations and in
other roles we play as citizen, homemaker, parent, and worker. Reading, writing,
speaking, listening, mathematical and scientific problem solving, critical think-
ing, and creative thinking abilities pervade every aspect of our daily lives. All
sports and recreation represent forms of ability. All visual and performing arts are
abilities, including poetry, play writing, acting, film, sculpting, and architecture.

Constituent Parts of a Cognitive Ability

All of these abilities have the following in common:

• A complex structure that includes a large domain of knowledge and skills


• An emotional component that motivates us to persevere in developing
this ability
• A domain of ill'structured problems or situations that are commonly en-
countered in performing this ability

Any cognitive ability is likely to rely on a body of knowledge and skills, but
the demonstration of a cognitive ability involves a complex task that requires
the student to use knowledge and skills in unique a combination to accomplish
a complex outcome. Psychologists and measurement specialists have resorted
to cognitive task analysis to uncover the network of knowledge and skills
needed to map out successful and unsuccessful performance in an ill-struc-
tured problem. This task analysis identifies the knowledge and skills needed to
be learned before completing each complex task. But more is needed. The stu-
dent needs to know how to select and combine knowledge and skills to arrive at
a solution to a problem or a conclusion to a task. Often, there is more than one
way to combine knowledge and skills for a desirable outcome.
Another aspect of cognitive abilities that Snow and Lohman (1989) believe
to be important is conative, the emotional aspect of human cognitive behavior.
CONTENT AND COGNITIVE PROCESSES 37

This emotional aspect is also becoming more formalized as an important aspect


of any cognitive ability, termed emotional intelligence (Goleman, 1995). The
overriding idea about each cognitive ability is this tendency to apply knowl-
edge and skills to a novel situation that produces a favorable result.

The Development of a Cognitive Ability

Cognitive abilities grow slowly over a lifetime, influenced by maturation,


learning, practice, and other experiences. Schooling represents a primary in-
fluence in the development of many cognitive abilities (Lohman, 1993).
Graduate and professional schools reflect advanced education where cogni-
tive abilities are extended. Special academies are formed to concentrate on
specific cognitive abilities. Talented individuals spend lifetimes perfecting
their abilities.
Abilities influence one another. The cognitive abilities of problem solving,
critical thinking, and creative thinking seem universally important to the de-
velopment of other cognitive abilities, and are often mentioned in this book.
Take a familiar cognitive ability, writing. Aspects of writing ability include
different writing modes, such as narrative, expository, persuasive, and cre-
ative. Writing is evaluated based on various analytic traits, such as conven-
tions, organization, word choice, and style. The development of writing
ability begins with simple behaviors mostly involving knowledge and skills.
Writing ability grows slowly over a lifetime. And good writers have a passion
for writing that motivates them to spend long hours practicing and improving
this ability. Writing ability influences other abilities such as critical thinking,
problem solving, or creative endeavors, such as journalism, play writing, and
writing novels.
Naturally, most abilities are heavily influenced by other abilities. A great
novelist, such as John Irving, must have great writing ability but must also have
great creative thinking ability. And he must have a passion for writing. An out-
standing athlete, such as Tiger Woods, must have considerable golfing ability
but also must have problem solving and critical thinking ability to perform at
the highest level. The emotional element needed in each abilities is always evi-
dent with respect to motivation, attitude, perseverance, self-confidence, and
self-esteem.
These abilities also dominate certification and licensing testing. The under-
lying competence in any profession is much more than simply knowledge and
skills. Professions require the use of knowledge and skills and emotional ele-
ments in complex performance usually involving critical thinking, creative
thinking, or problem solving.
All these cognitive abilities are teachable and learnable. The development
of our abilities is our most important lifelong occupation. In this book, and I
hope in your process of developing tests for educational achievement, you
38 CHAPTER 2

might consider abilities in this way. Test items are important ingredients in the
development of measures of abilities. Such tests can measure the growth of
these abilities on a developmental scale.

The Role of Knowledge in a Cognitive Ability

One of the most fundamental aspects of cognitive abilities and one that is
most recognizable to us is knowledge. Educational psychologists call this de-
clarative knowledge. As discussed in subsequent chapters, all testable knowl-
edge falls into one of these categories: facts, concepts, principles, or
procedures. The general assumption behind testing for knowledge is that it is
foundational to performing skills or more complex forms of behaviors. In the
analysis of any complex behavior, it is easy to see that we always need knowl-
edge. The most efficient way to test for knowledge is with the MC format.
Thus, MC formats have a decided advantage over CR formats for testing
knowledge. Chapter 3 discusses the rationale for this more completely and
provides much documentation and references to the extensive and growing
literature on this topic.
Chapters 5 and 6 provide many examples of MC items intended to reflect
important aspects of cognitive abilities. However, MC formats have limitations
with respect to testing cognitive abilities. Not all cognitive abilities lend them-
selves well to the MC format. Usually, the most appropriate measure of a cogni-
tive ability involves performance of a complex nature.
Knowledge is always fundamental to developing a skill or a cognitive ability.
Sometimes MC can be used to measure application of knowledge and skills in
the performance of a cognitive ability, but these uses are rare. If we task analyze
a complex task, we will likely identify knowledge and skills needed to complete
that task successfully.

The Role of Skills in a Cognitive Ability

Skills are also fundamental to a cognitive ability. A skill's nature reflects per-
formance. Skills are often thought of as singular acts. Punctuation, spelling,
capitalization, and abbreviation are writing skills. Skills are critical aspects of
complex performances, such as found in critical thinking, creative thinking,
and problem solving. The most direct way to measure a skill is through a perfor-
mance test. But there are indirect ways to measure skills using MC that corre-
late highly with the direct way. Thus, we are inclined to use the indirect way
because it saves time and gives us good information. For example, we could give
a test of spelling knowledge or observe spelling in student writing. We need to
keep in mind the fundamental differences in interpretation between the two.
But if the two scores are highly correlated, we might use the MC version be-
CONTENT AND COGNITIVE PROCESSES 39

cause it is usually easier to obtain and provides a more reliable test score. In a
high-stakes situation in life, such as life-threatening surgery, knowledge of a
surgical procedure is not a substitute for actual surgical skill, and both knowl-
edge and skills tests are not adequate measures of surgical ability. In low-stakes
settings, we might be willing to substitute the more efficient MC test of knowl-
edge for the less efficient performance test of skill because we know the two are
highly correlated. The risk of doing this is clear: Someone may know how to
perform a skill but is unable to perform the skill.

Examples of Student Learning Outcomes Suggesting


the Performance of an Ability

Table 2.10 provides examples of student learning outcomes of a complex


nature that reflect a cognitive ability. As you can see with the few examples
provided in reading, writing, and mathematics, the variety is considerable.
There is no rigid, structured domain of possible tasks. The universe of possi-
ble tasks that measure each ability are seemingly infinite and without clear-
cut patterns. However, all abilities involve knowledge and skills. In highly de-
fined and narrow fields of study or competency, we have examples that can be
delimited, and thus the measurement of ability can be refined and specific. A
branch of surgery is hand surgery. A physician who specializes in surgery can
subspecialize in hand surgery, which involves the part of the human anatomy

TABLE 2.10
Student Learning of Cognitive Abilities

Reading
Analyze selections of fiction, nonfictibn, and poetry.
Evaluate an instructional manual.
Compare and contrast historical and cultural perspectives of literary selections.
Writing
Create a narrative by drawing, telling, or emergent writing.
Write a personal experience narrative.
Write a report that conveys a point of view and develops a topic.
Mathematics
Predict and measure the likelihood of events and recognize that the results of an
experiment may not match predicted outcomes.
Draw inferences from charts and tables that summarize data from real-world
situations.
40 CHAPTER 2

from the elbow to the tips of the fingers. This specialty involves tissue, bones,
and nerves. The limits of knowledge and skills exist, and the range of prob-
lems encountered can be identified with some precision. Unfortunately, not
all abilities are so easy to define.

Summary

This chapter identifies and defines three types of student learning that are in-
terrelated and complementary: knowledge, skills, and cognitive ability. As you
can see, the defining, teaching, learning, and measuring of each is important in
many ways. However, the development of cognitive ability is viewed as the ulti-
mate purpose of education and training. Knowledge and skills play important
but supportive roles in the development of each cognitive ability. Knowledge
and skills should be viewed as enablers for performing more complex tasks that
we associate with these cognitive abilities.
3
Item Formats

OVERVIEW

One of the most fundamental steps in the design of any test is the choice of one
or more item formats to employ in a test. Because each item is intended to mea-
sure both content and a cognitive process that is called for in the test specifica-
tions, the choice of an item format has many implications and presents many
problems to the test developer.
This chapter presents a simple taxonomy of item formats that is connected
to knowledge, skills, and abilities that were featured in the previous chapter.
Claims and counterclaims have been made for and against the uses of various
formats, particularly the MC format. In the second part of this chapter, five va-
lidity arguments are presented that lead to recommendations for choosing an
item format.
A fundamental principle in the choice of an item format is that measuring
the content and the cognitive process should be your chief concern. The item
format that does the best job of representing content and the cognitive process
intended is most likely to be the best choice. However, other factors may come
into play that may cause you to chose another format. You should know what
these other factors are before you choose a particular format.

HIGH* AND LOWJNFERENCE ITEM FORMATS

What Is an Item Format?


The item format is a device for obtaining a student response. This response is
subsequently scored using a scoring rule. We have many types of item formats.
Virtually all types have the same components: (a) a question or command to

41
42 CHAPTER 3

the test taker, (b) some conditions governing the response, and (c) a scoring
procedure. This chapter attempts to help you sort out differences among item
formats and select an item format that best fits your needs. As you will see,
item formats distinguish themselves in terms of their anatomical structure as
well as the kind of student learning each can measure. Each format competes
with other formats in terms of criteria that you select. As you consider choos-
ing a format, you should determine whether the outcome involves knowl-
edge, skills, or abilities. Then you can evaluate the costs and benefits of
rivaling formats.
Another dimension of concern is the consequences of using a particular
item format (Frederiksen, 1984; Shepard, 2000). The choice of a single for-
mat may inadvertently elicit a limited range of student learning that is not
necessarily desired. Ideally, a variety of formats are recommended to take full
advantage of each format's capability for measuring different content and
cognitive processes.

The Most Fundamental Distinction Among Item Formats

The most fundamental distinction among item formats is whether the underly-
ing student learning that is being measured is abstract or concrete. This differ-
ence is discussed in chapter 1 and is expanded here in the context of item
formats. Table 3.1 provides a set of learning outcomes in reading, writing, and
mathematics in the elementary and secondary school curriculum that reflect
this fundamental distinction. The learning outcomes in the left-hand column
are abstractly defined, and the learning outcomes on the right-hand column
are concretely defined.
With abstractly defined learning, because we do not have clear-cut consen-
sus on what we are measuring, we rely on logic that requires us to infer from test
taker behavior a degree of learning. We rely on the judgments of trained SMEs
to help us measure an abstract construct. This is construct-centered measure-
ment. With operationally defined student learning, we have consensus about
what is being observed. Expert judgment is not needed. The student behavior is
either correct or incorrect. Table 3.1 provides a comparison of the essential dif-
ferences between abstractly define and concretely defined learning outcomes
in terms of item formats. We use the term high inference to designate abstractly
defined student learning.
Because most school abilities are abstractly defined, high-inference item
formats are the logical choice. Some abstractly defined skills also match well to
high-inference formats. Any skill where judgment comes into play suggests the
use of the high-inference format.
Low-inference formats seem ideally suited to knowledge and most mental
and physical skills that can be concretely observed. In chapter 1, the termoper-
ITEM FORMATS 43

TABLE 3.1
High-Inference and Low-Inference Learning Outcomes

High-Inference Learning Low-Inference Learning


Compare real-life experiences to events, • Sequence a series of events from a
characters, and conflicts in a literary reading selection
selection. • Follow a set of written directions.
Summarize the main points. • Identify root words.
Analyze complex texts. • Identify facts from opinions.
• Write a report in your own words giving Copy 26 letters of the alphabet.
a point of view. Record observations.
• Write a response to a literary selection. Spell correctly.
• Write a persuasive essay. Punctuate correctly.
Apply rules of capitalization.
Solve problems using a variety of mental Construct equivalent forms of whole
computations and estimations numbers.
(explaining your solution). Add and subtract two three-digit whole
Formulate predictions from a given set numbers.
of data and justify predictions. Identify the greatest common factor.
Construct a Venn diagram.

ational definition is used to designate instances in which there is general consen-


sus and you can see whether the performance was done or not done. With the
complex cognitive abilities, the development of low-inference formats may be
desirable but hard to achieve.

Ease of Item Construction. A high-inference performance item is diffi-


cult to construct. It usually consists of a set of instructions to the student with
performance conditions and one or more rating scales (rubrics) for which the
performance will be judged by one or more qualified content experts. The cre-
ation of each item is a significant effort. Although the response to a high-infer-
ence item may be brief, obtaining a useful measure of an ability usually requires
an extended performance.
A low-inference performance item is usually easy to construct and has a sim-
ple structure. We tend to use many of these items to measure low-inference
knowledge and skills.

Types of Test Administration. Both high-inference and low-inference


test items can be group or individually administered. Group administration is
generally efficient, whereas individual administration is time consuming and
hence costly. MC is a low-inference format that is group administered. A writ-
44 CHAPTERS

ing assessment may be group administered, but it is a more extended task that
takes more student time. Some low-inference skills have to be observed indi-
vidually, and this too takes much time.

Cost of Scoring. The cost of scoring high-inference performance is high


because it requires one or two trained content experts. The cost of scoring
low-inference outcomes is usually less because the performance does not have
to be judged, merely observed. Observers don't have to be content experts, but
it helps sometimes. The use of optical scanners or scoring templates makes the
low-inference MC format an efficient, low-cost choice.

Type of Scoring. High-inference testing requires subjective scoring,


which involves the use of trained judges who are SMEs. Objective scoring is
usually dichotomous: right-wrong, yes-no, or present-absent. Scorers do not
have to be trained. In fact, objective scoring can often be automated.

Rater Effects. With subjective scoring associated with high-inference


testing, we have two important threats to validity. Rater effects is one of these
threats. There are many types of rater effects, including rater severity, halo or
logical errors, and the error of central tendency (Engelhard, 2002). Another
threat is rater inconsistency. With the rating of abstract things, raters tend to
disagree more than we would like.

Reliability. Reliability depends on many factors. Subjective scoring tends


to detract from reliability because judges tend to rate less consistently than we
would like. With low-inference item formats, the objective scoring eliminates
rater effects and inconsistency. Moreover, with low-inference formats the num-
ber of scorable units and the variation of each unit can be sufficient to ensure
high reliability. Achieving high reliability with high-inference item formats is
challenging.

Summary. Table 3.2 summarizes these comparisons of high-inference and


low-inference formats. Although the two formats can be compared, it is seldom
true that you have a choice between these two types of formats. The choice of
any format should be dictated by the content and cognitive process desired.
However, if you do have a choice, the low-inference format is usually desirable
over the high-inference format.

Anatomy of a High-Inference Format

A high-inference item format has three components:


ITEM FORMATS 45

TABLE 3.2
Attributes of High-Inference and Low-Inference Item Formats

Attribute High-Inference Formats Low-Inference Formats


Construct Complex abilities, skills of an Knowledge, cognitive, and
measured abstract nature. psychomotor skills of a concrete
nature, abilities, but only in a very
limited way.
Ease of item Design of item is usually Design of items is not as complex as
construction complex, involving command high inference, involving command
or question, conditions for or question, conditions for
performance, and a set performance, and a simple,
subjectively scorable objective scoring rule.
descriptive rating scales.
Type of Can be group or individually Can be group or individually
administration administered. administered.
Cost of scoring Can be very expensive to Usually is not expensive to score
score because trained subject because scoring can be done by
matter experts must evaluate machine, a scoring template, or an
student work using descriptive untrained observer.
rating scales.
Type of scoring Subjective Objective
Rater effects Rater effects is a threat to Rater effects is not a threat to
validity. validity.
Reliability Reliability is a problem with Results can be very reliable because
this kind of testing because of units of observation are numerous
rater effects, rater and errors due to rater effects and
inconsistency, and lack of inconsistency are small or
variation in scores. nonexistent.

1 A command, question, or set of instructions to the test taker that summa-


rizes the nature of the task to be completed.
2 A set of performance conditions, including time limits, scope of response,
mode of presentation, schedule for completion, decision of whether con-
suiting or collaboration is to be allowed, and decision of whether the
work can be revised later. This set of conditions is usually detailed.
3 The scoring entails a single holistic rating scale or a set of complementary
analytic trait rating scales. Another name for rating scale is scoring guide
or rubric. Trained judges should be proficient in the use of these rating
scales.
46 CHAPTER 3

The scope of high-inference items is usually extensive. Because the object of


a high-inference item is the ability itself and not some isolated knowledge or a
single skill, the item usually directs the student to perform extensively, as the
learning outcomes in Table 3.1 suggest.

Anatomy of Low-Inference Formats

The low-inference format simply involves observation because there is some


behavior or answer in mind that is either present or absent. Writing conven-
tions can be measured by noting misspelled words, capitalization and punctua-
tion errors, and poor grammar. These writing skills are directly observable in
student writing. In mathematics, most skills can be directly observed. With
low-inference measurement, we have some variety in formats, as follows:

1. Simple observation. We can observe whether knowledge is possessed or


not possessed and whether a skill is performed or not performed in a simple
observation. This observation can be scored 1 for successful performance
and 0 for no performance or correct or incorrect.
2. Simple observation with a measuring instrument. We can also observe
whether knowledge is possessed or not possessed and whether a skill is per-
formed or not performed in a simple observation that involves a measuring
instrument. Any timed test provides a good example of this type of item. In
some circumstances, the outcome may be weighed or its volume may be cal-
culated. The focus in this type of low-inference measurement is the use of a
measuring instrument such as a timing instrument, ruler, scale, or some
other measuring device.
3. Checklist. We can observe the performance of a process or characteris-
tics of a product using a series of coordinated or connected observations that
much resemble simple observation. The key feature of the checklist is that
the series of simple observations are correlated and the evaluation of perfor-
mance or of the product is based on all items on the checklist.
4. MC. With the MC format we usually measure knowledge or a cogni-
tive skill. Scoring is objective. The inference we make from a test score is
usually to a domain of knowledge, skills, or both.
5. Essay. With the essay item intended to measure knowledge or a cogni-
tive skill, scoring can be objective. The student provides the right answer
and it is scored right or wrong, just like the MC item. In most instances, one
of the preceding low-inference techniques is used to score an essay. For in-
stance, a checklist can be used to determine whether certain features were
given in the essay answer. Another type of essay item is actually a high-infer-
ence format. This is the instance when the type of learning is abstract and
requires the use of judges.
ITEM FORMATS 47

Conclusion

All item formats can be classified according to whether the underlying objec-
tive of measurement is abstractly or operationally defined. The type of infer-
ence is the key. Measuring cognitive abilities usually require a high-inference
item format. Most knowledge and mental and physical skills can be observed
using a low-inference item format. Some skills require subjective evaluation by
trained judges and by that fall into the high-inference category. For instance, a
basketball coach may evaluate a player's free throw shooting technique. The
coach's experience can be used as a basis for judging the shooting form. This is
high-inference observation. However, the actual shooting percentage of the
player is a low-inference observation.

EVALUATING THE ITEM FORMAT ISSUE USING


VALIDITY AS A BASIS

The first part of this chapter focuses on differences between high-inference


and low-inference item formats. For measuring abstractly defined abilities,
high-inference formats seem useful. For measuring concretely defined knowl-
edge and many mental and physical skills, low-inference formats seem suit-
able. Because the MC format was introduced in the early part of the 20th
century, an active, ongoing debate has ensued to the present about the choice
of item formats (Eurich, 1931; Godshalk, Swineford, & Coffman, 1966;
Hurd, 1932; O'Dell, 1928; Patterson, 1926; Ruch, 1929; Tiegs, 1931; Traub
& Fisher, 1977). Fortunately, the issue about the choice of item formats has
been an active field of study and there is much to report to help us gain a
better understanding of this issue. Many new perspectives have emerged that
enrich the debate and provide guidance.
Traub (1993) provided an appraisal of research and surrounding issues. He
identified flaws in earlier research that made these studies less useful. He also
pointed to methods of study that would overcome the flaws of earlier studies
and help in the next generation of studies. His brief review of nine exemplary
studies on the item format issue was inconclusive, leading him to argue that a
better approach to the study of this problem is a theory of format effects. This
emerging theory suggests that the choice of format influences the measure-
ment of the construct of interest.
Snow (1993) considered the problem of item format differences not from a
purely psychometric perspective but from a psychological perspective that in-
cludes cognitive processing demands on the examinee. Snow also stated that
the study of performance on contrasting item formats should include noncog-
nitive aspects as well. This psychological perspective is often missing from
studies of item format differences. Snow suggested a multifaceted approach
that includes a variety of conditions and a set of working hypotheses to be
48 CHAPTER 3

tested. Of the eight offered, three are noncognitive (attitudes, anxiety, and
motivation) and only the eighth is psychometric in nature. Later in this chap-
ter, a section is devoted to this perspective, drawing from a research review by
Martinez (1998).
Bennett (1993), like Snow (1993) and many others, believed that the adop-
tion of the unified approach to validity has salience for the study of this prob-
lem. Bennett emphasized values and consequences of test score interpretations
and use. We have seldom applied these criteria in past studies of item format
differences.
In summary, the study of item format differences has continued over most of
this century. The earliest studies were focused on format differences using sim-
ple correlation methods to study equivalence of content measured via each for-
mat. As cognitive psychology evolved, our notion of validity sharpened to
consider context, values, consequences, and the noncognitive aspects of test
behavior. Improvements in methodology and the coming of the computer
made research more sophisticated. Nevertheless, the basic issue seems to have
remained the same. For a domain of knowledge and skills or for a cognitive abil-
ity, which format should we use? This section tries to capture the main issues of
this debate and provide some focus and direction for choosing item formats.
Each of the next six sections of this chapter draw from recent essays and re-
search that shed light on the viability of MC formats in various validity con-
texts. The term argument is used here because in validation we assert a principle
based on a plausible argument and collect evidence to build our case that a spe-
cific test score use or interpretation is valid. The validity of using the MC item
format is examined in six contexts, building an argument that results in a sup-
portable conclusion about the role of MC formats in a test for a specific test in-
terpretations and use.

Validity Argument 1: Prediction

Generally, student grades in college or graduate school are predicted from ear-
lier achievement indicators such as previous grades or test scores. The
well-known ACT (American College Test) and SATI (Scholastic Assessment
Test 1) are given to millions of high school students as part of the ritual for col-
lege admissions, and the Graduate Record Examination is widely administered
to add information to assist in graduate school admission decisions. The pre-
dictive argument is the simplest to conceptualize. We have a criterion (desig-
nated Y) and predictors (designated as Xs). The extent to which a single X or a
set of Xs correlates with Y determines the predictive validity coefficient. Unlike
other validity arguments, prediction is the most objective. If one item format
leads to test scores that provide better prediction, we find the answer to the
question of which item format is preferable.
ITEM FORMATS 49

Downing and Norcini (1998) reviewed studies involving the predictive va-
lidity coefficients of CR and MC items for various criteria. Instead of using an
exhaustive approach, they selected research that exemplified this kind of re-
search. All studies reviewed favor MC over CR, except one in which the CR
test consisted of high-fidelity simulations of clinical problem solving in medi-
cine. The authors concluded that adding CR measures do little or nothing for
improving prediction, even when a CR criterion resembles the CR predictor.
These authors concluded that although there may be many good reasons for us-
ing CR items in testing, there is no good reason to use CR items in situations
where prediction of a criterion is desired.
The challenge to researchers and test developers is to identify or develop
new item formats that tap important dimensions of student learning that in-
crease predictive coefficients. Whether these formats are MC or CR seems ir-
relevant to the need to increase prediction. The choice of formats to improve
predictive coefficients can only be answered empirically.

Validity Argument 2: Content Equivalence

This validity argument concerns the interpretability of test scores when either
CR or MC formats are used. In other words, a certain interpretation is desired
based on some definition of a construct, such as writing or reading comprehen-
sion. The interpretation may involve a domain of knowledge and skills or a do-
main of ill-structured performance tasks that represent a developing, cognitive
ability, such as writing. This section draws mainly from a comprehensive, inte-
grative review and meta-analysis by Rodriguez (2002) on this problem. Simply
stated the issue is:

If a body of knowledge, set of skills, or a cognitive ability is being measured, does


it matter if we use a CR or MC format?

If it does not matter, the MC format is desirable because it has many advan-
tages. Some of these are efficient administration, objective scoring, automated
scoring, and higher reliability. With knowledge and skills, MC items usually give
more content coverage of a body of knowledge or a range of cognitive skills, when
compared with short-answer items, essays, or other types of CR items.
The answer to the question about when to use MC is complicated by the fact
expressed in Table 3.3 that a variety of CR item formats exist. Martinez (1998)
pointed out that CR formats probably elicit a greater range of cognitive behav-
iors than the MC, an assertion that most of us would not challenge. Rodriguez
mentioned another complicating factor, which is that the MC and CR scales
underlying an assumed common construct may be curvilinearly related be-
cause of difficulty and reliability differences. The nature of differences in CR
5O CHAPTERS

and MC scores of the assumed same construct are not easy to ascertain. Marti-
nez provided a useful analysis of the cognition underlying test performance and
its implications for valid interpretations. Rodriguez's meta analysis addressed
many methodological issues in studying this problem intended for future stud-
ies of item format differences. Among the issues facing test designers is the ad-
vice given by testing specialists and cognitive psychologists, cost consider-
ations, and the politics of item format selection, which sometimes runs con-
trary to these other factors.
Dimensionality is a major issue with these studies. Whereas Martinez
(1998) warned us not to be seduced by strictly psychometric evidence, studies
reviewed by Thissen, Wainer, and Wang (1994) and Lukhele, Thissen, and
Wainer (1993) provided convincing evidence that in many circumstances CR
and MC items lead to virtually identical interpretations because unidimen-
sional findings follow factor analysis. Bennett, Rock, and Wang (1990) con-
cluded that "the evidence presented offers little support for the stereotype of
MC and free-response formats as measuring substantially different constructs
(i.e., trivial factual recognition vs. higher-order processes)" (p. 89). Some of
Martinez's earlier studies (e.g., Martinez, 1990, 1993) offered evidence that
different formats may yield different types of student learning. However, when
content is intended to be similar, MC and CR item scores tend to be highly re-
lated, as Rodriguez's review shows.
A final point was made by Wainer and Thissen (1993) in their review of this
problem, from their study of advanced placement tests where CR and MC
items were used: Measuring a construct not as accurately but more reliably is
much better than measuring the construct more accurately but less reliably. In
other words, an MC test might serve as a reliable proxy for the fundamentally
better but less reliable CR test. Their advice applies to the third finding by Ro-
driguez (2002) in Table 3.3, where content is not equivalent, but MC may be a
better choice simply because it approximates the higher fidelity CR that may
have a lower reliability.

TABLE 3.3
General Findings About Multiple-Choice (MC) and Constructed-Response (CR)
Item Formats in Construct Equivalence Settings

Type of Test Design General Findings


Stem-equivalent MC and CR Very high correlations
Content-equivalent MC and CR Very high correlations, slightly below
stem-equivalent findings
Not content-equivalent MC and CR High correlations, but distinctly below
content-equivalent MC and CR
Essay- type items and MC Moderate correlations
ITEM FORMATS 51

Several conclusions seem justifiable:

• If a construct is known to be knowledge based, the use of both a CR or


MC format will result in highly correlated scores. In these circumstances,
the MC format is superior.
• If a construct is known to be skill based, CR items have a greater fidel-
ity. However, MC items might serve better because they correlate highly
with the truer fidelity measure and have greater efficiency. With physical
skills, MC does not seem plausible.
• If a construct is a cognitive, such as writing, CR items of a more com-
plex nature seem appropriate. MC items typically lack the kind of judged fi-
delity that a complex CR item has. However, the item set MC format comes
closest to modeling aspects of some these abilities. These formats appear in
the next two chapters.

Validity Argument 3: Proximity to Criterion

Earlier, it was stated that the issue between CR and MC resides with knowledge
and skills. The conclusion was that when the object of measurement is a
well-defined domain of knowledge and skills, the conclusion is inescapably
MC. The alternative, the essay format, has too many shortcomings.
This third validity argument examines the issue of the viability of MC for
measuring an ability. Mislevy (1996a) characterized a criterion as:

Any assessment task stimulates a unique constellation of knowledge, skills,


strategies, and motivation within each examinee, (p. 392)

If we can define this constellation of complex tasks, the complexity of any crite-
rion challenges us to design test items that tap the essence of the criterion. At
the same time, we need some efficiency and we need to ensure high reliability of
the test scores. To facilitate the study of the problem of criterion measurement,
two ideas are introduced and defined: fidelity and proximity.

Fidelity. Fidelity is concerned with the logical, judged relationship be-


tween a criterion measure and the criterion. The judges are experts in the
content being measured. Given that a criterion is unobtainable, some mea-
sures have more in common with the criterion than others. We can con-
struct a hypothetical continuum of fidelity for a set of measures of any
ability. In doing so, we can argue that some tests have greater fidelity to a hy-
pothetical construct than others. The continuum begins with the actual cri-
terion as an abstraction and then a series of measures that have varying
fidelity to the criterion. Tests of highest fidelity come closest to the criterion
52 CHAPTER 3

for cognitive and affective characteristics believed to be defined in that


fluid ability.
Writing prompts are used to infer the extent to which a student has the
fluid ability of writing. Breland and Gaynor (1979) stated that the first formal
writing assessment program was started by the College Board in 1901. How-
ever, it was later that we experimented with MC measures of knowledge of
writing skills. A common reference to writing elicited from prompts is direct
assessment, whereas MC items used to measure knowledge of writing skills is
referred to as indirect assessment. In this book, direct assessments are viewed
as having high fidelity for measuring writing. However, writing prompts are
contrived experiences used to elicit writing samples. We should not argue
that the use of prompts elicits criterion measures of writing, because real writ-
ing is natural and not elicited by the types of prompts seen in typical writing
assessments. In fact, there is some evidence that the type of prompt affects
the measurement of the ability (Wainer &Thissen, 1994). In some assess-
ment programs, choices of prompts are offered to give students a better
chance of showing their writing ability. Knowledge of writing and writing
skills provides a foundation for the ability of writing. But it would be difficult
to make the logical argument that an MC test of writing knowledge is a
high-fidelity measure of writing ability. However, an issue of practical impor-
tance that arises for test policymakers and test designers is the fidelity that ex-
ists between different types of measures and their ultimate criterion measure.
Fidelity can be addressed through analysis of the cognitive processes involved
in criterion behavior. Recent efforts at the Educational Testing Service have
shed light on the process of cognitive task analysis with a measurement sys-
tem that attempts to tap criterion behavior (Mislevy, 1996a). Other methods
rely on judgment of surface features of each test to the mythical criterion. Ta-
ble 3.4 presents a continuum of fidelity for medical competence.

TABLE 3.4
A Continuum of Indirect Measures of a Criterion for Medical Competence

Fidelity to Criterion Criterion: Medical Competence


Very high fidelity Supervised and evaluated patient treatment
High fidelity Standardized patient
Moderate fidelity Patient management problem
Lower fidelity MC context-dependent item set based on a patient scenario
Lowest fidelity MC tests of knowledge that are thought to be part of the
competence needed to treat patients safely

Note. MC = multiple choice.


ITEM FORMATS 53

Supervised patient practice has high fidelity to actual patient practice but
falls short of being exactly like actual practice. As noted previously in this
chapter and by many others (Linn, Baker, & Dunbar, 1991), this high-fidelity
measure may suffer from many technical and logistical limitations. Such mea-
surement can be incredibly expensive and rest almost totally on the expertise
of trained judges. An alternative to live patient examination is a standardized
patient, where there is an actor who is trained to play the role of a patient with
a prespecified disorder, condition, or illness. The cognitive aspects of patient
treatment are simulated, but actual patient treatment is not done. Scoring of
such complex behavior is only experimental and is in development at this
time. Thus, this is not yet a viable testing format. An alternative with less fi-
delity is the patient management problem (PMP). These paper-and-pencil
problems have been computerized, but success with these has been disap-
pointing, and active projects promoting their use have all but disappeared.
Scenario-based MC item sets are popular (Haladyna, 1992a). Although item
sets provide less fidelity than other MC formats just described, they have ma-
jor advantages. Scoring can be simple and highly efficient. But some problems
exist with item sets that warrant caution. Namely, responses are locally de-
pendent. Thus, the coefficient of reliability for a test containing item sets is
likely to be inflated. The attractive aspect of this item format is efficiency
over the other higher fidelity options. The testing approach that has the least
fidelity involves conventional MC items that reflect knowledge related to the
definition of competence. The test specifications may require recall or under-
standing of knowledge. Candidates must choose an answer from a list, and
usually the choice reflects nothing more than knowledge in the profession.
This option has the lowest fidelity although currently it dominates certifica-
tion and licensing testing.

Proximity. Proximity is simply a measure of the relation among measures


of varying fidelity. With the consideration of proximity, we have to establish
that two item formats are measures of the same construct but may differ in
terms of judged fidelity. These correlations representing proximity are flawed
by the fact that their reliabilities attenuate our estimation of the true relation.
Disattenuated correlations answer the question: Do two measures tap the
same abstractly defined construct? The amount of common variance pro-
vides an estimate of proximity of two measures to one another. Proximity does
not replace content analysis or cognitive task analysis where the constituent
knowledge, skills, and other abilities required in criterion performance are
identified. The implication with proximity is that when two measures of a cri-
terion have good proximity, the more efficient measure may be a reasonable
choice. But when two measures of varying fidelity have low proximity, the one
with higher fidelity may be the most justifiable. Perkhounkova (2002) pro-
vided a good example of this in a study in which she examined the dimen-
54 CHAPTERS

sionality of various item formats that putatively measured writing skills. She
concluded that MC item formats that measure writing skills were effective.
These formats included select the correction, find and correct the error, and
find the error. These item formats are illustrated in chapter 6. This is the ben-
efit of using an effective but lower fidelity item format in place of a higher fi-
delity item format for measuring the same thing: writing skills.
Haladyna (1998) reported the results of a review of studies of criterion mea-
surement involving CR and MC items. Conclusions from that study are pre-
sented in Table 3.5.
The arguments presented thus far are psychometric in nature and argue that
higher fidelity testing is desirable but sometimes proximate measures, such as
the indirect MC test of writing skills, may be used. However, Heck and Crislip
(2001) argued that lack of attention to issues of equity may undermine the ben-
efits of using higher fidelity measures. They provided a comprehensive study of
direct and less direct measures of writing in a single state, Hawaii. They con-

TABLE 3.5
Conclusions About Criterion Measurement

Criterion Conclusion About MC and CR


Knowledge Most MC formats provide the same information as essay,
short answer, or completion formats. Given the obvious
benefits of MC, use MC formats.
Critical thinking ability MC formats involving vignettes or scenarios (item sets)
provide a good basis for many forms of critical thinking. In
many respects this MC format has good fidelity to the more
realistic open-ended behavior elicited by some CR formats.
Problem-solving ability MC testlets provide a good basis for testing problem
solving. However, research is lacking on the benefits or
deficits of using MC problem-solving item sets.
Creative thinking ability It is hard to imagine a MC format for this. Many have
spoken of this limitation (see Martinez, 1998).
School abilities CR formats have the highest fidelity to criterion for these
(e.g., writing, reading, school abilities. MC is good for measuring knowledge and
mathematics) some mental skills.
Professional abilities Certain high-fidelity CR items seem best suited for this
(e.g., in professions such purpose. Some MC formats can tap more basic aspects of
as physician or teacher) these abilities such as knowledge and the elements of
professional practice including critical thinking and
problem solving.

Note. MC = multiple choice; CR = constructed response.


ITEM FORMATS 55

eluded that the higher fidelity writing assessment not only assessed a more di-
verse range of cognitive behavior but was less susceptible to external influences
that contaminate test score interpretations. Also, direct writing assessments
are more in line with schools' attempt to reform curriculum and teach direct
writing instead of emphasizing writing skills that appear on standardized MC
achievement tests. This research makes a strong statement in favor of CR test-
ing for measuring writing ability.

Validity Argument 4: Gender and Item Format Bias

Differences in performance between boys and girls has been often noted in
reading, writing, and mathematics. Are these differences real or the by-product
of a particular item format? Does item format introduce construct-irrelevant
variance into test scores, thereby distorting our interpretation of achievement?
Part of the argument against MC has been a body of research pointing to
possible interaction of gender with item formats. Ryan and DeMark (2002) re-
cently integrated and evaluated this research, and this section draws princi-
pally from their observations and conclusions, as well as from other excellent
studies (Beller & Garni, 2000; DeMars, 1998; Garner & Engelhard, 2001;
Hamilton, 1999; Wightman, 1998). Ryan and DeMark approached the prob-
lem using meta-analysis of 14 studies and 178 effects. They reached the follow-
ing conclusion:

Females generally perform better than males on the language measures, regard-
less of assessment format; and males generally perform better than females on
the mathematics measures, also regardless of format. All of the differences, how-
ever, are quite small in an absolute sense. These results suggest that there is little
or no format effect and no format-by-subject interaction, (p. 14)

Thus, their results speak clearly about the existence of small differences be-
tween boys and girls that may be real and not a function of item formats. Ryan
and DeMark (2002) offered a validity framework for future studies of item for-
mat that should be useful in parsing the results of past and future studies on CR
and MC item formats. Table 3.6 captures four categories of research that they
believed can be used to classify all research of this type.
The first category is justified for abilities where the use of CR formats is obvi-
ous. In writing, for example, the use of MC to measure writing ability seems
nonsensical, even though MC tests scores might predict writing ability perfor-
mance. The argument we use here to justify CR is fidelity to criteria.
The second category is a subtle one, where writing ability is interwoven with
ability being measured. This situation may be widespread and include many
fields and disciplines where writing is used to advance arguments, state propo-
56 CHAPTERS

TABLE 3.6
A Taxonomy of Types of Research on Gender-by-Item Format

Type Description
Criterion-related CR A CR format is intended for measuring something that is
appropriate, that is, high fidelity, such as a writing prompt for
writing ability.
Verbal ability is part of In these CR tests, verbal ability is required in performance
the ability being and is considered vital to the ability being measured. An
measured. example is advanced placement history, where students read
a historical document and write about it.
Verbal ability is CR tests of knowledge might call for recall or recognition of
correlated to the facts, concepts, principles, or procedures, and writing ability
construct but not part might influence this measurement. This is to be avoided.
of it.
Verbal ability is In many types of test performance in mathematics and in
uncorrelated to the science, verbal ability may not play an important role in CR
construct being test performance.
measured.

Note. CR = constructed response.

sitions, review or critique issues or performances, or develop plans for solutions


to problems. This second category supports CR testing in a complex way that
supports verbal expression. Critical thinking ability may be another ability re-
quired in this performance. Thus, the performance item format is multidimen-
sional in nature.
The third category is a source of bias in testing. This category argues that
verbal ability should not get in the way of measuring something else. One area
of the school curriculum that seems likely to fall into this trap is the measure-
ment of mathematics ability where CR items are used that rely on verbal ability.
This verbal ability tends to bias results. Constructs falling into this third cate-
gory seem to favor using MC formats, whereas constructs falling into the first or
second categories seem to favor CR formats.
The fourth category includes no reliance on verbal ability. In this instance,
the result may be so objectively oriented that a simple low-inference perfor-
mance test with a right and wrong answer may suffice. In these circumstances,
MC makes a good proxy for CR because MC is easily scorable.
A study of advanced placement history by Breland, Danos, Kahn, Kubota,
and Bonner (1994) supported the important findings of the Ryan and De-
Mark (2002) review. They found gender differences in MC and CR scores of
ITEM FORMATS 57

men and women but attributed the higher scoring by men to more knowledge
of history, whereas the scores for men and women on CR were about the same.
Much attention in this study was drawn to potential biases in scoring CR writ-
ing. Modern high-quality research such as this study reveals a deeper under-
standing of the problem and the types of inferences drawn from test data
involving gender differences. In another recent study, Wightman (1998) ex-
amined the consequential aspects of differences in test scores. She found no
bias due to format effects on a law school admission test. A study by DeMars
(1998) of students in a statewide assessment revealed little difference in per-
formance despite format type. Although Format X Gender interactions were
statistically significant, the practical significance of the differences was small.
DeMars also presented evidence suggesting the MC and CR items measured
the same or nearly the same constructs. Beller and Gafni (2000) approached
this problem using the International Assessment of Educational Progress in-
volving students from several countries. In Gender X Format interactions in
two assessments (1988 and 1991) appear to have reversed gender effects. On
closer analysis, they discovered that the difficulty of the CR items was found
to interact with gender to produce differential results. Garner and Engelhard
(2001) also found an interaction between format and gender in mathematics
for some items, pointing out the importance of validity studies of DIP, a topic
further discussed in chapter 10. Hamilton (1999) found one CR item that dis-
played DIE She found that gender differences were accentuated for items re-
quiring visualization and knowledge acquired outside of school. This
research and the research review by Ryan and DeMark (2002) do not put to
rest the suspicion about the influence of item format on performances by gen-
der. But if effects do exist, they seem to be small. Research should continue to
uncover sources of bias, if they exist. The most important outcome of their
study is the evolution of the taxonomy of types of studies. As stated repeat-
edly in this chapter, knowing more about the construct being measured has
everything to do with choosing the correct item format. Gallagher, Levin, and
Cahalan (2002) in their study of gender differences on a graduate admissions
test concluded that performance seem to be based on such features of test
items as problem setting, multiple pathways to getting a correct answer, and
spatially based shortcuts to the solution. Their experimentation with features
of item formats leads the way on designing items that accommodate differ-
ences in gender that may be construct-irrelevant factors that need to be re-
moved during test item design.
As you can see from this recent review of research, the gender-by-format is-
sue is by no means resolved. It remains a viable area for future research, but one
that will require more sophisticated methods and a better understanding of
cognitive processes involved in selecting answers. Perhaps a more important
connection for the gender-by-format validity argument is posed in the next sec-
tion on cognitive demand.
58 CHAPTERS

Validity Argument 5: Cognitive Demand

As noted in the history of study of item formats, a recent, emerging interest is in


the mental state of examinees when engaged in a test item. Do CR and MC
items elicit different mental behaviors? At the lowest level, is recall really dif-
ferent from recognition? With higher level behaviors, does format really make a
difference in interpretation, or can we feel comfortable with the more efficient
MC for measuring various types of higher level thinking?
Martinez's studies (1990, 1993; Martinez & Katz, 1996) and his recent re-
view (Martinez, 1998) provided greater understanding about the nature and
role of cognition in test performance. To be sure, other studies are contributing
to this growing understanding. Martinez offered 10 propositions that seem wor-
thy to review. Paraphrases of his propositions are provided in italics with com-
mentary following.

1. Considerable variety exists among CR formats in terms of the kinds of be-


havior elicited. This is certainly true. Consider for example, the range of CR
item formats that measure knowledge, skills, and abilities. We can use CR
formats for virtually any kind of student learning.
2. MC items elicit lower levels of cognitive behavior. Two studies are cited
showing a tendency for MC to elicit recognition and similar forms of lower
level behaviors, but this criticism has been aimed at item writers not the
test format. MC is certainly capable of better things. For example,
Hibbison (1991) interviewed five first-year composition students after
they completed his 40-item MC test. To his surprise, he detected 27 types
of inferences that he attributed to metacognitive, cognitive, and affective
interactions. He qualified his findings by stating that these items were in-
tended to tap complex understanding of passages and were not aimed at
low-level learning. Another factor mitigating this perception that MC for-
mats have a tendency to mesure low-level learning is that MC items elicit-
ing complex behavior are difficult to write. However, this is not the fault of
the MC format but of item writers in general. With adequate training and
practice, item writers can successfully write MC items with high cognitive
demand, as Hibbison has shown. For the most part, most tests suffer from
the malady of testing recall or recognition. This is not a function of item
format but limited ability to elicit higher levels of thinking in both teach-
ing and testing. Few would argue with the idea that the range of CR item
formats for testing higher levels of cognition is greater than the range of
MC formats.
3. MC can elicit complex behavior, but the range of complex behavior elicited
by CR item formats is greater. Although MC seems well suited to testing
knowledge and some mental skills, CR can be applied to the measurement
of knowledge, skills, and abilities, regardless of whether each is abstractly
ITEM FORMATS 59

or concretely defined. Two studies have addressed the cognitive complex-


ity of MC and CR item formats that have similar results. Skakun, Maguire,
and Cook (1994) used think-aloud procedures for 33 medical school stu-
dents who were given conventional MC items that covered a variety of
medical practices. They listed five ways in which students vary in how they
read a MC item. They also listed 16 response-elimination strategies.
Finally, they listed four distinctly different problem-solving activities that
students used to respond to items. Farr, Pritchard, and Smitten (1990) ex-
perimented with 26 college students using a reading comprehension test
and planned probes to obtain verbal reports of their thinking processes.
Four distinctly different strategies were identified for answering these con-
text-dependent passages that corresponded to Skakun et al.'s findings
with medical students. The most popular of these strategies was to read the
passage, then read each question, then search for the answer in the pas-
sage. Without any doubt, all test takers manifested question-answering be-
haviors. In other words, they were focused on answering questions, as
opposed to reading the passage for surface or deep meaning. They con-
cluded that the development of items (tasks) actually determines the types
of cognitive behaviors being elicited. These studies and others show that
the range of complex cognitive behaviors for MC and CR item formats is
considerable, and the design of the item seems to control the cognitive
complexity instead of the type of item format, MC or CR.
4. CR and MC items may or may not have similar psychometric properties,
depending on the conditions evoked. This issue is the object of the review by
Rodriguez (2002) and is the most complex problem involving item for-
mats. Martinez (1998) argued that the development of options in MC test-
ing relates to the cognitive demands on examinees. The problem stated in
the item stem also has an important role. Limiting evaluation to
psychometric criteria, Martinez believed, is a mistake. A theoretical analy-
sis should precede the choice of an item format, van den Bergh (1990)
agreed with this point. In his interesting study, he argued from his testing of
the reading comprehension of third graders that format made little differ-
ence in test score interpretation. His theoretical orientation provided a
stronger rationale for findings than prior studies have done. Daneman and
Hannon (2001) examined the validity of reading comprehension tests in
terms of cognitive processing demands and found that reading the passage
changes the cognitive process in contrast to not reading the passage and
attempting to answer items. They concluded that MC reading compre-
hension measures can succeed if the MC items are truly dependent on
reading the passage and not highly susceptible to prior knowledge. The op-
posite conclusion comes from the extensive research by Katz and
Lautenschlager (1999) into the validity of MC reading comprehension
tests. This research led them to conclude that much of the variation in per-
60 CHAPTER 3

formance at the item level may be attributed to test-taking skills and stu-
dents' prior knowledge. Campbell (2000) devised an experiment and
think-aloud procedures to study the cognitive demands of stem-equiva-
lent reading comprehension items. He found differences favoring the CR
format over the MC format but saw a need for both formats to be used in a
reading comprehension test. Skakun and Maguire (2000) used
think-aloud procedures with medical school students in an effort to un-
cover the cognitive processes arising from the use of different formats.
They found that students used MC options as provisional explanations
that they sought to prove or disprove. With CR items, no such provisional
explanations were available and students had to generate their own provi-
sional explanations. However, they also found that the cognitive process-
ing was more complex than simply being a function of the format used.
With items of varying quality, whether MC or CR, the cognitive demand
varied. When items are well written and demand declarative knowledge, it
does not matter which format is used. When items are poorly written, MC
and CR may have different cognitive demands. Katz, Bennett, and Berger
(2000) studied the premise that the cognitive demands of stem-equivalent
MC and CR items might differ for a set of 10 mathematics items from the
SAT. As with other studies like this one, students were asked to think
aloud about their solution strategies. These authors concluded:
The psychometric literature claims that solution strategy mediates effects of
format on difficulty. The results of the current study counter this view: on some
items, format affected difficulty but not strategy; other items showed the reverse
effect. (Katz et al., 2000, p. 53)
Katz et al. also concluded that reading comprehension mediates format
effects for both strategy and difficulty. Thus, reading comprehension may be
a stronger overarching influence on item performance than item format. In
chapter 1, the discussion centered on construct definition and the need to
theorize about the relation of test behavior to the abstract construct defini-
tion. Indeed, if we can give more thought to the nature of cognition in test-
ing, test items might improve.
5. Response-elimination strategies may contribute to construct'irrelevant
variation in MC testing. This criticism is aimed at faults in the item-writing
process. Good item writers follow guidelines, such as developed by
Haladyna, Downing, and Rodriguez (2002). Most formal testing programs
do not have flaws in items that allow examinees the opportunity to eliminate
options and increase the chances of guessing the right answer. Ironically, re-
cent research points to the fact that most MC items have only two or three
working options (Haladyna & Downing, 1993). Kazemi (2002) reported re-
search involving 90 fourth graders who were interviewed after responding to
MC items in mathematics. These students tended to evaluate the choices
ITEM FORMATS 61

instead of engaging in problem solving. Thus, the MC item presented many


students with a uniquely different thought process than would be encoun-
tered with an open-ended item with a similar stem. Thus, there is a possibil-
ity that students will evaluate each option and try to eliminate those they
find are implausible or simply wrong instead of engaging in a linear prob-
lem-solving procedure that might be expected. The study by Skakun et al.
(1994) clearly showed that medical students engaged in 16 distinctly differ-
ent response elimination strategies. A perspective to response elimination
was offered by Coombs (1953), who argued that response elimination can be
attributed to a student's partial knowledge. By eliminating implausible dis-
tractors, the probability of guessing among the more plausible, remaining
options is higher, as it should be, because the student has partial knowledge.
Response-elimination strategies may be an important part of the reasoning
process that enters into test performance. These strategies do not seem un-
desirable.
6. Test anxiety can influence CR performance. Test anxiety indeed can be a
powerful influence in all forms of testing, especially when the stakes are
high, such as in employment, certification, licensing, and graduation test-
ing. Minnaert (1999) studied college-level students' reading comprehen-
sion and found that test anxiety was more likely to affect the MC version of
reading comprehension than the performance version. Thus, the MC for-
mat seems to introduce construct-irrelevant variance into test score inter-
pretations. Evidence is cited for higher anxiety in CR testing, but when the
cognitive demand is great and students have not yet had enough experience
with complex CR formats, greater anxiety is to be expected.
7. CR formats have greater potential for diagnosis of student learning and
program effects. Evidence is growing to support this statement. For exam-
ple, Mukerjee (1991) examined reading comprehension test results for
children in a Cloze format, which requires a CR and MC formats, and
found useful diagnostic information from both. His conclusion was that
both formats had something to contribute to deepening understanding
about reading comprehension. At the same time, work at the Educational
Testing Service with inference networks and cognitive task analysis
(Mislevy, 1996a) promises to increase our ability to diagnose learning
problems in an MC format.
8. CR might contribute to richer anticipatory learning. This is a primary
claim of test reformers such as Wiggins (1989), among others. This also fol-
lows from the fact that most of educational testing concentrates on basic
knowledge and skills of a very low cognitive level, often involving recall or
recognition. As students prepare for CR or MC tests, differences in learning
may appear, but much has to do with the design of the tests and what cogni-
tion they elicit. As most testing specialists point out, both formats can elicit
higher levels of cognition.
62 CHAPTER 3

9. Policy decisions about CR and MC formats can be assisted by research but


should not be prescribed. Simplistic comparisons between CR and MC for-
mats using psychometric criteria such as item difficulty, item discrimination,
and reliability are helpful but often misleading. If construct interpretations
differ, such discussions are pointless. Researchers including Traub (1993)
have emphasized validity (test interpretation and use) rather than simplistic
criteria. Of course, cost and efficiency are powerful factors in policy deci-
sion. Legislators, school boards, and licensing and certification authorities
and boards all need to be more sophisticated in their appraisal of test inter-
pretations and uses as they pertain to item formats.
10. We need more research. Many more important questions remain to
be answered via theoretical development and research. The area of cogni-
tion and testing is relatively new. As cognitive psychologists address these
questions and testing specialists continue to be involved, the discussion of
item formats may be less important as computerized scoring replaces judg-
ment-based scoring. Several good examples exist of research on cognition
in testing. Haynie (1994) examined delayed retention using short-answer
CR and MC. He found MC to be superior in measuring delayed retention
of knowledge.

Validity Argument 6: Instrumentality

A persistent claim has been that the use of particular item format has an effect
on student learning and test preparation. Frederiksen (1984) and later
Shepard (2002) should be credited for advancing this idea. There has been in-
creasing support for the idea that the overuse or exclusive use of one kind of for-
mat might corrupt student learning and its measurement. Heck and Crislip
(2001) examined this premise with a large, representative sample of
third-grade students in writing. Although girls outperformed boys in CR and
MC measures, the CR measures showed fewer differences for format compari-
sons. If students or other examinees are given a choice, they are more likely to
choose an MC format over the CR format (Bennett et al., 1999).
One benefit of this concern for the influence that an item format may have
on learning is the AERA (2000) guide for high-stakes testing that encourages
test preparation to include practice on a variety of formats rather than simply
those used in a criterion test. Such test preparation and the appropriate use of a
variety of item formats may be a good remedy to remove this threat to validity.

CONCLUSIONS ABOUT CHOOSING AN ITEM FORMAT

The choice of an item format mainly depends on the kind of learning outcome
you want to measure. As Seller and Gafhi (2000) concluded, "It is believed that
ITEM FORMATS 63

the first priority should be given to what is measured rather than how it is mea-
sured" (p. 18). In other words, our focus should be content and cognitive pro-
cess. In chapter 2, it was established that student learning can include
knowledge, mental or physical skills, and cognitive abilities. Knowledge can be
recalled, understood, or applied. Given that content and cognitive processes
will direct us in the choice of an item format first and foremost, what have we
learned about item formats that will help use choose the most appropriate for-
mat for student learning?

1. If a domain of knowledge or skill is conceptualized, the main validity


concern is the adequacy of the sample of test items from this domain.
What type of item format gives you the best sampling from the domain?
MC is superior to the CR format simply because you can obtain more units
of measurement from MC and only a few units of measurement from CR
format, even if short-answer essays are used. Whether the cognitive de-
mand that is required is recall or understanding, MC seems justified. The
CR essay formats described in this chapter are weak alternatives to MC.
The low-inference essay format is nearly equivalent to the MC format but
is inefficient and less reliable. The high-inference essay format is the weak-
est alternative simply because it requires subjective scoring that adds bias
as a threat to validity, but it also suffers from lower reliability when com-
pared with MC. From a historical perspective, the MC formats were intro-
duced as an efficient replacement for essay-type testing (Eurich, 1931;
Godshalk et al., 1966; Kurd, 1932; O'Dell, 1928; Patterson, 1926; Ruch,
1929;Tiegs, 193l;Traub&Fisher, 1977). MC formats continue to serve us
well for measuring the recall and understanding of knowledge and many
cognitive skills.
2. The most direct way to measure cognitive skill is a performance
test item. In most instances, any of low-inference, objectively scorable
formats should be used. However, some cognitive skills lend themselves
nicely to some MC formats. Chapters 4 and 5 provides some examples. If
the skill is psychomotor, MC cannot be used. Some skills require expert
judgment and the high-inference, subjectively scorable item formats
should be used.
3. When measuring a cognitive ability, its complexity favors a
high-inference CR item format. Simple completion or short-answer CR
formats are not satisfactory. In some circumstances, the MC item set may
serve as a useful proxy for a CR item in the measurement of a cognitive
ability, particularly involving problem solving or critical thinking. Chap-
ter 4 provides a fuller discussion of the item set and other formats that
have higher cognitive demands. In some circumstances, the item set and
case-based items provide a useful alternative to the costly and inefficient
CR format.
64 CHAPTER 3

SUMMARY

This chapter has provided information about item formats to measure knowl-
edge, skills, and abilities. An important distinction was made between ab-
stractly-defined and concretely-defined student learning. Each type of
learning requires a different type of item format. The former is subjectively
scored by a content expert; the latter is objectively scored by a trained observer.
Five validity arguments were used as a basis for choosing the appropriate for-
mat. At the end of this chapter, three recommendations were offered regarding
the choice of a format.
II
Developing MC Test Items

Thorndike (1967) noted that constructing good test items is probably the most
demanding type of creative writing imaginable. Not only must the item writer
understand content measured by the item but must determine whether the
cognitive demand will involve recall, understanding, or application. Original-
ity and clarity are key features of well-written test items. The set of four chap-
ters in part II of this book is comprehensive with respect to writing MC items.
Chapter 4 presents and illustrates many MC formats and discusses some impor-
tant issues related to using these formats. Chapter 5 presents a validated list of
guidelines to follow when writing MC items. These guidelines derive from past
and current research (e.g., Haladyna et al., 2002). Chapter 5 contains many ex-
amples of MC items, most of which violate item-writing guidelines. Chapter 6
provides examples of test items taken from various sources. These items are ex-
emplary because of their innovative format, content measured, mental pro-
cesses represented, or some other feature. The purpose of chapter 6 is to give
you a broad sampling of the effectiveness of the MC format for many types of
content and cognitive processes, including the kind of thinking associated with
the measurement of abilities. Chapter 7 is devoted to item generation. This
chapter provides both older and newer ideas about how to prepare many items
for different types of content and mental processes rapidly.
This page intentionally left blank
4
MC Formats

OVERVIEW

In this chapter eight MC formats are presented. Examples are given for each
format. Claims are made about the types of content and cognitive processes
that each format can elicit. This chapter shows the versatility of the MC format
for measuring the recall or understanding of knowledge, some cognitive skills,
and many types of complex mental behavior that we associate with abilities.

CONTEXTS IN WHICH MC FORMATS ARE USED

Two main contexts apply to MC formats. The first is classroom testing, where
the objective of a MC test is to obtain a measure of student learning effi-
ciently. This measure is helpful when a teacher assigns a grade at the end of
the grading period. This measure has value to teachers and students for giving
students feedback and assistance in future learning or for reteaching and re-
learning content that has not been learned. The second context is a
large-scale testing program. The purposes of this large-scale testing program
might be graduation, promotion, certification, licensure, evaluation, place-
ment, or admission. In this second context, MC is chosen because it is effi-
cient and provides a useful summary of student learning of knowledge and
cognitive skills.

MC ITEM FORMATS

This chapter presents a variety of recommended MC formats. One format is


not recommended, and a recommended format serves as its replacement.

67
68 CHAPTER 4

Conventional MC

The most common MC format is conventional. We have three variations. Each


is shown in the Example 4.1. Each variation has three parts: (a) a stem; (b)
the correct choice; and (c) several wrong answers, called foils, misleads, or
distractors.

Question Format
Who is John Gait? stem
A. A rock star foil or distractor
B. A movie actor foil or distractor
C. A character in a book correct choice

Incomplete Stem (Partial Sentence)


John Gait is a character in an Ayn Rand novel who is remembered
for his
A. integrity.
B. romantic tendencies.
C. courage.

Best Answer
Which is the most effective safety feature in your car?
A. Seat belt
B. Front air bag
C. Anti-lock braking system

EXAMPLE 4.1. Three variations of conventional multiple-


choice items.

Stem. The stem is the stimulus for the response. The stem should provide
a complete idea of the knowledge to be indicated in selecting the right answer.
The first item in Example 4.1 shows the question format. The second item
shows the incomplete stem (a partial sentence) format. The third item shows
the best answer format.

Correct Choice. The correct choice is undeniably the one and only right
answer. In the question format, the correct choice can be a word, phrase, or
sentence. In some rare circumstances, it can be a paragraph or even a drawing
or photograph (if the distractors are also paragraphs, drawings, or photo-
MC FORMATS 69

graphs). However, the use of paragraphs, drawings, photographs, and the like
make the administration of the item inefficient. With the incomplete stem, the
second part of the sentence is the option, and one of these is the right answer.
With the best-answer format, all the options are correct, but only one is un-
arguably the best.

Distractors. Distractors are the most difficult part of the test item to
write. A distractor is an unquestionably wrong answer. Each distractor must
be plausible to test takers who have not yet learned the knowledge or skill
that the test item is supposed to measure. To those who possess the knowledge
asked for in the item, the distractors are clearly wrong choices. Each
distractor should resemble the correct choice in grammatical form, style, and
length. Subtle or blatant clues that give away the correct choice should al-
ways be avoided.
The number of distractors required for the conventional MC item is a mat-
ter of some controversy (Haladyna & Downing, 1993). When analyzing a vari-
ety of tests, Haladyna and Downing (1993) found that most items had only one
or two "working" distractors. They concluded that three options (a right an-
swer and two distractors) was natural. Few items had three working distractors.
In chapter 5, this issue is revisited. In this book, most of the examples contain
three options because both theory and research suggest that for conventional
MC three options works well.

Controversy About the Conventional MC Formats. Some controversy


exists about the second variation, the incomplete stem (Gross, 1994). Statman
(1988) provided a logical analysis of the issue. She asserted that with the com-
pletion format, one has to retain the stem in short-term memory while com-
pleting this stem with each option, evaluating the truthfulness of each option,
or if short-term memory fails, the test taker has to range back and forth from the
stem to each option, making a connection and evaluating the truth of that con-
nection. Testing is anxiety provoking, and the added stress of the completion
format may contribute to test anxiety, a problem that already troubles about
one in four test takers according Hill and Wigfield (1984). The mental steps in-
volved in answering a completion item also takes more time, which is undesir-
able. But research has not shown any appreciable difference when these two
formats are compared (Rodriguez, 2003).
Test takers with limited English proficiency taking a test presented in the
English language run a greater risk of having item format affect their perfor-
mance. For this reason, the more direct question format seems better suited for
these kinds of test takers.
Another issue is the use of blanks in the middle of the stem or question. The
guideline to consider using is never to leave a blank in the middle or at the be-
ginning of the stem. These blankety-blank formats are difficult for students to
70 CHAPTER 4

read. Such items also require more time to administer and reduce the time
spent productively answering other items. For these many reasons, the use of
internal or beginning blanks in completion-type items should be avoided. Ex-
ample 4.2 shows the blankety-blank item format.

Child abuse is an example of violence, whereas


sexism is an example of violence.

A. aggressive; structural
B. emotional; psychological
C. structural, emotional

EXAMPLE 4.2. Embedded blank-type items.

Several creative innovations in conventional MC have added to the variety


presented in this chapter. The first is a format that is easy to prepare and avoids
the tendency for students to use supplied options in mathematics to decide the
correct answer. In other words, some testing researchers suspect that the con-
ventional MC provides too many clues in the options. Johnson (1991) sug-
gested a standard set of numbers from low to high as options. The students code
the option that is closest to their answer. That way guessing or elimination
strategies do not work. The generation of numbers for distractors is easy, and
because this is one of the hardest steps in writing MC items, this variation can
be effective for quantitative items.
Another variation is the uncued MC (Fajardo &Chan, 1993), which is a de-
terrent to option-elimination strategies. By providing a key word or key phrase
list in the hundreds, we expect the student to read an item stem and search the
list for the correct answer. Guessing is virtually eliminated. These items have
good qualities. Namely, these items provide diagnostic information about fail-
ure to learn (Fenderson, Damjanov, Robeson, Veloski, & Rubin, 1997). Exam-
ple 4.3 shows an example.
Test designers can study patterns of response and determine what wrong
choices students are making and study why they are making these wrong
choices. The uncued MC also tends to be more discriminating at the lower end
of the test score scale and yields higher reliability than conventional MC. Re-
searchers argue that the writing of distractors for many items is eliminated once
the key word list is generated.
Matching
A popular variation of the conventional MC is the matching format. We use
this format when we have a set of options that seems useful for two or more
MC FORMATS 71

Draw four samples randomly from a distribution with a mean of 50


and a standard deviation of 1 0. Find the standard deviation of
your sample of four.

A B C D E F G H I

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

EXAMPLE 4.3. Uncued multiple choice.

items. The matching format begins with a set of options at the top followed by a
set of stems below. The instructions that precede the options and stems tell the
test taker how to respond and where to mark answers. As shown in Example
4.4, we have five options and six statements. We could easily expand the list of
six statements into a longer list, which makes the set of items more comprehen'
sive in testing student learning. In a survey of current measurement textbooks,
Haladyna et al. (2002) discovered that every measurement textbook they sur-
veyed recommended the matching format. It is interesting that there is no cited

Mark your answer on the answer sheet.


For each item select the correct answer from options provided
below.

A. Minnesota
B. Illinois
C. Wisconsin
D. Nebraska
E. Iowa

1. Home state of the Hawkeyes


2. Known for its cheese heads
3. Land of many lakes
4. Cornhuskers country
5. The largest of these states
6. Contains Cook County

EXAMPLE 4.4. Simple matching format.


72 CHAPTER 4

research on this format in any of these textbooks or prior reviews of research on


item formats.
Linn and Gronlund (1995) andNitko (2001) both offered excellent instruc-
tion on designing effective matching items. Linn and Gronlund suggested the
following contexts for matching items: persons and achievements, dates and
events, terms and definitions, rules and examples, symbols and concepts, au-
thors and books, English and non-English equivalent words, machines and
uses, plants or animals and classification, principles and illustrations, objects
and names of objects, parts and functions. As you can see, matching has many
applications. Also, the cognitive demand for matching items can be recall or
understanding. To accomplish the latter needs the use of novel presentation of
stems or options. For example, content may be presented one way in a textbook
or in instruction, but the stems or options should be paraphrased in the match-
ing item.
The matching format has many advantages:

1. Matching items are easy to construct.


2. The presentation of items is compact. The example just provided could
be expanded to produce as many as 30 items on a single page.
3. This format is popular and widely accepted.
4. Matching lends itself nicely to testing understanding of concepts, princi-
ples, and procedures.
5. Matching is efficient based on the amount of student testing time needed
to answer a set of matching test items.
6. The options do not have to be repeated. If we reformatted this into con-
ventional MC, it would require the repeating of the five options for
each stem.

Among the few limitations of this format are the following tendencies:

1. Write as many items as there are options, so that the test takers match
up item stems to options. For instance, we might have five items and five op-
tions. This item design invites cuing of answers. Making the number of op-
tions unequal to the number of item stems can avoid this problem.
2. Mix the content of options, for instance, have several choices be
people and several choices be places. The problem is nonhomogeneous op-
tions. This can be solved by ensuring that the options are part of a set of
things, such as all people or all places. In Example 4.4, the options are all
states.

Matching items seem well suited for testing understanding of concepts,


principles, and procedures. Matching items are useful in classroom testing but
infrequently seen in large-scale testing programs.
MC FORMATS 73

Extended Matching

An extended-matching (EM) format is an MC variation that uses a long list of


options linked to a long list of item stems. According to Case and Swanson
(1998), a set of EM items has four components: (a) a theme, (b) a set of options,
(c) a lead-in statement, and (d) a set of stems. The theme focuses the test taker
in a context. Example 4.5 shows a generic EM format. The options are possible
right answers. This list of options can be lengthy. In fact, the list of options
might be exhaustive of the domain of possible right answers. The list of options
for an EM item set must also be homogeneous in content.

Theme
Options; A, B, C ...
Lead-in Statement
Stems: 1,2,3, ...

EXAMPLE 4.5. A generic extended-matching item.

The lead-in statement might be a scenario or a vignette. This puts the prob-
lem in a real-life context. Finally, the set of stems should be independently an-
swered. Each set of EM items should have this lead-in statement. Otherwise,
the test taker may find the set of items ambiguous. The set of items must have at
least two stems. Case and Swanson (1993,1998) support the use of this format
because using it is easy and it generates large number of items that test for un-
derstanding of knowledge and cognitive skills. In fact, they showed how some
item sets can involve vignettes or scenarios that suggest higher levels of test be-
havior that we might associate with an ability. Their example reflects medical
problem solving.
Example 4.6 presents an EM set from the Royal College of Psychiatry in the
United Kingdom. The EM format is highly recommended for many good reasons.

1. Items are easy to write.


2. Items can be administered quickly.
3. The cognitive process may be understanding and, in some instances, ap-
plication of knowledge that we associate with problem solving.
4. These items seem less resilient to cuing, whereas with conventional MC
one item can cue another.
5. EM items are more resilient to guessing. Moreover, Haladyna and
Downing (1993) showed that conventional MC items seldom have many
Theme: Neuropsychological tests
Options

A. Cognitive Estimates Test


B. Digit Span
C. Go-No Go Test
D. Mini Mental State Examination
E. National Adult Reading Test
F. Raven's Progressive Matrices
G. Rivermead Behavioural Memory Test
H. Stroop Test
I. Wechsler Memory Scale
J. Wisconsin Card Sorting Test
Lead-in: A 54-year-old man has a year's history of steadily
progressive personality changes. He has become increasingly
apathetic and appears depressed. His main complaint is increasing
frontal headaches. On examination, he has word finding difficulties.
EEG shows frontal slowing that is greater on the left.
Which test should you consider?
Stems:

1. You are concerned that he may have an intracranial


space-occupying lesion.
2. Test indicates that his current performance IQ is in the low
average range.
3. The estimate of his premorbid IQ is 15 points higher than his
current performance IQ. It is recommended that he has a full
WAIS IQ assessment to measure both performance and
verbal IQ. On the WAIS, his verbal IQ is found to be impaired
over and above his performance IQ.
Which test is part of the WAIS verbal subtests?
4. An MRI can shows a large meningioma compressing dorso-
lateral prefrontal cortex on the left. Which test result is most
likely to be impaired?

EXAMPLE 4.6. Extended-matching format for clinical


problem-solving skills. Adapted with the permission of the
Royal College of Psychiatry in the United Kingdom.
74
MC FORMATS 75

good distractors; thus, guessing a right answer is more likely with conven-
tional MC. Case and her colleagues have researched the EM format with
favorable results (Case & Swanson, 1993; Case, Swanson, & Ripkey,
1994).

This format is widely used in medicine and related fields in both the United
States and the United Kingdom. It has much potential for classroom and
large-scale assessments. As chapter 6 shows, this format has versatility for a va-
riety of situations. An excellent instructional source for this format can be
found in Case and Swanson (2001), also available on the web at http://
www.nbme.org/.

Alternate Choice

Alternate Choice (AC) is a conventional MC with only two options. Ebel


(1981, 1982), a staunch advocate of this format, argued that many items in
achievement testing are either-or, lending them nicely to the AC format.
Downing (1992) reviewed the research on this format and agreed that AC is vi-
able. Haladyna and Downing (1993) examined more than 1,100 items from
four standardized tests and found many items have a correct answer and only
one working distractor. The other distractors were nonfunctioning. They con-
cluded that many of these items were naturally in the AC format. Example 4.7
shows a simple AC item.

What is the most effective way to motivate a student?

A. Intermittent praise

B. Consistent praise

EXAMPLE 4.7. Alternate-choice item.

Although the AC item may not directly measure writing skills, Example 4.8
shows the potential for the AC format to approximate the measurement of a
writing skill.
Although AC is a downsized version of conventional MC, it is not a
true-false (TF) item. AC offers a comparison between two choices, whereas
the TF format does not provide an explicit comparison among choices. With
the TF format, the test taker must mentally create the counterexample and
choose accordingly.
The AC has several attractive characteristics and some limitations:
76 CHAPTER 4

1. (A-Providing, B-Provided) that all homework is done, you


may go to the movie.

2. It wasn't very long (A-before, B-until) Earl called Keisa.

3. Knowledge of (A-preventative, B-preventive) medicine will


lengthen your life.

4. All instructions should be written, not (A-oral, B-verbal).

5. Mom divided the pizza (A-between, B-among) her three boys.

EXAMPLE 4.8. Alternative-choice items measuring writing skills.

1. The most obvious advantage is that writing the AC item is easy to write.
The item writer only has to think of a right answer and one plausible distractor.
2. The efficiency of the use of this format with respect to printing costs,
ease of test construction, and test administration is high.
3. Another advantage is that if the item has only two options, one can as-
sign more AC items to a test per testing period than with conventional MC
items. Consequently, the AC format provides better coverage of the content
domain.
4. AC items are not limited to recall but can be used to measure under-
standing, some cognitive skills, and even some aspects of abilities (Ebel,
1982).
5. Ebel (1981, 1982) argued that AC is more reliable than MC because
more AC items can be asked in a fixed time. Because the test length is func-
tionally related to reliability, using valid AC items makes sense. Research on
AC items supports Ebel's contention (Burmester &. Olson, 1966; Ebel,
1981, 1982; Ebel & Williams, 1957; Hancock, Thiede, & Sax, 1992;
Maihoff & Mehrens, 1985; Sax & Reiter, n.d.). Also, AC items have a his-
tory of exhibiting satisfactory discrimination (Ruch & Charles, 1928; Ruch
& Stoddard, 1925; Williams & Ebel, 1957).
6. Lord (1977) suggested another advantage: A two-option format is
probably most effective for high-achieving students because of their ten-
dency to eliminate other options as implausible distractors. Levine and
Drasgow (1982) and Haladyna and Downing (1993) provided further sup-
port for such an idea. When analyzing several standardized tests, they
found that most items contained only one or two plausible distractors.
Many of these items could have been easily simplified to the AC format. If
this is true, two options should not only be sufficient in many testing situa-
MC FORMATS 77

tions but also a natural consequence when useless distractors are removed
from an item containing four or five options.

The most obvious limitation of the AC format is that guessing is a factor—


the test takers may choose the correct answer even if they do not know the an'
swer. The probability of randomly guessing the right answer is 50% for one item.
By recognizing the floor and ceiling of a test score scale consisting of AC items,
we overcome this limitation. For instance, the lowest probable score for a
30'item AC test is 50% if random guessing happens. The ceiling of the test is, of
course, 100%. A score of 55% on such a test is very low, whereas a score of 75%
is in the middle of this scale. Given that guessing is a larger factor in AC items
when compared with conventional MC, one only has to make an interpreta'
tion in keeping with the idea that 50% is about as low a score as can be ex'
pected. Any passing standard or other evaluative criteria used should be
consistent with the effective range of the AC test score scale, which is from
50% to 100%.
TF
The TF format has been well established for classroom assessment but seldom
used in standardized testing programs. Haladyna et al. (2002) found that for a
contemporary set of educational measurement textbooks, all 26 books recorri'
mended TF items. However, there has been evidence to suggest using TF with
caution or not at all (Downing, 1992; Grosse & Wright, 1985; Haladyna,
1992b). Like other two-option formats, TF is subject to many abuses. The most
common may be a tendency to test recall of trivial knowledge. Example 4.9
shows the use of TF for basic knowledge.

Mark A on your answer sheet if true and B if false.

1. The first thing to do with an automatic transmission that does


not work is to check the transmission fluid. (A)

2. The major cause of tire wear is poor wheel balance. (B)

3. The usual cause of clutch "chatter" is in the clutch pedal


linkage. (A)

4. The distributor rotates at one half the speed of the engine


crankshaft. (B)

EXAMPLE 4.9. Examples of true-false items.


78 CHAPTER 4

Example 4.10 presents an effective, although unconventional, use of this format.

Place an "X" beneath each structure for which each statement is true.

Characteristic Root Stem Leaf


Growing point protected by a cap
May possess a pithy center
Epidermal cells hair-like
Growing region at tip
May possess a pithy center

EXAMPLE 4.10. Unusual example of a true-false format.

These items occupy a small space but provide a complete analysis of plant
anatomy.
However, there are subtle and serious problems with the TF format. For
example, Peterson and Peterson (1976) investigated the error patterns of
positively and negatively worded TF questions that were either true or false.
Errors were not evenly distributed among the four possible types of TF
items. Although this research is not damning, it does warn item writers that
the difficulty of the item can be controlled by its design. Hsu (1980) pointed
out a characteristic of TF items when they are presented as a group using the
generic stem in Example 4.11. Such a format is likely to interact with the
ability of the group being tested in a complex way. Both the design of the
item and the format for presentation are likely to cause differential results.
Ebel (1978), a proponent of TF items, was opposed to the grouping of items
in this manner.

Which of the following statements are true?

EXAMPLE 4.11. A generic stem for true-false items.

Grosse and Wright (1985) described a more serious threat to the usefulness
of TF. They argued that TF has a large error component due to guessing, a find-
ing that other research supports (Frisbie, 1973; Haladyna & Downing, 1989b;
Oosterhof & Glasnapp, 1974). Grosse and Wright claimed that if a test taker's
response style favors true instead of false answers in the face of ignorance, the
MC FORMATS 79

reliability of the test score may be seriously undermined. A study comparing


conventional MC, AC, and TF showed poor performance for TF in terms of re-
liability (Pinglia, 1994).
As with AC, Ebel (1970) advocated the use of TF. The chapter on TF testing
by Ebel and Frisbie (1991) remains an authoritative work. EbeFs (1970) argu-
ments are that the command of useful knowledge is important. We can state all
verbal knowledge about propositions, and each proposition can be truly or
falsely stated. We can measure student knowledge by determining the degree to
which each student can judge the truth or falsity of knowledge. Frisbie and
Becker (1991) synthesized the advice of 17 textbook sources on TF testing.
The advantages of TF items are as follows:

1. TF items are easy to write.


2. TF items can measure important content.
3. TF items can measure different cognitive processes.
4. More TF items can be given per testing period than conventional MC items.
5. TF items are easy to score.
6. TF items occupy less space on the page than other MC formats, therefore
minimizing the cost of production.
7. The judgment of a proposition as true or false is realistic.
8. We can reduce reading time.
9. Reliability of test scores is adequate.

The disadvantages are as follows:

1. Items tend to reflect trivial content.


2. TF items tend to promote the testing of recall.
3. Guessing is too influential.
4. The TF format is resistant to detecting degrees of truth or falsity.
5. TF tests tend to be slightly less reliable than comparable MC tests.
6. There are differences between true TF items and false TF items, which
have caused some concern.
7. TF items are not as good as AC items (Hancock et al., 1992).

We can refute some of these criticisms. The reputation for testing trivial
content is probably deserved, but only because item writers write items measur-
ing trivial content. This practice is not a product of the item format. Trivial
content can be tested with any format. The more important issue is: Can TF
items be written to measure non trivial content? A reading of the chapter on TF
testing in the book by Ebel and Frisbie (1991) provided an unequivocal yes to
this question. The issue of testing for understanding instead of recall is also an-
swered by better item-writing techniques. As with AC, guessing is not much of
a factor in TF tests, for the same reasons offered in the previous section. If one
8O CHAPTER 4

keeps in mind that the floor of the scale for a TF test is 50% and the ceiling is
100%, our interpretations can be made in that light. Exceeding 60% on these
tests when the test length is substantial is difficult for a random guesser, say 50
or 100 items. This is the same argument that applies to AC.
Given its widespread support from textbook writers, TF is recommended for
classroom assessment. For large-scale assessments, we have other formats de-
scribed in this chapter that are more useful and have less negative research.

Complex MC

This item format offers test takers three choices regrouped into four options,
as shown in Example 4.12. The Educational Testing Service first introduced
this format, and the National Board of Medical Examiners later adopted it for
use in medical testing (Hubbard, 1978). Because many items used in medical
and health professions testing programs had more than one right answer,
complex MC permits the use of one or more correct options in a single item.
Because each item is scored either right or wrong, it seems sensible to set out
combinations of right and wrong answers in an MC format where only one
choice is correct.

Which actors appeared in the movie Lethal Weapon 10?

1. Mel Gibson
2. Dannie Glover
3. Vin Diesel

A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1,2, and 3

EXAMPLE 4.12. Complex multiple-choice item.

Complex MC was popular in formal testing programs, but its popularity is


justifiably waning. Albanese (1992), Haladyna (1992b), and Haladyna and
Downing (1989b) gave several reasons to recommend against its use:

1. Complex MC items may be more difficult than comparable single-best-


answer MC.
MC FORMATS 81

2. Having partial knowledge, knowing that one option is absolutely correct


or incorrect, helps the test taker identify the correct option by eliminat-
ing distractors. Therefore, test-taking skills have a greater influence on
test performance than intended.
3. This format produces items with lower discrimination, which in turn low-
ers test score reliability.
4. The format is difficult to construct and edit.
5. The format takes up more space on the page, which increases the page
length of the test.
6. The format requires more reading time, thus reducing the number of
items of this type one might put in a test. Such a reduction negatively af-
fects the sampling of content, therefore reducing the validity of interpre-
tations and uses of test scores.

Studies by Case and Downing (1989), Dawson-Saunders, Nungester, and


Downing (1989), and Shahabi and Yang (1990) provided additional evidence
of the inferiority of the complex MC. Subhiyah and Downing (1993) provided
evidence that no difference exists. Complex MC items have about the same
qualities as conventional MC. Furthermore, this format fills a need when
"list-type" questioning is needed. Fortunately, multiple true-false (MTF) is a
viable alternative to the complex MC format.

MTF

The MTF, which is sometimes referred to as Type X, has much in com-


mon with the TF format. The distinguishing characteristic between the
two formats is that the TF items should be nonhomogeneous in content
and cognitive demand, whereas the MTF has much in common and usu-
ally derives its commonality from a lead-in statement, such as with the
EM format.
Example 4.13 names a book read by the class, and five statements are offered
that may be applicable to the book. Each student has to link the statement to
the book plausibly. Some statement are true and others are false.
Generally, the number of true and false answers are balanced. The MTF for-
mat is really an item set. The list of items can be lengthy, as many as 30. This is
an attractive feature of the MTF, the ability to administer many items in a short
time. Example 4.14 is a more complex MTF item set.
Frisbie (1992) reviewed research on the MTF format and supported its use.
However, he stated that one detriment to its use is a lack of familiarity by item
writers. Downing, Baranowski, Grosso, and Norcini (1995) compared MTF
and conventional MC in a medical testing setting. They found that MTF items
yielded more reliable scores, but they found conventional MC to be more
82 CHAPTER 4

The Lion, the Witch, and the Wardrobe by C. S. Lewis can best be
summarized by saying:

1. A penny saved is a penny earned.


2. If you give them an inch, they will take a mile.
3. Good will always overcome evil.
4. Do not put off tomorrow what you can do today.
5. Do not put all your eggs in one basket.

EXAMPLE 4.13. Multiple true-false item set for a book.

highly correlated with complex measures of competence than MTE They con-
cluded that MTF in this study seemed to reflect more basic knowledge.

The advantages of the MTF format are as follows:

1. This format avoids the disadvantages of the complex MC format.


2. Recent research has shown that the MTF item format is effective as to
reliability and validity (Frisbie, 1992). Several researchers have established
that the MTF format produces higher reliability estimates when compared
with the conventional MC items (Albanese, Kent, & Whitney, 1977;
Downing etal., 1995; Frisbie & Druva, 1986; Frisbie & Sweeney, 1982; Hill
& Woods, 1974).
3. Frisbie and Sweeney (1982) reported that students perceived the
MTF items to be easier and preferred to conventional MC. Oddly enough,
Hill and Woods (1974) reported that the MTF items seemed harder, but sev-
eral students anecdotally reported that the MTF items were better tests of
their understanding.
4. This format is efficient in item development, examinee reading
time, and the number of questions that can be asked in a fixed time. For
instance, placing nearly 30 MTF items on a page is possible, and admin-
istering more than 100 questions per 50-minute testing period is feasi-
ble. Given that guessing can play a strong role in such items, the
effective range of scores for such a test will range from 50% to 100%. As
with AC and TF, guessing will not greatly influence scores if enough
items are used.

There are some potential limitations to this format:

1. The MTF format appears limited to testing the understanding of


concepts by listing examples and nonexamples, characteristics and non-
Your video store rents VMS for $2.00 on weekdays and $3.00 on
weekends. You also rent DVDs for $3.00 on weekdays and $4.00
for weekends. Here is a weekly summary of rentals.

Videos Rented

Videos Rented DVDs Rented


Monday
38 35
Tuesday
31 28
Wednesday
40 45
Thursday
47 49
Friday
55 52
Saturday
63 60
Sunday
75 68

Mark A if true or B if false.

1. The video store makes more money from VMS than from
DVD.
2. DVDs and VHSs are more expensive on the weekdays.
3. The video store sells more DVDs in a week than VMS.
4. DVDs are more expensive than VMS.
5. The video store rents more videos Friday, Saturday, Sunday
than on the weekdays.
6. Customers rent about the same number of DVDs and VHSs
on the weekdays.
7. The video store rents more VHS than DVDs on the
weekends.

EXAMPLE 4.14. Complex multiple true-false item set. Written


by an anonymous teacher education student.

83
84 CHAPTER 4

characteristics. Although MTF items are further illustrated in chapters 5


and 6, the variety of content seems limited.
2. One technical problem that might arise with the MTF format is that
of estimating reliability. Generally, MC test items (including the MTF for-
mat) are assumed to be independent of one another with respect to re-
sponses. Dependence occurs when one item cues another item. The
technical term for this is local independence. Dependency among items of a
single MTF item set would make that set of items operate as one MC item.
Frisbie and Druva (1986) and Albanese and Sabers (1988) established
that no dependence existed with their test data. Nonetheless, local de-
pendency will result in an overestimation of reliability and is a caution
with this format.

The MTF format is an effective substitute for the complex MC. Because the
MTF has inherently good characteristics for testing knowledge, it should be
more widely used.

An MTF Variation: The Multiple Mark (Multiple-Multiple'Choice)

According to Pomplun and Omar (1997), the multiple-mark variation has a


history (Cronbach, 1941; Dressel &Schmid, 1953), but it has been neglected
until recently. With this variation, students mark the choice if it is correct or
true and do not mark it or leave it blank if it is not correct. With MTF, students
mark each option. Pomplun and Omar found that when students guess with
this MTF variation, they tend to make an error of omission. With MTF, they
tend to guess true. This research, along with the study by Grosse and Wright
(1985), calls our attention to problems with the TF format and guessing strate-
gies that might introduce bias into test scores. Both MTF and the multiple-
mark formats get good grades in terms of performance when compared with
other MC formats. As research continues to explore the MTF and multiple
mark, we will see this format more widely used both in classroom testing and in
formal, standardized testing programs. The economy of presenting many items
in a short time is the main attraction of this format.

Context-Dependent Item Sets

The context-dependent item set has an introductory stimulus and usually 2 to


12 test items related to this stimulus. The stimulus for any item set might be a
work of art, photograph, chart, graph, figure, table, written passage, poem,
story, cartoon, problem, experiment, narrative, or reference to an event, per-
son, or object. Once this stimulus has been created, we then create 2 to 10 test
MC FORMATS 85

items of any MC format. Creativity is much needed in shaping the item set.
Terms used to describe item sets include interpretive exercises, scenarios, vignettes,
item bundles, problem sets, super items, and testlets.
Although this format has a long history, it is only recently becoming more
popular. One reason is the need to create items that measure higher level
thinking. Another reason is that scoring methods have improved. The item
set seems well suited to testing cognitive abilities or aspects of an ability in-
volving complex thinking, such as is found in problem solving or critical
thinking.
Little research has been reported on the item set (Haladyna, 1992a, 1992b).
Although this format appears in many standardized achievement tests and
some professional licensing and certification examinations, the scientific basis
for its design and scoring is neonatal. The 1997 revision of the National Coun-
cil on Architectural Registration Boards adopted vignettes for its Building De-
sign Test. One study by Case, Swanson, and Becker (1996) addressed the issue
of the relative difficulty of medical licensing test items as to difficulty and dis-
crimination. They contrasted items with no stimulus material and items with
short and longer scenarios (vignettes). Although they found little or no differ-
ences in two studies in discrimination, long vignette items tended to be slightly
more difficult. The researchers concluded that vignette-based item sets will
continue to be used due to the higher cognitive demands that their format elic-
its. Another factor supporting the use of vignette-based items is acceptance by
candidates that these test items have a greater fidelity with the implicit crite-
rion of medical competence.
Wainer and Kiely (1987) and Thissen, Steinberg, and Mooney (1989) in-
troduced and described testlets as bundles of items with a variety of scorable
predetermined paths for responding. This is a more complex idea than pre-
sented here, but the technical issues addressed by these authors offer some
guidance in future research on item sets. Like the MTF format, context ef-
fects or interitem dependence is a threat. In fact, the MTF format is a type of
item set. If items are interdependent, the discriminative ability of the items,
and the reliability of scores, will be diminished (Sireci, Thissen, & Wainer,
1991). Wainer and Kiely (1987) explored methods for scoring these item bun-
dles, as applied to computerized adaptive testing, but these methods can apply
to conventional fixed-length testing. They also explored hierarchical testlets.
We can overcome the problem of dependency if we score item sets as mini-
tests (testlets; Rosenbaum, 1988). Thissen and Wainer (2001) have several
chapters that address scoring testlets.
Several types of item sets are featured here, each intended for a certain type
of cognitive activity: (a) reading comprehension, (b) problem solving, (c) pic-
torial, and (d) interlinear. Each type is briefly discussed to provide the essence
of what type of content and cognitive process is being measured. Examples of
each type are presented.
86 CHAPTER 4

Reading Comprehension. The item set shown in Example 4.15 presents


a poem for elementary grade language arts students and asks questions to
measure student understanding of the poem. Typically, one can get 6 MC

"The radiance was that of full, setting, and blood-red moon, which
now shone vividly through that once barely discernible fissure of
which I have before spoken as extending from the roof of the
building, in a zigzag direction, to the base. While I gazed this
fissure rapidly widened—there came a fierce breath of the whirl-
wind—the entire orb of the satellite burst at once upon my sight—
my brain reeled as I saw the mighty walls rushing asunder—there
was a long, tumultuous shouting sound like the voice of a
thousand waters—and the deep and dank tarn at my feet closed
sullenly and silently over the fragments of the House of Usher."
1. What is Poe referring to when he speaks of "the entire orb of
the satellite" ?
A. The sun
B. The moon
C. His eye
2. What is a "tarn"?
A. A small pool
B. A bridge
C. A marsh
3. How did the house fall?
A. It cracked into two pieces.
B. It blew up.
C. It just crumpled.
4. How did the speaker feel as he witnessed the fall of the House
of Usher?
A. Afraid
B. Awestruck
C. Pleased
5. What does the speaker mean when he said "his brain
reeled?"
A. He collected his thoughts.
B. He felt dizzy.
C. He was astounded.

EXAMPLE 4.15. Comprehension type item set.


MC FORMATS 87

items to a page. Therefore, the two-page item set might contain as many as 10
to 12 items, allowing for a brief introductory passage on the first page. Read-
ing comprehension item sets are common in standardized tests. One page is
devoted to a narrative or descriptive passage or even a short story, and the op-
posing page is devoted to MC items measuring understanding of the passage.
The items might take a generic form and an item set structure is established.
Some items might systematically ask for the meaning of words, phrases, or the
entire passage. Some items might ask for prediction (e.g., what should happen
next?). Other items might analyze characterizes or plot. Once the set of items
is drafted and used, it can be reapplied to other passages, making the testing
of comprehension easy. Chapter 7 presents a set of reading comprehension
item shells.
Katz and Lautenschlager (1999) experimented with passage and no-pas-
sage versions of a reading comprehension test. Based on their results, they
stated that students with outside knowledge could answer some items with-
out referring to the passage. This research and earlier research they cite shed
light on the intricacies of writing and validating items for reading compre-
hension. They concluded that a science for writing reading comprehension
items does not yet exist. We can do a better job of validating items by doing a
better analysis of field test data and more experimentation with the no pas-
sage condition.

Problem Solving. Example 4.16 contains an item set in science. The stim-
ulus is a scientific experiment involving a thermos bottle and some yeast, sugar,
and water. The questions involve the application of principles of science. Al-
though we try to write these items so that they are independent, dependency
seems unavoidable. It is important to note that each item should test a different
step in problem solving. Item 1 asks the student to apply a principle to predict
what happens to the temperature of the water. Item 2 gives the reason for this
result. All four options were judged to be plausible. Item 3 calls for a prediction
based on the application of a principle. Item 4 addresses possible changes in
sugar during a chemical reaction. Item 5 tests another prediction based on this
chemical reaction.
Example 4.17 illustrates a patient problem from a nursing licensing exami-
nation. This item set has an interesting variation; the stimulus presents the
problem and after an item is presented a change in the scenario introduces a
new problem, with an accompanying item. With this format, we test a student's
ability to work through a patient problem.

Pictorial. The pictorial variation of the context-dependent item set of-


fers considerable opportunity to ask questions in interesting and effective
ways. Example 4.18 provides an example of a table showing the number of
participants and number of injuries for 10 sports. Test items can be written to
88 CHAPTER 4

A thermos bottle is filled with a mixture of yeast, sugar, and water


at 15 degrees C and the contents are examined 24 hours later.
1. What happens to the temperature?
A. Increases
B. Stays the same
C. Decreases
2. What is the reason for that result?
A. Yeast plants respire.
B. Yeast plants do not respire.
C. Yeast plants absorb heat in order to live.
D. Heat cannot be conducted into or out of the thermos
bottle.
3. What has happened to the number of yeast plants?
A. Increased
B. Decreased
C. Remained about the same
4. What about the sugar?
A. Increased
B. Decreased
C. Remained about the same
5. What has happened to the content?
A. Increased in oxygen
B. Deceased in oxygen
C. Increased in carbon dioxide
D. Decreased in carbon dioxide

EXAMPLE 4.16. Problem-solving item set.

test one's understanding of the data and inferences that can be made from
these data. The test items reflect reading the table and evaluating the data
presented. Some items require that a ratio of injuries to participants be cre-
ated for each sport, to evaluate the rate of injury. Chapter 7 provides more va-
rieties of this item format.

Interlinear. This unique format is illustrated in Example 4.19. As you can


see, the item set does not take up a great amount of space but cleverly gets the
test taker to choose between correct and incorrect grammar, spelling, capital-
ization, and punctuation. Although the highest fidelity of measurement for
writing skills is actual editing of one's own writing or someone else's writing,
Ms. Maty Petel, 28 years old, is seen by her physician for
complaints of muscular weakness, fatigue, and a fine tremor of the
hands. Hyperthyroidism is suspected and her prescriptions
include a radioactive iodine uptake test.
1. The nurse should explain to Ms. Petel that the chief purpose of
a radioactive iodine uptake test is to
A. ascertain the ability of the thyroid gland to produce
thyroxine.
B. measure the activity of the thyroid gland.
C. estimate the concentration of the thyrotropic hormone in
the thyroid gland.
D. determine the best method of treating the thyroid condition.
The results of the diagnostic tests confirm a diagnosis of
hyperthyroidism. Ms. Petel consents to surgery on a future date.
Her current prescriptions include propylthiouracil.
2. The nurse should explain to Ms. Petel that the propylthiouracil
initially achieves its therapeutic effect by which of the
following actions?
A. Lowering the metabolic rate
B. Inhibiting the formation of thyroxine
C. Depressing the activity of stored thyroid hormone
D. Reducing the iodide concentration in the thyroid gland
Two months later, Ms. Petel is admitted to the hospital and
undergoes a subtotal thyroidectomy.
3. During the immediate postoperative period, the nurse should
assess Ms. Petel for laryngeal nerve damage. Which of the
following findings would indicate the presence of this
problem?
A. Facial twitching
B. Wheezing
C. Hoarseness
D. Hemorrhage

EXAMPLE 4.17. Problem-solving item set in professional


testing.

89
SPORT INJURIES PARTICIPANTS1

1. Basketball 646,678 26.2

2. Bicycle riding 600,649 54.0

3. Baseball, softball 459 542 351

4
- Football 453,648 13.3

5. Soccer 150,449 10.0

6. Swimming 130,362 66.2

7. Volleyball 129,839 22.6

8. Roller skating 113,150 26.5

9. Weightlifting 86 398 39 2

10. Fishing 84115 470

Source: National Safety Council's Consumer Product Safety


Commission, National Sporting Goods Association.
1
Reported in millions.

1. Which sport has the greatest number of participants?


A. Basketball
B. Bicycle riding
C. Soccer
2. Which sport in the list has the least number of injuries?
A. Gymnastics
B. Ice hockey
C. Fishing
3. Which of the following sports has the highest injury rate,
considering numbers of participants?
A. Basketball
B. Bicycle riding

EXAMPLE 4.18. Item set based on a table of data.

9O
MC FORMATS 91

For each numbered, underlined pair of choices, choose the letter


next to the correct spelling of the word and fill in your answer
sheet with that letter next to the number of the item.
There (1. A. our or B. are) many ways to invest money. You can
earn (2. A. intrest orB. interest) by buying savings bonds. Or you
can (3. A. bye or B. buy or C. by) corporate bonds. Or you can
become a (4. A. part-owner or B. partowner) of a company by
owning stock in a company. As a shareholder in a company, you
can share in company (5. A. profits or B. prophets).

EXAMPLE 4.19. Interlinear item set.

this MC format tests one's writing skills in a very efficient way. Also, this format
can be used to generate additional items so that other interlinear item sets can
give teachers practice items for students who want to know how proficient they
are with these writing skills.

Summary. The context-dependent item set is one of several effective


ways to measure complex thinking. With more experience and experimenta-
tion, we should identify more varieties to use to address the need to measure
higher level thinking with the more efficient MC format.

MC ITEM FORMAT ISSUES

This section contains discussions of topics involving the design of future MC


items. These topics are the role of calculators in MC testing, the use of com-
puter-based testing, the use of visual materials as suggested with item sets, the
use of dictionaries during a test, and the placement of dangerous answers in
items in credentialing tests for the professions.

Calculators and MC Testing

The use of inexpensive, simple electronic calculators became part of the MC


testing experience. The NCTM (1989, 2000) strongly encourages the use of
calculators both during instruction and for testing.
Electronic technologies—calculators and computers—are essential tools for
teaching, learning, and doing mathematics. They furnish visual images of mathe-
matical ideas, they facilitate organizing and analyzing data, and they compute ef-
ficiently and accurately. They can support investigation by students in every area
92 CHAPTER 4

of mathematics, including geometry, statistics, algebra, measurement, and num-


ber. When technological tools are available, students can focus on decision mak-
ing, reflection, reasoning, and problem solving. (NCTM, 2000, p. 25)
The emphasis on higher level thinking has promoted the use of calculators
and computers as aids in the teaching, learning, and testing processes.
Some standardized testing programs have recently introduced calculators
into the testing situation (e.g., the SAT1 and the Uniform Certified Public Ac-
countancy Examination). However, the use of calculators may affect test re-
sults or redefine what we are trying to measure via our test items. Calculators
can be used in the testing process but with the understanding that the use of
calculators may change the performance characteristics of items intended for
use without calculators.
Loyd (1991) made some noteworthy observations about using calculators
with these item formats. Although calculation errors will likely diminish with
the use of calculators, time needed for administration of a test consisting of
calculator items may actually increase because the nature of the task being
tested becomes more complex. Actual performance changes under condi-
tions of calculators and no calculators, depending on the type of material
tested (e.g., concepts, computation, problem solving) and grade level, are
complex (Lewis & Hoover, 1981). Some researchers reported that calculators
have little or no effect on test performance because the construct tested is not
affected by using calculators (Ansley, Spratt, & Forsyth, 1988). Loyd further
reported that these studies showed that in an item-by-item analysis of the use
of calculators, some items requiring calculation have improved performance
because of calculators, whereas other items are impervious to the use of cal-
culators. A study by Cohen and Kim (1992) showed that the use of calcula-
tors for college-age students actually changed the objective that the item
represented. These researchers recommended that even the type of calcula-
tor used can have an untoward effect on item performance. Poe, Johnson, and
Barkanic (1992) reported a study using a nationally normed standardized
achievement test where calculators had been experimentally introduced sev-
eral times at different grade levels. Both age and ability were found to influ-
ence test performance when calculators were permitted. Bridgeman, Harvey,
and Braswell (1995) reported a study of 215 students who took SAT mathe-
matics items, and the results favored the use of calculators. In fact, Bridge-
man et al. reported that one national survey indicated that 98% of all
students have family-owned calculators and 81% of 12th-grade students reg-
ularly use calculators. The universality of calculators coupled with the eco-
logical validity of using calculators to solve mathematics problems seems to
weigh heavily in favor of calculator usage in mathematical problem-solving
tests. Bridgeman et al. concluded that the use of calculators may increase va-
lidity but test developers need to be cautious about the nature of the problems
where calculators are used.
MC FORMATS 93

Therefore, research shows that the use of calculators should be governed by


the nature of the task at hand and the role that calculators are supposed to play
in answering the question. Thus, the actual format of the item (e.g., MC or TF)
is not the issue in determining whether a calculator should be used. Instead, we
need to study the mental task required by the item before making the decision
to use a calculator.
Calculators should be used with test items if the intent is to facilitate compU'
tations as part of the response to the test items. With standardized tests, calcu-
lators should be used in such a way as to minimize the variability of experience
in using calculators, and interpretations should be made cautiously in this light.
This is analogous to giving a test in English to a non-English speaker and draw-
ing the conclusion that the person cannot read. Calculators should not be used
if the standardized test was normed under conditions where calculators were
not used. Thus, using calculators may provide an advantage that will bias the
reporting and use of test scores. If the test is classroom specific, the use of calcu-
lators can be integrated with instruction, and any novelty effect of calculator
use can be avoided.

Computer-Based MC Testing

Technology is having a profound effect on testing. The advent of com-


puter-based forms of testing is changing not only how MC items are being ad-
ministered but even the design of test items. Although chapter 11 discusses
some trends and problems we face with computerization in testing, in this sec-
tion we address threats to validity that arise from using the computer to ad-
minister MC items. One of the most primary issues here is whether computer
administration of a MC item produces the same result as paper-and-pencil
administration. Although computer-based testing is on the increase, pub-
lished studies of the equivalence of item performance as a function of com-
puter or traditional administration are scarce. Such studies may be part of
testing more mundane validation work, but there is little publicly available
information about equivalence of performance across these differing admin-
istration types.
Huff and Sireci (2001) raised several issues related to computer-adminis-
tered testing. Computer platform familiarity is one factor that may affect stu-
dent performance. The efficiency or ease of use for the user interface is another
issue. The time spent on each item is still another issue. Do students have am-
ple time to finish each item or even review the item, as they would with a pa-
per-and-pencil test? The role of test anxiety with computer-administered tests
is another issue.
Some testing programs use computerized adaptive testing, which requires
many items of mid-difficulty. Therefore, the demand to increase the item pool
94 CHAPTER 4

is greater. Overexposure of items poses a threat to the validity of item re-


sponses. Even worse, items are being copied from tests such as the Graduate
Record Examination and posted on the World Wide Web (http://www.ets.org/
news/03012101 .html). Indeed, Standard 12.19 of the Standards for Educational
and Psychological Testing (AERA et al., 1999) provides caution about threats to
validity involving computerized testing.

On the Value of Accompanying Graphs, Tables, Illustrations,


and Photographs

Many standardized tests and credentialing tests use graphs, tables, illustra-
tions, or photographs as part of the item. There is some research and many pros
and cons to consider before choosing to use accompanying material like this.
Primary among the reasons for using material is that it completes the presen-
tation of the problem to be solved. In many testing situations it is inconceivable
that we would not find such material. Imagine certification tests in medicine
for plastic surgery, ophthalmology, dermatology, orthopedic surgery, and oto-
laryngology that would not have items that present patient diseases, injuries, or
congenital conditions in as lifelike a manner as possible. Tests in virtually any
subject matter can be enhanced by visual material. However, are such items
better than items with no visual material? Washington and Godfrey (1974) re-
ported a study on a single military test where the findings provided a scant mar-
gin of advantage for illustrated items. Lacking descriptive statistics, this study
can hardly be taken as conclusive.
The arguments against using illustrated items is that they require more
space and take more time to read. One would have to have a strong rationale
for using these items. That is, the test specifications or testing policies would
have to justify illustrated items. The main advantage might be face validity.

Dictionaries

To the extent that students have additional aids in taking tests there may be an
improvement or decrement in the validity of test score interpretations. Calcu-
lators are one of these aids, and dictionaries are another that may prove useful
in tests where the language used is not native to the examinee. Nesi and Meara
(1991) studied the effect of dictionary usage in a reading test, citing an earlier
study where the use of dictionaries did not affect test scores or administration
time. In this study, they found similar results, but noted that dictionaries in
both studies did not necessarily provide information useful to students. It
seems that the provision of any aid would have to be justified on the grounds
that it reduces or eliminates a construct-irrelevant influence on test perfor-
MC FORMATS 95

mance. Research by Abedi and his colleagues (Abedi, Lord, Hofstetter, &
Baker, 2000) has uncovered the importance of using language on MC tests that
is understandable to students, particularly when reading comprehension is not
the object of the test. Having a student glossary and extra time seemed helpful
to students who traditionally score low on tests given in English where their na-
tive language is not English. Like calculators, the issue is more complex than it
seems on the surface.

Dangerous Answers

The purpose of any licensing or certification test is to pass competent candi-


dates and fail incompetent candidates, to protect the public from incompetent
practitioners. In the health professions, one line of promising research has been
the use of dangerous answers, distractors that if chosen would have seriously
harmful effects on patients portrayed in the problem. The inference is that a
physician who chooses a dangerous answer potentially endangers his or her pa-
tients. The use of dangerous distractors in such tests assists in the identification
of dangerously incompetent practitioners.
Skakun and Gartner (1990) provided a useful distinction. Dangerous an-
swers are choices of actions that cause harm to patients, whereas deadly an-
swers are fatal actions. Their research showed that items can be successfully
written and that the inclusion of such items was agreed as content relevant by
appropriate content review committees of professional practitioners. The
study by Slogoff and Hughes (1987), however, provided a more thorough anal-
ysis. First, they found that passing candidates chose 1.6 dangerous answers and
failing candidates chose 3.4 dangerous answers. In a follow-up of 92 passing
candidates who chose 4 or more dangerous answers, a review of their clinical
practices failed to reveal any abnormalities that would raise concern over their
competence. They concluded that the use of such answers was not warranted.
Perhaps the best use of dangerous answers is in formative testing during medi-
cal education and training in other professions.

SUMMARY

This chapter presented and evaluated eight types of MC item formats. The
conventional MC, AC, matching, EM, TF, and MTF formats are clearly useful
for testing the recall and understanding of knowledge and many cognitive
skills. The complex MC is not recommended. The item set is the most promis-
ing because it seems well suited to testing for the application of knowledge and
skills in complex settings. Scoring item sets presents a challenge as well. How-
ever, with significant interest in testing cognitive abilities, the item set seems to
96 CHAPTER 4

be the most valued member of the family of MC formats. Some experimental


item types were briefly discussed, but these formats need more research before
they are recommended for classroom or large-scale testing. Table 4.1 summa-
rizes the perceived value of these formats for measuring knowledge, cognitive
skills, and aspects of abilities.

TABLE 4.1
Multiple-Choice Item Formats and the Content They Can Measure

Format Knowledge Cognitive Skills Ability


Conventional multiple choice X X

Alternate choice X X

Matching X X

Extended matching X X X

True-false X X

Complex multiple choice X X

Multiple true-false X X

Pictorial item set X X X

Problem- solving item set X

Vignette or scenario item set X

Interlinear item set X


5
Guidelines for Developing
MC Items

OVERVIEW

As noted in chapter 1, the item-development process for any testing program


involves many steps. Despite our best efforts, the number of items that sur-
vive after all item-development activities, checks, and reviews may be only
around 50% (Holtzman, Case, &. Ripkey, 2002). With such a low survival rate
for new items, we want to use all the strategies possible to make the test item
as good as it can be. This chapter presents guidelines for developing MC items
and provides many examples of good and bad item-writing practices. A set of
guidelines such as provided in this chapter should be adopted for use for any
testing program. All new items should be subjected to a review for adherence
to guidelines.
Despite the existence of the guidance found in this chapter on writing MC
items, Bormuth (1970), among many others, observed that item writing is not
yet a science. We have ample evidence that when items are written without
regard for item-writing guidelines that are featured in this chapter, the conse-
quences can be negative. Richichi (1995) experimented with an instructor's
set of test items for introductory psychology and analyzed the items using
item response theory (IRT). He found items that violated item-writing guide-
lines to be harder and less discriminating than nonflawed items. Training
item writers matters. Jozefowicz et al. (2002) studied the quality of items writ-
ten by trained and untrained item writers and found substantial differences in
quality. Downing (2002a, 2002b) reported that items written by untrained
item writers for evaluating student learning typically have item-writing flaws
described in this chapter. He found that students most likely to perform
poorly on flawed items were the low-achieving students. Katz and
97
98 CHAPTER 5

Lautenschlager (1999) experimented with providing and not providing the


reading passage to items found on the SAT. They found that some students
could perform because of their out-of-school experience and testwiseness,
thus casting some doubt on the capability of MC formats for measuring read-
ing comprehension. The problem with measuring reading comprehension is
not with the format used but with writing items that are truly passage depend-
ent yet can be independently answered.
The basis for the guidelines presented in this chapter draws from several
sources. One source is research by Haladyna and Downing (1989a, 1989b).
This work began in the mid-1980s. The first study involved an analysis of 46
textbooks and other sources on how to write MC test items (Haladyna &.
Downing, 1989a). The result was a list of 43 item-writing guidelines. Author
consensus existed for many of these guidelines. Yet for other guidelines, a lack
of a consensus was evident. A second study by Haladyna and Downing (1989b)
involved an analysis of more than 90 research studies on the validity of these
item-writing guidelines. Only a few guidelines received extensive study. Nearly
half of these 43 guidelines received no study at all. Since the appearance of
these two studies and the 43 guidelines, Haladyna et al. (2002) reprised this
study. They examined 27 new textbooks and more than 27 new studies of these
guidelines. From this review, the original 43 guidelines were reduced to a leaner
list of 31 guidelines.
This chapter has a set of MC item-writing guidelines that apply to all MC
formats recommended in the previous chapter and specific guidelines that
uniquely apply to specific MC formats. Item writers should apply these guide-
lines judiciously but not rigidly, as the validity of some guidelines still may be in
question.

GENERAL ITEMAVRITING GUIDELINES

Table 5.1 presents a list of general item-writing guidelines that can be applied to
all the item formats recommended in chapter 4. These guidelines are organized
by categories. The first category includes advice about content that should be
addressed by the SME item writer. The second category addresses style and for-
matting concerns that might be addressed by an editor. The third category is
writing the stem, and the fourth category is writing the options including the
right answer and the distractors. In the rest of this section, these guidelines are
discussed and illustrated when useful.

Content Concerns

1. Every Item Should Reflect Specific Content and a Single Specific Cogni'
tive Process. Every item has a purpose on the test, based on the test specifica-
tions. Generally, each item has a specific content code and cognitive demand
TABLE 5.1
General Item-Writing Guidelines3

Content Guidelines
1. Every item should reflect specific content and a single specific cognitive process, as
called for in the test specifications (table of specifications, two-way grid, test
blueprint).
2. Base each item on important content to learn; avoid trivial content.
3. Use novel material to measure understanding and the application of knowledge
and skills.
4. Keep the content of an item independent from content of other items on the test.
5. Avoid overspecific or overgeneral content.
6. Avoid opinion-based items.
7. Avoid trick items.
Styk and Format Concerns
8. Format items vertically instead of horizontally.
9. Edit items for clarity.
10. Edit items for correct grammar, punctuation, capitalization, and spelling.
11. Simplify vocabulary so that reading comprehension does not interfere with testing
the content intended.
12. Minimize reading time. Avoid excessive verbiage.
13. Proofread each item.
Writing the Stem
14. Make directions as clear as possible.
15. Make the stem as brief as possible.
16. Place the main idea of the item in the stem, not in the choices.
Writing Options
19. Develop as many effective options as you can, but two or three may be sufficient.
20. Vary the location of the right answer according to the number of options. Assign the
position of the right answer randomly.
21. Place options in logical or numerical order.
22. Keep options independent; choices should not be overlapping.
23. Keep the options homogeneous in content and grammatical structure.
24. Keep the length of options about the same.
25. None of the above should be used sparingly.
26. Avoid using all of the above.
continued on next page
99
1OO CHAPTER 5

27. Avoid negative words such as not or except.


28. Avoid options that give clues to the right answer.
29. Make distractors plausible.
30. Use typical errors of students when you write distractors.
31. Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

These guidelines apply to the multiple-choice, alternate-choice, matching, extended-matching,


true-false, multiple true-false, and item set formats. Some of these guidelines may not directly apply
to the true-false format.

code. The content code can come from a topic outline or a list of major topics.
In chapter 2, it was stated that all content can essentially be reduced to facts,
concepts, principles, or procedures. But generally, topics subsume this distinc-
tion. The cognitive demand is usually recall or understanding. But if the intent
of the item is to infer status to an ability, such as problem solving, the applica-
tion of knowledge and skills is assumed.

2. Base Each Item on Something Important to Learn; Avoid Trivial Content.


The judgment of the importance of content and the cognitive demand is sub-
jective. Fortunately, in large-scale testing programs, we have checks and bal-
ances in the review processes. Other content experts can help decide if content
is too trivial. In the classroom, the teacher can survey students to enlist their
help in deciding if a particular test item measures something that does not seem
very important to learn.
Example 5.1 shows the difference between trivial and important content for
a hypothetical class of preteenagers. The first stem asks for a fact, a percentage,
that may be meaningless to most students. The second question addresses a
major health problem in the world and seems more relevant to this class. The
distinction drawn here can only be made by an SME, and it is a subjective de-
termination. In testing programs, committees work together in deciding
whether content is trivial or important.

What is the nicotine content of a typical cigarette?


To which disease has cigarette smoking been linked?

EXAMPLE 5.1. Trivial and important content.


GUIDELINES FOR DEVELOPING MC ITEMS 1O1

3. Use Novel Material to Test for Understanding and Other Forms of


Higher Level Learning. As suggested in chapter 2 and emphasized through-
out this book, the testing of understanding instead of recall is important and
can be done using some strategies where a concept, principle, or procedure is
tested in a novel way. To achieve novelty, the content presented in a textbook
or during instruction is not reproduced in the test item. Instead, we ask the stu-
dent to identify an example of something, or we paraphrase a definition and see
if the student can link the paraphrased definition to a concept, principle, or
procedure. For more complex types of learning, we present scenarios or vi-
gnettes that ask for critical thinking or problem solving. Each scenario or vi-
gnette is new; therefore, recall is not tested. Example 5.2 shows two items. The
first item provides dictionary definitions that are likely to be memorized. The
second item provides examples of writing, and the student who understands a
metaphor is supposed to select the correct example.

Which is the best definition of a metaphor?

A. Metaphors describe something as if it were something else.


B. Metaphors make comparisons to other things.
C. Metaphors are trite, stereotyped expressions

Which of the following is a metaphor?

A. At the breakfast buffet, I ate like a pig.


B. My cat has fur like knotted wool.
C. She is like a rose, full of thorns and smelly.

EXAMPLE 5.2. Familiar and novel material.

The use of examples and nonexamples for testing concepts such as similes,
metaphors, analogies, homilies, and the like is easy. You can generate lists of
each and mix them into items as needed.

4. Keep the Content of an Item Independent of the Content of Other Items


on the Test. A tendency when writing sets of items is to provide information
in one item that helps the test taker answer another item. For example, con-
sider a line of questioning focusing on main ideas of a novel, as shown in Exam-
ple 5.3. Once a student correctly answers Item 1, this testwise student will look
for clues in the next item. If Roxie is correct for Item 1, it must be incorrect for
Item 2. Kate and Roxie were mentioned in Items 1 and 2, whereas Sara was not
mentioned in Item 1. Might Sara be the right answer? Yes.
1O2 CHAPTER 5

The following questions come from the story Stones from Ybarra.

1. Who was Lupe's best friend?


A. Kate
B. Dolores
C. *Roxie

2. Who was quarreling with Lupe?


A. Kate
B. *Sara
C. Roxie

EXAMPLE 5.3. Dependent items.

Testwise students use these kinds of strategies to select answers to items.


Therefore, it is testwiseness not learning that determines whether they choose
the right answer. In writing sets of items from a common stimulus, care must be
exercised to avoid this kind of cuing.

5. Avoid Overspecific and Overgeneral Content. The concept of speci-


ficity of knowledge refers to a continuum that ranges from too specific to too
general. Most items should probably be written with this continuum in mind.
We should avoid the extremes of this continuum. Overspecific knowledge
tends to be trivial to the domain gf knowledge intended. General knowledge
may have many exceptions, and the items are ambiguous. The two examples in
Example 5.4 illustrate these two extremes. The first item is very specific,
whereas the second item is very general.

5. Who wrote the Icon of Seville?


A. Lorca
B. Ibanez
C. Rodriguez

5. Which is the most serious problem in the world?


A. Hunger
B. Lack of education
C. Disease

EXAMPLE 5.4. Too specific and too general content.


GUIDELINES FOR DEVELOPING MC ITEMS 1O3

A danger in being too specific is that the item may be measuring trivial con-
tent, the memorization of a fact. The judgment of specificity and generality is
subjective. Each item writer must decide how specific or how general each item
must be to reflect adequately the content topic and type of mental behavior de-
sired. Items also should be reviewed by others, who can help judge the specific-
ity and generality of each item.

6. Avoid Opinion-Based Items. This advice derives from the value


that items should reflect well-known and publicly supported facts, con-
cepts, principles, and procedures. To test a student on an opinion about any
content seems unfair, unless the opinion is qualified by some logical analy-
sis, evidence, or presentation during instruction. The items in Example 5.5
show an unqualified opinion and a qualified opinion. The former item
seems indefensible, whereas the second item is probably more defensible. In
other words, the criteria for judging "best" in Item 1 are unclear. Items like
these need qualifiers.

Unqualified Opinion Item

6. Which is the best comedy film ever made?


A. Abbott and Costello Go to Mars
B. Young Frankenstein
C. A Day at the Races

Qualified Opinion Item

6. According to American Film Institute, which is the greatest


American film?
A. It Happened One Night
B. Citizen Kane
C. Gone With the Wind
D. Star Wars

EXAMPLE 5.5. Unqualified and qualified opinion items.

7. Avoid Trick Items. Trick items are intended to deceive the test taker
into choosing a distractor instead of the right answer. Trick items are hard to
illustrate. In a review and study, Roberts (1993) found just a few references in
the measurement literature on this topic. Roberts clarified the topic by distin-
guishing between two types of trick items: items deliberately intended by the
item writer, and items that accidentally trick test takers. Roberts's students
reported that in tests where more tricky items existed, these tests tended to be
104 CHAPTER 5

more difficult. Roberts's study revealed seven types of items that students
perceived as tricky, including the following:

1. The item writer's intention appeared to deceive, confuse, or mislead test


takers.
2. Trivial content was represented (which violates one of our item-writing
guidelines).
3. The discrimination among options was too fine.
4. Items had window dressing that was irrelevant to the problem.
5. Multiple correct answers were possible.
6. Principles were presented in ways that were not learned, thus deceiving
students.
7. Items were so highly ambiguous that even the best students had no idea
about the right answer. This type of trick item may also reflect a violation
of Guideline 2.

The open-ended items in Example 5.6 are trick items. Yes, there is a fourth
of July in England as there is around the world. All months have 28 days. It was
Noah not Moses who loaded animals on the ark. The butcher weighs meat.
Items such as these are meant to deceive you not to measure your knowledge.
Trick items often violate other guidelines stated in Table 5.1. Roberts encour-
aged more work on defining trick items. His research has made a much-needed
start on this topic.

Is there a fourth of July in England?


Some months have 31 days. How many have 28?
How many animals of each gender did Moses bring on his ship?
A butcher in the meat market is six feet tall. What does he weigh?

EXAMPLE 5.6. Some trick items.

A negative aspect of trick items is that if they are frequent enough, they
build an attitude by the test taker characterized by distrust and potential lack of
respect for the testing process. There are enough problems in testing without
contributing more by using trick items. As Roberts (1993) pointed out, one of
the best defenses against trick items is to allow students opportunities to chal-
lenge test items and to allow them to provide alternative interpretations. Dodd
and Leal (2002) argued that the perception that MC items are "tricky" may in-
crease test anxiety. They employ answer justification that eliminates both the
perception and reality of trick items. If all students have equal access to appeal-
GUIDELINES FOR DEVELOPING MC ITEMS 1O5

ing a trick item, this threat to validity is eliminated. Such procedures are dis-
cussed in more detail in chapter 8.

Style and Format Concerns

8. Format Items Vertically Instead of Horizontally. Example 5.7 pres-


ents the same item formatted horizontally and vertically. The advantage of
horizontal formatting is economy; you can fit more items on a page. If appear-
ance is important, vertical formatting looks less cramped and has a better visual
appeal. With students who may have test anxiety, horizontal formatting may be
harder to read, thus confusing students and lowering test scores.

8. You draw a card from a deck of 52 cards. What is the chance


you will draw a card with an odd number on it?
A. 36/52 B. 32/52 C. About one half

8. You draw a card from a deck of 52 cards. What is the chance


you will draw a card with an odd number on it?
A. 36/52
B. 32/52
C. About one half

EXAMPLE 5.7. Item formatted horizontally and vertically.

9. Edit Items for Clarity. Early in the development of an item, that item
should be subject to scrutiny by a qualified editor to determine if the central idea
is presented as clearly as possible. Depending on the purpose of the test and the
time and other resources devoted to testing, one should always allow for editing.
Editing for clarity does not guarantee a good item. However, we should never
overlook the opportunity to improve each item using editing for clarity.
We should note caution here. Cizek (1991) reviewed the research on editing
test items. He reported findings that suggested that if an item is already being
effectively used, editorial changes for improving clarity may disturb the perfor-
mance characteristics of those test items. Therefore, warning is that editing
should not be done on an operational item that performs adequately. On the
other hand, O'Neill (1986) and Webb and Heck (1991) reported no differ-
ences between items that had been edited and unedited.

10. Edit Items for Correct Grammar, Punctuation, Capitalization, and


Spelling. Later in the test-development process, editing to ensure that each
item has correct grammar, punctuation, capitalization, and spelling is also im-
106 CHAPTER 5

portant. Acronyms may be used, but their use should be done carefully. Gen-
erally, acronyms are explained in the test before being reused.
Dawson-Saunders et al. (1992, 1993) experimented with a variety of al-
terations of items. They found that reordering options along with other edi-
torial decisions may affect item characteristics. A prudent strategy would be
to concentrate on editing the item before instead of after its use. If editing
does occur after the first use of the item, these authors suggested that one
consider content editing versus statistical editing. The former suggests that
content changes are needed because the information in the item needs to be
improved or corrected. Statistical alteration would be dictated by informa-
tion showing that a distractor did not perform and should be revised or re-
placed. The two kinds of alterations may lead to different performances of
the same item. Expert test builders consider items that have been statisti-
cally altered as new. Such items would be subject to pretesting and greater
scrutiny before being used in a test. Reordering options to affect key balanc-
ing should be done cautiously.

11. Simplify Vocabulary. The purpose of most MC achievement tests is


to measure knowledge and skills that were supposed to be learned. In some
circumstances, the MC format may be good for measuring aspects of problem
solving, critical thinking, or other cognitive abilities. Other than a reading
comprehension test, a test taker's level of reading comprehension should not
affect test performance. Therefore, vocabulary should be simple enough for
the weakest readers in the tested group. If reading comprehension is con-
founded with the achievement being measured, the test score will reflect a
mixture of reading comprehension ability and the knowledge or ability you
intended to measure.
Abedi et al. (2000) reported results of experimental studies done with lim-
ited-English-proficient and English-proficient students where the language
of the test was presented in regular and simplified English forms. The lim-
ited-English-proficient students performed better when the language was
simplified. Given that many test takers are learning to read, write, speak, and
listen in a new language, a challenging vocabulary coupled with complex sen-
tence structures can add an unfair burden on these students and result in un-
deserved lower scores. Also, students with low reading comprehension are
equally at risk for a low score because of this disability rather than low
achievement.

12. Minimize Reading Time. Items may be unnecessarily wordy. Verbos-


ity is also an enemy to clarity. If test items are characteristically long and require
extensive reading, administration time will be longer. If a set time is used for
testing, verbose items limit the number of items we can ask in that set time,
which has a negative effect on the adequacy of sampling of content and on the
GUIDELINES FOR DEVELOPING MC ITEMS 1O7

reliability of test scores. For these many good reasons, we try to write MC items
that are as brief as possible without compromising the content and cognitive
demand we require. This advice applies to both the stem and the options.
Therefore, as a matter of writing style, test items should be crisp and lean. They
should get to the point in the stem and let the test taker choose among plausi-
ble options that are also as brief as possible. Example 5.8 shows an item with
repetitious wording. The improved version eliminates this problem.

12. Effective student grading should probably minimize


A. the student's present status.
B. the student's progress against criteria stated in the
syllabus.
C. the student's status relative to his or her ability.
D. the student's status relative to the class.
E. the student's progress relative to state standards.

12. Which should determine a student's grade?


A. Achievement against stated criteria
B. Status relative to other class members
C. Progress relative to his or her ability.

EXAMPLE 5.8. Repetitious wording in the options and an


improved version.

13. Proofread Each Item. A highly recommended procedure in the pro-


duction of any test is proofreading. Despite many reviews and checks and bal-
ance, errors will appear on a test. A good rule of thumb from expert editors is
that if you spot three errors in the final proofing phase of test development, you
have probably missed one error.
Errors suggest carelessness and negligence, perhaps the tip of a great iceberg:
poor test development. You do not want to convey this impression to test tak-
ers. Another issue is that such errors are often distracting to test takers, partic-
ularly those who have test anxiety. By failing to be more careful in proofing,
errors may cause test takers to perform more poorly and score lower than they
would have had the errors not been there. Finally, such errors may reduce the
clarity of expression in the item.

Writing the Stem

14. Make Directions as Clear as Possible. The stem should be written in


a way that the test taker knows immediately what the focus of the item is.
When we phrase the item, we want to ensure that each student has a reason-
1O8 CHAPTER 5

ably good chance of knowing what situation or problem is presented in the


stem. Example 5.9 presents two examples of directions.
In the bad example in Example 5.9, the student has to guess what happened
to the flower pot by looking at the options. In the good example, the turning of
the pot and the passing of a week is more specific about the expectation that
the plant is phototrophic and grows according to the light source.

Bad Example:

14. A plant in a flower pot fell over. What happened?

Improvement:

14. A plant growing in a flower pot was turned on its side. A


week later, what would you expect to see?

EXAMPLE 5.9. Unclear and clear directions in the stem.

15. Make the Stem as Brief as Possible. As noted with Guideline 12,
items that require extended reading lengthen the time needed for students to
complete a test. This guideline urges item writers to keep the stem as brief as
possible for the many good reasons offered in Guideline 12. Example 5.10 illus-
trates both lengthy and brief stems.

16. Place the Main Idea in the Stem, Not in the Choices. Guideline 12
urges brief items, and Guideline 15 urges a brief stem. Sometimes, the stem
might be too brief and uninformative to the test taker. The item stem should al-
ways contain the main idea. The test taker should always know what is being
asked in the item after reading the stem. When an item fails to perform as in-
tended with a group of students who have received appropriate instruction,
there are often many reasons. One reason may be that the stem did not present
the main idea. Example 5.11 provides a common example of a stem that is too
brief and uninformative. This item-writing fault is called the unfocused stem.
As you can see, the unfocused stem fails to provide adequate information to
address the options. The next item in Example 5.11 is more direct. It asks a
question and provides three plausible choices.

17. Avoid Inelevant Information (Window Dressing). Some items


contain words, phrases, or entire sentences that have nothing to do with the
problem stated in the stem. One reason for doing this is to make the item
GUIDELINES FOR DEVELOPING MC ITEMS 1O9

Destruction of certain cortical tissue will lead to symptoms


affecting our behavior. However, this destruction may lead to
symptoms that are due to withdrawal of facilitation in other cortical
areas. Thus, the tissue damage affects both directly and indirectly
cortical functioning. While such effects may be temporary, what is
the typical recovery time?

A. Immediately
B. Several days to a week
C. Several months
D. Seldom ever

Alternative wording to achieve brevity:


What is the typical recovery time for cortical functioning when
tissue is destroyed?

EXAMPLE 5.10. Lengthy and briefer stems.

Unfocused Stem

15. Corporal punishment


A. has been outlawed in many states.
B. is psychologically unsound for school discipline.
C. has many benefits to recommend its use.

Focused Stem

15. What is corporal punishment?


A. A psychologically unsound form of school discipline
B. A useful disciplinary technique if used sparingly
C. An illegal practice in our nation's schools

EXAMPLE 5.11. Focused and unfocused stems.

look more lifelike or realistic, to provide some substance to it. We use the
term window dressing to imply that an item has too much decoration and not
enough substance. Example 5.12 shows window dressing. For many good
reasons discussed in Guidelines 9, 11,12,14, and 15, window dressing is not
needed.
11O CHAPTER 5

Window Dressing
High temperatures and heavy rainfall characterize a humid
climate. People in this kind of climate usually complain of heavy
perspiration. Even moderately warm days seem uncomfortable.
Which climate is described?
A. Savanna
B. *Tropical rainforest
C. Tundra

Window Dressing Removed


Which term below describes a climate with high temperatures and
heavy rainfall?
A. Savanna
B. *Tropical rainforest
C. Tundra

EXAMPLE 5.12. Window dressing.

However, there are times when verbiage in the stem may be appropriate. For
example, in problems where the test taker sorts through information and distin-
guishes between relevant and irrelevant information to solve a problem, exces-
sive information is necessary. Note that the phrase window dressing is used
exclusively for situations where useless information is embedded in the stem
without any purpose or value. In this latter instance, the purpose of excessive in-
formation is to see if the examinee can separate useful from useless information.
In Example 5.13, the student needs to compute the discount, figure out the
actual sales price, compute the sales tax, add the tax to the actual sale price,
and compare that amount to $9.00. The $ 12.00 is irrelevant, and the student is
supposed to ignore this fact in the problem-solving effort. This is not window
dressing because the objective in the item is to have the student discriminate
between relevant and irrelevant information.

A compact disc at the music store was specially priced at $9.99,


but typically sells at $12.00. This weekend, it was marked at a 20%
discount from this special price. Sales tax is 6%. Tina had $9.00 in
her purse and no credit card. Does Tina have enough money to
buy this compact disc?

EXAMPLE 5.13. No window dressing.


GUIDELINES FOR DEVELOPING MC ITEMS 111

18. Avoid Negative Words in the Stem. We have several good reasons
for supporting this guideline. First, we have a consensus of experts in the field of
testing who feel that the use of negative words in the stem has negative effects
on students and their responses to such items (Haladyna et al., 2002). Some re-
search on the use of negative words also suggests that students have difficulty
understanding the meaning of negatively phrased items. A review of research
by Rodriguez (2002) led to his support of this guideline. Tamir (1993) cited re-
search from the linguistic literature that negatively phrased items require
about twice as much working memory as equivalent positively phrased forms of
the same item. Negative words appearing both in the stem and in one or more
options might require four times as much working memory as a positively
phrased equivalent item. Tamir's study led to a conclusion that for items with
low cognitive demand, negative phrasing had no effect, but that for items with
high cognitive demand, negatively phrased items were more difficult. Tamir
also found that differences in items in positive and negative forms differed as a
function of the type of cognitive processing required. Taking into account the
various sources of evidence about negative items, it seems reasonable that we
should not use negative wording in stems or in options.
Example 5.14 shows the use of the EXCEPT format, where all answers meet
some criterion except one. Although this is a popular format and it may per-
form adequately, this kind of item puts additional strain on test takers in terms
of working short-term memory. Consequently, it probably should be avoided.

17. Each of the following drugs is appropriate for the treatment of


cardiac arrthymia EXCEPT one. Which one is the exception?
A. Phenytoin
B. Lidocaine
C. Quinidine
D. Propranolol
E. Epinephrine

EXAMPLE 5.14. Use of negative word in an item.

According to Harasym, Doran, Brant, and Lorscheider (1992), a better way


to phrase such an item is to remove the NOT and make the item an MTF with
more options. Example 5.15 shows this transformation to MTF.
Another benefit of this transformation is that because the options now be-
come items, we have more scorable units, which is likely to increase test score
reliability.
If a negative term is used, it should be stressed or emphasized by placing it
in bold type, capitalizing it, or underlining it, or all of these. The reason is that
112 CHAPTER 5

For treating cardiac arrthymia, which of the following drugs are


appropriate? Mark A if true or B if false.
1. Phenytoin
2. Lidocaine
3. Quinidine
4. Propranolol
5. Epinephrine

EXAMPLE 5.15. Use of the multiple true-false format


as a replacement.

the student often reads through the NOT and forgets to reverse the logic of
the relation being tested. This is why the use of NOT is not recommended for
item stems.

Writing the Choices

19. Use as Many Choices as Possible, but Three Seems to Be a Natural Limit.
A growing body of research supports the use of three options for conven-
tional MC items (Andres & del Castillo, 1990; Bruno & Dirkzwager, 1995;
Haladyna & Downing, 1993; Landrum, Cashin, &Theis, 1993; Lord, 1977;
Rodriguez, 1997; Rogers &Harley, 1999; Sax &Reiter, n.d.; Trevisan, Sax, &
Michael, 1991, 1994). To summarize this research on the optimal number of
options, evidence suggests a slight advantage to having more options per test
item, but only if each distractor is discriminating. Haladyna and Downing
(1993) found that many distractors do not discriminate. Another implication
of this research is that three options may be a natural limit for most MC items.
Thus, item writers are often frustrated in finding a useful fourth or fifth option
because they typically do not exist.
The advice given here is that one should write as many good distractors as
one can but should expect that only one or two will really work as intended. It
does not matter how many distractors one produces for any given MC item, but
it does matter that each distractor performs as intended. This advice runs
counter to what is practiced in most standardized testing programs. However,
both theory and research support the use of one or two distractors in the design
of a test item. In actuality, when we use four or five options for a conventional
MC test item, the existence of nonperforming distractors is nothing more than
window dressing. Thus, test developers have the dilemma of producing unnec-
essary distractors, which do not operate as they should, for the appearance of
the test, versus producing tests with fewer options that are more likely to do
what they are supposed to do.
GUIDELINES FOR DEVELOPING MC ITEMS 113

One criticism of using fewer instead of more options for an item is that guess-
ing plays a greater role in determining a student's score. The use of fewer dis-
tractors will increase the chances of a student guessing the right answer.
However, the probability that a test taker will increase his or her score signifi-
cantly over a 20-, 50-, or 100-item test by pure guessing is infinitesimal. The
floor of a test containing three options per item for a student who lacks knowl-
edge and guesses randomly throughout the test is 33% correct. Therefore, ad-
ministering more test items will reduce the influence of guessing on the total
test score. This logic is sound for two-option items as well, because the floor of
the scale is 50% and the probability of a student making 20,50, or 100 success-
ful randomly correct guesses is very close to zero. In other words, the threat of
guessing is overrated.

20. Vary the Location of the Right Answer According to the Number of
Options. Assign the Position of the Correct Answer Randomly. The ten-
dency to mark in the same response category is response set. Also, testwise stu-
dents are always looking for clues that will help them guess. If the first item is
usually the correct answer, the testwise student will find this pattern and when
in doubt choose A. Therefore, we vary the location of the right answer to ward
off response set and testwise test takers. If we use a three-option format, about
33% of the time A, B, and C will be the right answer respectively.
Recent research indicates that this guideline about key balancing may have
some subtle complications. Attali and Bar-Hillel (2003) and Bar-Hillel and
Attali (2002) posited an edge aversion theory that the right answer is seldom
found in the first and last options, thus offering an innocent clue to test takers
to guess middle options instead of "edge" options. Guessing test takers have a
preference for middle options, as well. By balancing the key so that correct an-
swers are equally distributed across the MC options, this creates a slight bias be-
cause of edge aversion, and this affects estimates of difficulty and
discrimination. They concluded that correct answers should be randomly as-
signed to the option positions to avoid effects of edge aversion.

21. Place Options in Logical or Numerical Order. In the formatting of


test items for a test, the options should always appear in either logical or numer-
ical order. Example 5.16 shows two versions of the same item with the numeri-
cal order being wrong then right.
Answers should always be arranged in ascending or descending numeri-
cal order. Remember that the idea of the item is to test for knowledge in a di-
rect fashion. If a student has to hunt for the correct answer unnecessarily,
we unnecessarily increase the stress level for the test and we waste the test
taker's time.
Another point is about the place of decimal points in quantitative answers.
Decimal points should be aligned for easy reading. Example 5.17 shows the
114 CHAPTER 5

Wrong Right
What is the cost of an item that What is the cost of an item that
normally sells for $9.99 that is normally sells for $9.99 that is
discounted 25%? discounted 25%?
A. $5.00 A. $2.50
B. *$7.50 B. $5.00
C. $2.50 C. $6.66
D. $6.66 D. *$7.50

EXAMPLE 5.16. Numerical order of options.

You are dividing the bill of $19.45 equally among four of us who had
lunch together. But your lunch item $9.55. What is your fair share of
the bill?
A. .250 A. 0.049
B. 0.491 B. 0.250
C. .50 C. 0.500

EXAMPLE 5.17. Two ways of expressing decimals.

same item with answers aligned two ways. Notice that the second way is easier
to follow than the first. Notice that the decimal point is aligned in the second
example for easy reading. Also, try to keep the number of decimal places con-
stant for uniformity.
Logical ordering is more difficult to illustrate, but some examples offer hints
at what this guideline means. Example 5.18 illustrates an item where options
were alphabetically ordered.
There are instances where the logical ordering relates to the form of the an-
swers instead of the content. In Example 5.19, answers should be presented in
order of length, short to long.

22. Keep Choices Independent; Choices Should Not Be Overlapping.


This item-writing fault is much like interitem cuing discussed in Guideline 4. If
options are overlapping, these options are likely to give a clue to the test taker
about the correct answer and the distractors.
GUIDELINES FOR DEVELOPING MC ITEMS 115

21. Which is the most important consideration in preparing a


waxajet?
A. *Lubricant
B. O-ring integrity
C. Positioning
D. Wiring

EXAMPLE 5.18. Arranged in alphabetical order.

21. When an item fails to perform on a test, what is the most


common cause?
A. *The item is faulty.
B. Instruction was ineffective.
C. Student effort was inadequate.
D. The objective failed to match the item.

EXAMPLE 5.19. Options organized by length.

If a value contained in overlapping options is correct, the item may have two
or more correct answers. Example 5.20 illustrates this problem. Numerical
problems that have ranges that are close make the item more difficult. More
important in this example, Options A, B, C, and D overlap slightly. If the an-
swer is age 25, one can argue that both C and D are correct though the author
of the item meant C. This careless error can be simply corrected by developing
ranges that are distinctly different. The avoidance of overlapping options also
will prevent embarrassing challenges to test items.

22. What age range represents the physical "peak" of life?


A. 11 to 15 years of age
B. 13 to 19 years of age
C. 18 to 25 years of age
D. 24 to 32 years of age
E. over 32 years of age

EXAMPLE 5.20. Overlapping options.


116 CHAPTER 5

23. Keep Choices Homogeneous in Content and Grammatical Structure.


The use of options that are heterogeneous in content is often a cue to the stu-
dent. Such cues are not inherent in the intent of the item but an unfortunate
accident. Therefore, the maintenance of homogeneous options is good advice.
Fuhrman (1996) suggested another way to view the issue of option homogene-
ity. If the correct answer is shorter or more specific or stated in other language,
perhaps more technical or less technical, these tendencies might make the item
easier. A standard practice of keeping options homogeneous avoids the possi-
bility of giving away a right answer. Example 5.21 illustrates three homoge-
neous and one heterogeneous options. This odd combination may be a cue that
D is the right answer.

23. What reason best explains the phenomenon of levitation?


A. Principles of physics
B. Principles of biology
C. Principles of chemistry
D. Metaphysics

EXAMPLE 5.21. Lack of homogeneous options as a clue.

24. Keep the Length of Choices about the Same. One common fault in
item writing is to make the correct answer the longest. This may happen inno-
cently. The item writer writes the stem and the right answer, and in the rush to
complete the item adds two or three hastily written wrong answers that are
shorter than the right answer. Example 5.22 shows this tendency.

24. What effect does rehydroxy have on engine performance?


A. Increases the engine speed.
B. Less wear on pistons
C. Being the joint function of a transducer and piston
potential, it increases torque without loss in fuel economy

EXAMPLE 5.22. Which answer is correct?

25. None of the Above Should Be Used Sparingly. As a last option,


none of the above is easy to construct. Research has increased controversy over
this guideline. Studies by Knowles and Welch (1992) and Rodriguez (2002)
did not completely concur about the use of none of the above- Haladyna et al.
GUIDELINES FOR DEVELOPING MC ITEMS 117

(2002) surveyed current textbooks and found authors split on this guideline.
Frary (1993) supported this format, but with some caution. An argument fa-
voring using none of the above in some circumstances is that it forces the stu-
dent to solve the problem rather than choose the right answer. In these
circumstances, the student may work backward, using the options to test a so-
lution. In a study of none of the above by Dochy, Moerkerke, De Corte, and
Segers (2001) with science problems requiring mathematical ability, their re-
view, analysis, and research point to using none of the above because it is a
plausible and useful distractor, and they argue that students can generate
many incorrect answers to a problem. Thus, none of the above serves a useful
function in these complex problems with a quantitative answer. For items
with a lower cognitive demand, none of the above probably should not be used.
When none of the above is used, it should be the right answer an appropriate
number of times.

26. Avoid Using AH of the Above. The use of the choice all of the above
has been controversial (Haladyna & Downing, 1989a). Some textbook
writers have recommended and have used this choice. One reason may be
that in writing a test item, it is easy to identify one, or two, or even three
right answers. The use of the choice all of the above is a good device for cap-
turing this information. However, the use of this choice may help testwise
test takers. For instance, if a test taker has partial information (knows that
two of the three choices offered are correct), that information can clue the
student into correctly choosing all of the above. Because the purpose of a MC
test item is to test knowledge, using all of the above seems to draw students
into test-taking strategies more than directly testing for knowledge. One al-
ternative to the all of the above choice is the use of the MTF format. Another
alternative is simply avoid all of the above and ensure that there is one and
only one right answer.

27. Avoid Negative Words Such as Not or Except. We should phrase


stems positively, and the same advice applies to options. The use of negatives
such as not and except should also be avoided in options as well as the stem. Oc-
casionally, the use of these words in an item stem is unavoidable. In these cir-
cumstances, we should boldface, capitalize, italicize, or underline these words
so that the test taker will not mistake the intent of the item.

28. Avoid Options That Give Clues to the Right Answer. We have a
family of clues that tip off test takers about the right answer. They are as follows:

• Specific determiners. Specific determiners are so extreme that seldom


are they the correct answers. Specific determiners include such terms as al-
118 CHAPTER 5

ways, never, totally, absolutely, and completely. A specific determiner may oc-
casionally be the right answer. In these instances, their use is justified if the
distractors also contain other specific determiners. In Example 5.23, Option
A uses the specific determiner never and Option C uses the specific deter'
miner always.

28. Which of the following does research on homework support?


A. Never assign homework on Fridays.
B. Homework should be consistent with class learning.
C. Always evaluate homework the next day.

EXAMPLE 5.23. Specific determiner clues.

• Clang associations. Sometimes, a word or phrase that appears in the


item stem will also appear in the list of choices, and that word or phrase will
be the correct answer. If a clang association exists and the word or phrase is
not the correct answer, the item may be a trick question. Example 5.24
shows a clang association. The word TAX is capitalized to show that its ap-
pearance in the options clues the test taker.

28. What is the purpose of the TAX table? To help you determine
A. your gross income.
B. the amount of TAX you owe.
C. your net earnings.
D. your allowable deductions.

EXAMPLE 5.24. Example of clang association (tax).

• Options should be homogeneous with respect to grammar. Sometimes a


grammatical error in writing options may lead the test taker to the right an-
swer, as shown in Example 5.25.
For the learner of tennis, all three options may make sense, but only B is
grammatically consistent with the partial-sentence stem.
• Options should be homogeneous with respect to content. If the options are
not homogeneous as shown in Example 5.26, the testwise student is likely to
choose the heterogeneous option. If D is the correct answer, it is tipped off by
the similarity among distractors. If another option is correct, this item might
be a trick item.
GUIDELINES FOR DEVELOPING MC ITEMS 119

28. The most effective strategy in playing winning tennis for a


beginner is
A. more pace in ground strokes.
B. to keep the ball in play.
C. volley at the net as often as possible.
D. hit the ball as hard as possible.

EXAMPLE 5.25. Example of distractors with grammatical


inconsistency.

28. Three objects are thrown in the water. Object A floats on top
of the water. Object B is partially submerged. Object C sinks.
All three objects have the same volume Which object weighs
the most?
A. A
B. B
C. C

EXAMPLE 5.26. Example of homogeneous options.

Example 5.27 shows heterogeneous options. The plausible characteris-


tics should be homogeneous in terms of content and grammatical form.

28. Which is most characteristic of a catamaran?


A. Fast sailboat.
B. It was discovered in Katmandu.
C. Its main feature is two hulls.
D. More expensive than an ordinary sailboat.

EXAMPLE 5.27. Heterogeneous options.

• Blatantly absurd, ridiculous options. When writing that third or fourth


option there is a temptation to develop a ridiculous choice either as humor
or out of desperation. In either case, the ridiculous option will seldom be
chosen and is therefore useless. Example 5.28 gives two ridiculous options
that give away the correct answer.
12O CHAPTER 5

28. Who is best known for contributions to microelectronics?


A. Comedian Jay Leno
B. Robert Sveum
C. Actor Bruce Willis

EXAMPLE 5.28. Example of ridiculous distractors.

You may not know the person in the second option (B), but you know
that it is the right answer because the other two are absurd. If A or C is cor-
rect, the item is a trick question.

29. Make AH Distractors Plausible. As we know, in most settings, MC


is used to measure knowledge and cognitive skills. Therefore, the right an-
swer must be right, and the wrong answers must clearly be wrong. The key to
developing wrong answers is plausibility. Plausibility refers to the idea that
the item should be correctly answered by those who possess a high degree of
knowledge and incorrectly answered by those who possess a low degree of
knowledge. A plausible distractor will look like a right answer to those who
lack this knowledge. The effectiveness of a distractor can be statistically ana-
lyzed, as chapter 9 shows. Example 5.29 shows an item where only 3% of the
students tested chose Option B. We might conclude that this option is very
implausible. Options C and D seem more plausible as judged by the frequency
of response. Writing plausible distractors comes from hard work and is the
most difficult part of MC item writing.

29. The Emperor seems to view the Great Wall as a


A. protector of his way of life. (73%)*
B. popular tourist attraction. (3%)
C. symbol of the human spirit. (14%)
D. way to prevent people from escaping. (9%)

EXAMPLE 5.29. Example of plausible and unplausible


distractors.

*Source: NAER Grades 8,12,1993 Reading Assessment.

30. Use Typical Errors of Students When You Write Distractors. One
suggestion is that if we gave completion items (open-ended items without
choices), students would provide the correct answer and plausible wrong an-
GUIDELINES FOR DEVELOPING MC ITEMS 121

swers that are actually common student errors. In item writing, the good plausi-
ble distractor comes from a thorough understanding of common student errors.
In the example in Example 5.30, Distractor A is a logical incorrect answer for
someone learning simple addition.

29. 77 + 34 =
A. 101
B. 111

EXAMPLE 5.30. Example of an alternate-choice item with


a plausible, common student error.

31. Use Humor if It Is Compatible With the Teacher; Avoid Humor in a


Formal Testing Situation. McMorris, Boothroyd, and Pietrangelo (1997)
extensively studied the issue of humor in testing. Their conclusion was that hu-
mor is probably harmless in classroom assessment if it flows naturally from the
teacher. Thus, the use of humor would be compatible with the classroom learn-
ing environment.
Although humor may be useful to cut tension in the classroom and im-
prove the learning environment, in any formal testing situation, humor may
work against the purpose of testing. Items containing humor can reduce the
number of plausible dis tractors and therefore make the item artificially easier.
Humor also might encourage the student to take the test less seriously.
Limited research on the use of humor shows that, in theory, humor should re-
duce anxiety, but sometimes highly anxious test takers react in negative ways.
The use of humor detracts from the purpose of the test. The prudent thing to
do is to avoid humor.

GUIDELINES FOR SPECIFIC MC FORMATS

The preceding sections of this chapter focus on general item-writing advice.


Many of these guidelines apply equally to the various formats presented in
chapter 4, including AC, matching, MTF, and item sets. However, special
guidelines are needed that are unique to some of these MC formats. The next
section provides some specific guidance to item writers for these other formats.

Advice for Writing Matching Items

Generally, the set of choices for a matching item set is homogeneous as to con-
tent. Because the benefit of a matching format is the measurement of under-
122 CHAPTER 5

standing of a single learner outcome, the homogeneity of content is a


characteristic of a set of matching items.
Also, the number of choices should not equal the number of items. The basis
for this advice is that test takers may try to match up items to choices believing
in a one-to-one correspondence. If this is true, there is interitem cuing. If this is
not true, students will be confused. Table 5.2 provides seven guidelines for writ-
ing matching items.

Advice for Writing AC Items

Because AC is a short form of conventional MC, no unique guidelines appear


in this section. It is important to ensure that the single distractor is the most
common student error if this format is to work properly. Therefore, special ef-
fort should be given to writing the distractor for each AC item.

Advice for Writing MTF Item Clusters

1. The number of MTF items per cluster may vary within a test.
2. Conventional MC or complex MC items convert nicely to MTF items.
3. No strict guidelines exist about how many true and false items appear in a
cluster, but expecting a balance between the number of true and false
items per set seems reasonable.
4. The limit for the number of items in a cluster may be as few as 3 or as many
as would fit on a single page (approximately 30 to 35).

Guidelines for TF Testing Items

Although many experts currently do not recommend the TF format, a body of


knowledge exists on the writing of these items. In the interest of providing a
balanced presentation of guidelines for various formats, this section exists.

TABLE 5.2
Guidelines for the Matching Format

1. Provide clear directions to the students about how to select an option for each stem.
2. Provide more stems than choices.
3. Make choices homogeneous.
4. Put choices in logical or numerical order.
5. Keep the stems longer than the options.
6. Number stems and use letters for options (A, B, C, etc.).
7. Keep all items on a single page or a bordered section of the page.
GUIDELINES FOR DEVELOPING MC ITEMS 123

Frisbie and Becker (1991) surveyed 17 textbooks and extracted 22 common


guidelines for writing TF items. Most of the guidelines are similar if not identi-
cal with those presented earlier in this chapter. One thing to keep in mind,
however, is that most of these guidelines fail to reach consensus from writers of
textbooks or from research. Nonetheless, Frisbie and Becker provided many
excellent insights into TF item writing that are now reviewed and discussed.

Balance the Number of TF Statements. Key balancing is important in


any kind of objectively scored test. This guideline refers to the balance between
true and false statements, but it also applies to negative and positive phrasing.
So, it is actually key balancing as applied to TF items.

Use Simple Declarative Sentences. A TF item should be a simple, non-


complex sentence. The item should state something in a declarative rather
than interrogative way. It should not be an elliptical sentence. Example 5.31
shows a single-idea declarative sentence and a compound idea that should be
avoided.

Desirable: The principal cause of lung cancer is cigarette


smoking.

Undesirable: The principal causes of lung cancer are cigarette


smoking and smog.

EXAMPLE 5.31. Simple declarative sentence


and a compound idea.

Write Items in Pairs. Pairs of items offer a chance to detect ambiguity.


One statement can be true and another false. One would never use a pair of
items in the same test, but the mere fact that a pair of items exists offers the item
writer a chance to analyze the truth and falsity of related statements. Examples
are provided in Example 5.32.

Make Use of an Internal Comparison Rather Than an Explicit Comparison.


When writing the pair of items, if comparison or judging is the mental activity,
write the item so that the comparison is clearly stated in the item. In Example
5.33, the first item qualifies the evaluation of oil-based paint, whereas the sec-
ond item does not qualify the evaluation. The second item is ambiguous.

Take the Position of an Uninformed Test Taker. Example 5.34 contains a


true statement and two common misinterpretations.
124 CHAPTER 5

Overinflated tires will show greater wear than (false)


underinflated tires.

Underinflated tires will show greater wear than (true)


overinflated tires.

EXAMPLE 5.32. Benefit of writing true-false items in pairs.

Desirable: In terms of durability, oil-based paint is better than


latex-based paint.

Undesirable: Oil-based paint is better than latex-based paint.

EXAMPLE 5.33. Qualified and unqualified declarative


statements.

A percentile rank of 85 indicates that 85% of the (true)


sample tested scored lower than the equivalent test
score for this percentile.

A percentile rank of 85 means that 85% of items (false)


were correctly answered.

A percentile rank of 85 means that 15% of test takers (false)


have score lower than people at that percentile rank.

EXAMPLE 5.34. True-false variations of a concept.

Use MC Items as a Basis for Writing TF Items. Good advice is to take a


poor-functioning MC item and convert it to several TF items. Example 5.35
shows how a poorly operating conventional MC item can be transformed into
an MTF format. This conversion also produces five scorable items that have a
positive effect on reliability.

Advice for Writing Item Sets

Little research exists on the writing or effectiveness of item sets


(Haladyna, 1992a), despite its existence in the testing literature for more
GUIDELINES FOR DEVELOPING MC ITEMS 125

Conventional Multiple-Choice Format


The best way to improve the reliability of test scores is to
A. increase the length of the test.
B. improve the quality of items on the test.*
C. increase the difficulty of the test.
D. decrease the difficulty of the test.
E. increase the construct validity of the test.
Which actions listed below improve the reliability of test scores?
Mark A if it tends to improve reliability, mark B if not.
1. Increase the length of the test. (A)
2. Improve the discriminating quality of the items. (A)
3. Substitute less difficult items with more difficult items. (B)
4. Increase the construct validity of the test. (B)
5. Decrease the difficulty of the test items. (B)

EXAMPLE 5.35. Converting a multiple-choice item into a


series of multiple true-false items.

than 50 years. Nonetheless, some advice is offered regarding certain as-


pects of the item set.

Format the Item Set So All Items Are on a Single Page or Opposing Pages of
the Test Booklet. This step ensures easy reading of the stimulus material and
easy reference to the item. When limited to two pages, the total number of
items ranges from 7 to 12. If the MTF or AC formats are used with the item set,
many more items can be used.

Use Item Models if Possible. An algorithm is a standard item set scenario


with a fixed number of items. The scenario can be varied according to several
dimensions, producing many useful items. Haladyna (1991) presented exam-
pies for teaching statistics and art history. Chapter 7 provides illustrations and
examples of these.

Use Any Format That Appears Suitable With the Item Set. With any
item set, conventional MC, matching, AC, and MTF items can be used. The
item set encourages considerable creativity in developing the stimulus and us-
ing these various formats. Even CR item formats, such as short-answer essays,
can be used.
126 CHAPTER 5

SUMMARY

This chapter presents item-writing guidelines that represent a consensus of au-


thors' treatments on item writing and empirical research. Future studies may
lead to further revision of these guidelines.
6
A Casebook
of Exemplary Items and
Innovative Item Formats

OVERVIEW

This chapter contains a collection of exemplary and innovative items in vari-


ous MC formats. The chapter's purpose is to provide readers with ideas about
how new MC items might be created to accommodate different types of con-
tent and cognitive behaviors. This chapter was inspired by the Manual of Exam-
inationMethods (Technical Staff, 1933,193 7). Many examples appearing in this
chapter come directly or were adapted from the items found in these volumes.
For each item presented in this chapter, there is a brief introduction and
some commentary with an assessment of the intended content and cognitive
process intended. Occasionally, criticism is offered as a means for showing that
items can always be improved.
The chapter is organized into three sections. The first section includes items
that purportedly measure understanding of a concept, principle, or procedure.
The second section presents items purported to measure a skill. The third sec-
tion contains items that purportedly measure different types of higher level
thinking that require the application of knowledge and skills.

ITEMS TESTING UNDERSTANDING

When testing understanding, the stem must present a concept, principle, or


procedure in a novel way that has not been previously been presented in the
test taker's instructional history. The item should not directly come from previ-
127
128 CHAPTER 6

ously assigned reaching or course presentations or lectures. The idea in testing


understanding is to see if the student truly understands the concept, principle,
or procedure being learned instead of memorizing a definition or identifying a
previously presented example.

National Assessment—Reading

The first item comes from the NAEP's 1994 reading assessment and is shown in
Example 6.1. This item is based on a reading passage about the Anasazi Indians
of the Southwest United States. The passage is not presented here because of
space limitations, but it is customary to use reading passages to test reading
comprehension. In chapter 7, advice is given on how to generate a large num-
ber of reading comprehension items using "clone" item stems.

The reading passage is "The Lost People of Mesa Verde" by Elsa


Marston.

7. The Anasazi's life before 1200 A.D. was portrayed by the


author as being
A. dangerous and warlike.
B. busy and exciting.
C. difficult and dreary.
D. productive and peaceful.

EXAMPLE 6.1. Taken from the 1994 reading assessment of


the National Assessment of Educational Progress.

The student after reading the passage must choose among four plausible op-
tions. Understanding of the passage is essential. The four options use language
that cannot be found verbatim in the passage. Thus, the options present in a
novel way what the author portrayed about the Anasazi Indians. Those looking
for a more complete collection of examples of high-quality reading comprehen-
sion items should consult the web page of the National Center for Educational
Statistics (http://nces.ed.gov/nationsreportcard/).

EM

Example 6.2 shows how the EM format discussed in chapter 4 can be effec-
tively used to measure a student's understanding. Each patient is described in
terms of a mental disorder. The long list of disorders are given at the right.
Each learner can memorize characteristics of a disorder, but understanding is
CASEBOOK OF EXEMPLARY ITEMS 129

Patient Disorder
1. Mr. Ree anxiously enters every room left A. Neuresthenia
foot first. B. Dementia
2. Suzy enters the room excited. She walks C. Regression
quickly around chattering. Later she is D. Alexia
normal.
E. Sublimation
3. Bob saw the archangel of goodness
F. Bipolar
followed by Larry, Curly, and Moe, the
three saints of comedy. G. Compulsion
4. Muriel cannot pronounce words, and H. Rationalization
she has no paralysis of her vocal I. Masochism
chords. J. Hallucination
5. "I am the Queen of Sheba," the old lady K. Hypnotism
muttered. L. Delusional
6. After Julie broke up with Jake, she
remarked, "There are many fish in the
sea."
7. Norman was thin and tired. Nothing was
important to him. He felt useless and
inferior. He wanted to escape.
8. Good clothes and good looks did not
get Maria the attention she wanted, so
she excelled in sports.

EXAMPLE 6.2. Adapted from Technical Staff


(1937, p. 72).

needed when each learner is confronted with a new patient who demon-
strates one or more symptoms of a disorder. Of course, psychotherapy is not
simplistic, as this example suggests, but in learning about disorders, learners
should understand each disorder rather than simply memorize characteristics
of a disorder.

Combinatorial Formats: TF, Both-Neither, MTF

Anyone who has written MC items for a while has experienced the frustration
of thinking up that third or fourth option. Because distractors have to be plausi-
ble and reflect common student errors, it is hard to come up with more than
one or two really good distractors. The combinatorial format makes that effort
13O CHAPTER 6

easier. You simply write an AC item (two options) and add two more generic op-
tions. The first example comes from the National Board Dental Examination
Testing Programs.
As shown in Example 6.3, we have two statements. We have four combina-
tions for these two statements: true-true, true-false, false-true, false-false.

The dentist should suspect that the patient's primary mandibular


right second molar is non vital. MOST primary molar abbesses
appear at the apices.
A. Both statements are TRUE.
B. Both statements are FALSE
C. The first statement is TRUE, the second statement is FALSE
D. The first statement is FALSE, the second statement is TRUE.

EXAMPLE 6.3. Item 40 for the 1993 National Board Dental


Examination published by the Joint Commission on National
Dental Examinations (released test).

As shown in Example 6.4, we have two plausible answers that complete the
stem. The student must evaluate whether the first answer only is right, the sec-
ond answer only is right, or whether both answers are correct or both incorrect.
The nice thing about this format is that item writers never have to think of that
third or fourth option; they simply create two MC options. However, they have
to be careful to ensure that an equal number of times in the test the right an-
swer is evenly distributed among the four choices.

According to a recent American Cancer Society report, the most


common cause of lung cancer is:
A. cigarette smoking.
B. living in a polluted atmosphere.
C. both A and B.
D. neither A nor B.

EXAMPLE 6.4. Combinational multiple-choice with two


plausible options and two generic options.

The MTF format is also useful for testing understanding. Example 6.5 shows
how teaching the characteristics of vertebrates can be cleverly tested using de-
CASEBOOK OF EXEMPLARY ITEMS 131

scriptions of animals. The animals in the list are not described or presented in a
textbook or during lecture. The student encounters an animal description and
must decide based on its characteristics if it is absurd or realistic. The number of
items may vary from 1 or 2 to more than 30. The decision about the length of
this item set depends on the type of test being given and the kind of coverage
needed for this content.

Mark A if absurd or B if realistic.


1. An aquatic mammal.
2. A fish with a lung.
3. A single-celled metazoa
4. A flatworm with a skeleton
5. A coelenterate with a mesoderm.

EXAMPLE 6.5. Adapted from Technical Staff (1937, p. 47).

Efficient Dichotomous Format

Example 6.6 has efficient presentation and method of scoring. Imagine that we
are learning about three governments. You select the letter corresponding to
the government that reflects the statement on the left. This example only has 4
statements, but we could easily have 20 to 30 statements. We have 12 scorable
units in this example, but with 30 statements this would be equivalent to a 90-
item TF test.

Description U.S. U.K. France


Has a document known as the
"constitution."
It is federal in form.
Its leader is elected.
All judges are appointed.

EXAMPLE 6.6. Adapted from Technical Staff (1 937, p. 77).

This item may be measuring recall of facts, but if the statements are pre-
sented in a novel way, these items might be useful measures of student under-
standing.
132 CHAPTER 6

TESTING SKILLS

MC formats can be usefully applied to testing skills. In this section, we show


the use of MC to measure reading, writing, mathematics, and language trans-
lation skills.

Vocabulary—Reading

The testing of vocabulary is prominent in many achievement tests. Example


6.7 is one of the leanest, most efficient ways to test for vocabulary. The num-
ber of options may vary from three to five, but writing options should not be
difficult.

Find the word that most nearly means the same as the word on
the left.

1. Accept: A. Admit B. Adopt C. Allow D. Approve


2. Meander: A. Travel B. Wander C. Maintain D. Dislike
3. Allege: A. Maintain B. Pretend C. Claim D. Accuse
4. Expansive: A. Costly B. Extensive C. Large D. Flexible

EXAMPLE 6.7. Testing vocabulary.

Writing Skills

The measurement of writing skills using MC items is briefly introduced in


chapter 4. In this section, this idea is amplified. Without any doubt, MC for-
mats can be effectively used to measure student writing skills (Perkhounkova,
2002). For instance, Bauer (1991) experimented with items that put gram-
mar and other rules of writing in context but retained an MC format. This for-
mat resembles the interlinear item set, but each item stands alone. Bauer
claimed that this item format contextualizes writing and brings MC closer to
realistic editing in the writing process. He offered other examples of items
dealing with text idioms and vocabulary.
Example 6.8 focuses on a single skills, discriminating between active and
passive voice. In writing, writers are often encouraged to use, where possi-
ble, active voice. Note that the item set in Example 6.8 contains only 4
items, but we could easily increase the length of this item set to 10, 20, or
even 30 statements.
CASEBOOK OF EXEMPLARY ITEMS 133

Which of the following verbs is passive? Mark A if passive or B if


active.
1. The car is being repaired.
2. The mechanic replaced the thermafropple.
3. It malfunctioned yesterday.
4. The car needs new tires as well.

EXAMPLE 6.8. Multiple true-false items measuring whether


the student knows active from passive voice.

Example 6.9 shows the breadth of writing skills that can be tested. As noted
in this example, all items have two choices; therefore, guessing plays a factor.
However, if enough items are presented to students, guessing becomes less of a
factor.

1. The calculation of people attending the event was


(A-exact or B- meticulous).
2. Words that are identical in form are
(A-synonyms, B-homonyms, C- antonyms).
3. After all that practice, she makes (A-less, B-fewer) mistakes
than before.
4. The car (A-lies, B-lays) on its side.
5. Four (A-people, B-persons) were on the boat.
6. Arizona's climate is very (A-healthy, B-healthful).
7. The data (A-is, B-are) very convincing.
8. Let's keep this a secret (A-between, B-among) the three of us.

EXAMPLE 6.9. Examples of multiple-choice testing of


vocabulary and writing skills. Based on Technical Staff (1937).

A good point to make about these items is that all the distinctions listed in
the examples and by Technical Staff (1933, 1937) can be re-presented with
new sentences that students have not seen. Thus, we are testing the applica-
tion of writing skill principles to new written material.
Example 6.10 is presented in generic format because of space consider-
ations. As you can see, the number of sentences in the stimulus condition can
be long. In fact, this list might range from 5 to 50 or more sentences. The stu-
134 CHAPTER 6

dent is expected to detect eight distinctly different errors in writing. Such a test
has high fidelity with anyone who is learning how to correct and revise writing.

A series of numbered sentences in a long paragraph containing


many writing errors.
For the numbered sentences above, identify the type of error.
A. Fragmentary or incomplete sentence.
B. Comma fault
C. Dangling or handing modifier
D. Nonparallel construction
E. Error in tense, mode, or voice
F. Lack of subject/verb agreement
G. Vague pronoun reference
H. Misplaced modifier
I. Correctly written

EXAMPLE 6.10. Adapted from Technical Staff (1937, p. 51).

The main idea in using MC to measure writing skills is to use real examples
presented to test takers, allowing them to select the choices to provide insight
into their writing skills. Virtually every writing skill can be converted into an
MC format because most of these skills can be observed naturally as a student
writes or can be assessed artificially in a test using items that appear in this sec-
tion. Although these MC formats can appear artificial or contrived, is the low-
ering of fidelity to true editing a tolerable compromise? Editing student writing
is one way to measure these writing skills, but these examples provide a more
standardized way.

Mathematics Skills

Mathematics skills can also be tested easily using MC formats. Example 6.11
shows a conversion involving fractions, decimals, and percents. The student
learning objective might require that students find equivalents when pre-
sented with any fraction, decimal, or percent. Example 6.11 shows the use of
the MTF format, which permits a more thorough testing of the procedure for
converting from one form to another. As we see from this example, fractions
can vary considerably. We can create many items using this structure. Options
should include the right answer and common student errors.
CASEBOOK OF EXEMPLARY ITEMS 135

Mark A if equal, B if unequal


Which of the following is equal to 1/2?
1. 0.50
2. 50%
3. 0.12
4. 0.25

EXAMPLE 6.11. Simple conversions involving fractions,


decimals, and percents.

Example 6.12 shows a simple area problem that might be appropriate for a
fifth-grade mathematics objective. Problems like these are designed to reflect
real-world-type problems that most of us encounter in our daily lives. The cre-
ating of test items that students can see have real-world relevance not only
makes the problems more interesting but promotes the idea that this subject
matter is important to learn.

You are painting one wall and want to know its area. The wall is 8
feet high and 12 feet wide. What is the area?
A. 20 feet
B. 20 square feet
C. 40 square feet
D. 96 square feet

EXAMPLE 6.12. Simple area problem.

Example 6.13 involves a more complex skill where two numbers have to be
multiplied. This item represents a two-stage process: (a) recognize that multi-
plication is needed, and (b) multiply correctly. When a student misses this
item, we cannot ascertain whether it was a failure to do (a) or (b) that resulted
in the wrong answer.

Language

MC formats can be useful in language learning. Phrases are presented in one


language on the left, and the alternative, plausible translations are presented
on the right, as shown in Example 6.14. Such items are easy to write. We write
136 CHAPTER 6

Our orchard contains 100 trees. We know from previous years that
each tree produces about 30 apples. About how many apples
should be expected this year at harvest time?
A. 130
B. 300
C. 3,000
D. Cannot say. More information is needed.

EXAMPLE 6.13. More complex skill involving two steps.

phrases that we think are part of the student's body of knowledge to master and
then provide the correct translation and two or three plausible incorrect trans-
lations. Generating test items for practice testing and for summative testing or
for testing programs can be easily accomplished.

Er nimmt platz:

A. He occupies a position.
B. He waits at a public square.
C. He seats himself.

EXAMPLE 6.14. Language equivalency.

Example 6.15 is simple vocabulary translation. As with the previous exam-


ple, we can identify many words that require translation and provide the exact
literal translation or use a synonym to test for a higher level of learning.

LTeil: A. Hammer B. Particular C. Part D. Offer E. Glue

EXAMPLE 6.15. Vocabulary translation can be literal


or figurative.

Any reading passage can be presented in one language and the test items
measuring comprehension of the passage can be presented in another lan-
guage. Some state student achievement testing programs have experimented
with side-by-side passages in English and another language to assist those
CASEBOOK OF EXEMPLARY ITEMS 137

learning English to better perform in a content-based achievement test.


Thus, a mathematics story problem could be presented in English and Span-
ish, and the student whose native language may be Spanish can choose be-
tween the alternative presentations.

TESTING FOR THE APPLICATION OF KNOWLEDGE


AND SKILLS IN A COMPLEX TASK

This section contains items that are purported to prompt test takers to apply
knowledge and skills to address a complex task. The items range in MC for-
mats and include conventional MC, conventional MC with context-depend-
ent graphic material, conventional MC with generic (repeatable) options,
MTF and multiple-response (MR) formats, networked two-tier item sets,
and combinatorial MC items. These examples should show that writing MC
items can measure more than recall. Although reliance on CR performance
items that require judged scoring is always desirable, many of the examples
shown in this section should convince us that these types of MC items often
serve as good proxies for the performance items that require more time to ad-
minister and human scoring that is fraught is inconsistency and bias.

Conventional MC for a Certification Examination

We have literally hundreds of testing programs that require test items mea-
suring knowledge, skills, and abilities in professions. Item banks are hard to
develop, and these item banks must be updated each year, as most profes-
sions continuously evolve. Old items are retired and new items must replace
these old items. Item writing in this context is expensive. SMEs may be paid
or may volunteer their valuable time. Regardless, the items must not only
look good but they must perform. Example 6.16 shows a situation encoun-
tered by a Chartered Financial Analyst (CFA) where several actions are
possible and only one is ethical, according to the ethical standards of the
Association for Investment Management and Research (AIMR). Although
the ethical standards are published, the test taker must read and under-
stand the real-life situation that may be encountered by a practicing CFA
and take appropriate action. Inappropriate action may be unethical and
lead to negative consequences. Thus, not only are such items realistic in ap-
pearance but such items measure important aspects of professional knowl-
edge. Note that Option A has a negative term, which is capitalized so that
the test taker clearly understands that one of these options is negatively
worded and the other three options are positively worded.
138 CHAPTER 6

Wilfred Clark, CFA, accumulates several items of nonpublic


information through contacts with computer companies. Although
none of the information is "material" individually, Clark concludes,
by combining the nonpublic information, that one of the computer
companies will have unexpectedly high earnings in the coming year.
According to AIMR Standards of Professional Conduct, Clark may:

A. NOT use the nonpublic information.


B. may use the nonpublic information to make investment
recommendations and decisions.
C. must make reasonable efforts to achieve immediate public
dissemination of the nonpublic information.
D. may use the nonpublic but only after gaining approval from a
supervisory analyst attesting to its nonmateriality.

EXAMPLE 6.16. Adapted from Chartered Financial Analysts:


1999 CFA Level I Candidate Readings: Sample Exam and
Guideline Answers.

Medical Problem Solving

Most certification and licensing boards desire test items that call for the appli-
cation of knowledge and skills to solve a problem encountered in their profes-
sion. In medicine, we have a rich tradition for writing high-quality MC items
that attempt to get at this application of knowledge and skills. Example 6.17
provides a high-quality item that is typically encountered in certification tests
in the medical specialties.
These items often derive from an experienced SME who draws the basis for
the item from personal experience. It is customary for every item to have a ref-
erence in the medical literature that verifies the correctness of content and the
selected key.

Conventional MC with Accompanying Graphical Material

This item set requires the student to read and interpret a graph showing the
savings of four children (see Example 6.18). The graph could be used to test
other mathematics skills, such as correctly reading the dollar values saved by
each child. Other comparisons can be made. Or a probability prediction could
be made about who is likely to save the most or least next year. These types of
CASEBOOK OF EXEMPLARY ITEMS 139

A 48-year-old man becomes depressed three months after total


laryngectomy, left hemithyroidectomy, and post operative radiation
therapy (5,000 rads). During evaluation, a low-normal thyroxine
level is noted. What test is most useful in detecting subclinical
hypothyroidism?

A. Radioimmunoassay of tri-iodothyronine
B. Resin tri iodothyronine uptake test
C. Thyroid scan
D. Thyroid-stimulus hormone test
E. Free thyroxine index.

EXAMPLE 6.17. Item 73 from the 1985 Annual


Otolaryngology Examination (Part 2), American Academy of
Otolaryngology—Head and Neck Surgery Inc.

items are easily modeled. In fact, chapter 7 shows how item models can be cre-
ated using an item like this one. An item model is a useful device for generating
like or similar items rapidly.

Conventional MC Using a Table

Whether your graphical material addresses a single item (stand alone) or a set
of items, most high-quality testing programs use graphical materials because it
adds a touch of realism to the context for the item and it usually enables the
testing of application of knowledge and skills. In these two examples, tables are
used. These tables require the test taker to read and understand the data pro-
vided and take some action, as called for in the item stem.
Example 6.19 requires the student to add points and read the chart to deter-
mine Maria's grade. The item reflects meaningful learning because most stu-
dents want to know their grades and must perform an exercise like this one to
figure out their grade. This item can be varied in several ways to generate new
items. The points in the stem can be changed, and the grading standards can be
changed. As the next chapter shows, we have many techniques for generating
new items from old items that makes item development a little easier.
Example 6.20, also from the AIMR's certification testing program nicely shows
how a table can be used to test for the complex application of knowledge and skills.
As with virtually all items of a quantitative nature, this format can be used
and reused with new data in the table and appropriate revisions in the stem and
options. The potential for item models that allow you to generate additional
14O CHAPTER 6

Beth, Bob, Jackie, and Tom had savings programs for the
year. How many more dollars did Beth save than Tom?
A. $ 2.50
B. $ 5.00
C. $11.00
D. $21.00

items is great. As mentioned several other times in this chapter, chapter 7 is de-
voted to this idea of item generation.

Logical Analysis Using Generic MC Options in a Matching Format

Example 6.21 shows how MC can be used to test logical thinking is necessary,
such as we find in science, social studies, and mathematics. The first state-
ment is assumed to be true, factual. The second statement is connected to the
first and can be true, false, or indeterminate. By writing pairs of statements,
we can test the student's logical analysis concerning a topic taught and
learned. The number of pairs can be extensive. The options remain the same
for every option. Thus, writing items can be streamlined. Another aspect of
Maria's teacher has the following grading
Total Points Grade
standards in mathematics. Maria wants to
know what her grade this grading period
will be. Her scores from quizzes, portfolio,
920 to 1000 A
and homework are 345, 400,122, and 32.

A. A B
850 to 919
B. B

C. C 800 to 849

D. D
750 to 799 D

EXAMPLE 6.19. Real-life mathematics problem suitable


for fifth grade.

A three-asset portfolio has the following characteristics:

Expected Expected Standard


Asset Return Deviation Weight

X 0.15 0.22 0.50


Y 0.10 0.08 0.40
Z 0.06 0.03 0.10

The expected return on this three-asset portfolio is:


A. 0.3%
B. 11.0%
C. 12.1%
D. 14.8%

EXAMPLE 6.20. Adapted from Chartered Financial Analysts:


1999 CFA Level I Candidate Readings: Sample Exam
and Guideline Answers. 141
142 CHAPTER 6

A. The second statement must be true.


B. The second statement cannot be true.
C. The second statement may or may not be true.

1. All language consists of arbitrary symbols.


All arbitrary symbols that are in use are parts of a language.
2. Every culture is different from every other culture.
There are universal patterns in cultures
3. The Banyore are a Bantu tribe of Uganda.
The Banyore have some type of family organization.

EXAMPLE 6.21. Adapted from Technical Staff (1937, p. 28).

this item set is that the first two items are general and conceptual, whereas
the latter two items are specific to something that has been taught and
learned. Thus, we can test for general knowledge and understanding and
then apply it to specific instances.

MTF or MR Formats

The MC item set in Example 6.22 is presented using the MTF format but it
could also be presented in an MR format. Thus, aspects of problem solving and
critical thinking can be tested with a scenario-based format without using the

A young plant weighing two pounds was planted in a pot


containing 100 pounds of dry earth. The pot was regularly watered
for two years. The plant was removed and weighed 70 pounds,
and the earth weighed 99.9 pounds.

Mark A if true and B if false.


1. The increase in plant weight was due to the contribution of soil
and the watering.
2. The increased weight is partly due to assimilation of oxygen.
3. The data are incorrect.
4. The plant is not normal and healthy.
5. The plant absorbed something from the atmosphere.

EXAMPLE 6.22. Adapted from Technical Staff (1937, p. 23).


CASEBOOK OF EXEMPLARY ITEMS 143

conventional MC format, which requires more time to write the items. Note
that this format can have more than 5 items (statements). In fact, as many as 30
statements can fit on a single page; therefore, the amount of testing derived
from a single problem scenario can be extensive. Guessing is not a problem with
this format because of the abundance of items that can be generated to test the
student's approach to this problem.
AC is another format that resembles the previous format, is from the same
source, and has a generic nature that can be applied to many subject matters
and situations. AC has two parts. The stimulus contains a series of observa-
tions, findings, or statements about a theme. The response contains a set of
plausible conclusions. The following example shows a set of true statements
about student learning and then a series of logical and illogical conclusions are
drawn. The student must choose between the two choices for each conclusion.
Although only five items are presented in Example 6.23, you can see that the
list can be increased considerably.

• Students with limited English proficiency (LED) usually get


lower than average test scores on standardized achievement
tests.
• Students with disabilities who are on an IER usually get lower
than average scores on standardized achievement tests.
• Students whose families live in poverty usually get lower than
average scores on standardized tests.

Which conclusions are A-logically supported or B-not logically


supported.

1. Language facility may influence test score performance.


2. Teachers fail to teach LEP students effectively.
3. Students with disabilities should be given accommodations
and modifications to remove any construct-irrelevant
impairment to their test performance.
4. Poverty is known cause of low performance in schools.
5. Teachers whose student have low test scores have failed.

EXAMPLE 6.23. Alternate-choice items requiring inference.

Networked Two-Tier Items

In science education, a continuing interest has been misconceptions or alter-


native conceptions. Students' prior knowledge often influences learning and
144 CHAPTER 6

performance on achievement tests. Tsai and Chou (2002) developed the net-
worked two-tier test as a means for studying students' prior knowledge and mis-
conceptions about science. The test is administered over the World Wide Web,
which makes it more accessible for its purpose, which is diagnosis.
The two-tier test is a two-item MC item set based on a science problem, usu-
ally accompanied with visual material. The first tier (first item) explores the
child's knowledge of the phenomenon being observed. The second tier (second
item) explores that basis in reasoning for the first choice. Such items are devel-
oped after student interviews. Thus, the content for the items is carefully devel-
oped from actual student encounters with items, as opposed to using students'
perceptions afterward to gain insights. Example 6.24 illustrates a two-tier item.

Two astronauts were having a fight in space. One struck the other.
The one who struck weight 40 kilograms, but the one who was
struck weighted 80 kilograms.

1. What happened to them?


A. The one who struck would have moved away at a higher
velocity than the other astronaut.
B. The one who was struck would have moved away at a
higher velocity.
C. The two would have moved away at the same velocity.

2. What is your reason?


A. Under the same force, the one with less mass would
move in higher acceleration.
B. There was no force on the one who struck, but on the
stricken one.
C. Velocity had nothing to do with force, so the two would
have moved away.

EXAMPLE 6.24. Networked two-tier item. Used with


permission from Chin-Chung Tsai, National Chiao Tung
University (Taiwan).

The first-tier item is presented. The student responds. Then the second-tier
item is given, with the first-tier item being retained on screen. The second item is
presented only after the student has made a choice on the first item, so that the
student is not influenced by the first item. The sequencing effect of items pro-
vides for inferences to be made about stages of learning so that teaching inter-
ventions can identify and redirect students into more logical, correct patterns.
This item format coupled with the work of Tsai and Chou (2002) exemplifies
the recent emphasis on studying cognitive processing underlying some item for-
CASEBOOK OF EXEMPLARY ITEMS 145

mats. These researchers are interested in studying cognitive processes elicited by


MC items and, along the way, devising new ways to format MC items to accom-
plish this end. Tsai and Chou think that further refinement of two-tier items can be
used both diagnostically and instructionally. They also believe that use of technol-
ogy can greatly assist the two-tier system.

When designed as an interactive, multimedia learning environment, the net-


worked instructional tool, helping students overcome their alternative concep-
tions. Finally, the networked system can record students' learning paths when
navigating the system. (Tsai & Chou, 2002, p. 164)

Example 6.25 is another illustration of a two-tier item set.

A man breathing air that is 20% oxygen and 5% carbon dioxide


enters an atmosphere that is 40% oxygen and 10% carbon dioxide.

Which result is most plausible?


A. Respiratory rate increases.
B. Respiratory rate decreases.
C. Respiratory remains unchanged.
Which explains this result?
A. Primary stimulus is carbon dioxide.
B. Primary stimulus is oxygen.
C. The increase in amount of oxygen and carbon dioxide
did not change their proportions.

EXAMPLE 6.25. Adapted from Technical Staff (1937, p. 33).

The first item calls for the use of a principle to make a prediction. The sec-
ond item uses causal reasoning to explain the rationale for the prediction.

Premise-Consequence

Example 6.26 is based on a premise for which there is a consequence.


Students must know the relationship between nominal and real gross national
product (GNP) and apply it to a situation that probably has not been encoun-
tered in previous reading or in the textbook. The complexity of this item can be
improved by adding one or more premises.

Combinatorial Items

Another item from the National Board Dental Examinations employs another
strategy for systematic variation that makes writing options easier. Example
6.27 shows this technique.
146 CHAPTER 6

If nominal gross national product (GNP) increases at a rate of 8%


per year, then real GNP:
A. remains constant.
B. rises by 10%.
C. falls by 8%.
D. rises by 2%.

EXAMPLE 6.26. Reprinted by permission of Georgeanne


Cooper, Director of the Teaching Effectiveness Program,
University of Oregon.

How does the flouride ion affect size and solubility of the
hydroxyapatite crystal?
Crystal Size Solubility
A. Increases Increases
B. Decreases Decreases
C. Increases Decreases
D. Decreases Increases

EXAMPLE 6.27. Item 62 from the released National Board


Dental Hygiene Pilot Examination (1996), published by the
Joint Commission on National Dental Examinations.

Example 6.28 is another good example that comes from the Uniform Cer-
tified Public Accountant Examination with the use of simple yes and no an-
swers to combinatorial conditions.
The options are easier to develop. If the item writer can develop the stem so
that the four options can systematically have paired variations, the item writ-
ing is simplified.

SUMMARY

The purpose of this chapter was to show that MC formats come in a larger va-
riety than presented in chapter 4. Not only is there a variety in MC formats,
but this chapter shows that these MC formats can measure knowledge, skills,
and the application of knowledge and skills in many content areas. You are
CASEBOOK OF EXEMPLARY ITEMS 147

11. Pell is the principal and Astor is the agent in an agency


coupled with an interest. In the absence of a contractual
provision relating to the duration of the agency, who has the
rights to terminate the agency before the interest has
expired?
Pell Astor
A. Yes No
B. No Yes
C. No No
D. Yes No

EXAMPLE 6.28. Item 11 from business law. Taken from the


Uniform CPA Examination, May 1989. Questions and Unofficial
Answers. New York: American Institutes of Certified Public
Accountants.

encouraged to experiment with these MC formats and other MC formats that


you encounter. As you will see, the MC format is open to innovation, and the
results may provide you with more tools to measure difficult content and cog-
nitive operations.
7
Item Generation

OVERVIEW

Whether you are developing MG test items for a class you teach or for a testing
program, the pressure to produce a large collection of high-quality test items is
omnipresent. New items are always needed because old items based on old con-
tent may be retiring. Case et al. (2001) reported on new-item development for
several medical credentialing examinations. They stated that a significant por-
tion of the budget for test development is given to creating new items. Item
writing is a costly enterprise. Item generation refers to any procedure that
speeds up this item-writing process. Because new and better items are always
needed, any strategy to increase both the quality of items and the rate of pro-
duction is welcome.
This chapter features five sections. The first section covers item shells,
which is a very straightforward, item-generating technology that is easy to em-
ploy but is limited to items that mainly reflect knowledge and skills. The second
section is item modeling, which has more potential for measuring complex cog-
nitive behavior. The third section is key features, which has potential for mea-
suring clinical problem solving in a profession, which is a central interest in
professional credentialing tests. The fourth section discusses generic item sets,
where variable facets are introduced and generic items provide a basis for writ-
ing stems. The fifth section shows how to transform an existing complex perfor-
mance item into one or more MC items that reflect the complex behavior
elicited in the performance item.
These five approaches to item generation represent practical technologies
for item generation. However, there is an emerging science of item generation
that promises to improve our ability to rapidly generate items. The vision is that
computers will someday produce items on demand for testing programs where
new tests and test results are needed quickly. But the basis for creating these
148
ITEM GENERATION 149

computer programs will come from teams of experts, including SMEs, whose
judgments will always be needed.

A BRIEF ACKNOWLEDGMENT TO FUTURE ITEM


WRITING THEORIES

This emerging science of item generation was well documented in an edited


volume by Irvine and Kyllonen (2002) entitled Item Generation for Test Develop-
ment. This book contains chapters reviewing current item-generation theories
and research. Earlier, Roid and Haladyna (1982) had written about item-writ-
ing theories current to that date. Theories and the technologies that follow are
much desired. This recent activity signals the beginning of a new era of item
generation. Although this chapter does not draw directly from the new theo-
retical work, it is important to briefly review the recent progress in theories of
item writing as a context for this chapter.
New item-generation theories can be characterized as (a) having a strong
foundation in cognitive psychology; (b) focusing on narrow, well-defined do-
mains of cognitive behavior; and (c) aiming more at aptitude than achieve-
ment testing. The most urgent need in achievement testing are proven
technologies derived from these item-generating theories that produce items
quickly and efficiently.
Some cognitive achievement domains are well structured. They can be de-
fined as having clear goals and definable limits, and being adaptable to domain
specifications. The tooth-coding system in dentistry is finite. We have 32 teeth
in the adult dentition. Given the tooth name, the dental student must give the
code. Given the code, the dental student must name the tooth. The domain of
items is well structured: 64 open-ended items.
However, we also face ill-structured problems in all aspects of life. They are
not well defined, lack clear goals, and are resistant to domain specifications
such as what we see with dental anatomy or computation in mathematics. To
solve an ill-structured problem, one needs to define the problem; generate al-
ternative, viable solutions; evaluate these alternatives through argumentation
or experimentation; select the most viable among these alternatives; observe
the results; and draw a conclusion. The term ill-structured may have originated
from the work of Simon (1973). In its origin, ill-structured problems may have
multiple solutions. Thus, we might have more than one correct answer, or one
solution might be better than another.
With the testing of these cognitive abilities, we tend to focus on problems,
vignettes, or other situations that can best be described as ill structured. The
fact that so many testable phenomena are ill-structured situations seems to
work against automated item generation. Yet, there is some hope as evinced in
the recent volume by Irvine and Kyllonen (2002).
ISO CHAPTER 7

Wainer (2002) made several important points about item generation in the
future. He argued that item generation is urgently needed in the context of
computerized testing, particularly computer adaptive testing. He pointed out
that computerized testing is probably not desirable for large-scale assessments
of course work, tests that are given on an annual basis, and large-scale perfor-
mance assessments. Computerized testing is feasible in low-stakes settings such
as for placement, when test results are needed quickly, as in credentialing test-
ing program. Wainer concluded that item generation may be best suited for di-
agnostic testing and for things that are easy to codify, but for measuring
high-level complex thinking that school-based assessments desire, these item-
generation theories have not yet been successful.
Whether we use item-generation theories for developing a technology or
use the techniques described in this chapter, the need to write good items for all
testing programs is always there. In the interim period before such theories be-
come operational and provide the kinds of items we desire, we turn to the pro-
cedures in this chapter because they can help item writers accelerate the slow,
painful process of writing new items.

ITEM SHELLS

The item shell technique is primarily intended for item writers who lack formal
item-writing training and experience in MC item writing. These item writers
often have great difficulty in starting to write the MC item, though they had
considerable knowledge, skill, and experience in the subject matter for which
they were preparing items. As its name suggests, the item shell is a skeletal item.
The item shell provides the syntactic structure of a MC item. The item writer
has to supply his or her content, but the stem or partial stem is supplied to give
the item writer a start in the right direction.

Origin of the Item Shell

As reported earlier in this book, attempts to make item writing a science have
not yet been fruitful. An ambitious endeavor was Bormuth's (1970) algorith-
mic theory of item writing. He suggested a complex, item-writing algorithm
that transformed prose into MC test items. His theory of achievement test
item writing made item development more scientific and less subject to the
caprice and whims of idiosyncratic item writers. The problem, however, was
that the algorithm had too many steps that made its use impractical. Others
have tried similar methods with similar lack of success, including facet theory
and designs, item forms, amplified objectives, among others (see Roid &
Haladyna, 1982).
ITEM GENERATION 151

The item shell was created out of a need for a more systematic method of
MC item writing in the direction of these earlier efforts. However, the item
shell also permits item writers freedom that, in turn, permits greater creativ-
ity in designing the item. The item shell is also seen as a more efficient pro-
cess for writing MC items than presently exists. The method simplifies
writing items.

Defining an Item Shell

According to Haladyna and Shindoll (1989), an item shell is a "hollow" item


containing a syntactic structure that is useful for writing sets of similar items.
Each item shell is a generic MC test item. All item shells are derived from exist-
ing items that are known to perform as expected. Example 7.1 gives a simplistic
item shell.

Which is an example of (any concept)?


A. Example
B. Plausible nonexample
C. Plausible nonexample
D. Plausible nonexample

EXAMPLE 7.1. Generic item shell.

One could take this item shell and substitute almost any concept or princi-
ple from any subject matter. Writing the stem is only one part of MC item writ-
ing, but often it is the most difficult part. Writing a correct option and several
plausible distractors is also difficult. Once we write the stem, an important part
of that item-writing job is done.
A limitation of the item shell technique is that you may develop an
abundance of items that all have the same syntactic structure. For in-
stance, if you used the shell in Example 7.1, all items might have the
same syntactic structure. Some test makers and test takers may perceive
this situation negatively. We want more variety in our items. The solu-
tion is to use a variety of item shells instead of generating many items
from a single shell.
Another limitation of the item shell is that it does not apply equally well
to all content. There are many instances where the learning task is specific
enough so that generalization to sets of similar items is simply not possible.
In these instances other techniques presented in this chapter may be more
fruitful.
152 CHAPTER 7

Developing Item Shells

There are two ways to develop item shells. The first and easiest way is to adopt
the generic shells presented in Example 7.2. These shells are nothing more
than item stems taken from successfully performing items. The content expert
should identify the facts, concepts, principles, or procedures being tested and
the type of cognitive behaviors desired (recalling, understanding, or applying
knowledge or skills).

Which is the definition of...?


Which is the best definition of...?
Which is the meaning of...?
Which is synonymous with ...?
Which is like ...?
Which is characteristic of...?
What distinguishes ...?
Which is the reason for...?
Which is the cause of...?
What is the relationship between ... and ...?
Which is an example of the principle of...?
What would happen if...?
What is the consequence of...?
What is the cause of...?
Which is the most or least important, significant, effective ...?
Which is better, worse, higher, lower, farther, nearer, heavier, lighter,
darker, lighter...?
Which is most like, least like ...?
What is the difference between ... and ...?
What is a similarity between ... and ...?
Which of the following principles best applies to ...?
Which of the following procedures best applies to the problem of
...?
What is the best way to ...?
How should one ...?

EXAMPLE 7.2. Item shells derived from a variety


of successfully performing items.
ITEM GENERATION 153

A second way is to transform highly successful items into item shells. To do


so, one should follow certain steps. Example 7.3 shows a variety of item shells
for medical problem solving. To transform items into shells, several condi-
tions must be met. First an item must be identified as a successful performer.
Chapter 9 discusses the criteria for item performance. Second, the type of
cognitive behavior represented by the item must be identified. Third, the
content that the item tests must be identified. Fourth, a series of item-writing
steps must be followed.

Understanding
What are the main symptoms of...?
Comment: This item shell provides for the generation of a multitude
of items dealing with the symptoms of patient illnesses.
Predicting
What is the most common (cause or symptom) of a (patient
problem)?
Comment: This general item shell provides for a variety of
combinations that mostly reflects anticipating consequences or
cause-and-effect relationships arising from principles.
Understanding of concepts is also important for successful
performance on such items.
Applying Knowledge and Skills
Patient illness is diagnosed. Which treatment is likely to be most
effective?
Comment: This item shell provides for a variety of patient illnesses,
according to some taxonomy or typology of illnesses and
treatment options. Simply stated, one is the best. Another
questioning strategy is to choose the reason a particular
treatment is most effective.
Applying Knowledge and Skills
Information is presented about a patient problem. How should the
patient be treated?
Comment: The item shell provides information about a patient
disease or injury. The completed item will require the test taker to
make a correct diagnosis and to identify the correct treatment
protocol, based on the information given.

EXAMPLE 7.3. Examples of item shells for a medical


problem.
154 CHAPTER 7

These steps are as follows:

1. Identify the stem of a successfully performing item.

A 6-year-old child is brought to the hospital with contusions over


the abdomen and chest as a result of an automobile accident.
What should be the initial treatment?

2. Underline key words or phrases representing the content of the item.

A 6-year-old child is brought to the hospital with contusions over


the abdomen and chest because of an automobile accident. What
should be the initial treatment?

3. Identify variations for each key word or phrase.

Age of person: infant, child Cages 3-12), adolescent (ages 13-18),


young adult (ages 19-31], middle age (ages 32-59], elderly (ages
60 and over].
Trauma injury and complications: Cuts, contusions, fractures, in-
ternal injuries.
Type of accident: Automobile, home, industrial, recreational.

4. Select an age, trauma injury or complication, and type of accident from


personal experience.

Infant Abrasion Home

5. Write the stem.

An infant is brought to the hospital with severe abrasions follow-


ing a bicycle accident involving the mother. What should initial
treatment be?

6. Write the correct answer.

A. Conduct a visual examination.


ITEM GENERATION 155

7. Write the required number of distractors, or as many plausible distractors


as you can with a limit of four because most automated scoring permits up
to five options comfortably.

B. Treat for infection.


C. Administer pain killers to calm the infant.
D. Send for laboratory tests.
E. Clean the wounds with an antiseptic.

Steps 4 through 7 can be repeated for writing a set of items dealing with a
physician's treatment of people coming to the emergency department of a
hospital. The effectiveness of the item comes with the writing of plausible
distractors. However, the phrasing of the item, with the three variations,
makes it possible to generate many items covering a multitude of combina-
tions of ages, trauma injuries and complications, and types of injuries. The
item writer need not be concerned with the "trappings" of the item but can
instead concentrate on content. For instance, an experienced physician
who is writing test items for a credentialing examination might draw heavily
from clinical experience and use the item shell to generate a dozen different
items representing the realistic range of problems encountered in a typical
medical practice. In these instances, the testing events can be transformed
into context-dependent item sets.
An item shell for eighth-grade science is developed to illustrate the pro-
cess. The unit is on gases and its characteristics. The steps are as follows:

1. Identify the stem.

Which is the distinguishing characteristic of hydrogen?

2. Underline the key word or phrase.

Which is the distinguishing characteristic of hvdroaen?

3. Identify variations for each key word or phrase.

Which is the distinguishing characteristic of Cgases studied in this


unit)?
156 CHAPTER 7

4. Select an instance from the range of variations.

Oxygen

5. Write the stem.

Which is the distinguishing characteristic of oxygen?

6. Write the correct answer.

A. It is the secondary element in water.

7. Write the distractors.

B. It has a lower density than hydrogen.


C. It can be fractionally distilled.
D. It has a lower boiling point than hydrogen.

The last word in the stem can be replaced by any of a variety of gases, easily
producing many item stems. The difficult task of choosing a right answer and
several plausible distractors, however, remains.
Although the process of developing item shells may seem laborious, as illus-
trated on the preceding discussion, keep in mind that many of these seven steps
become automatic. In fact, once a good item shell is discovered, several steps
can be performed simultaneously.
Item shells have the value of being used formally or informally, as part of a
careful item-development effort, or informally for classroom testing.
Clearly the value of the item shell is its versatility to generate items for dif-
ferent types of content (facts, concepts, principles, and procedures) and
cognitive operations.

Another Approach to Generating Item Shells

A culling of stems from passage-related item sets that purport to measure com-
prehension of a poem or story can yield nice sets of item stems that have a ge-
neric quality. Rather than write original item stems, these generic stems can be
used to start your passage-related reading comprehension test items. Example
7.4 provides a list of stems that have been successfully used in MC items that
ITEM GENERATION 157

measure comprehension of poetry. Example 7.5 provides a complementary list


of items for measuring reading comprehension.

Poetry
What is the main purpose of this poem?
What is the theme of this poem?
Which of the following describes the mood of the poem?
What is the main event in this poem?
Which poetic device is illustrated in the following line? Possible
options include: allusion, simile, metaphor, personification.
Which of the following describes the basic metric pattern of the
poem? What does the language of the poem suggest?
What is the meaning of this line from the poem?
{Select a critical line}
What is the meaning of {select a specific term or
phrase from a line in the poem}?
Which describes the writing style of this poem/passage? Possible
answers include: plain, colorful, complex, conversational.
Which of the following would be the most appropriate title for this
poem/passage?
Which term best describes this poem? Possible answers include
haiku, lyric, ballad, sonnet.

EXAMPLE 7.4. Generic item shells for poetry


comprehension items.

As you increase your experience with these item stems, you may generate
new stems that address other aspects of understanding reflecting the curricu-
lum, what's taught, and, of course, what's tested.

Evaluation of Item Shells

According to Haladyna and Shindoll (1989), the item shell has several attrac-
tive features:

1. The item shell helps inexperienced item writers phrase the item in an ef-
fective manner because the item shell is based on a previously used and
successfully performing items.
158 CHAPTER 7

Reading Passage/Story/Narrative
What is the main purpose of this selection?
What is the theme of this story?
What is the best title for this story?
What is the conflict in this story?
Which best describes the writing style of this story?
Which best describes the conflict in this story?
Which best summarizes this story?
Which statement from this story is a fact or opinion? {Choose
statements}
What is the meaning of the following {word, sentence, paragraph}?
Which best describes the ending of this story?
How should ... be defined?
Which point of view is expressed in this story? {first person, second
person, third person, omniscient}
Which best summarizes the plot?
Which best describes the setting for this story?
What literary element is represented in this passage? Possible
answers include foil, resolution, flashback, foreshadowing.
Who is the main character?
How does the character feel? {Select a character}
What does the character think? {Select a character}
How are {character A and character B} alike? different?
After {something happened}, what happened next?
What will happen in the future?
Why do you think the author wrote this story {part of the story}?

EXAMPLE 7.5. Generic item stems for measuring


passage-related reading comprehension.

2. Item shells can be applied to a variety of types of content (facts, concepts,


principles, and procedures), types of cognitive behaviors (recalling, un-
derstanding, and applying), and various subject matters.
3. Items shells are easily produced and lead to the rapid development of use-
ful items.
4- Item shells can be used in item-writing training as a teaching device.
5. Item shells can be used to help item writers take a good idea and convert
it to an item. Once they have ideas, they can select from generic shells, as
ITEM GENERATION 159

Example 7.2 shows, or from a specially prepared set of shells, as Examples


7.3 and 7.4 show.
6. Item shells complement traditional methods of item writing so that a va-
riety of item formats exists in the operational item bank.
7. Finally, item shells help crystallize our ideas about the content of a test.

In summary, the item shell is a very useful device for writing MC items be-
cause it has an empirical basis and provides the syntactic structure for the con-
tent expert who wishes to write items. The technique is flexible enough to
allow a variety of shells fitting the complex needs of both classroom and large-
scale testing programs.

ITEM MODELING

Item modeling is a general term for a variety of technologies both old and new.
In chapter 11, the future of item modeling is discussed. Much of the theoretical
work currently under way that may lead to validated technologies that will en-
able the rapid production of MC items. In this chapter, we deal with practical
methods of item modeling.
An item model provides the means for generating a set of items with a com-
mon stem for a single type of content and cognitive demand. An item model
not only specifies the form of the stem but in most instances also provides a ba-
sis for the creation of the correct answer and the distractors. The options con-
form to well-specified rules. With a single item model we can generate a large
number of similar items.

A Rationale for Item Modeling

One rationale for item modeling comes from medical training and evalua-
tion. LaDuca (1994) contended that in medical practice we have used a be-
havioral-based, knowledge-skills model for discrete learning of chunks of
information. Traditional tests of medical ability view cognitive behavior as
existing in discrete parts. Each test item systematically samples a specific class
of behaviors. Thus, we have domain-referenced test score interpretations
that give us information about how much learning has occurred. Mislevy
(1993) referred to this mode of construct definition and the resulting tests as
representing low to high proficiency. Cognitive learning theorists maintain
that this view is outmoded and inappropriate for most professions (Shepard,
1991; Snow, 1993).
This point of view is consistent with modern reform movements in educa-
tion calling for greater emphasis on higher level thinking (Nickerson, 1989).
160 CHAPTER 7

For nearly two decades, mathematics educators have promoted a greater em-
phasis on problem'solving ability, in fact, arguing that problem solving is the
main reason for studying mathematics (Prawat, 1993). Other subject matters
are presented as fertile for problem-solving teaching and testing. In summary,
the impetus of school reform coupled with advances in cognitive psychology
are calling for a different view of learning and, in this setting, competence.
LaDuca (1994) submitted that competent practice resides in appropriate re-
sponses to the demands of the encounter.
LaDuca (1994) proposed that licensure tests for a profession ought to be
aimed at testing content that unimpeachably relates to effective practice. The
nature of each patient encounter presents a problem that needs an effective so-
lution to the attending physician. Conventional task analysis and role delinea-
tion studies identify knowledge and skills that are tangentially related to
competence, but the linkage is not so direct. In place of this approach is prob-
lem-solving behavior that hinges on all possible realistic encounters with pa-
tient problems.
LaDuca's (1994) ideas apply directly to professional credentialing testing,
but they may be adaptable to other settings. For instance, item modeling might
be used in consumer problems (e.g., buying a car or appliance, food shopping,
painting a house or a room, remodeling a house, fixing a car, or planning land-
scaping for a new home).

An Example of Item Modeling

This section briefly presents the structural aspects of LaDuca's (1994)


item-modeling procedures. (Readers interested in the fuller discussion should
refer to LaDuca, 1994; LaDuca, Downing, &Henzel, 1995; LaDuca, Staples,
Templeton, &Holzman 1986; Shea et al., 1992.)
For clinical encounters, several faceted dimensions exist for the develop-
ment of the vignette that involves a clinical encounter driving content of the
item. These facets are used by the expert physician in writing a content-ap-
propriate test item. The existence of these facets makes item writing more
systematic.

Facet 1: Setting
1. Unscheduled patients/clinic visits
2. Scheduled appointments
3. Hospital rounds
4. Emergency department
ITEM GENERATION 161

This first facet identifies five major settings involving patient encoun-
ters. The weighting of these settings may be done through studies of the pro-
fession or through professional judgment about the criticalness of each
setting.

Facet 2: Physician Tasks


1. Obtaining history and performing physical examination
2. Using laboratory and diagnostic studies
3. Formulating most likely diagnosis
4. Evaluating the severity of patient's problemtsJ
5. Managing the patient
6. Applying scientific concepts

The second facet provides the array of possible physician activities, which
are presented in sequential order. The last activity, applying scientific concepts,
is disjointed from the others because it connects patient conditions with diag-
nostic data as well as disease or injury patterns and their complications. In
other words, it is the complex step in treatment that the other categories do not
conveniently describe.

Facet 3: Case Cluster


1a. Initial work up of new patient, new problem
1b. Initial work up of known patient, new problem
2a. Continued care of known patient, old problem
2b. Continued care of known patient, worsening old problem
3. Emergency care

The third facet provides four types of patient encounters, in three dis-
crete categories with two variations in each of the first two categories. Ex-
ample 7.6 is the resulting item showing the application of these three
facets.
The item in Example 7.6 has the following facets: (a) Facet 1: Setting—2.
Scheduled appointment; (b) Facet 2: Physician task—3. Formulating most
likely diagnosis; (c) Facet 3: Case cluster—la. Initial work up of new patient,
new problem. It is interesting that the item pinpoints a central task c) of diag-
nosis but necessarily involves the successful completion of the first two tasks
162 CHAPTER 7

A 19-year-old archeology student comes to the student health


service complaining of severe diarrhea, with 15 large-volume
watery stools per day for 2 days. She has had no vomiting,
hematochezia, chills or fever, but she is very weak and very thirsty.
She is just returned form a 2-week trip to a remote Central
American archeological research site. Physical examination shows
a temperature 37.2 degrees Centigrade (99.0 degrees Fahrenheit),
pulse 120/min, respirations 12/min, and blood pressure 90/50 mm
Hg. Her lips are dry and skin turgor is poor. What is the most likely
cause of the diarrhea?

A. Anxiety and stress from traveling


B. Inflammatory disease of the large bowel
C. An osmotic diarrheal process
D. A secretory diarrheal process*
E. Poor eating habits during her trip

EXAMPLE 7.6. Item produced from three facets.

in the task taxonomy. The vignette could be transformed into a context-de-


pendent item set that includes all six physician tasks. The genesis of the pa-
tient problem comes from the rich experience of the physician or SME, but
systematically fits into the faceted vignette so that test specifications can be
satisfied.

Examples of Item Models in Mathematics

Item modeling works best in areas that are quantifiable. Virtually all types
of mathematics content can be modeled. Example 7.7 presents an item
model that deals with probability from an elementary grade mathematics
curriculum.
Example 7.8 shows an item from this model where we have two red, four
white, and six blue pieces of candy in a bag. The context (jelly beans in a jar,
candy in a bag, marbles, or any object) can be specified as part of the model. As
you can see, the number of red, yellow, and blue objects can vary but probably
should not be equal. The options, including distractors are, created once num-
bers are chosen. These options involve the correct relationship as well as logi-
cal, plausible, but incorrect actions. The range of positive integers can be
varied as desired or needed for the developmental of the students. The com-
plexity of the probability calculated can be increased by picking more than one
object or by including more than one color.
ITEM GENERATION 163

A {container} holds x red objects, y yellow objects, and z blue


objects. If we select one object from the container, what is the
probability that the object is {red, yellow, or blue}?
A. 1/n-plausible but wrong
B. 1/{x, y, or z}-plausible but wrong
C. {x, y, or z}/{x+y+z}-correct
D. {x, y, or z}/{the sum of the numbers of colored objects not
chosen)-perhaps not as plausible as other distractors

EXAMPLE 7.7. Simple mathematical item model with options


fixed by the model.

A bag contains two red, four yellow, and six blue pieces of candy.
What is the probability of reaching in the bag and picking a yellow
piece of candy (without peeking)?
A. 1/12
B. 1/4
C. 4/12
D. 4/6

EXAMPLE 7.8. An item generated from the item model.

Example 7.9 has a family of eight statements that constitute the item model.
We might limit this family to single-digit integers for a, b, x, and y. We would
not use zero as a value. Within these eight models, some items will be harder or
easier than others.

|a+b| Hh |x+y| |a+b| -I- |x-y| |a+b|- |x+y| |a+b| + |x-y|


|a-b| + |x+y| |a-b| + |x-y| |a-b|-| x+y| |a-b|- |x-y|

EXAMPLE 7.9. Family of eight item models.

Example 7.10 presents a sample item based on a member of this family of


eight item models. Any distractor might be based on a logical analysis. For in-
stance, Option A in the sample item is a simplistic solution where all integers
are added. Option C does some of the subtracting correctly but gets confused
164 CHAPTER 7

about the signs of the differences. Another way to develop distractors is to


have a small panel of students with varying mathematical ability think aloud
as they work through several of these items. Their wrong responses will give
you clues about how to write the rules for distractors.

| 4 - 2 | - | 2 - 5 |=
A. 13
B. 5
C. -1

EXAMPLE 7.10. An item derived from the family of eight item


models.

Summary of What We Know About Item Modeling

1. An item model provides an operational definition of content. The ability


to define a domain consisting of all encounters is at the heart of item
modeling. For instructional content, item modeling seems best suited to
subject matter content that is quantifiable.
2. Item modeling seems flexible and adaptive to many settings and situa-
tions, as LaDuca's (1994) work shows.
3. The method has a high degree of credibility because it rests on the judg-
ments of SMEs in a field of study or profession.
4. Item modeling accelerates the item writer's ability to write test items,
something that nonprofessional item writers greatly need.
5. In its most sophisticated form, distractors are systematically created. This
saves much time in item development.
6. Item specifications are created that are standardized and uniform.
7. The method can provide a basis for instruction as well as formative test-
ing in the classroom because the item model can be used in teaching j ust
as easily as in large-scale testing. This is helpful in integrating curricu-
lum, instruction, and assessment, as Nitko (1989) and others have long
championed.
8. Although not explicit, item modeling can be revised to the item set for-
mat that more closely models multistep thinking. But such models are
not presented here and remain a challenge to future theorists and re-
searchers.
ITEM GENERATION 165

However, item modeling appears to be restricted in its applications. Reading


comprehension provides a challenge for item modeling. Also, aspects of critical
thinking required in fields such as social studies and science would be difficult
to represent in item models. Reading, writing, and mathematics skills may be
more amenable to item writing, but creative efforts by SMEs in these areas are
needed. Defining the content and cognitive demand required in a precise way
seems to be at the heart of an item model. We need creative efforts to develop
item models for nonquantifiable content that currently seems resistant to item
modeling.

KEY FEATURES

A persistent problem in professional training is the measurement of prob-


lem-solving ability that is the part of most professional practice. Whether
the profession is medicine, law, teaching, engineering, dentistry, nursing, or
social work, the licensed professional must deal with a domain of typical
problems. For instance, in medicine, the graduating physician who enters
professional practice when encountering a patient with a problem must en-
gage in a complex thought process that leads to successful resolution of the
patient's problem.
Item modeling suggested by LaDuca and colleagues (LaDuca, 1994;
LaDuca et al, 1995) provides one way to approach the complex measurement
of competence in a profession, but the work of Page and Bordage and their col-
leagues (Bordage, Carretier, Bertrand, &Page, 1995; Hatala &Norman, 2002;
Page & Bordage, 1995; Page, Bordage, & Allen, 1995) provide another per-
spective. Their item-generating approach has been well established in training,
research, and licensure testing in Canada.
The rationale for key features came from a frustration in medical education
to measure physicians' clinical ability to treat patient problems. Traditional ap-
proaches such as the PMP failed to generate sufficient intercorrelations among
tasks to provide high enough reliability to use these test scores for important
decisions or evaluation.
In the early 1980s in Canada, the key features idea emerged. The main idea
of a key feature is either a difficult step in the thought process in treating a pa-
tient problem or a step in this process where an error is most likely to occur that
reduces the effectiveness of patient treatment. This step is called a key feature
because it helps discriminate among candidates with varying degrees. Unlike
the LaDuca (1994) item model, where many features are identified, the objec-
tive is to identify those features that are most likely to discriminate among can-
didates with varying degrees of competence.
A key features problem usually has a brief stem followed by several questions
requesting actions from the candidate being tested. The test items may be short
166 CHAPTER 7

answer (write-in) or short menu, which involves choosing the answer from a
long list of possible right answers.

Steps in Developing Key Feature Problems

Step 1. Define the Domain of Clinical Problems to Be Sampled. Medi-


cine has sought to define domains of patient problems for which persons in
medical training should be competent to treat. This domain of problems also
has a list of patient complaints and a list of correct diagnoses. Note that the em-
phasis is placed here on defining clearly and specifically the problems, com-
plaints, and diagnoses. This domain can be defined by preexisting curriculum
guides or surveys that identify the type of patient problems to be treated. For
example, Page and Bordage (1995) gave pediatricians an example of the prob-
lems that might be encountered: near drowning, enuresis, dehydration, glom-
eru lonephritis, adolescent diabetes, or a foreign body aspiration. Any resulting
test is a representative sample from this domain.

Step 2. Provide Examination Blueprint. Once the domain is defined,


the test specifications typically help in selecting items for a test. In this in-
stance, it is used to select the problems from the domain of clinical problems.
Page and Bordage (1995) stated that this blueprint can be multidimensional
and refer to many relevant factors, such as medical specialty (e.g., pediatrics),
body systems (e.g., respiratory), and clinical setting (e.g., ambulatory, in pa-
tient) . They also mention basing the domain on a single dimension, such as
life span.

Step 3. Present Clinical Situations. Each problem can be presented in


various ways. Page et al. (1995) reported that five clinical situations were iden-
tified: (a) undifferentiated problems or patient complaints, (b) a single typical
or atypical problem, (c) a multiple problem or multisystem involvement, (d) a
life-threatening problem, and (e) preventive care and health promotion.

Step 4. Select Key Features for Each Problem. A key feature is a critical
step that will likely produce a variety of different choices by physicians. Some of
these choices will be good for the patient and some choices will not be good.
Not all patient problems will necessarily have key features. The key feature
must be difficult or likely produce a variety of effective or ineffective choices.
Although the key feature is identified by one expert, other SMEs have to agree
about the criticality of the key feature. Key features vary from two to five for
each problem. Each key feature has initial information and an assigned task.
Example 7.11 gives key features for two problems.
ITEM GENERATION 167

Problem 1: Four Associated Key Features


For a pregnant woman experiencing third-trimester bleeding with
no abdominal pain, the physician (or the graduating medical
student) should
1. generate placenta previa as the leading diagnosis,
2. avoid performing a pelvic examination (may cause fatal
bleeding),
3. avoid discharging from an outpatient clinic or emergency
department, and
4. order coagulation tests and cross-match.

Problem 2: Three Associated Key Features


For an adult patient complaining of a painful, swollen leg, the
physician should:
1. include deep venous thrombosis in the differential diagnosis,
2. elicit risk factors for deep venous thrombosis through the
patient's history, and
3. order a venogram as a definitive test for deep venous
thrombosis.

EXAMPLE 7.11. Key features for two problems.

Step 5. Select Case and Write Case Scenario. Referring back to the five
clinical situations stated in Step 3, the developer of the problem selects the
clinical situation and writes the scenario. The scenario contains all relevant in-
formation and includes several questions. As noted previously, the items can be
in an MC or CR format.

Step 6. Develop Scoring for the Results. Scoring keys are developed
that have a single or multiple right answers. In some instances, candidates
can select from a list where some of their choices are correct or incorrect. The
SME committee develops the scoring weight and scoring rules for each case
scenario.

Step 7. Conduct Pilot Testing. As with any high-stakes test, pilot testing
is critical. This information is used to validate the future use of the case in a for-
mal testing situation.
168 CHAPTER 7

Step 8. Set Standards. As with any high-stakes test with a pass-fail deci-
sion, standards should be set. Page et al. (1995) recommended a variety of stan-
dard-setting techniques to be used.

Example of a Key Features Item

Example 7.12 is an example of a key features item from Page et al. (1995). The
problem comes from the domain of patient problems identified in Step 1. The

Paul, a 56-year-old man, consults you in the outpatient clinic


because of pain in his left leg, which began 2 days ago and has
been getting progressively worse. He states his leg is tender
below the knee and swollen around the ankle. He has never had
similar problems. His other leg is fine.
Question 1: What diagnosis would you consider? List up to three.
Question 2: With respect to your diagnosis, what elements of his
history would you particularly want to elicit? Select up to seven.
1. Activity at the onset of 16. Palpitations
symptoms 17. Parasthesia
2. Alcohol intake 13. Paroxysmal nocturnal
3. Allergies dyspnea
4. Angina pectoris 19. Polydipsia
5. Anti-inflammatory therapy 20. Previous knee problems
6. Cigarette smoking 21. Previous back problem
7. Color of stools 22. Previous neoplasia
8. Cough 23. Previous urinary tract
9. Headache infection
10. Hematemesis 24. Recent dental procedure
11. Hormone therapy 25. Recent immobilization
12. Impotence 26. Recent sore throat
13. Intermittent claudication 27. Recent surgery
14. Low back pain 28. Recent work environment
15. Nocturia 29. Wounds on foot
30. Wounds on hand

EXAMPLE 7.12. Example of a key features item.


Adopted from Page and Bordage (1995, p. 197)
with permission of Academic Medicine.
ITEM GENERATION 169

key features for this problem are given in Example 7.11. The scenario has two
related questions. The first question addresses preliminary diagnoses. The
medical problem solver should be able to advance three plausible hypotheses
about the origin the patient's complaint. Note that the diagnoses item is in an
open-ended format, but this could be easily converted into an MC format. The
second item is in an MR format where the candidate must select up to seven el-
ements of history to help in the diagnosis.

Benefits of Key Feature Items

According to Page et al. (1995), key feature has many benefits. First, patient
problems are chosen on the basis of their criticality. Each is known to proce-
dure a difficult, discriminating key feature that will differentiate among can-
didates for licensure of varying ability. Second, the key feature problems are
short so that many can be administered. Because reliability is a primary type
of validity evidence, key feature items should be numerous and highly
intercorrelated. Third, there is no restriction or limitation to item format. We
can use MC or CR formats. If cuing is a threat to validity, the CR format can
be used. Scoring guides can be flexible and adaptable to the situation. Finally,
the domain of patient problems is definable in absolute ways so that candi-
dates for licensure can be trained and tested for inference to a large domain of
patient problems intended.

Evaluation of Key Feature

The Clinical Reasoning Skills test of the Medical Council of Canada provides a
nice key feature on their web page: www.mcc.ca. The key feature item-genera-
tion process is intended for testing programs or educational systems where clini-
cal problem solving is the main activity. This process is not for measuring
knowledge and skills. A strong commitment is needed to having SMEs define the
domain, develop the examination blueprint, identify the clinical situations to be
assessed, identify the key features that are relevant to the problem, and select a
case scenario to represent the problem. Examples of key features applied to other
professions would encourage others to experiment with this method. A strong,
related program of research that validates the resulting measures would increase
the use of key feature. For instance, Hatala and Norman (2002) adapted key fea-
tures for a clinical clerkship program. The 2-hour examination produced a reli-
ability estimate of 0.49, which is disappointing. These researchers found low
correlations with other criteria even when corrected for unreliability.
Overall, the key feature approach has to be a strong contender among other
approaches to modeling higher level thinking that is sought in testing compe-
17O CHAPTER 7

tence in every profession. However, systematic validation involving reliability


and other forms of validity evidence are needed to persuade future users that
the key feature approach will accomplish its elusive goal: measuring clinical
problem solving.

GENERIC ITEM SETS

Chapter 4 presents and illustrates the item set as a means for testing various
types of complex thinking. The item set format is becoming increasingly popu-
lar because of its versatility. Testing theorists are also developing new models
for scoring item sets (Thissen & Wainer, 2001). The item set appears to have a
bright future in MC testing because it offers a good opportunity to model vari-
ous type of higher level thinking that are much desired in achievement testing
programs.
This section uses the concept of item shells in a more elaborate format, the
generic item set. This work is derived principally from Haladyna (1991) but
also has roots in the earlier theories of Guttman and Hively, which are dis-
cussed in Roid and Haladyna (1982). The scenario or vignette is the key to the
item set, and like the item model suggested by LaDuca (1994), if this scenario
or vignette has enough facets, the set of items flows naturally and easily from
each scenario or vignette.
The method is rigid in the sense that it has a structure. But this is important
in facilitating the development of many relevant items. On the other hand, the
item writer has the freedom to write interesting scenarios and identify factors
within each scenario that may be systematically varied. The generic questions
also can be a creative endeavor, but once they are developed they can be used
for variations of the scenario. The writing of the correct answer is straightfor-
ward, but the writing of distractors requires some inventiveness.
As noted earlier and well worth making the point again, item sets have a
tendency for interitem cuing. In technical terms, this is called local depend-
ence (Hambleton, Swaminathan, &. Rogers, 1991), and the problem is signifi-
cant for item sets (see Haladyna, 1992a; Thissen et al., 1989). Item writers
have to be careful when writing these items to minimize the tendency for
examinees to benefit from other items appearing in the set. This is why it is rec-
ommended that not all possible items in the set should be used for each set at
any one time.
The generic item set seems to apply well to quantitative subjects, such as sta-
tistics. But like item modeling, it does not seem to apply well to
nonquantitative content. These item sets have been successfully used in na-
tional licensing examinations in accountancy, medicine, nursing, and phar-
macy, among others. Haladyna (1991) provided an example in art history.
Therefore, there seems to be potential for other types of content.
ITEM GENERATION 171

Item Shells for Item Sets

The production of test items that measure various types of higher level think-
ing is problematic. Item shells presented in this chapter lessen this problem.
With the problem'solving-type item set introduced in chapter 4, a systematic
method for producing large numbers of items for item sets using shelllike struc-
tures has been developed (Haladyna, 1991). This section provides the concept
and methods for developing item shells for item sets.

Generic Scenario

The generic scenario is a key element in the development of these items. A


scenario (or vignette) is a short story containing relevant information to
solve a problem. Sometimes the scenario can contain irrelevant information
if the intent is to have the examinee discriminate between relevant and irrel-
evant information.
These scenarios can have a general form, as shown in Example 7.13 for a be-
ginning graduate-level statistics course.

Given a situation where bivariate correlation is to be used, the


student will (1) state or identify the research question/hypothesis;
(2) identify the constructs (Y and X) to be measured; (3) identify
the variables (y and x) representing the construct (Y and X); (4)
write or identify the statistical null and alternate hypotheses, or
directional, if indicated in the problem; (5) assess the power of the
statistical test; (6) determine alpha for deciding whether to reject
or accept the null hypothesis; (7) draw a conclusion regarding the
null/alternate hypothesis, when given results; (8) determine the
degree of practical significance that the result indicates; (9)
discuss the possibility of Type I and Type II errors in this problem;
and (10) draw a conclusion regarding the research
question/hypothesis.

EXAMPLE 7.13. Expected skills of students when


encountered any vignette involving bivariate
relationships of interval or ratio scales.

This example involves one statistical test, product-moment correlation. A


total of 18 common statistical tests are taught and tested. With the use of each
test, four variations exist: (a) statistical and practical significance are present,
172 CHAPTER 7

(b) statistical significance is present but no practical significance is indicated,


(c) no statistical significance is indicated but potentially practically signifi-
cance may be present, and (d) neither statistical nor practical significance is
present. Thus, the achievement domain contains 72 possibilities. Once a sce-
nario is generated, the four conditions may be created with a single scenario.
Example 7.14 shows a simple correlation problem that is varied four ways. The
four scenarios provides the complete set of variations involving statistical and
practical significance. (A technical note for statistically oriented readers: The
third scenario is like the fourth except that the student should recognize that
the small sample size may be contributing to the dilemma of obtaining a high
correlation coefficient that is not statistically significant.)

Statistical and Practical Significance


Two researchers studied 42 men and women for the relationship
between amount of sleep each night and calories burned on an
exercise bike. They obtained a correlation of .28, which has a
two-tailed probability of .08. They used a directional hypothesis and
chose alpha for determining statistical significance at .05.

Statistical Significance but No Practical Significance


Two researchers studied 1,442 men and women for the relationship
between amount of sleep each night and calories burned on an
exercise bike. They obtained a correlation of .11, which has a two-
tailed probability of .08. They used a directional hypothesis and
chose alpha for determining statistical significance at .05.

No Statistical Significance but Potential Practical Significance


Two researchers studied 12 men and women for the relationship
between amount of sleep each night and calories burned on an
exercise bike. They obtained a correlation of .68, which has a
two-tailed probability of .12. They used a directional hypothesis and
chose alpha for determining statistical significance at .05.

No Statistical or Practical Significance


Two researchers studied 42 men and women for the relationship
between amount of sleep each night and calories burned on an
exercise bike. They obtained a correlation of .13, which has a
two-tailed probability of .28. They used a directional hypothesis and
chose alpha for determining statistical significance at .05.

EXAMPLE 7.14. Four logical variations of a single scenario.


ITEM GENERATION 173

With each scenario, a total of 10 test items is possible. With the develop-
ment of this single scenario and its four variants, the item writer has created a
total of 40 test items. Some item sets can be used in an instructional setting for
practice, whereas others should appear on formative quizzes and summative
tests. For formal testing programs, item sets can be generated in large quantities
to satisfy needs without great expense.
Example 7.15 presents a fully developed item set. This set is unconventional
because it contains a subset of MTF items. Typically not all possible items from
an item set domain would be used in a test for several reasons. One, too many
items are possible and it might exceed the need that is called for in the test spec-
ifications. Two, item sets are best confined to a single page or facing pages in a
test booklet. Three, item sets are known to have interitem cuing, so that the
use of all possible items may enhance undesirable cuing. With the scenario pre-

What is the relationship between number of calories burned on an


exercise bike each day and percentage of body fat? The
researchers limited this study to 42 women between ages 25 to
52. They obtained a correlation of .28, which has a two-tailed
probability of .08.
1. Which is an example of a properly written research question?
A. Is there a relationship between amount of sleep and
energy expended?
B. Does amount of sleep correlate with energy used?
C. What is the cause of energy expended?
D. What is the value of rho?
What is the correct term for the variable amount of sleep'?
Mark A if correct or B if incorrect
2. Criterion (A)
3. Independent (B)
4. Dependent (A)
5. Predictor (B)
6. y (A)
7. x (B)
8. What is the correct statistical hypothesis?
A. There is no correlation between sleep and energy
expended.
B. Rho equals zero.*
C. r equals zero.
D. Rho equals r. continued
174 CHAPTER 7

9. If power is a potentially serious problem in this study, what


remedies should you take?
A. Set alpha to .10 and do a directional test.*
B. Set alpha to .05 and do a directional test.
C. Set alpha to .01 and do a nondirectional test.
D. Set alpha to .05 and do a nondirectional test.
10. What conclusion should you draw regarding the null
hypothesis?
A. Reject*
B. Accept
C. Cannot determine without more information
11. What is the size of the effect?
A. Zero
B. Small*
C. Moderate
D. Large
12. What are the chances of making a Type I error in this
problem?
A. .05*
B. Very small
C. Large
D. Cannot determine without more information

EXAMPLE 7.15. A fully developed scenario-based item set


for beginning statistics class.

sented in Example 7.15, you can see that introducing small variations in the
sample size, the correlation coefficient, and its associated probability, and using
a directional test can essentially create a new problem.

An Item Set Based on a Generic Table

An item set can be created for a small business where items are sold for
profit. For instance, Sylvia Vasquez has a cell phone business at the
Riverview Mall. Help her figure out how her business is doing. Example 7.16
provides a data table for this hypothetical small business. Note that the
product can vary in many ways. For example, the name can be changed, and
the owner of the business can sell caps, ties, earrings, candy, magazines, or
tee shirts. All the numbers can be adjusted by SMEs to create profitable and
unprofitable situations.
ITEM GENERATION 175

A B C D E F
Type of Number Selling Number Amount
Cell Phone Bought Unit Cost Price Sold Received

Economy 400 $28 $36 127 ?

Better 250 $67 $84 190 9

Best 125 $125 $275 15 9

EXAMPLE 7.16. Basis for generating an item set.

As part of the mathematics curriculum or in another course, we may be in-


terested in providing practice or assessing a students understanding of pric-
ing, sales, profit, and strategies for future profits. Instead of generating a
generic set of test items, the potential of this table is shown in Example 7.17
by listing item stems that tap specific concepts, principles, and procedures
and more complex strategies that require reasoning and the application of
knowledge and skills.

1. Which type of cell phone sells best?


2. Which type of cell phone is {most, least} profitable per unit?
3. Which type of cell phone is {most, least} profitable in terms
of total sales?
4. What is the profit margin per unit for the {economy, better,
best} cell phone?
5. What are the gross revenues for the {economy, better, best}
cell phone?
6. What are the total gross revenues for all cell phones?
7. Based on current sales and the profit margin, which order
makes the most sense for next month?
8. Assuming revenues for the last month and an overhead of
92%, what is your profit margin?

EXAMPLE 7.17. Sample generic item stems based on table


in Example 7.16.
176 CHAPTER 7

Evaluation of Generic Item Sets

Generic item sets have a potential for modeling higher level thinking that flows
more directly from the complex performance item. Despite the advocacy in this
book for MC, the generic item set makes no assumption about the test item for-
mat. Certainly, a CRformat could be used for these vignettes. However, the ge-
neric item set technique is well suited to simulating complex thinking with the
objectively scorable MC items. The adaptability of MC to scenario-based item
sets may be the chief reason so many credentialing testing programs are using
item sets. Item writers with their rich background and experience can draw
from this resource to write a scenario and then develop or adapt existing test
items to determine whether a candidate for certification or licensure knows
what to do to achieve the outcome desired. For course material, the generic
item sets provide a good basis for generating large numbers of test items for vari-
ous purposes: formative and summative testing and even test preparation or
homework.

CONVERTING CR ITEMS TO MC ITEMS

The last item-generating procedure to be discussed in this chapter is a practical


strategy based on the understanding that a complex performance item that is
scored by humans is probably more desirable than an MC item that mimics the
complexity of this performance item but for many valid reasons we want to use
the MC format. What are these reasons?

1. The complex performance item takes a long time to administer. An MC


version might take less time.
2. The complex performance item has to be scored by one or two judges, de-
pending on the importance of the test. This scoring is expensive, and MC
provides a cost savings.
3. There is a growing body of research reviewed in chapter 3 that suggests
that MC often provides a good proxy for complex performance items.
Certainly, examples provided in chapters 4 and 6 and in this chapter give
ample evidence of this.
4. We know that scoring performance items is fraught with threats to valid-
ity that include rater effects, such as severity, and rater inconsistency that
affects reliability.

In most circumstances, SMEs may rightfully assert that the complex perfor-
mance item has greater fidelity to what exactly we want to test, but we are will-
ing to sacrifice a little fidelity for greater efficiency. If we are willing to make this
compromise, we can take a perfectly good performance item and convert it into
ITEM GENERATION 177

an MC format. In doing this, we try to incorporate the complex thinking under-


lying the complex performance. Thus, if the score interpretation is the same,
regardless of which format is used, we argue that converting a performance
item to a set of equivalent MC items is a good idea.

Example From the NAEP

Example 7.18 was taken from a fourth-grade 2000 NAEP item block in science
(http://nces.ed.gov/nationsreportcard). This question measures basic knowl-
edge and understanding of the following: Look at the two containers of water
with a thermometer in each one. Because this is a basic knowledge question that
tests the mental behavior understanding, it converts nicely into an item set.
Example 7.19 is easy to replicate. A label from any food product can be ob-
tained. A set of items can be written by SMEs that probe the student's reading
comprehension or application of knowledge to solve a problem. Items can be
written to address health and diet issues that may be part of another curriculum
because modern education features integrated learning units that are cross -
discipline.

SUMMARY

Item shells, item modeling, key features, generic item sets, and CR item format
conversions are discussed, illustrated, and evaluated. These methods have
much in common because each is intended to speed up item development and
provide a systematic basis for creating new MC test items.
The item shell technique is merely prescriptive. It depends on using existing
items. Item modeling has the fixed structure of item stems that allows for a do-
main definition of encounters, but each item model tests only one type of con-
tent and cognitive operation. Key features depends on an expert committee
and has a systematic approach that links training with licensure testing. The
generic item set approaches item modeling in concept but has a fixed question-
ing structure. Item format conversions provide a basis for taking CR items and
creating MC items that appear to have the same or a similar cognitive demand.
The advantage is that the latter is objectively scorable. Thus, we give up a little
fidelity for greater efficiency.
Each item-generating method in this chapter has potential for improving
the efficiency of item writing as we know it. The greatest reservation in using
any item-generation method is the preparation required at the onset of
item-writing training. SMEs need to commit to an item-generation method
and use their expertise to develop the infrastructure needed for item genera-
tion, regardless of which item-generation method is used. Although item shells
and item modeling have much in common, further developments will probably
One hot, sunny day Sally left two
buckets of water out in the sun.
The two buckets were the same
except that one was black and one
was white. At the end of the day,
Sally noticed that the water in the
black bucket felt warmer than the
water in the white bucket. Sally
wondered why this happened, so
the next day she left the buckets of
water out in the hot sun again. She
made sure that there was the same
amount of water in each bucket. This time she carefully measured
the temperature of the water in both buckets at the beginning of the
day and at the end of the day. The pictures shows what Sally found.
1. Which of the two container has the hottest water before
sitting in the sun?
A. Black
B. White
C. They are both the same temperature.

2. Which of the two containers has the hottest water after sitting
in the sun?
A. Black
B. White
C. They are both the same.

Which of the following reasons support your answer?


Mark A if true and B if false.

3. Black soaks up the sun's rays.


4. White soaks up the sun's rays.
5. The sun's rays bounce off black.
6. The sun's rays bounce off white.
Key: 1. C, 2. A, 3. True. 4. False, 5. False, 6. True.

EXAMPLE 7.18. Adapted from the National Assessment of


Educational Progress (http://nces.ed.gov/nationsreportcard/).

178
Jay had two bags of chips. He's concerned about his diet. So he
looked at the label on one of these bags. Nutrition Facts: Serving
size: 1 bag—28 grams. Amount Per Serving: Calories 140 calories
from fat 70. Ingredients: potatoes, vegetable oil, salt
Total Fat 8 g. 12%
Saturated Fat 1.5 g. 8%
Cholesterol 0 mg. 0%
Sodium 160mg. 7%
Total Carbohydrates 16g. 5%
Dietary Fiber 1 g. 4%
Sugars 0 g.
Protein 2g.

Percent of Daily Allowance


Vitamin A 0% Vitamin C 8%
Calcium 0% Iron 2%

1. How many calories did he have?


A. 70
B. 140
C. 280

2. His daily allowance of sodium is 2400 mg. Did he have too


much sodium?
A. Yes
B. No
C. Not enough information given

3. His daily allowance of fat grams is 65. By having two bags of


potato chips, how is he doing?
A. More than his allowance
B. Way less than his allowance
C. Cannot say from information given
continued

179
ISO CHAPITER 7

4. How much vitamin C did he get toward his daily allowance?


A. 0%
B. 2%
C. 8%
D. 16%

5. What is the primary ingredient in this package?


A. Potatoes
B. Vegetable oil
C. Salt

EXAMPLE 7.19. Example of an item set created


from a food label.

favor item modeling because of its inherent theoretical qualities that strike at
the foundation of professional competence. Key features have potential for
item generation but in a specific context, such as patient treatment and clinical
problem solving. It remains to be shown its applicability to other professions
and general education. Generic item sets work well in training or education, es-
pecially for classroom testing. Its applicability to testing programs may be lim-
ited because too many item sets appear repetitious and may cue test takers.
Adapting CR items to MC formats is a simple, direct way to make scoring ob-
jective yet keeping the higher cognitive demand intended.
As the pressure to produce high-quality test items that measure more than
recall increases, we will see increased experimentation and new developments
in item generation. Wainer (2002) estimated the cost of developing a new item
for a high-quality testing program as high as $1,000. With more computer-
based and computer-adaptive testing, we will see heavier demands for
high-quality MC items. Item generation will have a bright future if items can be
created that have the same quality or better than are produced by item writers.
Test content that has a rigid structure can be more easily transformed via item-
generation methods, as the many methods discussed in Irvine and Kyllonen
(2002) show. Theories of item writing feature automated item generation are
much needed for content involving ill-structured problems that we commonly
encounter in all subject matters and professions. Until the day that such theo-
ries are transformed into technologies that produce items that test problem
solving in ill-structured situations, the simpler methods of this chapter should
help item writers generate items more efficiently than the traditional way of
grinding out one item after another.
Ill
Validity Evidence Arising
From Item Development and
Item Response Validation

A central premise in this book is that item response interpretations or uses are
subject to validation in the same way that test scores are subject to validation.
A parallelism exists between validation pertaining to test scores and validation
pertaining to item responses. Because item responses are aggregated to form
test scores, validation should occur for both test scores and item responses.
Also germane, a primary source of validity evidence supporting any test
score interpretation or use involves test items and responses to test items.
Thus, the study of items and item responses becomes an important part of test
score validation. Part of this validity evidence concerning items and item re-
sponses should be based on the quality of test items and the patterns of item re-
sponses that are elicited by these items during a testing session (Downing &
Haladyna, 1997). The three chapters in this section address complementary
aspects of this item response validation process. Chapter 8 discusses the kinds
of validity evidence that comes from following well-established procedures in
test development governing item development. Chapter 9 discusses study of
item responses that is commonly known as item analysis. Chapter 10 provides
more advanced topics in the study of item responses. The procedures of chapter
8 coupled with the studies described in chapters 9 and 10 provide a body of evi-
dence that supports this validity argument regarding test score interpretation
and use. Thus, collecting and organizing of evidence supporting the validity of
item responses seems crucial in the overall evaluation of validity that goes on in
validation.
This page intentionally left blank
8
Validity Evidence Coming

From Item Development

Procedures

OVERVIEW

After an item is written, several item improvement activities should be un-


dertaken. Both research and experience have shown that many MC items are
flawed in some way at the initial stage of item development, so these activities
are much needed. The time invested in these activities will reap many re-
wards later. The more polish applied to new items, the better these items be-
come. However, some of these activities are more important than others and
deserve more attention. We can view the processes described in this chapter
as part of the validation process. Documentation of these activities consti-
tutes an important source of validity evidence (Downing & Haladyna, 1997;
Haladyna, 2002). Table 8.1 lists six standards addressing qualities of test
items that come from the Standards for Educational and Psychological Testing
(AERAetal., 1999).
These standards are not as comprehensive in coverage as what appears
in this chapter. Nonetheless, the standards show the importance of ensur-
ing that the basic scoring unit of any test, the test item, is also subjected to
validation.
In any high-quality testing program, the activities recommended in this
chapter are essential. For items used with students as part of classroom assess-
ment, the activities prescribed in this chapter are desirable but impractical.
Nonetheless, the improvement of achievement testing hinges on the ability
of test makers to develop highly effective test items. To accomplish this goal,
all items need to be reviewed.
183
184 CHAPTERS

TABLE 8.1
Standards Applying to Item Development

3.6. The types of items, the response formats, scoring procedures, and test
administration procedures should be selected based on the purposes of the test,
the domain to be measured, and the intended test takers. To the extent possible,
test content should be chosen to ensure that intended inferences from test scores
are equally valid for members of different groups of test takers. The test review
process should include empirical analyses and, when appropriate, the use of expert
judges to review items and response formats. The qualifications, relevant
experiences, and demographic characteristics of expert judges should also be
documented.
3.7. The procedures used to develop, review, and try out items, and to select items
from the item pool should be documented. If items were classified into different
categories or subtests according to the test specifications, the procedures used for
the classification and the appropriateness and accuracy of the classification should
be documented.
3.8. When item tryouts or field tests are conducted, the procedures used to select
the sample (s) of test takers for item tryouts and the resulting characteristics of
the sample (s) should be documented. When appropriate, the sample (s) should
be representative as possible of the population (s) for which the test is
intended.
3.11. Test developers should document the extent to which the content domain of a
test represents the defined domain and test specifications.
6.4. The population for whom the test is intended and the test specification should be
documented. If applicable, the item pool and scale development procedures
should be described in the relevant test manuals.
7.4. Test developers should strive to identify and eliminate language, symbols, words,
phrases, and content that are generally regarded as offensive by members of racial,
ethnic, gender, or other groups, except when judged to be necessary for adequate
representation of the domain.

In the first part of this chapter, several overarching concerns and issues are
discussed. These are content definition, test specifications, item writer train-
ing, and security. In the second part, seven complementary item review activ-
ities are recommended for any testing program. These include the following:
(a) adhering to a set of item-writing guidelines, (b) assessing the cognitive de-
mand of each item, (c) assessing the content measured by each item, (d) edit-
ing the item, (e) assessing potential sensitivity or unfairness of each item, (f)
checking the correctness of each answer, and (g) conducting a think-aloud,
where test takers provide feedback about each item.
ITEM DEVELOPMENT PROCEDURES 185

GENERAL CONCERNS AND ISSUES

Content Definition

The term content validity was traditionally used to draw attention to the impor-
tance of content definition and the many activities ensuring that the content
of each item is systematically related to this definition. Messick (1989) has ar-
gued that because content is not a property of tests but of test scores, content
validity has no relevance. Content-related evidence seems a more appropriate
perspective (Messick, 1995b).
Therefore, content is viewed as an important source of validity evidence.
The Standards for Educational and Psychological Testing (AERA et al., 1999)
make many references to the importance of content in the validation of any
test score interpretation or use. The parallelism between test scores and items
is made in chapter 1 and is carried out here. Each item has an important con-
tent identity that conforms to the test specification. Expert judgment is needed
to ensure that every item is correctly classified by content.

Classroom Testing, For this type of testing, the instructional objective has
long served as a basis for both defining learning and directing the content of
tests. States have developed content standards replete with lists of perfor-
mance objectives. Terminology may vary. Terms such as objectives, instructional
objectives, behavioral objectives, performance indicators, amplified objectives, and
learner outcomes are used. The quintessential Mager (1962) objective is shown
in Table 8.2.
Some interest has been expressed in developing cognitive abilities, such as
reading and writing. Whereas there is still heavy reliance on teaching and
testing atomistic aspects of the curriculum that the objective represents, the
measurement of a cognitive ability requires integrated performance that may
involve reading, writing, critical thinking, problem solving, and even creative
thinking. MC items may not be able to bear the load for such complex behav-
ior. But it has been argued and examples are presented in many chapters in
this book showing attempts to measure complex cognitive behavior using the
MC format.

TABLE 8.2
Anatomy of an Objective

Anatomical Feature Example


TSW (The student will) TSW
Action verb Identify examples of invertebrates.
Conditions for performance Animals will be described in terms of characteristics,
habitats, and behaviors. Some will be invertebrate.
186 CHAPTER 8

Large-Scale Testing. Testing programs may have different bases for defin-
ing content. In professional certification and licensing testing programs,
knowledge and skills are identified on the basis of surveys of the profession.
These surveys are often known as role delineation, job analysis, task analysis, or
professional practice analysis (Raymond, 2001). Respondents rate the impor-
tance or criticality of knowledge and skills to professional practice. Although
candidates for certification and licensure must meet many criteria to earn a
board's credential, an MC test is often one of these criteria. These tests typi-
cally measure professional and basic science knowledge related to professional
practice. The source of content inovlves expert judgment. No matter what
type of testing, consensus among SMEs is typically used for establishing the
content for a test.

Test Specifications

A systematic process is used to take test content and translate it into test speci-
fications stating how many items will be used and which content topics and
cognitive processes that will be tested. Kane (1997) described ways that we can
establish the content and cognitive process dimensions of our test specifica-
tions. Generally, the effort to create test specification again rests on expert
judgment. As Messick (1995b) expressed, test specifications provide bound-
aries for the domain to be sampled. This content definition is operationalized
through test specifications. Most measurement textbooks discuss test specifi-
cations. They generally have two dimensions: content and cognitive processes.
Chapter 2 discusses a simple classification system for cognitive processes that is
consistent with current testing practices.

Item-Writing Guide

For a testing program, it is a common practice to have a booklet that every item
writer receives that discusses the formats that will be used, the guidelines that
will be followed, examples of well and poorly written items, a classification sys-
tem for items, directions on submitting and reviewing items, and other salient
information to help future item writers.

Recruiting Item Writers

For testing programs, the expertise of item writers is crucial to the testing pro-
gram's success. Downing and Haladyna (1997) argued that a key piece of valid-
ity evidence is this expertise. The credentials of these item writers should
enhance the reputation of the testing program. Generally, volunteers or paid
ITEM DEVELOPMENT PROCEDURES 187

item writers are kept for appointed terms, which may vary from 1 to 3 years.
Once they are trained, their expertise grows; therefore, it is advantageous to
have these item writers serve for more than 1 year.

Item Writer Training

For any kind of test, item writers should be trained in the principles of item writ-
ing, as expressed throughout this book and consistent with the item-writing
guide. This training need not take a long time, as content experts can learn the
basics of item writing in a short time. Trainees should learn the test specifica-
tions and the manner in which items are classified by content and cognitive be-
havior. Participants in this training should have supervised time to write items
and engage in collegial review.

Security

In high-stakes testing programs, there is an active effort to obtain copies of tests


or test items for the express purpose of increasing performance. This kind of
zeal is evident in standardized testing in public schools, where a variety of tac-
tics are used to increase performance. Although the problem with such testing
may be misinterpretation and misuse of test scores by policy makers, including
legislators and school boards, lack of test security makes it possible to obtain
and compromise legitimate uses of the test. In high-stakes certification and li-
censing tests, poor security may lead to exposed items that weaken the valid in-
terpretation and uses of test scores.
Downing and Haladyna (1997) recommended a test security plan that de-
tails how items are prepared and guarded. If security breaches occur, are re-
placement items available? If so, a replacement test needs to be assembled to
replace the compromised test. As they pointed out, the test security issue cuts
across all other activities mentioned in this chapter because test security is an
overarching concern in test development, administration, and scoring.

REVIEWING ITEMS

In this part of the chapter, seven interrelated, complementary reviews are de-
scribed that are highly recommended for all testing programs. The performing
of each activity provides a piece of validity evidence that can be used to support
both the validity of interpreting and using test scores and item responses.

Review 1: Adherence to ItenvWriting Guidelines

Chapter 5 presents item-writing guidelines and examples of the use or misuse of


each guideline. Every test item should be subjected to a review to decide
188 CHAPTER 8

whether items were properly written. The guidelines are really advice based on
a consensus of testing experts; therefore, we should not think of these guide-
lines as rigid laws of item writing but friendly advice. However, in any
high-stakes testing program, it is important to adopt a set of guidelines and ad-
here to them strictly. Once these guidelines are learned, the detection of
item-writing errors is a skill that can be developed to a high degree of profi-
ciency. Items should be revised accordingly. Violating these guidelines usually
results in items that fail to perform (Downing, 2002). Following these guide-
lines should result in a test that not only looks better but is more likely to per-
form according to expectations.
Table 5.1 (in chapter 5) summarizes these guidelines. A convenient and ef-
fective way to use Table 5.1 in reviewing items is to use each guideline number
as a code for items that are being reviewed. The people doing the review can
read each item and enter the code on the test booklet containing the offending
item. Such information can be used by the test developers to consider redraft-
ing the item, revising it appropriately, or retiring the item. As mentioned previ-
ously, these guidelines are well grounded in an expert opinion consensus, but,
curiously, research is not advanced well enough to cover many of these guide-
lines. Thus, the validity of each guideline varies.

Review 2: Cognitive Process

Chapter 2 provides a simple basis for classifying items: recall or understanding


of knowledge, skills, and the application of knowledge and skill in some com-
plex way. Any classification system rests on the ability of content experts to
agree independently about the kind of behavior elicited by its test takers.

Review 3: Content

The central issue in content review is relevance. In his influential essay on va-
lidity, Messick (1989) stated:

Judgments of the relevance of test items or tasks to the intended score interpre-
tation should take into account all aspects of the testing procedure that signifi-
cantly affect test performance. These include, as we have seen, specification of
the construct domain of reference as to topical content, typical behaviors, and
underlying processes. Also needed are test specifications regarding stimulus for-
mats and response alternatives, administration conditions (such as examinee in-
structions or time limits), and criteria for item scoring, (p. 276)

As Popham (1993) pointed out, the expert judgment regarding test items has
dominated validity studies. Most classroom testing and formal testing pro-
grams seek a type of test score interpretation related to some well-defined con-
ITEM DEVELOPMENT PROCEDURES 189

tent (Fitzpatrick, 1981; Kane, 2002; Messick, 1989). Under these conditions,
content is believed to be definable in terms of a domain of knowledge (e.g., a set
of facts, concepts, principles, or procedures). Under these circumstances, each
test is believed to be a representative sample of the total domain of knowledge.
As Messick (1989) noted, the chief concerns are clear construct definition, test
specifications that call for the sample of content desired, and attention to the
test item formats and response conditions desired. He further added adminis-
tration and scoring conditions to this area of concern.
As noted by Popham (1993) previously, the technology involves the use
of content experts, persons intimate with the content who are willing to re-
view items to ensure that each item represents the content and level of cog-
nitive behavior desired. The expert or panel of experts should ensure that
each item is relevant to the domain of content being tested and is properly
identified as to this content. For example, if auto mechanics' knowledge of
brakes is being tested, each item should be analyzed to figure out if it belongs
to the domain of knowledge for which the test is designed and if it is cor-
rectly identified.
Although this step may seem tedious, it is sometimes surprising to see items
misclassified by content. With classroom tests designed to measure student
achievement, students can easily identify items that are instructionally irrele-
vant. In formal testing programs, many detection techniques inform users
about items that may be out of place. This chapter discusses judgmental con-
tent review, whereas chapters 9 and 10 discuss statistical methods.
Methods for performing the content review were suggested by Rovinelli and
Hambleton (1977). In selecting content reviewers, these authors made the fol-
lowing excellent points:

1. Can the reviewers make valid judgments regarding the content of items ?
2. Is there agreement among reviewers?
3. What information is sought in the content review?
4. What factors affect the accuracy of content judgments of the reviewers?
5. What techniques can be used to collect and analyze judgments?

Regarding the last point, the authors strongly recommended using the simplest
method available.
Toward that end, the review of test items can be done in formal testing pro-
grams by asking each content specialist to classify the item according to an item
classification guide. Rovinelli and Hambleton (1977) recommended a simple
3-point rating scale:

1. Item is correctly classified.


2. Uncertain.
3. Item is incorrectly classified.
19O CHAPTERS

Rovinelli and Hambleton (1977) also provided an index of congruence be-


tween the original classification and the content specialists' classification. The
index can be used to identify items having content classification problems. A
simpler index might be any item with a high frequency of ratings of 2 or 3 as de-
termined from the preceding scale. If the cognitive level of each item is of con-
cern, the same kind of rating can be used.
Figure 8.1 provides test specification for the mythical Azalea Growers' Certifi-
cation Test. The first dimension is the topic dimension for content. The second
dimension, at the left, is cognitive process, which has three types: recall, under-
standing, and application. Figure 8.2 provides a hypothetical set of ratings from
three azalea-growing experts regarding three items they were evaluating.

Topics
Behavior Watering Fertilizing Soil Total
Recalling knowledge 15% 15% 10% 40%
Understanding knowledge 10% 10% 10% 30%
Applying knowledge 15% 5% 10% 30%
Total 40% 30% 30% 100%

FIG. 8.1. Test specifications for the Azalea Growers' Certification Test.

Reviewers
Item Original Classification #1 #2 #3
82 Watering 3 3 3
83 Fertilizer 1 1 1
84 Soil 1 2 1
85 Soil 2 3 2
86 Light 1 1 1

FIG. 8.2. Excerpt of reviews from three content reviewers.

The science of content review has been raised beyond merely expert judg-
ment and simple descriptive indicators of content agreement. Crocker, Llabre,
and Miller (1988) proposed a more sophisticated system of study of content rat-
ings involving generalizability theory. They described how theory can be used to
generate a variety of study designs that not only provide indexes of content-
rater consistency but also identify sources of inconsistency. In the context of a
ITEM DEVELOPMENT PROCEDURES 191

high-stakes testing program, procedures such as the one they recommend are
more defensible than simplistic content review procedures.
Content review has been a mundane aspect of test design. As Messick
(1989) noted, although most capable test development includes these impor-
tant steps, we do not have much systematic information in the literature that
informs us about what to use and how to use it. Hambleton (1984) provided a
comprehensive summary of methods for validating test items.

Review 4: Editorial

No matter the type of testing program or the resources available for the devel-
opment of the test, having each test professionally edited is desirable. The edi-
tor is someone who is usually formally trained in the canons of English grammar
and composition.
There are several good reasons for editorial review. First, edited test items
present the cognitive tasks in a clearer fashion than unedited test items. Editors
pride themselves on being able to convert murky writing into clear writing
without changing the content of the item. Second, grammatical, spelling, and
punctuation errors tend to distract test takers. Because great concentration is
needed on the test, such errors detract from the basic purpose of testing, to find
the extent of knowledge of the test taker. Third, these errors reflect badly on
the test maker. Face validity is the tendency for a test to look like a test. If there
are many errors in the test, the test takers are likely to think that the test falls
short in the more important areas of content and item-writing quality. Thus,
the test maker loses the respect of test takers. There are several areas of con-
cern of the editorial review show in Table 8.3.
A valuable aid in testing programs is an editorial guide. This document is
normally several pages of guidelines about acceptable formats, accepted abbre-
viations and acronyms, styles conventions, and other details of item prepara-

TABLE 8.3
Areas of Concern in the Editorial Review

Areas of Concern Aspects of the Review


1. Clarity Item stem clearly presents the problem, and options provide
coherent and plausible responses
2. Mechanics Spelling, abbreviations and acronyms, punctuation, and
capitalization
3. Grammar Complete sentences, correct use of pronouns, correct form and
use of verbs, and correct use of modifiers
4. Style Active voice, conciseness, positive statements of the problem in
the stem, consistency
192 CHAPTER 8

tion, such as type font and size, margins, and so on. For classroom testing,
consistency of style is important.
There are some excellent references that should be part of the library of a
test maker, whether professional or amateur. These appear in Table 8.4
A spelling checker on a word processing program is also handy. Spelling
checkers have resident dictionaries for checking the correct spelling of many
words. However, the best feature is the opportunity to develop an exception
spelling list, where specialized words not in the spelling checker's dictionary
can be added. Of course, many of these types of words have to be verified first
from another source before each word can be added. For example, if one
works in medicine or in law, the spelling of various medical terms can be
checked in a specialized dictionary, such as Stedman's Medical Dictionary for
which there is a Web site (http://www.stedmans.com/), which also has a
CD-ROM that checks more than half a million medical phrases and terms.
Another useful reference is Black's Law Dictionary (Garner, 1999).

Review 5: Sensitivity and Fairness

Fairness has been an important issue in test development and in the use of test
scores. Chapter 7 in the Standards for Educational and Psychological Testing is de-
voted to fairness. Standard 7.4 in that chapter asserts:

Test developers should strive to identify and eliminate language, symbols, words,
phrases, and content that are generally regarded as offensive by members of ra-
cial, ethnic, or other groups for adequate representation of the domain. (AERA
et al., 1999, p. 82)

TABLE 8.4
References on Grammar, Composition, & Style

Gibaldi, J. (1999). The MLA handbook for writers of research papers (5th ed.). New
York: Modem Language Association of America.
The American Heritage Book of English usage: A practical and authoritative guide to
contemporary English. (1996). Boston: Houghton Mifflin.
American Psychological Association. (2001). Publication manual of the American
Psychological Association (5th ed.). Washington, DC: Author.
Strunk, W., Jr., & White, E. B. (2000). The elements of style (4th ed.). Boston: Allyn &.
Bacon and Longman.
The Chicago manual of style (14th ed.). (1993). Chicago: University of Chicago Press.
Warriner, J. E. (1988). English grammar and composition: Complete course. New York:
Harcourt, Brace, &. Jovanovich.
ITEM DEVELOPMENT PROCEDURES 193

Fairness review generally refers to two activities. The first is a sensitivity re-
view aimed at test items that potentially contain material that is sexist, rac-
ist, or otherwise potentially offensive or negative to any group. The second
is an analysis of item responses to detect differential item functioning,
which is discussed in chapter 10. We should think of fairness as a concern for
all of testing.
This section focuses on this first fairness review, often referred to as the sensi-
tivity review. Chapter 10 has a section on the second type of fairness activity,
DIE The sensitivity review concerns stereotyping of groups and language that
may be offensive to groups taking the test.
The Educational Testing Service has recently issued a new publication on
fairness (2003) (http://www.ets.org/fairness/download.html). Since 1980, Ed-
ucational Testing Service has led the testing industry by issuing and continu-
ously updating guidelines. It has exerted a steadying influence on the testing
industry to be more active in watch guarding the content of test in an effort not
to avoid negative consequences that arise from using content that might offend
test takers.
We have many good reasons for being concerned about fairness and sensitiv-
ity. First and foremost, Zoref and Williams (1980) noted a high incidence of
gender and ethnic stereotyping in several prominent intelligence tests. They
cited several studies done in the 1970s where similar findings existed for
achievement tests. To what extent this kind of bias exists in other standardized
tests currently can only be speculation. However, any incidence of insensitive
content should be avoided.
Two, for humanistic concerns, all test makers should ensure that items do
not stereotype diverse elements of our society. Stereotyping is inaccurate be-
cause of overgeneralization. Stereotyping may cause adverse reactions from
test takers during the test-taking process.
Table 8.5 provides some criteria for judging item sensitivity, adapted from
Zoref and Williams (1980). Ramsey (1993) urged testing personnel to identify
committees to conduct sensitivity reviews of test items and to provide training
to committee members.
He recommended four questions to pose to committee members:

• Is there a problem?
• If there is, which guideline is violated?
• Can a revision be offered?
• Would you sign off on the item if no revision was made? In other words,
how offensive is the violation?

Educational Testing Service (2003) recommended the following set of stan-


dards concerning sensitivity:
194 CHAPTERS

TABLE 8.5
A Typology for Judgmental Item Bias Review

Gender Race/Ethnic
Representation: Items should be balanced with Representation: Simply stated, if
respect to gender representations. Factors to the racial or ethnic identity of
consider include clothes, length of hair, facial characters in test items is present, it
qualities, and makeup. Nouns and pronouns should resemble the demographics
should be considered (he/she, woman/man). of the test-taking population.
Characterization: Two aspects of this are role Characterization: White characters
stereotyping (RS) and apparel stereotyping (AS). in test items may be stereotypically
Male examples of RS include any verbal or be presented in leadership roles,
pictorial referring to qualities such as wealthy, professional, technical,
intelligence, strength, vigor, ruggedness, intelligent, academic, and the like.
historical contributions, mechanical aptitude, Minority characters are depicted as
professionalism, and/or fame. Female examples unskilled, subservient,
of RS include depicting women in domestic undereducated, poor, or in
situations, passiveness, weakness, general professional sports
activity, excessive interest in clothes or
cosmetics, and the like.
AS is viewed as the lesser of the two aspects of
characterization. AS refers to clothing and other
accouterments that are associated with men and
women, for example, neckties and cosmetics. This
latter category is used to support the more
important designation of the former category in
identifying gender bias in an item..

Note. Adapted from Zoref and Williams (1980).

1. Treat people with respect in test materials. All groups and types of people
should be depicted in a wide range of roles in society. Representation of
groups and types of people should be balanced, not one-sided. Never mock
or hold in low regard anyone's beliefs. Avoid any hint of ethnocentrism or
group comparisons. Do not use language that is exclusive to one group of
people. Educational Testing Service used an example to illustrate this point:
"All social workers should learn Spanish." This implies that most people in
need of welfare are Spanish speaking.
2. Minimize construct-irrelevant knowledge. As a nation, we are fond of
figures of speech, idioms, and challenging vocabulary. We need to avoid in
our tests specialized political words, regional references, religious terms, eso-
teric terms, sports, and the like. The intent is to ensure that prior knowledge
does not in some way increase or decrease performance.
ITEM DEVELOPMENT PROCEDURES 195

3. Avoid inflammatory or controversial material. Some topics that may


creep into a test might make reference to abortion, child abuse, evolution,
euthanasia, genocide, hunting, Jesus, Satanism, slavery, sex crimes, suicide,
and the like. Tests should never advocate positions for or against, as might
be taken in any of the preceding references.
4. Use appropriate terminology. Avoid labeling people, but if labels must be
used, use appropriate labels, such as listed in Example 8.1.

Appropriate Adjectives Inappropriate Adjectives

African American or Black Negro or Colored


Asian American,
Pacific Island American,
Asian/Pacific Island American Oriental
Native American, American Indian Eskimo, Indian
White, Caucasian, European American

EXAMPLE 8.1. Appropriate and inappropriate ethnic


and racial labels.

With most test items, one's race or ethnicity is seldom important to the
content of the item. Therefore, it is best not to use such labels unless justi-
fied in the opinion of the committee doing the sensitivity review.
With men and women, they should be used in parallel ways. Never refer
to appearance of a person in terms of gender. Be careful about the use of boys
and girls. That term is reserved for persons below the age of 18. When de-
picting characters in test items, include men and women equally. Avoid ge-
neric terms such as he or man. Avoid references to a person's sexual
preference. Avoid references to the age of a person unless it is important.
5. Avoid stereotypes. We should avoid using terms that may be part of our
normal parlance but are really stereotypes. The term Indian giver is one that
conveys a false image. "You throw the ball like a girl" is another stereotype
image that should be avoided. Even though we may want to stereotype a
group in a positive way, it is best to avoid stereotyping.

Sensitivity reviews are essential to testing programs. The sensitivity review


provides a useful complement to statistical studies of item responses in
chapter 10. A panel should be convened of persons who will do the sensitiv-
ity review. Educational Testing Service (2003) recommended that a sensi-
tivity review committee have specific training for its task and have no stake
196 CHAPTER 8

in the test items being reviewed. The review procedure should be docu-
mented and should become part of the body of validity evidence. Chal-
lenged items should be reviewed in terms of which guideline is potentially
violated. Other members should decide on the outcome of the item. Chal-
lenged items should never be released to the public. As you can see, sensi-
tivity review will continue to be done and to be an important aspect of the
item development process.

Review 6: Key Check (Verification of the Correct Answer)

When an item is drafted, the author of the item usually chooses one of the MC
options as the key (correct answer). The key check is a method for ensuring
that there is one and only one correct answer. Checking the key is an important
step in item development. The key check should never be done superficially or
casually. Why is it necessary to check the key? Because several possibilities exist
after the test is given and the items are statistically analyzed:

1. There may be no right answer.


2. There may be a right answer, but it is not the one that is keyed.
3. There may be two or more right answers.

What should be done if any of these three circumstances exist after the test
is given? In any testing program where important decisions are made based on
test scores, the failure to deal with key errors is unfair to test takers. In the un-
likely event of the first situation, the item should be removed from the test and
not be used to compute the total score. The principle at stake is that no test
taker should be penalized for the test maker's error. If the second or third condi-
tions exist, right answers should be rekeyed and the test results should be re-
scored to correct any errors created by either situation. These actions can be
avoided through a thorough, conscientious key check.

Performing the Key Check. The key check should always be done by a
panel of SMEs. These experts should agree about the correct answer. The ex-
perts should self-administer each item and then decide if their response
matched the key. If it fails to match the key, the item should be reviewed, and
through consensus judgment, the key should be determined. If a lack of con-
sensus exists, the item is inherently ambiguous or otherwise faulty. The item
should be revised so that consensus is achieved about the key, or the item
should be retired.
Another way to validate a key is to provide a reference to the right answer
from an authoritative source, such as a textbook or a journal. This is a common
practice in certification and licensing testing programs. The practice of provid-
ITEM DEVELOPMENT PROCEDURES 197

ing references for test items also ensures a faithfulness to content that may be
part of the test specifications.

Review 7: Answer Justification

Answer Justification for a Testing Program. One of the best sources of


information about the correct answer of any MC test item is the person for
whom the test item is intended. Whether that person is a candidate for certifi-
cation or licensure or a student who is developing reading, writing, or mathe-
matical problem-solving ability, their analysis of test items can provide
important and useful insights about the quality of each item. Fortunately, there
has been increased interest in this topic and some research to increase our un-
derstanding of answer justification.
Answer justification is a systematic study of correct answers from the stand-
point of those who are going to or have taken the test. Chapter 9 shows how we
use item response patterns to gain insight into how students perform on each
test item. This is one type of validity evidence. But another type of validity evi-
dence is the consensus that builds from affirmation by those taking the test.
Thus, the survey of test takers provides another piece of validity evidence that
accumulates, providing us with support to use the item with confidence.
However, there is another important value in answer justification. If for
some reason the items are in some way flawed, answer justification may un-
cover these flaws. A process that allows answer justification is rarely used in
high-stakes testing programs but can be useful as a deterrent against ambiguous
items. If a candidate in some high-stakes testing program is close to the cut
score, a single answer justification in favor of the candidate may make a differ-
ence between passing and failing. How we incorporate answer justification in
some high-stakes testing program is yet to be determined. But the desirability
should be evident.

Answer Justification in the Classroom. As we know, the test items pre-


pared for classroom use do not have the same quality control as we see with
testing programs. Also, most writers of MC items for classroom use are not
particularly well trained or experienced in item writing. An important safe-
guard against poorly written MC items is answer justification, as described in
the following:

1. The answer justification review provides the instructor or test devel-


oper useful information about how well the item works. Therefore, the in-
formation is part of the evaluation of item performance. This information
complements the statistical analysis of item performance discussed in
chapter 9.
198 CHAPTER 8

2. As briefly mentioned in chapter 5, such review can diffuse the threat


that trick items are included. By having students or test takers think aloud
about how they chose their answer, we can gain insight into the trickiness or
ambiguity of our items.

The next class period following a classroom test should be spent discussing
test results. The primary purpose is to help students learn from their errors. If
learning is a continuous process, a posttest analysis can be helpful in subse-
quent learning efforts. A second purpose, however, is to detect items that fail to
perform as intended. The expert judgment of classroom learners can be mar-
shaled for exposing ambiguous or misleading items.
After a classroom test is administered and scored, it is recommended that
students have an opportunity to discuss each item and provide alternative rea-
soning for their wrong answers. Sometimes, they may prove the inherent weak-
ness in the item and the rationale for their answer. In these circumstances, they
deserve credit for their responses. Such informal polling also may determine
that certain items are deficient because the highest scoring students are chron-
ically missing the item or the lowest scoring students are chronically getting an
item right. Standard item analysis also will reveal this, but the informal polling
method is practical and feasible. In fact, it can be done immediately after the
test is given, if time permits, or at the next class meeting. Furthermore, there is
instructional value to the activity because students have the opportunity to
learn what they did not learn before being tested. An electronic version of this
polling method using the student-problems (S-P) chart is reported by Sato
(1980), but such a technique would be difficult to carry out in most instruc-
tional settings because of cost. On the other hand, the informal polling method
can simulate the idea behind the S—P process and simultaneously provide ap-
peals for the correct scoring of the test and provide some diagnostic teaching
and remedial learning.
An analysis for any single student can reveal the nature of the problem.
Sometimes, a student may realize that overconfidence, test anxiety, lack of
study or preparation, or other factors legitimately affected performance, or the
analysis may reveal that the items were at fault. In some circumstances, a stu-
dent can offer a correct line of reasoning that justifies an answer that no one
else in the class or the teacher thought was right. In these rarer circumstances,
credit could be given. This action rightfully accepts the premise that item writ-
ing is seldom a perfect process and that such corrective actions are sometimes
justified.
Another device for obtaining answer justification is the use of a form where
the student writes out a criticism of the item or the reasoning used to select his
or her response (Dodd & Leal, 2002). The instruction might read:

Present any arguments supporting the answer you chose on the test.
ITEM DEVELOPMENT PROCEDURES 199

Nield and Wintre (2002) have been using this method in their introductory
psychology classes for several years with many positive results. In a survey of
their students, 41% used the answer justification option. They reported that
student anxiety may have been lessened and that they gained insight into their
teaching as well as identified troublesome items. They observed that few stu-
dents were affected by changes in scoring, but they also noted that higher
achieving students were more likely to gain score points as a result of the an-
swer justification process.
Naturally, students like this technique. Dodd and Leal (2002) reported in
their study of answer justification that 93% of their students thought the proce-
dure should be used in other classes. They cited many benefits for answer justi-
fication, including the following:

1. Makes guessing less of an issue.


2. Eliminates the problems associated with ambiguous items.
3. Creates a healthy dialogue between student and instructor.
4. Eliminates the need to rescore the test.
5. Rewards students who can justify their choice.
6. Gives students higher scores that they deserve.
7. Improves the relationship with the instructor.
8. Eliminates the issue of trick items being present in the test.

Answer justification seems like an excellent strategy for classroom instruc-


tion or training where student learning is the goal and fairness in testing is val-
ued. Therefore it is enthusiastically recommended for instructors in any subject
matter or educational level where MC items are used.

Review 8: Think-Aloud

For think-aloud, students are grouped around a table and asked to respond to a
set of MC items. During that time, the test administrator sits at the table with
the students and talks to the students as they encounter each item. They are
encouraged to talk about their approach to answering the item. The adminis-
trator often probes to find out what prompted certain answers. The setting is
friendly. Students talk with the administrator or each other. The administrator
takes notes or audio- or videotapes the session. We should consider the value of
think-aloud in two settings, research and item response validation.

As a Research Method. As noted in chapter 3, the basis for the think-


aloud procedure comes from studies of cognitive behavior. Morris (1990) pro-
vided an excellent review of both the history and the rationale for think-aloud.
He provided a useful taxonomy of elicitation levels, which is shown in Table 8.6.
200 CHAPTERS

TABLE 8.6
Descriptions of Elicitation Levels

Eiicitation Level Description


Think-aloud Participants were instructed to report all they were thinking as
they worked through the items and to mark their answers on a
standardized answer sheet.
Immediate recall Participants were instructed to mark their answers to each item
on a standardized answer sheet and to tell immediately after
choosing each answer why they chose it.
Criteria probe Participants were instructed to mark their answers to each item
on a standardized answer sheet and were asked immediately after
marking each answer whether a piece of information pointed out
in the item had made any difference to their choice.
Principle probe Participants were treated as in the criterion problem group with
an additional question asking whether their answer choice was
based on particular general principles.
No elicitation Participants were not interviewed but were instructed to work
alone and to mark their answers on a standardized sheet.

One conclusion that Norris drew from his experimental study of college students
is that the use of the six levels of elicitation of verbal reports did not affect cogni-
tive test behavior. Some benefits of this kind of probing, he claimed, include de-
tecting misleading expressions, implicit clues, unfamiliar vocabulary, and
alternative justifications. Skakun and Maguire (2000) provided an updated re-
view of think-aloud research. As noted in chapter 3, Hibbison (1991) success-
fully used think-aloud to induce students to describe the cognitive processes they
used in answering MC items. Tamir (1993) used this technique in research on
the validity of guidelines for item writing. Skakun et al. (1994) found that their
medical school students approached MC items in 5 ways, processed options in 16
ways, and used 4 cognitive strategies to make a correct choice. Consequently, we
should see the potential of the think-aloud procedure in research on the cogni-
tive processes elicited by various item formats and the validity of item-writing
guidelines. Seeing the link between think-aloud and construct validity, test spe-
cialists have recommended this practice.

Testing Programs. The think-aloud procedure can be helpful in verifying


the content and cognitive processes intended for each item. Think-aloud can
also be used to verify the right answer and the effectiveness of distractors.
ITEM DEVELOPMENT PROCEDURES 2O1

These outcomes of think-aloud provide validity evidence concerning both


content and cognitive behavior believed to be elicited when responding to the
item. Although the think-aloud method is time consuming and logistically dif-
ficult to conduct, it seems well worth the effort if one is serious about validating
test results. Unfortunately, we see too few reports of this kind of validity evi-
dence in all achievement testing programs.

SUMMARY

In the first part of this chapter, some issues are discussed as each affects the pro-
cess we use to review and polish test items. It is crucial to ensure that the con-
struct being measured is well defined and that test specifications are logically
created to reflect this definition. Item writers need to be trained to produce
new, high-quality items. Security surrounds this process and ensures that items
are not exposed or in some other way compromised.
In the second part, seven interrelated, complementary item-review activi-
ties are recommended. Table 8.7 provides a summary of these review activities.
Performing these reviews provides an important body of evidence supporting
both the validity of test score interpretation and uses and the validity of item
response interpretations and uses.

TABLE 8.7
Item Review Activities

1. Item-writing review: Checks items against guidelines for violations.


2. Cognitive demand review: Checks item to see if it elicits the cognitive process
intended.
3. Content review: Checks for accuracy of content classification.
4. Editorial review: Checks items for clarity and any grammar, spelling, punctuation, or
capitalization errors.
5. Sensitivity and fairness review: Checks items for stereotyping of persons or insensitive
use of language.
6. Key check: Checks items for accuracy of correct answer. Ensures that there is only one
right answer.
7. Answer justification: Listens to test takers alternative explanations for their choices
and gives them credit when justified.
8. Think-aloud: During the field test, subject each item to a round-table discussion by
test takers. The results should inform test developers about the quality of the item for
its intended content and cognitive process. Think-aloud is also a good research
method.
9
Validity Evidence Coming
From Statistical Study
of Item Responses

OVERVIEW
As is frequently noted in this book, the quality of test items depends on two
complementary activities, the item review procedures featured in chapter 8
and the statistical study of item responses featured in this chapter. The docu-
mentation of these review procedures and the statistical studies discussed in
this chapter provide two important sources of validity evidence desired in a
validation.
Once we have completed the item review activities, we may field test the
item. The item is administered but not included in scoring. We turn to item
analysis results to help us make one of three decisions about the future of
each item:

• Accept the item as is and add it to our item bank and continue to use the
item on future tests.
• Revise the item to improve the performance of the item.
• Reject and retire the item because of undesirable performance.

In this chapter, first, we consider the nature of item responses and explore the
rationale for studying and evaluating item responses. In this treatment of item
responses, we encounter context for MC testing, statistical theories we will use,
and tools that help us in the study and evaluation of item responses. Second, we
examine three distinctly different yet complementary ways to study item re-
sponses: tabular, graphical, and statistical.

2O2
STATISTICAL STUDY OF ITEM RESPONSES 2O3

THE NATURE OF ITEM RESPONSES

Every MC item has a response pattern. Some patterns are desirable and
other patterns are undesirable. In this chapter we consider different meth-
ods of study, but some foundation should be laid concerning the patterns of
item responses.
The study of item responses provides a primary type of validity evidence
bearing on the quality of test items. Item responses should follow patterns
that conform with our idea about what the item measures and how
examinees with varying degrees of knowledge or ability should encounter
these items.
Examinees with high degrees of knowledge or ability tend to choose the
right answer, and examinees with low degrees of knowledge or ability tend to
choose the wrong answers. It is that simple. But as you will see, other consider-
ations come in to play that make the study of item response patterns more com-
plex. With this increasing complexity of the study of item responses comes a
better understanding of how to improve these items so that they can serve in
the operational item bank.

Statistical Theories of Test Scores

We can study item response patterns using a statistical theory that explains
item responses variation, or we can study item responses in a theory-free
context, which is more intuitive and less complex. Although any statistical
theory of test scores we use may complicate the study of item responses,
should we avoid the theory? The answer is most assuredly no. In this book
we address the problem of studying and evaluating item response patterns
using classical test theory (CTT) but then shift to item response theory
(IRT). Both theories handle the study of item response patterns well. Al-
though these rivaling theories have much in common, they have enough
differences to make one arguably preferable to the other, although which
one is preferable is a matter of continued debate. Some statistical methods
are not theory based but are useful in better understanding the dynamics of
item responses. This chapter employs a variety of methods in the interest of
providing comprehensive and balanced coverage, but some topics require
further study in other sources.
We have many excellent technical references to statistical theories of test
scores (Brennan, 2001; Crocker & Algina, 1986; Embretson & Reise, 2000;
Gulliksen, 1987; Hambleton, 1989; Hambleton, Swaminathan, & Rogers,
1991; Lord, 1980; Lord & Novick, 1968; MacDonald, 1999; Nunnally &
Bernstein, 1994).
2O4 CHAPTER 9

Computer Programs for Studying Item Responses

We have a considerable and growing market of computer programs that help us


study item response patterns and estimate test and item characteristics. Table 9.1
provides a list of some of these programs with brief descriptions and web ad-
dresses. Two companies in particular provide many useful computer item analysis
programs. These are Assessment Systems Corporation (http:// www.assess.com)
and Scientific Software International (http:// www.ssicentral.com/). Standard
statistical computer programs such as SAS, SPSS, and SYSTAT provide many of
the analyses needed for the study of item responses. All of these computer pro-
grams now operate in a friendly Windows- based environment with short execu-
tion times, and they have capacities for large data sets. All programs provide a
variety of options for the most discriminating users.

TABLE 9.1
Computer Programs Offering Item and Test Information

BILOG3 (Mislevy & Bock, 2002). An IRT program that provides classical and IRT
item analysis information and much more.
BILOG-MG (Zimowski, Muraki, Mislevy &. Bock, 2003). An enhanced version of
BILOG that also provides differential item functioning, item drift, and other more
complex features.
ITEMAN (Assessment Systems Corporation, 1995). A standard test and item analysis
program that is easy to use and interpret. It provides many options including subscale
item analysis.
RASCAL (Assessment Systems Corporation, 1992). This program does a Rasch (one-
parameter) item analysis and calibration. It also provides some traditional item
characteristics. This program shares the same format platform as companion
programs, ITEMAN and XCALIBRE, which means that it is easy to use and
completes its work quickly.
RUMM2010 (Rasch Unidimensional Models for Measurement; Andrich, Lyne,
Sheridan, & Luo, 2001). This program provides a variety of classical and item
response theory item analyses including person fit analysis and scaling. The use of
trace lines for multiple-choice options is exceptional.
TESTFACT4 (Bock et al., 2002). This program conducts extensive studies of item
response patterns including classical and IRT item analysis, distractor analysis, full
information factor analysis, suitable for dichotomous and polytomous item responses.
XCALIBRE (Assessment Systems Corporation). This program complements ITEMAN
and RASCAL by providing item calibrations for the two- and three-parameter model.
The program is easy to use and results are easily interpreted.
CONQUEST (Wu, Adams, & Wilson, 1998). ConQuest fits several item response
models to binary and polytomous data, and produced traditional item analysis.
STATISTICAL STUDY OF ITEM RESPONSES 2O5

Weighted or Unweighted Scoring

A complication we face is the choice of the scoring method. For nearly a cen-
tury, test analysts have treated distractors and items equally. That is, all right
answers score one point each and all distractors score no points. Should some
options have more weight than others in the scoring process? If the answer is
no, the appropriate method of scoring is zero-one or binary. With the coming of
dichotomous IRT, the weighting of test items is realized with the use of the two-
and three-parameter logistic response models.
When comparing unweighted and weighted item scoring, the sets of scores
are highly correlated. The differences between unweighted and weighted
scores are small and usually observed in the upper and lower tails of the distri-
bution of test scores. Weighted scoring may be harder to explain to the public,
which is why, perhaps, most test sponsors prefer to use simpler scoring models
that the public more easily understands and accepts. When we use the two- or
three-parameter logistic response models, we are weighting items. Several
examinees with identical raw scores may have slightly different scale scores be-
cause their patterns of responses vary.

MC OptionAVeighted Scoring

In both CTT and IRT frameworks, theoreticians have developed polytomous


scoring models that consider the differential information offered in MC
distractors (Haladyna & Sympson, 1988). Methods described in this chapter
show that distractors differ in information about the knowledge or ability being
measured. Can this information be used to improve the accuracy of test scores?
The answer is yes. But the procedures for doing this are complex and the gains
in accuracy may not be worth the extra effort. Chapter 10 provides a discussion
of this possibility.

Instruction or Training Context

To understand how to use item response patterns to evaluate items, we must


place the item in a context, which is what the test is supposed to measure
and how test scores will be used. One important context is instruction or
training. In either setting, a test is intended to measure student learning.
With instruction or training, a certain domain of knowledge is to be learned.
An MC test is a sample of that domain. If the test measures status in a do-
main for certifying competence in a profession, licensing a professional, cer-
tifying completion of a course of study, or the like, we are again interested in
2O6 CHAPTER 9

accomplishment relevant to a domain. Item response patterns vary as a


function of the developmental level of the sample of students or trainees
taking the test. In other words, estimating item characteristics can be slip'
pery in either context. This inability to get a fix on item difficulty when the
sample may be highly accomplished or low achieving is one reason CTT is
criticized. This criticism is overcome by careful sampling. If the sample con-
tains a sample of students representing the full range of achievement, diffi-
culty can be adequately estimated.

ITEM CHARACTERISTICS

In this section, we address several important characteristics of item re-


sponses: difficulty, discrimination, guessing or pseudochance, and omitted
responses. As you will see, statistical theories, scoring methods, and instruc-
tional or training contexts come into play in considering the study of these
item characteristics.

INFLUENCE OF SAMPLE SIZE

Before any discussion of calculating and intepreting item characteristics such


as difficulty and discrimination, a word of caution is needed about sample size.
With small samples, less than 200 examinees, we should proceed with caution.
Komrey and Bacon (1992) studied the role of sample size as it affects the accu-
racy of estimation. They found that with sample sizes as small as 20, one could
obtain reasonably good item difficulty estimates, assuming a heterogeneous,
representative sample of examinees; however, with discrimination, large sam-
ple sizes are needed.
In an unpublished paper by Forster (circa 1974), he drew random samples of
students for whom he calibrated item difficulty for the one-parameter Rasch
model. The population value was 5.95, and his values for samples ranging from
41 to 318 were within 0.47 of a score point. The correlation of student scores
based on these smaller samples ranged from .955 to .989. When samples
dropped below 200, average discrepancies increased sizably. Reckase (2000)
studied the minimum size needed for the three-parameter model, which in-
cludes difficulty, discrimination, and a pseudochance level. He concluded that
300 was a good minimum sample size for estimating the three parameters.

HOMOGENEITY AND HETEROGENEITY


OF THE SAMPLE

With CTT, the estimation of difficulty can be biased. If the sample is restricted
but scores are observed in the center of the test score scale, this bias may be less
STATISTICAL STUDY OF ITEM RESPONSES 2O7

pronounced or nonexistent. Item discrimination can also be biased with a ho-


mogeneous sample, no matter where the scores are located on the test score
scale. A good rule of thumb for item analysis and the study of item characteris-
tics is that the sample should be heterogeneous and representative for the pop-
ulation being tested.
With IRT, the basis for estimating difficulty and discrimination is not de-
pendent on the sample. However, if the sample obtained is greatly restricted in
some way, even IRT will fail in estimating difficulty and discrimination.
In a practical way, a good investigative technique is to compute the full array
of descriptive statistics for a set of scores including skewness and kurtosis. By
knowing about restriction of range or severe skewness, we can better under-
stand how and why some items fail to perform as expected.

ITEM DIFFICULTY

The CTT Perspective

The first characteristic of item responses is item difficulty. The natural scale for
item difficulty is percentage of examinees correctly answering the item. The
ceiling of any MC item is 100% and the probability of a correct response deter-
mines the floor when examinees with no knowledge are randomly guessing.
With a four-option item, the floor is 25%, and with a three-option item the
floor is 33%. A commonly used technical term for item difficulty is p value,
which stands for the proportion or percentage of examinees correctly answer-
ing the item.
Every item has a natural difficulty, one that is based on the performance of
all persons for whom we intend the test. This p value is difficult to estimate ac-
curately unless a representative group of test takers is being tested. This is one
reason CTT is criticized, because the estimation of the p value is potentially bi-
ased by the sample on which the estimate of item difficulty is based. If the sam-
ple contains well-instructed, highly trained, or high-ability examinees, the
tests and their items appear easy, usually above .90. If the sample contains unin-
structed, untrained, or low-ability people, the test and the items appear hard,
usually below .40, for instance.

The IRT Perspective

IRT allows for the estimation of item difficulty without consideration of exactly
who is tested. With CTT, as just noted, the knowledge or ability level of the
sample strongly influences the estimation of difficulty. With IRT, item difficulty
can be estimated without bias. The difficulty of the item in the IRT perspective
208 CHAPTER 9

is governed by the true difficulty of the item and the achievement level of the
person answering the item. The estimation of difficulty is based on this idea,
and the natural scale for scaling difficulty in IRT is logarithm units (logits) that
generally vary between -3.00 and +3.00, with the negative values being inter-
preted as easy and the higher values being interpreted as hard.
There are many IRT models. Most are applicable to large testing programs
involving 200 or more test takers. If a testing program is that large and the con-
tent domain is unidimensional, IRT can be effective for constructing tests that
are adaptable for many purposes and types of examinees. The one-, two-, and
three-parameter binary-scoring IRT models typically lead to similar estimates
of difficulty. These estimates are highly correlated to classical estimates of diffi-
culty. The ability to estimate parameters accurately, such as difficulty, provides
a clear advantage for IRT over CTT. IRT is also favored in equating studies and
scaling, and most computer programs listed in Table 9.1 enable IRT equating
and scaling.

Controlling Item Difficulty

What causes an item to be difficult or easy? Studies of factors that control


item difficulty are scarce. Green and Smith (1987), Smith (1986), and Smith
and Kramer (1990) conducted some interesting experiments on controlling
item difficulty. This aspect of item design is a promising research topic. Bejar
(2002) has had the greatest success in developing items and controlling diffi-
culty. He used a generative approach involving an analysis of the construct
being measured, the development of task models, and related tasks. This is a
highly refined item generation that intends to produce equivalent items.
Enright and Sheehan (2002) also reported on their research involving small
item sets. The production of items at a known level of difficulty provides an
advantage over the current method typically used. These researchers showed
that understanding and controlling item difficulty is within our reach. Their
research should lead us to more general methods for writing items of prespeci-
fied difficulty. The traditional approach to item writing produces item diffi-
culties that range considerably.
Another possible cause of a p value is the extent to which instruction, train-
ing, or development has occurred with those being tested. Consider an item
with a p value of .66. Was it the composition of the group being tested or the ef-
fectiveness of instruction or training that caused this p value? One clue is to ex-
amine test performance of instructed or uninstructed, trained or untrained,
and developed or undeveloped individuals and groups. This is the idea of in-
structional sensitivity, and it is more fully discussed in a subsequent section of
this chapter.
Another possible cause of a p value is that the item is really not relevant to
the knowledge domain being tested. In this circumstance, we would expect the
STATISTICAL STUDY OF ITEM RESPONSES 2O9

item performance pattern to be unintelligible or the p value to be very low be-


cause the item does not relate to instruction or training.

ITEM DISCRIMINATION

The general notion of item discrimination is that high-achieving students tend


to choose the right answer and low-achieving students tend to choose the
wrong answer. Item discrimination is an item characteristic that describes the
item's ability to measure individual differences sensitively that truly exist
among test takers. If we know test takers to differ in their knowledge or ability,
each test item should mirror the tendency for test takers to be different. Thus,
with a highly discriminating item, those choosing the correct answer must nec-
essarily differ in total score from those choosing any wrong answer. This is a
characteristic of any measuring instrument where repeated trials (items) are
used. Those possessing more of the trait should do better on the items consti-
tuting the test than those possessing less of that trait.
Discrimination of any MC item can be studied from three complementary
perspectives: tabular, graphical, and statistical. The former two are theory free.
These methods are intuitive and provide useful perspectives about item re-
sponses. Statistical methods are highly valued and have the greatest influence
in providing validity evidence. This is because although the tabular and graphi-
cal methods provide basic understanding, statistical methods provide greater
assurance about each item's discriminating ability.

Tabular Method

The most fundamental tabular method involves the mean of those choosing
the correct answer and the mean of those choosing any incorrect answer. A
good way to understand item discrimination is to note that those who chose the
correct answer should have a high score on the test and those who chose any
wrong answer should have a low score on the test. In Table 9.2, the first item has
good discrimination, the second item has lower discrimination, the third item
fails to discriminate, and the fourth item discriminates in a negative way. Such
an item may be miskeyed.
Another tabular method is to divide the sample of examinees into 10 groups
reflecting the rank order of scores. Group 1 will be the lowest scoring group and
Group 10 will be the highest scoring group. In the example in Table 9.3 we have
2,000 examinees ranked by scores from low to high divided into 10 groups with
200 examinees in each score group. Group 1 is the lowest scoring group, and
Group 10 is the highest scoring group. Note that the number of examinees
choosing the right answer (1) are few in the lowest group but increasingly more
210 CHAPTER 9

TABLE 9.2
Examples of the Average (Mean) Scores of Those Answering the Item Correctly
and Incorrectly for Four Types of Item Discrimination

Those Mean Score Mean Score Mean Score Mean Score


Answering: Item I Item 2 Item 3 Item 4
Correctly 90% 70% 65% 65%
Incorrectly 30% 60% 65% 75%
Discrimination Very high Moderate Zero Negative

numerous with higher scoring groups. Also, note that the number of examinees
choosing the wrong answer (0) are numerous in the lowest scoring group but
less frequent in the highest scoring group.
These tabular methods are fundamental to understanding the nature of
item responses and discrimination. These methods provide the materiel for
other methods that enhance this understanding of item discrimination.

Graphical Method

Taking the tabular results of Table 9.3, we can construct graphs that display the
performance of examinees who selected the correct and incorrect responses.
Figure 9.1 illustrates a trace line (also known as an option characteristic curve)
for the correct choice and all incorrect choices taken collectively. The trace
line can be formed in several ways. One of the easiest and most direct methods
uses any computer graphing program. In Fig. 9.1, the trace lines were taken
from results in Table 9.3.
The trace line for the correct answer is monotonically increasing, and the
trace line for the collective incorrect answers is monotonically decreasing.
Note that the trace line for correct answers is a mirror image of the trace line for
incorrect answers. Trace lines provide a clear way to illustrate the discriminat-
ing tendencies of items. Flat lines are undesirable because they indicate a fail-

TABLE 9.3
Frequency of Correct and Incorrect Responses for 10 Score Groups

1 2 3 4 5 6 7 8 9 J O
0 140 138 130 109 90 70 63 60 58 56
1 60 62 70 91 110 130 137 140 142 144
STATISTICAL STUDY OF ITEM RESPONSES 211

FIG. 9.1. Trace lines for the correct answer and collectively all wrong answers.

ure to discriminate. Nonmonotonic lines are difficult to explain in terms of


student behavior. Therefore, these trace lines are undesirable.

Statistical Methods

We have an increasing number of methods for studying item discrimination.


Pitfalls exist in its estimation. Some major differences exist depending on
which statistical theory of test scores is used. In CTT, for instance, item dis-
crimination is usually associated with the product-moment (point-biserial)
correlation between item and test performance. As we know, this index can
be positive or negative, can vary between 0.00 and 1.00, and is subject to tests
of statistical significance. The biserial correlation (a sister to the point-
biserial) may also be used, but it has a less direct relationship to coefficient al-
pha, the reliability coefficient typically used for MC test scores. The size of
the discrimination index is informative about the relation of the item to the
total domain of knowledge or ability, as represented by the total test score. It
can be shown both statistically and empirically that test score reliability de-
pends on item discrimination (Nunnally, 1977, p. 262). The weakness of us-
ing classical item discrimination in testing students or trainees is that if the
range of scores is restricted in the sample of examinees, when instruction is ef-
212 CHAPTER 9

fective and student effort is strong the discrimination index is greatly under'
estimated. In fact, if all students answered an item correctly, the
discrimination index would be zero. Nevertheless, this is misleading. If the
sample included nonlearners, we would find out more about the ability of the
item to discriminate. One can obtain an unbiased estimate of discrimination
in the same way one can obtain an unbiased estimate of difficulty—by obtain-
ing a representative sample that includes the full range of behavior for the
trait being measured. Restriction in the range of this behavior is likely to af-
fect the estimation of discrimination.
With IRT, we have a variety of traditional, dichotomous scoring models
and newer polytomous scoring models from which to choose. The one-pa-
rameter item response model (referred to as the Rasch model) is not con-
cerned with discrimination, as it assumes that all items discriminate equally.
The Rasch model has one parameter—difficulty. The model is popular be-
cause applying it is simple, and it provides satisfactory results despite this im-
plausible assumption about discrimination. Critics of this model
appropriately point out that the model is too simplistic and ignores the fact
that items do vary with respect to discrimination. With the two- and three-
parameter models, item discrimination is proportional to the slope of the op-
tion characteristic curve at the point of inflexion (Lord, 1980). This shows
that an item is most discriminating in a particular range of scores. One item
may discriminate very well for high-scoring test takers, whereas another item
may discriminate best for low-scoring test takers.
A popular misconception is that a fit statistic is a substitute for discrimina-
tion. Fit statistics do not measure discrimination. Fit statistics in IRT answer a
question about the conformance of data to a hypothetical model, the item
characteristic curve. One of best discussions of fit can be found in Hambleton
(1989, pp. 172-182). If items do not fit, some claims for IRT about sample-free
estimation of examinee achievement are questionable.
A third method used to estimate item discrimination is the eta coefficient.
This statistic can be derived from the one-way analysis of variance
(ANOVA), where the dependent variable is the average score of persons se-
lecting that option (choice mean) and the independent variable is the op-
tion choice. In ANOVA, three estimates of variance are obtained: sums of
square between, sums of squares within, and sums of squares total. The ratio
of the sums of squares between and the sums of squares total is the squared
eta coefficient. In some statistical treatments, this ratio is also the squared
correlation between two variables (R 2 ). The eta coefficient is similar to the
traditional product-moment discrimination index. In practice, the eta co-
efficient differs from the product-moment correlation coefficient in that
the eta considers the differential nature of distractors, whereas the prod-
uct-moment correlation makes no distinction among item responses re-
lated to distractors.
STATISTICAL STUDY OF ITEM RESPONSES 213

Table 9.4 illustrates a high discrimination index but a low eta coefficient.
Notice that the choice means are closely bunched. The second item also has a
high discrimination index but it has a high eta coefficient, as well because the
choice means of the distractors are more separated. In dichotomous scoring,
point-biserial item response-total test score correlation or the two- or three-
parameter discrimination may serve as a discrimination index. However, with
polytomous scoring, the eta coefficient provides different information that is
appropriate for studying item performance relative to polytomous scoring.
Chapter 10 treats this subject in more detail.
What we can learn from this section is that with dichotomous scoring, one
can obtain approximately the same information from using the classical dis-
crimination index (the product-moment correlation between item and test
performance) or the discrimination parameter from the two- or three-parame-
ter item response models. But with polytomous scoring these methods are inap-
propriate, and the eta coefficient provides unique and more appropriate index
of discrimination.

Dimensionality and Discrimination. A problem that exists with estimat-


ing discrimination is the dimensionality of the set of items chosen for a test.
Generally, any test should be as unidimensional as possible with the present
theories and methods in use. Nunnally (1967) described the problem as funda-
mental to validity. Any test should measure an attribute of student learning,
such as a specific body of knowledge or a cognitive ability. Items need to have
the same content to function appropriately. This lack of a common attribute
clouds our interpretation of the test score.
With the existence of several factors on the test, item response patterns are
likely to be odd or nondiscriminating. Deciding which items are working well is
difficult because items discriminate with a common attribute in mind. With

TABLE 9.4
Point-Biserial and Eta Coefficients for Two Items

Item 1 Item 2
Point-biserial .512 .552
Eta coefficient .189 .326
Choice Mean Choice Mean
Option A—correct 33.9% 33.4%
Option B— incorrect 23.5% 24-8%
Option C—incorrect 27.0% 29.6%
Option D—incorrect 26.4% 30.7%
214 CHAPTER 9

two or more attributes present in the test, item discrimination has no criterion
on which to fix.
With IRT, unidimensionality is a prerequisite of test data. Hattie (1985)
provided an excellent review of this issue, and Tate (2002) provided a timely
update of this earlier work. When using the two- or three-parameter logistic
response model, the computer program will fail to converge if multidimen-
sionality exists. With the use of classical theory, discrimination indexes, ob-
tained from product-moment correlations or biserial correlations, will be
lower than expected and unstable from sample to sample. Thus, one has to be
cautious that the underlying test data are unidimensional when estimating
discrimination. A quick-and-dirty method for studying dimensionality is to
obtain a KR-20 (Kuder Richardson-20) internal consistency estimate of reli-
ability. If it is lower than expected for the number of items in the test, this is a
clue that the data may be multidimensional. A more dependable method is to
conduct a full-information, confirmatory item factor analysis. Chapter 10
provides more discussion of this problem and its implications for estimating
discrimination.

How Instruction or Training Affects Discrimination. As mentioned pre-


viously in this chapter, instruction and training are a context that needs to be
considered when studying item response patterns. The difficulty of any test
item is affected by the achievement level of the sample of examinees and their
instructional history.
In evaluating discrimination in the context of instruction or training, one
must be careful in sampling examinees. If instruction or training has been suc-
cessful, people instructed or trained should perform at the upper end of the test
score scale, whereas those not instructed or trained should perform at the lower
end of the scale, as illustrated in Fig. 9.2. The uninstructed group displays low
performance on a test and its items, and the instructed group displays high per-
formance on a test and its items. This idealized performance pattern shows ef-
fective instruction, good student effort, and a test that is sensitive to this
instruction. Other terms used to describe this phenomenon are instructional

XX XX

XXX XXX

xxxxxx xxxxxxx
xxxxxxxxxx xxxxxxxxxx
Low performance High performance
Before instruction After Instruction

FIG. 9.2. Idealized performance of instructed and uninstructed students.


STATISTICAL STUDY OF ITEM RESPONSES 215

sensitivity or opportunity to kam (Haladyna &Roid, 1981). Instructional sensi-


tivity can be estimated using CTT or IRT. The concept of instructional sensi-
tivity incorporates the concepts of item difficulty and item discrimination
(Haladyna, 1974; Haladyna & Roid, 1981; Herbig, 1976).
Item difficulty varies because the group of students tested has received dif-
ferential instruction. Higher achieving students perform well on an item,
whereas lower achieving students do not perform very well. Therefore, it is pos-
sible to observe several conditions involving item difficulty to help us find
which items are working as predicted and which items have performance prob-
lems that require closer analysis.
The simplest of the instructional sensitivity indexes is now used to illustrate
several possible conditions. Then, we can see how instructional sensitivity can
be measured in several ways. Instructional sensitivity is a helpful concept in an-
alyzing several important instructional conditions. These conditions include
effective instruction, ineffective or lack of instruction, and unneeded instruc-
tion or an item is too easy. With each condition, several plausible, alternative
explanations exist. The index must be interpreted by someone who is intimate
with the instructional setting.

Pre-to'Post Difference Index (PPDI). This index, introduced by Cox


and Vargas (1966), provides the simple difference in item difficulty based on
two samples of test takers known to differ with respect to instruction. For in-
stance, the first group can be typical students who have not yet received in-
struction and the second group can be typical students who have received
instruction.

Pre- Instruction Post-Instruction PPDI


40% 80% 40
This illustration suggests that the item is moderately difficult (60%) for a
typical four-option MC item, when the sample has an equal number of in-
structed and uninstructed students. The change in difficulty for the two condi-
tions represents how much learning was gained from instruction, as reflected
by a single item.
Because a single item is an undependable measure of overall learning, and
because a single item is biased by its intrinsic difficulty, aggregating several
items across the test to make an inference about instructional effectiveness or
growth is far better. Other conditions exist for this index that provide useful
descriptive information about item performance, as shown in the following:

Pre-Instruction Post-Instruction PPDI


40% 40% 0
216 CHAPTER 9

This kind of performance suggests ineffective instruction or lack of treat-


ment of the content on which the item was based. A second plausible and rival-
ing explanation is that the item is so difficult that few can answer it correctly,
despite the effectiveness of instruction. A third plausible hypothesis is that the
item is unrelated to the purposes of the test. Therefore, no amount of instruc-
tion is relevant to performance on the item. The instructional designer and test
designer must be careful to consider other, more plausible hypotheses and
reach a correct conclusion. Often this conclusion is augmented by studying the
performance patterns of clusters of items. Having a single item perform like the
previous one is one matter, but having all items perform as just shown is an en-
tirely different matter. A single item may be unnecessarily difficult, but if all
items perform similarly, the problem may lie with instruction or the entire test
may not reflect the desired content.

Pre-Instruction Post-Instruction PPDI


90% 90% 0
Like the previous example, the PPDI is 0, but unlike the previous example, the
performance of both samples is high. Several rivaling hypotheses explain this
performance. First, the content may have already been learned, and both unin-
structed and instructed groups perform well on the item. Second, the item may
have a fault that is cuing the correct answer; therefore, most students are picking
the right answer regardless of whether they have learned the content repre-
sented by the item. Third, the item is inherently easy for everyone. The item fails
to reflect the influence of instruction because the item fails to discriminate what
content is to be measured because of the inherent easiness of the item.
These three examples show the interplay in instruction with items specifi-
cally designed or chosen to match instruction. Knowing how students perform
before and after instruction informs the test designer about the effectiveness of
the items as well as instruction.

Other Indexes. Obtaining a sample of test behavior from a preinstructed


or uninstructed group is often impractical. Thus, the PPDI is not an easy in-
dex to obtain. Haladyna and Roid (1981) examined a set of other instruc-
tional sensitivity indexes, including one derived from the Rasch model and a
Bayesian index. They found a high degree of relation among these indexes.
They also found that the postinstruction difficulty is a dependable predictor
of PPDI, but this difficulty is limited because the use of this index will be in-
correct in the condition reported earlier where pre- and postinstruction per-
formance of test takers is uniformly high. Thus, this shortcut method for
estimating PPDI is useful but one should always consider this inherent weak-
ness in analyzing the instructional sensitivity of items by using postin-
struction difficulty alone.
STATISTICAL STUDY OF ITEM RESPONSES 217

In this setting, the validity of these conclusions is not easy to prove based on
statistical results alone. Analysis of performance patterns requires close obser-
vation of the instructional circumstances and the judicious use of item and test
scores to draw valid conclusions. Instructional sensitivity is a useful combina-
tion of information about item difficulty and discrimination that contributes to
the study and improvement of items designed to test the effects of teaching or
training.

GUESSING

With the use of MC test items, an element of guessing exists. Any test taker
when encountering the item answer either knows the right answer, has partial
knowledge that allows for the elimination of implausible distractors and a
guess among the remaining choices, or simply guesses in the absence of any
knowledge.
In CTT, one can ignore the influence of guessing. To do so, one should con-
sider the laws of probability that influence the degree to which guessing might
be successful. The probability of getting a higher than deserved score by guess-
ing gets smaller as the test gets longer. For example, even in a four-option, 10-
item test, the probably of getting 10 correct random guesses is .0000009. An-
other way of looking at this is to realize the probability of scoring 70% or higher
on a 10-item MC test by random guessing is less than .004. Increase that test to
25 items, and that probability of getting a score higher than 70% falls to less
than .001.
The three-parameter item response model although often referred to as the
"guessing parameter" is actually a pseudochance level (Hambleton et al.,
1991). This parameter is not intended to model the psychological process of
guessing but merely to establish that a reasonable floor exists for the difficulty
parameter. This third parameter is used along with item difficulty and discrimi-
nation to compute a test taker's score. The influence of this third parameter is
small in relation to the influence of the discrimination parameter. Several
polytomous scoring models that also use correct and incorrect responses also
incorporate information about guessing into scoring procedures (Sympson,
1983, 1986; Thissen & Steinberg, 1984).

OMITTED AND NOT REACHED RESPONSES

Depending on test instructions and other conditions, examinees may omit re-
sponses. One serious form of omitted responses are items that are not tried.
Usually, a string of responses at the end of a set of responses signals items that
are not reached. In the study of item responses, it is important to tabulate
218 CHAPTER 9

omits and not-reached responses, as knowledge of the extensiveness of such


nonresponse is a threat to valid test score interpretations or use. In some test-
ing programs, examinees are discouraged from guessing and thus tend to omit
many items. We address this problem in the next section in a different way. It
matters greatly in the evaluation of MC item performance how omits and
not-reached responses are counted. A science of imputation has grown,
where such responses can be predicted, and in scoring, these imputed re-
sponses replace omitted or not-reached responses. DeAyala, Plake, and
Impara (2001) provided an excellent review of the issues and methods for im-
putation and a study on the effectiveness of various imputation methods.
This topic is treated more extensively in the next chapter as it presents a
threat to validity.

DISTRACTOR EVALUATION

Thissen, Steinberg, and Fitzpatrick (1989) stated that test users and analysts
should consider the distractor as an important part of the item. Indeed, nearly
50 years of continuous research has revealed a patterned relationship between
distractor choice and total test score (Haladyna &Sympson, 1988; Levine &
Drasgow, 1983; Nishisato, 1980). The following are five reasons for studying
distractor performance for MC items:

• Slimming down fat items. Haladyna and Downing (1993) provided both
theoretical arguments and empirical results suggesting that most test items
contain too many options. They argue that if we systematically evaluated
dis tractors, we would discover that many distr actors are not performing as
intended. In chapter 5, a guideline was presented that suggested MC items
should have as many good options as is possible, but three is probably a good
target. The research cited there provides a good basis for that guideline and
the observation that most items usually have only one or two really well-
functioning distractors. By trimming the number of options for MC items,
item writers are relieved of the burden of writing distractors that seldom
work, and examinees take shorter tests. Or, if we can shorten the length of
items, we can increase the number of items administered per hour and by
that increase the test length, which may improve the sampling of a content
domain and increase the reliability of test scores.
• Improving test items. The principal objective of studying and evaluating
item responses is to improve items. This means getting difficulty in line and
improving discrimination. Item analysis provides information to SMEs
about the performance of items so that items can be retained in the opera-
tional item pool, revised, or retired. Information about distractors can be
used in this revision process.
STATISTICAL STUDY OF ITEM RESPONSES 219

• Detecting reasons for performance problems. The study of distractors can


lead to conclusions by SMEs about why an item is not performing as ex-
pected. Because distractors are often common student errors, the perfor-
mance of these distractors can help these SMEs decide which distractors
remain and which should be revised. Such analysis often has implications for
future instruction or training.
• Augmenting studies of cognitive processes. As reported in chapters 4 and
8, it is increasingly common to talk to test takers to learn about the cognitive
processes that may be needed to answer an item correctly. Such information
can be used with distractor analysis to help SMEs improve items and better
understand what kind of cognitive behaviors their items are eliciting.
• Differential distractor functioning. Distractors are differentially attrac-
tive and can provide the basis for improving the scoring of item responses. In
the next chapter the theory and research involved with scoring distractors
are presented and discussed.

As you can see, not only are these reasons compelling, but the effort put into
studying and improving distractors contributes important validity evidence to
the overall validation process. The next three sections present and discuss
three unique ways to study distractor performance. Although the methods dis-
cuss are diverse in nature and origin, they should provide convergent informa-
tion about the distractability of distractors.

TABULAR METHOD FOR STUDYING


DISTRACTOR PERFORMANCE

The frequency table for all options of a MC item is a distribution of responses


for each option according to score groups. Each score group represents an or-
dered fractional part of the test score distribution. Table 9.5 shows the fre-
quency tables for two items. In this table, there are 5 score groups, representing
five distinctly ordered ability levels. For small samples of test takers, 4 or 5 score
groups can be used, whereas with larger samples, 10 to 20 score groups might
prove useful. The sum of frequencies (in percent) for each score group is the
fractional equivalent of the number of test takers in that score group to the to-
tal sample. Because we have 5 score groups, each row equals about 20%, one
fifth of the total sample. (Sometimes, because more than one person received
the same score, having exactly 20% in each score is not possible.) The column
totals represent the frequency of response to each option.
For the first item, the correct answer, Option A, was chosen 55% of the time,
17% by the highest score group, 14% by the next highest score group, and 6%
by the lowest score group. This is a typical and desirable pattern of response for
a correct answer.
220 CHAPTER 9

TABLE 9.5
Frequency Tables for Two 4-Option Multiple-Choice Items

Item I Options
Score" group A" B C D
80-99 percentile 17% 1% 0% 2%
60-79 percentile 14% 2% 0% 4%
40-59 percentile 10% 2% 0% 4%
20-39 percentile 8% 9% 1% 2%
1-19 percentile 6% 13% 3% 0%
Total 55% 27% 4% 14%
Item 2 Options
Score" group Ab B C D
80-99 percentile 19% 1% 0% 0%
60-79 percentile 14% 3% 1% 2%
40-59 percentile 8% 4% 2% 6%
20-39 percentile 8% 9% 1% 2%
1-19 percentile 6% 12% 1% 1%
Total 55% 29% 5% 11%

"In percentile ranks.


b
Correct answer.

Option B, a distractor, has a low response rate for the higher groups and a
higher response rate for the lower groups. This is a desirable pattern for a
well-performing distractor. As described earlier, all distractors should have a
pattern like this.
Option C illustrates a low response rate across all five score groups. Such
distractors are useless, probably because of extreme implausibility. Such
distractors should either be removed from the test item or replaced.
Option D illustrates an unchanging performance across all score groups. No
orderly relation exists between this distractor and total test performance. We
should remove or replace such a distractor from the test item because it is not
working as intended.
The second item exhibits a distractor pattern that presents problems of in-
terpretation and evaluation. Option D is more often chosen by the middle
group and less often chosen by the higher and lower groups. This pattern is
STATISTICAL STUDY OF ITEM RESPONSES 221

nonmonotonic in the sense that it increases as a function of total test score and
then decreases. Is this pattern a statistical accident or does the distractor at-
tract middle achievers and not attract high and low achievers? Distractors are
not designed to produce such a pattern because the general intent of a
distractor is to appeal to persons who lack knowledge. The nonmonotonic pat-
tern shown in Option D implies that the information represented by Option D
is more attractive to middle performers and less attractive to high and low per-
formers. The nonmonotonic pattern appears to disrupt the orderly relation be-
tween right and wrong answers illustrated in Options A and B. For this reason,
nonmonotonic trace lines should be viewed as undesirable.
This tabular method is useful for obtaining the basic data that show the per-
formance of each distractor. A trained evaluator can use these tables with con-
siderable skill, but these data are probably more useful for creating graphical
presentations of distractor functioning. The computer program TESTFACT
(Bock et al., 2002) provides useful tables called/racti'Jes that provide tabular op-
tion responses.

GRAPHICAL METHOD FOR STUDYING


DISTRACTOR PERFORMANCE

The trace line we use in Fig. 9.1 for the correct answer and for the collective
distractors can be used for each distractor. Figure 9.3 shows four trace lines. A
four-option item can have up to five trace lines. One trace line exists for each
option and one trace line can be created for omitted responses.
As noted in Fig. 9.1, an effectively performing item contains a trace line for
the correct choice that is monotonically increasing, as illustrated in Fig. 9.1
and again in Fig. 9.3. These figures show that the probability or tendency to
choose the right answer increases with the person's ability. The collective per-
formance of distractors must monotonically decrease in opposite correspond-
ing fashion, as illustrated in Fig. 9.3. That figure shows that any examinee's
tendency to choose any wrong answer decreases with the person's ability or
achievement.
If the ideal trace line for all distractors is monotonically decreasing, each
trace line should exhibit the same tendency. Any other pattern should be in-
vestigated, and the distractor should be retained, revised, or dropped from
the test item. Referring to Fig. 9.3, the first trace line has the characteristic of
the correct answer, whereas the second trace line has the characteristic of a
plausible, well-functioning distractor. The third trace line shows a flat perfor-
mance across the 10 score groups. This option simply does not discriminate in
the way it is expected to discriminate. This kind of option probably has no use
in the item or should be revised. The fourth type of trace line shows low re-
sponse rates for all score groups. This kind of distractor is one that is probably
222 CHAPTER 9

FIG. 9.3. Four types of trace lines.

implausible and therefore is typically not chosen. It too should be revised or


dropped from the test item.
One trace line that is not presented here is the nonmonotonic trace line. A
purely statistical view of item responses seems to accept nonmonotonic trace
lines as acceptable and interpretable: That a subgroup of the examinees finds
a distractor more attractive than two or more other subgroups of the distribu-
tion of examinees. There is no educational or psychological explanation for
such a pattern of item responses, and there is no acceptable scoring guide for
such a distractor. If SMEs have determined the distractor is a wrong answer,
the existence of a nonmonotonic trace line should cause the SMEs to reject
the distractor and replace it with a distractor that has a monotonic decreasing
trace line.
Trace lines can be constructed using some standard computer graphics pro-
grams, such as found with word processing programs. Statistical packages also
are useful for constructing trace lines. Some of these computer programs have
the option of taking the data from the frequency tables and providing
smoothed curves for easier interpretation of the trace lines. An item analysis
and scaling program, RUMM, introduced by Andrich et al. (2001), provides
trace lines for both MC items and rating scales. Wainer (1989) and Thissen,
Steinberg, and Fitzpatrick (1989) favored using trace lines. They argued that
trace lines make item analysis more meaningful and interpretable. van
Batenburg &. Laros (2002) provided an in-depth discussion of graphical item
STATISTICAL STUDY OF ITEM RESPONSES 223

analysis and supported its use in studying and evaluating distractors. The pri-
mary advance is that practitioners can easily read and understand these option
performance graphs.

STATISTICAL METHODS FOR STUDYING


DISTRACTOR PERFORMANCE

These statistical methods can be grouped into three categories: (a) tradi-
tional, (b) nonparametric, and (c) parametric. The number of proposed
methods has increased recently, but research that compares the effectiveness
of these methods has not kept pace. The speculation is offered that because
these methods have the same objective, they probably provide similar results.
These results should follow logically from an inspection of a frequency table
and graphical results. In fact, these statistical methods should confirm what
we observe from viewing tabular results shown in Table 9.5 and the trace lines
shown in Fig. 9.3.

TRADITIONAL STATISTICAL METHODS


FOR EVALUATING DISTRACTORS

In this section we examine the traditional discrimination index applied to


distractors. Other approaches for evaluating distractors have a dual value: (a)
they provide information about the uniqueness of a distractor and (b) they cap-
ture the performance of a set of distractors found in an item.

Classical Point-Biserial Discrimination

Traditional item analysis relies on the relationship between item and test per-
formance. The most direct method is the simple product-moment (point-
biserial) correlation between item and test performance. Applied to a
distractor, however, the point-biserial coefficient can be estimated incor-
rectly (Attali & Fraenkel, 2000). If the standard formula for point-biserial is
used, the responses to other distractors is grouped with responses to the right
answer, and the resulting discrimination index for the distractor is underesti-
mated. Attali and Fraenkel (2000) pointed out that the correlation of
distractor to total score should be independent of other distractors, and they
showed how discrimination indexes can be corrupted by the casual use of the
point-biserial coefficient. They also pointed out that this coefficient has the
advantage of being an effect size measure. The squared point-biserial is the
percentage of criterion variance accounted for by choosing that distractor.
224 CHAPTER 9

Also, because distractors can be chosen by several examinees, this index can
be unreliable. Therefore, it is recommended that if the traditional point-
biserial is correctly used to evaluate a distractor, the appropriate test of statis-
tical significance is used with a directional hypothesis because the coefficient
is assumed to be negative. Attali and Fraenkel suggested power tables found
in Cohen (1988). A bootstrap method is suggested for overcoming any bias
introduced by limitations of the sample (de Gruijter, 1988), but this kind of
extreme measure points out an inherent flaw in the use of this index. It
should be noted that the discrimination index is not robust. If item difficulty
is high or low, the index is attenuated. It maximizes when difficulty is mod-
erate. The sample composition relates to the estimate of discrimination.
Distractors tend to be infrequently chosen, particularly when item diffi-
culty exceeds 0.75. Thus, the point-biserial correlation is often based on
only a few observations, which is a serious limitation. Henrysson (1971)
provided additional insights into the inadequacy of this index for the study
of distractor performance. Because of these many limitations, this index
probably should not be used.

Choice Mean

For any option, we can calculate the mean of all examinees who chose that op-
tion. For the right answer, this mean will typically be higher than the means of
any wrong answer. We can analyze the relationship of the choice mean to total
score or the choice of the option to total score. The first is a product-moment
correlation between the choice mean and total score, where the choice mean is
substituted for the distractor choice. This coefficient shows the overall working
of an item to tap different levels of achievement through its distractors. An
item with a high coefficient would have different choice means for its
distractors. This may be viewed as an omnibus index of discrimination that in-
cludes the differential nature of distractors. The second type of index is the eta
coefficient, where the independent variable is option choice and the depend-
ent variable is total score. This index also represents an item's ability to discrim-
inate at different levels of achievement. The traditional point-biserial applied
to any item disregards the unique contributions of each distractor. When the
choice mean is used, an effect size can be calculated for the difference in choice
means for the distractor and the correct choice. The greater the different in
standard deviation units, the more effective the distractor. Referring to Table
9.6, however, we note that the choice mean for each distractor differs from the
choice mean of the correct answer. The difference in these choice means can
serve as a measure of distractor effectiveness; the lower the choice mean, the
better the distractor. This difference can be standardized by using the standard
deviation of test scores, if a standardized effect size measure is desired.
STATISTICAL STUDY OF ITEM RESPONSES 225

TABLE 9.6
Choice Means for Two Items From a Test

Options Item 32 Item 45


A 66% 88%
B 54% 86%
C 43% 84%
D 62% 85%
F ratio 22.44 1.04
Probability .000 .62
R2 .12 .02

The choice mean seems useful for studying distractors. The lower the choice
mean, the more effective the distractor. Yet, a bias exists in this procedure, be-
cause when the right answer is chosen by most high-scoring test takers, the
low-scoring test takers divide their choices among the three distractors plus the
correct answer. Therefore, distractors will always have lower choice means,
and statistical tests will always reveal this condition. Any exception would sig-
nal a distractor that is probably a correct answer.
As indicated earlier, the trace line has many attractive characteristics in the
evaluation of item performance. These characteristics apply equally well to
distractor analysis. Haladyna and Downing (1993) also showed that trace lines
reveal more about an option's performance than a choice mean. Whereas
choice means reveal the average performance of all examinees choosing any
option, the trace line accurately characterizes the functional relationship be-
tween each option and total test performance for examinees of different
achievement levels.

Nonparametric Statistical Methods

Nonparametric methods make no assumption about the shape of the option


characteristic curve, except that it is monotonically increasing with the right
answer and decreasing with the wrong answer. One approach to studying
distractors, proposed by Love (1997), capitalizes on the idea of trace lines.
The concept is the rising selection ratio, which posits that the ratio of the pro-
226 CHAPTER 9

portion of examinees choosing the correct response and the proportion of


examinees choosing the wrong response increases monotonically as a func-
tion of achievement level. Love based this idea on earlier work of Luce
(1959), who formed ratios of the probability of choosing one alternative by
the probability of choosing another alternative. For right answers, this ratio
should be monotonically increasing; for wrong answers, this ratio should be
monotonically decreasing. Love's selection ratio is always monotonically in-
creasing. Love gave several examples of how this index works.
Samejima (1994) proposed a nonparametric method, the simple sum pro-
cedure of the conditional probability density function combined with the
normal approach. These plausibility functions were thought to be useful for
estimating the underlying latent trait. The use of distractors in scoring item
responses is discussed in the next chapter. Samejima favored this method
over parametric methods because no assumption is made about the specific
shape of the trace line. Also, estimation can be done with smaller data sets
than is found with parametric approaches. Despite the benefits of this ap-
proach, her results show that most right answers approximate the normal OC
(operating characteristic). She also found that most distractors were similar
rather than different in their response pattern. The former she termed infor-
mative distractor as opposed to equivalent distractor. In dichotomous scoring,
equivalent distractors may be used, but in polytomous scoring, we want infor-
mative distractors. The more traditional coefficients discussed, including the
eta coefficient for item response and total score, also make this distinction be-
tween informative and equivalent distractors.

Parametric Statistical Methods

Experiments by Drasgow, Levine, Tsien, Williams, and Mead (1995) with a


variety of polytomous IRT methods resulted in some models emerging as
more robust than others. Andrich, Styles, Tognolini, Luo, and Sheridan
(1997) used partial credit with MC with some promising results. Wang
(1998) showed how the Rasch (one-parameter) model can be used to study
distractor functioning. Through simulation, he showed that parameter re-
covery was good. He concluded that the analyses provided unique informa-
tion that would be useful in revising items. He properly warned that when
sample sizes for distractors are small, these estimates are unstable. In these in-
stances, this method should not be used or should be used cautiously. An-
other approach to studying distractors involves the use of the general linear
model, which treats each item as an independent variable. This method can
also be used to study DIP at the distractor level. As polytomous IRT methods
become more accessible, the scoring of MC responses using distractor infor-
mation may become more commonplace.
STATISTICAL STUDY OF ITEM RESPONSES 227

IRT Methods

Item analysts are increasingly turning to IRT methods to investigate the work-
ings of distractors (Samejima, 1994; Wang, 2000). Wang (2000) used the gen-
eral linear model and grouping factors (items) as the independent variables.
Distractabilty parameters are estimate and used. His results in a simulation
study and with actual test data show the promise of this technique as confirmed
by graphical procedures. He also pointed out that low-frequency distractors are
not especially well estimated by this technique, nor by any other technique.
Methods like this one need to be studied with more conventional methods to
decide which of these is most and least effective.

Categorical Analysis of the Trace Line

Up to this point, the trace line has not been evaluated statistically. Haladyna
and Downing (1993) showed that the categorical data on which the trace line
is based can be subjected to statistical criteria using a chi-square test of inde-
pendence. Table 9.7 illustrates a contingency table for option performance.
Applying a chi-square test to these categorical frequencies, a statistically sig-
nificant result signals a trace line that is not flat. In the preceding case, it is
monotonically increasing, which is characteristic of a correct answer. Thus,
with the notion of option discrimination for the right choice, we expect mono-
tonically increasing trace lines, positive point-biserial discrimination indexes,
positive discrimination parameters with the two- and three-parameter models,
and choice means that exceed the choice means for distractors. For the wrong
choice, we expect monotonically decreasing trace lines, negative discrimina-
tion, negative discrimination parameters for the two- and three-parameter
models (which are unconventional to compute), and choice means that are
lower than the choice mean for the correct option.
The trace line appears to offer a sensitive and revealing look at option per-
formance. Trace lines can be easily understood by item writers who lack the
statistical background needed to interpret option discrimination indexes.

TABLE 9.7
Contingency Table for Chi-Square Test for an Option

First Score Second Score Third Score Fourth Score Fifth Score
Group Group Group Group Group
Expected 20% 20% 20% 20% 20%
Observed 6% 14% 20% 26% 34%
228 CHAPTER 9

The other statistical methods have limitations that suggest that they should
not be used.

GUIDELINES FOR EVALUATING ITEMS

Most testing programs have guidelines for evaluating MC items. As they eval-
uate item performance, particularly with SMEs, these guidelines are used to
identify items that can be used with confidence in the future and items that
need to be revised or retired. Table 9.8 provides a generic set of guidelines.
Depending on the overall difficulty of items and other conditions, values are
used to replace words in this table. For example, a moderate item might have
a difficulty (p value) between 40% and 90%. An easy item would have a p
value above 90%. Satisfactory discrimination might be. 15 or higher. Unsatis-
factory discrimination would be lower than .15. Negative discrimination
would signal a possible key error.

TABLE 9.8
Generic Guidelines for Evaluating Test Items

Item Type Difficulty Discrimination Discussion


1 Moderate Satisfactory Ideal type of item. An item bank should
contain Type 1 items.
2 Moderate Low or negative Item does not discriminate and does not
contribute significantly to reliability. Item
should be retired or revised.
3 High Irrelevant Item is very easy. Such items can be
retained if the subject matter expert
believes the item measures essential
material.
4 Low Satisfactory Although the item is very hard, it does
discriminate. Such items can be retained
in an operational item bank but should be
used sparingly in a future test.
5 Low Low This item performs so poorly that it
should be retired or revised.
6 Low Low This item performs just like the previous
item type but one of the distractors
performs just like a Type 1 item. This
signifies a key error.
STATISTICAL STUDY OF ITEM RESPONSES 229

SUMMARY

This chapter focuses on studying and evaluating item responses with the ob-
jective of keeping, revising, or retiring each item. A variety of perspectives
and methods are described and illustrated. Tabular methods provide clear
summaries of response patterns, but graphical methods are easier to under-
stand and interpret. Statistical indexes with tests of statistical significance
are necessary to distinguish between real tendencies and random variation.
The chapter ends with a table providing a generic set of guidelines for evalu-
ating items. All testing programs would benefit by adopting guidelines and
studying item response patterns. Doing item response studies and taking ap-
propriate action is another primary source of validity evidence, one that bears
on item quality.
1O
Using Item Response
Patterns to Study
Specific Problems

OVERVIEW

Chapter 8 summarizes an important source of validity evidence that comes


from item development. Chapter 9 deals with another important source of
validity evidence that comes from the statistical analysis of item responses.
This chapter examines several problems involving item responses that
threaten validity. Ignoring these problems runs the risk of undermining the
argument you build in validation and countering the validity evidence you
have assembled supporting the validity of test score interpretations and uses.
The first problem is item bias (equity) and the statistical procedure known
as DIP (i.e., differential item functioning). The observance of significant DIP
in item responses diminishes the validity of test score interpretations and
uses. The study of DIP is essential for any test with significant consequences.
A second problem is the study and detection of each test taker's item re-
sponse pattern. An aberrant pattern may result in an invalid test score inter-
pretation. One aspect of this problem is missing responses and one possible
solution is imputation. Another aspect of this problem is person fit (appropriate-
ness measurement). We have a growing body of theory and methodology for
studying person fit and many indications that aberrant student responses un-
dermine valid interpretations and uses of test scores.
A third problem is the dimensionality of a set of items proposed for a test.
Modern test scoring methods mainly work under the assumption that a test's
items are unidimensional. Whether a set of item responses meets criteria for

23O
SPECIFIC PROBLEMS IN TESTING 231

unidimensionality greatly affects the interpretation of test scores. Studies es-


tablishing dimensionality of item responses are highly desirable.
A fourth problem deals with the limitation of using dichotomous scoring for
MC items when we have polytomous scoring, which provides more reliable test
results. Information contained in distractors is differential and can be advanta-
geous in computing test scores. Ignoring this information in distractors may
lower reliability, which is a primary type of validity evidence.
Each of the four sections of this chapter provides a brief treatment of the
topic, references and a review of this literature, and recommendations for stud-
ies that testing programs might engage for the benefit of evaluating threats to
validity or adding important validity evidence.

ITEM BIAS

We know that test results may be used in many ways, including placement,
selection, certification, licensing, or advancement. These uses have both
personal and social consequences. Test takers are often affected by test
score uses. In licensing and certification, we run a risk by certifying or li-
censing incompetent professionals or by not certifying or licensing compe-
tent professionals.
Bias is a threat to valid interpretation or use of test scores because bias fa-
vors one group of test takers over another. Bias also has dual meanings. Bias is
a term that suggests unfairness or an undue influence. In statistics, bias is sys-
tematic error as opposed to random error. A scale that "weighs heavy" has this
statistical bias. Although bias has these two identities, the public is most
likely to identify with the first definition of bias rather than the second
(Dorans & Potenza, 1993). Although the discussion has been about bias in
test scores, in this section, the concern is with bias in item responses, thus the
term item bias.
As discussed in chapter 8, sensitivity review involves a trained committee
that subjectively identifies and questions items on the premise that test tak-
ers might be distracted or offended by the item's test content. Therefore, sen-
sitivity item review is concerned with the first meaning of item bias.
DIP refers to a statistical analysis of item responses that intends to reveal
systematic differences among groups for a set of responses to a test item that is
attributable to group membership instead of true differences in the construct
being measured. In other words, the hypothesis is that the groups do not dif-
fer. If DIP analysis find differences in item performance, items displaying DIP
are suspected of being biased. Removal of offending items reduces the differ-
ences between these groups to zero and removes this threat to validity.
Several important resources contributed to this section. The first is an ed-
ited volume by Holland and Wainer (1993) on DIE This book provides a
232 CHAPTER 1O

wealth of information about this rapidly growing field of item response analysis.
Camilli and Shepard (1994) also provided useful information on DIE Another
source is an instructional module on DIP by Clauser and Mazor (1998).
Readers looking for more comprehensive discussions of DIP should consult
these sources and other references provided here.

A Brief History

A barbering examination in Oregon in the late 1800s is one of the earliest ex-
amples of testing for a profession. Since then, test programs for certification,
licensure, or credentialing have proliferated (Shapiro, Stutsky, &Watt, 1989).
These kinds of testing programs have two significant consequences. First, per-
sons taking the test need to pass to be certified or licensed to practice. Second,
these tests are intended to filter competent and incompetent professionals, as-
suring the public of safer professional practice.
Well-documented racial differences in test scores led to widespread discon-
tent, culminating in a court case, the Golden Rule Insurance Company versus
Mathias case in the Illinois Appellate Court in 1980. Federal legislation led to
policies that promoted greater monitoring of Black-White racial difference in
test performance. The reasoning was that if a Black-White difference in item
difficulty was greater than the observed test score difference, this result would
suggest evidence of DIE

Methods for Studying DIE Methods of study of DIP have proliferated.


Table 10.1 provides a brief list of computer programs that are commercially
available for the study of DIE

TABLE 10.1
Commercially Available Computer Programs for the Study
of Differential Item Functioning (DIF)

Name of Program and Brief Description Source


BILOG-MG 3. A new version of a popular and versatile program www.ssicentral.com
that provides DIF and many other test statistics.
DIFPACK. This is an integrated package of programs that www.assess.com
includes SIBTEST, POLY-SIBTEST, DIFCOMP, and D1FSIM.
CONQUEST AND QUEST. Provides a variety of item statistics www.assess.com
including DIF. Based on the one-parameter Rasch model
MULTILOG 7. This program is a research instrument and www.ssicentral.com
provides DIF statistics.
PARSCALE4. Provides DIF statistics for rating scale data. www.ssicentral.com
SPECIFIC PROBLEMS IN TESTING 233

Research shows that many methods for detecting DIP share much in com-
mon. We can delineate the field of DIP using four discrete methods with the
understanding that items detected using one method are likely to be identified
using other methods.

1. IRT methods. This class of methods feature comparisons of trace lines


for different groups of examinees or item parameters. Normally, there is a
reference group and a focal group from which to make comparisons. A sta-
tistically significant difference implies that the item is differentially favor-
ing one group over another (Thissen & Wainer, 2001). However, it matters
which IRT model is used because the simplest model (Rasch) involves dif-
ficulty, but DIP may occur with the discrimination or guessing (pseudo-
chance) parameters as well. These methods have large sample size
requirements and users must have a good understanding of how parame-
ters are estimated and compared. Stout and his colleages at the IRT
Modeling Lab at the University of Illinois have extensive experience
(Shealy & Stout, 1996; Stout & Roussos, 1995). Table 10.1 provides the
source for their family of computer programs.
2. Mantel-Haenzel (MH) statistic. Introduced by Holland and Thayer
(1988), this method is one of the most popular. Frequency counts are done
using a contingency table based on the reference and focal groups. This
statistic considers the odds of two different groups correctly answering an
item when the ability of the groups is already statistically controlled. Spe-
cific details on its calculation and uses can be found in Holland and
Thayer. The statistic is evaluated using a chi-square test. Holland and
Thayer suggested that the MH statistic is similar to the procedure for DIP
associated with the Rasch model. This fact also supports the idea that the
many DIP statistics proposed do share this ability. However, the MH statis-
tic is considered one of the most powerful for detecting DIP, which ex-
plains its popularity.
3. Standardization. This method is the simplest to understand and ap-
ply and has the most common appeal among practitioners. However, this
method lacks a statistical test. Dorans and Holland (1993) provided a
good account of the development of this method, which is based on em-
pirical trace line analysis and a standardized difference in difficulty for
the focal and reference groups. They pointed out the proximity of this
method to the MH.
4. Logistic regression. One might consider this method as a unifying idea
for the other methods provided. It operates from a contingency table but
uses total score as the criterion measure. Although it approximates results
found with MH, it is superior for nonuniform DIP situations. Thus, this
method is more adaptable to a greater variety of situations.
234 CHAPTER 1O

Distractor DIE DIP is widely recognized as a problem of item responses


whereby two or more groups believed to be equal with respect to what the test
measures show an unequal performance on the item. This definition of DIP
can be extended to distractor selection as well (Alagumalai &Keeves, 1999).
This work explores three DIP methods and supports the study of item re-
sponses in a more detailed, in-depth manner. Although they provide some in-
teresting results concerning distractor DIP, there is no explanation or
confirmation of why gender differences exist. For distractor DIP to be useful,
we have to use other information, such as derived from think-aloud proce-
dures, to confirm why distractors perform differentially.

Conclusions and Recommendations. DIP is a healthy and actively grow-


ing field of study. The emerging DIP technology assisted by user-friendly soft-
ware gives test users important and necessary tools to improve test items and
therefore improve the validity of test score interpretations and uses. For for-
mal testing programs, especially when the stakes for test takers and the test
sponsor are moderate to high, DIP studies are essential validity evidence. The
Standards for Educational and Psychological Testing (AERA et al., 1999) dis-
cusses the need to detect and eliminate DIE Therefore, DIP studies of item
responses seem essential because they address a threat to validity.
Clauser and Mazor (1998) provided a comprehensive discussion of these
methods and situations when one might be preferred to another. They de-
scribed several conditions to be considered before choosing a DIP method. Al-
though these methods have much in common, detailed discussion of these
conditions is recommended before choosing a method. Overall, it seems the
choice of any method could be justified.
As Ramsey (1993) pointed out, the use of DIP requires human judgment.
Statistics alone will not justify the inclusion or rejection of any item. Thus, the
process of studying item bias using DIP is a more involved process that includes
judgment along with the use of one of these DIP methods.

NONRESPONSE AND ABERRANT RESPONSE PATTERNS

The fact that an examinee takes a test does not guarantee that the resulting re-
sponses are validly scorable. There are many reasons examinees may respond to
a test in ways that misinform us about their true achievement.
This section discusses nonresponse and aberrant responses. First, types of
responses are discussed and the problem of nonresponse is described. Then im-
putation is discussed as one approach to reducing the seriousness of this prob-
lem. Second, hypotheses are presented that explain the origins of aberrant
examinee responses. Then, statistical methods are discussed that address some
SPECIFIC PROBLEMS IN TESTING 235

of these aberrant patterns. The detection of aberrant response patterns poten-


tially invalidates test score interpretation and uses.

Nonresponse and Imputation

In this section, two related topics are discussed. The first topic involves types of
responses that arise from taking an MC or CR item. The second topic is how we
treat two of these types of responses.

Types of Responses. When any examinee takes an MC or CR test item,


several types of responses are possible, and these are listed in Table 10.2.
The most prevalent response type is the correct answer. This type of re-
sponse assumes that the test taker knows the correct answer.
The second response type is an uneducated guess that may result in a correct
or incorrect answer. The test taker does not know the answer and may choose
any MC option. With the CR format, the test taker may choose to bluff. The
probability of making a correct guess for an MC item is the ratio of one and the
number of options. For a CR item, the probability of a successful bluff given no

TABLE 10.2
A Taxonomy of Item Responses

Domain of Responses
or Multiple-Choice
(MC) Items Description
Correct answer The student knows the correct answer and either selects or
creates it.
An uneducated guess The student does not know the correct answer but makes an
uneducated guess. The probability of making a correct guess
on MC is I/number of option. With an open-ended item, this
probability is indeterminate but is probably very small.
An educated guess The student does not know the correct answer but makes an
educated guess using partial knowledge, clues, or eliminates
implausible distractors. With constructed-response items, the
student may bluff. In both instances, the probability of
obtaining a correct answer is higher than the uneducated
guess.
Omitted response The student answers omits a response.
Not reached The student makes one or more responses to a block of test
items and then leaves no responses following this string of
attempted responses.
236 CHAPTER 1O

prior knowledge is probably zero. With this second type of response, it is possi-
ble to obtain a correct answer, but this answer was obtained in complete igno-
rance of knowledge of the right answer.
The third response type is an educated guess. For an MC item, partial
knowledge comes into play. The test taker is believed to know something about
the topic and may be able to eliminate one or more distractors as implausible
and then guess using the remaining plausible options as a basis for the choice.
As a result, the probability of obtaining a correct answer is greater than chance.
For the CR format, the tendency to bluff may be based on some strategy that
the test taker has learned and used in the past. For instance, using vocabulary
related to the topic or writing long complex sentences may.earn a higher score
than deserved. In this instance, a correct answer is more probable but the test
taker also has a greater level of proficiency.
The fourth response type is an omitted response. The response string in the
test may show responses, but occasionally the test taker may choose for some
unknown reason to omit a response. An omitted response may occur with ei-
ther an MC or CR item.
The fifth response type is not-reached. The test taker may attempt one or
more items and then quit responding. Although the unattempted items may be
considered omitted, these items are classified as not reached because we want
to distinguish between the conscious act of omitting a response versus quitting
the test entirely.
Haladyna, Osborn Popp, and Weiss (2003) have shown that omitted and
not-reached rates for MC and CR items on the N AEP reading assessment are
unequal. Students have a greater tendency to omit CR items. For not-reached
responses, students have a tendency to stop responding after encountering a
CR item. Omit and not-reached rates are associated with students with educa-
tional disadvantage.

Imputation. Imputation refers to the practice of estimating what an


examinee might have done on an item if he or she had attempted it, based on
the existing pattern of responses. Of course, the method of imputation is based
on assumptions, and we have many methods of imputation.
When a test taker omits a response or quits taking the test and leaves a string
of blank responses, we have several options that we may choose to exercise,
some of which involve imputation. The most obvious action is to simply score
all blank responses as wrong. In competitive or high-stakes testing, this action
seems defensible because it is assumed that these test takers are sufficiently mo-
tivated and attentive, and well versed in test-taking strategies to respond to all
items. In low-stakes testing, lack of motivation or other factors may contribute
to nonresponse. We might be more willing to consider nonresponse as items
that are not administered. Thus, we would not score nonresponses as wrong.
SPECIFIC PROBLEMS IN TESTING 237

DeAyala et al. (2001) provided the most up-to-date review of imputation


methods for omitted responses. They showed that some imputation methods
are more effective than others, but all imputation methods are based on the di-
lemma of assuming that the test taker should or should not accept responsibil-
ity for omitting one or more responses or quitting the test. Test-scoring
computer programs, for example, Bilog MG-3 (Zimowski et al., 2003) provide
for imputed scores. One conclusion that DeAyala et al. drew is that scoring
omitted responses as wrong is probably the worst practice. The imputation
methods they studied provided useful remedies to omitted responses.

Conclusion. Omits and not-reached responses should be assessed as part


of scoring. A procedure for dealing with these kinds of nonresponse should ex-
ist. The procedure should be well conceived and based on existing research. For
omitted items, regardless of the stakes involved, some method of imputation is
desirable. Depending on the stakes of the test, the decision whether to score
not-reached items may differ. For low-stakes tests, not-reached items should
not be scored. Research on factors underlying nonresponse are much needed,
and remedies for getting test takers to complete all responses as directed are
needed to eliminate nonresponse as a threat to validity.

Types of Item Response Patterns

The second part of this section deals with response patterns for the test or a
block of item administered. From IRT, the probability of a correct response is a
function of the difficulty of the item and the achievement level of the test
taker. If a test taker responds in a way that does not follow this expectant pat-
tern, we raise suspicions that the resulting test score might be invalid. This
discussion about aberrant response patterns is conceptual in origin but
informs about the wide range of psychological factors that may produce aber-
rant item response patterns. This discussion draws from many other sources
(Drasgow, Levine, & Zickar, 1996; Haladyna & Downing, in press; Meijer,
1996;Meijer, Muijtjens, &vanderVleuten, 1996; Wright, 1977). We have at
least nine distinctly different psychological processes that may produce aber-
rant response patterns.

Anxiety. A persistent problem in any standardized and classroom testing


setting is anxiety that depresses test performance. Hill and Wigfield (1984) es-
timated that about 25% of the population has some form of test anxiety. Test
anxiety is treatable. One way is to prepare adequately for a test. Another strat-
egy is to provide good test-taking skills that use various strategies. Test anxiety
reduces performance, leading us to misinterpret a test score.
238 CHAPTER 1O

An aberrant response pattern related to anxiety might be a string of incor-


rect responses followed by a normal-looking response string. This normalcy
may happen after initial test anxiety is overcome during a testing session. It
would be helpful to know what other kinds of item response patterns are linked
to test anxiety and to know whether MC or CR items differentially affect re-
sponses by test-anxious examinees.

Cheating. Cheating inflates estimates of achievement scores, leading to


invalid interpretations and uses of test scores. The problem of cheating is sig-
nificant. Cheating takes two forms: institutional and personal. With the for-
mer, there is a systematic group error caused by a test administrator or someone
who is not taking the test. Personal cheating usually occurs in high-stakes set-
tings and varies according to the individual. Cannell (1989), Haladyna, Nolen,
and Haas (1991), and Mehrens and Kaminski (1989) discussed aspects and
ramifications of cheating in standardized testing. They reported on the exten-
siveness of this problem in American testing. Cizek (1999) devoted an entire
volume to this topic. He provided extensive documentation of cheating at both
the institutional and individual levels.
According to Frary (1993), methods to combat cheating in high-stakes test-
ing programs involve scrambling of test items from form to form; multiple test
forms, each consisting of different sets of items; or careful monitoring of test
takers during test administration. Given that these methods may fail to prevent
cheating, test administrators need to identify potential instances of cheating
and obtain evidence in support of an accusation.
An extensive literature exists for the detection of patterns of answer copying
by test takers. For example, Bellezza and Bellezza (1989) reported in their re-
view of this problem that about 75% of undergraduate college students resort
to some form of cheating. They suggested an error-similarity pattern analysis
based on binomial probabilities. Bellezza and Bellezza's index resembles earlier
indexes suggested by Angoff (1974) and Cody (1985). Their method identifies
outliers, performances so similar with respect to wrong answers that it may
have occurred through copying. A computer program, SCRUTINY (http://
www.assess.com/), is designed to screen test results for possible cheating. This
program has a report called a suspicious similarities report that identifies
examinees who may have copied someone else's answers.
It is important to note that the study of patterns of right answers may be mis-
leading because it is possible for two persons studying together to have similar
patterns of right answers, but it is unlikely that wrong answer patterns will be
similar because distractors have differential attractiveness and most tests have
three or four distractors per item.

Creative Test Taking. Test takers may find test items so easy or ambiguous
that they will reinterpret and provide answers that only they can intelligently
SPECIFIC PROBLEMS IN TESTING 239

justify. These test takers may also provide correct answers to more difficult
items. This pattern also resembles inattentive test takers who might "cruise"
through an easy part of a test until challenged. In chapter 8, answer justifica-
tion was discussed as a means of allowing test takers an opportunity to provide
an alternative explanation for their choice of an option. As we know from re-
search on cognitive processes, students taking the same test and items may be
differing cognitive strategies for choosing an answer. Although their choice
may not agree with the consensus correct choice, their reasoning process for
choosing another answer may be valid. Unless appeals are offered to test takers
with SME adjudication, such test-taking patterns go unrewarded. Research
where answer justification or think-aloud procedures are used should increase
our understanding of the potential to credit justified answers to test items that
do not match the keyed response.

Idiosyncratic Answering. Under conditions where the test does not


have important consequences to the test taker, some test takers may mark in
peculiar patterns. Such behavior produces a negative bias in test scores, af-
fecting individual and even group performances in some circumstances. An
e x a m p l e is p a t t e r n marking: ABCDABCDABCD ..., or
BBBCCCBBBCCC.... The identification and removal of offending scores
helps improve the accuracy of group results. Tests without serious conse-
quences to older children will be more subject to idiosyncratic pattern mark-
ing. A tendency among school-age children to mark idiosyncratically has
been documented (e.g., Paris, Lawton, Turner, &Roth, 1991). Little is known
about the extensiveness of such aberrant response patterns. Thus, the prob-
lem seems significant in situations where the test takers have little reason to
do well. Its detection should cause us to question the validity of scoring and
reporting these test results.

Inappropriate Coaching. In testing situations where the outcomes are es-


pecially important, such as licensing examinations, there are many test coach-
ing services that provide specific content instruction that may be articulated
with part of the test. Another context for coaching is with college admissions
testing. Reviews of the extant research on admissions testing coaching by
Becker (1990) and Linn (1990) provided evidence that most coaching gains
are of a small nature, usually less than one fifth of a standard deviation. Linn
made the important point that the crucial consideration is not how much
scores have changed, but how much the underlying trait that the test purport-
edly measures has changed. If coaching involved item-specific strategies, inter-
pretation of any gain should be that test behavior does not generalize to the
larger domain that the test score represents. If coached test takers are com-
pared with uncoached test takers, the subsequent interpretations might be
24O CHAPTER 1O

flawed. Haladyna and Downing (in press) argued that this type of test prepara-
tion is a CIV and a threat to validity.
The detection of inappropriate coaching can be done using any of the tech-
niques identified and discussed in the section on DIP in this chapter. The neces-
sary precondition to using these techniques is to identify two groups, one
inappropriately coached and one uncoached. Items displaying DIP provide evi-
dence of the types of items, content, and cognitive demand that affect test
scores. But research of this type about coaching effects is difficult to find. Becker
(1990) opined that the quality of most research on coaching is inadequate.

Inattention. Test takers who are not well motivated or easily distracted
may choose MC answers carelessly. Wright (1977) called such test takers
"sleepers." A sleeper might miss easy items and later correctly answer hard
items. This unusual pattern signals the inattentive test taker. If sleeper pat-
terns are identified, test scores might be invalidated instead of reported and
interpreted. The types of tests that come to mind that might have many inat-
tentive test takers are standardized achievement tests given to elementary
and secondary students. Many students have little reason or motivation to
sustain the high level of concentration demanded on these lengthy tests. This
point was demonstrated in a study by Wolf and Smith (1995) with college stu-
dents. With consequences in a course test, student motivation and perfor-
mance was higher than a comparable nonconsequences condition. This point
was also well made by Paris et al. (1991) in their analysis of the effects of stan-
dardized testing on children. They pointed out that older children tend to
think that such tests have less importance, thus increasing the possibility for
inattention.

Low Reading Comprehension. A prevalent problem in American test-


ing is the influence of reading comprehension on test performance, when
the construct being measured is not reading comprehension. As discussed
in chapter 4, Ryan and DeMark (2002) described a distinction that is not of-
ten made with achievement tests regarding the importance of reading com-
prehension in defining the construct. Some constructs make heavy
demands on reading comprehension and other construct make less of a de-
mand. The issue is: To what extent should reading comprehension influ-
ence test performance? For instance, a mathematics test may not make a
large demand on reading comprehension, but if the demand is more than
test takers' capabilities for reading comprehension, their low reading com-
prehension interferes with their performance on the test. We might infer
that each student has a low mathematics achievement level when, in fact,
students actually have low reading comprehension that interfered with the
testing of the subject matter.
SPECIFIC PROBLEMS IN TESTING 241

This problem is widespread among students who are English language learn-
ers (ELLs). Research by Abedi et al. (2000) showed that simplifying the lan-
guage in tests helps improve the performance of ELLs. This research showed
that reading comprehension is a source of CIV that is one of several threats to
validity.
The Standards for Educational and Psychological Testing (AERA et al., 1999)
provide an entire chapter containing discussions and standards addressing the
problems of students with diverse language backgrounds. In general, caution is
urged in test score interpretation and use when the language of the test exceeds
the linguistic abilities of test takers. Because we have so many ELLs in the
United States, emphasizing the problem seems justified. Testing policies sel-
dom recognize that reading comprehension introduces bias in test scores and
leads to faulty interpretations of student knowledge or ability.

Marking or Alignment Errors. Test responses are often made on optically


scannable answer sheets. Sometimes, in the midst of this anxiety-provoking
testing situation, test takers may mark in the wrong places on the answer sheet.
Marking across instead of down, or down instead of across, or skipping one
place and marking in all other places, so that all answers are off by one or more
positions. Such detection is possible. The policy to deal with the problem is
again another issue. Mismarked answer sheets produce invalid test scores.
Therefore, it seems reasonable that these mismarked sheets should be detected
and removed from the scoring and reporting process, and the test taker might
be given an opportunity to correct the error if obtaining a validly interpreted
score is important.

Plodding. As described under the topic of nonresponse, some students


under conditions of a timed test may not have enough time to answer all items
because of their plodding nature. These persons are careful and meticulous in
approaching each item and may lack test-taking skills that encourage
time-management strategies. Thus, they do not answer items at the end of
the test. The result is a lower score than deserved. It is not possible to extend
the time limit for most standardized tests; therefore, the prevention of the
problem lies in better test-taking training. Plodding leads to a response pat-
tern that is similar to the not-reached problem previously discussed. Al-
though the item response pattern may be the same, the reason for the pattern
is not detectable without some other means of investigation, such as student
interviews.

Summary. Table 10.3 summarizes the nine types of aberrant response pat-
terns discussed here. Research is needed on the frequency of these aberrant re-
sponse patterns and their causes in achievement tests with varying stakes.
242 CHAPTER 10

TABLE 10.3
Aberrant Response Patterns

Pattern and Description of Psychological


Problem Processes Underlying Pattern
Anxiety Anxiety affects about 25% of the population. Patterns of
responses for high-anxious students may be variable. One
pattern may be a string of incorrect responses followed by a
pattern more closely reflecting the true achievement of the
test taker.
Cheating Individual cheating can manifest in many ways. Pattern
detection is unlikely to detect all ways cheating occurs.
Creative test taking Test taker may have a good justification for a wrong
response. Without adjudication, a creative but
well-thought out response may go uncredited.
Idiosyncratic marking Test taker records answers in a pattern (ABCABC ...).
Inappropriate coaching Because of item-specific coaching, some items are likely to
reflect an usual percentage of correct responses when
compared with previous performance of the item.
Inattention An unmotivated test taker may answer some hard items
but miss easy items.
Low reading Persons with low reading comprehension, including those
comprehension learning the language in which the test is given, are subject
to missing items not because of a lack of achievement but
because of low reading comprehension.
Marking or alignment Test taker does not record answers on the answer sheet
errors correctly.
Plodding Slow, meticulous test taking may lead to the test taker to
not finish the test.

Despite the statistical science that has emerged, there is little research on ex-
tensiveness of aberrant response patterns. We need to know more about the
frequency of these patterns, their causes, and their treatment. The next section
discusses the statistical science associated with aberrant responding.

Study and Treatment of Aberrant Response Patterns

Person fit is a fairly young science of the study of aberrant item response patterns
on a person-by-person basis. Another term used is appropriateness measurement.
If an examinee's item response pattern does not conform to an expected, plau-
sible item response pattern, we have reason to be cautious about how that re-
SPECIFIC PROBLEMS IN TESTING 243

suiting score is interpreted and used. The objective of person fit is statistical
detection of invalid test scores. An entire issue of Applied Measurement in Edu-
cation (Meijer, 1996) was devoted to person-fit research and applications.
Readers looking for more comprehensive discussions should consult the con-
tributing authors' articles and the many references they provided as well as
those provided in this section. As a science of the study of aberrant examinee
item response patterns, person fit follows traditional IRT methods (Drasgow et
al., 1996). An alternative method of study uses nonparametric methods
(Meijer et al., 1996).

IRT Solutions to Person Fit. Although IRT methods are effective for
studying person fit, large samples of test takers are needed. The chief char-
acteristic of these methods is the use of an explicit statistical IRT model.
The context or purpose for an index for person fit is important. Drasgow and
Guertler (1987) stated that several subjective judgments are necessary. For
example, if one is using a test to make a pass-fail certification decision, the
location of a dubious score relative to the passing score and the relative risk
one is willing to take have much to do with these decisions. Other factors to
consider in using these indexes are (a) the cost of retesting, (b) the risk of
misclassification, (c) the cost of misclassification, and (d) the confidence or
research evidence supporting the use of the procedure. According to
Drasgow, Levine, and Williams (1985), aberrant response patterns are iden-
tified by first applying a model to a set of normal responses and then using a
measure of goodness of fit, an appropriateness index, to find out the degree
to which anyone deviates from normal response patterns. Levine and Rubin
(1979) showed that such detection was achievable, and since then there
has been a steady progression of studies involving several theoretical mod-
els (Drasgow, 1982; Drasgow et al., 1996; Levine & Drasgow, 1982, 1983).
These studies were initially done using the three-parameter item response
model, but later studies involved polytomous item response models
(Drasgow et al., 1985). Drasgow et al. (1996) provided an update of their
work. They indicated that appropriateness measurement is most powerful
because it has a higher rate of error detection when compared with other
methods. With the coming of better computer programs, more extensive re-
search can be conducted, and testing programs might consider employing
these methods to identify test takers whose results should not be reported,
interpreted, or used.

Nonparametric Person Fit. Whereas the IRT method is based on actual


and ideal response patterns using one- or three-parameter IRT models, the
nonparametric person-fit statistics derive from the use of nonparametric
models. Like the field of DIP, a proliferation of methods has resulted in a large
244 CHAPTER 10

array of choices. According to Meijer et al. (1996), three methods that stand
out are the Sato (1980) caution index, the modified caution index, and the
U3 statistic.
Sato (1975) introduced a simple pattern analysis for a classroom based
on the idea that some scores deserve a cautious interpretation. Like appro-
priateness measurement, the caution index and its derivatives have a broad
array of applications, but this section is limited to problems discussed ear-
lier. The focus of pattern analysis is the S-P chart that is a display of right
and wrong answers for a class. Table 10.4 is adapted from Tatsuoka and Linn
(1983) and contains the right and wrong responses to 10 items for 15 stu-
dents. Not only does the S-P chart identify aberrant scores, but it also iden-

TABLE 10.4
Students-Problems (S-P) Chart for a Class of 15 Students on a 10'Item Test

Items
Person I 2 3 4 5 6 7 8 9 10 Total
1 1 1 1 1 1 1 1 1 1 1 10
2 1 1 1 1 1 1 1 1 1 0 9
3 1 1 1 1 1 0 1 1 0 1 8
4 1 0 1 1 1 1 0 1 0 0 6
5 1 1 1 1 0 1 0 0 1 0 6
6 1 1 1 0 1 0 1 0 1 0 6
7 1 1 1 1 0 0 1 0 0 0 5
8 1 1 1 0 1 1 0 0 0 0 5
9 1 0 0 1 0 1 0 1 1 0 5
10 1 1 0 1 0 0 1 0 0 1 5
11 0 1 1 1 1 0 0 0 0 0 4
12 1 0 0 0 1 1 0 0 0 0 3
13 1 1 0 0 0 1 0 0 0 0 3
14 1 0 1 0 0 0 0 0 0 0 2
15 0 1 0 0 0 0 0 0 0 0 1
Item 13 11 10 9 8 8 6 5 5 4
difference
p value 87 73 67 60 53 53 40 33 23 27

Note. Based on Tatsuoka and Linn (1983).


SPECIFIC PROBLEMS IN TESTING 245

tifies items with aberrant item response patterns. The S-P chart is based on
two boundaries, the S curve and the P curve, and a student-item matrix of
item responses. Students are ordered by scores, and items are placed from
easy on the left side of the chart to hard on the right. The S curve is con-
structed by counting the number correct for any student and constructing
the boundary line to the right of the item response for that student. For the
15 students, there are 15 boundary lines that are connected to form the S
curve. If a student answers items correctly outside of the S curve (to the
right of the S curve), this improbable result implies that the score should be
considered cautiously. Similarly, if a student misses an item inside of the S
curve (to the left of the S curve), this improbable result implies that the stu-
dent failed items that a student of this achievement level would ordinarily
answer correctly. In the first instance, the student passed items that would
normally be failed. Referring to Table 10.4, Student 9 answered Items 6, 8,
and 9 correctly, which would normally be missed by students at this level of
achievement. Student 9 also missed two easy items. A total of 5 improbable
responses for Student 9 points to a potential problem of interpretation for
this student score of 5 out of 10 (50%). The P curve is constructed by count-
ing the number right in the class for each item and drawing a boundary line
below that item response in the matrix. For example, the first item was cor-
rectly answered by 13 of 15 students so the P curve boundary line is drawn
below the item response for the 13th student. Analogous to the S curve, it is
improbable to miss an item above the P curve and answer an item below the
P curve correctly. Item 6 shows that three high-scoring students missed this
item whereas three low-scoring students answered it correctly. Item 6 has
an aberrant response pattern that causes us to look at it more closely. A vari-
ety of indexes is available that provides numerical values for each student
and item (see Meijer et al., 1996; Tatsuoka & Linn, 1983).
One method that appears in the person-fit literature is U3. The underlying
assumption for this method is that for a set of examinees with a specific total
score, their item response patterns can be compared. If number correct is
identical, examinees with aberrant score patterns are subject to further con-
sideration for misfitting. Van der Flier (1982) derived this person-fit statistic
and studied its characteristics. The premise of U3 is the comparison of proba-
bilities of an item score pattern in conjunction with the probability of the pat-
tern of correct answers. The index is zero if the student responses follow a
Guttman pattern. An index of one is the reverse Guttman pattern. Meijer,
Molenaar, and Sijtsma (1994) evaluated U3, finding it to be useful for detect-
ing item response problems. In a series of studies by Meijer and his associates
(Meijer, 1996; Meijer & Sijtsma, 1995; Meijer, Molenaar, & Sijtsma, 1999;
Meijer, Muijtens, & Van der Vleuter, 1996), a number of positive findings
were reported for U3. One important finding was that this method works best
under conditions of higher reliability, longer tests, and situations where a
246 CHAPTER 1O

high proportion of examinees have aberrant patterns. U3 can also be applied


to group statistics of person fit. Meijer and Sijtsma (1995) concluded that U3
was the best among a great many proposed indexes available because the sam-
pling distribution is known, facilitating interpretation of results.

Conclusions and Recommendations

Factors that contribute aberrant response patterns have not been adequately
studied. Such studies should involve procedures such as think-aloud discussed
in chapters 4 and 8. Test takers could be interviewed and reasons for their pat-
tern of response known. Ultimately, we should be willing to invalidate a test
score based on aberrant performance.
The statistical science of person fit appears to lack a conceptual or theo-
retical basis that comes from an understanding of what aberrant test takers
do. Rudner, Bracey, and Skaggs (1996) suggested that person fit was
nonproductive when applied to a high-quality testing program. In their
study, only 3% of their sample had person-fit problems. This percentage
seems small. Such a result may generate rivaling hypotheses. Is person fit in
these data not much of a problem or are the methods used insensitive to the
range of aberrant response patterns that may exist? The statistical methods
for studying person fit do not seem sufficient for detecting unusual patterns
arising from cheating, inappropriate coaching, and other problems dis-
cussed in Table 10.3. We need a more systematic study of this problem with
better use of research methods that explore the psychological basis for aber-
rant response patterns.

DIMENSIONALITY

In this section, dimensionality is defined, its importance is emphasized in terms


of validity, the implications of unidimensionality on other test issues is dis-
cussed, methods of study are reviewed, and recommendations are offered con-
cerning dimensionality and its influence on validity.

Defining Dimensionality

Messick (1989) stated that a single score on a test implies a single dimension. If
a test contains several dimensions, a multidimensional approach should be
used and one score for each dimension would be justified. A total test score
from a multidimensional test is subject to misinterpretation or misuse because
differential performance in any dimension might be overlooked when forming
SPECIFIC PROBLEMS IN TESTING 247

this composite score. An examinee might score high in one dimension and low
in another dimension, and a total score would not reveal this kind of differen-
tial performance. This differential effect also implies that it does not matter
that a low score existed. In credentialing testing, low scores can have negative
consequences for future professional practice.
As we know, one of the most fundamental steps in the development of any
test is construct formulation where the trait to be measured is defined clearly.
That definition needs to state whether a single score is intended to describe the
trait or several scores are needed that differentiate critical aspects of the trait.
The underlying structure of item responses is fundamental to this definition
and eventually to the validity of interpreting and using test scores.
According to MacDonald (1985), the history of cognitive measurement fo-
cused on making a test consisting of items that share a common factor or di-
mension. Simply put:

Each test should be homogeneous in content, and consequently the items on each
test should correlate substantially with one another. (Nunnally, 1977, p- 247)

A seminal review by Hattie (1985) provided one of the best syntheses of


thinking about dimensionality to that date. Tate (2002) provided a more re-
cent, comprehensive discussion of dimensionality including methods of study
and recommendations for fruitful approaches to establishing evidence for
dimensionality. Other useful references on this topic include MacDonald
(1999), Hambleton et al. (1991), and Thissen and Wainer (2001).

Importance of Dimensionality and Test Content

The Standards for Educational and Psychological Testing (AERA et al., 1999) lists
specific standards pertaining to content-related validity evidence (1.2,1.6,3.2,
3.3,3.5,3.11, 7.3, 7.11,13.5,13.8,14.8,14.9,14.10,14.14).EssaysbyMessick
(1989, 1995a, 1995b) furnished further support for the importance of con-
tent-related evidence. Hattie (1985), MacDonald (1981, 1985, 1999),
Nunnally and Bernstein (1994), and Tate (2002) all stressed the importance of
studies that provide validity evidence for the dimensionality of a test's item re-
sponses. As we build an argument for the validity of any test score interpreta-
tion or use, content-related evidence is primary. Studies that provide such
evidence are essential to the well-being of any testing program.
What are the implications for validity that arise from a study of dimensionality?

• Any estimate of internal consistency reliability may be affected by


dimensionality. If item responses reflect multidimensional, internal consis-
tency will be lower than expected and reliability will be underestimated.
248 CHAPTER 1O

• According to Messick (1989), the study of dimensionality is a search


for construct-irrelevant factors that threaten validity. Low reading compre-
hension may be one of these construct-irrelevant sources. If the construct
being measured does not include reading comprehension as a vital part of its
definition, a test taker's reading comprehension level should not diminish
test performance. Anxiety, inattention, cheating, and motivation may be
other construct-irrelevant factors that affect test performance. Studies of
dimensionality may seek out these sources and determine to what extent
each threatens validity.
• The way we evaluate items may be affected by dimensionality. Using
a total score or subscore as a criterion in an item analysis may provide dif-
ferent results if data are multidimensional instead of unidimensional. In
the previous chapter, it is shown that difficulty and discrimination are
computed on the basis of a total score. If, however, there is evidence that
subscores are viable, the item analysis could be conducted using a subscore
as the criterion instead of the total score. In a study of a large licensing test
in a profession, Haladyna and Kramer (2003) showed that assessment of
dimensionality affected the evaluation of items that were going to be re-
tained for future testing, revised, or retired. Decisions about the perfor-
mance of test items were different as a function of the assessment of
dimensionality.
• A typical construct in a credentialing test is based on a large domain of
knowledge and skills that derives from a study of content in a profession
(Raymond, 2002). If the validity evidence favors a unidimensional interpre-
tation, the argument for subscore validity is difficult to make because each
subscore will be highly correlated with other subscores and diagnostic infor-
mation will not be informative. Also, if we have too many subscores, the reli-
ability of subscores may be so low and standard errors so high that we have
little assurance about the validity of these subscores. If the validity evidence
fosters a multidimensional interpretation, subscores can be informative, but
there remains a need for these subscores to be reliable.
• The way we set passing scores may be affected by dimensionality. If a
set of item responses is multidimensional, there is a possibility that one of
these dimensions might be given unfair weighting in the standard-setting
process.
• The comparability of test scores may be affected by dimensionality. If
data are sufficiently multidimensional, equating depends on the unidimen-
sionality of linking items. Careless treatment of these linking items may bias
results. Multidimensionality may disturb the accuracy of the study of trends
over time.

As we can see, the determination of dimensionality has many implications for


the validity of interpreting and using test scores.
SPECIFIC PROBLEMS IN TESTING 249

Methods for the Study of Dimensionality

Tate (2002) made an important observation about the study of dimensionality.


If your expectation is that the set of item responses in a test represents a single
dimension, any of a wide variety of methods he reviewed should be adequate for
your needs. However, if there is reason to expect more than one dimension, a
confirmatory, instead of an exploratory, factor analysis should be used.
The methods for the study of dimensionality have developed rapidly in re-
cent years. Coupled with computer software that is easier to use and provides
more capability, studies of dimensionality are easily performed for testing pro-
grams and should be done as part of the routine analysis. In this section, several
indicators of dimensionality are presented and discussed, ranging from simple
to more complex. Each method and its results are a source of validity evidence
that addresses content of the test.
According to Tate (2002), any study of dimensionality is the search for the
minimum number of factors that explains the patterns of item responses. With
this in mind, we can examine internal consistency reliability, correlation pat-
terns, factor analysis, and other methods for studying dimensionality.

Internal Consistency. A simple and direct method is to calculate coeffi-


cient alpha, which is a measure of the internal consistency of item responses.
Alpha is related to the first principal component of a factor analysis of item re-
sponses. Although alpha may not be the best indicator of dimensionality, it is
informative. Alpha can be underestimated if the sample is restricted in terms of
its variance of test scores. Nonetheless, when the sample is inclusive of the full
range of achievement, its estimation provides primary reliability evidence and
an indication of unidimensionality. If the coefficient is high relative to the
number of items on the test, unidimensionality is indicated. If this coefficient is
lower than expected and other threats have been dismissed, such as sample
invariance, multidimensionalitay should be suspected. In this latter circum-
stance, confirmatory factor analysis is recommended.

Multitrait, Multimethod Correlation Analysis. Another way to look at


dimensionality is to examine possible subscores in a multitrait, multimethod
correlation framework (Campbell &Fiske, 1959). Campbell and Fiske (1959)
suggested two kinds of content-related validity evidence: convergent and
discriminant. Two measures of the same traits should correlate more highly
than a measure of one trait and a measure of another trait. This state of affairs is
convergent evidence. If one test (A) produces two measures (1 and 2) and an-
other test (B) also produces two measures (1 and 2), we would expect the corre-
lations between like measures to be higher than correlations between unlike
measures. This is discriminant evidence. Table 10.5 provides a hypothetical
25O CHAPTER 1O

TABLE 10.5
Evidence Supporting the Independence of Traits

Method (Test A) Method (Test B)


Evidence for Two Traits Trait J Trait 2 Trait 1 Trait 2
Method Test A Trait 1 (.56) .53 .89 .49
Trait 2 .32 (.66) .50 1.00
Method Test B Trait 1 .52 .32 (-61) .57
Trait 2 .28 .62 .34 (.58)

correlation matrix for two subscores measured in two tests. In terms of ob-
served correlations, median correlations among like traits should exceed the
median correlation among unlike traits.
Trait 1 is more highly correlated with Trait 1 on Test B than Trait 1 is
correlated with Trait 2 on both Test A and Test B. If we think that Trait 1
and Trait 2 are unique, we expect the correlation between Trait 1-Test A
and Trait 1-Test B to be higher than between trait correlations. We also
expect the correlation between Trait 2-Test A and Trait 2-Test B to be
higher than between trait correlations. Correlation coefficients are lim-
ited by the reliabilities of the two variables used to compute the correla-
tion coefficient. In the diagonal of the correlation matrix, the reliability
estimates are given. If we correct for unreliability, the corrected correla-
tion coefficients to the right of the reliability estimates give us an estimate
of true relationship where measurement error is not considered. Note that
the "true" correlation of Trait 1-Test A and Trait 1-Test B is .89, which is
high. Thus, we can conclude that there is some evidence that Tests A and
B seem to measure the same trait, A. Also, note that the corrected correla-
tion coefficient for Trait 2-Test A and Trait 2-Test B is 1.00, which sug-
gests that the two tests are both measuring Trait 2. Table 10.5 shows good
convergent and discriminant validity evidence for Traits 1 and 2 for these
two tests, A and B.
Table 10.6 provides another matrix with different results. In this instance,
the correlations among traits are high, regardless of trait or test designation.
The corrected correlation coefficients to the right of the reliability estimates
are all high. In this instance, the validity evidence points to the convergence
of all measures on a single dimension. In other words, there is no indication
that Traits 1 and 2 are independent. Discriminative evidence is lacking. This
evidence points to a single dimension.
251

TABLE 10.6
Evidence Supporting Convergence of Traits for Method A

Method (Test A) Method (Test B)


Evidence for a Single Dimension Trait I Trait 2 Trait I Trait 2
Method Test A Trait 1 (.56) .87 .91 .88
Trait 2 .53 (.66) .77 .97
Method Test B Trait 1 .52 .49 (-61) .98
Trait 2 .50 .60 .58 (.58)

A third possibility is shown in Table 10.7. In this instance, Test A is pur-


ported to measure two traits, 1 and 2, and Test B is purported to measure the
same two traits, 1 and 2. However, the pattern of correlations shows that
each test tends to measure one trait. The trait measured by Test A is not the
same as the trait measured by Test B. This pattern suggests that the instru-
ment defines a trait, and any distinction between Trait A and Trait B is not
supported by these data. Each test measures something uniquely different
from the other test.
As we can see, the multitrait-multimethod matrix is a simple way to evalu-
ate dimensionality of traits that are part of a test. In this chapter, however, this
kind of correlation matrix is not easy to use for tests containing many items that
are believed to measure subscores. Although the logic of convergent and dis-
crimination content-related validity evidence is well illustrated using these
correlation matrices, we need to resort to other statistical methods that exam-
ine item response patterns using the same logic of the multitrait-multimethod
matrix but provide more summative findings that we can readily interpret.

TABLE 10.7
Evidence Supporting the Independence of Each Method as Measuring a Trait

Method (Test A) Method (Test B)


Evidence for Instrument Bias Trait I Trait 2 Trait I Trait 2
Method Test A Trait 1 (.56) .89 .41 .44
Trait 2 .54 (-66) .50 .36
Method Test B Trait 1 .24 .32 (.61) 1.01
Trait 2 .25 .22 .60 (.58)
252 CHAPTER 1O

Factor Analysis. The study of item response patterns is appropriately


handled with factor analysis. Conventional, exploratory factor analysis of item
responses may produce spurious factors that reflect item difficulty, item format
effects, or grouped items such as found with item sets. Instead of working on
product-moment correlations, a linear factor analysis of the matrix of tetra-
choric correlations overcomes some of the problems associated with traditional
factor analysis.
TESTFACT 4 (Bock et al., 2003) offers a confirmatory, full-information
factor analysis with a new feature, bifactor, which allows you to gather evi-
dence for a single dimension but also allows you to posit subscores and seek
confirming evidence that supports subscores interpretations. Haladyna and
Kramer (2003) analyzed two complementary testing programs in dentistry
with evidence for unidimensionality and some modest, supporting evidence
for subscores. Subsequent, more detailed analysis showed that even though
the item responses were highly internally consistent and that a principal fac-
tor existed, the subscores hypothesized were mainly confirmed. Moreover,
discriminative information at the individual level showed that subscores pro-
vided information about differential performance of more than 70% of all
candidates tested.
Tate (2002) provided the most up-to-date discussion of the issues found
with exploratory and confirmatory factor analysis. Interested readers should
also consult the technical manual for TESTFACT for a current discussion of
the many features and rationale for full-information item factor analysis with
this confirmatory bifactor feature.

Nonparametric Analyses of Item Covariances. Nonparametric analy-


sis is analogous to covariance residuals in factor analysis. Although this
method is not strictly factor analytic, it comes close. Conditional item as-
sociation involves item covariances. For any pair of items, residual Co-
variance can exist after the influence of a single factor has been extracted.
Although this method differs from the factor analysis methods just dis-
cussed, this method answers the same question that factor analysis an-
swers. The procedure can work within or outside IRT. DIMTEST and
POLY-DIMTEST are computer programs that provide a basis for testing
hypotheses about the structure of item response data for dichotomously
and polytomously scored tests (Stout, Nandakumar, Junker, Chang, &
Steidinger, 1993). DIMTEST is intended as the formal test of unidimen-
sionality, whereas DETECT is recommended as a follow-up procedure.
Interested readers should consult the following web page for more infor-
mation about the family of methods and computer programs used for ex-
ploring d i m e n s i o n a l i t y ( h t t p : / / w w w . s t a t . u i u c . e d u / s t o u t l a b /
programs.html). These programs are also available from Assessments Sys-
tems Corporation (www.assess.com).
SPECIFIC PROBLEMS IN TESTING 253

Conclusions and Recommendations

As Tate (2002) noted, any achievement test is likely to have some degree of
multidimensionality in its item responses. We need to determine whether that
degree is serious enough to undermine the validity of interpretations. Fortu-
nately, most scaling methods tolerate some multdimensionality. Factor analysis
and other methods reviewed here provide the evidence for asserting that a set
of item responses is sufficiently unidimensional.
It is strongly recommended that a study of dimensionality be routinely con-
ducted to confirm what the construct definition probably intended, a single
score that is sufficient to describe the pattern of item responses. If the construct
definition posits several dimensions, confirmatory factor analysis is recom-
mended, and the results should confirm this thinking.
In many circumstances where a single dimension is hypothesized, subscores
are thought to exist. We have the means for studying and validating item re-
sponses supporting subscore interpretations. In some instances, it is possible to
have a unidimensional interpretation with supporting empirical evidence for
subscore interpretation, as the study by Haladyna and Kramer (2003) showed.
However, establishing the validity of subscore interpretations in the face of
unidimensionality can be challenging. Gulliksen (1987) provided some guid-
ance on how to establish other validity evidence for subscore validity.

POLYTOMOUS SCALING OF MC ITEM RESPONSES

MC items are usually scored in a binary fashion, zero for an incorrect choice
and one for a correct choice. A total score is the sum of correct answers.
With the one-parameter IRT model, there is a transformation of the total
score to a scaled score. With the two- and three-parameter models, the
transformation to a scaled score is more complex because items are
weighted so that any raw score can have different scaled scores based on the
pattern of correct answers. With the traditional binary-scoring IRT models,
no recognition is given to the differential nature of distractors. This section
deals with the potential of using information from distractors for scoring
MC tests. The use of distractor information for test scoring is believed to in-
crease the reliability of test scores, which in turn should lead to more accu-
rate decisions in high-stakes pass-fail testing.

Are MC Distractors Differentially Discriminating?

The answer is yes. Traditional methods for studying distractor functioning are
convincing of this fact (Haladyna & Downing, 1993; Haladyna & Sympson,
254 CHAPTER 1O

1988; Levine & Drasgow, 1983; Thissen, 1976; Thissen, Steinberg, &
Fitzpatrick, 1989; Thissen, Steinberg, & Mooney, 1989; Wainer, 1989).
As indicated in chapter 9, one of the best ways to study distractor perfor-
mance for a test item is using trace lines. Effective distactors have a mono-
tonically decreasing trace line. A flat trace line indicates a nondiscriminating
distractor. A trace line close to the origin has a low frequency of use, which sig-
nals that the distractor may be so implausible that even low-achieving exam-
inees do not select it.

Polytomous Scoring of MC Items

We have two approaches to using distractors in scoring MC items. The lin-


ear approach has a longer history and a sound theoretical base. The nonlin-
ear (IRT) approach has a more recent history but also gives promise of
improving the scoring of MC items. A fundamental limiting factor with any
research on polytomous scoring of MC items is that distractors are too nu-
merous in most current tests and many of the distractors are not discrimi-
nating. Thus, the use of these polytomous scoring methods cannot be
effective unless the items provide the differential information needed to
make polytomous scoring work.

Linear Scoring Methods. Richardson and Kuder (1933) suggested a


method whereby the coefficient alpha is maximized by weighting rating scale
points. Guttman (1941) proposed this method for MC item responses. Lord
(1958) showed that this method is related to the first principal component in
factor analysis. Serlin and Kaiser (1978) provided more evidence for the valid-
ity of this method, known as reciprocal averages. Haladyna and Sympson (1988)
reviewed the research on reciprocal averages and concluded that studies gen-
erally supported the premise that methods such as reciprocal averages tend to
purify traits, eliminating CIV. Evidence for Lord's proof came from increases in
the alpha coefficient and increases in the eigenvalue of the first principal com-
ponent in factor analysis following the use of reciprocal averages. Weighting
options seems to yield a more homogeneous test score. In other words, the al-
pha reliability of the option-weighted score is higher than the alpha reliability
of the binary score.
The method of reciprocal averages involves computing the average score for
all examinees who chose any option. The option weights are used to compute a
test score. Then, the procedure is repeated. A new set of weights are computed
and used to compute a test score. This process continues until improvement in
the coefficient alpha maximizes. A test score is simply the sum of products of
weights and responses. Cross-validation is recommended regarding the calcu-
lation and validation of the scoring weights. Although reciprocal averages re-
SPECIFIC PROBLEMS IN TESTING 255

quire this iterative feature, experience shows that a single estimation is close to
the iterative result (Haladyna, 1990). In the framework of a certification or li-
censing examination, the reciprocal averages produced positive results but the
computational complexity brings to our attention a major limitation of this
method. Schultz (1995) also provided results showing that option weighting
performs better than simple dichotomous scoring with respect to alpha reliabil-
ity and decision-making consistency.

Polytomaus IRT Scaling of MC Responses. Polytomous IRT models pro-


posed by Bock (1972), Masters (1982), and Samejima (1979) led to the devel-
opment of promising computer programs such as ConQuest, Facets, Multilog,
Parscale, and RUMM that permitted the analysis of rating scale data. But the
application of these models to MC items has been slow to develop. Perhaps a
major reason for slow development is the discouraging finding that polytomous
scaling of MC item responses usually leads to small gains in internal consis-
tency reliability at the high cost of a complex and cumbersome procedure
(Haladyna &Sympson, 1988).
The most current, comprehensive, and thorough review of IRT scaling was
done by Drasgow et al. (1995). They fitted a number of proposed models to
three large standardized cognitive tests. They concluded that fitting MC re-
sponses to these polytomous IRT models was problematic, especially when
examinees omitted responses. Andrich et al. (1997) proposed a graded re-
sponse method for scaling MC items based on distractor information, the
Rasch extended logistic model. This model is suitable for multicategory scoring
such as that seen with rating scales and MC when distactors are considered.
The computer program RUMM (Andrich et al., 2001) provides a user-friendly
method for scaling MC. Research comparing results obtained by RUMM with
other methods, such as reciprocal averages, has yet to be reported but should
help us understand the influence of distactors in polytomous MC scaling and
practicality of IRT approaches to polytomous MC scaling.

Conclusions and Recommendations

Any research on MC distractor plausibility should reveal that distractors have


differential attractiveness to test takers and that the number of distractors with
decreasing trace lines is usually one or two per item. This persistent finding
would argue that MC items should be leaner, perhaps containing only one or
two distractors. As noted in the previous chapter, paying more attention to the
performance of distractors should lead us to develop better MC items.
However, the main purpose of this section is to assess the potential for poly-
tomous scoring of MC items. One persistent finding and conclusion is that
polytomous scoring of MC item responses provides greater precision in the
256 CHAPTER 1O

lower half of the test score distribution. If such precision is desirable, poly-
tomous scoring of MC item responses should be done. If, however, the need for
precision is in the upper half of the test score distribution, polytomous scoring
will not be very helpful.

SUMMARY

This chapter focuses on four problems that affect the validilty of test score in-
terpretations and uses. All four problems involve item responses. As we think
of each of these four problems, studies related to each become part of the valid-
ity evidence we can use to support interpretations and uses of test scores. Re-
search on these problems also addresses threats to validity that are not often
considered. By examining each threat to validity and taking remedial action
where justified, we can strengthen the overall argument for each test score in-
terpretation and use.
IV
The Future of Item
Development and Item
Response Validation
This page intentionally left blank
11
New Directions
in Item Writing and
Item Response Validation

OVERVIEW

This book focuses on two important activities in test development, the devel-
opment of the test item and the validation of responses to the item.
In this chapter, these two interrelated topics are evaluated in terms of their
pasts and their futures.
In this final chapter, the science of item development is discussed in the con-
texts that affect its future. These contexts include (a) the role of policy at na-
tional, state, and local levels, politics, and educational reform; (b) the unified
approach to validity; (c) the emergence of cognitive psychology as a prevailing
learning theory and the corresponding retrenchment of behaviorism; and (d)
changes in the way we define outcomes of schooling and professional training.
These four contexts will greatly influence the future of item development.
Item response validation has rested on statistical theories of test scores;
therefore, fewer changes have occurred recently. The progress of polytomous
IRTs in recent years and computer software that applies these theories repre-
sent a significant advance.

FACTORS AFFECTING THE FUTURE


OF ITEM DEVELOPMENT

Item writing is characterized in this book as a science much in need of nour-


ishing theory and research. The promising theories of item writing discussed
259
26O CHAPTER 11

in Roid and Haladyna (1982) did not result in further research and develop-
ment. In fact, these theories have been virtually abandoned. Bennett and
Ward (1993) published a set of papers that extended our understanding of the
similarities and differences between MC and CR item formats. In Test Theory
for a New Generation of Tests, Frederiksen et al. (1993) provided us with a
promising set of theories that linked item development to cognitive learning
theory. This effort has been followed by more extensive study of item formats
and their cognitive demands, as chapter 3 in this book shows. Irvine and
Kyllonen (2002) introduced us to more recent item development theories.
An important feature of this new work is that it includes both MC and CRfor-
mats. Another important feature of this new work is that cognitive science is
strongly linked to these efforts. Where these new theories take us will depend
on these contextual factors.

Policy, Politics, and School Reform

Education consists of various communities. These communities provide edu-


cational opportunities to millions of people in a variety of ways and at different
levels of learning that include preschool; elementary and secondary schools;
undergraduate university and college education; graduate programs; profes-
sional, military, and business training; professional development; and adult
continuing education that reflects recreational, personal, or human develop-
ment. Policymakers represent an important community within education.
Policymakers include elected and appointed federal and state officials and
school board members. They have political philosophies, constituencies, advi-
sors, and specific objectives that affect how tests are developed and used. Their
main responsibilities are to make policy and allocate resources.
Although many of these policymakers may not be well informed about
schools, schooling, theories, research on schooling, cognitive science, or statis-
tical test score theories, they have considerable influence on educational prac-
tice. These policymakers will continue to make decisions affecting testing in
their jurisdictions.
House (1991) characterized educational policy as heavily influenced by
economic and social conditions and political philosophies. He traced recent
history regarding the status of schools, concerning our economic and social
conditions, to two rivaling political positions—liberal and conservative. In the
liberal view, increases in spending on education will lead to better trained peo-
ple who will be producers as opposed to consumers of our resources. In the con-
servative view, the failure of education to deal with the poor has resulted in
undisciplined masses who have contributed heavily to economic and social
woes. Thus, political education platforms and their policies affect educational
policy and, more specifically, educational testing. With respect to changes in
NEW DIRECTIONS IN ITEM WRITING 261

testing in the national, state, and local school districts, the education platforms
of political parties have a major influence on the testing policies and practices
in each jurisdiction.
School reform appears to have received its impetus from the report A Nation
At Risk (National Commission on Educational Excellence, 1983). The legisla-
tion known as the No Child Left Behind Act of 2001 has provided sweeping in-
fluence over student learning, achievement testing, and accountability.
Another significant movement is restructuring of schools, which is more sys-
temic and involves decentralized control of schools by parents, teachers, and
students. Charter schools are one result of this movement.
One of many forces behind the reform movement has been the misuse of
standardized test scores. In recent years, test scores have been used in ways
unimagined by the original developers and publishers of these tests
(Haladyna, Haas, & Allison, 1998; Mehrens &Kaminski, 1989; Nolen et al.,
1992). The need for accountability has also created a ruthless test score im-
provement industry where vendors and educators employ many questionable
practices to raise test scores in high-stakes achievement tests (Cannell, 1989;
Nolen et al., 1992). This unfortunate use of test scores has led to the issuing of
guidelines governing the use of test scores in high-stakes testing programs by
the AERA (2000).
With respect to school reform, traditional ideas and practices will be reex-
amined and reevaluated. This reform movement will lead to new testing para-
digms where some of these traditional ideas and practices will survive, but
others will not. Indeed, this change is already under way. Performance testing
has affected educational testing in the nation, in states, in classrooms, and on
teaching.
MC testing has enjoyed a renaissance as policymakers and educators realize
that the foundation of most education and training is acquisition of knowledge.
MC is still the best way to measure knowledge. Also, MC is useful in approxi-
mating many types of higher level thinking processes. As we get better in using
new MC formats to measure more complex cognitive behavior, our ability to
design better MC tests is increasing.

Validity

The unified view of validity has overtaken the traditional way of studying valid-
ity, thanks to the important work of Messick (1984, 1989, 1995a, 1995b) and
many others. This view is articulated in chapter 1 and is linked to virtually ev-
ery chapter in this book. The future of item development is strongly linked to
the idea that what we do in item development yields a body of validity evidence
that adds to the mix of evidence we evaluate when making judgments about
the validity of any test score interpretation or use.
262 CHAPTER 11

As the test item is the most basic unit of measurement, it matters greatly
that we address the issue of validity evidence at the item and item response
levels. Not only is this body of validity evidence relevant to items but it is also
relevant to the body of evidence we use to support validity for test score inter-
pretation or use.

Cognitive Psychology

Behaviorism is well established in teaching and testing. Most varieties of sys-


tematic instruction have behaviorist origins and characteristics. Included in
this list of behaviorally based examples are objective-based learning, out-
come-based learning, mastery learning, the personalized system of instruc-
tion, competency-based instruction, and the Carroll (1963) model for school
learning. These teaching methods have the common elements of unit mas-
tery, well-defined learning outcomes, and criterion-referenced tests closely
linked to learner outcomes. The pedagogy of behaviorally based learning is
well established in American education and will probably survive. What will
change is the emphasis on the development of cognitive abilities, such as
reading, writing, problem solving, and critical thinking, and testing of these
abilities using both MC and CR formats.
During this time of transition from behavioral learning theory to cognitive
science, we realize that the focus on knowledge and skills falls short of the need
to use knowledge and skills in complex ways to solve problems, think critically,
and create.
Cognitive science has still not emerged as a unified science of human
learning. Snow and Lohman (1989) described cognitive psychology as a loose
confederation of scientists studying various aspects of cognitive behavior.
Terminology among cognitive psychologists varies considerably. For instance,
knowledge structures are variously called mental models, frames, or schemas
(Mislevy, 1993). Despite this heterogeneity in the theoretical bases for re-
search, many cognitive psychologists are working on the same problems in
much the same way with a common theoretical orientation, namely, that (a)
learners develop their working internal models to solve problems, (b) these
models develop from personal experience, and (c) these models are used to
solve other similar situations encountered in life. The most intelligent behav-
ior consists of a variety of working models (schemas, the building blocks of
cognition) that have greater generality. The issue of learning task generality
to other problems encountered is critical to learning theory and testing.
Dibello, Roussos, and Stout (1993) proposed a unified theory drawing
heavily from earlier work by Tatsuoka (1985, 1990). An emergent unified the-
ory of school learning, such as this one, hopes to explain how students find, or-
ganize, and use knowledge. An emerging theory will:
NEW DIRECTIONS IN ITEM WRITING 263

1. likely derive from current and past information processing theories.


2. incorporate ideas of declarative, procedural, and strategic knowl'
edge, as opposed to the more traditional dichotomy of knowledge and
skills. Dibello et al. (1993) also proposed schematic and algorithmic
knowledge.
3. provide a basis for organizing both declarative and procedural
knowledge using schemata, and a complete understanding of how these
will lead to more effective teaching methods.
4. place emphasis on problem solving and other types of higher level
thinking. Problem solving will be more complex than we realize. In fact,
there is evidence to suggest that a variety of problem-solving methods are
content bound (see Snow & Lohman, 1989).
5. be confirmed or discontinued by both qualitative and quantitative inquiry.
6. focus on practical applications of principles and procedures to
classroom instruction. In this context, the instructional program be-
comes the focus; its constituent parts are curriculum, instruction, and in-
tegrated testing.
7. include a way to diagnose learning difficulties using a student's incor-
rect responses.
8. incorporate a componential conceptualization of abilities into
the curriculum. Abilities will be developed over longer periods
(Gardner & Hatch, 1989; Sternberg, 1985). Test scores reflecting these
abilities will not be dramatic in showing growth because such growth is
irregular and slow.
9. involve the idiosyncratic nature of each school learner, a condition
that has direct implications for individualized instruction and individual ed-
ucation plans.
10. recognize the context of exogenous factors. The personal or social
context of each learner has a strong influence on the quality and quantity
of learning. Factors such as test anxiety, economic status, parental support
for schooling, nutrition, personal or social adjustment, physical health,
and the like become critical aspects of both theory and technology of
school learning.
11. have a component consisting of a statistical theory of option re-
sponse patterns that will be more compatible with complex, multistep
thinking.

Although we are far from having a unified learning theory, the groundwork
is being laid. Given these 11 qualities of this emerging unified, cognitive the-
ory of school learning, present-day teaching and testing practices seem al-
most obsolete. The future of item development and item response validation
in measuring student learning should be quite different from current prac-
tices as illustrated in this volume.
264 CHAPTER 11

Barriers to Redefining the Outcomes of Schooling


and Professional Competence

Two related but different barriers exist that affect the future of item develop-
ment. The first barrier is the lack of construct definition. Cognitive psycholo-
gists and others have used a plethora of terms representing higher level
thinking, including metacognition, problem solving, analysis, evaluation,
comprehension, conceptual learning, critical thinking, reasoning, strategic
knowledge, schematic knowledge, and algorithmic knowledge, to name a
few. The first stage in construct validity is construct definition. These terms
are seldom adequately defined so that we can identify or construct items that
measure these traits. Thus, the most basic step in construct validity, construct
definition, continues to inhibit both the development of many higher level
thinking behaviors and their measurement. As the focus changes from
knowledge and skills to these learnable, developing cognitive abilities, we
will have to identify and define these abilities better than we have in the past,
as Cole (1990) observed.
The second barrier is the absence of a validated taxonomy of complex cogni-
tive behavior. Studies of teachers' success with using higher level thinking
questions lead to inconclusive findings because of a variety of factors, including
methodological problems (Winne, 1979). Many other studies and reports at-
test to the current difficulty of successfully measuring higher level thinking
with the kind of scientific rigor required in construct validation. Royer et al.
(1993) proposed a taxonomy of higher level behavior and reviewed research on
its validity. This impressive work is based on a cognitive learning theory pro-
posed by Anderson (1990). Although the taxonomy is far from being at the im-
plementation stage, it provides a reasonable structure that invites further study
and validation.
Item writing in the current environment cannot thrive because of these
two barriers. Advances in cognitive learning theory should lead to better con-
struct definitions and organization of types of higher level thinking that will
sustain more productive item development, leading to higher quality
achievement tests.

Statistical Theories of Test Scores

Once constructs are defined and variables are constructed, testing provides
one basis for the empirical validation of test score interpretations and uses. In
this context, a statistical theory of test scores is adopted, and this theory can be
applied to item responses with the objective of evaluating and improving items
until they display desirable item response patterns.
Classical test theory has its roots in the early part of this century and has
grown substantially. It is still widely accepted and used in testing programs de-
NEW DIRECTIONS IN ITEM WRITING 265

spite the rapid and understandable emergence of IRTs. For many reasons enu-
merated in chapter 8 and in other sources (e.g., Hambleton & Jones, 1993;
Hambleton, Swaminathan, &. Rogers, 1991), classical theory has enough defi-
ciencies to limit its future use. Nonetheless, its use is encouraged by its familiar-
ity to the mainstream of test users.
Generalizability theory is a neoclassical theory that gives users the ability to
study sources of error in cognitive measurement using familiar analysis of vari-
ance techniques. Cronbach, Nanda, Rajaratnam, and Gleser (1972) formu-
lated a conceptual framework for generalizability. Brennan (2001) showed how
generalizability theory can be used to study the sources of measurement error in
many setting involving both MC and CR formats.
IRTs have developed rapidly in recent years, largely due to the efforts of the-
orists such as Rasch, Birnbaum, Lord, Bock, Samejima, and Wright, to name a
few. These theories are increasingly applied in large-scale testing programs.
Computer software is user friendly. Although IRT does not make test score in-
terpretations more valid (Linn, 1990), it provides great ability to scale test
scores to avoid CIV that arises from nonequivalent test forms.
In Test Theory for a New Generation of Tests, Frederiksen et al. (1993) as-
sembled an impressive and comprehensive treatment of ongoing theoretical
work, representing a new wave of statistical test theory. This collection of pa-
pers is aimed at realizing the goal of unifying cognitive and measurement per-
spectives with emphasis on complex learning. Mislevy (1993) distinguished
much of this recent work as departing from low- to high-proficiency testing in
which a total score has meaning to pattern scoring where wrong answers have
diagnostic value. In this setting, the total score does not inform us about how a
learner reached the final answer to a complex set of activities.
An appropriate analysis of patterns of responses may inform us about the
effectiveness of a process used to solve a problem. In other words, patterns
of responses, such as derived from the context-dependent item set, may lead
to inferences about optimal and suboptimal learning. Theoretical develop-
ments by Bejar (1993), Embretsen (1985), Fischer (1983), Haertel and
Wiley (1993), Tatsuoka (1990), and Wilson (1989) captured the rich array
of promising new choices. Many of these theorists agree that traditional
CTT and even present-day IRTs may become passe because they are inade-
quate for handling complex cognitive behavior. As with any new theory, ex-
tensive research leading to technologies will take considerable time and
resources.
These new statistical theories have significant implications for item re-
sponse validation. Traditional item analysis was concerned with estimating
item difficulty and discrimination. Newer theories will lead to option-re-
sponse theories, where right and wrong answers provide useful information,
and patterns of responses provide information on the success of learning
complex tasks.
266 CHAPTER 11

THE FUTURE OF ITEM DEVELOPMENT

In this section, two topics are addressed. First, the status of item writing is de-
scribed. Second, the characteristics of future item development are identified
and described. A worthwhile goal should be to abandon the current prescrip-
tive method for writing items and work within the framework of an item-writ-
ing theory that integrates with cognitive learning theory.
Critics have noted that item writing is not a scholarly area of testing (e.g.,
Cronbach, 1970; Nitko, 1985). Item writing is characterized by the collective
wisdom and experience of measurement experts who often convey this knowl-
edge in textbooks (Ebel, 1951). Another problem is that item writing is not es-
pecially well grounded in research. Previous discussions of item development
in Educational Measurement (Lindquist, 1951; Linn, 1989; Thorndike, 1970)
have treated item writing in isolation of other topics, such as validity, reliability,
and item analysis, among other topics. Cronbach (1971), in his classic chapter
on validation, provided scant attention to the role of items and item responses
in test validation. Messick (1989), on the other hand, referred to the impor-
tance of various aspects of item development and item response validation on
construct validity. The current unified view of validity explicitly unites many
aspects of item development and item response validation with other critical
aspects of construct validation. But this is only a recent development.
Downing and Haladyna (1997) emphasized the role of item development and
validation on test score validation.
The criterion-referenced testing movement brought sweeping reform to test
constructors at all levels by focusing attention on instructional objectives.
Each item needed to be linked to an instructional objective. Test items were
painstakingly matched to objectives, and collections of items formed tests that
putatively reflected these objectives. The integration of teaching and testing
produced predictable results: high degree of learning, if student time for learn-
ing was flexible to accommodate slow learners. The dilemma was how specific
to make the objective. Objectives too specific limited the degree to which we
could generalize; objectives too vague produced too much inconsistency in
item development resulting in disagreement among context experts about the
classifications of these items. No single test item or even small sample of test
items was adequate for measuring an objective. The widespread use of instruc-
tional objectives in education and training is remarkable. But the criticism of
this approach is that learning can seem fragmented and piecemeal. What fails
to happen is that students do not learn to use knowledge and skills to perform
some complex cognitive operation.
The current reform movement and the current emphasis on performance
testing has caused a reconsideration of the usefulness of the instructional ob-
jective. Because criterion-referenced testing is objective driven, it may be re-
NEW DIRECTIONS IN ITEM WRITING 267

placed by statements that convey a different focus: one on the development of


fluid abilities. One example of this is evidence'Centered design (Mislevy,
Steinberg, & Almond, 1999).
Current knowledge about item writing was kernelized by Haladyna and
Downing (1989a) into a taxonomy of 43 item-writing rules. Haladyna et
al. (2002) updated this study and reduced the list of rules to a smaller set.
Research on item writing is still asystematic and limited only to several
rules. New research has shown that there is some interest in advancing the
science of item writing but the more important work is in developing theo-
ries of item writing that address the urgent need to produce MC and CR
items with high cognitive demand. Theories of item writing provide a
more systematic basis for generating items that map content domains of
ability.
Aseriesofintegrativereviewsby Albanese (1992), Downing (1992), Frisbie
(1992), and Haladyna (1992a) provided guidance about the variety of MC for-
mats available for item writing. This work provided an important basis for the
use of some formats and the discontinuation of other formats, such as the com-
plex MC and TF.
This legacy of item writing is characterized by a checkered past, consisting of
many thoughtful essays and chapters in textbooks about how to write items.
Although most of this advice is good, it fails to qualify as a science.

CHARACTERISTICS OF NEW THEORIES


OF ITEM WRITING

This section addresses some characteristics that these new item-writing the-
ories must possess to meet the challenge of measuring complex behavior.
These characteristics draw heavily from current thinking in cognitive science
but also rely on this item-writing legacy.

New Kinds of Tasks and Scoring

Computers now present examinees with tasks of a complex nature, with in-
teractive components that simulate real-life, complex decision making.
Scoring can offer several pathways to correct answers, and scoring can be au-
tomated. The fidelity of such creative testing is being demonstrated in com-
puter-delivered licensing tests in architecture. Mislevy (1996b) made a good
point about this emerging technology. If the information provided is no better
than provided by conventional MC, the innovation seems pointless. These
innovations must provide something well beyond what is available using for-
mats presented in chapter 4.
268 CHAPTER 11

The Breakdown of Standardization in Testing

Whereas outcomes of educational or training may be uniform, the means by


which the outcomes are achieved may be diverse. Mislevy (1996a) also ob-
served that in graduate education, a student might be expected to have foun-
dation knowledge, but thesis or dissertation research is creative and hardly the
same from graduate to graduate. Also, not all students have the same back-
ground experiences and capability. The generalizability of one test may be lit-
tle, but very relevant to an immediate goal. Thus, in future test design, more
will have to be considered than simply defining a domain of tasks and comput-
ing a test score based on a sample of these tasks. Within instruction or training,
the use of computers allows for more individualization, which may nonstand-
ardize the test but will standardize the result of instruction or training. In other
words, students will follow different paths in their instruction or training, per-
haps, reaching different ends to fit their personal educational plan. Uniform
teaching and testing might end.

Inference Networks

Traditional item writing focuses on a single behavior. The stem communicates


a single task; the options provide the correct and several plausible choices.
Theorists such as Royer et al. (1993) portrayed this type of testing as represent-
ing microskills, simple cognitive behaviors that, although often important, are
not as important as macroskills. The latter represents the various types of
higher level thinking.
Although the instructional objective was the basis for writing the item in the
teaching technology in the 1970s and 1980s, defining and measuring macro-
skills using the objective is difficult, perhaps contributing to the extensive fail-
ure by practitioners to write this type of test item.
Cognitive science is working toward an opposite end. Ability constructs are
more complicated, reflecting how we learn instead of what we learn. Instead of
aggregating knowledge, like filling a storeroom, learning is viewed as more
patchwork or mosaic. The schema is the mental structure for organizing this
knowledge. Mislevy (1993) provided examples of inference networks, which
are graphical representations that reflect the cluster and connectedness of
microtasks that constitute a complex cognitive behavior. These networks have
a statistical basis, reflecting the reasoning about the causality of factors that we
can observe. The inference network may contain both MC and CR elements,
each providing for a certain kind of inference. Mislevy described both a causal
model of reasoning about observations and an appropriate statistical theory
that can be used to model student behavior during learning. This is how the
unification of cognitive science and statistical test score theory takes place.
NEW DIRECTIONS IN ITEM WRITING 269

Such inference networks can illustrate the pattern of behavior in a complex


process or simple proficiency—the outcome of the process.
Inference networks provide a new way to view content and cognitive be-
havior in a complex type of learning. The inference network can be expanded
to include the instructional strategy needed for each microskill and the for'
mative and summative aspects of learning. Item writing becomes an interest'
ing challenge because items must model the range of behaviors that
distinguish students with respect to the trait being learned and measured.
Mislevy (1993) provided several examples from different fields, illustrating
that inference networks will help develop effective measures of complex be'
havior in a variety of settings.

Item-Generating Ability

As more testing programs offer tests via computers and the format is adap-
tively administered, the need for validated items grows. Testing programs will
have to have large supplies of these items to adapt tests for each examinee on
a daily basis.
Present-day item writing is a slow process. Item writers require training.
They are assigned the task of writing items. We expect these items to go
through a rigorous battery of reviews. Then these items are administered, and
if the performance is adequate, the item is deposited in the bank. If the item
fails to perform, it is revised or discarded. We can expect about 60% of our
items to survive. This state of affairs shows why item-writing methods need to
improve.
Ideally any new item-writing theory should lead to the easy generation of
many content-relevant items. A simple example shows how item-generating
schemes can benefit item writing. In dental education, an early skill is learning
to identify tooth names and numbers using the Universal Coding System. Two
objectives can be used to quickly generate 104 test items:

Because there are 32 teeth in the adult dentition, a total of 64 items defines the
domain. Because the primary dentition has 20 teeth, 40 more items are possi-
ble. Each item can be MC, or we can authentically assess a dental student's ac-
tual performance using a patient. Also, a plaster or plastic model of the adult or
child dentition can be used. If domain specifications were this simple in all edu-
cational settings, the problems of construct definition and item writing would
be trivial.

Chapter 7 presents different approaches to item-generating procedures.


These methods are practical but limited in the cognitive demand elicited.
Better item generation methods are needed.
270 CHAPTER 11

In Item Generation for Test Development, Irvine and Kyllonen (2002) assem-
bled an impressive set of papers by scholars who have proposed or developed
new ways to generate items. Beginning where Roid and Haladyna (1982) left
off, this volume reports the efforts of many dating from the mid-1980s to the
present.
Irvine (2002) characterized current item-generation efforts as falling
into three categories: R, L, and D models. The R model is traditional and in-
volves item development as depicted in this volume. Item writers replenish
item banks. The machinery of CTT or IRT is used to produce equated tests
so that construct-irrelevant difficulty or easiness is not a threat to validity.
The L model, which has failed, emphasizes latency in responding. In few in-
stances, speed of responding is importantly related to a cognitive ability, but
for the most part, L models do not have a history that supports its continu-
ance. The D model offers continuous testing during the learning period.
Items and tests must be independent, and change is recorded on an individ-
ual basis toward a goal.
Irvine (2002) also saw technology as one of the most influential factors in fu-
ture item generation. Computer-based and computer-adaptive testing includes
variations in display, information, and response modes to consider.
With respect to specific, promising item-writing theories, Bejar (1993)
proposed response generative model (RGM) as a form of item writing that is
superior to these earlier theories because it has a basis in cognitive theory,
whereas these earlier generative theories have behavioristic origins. The
RGM generates items with a predictable set of parameters, from which clear
interpretations are possible. Bejar presented evidence from a variety of re-
searchers, including areas such as spatial ability, reasoning, and verbal ability.
The underlying rationale of the RGM is that item writing and item response
are linked predictably. Every time an item is written, responses to that item
can confirm the theory. Failure to confirm would destroy the theory's credi-
bility. Bejar maintained that this approach is not so much an item-writing
method, a content-specification scheme, or a cognitive theory but a philoso-
phy of test construction and response modeling that is integrative.
The RGM has tremendous appeal to prove or disprove itself as it is used.
It has the attractive qualities of other generative item-writing theories,
namely, (a) the ability to operationalize a domain definition, (b) the abil-
ity to generate objectively sufficient numbers of items, and (c) the ease
with which relevant tests are created with predictable characteristics. Ad-
ditionally, RGM provides a basis for validating item responses and test
scores at the time of administration. What is not provided in Bejar's the-
ory thus far are the detailed specifications of the use of the theory and the
much-needed research to transform theory into technology. Like earlier
theories, significant research will be needed to realize the attractive
claims for this model.
NEW DIRECTIONS IN ITEM WRITING 271

Misconception Strategies

A third characteristic of new item-writing theories will be the diagnostic value


of wrong choices. Current item-writing wisdom suggests that distractors should
be based on common errors of students (Haladyna & Downing, 1989a;
Haladyna et al., 2002). Although this method of creating distractors may seem
simplistic, one has only to administer items in an open-ended format to appro-
priately instructed students to develop credible distractors. This process ap-
plies to open-ended performance testing. The scoring rubric for open-ended
tests would derive from an analysis of student errors, thus making the process
much like the design of an MC item.
Tatsuoka (1985) and her colleagues proposed a model for diagnosing cog-
nitive errors in problem solving. This impressive research used her rule space
model based on task analyses of mathematics skills. Mathematics seems the
most readily adaptable to these theoretical developments. We lack applica-
tions to more challenging subject matters, for example, biology, philosophy,
history, political science, speech, reading, literature studies, psychology, and
art. Because a desirable feature of achievement tests is diagnostic informa-
tion leading to reteaching, these misconception methods are highly desirable.
Lohman and Ippel (1993) presented a general cognitive theory that exam-
ines processes that uncover misconceptions in student learning. The nature of
complex learning compels cognitive psychologists to reject traditional test
models that focus on the meaning of total test scores. These researchers go fur-
ther to assert that even measures of components of process that are often quan-
titative may be inappropriate because step-by-step observations do not capture
the essence of what makes individuals different in the performance of a com-
plex task. Lohman and Ippel looked to understandings based on developmen-
tal psychology. Instead of using quantitative indicators in a problem-solving
process, they looked for qualitative evidence. Although this work is prelimi-
nary, it shows that cognitive psychologists are sensitive to uncovering the pre-
cise steps in correct and incorrect problem solving. This work directly affects
item writing in the future. Also, conventional item writing does not contribute
to modeling complex behavior as it emerges in these cognitive theories.
An urgent need exists to make erroneous response part of the scoring system
in testing and, at the same time, provide information to teachers and learners
about the remedial efforts needed to successfully complete complex tasks. Fu-
ture item-writing theories will need this component if we are to solve the mys-
tery of writing items for higher level thinking.

Conclusion

This section discusses the future of item writing. Item writing lacks the rich
theoretical tradition that we observe with statistical theories of test scores.
272 CHAPTER 11

The undervaluing of item writing has resulted in a prescriptive technology


instead of workable item-writing theories. Most of the current work cap-
tured in the volume by Irvine and Kyllonen (2002) involves well-specified
domains of tasks of a stable nature that may be heavily loaded on a general
intelligence factor. Few contributors to this volume have addressed the
larger field involving ill-structured problems that dominate higher level
thinking and achievement.
Item-writing theories of the future will have to feature a workable method
for construct-centered abilities. Future item-writing theory will permit the
ability to generate rapidly items that completely map ability or knowledge and
skill domains.

THE FUTURE OF ITEM RESPONSE VALIDATION

Item analysis has been a stagnant field in the past, limited to the estimation of
item difficulty and discrimination using CTT or IRT, and the counting of re-
sponses to each distractor. Successive editions of Educational Measurement
(Lindquist, 1951; Linn, 1989; Thorndike, 1970) documented this unremark-
able state of affairs. The many influences described in this chapter, coupled
with growth in cognitive and item-response theories, have provided an oppor-
tunity to unify item development and item response validation in a larger con-
text of the unified approach to validity. The tools and understanding that are
developing for more effective treatment of item responses have been charac-
terized in this book as item response validation. The future of item response
validation will never be realized without significant progress in developing a
workable theory of item writing.
Chapter 9 discusses item response validation, and chapter 10 presents
methods to study specific problems. An important linkage is made between
item response validation and construct validation. Three important aspects of
item response validation that should receive more attention in the future are
distractor evaluation, a reconceptualization of item discrimination, and pat-
tern analysis. Because these concepts are more comprehensively addressed in
the previous chapter, the following discussion centers on the relative impor-
tance of each in the future.

Distractor Evaluation

The topic of distractor evaluation has been given little attention in the past.
Even the most current edition of Educational Measurement provides a scant
three paragraphs on this topic (Millman &. Greene, 1989). However, Thissen,
Steinberg, and Fitzpatrick (1989) supported the study of distractors. They
NEW DIRECTIONS IN ITEM WRITING 273

stated that any item analysis should consider the distractor as an important
part of the item. Wainer (1989) provided additional support, claiming that the
graphical quality of the trace line for each option makes the evaluation of an
item response more complex but also more complete. Because trace lines are
pictorial, they are less daunting to item writers who may lack the statistical
background needed to deal with option discrimination indexes.
The traditional item discrimination index provides a useful and conve-
nient numerical summary of item discrimination, but it tends to overlook the
relative contributions of each distractor. Because each distractor contains a
plausible incorrect answer, item analysts are not afforded enough guidance
about which distractors might be revised or retired to improve the item per-
formance. Changes in distractors should lead to improvements in item per-
formance, which in turn should lead to improved test scores and more valid
interpretations.
There are at least three good reasons for evaluating distractors. First, the
distractor is part of the test item and should be useful. If it is not useful, it
should be removed. Useless distractors have an untoward effect on item dis-
crimination. Second, with polytomous scoring, useful distractors contrib-
ute to more effective scoring, which has been proven to affect positively test
score reliability. Third, as cognitive psychologists lead efforts to develop
distractors that pinpoint misconceptions, distractor evaluation techniques
will permit the empirical validation of distractor responses and by that im-
prove our ability to provide misconception information to instructors and
students.

Item Discrimination

The concept of item discrimination has evolved. An earlier discrimination in-


dex consisted of noting the difference between mean item performance of a
high-scoring group and the mean item performance of a low-scoring group.
Such high-group/low-group comparisons were calculationally simple. Statisti-
cal indexes such as the biserial and point—biserial were theoretically more satis-
factory, and they were routinely produced with the coming of the computer.
However, these traditional item discrimination indexes have many deficiencies
to recommend against their use (Henrysson, 1971). Two- and three-parameter
binary-scoring IRTs provide discrimination that is highly related to traditional
discrimination. Like traditional discrimination, the differential discriminating
abilities of distractors are immaterial.
In polytomous scoring, discrimination has a different conceptualization.
As discussed in chapter 10, polytomous scoring views the differential infor-
mation contained in distractors more sensitively than does binary scoring.
Because discriminating distractors are infrequent, according to studies such
274 CHAPTER 11

as Haladyna and Downing (1993), MC items in the future may be necessarily


leaner, containing only two or three distractors.
This reconceptualization of item discrimination compels item analysts to
evaluate distractors as well as consider the response pattern of each distractor
relative to one another. Items that have distractors that have similar response
patterns, unless reflecting uniquely different misconceptions, may not be use-
ful in item design.

Response Pattern Analysis

Complex behavior requires many mental steps. New theories propose to model
cognitive behavior using statistical models that examine patterns of responses
among items, as opposed to traditional item analysis that merely examines the
pattern of item response in relation to total test score (Frederiksen et al., 1993;
Mislevy, 1993).
Some significant work is currently being done with context-dependent item
sets. Wainer and Kiely (1987) conceptualized item sets as testlets. Responses to
testlets involve the chaining of response, and specific patterns have more value
than others. Although this pattern analysis does not fulfil the promise of cogni-
tive psychologists regarding misconception analysis, testlet scoring takes a ma-
jor first step into the field of item analysis for multistep thinking and the
relative importance of each subtask in a testlet. Chapters 9 and 10 discuss item
response models and computer software that exist for studying various scoring
methods. As cognitive psychologists develop constructs to the point that item
writing can produce items reflecting multistep thinking, response pattern anal-
ysis will become more statistically sophisticated and useful.

SUMMARY

A unification between cognitive science and statistical test score theory is in


progress. In this new environment, item writing should cease to be prescriptive.
In other words, the existence of a taxonomy of item-writing rules developed by
Haladyna et al. (2002) offers a stopgap until more scientific methods for item
writing exist. Item writing should be part of this unified theory that involves
construct definition, test development, and construct validation both at the
item and test score units of analysis. Mislevy (2003) portrayed this unification
as grounded in validity theory, where the plausibility of the logical argument
and the quality of validity evidence contributes to the validity of any assess-
ment of student learning. Toward that end, the creative act of item writing will
probably be replaced with more algorithmic methods that speed up the item-
development process and control for difficulty at the same time. Bormuth
NEW DIRECTIONS IN ITEM WRITING 275

(1971) prophesized that item writing will become automated to eliminate the
caprice and whims of human item writers. When this objectivity is realized,
achievement testing will improve. Creativity will be needed at an earlier stage
with content specification procedures, such as inference networks, that will
automate the item-writing process, but individual creativity associated with
item writers will disappear.
With item response validation, the advent of polytomous IRT has made it
more likely that we will explore the potential for developing distractors that in-
crease the likelihood of polytomous scoring of MC item responses. Conse-
quently, more attention will be given to distractor response patterns that
diagnose wrong thinking in a complex behavior, and the trace line will be a use-
ful and friendly device to understand the role that each distractor plays in
building a coherent item. Both item writing and item response validation are
important steps in test development and validation. As cognitive psychologists
better define constructs and identify the constituent steps in complex thinking,
item development and item response validation should evolve to meet the
challenge. Both item writing and item response validation will continue to play
an important role in test development. Both steps in test development will re-
quire significant research in the context of this unified theory involving both
cognitive science and statistical test score theory.
Finally, it would be remiss not to point out the increasing role of CR perfor-
mance testing in testing cognitive abilities. The CR format has received much
less scholarly attention and research than the MC format. Item writing will cer-
tainly be a unified science of observation where MC and CR assume appropri-
ate roles for measuring aspects of knowledge, skills, and abilities. The road to
better item development and item response validation will be long, as there is
still much to accomplish.
This page intentionally left blank
References

Abedi, J., Lord, C., Hofstetter, C., & Baker, E. (2000). Impact of accommodation strate-
gies on English language learners' test performance. Educational Measurement: Issues
and Practice, 19(3), 16-26.
Adams, R., Wu, M., & Wilson, M. (1998). ConQuest [Computer program]. Camber-
well: Australian Council for Educational Research.
Alagumalai, S., & Keeves, J. R (1999). Distractors—Can they be biased too? Journal of
Outcome Measurement, 3(1), 89-102.
Albanese, M. A. (1992). Type K items. Educational Measurement: Issues and Practices, 12,
28-33.
Albanese, M. A., Kent, T. A., & Whitney, D. R. (1977). A comparison of the difficulty,
reliability, and validity of complex multiple-choice, multiple response, and multiple
true-false items. Annual Conference on Research in Medical Education, 16, 105-110.
Albanese, M. A., & Sabers, D. L. (1988). Multiple true-false items: A study of interitem
correlations, scoring alternatives, and reliability estimation. Journal of Educational
Measurement, 25, 111-124.
American Educational Research Association (2000). Position statement of the Ameri-
can Educational Research Association concerning high-stakes testing in pre K-12
education. Educational Researcher, 29, 24-25.
American Educational Research Association, American Psychological Association.
National Council on Measurement in Education. (1999). Standards for Educational
and Psychological Testing. Washington, DC: American Educational Research
Association.
American Psychological Association, American Educational Research Association, &.
National Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: American Psychological Association.
Anderson, J. R. (1990). The adaptive character of thought. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Anderson, L., & Krathwohl, D. (2001). A taxonomy for learning, teaching and assessing: A
revision of Bloom's taxonomy of educational objectives. New York: Longman.
Anderson, L. W, & Sosniak, L. A. (Eds.). (1994). Bloom's taxonomy: A forty-year retro-
spective. Ninety-third Yearbook of the National Society for the Study of Education. Part II.
Chicago: University of Chicago Press.
277
278 REFERENCES

Andres, A. M., &del Castillo, J. D. (1990). Multiple-choice tests: Power, length, and op-
timal number of choices per item. British journal of Mathematical and Statistical Psy-
chology, 45, 57-71.
Andrich, D., Lyne, A., Sheridan, B., & Luo, G. (2001). RUMM2010: A Windows-
based computer program for Rasch unidimensional models for measurement
[Computer program]. Perth, Western Australia: Murdoch University, Social Mea-
surement Laboratory.
Andrich, D., Styles, I., Tognolini, J., Luo, G., & Sheridan, B. (1997, April). Identifying in-
formation from distractors in multiple-choice items: A routine application o/IRT hypothe-
ses. Paper presented at the annual meeting of the National Council on Measurement
in Education, Chicago.
Angoff, W. H. (1974). The development of statistical indices for detecting cheaters.
Journal of the American Statistical Association, 69, 44-49.
Angoff, W. H. (1989). Does guessing really help? Journal of Educational Measurement,
26(4), 323-336.
Ansley, T. N., Spratt, K. E, & Forsyth, R. A. (1988, April). An investigation of the effects of
using calculators to reduce the computational burden on a standardized test of mathematics
problem solving. Paper presented at the annual meeting of the American Educational
Research Association, New Orleans, LA.
Assessment Systems Corporation. (1992). RASCAL (Rasch analysis program) [Com-
puter program]. St. Paul, MN: Author.
Assessment Systems Corporation. (1995). ITEM AN: Item and test analysis. [Computer
program]. St. Paul, MN: Author.
Attali, Y., & Bar-Hillel, M. (2003). Guess where: The position of correct answers in mul-
tiple-choice test items as a pyschometric variable. Journal of Educational Measure-
ment, 40, 109-128.
Attali, Y., &. Fraenkel, T. (2000). The point-biserial as a discrimination index for
distractors in multiple-choice items: Deficiencies in usage and an alternative. Journal
of Educational Measurement, 37(1), 77-86.
Bar-Hillel, M., &. Attali, Y. (2002). Seek whence: Anser sequences and their conse-
quences in key-balanced multiple-choice tests. The American Statistician, 56,299-303.
Bauer, H. (1991). Sore finger items in multiple-choice tests. System, 19(4), 453-458.
Becker, B. J. (1990). Coaching for Scholastic Aptitude Test: Further synthesis and ap-
praisal. Review of Educational Research, 60, 373-418.
Bejar, I. (1993). A generative approach to psychological and educational measurement.
In N. Frederiksen, R. J. Mislevy, & I. Bejar (Eds.). Test theory for a new generation of
tests (pp. 297-323). Hillsdale, NJ: Lawrence Erlbaum Associates.
Bejar, I. (2002). Generative testing: From comprehension to implementation. In S. H.
Irvine & R C. Kyllonen (Eds.), Item generation for test development (pp. 199-217).
Mahwah, NJ: Lawrence Erlbaum Associates.
Beller, M., & Garni, N. (2000). Can item format (multiple-choice vs. open-ended) ac-
count for gender differences in mathematics achievement? Sex Roles, 42 (1/2), 1-22.
Bellezza, F. S., &Bellezza, S. F. (1989). Detection of cheating on multiple-choice tests by
using error-similarity analysis. Teaching of Psychology, 16, 151-155.
Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett &. W.
C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in con-
structed response, performance testing, and portfolio assessment (pp. 1-27). Hillsdale, NJ:
Lawrence Erlbaum Associates.
REFERENCES 279

Bennett, R. E., Morley, M., Quardt, D., Rock, D. A., Singley, M. K., Katz, I. R., et al.
(1999). Psychometric and cognitive functioning of an under-determined com-
puter-based response type for quantitative reasoning. Journal of Educational Measure-
ment, 36(3), 233-252.
Bennett, R. E., Rock, D. A., & Wang, M. D. (1990). Equivalence of free-response and
multiple-choice items. Journal of Educational Measurement, 28, 77-92.
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W H., & Krathwohl, D. R. (1956). Tax-
onomy of educational objectives. New York: Longmans Green.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are
scored in two or more nominal categories. Psychometrika, 37, 29-51.
Bock, R. D., Wood, R., Wilson, D. T., Gibbons, R., Schilling, S. G., &Muraki, E. (2003).
TESTFACT 4: Full information item factor analysis and item analysis. Chicago: Scien-
tific Software, International.
Bordage, G., &Carretier, H., Bertrand, R., &Page, G. (1995). Academic Medicine, 70(5),
359-365.
Bordage, G., &Page, G. (1987). An alternate approach to PMPs: The key features con-
cept. In I. Hart & R. Harden (Eds.), Further developments in assessing clinical compe-
tence (pp. 57-75). Montreal, Canada: Heal.
Bormuth, J. R. (1970). On a theory of achievement test items. Chicago: University of Chi-
cago Press.
Breland, H. M., Danes, D. O., Kahn, H. D., Kubota, M. Y., &Bonner, M. W. (1994). Per-
formance versus objective testing and gender: An exploratory study of an advanced
placement history examination. Journal of Educational Measurement, 31, 275-293.
Breland, H. M., & Gaynor, J. (1979). A comparison of direct and indirect assessments of
writing skills. Journal of Educational Measurement, 6, 119-128.
Brennan, R. L (2001). Generalizability theory. New York: Springer Verlag.
Bridgeman, B., Harvey, A., Braswell, J. (1995). Effects of calculator use on scores on a
test of mathematical reasoning. Journal of Educational Measurement, 32 (4), 323-340.
Bruno, J. E., & Dirkzwager, A. (1995). Determining the optimal number of alternatives
to a multiple-choice test item: An information theoretical perspective. Educational
and Psychological Measurement, 55, 959-966.
Burmester, M. A., & Olson, L. A. (1966). Comparison of item statistics for items in a
multiple-choice and alternate-response form. Science Education, 50, 467-470.
Camilli, D. & Shepard, L. (1994). Methods for identifying biased test items. Thousand
Oaks, CA: Sage Publications.
Campbell, D. R., & Fiske, D. W (1959). Convergent and discriminant validation by the
multi-trait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Campbell, J. R. (2000). Cognitive processes elicited by multiple-choice and con-
structed-response questions on an assessment of reading comprehension. Disserta-
tion Abstracts International, Section A: Humanities and Social Sciences, 60(1-A), 2428.
Cannell, J. J. (1989). How public educators cheat on standardized achievement tests. Albu-
querque, NM: Friends for Education.
Carroll,]. B. (1963). A model for school learning. Teachers Cottege Record, 64,723-733.
Case, S. M., & Downing, S. M. (1989). Performance of various multiple-choice item
types on medical specialty examinations: Types A, B, C, K, and X. In Proceedings of the
Twenty-Eighth Annua/ Conference of Research in Medical Education (167-172).
Case, S. M., Holtzman, K, & Ripkey, D. R. (2001). Developing an item pool for CBT: A prac-
tical comparison of three models of item writing. Academic Medicine, 76(10),S111-S113.
280 REFERENCES

Case, S. M, & Swanson, D. B. (1993). Extended matching items: A practical alternative


to free response questions. Teaching and Learning in Medicine, 5(2), 107-115.
Case, S. M., & Swanson, D. (2001). Constructing written test questions for the basic and
clinical sciences (3rd ed.) Philadelphia: National Board of Medical Examiners.
Case, S. M., Swanson, D. B., & Becker, D. E (1996). Verbosity, window dressing, and red
herrings: Do they make a better test item? Academic Medicine, 71 (10), S28-S30.
Case, S. M., Swanson, D. B., & Ripkey, D. R. (1994). Comparison of items in five-option
and extended matching format for assessment of diagnostic skills. Academic Medi-
cine, 69(Suppl.), S1-S3.
Cizek, G. J. (1991, April). The effect of altering the position of options in a multipk-choice ex-
amination. Paper presented at the annual meeting of the National Council on Mea-
surement in Education, Chicago.
Cizek, G. J. (1999). Cheating on tests. Mahwah, NJ: Lawrence Erlbaum Associates.
Clauser, B. E., &Mazor, K. M. (1998). Using statistical procedures to identify differen-
tially functioning test items. Educational Measurement: Issues and Practices, 17,32-44.
Cody, R. P (1985). Statistical analysis of examinations to detect cheating. Journal of
Medical Education, 60, 136-137.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Cohen, A. S., & Kim S. (1992). Detecting calculator effects on item performance. Ap-
plied Measurement in Education, 5, 303-320.
Cole, N. S. (1990). Conceptions of educational achievement. Educational Researcher, 19,
2-7.
Coombs, C. H. (1953). On the use of objective examinations. Educational and Psycholog-
ical Measurement, 13 (2), 308-310.
Cox, R. C., & Vargas, J. (1966). A comparison of item selection techniques for norm-refer-
enced and criterion-referenced tests. Pittsburgh, PA: University of Pittsburgh Learning
Research and Development Center.
Crocker, L., & Algina, J. (1986). Introduction to classical and modem test theory. New York:
Holt, Rinehart, & Winston.
Crocker, L., Llabre, M., & Miller, M. D. (1988). The generalizability of content validity
ratings. Journal of Educational Measurement, 25, 287-299.
Cronbach, L. J. (1941). An experimental comparison of the multiple true-false and mul-
tiple multiple choice test. Journal of Educational Psychology, 32, 533-543.
Cronbach, L. J. (1970). Review of On the theory of achievement test items. Psychometrika,
35,509-511.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational Measure-
ment (2nd ed., pp. 443-507). Washington, DC: American Council on Education.
Cronbach, L. J. (1988). Five perspectives of the validity argument. In H. Wainer &.H. I.
Braun (Eds.), Test validity (pp. 3-18). Hillsdale, NJ: Lawrence Erlbaum Associates.
Cronbach, L. J., Gleser, G. C., Nanda, H., &Rajaratnum, N. (1972). The dependability of be-
havioral measurements: Theory of generaUzability for scores and profiles. New York: Wiley.
Daneman, M., & Hannon, B. (2001). Using working memory theory to investigate the
construct validity of multiple-choice reading comprehension tests such as the SAT.
Journal of Experimental Psychology: General, 30(2), 208-223
Dawson-Saunders, B., Nungester, R. J., & Downing, S. M. (1989). A comparison of single
best answer multipk-choice items (A-type) and complex multipk -choice (K-type). Phila-
delphia: National Board of Medical Examiners.
REFERENCES 281

Dawson-Saunders, B., Reshetar, R., Shea, J. A., Fierman, C. D., Kangilaski, R., &
Poniatowski, R A. (1992, April). Alterations to item text and effects on item difficulty and
discrimination. Paper presented at the annual meeting of the National Council on
Measurement in Education, San Francisco.
Dawson-Saunders, B., Reshetar, R., Shea, J. A., Fierman, C. D., Kangilaski, R., &
Poniatowski, R A. (1993, April). Changes in difficulty and discrimination related to alter-
ing item text. Paper presented at the annual meeting of the National Council on Mea-
surement in Education, Atlanta.
DeAyala, R. J., Plake, B. S., & Impara, J. C. (2001). The impact of omitted responses on
the accuracy of ability estimation in item response theory. Journal of Educational Mea-
surement, 38(3), 213-234-
de Gruijter, D. N. M. (1988). Evaluating an item and option statistic using the bootstrap
method. Tijdschrift voor Ondenvijsresearch, 13, 345-352.
DeMars, C. E. (1998). Gender differences in mathematics and science on a high school
proficiency exam: The role of response format. Applied Measurement in Education,
II (3), 279-299.
Dibello, L. V, Roussos, L. A., & Stout, W. F. (1993, April). Unified cognitive/psychometric
diagnosis foundations and application. Paper presented at the annual meeting of the
American Educational Research Association, Atlanta, GA.
Dobson, C. (2001). Measuring higher cognitive development in anatomy and physiol-
ogy students. Dissertation Abstracts International: Section-B: The Sciences and Engi-
neering, 62(5-B), 2236.
Dochy, E, Moekerke, G., De Corte, E., &Segers, M. (2001). The assessment of quantita-
tive problem-solving with "none of the above"-items (NOTA items). European Jour-
nal of Psychology of Education, 26(2), 163-177.
Dodd, D. K., & Leal, L. (2002). Answer justification: Removing the "trick" from multi-
ple-choice questions. In R. A. Griggs (Ed.), Handbook for teaching introductory psy-
chology (Vol. 3, pp. 99-100). Mahwah, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Holland, R W (1993). DIP detection and description: Mantel-Haenzel
and standardization. In R W. Holland &H. Wainer (Eds.), Differential item functioning
(pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Potenza, M. T. (1993, April). Issues in equity assessment for complex re-
sponse stimuli. Paper presented at the annual meeting of the National Council on
Measurement in Education, Atlanta, GA.
Downing, S. M. (1992). True-false and alternate-choice item formats: A review of re-
search. Educational Measurement: Issues and Practices, I I , 27-30.
Downing, S. M. (2002a). Construct-irrelevant variance and flawed test questions: Do
multiple-choice item-writing principles make any difference? Academic Medicine,
77(10), S103-S104.
Downing, S. M. (2002b). Threats to the validity of locally developed multiple-choice
tests in medical education: Construct-irrelevant variance and construct underrep-
resentation. Advances in Health Sciences Education. 7, 235-241.
Downing, S. M., Baranowski, R. A., Grosso, L. J., & Norcini, J. J. (1995). Item type
and cognitive ability measured: The validity evidence for multiple true-false
items in medical specialty certification. Applied Measurement in Education, 8(2),
187-197.
Downing, S. M., & Haladyna, T. M. (1997). Test item development: Validity evidence
from quality assurance procedures. Applied Measurement in Education, 10 (1), 61 -82.
282 REFERENCES

Downing, S. M., &Norcini, J. J. (1998, April). Constructed response or multiple-choice:


Does format make a difference for prediction? In T. M. Haladyna (Chair), Construc-
tion versus choice: A research synthesis. Symposium conducted at the annual meeting
of the American Educational Research Association, San Diego, CA.
Drasgow, F. (1982). Choice of test model for appropriateness measurement. Applied Psy-
chological Measurement, 6, 297-308.
Drasgow, E, &Guertler, E. (1987). A decision-theoretic approach to the use of appropri-
ateness measurement for detecting invalid test and scale scores. Journal of Applied
Psychology, 72, 10-18.
Drasgow, E, Levine, M. V, Tsien, S., Williams, B., & Mead, A. D. (1995). Fitting poly-
tomous item response theory models to multiple-choice tests. Applied Psychological
Measurement, 19(2), 143-165.
Drasgow, E, Levine, M. V, & Williams, E. A. (1985). Appropriateness measurement
with polychotomous item response models and standardized indices. British Journal of
Educational Psychology, 38, 67-86.
Drasgow, E, Levine, M. V, & Zickar, M. J. (1996). Optimal identification of
mismeasured individuals. AppUed Measurement in Education, 9(1), 47-64.
Dressel, R L., &Schmid, E (1953). Some modifications of the multiple-choice item. Ed-
ucational and Psychological Measurement, 13, 574-595.
Ebel, R. L. (1951). Writing the test item. In E. F. Lindquist (Ed.), Educational measure-
ment (1st ed., pp. 185-249). Washington, DC: American Council on Education.
Ebel, R. L. (1970). The case for true-false test items. School Review, 78, 373-389
Ebel, R. L. (1978). The ineffectiveness of multiple true-false items. Educational and Psy-
chological Measurement, 38, 37-44.
Ebel, R. L. (1981, April). Some advantages of alternate-choice test items. Paper presented at the
annual meeting of the National Council on Measurement in Education, Los Angeles.
Ebel, R. L. (1982). Proposed solutions to two problems of test construction. Journal of
Educational Measurement, 19,267-278.
Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.).
Englewood Cliffs, NJ: Prentice-Hall.
Ebel, R. L., & Williams, B. J. (1957). The effect of varying the number of alternatives per
item on multiple-choice vocabulary test items. In The fourteenth yearbook. Washing-
ton, DC: National Council on Measurement in Education.
Educational Testing Service. (2003). ETS Standards for fairness and quality. Princeton,
NJ: Author.
Embretsen, S. (1985). Multicomponent latent trait models for test design. In S. E.
Embretsen (Ed.), Test design: Developments in psychology and psychometrics (pp.
195-218). Orlando, FL: Academic Press.
Embretsen, S. E, & Reise, S. E. (2000). Item response theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum Associates.
Engelhard, G., Jr. (2002). Monitoring raters in performance assessments. In G. Tindal
& T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity,
technical adequacy, and implementation (pp. 261-288). Mahwah, NJ: Lawrence
Erlbaum Associates.
Enright, M. K., &Sheehan, K. M. (2002). Modeling the difficulty of quantitative rea-
soning items: Implications from item generation. In S. H. Irvine & P C. Kyllonen
(Eds.), Item generation for test development (pp. 129-157). Mahwah, NJ: Lawrence
Erlbaum Associates.
REFERENCES 283

Eurich, A. C. (1931). Four types of examination compared and evaluated. Journal of Ed-
ucational Psychology, 26, 268-278.
Fajardo, L. L, &Chan, K. M. (1993). Evaluation of medical students in radiology written
testing using uncued multiple-choice questions. Investigative Radiology, 28 (10), 964-968.
Farr, R., Pritchard, R., & Smitten, B. (1990). A description of what happens when an
examinee takes a multiple-choice reading comprehension test. Journal of Educational
Measurement, 27, 209-226.
Fenderson, B. A., Damjanov, I., Robeson, M. R., Veloski, J. J., & Rubin, E. (1997). The
virtues of extended matching and uncued tests as alternatives to multiple-choice
questions. Human Pathology, 28(5), 526-532.
Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika,
48, 3-26.
Fitzpatrick, A. R. (1981). The meaning of content validity. Applied Psychological Mea-
surement, 7, 3-13.
Forster, F. (1974). Sample size and stable calibration. Unpublished paper.
Frary, R. B. (1993). Statistical detection of multiple-choice test answer copying: Review
and commentary. Applied Measurement in Education, 6, 153-165.
Frederiksen, N. (1984). The real test bias. Influences of testing on teaching and learning.
American Psychologist, 39, 193-202.
Frederiksen, N., Mislevy, R. J., &Bejar, I. (Eds.). (1993). Test theory for a new generation of
tests. Hillsdale, NJ: Lawrence Erlbaum Associates.
Frisbie, D. A. (1973). Multiple-choice versus true-false: A comparison of reliabilities
and concurrent validities. Journal of Educational Measurement, JO, 297-304.
Frisbie, D. A. (1992). The status of multiple true—false testing. Educational Measurement:
Issues and Practices, 5, 21-26.
Frisbie, D. A., & Becker, D. F. (1991). An analysis of textbook advice about true—false
tests. Applied Measurement in Education, 4, 67-83.
Frisbie, D. A., & Druva, C. A. (1986). Estimating the reliability of multiple-choice
true-false tests. Journal of Educational Measurement, 23, 99-106.
Frisbie, D. A., Miranda, D. U., &. Baker, K. K. (1993). An evaluation of elementary text-
book tests as classroom assessment tools. Applied Measurement in Education, 6,21-36.
Frisbie, D. A., & Sweeney, D. C. (1982). The relative merits of multiple true-false
achievement tests. Journal of Educational Measurement, 19, 29-35.
Fuhrman, M. (1996). Developing good multiple -choice tests and test questions. Journal
ofGeoscience Education, 44, 379-384.
Gagne, R. M. (1968). Learning hierarchies. Educational Psychologist, 6, 1-9.
Gallagher, A., Levin, J., & Cahalan, C. (2002). Cognitive patterns of gender differences on
mathematics admissions test. ETS Research Report 2—19. Princeton, NJ: Educational
Testing Service.
Gardner, H. (1986). The mind's new science: A history of the cognitive revolution. New York:
Basic Books.
Gardner, H., &. Hatch, T. (1989). Multiple intelligences go to school. Educational Re-
searcher, 18, 4-10.
Gamer, B. A. (Ed.). (1999). Black's Law Dictionary (2nd ed.) New York: West Publishing Co.
Garner, B. A. (Ed.). (1999). Black's Law Dictionary (7th ed.). St. Paul, MN: West Group.
Garner, M., & Engelhard, G., Jr. (2001). Gender differences in performance on multi-
ple-choice and constructed response mathematics items. Applied Measurement in Ed-
ucation, 12(1), 29-51.
284 REFERENCES

Gitomer, D. H., & Rock, D. (1993). Addressing process variables in test analysis. In N.
Frederiksen, R. J. Mislevy, & I. J. Bejar (Eds.), Test theory for a new generation of tests
(pp. 243-268). Hillsdale, NJ: Lawrence Erlbaum Associates.
Glaser, R., & Baxter, G. R (2002). Cognition and construct validity: Evidence for the na-
ture of cognitive performance in assessment situation. In H. I. Braun, D. N. Jackson,
& D. E. Wiley (Eds.), The role of constructs in psychological and educational measure-
ment (pp. 179-192). Mahwah, NJ: Lawrence Erlbaum Associates.
Godshalk, E I., Swineford, E., & Coffman, W. E. (1966). The measurement of writing
ability. College Board Research Monographs, No. 6. New York: College Entrance Exam-
ination Board.
Goleman, D. (1995). Emotional intelligence. New York: Bantam Books.
Green, K. E., & Smith, R. M. (1987). A comparison of two methods of decomposing
item difficulties. Journal of Educational Statistics, 12, 369-381.
Gross, L. J. (1994). Logical versus empirical guidelines for writing test items. Evaluation
and the Health Professions, 17(1), 123-126.
Grosse, M., & Wright, B. D. (1985). Validity and reliability of true-false tests. Educa-
tional and Psychological Measurement, 45, 1-13.
Guilford, J. R (1967). The nature of human intelligence. New York: McGraw-Hill.
Gulliksen, H, (1987). Theory of mental tests. Hillsdale, NJ: Lawrence Erlbaum Asso-
ciates.
Guttman, L. (1941). The quantification of a class of attributes: A theory and method of
scale construction. In R Horst (Ed.), Prediction of personal adjustment (pp. 321-345).
[Social Science Research Bulletin 48].
Haertel, E. (1986). The valid use of student performance measures for teacher evalua-
tion. Educational Evaluation and Policy Analysis, 8, 45-60.
Haertel, E. H., & Wiley, D. E. (1993). Representations of ability structures: Implications
for testing. InN. Frederiksen, R. J. Mislevy, &.I. Bejar (Eds.), Test theory for a new gen-
eration of tests (pp. 359-384). Hillsdale, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M. (1974). Effects of different samples on item and test characteristics of
criterion-referenced tests. Journal of Educational Measurement, 11, 93-100.
Haladyna, T. M. (1990). Effects of empirical option weighting on estimating domain
scores and making pass/fail decisions. Applied Measurement in Education, 3,231-244.
Haladyna, T. M. (1991). Generic questioning strategies for linking teaching and testing.
Educational Technology: Research and Development, 39, 73-81.
Haladyna, T. M. (1992a). Context-dependent item sets. Educational Measurement: Issues
and Practices, 11, 21—25.
Haladyna, T. M. (1992b). The effectiveness of several multiple-choice formats. Applied
Measurement in Education, 5, 73-88.
Haladyna, T. M. (1998, April). Fidelity and proximity in the choice of a test item format.
In T. M. Haladyna (Chair), Construction versus choice: A research synthesis. Sympo-
sium conducted at the annual meeting of the American Educational Research Asso-
ciation, San Diego, CA,
Haladyna, T. M. (2002). Supporting documentation: Assuring more valid test score in-
terpretations and uses. In G. Tindal &T. M. Haladyna (Eds.), Large-scale assessment
for all students: Validity, technical adequacy, and implementation (pp. 89-108). Mahwah,
NJ: Lawrence Erlbaum Associates.
Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writ-
ing rules. Applied Measurement in Education, 1, 37-50.
REFERENCES 285

Haladyna, T. M., & Downing, S. M. (1989b). The validity of a taxonomy of multiple -


choice item-writing rules. Applied Measurement in Education, 1, 51-78.
Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multi-
ple-choice test item. Educational and Psychological Measurement, 53, 999-1010.
Haladyna, T. M., & Downing, S. M. (in press). Construct-irrelevant variance in high
stakes testing. Educational measurement: Issues and practice.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multi-
ple-choice item-writing guidelines for classroom assessment. Applied Measurement in
Education, 15(3), 309-334.
Haladyna, T. M., Haas, N. S., & Allison, J. (1998). Tensions in standardized testing.
Childhood Education, 74, 262-273.
Haladyna, T M., & Kramer, G. (2003). The effect of dimensionality on item analysis and
subscore reporting for a large-scak credentialing test. Manuscript submitted for publication.
Haladyna, T. M., Nolen, S. B., & Haas, N. S. (1991). Raising standardized achievement
test scores and the origins of test score pollution. Educational Researcher, 20, 2-7.
Haladyna, T. M., Osborn Popp, S., & Weiss, M. (2003). Non response in large scale achieve-
ment testing. Unpublished manuscript.
Haladyna, T M., &Roid, G. H. (1981). The role of instructional sensitivity in the empirical
review of criterion-referenced test items Journal of Educational Measurement, 18,39-53.
Haladyna, T. M., & Shindoll, R. R. (1989). Item shells: A method for writing effective
multiple-choice test items. Evaluation and the Health Professions, 12, 97-104-
Haladyna, T. M., &Sympson, J. B. (1988, April). Empirically basedpolychotomous scoring
of multiple-choice test items: A review. In nevj development in polychotomous scoring. Sym-
posium conducted at the annual meeting of the American Educational Research As-
sociation, New Orleans, LA.
Hambleton, R. K. (1984). Validating the test scores. In R. A. Berk (Ed.), A guide to cri-
terion-referenced test construction (pp. 199-230). Baltimore: Johns Hopkins Univer-
sity Press.
Hambleton, R. K. (1989). Principles and selected applications of item response theory.
In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 147-200). New York:
American Council on Education and Macmillan.
Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item
response theory and their applications to test development. Educational Measure-
ment: Issues and Practices, 12, 38-46.
Hambleton, R. K., Swaminathan, H., & Rogers, J. (1991). Item response theory: Principles
and applications (2nd ed.). Boston: Kluwer-Nijhoff.
Hamilton, L. S. (1999). Detecting gender-based differential item functioning on a con-
structed-response science test. Applied Measurement in Education, 12(3), 211-235
Hancock, G. R., Thiede, K. W., &Sax, G. (1992, April). Reliability of comparably written
two-option multiple-choice and true-false test items. Paper presented at the annual
meeting of the National Council on Measurement in Education, Chicago.
Harasym, R H., Doran, M. L., Brant, R., & Lorscheider, F. L. (1992). Negation in stems
of single-response multiple-choice items. Evaluation and the Health Professions, 16(3),
342-357.
Hatala, R., & Norman, G. R. (2002). Adapting the key features examinations for a clini-
cal clerkship. Medical Education, 36, 160-165.
Hattie, J. A. (1985). Methodological review: Assessing unidimensionality of tests and
items. Applied Psychological Measurement, 9, 139-164.
286 REFERENCES

Haynie, W. J., (1994). Effects of multiple-choice and short-answer tests on delayed re-
tention learning. Journal of Technology Education, 6(1), 32-44.
Heck, R., StCrislip, M. (2001). Direct and indirect writing assessments: Examining
issues of equity and utility. Educational Evaluation and Policy Analysis, 23(3),
275-292
Henrysson, S. (1971). Analyzing the test item. In R. L. Thorndike (Ed.), Educational mea-
surement (2nd ed., pp. 130-159) Washington, DC: American Council on Education.
Herbig, M. (1976). Item analysis by use in pre-test and post-test: A comparison of
different coefficients. PLET (Programmed Learning and Educational Technology),
13,49-54.
Hibbison, E. E (1991). The ideal multiple choice question: A protocol analysis. Forum
for Reading, 22 (2) ,36-41.
Hill, G. C., & Woods, G. T. (1974). Multiple true-false questions. Education in Chemis-
try I], 86-87.
Hill, K., & Wigfield, A. (1984). Test anxiety: A major educational problem and what
can be done about it. The Elementary School Journal, 85, 105-126.
Holland, R W, &Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenzel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp. 129-145).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Holland, E W, & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Holtzman, K., Case, S. M., & Ripkey, D. (2002). Developing high quality items quickly,
cheaply, consistently-pick two. CLEAR Exam Review, 16-19.
House, E. R. (1991). Big policy, little policy. Educational Researcher, 20, 21-26.
Hsu, L. M. (1980). Dependence of the relative difficulty of true-false and grouped true-
false tests on the ability levels of examinees. Educational and Psychological Measure-
ment, 40, 891-894.
HubbardJ. E (197'8). Measuring medical education: The tests and experience of the National
Board of Medical Examiners (2nd ed.). Philadelphia: Lea and Febiger.
Huff, K. L., & Sireci, S. (2001). Validity issues in computer-based testing. Educational
Measurement: Issues and Practices, 20, 16-25.
Hurd, A. W. (1932). Comparison of short answer and multiple-choice tests covering
identical subject content. Journal of Educational Research, 26, 28-30.
Irvine, S. H., & Kyllonen, R C. (Eds.). (2002). Item generation for test development.
Mahwah, NJ: Lawrence Erlbaum Associates.
Johnson, B. R. (1991). A new scheme for multiple-choice tests in lower division mathe-
matics. The American Mathematical Monthly, 98, 427—429.
Joint Commission on National Dental Examinations. (1996). National Board Dental Hy-
giene Pilot Examination. Chicago: American Dental Association.
Jozefowicz, R. E, Koeppen, B. M., Case, S., Galbraith, R., Swanson, D., &Glew, R. H.
(2002). The quality of in-house medical school examinations. Academic Medicine, 77(2),
156-161.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,
527-535.
Kane, M. T. (1997). Model-based practice analysis and test specifications. Applied Mea-
surement in Education, 10, 1, 5-18.
Kane, M. T. (2002). Validating high-stakes testing programs. Educational Measurement:
Issues and Practices, 21 (1), 31-41.
REFERENCES 287

Katz, I. R., Bennett, R. E., & Berger, A. L. (2000). Effects of response format on difficulty
of SAT-mathematics items: It's not the strategy. Journal of Educational Measurement,
37(1), 39-57.
Katz, S., & Lautenschlager, G. J. (1999). The contribution of passage no-passage item
performance on the SAT1 reading task. Educational Assessment, 7(2), 165—176.
Kazemi, E. (2002). Exploring test performance in mathematics: The questions chil-
dren's answers raise. Journal of Mathematical Behavior, 21(2), 203-224-
Komrey, J. D., & Bacon, T. P (1992, April). Item analysis of achievement tests bases on small
numbers of examinees. Paper presented at the annual meeting of the American Educa-
tional Research Association, San Francisco.
Knowles, S. L., & Welch, C. A. (1992). A meta-analytic review of item discrimination
and difficulty in multiple-choice items using none-of-the-above. Educational and
Psychological Measurement, 52, 571-577.
Kreitzer, A. E., & Madaus, G. F. (1994). Empirical investigations of the hierarchical
structure of the taxonomy. In L. W. Anderson &L. A. Sosniak (Eds.), Bloom's taxon-
omy: A forty-year retrospective. Ninety-third yearbook of the National Society for the Study
of Education. Pan II (pp. 64-81). Chicago: University of Chicago Press.
LaDuca, A. (1994). Validation of a professional licensure examinations: Professions the-
ory, test design, and construct validity. Evaluation in the Health Professions, 17(2),
178-197.
LaDuca, A., Downing, S. M., &. Henzel, T. R. (1995). Test development: Systematic
item writing and test construction. In J. C. Impara &.J. C. Fortune (Eds.), Licensure
examinations: Purposes, procedures, and practices (pp. 117-148). Lincoln, NE: Buros
Institute of Mental Measurements.
LaDuca, A., Staples, W. I., Templeton, B., & Holzman, G. B. (1986). Item modelling
procedure for constructing content-equivalent multiple-choice questions. Medical
Education, 20, 53-56.
Landrum, R. E., Cashin, J. R., &Theis, K. S. (1993). More evidence in favor of three op-
tion multiple-choice tests. EducationalandPsychologicalMeasurement, 53,771-778.
Levine, M. V, & Drasgow, F. (1982). Appropriateness measurement: Review, critique,
and validating studies. British Journal of Educational Psychology, 35, 42-56.
Levine, M. V, & Drasgow, F. (1983). The relation between incorrect option choice and
estimated ability. Educational and Psychological Measurement, 43, 675-685.
Levine, M. V, & Drasgow, F. (1988). Optimal appropriateness measurement. Psycho-
metrika, 53, 161-176.
Levine, M. V, & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice
test scores. Journal of Educational Statistics, 4, 269-289.
Lewis, J. C., &. Hoover, H. D. (1981, April). The effect of pupil performance from using
hand-held calculators during standardized mathematics achievement tests. Paper pre-
sented at the annual meeting of the National Council on Measurement in Educa-
tion, Los Angeles.
Lindquist, E. F. (Ed.). (1951). Educational measurement (1st ed.). Washington, DC:
American Council on Education.
Linn, R. L. (Ed.). (1989). Educational measurement (3rd ed.). New York: American
Council on Education and Macmillan.
Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4-16.
Linn, R. L., Baker, E. L., &Dunbar, S. B. (1991). Complex, performance-based assess-
ments: Expectations and validation criteria. Educational Researcher, 20, 15-21.
288 REFERENCES

Linn, R. L., & Gronlund, N. (2001). Measurement and assessment in teaching (7th ed.).
Columbus, OH: Merrill.
Lohman, D. F. (1993). Teaching and testing to develop fluid abilities. Educational Re-
searcher, 22, 12-23.
Lohman, D. F, &. Ippel, M. J. (1993). Cognitive diagnosis: From statistically-based as-
sessment toward theory-based assessment. In N. Frederikesen, R. J. Mislevy, & I.
Bejar (Eds.), Test theory for a new generation of tests (pp. 41-71). Hillsdale, NJ: Law-
rence Erlbaum Associates.
Lord, F. M. (1958). Some relations between Guttman's principal components of scale
analysis and other psychometric theory. Psychometrika, 23, 291-296.
Lord, F. M. (1977). Optimal number of choices per item—A comparison of four ap-
proaches. Journal of Educational Measurement, 14, 33-38.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Lord, F. M., &Novick, M. R. (1968). Statistical theories of mental test scores. Chicago:
McGraw-Hill.
Love, T. E. (1997). Distractor selection ratios. Psychometrika, 62(1), 51-62.
Loyd, B. H. (1991). Mathematics test performance: The effects of item type and calcula-
tor use. Applied Measurement in Education, 4, 11-22.
Luce, R. D. (1959). Individual choice behavior. New York: Wiley.
Lukhele, R., Thissen, D., & Wainer, H. (1993). On the relative value of multiple-choice,
constructed-response, and examinee-selected items on two achievement tests. Jour-
nal of Educational Measurement, 31(3), 234-250.
MacDonald, R. E. (1981). The dimensionality of tests and items. British Journal of Math-
ematical and Statistical Psychology, 34, 100-117.
MacDonald, R. E (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence
Erlbaum Associates.
MacDonald, R. E (1999). Test theory. Mahwah, NJ: Lawrence Erlbaum Associates.
Mager, R. F. (1962). Preparing instructional objectives. Palo Alto, CA: Fearon.
Maihoff, N. A., & Mehrens, W. A. (1985, April). A comparison of alternate-choice and
true-false item forms used in classroom examinations. Paper presented at the annual
meeting of the National Council on Measurement in Education, Chicago.
Martinez, M. E. (1990). A comparison of multiple-choice and constructed figural re-
sponse items. Journal of Educational Measurement, 28, 131-145.
Martinez, M. E. (1993). Cognitive processing requirements of constructed figural re-
sponse and multiple-choice items in architecture assessment. Applied Measurement
in Education, 6, 167-180.
Martinez, M. E. (1998, April). Cognition and the question of test item format. In T. M.
Haladyna (Chair), Construction versus choice: A research synthesis. Symposium con-
ducted at the annual meeting of the American Educational Research Association,
San Diego, CA.
Martinez, M. E., &Katz, I. R. (1996). Cognitive processing requirements of constructed
figural response and multiple-choice items in architecture assessment. Educational
Assessment, 3, 83-98.
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,149-174.
McMorris, R. F., Boothroyd, R. A., & Pietrangelo, D. J. (1997). Humor in educational
testing: A review and discussion. Applied Measurement in Education, 10,269-297.
Mehrens, W. A., & Kaminski, J. (1989). Methods for improving standardized test scores:
Fruitful, fruitless, or fraudulent? Educational Measurement: Issues and Practices, 8,14-22.
REFERENCES 289

Meijer, R. R. (1996). Person-fit research: An introduction. Applied Measurement in Edu-


cation, 9(1), 3-8.
Meijer, R. R., Molenaar, I. W., & Sijtsma, K. (1994). Influence of person and group char-
acteristics on nonparametrie appropriateness measurement. Applied Psychological
Measurement, 8, 111-120.
Meijer, R. R., Muijtjens, A. M. M. M., & van der Vleuten, C. R M. (1996). Nonpara-
metric person-fit research: Some theoretical issues and an empirical evaluation. Ap-
plied Measurement in Education, 9(1), 77-90.
Meijer, R. R., &Sijtsma, K. (1995). Detection of aberrant item score patterns: A review
of recent developments. Applied Measurement in Education, 8(3), 261-272.
Messick, S. (1975). The standard problem: Meaning and values in measurement and
evaluation. American Psychologist, 30, 955-966.
Messick, S. (1984). The psychology of educational measurement. Journal of Educational
Measurement, 21,215-23 7.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
13-104). New York: American Council on Education and Macmillan.
Messick, S. (1995 a). Validity of psychological assessment: Validation of inferences from
persons' responses and performances as scientific inquiry into score meaning. Ameri-
can Psychologist, 50, 741-749.
Messick, S. (1995b). Standards of validity and the validity of standards in performance
assessment. Educational Measurement: Issues and Practice, 14(4), 5-8.
Miller, W. G., Snowman, J., &. O'Hara, T. (1979). Application of alternative statistical
techniques to examine the hierarchical ordering in Bloom's taxonomy. American Ed-
ucational Research Journal, 16, 241-248.
Millman, J., &. Greene, J. (1989). The specification and development of tests of achieve-
ment and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335-366).
New York: American Council on Education and Macmillan.
Minnaert, A. (1999). Individual differences in text comprehension as a function of text
anxiety and prior knowledge. Psychological Reports, 84, 167-177.
Mislevy, R. J. (1993). Foundations of a new test theory. In N. Frederiksen, R. J. Mislevy,
& I. Bejar (Eds.), Test theory for a new generation of tests (pp. 19-39). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Mislevy, R. J. (1996a). Some recent developments in assessing student learning. Princeton,
NJ: Center for Performance Assessment at the Educational Testing Service.
Mislevy, R. J. (1996b). Test theory reconceived. Journal of Educational Measurement, 33,
379-417.
Mislevy. R. J. (2003, April). Educational assessments as evidentiary arguments: What has
changed, and what hasn't? Paper presented at the invitational conference on infer-
ence, culture, and ordinary thinking in dispute resolution, Benjamin N. Cardozo
School of Law, New York.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (1999). Evidence-centered assessment de-
sign. Princeton, NJ: Educational Testing Service.
Mukerjee, D. R (1991). Testing reading comprehension: A comparative analysis of a
cloze test and a multiple-choice test. Indian Educational Review, 26, 44-55.
Muraki, E., & Bock, R. D. (2003). PARSCALE 4: IRT based test scoring and item analy-
sis for graded open-ended exercises and performance tests [Computer program].
Chicago: Scientific Software, Inc.
National Commission on Educational Excellence. (1983). A nation at risk. Washington,
DC: U.S. Government Printing Office.
290 REFERENCES

National Council of Teachers of Mathematics. (1989). Curriculum and evaluation stan-


dards for school mathematics. Reston, VA: Author.
National Council of Teachers of Mathematics. (2000). Principles and standards for school
mathematics. Reston, VA: Author.
Neisser, U. (Ed). (1998). The rising curve. Long term gains in IQ and related measures.
Washington, DC: American Psychological Association.
Nesi, H., &. Meara, R (1991). How using dictionaries affects performance in multiple-
choice ESL tests. Reading in a Foreign Language, 8(1), 631-643.
Nickerson, R. S. (1989). New directions in educational assessment. Educational Re-
searcher, 18, 3-7.
Nield, A. E, & Wintre, M. G. (2002). Multiple-choice questions with an option to com-
ment: Student attitude and use. In R. A. Griggs (Ed.), Handbook for teaching introduc-
tory psychology (Vol. 3, pp. 95-99). Mahwah, NJ: Lawrence Erlbaum Associates.
Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto,
Canada: University of Toronto.
Nitko, A. J. (1985). Review of Roid and Haladyna's "A technology for test item writing."
Journal of Educational Measurement, 21, 201-204.
Nitko, A. ]. (1989). Designing tests that are integrated with instruction. In R. L. Linn
(Ed.), Educational measurement (3rd ed., pp. 447-474). New York: American Coun-
cil on Education and Macmillan
Nitko, A. J. (2001). Educational assessment of students. Upper Saddle River, NJ: Merrill
Prentice Hall.
Nolen, S. B., Haladyna, T. M., & Haas, N. S. (1992). Uses and abuses of achievement
test scores. Educational Measurement: Issues and Practices, 11, 9—15.
Norris, S. R (1990). Effects of eliciting verbal reports of thinking on critical thinking test
performance. Journal of Educational Measurement, 27, 41-58.
Nunnally, J. C. (1967). Psychometric theory. New York: McGraw-Hill.
Nunnally, J. C. (1977). Psychometric theory (2nd ed.). New York: McGraw-Hill.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York:
McGraw-Hill.
O'Dell, C. W. (1928). Traditional examinations and new type tests. New York: Century.
O'Neill, K. (1986, April). The effect of stylistic changes on item performance. Paper pre-
sented at the annual meeting of the American Educational Research Association,
San Francisco.
Oosterhof, A. C., &Glasnapp, D. R. (1974). Comparative reliabilities and difficulties of the
multiple-choice and true-false formats. Journal qf Experimental Education, 42, 62-64.
Page, G., &Bordage, G., & Allen, T. (1995). Developing key-features problems and exam-
ination to assess clinical decison-making skills. Academic Medicine, 70(3), 194-201.
Paris, S. G., Lawton, T. A., Turner, J. C., & Roth, J. L. (1991). A developmental perspec-
tive on standardized achievement testing. Educational Researcher, 20, 2-7
Patterson, D. G. (1926). Do new and old type examinations measure different mental
functions? School and Society, 24, 246-248.
Perkhounkova, E. (2002) Modeling the dimensions of language achievement. Disserta-
tion Abstracts International Section A: Humanities and Social-Sciences, 62 (12-A) ,4137
Peterson, C. C., & Peterson, J. L. (1976). Linguistic determinants of the difficulty of
true-false test items. Educational and Psychological Measurement, 36, 161-164.
Phelps, R. R (1998). The demand for standardized testing. Educational Measurement: Is-
sues and Practice, 17(3), 5-19.
REFERENCES 291

Phelps, R. P. (2000). Trends in large-scale testing outside the United States. Educational
Measurement: Issues and Practice, 19(1), 11-21.
Pinglia, R. S. (1994). A psychometric study of true-false, alternate-choice, and multi-
ple-choice item formats. Indian Psychological Review, 42(1-2), 21—26.
Poe, N., Johnson, S., & Barkanic, G. (1992, April). A reassessment of the effect of calculator
use in the performance of students taking a test of mathematics applications. Paper pre-
sented at the annual meeting of the National Council on Measurement in Educa-
tion, San Francisco.
Pomplun, M., &Omar, H. (1997). Multiple-mark items: An alternative objective item
format? Educational and Psychological Measurement, 57, 949-962.
Popham, W. J. (1993). Appropriate expectations for content judgments regarding
teacher licensure tests. Applied Measurement in Education, 5, 285-301.
Prawat, R. S. (1993). The value of ideas: Problems versus possibilities in learning. Educa-
tional Researcher, 22, 5-16.
Ramsey, R A. (1993). Sensitivity reviews: The ETS experience as a case study. In P W.
Holland & H. Wainer (Eds.), Differential item functioning (pp. 367-388). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Raymond, M. (2001). Job analysis and the specification of content for licensure and cer-
tification examinations. Applied Measurement in Education, 14(4), 369-415.
Reckase, M. D. (2000, April). The minimum sample size needed to calibrate items using the
three-parameter logistic model. Paper presented at the annual meeting of the American
Educational Research Association, New Orleans, LA.
Richardson, M., & Kuder, G. E (1933). Making a rating scale that measures. Personnel
Journal, 12, 36-40.
Richichi, R. V, (1996), An analysis of test bank multiple-choice items using item response the-
ory. ERIC Document 405367.
Roberts, D. M. (1993). An empirical study on the nature of trick questions. Journal of Ed-
ucational Measurement, 30, 331-344.
Rodriguez, M. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna
(Eds.), Large-scale assessment programs for all students: Validity, technical ade-
quacy, and implementation issues (pp. 211-229). Mahwah, NJ: Lawrence
Erlbaum Associates.
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-re-
sponse items: A random effects synthesis of correlations. Journal of Educational Mea-
surement, 40(2), 163-184.
Rogers, W. T, & Harley, D. (1999). An empirical comparison of three-choice and four-
choice items and tests: Susceptibility to testwiseness and internal consistency reli-
ability. Educational and Psychological Measurement, 59(2), 234-247.
Roid, G. H. (1994). Patterns of writing skills derived from cluster analysis of direct-writ-
ing assessments. Applied Measurement in Education, 7, 159-170.
Roid, G. H., & Haladyna, T. M. (1982). Toward a technology of test-item writing. New York:
Academic Press.
Rosenbaum, R R. (1988). Item bundles. Psychometrika, 53, 63-75.
Rothstein, R. (2002 Sept. 18). How U. S. punishes states with higher standards. The
New York Times, http://www.nytimes.com/2002/09/18
Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the as-
sessment of criterion-referenced test item validity. Dutch Journal of Educational Re-
search, 2, 49-60.
292 REFERENCES

Royer, J. M., Cisero, C. A., & Carlo, M. S. (1993). Techniques and procedures for assess-
ing cognitive skills. Review of Educational Research, 63, 201-243.
Ruch, G. M. (1929). The objective or new type examination. New York: Scott Foresman.
Ruch, G. M., & Charles, J. W. (1928). A comparison of five types of objective tests in ele-
mentary psychology. Journal of Applied. Psychology, 12, 398-403.
Ruch, G. M., &Stoddard, G. D. (1925). Comparative reliabilities of objective examina-
tions. Journal of Educational Psychology, 12, 89-103.
Rudner, L. M., Bracey, G., & Skaggs, G. (1996). The use of person-fit statistics with one
high-quality achievement test. Applied Measurement in Education, 9(1), 91-109.
Ryan, J. M., & DeMark, S. (2002) .Variation in achievment test scores related to gender,
item format, and content area tests. InG. Tindal &.T. M Haladyna (Eds.), Large-scale
assessment programs for all students: Validity, technical adequacy, and implementation (pp.
67-88). Mahwah, NJ: Lawrence Erlbaum Associates.
Samejima, F. (1979). A new family of models for the multipk-choice item (Office of Naval
Research Report 79-4). Knoxville: University of Tennessee.
Samejima, F. (1994). Non parametric estimation of the plausibility functions of the
distractors of vocabulary test items. Applied Psychological Measurement, 18(1),
35-51.
Sanders, N. M. (1966). Classroom questions. What kinds? New York: Harper & Row.
Sato, T. (1975). The construction and interpretation ofS-P tabks. Tokyo: Meijii Tosho.
Sato, T. (1980). The S-P chart and the caution index. Computer and communications systems
research laboratories. Tokyo: Nippon Electronic.
Sax, G., & Reiter, E B. (n.d.). Reliability and validity of two-option multiple-choice and com-
parably written true-false items. Seattle: University of Washington.
Schultz, K. S. (1995). Increasing alpha reliabilities of multiple-choice tests with linear
polytomous scoring. Psychological Reports, 77, 760-762.
Seddon, G. M. (1978). The properties of Bloom's taxonomy of educational objectives for
the cognitive domain. Review of Educational Research, 48, 303-323.
Serlin, R., & Kaiser, H. F. (1978). A method for increasing the reliability of a short multi-
ple-choice test. Educational and Psychological Measurement, 38, 337-340.
Shahabi, S., & Yang, L. (1990, April). A comparison between two variations of multiple-
choice items and their effects on difficulty and discrimination values. Paper presented at
the annual meeting of the National Council on Measurement in Education,
Boston.
Shapiro, M. M., Stutsky, M. H., & Watt, R. F. (1989). Minimizing unnecessary differ-
ences in occupational testing. Valparaiso Law Review, 23, 213-265.
Shea, J. A., Poniatowski, E A., Day, S. C., Langdon, L. O., LaDuca, A., &Norcini, J. J.
(1992). An adaptation of item modeling for developing test-item banks. Teaching and
Learning in Medicine, 4, 19-24.
Shealy, R., & Stout, W. F. (1996). A model-based standardization approach that sepa-
rates true bias/DIF from group differences and detects bias/DIF as well as item bias/
DIE Psychometrika, 58, 159-194.
Shepard, L. A. (1991). Psychometrician's beliefs about learning. Educational Researcher,
20, 2-9.
Shepard, L. A. (1993). The place of testing reform in educational reform—A reply to
Cizek. Educational Researcher, 22, 10-13.
Shepard, L.A. (2000). "The role of assessment in a learning culture." Educational Re-
searcher, 29(7), 4-14.
REFERENCES 293

Shonkoff, J., & Phillips, D. (Eds.)- (2000). The science of early childhood development.
Washington, DC: National Research Council Institute of Medicine, National Acad-
emy Press.
Simon, H. A. (1973). The structure of ill-structured problems. Artificial Intelligence, 4,
181-201.
Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests.
Journal of Educational Measurement, 28, 237-247.
Skakun, E. N., & Gartner, D. (1990, April). The use of deadly, dangerous, and ordinary
items on an emergency medical technicians-ambulance registration examination. Paper
presented at the annual meeting of the American Educational Research Association,
Boston.
Skakun, E. N., &Maguire, T. (2000, April). What do think aloud procedures tell us about
medical students' reasoning on multipk-choice and equivalent construct-response items?
Paper presented at the annual meeting of the National Council on Measurement in
Education, New Orleans, LA.
Skakun, E. N., Maguire, T., & Cook, D. A. (1994). Strategy choices in multiple-choice
items. Academic Medicine Supplement, 69(10), S7-S9.
Slogoff, S., &Hughes, F. P (1987). Validity of scoring "dangerous answers" on a written
certification examination. Journal of Medical Education, 62, 625-631.
Smith, R. M. (1986, April). Developing vocabulary items to fit a polychotomous scoring
model. Paper presented at the annual meeting of the American Educational Research
Association, San Francisco.
Smith, R. M., & Kramer, G. A. (1990, April). An investigation of components influencing
the difficulty of form-development items. Paper presented at the annual meeting of the
National Council on Measurement in Education, Boston.
Snow, R. E. (1989). Toward assessment of cognitive and conative structures in learning.
Educational Researcher, 18, 8-14-
Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett
& W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in con-
structed response, performance testing, and portfolio assessment (pp. 45-60). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educa-
tional measurement. In R. L. Linn (Ed.), Educational measurement (3rd ed.,
263-332). New York: American Council on Education and MacMillan.
Statman, S. (1988). Ask a clear question and get a clear answer: An inquiry into the
question/answer and the sentence completion formats of multiple-choice items. Sys-
tem, 16, 367-376.
Sternberg, R. J. (1985). Beyond IQ: A triarchic theory of human intelligence. New York:
Cambridge University Press.
Sternberg, R. J. (1998). Abilities are forms of developing expertise. Educational Re-
searcher, 27(3), 11-20
Stiggins, R. J., Griswold, M. M., & Wikelund, K. R. (1989). Measuring thinking skills
through classroom assessment. Journal of Educational Measurement, 26, 233-246.
Stout, W, Nandakumar, R., Junker, B., Chang, H., &Steidinger, D. (1993). DIMTEST:
A FORTRAN program for assessing dimensionality of binary item responses. Applied
Psychological Measurement, 16,236.
Stout, W, &Roussos, L. (1995). SIBTESTmanual (2nd ed.). Unpublished manuscript.
Urbana-Champaign: University of Illinois.
294 REFERENCES

Subhiyah, R. G., & Downing, S. M. (1993, April). K-typeandA-typeitems: IRTcomparisons


of psychometric characteristics in a certification examination. Paper presented at the annual
meeting of the National Council on Measurement in Education, Atlanta, GA.
Sympson, J. B. (1983, August). A new item response theory model for calibrating multi-
ple-choice items. Paper presented at the annual meeting of the Psychometric Society,
Los Angeles.
Sympson, J. B. (1986, April). Extracting information from wrong answers in computer-
ized adaptive testing. In New developments in computerized adaptive testing. Symposium
conducted at the annual meeting of the American Psychological Association, Wash-
ington, DC.
Tamir, R (1993). Positive and negative multiple-choice items: How different are they?
Studies in Educational Evaluation, 19, 311-325.
Tate, R. (2002). Test dimensionality. In G. Tindal &T. M. Haladyna (Eds.), Large-scale
assessment programs for all students: Validity, technical adequacy, and implementation (pp.
180-211). Mahwah, NJ: Lawrence Erlbaum Associates.
Tatsuoka, K. K. (1985). Rule space: An approach for dealing with misconceptions based
on item response theory. Journal of Educational Measurement, 20, 345-354.
Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive er-
ror diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, &M. G. Shafto (Eds.), Diag-
nostic monitoring of skill and knowledge acquisition (pp. 453-488). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Tatsuoka, K. K., &Linn, R. L. (1983). Indices for detecting unusual patterns: Links be-
tween two general approaches and potential applications. Applied Psychological Mea-
surement, 7, 81-96.
Technical Staff. (1933). Manual of examination methods (1st ed.). Chicago: University of
Chicago, The Board of Examinations.
Technical Staff. (1937). Manual of examination methods (2nd ed.). Chicago: University of
Chicago, The Board of Examinations.
Terman, L. M., &Oden, M. (1959). The gifted group at mid-life. Stanford, CA: Stanford
University Press.
Thissen, D. M. (1976). Information in wrong responses to the Raven Progressive Ma-
trices. Journal of Educational Measurement, 14, 201-214.
Thissen, D., & Steinberg, L. (1984). A response model for multiple-choice items.
Psychometrika, 49, 501-519.
Thissen, D., Steinberg, L., & Fitzpatrick, A. R. (1989). Multiple-choice models: The
distractors are also part of the item. Journal of Educational Measurement, 26,
161-175.
Thissen, D., Steinberg, L., &Mooney, J. A. (1989). Trace lines for testlets: A use of mul-
tiple-categorical-response models. Journal of Educational Measurement, 26,247-260.
Thissen, D., & Wainer, H. (Eds.). (2001). Test scoring. Mahwah, NJ: Lawrence Erlbaum
Associates.
Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice
and free-response items necessarily less unidimensional than multiple-choice tests?
An analysis of two tests. Journal of Educational Measurement, 31(2), 113-123.
Thorndike, R. L. (1967). The analysis and selection of test items. In S. Messick & D.
Jackson (Eds.), Problems in human assessment. New York: McGraw-Hill.
Thorndike, R. L. (Ed.). (1970). Educational measurement (2nd ed.). Washington, DC:
American Council on Education.
REFERENCES 295

Thurstone, L. L. (1938). Primary mental abilities. Chicago: University of Chicago Press.


(Reprinted in 1968 by the Psychometric Society)
Tiegs, E. W. (1931). Tests and measurement for teachers. New York: Houghton Mifflin.
Traub, R. E. (1993). On the equivalence of traits assessed by multiple-choice and con-
structed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus
choice in cognitive measurement: Issues in constructed response, performance testing, and
portfolio assessment (pp. 1-27). Hillsdale, NJ: Lawrence Erlbaum Associates.
Traub, R. E., & Fisher, C. W. (1977). On the equivalence of constructed response and
multiple-choice tests. AppUed Psychological Measurement, I , 355-370.
Trevisan, M. S., Sax, G., & Michael, W. B. (1991). The effects of the number of options
per item and student ability on test validity and reliability. Educational and Psychologi-
cal Measurement, 51, 829-837.
Trevisan, M. S., Sax, G., & Michael, W. B. (1994). Estimating the optimum number of
options per item using an incremental option paradigm. Educational and Psychological
Measurement, 54, 86-91.
Tsai, C.-C., & Chou, C. (2002). Diagnosing students' alternative conceptions in sci-
ence. Journal of Computer Assisted Learning, 18, 157-165.
van Batenburg, T. A., & Laros, J. A. (2002). Graphical analysis of test items. Educational
Research and Evaluation, 8(3), 319-333.
van den Bergh, H. (1990). On the construct validity of multiple-choice items for reading
comprehension. Applied Psychological Measurement, 14(1), 1-12.
Van der Flier, H. (1982). Deviant response patterns and comparability of test scores.
Journal of Cross-Cultural Psychology, 13, 267-298.
Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement, 26,
191-208.
Wainer, H. (2002). On the automatic generation of test items: Some whens, whys, and
hows. In S. H. Irvine & R C. Kyllonen (Eds). Item generation for test development, (pp.
287-305. Mahwah, NJ: Lawrence Erlbaum Associates.
Wainer, H., &Kiely, G. (1987). Item clusters and computerized adaptive testing: A case
for testlets. Journal of Educational Measurement, 24, 185-202.
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed re-
sponse test scores: Toward a Marxist theory of test construction. Applied Measure-
ment in Education, 6, 103-118.
Wainer, H., & Thissen, D. (1994). On examinee choice in educational testing. Review of
Educational Research, 64, 159-195.
Wang, W (1998). Rasch analysis of distractors in multiple-choice items. Journal of Out-
come Measurement, 2(1), 43-65.
Wang, W. (2000). Factorial modeling of differential distractor functioning in multi-
ple-choice items. Journal of Applied Measurement, I (3), 238-256.
Washington, W. N., &. Godfrey, R. R. (1974). The effectiveness of illustrated items. Jour-
nal of Educational Measurement, 11, 121-124.
Webb, L. C., &Heck, W. L. (1991, April). The effect of stylistic editing on item performance.
Paper presented at the annual meeting of the National Council on Measurement in
Education, Chicago.
What works. (1985). Washington, DC: United States Office of Education.
Wiggins, G. (1989). Teaching to the (authentic) test. Educational Leadership, 76,41-47.
Wightman, L. E (1998). An examination of sex differences in LSAT scores from the per-
spective of social consequences. Applied Measurement in Education, 11(3), 255-278.
296 REFERENCES

Williams, B. J., & Ebel, R. L. (1957). The effect of varying the number of alternatives per
item on multiple-choice vocabulary test items. In The 14th yearbook of the National
Council on Measurement in Education (pp. 63-65). Washington, DC: National Coun-
cil on Measurement in Education.
Wilson, M. R. (1989). Saltus: A psychometric model of discontinuity in cognitive devel-
opment. Psychological Bulletin, 105, 276-289.
Winne, P. H. (1979). Experiments relating teachers' use of higher cognitive questions to
student achievement. Review of Educational Research, 49, 13-50.
Wolf, L. E, & Smith, J. K. (1995). The consequence of consequence: Motivation, anxi-
ety, and test performance. Applied Measurement in Education, 8(3), 227-242.
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of
Educational Measurement, 14, 97-116.
Zimowski, M. E, Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item
analysis and test scoring with binary logistic models [Computer program]. Chicago:
Scientific Software.
Zoref, L., & Williams, R (1980). A look at content bias in IQ tests. Journal of Educational
Measurement, 17,313-322.
Author Index

A Beller, M., 54, 57, 62


Bennett, R. E., 48, 50, 60, 62, 260
Abedi,]., 95,106,241 Berger, A. L., 60
Adams, R., 204 Bernstein, I. H., 203, 247
Alagumalai, S., 234 Bertrand, R., 165
Albanese, M. A., 80, 82, 84, 267 Breland, H. M., 52, 56
Algina, J., 203 Brennan, R. L., 203, 265
Almond, R. G., 267 Bloom, B. S., 21
Allen, T., 165 Bock, R.D., 221, 252, 255
Allison,]., 261 Bonner, M. W., 56
Anderson, L., 21 Boothroyd, R. A., 121
Anderson, J. R., 21, 264 Bordage, G., 165, 166,168
Anderson, L. W., 22 Bormuth, J. R., 97, 150, 274
Andres, A.M., 112 Bridgeman, B., 92
Andrich, D., 204, 222, 226, 255 Bracey, G., 246
Angoff, W. H., 238 Brant, R., I l l
Ansley, T. K, 92 Braswell, J., 92
Attali,Y, 113, 223 Bruno, J. E., 112
Burmester, M. A., 76
B
C
Bacon, T E, 206
Baker, E. L, 53, 95 Calahan, C., 57
Baker, K. K., 26 Camilli, D., 232
Bar-Hillel, M, 112 Campbell, D. R., 249
Baranowski, R. A., 81 Campbell, J. R., 60
Barkanic, G., 92 Cannell, J. J., 238, 261
Bauer, R, 132 Carlo, C. A., 20
Baxter, G. E, 27 Carretier, H., 165
Becker, D. E, 79, 85, 123 Carroll, J. B., 262
Becker, B. J., 240 Case, S. M., 16, 17, 73, 75, 81, 85, 148
Bejar, 1,21,265, 270 Cashin,]. R., 112
Bellezza, E S., 238 Chan, K. M., 70
Bellezza, S. E, 238 Chang, H., 252
297
298 AUTHOR INDEX

Charles, J. W., 76 Embretsen, S. E, 203, 265


Chou, C, 144, 145 Engelhard, G., Jr., 54, 57
Cisero, C. A., 20 Engelhart, M. D., 21
Cizek,G.J., 105, 238 Enright, M. K., 208
Clauser, B. E., 234 Eurich, A. C., 47, 63
Cofrman, W. E., 47
Cody, R. E, 238
Cohen, A. S., 92, 224 F
Cohen, J., 224
Cole, N. S., 264 Fajardo, L. L., 70
Cook, D. A., 59 Farr, R., 59
Coombs, C. H., 61 Fenderson, B. A., 70
Crislip, M., 54, 62 Fischer, G. H., 265
Crocker, L, 190, 203 Fisher, C. W., 47, 63
Cronbach, L. J., vii, 10, 12, 84, 265, 266 Fiske, D. W., 249
Cox, R. C, 215 Fitzpatrick, A. R., 189, 218, 222, 254, 272
Forster, E, 206
Forsyth, R. A., 92
D Fraenkel, T., 223
Frary, R. B., 117, 238
Damjamov, I., 70 Frederiksen, N., ix, 11, 21, 42, 62, 260,
Danneman, M., 59 265, 274
Danos, D. O., 56 Frisbie, D. A., 26, 78, 79, 81, 82, 84, 123,
Dawson-Saunders, B., 81, 106 267
DeAyala, R. J., 218,237 Fuhrman, M., 116
DeCorte, E., 117 Furst, E. J., 21
del Castillo, J. D., 112
De Gruijter, D. N. M., 224
DeMars, C. E., 54, 57 G
DeMark, S., 54, 56, 57, 240
Dibello, L. V, 262, 263 Gafni, N., 54, 57, 62
Dirkzwager, A., 112 Gagne, R., 20
Dobson, C., 22 Gallagher, A., 57
Dochy,E, 117 Gardner, H., 263
Dodd, D. K., 104,198, 199 Gardner, J., 7
Doran, M. L, 111 Garner, B. A., 192
Dorans,N.J., 231,233 Garner, M., 54, 57
Downing, S. M., vii, 14, 49, 60, 69, 73, 75, Gartner, D., 95
76, 77, 78, 80, 81, 82, 97, 98, Gaynor, J., 52
112, 117,160,181,183,186, Gitomer, D. H., 21,29
187, 188, 218, 225, 227, 237, Glaser, R., 27
253, 266, 267, 271, 274 Glasnapp, D. R., 78
Drasgow, E, 76, 218, 226, 237, 243, 254, 255 Gleser, G. C., 265
Dressel, E L., 84 Godfrey, R. R., 94
Druva, C. A., 82, 84 Godshalk, F. I., 47, 63
Dunbar, S. B., 53 Goleman, D., 8, 37
Green, K. E., 208
Greene, J., 272
E Griswold, M. M., 26
Gronlund, N., 22, 72
Ebel, R. L., 75, 76, 78, 79, 266 Gross, L. J., 69
AUTHOR INDEX 299

Grosse, M., 77, 78, 84 Hurd, A. W., 47, 63


Grosso, L. J., 81
Guilford, J. E, 7 I
Gulliksen, K, 203, 253
Guttman, L., 254 Impara, J. C., 218
Ippel.M.J., 271
H Irvine, S. H., 149,180, 260, 270, 272
Haas, N. S, 9, 238, 261
Hack, R., 62 J
Haertel, E., 265
Haladyna, T. M., vii, 9,11, 12, 14, 53, 54, Johnson, B. R., 70
60, 65, 69, 71, 73, 75, 76, 77, 78, Johnson, S., 92
80,85,98,111,112,116,117, Jones, R. W., 265
124,125, 149, 150, 151, 157, Jozefowicz, R. E, 97
170, 171, 183 181,186, 187, 205, Junker, B., 252
215, 216, 225, 227, 236, 237,
238, 247, 253, 254, 255, 260, K
266,267,270,271,274
Hambleton, R. K., 170, 189, 190, 191, 203, Kahn, H. G., 56
212, 218, 247, 265 Kaiser, H. E, 254
Hamilton, L. S., 55 KaminskiJ., 238, 261
Hancock, G. R., 76 Kane.M.T., 12, 186, 189
Hannon, B., 59 Katz, I. R., 58, 59, 60,87, 97
Harasym, EH., Ill Kazemi, E., 60
Harley.D., 112 Keeling, B., 95
Hatala, R., 165, 169 Keeves, J. E, 234
Hatch, T., 263 Kent, T. A., 82
Harvey, A., 92 Kiely, G., 85, 274
Hattie,J.A,214, 247 Kim, S., 92
Haynie, W. J., 62 Komrey, J. D., 206
Heck, R., 54 Knowles, S. L., 116
Heck, W. L., 105 Kramer, G. A., 208, 247, 253
Henrysson, S., 224, 273 Krathwohl, D. R., 21
Henzel, T. R., 160 Kreitzer, A. E., 22
Herbig, M., 215 Kubota, M. Y., 56
Hibbison, E. E, 58, 200 Kuder, G. E, 254
Hill, G. C., 82 Kyllonen, E C., 149, 180, 260, 270, 272
Hill, K., 69, 237
Hill, W. H., 21 L
Hofstetter, C., 95
Holland, E W., 231, 233 LaDuca, A., 159,164, 165, 170
Holtzman, K., 16, 97 Landrum, R. E., 112
Holzman, G. B., 160 Laros, J. A., 222
Hoover, H. D., 92 Lautenschlager, G. J., 59, 87, 98
House, E. R., 260 Lawson, T. A., 239
Hsu, L. M., 78 Leal, L, 104, 198, 199
Hubbard, J. E, 80 Levin, J., 57
Huff, K. L., 93 Levine, M. V, 218, 226, 237, 243, 254
Hughes, D. C., 95 Lewis, J. C., 92
Hughes, F. E, 95 Lindquist, E. E, 266, 272
3OO AUTHOR INDEX

Linn, R. L., 22, 53, 57, 239, 244, 245, 265, Nesi, H., 94
266, 272 Nickerson, R. S., 26, 159
Llabre, M., 190 Nield, A. E, 199
Lohman, D. E, 26, 35, 36, 37, 262, 263, 271 Nishisato, S., 218
Lord, C., 95 Nitko, A. J., vii, 72, 164, 266
Lord, P.M., 76, 112, 203, 212 Nolen, S. B., 9, 238, 261
Lorscheider, E L., Ill Norcini, J. J., 49, 81
Love, T. E., 225 Norman, G. R., 165, 169
Loyd, B. K, 92 Norris, S. E, 199
Luce, R. D., 226 Novick, M. R., 203
Lukhele, R., 50 Nungester, R. J., 81
Luo, G., 204, 226 Nunnally, J. C., 203, 211, 213, 247
Lyne, A., 204
O
M
O'Dell, C. W, 47, 63
MacDonald, R. E, 203, 247 O'Hara, T., 22
Madaus, G. E, 22 O'Neill, K., 105
Mager, R. E, 185 Oden, M., 7
Maguire, T., 59, 60, 200 Olson, L. A., 76
Maihoff, N. A., 76 Omar, H., 84
Martinez, M. E., 48, 49, 50, 58, 59 Oosterhof, A. C., 78
Masters, G. N., 255 Osborn Popp, S., 236
Mazor, K. M., 234
McMorris, R. E, 121
Mead, A. D., 226
Meara, R, 94
Mehrens, W. A., 76, 238, 261 Page, G., 165, 166, 168,169
Meijer, R. R., 237, 243, 244, 245, 246 Paris, S. G., 239, 240
Messick, S., 9, 11, 12, 14, 26, 35, 185, 186, Patterson, D. G., 47
188, 189, 190, 246, 247, 261, 266 Perkhounkova, E., 53, 132
Michael, W.B., 112 Peterson, C. C., 78
Miller, M. D., 190 Peterson, J. L., 78
Miller, W. G., 22 Phelps, R., ix
Millman, J., 272 Phillips, D., 7
Minnaert, A., 61 Pietrangelo, D. J., 121
Miranda, D. U, 26 Pinglia, R. S., 79
Mislevy, R. J., 21, 25, 51, 52, 61, 159, 204, Plake, B. S., 218
262, 265, 267, 268, 269, 274 Poe, N., 92
Mokerke, G., 117 Pomplum, M., 84
Molenaar, I. W., 245 Popham, W. J., 188, 189
Mooney, J. A., 85, 254 Potenza, M. T, 231
Muijtjens, A. M. M. M., 237, 245 Prawat, R. S., 160
Mukerjee, D. R, 61 Pritchard, R., 59

N R
Nanda, H., 265 Rajaratnam, N., 265
Nandakumar, R., 252, 265 Ramsey, R A., 193, 234
Neisser, U., 7 Raymond, M., 186, 248
AUTHOR INDEX 3O1

Reckase, M. D., 206 Simon, H. A., 149


Reise, S. E, 203 Sireci, S. G., 85, 93
Reiter,RB., 76, 112 Skaggs, G., 246
Richardson, M., 254 Skakun, E. K, 59, 60, 61, 95, 200
Richichi, A., 97 Slogoff, S., 95
Ripkey, D. R., 16, 75, 97 Smith, J. K., 240
Roberts, D. M., 103, 104 Smith, R. M., 208
Robeson, M. R., 70 Smitten, B., 59
Rock, D. A., 21,29, 50 Snow, R. E., 26, 36, 47, 93, 159, 262, 263
Rodriguez, M. R., vii, 49, 50, 59, 60, 69, Snowman, J., 22
111,112,116 Sosniak, L. A., 22
Rogers,]., 170,265 Spratt, K. E., 92
Rogers, W. T., 112,203 Staples, W. L, 160
Roid, G. H., vii, 10,149,150, 170, 215, Statman, S., 69
216, 260, 270 Steidinger, D., 252
Roussos, L. A., 233, 262 Steinberg, L, 85, 218, 222, 254, 267, 272
Rosenbaum, E R., 85 Sternberg, R. J., 7, 26,35, 263
Roth, J. C., 239 Stiggins, R. J., 26
Rothstein, R., 7 Stoddard, G. D., 76
Rovinelli, R. J., 189, 190 Stout, W. E, 233, 252, 262
Royer, J. M., 20, 264, 268 Stutsky, M. H., 232
Rubin, D. B., 243 Styles, L, 226
Rubin, E., 70 Subhiyah, R. G., 87
Ruch, G. M., 47, 63, 76 Swaminathan, H.,170, 203, 265
Rudner, L. M., 246 Swanson, D. B., 16, 73, 75, 85
Ryan, J. M., 54, 56, 57, 240 Sweeney, D. C., 82
Swineford, E., 47
Sympson,]. B., 205, 217, 218, 253, 254, 255
S
Sabers, D. L., 84 T
Samejima, E, 226, 227, 255
Sanders, N. M., 22 Tamir, E, 111
Sato, T., 244 Tate, R., 214, 247, 249, 252, 253
Sax.G., 76, 112 Tatsuoka, K. K., 244, 245, 262, 265, 271
Schilling, S. G., 204 Templeton, B., 160
Schmid, R, 84 Terman, L. M., 7
Schultz, K. S., 255 Thayer, D. T., 233
Seddon, G. M., 22 Theis, K. S., 112
Segers, M., 117 Thiede, K. W., 76
Serlin, R., 254 Thissen, D. M., 50, 52, 85, 170, 217, 218,
Shahabi, S., 81 222, 233, 247, 254, 272
Shapiro, M. M., 232 Thorndike, R. L., 3, 65, 266, 272
Shea, J. A., 160 Thurstone, L. L., 7
Shealy, R., 233 Tiegs, E. W., 47, 63
Sheehan, K, M., 208 Togolini, J., 226
Shepard, L. A., viii, ix, 42, 159, 232 Traub, R. E., 47, 62, 63
Sheridan, B., 204, 226 Trevisan, M. S., 112
Shindoll, R. R., 151, 157 Tsai, C.-C., 144, 145
Shonkoff, J., 7 Tsien, S., 226
Sijtsma, K., 245, 246 Turner, J. C., 239
3O2 AUTHOR INDEX

V Wightman, L. E, 54, 57
Wikelund, K. R., 20
van Batenburg, T. A., 222 Wiley, D. E., 265
van den Bergh, H., 59 Williams, B., 226
Van der Flier, H., 245 Williams, B. J., 76
van der Vleuten, C. R M., 237, 245 Williams, E. A., 243
Vargas,]., 215 Williams, E, 193, 194
Veloski, J. J., 70 Wilson, M. R., 204
Winne, E H., 264
W Wintre, M. G, 199
Wolf, L. E, 240
Wood, R., D., 204
Wainer, H., 50, 52, 85, 150, 170, 180, 217,
Woods, G. T., 82
222,231,233,247,254,273,274
Wang, M. D., 50 Wright, B. D., 77, 78, 84, 237, 240
Wang, W, 226 Wu, M., 204
Wang, X., 50, 226, 227
Ward, W. C., 260 Y
Washington, W. N., 94
Watt, R. E, 232 Yang, L., 81
Webb, L. C., 105
Weiss, M., 236 Z
Welch, C. A., 116
Whitney, D. R., 82, 265 Zickar, M. J., 237
Wigfield, A., 69, 237 Zimowski, M. E, 204, 237
Wiggins, G., 67 Zoref, L., 193, 194
Subject Index

A methods, 249-253
Distractor evaluation, 218-228, 272-273
Abilities (cognitive, developing, fluid,
learned), 6-7, 8, 35-40 E
Achievement, 6-7
A/1 of the above option, 117 Educational Testing Service, 193, 195
American Educational Research Associa- Emotional intelligence, 8
tion (AERA), x, 10, 15, 25, 62, Editing items, 105-106
94, 183, 185, 234, 241, 247, 261
American Psychological Association, x
Answer justification, 197-199 Future of item development, 259-272
Appropriateness measurement, 243 factors affecting, 259-265
Assessment Systems Corporation, 204 new theories, 267-272
C Future of item-response validation,
272-275
Calculators, 91-93 G
Clang associations, 118
Clues to answers, 117-120 Generic item sets, 170-176
Cognition, 19 definition, 170
Cognitive demand (process), 19, 25 evaluation, 176
cognitive taxonomies, 20-25, 28-40 generic scenario, 171-175
construct-centered measurement, 26-27 Guessing, 217
Construct definition, 5-6
Constructed-response (CR) item formats, H
42-47
Converting constructed response to multiple - Higher level thinking, 35-40
choice, 176-177,178,179,180 examples of multiple-choice items,
137-147
D Humor in items, 121
Differential item functioning, 231-234 I
Dimensionality, 213-214, 246-253
defining, 246-247 Instructional sensitivity, 205, 214-217
303
304 SUBJECT INDEX

Intelligence (scholastic aptitude, mental J


ability) ,7, 8
Item bias, 231-234 Joint Commission on National Dental Ex-
Item characteristics, 206-218 aminations, 130
difficulty, 207-209
discrimination, 209-217, 273-274 K
guessing, 217
heterogeneity, 206-207
Key balancing, 113
homogeneity, 206-207
Key features, 165-170
non response, 217-218, 235-237
developing an item set, 166-168
omitted responses, 217-218
example, 167
pseudo-chance, 217
evaluation, 169-170
sample size, 206
Key check (verification), 196-197
Item development process, 14-17
Knowledge (declarative knowledge), 6,
Item difficulty, 207-209
7-8, 29-43
Item discrimination, 209-217, 273-274
concepts, 31-32
Item format, 41^-2
facts, 30-31
high-inference, 42-44, 44^16
principles, 32
low-inference, 42^H, 46^7
procedures, 33-35
recommendations for choosing, 62-63
Item format validity arguments, 47-62
cognitive demand, 58-62
L
content equivalence, 49-51
fidelity and proximity to criterion, Learning theories, viii-ix, 26-27
51-55 behaviorism, 26
gender-format interaction, 55-57 constructivism, 26
instrumentality, 62 cognitive, 26-27
prediction, 48-49
Item modeling, 159-165 M
definition, 159
evaluation, 164-165 Multiple-choice issues
example, 160-164 calculators, 91-93
Item responses, 203-206 computer-based testing, 93-94
patterns, 237-242, 274 controversy, 69-70
Item shells, 150-159, 171 dangerous answers, 95
definition, 151 dictionaries, 94-95
developing item shells, 152-156 pictorial aids, 94
origin, 150-151 uncued, 70
evaluation, 157-159 Multiple-choice item formats, 67-69, 96
Item weighting, 205-206 alternate-choice, 75-77
Item-writing guide, 16, 186 complex multiple-choice, 80—81
Item-writing guidelines, 97-126 context-dependent item sets (interpre-
content concerns, 98, 101-105 tive exercises, item bundles,
format concerns, 105-107 scenarios, super-items,
option construction, 112-121 testlets, vignettes), 84-91
specific item formats, 121-125 interlinear, 88, 91
stem construction, 107-112 pictorial, 87-88, 138-139, 140
style concerns, 105-107 problem solving, 87, 88, 89,
Item-writing science, 149-150 139-142
Item-writing training, 16, 186-187 reading comprehension, 128
SUBJECT INDEX 3O5

conventional (question, sentence com- Statistical test theories, 203


pletion, best answer), 68-69
extended matching, 73-75, 128-129 T
matching, 70-72
multiple true-false, 81-84, 130-131, Technical Staff, 127, 129, 131, 133, 142
142-143 Test, 3
multiple mark, 84 Test items
multiple, multiple, 84 definition, 3-4
network two-tier, 143-145 steps in developing, 14-18
true-false, 77-80 Test score, 4
Test specifications, 15, 186
N Think aloud, 60
Trace line (option characteristic curve),
National Commission on Educational Ex- 210-211,221-223
cellence, 261 Trick items, 103-105
National Council on Teachers of Mathe-
matics, 26, 87, 92
National Council on Measurement in Edu- U
cation, x, 9
Negative phrasing, 111-112, 117 Unfocused stem, 108, 109
None of the above option, 116-117
V
O
Validity, 9-10
Opinion-based items, 103 formulation, 10-11, 18
Operational definition, 4-5 explication, 11, 18
Option characteristic curve, see Trace line validation, 9, 11-12, 18
Option ordering, 113-114 Validity evidence
Option weighting, 253-256 analyzing item responses, 202-229
characteristics of item responses,
P 206-218
difficulty, 207-209
Person fit, 242-246 dimensionality, 213-214,
Polytomous scoring, see Option weighting 246-253
Proofreading items, 107 discrimination, 209-217,
273-274
R distractor evaluation, 218-228,
272-273
guessing, 217
Response elimination, 60
item bias (differential item func-
Reviews of items, 17, 183-201
tioning), 231-234
nature of item responses,
S 203-206
person fit, 242-246
Security, 187 polytomous scoring, 253-256
Skills, 7, 8, 34-35, 132-137 computer programs, 204
Specific determiners, 117-118 guidelines for evaluating items,
Standards for Educational and Psychological 228
Testing, x, 10,12,13,15,25,62,94, procedural, 183-201
183,184,185,234,241,247,261 answer justification, 197-199
306 SUBJECT INDEX

content definition, 185-186 sensitivity review, 192-196


content review, 188-191 test specifications, 186
editorial review, 191-192 think-aloud, 199-201
item-writer training, 186-187
key check (verification), 196-197
review for cognitive process, 188 W
review for violations of item-writing
What
guidelines, 187-188 Worlcs, 26
security, 187 Window dressing, 108-109

You might also like