0% found this document useful (1 vote)
2K views

Module 6 Administering Analysing and Improving Tests

1. The document discusses guidelines for assembling, administering, scoring, and analyzing tests to evaluate student learning. It provides steps for packaging the test, such as grouping similar item formats and arranging items from easy to hard. 2. Guidelines are also provided for administering the test, such as maintaining a positive attitude and minimizing distractions. Scoring procedures include preparing an answer key and checking scoring. 3. Procedures for quantitative and qualitative analysis of tests are described to identify deficiencies and improve items, leading to better evaluation of student learning. Both norm-referenced and criterion-referenced analysis methods are discussed.

Uploaded by

Jerwin Canteras
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
2K views

Module 6 Administering Analysing and Improving Tests

1. The document discusses guidelines for assembling, administering, scoring, and analyzing tests to evaluate student learning. It provides steps for packaging the test, such as grouping similar item formats and arranging items from easy to hard. 2. Guidelines are also provided for administering the test, such as maintaining a positive attitude and minimizing distractions. Scoring procedures include preparing an answer key and checking scoring. 3. Procedures for quantitative and qualitative analysis of tests are described to identify deficiencies and improve items, leading to better evaluation of student learning. Both norm-referenced and criterion-referenced analysis methods are discussed.

Uploaded by

Jerwin Canteras
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CHAPTER 6

Administering, Analyzing, and Improving Tests

Intended Learning Outcomes:


1. Identify appropriate test assembly, administration, and scoring practices
2. Execute guidelines for assembling the test.
3. Apply quantitative and qualitative procedures for analyzing a test.
4. Recall debriefing guidelines for test and instructional improvement.
5. Name important components of achievement testing.

ASSEMBLING THE TEST


At this point let’s assume you have
1. Written measurable instructional objectives.
2. Prepared a test blueprint, specifying the number of items for each content
and process area.
3. Written test items that match your instructional objectives.

Once you have completed these activities you are ready to


1. Package the test.
2. Reproduce the test.

These components constitute what we are calling test assembly. Let’s consider
each a little more closely.

 Packaging the Test


There are several packaging guidelines worth remembering, including
grouping together items of similar format, arranging test items from easy to
hard, properly spacing items, keeping items and options on the same page,
placing illustrations near the descriptive material, checking for randomness in
the answer key, deciding how students will record their answers, providing
space for the test-taker’s name and the date, checking test directions for clarity,
and proofreading the test before you reproduce and distribute it.

1. Group Together All Items of Similar Format


If you have all true–false items grouped together, all completion items
together, and so on, the students will not have to “switch gears” to adjust to
new formats. This will enable them to cover more items in a given time than if
item formats were mixed throughout the test. Also, by grouping items of a
given format together, only one set of directions per format section is necessary,
which is another time-saver.

2. Arrange Test Items from Easy to Hard


Arranging test items according to level of difficulty should enable more
students to answer the first few items correctly, thereby building confidence
and, it is hoped, reducing test anxiety.

[Prof Ed 221 Assessment of Student Learning 1]


3. Space the Items for Easy Reading
If possible, try to provide enough blank space between items so that each
item is distinctly separate from others. When items are crowded together, a
student may inadvertently perceive a word, phrase, or line from a preceding or
following item as part of the item in question. Naturally, this interferes with a
student’s capacity to demonstrate his or her true ability.

4. Keep Items and Options on the Same Page


There are few things more aggravating to a test-taker than to have to turn
the page to read the options for multiple choice or matching items or to finish
reading a true–false or completion item. To avoid this awkwardness, do not
begin an item at the bottom of the page unless you have space to complete the
item. Not only will this eliminate having to carry items over to the next page, it
will also minimize the likelihood that the last line or two of the item will be cut
off when you reproduce the test.

5. Position Illustrations Near Descriptions


Place diagrams, maps, or other supporting material immediately above the
item or items to which they refer. In other words, if items 9, 10, and 11 refer to a
map of South America, locate the map above items 9, 10, and 11—not between
9 and 10 or between 10 and 11 and not below them. Also, if possible, keep any
such stimuli and related questions on the same page to save the test-taker time.

6. Check Your Answer Key


Be sure the correct answers follow a fairly random pattern. Avoid true–false
patterns such as TFTF, or TTFF, and multiple-choice patterns such as D C B A D
C B A. At the same time, check to see that your correct answers are distributed
about equally between true and false and among multiple-choice options.

7. Determine How Students Record Answers


Decide whether you want to have students record their answers on the test
paper or on a separate answer sheet.
In the lower elementary grades, it is generally a good idea to have students
record answers on the test papers themselves. In the upper elementary and
secondary grades, separate answer sheets can be used to facilitate scoring
accuracy and to cut down on scoring time. Also, in the upper grades, learning to
complete separate answer sheets will make students familiar with the process
they will use when taking standardized tests.

8. Provide Space for Name and Date


Be sure to include a blank on your test booklet and/or answer sheet for the
student’s name and the date. This may seem an unnecessary suggestion, but it
is not always evident to a nervous test-taker that a name should be included on
the test. Students are much more likely to remember to put their names on
tests if space is provided.

[Prof Ed 221 Assessment of Student Learning 1]


9. Check Test Directions
Check your directions for each item format to be sure they are clear.
Directions should specify:
1. The numbers of the items to which they apply.
2. How to record answers.
3. The basis on which to select answers.
4. Criteria for scoring.

10. Proofread the Test


Proofread for typographical and grammatical errors before reproducing the
test and make any necessary corrections. Having to announce corrections to the
class just before the test or during the test will waste time and is likely to inhibit
the test-takers’ concentration.
Before reproducing the test, it’s a good idea to check off these steps. The
checklist in Figure 1 can be used for this purpose.

Figure 1. Test Assembly Checklist (Kubiszyn & Borich, 2013 p. 222)

 Reproducing the Test


Most test reproduction in the schools is done on photocopying machines. As
you well know, the quality of such copies can vary tremendously. Regardless of how
valid and reliable your test might be, poor copies will make it less so. If someone else
will do the reproducing, be sure to specify that the copies are for an important test
and not simply an enrichment exercise. Ask the clerk or aide to randomly inspect
copies for legibility while running the copies and to be alert for blank or partially
copied pages while collating, ordering, and stapling multipage tests.

[Prof Ed 221 Assessment of Student Learning 1]


Activity 6.1
Part 1. Evaluate the 15 multiple-choice items you made in Activity 5C using the
checklist statements below for test assembly. For each item, write YES or NO in the
appropriate box. Write NA if a question does not apply.
Item Number

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Are items
of similar
format
grouped
together?

Are items
arranged
from easy
to hard
levels of
difficulty?

Are items
properly
spaced?

Are items
and
options on
the same
page?

Are
diagrams,
maps, and
supporting
material
above
designated
items and
on the
same page
with
items?

Are
answers
random?

[Prof Ed 221 Assessment of Student Learning 1]


Will an
answer
sheet be
used?

Are blanks
for name
and date
included?

Have the
directions
been
checked
for clarity?

Has the
test been
proofread
for errors?

Do items
avoid
racial and
gender
bias?

Part 2. After evaluating, reconstruct the items that need revision so it will satisfy the
requirements in assembling test.

Item #: __________________________________________________
Revised Item #: ___________________________________________

ADMINISTERING THE TEST


The test is ready. All that remains is to get the students ready and hand out the
tests. Here is a series of suggestions to help your students psychologically prepare
for the test.
1. Maintain a positive attitude
2. Maximize achievement motivation
3. Equalize advantages
4. Avoid surprises
5. Clarify the rules
6. Rotate distribution
7. Remind students to check their copies
8. Monitor students
9. Minimize distractions

[Prof Ed 221 Assessment of Student Learning 1]


10. Give time warnings
11. Collect tests uniformly

SCORING THE TEST


Some general suggestions to save scoring time and improve scoring accuracy
and consistency:
1. Prepare an answer key
2. Check the answer key
3. Score blindly
4. Check machine-scored answer sheets
5. Check scoring
6. Record scores

ANALYZING THE TEST


Just as you can expect to make scoring errors, you can expect to make errors in
test construction. No test you construct will be perfect—it will include inappropriate,
invalid, or otherwise deficient items. In the remainder of this chapter we will
introduce you to a technique called item analysis. Item analysis can be used to
identify items that are deficient in some way, thus paving the way to improve or
eliminate them, with the result being a better overall test. We will make a
distinction between two kinds of item analysis, quantitative and qualitative.
Quantitative item analysis is likely to be something new. But as you will see,
qualitative item analysis is something with which you are already familiar. Finally,
we will discuss how item analysis differs for norm- and criterion-referenced tests,
and we provide you with several modified norm-referenced analysis methods to use
with criterion-referenced tests.
Quantitative item analysis is a technique that will enable us to assess the quality
or utility of an item. It does so by identifying distractor or response options that are
not doing what they are supposed to be doing. How useful is this procedure for a
completion or an essay item? Frankly, it is not very useful for these types of items,
but qualitative item analysis is. On the other hand, quantitative item analysis is
ideally suited for examining the usefulness of multiple-choice formats. The
quantitative item analysis procedures that we will describe are most appropriate for
items on a norm-referenced test. As you will see, we are interested in spreading out
students, or discriminating among them, with such a test. When dealing with a
criterion-referenced test,qualitative and modified quantitative item analysis
procedures are most appropriate.

Quantitative Item Analysis


A numerical method for analyzing test items employing student response
alternatives or options.

Empirically-based Improvement Procedures


Item-improvement using empirically-based methods is aimed at improving
the quality of an item using students’ responses to the test. Test developers
refer to this technical process as item analysis as it utilizes data obtained
separately for each item. An item is considered good when its quality indices i.e.

[Prof Ed 221 Assessment of Student Learning 1]


difficulty index and discrimination index, meet certain characteristics. For a
norm-referenced test, these two indices are related since the level of difficulty
of an item contributes to its discriminability. An item is good if it can
discriminate between those who perform well in the test and those who do not.
However, an extremely easy item, that which can be answered correctly by
more than 85% of the group, or an extremely difficult item, that which can only
be answered correctly by 15%, is not expected to perform well as a
“discriminator”. The group will appear to be quite homogeneous with items of
this kind. They are weak items since they do not contribute to “score-based
inference”.

Difficulty Index
An Item’s difficulty index is obtained by calculating the p value (p)
which is the proportion of students answering the item correctly.
The difficulty of an item or item difficulty is defined as the number of
students who are able to answer the item correctly divided by the total
number of students.

Where p is the difficulty index


R = total number of students answering the item right
T = total number of students answering the item

Here are two illustrative samples:


Item 1: There were 45 students in the Item 1 has a p value of 0.67. Sixty-seven
class who responded to Item 1 and 30 percent (67%) got the item right while
answered it correctly. 33% missed it.

p = 30/45
=0.67

Item 2: In the same class, only 10 Item 2 has a p value of 0.22. Out of 45
responded correctly in Item 2. only 10 or 22% got the item right while
35 or 78% missed it.

p = 10/45
=0.22

For Normative-referenced test: Between the two items, Item 2 appears to be a


much more difficult item since less than a fourth of the class only was able to

[Prof Ed 221 Assessment of Student Learning 1]


respond correctly.
For Criterion-referenced test: The class shows much better performance in Item 1
than in Item 2. It is still a long way for many to master Item 2.

Range of Difficulty Index Interpretation Action

0 - 0.25 Difficult Revise or Discard

0.26 - 0.75 Right Difficult Retain

0.76 - above Easy Revise or Discard

Discrimination Index
The power of an item to discriminate between informed and uninformed groups
or between more knowledgeable and less knowledgeable learners is shown using
the item discrimination index (D). This is an item statistics that can reveal useful
information for improving an item. Basically an item-discrimination index shows the
relationship between the student’s performance in an item (i.e. right or wrong) and
his/her total performance in the test represented by the total score.
For classroom tests, the discrimination index shows if a difference exists
between the performance of those who scored high and those who scored low in an
item. As a general rule, the higher the discrimination index (D), the more marked
the magnitude of the difference is, and thus, the more discriminating the item is.
The nature of the difference however, can take different directions:
a. Positively discriminating item - proportion of high scoring group is greater
than that of the low scoring group.
b. Negatively discriminating item - proportion of high scoring group is less
than that of the low scoring group.
c. Not discriminating - proportion of high scoring group is equal to that of the
low scoring group.

Calculation of the discrimination index therefore requires obtaining the


difference between the proportion of the high-scoring group getting the item
correctly and the proportion of the low-scoring group getting the item correctly
using this simple formula:

[Prof Ed 221 Assessment of Student Learning 1]


Another calculation can bring about the same result as (Kubiszyn and Borich,
2010):

As you can see is actually getting the p value of an item. So to get D is to get
the difference between the p-value involving the upper half and the p-value
involving the lower half. So the formula for discrimination index (D) can also be
given as (Popham, 2011):

To obtain the proportions of the upper and lower groups responding to the item
correctly, the teacher follows these steps:
1. Score the test papers using a key to correction to obtain the total scores of the
student. Maximum score is the total number of objective items.
2. Order the test papers from highest to lowest score.
3. Split the test papers into halves: high group and low group.
 For a class of 50 or less students, do a 50 - 50 split. Take the upper half as
the HIGH GROUP and the lower as the LOW GROUP.
 For a big group of 100 or so, take the upper 25 - 27% and the lower 25 - 27%.
 Maintain equal numbers of test papers for Upper and Lower groups.
4. Obtain the p value for the upper group and p value for the lower group.

[Prof Ed 221 Assessment of Student Learning 1]


5. Get the discrimination index by getting the difference between the p-values.
For the purposes of evaluating the discriminating power of items, Popham
(2011) offers the guidelines proposed by Ebel & Frisbie (1991) shown in Table 1. The
teachers can be guided on how to select satisfactory items and what to do to
improve the test.
Table 1. Guidelines for Evaluating the Discriminating Efficiency of Items
Discrimination Index Item Evaluation

.40 and above Very good items

.30 - .39 Reasonably good items, but possibly


subject to improvement

.20 - .29 Marginal items, usually needing


improvement

.19 and below Poor items, to be rejected or improved


by revision

Items with negative discrimination indices, although significantly high, are


subject right away to revision if not deletion. With multiple-choice items, negative D
is a forensic evidence errors in item writing. It suggests the possibility of:
 Wrong key - more knowledgeable students selected a distracter which is
the correct answer but is not the keyed option
 Unclear problem in the stem leading to more than one correct answer
 Ambiguous distracters leading the more informed students be divided in
choosing the attractive options
 Implausible keyed option which more informed students will not choose
As you can see, awareness of item-writing can provide cues on how to improve
items bearing negative or non-significant discrimination indices.

Distracter Analysis
Another empirical procedure to discover areas for item-improvement utilizes an
analysis of the distribution of responses across the distracters. Especially when the
difficulty index and discrimination index of the item seem suggest its being
candidate for revision, distracter analysis becomes a useful follow-up. It can detect
differences in how the more able students respond to the distracters in a
multiple-choice item compared to how the less able ones do it. It can also provide an
index of the plausibility of the alternatives, that is, if they are functioning as good
distracters. Distracters not chosen at all, especially by the uniformed students need
to be revised to increase their attractiveness.

[Prof Ed 221 Assessment of Student Learning 1]


To illustrate this process, consider the frequency distribution of the responses of
the upper group and lower group across the alternatives for two items. Separate
counts are done for the upper and lower group who chose A,B,C, and D. The data is
organized in a distracter analysis table.

Table 2. Distracter Analysis Table


Item Difficulty Discrimination Group Alternatives
N=40 Index (p) Index (D)
A B C D Omit

1 .38 -.35 Upper 2 10 *5 3

Lower 2 0 12 6
2 .45 -.50 Upper 2 *4 10 4

Lower 5 14 1 0

Analysis:
 What kinds of items do you see based on their D?
 What does their respective D indicate? Cite the data supporting this.
 Which of the two items is more discriminating?Why
 Which items need to be revised?

Example (Quantitative Item Analysis):

[Prof Ed 221 Assessment of Student Learning 1]


Sensitivity to Instruction Index
The techniques earlier discussed make use of responses obtained from single
administration of a test. The indices obtained for difficulty, discrimination and
option plausibility are seen as helpful statistics for improvement of norm-referenced
or summative tests given after a period of instruction.
Another empirical approach for reviewing test items is to infer how sensitive an
item has been to instruction. This is referred to as sensitivity to instruction index (Si)
and it signifies change in student’s performance as a result of instruction. The
information is useful for criterion-referenced tests which aim at determining if
mastery learning has been attained, after a designated of prescribed instructional
period. The basic question being addressed is a directional one, i.e., is student
performance better after instruction is given. In the context of item performance, Si
will indicate p-value obtained for the item in the post-test is greater than the
p-value in the pre-test. Consider an item where in a class of 40, 80% answered it
correctly in the post-test while only 10% did it in the pre-test.
Its p-value for the post-test is .80 while for pre-test is .10, thus Si =.70 following
this calculation:
Sensitivity to instruction (Si ) = Ppost - Ppre = .80 - .10 = .70

Notice that the calculation for Si carries the same concept as the discrimination
index except that the difference in proportion is obtained between post-test and
pre-test given to the same group. Similar to D interpretation, the higher Si value is
the more sensitive the item is showing the change as a result to instruction. This
item statistics gives additional information regarding the efficiency and validity of
the item.
There could however, be reasons if Si of a test item does not register a
meaningful difference between post-test and pre-test. Especially for a knowledge
level item, it is possible that it was not taken at all during instruction so the students
did not have the chance to learn it or they already know the item prior to instruction.
The teacher should take note of these items for reviewing content coverage for the
period.

Activity 6.2

Task Description: This activity will test your ability to apply empirical procedures for
item-improvement. Solve and answer the following.

1. A final test in Science was administered to a Grade IV class of 50. The teacher
wants to improve further the items for next year’s use. Calculate a quality index that
can be used using the given data and indicate the possible revision needed by some
items.
Item Number getting Index Revision needed to
item correct be done

1 34 ____________

[Prof Ed 221 Assessment of Student Learning 1]


2 18 ____________

3 10 ____________

4 46 ____________

5 8 ____________

2. Below are additional data collected for the same items. Calculate another quality
index and indicate what needs to be improved with the obtained index as a basis.
Item Upper Group Lower Group Index Revision
needed to be
done

1 25 9 ____________

2 9 9 ____________

3 2 8 ____________

4 38 8 ____________

5 1 7 ____________

3. A distracter analysis table is given for a test item given to a class of 60. Obtain
the necessary item statistics using the given data.
Item Difficulty Discrimination Group Alternatives
N=30 Index (p) Index (D)
A B *C D Omit

1 ? ? Upper 2 18 5 0

Lower 0 10 20 0

Write your evaluation on the following aspects of the item:


a. Difficulty of the Item - _______________________________________________

b. Discriminating power of the Item - ____________________________________

c. Plausibility of Options - ______________________________________________

d. Ambiguity of the answer - ____________________________________________

Qualitative Item Analysis


A non-numerical method for analyzing test items not employing student
responses, but considering test objectives, content validity, and technical item
quality.

[Prof Ed 221 Assessment of Student Learning 1]


Judgmentally-based improvement procedures
This approach basically makes use of human judgment in reviewing the
items. The judges are the teachers themselves who know exactly what the test
is for, the instructional outcomes to be assessed, and the items’ level of
difficulty appropriate to his/her class; the teacher’s peers or colleagues who are
familiar with the curriculum standards for the target grade level, the subject
matter content, and the ability of the learners; and the students themselves
who can perceive difficulties based on their past experiences.
 Teachers’ Own Review (Self-review)
It is always advisable for teachers to take a second look at the
assessment tool s/he has devised for a specific purpose. To presume
perfection right away after its construction may lead to failure to detect
shortcomings of the test or assessment task. There are five (5)
suggestions given by Popham (2011, p253) for the teachers to follow in
exercising judgment:

1. Adherence to item-specific guidelines and general item-writing


commandments. There are specific guidelines in writing various forms of
objective and non-objective constructed-response types and the
selected-response types for measuring lower level and higher level thinking
skills. Those guidelines should be used by the teachers to check how good
the items have been planned and written particularly in their alignment to
intended instructional outcomes.

2. Contribution to score-based inference. The teacher examines if the


expected scores generated by the test can contribute to making valid
inference about the learners. Can the scores reveal the amount of learning
achieved or show what have been mastered? Can the score infer the
student’s capability to move on to the next instructional level? Or rather the
scores obtained do not make any difference at all in describing or
differentiating various abilities.

3. Accuracy of content. This review should especially be considered when


tests have been developed after a certain period of time. Changes that may
occur due to new discoveries or developments can redefine the test content
of a summative test. If this happens, the items or the key to correction may
have to be revisited.

4. Absence of content gaps. This review criterion is especially useful in


strengthening the score-based inference capability of the test. If the
current tool misses out on important content now prescribed by a new
curriculum standard, the score will likely not give an accurate description of
what is expected to be assessed. The teacher always ensures that the
assessment tool matches what is currently required to be learned. This is a
way to check on the content validity of the test.

[Prof Ed 221 Assessment of Student Learning 1]


5. Fairness. The discussions on item-writing guidelines always give
warning on unintentionally favoring the uninformed students obtain higher
scores. These are due inadvertent grammatical clues, unattractive
distracters, ambiguous problems, and messy test instructions. Sometimes,
unfairness can happen because of due advantage received by a particular
group like those seated in front of the classroom or those coming from a
particular socio-economic level. Getting rid of faulty and biased items and
writing clear instructions definitely add to the fairness of the test.

 Peer review
There are schools that encourage peer or collegial review of assessment
instruments among themselves. Time is provided for this activity and it has
almost always yielded good results for improving tests and performance-based
assessment tasks. During these teacher dyad or triad sessions, those teaching
the same subject area can openly review together the classroom tests and tasks
they have devised against some consensual criteria. The suggestions given by
test experts can actually be used collegially as basis for a review checklist:
a. Do the items follow the specific and general guidelines in writing items
especially on:
 Being aligned to instructional objectives?
 Making the problem clear and unambiguous?
 Providing plausible options?
 Avoiding unintentional clues?
 Having only one correct answer?
b. Are the items free from inaccurate content?
c. Are the items free from obsolete content?
d. Are the test instructions clearly written for students to follow?
e. Is the level of difficulty of the test appropriate to level of learners?
f. Is the test fair to all kinds of students?

 Student Review
Engagement of students in reviewing items has become a laudable practice
for improving classroom tests. The judgement is based on the students’
experience in taking the test, their impressions and reactions during the testing
event. The process can be efficiently carried out through the use of a review
questionnaire. Popham (2011) illustrates a sample questionnaire shown in Table
3. It is better to conduct the review activity a day after taking the test so the
students still remember the experience when they see a blank copy of the test.

Table 3. Item-Improvement Questionnaire for Students.


1. If any of the items seemed confusing, which ones were they?
2. Did any items have more than one correct answer? If so, which ones?
3. Did any items have no correct answers? If so, which ones?
4. Were there words in any items that confused you? If so, which ones?
5. Were the directions for the test, or for particular subsections, unclear? If so,

[Prof Ed 221 Assessment of Student Learning 1]


which ones?

Acitivity 6.3 Classifying Item - Improvement Approach. Below are descriptions of


procedures done, write J if a judgmental approach is used and E if empirically based.
__________ 1. The Math Coordinator of Grade VII classes examined the periodical
tests prepared by the Math teachers to see if their items are aligned to the target
outcomes for the first quarter.
__________ 2. The alternatives of the multiple-choice items of the Social Studies
test were reviewed to discover if they only have one correct answer.
__________ 3. To determine if the items are efficiently discriminating between the
more able students from the less able ones, a Biology teacher obtained the
discrimination index (D) of the items.
__________ 4. A Technology Education teacher was interested to see if the
criterion-referenced test he has devised shows a difference in the items’ post-test
and pre-test p-values.
__________ 5. An English teacher conducted a session with his students to find out
if there are other responses acceptable in their literature test. He encouraged them
to rationalize their answers.

ITEM ANALYSIS MODIFICATIONS FOR THE


CRITERION-REFERENCED TEST
The statistical test analysis method discussed earlier, called quantitative item
analysis, applies most directly to the norm-referenced test. The classroom teacher
will typically use criterion-referenced tests rather than norm referenced tests. Well,
then, we can just use these same procedures for our teacher-made
criterion-referenced tests. Right? Wrong!
As we will discover in later chapters, variability of scores is crucial to the
appropriateness and success of norm-referenced quantitative item analysis
procedures. In short, these procedures depend on the variability or spread of scores
(i.e., low to high) if they are to do their jobs correctly. In a typical teacher-made
criterion-referenced test, however, variability of scores would be expected to be
small, assuming instruction is effective and the test and its objectives match. Thus,
the application of quantitative item analysis procedures to criterion-referenced
measures may not be appropriate, since by definition most students will answer
these items correctly (i.e., there will be minimal variability or spread of scores). In
this section we will describe several ways in which these procedures can be modified
when a criterion-referenced, mastery approach to test item evaluation is employed.
As you will see, these modifications are straightforward and easier to use than the
quantitative procedures described earlier.

 Using Pre- and Post-tests as Upper and Lower Groups

[Prof Ed 221 Assessment of Student Learning 1]


The following approaches require that you administer the test as a pretest prior
to your instruction and as a post-test after your instruction. Ideally, in such a
situation the majority of students should answer most of your test items incorrectly
on the pretest and correctly on the post-test. By studying the difference between
the difficulty (p) levels for each item at the time of the pre- and post-tests, we can
tell if this is happening. At pretest, the p level should be low (e.g., 0.30 or lower), and
at post-test, it should be high (e.g., 0.70 or higher). In addition, we can consider the
pretest results for an item as the lower group (L) and post-test results for the item as
the upper group (U), and then we can perform the quantitative item analysis
procedures previously described to determine the discrimination direction for the
key and for the distractors.

[Prof Ed 221 Assessment of Student Learning 1]


If a criterion-referenced test item manifests these features, it has passed our
“test” and probably is a good item with little or no need for modification. Contrast
this conclusion, however, with the following item from the same test.

[Prof Ed 221 Assessment of Student Learning 1]


Thus, the item in Example 2 failed all the tests. Rather than modify the item, it
is probably more efficient to replace it with another.

 Comparing the Percentage Answering Each Item Correctly on Both Pre- and
Pos t-test
If your test is sensitive to your objectives (and assuming you teach to your
objectives), the majority of learners should receive a low score on the test prior to
your instruction and a high score afterward. This method can be used to determine
whether this is happening. Subtract the percentage of students passing each item
before your instruction from the percentage of students passing each item after
your instruction. The more positive the difference, the more you know the item is
tapping the content you are teaching. This method is similar to the first step as
described in the preceding section. For example, consider the following percentages
for five test items:

Notice that item 3 registers no change in the percentage of students passing


from before to after instruction. In fact, a high percentage of students got the item
correct without any instruction! This item may be eliminated from the test, since
little or no instruction pertaining to it was provided and most students already knew
the content it represents.
Now, look at item 5. Notice that the percentage is negative. That is, 14% of the
class actually changed from getting the item correct before instruction to getting it
wrong after. Here, either the instruction was not related to the item or it actually
confused some students who knew the correct answer beforehand. A revision of the
item, the objective pertaining to this item, or the related instruction is in order.

[Prof Ed 221 Assessment of Student Learning 1]


 Determining the Percentage of Items Answered in the Expected Direction
for the Entire Test
Another, slightly different approach is to determine whether the entire test
reflects the change from fewer to more students answering items correctly from
pre- to post-test. This index uses the number of items each learner failed on the test
prior to instruction but passed on the test after instruction. Here is how it is
computed:

Step 1: Find the number of items each student failed on the pretest, prior to
instruction, but passed on the post-test, after instruction.

The asterisks indicate just the items counted in Step 1 for Bobby. This count is
then repeated for each student.
Step 2: Add the counts in Step 1 for all students and divide by the number of
students.
Step 3: Divide the result from Step 2 by the number of items on the test.
Step 4: Multiply the result from Step 3 by 100.

Let’s see how this would work for a 25-item test given to five students before
and after instruction.

[Prof Ed 221 Assessment of Student Learning 1]


DEBRIEFING GUIDELINES
Before handing back answer sheets or grades, you should do the following.
1. Discuss problem items.
2. Listen to student reactions.
3. Avoid on-the-spot decisions.
4. Be equitable with changes.
5. Ask students to double-check.
6. Ask students to identify problems.

THE PROCESS OF EVALUATING CLASSROOM ACHIEVEMENT


Figure 2 summarizes all of the important components of achievement testing
that we have discussed thus far. If you’ve studied and worked at these chapters, you
are ahead in the test construction game. What that means for you is better tests
that cause fewer students and parents to complain and tests that are more valid and
reliable measurements of achievement.

[Prof Ed 221 Assessment of Student Learning 1]


Figure 2. The process of measuring achievement in the classroom. (Kubiszyn &
Borich, 2013 p. 240)

[Prof Ed 221 Assessment of Student Learning 1]


References

[1] De Guzman, E.S., & Adamos, J.L. (2015). Assessment of learning 1. Adriana
Publishing Co., Inc.
[2] Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement:
Classroom application and practice. John Wiley & Sons, Inc.
[3] Navaroo, R.L., Santos, R.G., & Corpuz, B.B. (2019). Assessment of learning 1.
Lorimar Publishing, Inc.
[4] Popham, W.J. (2017). Classroom assessment: What teachers need to know.
Pearson Education, Inc.

[Prof Ed 221 Assessment of Student Learning 1]

You might also like