Validity and Reliability of Tests
Validity and Reliability of Tests
RELIABILITY
Test reliability refers to the degree to which a test is consistent and stable in
measuring what it is intended to measure. Most simply put, a test is reliable if it is
consistent within itself and across time. To understand the basics of test reliability,
think of a bathroom scale that gave you drastically different readings every time
you stepped on it regardless of whether you had gained or lost weight. If such a
scale existed, it would be considered not reliable.
VALIDITY
Test validity refers to the degree to which the test actually measures what it
claims to measure. Test validity is also the extent to which inferences, conclusions,
and decisions made on the basis of test scores are appropriate and meaningful. The
2000 and 2008 studies present evidence that Ohio's mandated accountability tests
are not valid, that the conclusions and decisions that are made on the basis of OPT
performance are not based upon what the test claims to be measuring.
Reliability Validity
What does it The extent to which the The extent to which the
tell you? results can be reproduced results really measure what
when the research is repeated they are supposed to measure.
under the same conditions.
WHAT IS RELIABILITY?
Reliability refers to how consistently a method measures something. If the same
result can be consistently achieved by using the same methods under the same
circumstances, the measurement is considered reliable.
You measure the temperature of a liquid sample several times under identical
conditions. The thermometer displays the same temperature every time, so the
results are reliable.
A doctor uses a symptom questionnaire to diagnose a patient with a long-term
medical condition. Several different doctors use the same questionnaire with the
same patient but give different diagnoses. This indicates that the questionnaire has
low reliability as a measure of the condition.
WHAT IS VALIDITY?
TYPES OF RELIABILITY
TYPES OF VALIDITY
QUESTION NO 2
Hughes (2003) was an early advocate for increased level of details. According to
him it is not to be expected that everything in the specification will always appear
in the test; Bachman and Palmer (1996) and Alderson (1995), called for more
details to be included in test specifications, although Bachman and Palmer were
more detailed than Alderson.
According to Davidson and Lynch (2002:20), there is no single best format for test
specifications; ‘the principles are universal’. Their specification model calls for test
developers to include a (GD) general description (PA) prompt attributes (RA)
response attributes (SI) sample items and (SS) specification supplement, if
necessary.
VALIDITY
The term validity refers to the extent to which a test measures what it says it
measures (Heaton 1988:159).
RELIABILITY:
Reliability is concerned with ensuring consistency of test scores. Practicality:
Practical issues include time, resources and administrative logistics; and it is,
perhaps, one of the most important qualities of a test.
WASHBACK:
According to Buck (1988), Wash back refers to the effect of testing on teaching
and learning.
AUTHENTICITY:
It is an important criterion to judge a test’s quality; and good testing, should strive
to use formats and tests that mirror the types of situations in which students would
authentically use the target language.
TRANSPARENCY:
SCORER RELIABILITY:
It means clearly defining the weighing for each section, describing specific criteria
for marking and grading. (Weir, 1993, p.7)
RATING SCALES:
Scoring is often taken for granted in language test that the writer does not intend to
do. Scoring will be:
ANALYTICAL:
A type of rating scale that requires teachers to allot separate ratings for the
different components of language ability.
HOLISTIC:
1. LISTENING
Knowledge
Comprehension
Application
Analysis
Synthesis
Evaluation
2. SPEAKING:
Accuracy
Fluency
Appropriacy
Coherence and cohesion
Use of language functions
Managing a discussion
Task fulfillment
3. READING:
Comprehension
Application
Analysis
Synthesis
Evaluation
4. WRITING:
Accuracy
Fluency
Appropriacy
Coherence and cohesion
Use of language functions
Managing a discussion
Task fulfillment
CRITERIA
INTRODUCTION
After the overall content of the test has been established through a job analysis, the
next step in test development is to create the detailed test specifications. Test
specifications usually include a test description component and a test blueprint
component. The test description specifies aspects of the planned test such as the
test purpose, the target examinee population, the overall test length, and more. The
test blueprint, sometimes also called the table of specifications, provides a listing
of the major content areas and cognitive levels intended to be included on each test
form. It also includes the number of items each test form should include within
each of these content and cognitive areas.
TEST DESCRIPTION
Test Blueprint
The content areas listed in the test blueprint, or table of specifications, are
frequently drawn directly from the results of a job analysis. These content areas
comprise the knowledge, skills, and abilities that have been determined to be the
essential elements of competency for the job or occupation being assessed. In
addition to the listing of content areas, the test blueprint specifies the number or
proportion of items that are planned to be included on each test form for each
content area. These proportions reflect the relative importance of each content area
to competency in the occupation.
Most test blueprints also indicate the levels of cognitive processing that the
examinees will be expected to use in responding to specific items (e.g.,
Knowledge, Application). It is critical that your test blueprint and test items
include a substantial proportion of items targeted above the Knowledge-level of
cognition. A typical test blueprint is presented in a two-way matrix with the
content areas listed in the table rows and the cognitive processes in the table
columns. The total number of items specified for each column indicates the
proportional plan for each cognitive level on the overall test, just as the total
number of items for each row indicates the proportional emphasis of each content
area. The test blueprint is used to guide and target item writing as well as for test
form assembly. Use of a test blueprint improves consistency across test forms as
well as helping ensure that the goals and plans for the test are met in each
operational test. An example of a test blueprint is provided next.
In the (artificial) test blueprint for a Real Estate licensure exam given below the
overall test length is specified as 80 items. This relatively small test blueprint
includes four major content areas for the exam (e.g., Real Estate Law). Three
levels of cognitive processing are specified. These are Knowledge,
Comprehension, and Application. Each test form written to this table of
specifications will include 40% of the total test (or 32 items) in the content area of
Real Estate Law. In addressing cognitive levels, 35% of the overall test (or 28
items) will be included at the Knowledge-level. The interior cells of the table
indicate the number of items that are intended to be on the test from each content
and cognitive area combination. For example, the test form will include 16 items at
the Knowledge-level in the content area of Real Estate Law.
QUESTION NO 3
TEST INTERPRETATION
SCORES:
RAW SCORES:
The number of points received on a test when the test has been according to
direction. Example:
A Student got 10 out of a 20 scores in item quiz.
Raw scores reflect an immediate interpretation as a response to the scores.
It does not yield a meaningful interpretation because it just raw scores.
Thus, we have to interpret Ali’s score in a more descriptive and meaningful
way.
SCALED SCORES:
REFERENCING FRAMEWORK
NORM GROUP
The well-defined group of other students. Basically ranking the scores of student
from highest score to the lowest one provides an immediate sample of for norm
referenced interpretation. However, barely ranking “raw scores” to interpret
student’s performance formally is not proper and valid. “The raw scores converted
to derived scores.”
DERIVED SCORE
A derived score is a numerical report of test performance on a scale that has well
defined characteristics and yields normative meanings.
GRADE NORMS:
PERCENTILE NORMS
DISADVANTAGES
The performance of the students will not be affected by the whole class.
It promotes cooperation among the students.
All students may pass the subject or course when they meet the set by the
teacher
DISADVANTAGES
DEFINITION:
The standard deviation is the positive square root of the arithmetic mean of the
squares of deviations of all score from their mean.
Z-SCORE
It indicates that how many standard deviation from the mean (Plus or minus) the
student scored.
Where x is the student’s test score
µ is the mean of all test scores
σ is the standard deviation of the test scores
T-SCORE
Is the standard score with a mean equal to 50 and standard deviation is equal to 10.
Where z is an arbitrary, the given of what standard deviation does a raw score falls
the mean Stanine Scores
The term Stanine is the abbreviation of “standard nine”. It has a mean equal to 4
and standard deviation equal to 2. A student whose raw score equals the test mean
will obtain a Stanine score of 5. A score that is 3 standard deviation above the
mean is assigned a Stanine of 9 not 11 because Stanines are limited to a range of 1
to 9
Many Standard-score scales imply a degree of precision that does not exist within
educational tests. Differences of less than one-third standard deviation are usually
not measurable. This means that differences less than 3 point on the 7-score and 5
point in deviation IQ’s are not meaningful. Both of these scales are too precise.
They represent measures of relative standing as opposed to measures of growth. A
student who progresses trough school in step with peers remains at the same
number of standard deviations from the mean. The constant standard score may
suggest (incorrectly) that growth is not occurring.
RANKING
The position or level something or someone has in a list that compares their
importance, quality, success.
A list that compares quality, success or importance of things or people.
A ranking is a relationship between a set of items such that, for any two items, the
first is either 'ranked higher than', 'ranked lower than' or 'ranked equal to' the
second.
FREQUENCY DISTRIBUTION
In case of giving a test to the students to know about their achievements, raw
scores serve as data. It has not yet undergone any statistical technique. To
understand the data easily, we arrange it into groups or classes. The data so
arranged is called grouped data or frequency distribution.
The raw data is in the form of scores of fifty students: 37, 42, 44, 51, 48, 30, 47,
56, 52, 31 64, 36, 42, 54, 49, 59, 45, 32, 38, 46 53, 54, 63, 41, 49, 51, 58, 41, 48,
48 43, 37, 52, 55, 61, 43, 46, 48, 62, 35 52, 33, 46, 60, 45, 40, 47, 51, 56, 53
STEPS:
Determine the range. Range is the difference between highest and lowest scores.
Range = 64-30 = 34
Decide the appropriate numbers of class intervals. There is no hard and fast
formula for deciding the class intervals. The number of class intervals is usually
taken between 5 and 20 depending on the length of the data. For this data the
number of class interval to be taken is 7.
Determine the approximate length of the class interval by dividing the range with
number of class intervals. 34 = 4.8 = 5 7
Determine the limits of the class intervals taking the smallest scores at the bottom
of the column to the largest scores at the top.
Determine the number of scores falling in each class interval. This is done by using
a tally or score sheet.
CLASS INTERVALS
Tallies Frequency 60-64 IIII 5 55-59 IIII I 6 50-54 IIII IIII 10 45-49 IIII IIII II 12
40-44 IIII III 8 35-39 IIII 5 30-34 IIII 4 N= 50
TYPES OF FREQUENCY
PICTORAL FORM
Marks are the bases for important decisions made by the student, teachers,
counselors, parents, school administration and employers. They can also serve as
incentive for increased motivation. Presently marks and reports become the bases
for crucial decisions about the educational and occupational destiny of the student.
In terms of performance assessment, the basis for the assignment of marks is the
student’s achievement of the instructional objectives. Unfortunately all teachers do
not agree that achievement of instructional objective should be the exclusive basis
for marking. Instead they use several other bases.
They often base grades on the student’s attitude, or citizenship, or desirable
attributes of character. A student who shows a cooperative attitude,
responsible citizenship, and strength of character receives a higher mark than
a student who shows a rebellious attitude, under developed citizenship, and
weakness of character.
Teachers often base marks on the amount of effort the student invests in
achieving instructional objectives, whether or not these efforts meet with
success. Conceivably, the student who expends more effort and does not
succeed may receive a higher mark than does a student who expends less
effort and does succeed.
Teachers base marks on growth or how much the student has learned, even
though this amount falls short of that required by the instructional objective.
ABSOLUTE
It is sometimes called the absolute system because it assumes the possibility of the
student achieving absolute perfection. It is based on 100% mastery. Percentage
below 100 represents less mastery. According to the definition of absolute
standards of achievement, 100% could simply mean that the student has attained
the standard of acceptable performance specified in the instructional objective.
With the absolute standard, it is harder to explain the meaning of any percentage
below 100 since the student achieves or does not achieve (all or none) the required
standard. A grade of less than 100% could indicate the percentage of the total
number of instructional objectives a student has achieved at some point during or
at the end of the course.
CRITERION-RELATED
RELATIVE STANDARDS
The more popular marking system consists of the assignment of letter grades: A, B,
C, D, and F. A denotes superior, B good, C average, D fair and F failing or
insufficient achievement. It is sometimes called the relative system because the
grades are intended to describe the student’s achievement relative to that of other
students rather than to a standard of perfection or mastery. An ordinary but by no
means universal assumption of this marking system is that the grading should be
on the curve. The curve in question is the normal probability curve. A teacher may
decide that one class has earned more superior or failing grades than another class
which, in turn, has earned more average and good grades.
Essay test scoring calls for higher degrees of competence, and ordinarily takes
considerably more time, than the scoring of objective tests. In addition to this,
essay test scoring presents two special problems. The first is that of providing a
basis for judgment that is sufficiently definite, and of sufficiently general validity,
to give the scores assigned by a particular reader some objective meaning. To be
useful, his scores should not represent purely subjective opinions and personal
biases that equally competent readers might or might not share. The second
problem is that of discounting irrelevant factors, such as quality of handwriting,
verbal fluency, or gamesmanship, in appealing to the scorer’s interests and biases.
The reader’s scores should reflect unbiased estimates of the essential achievements
of the examinee.
One means of improving objectivity and relevancy in scoring essay tests is to
prepare an ideal answer to each essay question and to base the scoring on relations
between examinee answers and the ideal answer. Another is to defer assignment of
scores until the examinee answers have been sorted and resorted into three to nine
sets at different levels of quality. Scoring the test question by question through the
entire set of papers, rather than paper by paper (marking all questions on one paper
before considering the next) improves the accuracy of scoring. If several scorers
will be marking the same questions in a set of papers, it is usually helpful to plan
training and practice session in which the scorers mark the same papers, compare
their marks and strive to reach a common basis for marking.
The construction and scoring of essay questions are interrelated processes that
require attention if a valid and reliable measure of achievement is to be obtained.
In the essay test the examiner is an active part of the measurement instrument.
Therefore, the viabilities within and between examiners affect the resulting score
of examinee. This variability is a source of error, which affects the reliability of
essay test if not adequately controlled. Hence, for the essay test result to serve
useful purpose as valid measurement instrument conscious effort is made to score
the test objectively by using appropriate methods to minimize the effort of personal
biases and idiosyncrasies on the resulting scores; and applying standards to ensure
that only relevant factors indicated in the course objectives.
In this method each answer is compared with already prepared ideal marking
scheme (scoring key) and marks are assigned according to the adequacy of the
answer. When used conscientiously, the analytic method provides a means for
maintaining uniformity in scoring between scorers and between scripts, thus
improving the reliability of the scoring.
This method is generally used satisfactorily to score Restricted Response
Questions. This is made possible by the limited number of characteristics elicited
by a single answer, which thus defines the degree of quality precisely enough to
assign point values to them. It is also possible to identify the particular weakness or
strength of each examinee with analytic scoring. Nevertheless, it is desirable to rate
each aspect of the item separately. This has the advantage of providing greater
objectivity, which increases the diagnostic value of the result.
In this method the examiner first sorts the response into categories of varying
quality based on his general or global impression on reading the response. The
standard of quality helps to establish a relative scale, which forms the basis for
ranking responses from those with the poorest quality response to those that have
the highest quality response. Usually between five and ten categories are used with
the rating method with each of the piles representing the degree of quality and
determines the credit to be assigned. For example, where five categories are used,
and the responses are awarded five letter grades: A, B, C, D and E. The responses
are sorted into the five categories
This method is ideal for the extended response questions where relative judgments
are made (no exact numerical scores) concerning the relevance of ideas,
organization of the material and similar qualities evaluated in answers to extended
response questions. Using this method requires a lot of skill and time in
determining the standard response for each quality category. It is desirable to rate
each characteristic separately. This provides for greater objectivity and increases
the diagnostic value of the results.
The following are procedures for scoring essay questions objectively to enhance
reliability.
Prepare the marking scheme or ideal answer or outline of expected answer
immediately after constructing the test items and indicate how marks are to
be awarded for each section of the expected response.
Use the scoring method that is most appropriate for the test item. That is, use
either the analytic or global method as appropriate to the requirements of the
test item.
Decide how to handle factors that are irrelevant to the learning outcomes
being measured. These factors may include legibility of handwriting,
spelling, sentence structure, punctuation and neatness. These factors should
be controlled when judging the content of the answers. Also decide in
advance how to handle the inclusion of irrelevant materials (uncalled for
responses).
Score only one item in all the scripts at a time. This helps to control the
“halo” effect in scoring.
Evaluate the answers to responses anonymously without knowledge of the
examinee whose script you are scoring. This helps in controlling bias in
scoring the essay questions.
Evaluate the marking scheme (scoring key) before actual scoring by scoring
a random sample of examinees actual responses. This provides a general
idea of the quality of the response to be expected and might call for a
revision of the scoring key before commencing actual scoring.
Make comments during the scoring of each essay item. These comments act
as feedback to examinees and a source of remediation to both examinees and
examiners.
Obtain two or more independent ratings if important decisions are to be
based on the results. The result of the different scorers should be compared
and rating moderated to reflect the discrepancies for more reliable results.
I. MANUAL SCORING
In this method of scoring the answer to test items are scored by direct comparison
of the examinees answer with the marking key. If the answers are recorded on the
test paper for instance, a scoring key can be made by marking the correct answers
on a blank copy of the test. Scoring is then done by simply comparing the columns
of answers on the master copy with the columns of answers on each examinee’s
test paper. Alternatively, the correct answers are recorded on scripts of paper and
this script key on which the column of answers are recorded are used as master for
scoring the examinees test papers.
II. STENCIL SCORING
Here separate sheet of answer sheets are used by examinees for recording their
answers, it’s most convenient to prepare and use a scoring stencil. A scoring stencil
is prepared by pending holes on a blank answer sheet where the correct answers
are supposed to appear. Scoring is then done by laying the stencil over each answer
sheet and the number of answer checks appearing through the holes is counted. At
the end of this scoring procedure, each test paper is scanned to eliminate possible
errors due to examinees supplying more than one answer or an item having more
than one correct answer.
If the number of examinees is large, a specially prepared answer sheets are used to
answer the questions. The answers are normally shaded at the appropriate places
assigned to the various items. These special answer sheets are then machine scored
with computers and other possible scoring devices using certified answer key
prepared for the test items. In scoring objective test, it is usually preferable to
count each correct answer as one point. An examinee’s score is simply the number
of items answered correctly.
One question that often arises is whether or not objective test scores should be
corrected for guessing. Differences of opinion on this question are much greater
and more easily observable than differences in the accuracy of the scores produced
by the two methods of scoring. If well-motivated examinees take a test that is
appropriate to their abilities, little blind guessing is likely to occur. There may be
many considered guesses, if every answer given with less than complete certainty
is called a guess. But the examinee’s success in guessing right after thoughtful
consideration is usually a good measure of his achievement.
Since the meaning of most achievement test scores is relative, not absolute—the
scores serve only to indicate how the achievement of a particular examinee
compares with that of other examinees—the argument that scores uncorrected for
guessing will be too high carries little weight. Indeed, one method of correcting for
guessing results in scores higher than the uncorrected scores.
The logical objective of most guessing correction procedures is to eliminate the
expected advantage of the examinee who guesses blindly in preference to omitting
an item. This can be done by subtracting a fraction of the number of wrong
answers from the number of right answers, using the formula S = R – W/ (k – 1)
where S is the score corrected for guessing, R is the number of right answers, W is
the number of wrong answers, and k is the number of choices available to the
examinee in each item. An alternative formula is S = R + O/k where O is the
number of items omitted, and the other symbols have the same meaning as before.
Both formulas rank any set of examinee answer sheets in exactly the same relative
positions, although the second formula yields a higher score for the same answers
than does the first.
Logical arguments for and against correction for guessing on objective tests are
complex and elaborate. But both these arguments and the experimental data point
to one general conclusion. In most circumstances a correction for guessing is not
likely to yield scores that are appreciably more or less accurate than the
uncorrected scores.
REPORTING
The most popular method of reporting marks is the report card. Most modern
report cards contain grades and checklist items. The grades describe the level of
achievement, and the checklists describe other areas such as effort, conduct,
homework, and social development.
Because the report card does not convey all the information parents sometimes
seek and to improve the cooperation between parents and teachers school often use
parent-teacher conferences. The teacher invites the parents to the school for a short
interview. The conferences allow the teacher to provide fuller descriptions of the
student’s scholastic and social development and allow parents to ask questions,
describe the home environment and plan what they may do to assist their children’s
educational development. There are inherent weaknesses in the conferences and
ordinarily they should supplement rather than replace the report card.
Despite the rather obvious limitations of validity, reliability, and interpretation,
reform of these marking systems has had only temporary appeal. Reforms
advocating the elimination of marks have failed because student’s teachers,
counselors, parents, administrators, and employers believe they enjoy distinct
advantages in knowing the student’s marks. Many know that marks mislead them,
but many believe that some simplified knowledge of the student’s achievement is
better than no knowledge at all.
QUESTION NO 5
Q: Write considerations in test administration, before, during and after?
Planning of the test is the first important step in the test construction. The main
goal of evaluation process is to collect valid, reliable and useful data about the
student. Therefore before going to prepare any test we must keep in mind that:
What is to be measured?
What content areas should be included and
What types of test items are to be included.
A test can be used for different purposes in a teaching learning process. It can be
used to measure the entry performance, the progress during the teaching learning
process and to decide the mastery level achieved by the students. Tests serve as a
good instrument to measure the entry performance of the students. It answers to the
questions, whether the students have requisite skill to enter into the course or not,
what previous knowledge does the pupil possess. Therefore it must be decided
whether the test will be used to measure the entry performance or the previous
knowledge acquired by the student on the subject.
Tests can also be used for formative evaluation. It helps to carry on the teaching
learning process, to find out the immediate learning difficulties and to suggest its
remedies. When the difficulties are still unsolved we may use diagnostic tests.
Diagnostic tests should be prepared with high technique. So specific items to
diagnose specific areas of difficulty should be included in the test.
Tests are used to assign grades or to determine the mastery level of the students.
These summative tests should cover the whole instructional objectives and content
areas of the course. Therefore attention must be given towards this aspect while
preparing a test.
The second important step in the test construction is to prepare the test
specifications. In order to be sure that the test will measure a representative sample
of the instructional objectives and content areas we must prepare test
specifications. So that an elaborate design is necessary for test construction. One of
the most commonly used devices for this purpose is ‘Table of Specification’ or
‘Blue Print.’
There are vast arrays of instructional objectives. We cannot include all in a single
test. In a written test we cannot measure the psychomotor domain and affective
domain. We can only measure the cognitive domain. It is also true that all t he
subjects do not contain different learning objectives like knowledge, un-
derstanding, application and skill in equal proportion. Therefore it must be planned
how much weight ago to be given to different instructional objectives. While
deciding this we must keep in mind the importance of the particular objective for
that subject or chapter.
For example if we have to prepare a test in General Science for Class—X we may
give the weightage to different instructional objectives as following:
Table Showing weightage given to different instructional objectives in a test of 100
marks:
It also prevents repetition or omission of any unit. Now question arises how much
weightage should be given to which unit. Some experts say that, it should be
decided by the concerned teacher keeping the importance of the chapter in mind.
Others say that it should be decided according to the area covered by the topic in
the text book. Generally it is decided on the basis of pages of the topic, total page
in the book and number of items to be prepared. For example if a test of 100 marks
is to be prepared then, the weightage to different topics will be given as following.
WEIGHTAGE OF A TOPIC:
If a book contains 250 pages and 100 test/items (marks) are to be constructed then
the weightage will be given as, following:
Table showing weightage given to different content areas:
DETERMINING THE ITEM TYPES:
Preparation of the three way chart is last step in preparing table of specification.
This chart relates the instructional objectives to the content area and types of items.
In a table of specification the instructional objectives are listed across the top of the
table, content areas are listed down the left side of the table and under each
objective the types of items are listed content-wise. Table 3.3 is a model table of
specification for X class science.
After planning preparation is the next important step in the test construction. In this
step the test items are constructed in accordance with the table of specification.
Each type of test item need special care for construction.
The test items should be so designed that it will measure the performance
described in the specific learning outcomes. So that the test items must be in
accordance with the performance described in the specific learning outcome.
For example:
Specific learning outcome—Knows basic terms.
Test item—an individual is considered as obese when his weight is % more
than the recommended weight.
TEST ITEMS SHOULD MEASURE ALL TYPES OF INSTRUCTIONAL OBJECTIVES AND THE
WHOLE CONTENT AREA:
The items in the test should be so prepared that it will cover all the instructional
objectives—Knowledge, understanding, thinking skills and match the specific
learning outcomes and subject matter content being measured. When the items are
constructed on the basis of table of specification the items became relevant.
Example:
Poor item —Where did Allama Iqbal born
Better —In which city did Allama Iqbal born?
The test items should be proper difficulty level, so that it can discriminate properly.
If the item is meant for a criterion-referenced test its difficulty level should be as
per the difficulty level indicated by the statement of specific learning outcome.
Therefore if the learning task is easy the test item must be easy and if the learning
task is difficult then the test item must be difficult.
THE TEST ITEM MUST BE FREE FROM TECHNICAL ERRORS AND IRRELEVANT CLUES
Sometimes there are some unintentional clues in the statement of the item which
helps the pupil to answer correctly. For example grammatical inconsistencies,
verbal associations, extreme words (ever, seldom, always), and mechanical
features (correct statement is longer than the incorrect). Therefore while construct-
ing a test item careful step must be taken to avoid most of these clues.
TEST ITEMS SHOULD BE FREE FROM RACIAL, ETHNIC AND SEXUAL BIASNESS:
The items should be universal in nature. Care must be taken to make a culture fair
item. While portraying a role all the facilities of the society should be given equal
importance. The terms used in the test item should have an universal meaning to all
members of group.
PREPARING INSTRUCTION FOR THE TEST:
This is the most neglected aspect of the test construction. Generally everybody
gives attention to the construction of test items. So the test makers do not attach
directions with the test items.
But the validity and reliability of the test items to a great extent depends upon
the instructions for the test. N.E. Gronlund has suggested that the test maker
should provide clear-cut direction about;
The purpose of testing.
The time allowed for answering.
The basis for answering.
The procedure for recording answers.
The methods to deal with guessing.
A written statement about the purpose of the testing maintains the uniformity of the
test. Therefore there must be a written instruction about the purpose of the test
before the test items.
Clear cut instruction must be supplied to the pupils about the time allowed for
whole test. It is also better to indicate the approximate time required for answering
each item, especially in case of essay type questions. So that the test maker should
carefully judge the amount of time taking the types of items, age and ability of the
students and the nature of the learning outcomes expected. Experts are of the
opinion that it is better to allow more time than to deprive a slower student to
answer the question.
Test maker should provide specific direction on the basis of which the students will
answer the item. Direction must clearly state whether the students will select the
answer or supply the answer. In matching items what is the basis of matching the
premises and responses (states with capital or country with production) should be
given. Special directions are necessary for interpretive items. In the essay type
items clear direction must be given about the types of responses expected from the
pupils.
Students should be instructed where and how to record the answers. Answers may
be recorded on the separate answer sheets or on the test paper itself. If they have to
answer in the test paper itself then they must be directed, whether to write the
correct answer or to indicate the correct answer from among the alternatives. In
case of separate answer sheets used to answer the test direction may be given either
in the test paper or in the answer sheet.
Direction must be provided to the students whether they should guess uncertain
items or not in case of recognition type of test items. If nothing is stated about
guessing, then the bold students will guess these items and others will answer only
those items of which they are confident. So that the bold pupils by chance will
answer some items correctly and secure a higher score. Therefore a direction must
be given ‘to guess but not wild guesses.’
A scoring key increases the reliability of a test. So that the test maker should
provide the procedure for scoring the answer scripts. Directions must be given
whether the scoring will be made by a scoring key (when the answer is recorded on
the test paper) or by a scoring stencil (when answer is recorded on separate answer
sheet) and how marks will be awarded to the test items.
In case of essay type items it should be indicated whether to score with ‘point
method’ or with the ‘rating’ method.’ In the ‘point method’ each answer is
compared with a set of ideal answers in the scoring Hey. Then a given number of
points are assigned.
In the rating method the answers are rated on the basis of degrees of quality and
determine the credit assigned to each answer. Thus a scoring key helps to obtain a
consistent data about the pupils’ performance. So the test maker should prepare a
comprehensive scoring procedure along with the test items.
Once the test is prepared now it is time to be confirming the validity, reliability and
usability of the test. Try out helps us to identify defective and ambiguous items, to
determine the difficulty level of the test and to determine the discriminating power
of the items.
One should follow the following principles during the test administration:
The teacher should talk as less as possible.
The teacher should not interrupt the students at the time of testing.
The teacher should not give any hints to any student who has asked about
any item.
The teacher should provide proper invigilation in order to prevent the
students from cheating.
SCORING THE TEST:
Once the test is administered and the answer scripts are obtained the next step is to
score the answer scripts. A scoring key may be provided for scoring when the
answer is on the test paper itself Scoring key is a sample answer script on which
the correct answers are recorded.
When answer is on a separate answer sheet at that time a scoring stencil may be
used for answering the items. Scoring stencil is a sample answer sheet where the
correct alternatives have been punched. By putting the scoring stencil on the pupils
answer script correct answer can be marked. For essay type items separate
instructions for scoring each learning objective may be provided.
When the pupils do not have sufficient time to answer the test or the students are
not ready to take the test at that time they guess the correct answer, in recognition
type items.
In that case to eliminate the effect of guessing the following formula is used:
But there is a lack of agreement among psychometricians about the value of the
correction formula so far as validity and reliability are concerned. In the words of
Ebel “neither the instruction nor penalties will remedy the problem of guessing.”
Guilford is of the opinion that “when the middle is excluded in item analysis the
question of whether to correct or not correct the total scores becomes rather
academic.” Little said “correction may either under or over correct the pupils’
score.” Keeping in view the above opinions, the test-maker should decide not to use
the correction for guessing. To avoid this situation he should give enough time for
answering the test item.
Evaluating the test is most important step in the test construction process.
Evaluation is necessary to determine the quality of the test and the quality of the
responses. Quality of the test implies that how good and dependable the test is?
(Validity and reliability). Quality of the responses means which items are misfit in
the test. It also enables us to evaluate the usability of the test in general class-room
situation.
ITEM ANALYSIS:
Item analysis is a procedure which helps us to find out the answers to the following
questions:
Whether the items functions as intended?
Whether the test items have appropriate difficulty level?
Whether the item is free from irrelevant clues and other defects?
Whether the distracters in multiple choice type items are effective?
Item analysis procedure gives special emphasis on item difficulty level and item
discriminating power.
For example if the test is administered on 60 students then select 16 test papers
from highest end and 16 test papers from lowest end.
Keep aside the other test papers as they are not required in the item analysis.
Tabulate the number of pupils in the upper and lower group who selected
each alternative for each test item. This can be done on the back of the test
paper or a separate test item card may be used.
It implies that the item has a proper difficulty level. Because it is customary to
follow 25% to 75% rule to consider the item difficulty. It means if an item has a
item difficulty more than 75% then is a too easy item if it is less than 25% then
item is a too difficult item.
In our example (Fig. 3.1) 15 students from upper group responded the item
correctly and 5 from lower group responded the item correctly.
A high positive ratio indicates the high discriminating power. Here .63 indicates an
average discriminating power. If all the 16 students from lower group and 16
students from upper group answers the item correctly then the discriminating
power will be 0.00.
It indicates that the item has no discriminating power. If all the 16 students from
upper group answer the item correctly and all the students from lower group
answer the item in correctly then the item discriminating power will be 1.00 it
indicates an item with maximum positive discriminating power.
Upper group than the lower group. It indicates that distracter ‘D’ is not an effective
distracter. ‘E’ is a distracter which is not responded by any one. Therefore it also
needs revision. Distracter ‘A’ and ‘B’ prove to be effective as it attracts more
pupils from lower group.
Once the item analysis process is over we can get a list of effective items. Now the
task is to make a file of the effective items. It can be done with item analysis cards.
The items should be arranged according to the order of difficulty. While filing the
items the objectives and the content area that it measures must be kept in mind.
This helps in the future use of the item.
At the time of evaluation it is estimated that to what extent the test measures what
the test maker intends to measure.
Evaluation process also estimates to what extent a test is consistent from one
measurement to other. Otherwise the results of the test can not be dependable.
Try out and the evaluation process indicates to what extent a test is usable in
general class-room condition. It implies that how far a test is usable from
administration, scoring, time and economy point of view.