0% found this document useful (0 votes)
1K views48 pages

Validity and Reliability of Tests

Reliability refers to the consistency and stability of a test to measure what it is intended to measure. A reliable test will produce consistent results across time and raters. Validity refers to how well a test measures what it claims to measure. A valid test accurately reflects the property or concept it is intended to measure. Reliability is a necessary but not sufficient condition for validity - a test can be reliable but not valid if it consistently measures something different than intended. Both reliability and validity are important to evaluate the quality of a test and ensure accurate conclusions can be drawn from test results.

Uploaded by

Waqas Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views48 pages

Validity and Reliability of Tests

Reliability refers to the consistency and stability of a test to measure what it is intended to measure. A reliable test will produce consistent results across time and raters. Validity refers to how well a test measures what it claims to measure. A valid test accurately reflects the property or concept it is intended to measure. Reliability is a necessary but not sufficient condition for validity - a test can be reliable but not valid if it consistently measures something different than intended. Both reliability and validity are important to evaluate the quality of a test and ensure accurate conclusions can be drawn from test results.

Uploaded by

Waqas Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

QUESTION NO 1

Q1: How will you define validity and reliability of a test?

TEST RELIABILITY AND VALIDITY DEFINED

RELIABILITY
      Test reliability refers to the degree to which a test is consistent and stable in
measuring what it is intended to measure. Most simply put, a test is reliable if it is
consistent within itself and across time. To understand the basics of test reliability,
think of a bathroom scale that gave you drastically different readings every time
you stepped on it regardless of whether you had gained or lost weight. If such a
scale existed, it would be considered not reliable.

VALIDITY
      Test validity refers to the degree to which the test actually measures what it
claims to measure. Test validity is also the extent to which inferences, conclusions,
and decisions made on the basis of test scores are appropriate and meaningful. The
2000 and 2008 studies present evidence that Ohio's mandated accountability tests
are not valid, that the conclusions and decisions that are made on the basis of OPT
performance are not based upon what the test claims to be measuring.

THE RELATIONSHIP OF RELIABILITY AND VALIDITY

Test validity is requisite to test reliability. If a test is not valid, then reliability is


moot. In other words, if a test is not valid there is no point in discussing reliability
because test validity is required before reliability can be considered in any
meaningful way. Likewise, if as test is not reliable it is also not valid. Therefore,
the two Hoover Studies do not examine reliability.

RELIABILITY VS. VALIDITY: WHAT’S THE DIFFERENCE?


Reliability and validity are concepts used to evaluate the quality of research. They
indicate how well a method, technique or test measures something. Reliability is
about the consistency of a measure, and validity is about the accuracy of a
measure.
It’s important to consider reliability and validity when you are creating
your research design, planning your methods, and writing up your results,
especially in quantitative research.

RELIABILITY VS. VALIDITY

Reliability Validity

What does it The extent to which the The extent to which the
tell you? results can be reproduced results really measure what
when the research is repeated they are supposed to measure.
under the same conditions.

How is it By checking the consistency By checking how well the


assessed? of results across time, across results correspond to
different observers, and across established theories and other
parts of the test itself. measures of the same concept.

How do they A reliable measurement is not A valid measurement is


relate? always valid: the results might generally reliable: if a test
be reproducible, but they’re produces accurate results, they
not necessarily correct. should be reproducible.

UNDERSTANDING RELIABILITY VS. VALIDITY


Reliability and validity are closely related, but they mean different things. A
measurement can be reliable without being valid. However, if a measurement is
valid, it is usually also reliable.

WHAT IS RELIABILITY?
Reliability refers to how consistently a method measures something. If the same
result can be consistently achieved by using the same methods under the same
circumstances, the measurement is considered reliable.
You measure the temperature of a liquid sample several times under identical
conditions. The thermometer displays the same temperature every time, so the
results are reliable.
A doctor uses a symptom questionnaire to diagnose a patient with a long-term
medical condition. Several different doctors use the same questionnaire with the
same patient but give different diagnoses. This indicates that the questionnaire has
low reliability as a measure of the condition.

WHAT IS VALIDITY?

Validity refers to how accurately a method measures what it is intended to


measure. If research has high validity that means it produces results that
correspond to real properties, characteristics, and variations in the physical or
social world.
High reliability is one indicator that a measurement is valid. If a method is not
reliable, it probably isn’t valid.
If the thermometer shows different temperatures each time, even though you have
carefully controlled conditions to ensure the sample’s temperature stays the same,
the thermometer is probably malfunctioning, and therefore its measurements are
not valid.
If a symptom questionnaire results in a reliable diagnosis when answered at
different times and with different doctors, this indicates that it has high validity as
a measurement of the medical condition.
However, reliability on its own is not enough to ensure validity. Even if a test is
reliable, it may not accurately reflect the real situation.
The thermometer that you used to test the sample gives reliable results. However,
the thermometer has not been calibrated properly, so the result is 2 degrees lower
than the true value. Therefore, the measurement is not valid.
A group of participants take a test designed to measure working memory. The
results are reliable, but participants’ scores correlate strongly with their level of
reading comprehension. This indicates that the method might have low validity:
the test may be measuring participants’ reading comprehension instead of their
working memory.
Validity is harder to assess than reliability, but it is even more important. To obtain
useful results, the methods you use to collect your data must be valid: the research
must be measuring what it claims to measure. This ensures that your discussion of
the data and the conclusions you draw are also valid.

HOW ARE RELIABILITY AND VALIDITY ASSESSED?

Reliability can be estimated by comparing different versions of the same


measurement. Validity is harder to assess, but it can be estimated by comparing the
results to other relevant data or theory. Methods of estimating reliability and
validity are usually split up into different types.

TYPES OF RELIABILITY

Different types of reliability can be estimated through various statistical methods.

Type of What does it assess? Example


reliability

Test-retest The consistency of a A group of participants


measure across time: do you complete a questionnaire
get the same results when you designed to measure
repeat the measurement? personality traits. If they
repeat the questionnaire days,
weeks or months apart and
give the same answers, this
indicates high test-retest
reliability.

Interrater The consistency of a Based on an assessment


measure across raters or criteria checklist, five
observers: do you get the examiners submit
same results when different substantially different results
people conduct the same for the same student project.
measurement? This indicates that the
assessment checklist has low
inter-rater reliability (for
example, because the criteria
are too subjective).

Internal The consistency of the You design a questionnaire to


consistency measurement itself: do you measure self-esteem. If you
get the same results from randomly split the results into
different parts of a test that are two halves, there should be
designed to measure the same a strong correlation between
thing? the two sets of results. If the
two results are very different,
this indicates low internal
consistency.

TYPES OF VALIDITY

The validity of a measurement can be estimated based on three main types of


evidence. Each type can be evaluated through expert judgment or statistical
methods.

Type of What does it assess? Example


validity
Construct The adherence of a measure A self-esteem questionnaire
to existing theory and could be assessed by
knowledge of the concept measuring other traits known
being measured. or assumed to be related to the
concept of self-esteem (such
as social skills and optimism).
Strong correlation between
the scores for self-esteem and
associated traits would
indicate high construct
validity.

Content The extent to which the A test that aims to measure a


measurement covers all class of students’ level of
aspects of the concept being Spanish contains reading,
measured. writing and speaking
components, but no listening
component.  Experts agree
that listening comprehension
is an essential aspect of
language ability, so the test
lacks content validity for
measuring the overall level of
ability in Spanish.

Criterion The extent to which the result A survey is conducted to


of a measure corresponds measure the political opinions
to other valid measures of of voters in a region. If the
the same concept. results accurately predict the
later outcome of an election in
that region, this indicates that
the survey has high criterion
validity.
To assess the validity of a cause-and-effect relationship, you also need to
consider internal validity (the design of the experiment) and external validity (the
generalizability of the results).

QUESTION NO 2

What are the elements of test specifications?

THE COMPONENTS OF TEST SPECIFICATIONS

1. THE COMPONENTS OF TEST SPECIFICATIONS

WHAT IS THE TEST COMPONENT OF SPECIFICATION?

The test component of specifications is a written document that provides essential


background information about the planned exam program. This information is then
used to focus and guide the remaining steps in the test development process. •
Specifications are generative and explanatory in nature. A key benefit of using test
specifications is their efficiency. (Alderson, et al., 1995:2).
WHAT ARE SPECS?
Test specifications – usually called ‘specs’ – are generative explanatory documents
for the creation of test tasks. Specs have the role of a generative blue print, from
which many equivalent test items or tasks can be produced. Specs tell us the
rationale behind the various choices that we make. Specs tell us the nuts and bolts
of how to phrase the test items. (Davidson & Lynch, 2002)
McNamara, (2000:31), defines it as ‘a set of instructions for creating the test’ and
its purpose is to ‘make explicit the design decisions in the test and to allow new
versions to be written in the future by someone other than the test developer’.
WHAT IS SPECS GOAL IN A TEST?

The ultimate goal of a specs review is to build a stronger validity argument by


identifying issues that might undermine the validity.
Brown (1994:387) simply calls it ‘practical outline of your test’; Brown’s
questions, when borne in mind, would aid in writing good test specifications: Are
the directions to each section absolutely clear? Does each item measure a specific
item? Is each item stated in clear, simple language? Does the difficulty of each
item seem to be appropriate for the students? Does the sum of the items and test as
whole adequately reflect the learning objectives?
Ruch (1929) view point about Test specifications: He may have been the earliest
proponent of test specifications in educational and psychological assessment. He
mentioned detailed rules of procedures for TS. He recognized the need for
specifications to be immediately relevant to the local context and test. He believed
that such general statements would probably be impossible.

HUGHES, BACHMAN AND PALMER VIEW ABOUT TEST


SPECIFICATIONS:

Hughes (2003) was an early advocate for increased level of details. According to
him it is not to be expected that everything in the specification will always appear
in the test; Bachman and Palmer (1996) and Alderson (1995), called for more
details to be included in test specifications, although Bachman and Palmer were
more detailed than Alderson.

DAVIDSON AND LYNCH’S ITERATIVE-BASED MODEL OF TEST


SPECIFICATIONS:

According to Davidson and Lynch (2002:20), there is no single best format for test
specifications; ‘the principles are universal’. Their specification model calls for test
developers to include a (GD) general description (PA) prompt attributes (RA)
response attributes (SI) sample items and (SS) specification supplement, if
necessary.

JUSTIFICATION FOR THE TEST SPECIFICATIONS CORNERSTONES:

Most commentators consider validity and reliability to be the most important


criteria in judging a test’s quality, but there are so many factors into consideration
in having a good test that call them ‘cornerstones’ of testing, although Bachman
and Palmer caution that it is an impossible task.( Bachman and Palmer, 1996)

TEST SPECIFICATIONS CORNERSTONES:

VALIDITY
The term validity refers to the extent to which a test measures what it says it
measures (Heaton 1988:159).
RELIABILITY:
Reliability is concerned with ensuring consistency of test scores. Practicality:
Practical issues include time, resources and administrative logistics; and it is,
perhaps, one of the most important qualities of a test.

WASHBACK:

According to Buck (1988), Wash back refers to the effect of testing on teaching
and learning.

AUTHENTICITY:

It is an important criterion to judge a test’s quality; and good testing, should strive
to use formats and tests that mirror the types of situations in which students would
authentically use the target language.

TRANSPARENCY:

Transparency refers to the availability of clear, accurate information to students


about testing.

SCORER RELIABILITY:

It means clearly defining the weighing for each section, describing specific criteria
for marking and grading. (Weir, 1993, p.7)

CRITERIA FOR ASSESSMENT:


It depends and will be checked whether the response is relevant to the assigned
topic or not, what level of thoroughness it presents.

RATING SCALES:

Scoring is often taken for granted in language test that the writer does not intend to
do. Scoring will be:

ANALYTICAL:

A type of rating scale that requires teachers to allot separate ratings for the
different components of language ability.

HOLISTIC:

One based on an impressionistic method of scoring. (Davies 1990)

THE COMPONENTS OF TEST SPECIFICATIONS FOR TEACHING SKILLS:

1. LISTENING
 Knowledge
 Comprehension
 Application
 Analysis
 Synthesis
 Evaluation
2. SPEAKING:
 Accuracy
 Fluency
 Appropriacy
 Coherence and cohesion
 Use of language functions
 Managing a discussion
 Task fulfillment
3. READING:
 Comprehension
 Application
 Analysis
 Synthesis
 Evaluation
4. WRITING:
 Accuracy
 Fluency
 Appropriacy
 Coherence and cohesion
 Use of language functions
 Managing a discussion
 Task fulfillment

CRITERIA

INTRODUCTION

After the overall content of the test has been established through a job analysis, the
next step in test development is to create the detailed test specifications. Test
specifications usually include a test description component and a test blueprint
component. The test description specifies aspects of the planned test such as the
test purpose, the target examinee population, the overall test length, and more. The
test blueprint, sometimes also called the table of specifications, provides a listing
of the major content areas and cognitive levels intended to be included on each test
form. It also includes the number of items each test form should include within
each of these content and cognitive areas.

COMPONENTS OF THE TEST SPECIFICATIONS

TEST DESCRIPTION

The test description component of an exam program's test specifications is a


written document that provides essential background information about the
planned exam program. This information is then used to focus and guide the
remaining steps in the test development process. At a minimum, the test
description may simply indicate who will be tested and what the purpose of the
exam program is. More often, the test description will usually also include
elements such as the overall test length, the test administration time limit, and the
item types that are expected to be used (e.g., multiple choice, essays). In some
cases the test description may also specify a test administration mode (e.g., paper-
and-pencil, performance-based, computer-based). And, if the test will include any
items or tasks that will need to be scored by human raters, the test description may
also include plans for the scoring procedures and scoring rubrics.

Test Blueprint

The content areas listed in the test blueprint, or table of specifications, are
frequently drawn directly from the results of a job analysis. These content areas
comprise the knowledge, skills, and abilities that have been determined to be the
essential elements of competency for the job or occupation being assessed. In
addition to the listing of content areas, the test blueprint specifies the number or
proportion of items that are planned to be included on each test form for each
content area. These proportions reflect the relative importance of each content area
to competency in the occupation.
Most test blueprints also indicate the levels of cognitive processing that the
examinees will be expected to use in responding to specific items (e.g.,
Knowledge, Application). It is critical that your test blueprint and test items
include a substantial proportion of items targeted above the Knowledge-level of
cognition. A typical test blueprint is presented in a two-way matrix with the
content areas listed in the table rows and the cognitive processes in the table
columns. The total number of items specified for each column indicates the
proportional plan for each cognitive level on the overall test, just as the total
number of items for each row indicates the proportional emphasis of each content
area. The test blueprint is used to guide and target item writing as well as for test
form assembly. Use of a test blueprint improves consistency across test forms as
well as helping ensure that the goals and plans for the test are met in each
operational test. An example of a test blueprint is provided next.

EXAMPLE OF A TEST BLUEPRINT

In the (artificial) test blueprint for a Real Estate licensure exam given below the
overall test length is specified as 80 items. This relatively small test blueprint
includes four major content areas for the exam (e.g., Real Estate Law). Three
levels of cognitive processing are specified. These are Knowledge,
Comprehension, and Application. Each test form written to this table of
specifications will include 40% of the total test (or 32 items) in the content area of
Real Estate Law. In addressing cognitive levels, 35% of the overall test (or 28
items) will be included at the Knowledge-level. The interior cells of the table
indicate the number of items that are intended to be on the test from each content
and cognitive area combination. For example, the test form will include 16 items at
the Knowledge-level in the content area of Real Estate Law.

QUESTION NO 3

Q3: Write a note on interpreting test scores using percentages.

INTERPRETATION OF TEST SCORES

TEST INTERPRETATION

Test Interpretation is the process of analyzing scores in a test and translating


qualitative data into quantitative and grading into numerical. Score interpretation is
same as test interpretation.

SCORES:

“A summary of the evidence contained in an examinee's responses to the items of a


test that are related to the construct or constructs being measured”. Types of
Scores:
 Raw scores
 Scales Scores

RAW SCORES:

The number of points received on a test when the test has been according to
direction. Example:
 A Student got 10 out of a 20 scores in item quiz.
 Raw scores reflect an immediate interpretation as a response to the scores.
 It does not yield a meaningful interpretation because it just raw scores.
 Thus, we have to interpret Ali’s score in a more descriptive and meaningful
way.

SCALED SCORES:

Scaled scores are the results of transformation (usually transformed through a


consistent scale) Examples:
 A child awarded scale score of 100 is judged to have met the “National
Standard” in the area of judged by the test.
 A child awarded scale score more than 100 is judged to have exceeded
national standard and demonstrated a higher than the expected knowledge
curriculum for their age.
 A child awarded scale score less than 100 are judged to have not yet met the
“National standard” and perform below the expectation from their age.

METHODS OF INTERPRETING TEST SCORES

REFERENCING FRAMEWORK

A referencing framework is a structure you can use to compare student


performance to something external to the assessment itself.
 Criterion Referencing Framework
 Norm Referencing Framework

CRITERION REFERENCING FRAMEWORK


Criterion Referencing Framework permits us to describe an individual’s
performance without referring to the performance of other. Infers the kind of
performance a student can do in a domain, rather than the student’s relative
standing in a norm group. Criterion: the domain of performance to which you
reference a student’s assessment results.
Most widely used interpretation because of its ease of computation and there is a
ready transmutation table printed at the inside back cover of the teacher’s class
record.
A criterion referenced interpretation of score requires a comparison of particular
student score with subjective and pre-determined performance standard (Criteria).
Criterion referenced and standard based interpretation of test result is most
meaningful when the test has been specifically designed for this purpose.

CRITERION REFERENCED INTERPRETATION

Describes student performance according to a specified domain or clearly defined


learning tasks. Concerned with national examination and other assessment bodies.
Used in the assessment of vocational and academic qualifications. Results are
given on a pass/fail, competent/not competent basis. Results are conclusive and
usually open to review.

NORM REFERENCING FRAMEWORK

Norm referenced framework interpretation tell us how an individual compares with


other students who have taken the same test. How much student knows is
determined by his standing or rank within the reference group. This means that
student’s scores is not treated individually but as the part of the group where the
student belongs.

NORM GROUP

The well-defined group of other students. Basically ranking the scores of student
from highest score to the lowest one provides an immediate sample of for norm
referenced interpretation. However, barely ranking “raw scores” to interpret
student’s performance formally is not proper and valid. “The raw scores converted
to derived scores.”
DERIVED SCORE

A derived score is a numerical report of test performance on a scale that has well
defined characteristics and yields normative meanings.

NORM REFERENCED FRAMEWORK

Most common types are:


 Grade Norms (5.5)
 Percentile Norms (85% higher than)
 Standard scores norms (normal curve)
 Stanines (9)

GRADE NORMS:

 Name of Derived Scores -----------Grade Equivalents


 Grade in which student’s raw score is average.
 The grade equivalent that corresponds to a particular raw score identifies the
grade level at which the typical student obtains raw score.

PERCENTILE NORMS

Name of Derived Scores----------- Percentile Ranks


Percentile of students in the reference group who fall below student’s raw score

STANDARD SCORES NORMS

Name of Derived -----------Standard Scores


Distance of student’s raw score above or below the mean of the reference group in
terms of standards units Stanine.
(Standard NINE) is a method of scaling test scores on a nine-point standard scale
with a mean of five and a standard deviation of two.

NORM REFERENCED INTERPRETATION ADVANTAGES


 It is very easy to use.
 It is appropriate to a large group of students that is, more than 40.
 It increases the healthy competition among the students.
 The teacher easily identifies learning criteria – the percentage of students
who receive highest grade or lowest grade.

DISADVANTAGES

 The performance of a student is not only determined by his achievement, but


also the achievement of the other students.
 It promotes intense competition among the students rather than cooperation.
 It cannot be used when the class size is smaller than 40.
 Not all the student can pass the given subject or course

CRITERION REFERENCED INTERPRETATION ADVANTAGES

 The performance of the students will not be affected by the whole class.
 It promotes cooperation among the students.
 All students may pass the subject or course when they meet the set by the
teacher

DISADVANTAGES

 It is difficult to set a reasonable standard if it is not stated in the grading


policies of the institution
 All students may not pass the subject or course when they do not meet the
standard set by the teacher or the institution.
 Percentage In mathematics a relationship with 100 is called percentage
(denoted %). Often it is useful to express the scores in terms of percentages
for comparison. Consider the following example: Grade Class “A” Students
% Class “B” Students %
 A 10 12.50 8 40
 B 25 31.25 6 30
 C 30 37.50 4 20
 D 15 18.75 2 10
 Total 80 100 20 100
STANDARD DEVIATION

DEFINITION:

The standard deviation is the positive square root of the arithmetic mean of the
squares of deviations of all score from their mean.

Z-SCORE

It indicates that how many standard deviation from the mean (Plus or minus) the
student scored.
Where x is the student’s test score
 µ is the mean of all test scores
 σ is the standard deviation of the test scores

T-SCORE

Is the standard score with a mean equal to 50 and standard deviation is equal to 10.
Where z is an arbitrary, the given of what standard deviation does a raw score falls
the mean Stanine Scores
The term Stanine is the abbreviation of “standard nine”. It has a mean equal to 4
and standard deviation equal to 2. A student whose raw score equals the test mean
will obtain a Stanine score of 5. A score that is 3 standard deviation above the
mean is assigned a Stanine of 9 not 11 because Stanines are limited to a range of 1
to 9

ADVANTAGES OF STANDARD DEVIATION

Standard scores divide differences in performance into equal intervals. For


instance, performance represented by T-Scores of 40 of 50 to 55 represent
approximately equal differences in whatever ability the test is measuring. This
attribute of equal intervals is not shared in percentile rank.
Standard scores can be used to compare a student’s performance across. For
percentile rank, they should be converted to be equal-interval scale before they can
be added, subtracted, or averaged.
Standard scores can be used to compare a student’s performance across tests. For
example, if a student’s Stanine scores in math and verbal skills are 4 and 2
respectively, one can conclude that the student performed better in math.

LIMITATIONS OF STANDARD DEVIATION

Many Standard-score scales imply a degree of precision that does not exist within
educational tests. Differences of less than one-third standard deviation are usually
not measurable. This means that differences less than 3 point on the 7-score and 5
point in deviation IQ’s are not meaningful. Both of these scales are too precise.
They represent measures of relative standing as opposed to measures of growth. A
student who progresses trough school in step with peers remains at the same
number of standard deviations from the mean. The constant standard score may
suggest (incorrectly) that growth is not occurring.

RANKING

The position or level something or someone has in a list that compares their
importance, quality, success.
A list that compares quality, success or importance of things or people.
A ranking is a relationship between a set of items such that, for any two items, the
first is either 'ranked higher than', 'ranked lower than' or 'ranked equal to' the
second.

STRATEGIES FOR ASSIGNING RANKING

Standard Competition Ranking It is a ranking in which the mathematical values


that are equal are given equal ran and the next, lesser value is given the next
highest rank.
Ordinal Ranking It is a system of ordering where each mathematical value is given
a certain position in a sequence of numbers where no position is equal.
Fractional Ranking It is system of ordering in which the mathematical values that
is equal given the mean of the ranking positions.

FREQUENCY DISTRIBUTION

In case of giving a test to the students to know about their achievements, raw
scores serve as data. It has not yet undergone any statistical technique. To
understand the data easily, we arrange it into groups or classes. The data so
arranged is called grouped data or frequency distribution.

GENERAL RULES FOR CONSTRUCTION OF FREQUENCY


DISTRIBUTION

The raw data is in the form of scores of fifty students: 37, 42, 44, 51, 48, 30, 47,
56, 52, 31 64, 36, 42, 54, 49, 59, 45, 32, 38, 46 53, 54, 63, 41, 49, 51, 58, 41, 48,
48 43, 37, 52, 55, 61, 43, 46, 48, 62, 35 52, 33, 46, 60, 45, 40, 47, 51, 56, 53

STEPS:

Determine the range. Range is the difference between highest and lowest scores.
Range = 64-30 = 34
Decide the appropriate numbers of class intervals. There is no hard and fast
formula for deciding the class intervals. The number of class intervals is usually
taken between 5 and 20 depending on the length of the data. For this data the
number of class interval to be taken is 7.
Determine the approximate length of the class interval by dividing the range with
number of class intervals. 34 = 4.8 = 5 7
Determine the limits of the class intervals taking the smallest scores at the bottom
of the column to the largest scores at the top.
Determine the number of scores falling in each class interval. This is done by using
a tally or score sheet.

CLASS INTERVALS
Tallies Frequency 60-64 IIII 5 55-59 IIII I 6 50-54 IIII IIII 10 45-49 IIII IIII II 12
40-44 IIII III 8 35-39 IIII 5 30-34 IIII 4 N= 50

TYPES OF FREQUENCY

RELATIVE FREQUENCY DISTRIBUTION

A Frequency Distribution where each of the class frequencies is divided by the


total number of observation.

CUMULATIVE FREQUENCY DISTRIBUTION

A Frequency Distribution is the sum of class and all classes in a frequency


distribution. All that means you are adding up a value and all of the values that
comes before it.

ADVANTAGES AND LIMITATIONS OF FREQUENCY DISTRIBUTION

 Condense and summarize large amounts of data in a useful formats


 Describe all variable types
 Facilitate graphic presentation of data
 Begin to identify population characteristics.
 Permit cautions comparison of data sets.
 There are a few methods can be used in formulating class intervals from 5-
10 or over 55 & less than 30.

PICTORAL FORM

 The information summarized by the frequency distribution can be presented


in graphical forms.
 Both frequency polygons and histograms are useful in describing a set of test
scores
 Both frequency polygons and histograms are useful in describing a set of test
scores
QUESTION NO 4
Q: Discuss the types of test reporting and making?
TEST-MARKING AND REPORTING

Marks are the bases for important decisions made by the student, teachers,
counselors, parents, school administration and employers. They can also serve as
incentive for increased motivation. Presently marks and reports become the bases
for crucial decisions about the educational and occupational destiny of the student.

HOW MARKS ARE THE BASES FOR THE FOLLOWING DECISIONS

 Teachers and counselors use marks to assess past accomplishments, and to


assess present ability to help the student in making educational and
vocational plans for the future.
 The student uses marks to appraise his own educational accomplishments, to
select major and minor areas of study, and to decide whether to terminate or
to continue his formal education.
 Parents use marks to determine which ( if not all ) of their children should
send to some specific college and to estimate the probability of success any
one child might have in advanced study and particular vocation
 School and college administrators, faced with limited educational facilities,
use marks as the basis for admission to advance study and as indications of
the student’s progress after admission.
 Employers use marks in selecting the applicant most likely to perform best
the service they require. Park of hue and cry over marks and marking
systems stems from the major role marks desirably and undesirably play in
the lives of our students.
Marks can also serve as incentives or positive reinforces. Incentives can increase
motivation by raising the anticipation of reaching a desired goal. Incentives can
yield learned expectancies. A student who has learned to expect good marks for
competent performances will approach most educational tasks with more vim and
vigor than will the student who has learned to expect poor marks for inadequate
performances. Marks, therefore, not only convey information for crucial decisions
but also provide important motivational influences.

THE BASES OF MARKS

In terms of performance assessment, the basis for the assignment of marks is the
student’s achievement of the instructional objectives. Unfortunately all teachers do
not agree that achievement of instructional objective should be the exclusive basis
for marking. Instead they use several other bases.
 They often base grades on the student’s attitude, or citizenship, or desirable
attributes of character. A student who shows a cooperative attitude,
responsible citizenship, and strength of character receives a higher mark than
a student who shows a rebellious attitude, under developed citizenship, and
weakness of character.
 Teachers often base marks on the amount of effort the student invests in
achieving instructional objectives, whether or not these efforts meet with
success. Conceivably, the student who expends more effort and does not
succeed may receive a higher mark than does a student who expends less
effort and does succeed.
 Teachers base marks on growth or how much the student has learned, even
though this amount falls short of that required by the instructional objective.

THE MARKING SYSTEMS

The marking system generally used is based on three types of standards:


 Absolute
 Criterion-related
 Relative standards

ABSOLUTE

It is sometimes called the absolute system because it assumes the possibility of the
student achieving absolute perfection. It is based on 100% mastery. Percentage
below 100 represents less mastery. According to the definition of absolute
standards of achievement, 100% could simply mean that the student has attained
the standard of acceptable performance specified in the instructional objective.
With the absolute standard, it is harder to explain the meaning of any percentage
below 100 since the student achieves or does not achieve (all or none) the required
standard. A grade of less than 100% could indicate the percentage of the total
number of instructional objectives a student has achieved at some point during or
at the end of the course.

CRITERION-RELATED

Criterion-related standards do not rely on absolute mastery of every objective.


Instead they depend on criterion established by the teacher who has considered the
number and kind of objectives which must be met before the next stage of
instruction can be entered. In this way they are more meaningful because they can
recognize differential levels of competence and achievement and allow some
students to achieve above the minimum performance required for all students.
Teachers who use percentage grades rarely adopt either of these interpretations,
and their grades, despite their deceptive arithmetical appearance, convey no clear
information about student achievement.

RELATIVE STANDARDS

The more popular marking system consists of the assignment of letter grades: A, B,
C, D, and F. A denotes superior, B good, C average, D fair and F failing or
insufficient achievement. It is sometimes called the relative system because the
grades are intended to describe the student’s achievement relative to that of other
students rather than to a standard of perfection or mastery. An ordinary but by no
means universal assumption of this marking system is that the grading should be
on the curve.  The curve in question is the normal probability curve. A teacher may
decide that one class has earned more superior or failing grades than another class
which, in turn, has earned more average and good grades.

SCORING ESSAY TEST

Essay test scoring calls for higher degrees of competence, and ordinarily takes
considerably more time, than the scoring of objective tests. In addition to this,
essay test scoring presents two special problems. The first is that of providing a
basis for judgment that is sufficiently definite, and of sufficiently general validity,
to give the scores assigned by a particular reader some objective meaning. To be
useful, his scores should not represent purely subjective opinions and personal
biases that equally competent readers might or might not share. The second
problem is that of discounting irrelevant factors, such as quality of handwriting,
verbal fluency, or gamesmanship, in appealing to the scorer’s interests and biases.
The reader’s scores should reflect unbiased estimates of the essential achievements
of the examinee.
One means of improving objectivity and relevancy in scoring essay tests is to
prepare an ideal answer to each essay question and to base the scoring on relations
between examinee answers and the ideal answer. Another is to defer assignment of
scores until the examinee answers have been sorted and resorted into three to nine
sets at different levels of quality. Scoring the test question by question through the
entire set of papers, rather than paper by paper (marking all questions on one paper
before considering the next) improves the accuracy of scoring. If several scorers
will be marking the same questions in a set of papers, it is usually helpful to plan
training and practice session in which the scorers mark the same papers, compare
their marks and strive to reach a common basis for marking.
The construction and scoring of essay questions are interrelated processes that
require attention if a valid and reliable measure of achievement is to be obtained.
In the essay test the examiner is an active part of the measurement instrument.
Therefore, the viabilities within and between examiners affect the resulting score
of examinee. This variability is a source of error, which affects the reliability of
essay test if not adequately controlled. Hence, for the essay test result to serve
useful purpose as valid measurement instrument conscious effort is made to score
the test objectively by using appropriate methods to minimize the effort of personal
biases and idiosyncrasies on the resulting scores; and applying standards to ensure
that only relevant factors indicated in the course objectives.

THE POINT OR ANALYTIC METHOD

In this method each answer is compared with already prepared ideal marking
scheme (scoring key) and marks are assigned according to the adequacy of the
answer. When used conscientiously, the analytic method provides a means for
maintaining uniformity in scoring between scorers and between scripts, thus
improving the reliability of the scoring.
This method is generally used satisfactorily to score Restricted Response
Questions. This is made possible by the limited number of characteristics elicited
by a single answer, which thus defines the degree of quality precisely enough to
assign point values to them. It is also possible to identify the particular weakness or
strength of each examinee with analytic scoring. Nevertheless, it is desirable to rate
each aspect of the item separately. This has the advantage of providing greater
objectivity, which increases the diagnostic value of the result.

THE GLOBAL/HOLISTIC OF RATING METHOD

In this method the examiner first sorts the response into categories of varying
quality based on his general or global impression on reading the response. The
standard of quality helps to establish a relative scale, which forms the basis for
ranking responses from those with the poorest quality response to those that have
the highest quality response. Usually between five and ten categories are used with
the rating method with each of the piles representing the degree of quality and
determines the credit to be assigned. For example, where five categories are used,
and the responses are awarded five letter grades: A, B, C, D and E. The responses
are sorted into the five categories
This method is ideal for the extended response questions where relative judgments
are made (no exact numerical scores) concerning the relevance of ideas,
organization of the material and similar qualities evaluated in answers to extended
response questions. Using this method requires a lot of skill and time in
determining the standard response for each quality category. It is desirable to rate
each characteristic separately. This provides for greater objectivity and increases
the diagnostic value of the results.

IMPROVING OBJECTIVITY IN MARKING IN ESSAY TEST

The following are procedures for scoring essay questions objectively to enhance
reliability.
 Prepare the marking scheme or ideal answer or outline of expected answer
immediately after constructing the test items and indicate how marks are to
be awarded for each section of the expected response.
 Use the scoring method that is most appropriate for the test item. That is, use
either the analytic or global method as appropriate to the requirements of the
test item.
 Decide how to handle factors that are irrelevant to the learning outcomes
being measured. These factors may include legibility of handwriting,
spelling, sentence structure, punctuation and neatness. These factors should
be controlled when judging the content of the answers. Also decide in
advance how to handle the inclusion of irrelevant materials (uncalled for
responses).
 Score only one item in all the scripts at a time. This helps to control the
“halo” effect in scoring.
 Evaluate the answers to responses anonymously without knowledge of the
examinee whose script you are scoring. This helps in controlling bias in
scoring the essay questions.
 Evaluate the marking scheme (scoring key) before actual scoring by scoring
a random sample of examinees actual responses. This provides a general
idea of the quality of the response to be expected and might call for a
revision of the scoring key before commencing actual scoring.
 Make comments during the scoring of each essay item. These comments act
as feedback to examinees and a source of remediation to both examinees and
examiners.
 Obtain two or more independent ratings if important decisions are to be
based on the results. The result of the different scorers should be compared
and rating moderated to reflect the discrepancies for more reliable results.

SCORING OBJECTIVE TEST


Answers to true–false, multiple-choice, and other objective-item types can be
marked directly on the test copy. But scoring is facilitated if the answers are
indicated by position marking a separate answer sheet. For example, the examinee
may be directed to indicate his choice of the first, second, third, fourth, or fifth
alternative to a multiple-choice test item by blackening the first, second, third,
fourth, or fifth position following the item number on his answer sheet.
Answers so marked can be scored by clerks with the aid of a stencil key on which
the correct answer positions have been punched. To get the number of correct
answers, the clerk simply counts the number of marks appearing through the holes
on the stencil key. Or the answers can be scored, usually much more quickly and
accurately, by electrical scoring machines. Some of these machines, which “count”
correct answers by cumulating the current flowing through correctly placed pencil
marks, require the examinee to use special graphite pencils; others, which use
photoelectric cells to scan the answer sheet, require only marks black enough to
contrast sharply with the lightly printed guide lines. High-speed photoelectric test
scoring machines usually incorporate, or are connected to, electronic data
processing and print-out equipment.

OBJECTIVE TEST CAN BE SCORED BY VARIOUS METHODS. VARIOUS TECHNIQUES ARE


USED TO SPEED UP THE SCORING

I. MANUAL SCORING

In this method of scoring the answer to test items are scored by direct comparison
of the examinees answer with the marking key. If the answers are recorded on the
test paper for instance, a scoring key can be made by marking the correct answers
on a blank copy of the test. Scoring is then done by simply comparing the columns
of answers on the master copy with the columns of answers on each examinee’s
test paper. Alternatively, the correct answers are recorded on scripts of paper and
this script key on which the column of answers are recorded are used as master for
scoring the examinees test papers.

II. STENCIL SCORING
Here separate sheet of answer sheets are used by examinees for recording their
answers, it’s most convenient to prepare and use a scoring stencil. A scoring stencil
is prepared by pending holes on a blank answer sheet where the correct answers
are supposed to appear. Scoring is then done by laying the stencil over each answer
sheet and the number of answer checks appearing through the holes is counted. At
the end of this scoring procedure, each test paper is scanned to eliminate possible
errors due to examinees supplying more than one answer or an item having more
than one correct answer.

III. MACHINE SCORING

If the number of examinees is large, a specially prepared answer sheets are used to
answer the questions. The answers are normally shaded at the appropriate places
assigned to the various items. These special answer sheets are then machine scored
with computers and other possible scoring devices using certified answer key
prepared for the test items. In scoring objective test, it is usually preferable to
count each correct answer as one point. An examinee’s score is simply the number
of items answered correctly.

CORRECTION FOR GUESSING

One question that often arises is whether or not objective test scores should be
corrected for guessing. Differences of opinion on this question are much greater
and more easily observable than differences in the accuracy of the scores produced
by the two methods of scoring. If well-motivated examinees take a test that is
appropriate to their abilities, little blind guessing is likely to occur. There may be
many considered guesses, if every answer given with less than complete certainty
is called a guess. But the examinee’s success in guessing right after thoughtful
consideration is usually a good measure of his achievement.
Since the meaning of most achievement test scores is relative, not absolute—the
scores serve only to indicate how the achievement of a particular examinee
compares with that of other examinees—the argument that scores uncorrected for
guessing will be too high carries little weight. Indeed, one method of correcting for
guessing results in scores higher than the uncorrected scores.
The logical objective of most guessing correction procedures is to eliminate the
expected advantage of the examinee who guesses blindly in preference to omitting
an item. This can be done by subtracting a fraction of the number of wrong
answers from the number of right answers, using the formula S = R – W/ (k – 1)
where S is the score corrected for guessing, R is the number of right answers, W is
the number of wrong answers, and k is the number of choices available to the
examinee in each item. An alternative formula is S = R + O/k where O is the
number of items omitted, and the other symbols have the same meaning as before.
Both formulas rank any set of examinee answer sheets in exactly the same relative
positions, although the second formula yields a higher score for the same answers
than does the first.
Logical arguments for and against correction for guessing on objective tests are
complex and elaborate. But both these arguments and the experimental data point
to one general conclusion. In most circumstances a correction for guessing is not
likely to yield scores that are appreciably more or less accurate than the
uncorrected scores.

REPORTING

The most popular method of reporting marks is the report card. Most modern
report cards contain grades and checklist items. The grades describe the level of
achievement, and the checklists describe other areas such as effort, conduct,
homework, and social development.
Because the report card does not convey all the information parents sometimes
seek and to improve the cooperation between parents and teachers school often use
parent-teacher conferences. The teacher invites the parents to the school for a short
interview. The conferences allow the teacher to provide fuller descriptions of the
student’s scholastic and social development and allow parents to ask questions,
describe the home environment and plan what they may do to assist their children’s
educational development. There are inherent weaknesses in the conferences and
ordinarily they should supplement rather than replace the report card.
Despite the rather obvious limitations of validity, reliability, and interpretation,
reform of these marking systems has had only temporary appeal. Reforms
advocating the elimination of marks have failed because student’s teachers,
counselors, parents, administrators, and employers believe they enjoy distinct
advantages in knowing the student’s marks. Many know that marks mislead them,
but many believe that some simplified knowledge of the student’s achievement is
better than no knowledge at all.
QUESTION NO 5
Q: Write considerations in test administration, before, during and after?

PLANNING THE TEST:

Planning of the test is the first important step in the test construction. The main
goal of evaluation process is to collect valid, reliable and useful data about the
student. Therefore before going to prepare any test we must keep in mind that:

 What is to be measured?
 What content areas should be included and
 What types of test items are to be included.

Therefore the first step includes three major considerations.

 Determining the objectives of testing.


 Preparing test specifications.
 Selecting appropriate item types.

DETERMINING THE OBJECTIVES OF TESTING:

A test can be used for different purposes in a teaching learning process. It can be
used to measure the entry performance, the progress during the teaching learning
process and to decide the mastery level achieved by the students. Tests serve as a
good instrument to measure the entry performance of the students. It answers to the
questions, whether the students have requisite skill to enter into the course or not,
what previous knowledge does the pupil possess. Therefore it must be decided
whether the test will be used to measure the entry performance or the previous
knowledge acquired by the student on the subject.

Tests can also be used for formative evaluation. It helps to carry on the teaching
learning process, to find out the immediate learning difficulties and to suggest its
remedies. When the difficulties are still unsolved we may use diagnostic tests.
Diagnostic tests should be prepared with high technique. So specific items to
diagnose specific areas of difficulty should be included in the test.

Tests are used to assign grades or to determine the mastery level of the students.
These summative tests should cover the whole instructional objectives and content
areas of the course. Therefore attention must be given towards this aspect while
preparing a test.

PREPARING TEST SPECIFICATIONS:

The second important step in the test construction is to prepare the test
specifications. In order to be sure that the test will measure a representative sample
of the instructional objectives and content areas we must prepare test
specifications. So that an elaborate design is necessary for test construction. One of
the most commonly used devices for this purpose is ‘Table of Specification’ or
‘Blue Print.’

PREPARATION OF TABLE OF SPECIFICATION/BLUE PRINT

Preparation of table of specification is the most important task in the planning


stage. It acts, as a guide for the test construction. Table of specification or ‘Blue
Print’ is a three dimensional chart showing list of instructional objectives, content
areas and types of items in its dimensions.
It includes four major steps:
 Determining the weightage to different instructional objectives.
 Determining the weightage to different content areas.
 Determining the item types to be included.
 Preparation of the table of specification.

DETERMINING THE WEIGHTAGE TO DIFFERENT INSTRUCTIONAL OB JECTIVES

There are vast arrays of instructional objectives. We cannot include all in a single
test. In a written test we cannot measure the psychomotor domain and affective
domain. We can only measure the cognitive domain. It is also true that all t he
subjects do not contain different learning objectives like knowledge, un-
derstanding, application and skill in equal proportion. Therefore it must be planned
how much weight ago to be given to different instructional objectives. While
deciding this we must keep in mind the importance of the particular objective for
that subject or chapter.
For example if we have to prepare a test in General Science for Class—X we may
give the weightage to different instructional objectives as following:
Table Showing weightage given to different instructional objectives in a test of 100
marks:

DETERMINING THE WEIGHTAGE TO DIFFERENT CONTENT AREAS


The second step in preparing the table of specification is to outline the content
area. It indicates the area in which the students are expected to show their
performance. It helps to obtain a representative sample of the whole content area.

It also prevents repetition or omission of any unit. Now question arises how much
weightage should be given to which unit. Some experts say that, it should be
decided by the concerned teacher keeping the importance of the chapter in mind.
Others say that it should be decided according to the area covered by the topic in
the text book. Generally it is decided on the basis of pages of the topic, total page
in the book and number of items to be prepared. For example if a test of 100 marks
is to be prepared then, the weightage to different topics will be given as following.

WEIGHTAGE OF A TOPIC:

If a book contains 250 pages and 100 test/items (marks) are to be constructed then
the weightage will be given as, following:
Table showing weightage given to different content areas:
DETERMINING THE ITEM TYPES:

The third important step in preparing table of specification is to decide appropriate


item types. Items used in the test construction can broadly be divided into two
types like objective type items and essay type items. For some instructional
purposes, the objective type items are most efficient where as for others the essay
questions prove satisfactory.

Appropriate item types should be selected according to the learning outcomes to be


measured. For example when the outcome is writing, naming supply type items are
useful. If the outcome is identifying a correct answer selection type or recognition
type items are useful. So that the teacher must decide and select appropriate item
types as per the learning outcomes.

PREPARING THE THREE WAY CHART:

Preparation of the three way chart is last step in preparing table of specification.
This chart relates the instructional objectives to the content area and types of items.
In a table of specification the instructional objectives are listed across the top of the
table, content areas are listed down the left side of the table and under each
objective the types of items are listed content-wise. Table 3.3 is a model table of
specification for X class science.

 PREPARING THE TEST:

After planning preparation is the next important step in the test construction. In this
step the test items are constructed in accordance with the table of specification.
Each type of test item need special care for construction. 

The preparation stage includes the following three functions:


 Preparing test items.
 Preparing instruction for the test.
 Preparing the scoring key.

PREPARING THE TEST ITEMS


Preparation of test items is the most important task in the preparation step.
Therefore care must be taken in preparing a test item. The following principles
help in preparing relevant test items.

TEST ITEMS MUST BE APPROPRIATE FOR THE LEARNING OUTCOME TO BE MEASURED:

The test items should be so designed that it will measure the performance
described in the specific learning outcomes. So that the test items must be in
accordance with the performance described in the specific learning outcome.

For example:
 Specific learning outcome—Knows basic terms.
 Test item—an individual is considered as obese when his weight is % more
than the recommended weight.

TEST ITEMS SHOULD MEASURE ALL TYPES OF INSTRUCTIONAL OBJECTIVES AND THE
WHOLE CONTENT AREA:

The items in the test should be so prepared that it will cover all the instructional
objectives—Knowledge, understanding, thinking skills and match the specific
learning outcomes and subject matter content being measured. When the items are
constructed on the basis of table of specification the items became relevant.

THE TEST ITEMS SHOULD BE FREE FROM AMBIGUITY:

The item should be clear. Inappropriate vocabulary and awkward sentence


structure should be avoided. The items should be so worded that all pupils
understand the task.

Example:
 Poor item —Where did Allama Iqbal born
 Better —In which city did Allama Iqbal born?

THE TEST ITEMS SHOULD BE OF APPROPRIATE DIFFICULTY LEVEL:

The test items should be proper difficulty level, so that it can discriminate properly.
If the item is meant for a criterion-referenced test its difficulty level should be as
per the difficulty level indicated by the statement of specific learning outcome.
Therefore if the learning task is easy the test item must be easy and if the learning
task is difficult then the test item must be difficult.

In a norm-referenced test the main purpose is to discriminate pupils according to


achievement. So that the test should be so designed that there must be a wide
spread of test scores. Therefore the items should not be so easy that everyone
answers it correctly and also it should not be so difficult that everyone fails to
answer it. The items should be of average difficulty level.

THE TEST ITEM MUST BE FREE FROM TECHNICAL ERRORS AND IRRELEVANT CLUES

Sometimes there are some unintentional clues in the statement of the item which
helps the pupil to answer correctly. For example grammatical inconsistencies,
verbal associations, extreme words (ever, seldom, always), and mechanical
features (correct statement is longer than the incorrect). Therefore while construct-
ing a test item careful step must be taken to avoid most of these clues.

TEST ITEMS SHOULD BE FREE FROM RACIAL, ETHNIC AND SEXUAL BIASNESS:

The items should be universal in nature. Care must be taken to make a culture fair
item. While portraying a role all the facilities of the society should be given equal
importance. The terms used in the test item should have an universal meaning to all
members of group.
PREPARING INSTRUCTION FOR THE TEST:

This is the most neglected aspect of the test construction. Generally everybody
gives attention to the construction of test items. So the test makers do not attach
directions with the test items.

But the validity and reliability of the test items to a great extent depends upon
the instructions for the test. N.E. Gronlund has suggested that the test maker
should provide clear-cut direction about;
 The purpose of testing.
 The time allowed for answering.
 The basis for answering.
 The procedure for recording answers.
 The methods to deal with guessing.

DIRECTION ABOUT THE PURPOSE OF TESTING:

A written statement about the purpose of the testing maintains the uniformity of the
test. Therefore there must be a written instruction about the purpose of the test
before the test items.

INSTRUCTION ABOUT THE TIME ALLOWED FOR ANSWERING:

Clear cut instruction must be supplied to the pupils about the time allowed for
whole test. It is also better to indicate the approximate time required for answering
each item, especially in case of essay type questions. So that the test maker should
carefully judge the amount of time taking the types of items, age and ability of the
students and the nature of the learning outcomes expected. Experts are of the
opinion that it is better to allow more time than to deprive a slower student to
answer the question.

INSTRUCTIONS ABOUT BASIS FOR ANSWERING:

Test maker should provide specific direction on the basis of which the students will
answer the item. Direction must clearly state whether the students will select the
answer or supply the answer. In matching items what is the basis of matching the
premises and responses (states with capital or country with production) should be
given. Special directions are necessary for interpretive items. In the essay type
items clear direction must be given about the types of responses expected from the
pupils.

INSTRUCTION ABOUT RECORDING ANSWER:

Students should be instructed where and how to record the answers. Answers may
be recorded on the separate answer sheets or on the test paper itself. If they have to
answer in the test paper itself then they must be directed, whether to write the
correct answer or to indicate the correct answer from among the alternatives. In
case of separate answer sheets used to answer the test direction may be given either
in the test paper or in the answer sheet.

INSTRUCTION ABOUT GUESSING:

Direction must be provided to the students whether they should guess uncertain
items or not in case of recognition type of test items. If nothing is stated about
guessing, then the bold students will guess these items and others will answer only
those items of which they are confident. So that the bold pupils by chance will
answer some items correctly and secure a higher score. Therefore a direction must
be given ‘to guess but not wild guesses.’

PREPARING THE SCORING KEY:

A scoring key increases the reliability of a test. So that the test maker should
provide the procedure for scoring the answer scripts. Directions must be given
whether the scoring will be made by a scoring key (when the answer is recorded on
the test paper) or by a scoring stencil (when answer is recorded on separate answer
sheet) and how marks will be awarded to the test items.

In case of essay type items it should be indicated whether to score with ‘point
method’ or with the ‘rating’ method.’ In the ‘point method’ each answer is
compared with a set of ideal answers in the scoring Hey. Then a given number of
points are assigned.

In the rating method the answers are rated on the basis of degrees of quality and
determine the credit assigned to each answer. Thus a scoring key helps to obtain a
consistent data about the pupils’ performance. So the test maker should prepare a
comprehensive scoring procedure along with the test items.

TRY OUT OF THE TEST:

Once the test is prepared now it is time to be confirming the validity, reliability and
usability of the test. Try out helps us to identify defective and ambiguous items, to
determine the difficulty level of the test and to determine the discriminating power
of the items.

Try out involves two important functions:


 Administration of the test.
 Scoring the test.

ADMINISTRATION OF THE TEST:

Administration means administering the prepared test on a sample of pupils. So the


effectiveness of the final form test depends upon a fair administration. Gronlund
and Linn have stated that ‘the guiding principle in administering any class room
test is that all pupils must be given a fair chance to demonstrate their achievement
of learning outcomes being measured.’ It implies that the pupils must be provided
congenial physical and psychological environment during the time of testing. Any
other factor that may affect the testing procedure should be controlled.

Physical environment means proper sitting arrangement, proper light and


ventilation and adequate space for invigilation, Psychological environment refers
to these aspects which influence the mental condition of the pupil. Therefore steps
should be taken to reduce the anxiety of the students. The test should not be
administered just before or after a great occasion like annual sports on annual
drama etc.

One should follow the following principles during the test administration:
 The teacher should talk as less as possible.
 The teacher should not interrupt the students at the time of testing.
 The teacher should not give any hints to any student who has asked about
any item.
 The teacher should provide proper invigilation in order to prevent the
students from cheating.
SCORING THE TEST:

Once the test is administered and the answer scripts are obtained the next step is to
score the answer scripts. A scoring key may be provided for scoring when the
answer is on the test paper itself Scoring key is a sample answer script on which
the correct answers are recorded.

When answer is on a separate answer sheet at that time a scoring stencil may be
used for answering the items. Scoring stencil is a sample answer sheet where the
correct alternatives have been punched. By putting the scoring stencil on the pupils
answer script correct answer can be marked. For essay type items separate
instructions for scoring each learning objective may be provided. 

CORRECTION FOR GUESSING:

When the pupils do not have sufficient time to answer the test or the students are
not ready to take the test at that time they guess the correct answer, in recognition
type items.

In that case to eliminate the effect of guessing the following formula is used: 

But there is a lack of agreement among psychometricians about the value of the
correction formula so far as validity and reliability are concerned. In the words of
Ebel “neither the instruction nor penalties will remedy the problem of guessing.”
Guilford is of the opinion that “when the middle is excluded in item analysis the
question of whether to correct or not correct the total scores becomes rather
academic.” Little said “correction may either under or over correct the pupils’
score.” Keeping in view the above opinions, the test-maker should decide not to use

the correction for guessing. To avoid this situation he should give enough time for
answering the test item.

EVALUATING THE TEST:

Evaluating the test is most important step in the test construction process.
Evaluation is necessary to determine the quality of the test and the quality of the
responses. Quality of the test implies that how good and dependable the test is?
(Validity and reliability). Quality of the responses means which items are misfit in
the test. It also enables us to evaluate the usability of the test in general class-room
situation.

Evaluating the test involves following functions:


 Item analysis.
 Determining validity of the test.
 Determining reliability of the test.
 Determining usability of the test.

ITEM ANALYSIS:

Item analysis is a procedure which helps us to find out the answers to the following
questions:
 Whether the items functions as intended?
 Whether the test items have appropriate difficulty level?
 Whether the item is free from irrelevant clues and other defects?
 Whether the distracters in multiple choice type items are effective?

The item analysis data also helps us:


 To provide a basis for efficient class discussion of the test result
 To provide a basis for the remedial works
 To increase skill in test construction
 To improve class-room discussion.

ITEM ANALYSIS PROCEDURE:

Item analysis procedure gives special emphasis on item difficulty level and item
discriminating power.

The item analysis procedure follows the following steps:


 The test papers should be ranked from highest to lowest.
 Select 27% test papers from highest and 27% from lowest end.

For example if the test is administered on 60 students then select 16 test papers
from highest end and 16 test papers from lowest end.

 Keep aside the other test papers as they are not required in the item analysis.
 Tabulate the number of pupils in the upper and lower group who selected
each alternative for each test item. This can be done on the back of the test
paper or a separate test item card may be used.

5. Calculate item difficulty for each item by using formula:

Where R= Total number of students got the item correct.

T = Total number of students tried the item.


In our example (fig. 3.1) out of 32 students from both the groups 20 students have
answered the item correctly and 30 students have tried the item.

The item difficulty is as following:

It implies that the item has a proper difficulty level. Because it is customary to
follow 25% to 75% rule to consider the item difficulty. It means if an item has a
item difficulty more than 75% then is a too easy item if it is less than 25% then
item is a too difficult item.

Calculate item discriminating power by using the following formula:

Item discriminating power =  


Where RU= Students from upper group who got the answer correct.
RL= Students from lower group who got the answer correct.
T/2 = half of the total number of pupils included in the item analysis.

In our example (Fig. 3.1) 15 students from upper group responded the item
correctly and 5 from lower group responded the item correctly.

A high positive ratio indicates the high discriminating power. Here .63 indicates an
average discriminating power. If all the 16 students from lower group and 16
students from upper group answers the item correctly then the discriminating
power will be 0.00.
It indicates that the item has no discriminating power. If all the 16 students from
upper group answer the item correctly and all the students from lower group
answer the item in correctly then the item discriminating power will be 1.00 it
indicates an item with maximum positive discriminating power.

7. Find out the effectiveness of the distracters. A distractor is considered to be a


good distractor when it attracts more pupil from the lower group than the upper
group. The distractors which are not selected at all or very rarely selected should
be revised. In our example (fig. 3.1) distractor ‘D’ attracts more pupils from.

Upper group than the lower group. It indicates that distracter ‘D’ is not an effective
distracter. ‘E’ is a distracter which is not responded by any one. Therefore it also
needs revision. Distracter ‘A’ and ‘B’ prove to be effective as it attracts more
pupils from lower group.

PREPARING A TEST ITEM FILE:

Once the item analysis process is over we can get a list of effective items. Now the
task is to make a file of the effective items. It can be done with item analysis cards.
The items should be arranged according to the order of difficulty. While filing the
items the objectives and the content area that it measures must be kept in mind.
This helps in the future use of the item.

DETERMINING VALIDITY OF THE TEST:

At the time of evaluation it is estimated that to what extent the test measures what
the test maker intends to measure.

DETERMINING RELIABILITY OF THE TEST:

Evaluation process also estimates to what extent a test is consistent from one
measurement to other. Otherwise the results of the test can not be dependable.

DETERMINING THE USABILITY OF THE TEST:

Try out and the evaluation process indicates to what extent a test is usable in
general class-room condition. It implies that how far a test is usable from
administration, scoring, time and economy point of view.

You might also like