SPL-3 Unit 2
SPL-3 Unit 2
Reliability is the extent to which a test is repeatable and gives consistent scores.
Tests that are relatively free from measurement errors are said to be reliable.
Measurement errors are random. A person’s test score might not reflect the true
scores because they were sick, anxious, in a noisy room, in a hurry etc.
The reliability of a test refers to ability to yield consistent result from one set of
measure to another; it is the extent to which the obtained test scores are free from
internal defects.
Reliability also refers to the extent to which a test yields consistent result upon
testing and re-testing. Reliability includes such terms as consistency, stability,
replicability and repeatability.
There are four procedures in common use for computing the reliability coefficient of a
test.
These are:
1. Test-Retest (Repetition)
2. Alternate or Parallel Forms
3. Split-Half Technique
4. Rational Equivalence.
The resulting test scores arc correlated and this correlation coefficient provides a
measure of stability, that is, it indicates how stable the test results are over a period
of time. So it is otherwise known as a measure of stability.
Thus it means administrating the same test with a particular time gap say 21 to 28
days then calculate the correlation b/w 1st& 2nd test, if we get significant result then
we can say test is reliable.
In this method the time interval plays an important role. If it is too small say a day or
two, the consistency of the results will be influenced by the carry-over effect, i.e., the
pupils will remember some of the results from the first administration to the second.
If the time interval is long say a year, the results will not only be influenced by the
inequality of testing procedures and conditions, but also by the actual changes in the
pupils over that period of time.
Advantages:
• Self-correlation or test-retest method, for estimating reliability coefficient is
generally used.
• It is worthy to use in different situations conveniently.
• A test of an adequate length can be used after an interval of many days
between successive testing.
Disadvantages:
• If the test is repeated immediately, many subjects will recall their first answers
and spend their time on new material, thus tending to increase their scores.
• Besides immediate memory effects, practice and the confidence induced by
familiarity with the material will also affect scores when the test is taken for a
second time. Index of reliability so obtained is less accurate.
• If the interval between tests is rather long (more than six months) growth
factor and maturity will affect the scores and tends to lower down the reliability
index.
• If the test is repeated immediately or after a little time gap, there may be the
possibility of carry-over effect/transfer effect/memory/practice effect.
• On repeating the same test, on the same group second time, makes the
students disinterested and thus they do not like to take part wholeheartedly.
• Sometimes, uniformity is not maintained which also affects the test scores.
• Chances of discussing a few questions after the first administration, which
may increase the scores at second administration affecting reliability.
It refers to the degree to which two different forms of the same test yield similar
result. Once we have a test, we should develop our parallel form, we should have
the similar construct/item in form but not the same items, we should have at least
fifteen minutes gap between both test and then study correlation and find the
reliability.
Parallel tests have equal mean scores, variances and inter co-relations among
items. That is, two parallel forms must be homogeneous or similar in all respects, but
not a duplication of test items. The two equivalent forms are to be possibly similar in
content, degree, mental processes tested, and difficulty level and in other aspects.
The reliability coefficient may be looked upon as the coefficient correlation between
the scores on two equivalent forms of test. One form of the test is administered on
the students and on finishing immediately another form of test is supplied to the
same group.
The scores, thus obtained are correlated which gives the estimate of reliability. Thus,
the reliability found is called coefficient of equivalence.
Advantages:
This procedure has certain advantages over the test-retest method:
• Memory, practice, carryover effects and recall factors are minimised and they
do not affect the scores.
Limitations:
• It is difficult to have two parallel forms of a test. In certain situations (i.e. in
Rorschach) it is almost impossible.
• When the tests are not exactly equal in terms of content difficulty, length, the
comparison between two set of scores obtained from these tests may lead to
erroneous decisions.
• Practice and carryover factors cannot be completely controlled.
• The testing conditions while administering the second Form may not be the
same. Besides, the testes may not be in a similar physical, mental or
emotional state at both the times of administration.
In this method it is possible to divide a test into two halves and examine the
relationship between the two half scores. The test is administered as a whole, but
test scores are completed separately for each half.
A common approach to split half test division is to assign the odd numbered items to
one half and even numbered items to other. The correlations between the two half
gives an estimate of the reliability of each half test but not of the reliabilities of the full
test.
In this method the test is administered once on the sample and it is the most
appropriate method for homogeneous tests. This method provides the internal
consistency of a test scores.
All the items of the test are generally arranged in increasing order of difficulty and
administered once on sample. After administering the test it is divided into two
comparable or similar or equal parts or halves.
The scores are arranged or are made in two sets obtained from odd numbers of
items and even numbers of items separately. As for example a test of 100 items is
administered. The scores of individual based on 50 items of odd numbers like 1, 3, 5,
….99 and scores based on even numbers 2, 4, 6… 10 are separately arranged. In
part ‘A’ odd number items are assigned and part ‘B’ will consist of even number of
items.
After obtaining two scores on odd and even numbers of test items, co-efficient of
correlation is calculated. It is really a correlation between two equivalent halves of
scores obtained in one sitting. (To estimate reliability, Spearman-Brown Prophecy
formula is used).
While using this formula, it should be kept in mind that the variance of odd and even
halves should be equal. If it is not possible then Flanagan’s and Rulon’s formulae
can be employed. These formulae are simpler and do not involve computation of
coefficient of correlation between two halves).
Advantages:
• Here we are not repeating the test or using the parallel form of it and thus the
testee (subject) is not tested twice. As such, the carry over effect or practice
effect is not there.
• In this method, the fluctuations of individual’s ability, because of
environmental or physical conditions is minimised.
• Because of single administration of test, day-to-day functions and problems
do not interfere.
• Difficulty of constructing parallel forms of test is eliminated.
Limitations:
• A test can be divided into two equal halves in a number of ways and the
coefficient of correlation in each case may be different.
• This method cannot be used for estimating reliability of speed tests.
• As the test is administered once, the chance errors may affect the scores on
the two halves in the same way and thus tending to make the reliability
coefficient too high.
• This method cannot be used in power tests and heterogeneous tests.
Inspite of all these limitations, the split-half method is considered as the best of all
the methods of measuring test reliability, as the data for determining reliability are
obtained upon on occasion and thus reduces the time, labour and difficulties
involved in case of second or repeated administration.
4. Scorer reliability- refers to the consistency when different people who score the
same test agree. It indicates how consistent test scores are if the test is scored by
two or more people. For a test with a definite answer key, scorer reliability is of
negligible concern. When the subject responds with his own words, handwriting, and
organization of subject matter then scorers’ reliability matter.
Note, it can also be called inter-observer reliability when referring to observational
research. Here researchers observe the same behavior independently (to avoided
bias) and compare their data. If the data is similar then it is reliable.
Internal Consistency Reliability (Method of Rational Equivalence):
This method is also known as “Kuder-Richardson Reliability’ or ‘Inter-Item
Consistency’. It is a method based on single administration. It is based on
consistency of responses to all items.
In this method, it is assumed that all items have same or equal difficulty value,
correlation between the items are equal, all the items measure essentially the same
ability and the test is homogeneous in nature.
This type of reliability refers to the degree to which each item on a test is measuring
the same thing as each other. This inter-item consistency is known as internal
consistency. (Like split-half method this method also provides a measure of internal
consistency).
(The most common way for finding inter-item consistency is through the formula
developed by Kuder and Richardson (1937). This method enables to compute the
inter-correlation of the items of the test and correlation of each item with all the items
of the test. J. Cronbach called it as coefficient of internal consistency).
Advantages:
• This coefficient provides some indications of how internally consistent or
homogeneous the items of the tests are.
• Split-half method simply measures the equivalence but rational equivalence
method measures both equivalence and homogeneity.
• Economical method as the test is administered once.
• It neither requires administration of two equivalent forms of tests nor it
requires to split the tests into two equal halves.
Limitations:
• The coefficient obtained by this method is generally somewhat lesser than the
coefficients obtained by other methods.
• If the items of the tests are not highly homogeneous, this method will yield
lower reliability coefficient.
• Kuder-Richardson and split-half method are not appropriate for speed test.
Measuring a property that you expect to stay the same over time. Test-retest
Multiple researchers making observations or ratings about the same topic. Inter-rater
Using two different tests to measure the same thing. Parallel forms
Using a multi-item test where all the items are intended to measure the same Internal consistency
variable.
The standard error of measurement provides an indication of how confident one may
be that an individual’s obtained score on any given measurement test represents his
or her true score.
The standard error of measurement is a statistic that indicates the variability of the
errors of measurement by estimating the average number of points by which
observed scores are away from true scores. To understand standard error of
measurement, an introduction of basic concepts in reliability theory is necessary.
Therefore, this entry first examines true scores and error of measurement and
variance components. Next, this entry discusses the standard error of measurement
and its uses. Last, this entry comments on methods for reducing measurement error.
It is calculated as:
SEm = s√1-R
where:
The higher the reliability coefficient, the more often a test produces consistent
scores.
There exists a simple relationship between the reliability coefficient of a test and the
standard error of measurement:
• The higher the reliability coefficient, the lower the standard error of
measurement.
• The lower the reliability coefficient, the higher the standard error of
measurement.
(A) Intrinsic Factors: The principal intrinsic factors (i.e. those factors which lie
within the test itself) which affect the reliability are:
(i) Length of the Test: One of the major factors that affect reliability is the length of
the test. A longer test provides a more adequate sample of behavior being measured
and is less disturbed by chance factors like guessing.
Reliability has a definite relation with the length of the test. The more the number of
items the test contains, the greater will be its reliability and vice-versa. Logically, the
more sample of items we take of a given area of knowledge, skill and the like, the
more reliable the test will be.
The number of times a test should be lengthened to get a desirable level of reliability
is given by the formula:
Example: When a test has a reliability of 0.8, the number of items the test has to be
lengthened to get a reliability of 0.95 is estimated in the following way:
Hence the test is to be lengthened 4.75 times. However, while lengthening the test
one should see that the items added to increase the length of the test must satisfy
the conditions such as equal range of difficulty, desired discrimination power and
comparability with other test items.
(ii) Homogeneity of Items: Homogeneity of items has two aspects: item reliability
and the homogeneity of traits measured from one item to another. If the items
measure different functions and the inter-correlations of items are ‘zero’ or near to it,
then the reliability is ‘zero’ or very low and vice-versa.
(iii) Difficulty Value of Items: The difficulty level and clarity of expression of a test
item also affect the reliability of test scores. If the test items are too easy or too
difficult for the group members it will tend to produce scores of low reliability.
Because both the tests have a restricted spread of scores.
(vi) Reliability of the scorer: The reliability of the scorer also influences reliability of
the test. If he is moody, fluctuating type, the scores will vary from one situation to
another. Mistake in him give rises to mistake in the score and thus leads to reliability.
(vii) Objectivity: Eliminate the biases, opinions or judgments of the person who
checks the test. Socio-political beliefs shall be set aside when checking the test.
(B) Extrinsic Factors: The important extrinsic factors (i.e. the factors which remain
outside the test itself) influencing the reliability are:
(i) Group variability: When the group of pupils being tested is homogeneous in
ability, the reliability of the test scores is likely to be lowered and vice-versa.
(ii) Guessing and chance errors: Guessing in test gives rise to increased error
variance and as such reduces reliability. For example, in two-alternative response
options there is a 50% chance of answering the items correctly in terms of guessing.
(v) Heterogeneity of the students’ group: Reliability is higher when test scores
are spread out a range of abilities. Reliability is achieved when the test-takers
represent a variety of intellectual levels and skills.
(vi) Limited time: Speed is a factor and is more reliable than a test that is conducted
at a longer time. This factor considers the chances that a student might cheat.
• Training observers in the observation techniques being used and making sure
everyone agrees with them.