0% found this document useful (0 votes)
309 views11 pages

SPL-3 Unit 2

The document discusses different methods for measuring the reliability of tests: 1. Test-retest reliability measures consistency over time by administering the same test twice over an interval. 2. Parallel forms reliability assesses equivalence by administering two alternate forms of a test and correlating scores. 3. Split-half reliability estimates internal consistency by splitting a test in half and correlating scores on each half. It controls for practice effects with a single administration.

Uploaded by

Divyank Surum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
309 views11 pages

SPL-3 Unit 2

The document discusses different methods for measuring the reliability of tests: 1. Test-retest reliability measures consistency over time by administering the same test twice over an interval. 2. Parallel forms reliability assesses equivalence by administering two alternate forms of a test and correlating scores. 3. Split-half reliability estimates internal consistency by splitting a test in half and correlating scores on each half. It controls for practice effects with a single administration.

Uploaded by

Divyank Surum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT 2: RELIABILITY OF TESTS

2.1: Reliability: Meaning, True score estimation

Reliability is the extent to which a test is repeatable and gives consistent scores.
Tests that are relatively free from measurement errors are said to be reliable.

Measurement errors are random. A person’s test score might not reflect the true
scores because they were sick, anxious, in a noisy room, in a hurry etc.

The reliability of a test refers to ability to yield consistent result from one set of
measure to another; it is the extent to which the obtained test scores are free from
internal defects.

Reliability also refers to the extent to which a test yields consistent result upon
testing and re-testing. Reliability includes such terms as consistency, stability,
replicability and repeatability.

2.2: Types: Test-retest, Split-half, Parallel-form and Scorer reliability

Types of reliability: There are several types of reliability.

There are four procedures in common use for computing the reliability coefficient of a
test.

These are:
1. Test-Retest (Repetition)
2. Alternate or Parallel Forms
3. Split-Half Technique
4. Rational Equivalence.

Test-retest reliability: This type of reliability is known as “Stability”. Stability is the


extent to which individuals tend to obtain a similar score, upon retaking the same
test.

To estimate reliability by means of the test-retest method, the same test is


administered twice to the same group of pupils with a given time interval between the
two administrations of the test.

The resulting test scores arc correlated and this correlation coefficient provides a
measure of stability, that is, it indicates how stable the test results are over a period
of time. So it is otherwise known as a measure of stability.

Thus it means administrating the same test with a particular time gap say 21 to 28
days then calculate the correlation b/w 1st& 2nd test, if we get significant result then
we can say test is reliable.

The estimate of reliability varies according to the length of time-interval allowed


between the two administrations. The product moment method of correlation is a
significant method for estimating reliability of two sets of scores.
Thus, a high correlation between two sets of scores indicates that the test is reliable.
Means, it shows that the scores obtained in first administration resemble with the
scores obtained in second administration of the same test.

In this method the time interval plays an important role. If it is too small say a day or
two, the consistency of the results will be influenced by the carry-over effect, i.e., the
pupils will remember some of the results from the first administration to the second.

If the time interval is long say a year, the results will not only be influenced by the
inequality of testing procedures and conditions, but also by the actual changes in the
pupils over that period of time.

Test-retest reliability example


You devise a questionnaire to measure the IQ of a group of participants (a property
that is unlikely to change significantly over time).You administer the test two months
apart to the same group of people, but the results are significantly different, so the
test-retest reliability of the IQ questionnaire is low.

Advantages:
• Self-correlation or test-retest method, for estimating reliability coefficient is
generally used.
• It is worthy to use in different situations conveniently.
• A test of an adequate length can be used after an interval of many days
between successive testing.

Disadvantages:
• If the test is repeated immediately, many subjects will recall their first answers
and spend their time on new material, thus tending to increase their scores.
• Besides immediate memory effects, practice and the confidence induced by
familiarity with the material will also affect scores when the test is taken for a
second time. Index of reliability so obtained is less accurate.
• If the interval between tests is rather long (more than six months) growth
factor and maturity will affect the scores and tends to lower down the reliability
index.
• If the test is repeated immediately or after a little time gap, there may be the
possibility of carry-over effect/transfer effect/memory/practice effect.
• On repeating the same test, on the same group second time, makes the
students disinterested and thus they do not like to take part wholeheartedly.
• Sometimes, uniformity is not maintained which also affects the test scores.
• Chances of discussing a few questions after the first administration, which
may increase the scores at second administration affecting reliability.

2. Alternate or Parallel Forms Method:


Parallel form reliability is also known as Alternative form reliability or Equivalent form
reliability or Comparable form reliability.
In this method two parallel or equivalent forms of a test are used. Estimating
reliability by means of the equivalent form method involves the use of two different
but equivalent forms of the test. By parallel forms we mean that the forms are
equivalent so far as the content, objectives, format, difficulty level and discriminating
value of items, length of the test etc. are concerned.

It refers to the degree to which two different forms of the same test yield similar
result. Once we have a test, we should develop our parallel form, we should have
the similar construct/item in form but not the same items, we should have at least
fifteen minutes gap between both test and then study correlation and find the
reliability.

Parallel tests have equal mean scores, variances and inter co-relations among
items. That is, two parallel forms must be homogeneous or similar in all respects, but
not a duplication of test items. The two equivalent forms are to be possibly similar in
content, degree, mental processes tested, and difficulty level and in other aspects.

The reliability coefficient may be looked upon as the coefficient correlation between
the scores on two equivalent forms of test. One form of the test is administered on
the students and on finishing immediately another form of test is supplied to the
same group.

The scores, thus obtained are correlated which gives the estimate of reliability. Thus,
the reliability found is called coefficient of equivalence.

Parallel forms reliability example

A set of questions is formulated to measure financial risk aversion in a group of


respondents. The questions are randomly divided into two sets, and the respondents
are randomly divided into two groups. Both groups take both tests: group A takes
test A first, and group B takes test B first. The results of the two tests are compared,
and the results are almost identical, indicating high parallel forms reliability.

Advantages:
This procedure has certain advantages over the test-retest method:

• Here the same test is not repeated.

• Memory, practice, carryover effects and recall factors are minimised and they
do not affect the scores.

• Useful for the reliability of achievement tests.

• This method is one of the appropriate methods of determining the reliability of


educational and psychological tests.

Limitations:
• It is difficult to have two parallel forms of a test. In certain situations (i.e. in
Rorschach) it is almost impossible.

• When the tests are not exactly equal in terms of content difficulty, length, the
comparison between two set of scores obtained from these tests may lead to
erroneous decisions.
• Practice and carryover factors cannot be completely controlled.

• Moreover, administering two forms simultaneously creates boredom. That is


why people prefer such methods in which only one administration of the test is
required.

• The testing conditions while administering the second Form may not be the
same. Besides, the testes may not be in a similar physical, mental or
emotional state at both the times of administration.

• Test scores of second form of the test are generally high.

• Although difficult, carefully and cautiously constructed parallel forms would


give us reasonably a satisfactory measure of reliability. For well-made
standardised tests, the parallel form method is usually the most satisfactory
way of determining the reliability.

3. Split-Half Method or Sub-divided Test Method:


Split-half method is an improvement over the earlier two methods, and it involves
both the characteristics of stability and equivalence. The above discussed two
methods of estimating reliability sometimes seem difficult.
It may not be possible to use the same test twice and to get an equivalent form of
test. Hence, to overcome these difficulties and to reduce memory effect as well as to
economise the test, it is desirable to estimate reliability through a single
administration of the test.

In this method it is possible to divide a test into two halves and examine the
relationship between the two half scores. The test is administered as a whole, but
test scores are completed separately for each half.

A common approach to split half test division is to assign the odd numbered items to
one half and even numbered items to other. The correlations between the two half
gives an estimate of the reliability of each half test but not of the reliabilities of the full
test.

In this method the test is administered once on the sample and it is the most
appropriate method for homogeneous tests. This method provides the internal
consistency of a test scores.

All the items of the test are generally arranged in increasing order of difficulty and
administered once on sample. After administering the test it is divided into two
comparable or similar or equal parts or halves.

The scores are arranged or are made in two sets obtained from odd numbers of
items and even numbers of items separately. As for example a test of 100 items is
administered. The scores of individual based on 50 items of odd numbers like 1, 3, 5,
….99 and scores based on even numbers 2, 4, 6… 10 are separately arranged. In
part ‘A’ odd number items are assigned and part ‘B’ will consist of even number of
items.
After obtaining two scores on odd and even numbers of test items, co-efficient of
correlation is calculated. It is really a correlation between two equivalent halves of
scores obtained in one sitting. (To estimate reliability, Spearman-Brown Prophecy
formula is used).

While using this formula, it should be kept in mind that the variance of odd and even
halves should be equal. If it is not possible then Flanagan’s and Rulon’s formulae
can be employed. These formulae are simpler and do not involve computation of
coefficient of correlation between two halves).

Advantages:
• Here we are not repeating the test or using the parallel form of it and thus the
testee (subject) is not tested twice. As such, the carry over effect or practice
effect is not there.
• In this method, the fluctuations of individual’s ability, because of
environmental or physical conditions is minimised.
• Because of single administration of test, day-to-day functions and problems
do not interfere.
• Difficulty of constructing parallel forms of test is eliminated.

Limitations:
• A test can be divided into two equal halves in a number of ways and the
coefficient of correlation in each case may be different.
• This method cannot be used for estimating reliability of speed tests.
• As the test is administered once, the chance errors may affect the scores on
the two halves in the same way and thus tending to make the reliability
coefficient too high.
• This method cannot be used in power tests and heterogeneous tests.

Inspite of all these limitations, the split-half method is considered as the best of all
the methods of measuring test reliability, as the data for determining reliability are
obtained upon on occasion and thus reduces the time, labour and difficulties
involved in case of second or repeated administration.

Type of reliability Measures the consistency of…


Test-retest The same test over time.
Inter-rater The same test conducted by different people.
Parallel forms Different versions of a test which are designed to be equivalent.
Internal consistency The individual items of a test.

4. Scorer reliability- refers to the consistency when different people who score the
same test agree. It indicates how consistent test scores are if the test is scored by
two or more people. For a test with a definite answer key, scorer reliability is of
negligible concern. When the subject responds with his own words, handwriting, and
organization of subject matter then scorers’ reliability matter.
Note, it can also be called inter-observer reliability when referring to observational
research. Here researchers observe the same behavior independently (to avoided
bias) and compare their data. If the data is similar then it is reliable.
Internal Consistency Reliability (Method of Rational Equivalence):
This method is also known as “Kuder-Richardson Reliability’ or ‘Inter-Item
Consistency’. It is a method based on single administration. It is based on
consistency of responses to all items.

In this method, it is assumed that all items have same or equal difficulty value,
correlation between the items are equal, all the items measure essentially the same
ability and the test is homogeneous in nature.

This type of reliability refers to the degree to which each item on a test is measuring
the same thing as each other. This inter-item consistency is known as internal
consistency. (Like split-half method this method also provides a measure of internal
consistency).

A test is internally consistent or homogenous if an individual’s responses to, or


performance on one item is related to his/her responses to all of the other items in
the test.

(The most common way for finding inter-item consistency is through the formula
developed by Kuder and Richardson (1937). This method enables to compute the
inter-correlation of the items of the test and correlation of each item with all the items
of the test. J. Cronbach called it as coefficient of internal consistency).

Internal consistency example

A group of respondents are presented with a set of statements designed to measure


optimistic and pessimistic mindsets. They must rate their agreement with each
statement on a scale from 1 to 5. If the test is internally consistent, an optimistic
respondent should generally give high ratings to optimism indicators and low ratings
to pessimism indicators. The correlation is calculated between all the responses to
the “optimistic” statements, but the correlation is very weak. This suggests that the
test has low internal consistency.

Advantages:
• This coefficient provides some indications of how internally consistent or
homogeneous the items of the tests are.
• Split-half method simply measures the equivalence but rational equivalence
method measures both equivalence and homogeneity.
• Economical method as the test is administered once.
• It neither requires administration of two equivalent forms of tests nor it
requires to split the tests into two equal halves.

Limitations:
• The coefficient obtained by this method is generally somewhat lesser than the
coefficients obtained by other methods.
• If the items of the tests are not highly homogeneous, this method will yield
lower reliability coefficient.
• Kuder-Richardson and split-half method are not appropriate for speed test.
Measuring a property that you expect to stay the same over time. Test-retest

Multiple researchers making observations or ratings about the same topic. Inter-rater

Using two different tests to measure the same thing. Parallel forms

Using a multi-item test where all the items are intended to measure the same Internal consistency
variable.

2.3: Standard error of measurement

The standard error of measurement is one of the core concepts in


psychometrics. One of the primary assumptions of any assessment is that it is
accurately and consistently measuring whatever it is we want to measure. We,
therefore, need to demonstrate that it is doing so. There are a number of ways of
quantifying this, and one of the most common is the SEM.

SEM is an index of the reliability of an assessment instrument, representing the


variation of an individual’s scores across multiple administrations of the same test.
The larger the standard error of measurement, the greater the score variation across
administrations.

The standard error of measurement provides an indication of how confident one may
be that an individual’s obtained score on any given measurement test represents his
or her true score.

A standard error of measurement, often denoted SEM, is a measure of how much


measured test scores are spread around a “true” score. For an individual when
repeated measures are taken.

Standard error of measurement (SEm), the standard deviation of error of


measurement in a test or experiment. It is closely associated with the error variance,
which indicates the amount of variability in a test administered to a group that is
caused by measurement error. The standard error of measurement is used to
determine the effect of measurement error on individual results in a test and is a
common tool in psychoanalytical research and standardized academic testing.

The standard error of measurement is a function of both the standard deviation of


observed scores and the reliability of the test. When the test is perfectly reliable, the
standard error of measurement equals 0. When the test is completely unreliable, the
standard error of measurement is at its maximum, equal to the standard deviation of
the observed scores.

The standard error of measurement serves in a complementary role to the reliability


coefficient. Reliability can be understood as the degree to which a test is consistent,
repeatable, and dependable. The reliability coefficient ranges from 0 to 1: When a
test is perfectly reliable, all observed score variance is caused by true score
variance, whereas when a test is completely unreliable, all observed score variance
is a result of error. Although the reliability coefficient provides important information
about the amount of error in a test measured in a group or population, it does not
inform on the error present in an individual test score.

The Pearson product-moment coefficient measure of reliability is commonly used for


the calculation of the standard error of measurement.

The standard error of measurement is a statistic that indicates the variability of the
errors of measurement by estimating the average number of points by which
observed scores are away from true scores. To understand standard error of
measurement, an introduction of basic concepts in reliability theory is necessary.
Therefore, this entry first examines true scores and error of measurement and
variance components. Next, this entry discusses the standard error of measurement
and its uses. Last, this entry comments on methods for reducing measurement error.

A standard error of measurement, often denoted SEm, estimates the variation


around a “true” score for an individual when repeated measures are taken.

It is calculated as:

SEm = s√1-R

where:

• s: The standard deviation of measurements


• R: The reliability coefficient of a test

Note that a reliability coefficient ranges from 0 to 1 and is calculated by administering


a test to many individuals twice and calculating the correlation between their test
scores.

The higher the reliability coefficient, the more often a test produces consistent
scores.

The standard error of measurement (SEm) estimates how repeated measures of a


person on the same instrument tend to be distributed around his or her “true” score.
The true score is always an unknown because no measure can be constructed that
provides a perfect reflection of the true score. SEm is directly related to the reliability
of a test; that is, the larger the SEm, the lower the reliability of the test and the less
precision there is in the measures taken and scores obtained. Since all
measurement contains some error, it is highly unlikely that any test will yield the
same scores for a given person each time they are retested.

Reliability & Standard Error of Measurement

There exists a simple relationship between the reliability coefficient of a test and the
standard error of measurement:
• The higher the reliability coefficient, the lower the standard error of
measurement.
• The lower the reliability coefficient, the higher the standard error of
measurement.

The term standard error of measurement indicates the spread of measurement


errors when estimating an examinee’s true score from the observed score. Standard
error of measurement is most frequently useful in test reliability. An observed score
is an examinee’s obtained score, or raw score, on a particular test. A true score
would be determined if this particular test was then given to a group of examinees
1,000 times, under identical conditions. The average of those observed scores would
yield the best estimate of the examinees’ true abilities. Standard deviation is applied
to the average of those scores across persons and administrations to determine the
standard error of measurement. Observed score and true score can be used
together to determine the amount of error:

Score true = Score observed + Score error.

2.4: Reliability- Influencing factors and improvement techniques

Reliability is the degree to which an assessment tool produces stable and


consistent results. Some intrinsic and some extrinsic factors have been identified to
affect the reliability of test scores.

(A) Intrinsic Factors: The principal intrinsic factors (i.e. those factors which lie
within the test itself) which affect the reliability are:

(i) Length of the Test: One of the major factors that affect reliability is the length of
the test. A longer test provides a more adequate sample of behavior being measured
and is less disturbed by chance factors like guessing.

Reliability has a definite relation with the length of the test. The more the number of
items the test contains, the greater will be its reliability and vice-versa. Logically, the
more sample of items we take of a given area of knowledge, skill and the like, the
more reliable the test will be.

However, it is difficult to ensure the maximum length of the test to ensure an


appropriate value of reliability. The length of the tests in such case should not give
rise to fatigue effects in the testees, etc. Thus, it is advisable to use longer tests
rather than shorter tests. Shorter tests are less reliable.

The number of times a test should be lengthened to get a desirable level of reliability
is given by the formula:

Example: When a test has a reliability of 0.8, the number of items the test has to be
lengthened to get a reliability of 0.95 is estimated in the following way:
Hence the test is to be lengthened 4.75 times. However, while lengthening the test
one should see that the items added to increase the length of the test must satisfy
the conditions such as equal range of difficulty, desired discrimination power and
comparability with other test items.

(ii) Homogeneity of Items: Homogeneity of items has two aspects: item reliability
and the homogeneity of traits measured from one item to another. If the items
measure different functions and the inter-correlations of items are ‘zero’ or near to it,
then the reliability is ‘zero’ or very low and vice-versa.

(iii) Difficulty Value of Items: The difficulty level and clarity of expression of a test
item also affect the reliability of test scores. If the test items are too easy or too
difficult for the group members it will tend to produce scores of low reliability.
Because both the tests have a restricted spread of scores.

(v) Test instructions: Clear and concise instructions increase reliability.


Complicated and ambiguous directions give rise to difficulties in understanding the
questions and the nature of the response expected from the testee ultimately leading
to low reliability.

(vi) Reliability of the scorer: The reliability of the scorer also influences reliability of
the test. If he is moody, fluctuating type, the scores will vary from one situation to
another. Mistake in him give rises to mistake in the score and thus leads to reliability.

(vii) Objectivity: Eliminate the biases, opinions or judgments of the person who
checks the test. Socio-political beliefs shall be set aside when checking the test.

(B) Extrinsic Factors: The important extrinsic factors (i.e. the factors which remain
outside the test itself) influencing the reliability are:

(i) Group variability: When the group of pupils being tested is homogeneous in
ability, the reliability of the test scores is likely to be lowered and vice-versa.

(ii) Guessing and chance errors: Guessing in test gives rise to increased error
variance and as such reduces reliability. For example, in two-alternative response
options there is a 50% chance of answering the items correctly in terms of guessing.

(iii) Environmental conditions: As far as practicable, testing environment should


be uniform. Arrangement should be such that light, sound, and other comforts should
be equal to all testees, otherwise it will affect the reliability of the test scores.

(iv) Momentary fluctuations: Momentary fluctuations may raise or lower the


reliability of the test scores. Broken pencil, momentary distraction by sudden sound
of a train running outside, anxiety regarding non-completion of home-work, mistake
in giving the answer and knowing no way to change it are the factors which may
affect the reliability of test score.

(v) Heterogeneity of the students’ group: Reliability is higher when test scores
are spread out a range of abilities. Reliability is achieved when the test-takers
represent a variety of intellectual levels and skills.
(vi) Limited time: Speed is a factor and is more reliable than a test that is conducted
at a longer time. This factor considers the chances that a student might cheat.

Reliability Improving Techniques


There are also several things that psychologists can do to improve reliability. Where
observer scores do not significantly correlate then reliability can be improved by:

• Training observers in the observation techniques being used and making sure
everyone agrees with them.

• Ensuring behavior categories have been operationalized. This means that


they have been objectively defined.

• increasing the sample size,

• controlling the testing conditions, and

• Running a reliability analysis on the measurement tool.

You might also like