RELIABILITY
&
VALIDITY
Taking a closer look!
Facilitator: Shellon Samuels-White
Can a test be valid and not reliable?
A test can be reliable, but not valid. However, for a
test to be considered valid, it must be reliable.
Reliability provides the consistency needed to
obtain validity, and enables us to interpret
assessment results with greater confidence.
Factors such as Test length and test duration can affect reliability.
A test with more items will have a higher reliability. Insufficiency of time decreases the
reliability of a test. E.g. If the time is insufficient it may cause excitement and
carelessness on part of the test taker resulting in reduced reliability of the test.
RELIABILITY
Would you get the same results if you used
the assessment at different times?
If an assessment is being rated, would
different markers rate the same way?
To what extent will the results be free from
errors?
Reliability is the degree to which an assessment tool produces stable and consistent
results.
Testing for Reliability
TEST-RETEST METHOD
Reliability is obtained by administering the same test to the same
group after some time interval. The scores from Time 1 and Time
2 are then be correlated in order to evaluate the test for stability
over time. You use it when you are measuring something that you
expect to stay constant in your sample.
EXAMPLE:
A test designed to assess student reading levels could be given to a
group of students twice, with the second administration coming a
week after the first.
PARALLEL FORMS METHOD (Equivalent/Alternate forms)
Reliability is obtained by administering different versions of an
assessment tool to the same group of individuals. Both versions must
contain items that probe the same skill, knowledge base, etc. The scores
from the two versions can then be correlated in order to evaluate the
consistency of results across alternate versions.
EXAMPLE
A teacher creates two versions of a fraction test (the parallel forms). Both
versions are administered one after the other: The results of the two
tests are compared, and the results are almost identical, indicating high
parallel forms reliability.
INTERNAL CONSISTENCY METHOD:
Split Half Method
Internal consistency assesses the correlation between multiple items in a test that are intended to measure the same
construct.
Why it’s important
When you devise a set of questions or ratings that will be combined into an overall score, you have to make sure that all
of the items really do reflect the same thing. If responses to different items contradict one another, the test might be
unreliable.
SPLIT-HALF METHOD
This process begins by “splitting in half” all items of a test that are intended to probe the same area of knowledge. You
randomly split a set of measures into two sets. After testing the entire set on the respondents, you calculate the
correlation between the two sets of responses.
EXAMPLE
A group of respondents are presented with a set of statements designed to measure optimistic and pessimistic mindsets.
They must rate their agreement with each statement on a scale from 1 to 5. If the test is internally consistent, an
optimistic respondent should generally give high ratings to optimism indicators and low ratings to pessimism indicators.
◦ How are Parallel forms and Split half reliability
similar/different?
SPLIT HALF
PARALLEL FORMS
Version a
Time A
Version B
Time B
First Half Second Half
Time A Time A
INTER-RATER RELIABILITY METHOD
This is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-
rater reliability is useful because human observers will not
necessarily interpret answers the same way; raters may disagree as
to how well certain responses or material demonstrate knowledge
of the construct or skill being assessed.
EXAMPLE
Inter-rater reliability might be employed when different judges are
evaluating the degree to which art portfolios meet certain
standards. Then they calculate the correlation between their
different sets of results. If all the markers give similar ratings, the
portfolio task has high interrater reliability.
IMPROVIING INTER-RATER RELIABILITY
Clearly define your variables and the methods that will
be used to measure them.
Develop detailed, objective criteria for how the
variables will be rated, counted or categorized.
If multiple researchers are involved, ensure that they all
have exactly the same information and training.
Factors that lower the reliability of test scores
Test scores are based on too few items.
Remedy: Use longer tests or accumulate scores from several
short tests.
Testing conditions are inadequate:
Remedy: Arrange opportune time for administration &
eliminate interruptions, noise and other disrupting factors.
Scoring is subjective
Remedy: prepare scoring keys and follow carefully when
scoring essay answers.
The scale is said to be reliable but not a
valid measure of your weight. Explain
why?
VALIDITY
Validity is inferred from available evidence (not
measured).
Validity depends on many different types of evidences. If
we infer that our assessment indicates that students have
good ‘reasoning ability’, we would like some evidence to
support that fact that the results actually reflect that
construct.
Validity refers to the inferences drawn, not the instrument.
Types of Validity
FACE VALIDITY
Face validity considers how suitable the content of a test seems to
be on the surface. It is an informal and subjective assessment.
EXAMPLE
You create a survey to measure the regularity of people’s dietary
habits. You review the survey items, which ask questions about
every meal of the day and snacks eaten every day of the week. On
its surface, the survey seems like a good representation of what
you want to test, so you consider it to have high face validity.
CONSTRUCT VALIDITY
This is used to ensure that the measure actually measures what it is intended to
measure (i.e. the construct), and not other variables. A construct refers to a concept
or characteristic that can’t be directly observed, but can be measured by observing
other indicators that are associated with it. E.g. intelligence, obesity, job satisfaction,
or depression.
Example
If you develop a questionnaire to diagnose depression, you need to know: does the
questionnaire really measure the construct of depression? Or is it actually measuring
the respondent’s mood, self-esteem, or some other construct?
To achieve construct validity, The questionnaire must include only relevant questions
that measure known indicators of depression.
CONTENT VALIDITY
Content validity refers to the actual content within a test. A test that is valid in content should
adequately examine all aspects that define the objective. The determination of content-validity “should
include several teachers (and content experts when possible) in evaluating how well the test represents
the content taught.”
Content validity is not “tested for”. Rather it is “assured” by the informed item selections made by
experts in the domain.
Example
A mathematics teacher develops an end-of-semester algebra test for her class. The test should cover every
form of algebra that was taught in the class. If some types of algebra are left out, then the results may not
be an accurate indication of students’ understanding of the subject. Similarly, if she includes questions
that are not related to algebra, the results are no longer a valid measure of algebra knowledge.
CRITERION VALIDITY
Criterion validity evaluates how closely the results of your test correspond to the results
of a different test. The criterion is an external measurement of the same thing. It is
usually an established or widely-used test that is already considered valid.
Example
A university professor creates a new test to measure applicants’ English writing ability.
To assess how well the test really does measure students’ writing ability, she finds an
existing test that is considered a valid measurement of English writing ability, and
compares the results when the same group of students take both tests. If the outcomes
are very similar, the new test has a high criterion validity.
What are some ways to improve validity?
Make sure your goals and objectives are clearly defined.
Expectations of students should be written down.
Match your assessment measure to your goals and objectives. Have
the test reviewed by faculty at other schools to obtain feedback
from an outside party who is less invested in the instrument.
Get students involved; have the students look over the assessment
for troublesome wording, or other difficulties.
If possible, compare your measure with other measures, or data that
may be available.
SOURCES
Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents, pp. 231-249.
Dimitrov, D. M., & Rumrill, Jr, P. D. (2003). Pretest-posttest designs and measurement of change. Work: A Journal of
Prevention, Assessment and Rehabilitation 20(2), 159-165.
Gronlund, N., Waugh, C. (2009) Assessment of student achievement (9th Ed.). New York: Pearson.
Middleton, F. (2020, June 26) Types of reliability and how to measure them.
https://www.scribbr.com/methodology/types-of-reliability/
Middleton, F. (2020, June 26) The four types of Validity. https://www.scribbr.com/methodology/types-of-validity/