Chapter 5- Validity and Reliability V2
Chapter 5- Validity and Reliability V2
OBJECTIVES
• Describe the concepts of validity, reliability, and their importance in scientific
research.
• Compare and contrast different kinds of Validity and reliability.
• Discuss the threats to internal validity and how to counter them.
3
2- “can we be sure that if we repeated the measurement, we would get the same
result?”.
Validity and reliability are two important characteristics of behavioral measure and are
referred to as psychometric properties.
It is important to bear in mind that validity and reliability are not an all or none issue but
a matter of degree.
4
VALIDITY
Very simply, validity (accuracy) is the extent to which a test measures what
it is supposed to measure.
Therefore, we cannot ask the general question “Is this a valid test?”. The question to
ask is “how valid is this test for the decision that I need to make?” or “how valid is the
interpretation I propose for the test?”.
INFERENCE VALIDITY
INTERNAL VALIDITY
Internal validity: Determines whether the observed effects in the study are truly
due to the variables being studied (causality) and not due to confounding
factors.
A study with high internal validity effectively minimizes the influence of
confounding variables, ensuring that the observed effect is likely due to the
independent variable rather than other factors.
Internal validity can sometimes be checked via simulation, which can tell you
whether a given theorized process can in fact yield the outcomes that you claim it
does.
The idea that X causes Y is important because internal validity is about being able
to justify that X actually caused Y. We highlight the word actually because there
are many different reasons that can make it difficult to known whether X Causes Y.
9
EXTERNAL VALIDITY
External validity is related to generalizing.
CONSTRUCT VALIDITY
Construct validity is about how well a test measures what it’s supposed to.
For example, does the Beck Depression Inventory (BDI) truly measure depression?
To check this, researchers look at studies that have used the BDI.
One way to evaluate construct validity is by comparing BDI scores of people with
depression to those without it. If people with depression score higher, it shows that
the BDI is accurately measuring depression, which supports its construct validity.
11
CONSTRUCT VALIDITY
B- Criterion-related
A- Translation validity
validity
• Content validity • Predictive validity
• Face validity • Concurrent validity
• Convergent validity
• Discriminant validity
12
TRANSLATION VALIDITY
Content validity refers to how well a test's questions represent the full range of the construct
being measured.
It involves a systematic review of the content to ensure it covers a representative sample of the
behavior or skills within that domain.
For example, an IQ test should include items that cover all key areas of intelligence discussed
in the scientific literature, such as verbal, spatial, reasoning, and memory abilities. Consulting
experts in intelligence and psychometrics helps ensure the test has strong content validity
14
FACE VALIDITY
Face validity is about whether an instrument (like a survey or test) looks like it measures
what it's supposed to measure, based on appearance alone. However, this does not
mean that it actually measures the concept accurately; it just appears to.
It’s a kind of first impression of whether the test items seem appropriate. By carefully
looking at each item, we decide if it seems to fit with the concept we want to measure
— but this is based on how it looks rather than solid evidence.
Face validity is the weakest type of validity because it doesn’t prove the test actually
measures the concept; it only suggests that it might be on the right track. It's more about
whether the test "looks right" than if it actually works right.
15
Content validity and face validity are similar because both check if the
questionnaire items represent the concept we want to measure. Some
researchers even suggest combining them as one aspect of validity.
Content validity is stronger than face validity because it requires careful and
detailed evaluation by experts. This ensures that the questionnaire fully covers
the concept based on established knowledge and theories.
Content validity and face validity differ in how they are evaluated. Face
validity is more general and often involves personal judgment. It might include
input from participants and doesn't always rely on established theories.
16
CRITERION VALIDITY
Criterion validity checks how well a measure matches a "gold standard" or a trusted,
established measurement for the same concept. It compares responses from a new
questionnaire with results from other validated tools or well-known standards.
To test criterion validity, we measure the same group with both the "gold standard"
and the new tool. We use a correlation coefficient (a statistical measure) to see how
closely the two sets of results match. This is considered one of the most practical and
objective ways to test validity.
CRITERION VALIDITY
Concurrent and predictive validity are two types of criterion validity,
based on timing.
Concurrent validity checks if test scores Predictive validity checks if test scores
match with results from a standard can accurately predict results that will be
measure taken at the same time. measured in the future.
Example: If we want to know how well •High predictive validity: If students with high
a college entrance exam predicts exam scores generally have high GPAs, the
exam shows high predictive validity.
student success in college, we can •Low predictive validity: If there’s little or no
compare students' exam scores to relationship between exam scores and GPAs,
their final college GPA. the exam has low predictive validity.
• Single-group studies
A research team wants to study whether having indoor plants on office desks boosts the productivity of Nurses
from a Hospital. The researchers give each of the participating Nurse plant to place by their desktop for the month-
long study. All participants complete a timed productivity task before (pre-test) and after the study (post-test).
Threat Meaning Example
History An unrelated event A week before the end of the study, all Nurses are told that there
influences the outcomes. will be layoffs. The Nurses are stressed on the date of the post-test,
and performance may suffer.
Maturation The outcomes of the study Most Nurses are new to the job at the time of the pre-test. A month
vary as a natural result of later, their productivity has improved as a result of time spent
time. working in the position.
Instrumentation Different measures are In the pre-test, productivity was measured for 15 minutes, while the
used in pre-test and post- post-test was over 30 minutes long.
test phases.
Testing The pre-test influences the Nurses showed higher productivity at the end of the study because
outcomes of the post-test. the same test was administered. Due to familiarity, or awareness of
the study’s purpose, many Nurses achieved high results.
21
THREATS TO INTERNAL VALIDITY
Multi-group studies
A researcher wants to compare whether a phone-based app or traditional flashcards are better for learning vocabulary for the SAT.
They divide 11th graders from one school into three groups based on baseline (pre-test) scores on vocabulary. For 15 minutes a day,
Group A uses the phone-based app, Group B uses flashcards, while Group C spends the time reading as a control. Three months later,
post-test measures of vocabulary are taken.
Threat Meaning Example
Selection bias Groups are not comparable at Low-scorers were placed in Group A, while high-scorers were placed in Group B.
the beginning of the study. Because there are already systematic differences between the groups at the
baseline, any improvements in group scores may be due to reasons other than
the treatment.
Regression to the There is a statistical tendency Because participants are placed into groups based on their initial scores, it’s
mean for people who score extremely hard to say whether the outcomes would be due to the treatment or statistical
low or high on a test to score norms.
closer to the middle the next
time.
Social interaction Participants from different Groups B and C may resent Group A because of the access to a phone during
groups may compare notes class. As such, they could be demoralized and perform poorly.
and either figure out the aim of
the study or feel resentful of
others.
Attrition Dropout from participants 20% of participants provided unusable data. Almost all of them were from
Group C. As a result, it’s hard to compare the two treatment groups to a control
group.
22
23
RELIABILITY
Errors of measurement that affect reliability are random errors and errors of
measurement that affect validity are systematic or constant errors.
TEST-RETEST RELIABILITY
This method checks how consistent the results are across different versions of the
same test.
How it works:
• Create two similar tests: Both tests measure the same concept but use different questions.
• Administer both tests: Administering the two test forms to the same group of people, with a short time in
between.
• Correlate the scores: Calculate the correlation between the scores from the two tests.
A high correlation indicates the forms are equivalent, and the test is reliable.
Challenge: Creating two truly equivalent forms can be difficult and time-
consuming.
26
INTER-RATER RELIABILITY
Several methods exist for calculating IRR, from the simple (e.g.
percent agreement) to the more complex (e.g. Cohen’s Kappa).
Which one you choose largely depends on what type of data you
have and how many raters are in your model.
INTER-RATER RELIABILITY
27
• Percent Agreement
• The simple way to measure inter-rater For each question, we can write “1” if
reliability is to calculate the percentage the two judges agree and “0” if they
of items that the judges agree on. don’t agree.
• This is known as percent agreement,
which always ranges between 0 and 1
with 0 indicating no agreement between
raters and 1 indicating perfect
agreement between raters.
• For example, suppose two judges are
asked to rate the difficulty of 10 items on
a test from a scale of 1 to 3. The results
are shown below:
The percentage of questions the
judges agreed on was 7/10 = 70%.
28
Internal consistency reliability measures how consistent the results are across different items in a test,
ensuring that items designed to measure the same construct give similar scores.
Simple example: you want to find out how satisfied your customers are with the level of customer service
they receive at your call center. You send out a survey with three questions designed to measure overall
satisfaction. Choices for each question are: Strongly agree/Agree/Neutral/Disagree/Strongly disagree.
• I was satisfied with my experience.
• I will probably recommend your company to others.
• If I write an online review, it would be positive.
If the survey has good internal consistency, respondents should answer the same for each question, i.e.
three “agrees” or three “strongly disagrees.” If different answers are given, this is a sign that your questions
are poorly worded and are not reliably measuring customer satisfaction.
Most researchers prefer to include at least two questions that measure the same thing (the above survey
has three).
29
These methods ensure that the test results and the constructs
being measured are accurate. The choice of technique
depends on the subject, data set size, and available resources.
30
The split-halves test for internal consistency reliability is a simple method that divides a test into
two halves.
For example, a questionnaire measuring job satisfaction can be split into odd and even-
numbered questions. The results from both halves are analyzed, and if there is a weak
correlation, it suggests a reliability issue.
The division must be random. While split-halves testing was once popular due to its simplicity
and speed, more advanced methods are now preferred because computers can handle
complex calculations.
The split-halves test produces a correlation score between 0 and 1, with 1 indicating perfect
correlation.
31
KUDER-RICHARDSON TEST
The Kuder-Richardson test is a more advanced version of the split-halves test for internal
consistency reliability.
It calculates the average correlation for all possible split-half combinations in a test,
providing a more accurate result than the split-halves method. Like split-halves, it
generates a correlation score between 0 and 1.
However, the Kuder-Richardson test requires each question to have a simple right or
wrong answer (0 or 1).
For tests with multiple response options, more sophisticated methods are needed to
measure internal consistency reliability.
32
CRONBACH'S ALPHA TEST
Cronbach's Alpha is a more advanced test for internal consistency reliability. It averages
the correlation between all possible split-half combinations and can handle multi-level
responses.
For example, it can be used for questions where respondents rate their answers on a
scale from 1 to 5. Cronbach's Alpha produces a score between 0 and 1, with a value of
65% typically considered acceptable reliability.
This test also considers the sample size and number of response options, so a 40-question
test with 1-5 ratings is considered more reliable than a 10-question test with fewer
response levels.
Despite its efficiency, Cronbach's Alpha still requires computers or statistical software to
perform the calculations accurately.