0% found this document useful (0 votes)
15 views

Chapter 5- Validity and Reliability V2

The document discusses the concepts of validity and reliability in scientific research, emphasizing their importance as psychometric properties. It outlines various types of validity, including internal, external, construct, and criterion-related validity, along with threats to internal validity and methods to enhance reliability. The document also explains different reliability assessment methods such as test-retest, equivalent-forms, and inter-rater reliability.

Uploaded by

colleges660
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Chapter 5- Validity and Reliability V2

The document discusses the concepts of validity and reliability in scientific research, emphasizing their importance as psychometric properties. It outlines various types of validity, including internal, external, construct, and criterion-related validity, along with threats to internal validity and methods to enhance reliability. The document also explains different reliability assessment methods such as test-retest, equivalent-forms, and inter-rater reliability.

Uploaded by

colleges660
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

1

VALIDITY AND RELIABILITY


2

OBJECTIVES
• Describe the concepts of validity, reliability, and their importance in scientific
research.
• Compare and contrast different kinds of Validity and reliability.
• Discuss the threats to internal validity and how to counter them.
3

VALIDITY AND RELIABILITY


1- “how do we know that we are indeed measuring what we want to measure?”

2- “can we be sure that if we repeated the measurement, we would get the same
result?”.

The first question is related to validity and second to reliability.

Validity and reliability are two important characteristics of behavioral measure and are
referred to as psychometric properties.

It is important to bear in mind that validity and reliability are not an all or none issue but
a matter of degree.
4

VALIDITY AND RELIABILITY


5

VALIDITY
Very simply, validity (accuracy) is the extent to which a test measures what
it is supposed to measure.

the form of the test,


The question of validity is raised in the
the purpose of the test.
context of the following three points:
the population for whom it is intended.

Therefore, we cannot ask the general question “Is this a valid test?”. The question to
ask is “how valid is this test for the decision that I need to make?” or “how valid is the
interpretation I propose for the test?”.

There are many different types of validity.


6
7

INFERENCE VALIDITY

Inference validity (also called "statistical conclusion validity")


refers to the degree to which the conclusions or inferences
drawn from a study are accurate and trustworthy.

In essence, inference validity helps assess if the conclusions


from a study are both internally consistent (Internal validity)
and broadly applicable (External validity).
8

INTERNAL VALIDITY

Internal validity: Determines whether the observed effects in the study are truly
due to the variables being studied (causality) and not due to confounding
factors.
A study with high internal validity effectively minimizes the influence of
confounding variables, ensuring that the observed effect is likely due to the
independent variable rather than other factors.
Internal validity can sometimes be checked via simulation, which can tell you
whether a given theorized process can in fact yield the outcomes that you claim it
does.
The idea that X causes Y is important because internal validity is about being able
to justify that X actually caused Y. We highlight the word actually because there
are many different reasons that can make it difficult to known whether X Causes Y.
9

EXTERNAL VALIDITY
External validity is related to generalizing.

Strategies for strengthening external validity:

• Sampling. Select cases from a known population via a probability


sample (e.g., a simple random sample). This provides a strong basis for
claiming the results apply to the population as a whole.
• Representativeness. Show the similarities between the cases you studied
to a population you wish your results to be applied to, then argue that
the correlations you will found in your study will also be similar
• Replication. Repeat the study in multiple settings. Use meta statistics to
evaluate the results across studies. Although journal reviewers don't
always agree, consistent results across many settings with small samples
is more powerful than a large sample of a single setting.
10

CONSTRUCT VALIDITY

Construct validity is about how well a test measures what it’s supposed to.

For example, does the Beck Depression Inventory (BDI) truly measure depression?
To check this, researchers look at studies that have used the BDI.

One way to evaluate construct validity is by comparing BDI scores of people with
depression to those without it. If people with depression score higher, it shows that
the BDI is accurately measuring depression, which supports its construct validity.
11

CONSTRUCT VALIDITY

B- Criterion-related
A- Translation validity
validity
• Content validity • Predictive validity
• Face validity • Concurrent validity
• Convergent validity
• Discriminant validity
12

TRANSLATION VALIDITY

Translation validity refers to the subjective judgment of


whether a measure accurately represents the construct it’s
intended to measure.

In other words, do the survey questions seem appropriate


and make sense for measuring what you aim to assess?
13
CONTENT VALIDITY

Content validity refers to how well a test's questions represent the full range of the construct
being measured.

It involves a systematic review of the content to ensure it covers a representative sample of the
behavior or skills within that domain.

Since determining content validity is often subjective, it is typically assessed by consulting


experts in the specific field. This allows them to review the test content and provide feedback
on whether it adequately covers the relevant domain.

For example, an IQ test should include items that cover all key areas of intelligence discussed
in the scientific literature, such as verbal, spatial, reasoning, and memory abilities. Consulting
experts in intelligence and psychometrics helps ensure the test has strong content validity
14

FACE VALIDITY

Face validity is about whether an instrument (like a survey or test) looks like it measures
what it's supposed to measure, based on appearance alone. However, this does not
mean that it actually measures the concept accurately; it just appears to.

It’s a kind of first impression of whether the test items seem appropriate. By carefully
looking at each item, we decide if it seems to fit with the concept we want to measure
— but this is based on how it looks rather than solid evidence.

Face validity is the weakest type of validity because it doesn’t prove the test actually
measures the concept; it only suggests that it might be on the right track. It's more about
whether the test "looks right" than if it actually works right.
15

CONTENT AND FACE VALIDITY

Content validity and face validity are similar because both check if the
questionnaire items represent the concept we want to measure. Some
researchers even suggest combining them as one aspect of validity.

Content validity is stronger than face validity because it requires careful and
detailed evaluation by experts. This ensures that the questionnaire fully covers
the concept based on established knowledge and theories.

Content validity and face validity differ in how they are evaluated. Face
validity is more general and often involves personal judgment. It might include
input from participants and doesn't always rely on established theories.
16

CRITERION VALIDITY

Criterion validity checks how well a measure matches a "gold standard" or a trusted,
established measurement for the same concept. It compares responses from a new
questionnaire with results from other validated tools or well-known standards.

To test criterion validity, we measure the same group with both the "gold standard"
and the new tool. We use a correlation coefficient (a statistical measure) to see how
closely the two sets of results match. This is considered one of the most practical and
objective ways to test validity.

Criterion validity is easier to confirm if there’s a clear standard to compare against.


However, for some things—like pain, which relies on personal experience—no true
"gold standard" exists. In those cases, we might have to use other indirect methods to
evaluate validity.
17

CRITERION VALIDITY
Concurrent and predictive validity are two types of criterion validity,
based on timing.
Concurrent validity checks if test scores Predictive validity checks if test scores
match with results from a standard can accurately predict results that will be
measure taken at the same time. measured in the future.

When there’s an already validated instrument, it can serve as the


"gold standard" for comparison with a new test.
18

CONCURRENT CRITERION-RELATED VALIDITY.

Concurrent validity involves comparing a test with a standard measure


taken at the same time. This means the test and the criterion are
administered within the same general timeframe.

Concurrent validity is the most commonly used and referenced type of


validity, as it helps to see if a new test can match up to an established one
right away.
19
PREDICTIVE CRITERION-RELATED VALIDITY.

Predictive validity measures how well


a test or assessment can forecast
future performance or behavior. The
criterion is the specific outcome or
behavior we want to predict.

Example: If we want to know how well •High predictive validity: If students with high
a college entrance exam predicts exam scores generally have high GPAs, the
exam shows high predictive validity.
student success in college, we can •Low predictive validity: If there’s little or no
compare students' exam scores to relationship between exam scores and GPAs,
their final college GPA. the exam has low predictive validity.

Understanding predictive validity helps


us assess the effectiveness of tests and
make informed decisions about their
use.
20

CONVERGENT AND DISCRIMINANT CRITERION-RELATED


VALIDITY

Convergent validity: Checks if the measure correlates positively with other


measures of the same or very similar constructs. Example:
• A new, shorter scale for measuring organizational commitment should have a strong
correlation with the established, longer scale it aims to replace.

Discriminant validity: Ensures the measure correlates poorly with measures


of different constructs. Examples:
• A test for measuring anxiety should not have a strong relationship with a test that measures
general happiness, as they are different emotions.
• A test for measuring reading ability should not show a strong connection with a test that
measures math skills, as they are separate areas of knowledge.
• A test measuring job satisfaction should not correlate too highly with a test measuring
physical health, because they are different concepts.
THREATS TO INTERNAL VALIDITY

• Single-group studies
A research team wants to study whether having indoor plants on office desks boosts the productivity of Nurses
from a Hospital. The researchers give each of the participating Nurse plant to place by their desktop for the month-
long study. All participants complete a timed productivity task before (pre-test) and after the study (post-test).
Threat Meaning Example
History An unrelated event A week before the end of the study, all Nurses are told that there
influences the outcomes. will be layoffs. The Nurses are stressed on the date of the post-test,
and performance may suffer.
Maturation The outcomes of the study Most Nurses are new to the job at the time of the pre-test. A month
vary as a natural result of later, their productivity has improved as a result of time spent
time. working in the position.
Instrumentation Different measures are In the pre-test, productivity was measured for 15 minutes, while the
used in pre-test and post- post-test was over 30 minutes long.
test phases.
Testing The pre-test influences the Nurses showed higher productivity at the end of the study because
outcomes of the post-test. the same test was administered. Due to familiarity, or awareness of
the study’s purpose, many Nurses achieved high results.

21
THREATS TO INTERNAL VALIDITY
Multi-group studies
A researcher wants to compare whether a phone-based app or traditional flashcards are better for learning vocabulary for the SAT.
They divide 11th graders from one school into three groups based on baseline (pre-test) scores on vocabulary. For 15 minutes a day,
Group A uses the phone-based app, Group B uses flashcards, while Group C spends the time reading as a control. Three months later,
post-test measures of vocabulary are taken.
Threat Meaning Example
Selection bias Groups are not comparable at Low-scorers were placed in Group A, while high-scorers were placed in Group B.
the beginning of the study. Because there are already systematic differences between the groups at the
baseline, any improvements in group scores may be due to reasons other than
the treatment.
Regression to the There is a statistical tendency Because participants are placed into groups based on their initial scores, it’s
mean for people who score extremely hard to say whether the outcomes would be due to the treatment or statistical
low or high on a test to score norms.
closer to the middle the next
time.
Social interaction Participants from different Groups B and C may resent Group A because of the access to a phone during
groups may compare notes class. As such, they could be demoralized and perform poorly.
and either figure out the aim of
the study or feel resentful of
others.
Attrition Dropout from participants 20% of participants provided unusable data. Almost all of them were from
Group C. As a result, it’s hard to compare the two treatment groups to a control
group.
22
23

RELIABILITY

Reliability is the degree to which a test consistently measures whatever it


measures.

Errors of measurement that affect reliability are random errors and errors of
measurement that affect validity are systematic or constant errors.

Reliability can also be expressed in terms of the standard error of


measurement. It is an estimate of how often you can expect errors of a given
size.
24

TEST-RETEST RELIABILITY

To estimate test-retest reliability:


• Administer the same test form to a group of examinees on two separate occasions, usually
just a few days or weeks apart. This ensures the examinees' skills have not significantly
changed through additional learning.
• Calculate the statistical correlation between the examinees' scores from the two
administrations. This measures how similar the scores are, demonstrating the test's ability to
produce stable, consistent results over time.

Potential issues with test-retest reliability include:


• Memory: Examinees may remember their responses from the first administration, affecting
their performance on the second.
• Maturation: Examinees may naturally improve or decline in the skills being tested over the
time between administrations.
• Learning: Examinees may learn or forget the material between administrations, leading to
score changes.
25

EQUIVALENT-FORMS OR ALTERNATE-FORMS RELIABILITY

This method checks how consistent the results are across different versions of the
same test.

How it works:

• Create two similar tests: Both tests measure the same concept but use different questions.
• Administer both tests: Administering the two test forms to the same group of people, with a short time in
between.
• Correlate the scores: Calculate the correlation between the scores from the two tests.

A high correlation indicates the forms are equivalent, and the test is reliable.

Challenge: Creating two truly equivalent forms can be difficult and time-
consuming.
26
INTER-RATER RELIABILITY

Inter-rater reliability (IRR) is the level of agreement between raters


or judges.

If everyone agrees, IRR is 1 (or 100%) and if everyone disagrees,


IRR is 0 (0%).

Several methods exist for calculating IRR, from the simple (e.g.
percent agreement) to the more complex (e.g. Cohen’s Kappa).

Which one you choose largely depends on what type of data you
have and how many raters are in your model.
INTER-RATER RELIABILITY
27

• Percent Agreement
• The simple way to measure inter-rater For each question, we can write “1” if
reliability is to calculate the percentage the two judges agree and “0” if they
of items that the judges agree on. don’t agree.
• This is known as percent agreement,
which always ranges between 0 and 1
with 0 indicating no agreement between
raters and 1 indicating perfect
agreement between raters.
• For example, suppose two judges are
asked to rate the difficulty of 10 items on
a test from a scale of 1 to 3. The results
are shown below:
The percentage of questions the
judges agreed on was 7/10 = 70%.
28

INTERNAL CONSISTENCY RELIABILITY

Internal consistency reliability measures how consistent the results are across different items in a test,
ensuring that items designed to measure the same construct give similar scores.

Simple example: you want to find out how satisfied your customers are with the level of customer service
they receive at your call center. You send out a survey with three questions designed to measure overall
satisfaction. Choices for each question are: Strongly agree/Agree/Neutral/Disagree/Strongly disagree.
• I was satisfied with my experience.
• I will probably recommend your company to others.
• If I write an online review, it would be positive.
If the survey has good internal consistency, respondents should answer the same for each question, i.e.
three “agrees” or three “strongly disagrees.” If different answers are given, this is a sign that your questions
are poorly worded and are not reliably measuring customer satisfaction.

Most researchers prefer to include at least two questions that measure the same thing (the above survey
has three).
29

INTERNAL CONSISTENCY RELIABILITY

There are three main techniques for measuring internal


consistency reliability, each suited to different levels of
complexity, scope, and the nature of the test.

These methods ensure that the test results and the constructs
being measured are accurate. The choice of technique
depends on the subject, data set size, and available resources.
30

SPLIT-HALF’S RELIABILITY TEST

The split-halves test for internal consistency reliability is a simple method that divides a test into
two halves.

For example, a questionnaire measuring job satisfaction can be split into odd and even-
numbered questions. The results from both halves are analyzed, and if there is a weak
correlation, it suggests a reliability issue.

The division must be random. While split-halves testing was once popular due to its simplicity
and speed, more advanced methods are now preferred because computers can handle
complex calculations.

The split-halves test produces a correlation score between 0 and 1, with 1 indicating perfect
correlation.
31

KUDER-RICHARDSON TEST

The Kuder-Richardson test is a more advanced version of the split-halves test for internal
consistency reliability.

It calculates the average correlation for all possible split-half combinations in a test,
providing a more accurate result than the split-halves method. Like split-halves, it
generates a correlation score between 0 and 1.

However, the Kuder-Richardson test requires each question to have a simple right or
wrong answer (0 or 1).

For tests with multiple response options, more sophisticated methods are needed to
measure internal consistency reliability.
32
CRONBACH'S ALPHA TEST

Cronbach's Alpha is a more advanced test for internal consistency reliability. It averages
the correlation between all possible split-half combinations and can handle multi-level
responses.

For example, it can be used for questions where respondents rate their answers on a
scale from 1 to 5. Cronbach's Alpha produces a score between 0 and 1, with a value of
65% typically considered acceptable reliability.

This test also considers the sample size and number of response options, so a 40-question
test with 1-5 ratings is considered more reliable than a 10-question test with fewer
response levels.

Despite its efficiency, Cronbach's Alpha still requires computers or statistical software to
perform the calculations accurately.

You might also like