0% found this document useful (0 votes)
27 views

RELIABILITY AND VALIDITY

Uploaded by

abdul aleem ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

RELIABILITY AND VALIDITY

Uploaded by

abdul aleem ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

RELIABILITY

Reliability

When a test is reliable, it provides dependable, consistent results


and, for this reason, the term consistency is often given as a
synonym for reliability (e.g., Anastasi, 1988).

Consistency = Reliability
 In the psychometric sense it really only refers to something that
is consistent—not necessarily consistently good or bad, but
simply consistent
Variance

 A statistic useful in describing sources of test score variability is the


variance (σ2 )—the standard deviation squared. This statistic is useful
because it can be broken into components.
 Variance from true differences is true variance, and variance from
irrelevant, random sources is error variance. The total variance is sum of
the true variance and the error variance
Reliability

 The term reliability refers to the proportion of the total variance


attributed to true variance.
 The greater the proportion of the total variance attributed to true
variance, the more reliable the test.
 Because true differences are assumed to be stable, they are presumed
to yield consistent scores on repeated administrations of the same test
as well as on equivalent forms of tests.
 Because error variance may increase or decrease a test score by varying
amounts, consistency of the test score—and thus the reliability—can be
affected.
Methods for Estimating
Reliability
 The selection of a method for estimating reliability depends on the
nature of the test.
 Each method not only entails different procedures but is also affected
by different sources of error. For many tests, more than one method
should be used
 4 types of reliability:
1. Test Re-test Reliability
2. Alternate (Equivalent, Parallel) Forms Reliability
3. Internal Consistency Reliability
4. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability
Test-Retest Reliability

 The test-retest method for estimating reliability involves administering


the same test to the same group of examinees on two different
occasions and then correlating the two sets of scores.
 When using this method, the reliability coefficient indicates the degree
of stability (consistency) of examinees' scores over time and is also
known as the coefficient of stability.
Cont..
 The primary sources of measurement error for test-retest reliability
are any random factors related to the time that passes between the
two administrations of the test.
 These time sampling factors include random fluctuations in
examinees over time (e.g., changes in anxiety or motivation) and
random variations in the testing situation.
 Memory and practice also contribute to error when they have random
carryover effects; i.e., when they affect many or all examinees but not
in the same way.
 Test-retest reliability is appropriate for determining the reliability of
tests designed to measure attributes that are relatively stable over
time and that are not affected by repeated measurement.
 It would be appropriate for a test of aptitude , which is a stable
characteristic, but not for a test of mood, since mood fluctuates over
time, or a test of creativity, which might be affected by previous
exposure to test items.
Alternate (Equivalent, Parallel)
Forms Reliability
 To assess a test's alternate forms reliability, two equivalent forms of
the test are administered to the same group of examinees and the
two sets of scores are correlated.
 Alternate forms reliability indicates the consistency of responding to
different item samples (the two test forms).
 The alternate forms reliability coefficient is also called the coefficient
of equivalence when the two forms are administered at about the
same time;
Cont..

 The primary source of measurement error for alternate forms reliability


is content sampling, or error introduced by an interaction between
different examinees' knowledge and the different content assessed by
the items included in the two forms (e.g: Form A and Form B)
 The items in Form A might be a better match of one examinee's
knowledge than items in Form B, while the opposite is true for another
examinee.
 In this situation, the two scores obtained by each examinee will differ,
which will lower the alternate forms reliability coefficient.
Cont..

 If the same strategies required to solve problems on Form A are used to


solve problems on Form B, even if the problems on the two forms are
not identical, there are likely to be practice effects.
 When these effects differ for different examinees (i.e., are random),
practice will serve as a source of error.
 Although alternate forms reliability is considered by some experts to be
the most rigorous (and best) method for estimating reliability, it is not
often assessed due to the difficulty in developing forms that are truly
equivalent.
 Alternate forms reliability is not appropriate when the attribute
measured by the test is likely to be affected by repeated measurement.
Internal Consistency Reliability

 Reliability can also be estimated by measuring the internal


consistency of a test.
 Split-half reliability and coefficient alpha are two methods for
evaluating internal consistency.
 Both involve administering the test once to a single group of
examinees, and both yield a reliability coefficient that is also known
as the coefficient of internal consistency.
Split-half Reliability

 To determine a test's split-half reliability, the test is split into equal


halves so that each examinee has two scores (one for each half of the
test).
 Scores on the two halves are then correlated
 Tests can be split in several ways, but probably the most common
way is to divide the test on the basis of odd- versus even-numbered
items
Cont..

 A problem with the split-half method is that it produces a reliability


coefficient that is based on test scores that were derived from one-
half of the entire length of the test.
 If a test contains 30 items, each score is based on 15 items. Because
reliability tends to decrease as the length of a test decreases, the
split-half reliability coefficient usually underestimates a test's true
reliability.
 For this reason, the split-half reliability coefficient is ordinarily
corrected using the Spearman-Brown formula, which provides an
estimate of what the reliability coefficient would have been had it
been based on the full length of the test.
Coefficient alpha

 Cronbach's coefficient alpha also involves administering the test once


to a single group of examinees. However, rather than splitting the test
in half, a special formula is used to determine the average degree of
inter-item consistency.
 One way to interpret coefficient alpha is as the average reliability that
would be obtained from all possible splits of the test.
 When test items are scored dichotomously (right or wrong), a
variation of coefficient alpha known as the Kuder-Richardson Formula
20 (KR-20) can be used.
 In contrast to KR-20, which is appropriately used only on tests with
dichotomous items, coefficient alpha is appropriate for use on tests
containing non-dichotomous items.
Cont..

 Content sampling is a source of error for both split-half reliability and


coefficient alpha.
 For split-half reliability, content sampling refers to the error resulting
from differences between the content of the two halves of the test (i.e.,
the items included in one half may better fit the knowledge of some
examinees than items in the other half);
 For coefficient alpha, content (item) sampling refers to differences
between individual test items rather than between test halves.
Cont..

 Coefficient alpha also has as a source of error, the heterogeneity of the


content domain.
 The greater the heterogeneity of the content domain, the lower the
inter-item correlations and the lower the magnitude of coefficient alpha.
 Coefficient alpha could be expected to be smaller for a 200-item test
that contains items assessing knowledge of test construction, statistics,
ethics, epidemiology, environmental health, social and behavioral
sciences, rehabilitation counseling, etc. than for a 200-item test that
contains questions on test construction only.
Cont..

 The methods for assessing internal consistency reliability are useful


when
 a test is designed to measure a single characteristic
 the characteristic measured by the test fluctuates over time
 scores are likely to be affected by repeated exposure to the test
Cont..

 Coefficient alpha typically ranges in value from 0 to 1. The reason for


this is that, conceptually, coefficient alpha (much like other coefficients
of reliability) is calculated to help answer questions about how similar
sets of data are.
 Here, similarity is gauged, in essence, on a scale from 0 (absolutely no
similarity) to 1 (perfectly identical).
 A myth about alpha is that “bigger is always better.” As Streiner (2003)
pointed out, a value of alpha above .90 may be “too high” and indicate
redundancy in the items
Inter-Rater (Inter-Scorer, Inter-
Observer) Reliability

 Inter-rater reliability is of concern whenever test scores depend on a


rater's judgment.
 A test constructor would want to make sure that an essay test, a
behavioral observation scale, or a projective personality test have
adequate inter-rater reliability.
 This type of reliability is assessed either by calculating a correlation
coefficient (e.g., a kappa coefficient or coefficient of concordance) or
by determining the percent agreement between two or more raters.
Cont..

 Sources of error for inter-rater reliability include factors related to the


raters such as lack of motivation and rater biases and characteristics of
the measuring device.
 An inter-rater reliability coefficient is likely to be low, for instance, when
 rating categories are not exhaustive (i.e., don't include all possible responses
or behaviors)
 are not mutually exclusive

 The inter-rater reliability of a behavioral rating scale can also be affected


by consensual observer drift, which occurs when two (or more) observers
working together influence each other's ratings so that they both assign
ratings in a similarly idiosyncratic way.
 Consensual observer drift tends to artificially inflate inter-rater reliability.
VALIDITY
Validity

 Validity, as applied to a test, is a judgment or estimate of how well a test


measures what it purports to measure in a particular context.
 Characterizations of the validity of tests and test scores are frequently
phrased in terms such as “acceptable” or “weak.”
 These terms reflect a judgment about how adequately the test measures
what it purports to measure.
 Inherent in a judgment of an instrument’s validity is a judgment of how
useful the instrument is for a particular purpose with a particular
population of people.
Cont..

 Valid test: what is really meant is that the test has been shown to be
valid for a particular use with a particular population of testtakers at a
particular time.
 No test or measurement technique is “universally valid” for all time, for
all uses, with all types of testtaker populations.
 Rather, tests may be shown to be valid within what we would
characterize as reasonable boundaries of a contemplated usage. If those
boundaries are exceeded, the validity of the test may be called into
question.
 Further, to the extent that the validity of a test may diminish as the
culture or the times change, the validity of a test may have to be re-
established with the same as well as other testtaker populations.
Types of validity

 Content validity
 Construct validity
 Criterion validity
Face Validity

 Face validity relates more to what a test appears to measure to the


person being tested than to what the test actually measures.
 Face validity is a judgment concerning how relevant the test items
appear to be.
 Stated another way, if a test definitely appears to measure what it
purports to measure “on the face of it,” then it could be said to be high
in face validity.
Cont..

 A paper-and-pencil personality test labeled The


Introversion/Extraversion Test, with items that ask respondents whether
they have acted in an introverted or an extraverted way in particular
situations, may be perceived by respondents as a highly face-valid test.
 On the other hand, a personality test in which respondents are asked to
report what they see in inkblots may be perceived as a test with low
face validity. Many respondents would be left wondering how what they
said they saw in the inkblots really had anything at all to do with
personality.
Cont..

 In contrast to judgments about the reliability of a test and judgments


about the content, construct, or criterion-related validity of a test,
judgments about face validity are frequently thought of from the
perspective of the testtaker, not the test user.
 A test’s lack of face validity could contribute to a lack of confidence in
the perceived effectiveness of the test—with a consequential decrease
in the testtaker’s cooperation or motivation to do his or her best
Content Validity

 Content validity evaluates how well an instrument (like a test)


covers all relevant parts of the construct it aims to measure.
 Content validity is the degree to which a test or assessment instrument
evaluates all aspects of the topic, construct, or behavior that it is
designed to measure.
 Do the items fully cover the subject?
 High content validity indicates that the test fully covers the topic for the
target audience. Lower results suggest that the test does not contain
relevant facets of the subject matter.
Cont..

 Ideally, test developers have a clear vision of the construct being


measured, and the clarity of this vision can be reflected in the content
validity of the test (Haynes et al., 1995).
 To ensure content validity, test developers strive to include key
components of the construct targeted for measurement, and exclude
content irrelevant to the construct targeted for measurement.
 Although content validation is sometimes used to establish the validity
of personality, aptitude, and attitude tests, it is most associated with
achievement-type tests that measure knowledge of one or more content
domains and with tests designed to assess a well-defined behavior
domain.
Cont..

 Content validity has two aspects


 Content relevance
 Content representation
Cont..

 Content validity is usually "built into" a test as it is constructed


through a systematic, logical, and qualitative process that involves
clearly identifying the content or behavior domain to be sampled
and then writing or selecting items that represent that domain.
 Once a test has been developed, the establishment of content
validity relies primarily on the judgment of subject matter experts.
 If experts agree that test items are an adequate and
representative sample of the target domain, then the test is said
to have content validity.
Don’t confuse Content validity with Face
validity.

 Content validity refers to the systematic evaluation of a test by


experts who determine whether or not test items adequately sample
the relevant domain, while face validity refers simply to whether or
not a test "looks like" it measures what it is intended to measure.

 Although face validity is not an actual type of validity, it is a desirable


feature for many tests. If a test lacks face validity, examinees may
not be motivated to respond to items in an honest or accurate
manner. A high degree of face validity does not, however, indicate
that a test has content validity.
Construct Validity

 Construct validity is about how well a test measures the


concept it was designed to evaluate.
 Construct validity is a judgment about the appropriateness of
inferences drawn from test scores regarding individual standings
on a variable called a construct.
Construct

 A construct is an informed, scientific idea developed or hypothesized to


describe or explain behavior.
 Intelligence is a construct that may be invoked to describe why a student
performs well in school.
 Other examples of constructs are job satisfaction, personality, intolerance,
aptitude, depression, motivation, self-esteem, emotional adjustment,
creativity etc
 Constructs are unobservable, presupposed (underlying) traits that a test
developer may invoke to describe test behavior or criterion
performance.
Cont..

 Traditionally, construct validity has been viewed as the unifying concept for
all validity evidence (American Educational Research Association et al.,
1999).
 All types of validity evidence, including evidence from the content- and
criterion-related varieties of validity, come under the umbrella of construct
validity.
 The researcher investigating a test’s construct validity must formulate
hypotheses about the expected behavior of high scorers and low scorers
on the test.
 These hypotheses give rise to a tentative theory about the nature of the
construct the test was designed to measure. If the test is a valid measure
of the construct, then high scorers and low scorers will behave as predicted
by the theory.
Evidence of Construct Validity

 A number of procedures may be used to provide different kinds of evidence that a


test has construct validity.
 The various techniques of construct validation may provide evidence, for example,
that
 the test is homogeneous, measuring a single construct
 test scores increase or decrease as a function of age, the passage of time, or an
experimental manipulation as theoretically predicted
 test scores obtained after some event or the mere passage of time (or, posttest
scores) differ from pretest scores as theoretically predicted
 test scores obtained by people from distinct groups vary as predicted by the theory
 test scores correlate with scores on other tests in accordance with what would be
predicted from a theory that covers the manifestation of the construct in question
Types Of Construct Validity

 Convergent validity
 Divergent (discriminant) validity
Convergent validity:
 Convergent validity shows whether a test that is designed to assess a
particular construct correlates with other tests that assess the same
construct.
 We can analyze convergent validity by comparing the results of a test
with those of others that are designed to measure the same construct. If
there is a strong positive correlation between the results, then the test
can be said to have high convergent validity.
Cont..

 Discriminant validity shows whether a test that is designed to


measure a particular construct does not correlate with tests that
measure different constructs. This is based on the idea that we wouldn’t
expect to see the same results from two tests that are meant to
measure different things (e.g. a math test vs a spelling test).
 We can analyze discriminant validity by comparing the results of an
assessment that measures one thing with those of a test that measures
something else altogether. If there is no correlation between the scores,
the test can be said to have high discriminant validity; a strong
correlation would indicate low discriminant validity.
 To establish high construct validity, a test must be shown to have
both high convergent validity AND high discriminant validity
Criterion validity

 Criterion validity (or criterion-related validity) evaluates how


accurately a test measures the outcome it was designed
to measure.
 Criterion validity (or criterion-related validity) measures how well
one measure predicts an outcome for another measure. A test
has this type of validity if it is useful for predicting performance
or behavior in another situation (past, present, or future).
Cont..

For example:
 A job applicant takes a performance test during the interview process. If
this test accurately predicts how well the employee will perform on the
job, the test is said to have criterion validity.
 A graduate student takes the GRE. The GRE has been shown as an
effective tool (i.e. it has criterion validity) for predicting how well a
student will perform in graduate studies.
 The first measure (in the above examples, the job performance test and
the GRE) is sometimes called the predictor variable or the estimator. The
second measure is called the criterion variable
Types of Criterion Validity

 Two types of validity evidence are included under criterion-related


validity.
 Concurrent validity is an index of the degree to which a test score
is related to some criterion measure obtained at the same time
(concurrently).
 Predictive validity is an index of the degree to which a test score
predicts some criterion measure in future.
Concurrent Validity

 If test scores are obtained at about the same time as the criterion
measures are obtained, measures of the relationship between the test
scores and the criterion provide evidence of concurrent validity.
 Statements of concurrent validity indicate the extent to which test
scores may be used to estimate an individual’s present standing on a
criterion.
 Concurrent validity measures how well a new test compares to an well-
established test.
 If we create a new test for depression levels, we can compare its
performance to previous depression tests that have high validity.
Predictive Validity

 Test scores may be obtained at one time and the criterion measures
obtained at a future time, usually after some intervening event has
taken place.
 The intervening event may take varied forms, such as training,
experience, therapy, medication, or simply the passage of time.
 Measures of the relationship between the test scores and a criterion
measure obtained at a future time provide an indication of the predictive
validity of the test; that is, how accurately scores on the test predict
some criterion measure.
Cont..

 If there is a high correlation between scores on the entrance exam and


the first semester GPA, it’s likely that there is predictive
validity between these two variables.
 In other words, the score that a student receives on this particular
college entrance exam is predictive of the GPA they’re likely to receive
during their first semester in college.
 The term predictive validity refers to the extent that it’s valid to use
the score on some scale or test to predict the value of some other
variable in the future.
 For example, we might want to know how well some college entrance
exam is able to predict the first semester grade point average of
students.
Summary

 Construct validity: concerns the extent to which your test or measure


accurately assesses what it’s supposed to.
 Content validity: Is the test fully representative of what it aims to measure?
 Face validity: Does the content of the test appear to be suitable to its aims?
 Criterion validity: Do the results accurately measure the concrete outcome
they are designed to measure?
 All three types of validity evidence contribute to a unified picture of a test’s
validity.
 A test user may not need to know about all three. Depending on the use to
which a test is being put, one type of validity evidence may be more relevant
than another.
THANKYOU!

You might also like