0% found this document useful (0 votes)
7 views

Lesson 4 Assessment 4

Uploaded by

Ma'am Joan H.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lesson 4 Assessment 4

Uploaded by

Ma'am Joan H.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Validity and Reliability

in Assessment
Measurement experts (and many educators) believe
that every measurement device should possess
certain qualities.

The two most common technical concepts in


measurement are reliability and validity.
Reliability Definition (Consistency)

 The degree of consistency between two measures


of the same thing. (Mehrens and Lehman, 1987(
 The measure of how stable, dependable,
trustworthy, and consistent a test is in measuring
the same thing each time (Worthen et al., 1993)
Validity definition (Accuracy)

 Truthfulness: Does the test measure what it


purports to measure? the extent to which certain
inferences can be made from test scores or other
measurement. (Mehrens and Lehman, 1987)
 The degree to which they accomplish the purpose
for which they are being used. (Worthen et al., 1993(
 The term “validity” refers to the degree to which
the conclusions (interpretations) derived from the
results of any assessment are “well-grounded or
justifiable; being at once relevant and
meaningful.” (Messick S. 1995)
 Content”: related to objectives and their sampling.
 “Construct”: referring to the theory underlying the target.
 “Criterion”: related to concrete criteria in the real world. It can be concurrent or
predictive.
 “Concurrent”: correlating high with another measure already validated.
 “Predictive”: Capable of anticipating some later measure.
 “Face”: related to the test overall appearance.

The usual concepts of validity.


Sources of validity in assessment

Old concept
Sources of validity in assessment

Usual concepts of validity


All assessments in medical education require
evidence of validity to be interpreted meaningfully.

In contemporary usage, all validity is construct


validity, which requires multiple sources of
evidence; construct validity is the whole of validity,
but has multiple facets. (Downing S 2003)
Construct (Concepts, ideas and notions)

- Nearly all assessments in medical education, deal with constructs:


intangible collections of abstract concepts and principles
which are inferred from behavior and explained by
educational or psychological theory.
- Educational achievement is a construct, inferred from
performance on assessments; written tests over domain of
knowledge, oral examinations over specific problems or cases
in medicine, or OSCE, history-taking or communication skills.
- Educational ability or aptitude is another example of construct –
a construct that may be even more intangible and abstract than
achievement. (Downing 2003)
Sources of validity in assessment

 Content: do instrument items completely represent the


construct?
 Response process: the relationship between the intended
construct and the thought processes of subjects or observers
 Internal structure: acceptable reliability and factor structure
 Relations to other variables: correlation with scores from
another instrument assessing the same construct
 Consequences: do scores really make a difference?

Downing 2003, Cook S 2007


Sources of validity in assessment
Consequences Relationship to other Internal structure Response process Content
variables
• Impact of test • Correlation with • Item analysis data: - Student format - Examination
scores/results other relevant familiarity blueprint
1. Item
variables (exams)
on students/society difficulty/discriminati - Quality control of - Representativeness
• Convergent on electronic of test blueprint to
• Consequences on
correlations - achievement
learners/future 2. Item/test scanning/scoring
learning internal/external: characteristic curves - Key validation of domain
- Similar tests 3. Inter-item preliminary scores - Test specification
• Reasonableness of
method of • Divergent correlations - Accuracy in - Match of item
establishing pass-fail correlations: 4. Item-total combining different content to test
(cut) score internal/external correlations (PBS) formats scores specifications
• Pass-fail - Dissimilar measures • Score scale - Quality - Representativeness
consequences: reliability control/accuracy of of items to domain
• Test-criterion
final
1. P/F Decision correlations • Standard errors of - Logical/empirical
scores/marks/grades
reliability-accuracy • Generalizability of measurement (SEM) relationship of
- Subscore/subscale
2. Conditional evidence • Generalizability content tested
analyses:
standard error of domain
measurement • Item factor analysis 1-Accuracy of - Quality of test
• False +ve/-ve • Differential Item applying pass-fail questions
decision rules to
Functioning (DIF) scores - Item writer
qualifications
2-Quality control of
score reporting - Sensitivity review
Sources of validity
Internal Structure-1

Statistical evidence of the hypothesized relationship


:between test item scores and the construct
:Reliability (internal consistency) -1
􀂄 Test scale reliability
􀂄 Rater reliability
􀂄 Generalizability
Item analysis data -2
􀂄 Item difficulty and discrimination
􀂄 MCQ option function analysis
􀂄 Inter-item correlations
Scale factor structure -3
Dimensionality studies -4
Differential item functioning (DIF) studies -5
Sources of validity

Relationship to other variables-2

Statistical evidence of the hypothesized relationship


between test scores and the construct
􀂆 Criterion-related validity studies
􀂄 Correlations between test scores/subscores and
other measures
􀂄 Convergent-Divergent studies
Keys of reliability assessment

 “Stability”: related to time consistency.


 “Internal”: related to the instruments.
 “Inter-rater”: related to the examiners’ criterion.
 “Intra-rater”: related to the examiner’s criterion.

Validity and reliability are closely related.


A test cannot be considered valid unless the measurements resulting from it are
reliable. Likewise, results from a test can be reliable and not necessarily valid.
Keys of reliability assessment

Validity and reliability are closely related.


A test cannot be considered valid unless the measurements resulting from it are
reliable. Likewise, results from a test can be reliable and not necessarily valid.
Sources of reliability in assessment

Comments Definitions Measures Description Source of


reliability
- Rarely used because - Correlation between Split-half reliability - Do all the items on an Internal
the “effective” scores on the first and instrument measure the consistency
instrument is only second halves of a same construct? (If an
half as long as the given instrument instrument measures more
actual instrument; than one construct, a single
Spearman-Brown† score will not measure either
formula can adjust construct very well.
-Assumes all items are - Similar concept to Kuder-Richardson 20 - We would expect high
equivalent, measure a split-half, but correlation between item
single construct, and accounts for all items scores measuring a single
have dichotomous construct.
responses
- Internal consistency is
probably the most
commonly reported
- Assumes all items
- A generalized form reliability statistic, in part
are equivalent and Cronbach’s alpha
of the because it can be calculated
measure a single
after a single administration
construct; can be used Kuder-Richardson
of a single instrument.
with dichotomous or formulas
continuous data - Because instrument halves
can be considered “alternate
forms,” internal consistency
can be viewed as an estimate
of parallel forms reliability.
Sources of reliability in assessment
Comments Definitions Measures Description Source of
reliability
Usually quantified Administer the Test-retest reliability Does the instrument produce Temporal
using correlation instrument to the same similar results when stability
(eg, Pearson’s r) person at different times administered a second
time?

Usually quantified Administer different Do different versions of the


Alternate forms Parallel forms
using correlation versions of the “same” instrument produce
(eg, Pearson’s r) instrument to the same reliability similar results?
individual at the same or
different times

%identical responses Percent agreement Agreement


Does not account When using raters, does it
for agreement Simple correlation matter who does the rating? (inter-rater
Phi
that would occur reliability)
Agreement corrected for Kappa Is one rater’s score similar to
by chance
chance another’s?
Does not account Kendall’s tau
Agreement on ranked
for chance Intraclass correlation
data
coefficient
ANOVA to estimate how
well ratings from
different raters coincide
Sources of reliability in assessment

Comments Definitions Measures Description Source of


reliability
As the name implies, Complex model that Generalizability How much of the error in Generalizability
this elegant method is allows estimation of coefficient measurement is the result theory
“generalizable” to multiple sources of of each factor (eg, item,
virtually any setting error item grouping, subject,
in which reliability is rater, day of
assessed; administration) involved
in the measurement
For example, it can
process?
determine the
relative contribution
of internal
consistency and
inter-rater reliability
to the overall
reliability of a given
instrument
.Items” are the individual questions on the instrument “*
.The “construct” is what is being measured, such as knowledge, attitude, skill, or symptom in a specific area
The Spearman Brown “prophecy” formula allows one to calculate the reliability of an instrument’s scores
.when the number of items is increased (or decreased)

Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)


Keys of reliability assessment
Keys of reliability assessment

Different types of assessments require different


kinds of reliability
Written MCQs
􀂆 Scale reliability Oral Exams
􀂆 Internal consistency 􀂄 Rater reliability
Written—Essay 􀂄 Generalizability Theory
􀂆 Inter-rater agreement Observational Assessments
􀂆 Rater reliability
􀂆 Generalizability Theory 􀂆 Inter-rater agreement
􀂆 Generalizability Theory
Performance Exams (OSCEs)
􀂄 Rater reliability
􀂄 Generalizability Theory
Keys of reliability assessment

?Reliability – How high

􀂆 Very high-stakes: > 0.90 + (Licensure tests)


􀂆 Moderate stakes: at least ~0.75 (OSCE)
􀂆 Low stakes: >0.60 (Quiz)
Keys of reliability assessment
?How to increase reliability
For Written tests
􀂄 Use objectively scored formats
􀂄 At least 35-40 MCQs
􀂄 MCQs that differentiate high-low students
For performance exams
􀂄 At least 7-12 cases
􀂄 Well trained SPs
􀂄 Monitoring, QC
Observational Exams
􀂄 Lots of independent raters (7-11)
􀂄 Standard checklists/rating scales
􀂄 Timely ratings
Conclusion

Validity = Meaning
􀂄 Evidence to aid interpretation of assessment data
􀂆 Higher the test stakes, more evidence needed
􀂄 Multiple sources or methods
􀂄 Ongoing research studies
Reliability
􀂄 Consistency of the measurement
􀂄 One aspect of validity evidence
􀂄 Higher reliability always better than lower
References
 National Board of Medical Examiners. United States Medical Licensing
 Exam Bulletin. Produced by Federation of State Medical Boards of
 the United States and the National Board of Medical Examiners.
 Available at: http://www.usmle.org/bulletin/2005/testing.htm.
 Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method
for assessing clinical skills. Ann Intern Med. 2003;138:476-481.
 Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers. Acad Med. 1998;73:688-695.
 Merriam-Webster Online. Available at: http://www.m-w.com/.
 Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-
Based Medicine: How to Practice and Teach EBM. Edinburgh: Churchill Livingstone; 1998.
 Wallach J. Interpretation of Diagnostic Tests. 7th ed. Philadelphia:
Lippincott Williams & Wilkins; 2000.
 Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching?
A review of the published instruments. J Gen Intern Med. 2004;19:971-977.
 Shanafelt TD, Bradley KA, Wipf JE, Back AL. Burnout and selfreported
patient care in an internal medicine residency program. Ann Intern Med. 2002;136:358-367.
 Alexander GC, Casalino LP, Meltzer DO. Patient-physician communication about out-of-pocket costs. JAMA.
2003;290:953-958.
References
- Pittet D, Simon A, Hugonnet S, Pessoa-Silva CL, Sauvan V, Perneger TV. Hand hygiene among
physicians: performance, beliefs, and perceptions. Ann Intern Med. 2004;141:1-8.
- Messick S. Validity. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American
Council on Education and Macmillan; 1989.
- Foster SL, Cone JD. Validity issues in clinical assessment. Psychol Assess. 1995;7:248-260.
American Educational Research Association, American Psychological Association, National Council
on Measurement in Education. Standards for Educational and Psychological Testing. Washington,
DC:
American Educational Research Association; 1999.
- Bland JM, Altman DG. Statistics notes: validating scales and indexes. BMJ. 2002;324:606-607.
- Downing SM. Validity: on the meaningful interpretation of assessment
data. Med Educ. 2003;37:830-837. 2005 Certification Examination in Internal Medicine Information
Booklet. Produced by American Board of Internal Medicine. Available
at: http://www.abim.org/resources/publications/IMRegistrationBook. pdf.
- Kane MT. An argument-based approach to validity. Psychol Bull. 1992;112:527-535.
- Messick S. Validation of inferences from persons’ responses and performances as scientific
inquiry into score meaning. Am Psychol. 1995;50:741-749.
- Kane MT. Current concerns in validity theory. J Educ Meas. 2001; 38:319-342. American
Psychological Association. Standards for Educational and Psychological Tests and Manuals.
Washington, DC: American Psychological Association; 1966.
- Downing SM, Haladyna TM. Validity threats: overcoming interference in the proposed
interpretations of assessment data. Med Educ. 2004;38:327-333.
- Haynes SN, Richard DC, Kubany ES. Content validity in psychological assessment: a functional
approach to concepts and methods. Psychol Assess. 1995;7:238-247.
- Feldt LS, Brennan RL. Reliability. In: Linn RL, editor. Educational Measurement, 3rd Ed. New
York: American Council on Education and Macmillan; 1989.
- Downing SM. Reliability: on the reproducibility of assessment data. Med Educ. 2004;38:1006-
1012.
Clark LA, Watson D. Constructing validity: basic issues in objective scale development. Psychol
Resources
 For an excellent resource on item analysis:
 http://www.utexas.edu/academic/ctl/assessment/iar/
students/report/itemanalysis.php
 For a more extensive list of item-writing tips:
 http://testing.byu.edu/info/handbooks/Multiple-Choice
%20Item%20Writing%20Guidelines%20-%20Haladyna%20and
%20Downing.pdf
 http://homes.chass.utoronto.ca/~murdockj/teaching/
MCQ_basic_tips.pdf
 For a discussion about writing higher-level multiple choice items:
 http://www.ascilite.org.au/conferences/perth04/procs/pdf/
woodford.pdf

You might also like