0% found this document useful (0 votes)
84 views7 pages

Utility Index Mufassar

The article discusses the importance of a rigorous assessment system in medical education, emphasizing the utility formula that includes validity, reliability, educational impact, practicability, cost-effectiveness, and acceptability. It highlights the need for assessments to accurately measure student progress and provide meaningful feedback while promoting higher-order thinking. The authors propose practical measures for improving the utility of assessments to enhance the quality and accountability of medical education.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views7 pages

Utility Index Mufassar

The article discusses the importance of a rigorous assessment system in medical education, emphasizing the utility formula that includes validity, reliability, educational impact, practicability, cost-effectiveness, and acceptability. It highlights the need for assessments to accurately measure student progress and provide meaningful feedback while promoting higher-order thinking. The authors propose practical measures for improving the utility of assessments to enhance the quality and accountability of medical education.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ISPUB.

COM The Internet Journal of Medical Education


Volume 1 Number 1

Evaluating and designing assessments for medical


education: the utility formula
M Chandratilake, M Davis, G Ponnamperuma

Citation
M Chandratilake, M Davis, G Ponnamperuma. Evaluating and designing assessments for medical education: the utility
formula. The Internet Journal of Medical Education. 2009 Volume 1 Number 1.

Abstract
As assessment serves several important purposes in medical education, it is a vital element in the training of doctors. A rigorous
assessment system, therefore, is an essential requirement in enhancing quality and accountability of medical education. This
can be achieved by considering the utility formula which takes in to account the validity, the reliability, the educational impact,
the practicability, the cost-effectiveness and the acceptability of assessment. All utility elements of a given assessment should
be satisfactorily engaged for it to be psychometrically rigorous and sustainable in a particular context. This article discusses the
utility elements in relation to medical assessments and introduces some useful measures for achieving acceptable utility.

INTRODUCTION assessment performance as a basis for career


As assessment serves several important purposes in medical selection.
education, it is a vital element in the training of doctors:
Similarly, assessment results should provide useful
The ultimate aim of undergraduate, postgraduate feedback to other stakeholders in the educational
and continuing medical education is to improve the process such as teachers and future employers.
health and the health care of the population. The
outcomes of all medical education programmes, in A rigorous assessment system, therefore, is an essential
general, are focused on this aim. Assessments requirement in enhancing quality and accountability of
should accurately measure the students’ or medical education. Quality enhancement agencies of many
trainees’ progress towards or achievement of these countries have formally emphasised the importance of
outcomes at different levels of their training; credible assessment systems in medical education.6 7
Although students can escape from poor teaching by
Pass / fail decisions are taken and qualifications are
independent learning, they cannot escape from the effects of
awarded based on assessment results. Students who
poor assessments; they have to pass the examination.8
perform well in assessments receive good ranks,
grades and prizes. On the other hand, poorly Various assessment methods are used in both undergraduate
performing students may be offered support and and postgraduate medical education. Assessors need to
additional training; consider some essential questions when implementing
assessments. Are our assessments psychometrically sound?
As assessments drive student learning, they are a
What is their educational impact? Are the assessments with
crucial component of the teaching I learning
sound psychometric properties and positive educational
process.1,2,3,4,5 Therefore, assessment is an
impact feasible and cost-effective in our own setting, and
important mode of communicating to students
acceptable to all involved in assessments? This article
what teachers value (i.e. intended outcomes of the
discusses: the contribution of different elements, namely
programme);
psychometric properties, educational impact, practicability,
The assessment results should provide students cost-effectiveness and acceptability to the utility value of our
1
with meaningful feedback on their strengths and assessments ; and the practical measures for improving each
weaknesses. At times, students use their aspect.

1 of 7
Evaluating and designing assessments for medical education: the utility formula

PSYCHOMETRIC PROPERTIES knowledge is mostly not a goal itself, but only a single
The assessments are psychometrically sound if they are valid aspect of solving professional problems’.11 One of the
and reliable. Validity is defined as the “extent to which a test important principles of recent curricular changes in
measures what is intended to be measured and nothing undergraduate medical education is the promotion of higher
12
else”.9 Reliability is a measure of the consistency and order thinking. The role of assessments in encouraging
13
precision with which a test measures what it is supposed to higher order thinking is vital.
9
assess.
Bloom’s taxonomy14 categorises knowledge into six levels:
1. VALIDITY OF THE ASSESSMENT recall; comprehension; application; analysis; synthesis; and
evaluation. The assessment of recall and comprehension of
Major determinants of the validity are: assessment of what is
knowledge is essential, but if only recall and comprehension
purported to be assessed; selection of suitable assessment
are tested, lower order thinking will be promoted. In
instruments for the purpose; and adequate representation of
contrast, higher order thinking is encouraged by assessing
the curriculum in the assessment material. These aspects
the knowledge at application; analysis; synthesis; and
need to be considered before the assessment is conducted
evaluation levels. Context-free questions, i.e. questions that
(i.e. at the planning stage). After assessments are held,
are not based on practical / clinical scenarios, encourage the
however, the validity of assessments may be reviewed by
consideration of simple answers, e.g. yes/no.15 The
quantitative analysis of results.
promotion and assessment of higher order thinking can be
ASSESSMENT OF WHAT IS PURPORTED TO BE achieved by introducing context-rich questions, i.e.
ASSESSED questions based on patient, practical or clinical scenarios, for
11,16
The assessments should assess what is intended by the knowledge assessments (Box 1).
curriculum. The purpose of the course (i.e. intended
Figure 1
educational message) is demonstrated by: the time allocated
Box I — Context free and context rich questions
to each topic in teaching; and the level of thinking and
competence/performance encouraged by the course
objectives. For example, in an endocrine module, the
curriculum expects the students to solve clinical problems
related to common endocrine disorders, which they meet at
first contact level. Accordingly, more teaching time is
allocated to diabetes than pheochromocytoma, as in primary
care settings the presentation of patients with the former is
more frequent than the latter. When clinical-problem solving
is the level of competence that is required in specific
curriculum, problem-based learning is used as the main
method of teaching. If the assessment mostly tests factual
recall about pheochromocytoma, however, the purpose of
the module is not represented in the assessment. Students, no
doubt, will be driven towards memorising facts rather than
solving problems, and more about pheochromocytoma than SUITABILITY OF ASSESSMENT INSTRUMENTS
diabetes. As a result, incongruence between the time devoted Miller describes four levels of assessment: knows; knows
to teaching (more time for diabetes), and weight (more how; shows how (competence); and does (performance)
assessment content in pheochromocytoma), and level 17
(Figure 1). Suitability of the assessment instrument(s) can
(factual recall) assessed leads to undesired student learning. be determined by relating the objectives or outcomes
Therefore, the relative weight given to each topic in assessed to the different levels of Miller’s pyramid.
assessment should be proportionate to teaching and teaching Assessors, therefore, may require an assessment ‘tool kit’
time allocated in the planned curriculum. rather than a single instrument to assess every thing they
need to assess.
Factual knowledge is a prerequisite for effective problem
10
solving. However, ‘in real professional practice, factual

2 of 7
Evaluating and designing assessments for medical education: the utility formula

Figure 2 the assessment materials, establishing the content validity of


Figure 1 — Examples of assessment instruments for the assessment.2
assessing different levels of Miller’s pyramid
The number of questions focused on assessing different
topics and objectives in an assessment vary in congruence
with the relative emphasis given to each topic and objective
in the curriculum. No topic or objective/outcome, however,
should be left out, as the assessment material should be a
representative sample of the course content.

b) Technical accuracy

The questions formulated to assess the sampled content


should not contain technical errors. For example, a
grammatically incorrect MCQ may not assess the students’
knowledge of the intended topic, as the students may not
The use of multiple assessment instruments enhances both
understand what is being asked. Frequently observed
validity and reliability of results.1 The students also perceive
technical flaws in relation to MCQs include: use of absolute
more satisfaction and motivation with the use of multiple
(e.g. using must, typical in MCQ) and frequency (e.g. using
assessment instruments than with the use of a single
15 often, sometimes in MCQ) terms; and spelling and grammar
instrument. 5
mistakes. Technical flaws confuse students and directly
Some assessment instruments possess more than one format; affect the students’ marks, reducing the validity of the
18
e.g. single best response and extended matching items assessment. Therefore, they should be eliminated in
formats in Multiple Choice Questions (MCQs). The constructing any type of assessment question.
appropriate format should be chosen considering the content
Quantitative analysis of marks
to be assessed, the training and experience of the assessor,
and the psychometric properties (validity and reliability) of Based on the performance of students, calculating difficulty
each format. and discrimination indices, and correlation of marks may
provide validity evidence.
SAMPLING OF THE CURRICULUM FOR
ASSESSMENT a) The difficulty and discrimination indices
One measure of ensuring validity is adequate sampling of
the curriculum for the assessment (i.e. the assessment The difficulty of a test item and its discrimination power
18 (DP) could provide supportive evidence for validity of
content should be representative of the curriculum content).
examinations.21,22 The difficulty index (DI) is the proportion
a) Representativeness of candidates that passes a test item (e.g. single question in a
single-best-answer type MCQ paper). It is calculated by
As assessment drives learning,1,4,19 the representation of each
dividing the number of candidates who passed the test item
topic and each curriculum objective in assessments sends a
by the number who sat the examination. Thus a high DI (e.g.
clear educational message to the students about the topics
0.9) may indicate an easy item and a low DI (e.g. 0.1) may
and outcomes they should master. Therefore, the sample of
indicate a hard item. The DP is the ability of a test item to
curriculum content in the assessment should represent the
distinguish between high and low performers. For example,
whole curriculum and this is a primary requirement of
to demonstrate high DI, students who are more competent in
content validity.2,12,20 clinical skills (high performers) should score higher than the
Before assessment, the assessment contents should be students who are less competent (low performers) in an
plotted against the planned objectives (this is often referred OSCE station designed for the assessment of history taking
20
to as “blueprinting”). In the assessment blueprint, the skills (test item). In calculating the DP of a test item, the
columns represent the course outcomes or objectives and the candidates are ranked by descending order of their marks for
rows represent the teaching/learning topics. This process the whole examination. The number of candidates in upper
helps assessors sample all topics and outcomes/objectives in third and lower third of the list who correctly answered the

3 of 7
Evaluating and designing assessments for medical education: the utility formula

item is calculated. The proportion of candidates who have measurement errors.23,24 Exam questions and examiners
correctly answered the item in the lower third is subtracted either individually or in combination may contribute to
from the proportion of their counterparts in the upper third. measurement error.
The DP should be positive. A negative DP requires
investigation. The reliability of assessment results can be estimated using
Classical Test Theory (CTT) and Generalisability Theory
24
Most of the medical undergraduate assessments have either (GT). Both these theories examine the variance of scores.
criterion-referenced components (passing or failing is based
on the standard achieved) or norm-referenced components ESTIMATING RELIABILITY USING CTT
(passing a percentage of candidates after ranking them based A widely used reliability measure that uses the CTT as its
on their performance), or both. If the DI of a test item is low, basis is the Alpha coefficient (AC). AC is a value between
the test setter may be able to observe that: the item assesses zero and one (0-1), which can be calculated using statistical
content outside the curriculum; the teaching / learning of the software like SPSS. For example, an AC of 0.8 means that
content area has taken place ineffectively; the item is the reproducibility is 80% and the total measurement error is
technically flawed; or the students have not learnt the topic 20%. However, CTT cannot be used to identify the sources
represented by the item.23 Obviously, DI of a norm- of error (i.e. what contributes to the 20% of error in the
referenced examination should be high in order to example above) and their relative magnitudes, as in CTT the
24
discriminate between high and low performers. Although the error is identified as a single entity.
intention of a criterion-referenced test is not discrimination
ESTIMATING RELIABILITY USING GT
between high and low performers, the discrimination index
23
still has a value. An item with a negative discrimination In GT, the G-coefficient (value between 0 – 1) also indicates
index (i.e. more low performers answering correctly than the reliability of results. Different sources (e.g.
high performers) usually denotes a technical flaw, a mistake items/stations: raters,) can be responsible for the error
(e.g. wrong answer), or mis-key. component. The assessors would want to know not only the
magnitude of the overall error but also the source(s) of error
A DP near to zero together with a high DI in a criterion- and their individual magnitude.24 GT can be used to identify
referenced test may indicate the effectiveness of the teaching the sources of error and quantify their contribution to the
/ learning of the content area related to the item (i.e. both total error, as GT analyses the variance. 24 It also gives
high and low performers have mastered the topic). provisions to identify how to minimize the error and what is
needed to achieve results that are sufficiently reliable. G-
b) Correlation coefficients
coefficient can be calculated using statistical software
In an examination, assessors may use different assessment packages such as GENOVA.
instruments to assess different levels of Miller’s pyramid.
In both CTT and GT, a value of more than 0.8 is considered
Supportive evidence for the use of an appropriate instrument
acceptable reliability. However, in high stake examinations,
for a specified level may be obtained by correlating students’
some assessment authorities (e.g. Postgraduate Medical
marks (using a Pearson correlation) for different assessment
Education and Training Board) recommend the achievement
instruments. The correlation of marks of two instruments
of 0.9. The evidence of reliability estimated by these
which assess the same level (e.g. MCQ and SAQ assessing
statistical methods, however, should always be interpreted
‘knows’ level) should be higher than the correlation
against the backdrop of the validity of the assessment. The
coefficient of marks of instruments assessing different levels
reliability values have no meaning with poor validity.
(e.g. MCQ assessing knows and OSCE assessing shows
how). EDUCATIONAL IMPACT
2. THE RELIABILITY OF ASSESSMENT The educational message, i.e. the educationally desirable
RESULTS direction that teachers expect the students to follow,
conveyed to the student by the assessment is referred to as
Reliability indicates the ability of an assessment result to be
educational impact. Citing many authors, van der Vleuten
replicated given the same or similar conditions. Assessment
points out that the “assessment programme has tremendous
is a measurement. As in all measurements, assessment
impact on learners and students do whatever they are tested
results may not be always consistent (i.e. reliable) due to

4 of 7
Evaluating and designing assessments for medical education: the utility formula

on and are not likely to do what they are not tested on”.1 learning” and perceived resource-intensive assessment
Although more time is allocated for learning clinical skills in methods may turn out to be rewarding in terms of return on
wards, if students are assessed on recalling facts using a cost in practice.1 Therefore, the cost-effectiveness of
MCQ examination, they have a propensity to read books and assessment, evaluating the benefits of a particular
notes in a library. Conversely, they will learn clinical skills, assessment against its cost, seems more important than the
spending more time in clinical skills centres or wards, if cost alone. For example, a one-from-five MCQ test may be
25
their clinical skills are assessed using an OSCE. Therefore, the cheapest mode of valid and reliable assessment of
17
the assessments should reflect the educationally desirable ‘knows’ and ‘knows how’ levels of Millers pyramid.
direction expressed in the curriculum outcomes. However, it is not suitable for assessing competence or
performance. A portfolio assessment, which is costly
It is true that high validity, reliability and positive
compared to a MCQ test, may be the cost-effective method
educational impact enhance the rigor of assessments.
of assessing performance credibly.
However, the psychometric properties and educational
impact of assessments should be balanced with the ACCEPTABILITY
practicability and the cost-effectiveness of using an A test may be acceptable to some of those dealing with it
assessment instrument in a given context, and its 9
and not to others. The beliefs and attitudes of both
acceptability to people involved in the assessment process examiners and examinees towards assessment may not
1
(e.g. exam setters, examiners, examinees). always be in line with the research and empirical evidence.
Therefore, certain assessments may not be acceptable to all.1
PRACTICABILITY
Provision of necessary information and willingness to
Strategies to improve validity (e.g. the use of the OSCE to
compromise may increase the commitment of both
assess skills) and reliability (e.g. testing with as many
examiners and examinees.27 However, if the beliefs, opinions
observers and cases or situations as possible) may not be
and attitudes of both examiners and examinees are not
feasible for many reasons. Ram et al,26 in their evaluation of
considered in choosing and designing assessments, the
using video observations for the assessment of general 1
survival of assessment procedure is threatened. For
practitioners, identified that feasibility issues were related to
example, strongly structured assessments may not be
the cost, availability of equipment, time, recruitment of
acceptable to examiners, as the examiners have little
patients and assessors, and manpower necessary to develop
opportunity to exploit their expertise to vary questioning
infrastructure. Psychometric rigor may be very important in
from candidate to candidate.1 Therefore, a compromise
some high stake assessments (e.g. final year undergraduate
between an acceptable degree of freedom for such issues and
examination, national board examinations). But feasibility
the exam structure enhances the sustainability of the
may be equally important for iterative in-training
assessment system.1 For example, in an OSCE, a checklist
assessments.27 Therefore, at times, a compromise of
can be used together with a global rating where the
psychometric rigor, to a certain extent, may be necessary for
examiners can express their overall judgment on candidates,
the assessment system to be practicable. For example, the
enhancing both psychometric properties and acceptability.
number of summative examinations can be reduced when the
number of formative examinations is increased provided that THE UTILITY FORMULA
the formative exams follow the same format as the Combining the utility elements: validity; reliability;
summative examinations. Because formative assessments educational impact; cost-effectiveness; and acceptability,
may not warrant such strict psychometric rigor as summative 1
van der Vleuten introduced a utility formula.
assessments, this approach may help mobilise the existing
resources and make psychometrically rigourous summative Utility = R x V x E x A x C
examinations practicable.
(R= Reliability, V= Validity, EI= Educational impact, A=
COST-EFFECTIVENESS Acceptability, C= Cost)
In practice, the cost of assessment is a compromise between However, feasibility has also been shown to be important for
the information elicited and the resources required by the 27
1
the utility of an assessment. On the other hand, in practice,
examination. However, “investing in assessment is cost-effectiveness of assessment may be a better determinant
investing in teaching and learning, as assessment drives of its utility than the cost alone. Therefore, we have found it

5 of 7
Evaluating and designing assessments for medical education: the utility formula

helpful to modify this formula to include practicability and 1993;46 – 47.


10. Hager P, Gonczi A. What is competence? Medical
cost-effectiveness. Teacher; 1996; 18: 15 – 18.
11. Schuwrith L, van der Vleuten C: Different written
Utility = R x V x EI x P x A x CE assessment methods: what can be said about their strengths
and weaknesses. Medical Education; 2004; 38: 974 — 979.
(R= Reliability, V= Validity, EI= Educational impact, P = 12. Spencer J: Learner-centered approaches in medical
education British Medical Journal; 1999; 318: 1280 – 1283.
Practicability, A= Acceptability, CI= Cost-effectiveness) 13. Wood D: Evaluating the outcomes of undergraduate
medical education. Medical Education; 2003; 37: 580 – 581.
According to this utility formula, the utility value of the 14. Bloom S, Hastings T, Modays J: Handbook of Formative
assessment becomes null and void, if any of the utility and Summative Evaluation of Students Learning; New York;
McGraw Hill; 1971: 103.
factors becomes zero. 15. Scale F, Chapman J, Davey C: The influence of
assessments on the students motivation to learn in a therapy
CONCLUSION degree course. Medical Education; 2000; 34: 614 - 621.
16. Des Marchas J, Vu V: Developing and evaluating the
Good assessment practices in medical training, at all levels, student assessment system in the preclinical problem-based
enhance both quality and accountability of medical curriculum at Sherbrook. Academic Medicine; 1996; 71: 274
education. The utility of assessments depends on reliability, – 281.
17. Miller J: The assessment of clinical skills, competence,
validity, educational impact, acceptability, cost-effectiveness performance. Academic Medicine; 1990; 65: s63 — s67.
and practicability. Although the rigor of assessments is 18. Bridge D, Musial J, Frank R, Roe T, Sawilowsky S:
Measurement practices: methods tom developing content
determined by validity, reliability and the educational valid student examinations. Medical Teacher; 2003; 25: 414
impact, measures employed in achieving rigor should be – 421
balanced against the practicability and cost-effectiveness of 19. van der Vlouten C, Schuwirth L: Assessing professional
competence. Medical Education; 2005; 39: 309 - 317.
using an assessment system in a particular setting, and the 20. Wass V, van der Vleuten C, Shatzer J, Jones R:
acceptability of assessments to their stake holders. Assessment of clinical competence. Lancet; 2001; 357: 945
– 949
21. Haynes S, Richard D, Kubany E: Content validity in
References psychological assessment: a functional approach to concepts
1. van der Vleuten C: The assessment of professional and methods. Psychological Assessment; 1995; 7: 238 –
competence: 247.
developments, research and practical implications. Advances 22. Southgate L, Cox J, David I, Hatch D, Howes A, et al:
in Health Science Education; 1996;1: 41 – 67. The assessment of poorly performing doctors: the
2. Fowell S, Southgate L, Bligh J: Evaluating assessment: development of the
the missing link? Medical Education; 1999; 33: 276 – 28. assessment programmes for the General Medical Councils
3. Harden R. AMEE Guide 21: curriculum mapping: a tool Performance Procedures. Medical Education; 2001; 35 (sl):
for transparent and authentic teaching and learning. Medical 28.
Teacher; 2001; 23: 123 – 137. 23. Gronlund N: Reliability and other desired characteristics.
4. Eraut M: A wider perspective on assessment. Medical Measuring and evaluating in teaching. 3rd ed; London;
Education; 2004; 38: 800 – 804. Collier Macmillan; 1976: 105 – 135.
5. Boud D: Assessment and learning: contradictory or 24. Boulet J: Generalizability theory: Basis. In: Everitt S,
complementary? In: Knight P. eds. Assessment for Learning Howell C. eds. Encyclopedia of Statistics in Behavioural
in Higher Education; London; Kogan; 1995: 25 – 48. Science; 2nd ed; Chichester; John Wiley & Sons; 2005: 704
6. Postgraduate Medical Education and Training Board: – 711.
Standards for Curricula and Assessment Systems; London; 25. Malik L: The attitudes of medical students to the
PMETB; 2008: 6. objective structured practical examination. Medical
7. Committee of Vice-chancellors and Directors: Quality Education; 1988; 22: 40 – 46.
Assurance Handbook for Sri Lankan Universities. Colombo: 26. Ram P, Grol R, Rethans J, Schouten B, van der Vleuten
University Grants Commission; 2002: 105. C, Kester A: Assessment of general practitioners by video
8. Case S, Swanson D: Constructing written test questions observation of communicative and medical performance in
for basic and clinical sciences. 3rd ed. Philadelphia; National daily practice: issues of validity, reliability and feasibility.
Board of Medical Examiners; 2002: 26. Medical education; 1999; 33: 447 – 454.
9. Lowry S: Medical Education; London; BMJ books; 27. Crossley J, Humphris G, Jolly B: Assessing health
professionals. Medical Education; 2002;36: 800–804.

6 of 7
Evaluating and designing assessments for medical education: the utility formula

Author Information
M. N. Chandratilake
Research Officer, Centre for Medical Education, University of Dundee

M. H. Davis
Director, Centre for Medical Education, University of Dundee

G. Ponnamperuma
Senior Lecturer, Development and Research Centre, University of Colombo

7 of 7

You might also like