Evaluación Del Estado de Salud y Los Instrumentos de Calidad de Vida: Atributos y Criterios de Revisión (Ingles)
Evaluación Del Estado de Salud y Los Instrumentos de Calidad de Vida: Atributos y Criterios de Revisión (Ingles)
193
Ó 2002 Kluwer Academic Publishers. Printed in the Netherlands.
Abstract
The field of health status and quality of life (QoL) measurement – as a formal discipline with a cohesive
theoretical framework, accepted methods, and diverse applications – has been evolving for the better part
of 30 years. To identify health status and QoL instruments and review them against rigorous criteria as a
precursor to creating an instrument library for later dissemination, the Medical Outcomes Trust in 1994
created an independently functioning Scientific Advisory Committee (SAC). In the mid-1990s, the SAC
defined a set of attributes and criteria to carry out instrument assessments; 5 years later, it updated and
revised these materials to take account of the expanding theories and technologies upon which such
instruments were being developed. This paper offers the SAC’s current conceptualization of eight key
attributes of health status and QoL instruments (i.e., conceptual and measurement model; reliability;
validity; responsiveness; interpretability; respondent and administrative burden; alternate forms; and cul-
tural and language adaptations) and the criteria by which instruments would be reviewed on each of those
attributes. These are suggested guidelines for the field to consider and debate; as measurement techniques
become both more familiar and more sophisticated, we expect that experts will wish to update and refine
these criteria accordingly.
Key words: Health status, Item response theory, Measurement, Quality of life, Reliability, Responsiveness,
Validity
new instruments, translate and culturally adapt SAC determined that, to discharge its responsi-
existing instruments, and facilitate research among bilities, it would need to establish some principles
academicians, clinicians, and health care organi- and criteria, as well as procedures, by which it
zations; emergence of a professional society de- would acquire, review, and make assessments
voted explicitly to the furtherance of this field (the about instruments that came to its attention or
international society for quality of life research were submitted to the trust.
[ISOQoL]); convening of numerous international
colloquia and conventions on methods and issues Instrument review criteria
in assessing health-related quality of life; produc-
tion of numerous compilations of rating instru- The SAC thus set about to define a set of attributes
ments and questionnaires for measuring health and criteria to carry out instrument assessments
status, functioning, and related concepts; and and, after external peer review and revisions,
publication of at least one journal whose core published and disseminated the first set of its ‘in-
content relates to QoL measurement (namely, strument review criteria’ in 1996 (Evaluating
Quality of Life Research). quality-of-life and health status assessment in-
Today, the field is now broadly international, struments: Development of scientific review crite-
and its leaders are at the forefront of applying ria. Clin Ther 1996; 18(5): 979–992). They were
both traditional and modern theories and methods re-published in the Monitor, a trust publication, in
from health, psychology, and related fields to the March 1997; a subsidiary set of criteria relating
creation and validation of such instruments. This solely to the evaluation of translations and cultural
heterogeneity induces extremely productive and adaptations of instruments was published in the
useful debate and methodologic advances, and for Bulletin, a sister trust publication, in July 1997.
experts in the field, this diversity is acceptable and, Within the SAC approach and criteria, the term
indeed, welcome. instrument refers to the constellation of items
contained in questionnaires and interview sched-
The Medical Outcomes Trust and its Scientific ules along with their instructions to respondents,
Advisory Committee (SAC) procedures for administration, scoring, interpre-
tation of results, and other materials found in a
This complex mix of nonprofit organizations, users’ manual. We use the term attributes to in-
academic researchers, public sector agencies, and dicate categories of properties or characteristics of
commercial firms has, of course, pursued no single instruments that warrant separate, independent
set of objectives. In 1992, however, the Medical consideration in evaluation. Within the attributes,
Outcomes Trust was incorporated with the mis- we specify what we denote as criteria, which are
sion of promoting the science and application of commonly understood to be conditions or facts
outcomes assessment, with a particular emphasis used as a standard by which something can be
on expanding the availability and use of self- or judged or considered. We view the criteria as
interviewer-administered questionnaires designed prescribing the specific information instrument
to assess health and the outcomes of health care developers should be prepared to provide about
from the patients’ point of view. To accomplish particular aspects of each attribute.
this mission, the trust undertook to identify such In general, we have used these criteria to review
instruments, bring them into an instrument li- instruments developed in English and cultural and
brary, and disseminate them (together with ap- language adaptations based on the English lan-
propriate users’ guides and related materials) to all guage version of the given instrument, but they can
persons with an interest in and need for them. and have been used to consider instruments de-
In furtherance of this task, in 1994 the trust veloped in other languages as well. We apply them
created a Scientific Advisory Committee (SAC) – to instruments that measure domains of health
an independently operating entity charged with status and quality of life (QoL) in both groups and
reviewing instruments and assessing their suit- individuals. Although we believe that the criteria
ability for broad distribution by the trust. The apply to the ‘individualized’ class of measures
195
(such as the schedule for the evaluation of indi- We have three goals in mind in disseminating
vidual quality of life [SEIQoL]) that do not have these criteria. First, we hope to enhance the ap-
standardized items across respondents, we have preciation of health outcomes assessment among
not yet had experience applying these criteria with as wide an audience as possible and to prompt yet
such measures. more discussion and debate about continuous
improvement in this field. Second, we want to
Revised instrument review criteria provide a template by which others setting out to
assess materials or systems (e.g., performance
There matters stood for about 2 years, as the SAC measurement or monitoring systems in the quality-
applied its original set of instrument review criteria of-care arena) might similarly undertake to state
to instruments submitted from the United States, their evaluation criteria clearly and openly. Third,
the United Kingdom, Canada, and various Euro- we aim to document the process and criteria
pean countries as part of the larger trust activities. used by the SAC within the context of the trust’s
Increasingly, however, the SAC encountered two mission.
problems. One was that developers sometimes
found the documents describing the criteria diffi-
cult to apply to their particular situation; the other Attributes and criteria
was that the criteria were less applicable to in-
struments developed in accordance with the prin- Eight attributes have served as the principal foci
ciples of modern test theory than to instruments for SAC instrument review and are the core of this
created in line with classical psychometric rules. paper. They are:
Thus, in the course of using the initial criteria set 1 Conceptual and measurement model
over several years, we determined that they re- 2. Reliability
quired revision and expansion to address advances 3. Validity
in the science of psychometrics and to apply to a 4. Responsiveness
broader range of instruments. Quite apart from 5. Interpretability
the fact that more instruments are being developed 6. Respondent and administrative burden
on principles other than classic test theory, we 7. Alternative forms
recognized that being able to apply the same 8. Cultural and language adaptations (transla-
concepts of assessment to other types of instru- tions)
ments, such as screeners or instruments by which Within these attributes, we established specific
consumers might rate their satisfaction with health review criteria that are based on existing standards
care and plans, is also desirable. and evolving practices in the behavioral science and
To address these concerns, we undertook to health outcomes fields. These criteria, which are
revise the criteria following the same process as general guidelines, reflect principles and practices
used initially. Specifically, we determined that we of both classical and modern test theory. Table 1
would retain the basic structure of the criteria set summarizes the attributes and main criteria for
but expand the definition and specific criteria to each attribute. In general, our review criteria have
reflect modern test theory principles and methods. been designed primarily for health status and QoL
We also revamped the presentation of the criteria, profiles; we acknowledge that for various utility or
primarily to make clear the distinction between the preference measures, yet other attributes and cri-
description or definition of a specific attribute teria may be appropriate. (At the end of this paper,
(e.g., reliability or respondent burden) and the readers will find a selected bibliography of seminal
specific pieces of information that we believe de- texts and articles that provide the conceptual and
velopers should try to provide about that attribute. empirical base for these attributes and criteria; we
Before publishing these revised criteria, we solic- judged this approach to be simpler than trying to
ited outside peer review from six reviewers in the document the numerous sources that could be cited
United States, the United Kingdom, and Denmark for this material within the text itself.)
(see Acknowledgments) and revised the document We review instruments in the context of 11
accordingly. documented applications:
196
Reproducibility
– Methods employed to collect reproducibility data
Reproducibility – Well-argued rationale to support the design of the study and the
Stability of an instrument over time (test–retest) and interval between first and subsequent administration to support the
inter-rater agreement at one point in time. assumption that the population is stable
– Information on test-retest reliability and inter-rater reliability
based on intraclass correlation coefficients
– Information on the comparability of the item parameter estimates
and on measurement precision over repeated administrations
3. Validity – Rationale supporting the particular mix of evidence presented for
The degree to which the instrument measures what it the intended uses
purports to measure. – Clear description of the methods employed to collect validity data
– Composition of the sample used to examine validity (in detail)
– Above data for each major population of interest
Content-related: evidence that the domain of an instrument – Hypotheses tested and data relating to the tests
is appropriate relative to its intended use. – Clear rationale and support for the choice of criteria measures
Construct-related: evidence that supports a proposed
interpretation of scores based on theoretical implications
associated with the constructs being measured.
Criterion-related: evidence that shows the extent to which
scores of the instrument are related to a criterion measure.
4. Responsiveness – Evidence on the changes in scores of the instrument
An instrument’s ability to detect change over time. – Longitudinal data that compare a group that is expected to change
with a group that is expected to remain stable
– Population(s) on which responsiveness has been tested, including
the time intervals of assessment, the interventions or measures
involved in evaluating change, and the populations assumed to be
stable
5. Interpretability – Rationale for selection of external criteria of populations for
The degree to which one can assign easily understood purposes of comparison and interpretability of data
meaning to an instrument’s quantitative scores. – Information regarding the ways in which data from the instrument
should be reported and displayed
– Meaningful ‘benchmarks’ to facilitate interpretation of the scores
6. Burden Respondent burden
The time, effort, and other demands placed on those to whom – Information on: (a) average and range of the time needed to
the instrument is administered (respondent burden) or on complete the instrument, (b) reading and comprehension level, and
those who administer the instrument (administrative burden). (c) any special requirements or requests made of respondent
197
Table 1. (Continued)
* For all entries in this column, developers are expected to provide definitions, descriptions, explanations, or empirical information.
– assessing the health of general populations at a setting or population. The relative importance of
point in time, the eight attributes may differ depending on the
– assessing the health of specific populations at a intended uses and applications specified for the
point in time, instrument. Instruments may, for instance, docu-
– monitoring the health of general populations ment the health status or attitudes of individuals at
over time, a point in time, distinguish between two or more
– monitoring the health of specific populations groups, assess change over time among groups or
over time, individuals, predict future status, or some combi-
– evaluating the impact of broad-based or com- nations of these. Hence, the weight placed on one
munity-level interventions or policies, or another set of criteria may differ according to
– evaluating the efficacy and effectiveness of health the purposes claimed for the instrument.
care interventions, In reviewing instruments, the SAC aimed to be
– conducting economic evaluations of health in- thorough without holding instruments to unreal-
terventions, istically high standards. For example, we accepted
– using in quality improvement and quality as- some instruments even though their responsiveness
surance programs in health care delivery sys- to change over time (attribute 4) had not been
tems, evaluated at the time of submission. In a case such
– screening for health conditions, as this, we would note that the instrument had
– diagnosing health conditions, been approved for group comparisons but that no
– monitoring the health status of individual pa- data were available regarding the instrument’s re-
tients. sponsiveness. In other cases, developers may pro-
vide support for content and construct validity but
An instrument that works well for one purpose not criterion validity because true gold standards
or in one setting or population may not do so are often not available for evaluating the latter. In
when applied for another purpose or in another yet other cases, reliability may be judged sufficient
198
for comparing groups but not for evaluating in- Review criteria
dividuals. In summary, we matched criteria to Developers should:
particular uses claimed for the instrument and – State what broad concept (or concepts) the in-
accepted instruments for specific applications strument is trying to measure – for example,
when evaluation of the instrument and its docu- functional status, well-being, health-related
mentation supported these applications. quality of life, QoL, satisfaction with health
In the remainder of this paper, we present our care, or others. In addition, if the instrument is
definition of the attributes noted above and then designed to assess multiple domains within a
give our current (i.e., now revised) review criteria. broad concept (e.g., multiple scales assessing
The criteria are offered in terms of our view of several dimensions of health-related quality of
what instruments developers should ‘do’ (e.g., life), then provide a listing of all domains or
describe, provide, or discuss) in documenting the dimensions.
characteristics of their instruments, so the mate- – Describe the conceptual and empirical basis for
rial appears largely in bulleted form. We empha- generating the instrument content (e.g., items)
size here that our definitions and criteria are open and for combining multiple items into a single
to further discussion and evolution within the scale score and/or multiple scale scores.
field of health status assessment, and we hope that – State the methods and involvement of the target
experts around the world will be encouraged to populations for obtaining the final content of the
engage in a dialogue about these issues in years to instrument and for ascertaining the appropri-
come. ateness of the instrument’s content for that
population, for example by use of focus groups
Conceptual and measurement model or pretesting in target population(s).
– Provide information on dimensionality and dis-
Definition tinctiveness of multiple scales, because both
A conceptual model is a rationale for and de- classical and modern test approaches assume
scription of the concepts and the populations that appropriate dimensionality (usually unidimensi-
a measure is intended to assess and the relation- onality) of scales.
ship between those concepts. A measurement – Provide evidence that the scale has adequate
model operationalizes the conceptual model and is variability in a range that is appropriate to its
reflected in an instrument’s scale and subscale intended use – for example, information on
structure and the procedures followed to create central tendency and dispersion, skewness, ceil-
scale and subscale scores. The adequacy of the ing and floor effects, and pattern of missing
measurement model can be evaluated by examin- data.
ing evidence that: (1) a scale measures a single – State the intended level of measurement (e.g.,
conceptual domain or construct; (2) multiple scales ordinal, interval, or ratio scales) with available
measure distinct domains; (3) the scale adequately supportive evidence.
represents variability in the domain; and (4) the – Describe the rationale and procedures for de-
intended level of measurement of the scale (e.g., riving scale scores from raw scores and for
ordinal, interval, or ratio) and its scoring proce- transformations (such as weighting and stan-
dures are justified. dardization); for preference-weighted measures
Classical test theory approaches may employ, or utility measures, provide a rationale and
for example, principal components analysis, factor empirical basis for the weights.
analyses, and related techniques for evaluating the
empirical measurement model underlying an in- Reliability
strument and for examining dimensionality.
Methods based on modern test theory may use Definition
approaches including confirmatory factor analysis, The principal definition of test reliability is the
structural equation modeling, and methods based degree to which an instrument is free from random
on item response theory (IRT). error. Classical approaches for examining test re-
199
liability include (a) internal consistency reliability, applications describe specific levels of stability for
typically using Cronbach’s coefficient a, and (b) specific levels of the scale. As with internal con-
reproducibility (e.g., test–retest or inter-observer sistency reliability, minimal standards for repro-
(interviewer) reliability). The first approach re- ducibility coefficients are also typically considered
quires one administration of the instrument; the to be 0.70 for group comparisons and 0.90–0.95
latter requires at least two administrations. for individual measurements over time.
In modern test theory applications, the degree of Test–retest reproducibility is the degree to which
precision of measurement is commonly expressed an instrument yields stable scores over time among
in terms of error variance, standard error of respondents who are assumed not to have changed
measurement (SEM) (the square root of the error on the domains being assessed. The influence of
variance), or test information (reciprocal of the test administration on the second administration
error variance). Error variance (or any other may overestimate reliability. Conversely, varia-
measure of precision) takes on different values at tions in health, learning, reaction, or regression to
different points along the scale. the mean may yield test–retest data that underes-
timate reproducibility. Bias and limits-of-agree-
Internal consistency reliability. In the classical ap- ment statistics can indicate the range within which
proach, Cronbach’s coefficient a provides an esti- 95% of retest scores can be expected to lie. Despite
mate of reliability based on all possible split-half these cautions, information on test–retest repro-
correlations for a multi-item scale. For instru- ducibility data is important for the evaluation of
ments employing dichotomous response choices, the instrument. For instruments administered by
an alternative formula, the Kuder–Richardson an interviewer, test–retest reproducibility typically
formula 20 (KR-20), is available. Commonly ac- refers to agreement among two or more observers.
cepted minimal standards for reliability coeffi-
cients are 0.70 for group comparisons and 0.90–
0.95 for individual comparisons. Reliability re- Review criteria
quirements are higher when applying instrument Internal consistency reliability and test informa-
scores for individualized use because confidence tion. Developers should:
intervals of those scores are typically computed – Describe clearly the methods employed to col-
based on the SEM. The SEM is computed as the lect reliability data. This should include (a)
standard deviation (SD) x Ö1-reliability. Reliabil- methods of sample accrual and sample size; (b)
ity coefficients lower than 0.9–0.95 provide too characteristics of the sample (e.g., sociodemo-
wide (e.g., more than one to two thirds of the score graphics, clinical characteristics if drawn from a
distribution) intervals to be useful for monitoring patient population, etc.); (c) the testing condi-
individual’s score. tions (e.g., where and how the instrument of
In the IRT approach, measurement precision is interest was administered); and (d) descriptive
generally evaluated at one or more points on the statistics for the instrument under study (e.g.,
scale. The scale’s precision should be characterized means, SDs, floor and ceiling effects).
over the measurement range likely to be encoun- – For classical applications, report reliability es-
tered in actual research. A single value, marginal timates and SEs for all elements of an instru-
reliability, can be estimated as an analog to the ment, including both the total score and
classical reliability coefficient. This value is most subscale scores, where appropriate.
useful for tests in which measurement precision is – For IRT applications, provide a plot showing
relatively stable across the scale. the SEM over the range of the scale. In addition,
the marginal reliability of each scale may be
Reproducibility. A second approach to reliability reported where this information is considered
can be obtained by judging the reproducibility or useful.
stability of an instrument over time (test–retest) – Where developers have reason to believe that
and inter-rater agreement at one point in time. In reliability estimates or SEM may differ sub-
classical applications, the stability of an instru- stantially for the various populations in which
ment is often expressed as a single value, but IRT an instrument is to be used, present these data
200
for each major population of interest (e.g., dif- purports to measure. Evidence for the validity of
ferent chronic disease populations, different an instrument has commonly been classified in
language or cultural groups). three ways discussed just below. (We note that
validation of a preference-based measure will need
Reproducibility. Developers should: to employ constructs relating to preferences per se,
– Describe clearly the methods employed to collect not simply descriptive constructs, and these can
reproducibility data. This description should in- differ from the criteria set out below for nonutility
clude (a) methods of sample accrual and sample measures.)
size; (b) characteristics of the sample (e.g.,socio- 1. Content-related: Evidence that the content do-
demographics, clinical characteristics if drawn main of an instrument is appropriate relative to its
from a patient population, etc.); (c) the testing intended use. Methods commonly used to obtain
conditions (e.g., where and how the instrument of evidence about content-related validity include the
interest was administered); and (d) descriptive use of lay and expert panel (clinician) judgments of
statistics for the instrument under study (e.g., the clarity, comprehensiveness, and redundancy of
intraclass correlation coefficient, receiver opera- items and scales of an instrument. Often, the
tor characteristic, the test–retest mean, limits of content of newly developed self-report instruments
agreement, etc.). is best elicited from the population being assessed
– Provide test–retest reproducibility information or experiencing the health condition.
as a complement to, not as a substitute for, in- 2. Construct-related: Evidence that supports a
ternal consistency. proposed interpretation of scores based on theo-
– Give a well-argued rationale to support the de- retical implications associated with the constructs
sign of the study and the interval between first being measured. Common methods to obtain
and subsequent administrations to support the construct-related validity data include examining
assumption that the population is stable. This the logical relations that should exist with other
can include self-report about perceived change in measures and/or patterns of scores for groups
health over the time interval or other measures known to differ on relevant variables. Ideally, de-
of general and specific health or functional sta- velopers should generate and test hypotheses
tus. Information about test and retest scores about specific logical relationships among relevant
should include the appropriate central tendency concepts or constructs.
and dispersion measures of both test and retest 3. Criterion-related: Evidence that shows the ex-
administrations. tent to which scores of the instrument are related
– In classical applications for instruments yielding to a criterion measure. Criterion measures are
interval-level data, include information on test– measures of the target construct that are widely
retest reliability (reproducibility) and inter-rater accepted as scaled, valid measures of that con-
reliability based on intraclass correlation coeffi- struct. In the area of self-reported health status
cients (ICC, the bias statistic or test–retest mean, assessment, criterion-related validity is rarely
or limits of agreement); for nominal or ordinal tested because of the absence of widely accepted
scale values, j and weighted j, respectively, are criterion measures, although exceptions occur such
recommended. as testing shorter versions of measures against
– In IRT applications, also provide information on longer versions. For testing screening instruments,
the comparability of the item parameter esti- criterion validity is essential to compare the
mates and on measurement precision over re- screening measure against a criterion measure of
peated administrations. the diagnosis or condition in question, using sen-
sitivity, specificity, and receiver operating charac-
teristic.
Validity
Review Criteria
Definition Developers should:
The validity of an instrument is defined as the – Explain the rationale that supports the particular
degree to which the instrument measures what it mix of evidence presented for the intended uses.
201
– Provide a clear description of the methods em- estimate of a measure of the magnitude of change
ployed to collect validity data. This should in- in health status (sometimes denoted the ‘distance’
clude (a) methods of sample accrual and sample or difference between before and after scores). No
size; (b) characteristics of the sample (e.g., socio- agreement or consensus exists on the preferred
demographics, clinical characteristics if drawn statistical measure. Effect size statistics translate
from a patient population); (c) the testing condi- the before-and-after changes into a standard unit
tions (i.e., where and how the instrument of in- of measurement; essentially they involve dividing
terest was administered); and (d) descriptive the change score by one or another variance de-
statistics for the instrument under study (e.g., nominator. (Said another way, these statistics are
means, SDs, floor and ceiling effects). the amount of observed change over the amount
– Describe the composition of the sample used to of observed variance.) In these statistics, the nu-
examine the validity of a measure in sufficient merator is always a change score, but the de-
detail to make clear the populations to which the nominator differs depending on the statistic being
instrument applies and selective factors that used (e.g., standardized response mean, respon-
might reasonably be expected to influence va- siveness statistic, SE of the mean).
lidity, such as gender, age, ethnicity, and lan- Moreover, different methods may be used to
guage. evaluate effect size. Common approaches include
– When reasons exist to believe that validity will comparing scale scores before and after an inter-
differ substantially for the various populations vention that is expected to affect the construct and
in which an instrument is to be used, present the comparing changes in scale scores with changes in
above data for each major population of interest other, related measures that are assumed to move
(e.g., different chronic disease populations, dif- in the same direction as the target measure.
ferent language or cultural groups, different age Responsiveness, as some authors have suggested,
groups). Because validity testing and use of can be construed as a ‘meaningful’ level of change
major instruments are ongoing, we encourage and, accordingly, defined as the minimal change
developers to continue to present such data as considered to be important by persons with the
they accumulate them. health condition, their significant others, or their
– When presenting construct validity, provide the providers. We suggest, however, that this conno-
hypotheses tested and data relating to the tests. tation of responsiveness might better be considered
– When data related to criterion validity are pre- an element of how the data from an instrument are
sented, provide a clear rationale and support for interpreted. Interpretation of effects is discussed in
the choice of criteria measures. Interpretability. This includes minimally important
differences or changes. We make this distinction in
Responsiveness part because, although responsiveness and inter-
pretability are related concepts, one focuses on
Definition performance characteristics of the instrument at
Sometimes referred to as sensitivity to change, hand and the other focuses on the respondents’
responsiveness is viewed as an important part of views about the domains being studied.
the longitudinal construct validation process. Re-
sponsiveness refers to an instrument’s ability to Review criteria
detect change. The criterion of responsiveness re- Developers should:
quires asking whether the measure can detect dif- – For any claim that an instrument is responsive,
ferences in outcomes, even if those differences are provide evidence on the changes in scores found
small. Responsiveness can be conceptualized also in field tests of the instrument. Apart from this
as the ratio of a signal (the real change over time information, change scores can also be expressed
that has occurred) to the noise (the variability in as effect sizes, standardized response means,
scores seen over time that is not associated with SEM, or other relative or adjusted measures of
true change in status). distance between before and after scores. The
Assessment of responsiveness involves statistical methods and formulae used to calculate the re-
estimation of an effect size statistic – that is, an sponsiveness statistics should be explained.
202
– Preferably, cite longitudinal data that compare a comparison and interpretability of data. As with
group that is expected to change with a group validity (attribute 3), this should include (a) ra-
that is expected to remain stable. tionale for selection of external criteria or com-
– Clearly identify the population(s) on which re- parison population; (b) methods of sample
sponsiveness has been tested, including the time accrual and sample size; (b) characteristics of the
intervals of assessment, the interventions or sample; (c) the testing conditions; and (d) de-
measures involved in evaluating change, and the scriptive statistics for the instrument under study
populations assumed to be stable. – Provide information regarding the ways in which
data from the instrument should be (or have
Interpretability been) reported and displayed in order to facili-
tate interpretation.
Definition – Cite meaningful ‘benchmarks’ (comparative or
Interpretability is defined as the degree to which normative data) to facilitate interpretation of
one can assign easily understood meaning to an the scores.
instrument’s quantitative scores. Interpretability
of a measure is facilitated by information that Burden
translates a quantitative score or change in scores
to a qualitative category or other external measure Definition
that has a more familiar meaning. Interpretability Respondent burden is defined as the time, effort,
calls for explanation of the rationale for the ex- and other demands placed on those to whom the
ternal measure, the change scores, and the ways instrument is administered. Administrative burden
that those scores are to be interpreted in relation to is defined as the demands placed on those who
the external measure. administer the instrument.
Several types of information can aid in the in-
terpretation of scores: Review criteria: respondent burden
– comparative data on the distributions of scores Developers should:
derived from a variety of defined population – Give information on the following properties:
groups, including, when possible, a representa- (1) average and range of time needed to complete
tive sample of the general population; the instrument on a self-administered basis or,
– results from a large pool of studies that have as an interviewer-administered instrument, for
used the instrument in question and reported all population groups for which the instrument
findings on it, thus bringing familiarity with the is intended;
instrument that will aid interpretation; (2) the reading and comprehension level needed
– the relationship of scores to clinically recognized for all population groups for which the in-
conditions, need for specific treatments, or in- strument is intended;
terventions of known effectiveness; (3) any special requirements or requests that might
– the relationship of scores or changes in scores to be placed on respondents, such as the need to
socially recognized life events (such as the impact consult health care records or copy information
of losing a job); about medications used; and
– the relationship of scores or changes in scores to (4) the acceptability of the instrument, for example
subjective ratings of the minimally important by indicating the level of missing data and re-
changes by persons with the condition, their fusal rates and the reasons for both.
significant others, or their providers; and – For instruments that are not, on the face of it,
– how well scores predict known relevant events harmless and for those that appear to have ex-
(such as death or need for institutional care). cessive rates of missing data, provide evidence
that the instrument places no undue physical or
Review criteria emotional strain on the respondent (for instance,
Developers should: that it does not include questions that a signifi-
– Clearly describe the rationale for selection of cant minority of patients finds too upsetting or
external criteria or populations for purposes of confrontational).
203
– Indicate when or under what circumstances their Cultural and language adaptations or translations
instrument is not suitable for respondents.
Definition
Review criteria: administrative burden Many instruments are adapted or translated for
Developers should provide information about any applications across regional and national borders
resources required for administration of the in- and populations. In the MOT and SAC context,
strument, such as the need for special or specific cultural and language adaptations have referred to
computer hardware or software to administer, situations in which instruments have been fully
score, or analyze the instrument. adapted from original or source instruments for
For interviewer-administered instruments, de- cultures or languages different from the original.
velopers should: Language adaptation might well be differentiated
– Document the average time and range of time from translation. As a case in point: an instrument
required of a trained interviewer to administer developed in Spanish or English may be adapted
the instrument in face-to-face interviews, by for different ‘versions’ (e.g., country- or region-
telephone, or with computer-assisted formats/ specific dialects) of those basic languages, whereas
applications, as appropriate; an instrument developed in Swedish and translated
– Indicate the amount of training and level of into French or German would be quite a different
education or professional expertise and experi- matter. In any case, the SAC has held the view that
ence needed by administrative staff to adminis- measurement properties of each cultural and/or
ter, score, or otherwise use the instrument; language adaptation ought to be judged separately
– Indicate the availability of scoring instructions. for evidence of reliability, validity, responsiveness,
interpretability, and burden.
Alternative modes of administration The cross-cultural adaptation of an instrument
involves two primary steps: (1) assessment of
Definition conceptual and linguistic equivalence, and (2)
Alternative modes of administration used for evaluation of measurement properties. Conceptual
the development and application of instruments equivalence refers to equivalence in relevance and
can include self-report, interviewer-administered, meaning of the same concepts being measured in
trained observer rating, computer-assisted self-re- different cultures and/or languages. Linguistic
port, computer-assisted interviewer-administered, equivalence refers to equivalence of question
and performance-based measures. In addition, al- wording and meaning in the formulation of items,
ternative modes may include self-administered or response choices, and all aspects of the instrument
interviewer-administered versions of the original and its applications. In all such cases, it is useful if
source instrument that are to be completed by developers provide empirical information on how
proxy respondents such as parents, spouses, pro- items work in different cultures and languages.
viders, or other substitute respondents.
Review criteria
Review criteria Developers should:
Developers should: – Describe methods to achieve linguistic equiva-
– Make available information evidence on reli- lence. The commonly recommended steps are (a)
ability, validity, responsiveness, interpretability, at least two forward translations from the source
and burden for each mode of administration; language that yields a pooled forward transla-
– Provide information on the comparability of tion; (b) at least one, preferably more, backward
alternative modes; whenever possible, equating translations to the source language that results
studies should be conducted so that scores from in another pooled translation; (c) a review of
alternative modes can be made comparable to translated versions by lay and expert panels with
each other or to scores from an original in- revisions; and (d) field tests to provide evidence
strument. of comparability.
204
4. Bjorner JB, Thunedborg K, Kristensen TS, Modvig J, Bech 16. McHorney CA, Tarlov AR. Individual-patient monitoring
P. The Danish SF-36 health survey: Translation and pre- in clinical practice: Are available health surveys adequate?
liminary validity studies. J Clin Epidemiol 1998; 51; 991– Qual Life Res 1995; 4: 293–307.
999. 17. Nunnally JC, Bernstern IH. Psychometric Theory. 3rd edn.,
5. Bjorner JB, Kreiner S, Ware JE Jr, Damsgaard MT, Bech New York: McGraw-Hill, 1994.
P. Differential item functioning in the Danish translation of 18. Payne SL. The Art of Asking Questions. Princeton, NJ:
the SF-36. J Clin Epidemiol 1998; 51: 1189–1202. Princeton University Press, 1951.
6. Bowling A. Measuring Health: A Review of Quality of Life 19. Patrick DL, Chiang Y-P. (eds) Health outcomes method-
Measurement Scales. 2nd edn., London: Open University ology symposium. Med Care 2000; 38 (9) (Suppl. II): II3–
Press, 1997. II208.
7. Carmines E. Reliability and Validity Assessment. Newbury 20. Reise SP, Widaman KF, Pugh RH. Confirmatory factor
Park, Calif.: Sage Publications. 1997. analysis and item response theory: Two approaches for
8. Cronbach LJ. Essentials of Psychological Testing. 4th edn., exploring measurement invariance. Psychol Bull 1993; 114:
New York: Harper and Row. 1984. 552–566.
9. Cronbach LJ. Coefficient alpha and the internal structure of 21. Staquet M, Hays R, Fayers P. Quality of Life Assessment
tests. Psychometrika 1951; 16: 297–334. in Clinical Trials: Methods and Practice. New York: Ox-
10. Devellis R. Scale Development: Theory and Applications. ford University Press, 1998.
Vol. 26. Applied Social Research Methods Series. Newbury 22. Streiner DL, Norman GR. Health Measurement Scales.
Park, Calif.: Sage Publications, 1991. 2nd edn., Oxford: Oxford University Press, 1995.
11. Hambleton RK, Swaminathan H, Rogers HJ. Fundamen- 23. Wainer H, Dorans NJ, Flaugher R, et al. Computerized
tals of Item Response Theory. Newbury Park, Calif.: Sage Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erl-
Publications, 1991. baum Associates, 1990.
12. Lohr KN. Health outcomes methodology symposium. 24. Wilkin D, Hallam L, Doggett MA. Measures of Need and
Summary and recommendations. Med. Care 2000; 38 (9) Outcome for Primary Health Care. New York: Oxford
(Suppl. II):II194–II208. University Press, 1992.
13. Lord FM, Norvick MR. Statistical Theories of Mental Test
Scores. Reading, Mass.: Addison-Wesley, 1968. Author for correspondence: Kathleen N. Lohr, Ph.D.,
14. McDonald RP. Test Theory: A Unified Treatment. Mah- Chief Scientist, Health, Social, and Economic Research,
wah, NJ: Lawrence Erlbaum Associates, 1999. RTI International, PO Box 12194, 3040 Cornwallis Road,
15. McDowell I, Newell C. Measuring Health. A Guide to Research Triangle Park, NC, 27709-2194 USA
Rating Scales and Questionnaires. 2nd edn., New York: Phone: +1-919-541-6512; þ1-919-541-7384
Oxford University Press, 1996. E-mail: [email protected]