Instrument development and psychometric validation 030222

Instrument development and
psychometric validation
Roger Watson

Instrument development and psychometric validation
• Questionnaire design
• Questionnaire validation
• Content validation
• Screening questionnaires
• Receiver operating characteristics
• Predictive values

Designing a questionnaire
What do you want to find out?
How will you analyse the data?

Authenticity and directness
The balance between these will dictate the length and utility of your
questionnaire
If you need to ask it, ask it!
If you don’t need to ask it, don’t!
Avoid the ‘just one more question’ trap
Most items will be obvious and come early
Question every additional item

Response formats (contd.)
Points to consider:
• Have you included all the possible options where options are
provided?
• Have you provided a balanced spread of choices where choices
such as opinions are to be selected?
• Are the options mutually exclusive?
• Should you provide a neutral or mid-point response?

Standardised questions
Statements Strongly
disagree
Disagree Neutral Agree Strongly agree
My job provides me with an opportunity to
advance professionally
My income is adequate for normal
expenses
My job provides an opportunity to use a
variety of skills
When instructions are inadequate, I do what
I think is best
Job satisfaction scale – 5-point Likert scale

Demographic aspects
Gender: Male Female
Age: _______
(Please specify)
Educational qualifications:
___________________________ (Please specify the type of degree)
Years of experience as a nurse: ______yrs ________mths
Current post is my ______nursing job.
first
second
others: ________________________
(Please specify)
The nature of current employment is:
full time
part time: _____________________________
(Please specify the number of hours/week)

Presentation
THIS: What do you think about…?
NOT: What do you think about…?
OR: WHAT DO YOU THINK ABOUT…?

Instrument development and psychometric validation 030222

Reliability
the extent to which an instrument provides the same measure each time it is
used

Validity
the extent to which an instrument measures what it is supposed to measure

Establishing validity
Construct validity (unobtainable)
Construct validity is "the degree to which a test measures what it claims, or purports,
to be measuring." In the classical model of test validity, construct validity is one of
three main types of validity evidence, alongside content validity and criterion validity.
Modern validity theory defines construct validity as the overarching concern of validity
research, subsuming all other types of validity evidence.
Wikipedia

Content validity
• Item validity
• I-CVI
• Scale validity
• S-CVI
• Content validity ratio (CVR)

Content validity index (I-CVI)
• I-CVI is computed as the number of experts giving a rating of “very
relevant” for each item divided by the total number of experts.
• Values range from 0 to 1 where:
• I-CVI > 0.79, the item is relevant
• between 0.70 and 0.79, the item needs revisions
• if the value is below 0.70 the item is eliminated
Isabel B. Rodrigues, Jonathan D. Adachi, Karen A. Beattie & Joy C.
MacDermid
BMC Musculoskeletal Disorders 18, Article number: 540 (2017)

Content validity (S-CVI)
• Similarly, S-CVI is calculated using the number of items in a tool that have
achieved a rating of “very relevant”
• There are two methods to calculating S-CVI:
Universal Agreement (UA) among experts (S-CVI/UA):
• S-CVI/UA is calculated by adding all items with I-CVI equal to 1 divided
by the total number of items
• S-CVI/UA ≥ 0.8 = excellent content validity
Average CVI (S-CVI/Ave) (less conservative):
• S-CVI/Ave is calculated by taking the sum of the I-CVIs divided by the
total number of items
• S-CVI/Ave ≥ 0.9 = excellent content validity

Content validity ratio (CVR)
• CVR …is computed to specify whether an item is necessary for operating a
construct in a set of items or not
• For this, an expert panel is asked to give a score of 1 to 3 to each item
ranging from essential, useful but not essential, and not necessary
• The formula for computation of CVR = (Ne – N / 2) / (N / 2) in which Ne is
the number of panellists indicating “essential” and N is the total number of
panellists
• The numeric value of CVR ranges from -1 to 1 (Lawshe, 1975)

Criterion validity
• …the extent to which an operationalization of a construct, such as a
test, relates to, or predicts, a theoretical representation of the
construct—the criterion. (Wikipedia)

Factorial validity
• Factorial validity examines the extent to which the underlying
putative structure of a scale is recoverable in a set of test scores.
(Piedmont R.L. (2014) FactorialValidity. In: Michalos A.C. (eds) Encyclopedia of
Quality of Life and Well-Being Research. Springer, Dordrecht.
https://doi.org/10.1007/978-94-007-0753-5_984)

Types of factor analysis
Exploratory (EFA)
• principal axis factoring
• maximum likelihood factoring
• principal components analysis (PCA)*
Confirmatory (CFA)
• structural equation modelling
* - not strictly EFA
(http://www.stat-help.com/factor.pdf)

Item response theory
• The item response theory (IRT), also known as the latent response
theory refers to a family of mathematical models that attempt to
explain the relationship between latent traits (unobservable
characteristic or attribute) and their manifestations (i.e. observed
outcomes, responses or performance).
(https://www.publichealth.columbia.edu/research/population-health-methods/item-
response-theory)

05/04/2022 © The University of Sheffield / Department of Marketing and Communications
Item response theory (IRT)
• The unit of analysis in IRT:
• The item characteristic curve (ICC)
• Also known as:
• The item response curve (IRC)
• The item response function (IRF)

Item characteristic curves
P(θ)
θ
1 -
Item 1 Item 2
• item 2 is more
‘difficult’ than item 1
• it represents more of
the latent variable
• more difficult items
will have lower mean
scores on the latent
variable

Item response theory
• Rasch analysis
• Partial credit model
• Mokken scaling
Advantages of item response theory:
• only a specific set of items produces a given score on the latent variable
• therefore, you know what the score means

Screening questionnaires
• How does a screening questionnaire work?

• Used to find out if someone has something, or is likely to have
something
• For example:
• Depression
• Problem drinking
• Eating disorder
• Bowel cancer
• Medications management risk

• Why use them when a diagnosis is available?
• Many reasons:
• Speed and volume (many people quickly)
• Potential to save lives and prevent morbidity
• Appropriate use of resources
• Investigate or intervene only when necessary
• Lower risk of dangerous procedures

• But how do we decide what the questionnaire is telling us?
• We need to attach a score on the questionnaire to the level of risk or
probability of diagnosis
• Problems:
• Sometimes the questionnaire will be wrong
• Some people at risk will be screened as ‘OK’
• Some people not at risk will be screened as ‘not OK’

• Example: Bowel cancer screening (fictitious)
• Questions:
• Bloating yes/no
• Pain yes/no
• Changed bowel habit yes/no
• Blood yes/no
• Score range 0-4
• Is someone at risk of bowel cancer at 1, 2, 3 or 4?

Sensitivity
• In medical diagnosis, test sensitivity is the ability of a test to correctly
identify those with the disease (true positive rate), whereas
test specificity is the ability of the test to correctly identify those without
the disease (true negative rate).

Sensitivity
Sensitivity = 0.66

Specificity

Specificity
Sensitivity = 0.66
Specificity = 0.52

Sensitivity and specificity

• There is a ‘trade off’ between sensitivity and specificity
• The more sensitive a test is, the less specific it will be
• You will increase the number of negative people diagnosed as positive
• The more specific a test is the less sensitive it will be
• You will miss more people who are really positive

• There is a ‘trade off’ between sensitivity and specificity
• The more sensitive a test is, the less specific it will be
• You will increase the number of negative people diagnosed as positive
• The more specific a test is the less sensitive it will be
• You will miss more people who are really positive
SO HOW DO WE DECIDE WHAT IS BEST?

Receiver operating
characteristics curves

Receiver Operating Characteristics
The ROC curve was first
developed by electrical
engineers and radar
engineers during World
War II for detecting
enemy objects in
battlefields (Wikipedia)

Receiver Operating Characteristic (ROC)
curve
• The term “Receiver Operating Characteristic” has its roots in World War II.
ROC curves were originally developed by the British as part of the “Chain
Home” radar system.
• ROC analysis was used to analyze radar data to differentiate between
enemy aircraft and signal noise (e.g. flocks of geese).
• As the sensitivity of the receiver increased, so did the number of false
positives (in other words, specificity went down).

curve
• The ROC allows us to combine, graphically, sensitivity AND specificity
• The ROC allows us to find the optimal level of sensitivity and specificity

curve
ROC plot of:
Sensitivity
against
1- Specificity

curve
ROC plot of:
Sensitivity
against
1- Specificity
No better
than
guessing
Worse than
guessing
Better than
guessing

curve
ROC plot of:
Sensitivity
against
1- Specificity
Area under the curve
(AUC) > 0.7 indicates
acceptable ROC

Receiver operating
characteristics curves – Example:
PAID questionnaire

PAID – Problem areas in diabetes questionnaire
• 20 items
• 5-point scale
• e.g.: ‘Feeling depressed when you think about living with diabetes?’
• ‘Not a problem’ (0) to ‘Serious problem’ (4)

Fig. 1 ROC curve of the
PAID questionnaire score
for screening for clinical
depression

Fig. 2 ROC curve of the
PAID questionnaire score
for screening for
subclinical depression

How do we decide the optimal levels of sensitivity
and specificity?
• Youden index (J)
• J = Specificity + Sensitivity - 1

Sensitivity and Specificity vs PPV & NPV

PPV versus Sensitivity
• The Positive Predictive Value definition is similar to the sensitivity of
a test and the two are often confused.
• However, PPV is useful for the patient, while sensitivity is more useful
for the physician.
• Positive predictive value will tell you the odds of you having a disease
if you have a positive result.
(https://www.statisticshowto.com/probability-and-statistics/statistics-
definitions/sensitivity-vs-specificity-statistics/)

Summary
• Diagnostic or screening questionnaires:
• Only work where you have a binary outcome (yes/no)
• Sensitivity and specificity are antagonistic
• Sensitivity and specificity can be combined to find an optimal level of each
• Receiver operating characteristic curves help us to optimise diagnostic and screening
questionnaires
• PPV & NPV are helpful in real screening situations

rwatson1955@gmail.com
0000-0001-8040-7625
@rwatson1955

Instrument development and psychometric validation 030222

More Related Content

What's hot (20)

Similar to Instrument development and psychometric validation 030222 (20)

More from Roger Watson (20)

Recently uploaded (20)

Instrument development and psychometric validation 030222