0% found this document useful (0 votes)
52 views

An Overview of Contemporary ROC Methodology in Medical Imaging and Computer-Assist Modalities

The document discusses contemporary methodology for evaluating medical imaging technologies using receiver operating characteristic (ROC) analysis. It outlines efforts toward consensus, the ROC paradigm involving sensitivity and specificity, and complications that can arise from reader variability, uncertainty in truth assessments, sample sizes, and reader vigilance. Multiple-reader multiple-case studies that resample readers and cases are presented as a way to account for these complexities when assessing new diagnostic imaging and computer-assisted modalities.

Uploaded by

Komo Mo
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

An Overview of Contemporary ROC Methodology in Medical Imaging and Computer-Assist Modalities

The document discusses contemporary methodology for evaluating medical imaging technologies using receiver operating characteristic (ROC) analysis. It outlines efforts toward consensus, the ROC paradigm involving sensitivity and specificity, and complications that can arise from reader variability, uncertainty in truth assessments, sample sizes, and reader vigilance. Multiple-reader multiple-case studies that resample readers and cases are presented as a way to account for these complexities when assessing new diagnostic imaging and computer-assisted modalities.

Uploaded by

Komo Mo
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 47

An Overview of Contemporary ROC Methodology

in Medical Imaging and Computer-Assist Modalities


Robert F. Wagner, Ph.D., OST, CDRH, FDA

ROC
Receiver Operating Characteristic
(historic name from radar studies)
Relative Operating Characteristic
(psychology, psychophysics)
Operating Characteristic
(preferred by some)

OUTLINE:
- Efforts toward consensus development on present issues
- The ROC Paradigm
- The complication of reader variability
- The multiple-reader multiple-case (MRMC) ROC paradigm
- The measurement scales:
categories; patient-management/action; probability scale
- Complications from
location uncertainty
truth uncertainty
effective sample # uncertainty
reader vigilance
- Summary

EFFORTS TOWARD CONSENSUS DEVELOPMENT


ON THE PRESENT ISSUES

- How to use classic concepts of Sensitivity, Specificity,


and ROC analysis to assess performance
of diagnostic imaging and computer-assist systems?
- Many new issues and levels of complexity coming
to the fore as more complex technologies emerge

EFFORTS TOWARD CONSENSUS DEVELOPMENT


ON THE PRESENT ISSUES (II)
RSNA/SPIE/MIPS Various Workshops & Literature
- an evolving Work-in-Progress
FDA/CDRH use of multiple-reader multiple-case (MRMC) ROC
- Digital Mammography PMAs
- Computer Aid for lung nodule detection on CXR (film)
NCI Lung Image Database Consortium (LIDC) & Workshops
- Consensus seeking on many issues
- Two CDRH active members
Communication of these resources with incoming sponsors
5

Fundamentals of the ROC paradigm

Non-diseased
cases

Diseased
cases

Threshold

Test result value


or
subjective judgement of likelihood that case is diseased

Non-diseased
cases

Diseased
cases

more typically:

Test result value


or
subjective judgement of likelihood that case is diseased

Threshold
Diseased
cases

TPF, sensitivity

Non-diseased
cases

less aggressive
mindset

FPF, 1-specificity

Threshold
Diseased
cases

TPF, sensitivity

Non-diseased
cases

moderate
mindset

FPF, 1-specificity

10

Threshold
Diseased
cases

TPF, sensitivity

Non-diseased
cases

more
aggressive
mindset

FPF, 1-specificity

11

Non-diseased
cases

Threshold
Diseased
cases

TPF, sensitivity

Entire ROC curve

FPF, 1-specificity

12

TPF, sensitivity

Entire ROC curve


ce
n
a
h
c

lin

Reader Skill
and/or
Level of Technology

FPF, 1-specificity

13

. . . at least thats the idea . . .


. . . now to what happens in the real world . . .

The Complication of Reader Variability

14

In the following example from mammography,


readers were asked to set their threshold for action . . .
. . . between their sense of the boundary between
category 3 and category 4 of the BIRADS scale

15

T r u e N e g a tiv e F r a c tio n
1 .0

0 .9

0 .8

0 .7

0 .6

0 .5

0 .4

0 .3

0 .2

0 .1

0 .0

0 .0

0 .9

0 .1

0 .8

0 .2

0 .7

0 .3

0 .6

0 .4

0 .5

0 .5

0 .4

0 .6

0 .3

0 .7

0 .2

0 .8

0 .1

0 .9

0 .0

1 .0
0 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

F a ls e N e g a tiv e F r a c tio n

T r u e P o s itiv e F r a c tio n

1 .0

1 .0

F a ls e P o s itiv e F r a c tio n

TPF vs FPF for 108 US radiologists in study by Beam et al. 16

- There is no unique ROC operating point


i.e., no unique (TPF, FPF) point
- There is no unique ROC curve
i.e., there is a band or region of ROCs

17

. . . dozens of examples of this phenomenon exist . . .

The following is an example from


plain film chest radiography (CXR)

18

( Chest film study by E. James Potchen, M.D., 1999 )

19

The Multiple-Reader Multiple-Case (MRMC)


paradigm
Fully-Crossed Design
* Cases matched across modalities
(i.e., same cases read unaided vs aided)
* Readers matched across modalities
(i.e., same readers read unaided vs aided)
* This design has the most statistical power for
a given number of readers and
a given number of cases with verified truth;
thus, its least demanding of these resources
(least burdensome)
20

The Multiple-Reader Multiple-Case (MRMC)


paradigm
Enabled by resampling strategies
- Jackknife plus ANOVA (parametric)
(Dorfman, Berbaum, Metz DBM 1992)
-

Bootstrap the experiment of interest (nonpar)


Draw random readers, random cases
Carry out the experiment of interest

21

Some possible bootstrap samples of size 15


from a dataset with 15 elements

[14, 6, 3, 5, 12, 9, 11, 14, 4, 10, 7, 12, 3, 14, 2]


.
.
.
[9, 15, 11, 2, 13, 1, 6, 7, 12, 4, 8, 1, 12, 6, 14]

22

The Multiple-Reader Multiple-Case (MRMC)


paradigm
Enabled by resampling strategies
- Jackknife plus ANOVA (parametric)
(Dorfman, Berbaum, Metz DBM 1992)
-

Bootstrap the experiment of interest (nonpar)


Draw random readers, random cases
Carry out the experiment of interest

- Obtain mean performance over readers, cases


- Obtain error bars that account for
variability of readers and cases

23

Scales used for reporting and measurements:


- Historic ordered categories (usu. 5 or 6)
(almost definitely no . . . maybe . . . almost definitely yes)
- Action item or patient management scale
(e.g., no action vs F/U . . . or F/U vs biopsy)
. . . BIRADS scale is classic example . . .
- Continuous probability rating scale
(e.g., probability of disease or probability of cancer)
. . . actually recommended in BIRADS doc . . .
24

Scales used for reporting and measurements


Example of Best of both worlds:
Classification of benign vs malignant calc clusters
(Jiang, Nishikawa, Schmidt, Metz, Giger, Doi)

Authors studied ROC curves, ROC areas . . .


and (Sensitivity, Specificity) operating point
(means and uncertainties)
25

26

Possible reasons why we do not see more


of Best of both worlds
ROC total area is TPF (Se) averaged over FPF (Sp)
- Var(ROC area)
~ (Binomial Var)/2
- Var(Se) when Sp is known
= Binomial Var
- Var (Se) when Sp is estimated > Binomial Var
Var(ROC area) is least burdensome
- Both worlds requires consistent conventions
. . . plus training (little documentation so far)
- May require consensus bodies to promote
the practice
27

The most famous slides in the ROC archives . . .

28

Dilemma: Which modality is better?

True Positive Fraction


= Sensitivity

1.0

0.0

0.0

Modality B
Modality A

False Positive Fraction


= 1.0 Specificity

1.0

29

The dilemma is resolved after ROCs


are determined (one scenario):

True Positive Fraction


= Sensitivity

1.0

0.0

0.0

Conclusion:
Modality B
is better:

Modality B
Modality A

higher TPF at
same FPF, or
lower FPF at
False Positive Fraction
= 1.0 Specificity

1.0

same TPF
30

A different scenario: Same ROC

True Positive Fraction


= Sensitivity

1.0

Modality B
Modality A

0.0
0.0

1.0

False Positive
Fraction
= 1.0 Specificity

31

. . . yet another scenario:

True Positive Fraction


= Sensitivity

1.0

0.0

0.0

Conclusion:
Modality B

Modality A
is better:

Modality A

higher TPF at
same FPF, or
False Positive Fraction
= 1.0 Specificity

1.0

lower FPF at
same TPF

32

When ROC curves cross . . .


total area under the ROC curve is not a sufficient
summary measure of performance
. . . other summary measures may be necessary.
When this is anticipated, the study protocol
is expected to address this.
33

Location scoring:
- The basic ROC paradigm is an assessment of the
decision making at the level of the patient.
- In complex imaging, assessment of
decision making at a finer level is desired,
i.e., assessment of localization is desired.
- Localization adds more information,
more statistical power
34

The problem of location-specific ROC or


LROC analysis
- Measurement of a hit depends on localization criterion
(thus, results are not unique)
- Monotonic relationship between ROC and LROC
for special case of zero or one lesion
- More elaborate models require assumptions of
independence among multiple lesions, regions
- Lack of validated software for analysis of experiments
35

Region-of-interest (ROI) approach to


location-specific ROC analysis . . .
. . . only require localization to within a quadrant . . .

. . . or some other unit . . .

36

Region-of-interest (ROI) approach to


location-specific ROC analysis . . .

- Disadvantages: Does not correspond to the clinical task


. . . etc. . .
- Advantages: Straightforward to account for correlations
w/o additional assumptions

- The most straightforward method is simply to


resample using the patient as the statistical unit
37

THE PROBLEM OF UNCERTAINTY OF TRUTH STATE


Classic paper: Revesz, Kundel, Bonitatibus (1983)
included various ways of obtaining panel consensus truth
Authors compared three imaging methods
Any one of the three could outperform the others
depending on rule used for reducing panel to truth
HOWEVER, TODAY TARGET IS ACTIONABLE NODULE
ACCORDING TO EXPERT PANEL
Classic ref. above indicates additional uncertainty present
=> Resample panel to assess additional uncertainty
38

UNCERTAINTY OF TRUTH STATE


UNCERTAINTY IN EFFECTIVE SAMPLE SIZE
Uncertainty in TPF # actually diseased cases
Uncertainty in FPF # actually nondiseased cases
Uncertainty in total area under ROC curve
effective number of cases
Harmonic mean of numbers in the two classes
. . . & is a function of the panel sample

39

Given: 100 patients


What is the best split between normals and abnormals
for purposes of estimating area under ROC?

40

. . . relaxing panel criterion from unanimous to majority


- allows resampling to assess variability
- may increase effective number of samples
. . . these effects may tend to cancel

41

THE PROBLEM OF CONTROLLING FOR READER VIGILANCE


Any measurement setting has artificial conditions
vis--vis actual practice:
Are readers more vigilant in unaided reading
when theyre subjects in a study?
Are readers less vigilant in unaided reading
when theyre not subjects in a study?
One early suggestion:
Control the time available to readers to mimic the clinic
(Chan et al., Invest. Radiol. 1990)

42

IN SUMMARY
These points reflect the current status of
on-going interactions
between and among
FDA
Academia
Industry sponsors
NCI and the LIDC
on the topic and issues for submissions like the present one
43

Selected References
Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978;
8: 283-298.
Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986; 21: 720-33.
Metz CE. Some practical issues of experimental design and data analysis in
radiological ROC studies. Invest Radiol 1989; 24: 234-245.
Metz CE. Fundamentals of ROC Analysis. [In] Handbook of Medical Imaging.
Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL,
Eds. SPIE Press (Bellingham WA 2000), Chapter 15: 751-769.
Swets JA and Pickett RM. Evaluation of Diagnostic Systems. Academic Press,
New York, 1982.
Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Assessment of
medical imaging and computer-assist systems: Lessons from recent
experience. Acad Radiol 2002; 9: 1264-1277
Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Contemporary
issues for experimental design in assessment of medical imaging and
computer-assist systems. Proc. of the SPIE-Medical Imaging 2003; 5034:
213-224.
Dodd LE, Wagner RF, Armato SG, McNitt-Gray MF, et al. Assessment
methodologies and statistical issues for computer-aided diagnosis of lung
nodules in computed tomography: Contemporary research topics relevant
44 to

Toledano AY, Gatsonis C. Ordinal regression methodology for ROC curves derived
from correlated data. Statistics in Medicine 1996, 15: 1807-1826.
Nishikawa RM and Yarusso LM. Variations in measured performance of CAD
schemes due to database composition and scoring protocol. Proc. of the
SPIE 1998; 3338: 840-844.
Giger ML. Current issues in CAD for mammography. In: Doi K, Giger ML,
Nishikawa RM, and Schmidt RA, Eds. Digital Mammography 96. Elsevier
Science B.V. 1996, 53-59.
Clarke LP, Croft BY, Staab E, Baker H, Sullivan DC, National Cancer Institute
initiative: Lung image database resource for imaging research. Acad Radiol
2001 May;8(5):447-50.
Wagner RF, Beiden SV, Metz CE. Continuous versus categorical data for ROC
analysis: Some quantitative considerations. Acad Radiol 2001; 8: 328-334.
Revesz G, Kundel HL, and Bonitatibus M. The effect of verification on the
assessment of imaging techniques. Invest. Radiol. 1983; 18: 194-198.
Beiden SV, Wagner RF, Campbell G. Components-of-variance models and
multiple-bootstrap experiments: An alternative method for random-effects
receiver operating characteristic analysis. Acad Radiol 2000; 7: 341-349.
Obuchowski NA. Multireader, multimodality receiver operating characteristic
curve studies: Hypothesis testing and sample size estimation using an
analysis of variance approach with dependent observations. Acad Radiol
1995; 2 (Supplement 1): S22-S29.
Chan HP, Doi K, Vyborny CJ et al. Improvement in radiologists detection of
451102.
clustered microcalcifications on mammograms. Invest Radiol 1990; 25:

Chakraborty DP and Winter L. Free-response methodology: Alternate analysis


and a new observer-performance experiment. Radiology 1990; 174: 873-881.
Metz CE, Starr SJ, Lusted LB. Observer performance in detecting multiple
radiographic signals: prediction and analysis using a generalized ROC
approach. Radiology 1976; 121: 337-347.
Starr SJ, Metz CE, Lusted LB, Goodenough DJ. Visual detection and localization
of radiographic images. Radiology 1975; 116: 533-538
Swensson RG. Unified measurement of observer performance in detecting and
localizing target objects on images. Medical Physics 1996; 23: 1709-1725.
Chakraborty DP. The FROC, AFROC and DROC variants of the ROC analysis. [In]
Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J,
Kundel HL, and Van Metter RL, Eds. SPIE Press (Bellingham WA 2000), Chapter
16: 771-796.
Obuchowski NA. Multireader receiver operating characteristic studies: A
comparison of study designs. Acad Radiol 1995; 2: 709-716.
Gatsonis CA, Begg CB, Wieand S. Advances in Statistical Methods for Diagnostic
Radiology: A Symposium. Acad Radiol 1995; 2 (Supplement 1): S1-S84 (the
entire supplement is the Proceedings of the Symposium).
Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman M, Lo S-C B, and Xu X-W.
Independent versus sequential reading in ROC studies of computer-assist
modalities: Analysis of components of variance. Acad Radiol 22002; 9: 10361043.

46

Metz CE. Evaluation of CAD Methods. In: Doi K, MacMahon H, Giger ML, and
Hoffmann KR, eds. Computer-Aided Diagnosis in Medical Imaging. Amsterdam:
Elsevier Science B.V. (Excerpta Medica International Congress Series, Vol.
1182), 1999, 543-554.
Chakraborty, DP. Statistical power in observer performance studies: Comparison
of the receiver operating characteristic and free-response methods in tasks
involving localization. Acad Radiol 2002; 9: 147-156.
Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating
analysis: generalization to the population of readers and patients with the
jackknife method. Invest Radiol 1992; 27: 723-731.
Chakraborty DP and Berbaum KS: Comparing Inter-Modality Diagnostic
Accuracies in Tasks Involving Lesion Localization: A Jackknife AFROC
Approach. Supplement to Radiology, Volume 225 (P), 259, 2002.
Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and
localization of multiple abnormalities with application to mammography. Acad
Radiol 2000; 7: 516-525.
Rutter CM. Bootstrap estimation of diagnostic accuracy with patient-clustered
data. Acad Radiol 2000; 7 : 413-419.
Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall, New
York, 1993.
Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening
mammograms by US radiologists. Arch Intern Med 1996; 156: 209-213.
Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast
cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6: 22-33.
47

You might also like