100% found this document useful (3 votes)
3K views349 pages

Clinical Epidemiology - The Essentials (PDFDrive)

clinical epidemiologi

Uploaded by

Mirna Widasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
3K views349 pages

Clinical Epidemiology - The Essentials (PDFDrive)

clinical epidemiologi

Uploaded by

Mirna Widasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 349

Clinical

Epidemiology
The Essentials
Clinical
Epidemiology
The Essentials
Fifth Edition

Robert H. Fletcher, MD, MSc


Professor Emeritus
Department of Population
Medicine Harvard Medical School
Boston, Massachusetts
Adjunct Professor
Departments of Epidemiology and Social Medicine
The University of North Carolina at Chapel Hill
Chapel Hill, North Carolina

Suzanne W. Fletcher, MD, MSc


Professor Emerita
Department of Population
Medicine Harvard Medical School
Boston, Massachusetts
Adjunct Professor
Departments of Epidemiology and Social Medicine
The University of North Carolina at Chapel Hill
Chapel Hill, North Carolina

Grant S. Fletcher, MD, MPH


Assistant Professor of Medicine
The University of Washington School of Medicine
Seattle, Washington
Acquisitions Editor: Susan Rhyner
Product Manager: Catherine Noonan
Marketing Manager: Joy Fisher-Williams
Designer: Teresa Mallon
Compositor: Aptara, Inc.

Fifth Edition

Copyright © 2014, 2005, 1996, 1988, 1982 Lippincott Williams & Wilkins, a Wolters Kluwer business.

351 West Camden Street Two Commerce Square


Baltimore, MD 21201 2001 Market Street
Philadelphia, PA 19103

Printed in China

All rights reserved. This book is protected by copyright. No part of this book may be reproduced or transmitted
in any form or by any means, including as photocopies or scanned-in or other electronic copies, or utilized
by any information storage and retrieval system without written permission from the copyright owner, except
for brief quotations embodied in critical articles and reviews. Materials appearing in this book prepared by
individuals as part of their official duties as U.S. government employees are not covered by the above-mentioned
copyright. To request permission, please contact Lippincott Williams & Wilkins at 2001 Market Street,
Philadelphia, PA 19103, via email at [email protected], or via website at lww.com (products and services).

98 7 6 5 4 3 2 1

Library of Congress Cataloging-in-Publication Data

Fletcher, Robert H.
Clinical epidemiology : the essentials / Robert H. Fletcher, Suzanne
W. Fletcher, Grant S. Fletcher. – 5th ed.
p. ; cm.
Includes bibliographical references and index.
ISBN 978-1-4511-4447-5 (alk. paper)
I. Fletcher, Suzanne W. II. Fletcher, Grant S. III. Title.
[DNLM: 1. Epidemiologic Methods. WA 950]

614.4–dc23
2012022346

DISCLAIMER

Care has been taken to confirm the accuracy of the information present and to describe generally accepted
practices. However, the authors, editors, and publisher are not responsible for errors or omissions or for any
consequences from application of the information in this book and make no warranty, expressed or implied,
with respect to the currency, completeness, or accuracy of the contents of the publication. Application of this
information in a particular situation remains the professional responsibility of the practitioner; the clinical
treatments described and recommended may not be considered absolute and universal recommendations.
The authors, editors, and publisher have exerted every effort to ensure that drug selection and dosage set
forth in this text are in accordance with the current recommendations and practice at the time of
publication. However, in view of ongoing research, changes in government regulations, and the constant flow of
information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each
drug for any change in indications and dosage and for added warnings and precautions. This is particularly
important when the recommended agent is a new or infrequently employed drug.
Some drugs and medical devices presented in this publication have Food and Drug Administration (FDA)
clearance for limited use in restricted research settings. It is the responsibility of the health care provider to
ascertain the FDA status of each drug or device planned for use in their clinical practice.

To purchase additional copies of this book, call our customer service department at (800) 638-3030 or fax orders
to (301) 223-2320. International customers should call (301) 223-2300.

Visit Lippincott Williams & Wilkins on the Internet: http://www.lww.com. Lippincott Williams & Wilkins
customer service representatives are available from 8:30 am to 6:00 pm, EST.
Preface

This book is for clinicians—physicians, nurses,


necessary to make them suitable. Local “journal
physi- cians’ assistants, psychologists, veterinarians,
clubs” now carefully evaluate pivotal articles rather
and oth- ers who care for patients—who want to
than survey the contents of several journals. The
understand for themselves the strength of the
National Library of Medicine now includes search
information base for their clinical decisions.
terms in MEDLINE for research methods, such as
Students of epidemiology and public health may
randomized controlled trial and meta-analysis. In
also find this book useful as a complement to the
short, clinical medicine and epidemiology are mak-
many excellent textbooks about epidemiology
ing common cause. “Healing the schism” is what
itself.
Kerr White called it.
To reach full potential, modern clinicians should
Third, the care of patients should be fun. It is
have a basic understanding of clinical epidemiology
not fun to simply follow everyone else’s advice
for many reasons.
without really knowing what stands behind it. It is
First, clinicians make countless patient care deci-
exhaust- ing to work through a vast medical
sions every day, some with very high stakes. It is
literature, or even a few weekly journals, without a
their responsibility to base those decisions on the
way of quickly deciding which articles are
best available evidence, a difficult task because the
scientifically strong and clinically relevant and which
evi- dence base is vast and continually changing.
are not worth bothering with. It is unnerving to
At the simplest level, responsible care can be
make high-stakes decisions without really
accomplished by following carefully prepared
knowing why they are the right ones. To capture all
recommendations found in evidence-based guidelines,
the enjoyment of their profession, cli- nicians need
review articles, or textbooks, but patient care at its
to be confident in their ability to think about
best is far more than that. Otherwise, it would be
evidence for themselves, even if someone else has
sufficient to have it done by technicians following
done the heavy lifting to find and sort that evi-
protocols. Sometimes the evidence is contradictory,
dence, by topic and quality, beforehand. It is fun
resulting in “toss-up” deci- sions. The evidence may
to be able to confidently participate in discussions
be weak, but a decision still needs to be made.
of clinical evidence regardless of whether it is
Recommendations that are right for patients on
within one’s specialty (as all of us are nonspecialists in
average need to be tailored to the spe- cific illnesses,
every- thing outside our specialty).
behaviors, and preferences of individual patients. In this book, we have illustrated concepts with
Expert consultants may disagree, leaving the examples from patient care and clinical research,
clinicians with primary responsibility for the rather than use hypothetical examples, because medi-
patient in the middle. Special interests—by for-profit cine is so deeply grounded in practical decisions
compa- nies or clinicians whose income or prestige is about actual patients. The important questions and
related to their advice—may affect how evidence is studies have evolved rapidly and we have updated
summa- rized. For these reasons, clinicians need to examples to reflect this change, while keeping
be able to weigh evidence themselves in order to examples that represent timeless aspects in the care
meet their responsibilities as health care of patients and classic studies.
professionals. As clinical epidemiology becomes firmly estab-
Second, clinical epidemiology is now a central
lished within medicine, readers expect more from an
part of efforts to improve the effectiveness of patient
entry-level textbook. We have, therefore, added new
care worldwide. Clinicians committed to research
topics to this edition. Among them are
careers are pursuing formal postgraduate training
comparative effectiveness, practical clinical trials,
in research methods, often in departments of
noninferiority trials, patient-level meta-analyses, and
epidemi- ology. Grants for clinical research are
modern con- cepts in grading evidence-based
judged largely by the principles described in this
recommendations. We have also discussed risk,
book. Clinical epidemiology is the language of
confounding, and effect modification in greater
journal peer review and the “hanging committees”
depth.
that decide whether research reports should be
published and the revisions
v
vi Preface

Modern research design and analyses, supported Clinical epidemiology is now considered a cen-
by powerful computers, make it possible to answer tral part of a broader movement, evidence-based
clinical questions with a level of validity and gen-
eralizability not dreamed of just a few years ago. to judging the
Preface
medicine. This is in recognition of the importance, in
addition validity and
However, this often comes at the cost of complexity, generalizability of clinical research results, of
placing readers at a distance from the actual data and asking questions that can be answered by research,
their meaning. Many of us may be confused as finding the available evi- dence, and using the best of
highly specialized research scientists debate that evidence in the care of patients. We have
alternative meanings of specific terms or tout new always considered these addi- tional competencies
approaches to study design and statistical analyses, important, and we give them even more attention
some of which seem uncomfortably like black boxes in this edition of the book.
no matter how hard we try to get inside them. In We hope that readers will experience as much
such situations, it is especially valuable to remain enjoyment and understanding in the course of
grounded in the basics of clinical research. We have read- ing this book as we have in writing it.
tried to do just that with the understanding that
readers may well want to go on to learn more about Robert H. Fletcher
this field than is possible from an introductory Suzanne W. Fletcher
textbook alone. Grant S. Fletcher
We are fortunate to have learned clinical epidemi- people and now learns from medical students, resi-
ology from its founders. Kerr White was Bob and dents, and faculty colleagues as he teaches them
Suzanne’s mentor during postgraduate studies at about the care for patients at Harborview Hospital
Johns Hopkins and convinced us that what matter and the University of Washington.
are “the benefits of medical interventions in relation While editors of the Journal of General Inter-
to their hazards and costs.” Alvan Feinstein taught nal Medicine and Annals of Internal Medicine, Bob
a generation of young clinician–scholars about the and Suzanne learned from fellow editors, including
“architecture of clinical research” and the dignity members of the World Association of Medical
of clinical scholarship. Archie Cochrane spent a Edi- tors (WAME), how to make reports of research
night at our home in Montreal when Grant was a more complete, clear, and balanced so that readers
boy and opened our eyes to “effectiveness and can understand the message with the least effort.
efficiency.” David Sackett asserted that clinical With our colleagues at UpToDate, the electronic
epidemiology is a “basic science for clinical informa- tion source for clinicians and patients, we
medicine” and helped the world to understand. Many have been developing new ways to make the best
others have followed. We are especially grateful for available evidence on real-world, clinical questions
our work in common with Brian Haynes, founding readily accessible during the care of patients and to
editor of ACP Journal Club; Ian Chalmers, who make that evidence understandable not just to
made the Cochrane Collabora- tion happen; Andy academicians and investigators, but to full-time
Oxman, leader of the Rocky Mountain Evidence- clinicians as well.
Based Healthcare Workshop; Peter Tugwell, a Ed Wagner was with us at the beginning of this
founding leader of the International Clinical project. With him, we developed a new course in
Epidemiology Network (INCLEN); and Russ clini- cal epidemiology for the University of North
Harris, our long-time colleague at the interface Carolina School of Medicine and wrote the first
between clinical medicine and public health at the edition of this book for it. Later, that course was
University of North Carolina. These extraordinary Grant’s introduction to this field, to the extent he had
people and their colleagues have created an exciting not already been intro- duced to it at home. Ed
intellectual environment that led to a revolution in remained a coauthor through three editions and
clinical scholarship, bringing the evidence base for then moved on to leadership of Group Health
clinical medicine to a new level. Research Institute and other responsi- bilities based
Like all teachers, we have also learned from our in Seattle. Fortunately, Grant is now on the writing
students, clinicians of all ages and all specialties who team and contributed his expertise with the
wanted to learn how to judge the validity of clini- application of clinical epidemiology to the current
cal observations and research for themselves. Bob prac- tice of medicine, especially the care of very sick
and Suzanne are grateful to medical students at patients. We are grateful to members of the team,
McGill University (who first suggested the need led by Lippincott Williams & Wilkins, who
for this book), the University of North Carolina translated word processed text and hand-drawn
and Har- vard Medical School; fellows in the figures into an attractive, modern textbook. We got
Robert Wood Johnson Clinical Scholars Program, expert, personal attention from Catherine Noonan,
the International Clinical Epidemiology Network who guided us in the preparation of this book
(INCLEN), and the Harvard General Medicine throughout; Jonathan Dimes, who worked closely
Fellowship; CRN Schol- ars in the Cancer Research with us in preparing illus- trations; and Jeri Litteral,
Network, a consortium of research institutes in who collaborated with us in
integrated health systems; and participants in the the copy editing phase of this project.
Rocky Mountain Evidence-Based Healthcare We are especially grateful to readers all over the
Workshops. They were our students and now are world for their encouraging comments and practical
our colleagues; many teach and do research with us. suggestions. They have sustained us through the rig-
Over the years, Grant has met many of these ors of preparing this, the fifth edition of a textbook
first published 30 years ago.
Acknowledgment
vii
C o n t e n t s in B r i e f

1. Introduction 1

2. Frequency 17
3. Abnormality 31

4. Risk: Basic Principles 50

5. Risk: Exposure to Disease 61

6. Risk: From Disease to Exposure 80

7. Prognosis 93
8. Diagnosis 108

9. Treatment 132

10. Prevention 152

11. Chance 175


12. Cause 194

13. Summarizing the Evidence 209

14. Knowledge Management 225

Appendix A – Answers to Review Questions 237


Appendix B – Additional Readings 249
Index 251

ix
Contents

CHAPTER 1: INTRODUCTION 1 Distribution of Disease by Time,


Place, and Person 25
Time 26
Clinical Questions and Clinical Epidemiology 2 Place 27
Health Outcomes 2 Person 27
The Scientific Basis for Clinical Medicine 3 Uses of Prevalence Studies 28
Basic Principles 6 What Are Prevalence Studies Good For? 28
Variables 6 What Are Prevalence Studies Not Particularly
Numbers and Probability 6 Good For? 28
Populations and Samples 6
Bias (Systematic Error) 7
Selection Bias 7 CHAPTER 3: ABNORMALITY 31
Measurement Bias 8
Confounding 8 Types of Data 32
Chance 10 Nominal Data 32
The Effects of Bias and Chance Are Cumulative 10 Ordinal Data 32
Internal and External Validity 11 Interval Data 32

Information and Decisions 12 Performance of Measurements 33


Validity 33
Organization of this Book 12 Content Validity 33
Criterion Validity 33
Construct Validity 34
CHAPTER 2: FREQUENCY 17 Reliability 34
Range 34
Responsiveness 34
Are Words Suitable Substitutes for Interpretability 35
Numbers? 18
Variation 35
Prevalence and Incidence 18 Variation Resulting from Measurement 35
Prevalence 18 Variation Resulting from Biologic Differences 36
Incidence 18 Total Variation 37
Prevalence and Incidence in Relation to Time 19 Effects of Variation 37
Relationships Among Prevalence, Distributions 38
Incidence, and Duration of Disease 19 Describing Distributions 38
Some other Rates 20 Actual Distributions 39
The Normal Distribution 40
Studies of Prevalence and Incidence 21
Prevalence Studies 21 Criteria for Abnormality 41
Incidence Studies 21 Abnormal  Unusual 42
Cumulative Incidence 21 Abnormal  Associated with Disease 43
Incidence Density (Person-Years) 22 Abnormal  Treating the Condition Leads to a
Better Clinical Outcome 43
Basic Elements of Frequency Studies 23
What Is a Case? Defining the Numerator 23 Regression to the Mean 45
What Is the Population? Defining the
Denominator 25
Does the Study Sample Represent the Population? xi
25
x Conten

CHAPTER 4: RISK: BASIC Relative Risk 68


PRINCIPLES 50 Interpreting Attributable and Relative Risk 68
Population Risk 69
Taking other Variables into Account 71
Risk Factors 51 Extraneous Variables 71
Recognizing Risk 51 Simple Descriptions of Risk 71
Long Latency 51 Confounding 71
Immediate Versus Distant Causes 51 Working Definition 72
Common Exposure to Risk Factors 52 Potential Confounders 72
Low Incidence of Disease 52 Confirming Confounding 72
Small Risk 52
Multiple Causes and Multiple Effects 52 Control of Confounding 72
Risk Factors May or May Not Be Causal 53 Randomization 73
Restriction 73
Predicting Risk 54 Matching 74
Combining Multiple Risk Factors to Predict Stratification 74
Risk 54 Standardization 75
Risk Prediction in Individual Patients and Multivariable Adjustment 75
Groups 54 Overall Strategy for Control of Confounding 75
Evaluating Risk Prediction Tools 56 Observational Studies and Cause 76
Calibration 56
Discrimination 56 Effect Modification 76
Sensitivity and Specificity of a Risk
Prediction Tool 56
Risk Stratification 57 CHAPTER 6: RISK: FROM DISEASE
Why Risk Prediction Tools Do Not Discriminate TO EXPOSURE 80
Well Among Individuals 57
Case-Control Studies 81
Clinical Uses of Risk Factors and
Risk Prediction Tools 58 Design of Case-Control Studies 83
Risk Factors and Pretest Probability for Diagnostic Selecting Cases 83
Testing 58 Selecting Controls 83
Using Risk Factors to Choose Treatment 58 The Population Approach 83
Risk Stratification for Screening Programs 58 The Cohort Approach 84
Removing Risk Factors to Prevent Disease 59 Hospital and Community Controls 84
Multiple Control Groups 84
Multiple Controls per Case 85
CHAPTER 5: RISK: EXPOSURE Matching 85
TO DISEASE 61 Measuring Exposure 85
Multiple Exposures 87
The Odds Ratio: An Estimate of Relative
Studies of Risk 61 Risk 87
When Experiments Are Not Possible or Ethical 61
Cohorts 62 Controlling for Extraneous Variables 88
Cohort Studies 62 Investigation of A Disease Outbreak 89
Prospective and Historical Cohort Studies 63
Prospective Cohort Studies 63
Historical Cohort Studies Using Medical
Databases 64
Case-Cohort Studies 65
Advantages and Disadvantages of Cohort
Studies 65
Ways to Express and Compare Risk 67
Absolute Risk 67
Attributable Risk 68
Contents xiii

CHAPTER 7: PROGNOSIS 93
Trade-Offs between Sensitivity and Specificity 113
The Receiver Operator Characteristic (ROC)
Differences in Risk and Prognostic Curve 114
Factors 93 Establishing Sensitivity and
The Patients Are Different 94 Specificity 115
The Outcomes Are Different 94 Spectrum of Patients 116
The Rates Are Different 94 Bias 116
The Factors May be Different 94 Chance 117
Clinical Course and Natural History Predictive Value 117
of Disease 94 Definitions 117
Elements of Prognostic Studies 95 Determinants of Predictive Value 118
Patient Sample 95 Estimating Prevalence (Pretest Probability) 119
Zero Time 96 Increasing the Pretest Probability of Disease 120
Follow-Up 96 Specifics of the Clinical Situation 120
Outcomes of Disease 96 Selected Demographic Groups 120
Referral Process 120
Describing Prognosis 97 Implications for Interpreting the Medical
A Trade-Off: Simplicity versus More Literature 122
Information 97
Survival Analysis 97 Likelihood Ratios 122
Survival of a Cohort 97 Odds 122
Survival Curves 98 Definitions 122
Interpreting Survival Curves 100 Use of Likelihood Ratios 122
Why Use Likelihood Ratios? 123
Identifying Prognostic Factors 100 Calculating Likelihood Ratios 124
Case Series 101 Multiple Tests 125
Clinical Prediction Rules 102 Parallel Testing 126
Clinical Prediction Rules 127
Bias in Cohort Studies 102 Serial Testing 128
Sampling Bias 103 Serial Likelihood Ratios 128
Migration Bias 103 Assumption of Independence 129
Measurement Bias 104
Bias from “Non-differential” Misclassification 104
Bias, Perhaps, but does it Matter? 104 CHAPTER 9: TREATMENT 132
Sensitivity Analysis 104 Ideas and Evidence 132
Ideas 132
Testing Ideas 133
CHAPTER 8: DIAGNOSIS 108
Studies of Treatment Effects 134
Observational and Experimental Studies of
Treatment Effects 134
Simplifying Data 108
Randomized Controlled Trials 134
The Accuracy of a Test Result 109 Ethics 135
The Gold Standard 109 Sampling 135
Lack of Information on Negative Tests 110 Intervention 136
Lack of Information on Test Results in the Comparison Groups 138
Nondiseased 110 Allocating Treatment 139
Lack of Objective Standards for Disease 110 Differences Arising after Randomization 139
Consequences of Imperfect Gold Standards 111 Patients May Not Have the Disease Being
Sensitivity and Specificity 111 Studied 140
Definitions 113 Compliance 140
Use of Sensitive Tests 113 Cross-over 141
Use of Specific Tests 113 Cointerventions 141
x Conten

Blinding 141 Methodologic Issues in Evaluating


Assessment of Outcomes 142 Screening Programs 159
Efficacy and Effectiveness 143 Prevalence and Incidence Screens 159
Intention-to-Treat and Explanatory Trials 144 Special Biases 160
Lead-Time Bias 160
Superiority, Equivalence, and Length-Time Bias 161
Non-Inferiority 145 Compliance Bias 161
Variations on Basic Randomized Trials Performance of Screening Tests 163
145 High Sensitivity and Specificity 163
Detection and Incidence Methods for Calculating
Tailoring the Results of Trials to
Sensitivity 163
Individual Patients 146
Low Positive Predictive Value 164
Subgroups 146
Simplicity and Low Cost 164
Effectiveness in Individual Patients 146
Safety 165
Trials of N  1 146
Acceptable to Patients and Clinicians 166
Alternatives to Randomized Controlled
Unintended Consequences of
Trials 147
Screening 166
Limitations of Randomized Trials 147
Risk of False-Positive Result 166
Observational Studies of Interventions 147
Risk of Negative Labeling Effect 167
Clinical Databases 148
Risk of Overdiagnosis (Pseudodisease) in
Randomized versus Observational Studies? 148
Cancer Screening 167
Phases of Clinical Trials 148 Incidentalomas 169
Changes in Screening Tests and
Treatments over Time 169
CHAPTER 10: PREVENTION 152
Weighing Benefits Against Harms of
Prevention 169
Preventive Activities in Clinical
Settings 152
Types of Clinical Prevention 152 CHAPTER 11: CHANCE 175
Immunization 153
Screening 153 Two Approaches to Chance 175
Behavioral Counseling (Lifestyle Changes) 153 Hypothesis Testing 176
Chemoprevention 153 False-Positive and False-Negative
Levels of Prevention 153 Statistical Results 176
Primary Prevention 153 Concluding That a Treatment Works 176
Secondary Prevention 154 Dichotomous and Exact P Values 177
Tertiary Prevention 154 Statistical Significance and
Confusion about Primary, Secondary, and Clinical Importance 177
Tertiary Prevention 154 Statistical Tests 178
Concluding That a Treatment Does Not Work 179
Scientific Approach to
Clinical Prevention 155 How Many Study Patients
are Enough? 180
Burden of Suffering 156 Statistical Power 181
Effectiveness of Treatment 156 Estimating Sample Size Requirements 181
Treatment in Primary Prevention 156 Effect Size 181
Randomized Trials 156 Type I Error 181
Observational Studies 156 Type II Error 181
Safety 157 Characteristics of the Data 181
Counseling 157 Interrelationships 182
Treatment in Secondary Prevention 158
Treatment in Tertiary Prevention 159
Contents xv

Point Estimates and Confidence Is Scientific Quality Related to Research


Intervals 183 Results? 214
Statistical Power after a Study Is Completed 184 Summarizing Results 215
Detecting Rare Events 185 Combining Studies in Meta-Analyses 216
Are the Studies Similar Enough to Justify
Multiple Comparisons 185
Combining? 216
Subgroup Analysis 187 What Is Combined—Studies or Patients? 217
Multiple Outcomes 187 How Are the Results Pooled? 217
Identifying Reasons for Heterogeneity 219
Multivariable Methods 189
Cumulative Meta-Analyses 219
Bayesian Reasoning 190
Systematic Reviews of Observational and
Diagnostic Studies 221
CHAPTER 12: CAUSE 194 Strengths and Weaknesses of Meta-
Analyses 221
Basic Principles 195
Single Causes 195 CHAPTER 14: KNOWLEDGE
Multiple Causes 195 MANAGEMENT 225
Proximity of Cause to Effect 196
Indirect Evidence for Cause 198 Basic Principles 225
Examining Individual Studies 198 Do It Yourself or Delegate? 225
Hierarchy of Research Designs 199 Which Medium? 226
Grading Information 226
The Body of Evidence for and Against Misleading Reports of Research Findings 226
Cause 199
Does Cause Precede Effect? 200 Looking up Answers to Clinical
Strength of the Association 200 Questions 228
Dose–Response Relationships 200 Solutions 228
Reversible Associations 201 Clinical Colleagues 228
Consistency 201 Electronic Textbooks 229
Biologic Plausibility 201 Clinical Practice Guidelines 229
Specificity 202 The Cochrane Library 230
Analogy 202 Citation Databases (PubMed and Others) 230
Other Sources on the Internet 230
Aggregate Risk Studies 202
Surveillance on New Developments 230
Modeling 204
Journals 231
Weighing the Evidence 205 “Reading” Journals 233
Guiding Patients’ Quest for Health
CHAPTER 13: SUMMARIZING Information 234
THE EVIDENCE 209 Putting Knowledge Management
into Practice 235

Traditional Reviews 209 APPENDIX A: ANSWERS TO


Systematic Reviews 210 REVIEW QUESTIONS 237
Defining a Specific Question 210
APPENDIX B: ADDITIONAL READINGS 249
Finding All Relevant Studies 211
Limit Reviews to Scientifically Strong, INDEX 251
Clinically Relevant Studies 211
Are Published Studies a Biased Sample of
All Completed Research? 211
How Good Are the Best Studies? 212
Chapter 1

Introduction
We should study “the benefits of medical interventions in relation to their hazards
and costs.”
—Kerr L. White
1992

KEY WORDS
Clinical epidemiology and sometimes at rest. He gave up smoking
Dependent variable one pack of cigarettes per day 3 years ago
Clinical sciences
Extraneous variables and has been told that his blood pressure
Population sciences
Covariates is “a little high.” He is otherwise well and
Epidemiology
Populations takes no medications, but he is worried
Evidence-based
Sample about his health, particularly about heart
medicine
Inference disease. He lost his job 6 months ago and
Health services
Bias has no health in- surance. A complete
research
Selection bias physical examination and resting
Quantitative decision
Measurement bias electrocardiogram are normal except for a
making
Confounding blood pressure of 150/96 mm Hg.
Cost-effectiveness
Chance
analyses
Random variation
Decision analyses
Internal validity This patient is likely to have many questions. Am
Social sciences
External validity I sick? How sure are you? If I am sick, what is
Biologic sciences
Generalizability causing my illness? How will it affect me? What can
Variables
Shared decision be done about it? How much will it cost?
Independent variable
making As the clinician caring for this patient, you
have the same kinds of questions, although yours
reflect greater understanding of the possibilities. Is
the probability of serious, treatable disease high
Example enough to proceed immediately beyond simple
explanation and reassurance to diagnostic tests?
How well do various tests distinguish among the
possible causes of chest pain: angina pectoris,
A 51-year-old man asks to see you because of chest pain that he thinks is “indigestion.” He was well until 2 weeks ag
esophageal spasm, muscle strain, anxiety, and the
like. For example, how accurately will an exercise
stress test be in either confirming or ruling out
coronary artery disease? If coronary artery disease
is found, how long can the patient expect to have
the pain? How likely is it that other complications
—congestive heart failure, myo- cardial infarction, or
atherosclerotic disease of other organs—will occur?
Will the condition shorten his
1
2 Clinical Epidemiology: The

life? Will reduction of his risk factors for coronary Table 1.1
artery disease (from cigarette smoking and hyperten- Clinical Issues and Questionsa
sion) reduce his risk? Should other possible risk fac-
tors be sought? If medications control the pain, Issue Question
would a coronary revascularization procedure add Frequency (Ch. 2) How often does a disease occur?
benefit— by preventing a future heart attack or
Abnormality (Ch. 3) Is the patient sick or well?
cardiovascular death? Since the patient is
unemployed and without health insurance, can less Risk (Chs. 5 and 6) What factors are associated
with an increased risk of
expensive diagnostic work- ups and treatments
disease?
achieve the same result as more expensive ones?
Prognosis (Ch. 7) What are the consequences of
having a disease?
Clinical Questions and
Clinical Epidemiology Diagnosis (Ch. 8) How accurate are tests used
to diagnose disease?
The questions confronting the patient and doctor Treatment (Ch. 9) How does treatment change the
in the example are the types of clinical questions at course of disease?
issue in most doctor–patient encounters: What is Prevention (Ch. 10) Does an intervention on well
“abnor- mal”? How accurate are the diagnostic tests people keep disease from
we use? How often does the condition occur? arising? Does early detection and
What are the risks for a given disease, and how do treatment improve the course of
we determine the risks? Does the medical condition disease?
usually get worse, stay the same, or resolve Cause (Ch. 12) What conditions lead to disease?
(prognosis)? Does treatment really improve the What are the origins of the
patient or just the test results? Is there a way to disease?
prevent the disease? What is the under- lying cause a
Four chapters—Risk: Basic Principles (4), Chance (11), Systematic
of the disease or condition? and How can we give Reviews (13), and Knowledge Management (14)—pertain to all of
good medical care most efficiently? These clinical these issues.
questions and the epidemiologic methods to answer
them are the bedrock of this book. The clini- cal
cultures, cell membranes, and genetic sequences) or
questions are summarized in Table 1.1. Each is
in animals. Clinical epidemiology is the science used
also the topic of specific chapters in the book.
Clinicians need the best possible answers to these to study the 5 Ds in intact humans.
kinds of questions. They use various sources of infor- In modern clinical medicine, with so much order-
mation: their own experiences, the advice of their ing and treating of lab test results (for such things
colleagues, and reasoning from their knowledge of as plasma glucose levels, hematuria, troponins, etc.),
the biology of disease. In many situations, the it is difficult to remember that laboratory test
most credible source is clinical research, which results are not the important events in clinical
involves the use of past observations on other similar medicine. It
patients to predict what will happen to the patient
at hand. The manner in which such observations Table 1.2
are made and interpreted determines whether the Outcomes of Disease (the 5 Ds)a
conclusions reached are valid, and thus how
helpful the conclu- sions will be to patients. Death A bad outcome if untimely
Diseaseb A set of symptoms, physical signs, and
Health Outcomes laboratory abnormalities
The most important events in clinical medicine are Discomfort Symptoms such as pain, nausea,
the health outcomes of patients, such as symptoms dyspnea, itching, and tinnitus
(discomfort and/or dissatisfaction), disability, disease, Disability Impaired ability to go about usual
and death. These patient-centered outcomes are activities at home, work, or recreation
some- times referred to as “the 5 Ds” (Table 1.2). Dissatisfaction Emotional reaction to disease and
They are the health events patients care about. its care, such as sadness or anger
Doctors should try to understand, predict, interpret, a
Perhaps a sixth D, destitution, belongs on this list because
and change these outcomes when caring for the financial cost of illness (for individual patients or society) is
patients. The 5 Ds can be studied directly only in an important consequence of disease.
b
Or illness, the patient’s experience of disease.
intact humans and not in parts of humans (e.g.,
humeral transmitters, tissue
Chapter 1: Introduction 3

becomes easy to assume that if we can change abnor- aggressively lowering levels of blood sugar does not
mal lab tests toward normal, we have helped the protect against heart disease.) Establishing improved
patient. This is true only to the extent that careful health outcomes in patients is particularly impor-
study has demonstrated a link between laboratory test tant with new drugs because usually pharmacologic
results and one of the 5 Ds. interventions have several clinical effects rather than
just one.

Example THE SCIENTIFIC BASIS FOR


The incidence of type 2 diabetes mellitus is in- creasing dramatically in the United States. Dia- betics’ risk of dying from
CLINICAL MEDICINE
Clinical epidemiology is one of the basic sciences
that clinicians rely on in the care of patients. Other
health sciences, summarized in Figure 1.1, are also
integral to patient care. Many of the sciences
overlap with each other.
Clinical epidemiology is the science of making
predictions about individual patients by counting
clinical events (the 5 Ds) in groups of similar patients
and using strong scientific methods to ensure that
the predictions are accurate. The purpose of clini-
cal epidemiology is to develop and apply methods
of clinical observation that will lead to valid conclu-
sions by avoiding being misled by systematic error
and the play of chance. It is an important approach

RESEARCH FIELDPRIMARY FOCUS

Biologic sciences Animal models


Cells and transmitters Molecules
Genes
Drug development

Clinical sciences Individual patients

Individual patient questions Population m


Clinical epidemiology
During their training, clinicians are steeped in
the biology of disease, the sequence of steps that
leads from subcellular events to disease and its con-
sequences. Thus, it seemed reasonable to assume that
an intervention that lowered blood sugar in diabet- Epidemiology Populations
ics would help protect against heart disease. How-
ever, although very important to clinical medicine,
these biologic mechanisms cannot be substituted
for patient outcomes unless there is strong Health services Health care systems
evidence confirming that the two are related. (In
fact, the results of studies with several different
medications are raising the possibility that, in type Figure 1.1 ■ The health sciences and their comple-
mentary relationships.
2 diabetes,
4 Clinical Epidemiology: The

available research
to obtaining the kind of information clinicians
need to make good decisions in the care of
patients.
The term “clinical epidemiology” is derived from
its two parent disciplines: clinical medicine and
epide- miology. It is “clinical” because it seeks to
answer clin- ical questions and to guide clinical
decision making with the best available evidence. It
is “epidemiology” because many of the methods
used to answer ques- tions about how to best care
for patients have been developed by epidemiologists
and because the care of individual patients is seen in
the context of the larger population of which the
patient is a member.
Clinical sciences provide the questions and
approach that can be used to care for individual
patients. Some biologic sciences, such as anatomy
and physiology, are “clinical” to the extent that
they provide sound information to guide clinical
deci- sions. For example, knowing the anatomy of
the body helps determine possibilities for diagnosis
and treat- ment of many symptoms.
The population sciences study large groups of
people. Epidemiology is the “study of disease occur-
rence in human populations” (4) by counting health-
related events in people in relation to the naturally
occurring groups (populations) of which they are
members. The results of many such studies are
directly applicable to the care of individual
patients. For example, epidemiology studies are
used as the basis for advice about avoiding behaviors
such as smoking and inactivity that place patients
at increased risk. Other epidemiologic studies, such
as those showing harmful effects of passive
smoking and other envi- ronmental and
occupational hazards, are the basis for public health
recommendations. Clinical epidemiol- ogy is a
subset of the population sciences useful in the care
of patients.
Clinicians have long depended on research evi-
dence to some extent, but understanding clinical
evidence is more important in modern times than
it was in the past for several reasons. An extraor-
dinary amount of information must be sorted
through. Diagnostic and therapeutic interventions
have the potential for great effectiveness, as well as
risk and cost, so the stakes in choosing among them
are high. Clinical research at its best has become
stronger and, thus, can be a sounder basis for clini-
cal decisions. Nevertheless, the credibility of clini-
cal research continues to vary from study to study,
so clinicians need to have a method for sorting out
strong from weak evidence.
Evidence-based medicine is a modern term for
the application of clinical epidemiology to the care of
patients. It includes formulating specific
“answerable” clinical questions, finding the best
Chapter 1: Introduction 5
beliefs and patients’ cooperation) affect
Table 1.3
Factors Other Than Evidence-Based
Medicine That May Influence
Clinical Decisions

Eminence-based medicine Senior colleagues who


believe experience trumps
evidence
Vehemence-based Substitution of volume and
medicine stridency for evidence
Eloquence (or elegance)- Sartorial elegance and
based medicine verbal eloquence
Providence-based medicine The decision is best left in
the hands of the Almighty
Diffidence-based medicine Too timid to make
any medical decision
Nervousness-based Fear of litigation is a
medicine powerful stimulus to
overinvestigation and
overtreatment
Confidence-based Bravado
medicine
Adapted from Isaacs D, Fitzgerald D. Seven alternatives to
evidence- based medicine. BMJ 1999;319:1618.

evidence bearing on those questions,


judging the evidence for its validity, and
integrating the critical appraisal with the
clinician’s expertise and the patient’s situation
and values (5). This book deals with several
aspects of evidence-based medicine,
especially criti- cally appraising the evidence
about clinical questions. In real-life clinical
settings, other kinds of “evi- dence”
compete for clinicians’ attention and can
influence medical decisions. Table 1.3 describes
some of them in a parody of evidence-based
medicine that was published some years ago,
but is still true today. Probably all clinicians
have experienced at least one of these factors
during their training years! Another factor,
not so humorous but very relevant, has been
described as level IV evidence (6).
Clinicians tend to remember cases when
things go terribly wrong in the care they give
an individual patient and are more likely to
change practice after such an experience than
after reading a well-done study. Less valid
alternatives to evidence-based medicine can
be very compelling at the emotional level and
may provide a convenient way of coping
with uncertainty, but they are a weak
substitute for good research evidence.
Health services research is the study of
how non-biologic factors (e.g., clinical
workforce and facilities, how care is
organized and paid for, and clinicians’
6 Clinical Epidemiology: The

deficiency is the most common enzyme deficiency


patients’ health. Such studies have shown, for exam-
ple, that medical care differs substantially from
one small geographic area to another (without
corre- sponding differences in patients’ health); that
surgery in hospitals that often perform a specific
procedure tends to have better outcomes than
hospitals in which the procedure is done
infrequently; and that aspirin is underutilized in
the treatment of acute myocardial infarction, even
though this simple practice has been shown to
reduce the number of subsequent vascular events
by about 25%. These kinds of studies guide
clinicians in their efforts to apply existing knowledge
about the best clinical practices.
Other health services sciences also guide patient
care. Quantitative decision making includes cost-
effectiveness analyses, which describe the financial
costs required to achieve a good outcome such as
pre- vention of death or disease and decision
analyses, which set out the rational basis for clinical
decisions and the consequences of choices. The
social sciences describe how the social environment
affects health- related behaviors and the use of
health services.
Biologic sciences, studies of the sequence of
biologic events that lead from health to disease,
are a powerful way of knowing how clinical
phenom- ena may play out at the human level.
Historically, it was primarily the progress in the
biologic sciences that established the scientific
approach to clinical medicine, and they continue
to play a pivotal role. Anatomy explains nerve
entrapment syndromes and their cause, symptoms,
and relief. Physiology and biochemistry guide the
management of diabetic keto- acidosis. Molecular
genetics predicts the occurrence of diseases ranging
from common cardiovascular dis- eases and cancer to
rare inborn errors of metabolism, such as
phenylketonuria and cystic fibrosis.
However, understanding the biology of disease,
by itself, is often not a sound basis for prediction
in intact humans. Too many other factors contribute
to health and disease. For one thing, mechanisms of
dis- ease may be incompletely understood. For
example, the notion that blood sugar in diabetic
patients is more affected by ingestion of simple
sugars (sucrose or table sugar) than by complex
sugars such as starch (as in potatoes or pasta) has
been dispelled by rig- orous studies comparing the
effect of these foods on blood glucose. Also, it is
becoming clear that the effects of genetic
abnormalities may be modified by complex physical
and social environments such as diet and exposure
to infectious and chemical agents. For example,
glucose-6-phosphate dehydrogenase (G6PD) is an
enzyme that protects red blood cells against
oxidant injury leading to hemolysis. G6DP
Chapter 1: Introduction 7
in humans, occurring with certain mutations of
the X-linked G6PD gene. However, males with
com- monly occurring genetic variants of G6PD
deficiency are usually asymptomatic, developing
hemolysis and jaundice only when they are exposed
to environmen- tal oxidant stresses such as certain
drugs or infections. Finally, as shown in the
example of rosiglitazone treatment for patients
with type 2 diabetes, drugs often have multiple
effects on patient health beyond the one predicted
by studying disease biology. There- fore, knowledge
of the biology of disease produces hypotheses,
often very good ones, about what might happen
in patients. But these hypotheses need to be tested
by strong studies of intact human beings before they
are accepted as clinical facts.
In summary, clinical epidemiology is one of
many sciences basic to clinical medicine. At best,
the various health-related sciences complement one
another. Discoveries in one are confirmed in
another; discoveries in the other lead to new
hypotheses in the first.

Example
In the 1980s, clinicians in San Francisco noticed unusual infect
8 Clinical Epidemiology: The

BASIC PRINCIPLES
within 30 days of the procedure, as opposed to
The purpose of clinical epidemiology is to foster 40% to 80% when emergency repair is necessary.
methods of clinical observation and interpretation
that lead to valid conclusions and better patient care. Populations and Samples
The most credible answers to clinical questions are
based on a few basic principles. Two of these— Populations are all people in a defined setting (such
that observations should address questions facing as North Carolina) or with certain defined
patients and clinicians, and results should include characteristics (such as being age 65 years or
patient- centered health outcomes (the 5 Ds)— having a thyroid nod- ule). Unselected people in the
have already been covered. Other basic principles community are the usual population for
are discussed below. epidemiologic studies of cause. On the other hand,
clinical populations include all patients with a clinical
Variables characteristic such as all those with community-
acquired pneumonia or aortic stenosis. Thus, one
Researchers call the attributes of patients and clinical speaks of the general population, a hospitalized
events variables—things that vary and can be population, or a population of patients with a specific
mea- sured. In a typical study, there are three main disease.
kinds of variables. One is a purported cause or Clinical research is ordinarily carried out on a
predictor variable, sometimes called the sam- ple or subset of people in a defined
independent vari- able. Another is the possible population. One is interested in the characteristics of
effect or outcome vari- able, sometimes called the the defined popu- lation but must, for practical
dependent variable. Still, other variables may be reasons, estimate them by describing the
part of the system under study and may affect the characteristics of people in a sample (Fig. 1.2). One
relationship between the indepen- dent and then makes an inference, a reasoned judgment
dependant variables. These are called extra- neous based on data, that the characteristics of the sample
variables (or covariates) because they are resemble those of the parent population.
extraneous to the main question, though perhaps very The extent to which a sample represents its
much a part of the phenomenon under study. popula- tion, and thus is a fair substitute for it,
depends on how the sample was selected. Methods
Numbers and Probability in which every member of the population has an
Clinical science, like all other sciences, depends on equal (or known) chance of being selected can
quantitative measurements. Impressions, instincts, produce samples that are extraordinarily similar to
and beliefs are important in medicine too, but only the parent population, at least in the long run and for
when added to a solid foundation of numerical infor- large samples. An everyday example is opinion polls
mation. This foundation allows better using household sampling based on census data. In
confirmation, more precise communication among our own clinical research, we often use a computer to
clinicians and between clinicians and patients, and select a representative sample from all patients in
estimation of error. Clinical outcomes, such as our large, multispecialty group practice, each of
occurrence of dis- ease, death, symptoms, or which has the same chance of being selected. On the
disability, can be counted and expressed as other hand, samples taken haphaz- ardly or for
numbers. convenience (i.e., by selecting patients who are
In most clinical situations, the diagnosis, prog- easy to work with or happen to be visiting the clinic
nosis, and results of treatment are uncertain for an when data are being collected) may misrepresent their
individual patient. An individual will either experi- parent population and be misleading.
ence a clinical outcome or will not, and
predictions can seldom be so exact. Therefore, a
prediction must be expressed as a probability. The
probability for an individual patient is best
estimated by referring to past experience with
groups of similar patients—for example, that SAMPLING SAMPLE
cigarette smoking more than doubles the risk of
dying among middle-aged adults, that blood tests
for troponins detect about 99% of myo- cardial INFERENCE

infarctions in patients with acute chest pain, and POPULATION


that 2% to 6% of patients undergoing elec- tive
surgery for abdominal aortic aneurysm will die
Figure 1.2 ■ Population and sample.
Chapter 1: Introduction 9
1 Clinical Epidemiology: The

Bias (Systematic Error) Table 1.4


Bias is “a process at any stage of inference tending Bias in Clinical Observation
to produce results that depart systematically from
Selection bias Occurs when comparisons are made
the true values” (7). It is “an error in the
between groups of patients that differ
conception and design of a study—or in the in determinates of outcome other than
collection, analysis, interpretation, publication, or the one under study.
review of data—leading to results or conclusions Measurement Occurs when the methods of
that are systematically (as opposed to randomly) bias measurement are dissimilar among
different from the truth” (8). groups of patients
Confounding Occurs when two factors are associated
(travel together) and the effect of one
Example is confused with or distorted by the
effect of the other
Patients with inguinal hernia who get lapa- roscopic repair seem to have less postopera- tive pain and more rapid return

Observations on patients (whether for patient


Chapter 1: Introduction 11
care or research) are particularly susceptible to bias.
The process tends to be just plain untidy. As
partici- pants in a study, human beings have the
disconcert- ing habit of doing as they please and
not necessarily what would be required for
producing scientifically rigorous answers. When
researchers attempt to con- duct an experiment
with them, as one might in a laboratory, things
tend to go wrong. Some people refuse to
participate, whereas others drop out or choose
another treatment. In addition, clinicians are
inclined to believe that their therapies are
successful. (Most patients would not want a
physician who felt otherwise.) This attitude,
which is so important in the practice of medicine,
makes clinical observations particularly vulnerable
to bias.
Although dozens of biases have been defined
(11), most fall into one of three broad categories
(Table 1.4).

Selection Bias
Selection bias occurs when comparisons are
made between groups of patients that differ in
ways other than the main factors under study,
ones that affect the outcome of the study. Groups
of patients often differ in many ways—age, sex,
severity of disease, the presence of other diseases,
the care they receive, and so on. If one compares
the experience of two groups that differ on a
specific characteristic of interest (e.g., a treatment
or a suspected cause of disease) but are dissimilar
in these other ways and the differences are
themselves related to outcome, the comparison is
biased and little can be concluded about the inde-
pendent effects of the characteristic of interest. In
the herniorrhaphy example, selection bias would
have occurred if patients receiving the laparoscopic
proce- dure were healthier than those who had open
surgery.
1 Clinical Epidemiology: The

Measurement Bias 30

Measurement bias occurs when the method of mea-

Increase in systolic BP (mm


surement leads to systematically incorrect results.
20

Example
Doctor
Blood pressure levels are powerful predictors of cardiovascular disease. However, multiple stud- ies have shown that taking
10

Nurse
0
0 5 10
Duration of visit (minutes)
Figure 1.3 ■ White coat hypertension. Increase in systolic
pressure, determined by continuous intraarterial monitor-
ing, as the blood pressure is taken with a sphygmoma-
nometer by an unfamiliar doctor or nurse. (Redrawn with
permission from Mancia G, Parati G, Pomidossi G, et al.
Alerting reaction and rise in blood pressure during mea-
surement by physician and nurse. Hypertension 1987;9:
209–215.)

Example
Supplements of antioxidants, such as vitamins A, C, and E, are po

Confounding
Confounding can occur when one is trying to find
out whether a factor, such as a behavior or drug
expo- sure, is a cause of disease in and of itself. If
the fac- tor of interest is associated or “travels
together” with another factor, which is itself related
to the outcome, the effect of the factor under study
can be confused with or distorted by the effect of
the other.
Chapter 1: Introduction 13

ANTIOXIDANTS CARDIOVASCULAR
MAIN QUESTION
INTAKE DISEASE
PREVENTION

Age Aspirin
use
Physical activity
POTENTIALLY
CONFOUNDING Body mass index
FACTORS Cigarette smoking
Family history
Diet

Figure 1.4 ■ Confounding. The relationship between antioxidant intake and cardio-
vascular risk is potentially confounded by patient characteristics and behaviors related to
both antioxidant use and development of cardiovascular disease.

Most clinical research studies, especially stud-


ies that observe people over time, routinely try to Example
avoid confounding by “controlling” for possible Concerns have been raised that caffeine con-
confounding variables in the analysis (see Chap- sumption during pregnancy may lead to ad-
ter 5). Variables such as age, sex, and race are almost verse fetal outcomes. It would be unethical to
always analyzed for confounding because so many determine if caffeine is dangerous to fetuses
health outcomes vary according to them. Studies by an experiment assigning some pregnant
that involve human behavior (such as taking antiox- women to drink high levels of caffeine, and
idants regularly), are especially prone to confound- others not, so researchers have usually studied
what happens during pregnancy according to
ing because human behavior is so complex that it the amount of caffeine ingested. However, sev-
is difficult to analyze for all the factors that might eral biases have been demonstrated in many
influence it. of these studies (15). Measurement bias could
A variable does not have to be a cause of the have occurred because most studies relied on
dis- ease or other condition of interest in order to self-reported intake of caffeine. One study
be a confounding variable. It may just be related to demonstrated recall bias, a type of measure-
the condition in a particular set of data at hand, ment bias that refers to differential recall in
because of selection bias or chance, but not related people with an adverse outcome compared to
in nature. Whether just in the data or in nature, the those with a normal outcome. An association
was found between caffeine consumption and
conse- quences are the same: the mistaken miscarriage when women were interviewed
impression that the factor of interest is a true, after they miscarried, but not when women
independent cause when it is not. were questioned about caffeine consumption
Selection bias and confounding are related. They before miscarriage (16). If some women were
are described separately, however, because they pres- recruited for caffeine studies during prenatal
ent problems at different points in a clinical study. visits (women who are likely to be particular-
Selection bias is an issue primarily when patients are ly health conscious) and others recruited to-
chosen for investigation and it is important in the ward the end of their pregnancy, the different
design of a study. Confounding must be dealt with
during analysis of the data, once the observations
have been made.
A study may involve several types of biases at the
same time.
1 Clinical Epidemiology: The

treatment groups, and the measurements of pain


approaches to recruitment could lead to
and return to work.
selec- tion bias that might invalidate the Unlike bias, which tends to distort results in
results. Fi- nally, heavy coffee consumption is one direction or another, random variation is as
known to be associated with cigarette likely to result in observations above the true value
smoking, lower socio- economic levels, as below it. As a consequence, the mean of many
greater alcohol consumption, and generally unbiased observations on samples tends to
less health consciousness, all of which could approximate the true value in the population, even
confound any association between caffeine though the results of individual small samples may
not. In the case of inguinal hernia repair, multiple
studies, when evalu- ated together, have shown
The potential for bias does not mean that bias laparoscopic repair results in less pain in the first
is actually present in a particular study or, if few days after surgery.
present, would have a big enough effect on the results Statistics can be used to estimate the extent to
to matter. For a researcher or reader to deal effectively which chance (random variation) accounts for the
with bias, it is first necessary to know where and how results of a clinical study. Knowledge of statistics
to look for it and what can be done about it. But one can also help reduce the role of chance by helping to
should not stop there. It is also necessary to create a better design and analyses. However,
determine whether bias is actually present and how random variation can never be eliminated totally, so
large it is likely to be, and then decide whether it is chance should always be considered when assessing
important enough to change the conclusions of the the results of clinical obser- vations. The role of chance
study in a clinically meaningful way. in clinical observations will be discussed in greater
depth in Chapter 11.
Chance
The Effects of Bias and
Observations about disease are ordinarily made on
Chance Are Cumulative
a sample of patients because it is not possible to study
all patients with the disease in question. Results of The two sources of error—bias and chance—are
unbi- ased samples tend to approximate the true not mutually exclusive. In most situations, both are
value. How- ever, a given sample, even if selected pres- ent. The relationship between the two is
without bias, may misrepresent the situation in the illustrated in Figure 1.5. The measurement of
population as a whole because of chance. If the diastolic blood pres- sure on a single patient is taken
observation were repeated on many such patient as an example; each dot represents an observation on
samples from the same popula- tion, results for the that patient. True blood pressure, which is 80 mm
samples would cluster around the true value, with Hg for this patient, can be obtained by an intra-
more of them close to, rather than far from, the true arterial cannula, but this method is not feasible for
value. The divergence of an observation on a sample routine measurements. Blood pres- sure is ordinarily
from the true population value, due to chance measured indirectly, using a sphyg- momanometer
alone, is called random variation. (blood pressure cuff). As discussed in
All of us are familiar with chance as an
explanation
for why a coin does not come up heads exactly True Blood pressure
50% of the time when it is flipped, say, 100 times. blood pressure measurement
The same effect, random variation, applies when (intra-arterial cannula) (sphygmomanometer)
com- paring the effects of laparoscopic and open
repair of inguinal hernia, discussed earlier. Suppose
all biases were removed from a study of the effects
of the two procedures. Suppose, further, that the two
procedures are, in reality, equally effective in the
Number of

Chance
amount of pain caused, each followed by pain in
10% of patients. Because of chance alone, a single
study with small numbers of patients in each Bias
treatment group might easily find that patients do
better with laparoscopy
than with open surgery (or vice versa). the sampling of patients for the study, the selection of
Chance can affect all the steps involved in clinical
observations. In the assessment of the two ways of
repairing inguinal hernia, random variation occurs in
80 90
Chapter 1: Introduction 15
Diastolic blood pressure (mm Hg)
Figure 1.5 ■ Bias and chance. True blood
pressure by intra-arterial cannula and clinical
measurement by sphygmo- manometer.
1 Clinical Epidemiology: The

an earlier example, the simpler instrument is prone to All patients with the INTERNAL
error or deviations from the true value. In the figure, condition of interest VALIDITY
the error is represented by all of the
sphygmomanometer readings falling to the right of
Sampling
the true value. The devia- tion of sphygmomanometer SAMPLE SAMPLE
readings to higher values (bias) may have several
explanations (e.g., the wrong cuff size, patient
Selection
anxiety, or “white coat hypertension”). Individual bias
blood pressure readings are also subject to error
because of random variation in measurement, as
illustrated by the spread of the sphygmomanometer
Measurement and
readings around the mean value (90 mm Hg). confounding bias
The main reason for distinguishing between bias ?
and chance is that they are handled differently. In ?? Chance
theory, bias can be prevented by conducting clini-
cal investigations properly or can be corrected dur- EXTERNAL CONCLUSIO
ing data analysis. If not eliminated, bias often can be VALIDITY
detected by the discerning reader. Most of this (generalizability)
book is about how to recognize, avoid, or
minimize bias.
Chance, on the other hand, cannot be eliminated, they apply to my patients as well?” Generalizability
but its influence can be reduced by proper design expresses the
of research, and the remaining effect can be
estimated by statistics. No amount of statistical
treatment can correct for unknown biases in data.
Some statisticians would go so far as to suggest that
statistics should not be applied to data that are
vulnerable to bias because of poor research design, for
fear of giving false respect- ability to fundamentally
misleading work.

Internal and External Validity


When making inferences about a population from
observations on a sample, clinicians need to make
up their minds about two fundamental questions.
First, are the conclusions of the research correct
for the people in the sample? Second, if so, does the
sam- ple represent fairly the patients the clinician
is most interested in, the kind of patients in his or
her prac- tice, or perhaps a specific patient at hand
(Fig. 1.6)?
Internal validity is the degree to which the results
of a study are correct for the sample of patients being
studied. It is “internal” because it applies to the
conditions of the particular group of patients being
observed and not necessarily to others. The
internal validity of clinical research is determined by
how well the design, data collection, and analyses
are carried out, and it is threatened by all of the
biases and ran- dom variation discussed earlier. For a
clinical observa- tion to be useful, internal validity
is a necessary but not sufficient condition.
External validity is the degree to which the
results of an observation hold true in other settings.
Another term for this is generalizability. For the
individual clinician, it is an answer to the question,
“Assuming that the results of a study are true, do
Figure 1.6 ■ Internal and external validity.
Chapter 1: Introduction 17

validity of assuming that patients in a study are


simi- lar to other patients.
Every study that is internally valid is
generaliz- able to patients very much like the
ones in the study. However, an unimpeachable
study, with high inter- nal validity, may be totally
misleading if its results are generalized to the
wrong patients.

Example

What is the long-term death rate in anorexia nervosa, an eatin


1 Clinical Epidemiology: The

16
15
a great deal more as well, including value judgments
and weighing competing risks and benefits.
14
In recent years, medical decision making has
12
become a valued discipline in its own right. The
field includes qualitative studies of how clinicians
make decisions and how the process might be biased
3-Year mortality

10
and can be improved. It also includes quantitative
8
7 methods such as decision analysis, cost-benefit anal-
ysis, and cost-effectiveness analysis that present the
6
decision-making process in an explicit way so that its
4 components and the consequences of assigning vari-
ous probabilities and values to them can be
2 examined. Patients and clinicians make clinical
decisions. At best, they make decisions together, a
42 Clinic-based Population-based
process called shared decision making, recognizing
studies study that their exper- tise is complementary. Patients are
experts in what they
Figure 1.7 ■ Sampling bias. Thirty-year mortality from hope to achieve from medical care, given their unique
all causes in patients with anorexia nervosa. Comparison experiences and preferences. They may have found a
of a synthesis of 42 published studies, mainly from referral lot of information about their condition (e.g., from
centers, and a study of all patients with anorexia in the popu- the Internet) but are not grounded in how to sort
lation. (Data from Sullivan PF. Mortality in anorexia nervosa.
out credible from fallacious claims. Doctors are
Am J Psychiatry 1995;152:1073–1074; and Korndorter
SR, Lucan AR, Suman VJ, et al. Long-term survival of
experts in whether and how likely patients’ goals can
patients with anorexia nervosa: a population-based study in be achieved and how to achieve them. For this, they
Rochester, Minn. Mayo Clin Proc 2003;78:278–284.) depend on the body of research evidence and the
ability, based on the principles of clinical
epidemiology, to distinguish stron- ger from weaker
The generalizability of clinical observations, even evidence. Of course, clinicians also bring to the
those with high internal validity, is a matter of per- encounter experience in how disease pres- ents and
sonal judgment about which reasonable people might the human consequences of care, such as what it is like
disagree. A situation often occurs when clinicians to be intubated or to have an amputation, with which
must decide whether to use the results of a well-done patients may have little experience. For clinicians to
study for a patient who is older than those in the play their part on this team, they need to be experts in
study, a different gender, or sicker. It might be that the interpretation of clinically relevant information.
a treatment that works well in young healthy men Patients’ preferences and sound evidence are the
does more harm than good in older, sicker women. basis for choosing among care options. For example,
Generalizability can rarely be dealt with a patient with valvular heart disease may prefer the
satisfacto- rily in any one study. Even a defined, pos- sibility of long-term good health that surgery
geographically based population is a biased sample offers, even though surgery is associated with
of other popula- tions. For example, hospital discomfort and risk of death in the short term. A
patients are biased sam- ples of county residents, clinician armed with critical reading and
counties of states, states of regions, and so on. The communication skills can help the patient understand
best a researcher can do about generalizability is to how big those potential benefits and risks are and
ensure internal validity, have the study population fit how surely they have been established.
the research question, describe the study patients Some aspects of decision analysis, such as evaluation
carefully, and avoid studying patients who are so of diagnostic tests, are included in this book.
unusual that experience with them gener- alizes to However, we have elected not to go deeply into
few others. It then remains for other studies, in other medical decision making itself. Our reason is that
settings, to extend generalizability. decisions are only as good as the information used
to make them, and we have found enough to say
INFORMATION AND DECISIONS about the essentials of collect- ing and interpreting
clinical information to fill a book.
The primary concerns of this book are the quality
of clinical information and its correct ORGANIZATION OF THIS BOOK
interpretation. Making decisions is another matter.
True, good deci- sions depend on good In most textbooks on clinical medicine,
information, but they involve information about each disease is presented as
Chapter 1: Introduction 19
answers to traditional clinical questions:
diagnosis, clinical course, treatment,
2 Clinical Epidemiology: The

and the like. However, most epidemiology books We have organized this book primarily according
are organized around research strategies such as to the questions clinicians encounter when caring for
clinical trials, surveys, case-control studies, and the patients (Table 1.1). Figure 1.8 illustrates how
like. This way of organizing a book may serve those these questions correspond to the book’s chapters,
who perform clinical research, but it is often taking HIV infection as an example. The questions
awkward for clinicians. relate to

Natural History Chapter Topic Page

Population
at risk

Risk factors Unprotected sex Sharing needles CausePg. 194


Risk Pg. 50, 61, 80
PreventionPg. 152

Infection FrequencyPg. 17
Abnormality Pg. 31
DiagnosisPg. 108
PreventionPg. 152

Onset of disease Primary infection AIDS-defining illness Kaposi sarcoma


Pneumocystis infection Disseminated mycobacterium
avium infection

Treatment TreatmentPg. 132

Outcomes PrognosisPg. 93
Death
Sick with AIDS Well

Figure 1.8 ■ Organization of this book in relation to the natural history of human immunode-
ficiency virus (HIV) infection. Chapters 11, 13, and 14 describe cross-cutting issues related to all
points in the natural history of disease.
Chapter 1: Introduction 21

the entire natural history of disease, from the time Some strategies, such as cohort studies, are use-
people without HIV infection are first exposed to ful for answering several different kinds of clinical
risk, to when some acquire the disease and emerge as questions. For the purposes of presentation, we
patients, through complications of the disease, AIDS- have discussed each strategy primarily in one chap-
defining illness, to survival or death. ter and have simply referred to the discussion
In each chapter, we describe research strategies when the method is relevant to other questions in
used to answer that chapter’s clinical questions. other chapters.

Revie w Question s
Questions 1.1–1.6 are based on the following 1.3. Fewer patients who did not have surgery
clinical scenario. remained under care at the clinic 2
months after surgery.
A 37-year-old-woman with low back pain for the
past 4 weeks wants to know if you recommend A. Selection bias
surgery. B. Measurement bias
You prefer to base your treatment recommendations C. Confounding
on research evidence whenever possible. In the stron- D. Chance
gest study you can find, investigators reviewed the E. External validity (generalizability)
medical records of 40 consecutive men with low
back pain under care at their clinic—22 had been 1.4. The patients who were referred for
referred for surgery, and the other 18 patients had surgery were younger and fitter than
remained under medical care without surgery. The those who remained under medical care.
study compared rates of disabling pain after 2 A. Selection bias
months. B. Measurement bias
All of the surgically treated patients and 10 of the C. Confounding
medically treated patients were still being seen in D. Chance
the clinic throughout this time. Rates of pain relief E. External validity (generalizability)
were slightly higher in the surgically treated
patients. 1.5. Compared with patients who had medical
care alone, patients who had surgery might
For each of the following statements, have been less likely to report whatever pain
circle the one response that best they had and the treating physicians might
represents the corresponding threat to have been less inclined to record pain in the
validity. medical record.
1.1. Because there are relatively few A. Selection bias
patients in this study, it may give a B. Measurement bias
misleading impression of the actual C. Confounding
effectiveness of surgery. D. Chance
E. External validity (generalizability)
A. Selection bias
B. Measurement bias 1.6. Patients without other medical conditions
C. Confounding were both more likely to recover and more
D. Chance likely to be referred for surgery.
E. External validity (generalizability)
A. Selection bias
1.2. The results of this study may not apply B. Measurement bias
to your patient, a woman, because all the C. Confounding
patients in the study were men. D. Chance
E. External validity (generalizability)
A. Selection bias
B. Measurement bias
C. Confounding
D. Chance
E. External validity (generalizability)
2 Clinical Epidemiology: The

For questions 1.7–1.11, select the best answer. the rates of subsequent coronary events were
compared in employees who volunteered
1.7. Histamine is a mediator of inflammation in for the program and those who did not
patients with allergic rhinitis (“hay fever”). volunteer. The development of CHD was
Based on this fact, which of the following is determined by means of regular voluntary
true? checkups, including a careful history, an
A. Drugs that block the effects of histamines electrocardiogram, and a review of routine
will relieve symptoms. health records. Surprisingly, the members of
B. A fall in histamine levels in the nose is the exercise group developed higher rates of
a reliable marker of clinical success. CHD even though fewer of them smoked
C. Antihistamines may be effective, and cigarettes. This result is least likely to be
their effects on symptoms (e.g., itchy explained by which of the following?
nose, sneezing, and congestion) should be A. The volunteers were at higher risk
studied in patients with allergic rhinitis. for developing CHD than those not
D. Other mediators are not important. volunteering before the study began.
E. If laboratory studies of disease are B. The volunteers did not actually increase
convincing, clinical research is their exercise and the amount of exercise
unnecessary. was the same in the two groups.
C. Volunteers got more check-ups, and silent
1.8. Which of the following statements about myocardial infarctions were, therefore,
samples of populations is incorrect? more likely to have been diagnosed in the
A. Samples of a populations may have exercise group.
characteristics that differ from the
1.11. Ventricular premature depolarizations are
population even though correct sampling
associated with an increased risk of sudden
procedures were followed.
death from a fatal arrhythmia, especially in
B. Samples of populations are the only
people with other evidence of heart disease.
feasible way of studying the population.
You have read there is a new drug for
C. When populations are correctly sampled,
ventricular premature depolarizations.
external validity is ensured.
D. Samples of populations should be selected in What is the most important thing you
a way that every member of the population would like to know about the drug before
has an equal chance of being chosen. prescribing it to a patient?
A. The drug’s mechanism of action.
1.9. You are making a treatment decision with a B. How well the drug prevents ventricular
72-year-old man with colon cancer. You are premature depolarizations in people using
aware of several good studies that have the drug compared to those who do not
shown that a certain drug combination use the drug.
prolongs the life of patients with colon C. The rate of sudden death in similar people
cancer. However, all the patients in these who do and do not take the drug.
studies were much younger. Which of the
statements below is correct?
Questions 1.12–1.15 are based on the following
A. Given these studies, the decision about clinical scenario.
this treatment is a matter of personal
judgment. Because reports suggested estrogens increase the risk
B. Relying on these studies for your patient of clotting, a study compared the frequency of oral
is called internal validity. contraceptive use among women admitted to a hos-
C. The results in these studies are affected pital with thrombophlebitis and a group of women
by chance but not bias. admitted for other reasons. Medical records were
reviewed for indication of oral contraceptive use in
1.10. A study was done to determine whether the two groups. Women with thrombophlebitis were
regular exercise lowers the risk of found to have been using oral contraceptives more
coronary heart disease (CHD). An fre- quently than the women admitted for other
exercise program was offered to reasons.
employees of a factory, and
Chapter 1: Introduction 23

For each of the following statements, select C. Confounding


the one response that represents the D. Chance
corresponding threat to validity. E. External validity (generalizability)

1.12. Women with thrombophlebitis may 1.14. The number of women in the study was small.
have reported the use of contraceptives
more completely than women without A. Selection bias
thrombophlebitis because they remembered B. Measurement bias
hearing of the association. C. Confounding
D. Chance
A. Selection bias E. External validity (generalizability)
B. Measurement bias
C. Confounding 1.15. The women with thrombophlebitis were
D. Chance admitted to the hospital by doctors
E. External validity (generalizability) working in different neighborhoods than
the physicians of those that did not
1.13. Doctors may have questioned women with have thrombophlebitis.
thrombophlebitis more carefully about
contraceptive use than they did those A. Selection bias
without thrombophlebitis (and recorded the B. Measurement bias
information more carefully in the medical C. Confounding
record) because they were aware that D. Chance
estrogen could cause clotting. E. External validity (generalizability)

A. Selection bias Answers are in Appendix A.


B. Measurement bias

REFERENCES
1. Home PD, Pocock SJ, Beck-Nielsen H, et al. Rosiglitazone
11. Sackett DL. Bias in analytic research. J Chronic Dis 1979;32:
evaluated for cardiovascular outcomes in oral agent combina-
51–63.
tion therapy for type 2 diabetes (RECORD): a multicentre,
12. Pickering TG, Hall JE, Appel LJ, et al. Recommendations
randomized, open-label trial. Lancet 2009;373:2125–2135.
for blood pressure in humans and experimental animals. Part
2. Lipscombe LL, Gomes T, Levesque LE, et al. Thiazolidinedio-
1: Blood pressure measurement in humans. A statement for
nes and cardiovascular outcomes in older patients with diabe-
professionals from the Subcommittee of Professional and
tes. JAMA 2007;298:2634–2643.
Public Education of the American Heart Association
3. Nissen SE, Wolski K. Effect of rosiglitazone on the risk of
Coun- cil on High Blood Pressure Research. Circulation
myocardial infarction and death from cardiovascular causes.
2005;111: 697–716.
N Engl J Med 2007;356:2457–2471.
13. Bjelakovic G, Nikolova D, Gluud LL, et al. Mortality in
4. Friedman GD. Primer of Epidemiology, 5th ed. New York:
random- ized trials of antioxidant supplements for primary and
Appleton and Lange; 2004.
secondary prevention: systematic review and meta-analysis.
5. Straus SE, Richardson WS, Glasziou P, et al. Evidence-Based
JAMA 2007; 297(8):842–857.
Medicine: How to Practice and Teach EBM, 4th ed. New
14. Vevekananthan DP, Penn MS, Sapp SK, et al. Use of anti-
York: Churchill Livingstone; 2011.
oxidant vitamins for the prevention of cardiovascular dis-
6. Stuebe AM. Level IV evidence—adverse anecdote and
ease: meta-analysis of randomized trials. Lancet 2003;361:
clinical practice. N Engl J Med 2011;365(1):8–9.
2017–2023.
7. Murphy EA. The Logic of Medicine. Baltimore: Johns
15. Norman RJ, Nisenblat V. The effects of caffeine on
Hop- kins University Press; 1976.
fertility and on pregnancy outcomes. In: Basow DS, ed.
8. Porta M. A Dictionary of Epidemiology, 5th ed. New
UpToDate. Waltham, MA: UpToDate; 2011.
York: Oxford University Press; 2008.
16. Savitz DA, Chan RL, Herring AH, et al. Caffeine and miscar-
9. McCormack K, Scott N, Go PM, et al. Laparoscopic
riage risk. Epidemiology 2008;19:55–62.
techniques versus open techniques for inguinal hernia repair.
17. Sullivan PF. Mortality in anorexia nervosa. Am J Psychiatry
Cochrane Database Systematic Review 2003;1:CD001785.
1995;152:1073–1074.
Publication History: Edited (no change to conclusions) 8
18. Korndorfer SR, Lucas AR, Suman VJ, et al. Long-term
Oct 2008.
survival of patients with anorexia nervosa: a population-
10. Neumayer L, Giobbie-Hurder A, Jonasson O, et al. Open
based study in Rochester, Minn. Mayo Clin Proc 2003;78:
mesh versus laparoscopic mesh repair of inguinal hernia.
278–284.
N Eng J Med 2004;350:1819–1827.
2 Clinical Epidemiology: The

Chapter 2

Frequency
Here, it is necessary to count.
—P.C.A. Louis†
1787–1872a

KEY WORDS
Numerator
Cohort studies Example
A 72-year-old man presents with slowly pro-
Denominator
Cumulative gressive urinary frequency, hesitancy, and drib-
Prevalence
incidence bling. A digital rectal examination reveals a
Point prevalence
Incidence density symmetrically enlarged prostate gland and no
Period prevalence
Person-time nodules. Urinary flow measurements show a
Incidence
Dynamic population reduction in flow rate, and his serum
Duration of disease
Population at risk prostate- specific antigen (PSA) is not
Case fatality rate
Random sample elevated. The cli- nician diagnoses benign
Survival rate
Probability sample prostatic hyperplasia (BPH). In deciding on
Complication rate
Sampling fraction treatment, the clinician and patient must
Infant mortality rate
Oversample weigh the benefits and haz- ards of various
Perinatal mortality
Convenience samples therapeutic options. To simplify, let us say the
rate
Grab samples options are medical therapy with drugs or
Prevalence studies
Epidemic surgery. The patient might choose medical
Cross-sectional studies
Pandemic treatment but runs the risk of worsening
Surveys
Epidemic curve symptoms or obstructive renal disease
Cohort
Endemic because the treatment is less immediately
effective than surgery. Or he might choose
surgery, gaining
Chapter 1 outlined the questions that clinicians need immediate relief of symptoms but at the risk
to answer as they care for patients. Answers are usu- of operative mortality and long-term urinary
ally in the form of probabilities and only rarely as incontinence and impotence.
cer- tainties. Frequencies obtained from clinical
research are the basis for probability estimates for the
purposes of patient care. This chapter describes basic
expres- sions of frequency, how they are obtained Decisions such as the one this patient and
from clini- cal research, and how to recognize clinician face have traditionally relied on clinical
threats to their validity. judgment based on experience at the bedside and in
the clinics. In modern times, clinical research has
become suffi- ciently strong and extensive that it is
possible to ground clinical judgment in research-

A 19th Century physician and proponent of the “numerical based probabilities— frequencies. Probabilities of
method” (relying on counts, not impressions) to understand the disease, improvement, deterioration, cure, side
natural history of diseases such as typhoid fever.
effects, and death are the basis for answering most
clinical questions. For this
17
1 Clinical Epidemiology: The

patient, sound clinical decision making requires event could have occurred (population). The two
accu- rate estimates of how his symptoms and basic measures of frequency are prevalence and
complica- tions of treatment will change over time incidence.
according to which treatment is chosen.
Prevalence
ARE WORDS SUITABLE Prevalence is the fraction (proportion or percent) of
SUBSTITUTES FOR NUMBERS? a group of people possessing a clinical condition
or outcome at a given point in time. Prevalence is
Clinicians often communicate probabilities as words
mea- sured by surveying a defined population and
(e.g., usually, sometimes, rarely) rather than as num-
counting the number of people with and without the
bers. Substituting words for numbers is convenient
condition of interest. Point prevalence is
and avoids making a precise statement when one
measured at a sin- gle point in time for each
is uncertain about a probability. However, words
person (although actual measurements need not
are a poor substitute for numbers because there is
necessarily be made at the same point in calendar
little agreement about the meanings of commonly
time for all the people in the population). Period
used adjectives describing probabilities.
prevalence describes cases that were present at any
time during a specified period of time.
Example Incidence
Incidence is the fraction or proportion of a group
of people initially
Physicians were asked to assign percentage val- ues to 13 expressions free of the
of probability (1).outcome of interestgenerally
These physicians that agreed o
devel- ops the condition over a given period of time.
Incidence refers then to new cases of disease occurring
in a popula- tion initially free of the disease or new
outcomes such as symptoms or complications
occurring in patients with a disease who are initially
free of these problems.
Figure 2.1 illustrates the differences between inci-
dence and prevalence. It shows the occurrence of

2010 2011 2012

Patients also assign widely varying probabilities whom the


to word descriptions. In another study, highly
skilled and professional workers outside of medicine
thought “usually” referred to probabilities of 35%
to 100%; “rarely” meant to them a probability of 0%
to 15% (3). Thus, substituting words for numbers
diminishes the information conveyed. We advocate
using num-
bers whenever possible. Onset Duration

PREVALENCE AND INCIDENCE


In general, clinically relevant measures of frequency are
expressed as proportions, in which the numerator is
the number of patients experiencing an event (cases)
and the denominator is the number of people in
Chapter 2: Frequency 19

Figure 2.1 ■ Incidence and prevalence.


Occurrence of disease in 10,000 people at risk for
lung cancer, 2010 to 2012.
2 Clinical Epidemiology: The

lung cancer in a population of 10,000 people over Prevalence and


the course of 3 years (2010–2012). As time passes, Incidence in Relation to
individuals in the population develop the disease. Time
They remain in this state until they either recover
or die—in the case of lung cancer, they usually Every measure of disease frequency necessarily con-
die. Four people already had lung cancer before tains some indication of time. With measures of
2010, and 16 people developed it during the 3 prev- alence, time is assumed to be instantaneous,
years of obser- vation. The rest of the original as in a single frame from a motion picture film.
10,000 people have not had lung cancer during Prevalence depicts the situation at that point in
these 3 years and do not appear in the figure. time for each patient, even though it may, in reality,
To calculate prevalence of lung cancer at the have taken sev- eral months to collect observations
beginning of 2010, four cases already existed, so on the various peo- ple in the population. However,
the prevalence at that point in time is 4/10,000. for incidence, time is the interval during which
If all surviving people are examined at the begin- susceptible people were observed for the emergence
ning of each year, one can compute the prevalence of the event of interest. Table 2.1 summarizes the
at those points in time. At the beginning of 2011, characteristics of incidence and prevalence.
the prevalence is 5/9,996 because two of the pre- Why is it important to know the difference
2010 patients are still alive, as are three other people between prevalence and incidence? Because they
who developed lung cancer in 2010; the denomi- answer two entirely different questions: on the one
nator is reduced by the 4 patients who died before hand, “What proportion of a group of people has a
2011. Prevalence can be computed for each of the condition?”; and on the other, “At what rate do new
other two annual examinations and is 7/9,992 at cases arise in a defined population as time passes?”
the beginning of 2011 and 5/9,986 at the The answer to one question cannot be obtained
beginning of 2012. directly from the answer to the other.
To calculate the incidence of new cases develop-
ing in the population, we consider only the 9,996 RELATIONSHIPS AMONG
people free of the disease at the beginning of 2010 PREVALENCE, INCIDENCE,
and what happens to them over the next 3 years. Five AND DURATION OF DISEASE
new lung cancers developed in 2010, six
developed in 2011, and five additional lung Anything that increases the duration of disease
cancers developed in 2012. The 3-year incidence increases the chances that the patient will be
of the disease is all new cases developing in the 3 identi- fied in a prevalence study. Another look at
years (16) divided by the number of susceptible Figure 2.1 will confirm this. Prevalent cases are
individuals at the begin- ning of the follow-up those that remain affected, to the extent that
period (9,996), or 16/9,996 in 3 years. What are patients are cured, die of their disease, or leave the
the annual incidences for 2010, 2011, and 2012? population under study, they are no longer a case in
Remembering to remove the previ- ous cases from a prevalence survey. As a result, diseases of brief
the denominator (they are no longer at risk of duration will be more likely to be missed by a
developing lung cancer), we would calculate the prevalence study. For example, 15% of all deaths
annual incidences as 5/9,996 in 2010, 6/9,991 in from coronary heart disease occur out- side the
2011, and 5/9,985 in 2012. hospital within an hour of onset and with- out
prior symptoms of heart disease. A prevalence

Table 2.1
Characteristics of Incidence and Prevalence

a. Characteristic b. Incidence c. Prevalence


Numerator New cases occurring during a period of time among Existing cases at a point or period
a group initially free of disease of time
Denominator All susceptible people without disease at All people examined, including
the beginning of the period cases and non-cases
Time Duration of the period Single point or period
How measured Cohort study (see Chapter 5) Prevalence (cross-sectional) study
Chapter 2: Frequency 21

study would, therefore, miss nearly all these events Similarly, the prevalence of prostate cancer on
and underestimate the true burden of coronary autopsy is so much higher than its incidence that
heart disease in the community. In contrast, the majority of these cancers must never become
diseases of long duration are well represented in symp- tomatic enough to be diagnosed during life.
prevalence sur- veys, even when their incidence is
low. The incidence of inflammatory bowel disease in
North America is only about 2 to 14 per SOME OTHER RATES
100,000/year, but its preva- lence is much higher,
37 to 246/100,000, reflecting the chronic nature of Table 2.2 summarizes some rates used in health care.
the disease (4). Most of them are expressions of events over time.
The relationship among incidence, prevalence For example, a case fatality rate (or alternatively,
and duration of disease in a steady state, in which the survival rate) is the proportion of people
none of the variables is changing much over time, having a disease who die of it (or who survive it).
is approximated by the following expression: For acute dis- eases such as Ebola virus infection,
follow-up time may be implicit, assuming that
Prevalence  Incidence  Average deaths are counted over a long enough period of
duration of the disease time (in this case, a few weeks) to account for all of
Alternatively, them that might have occurred. For chronic
diseases such as cardiovascular disease or cancer, it
Prevalence/Incidence  Duration is more usual to specify the period of obser- vation
(e.g., the 5-year survival rate). Similarly, com-

Example
The incidence and prevalence of ulcerative colitis were measured in Olmstead County, Minnesota, from 1984 to 1993 (5). In

plication rate, the proportion of people with a


disease or treatment who experience complications,
assumes that enough time has passed for the
complications to have occurred. These kinds of
measures can be under- estimations if follow-up is
not really long enough. For example, surgical site
infection rates have been underreported because they
have been counted up to the time of hospital
discharge, whereas some wound infections are first
apparent after discharge (6).
Other rates, such as infant mortality rate and
perinatal mortality rate (defined in Table 2.2)
are approximations of incidence because the
children in the numerator are not necessarily those
in the denominator. In the case of infant mortality
rate for a given year, some of the children who die in
that year were born in the previous year; similarly,
2 Clinical Epidemiology: The
the last chil- dren to be born in that year measurement more feasible, while providing a useful
may die in the following year. These rates approximation of a true rate in a given year.
are constructed in this way to make

Table 2.2
Some Commonly Used Rates

Case fatality rates Proportion of patients who die of a disease


Complication rate Proportions of patients who suffer a complication of a disease or its treatment
Infant mortality rate Number of deaths in a year of children 1 year of age
Number of live births in the same year
Perinatal mortality rate (World Number of stillbirths and deaths in the first week of life per 1,000 live births
Health Organization definition)
Maternal mortality rate Number of maternal deaths related to childbirth in a given year
Number of live births in the same population during the same year
Chapter 2: Frequency 23

Defined Representative Disease/outcome


population sample present?

Population at risk

No
Sample
Yes

Figure 2.2 ■ The design of a prevalence study.

STUDIES OF PREVALENCE
AND INCIDENCE period prevalence but a good estimate of
point prevalence because of the narrow
Prevalence and incidence are measured by entirely time win- dow) ranged from a high of 4.6%
different kinds of studies. in the United States to a low of 0.9% in
Japan. Period preva- lence was higher; for
Prevalence Studies example, in the United States, the 12-
In prevalence studies, people in a population are month prevalence was 10.0% and the
examined for the presence of the condition of lifetime prevalence was 16.9%. The authors
interest. Some members of the population have the concluded that “major depressive episodes
condition at that point in time, whereas others do not are a commonly occurring disorder that
(Fig. 2.2). The fraction or proportion of the usually has a chronic-intermittent course”
population that has the condition (i.e., cases)
constitutes the prevalence
of the disease. Incidence Studies
Another term for prevalence studies is cross-
The population under examination in an incidence
sectional studies because people are studied at a
study is a cohort, which is defined as a group of
“cross-section” of time. Prevalence studies are also
peo- ple having something in common when they
called surveys if the main measurement is a
are first assembled and are then followed over time
questionnaire.
for the devel- opment of outcome events. For this
The following is an example of a typical prevalence
reason, incidence studies are also called cohort
study.
studies. A sample of people free of the outcome of
interest is identified and observed over time to see
Example whether an outcome event occurs. Members of the
The World Health Organization created a re- search consortium to study
cohort may the cross-national
be healthy prevalence
at first and of depression.
then followed
forward in time for the emergence of disease—for
example, from being cancer-free until the onset (or
not) of pancreatic cancer. Or, all of them may have a
recently diagnosed disease (such as pancreatic
cancer) and then be followed forward in time to
out- comes such as recurrence or death. Incidence
studies will be discussed in greater detail in Chapters
5 and 7.

Cumulative Incidence
To this point, the term “incidence” has been used to
describe the rate of new events in a group of
people of fixed size, all members of which are
observed over
2 Clinical Epidemiology: The

a period of time. This is called cumulative incidence


because new cases are accumulated over time. all care provided to county residents was
pro- vided within the county and most
Incidence Density (Person-Years) residents had agreed to let their records be
Another approach to studying incidence is to mea- used for research. The population of the
sure the number of new cases emerging in an ever- county was estimated from census data at
changing population, one in which individuals are approximately 175,000. Incidence of herpes
under study and susceptible for varying lengths of zoster, adjusted to the age and sex of the
time. The incidence derived from studies of this type U.S. adult population, was
is called incidence density because it is, figuratively 3.6 per 1,000 person-years and rose with
speaking, the density of new cases in time and place. age. Pain after the herpes attack occurred
Clinical trials often use the incidence density in 18% of these patients.
approach. Eligible patients are enrolled over a period
of time so that early enrollees are treated and
followed up for longer periods than late enrollees. In
Incidence of herpes zoster infection was described
an effort to keep the contribution of individual
in person-years in a dynamic population, whereas
patients com- mensurate with their follow-up
pain after infection was a cumulative incidence in
interval, the denomi- nator of an incidence density
which all patients with herpes zoster were followed
measure is not persons at risk for a specific time
up.
period but person-time at risk for the outcome
A disadvantage of the person-years approach is
event. A patient followed for 10 years without an
that it lumps together different lengths of follow-up.
outcome event contributes 10 person-years,
A small number of patients followed for a long time
whereas one followed for 1 year con- tributes only
can contribute as many person-years as a large num-
1 person-year to the denominator. Inci- dence
ber of patients followed for a short time. If
density is expressed as the number of new cases
patients with long follow-up are systematically
per total number of person-years at risk.
The person-years approach is especially useful different from those with short follow-up—perhaps
for estimating the incidence of disease in dynamic because out- come events take a long time to
populations, those in which some individuals in
develop or because patients with especially bad
the population are entering and others leaving it as risk tend to leave the population—the resulting
time passes. Incidence studies in large populations incidence density will depend on the particular
typically have an accurate count of new cases in combination of number of patients and follow-up
the population (e.g., from hospital records or disease times. For example, the latency period between
reg- istries), but the size and characteristics of the exposure to carcinogen and onset of cancer is at
popula- tion at risk can only be estimated (from least 10 years for most cancers. It might be possible
census and other records) because the people in it to see an increase in cancer rates
are entering and leaving the region continually.
This approach works because the proportion of
people who enter or leave is small, relative to the
population as a whole (Fig. 2.3), so the population Move into
is likely to be relatively stable over short periods community
of time. Born in
community

Example
Community
A study of the incidence of herpes zoster in- fections (“shingles”) and its complications pro- vides and example of both incid
Population

Die

Move out
Figure 2.3 ■ A dynamic population.
Chapter 2: Frequency 25

in a study of 10,000 people exposed to a


carcinogen and followed up for 20 years. Example
However, a study of 100,000 people followed for 2
years would not show an increase, even though it
involves the same num- ber of person-years The world is in an obesity epidemic. What is the prevalenc
(200,000), because the follow-up time is too short. 59 years in 2007 through 2008 (9). The U.S. National Institutes

BASIC ELEMENTS OF
FREQUENCY STUDIES
To make sense of a study reporting prevalence,
one needs careful definition of both the numerator
and the denominator.

What Is a Case?
Defining the Numerator
Cases might be people in the general population who
develop a disease or patients in clinical settings with
disease who develop an outcome event such as recur-
rence, complication of treatment, or death. In
either situation, the way in which a case is defined
affects rates. Rates may also be affected by how aggressively
Most clinical phenomena (serum cholesterol, one looks for cases. For example, aspirin can induce
serum calcium, thyroid hormone levels, etc.) exist asthma in some people. How often does this occur?
on a continuum from low to high. The cutoff point It depends on the definition of a case. When peo-
defining a case can be placed at various points and ple are simply asked whether they have a breath-
this can have large effects on the resulting ing problem after taking aspirin, rates are relatively
prevalence. We will discuss some of the reasons why low, about 3% in adults. When a case is defined
one would place a cutoff at one or another point in more rigorously, by giving aspirin and measuring
Chapter 3 and the consequences for a diagnostic whether this was followed by bronchoconstriction,
test perfor- mance in Chapter 8. the prevalence of aspirin-induced asthma is much
higher, about 21% in adults (10). The lower rate

10

Overweight
Population

Class I
4
Normal weight
Class II
2
Under- weight Class III

0 Obese
10 15 20
25 30 35 40 45 50 55
Body mass index
Figure 2.4 ■ The prevalence of overweight and obesity in men, 2007 to
2008. (Data from Flegal KM, Carroll MD, Ogden CL, et al. Prevalence and trends
in obesity among US adults, 1999–2008. JAMA 2010;303(3):235–241.)
2 Clinical Epidemiology: The

Table 2.3
Example
Classification of Obesity According Many cases of prostate cancer remain indolent and are not detect
to the U.S. National Institutes of
Health and World Health
Organization
Classification Body Mass Index (kg/m2)
Underweight 18.5
Normal weight 18.5–24.9
Overweight 25.0–29.9
Obesity 30
Obesity Class I 30.0–34.9
Obesity Class II 35.0–39.9
Obesity Class III 40
(“severe,” “extreme,”
or “morbid”)
Data from Flegal KM, Carroll MD, Ogden CL et al. Prevalence
and trends in obesity among US adults, 1999–2008. JAMA
2010;303:235–241.

pertains to clinical situations, whereas the higher


rate tells us something about the biology of this
disease.
Incidence can also change if a more sensitive
ways of detecting disease is introduced.

250

200
Age-adjusted Incidence /

150

100

PSA Approval
50

0
1975 1980 1985 1990 1995 2000 2005 2007
Year of diagnosis
Figure 2.5 ■ Incidence depends on the intensity of efforts to find cases. Incidence
of prostate cancer in the United States during the widespread use of screening with
prostate- specific antigen (PSA). (Redrawn with permission from Wolf AMD, Wender RC,
Etzioni RB et al. American Cancer Society guideline for the early detection of prostate
cancer: Update 2010. CA Cancer Journal for Clinicians 2010;60:70–98.)
Chapter 2: Frequency 27

What Is the Population? being selected. Probability samples are useful


Defining the because it is often more informative to include in
Denominator the sample a sufficient number of people in
particular subgroups of interest, such as ethnic
A rate is useful only to the extent that the popula- minorities or the elderly. If members of these
tion in which it is measured—the denominator of the subgroups comprise only a small proportion of the
rate—is clearly defined and right for the question. population, a simple random sam- ple of the entire
Three characteristics of the denominator are espe- population might not include enough of them. To
cially important. remedy this, investigators can vary the sampling
First, all members of the population should be fraction, the fraction of all members of each
susceptible to the outcome of interest; that is, they subgroup included in the sample. Investigators can
should comprise a population at risk. If members of oversample low-frequency groups relative to the
the population cannot experience the event or condi- rest, that is, randomly select a larger fraction of them.
tion counted in the numerator, they do not belong The final sample will still be representative of the
in the denominator. For example, rates of cervical entire population if the different sampling fractions
can- cer should be assessed in women who still have are taken into account in the analysis.
a cer- vix; to the extent that cervical cancer rates are On average, the characteristics of people in prob-
based on populations that include women who ability samples are similar to those of the population
have had hysterectomies (or for that matter, men), from which they were selected, particularly when the
true rates will be underestimated. sample is large. To the extent that the sample differs
Second, the population should be relevant to from the parent population, it is by chance and not
the question being asked. For example, if we because of systematic error.
wanted to know the prevalence of HIV infection in Non-random samples are common in clinical
the commu- nity, we would study a random sample research for practical reasons. They are called con-
of all people in a region. But if we wanted to know venience samples (because their main virtue is that
the prevalence of HIV infection among people who they were convenient to obtain, such as samples of
use street drugs, we would study them. patients who are visiting a medical facility, are coop-
Third, the population should be described in erative, and are articulate) or grab samples (because
sufficient detail so that it is a useful basis for judg- the investigators just grabbed patients wherever they
ing to whom the results of the prevalence study could find them).
applies. What is at issue here is the generalizability Most patients described in the medical
of rates—deciding whether a reported rate applies literature and encountered by clinicians are biased
to the kind of patients that you are interested in. samples of their parent population. Typically,
A huge gradient in rates of disease (e.g., for HIV patients are included in research because they are
infection) exists across practice settings from the under care in an academic institution, are available,
general population to primary care practice to refer- are willing to be studied, are not afflicted with
ral centers. Clinicians need to locate the reported diseases other than the one under study, and perhaps
rates on that spectrum if they are to use the infor- also are particularly interesting, severely affected, or
mation effectively. both. There is nothing wrong with this practice as
long as it is understood to whom the results do (or
Does the Study Sample do not) apply. However, because of biased
Represent the Population? samples, the results of clinical research often leave
thoughtful clinicians with a large generalizability
As mentioned in Chapter 1, it is rarely possible to problem, from the research setting to their practice.
study all the people who have or might develop
the condition of interest. Usually, one takes a
sample so that the number studied is of DISTRIBUTION OF DISEASE BY
manageable size. This leads to a central question: TIME, PLACE, AND PERSON
Does the sample accu- rately represent the parent
population? Epidemiology has been described as the study of
Random samples are intended to produce repre- the determinants of the distribution of disease in
sentative samples of the population. In a simple ran- popula- tions. Major determinants are time, place,
dom sample, every individual in the population and person. Distribution according to these factors
has an equal probability of being selected. A more can provide strong clues to the causes and control
general term, probability sample, is used when of disease, as well as to the need for health
every per- son has a known (not necessarily equal) services.
probability of
2 Clinical Epidemiology: The

Time
An epidemic is a concentration of new cases in and signs of a febrile respiratory illness,
time. The term pandemic is used when a disease chest radiograph changes, lack of response
is especially widespread, such as a global epidemic to anti- biotics, and normal or decreased
of particularly severe influenza (e.g., the one in white blood cell count. Later, as more
1918– 1919) and the more slowly developing but became known about this new disease,
world- wide rise in HIV infection/AIDS. The laboratory testing for the responsible
existence of an epidemic is recognized by an coronavirus could be used to define a case.
epidemic curve that shows the rise and fall of cases Cases were called “reported” to make clear
of a disease over time in a population. that there was no assurance that all cases
in the Beijing community were detected.
Figure 2.6 also indicates when major
con- trol measures were instituted. The
Example epidemic declined in relation to aggressive
Figure 2.6 shows the epidemic curve for Se- vere Acute Respiratory Syndrome
quarantine (SARS),involving
measures in Beijing,the
People’s Republic of China
closing
of public gath- ering places, identifying new
cases early in their course, removing cases
from the community, and isolating cases in
facilities specifically for SARS. It is possible
that the epidemic abated for reasons other
than these control measures, but it is
unlikely given that similar control measures in
other places were also followed by a resolu-
tion of the epidemic. Whatever the cause,
the decline in new cases allowed the World
Health Organization to lift its advisory
against travel to Beijing so that the city
could reopen public places and resume
normal international busi- ness and tourism.

Universities and schools closed Libraries, bars, theaters closed

200 Fever checks at airports begins Quarantine of close contacts Start to group patients with SARS in designated war

150 Training in management of patients with SARS Designated fever clinics


Number of probable

100
SARS made reportable Contact tracing begins All patients with SARS in designated hospitals

50

0
Mar 7 Mar 14 Mar 21 Mar 28 Apr 4 Apr 11 Apr 18 Apr 25 May 2 May 9 May 15 May 23 May 30

Date of hospitalization
Figure 2.6 ■ An epidemic curve. Probable cases of severe acute respiratory syndrome in Beijing March 2003 through
May 2003, in relation to control measures. (Adapted with permission from Pang X, Zhu Z, Xu F, et al. Evaluation of
control measures implemented in the severe acute respiratory syndrome outbreak in Beijing, 2003. JAMA 2003;290:
3215–3221.)
Chapter 2: Frequency 29

Knowledge of a local epidemic helps clinicians


get the right diagnosis. For example, while serving as When a disease such as iodine deficiency goiter or
a primary care physician on a military base in polio (after global efforts to eradicate it) is limited to
Germany, one author saw a child with a fever and certain places, the disease is called endemic.
rash on the hands and feet. With only a hospital-
based clerk- ship in pediatrics to rely on, he was Person
perplexed. But when he and his colleagues began
seeing many such children in a short time span, When disease affects certain kinds of persons at
they recognized (with the help of a pediatric the same time and in the same places as other people
consultant) that they were in the midst of an who are not affected, this provides clues to causes
outbreak of coxsackievirus infection (“hand, foot, and guid- ance on how health care efforts should be
and mouth syndrome”), which is a dis- tinctive but deployed. At the beginning of the AIDS pandemic,
mild infectious disease of children. most cases were seen in homosexual men who had
multiple sexual partners as well as among
Place intravenous drug users. This led to the early
hypothesis that the disease was caused
The geographic distribution of cases indicates where
a disease causes a greater or less burden of suffering
and provides clues to its causes.

Example
The incidence of colorectal cancer is very differ- ent in different parts of the world. Rates, even when adjusted for di

in North America, Europe, and Australia and


low in Africa and Asia (Fig. 2.7) (13). This
obser- vation has led to the hypothesis that
environ- mental factors may play a large
part in the de- velopment of this disease.
This hypothesis has been supported by
other studies showing that people moving
from countries of low incidence to those of
high incidence acquire higher rates of
colorectal cancer during their lifetime.

5.0 cases/10,000/year

Figure 2.7 ■ Colorectal cancer incidence for men according to area of the globe. (Data from Center MM, Jemal
A, Smith RA, et al. Worldwide variations in colorectal cancer. CA Cancer J Clin 2009;59:366–378.)
3 Clinical Epidemiology: The

by an infectious agent transmitted in semen and Hodgkin disease, aplastic anemia, or systemic lupus
blood. Laboratory studies confirmed this hypothesis ery- thematosus. In contrast, some referral hospitals
and discovered the human immunodeficiency virus. are well prepared for just these diseases, and
Identification of the kinds of people most affected appropriately so.
also led to special efforts to prevent spread of the
disease in them—for example, by targeting education What Are Prevalence Studies Not
about safe sex to those communities, closing public Particularly Good For?
bathhouses, and instituting safe-needle programs.
Prevalence studies provide only weak evidence of
cause and effect. Causal questions are inherently
USES OF PREVALENCE STUDIES about new events arising over time; that is, they
Properly performed prevalence studies are the very are about incidence. One of the other limitations of
best ways of answering some important questions prevalence studies, for this purpose, is that it may
and are a weak way of answering others. be difficult to know whether the purported cause
actually preceded or followed the effect because the
What Are two are measured at the same point in time. For
Prevalence Studies example, if inpatients with hyperglycemia are more
often infected, is it because hyperglycemia impairs
Good For? immune function leading to infection or has the
Prevalence studies provide valuable information infection caused the hyperglyce- mia? If a risk
about what to expect in different clinical factor (e.g., family history or a genetic marker) is
situations. certain to have preceded the onset of dis- ease or
outcome, interpretation of the cause-and-effect
Example sequence is less worrisome.
The approach to cervical lymphadenopathy depends on where and in whom it is seen. Children with persistent cervical aden
Another limitation is that prevalence may be
the result of incidence of disease, the main
consideration in causal questions, or it may be
related to duration of disease, an altogether
different issue. With only information about
prevalence, one cannot determine how much each of
the two, incidence and duration, contributes.
Nevertheless, cross-sectional studies can provide
compelling hypotheses about cause and effect to be
tested by stronger studies.
The underlying message is that a well-performed
cross-sectional study, or any other research design, is
not inherently strong or weak but is only in
relation to the question it is intended to answer.

Example
Children living on farms are less likely to have asthma than childr

Prevalence of disease strongly affects the interpre-


tation of diagnostic test results, as will be described
in greater detail in Chapter 8.
Finally, prevalence is an important guide to
planning health services. In primary care practice,
being prepared for diabetes, obesity, hypertension,
Chapter 2: Frequency 31
and lipid disorders should demand more attention
than planning for
3 Clinical Epidemiology: The

Revie w Question s
Read the following statements and mark the about 1/100 persons. On average, how many
best answer. years does the disease persist?
2.1. Cancer registries report 40 new cases of A. 10
bladder cancer per 100,000 men per year. B. 25
Cases were from a complete count of all C. 33
patients who developed bladder cancer in D. 40
several regions of the United States, and the E. 50
number of men at risk was estimated from
2.6. Which of the following studies is not a cohort
the census data in those regions. Which rate
study?
is this an example of?
A. The proportion of patients with
A. Point prevalence
stomach cancer who survive 5 years
B. Period prevalence
B. The risk of developing diabetes mellitus in
C. Incidence density
children according to their weight
D. Cumulative incidence
C. Complications of influenza vaccine
E. Complication rate
among children vaccinated in 2011
D. The earlier course of disease in a group
2.2. Sixty percent of adults in the U.S.
of patients now under care in a clinic
population have a serum cholesterol
E. Patients admitted to an intensive care
200mg/dL (5.2 mmol/L). Which rate is this
unit and followed up for whether they are
an example of?
still alive at the time of hospital
A. Point prevalence discharge
B. Complication rate
C. Incidence density 2.7. A sample for a study of incidence of
D. Cumulative Incidence medication errors is obtained by enrolling
E. Period prevalence every 10th patient admitted to a
hospital. What kind of sample is this?
2.3. You are reading a study of the prevalence of A. Stratified sample
uterine cervix infections and want to decide if B. Probability sample
the study is scientifically sound. Which of the C. Convenience sample
following is not important? D. Random sample
A. Participants are followed up for a E. Oversample
sufficient period of time for anemia to 2.8. Cohort studies of children with a first febrile
occur. seizure have shown that they have a one in
B. The study is done on a representative three chance of having another seizure during
sample of the population. childhood. What kind of rate is this?
C. All members of the population are
women. A. Point prevalence
D. Cervical infection is clearly defined. B. Complication rate
E. The study is done on a sample from a C. Cumulative Incidence
defined population. D. Period prevalence
E. Incidence density
2.4. A probability sample of a defined population:
2.9. Which of the following would not increase
A. Is invalidated by oversampling. the observed incidence of disease?
B. Is inferior to a random sample.
C. Is not representative of the population. A. More aggressive efforts to detect the disease
D. Results in a representative sample of B. A true increase in incidence
the population only if there are enough C. A more sensitive way of detecting the disease
people in the sample. D. A lowering of the threshold for diagnosis
of disease
2.5. The incidence of rheumatoid arthritis is E. Studying a larger sample of the
about 40/100,000/year and the prevalence is population
Chapter 2: Frequency 33
study of the incidence and complication rates of herpes zoster
2.10. Infection with a fungus, coccidioidomycosis,
is common in the deserts of the southwestern
United States and in Mexico, but
uncommon elsewhere. Which of the
following best describes this infection?
A. Endemic
B. Pandemic
C. Incident
D. Epidemic
E. Prevalent

2.11. Twenty-six percent of adults report having


experienced back pain lasting at least a day in
the prior 3 months. Which of the following
best describes this rate?
A. Cumulative Incidence
B. Incidence density
C. Point prevalence
D. Complication rate
E. Period prevalence

2.12. Which of the following best describes a


“dynamic population”?
A. It is rapidly increasing in size.
B. It is uniquely suited for cohort studies.
C. People are continually entering and
leaving the population.
D. It is the basis for measuring cumulative
incidence.
E. It is the best kind of population for
a random sample.

2.13. For a study of the incidence of idiopathic


scoliosis (a deformity of the spine that
becomes

REFERENCES

1. Roberts DE, Gupta G. Letter to the editor. New Engl J


Med 1987;316:550.
2. Bryant GD, Norman GR. Expressions of probability:
words and numbers. N Engl J Med 1980;302:411.
3. Toogood JH. What do we mean by “usually”? Lancet
1980;1:1094
4. Loftus EV Jr. Clinical epidemiology of inflammatory bowel
disease: Incidence, prevalence, and environmental influences.
Gastroenterology 2004;126:1504–1517.
5. Loftus EV Jr, Silverstein MD, Sandborn WJ, et al. Ulcerative
colitis in Olmstead County, Minnesota, 1940-1993: inci-
dence, prevalence, and survival. Gut 2000;46:336–343.
6. Sands K, Vineyard G, Platt R. Surgical site infections
occur- ring after hospital discharge. J Infect Dis
1996;173:963–970.
7. Andrade L, Caraveo-Anduaga JJ, Berglund P, et al. The
epide- miology of major depressive episodes: results from the
Inter- national Consortium of Psychiatric Epidemiology
(ICPE) surveys. Int J Methods Psychiatr Res 2003;12:3–
21.
8. Yawn BP, Saddier P, Wollan PC, et al. A population-based
3 Clinical Epidemiology: The
respiratory syndrome out- break in Beijing, 2003. JAMA
2003;290:3215–3221.
apparent after birth, most often in 13. Center MM, Jemal A, Smith RA, et al. Worldwide
adolescence), which would be an variations in colorectal cancer. CA Cancer J Clin
appropriate cohort? 2009;59:366–378.
14. Ege MJ, Mayer M, Normand AC, et al. Exposure to
A. Children born in North Carolina environmen- tal microorganisms in childhood asthma. N Engl
in 2012 and examined for J Med 2011; 364:701–709.
scoliosis until they are adults
B. Children who were
referred to an orthopedic
surgeon for treatment
C. Children who were found to have
scoliosis in a survey of children in
North Carolina
D. Children who have scoliosis
and are available for study
E. Children who were randomly
sampled from the population
of North Carolina in the
spring of 2012
2.14. Which of the following are prevalence
studies especially useful for?
A. Describing the incidence of disease
B. Studying diseases that resolve rapidly
C. Estimating the duration of disease
D. Describing the proportion of
people in a defined population
with the condition of interest
E. Establishing cause and effect
2.15. Last year, 800,000 Americans died of
heart disease or stoke. Which of the
following best describes this
statistic?
A. Incidence density
B. Point prevalence
C. Cumulative Incidence
D. Period prevalence
E. None of

the above Answers


are in Appendix A.

before zoster vaccine introduction. Mayo Clin


Proc 2007;82: 1341–1349.
9. Flegal KM, Carroll MD, Ogden CL, et al.
Prevalence and trends in obesity among US adults,
1999-2008. JAMA 2010; 303:235–241.
10. Jenkins C, Costello J, Hodge L. Systematic review
of preva- lence of aspirin induced asthma and its
implications for clini- cal practice. BMJ
2004;328:434–437.
11. Wolf AMD, Wender RC, Etzioni RB, et al.
American Cancer Society Guideline for the early
detection of prostate cancer: update 2010. CA
Cancer J Clin 2010;60:70–98.
12. Pang X, Zhu Z, Xu F, et al. Evaluation of control
measures implemented in the severe acute
Chapter 3

Abnormality
. . . the medical meaning of “normal” has been lost in the shuffle of statistics.
—Alvan Feinstein
1977

KEY WORDS abnormal. That is when skill and a conceptual


basis for deciding become important.
Nominal data Reliability Decisions about what is abnormal are most dif-
Dichotomous data Reproducibility ficult among relatively unselected patients, usually
Ordinal data Precision found outside of hospitals. When patients have
Interval data Range already been selected for special attention, as is the
Continuous data Responsiveness case in most referral centers, it is usually clear that
Discrete data Interpretability something is wrong. The tasks are then to refine
Validity Sampling fraction the diagnosis and to treat the problem. In primary
Accuracy Frequency distribution care settings and emergency departments, however,
Items Central tendency patients with subtle manifestations of disease are
Constructs Dispersion mixed with those with the everyday complaints of
Scales Skewed distribution basically healthy people. It is not possible to pursue
Content validity Normal distribution all of these complaints aggressively. Which of many
Criterion validity Regression to the patients with abdominal pain have self-limited gas-
Construct validity mean troenteritis and which have early appendicitis?
Which patients with sore throat and hoarseness
have a viral pharyngitis and which have the rare
but potentially lethal Haemophilus epiglottitis? These
are examples of
Clinicians spend a great deal of time how difficult and important distinguishing various
distinguishing “normal” from “abnormal.” Is the kinds of abnormalities can be.
thyroid normal or slightly enlarged? Is the heart The point of distinguishing normal from abnormal
murmur “innocent” (of no health importance) or a is to separate those clinical observations that
sign of valvular disease? Is a slightly elevated should be the basis for action from those that can
serum alkaline phosphatase evi- dence of liver be simply noted. Observations that are thought to
disease, unrecognized Paget disease, or nothing be normal are usually described as “within normal
important? limits,” “unre- markable,” or “non-contributory”
When confronted with something grossly and remain buried in the body of a medical record.
different from the usual, there is little difficulty The abnormal find- ings are set out in a problem
telling the two apart. We are all familiar with list or under the head- ing “impressions” or
pictures in textbooks of physical diagnoses showing “diagnoses” and are the basis for action.
massive hepatospleno- megaly, huge goiters, or Simply calling clinical findings normal or abnor-
hands severely deformed by rheumatoid arthritis. It mal is undoubtedly crude and results in some
takes no great skill to rec- ognize this degree of misclas- sification. The justification for taking this
abnormality, but clinicians are rarely faced with this approach is that it is often impractical or
situation. Most often, clinicians must make subtler unnecessary to consider
distinctions between normal and
31

3 Clinical Epidemiology: The


3 Clinical Epidemiology: The

Table 3.1 whether actively (by additional diagnostic tests and


Summary of Clinical Data: A Patient’s treatment) or passively (by no intervention).
Problem List and the Data on Which This chapter describes some of the ways
It Is Based clinicians distinguish normal from abnormal. First,
we consider how biologic phenomena are
Problem List Raw Data measured, how they vary, and how they are
Acute myocardial Chest pain, troponin 40 g/L (99th
summarized. Then, we discuss how these data are
infarction percentile of upper reference limit), used as a basis for value judgments about what is
new ST elevation in leads II, III, and worth calling abnormal.
AVF
Hypertension Several blood pressure measurements (mm TYPES OF DATA
Hg): 145/92, 149/93, 142/91
Diabetes mellitus Several fasting plasma sugar mea- Measurements of clinical phenomena yield three
surements (mg/dL): 138, 135, 129 kinds of data: nominal, ordinal, and interval.
Renal failure Serum creatinine 2.7 mg/dL
Obstructive Forced expiratory volume at 1 second Nominal Data
pulmonary disease (FEV1)/forced vital capacity (FVC) Nominal data occur in categories without any inher-
 0.70
ent order. Examples of nominal data are
characteristics that are determined by a small set of
genes (e.g., ABO blood type and sex) or are
the raw data in all their detail. As Bertrand Russell dramatic, discrete events (e.g., death, dialysis, or
pointed out, to be perfectly intelligible one must be at surgery). These data can be placed in categories
least somewhat inaccurate, and to be perfectly accu- without much concern about mis- classification.
rate, one is too often unintelligible. Physicians Nominal data that are divided into two categories
usu- ally choose to err on the side of being (e.g., present/absent, yes/no, alive/dead) are called
dichotomous.
intelligible—to themselves and others—even at the
expense of some accuracy. Another reason for
Ordinal Data
simplifying data is that each aspect of a clinician’s
work ends in a decision— to pursue evaluation or to Ordinal data possess some inherent ordering or
wait, to begin a treatment or to reassure. Under rank such as small to large or good to bad, but the
these circumstances, some sort of “present/absent” size of the intervals between categories is not specified.
classification is necessary. Some clini- cal examples include 1 to 4 leg edema,
Table 3.1 is an example of how relatively heart murmurs grades I (heard only with special
simple expressions of abnormality are derived effort) to VI (audible with the stethoscope off the
from more complex clinical data. On the left is a chest), and muscle strength grades 0 (no movement)
typical problem list, a statement of the patient’s to 5 (normal strength). Some ordinal scales are
important medical problems. On the right are some complex. The risk of birth defects from drugs during
of the data on which the decisions to call them pregnancy is graded by the U.S. Food and Drug
problems are based. Con- clusions from the data, Administration on a five-category scale ranging from
represented by the problem list, are by no means A, “no adverse effects in humans”; through B, an
uncontroversial. For example, the mean of the four adverse effect in animal studies not confirmed in
diastolic blood pressure mea- surements is 92 mm con- trolled studies in women or “no effect in animals
Hg. Some might argue that this level of blood with- out human data”; C, “adverse effect in animals
pressure does not justify the label “hypertension” with- out human data or no available data from
because it is not particularly high and there are some animals or humans”; and D, “adverse effects in
disadvantages to telling patients they are sick and humans, or likely in humans because of adverse
recommending drugs. Others might consider the effects in animals”; to X, “adverse effects in humans
label appropriate, considering that this level of or animals without indica- tion for use during
blood pressure is associated with an increased risk of pregnancy” (1).
cardiovascular disease and that the risk can be
reduced by treatment, and the label is consistent with Interval Data
guidelines. Although crude, the problem list serves
as a basis for decisions—about diagnosis, prognosis, For interval data, there is inherent order and the
and treatment—and clinical decisions must be made, interval between successive values is equal, no
matter where one is on the scale. There are two types
of inter- val data. Continuous data can take on any
Chapter 3: Abnormality 33
value in a continuum, regardless of whether
they are reported
3 Clinical Epidemiology: The

that way. Examples include most serum chemistries, Table 3.2


weight, blood pressure, and partial pressure of The CAGE Test for Detecting
oxygen in arterial blood. The measurement and Alcohol Abuse and Dependencea
description of continuous variables may in practice
be confined to a limited number of points on the Have you ever felt you needed to Cut down on your
continuum, often integers, because the precision of drinking?
the measurement, or its use, does not warrant Have people Annoyed you by criticizing your drinking?
greater detail. For exam- ple, a particular blood
Have you ever felt Guilty about your drinking?
glucose reading may in fact be 193.2846573 . . .
mg/dL but is simply reported as 193 mg/dL. Have you ever felt you needed a drink first thing in the
morning (Eye opener) to steady your nerves or to get rid of a
Discrete data can take on only specific values and
hangover?
are expressed as counts. Examples of dis- crete data
are the number of a woman’s pregnancies and live One “yes” response suggests the need for closer
assessment.
births and the number of migraine attacks a patient
has in a month. Two or more “yes” responses is strongly related to
It is for ordinal and interval data that the alcohol abuse, dependence, or both.
question arises, “Where does normal leave off and a
Other tests, such as AUDIT, are useful for detecting less severe
abnormal begin?” When, for example, does a large drinking patterns that can respond to simple counseling.
Adapted from Ewing JA. Detecting alcoholism: the CAGE question- naire.
normal prostate become too large to be considered JAMA 1984;252:1905–1907.
normal? Clinicians are free to choose any cutoff
point. Some of the reasons for the choices are
considered later in this chapter. Some other clinical measurements such as pain,
nausea, dyspnea, depression, and fear cannot be ver-
PERFORMANCE OF ified physically. In patient care, information about
MEASUREMENTS these phenomena is usually obtained informally by
“taking a history.” More formal and standardized
Whatever the type of measurement, its approaches, used in research, are structured
performance can be described in several ways. interviews and questionnaires. Individual questions
(items) are designed to measure specific phenomena
Validity (e.g., symp- toms, feelings, attitudes, knowledge,
Validity is the degree to which the data measure what
beliefs) called constructs, and these items are
they were intended to measure—that is, the degree grouped together to form scales. Table 3.2 shows
to which the results of a measurement correspond one such scale, a brief questionnaire used to detect
to the true state of the phenomenon being alcohol abuse and dependence.
measured. Another word for validity is accuracy. Three general strategies are used to establish the
For clinical observations that can be measured by validity of measurements that cannot be directly veri-
physical means, it is relatively easy to establish fied physically.
valid- ity. The observed measurement is compared
with some accepted standard. For example, serum
Content Validity
sodium can be measured on an instrument recently Content validity is the extent to which a particular
calibrated against solutions made up with known method of measurement includes all of the dimen-
concentrations of sodium. Laboratory measurements sions of the construct one intends to measure and
are commonly subjected to extensive and repeated nothing more. For example, a scale for measuring
validity checks. For example, it is common practice pain would have content validity if it included ques-
for blood glucose measurements to be monitored for tions about aching, throbbing, pressure, burning,
accuracy by com- paring readings against high and and stinging, but not about itching, nausea, and
low standards at the beginning of each day, before tingling.
each technician begins a day, and after any changes
in the techniques, such as a new bottle of reagent Criterion Validity
or a new battery for the instrument. Similarly,
Criterion validity is present to the extent that the
accuracy of a lung scan for pul- monary embolus can
measurements predict a directly observable phe-
be measured against pulmonary angiography, in
nomenon. For example, one might see whether
which the pulmonary artery anatomy is directly
visualized. The validity of a physical exami- nation
finding can be established by comparing it to the
results of surgery or radiologic examinations.
Chapter 3: Abnormality 35

responses on a scale measuring pain bear a Reliability


predict- able relationship to pain of known
severity: mild pain from minor abrasion, moderate Reliability is the extent to which repeated
pain from ordi- nary headache and peptic ulcer, and measure- ments of a stable phenomenon by
severe pain from renal colic. One might also show different people and instruments at different times
that responses to a scale measuring pain are and places get sim- ilar results. Reproducibility and
related to other, observ- able manifestations of the precision are other words for this property.
severity of pain such as sweating, moaning, The reliability of laboratory measurements is
writhing, and asking for pain medications. estab- lished by repeated measures—for example, of the
same serum or tissue specimen—sometimes by
Construct Validity different people and with different instruments. The
reliability of symptoms can be established by
Construct validity is present to the extent that the showing that they are similarly described to different
measurement is related in a coherent way to other observers under dif- ferent conditions.
measures, also not physically verifiable, that are The relationships between reliability and validity
believed to be part of the same phenomenon. are shown in simple form in Figure 3.1. Measure-
Thus, one might be more confident in the construct ments can be both accurate (valid) and reliable
validity of a scale for depression to the extent that it (pre- cise), as shown in Figure 3.1A. Measurements
is related to fatigue and headache—constructs can be very reliable but inaccurate if they are
thought to be different from but related to systematically off the mark, as in Figure 3.1B. On
depression. the other hand, measurements can be valid on the
Validity of a scale is not, as is often asserted, average but not be reliable, because they are widely
either present or absent. Rather, with these scattered about the true value, as shown in Figure
strategies, one can build a case for or against its 3.1C. Finally, measure- ments can be both invalid
validity under the conditions in which it is used, so and imprecise, as shown in Figure 3.1D. Small
as to convince oth- ers that the scale is more or numbers of measurements with poor reliability are
less valid. at risk of low validity because they are likely to be
Because of their selection and training, physicians off the mark by chance alone. Therefore, reliability
tend to prefer the kind of precise measurements and validity are not altogether independent
that the physical and biologic sciences afford and concepts. In general, an unreliable mea- surement
may avoid or discount others, especially for research. cannot be valid and a valid measurement must be
Yet relief of symptoms and promoting satisfaction reliable.
and a feeling of well-being are among the most
important outcomes of patient care and are central Range
concerns of patients and doctors alike. To guide
clinical decisions, research must include them, lest An instrument may not register very low or high
the picture of medi- cine painted by the research be values of the phenomenon being measured; that
distorted. is, it has limited range, which limits the informa-
As Feinstein (2) put it: tion it conveys. For example, the Basic Activities
The term “hard” is usually applied to data that are
of Daily Living scale that measures patients’ ability
reliable and preferably dimensional (e.g., laboratory in dressing, eating, walking, toileting, maintaining
data, demographic data, and financial costs). But hygiene, and transferring from bed or chair does
clinical performance, convenience, anticipation, and not measure ability to read, write, or play the
familial data are “soft.” They depend on subjective piano (activities that might be very important to
statements, usually expressed in words rather than individ- ual patients).
numbers, by the people who are the observers and
the observed. Responsiveness
To avoid such soft data, the results of
treatment are commonly restricted to laboratory An instrument demonstrates responsiveness to the
information that can be objective, dimensional, and extent that its results change as conditions change.
reliable—but it is also dehumanized. If we are told For example, the New York Heart Association
that the serum cholesterol is 230 mg/dL, that the scale—Classes I to IV (no symptoms of heart failure
chest x-ray shows cardiac enlargement, and that the and no limitations of ordinary physical activity, mild
electrocardiogram has Q waves, we would not know
whether the treated object was a dog or a person. If symptoms and slight limitation of ordinary physi-
we were told that ca- pacity at work was restored, cal activity, marked limitation of ordinary physical
that the medicine tasted good and was easy to take, activity because of fatigue, palpitation or dyspnea,
and that the family was happy about the results,
we would recognize a hu- man set of responses.
3 Clinical Epidemiology: The

VALIDITY
(Accuracy)
High Low

A B
High

Frequenc
RELIABILITY
(Precision)
C D

Low

Measurement
Figure 3.1 ■ Validity and reliability. A. High validity and high reliability. B. Low
validity and high reliability. C. High validity and low reliability. D. Low validity and
low reliability. The white lines represent the true values.

and inability to carry out any physical activity, even


at rest, because of symptoms)—is not sensitive VARIATION
to subtle changes in congestive heart failure, ones
Overall variation is the sum of variation related to the
that might matter to patients. However, measure-
act of measurement, biologic differences within indi-
ments of ejection fraction by echocardiography can
viduals from time to time, and biologic differences
detect changes so subtle that patients do not notice
among individuals (Table 3.3).
them.
Variation Resulting from
Interpretability Measurement
Clinicians learn to interpret the significance of a All observations are subject to variation because of
PCO2 of 50 or a blood sugar of 460 through expe- the performance of the instruments and observers
rience, in which they repeatedly calibrate patients’ involved in making the measurements. The
current conditions and clinical courses against such conditions
test results. However, scales based on questionnaires
may have little intuitive meaning to clinicians and
Table 3.3
patients who do not use them regularly. To over-
come this interpretability disadvantage, research- Sources of Variation
ers can “anchor” scale values to familiar states. To
help clinicians interpret scale values, the numbers Source of
Variation Definition
are anchored to descriptions of everyday perfor- Measurement Variation
mance. For example, values of the Karnofsky Perfor-
mance Status Scale, a measure of functional capacity Instrument The means of making the measurement
commonly used in studies of cancer patients receiv- Observer The person making the measurement
ing chemotherapy, range from 100 (normal) to 0
(dead). Just how bad is it to have a value of 60? At Biologic Variation
a scale value of 60, patients require occasional assis-
tance but are able to care for most of their personal
needs.
Within Changes in a person at different times and
individuals situations
Between Biologic differences from person to person
Chapter 3: Abnormality 37
individuals
3 Clinical Epidemiology: The

of measurement can lead to a biased result (lack of 1


validity) or simply random error (lack of reliability).
It is possible to reduce this source of variation by
making
measurements with great care and by following stan- 80
dard protocols. However, when measurements

Radiographs read positive


involve human judgment, rather than machines,
variation can
be particularly large and difficult to control. 60

Example
Findings on chest radiographs are used as part of the diagnosis40of Acute Lung Injury and Acute Respiratory Distress Syndrom

20

0
Readings by 21 experts
Figure 3.2 ■ Observer variability. Variability among
21 specialists reading chest x-rays for acute lung injury and
acute respiratory distress syndrome. The percentage of
radiographs read as positive for the diagnosis varied from 36%
to 71% among the experts. (Data from Rubenfeld GD, Caldwell
E, Granton J, et al. Interobserver variability in ap- plying a
radiographic definition for ARDS. Chest 1999;116: 1347–
1353.)

change from moment to moment. A measurement


at a point in time may not represent the usual
value of these measurements.

Example
Variations in measurements also arise because Clinicians estimate the frequency of ventricular
they are made on only a sample of the phenomenon premature beats (VPBs) to help determine the
being described, which may misrepresent the whole. need for and effectiveness of treatment. For
practical reasons, they may do so by making
Often, the sampling fraction (the fraction of the
relatively brief observations—perhaps feeling a
whole that is included in the sample) is very small. pulse for 1 minute or reviewing an electrocar-
For example, a liver biopsy represents only about diogram recording lasting several seconds. How-
1/100,000 of the liver. Because such a small part of ever, the frequency of VPBs in a given patient
the whole is exam- ined, there is room for varies over time. To obtain a larger sample to
considerable variation from one sample to another. estimate the VPB rate, a portable monitor was
If measurements are made by several different developed that tracks ventricular premature
methods, such as different laboratories, technicians, depolarizations (VPDs) electrocardiographically.
or instruments, some of the measurements may be Early studies found monitoring even for ex-
tended periods of time can be misleading. Fig-
unre- liable or may produce results that are
ure 3.3 shows observations on one patient with
systematically different from the correct value, VPDs, similar to other patients studied (4). VPDs
which could contrib- ute to the spread of values per hour varied from 20 to 380 during a 3-day
obtained. period, according to day and time of day. The

Variation Resulting
from Biologic
Differences
Variation also arises because of biologic changes within
Chapter 3: Abnormality 39
individuals over time. Most biologic phenomena
4 Clinical Epidemiology: The

400

300

Number of Day 1

200

Day 2

100

Day 3

0
Noon 6 P.M. Midnight 6 A.M.

Figure 3.3 ■ Biologic variability. The number of ventricular premature beats


(VPBs) per hour for one untreated patient on 3 consecutive days. (Data from
Morganroth J, Michelson EL, Horowitz LN, et al. Limitations of routine long-
term electrocardiographic monitoring to assess ventricular ectopic fre- quency.
Circulation 1978;58:408–414.)

sicians, or taking over-the-counter cold medications.


authors concluded, “To distinguish a Of course, we are most interested in knowing how
reduction in VPB frequency attributable to an individual’s blood pressure compares with that of
therapeutic intervention rather than his or her peers, especially if the blood pressure level
biologic or spontane- ous variation alone is related to complications of hypertension and the
required a greater than 83% reduction in effectiveness of treatment.
VPB frequency if only two 24-hour Despite all these sources of variation that can dis-
monitoring periods were compared.” Much tort the measurement of blood pressure, biologic dif-
shorter periods of observation could be ferences among individuals is a predominant cause of
even more misleading because of biologic variation in the case of blood pressure, so much so
variation. To deal with this biologic that several studies have found even a single casual
variability, modern de- vices are now able to blood pressure measurement can predict subsequent
monitor cardiac rhythm for extended cardiovascular disease among a population.

Total Variation Effects of Variation


The several sources of variation are cumulative. Another way of thinking about variation is in
Figure 3.4 illustrates this for the measurement of terms of its net effect on the validity and reliability
blood pressure. When looking at a population dis- of a mea- surement and what can be done about it.
tribution, variation in measurement for individual Random variation, for example, by unstable
patients is added to variation for those individuals instru- ments or many observers with various biases
from time to time, which in turn is added to varia- that tend to balance each other out, results on average
tion among different patients. Measurement varia- in no net misrepresentation of the true state of a
tion contributes relatively little, although it accounts phenomenon, even though individual
for as much as a 12 mm Hg among various measurements may be mis- leading. Inaccuracy
observ- ers. However, each patient’s blood pressure resulting from random variation can be reduced by
varies a great deal from moment to moment taking the average of a larger sam- ple of what is
throughout the day, so any single blood pressure being measured, for example, by count- ing more
reading might not represent the usual for that cells on a blood smear, examining a larger area of
patient. Much of this variation is not random: Blood a urine sediment, or studying more patients. Also,
pressure is generally higher when people are awake, the extent of random variation can be estimated by
excited, visiting phy- statistical methods (see Chapter 11).
Chapter 3: Abnormality 41

CONDITIONS OF DISTRIBUTION OF SOURCE OF


MEASUREMENT MEASUREMENT VARIATION

Within individual patient

Simultaneous–same observer
Measurement

Simultaneous–2 observers Measurement

Between visits Biologic

Among patients Biologic

60 70 80 90 100 110
Diastolic blood pressure (mm Hg)
Figure 3.4 ■ Sources of variation in the measurement of diastolic (phase V)
blood pressure. The dashed line indicates the true blood pressure. Multiple sources of variation,
including within and among patients as well as intra- and inter-observer variation, all con- tribute
to blood pressure measurement results.

On the other hand, biased results are systemati-


blood neutrophil counts in normal black men and
cally different from the true value, no matter how
women.
many times they are repeated. For example, all of
the high values for VPDs shown in Figure 3.3
were recorded on the first day, and most of the low Describing Distributions
values on the third day. The days were biased Presenting interval data as a frequency distribution
estimates of each other because of variation in VPD conveys the information in relatively fine detail, but
rate from day to day. it is often convenient to summarize distributions.
Indeed, summarization is imperative when a large
number of distributions are presented and compared.
DISTRIBUTIONS Two basic properties of distributions are used to
summarize them: central tendency, the middle of
Data that are measured on interval scales are often
the distribution, and dispersion, how spread out the
pre- sented as a figure, called a frequency
values are. Several ways of expressing central tendency
distribution, showing the number (or proportion)
and dispersion, along with their advantages and dis-
of a defined group of people possessing the different
advantages, are illustrated in Figure 3.5 and summa-
values of the measurement. Figure 3.5 shows the
rized in Table 3.4.
distribution of
4 Clinical Epidemiology: The

ModeMedian Actual Distributions


30 Mean
Distributions of clinical phenomena have many
differ- ent shapes. The frequency distributions of
four com- mon blood tests (for potassium, alkaline
phosphatase, glucose, and hemoglobin) are shown
20
in Figure 3.6. In general, most of the values appear
near the middle of the distribution, and except for
Perc

the central part of the curves, there are no “humps”


or irregularities. The high and low ends of the
10
distributions stretch out into tails, with the tail at
one end often being more elon- gated than the tail
at the other (i.e., the curves are skewed toward the
longer end). Whereas some of the distributions are
0
0 1 2 3 4 5 6 7 8 9 10
skewed toward higher values, others are skewed
Neutrophil count (x 109/L) toward lower values. In other words, all of these
1 S.D. distributions are unimodal (have only one hump),
and are roughly bell shaped, though not nec-
Range
essarily symmetrical. Otherwise, they do not
resemble one another.
Figure 3.5 ■ Measures of central tendency and dis- The distribution of values for many laboratory
persion. The distribution of blood neutrophil counts in a tests changes with characteristics of the patients such
national sample of blacks age 18 years and older. (The authors as age, sex, race, and nutrition. For example, the dis-
found that neutrophil counts were lower and neu- tropenia was tribution of one such test, blood urea nitrogen (BUN,
more common in blacks compared to whites.) (Data from Hsieh
MM, Everhart JE, Byrd-Holt DD, et al. Prevalence of
a test of kidney function), changes with age. A BUN
neutropenia in the U.S. population: Age, sex, smoking status and of 25 mg/dL would be unusually high for a young
ethnic differences. Ann Intern Med 2007;146:486–492.) person in her 20s, but not particularly remarkable for
an 80-year-old.

Table 3.4
Expressions of Central Tendency and Dispersion

Expression Definition Advantages Disadvantages


Central Tendency
Mean Sum of values for observations Well suited for mathematical Affected by extreme values
Number of observations manipulation
Median The point where the number of Not easily influenced by extreme Not well suited for
observations above equals the values mathematical manipulation
number below
Mode Most frequently occurring value Simplicity of meaning Sometimes there are no, or
many, most frequent values
Dispersion
Range From lowest to highest value in a Includes all values Greatly affected by extreme
distribution values
Standard The absolute value of the average Well suited for mathematical For non-Gaussian
Deviationa difference of individual values from manipulation distributions, does not describe
the meana a known proportion of the
observations
Percentile, The proportion of all observations Describes “unusualness” of a value Not well suited for statistical
decile, falling between specified values Does not make assumptions about the manipulation
quartile, etc. shape of a distribution

(X  X )2
a –
N 1 , where X  each observation; X  mean of all observations; and N  number of observations.
Chapter 3: Abnormality 43

30
20 Serum potassium Alkaline phosphatase
20

10
10

3.0 4.0 5.0 20 40 60 80 100 120 140


mEq/L Units
Perce

30 40

Plasma 30
20 glucose Hemoglobin
20
10
10

100 150 200 8 9 10 11 12 13 14 15 16

mg/100 mL g/100 mL
Figure 3.6 ■ Actual clinical distributions. (Data from Martin HF, Gudzinowicz BJ,
Fanger H. Normal Values in Clinical Chemistry. New York: Marcel Dekker; 1975.)

The Normal Distribution distribution of repeated measurements of the same


Another kind of distribution is called the normal physical object by the same instrument. Dispersion
distribution (or “Gaussian,” after the mathemati- of values represents random variation alone. A
cian who first described it). The normal distribution, normal curve is shown in Figure 3.7. The curve is
based in statistical theory, describes the frequency symmetri- cal and bell shaped. It has the
mathematical property that about two-thirds of the
observations fall within
Frequen

Standard deviations –3 –2 –1 0 +1 +2 +3

2.1413.5934.1334.1313.592.14

Percent of area under the curve 68.26

95.44
99.72

Figure 3.7 ■ The normal (Gaussian) distribution.


4 Clinical Epidemiology: The

in this way.
1 standard deviation of the mean, and about 95%,
within 2 standard deviations.
Although clinical distributions often resemble a
normal distribution the resemblance is superficial. As
summarized in a perspective, “The experimental fact
is that for most physiologic variables the distribution
is smooth, unimodal, and skewed, and that mean
2 standard deviations does not cut off the desired
95%. We have no mathematical, statistical, or
other theo- rems that enable us to predict the shape
of the distri- butions of physiologic
measurements” (5).
The shapes of clinical distributions differ from
one another because many differences among people,
other than random variation, contribute to distribu-
tions of clinical measurements. Therefore, if distri-
butions of clinical measurements resemble normal
curves, it is largely by accident. Even so, it is
often assumed, as a matter of convenience (because
means and standard deviations are relatively easy to
calculate and manipulate mathematically) that
clinical mea- surements are “normally”
distributed.

CRITERIA FOR ABNORMALITY


It would be convenient if the frequency distributions
of clinical measurements for normal and abnormal
people were so different that these distributions
could be used to distinguish two distinct popula-
tions. This is actually the case for some abnormal
genes. Sequences (genetic abnormalities coding) for
the autosomal dominant condition familial adeno-
matous polyposis are either present or absent. People
with the abnormal gene develop hundreds of
polyps in their colon, whereas people without the
gene rarely have more than a few, but this is the
exception that proves the rule. Far more often, the
various genetic abnormalities coding for the same
disease produce a range of expressions. Even the
expression of a spe- cific genetic abnormality
(such as substitution of a single base pair) differs
substantially from one person to another,
presumably related to differences in the rest of the
genetic endowment as well as exposure to external
causes of the disease.
Therefore, most distributions of clinical variables
are not easily divided into “normal” and “abnormal.”
They are not inherently dichotomous and do not dis-
play sharp breaks or two peaks that characterize nor-
mal and abnormal results. This is because disease is
usually acquired by degrees, so there is a smooth
tran- sition from low to high values with increasing
degrees of dysfunction. Laboratory tests reflecting
organ fail- ure, such as serum creatinine for kidney
failure or ejection fraction for heart failure, behave
Chapter 3: Abnormality 45
Another reason why normals and abnormals
are not seen as separate distributions is that even
when people with and without a disease have
substantially different frequency distributions, the
distributions almost always overlap. When the
two distributions are mixed together, as they are in
naturally occurring populations, the abnormals are
usually not seen as separate because they comprise
such a small propor- tion of the whole. The curve
for people with disease is “swallowed up” by the
larger curve for healthy, nor- mal people.

Example
Phenylketonuria (PKU) is an inherited disease characterized by
It is common practice to screen newborns for PKU with a blo
4 Clinical Epidemiology: The

A This is a statistical definition, based on the frequency


of a characteristic in a defined population. Com-
monly, the reference population is made up of people
without disease, but this need not be the case. For
example, we may say that it is normal to have pain
after surgery or itching with eczema.
Perce

It is tempting to be more specific by defining


what is unusual in mathematical terms. One
commonly used way of establishing a cutoff point
between nor- mal and abnormal is to agree,
somewhat arbitrarily, that all values beyond 2
standard deviations from the mean are abnormal.
Normal Mutant On the assumption that the dis- tribution in
question approximates a normal distribu- tion, 2.5%
of observations would then appear in each
Alleles for phenylalanine hydroxylase tail of the distribution and be considered abnormally
high or abnormally low.
B Of course, as already pointed out, most biologic
measurements are not normally distributed. There-
fore, it is better to describe unusual values, whatever
the proportion chosen, as a fraction (or percentile) of
Normal the actual distribution. In this way, it is possible to
Perce

make a direct statement about how infrequent a value


is without making assumptions about the shape of
the distribution from which it came.
Mutant Despite this, the statistical definition of normality,
with the cutoff point at 2 standard deviations from
the mean, is most commonly used. However, it
can be ambiguous or misleading for several reasons:
0 2 4 6 8 10
■ If all values beyond an arbitrary statistical limit,
Blood phenylalanine (mg/dL)
say the 95th percentile, were considered abnor-
Figure 3.8 ■ Screening for phenylketonuria (PKU) mal, then the frequency of all diseases would
in
be the same (if one assumed the distribution was
infants: dichotomous and overlapping distributions of
normal and abnormal. A. Alleles coding for phenylalanine
nor- mal, 2.5% if we consider just the extreme
hydroxylase are either normal or mutant. B. The distributions of high or low ends of the distribution). Yet, it is
blood phenylalanine levels in newborns with and without PKU common knowledge that diseases vary in
overlap and are of greatly different magnitude. (The prevalence frequency; diabe- tes and osteoarthritis are far
of PKU, which is 1/10,000, is exaggerated so that its more common than ovalocytosis and hairy cell
distribution can be seen in the figure.) leukemia.
■ There is no general relationship between the
degree of statistical unusualness and clinical
disease. The relationship is specific to the disease
If there is no sharp dividing line between normal in question and the setting. Thus, obesity is quite
and abnormal, and the clinician can choose where the common in the United States but uncommon in
line is placed, what ground rules should be used to many developing countries. For some
decide? Three criteria have proven useful: being measurements, deviations from usual are
unusual, being sick, and being treatable. For a given associated with disease to an important degree
measurement, the results of these approaches bear no only at quite extreme values, well beyond the 95th
necessary relation to one another, so what is or even the 99th percentile. Failure of organs
considered abnormal by one criterion might be such as the liver and kidneys becomes
normal by another. symptomatic only when most of usual function
is lost.
Abnormal = Unusual ■ Sometimes extreme values are actually beneficial.
For example, people with unusually low blood
Normal often refers to the frequently occurring or pressure are, on average, at lower risk of cardiovas-
usual condition. Whatever occurs often is considered cular disease than people with more usual
normal, and what occurs infrequently is abnormal. blood pressures. People with unusually high bone
Chapter 3: Abnormality 47
density are at lower than average risk of
fractures.
4 Clinical Epidemiology: The

■ Many measurements are related to risk of disease Abnormal = Associated


over a broad range of values, with no threshold with Disease
dividing normal from increased risk. Blood pres-
sure is an example. As the above example demonstrates, a sounder
approach to distinguishing normal from abnormal
is to call abnormal those observations that are
clinically meaningful departures from good health
Example —that is, associated with a meaningful risk of
having
Figure 3.9 shows that usual systolic blood pres- sure is related or devel-heart
to ischemic opingdisease
disease, disability,
mortali- or death. a broad r
ty throughout
Using this approach to defining abnormality can
sometimes lead to differ- ent levels of a condition
being abnormal.

Example
At what point does higher than average weight for height be

16

4
Mortal

0
115 120 140 160 180 Abnormal = Treating the
Usual systolic blood pressure (mm Hg) Condition Leads to a Better
Figure 3.9 ■ Ischemic heart disease mortality for peo- Clinical Outcome
ple ages 40 to 49 years is related to systolic blood
pres- sure throughout the range of values occurring in It makes intuitive sense to define a clinical condition
most people. There is no threshold between normal and or finding as “abnormal” if treatment of it leads to
abnormal. “Mortality” is presented as a multiple of the baseline a better outcome. This approach makes
rate. (Data from Prospective Studies Collaboration. Age-specific particularly good sense for asymptomatic
relevance of usual blood pressure to vascular mortality: a meta- conditions. If a con- dition is causing no trouble,
analysis of individual data for one million adults in 61 prospective and treatment makes no difference, why try to treat
studies. Lancet 2002;360:1903–1913.)
it? However, even for symptomatic patients, it is
sometimes difficult to
Chapter 3: Abnormality 49

A 20,000

16,0 00
Mortality rate / 100,000 person-

12,0 00

8,0 00

4,0 00
<18.5
18.5–21.9 22.0–24.9 25.0–27.4 27.5–29.930.0–34.9 ≥35.0
Body mass index (kg/m2)

B 40

30
Men with functional decline

20

10

<18.5 18.5–21.9 22.0–24.9 25.0–27.4 27.5–29.9 30.0–34.9 ≥35.0

Body mass index (kg/m ) 2

Figure 3.10 ■ Abnormal as associated with disease and other patient outcomes. The re-
lationship between body mass index and (A) total mortality and (B) functional decline in men age 65 and
older on Medicare. Body mass index is weight in kilograms divided by height in meters squared. Mortality
rates are adjusted for age and smoking. (Redrawn with permission from Wee CC, Huskey KW, Ngo LH, et
al. Obesity, race and risk for death or functional decline among Medi- care beneficiaries. A cohort study.
Ann Intern Med 2011;154:645–655.)
5 Clinical Epidemiology: The

distinguish between clinical findings that will and What is considered treatable changes with time.
will not improve with treatment. Modern technology, At their best, therapeutic decisions are grounded on
especially newer imaging techniques, are now able to evi- dence from well-conducted clinical trials
detect abnormalities in patients so well that it is (Chapter 9). As new knowledge is acquired from
not always clear what is found is related to the the results of clinical trials, the level at which
patient’s complaint. The result is an increasingly treatment is consid- ered useful may change.
common dilemma for both patients and clinicians.

Example Example
Folic acid, a vitamin that occurs mainly in green leafy veget
Magnetic resonance imaging (MRI) of the knee is frequently performed in middle-aged and elderly patients presenting

REGRESSION TO THE MEAN


When clinicians encounter an unexpectedly abnor-
mal test result, they tend to repeat the test. Often, the
Another reason to limit the definition of second test result is closer to normal. Why does
“abnor- mal” to a condition that is treatable is that this happen? Should it be reassuring?
not every condition conferring an increased risk can Patients selected because they represent an
be success- fully treated: The removal of the extreme value in a distribution can be expected, on
condition may not remove risk, either because the average, to have less extreme values on subsequent
condition itself is not a cause of disease but is only measurements. This phenomenon, called regression
related to a cause or because irreversible damage to the mean, occurs for purely statistical reasons,
has already occurred. To label people abnormal can not because the patients have necessarily
cause worry and a sense of vulnerability that may not improved.
be justified for some health problems if treatment Regression to the mean arises in the following
cannot improve the outlook. way (Fig. 3.11): People are selected for inclusion
in
Chapter 3: Abnormality 51

retesting. They were selected only because they hap-


pened, through random variation, to have a high
First testing of the population value at the time they were first measured. When the
measurement is made again, these people have lower
Frequen

values than they had during the first screening.


This phenomenon tends to drag down the mean
value of the subgroup originally found to have
values above the cutoff point.
Thus, patients who are singled out from others
because of a laboratory test result that is unusually
high or low can be expected, on average, to be closer
to the center of the distribution if the test is
repeated. Moreover, subsequent values are likely to
Patients with high values
be more accurate estimates of the true value,
Mean which could be obtained if measurements were
repeated for a particu- lar patient many times.
Patients retested Therefore, the time-honored practice of repeating
laboratory tests that are found to be abnormal and of
Mean considering the second one the correct result is not
Figure 3.11 ■ Regression to the mean. merely wishful thinking. It has a sound theoretical
basis. It also has an empirical basis. For example, in
a study of liver function tests (aspar- tate
a study or for further diagnosis or treatment aminotransferase, alanine aminotransferase, alka-
because their initial measurement for a trait falls line phosphatase, gamma-glutamyltransferase, and
beyond an arbitrarily selected cutoff point in the bilirubin) in a cross-section of the U.S. population,
tail of a distri- bution of values for all the patients 12% to 38% of participants who had abnormally high
examined. Some of these people will remain above values on initial testing had normal values on
the cutoff point on subsequent measurements retest- ing (9). However, the more extreme the initial
because their true values are usually higher than reading was—more than two times the normal
average, but others who were found to have values range—the more likely the repeat test would
above the cutoff point during the initial screening remain abnormal. For participants with normal
usually have lower values on initial testing, values on retesting remained normal
in more than 95%.

Revie w Question s
For each of the numbered clinical scenarios C. Nominal
below (3.1–3.5), select from the lettered D. Ordinal
options the most appropriate term for the type E. Interval—Discrete
of data.
3.3. Serum sodium 139 mg/dL.
3.1. Deep tendon reflex grade 0 (no response),
1 (somewhat diminished), 2 (normal), A. Interval—Continuous
3 (brisker than average), and 4 (very brisk). B. Dichotomous
C. Nominal
A. Interval—Continuous D. Ordinal
B. Dichotomous E. Interval—Discrete
C. Nominal
D. Ordinal 3.4. Three seizures per month.
E. Interval—Discrete
A. Interval—Continuous
3.2. Cancer recurrent/not recurrent 5 years after B. Dichotomous
initial treatment. C. Nominal
D. Ordinal
A. Interval—Continuous E. Interval—Discrete
B. Dichotomous
5 Clinical Epidemiology: The

3.5. Causes of upper gastrointestinal bleeding: 3.8. “Abnormal” is commonly defined by all
duodenal ulcer, gastritis, esophageal, or of the following except:
other varices. A. The level at which treatment has been
A. Interval—Continuous shown to be effective
B. Dichotomous B. The level at which death rate is increased
C. Nominal C. Statistically unusual values
D. Ordinal D. Values that do not correspond to a
E. Interval—Discrete normal distribution
E. The level at which there is an
increased risk of symptoms
For questions 3.6–3.10, choose the best
answer.
3.9. All of the following statements are true except:
3.6. When it is not possible to verify measure- A. The normal distribution describes the
ment of a phenomenon, such as itching, distribution of most naturally occurring
by the physical senses, which of the phenomena.
following can be said about its validity? B. The normal distribution includes 2.5%
of people in each tail of the
A. It is questionable, and one should
distribution (beyond 2 standard
rely on “hard” measures such as
deviations from the mean).
laboratory tests.
C. The normal distribution is unimodal
B. It can be established by showing that
and symmetrical.
the same value is obtained when the
D. The normal distribution is the most
measurement is repeated by many
common basis for defining abnormal
different observers at different times.
laboratory tests measured on interval
C. It can be supported by showing that
scales.
the measurement is related to other
measures of phenomena such as the 3.10. You see a new patient who is a 71-year-old
presence of diseases that are known to woman on no medicines and without
cause itching. history of heart disease in herself or her
D. It can be established by showing that
family. She has never smoked and is not
measurement results in a broad range
diabetic. Her blood pressure is 115/75 mm
of values.
Hg, she is about 15 pounds overweight. A
E. It cannot be established.
total choles- terol test done 2 days ago was
high at as 250 mg/dL and the HDL
3.7. A physician or nurse measures a patient’s
cholesterol was 59 mg/ dL. The Framingham
heart rate by feeling the pulse for 10 seconds
risk calculator estimates that the patient’s risk
each time she comes to clinic. The rates
of developing general cardiovascular disease
might differ from visit to visit because of all
in the next 10 years
the following except:
is 9%. You know that treating cholesterol at
A. The patient has a different pulse rate at the level found reduces cardiovascular risk.
different times. The patient wants to know if she should start
B. The measurement may misrepresent the taking a statin. Which of the following state-
true pulse by chance because of the brief ments is least correct?
period of observation.
A. The patient is likely to have a lower
C. The physician and nurse use different
serum cholesterol the next time it is
techniques (e.g., different degrees of
measured.
pressure on the pulse).
B. The estimation of a 9% probability of
D. The pulse rate varies among patients.
cardiovascular disease in the next 10
E. An effective treatment was begun
years could be influenced by chance.
between visits.
C. The patient should be given a prescription
for a statin to lower her risk of coronary
heart disease.
Chapter 3: Abnormality 53

10 A
15
Monitored fetal heart rate 130–150
8
Proportion of infants

10

5
4

0
2 B

Number of
5 Monitored fetal heart rate <130

0
0 2,000 4,000 6,000 8,000
Birthweight (g) 0

Figure 3.12 ■ The distribution of birth weights of full- C


term babies born to non-diabetic mothers. (Redrawn 10 Monitored fetal heart rate >150
with permission from Ludwig DS, Currie J. The association
between pregnancy weight gain and birthweight: a within-
family comparison. Lancet 2010;376:984–990.)
5

Figure 3.12 shows the distribution of


birthweight of more than a million full-term 0
babies born to mothers without diabetes. –50 40 30 20 10 0 10 20 30 40 50+
Questions 3.11–3.13 relate to the figure. Underestimate Overestimate
For
each question, choose the best answer.
ERROR (beats/min)

3.11. Which statement about the central Figure 3.13 ■ Observer variability. Comparing fetal heart
tendency is incorrect ? auscultation to electronic monitoring of fetal heart rate. (Redrawn
with permission from Day E, Maddem L, Wood C. Auscultation
A. The mean birthweight is below 4,000 g. of foetal heart rate: an assessment of its error and significance.
B. There is more than one mode Br Med J 1968;4:422–424.)
birth weight.
C. Mean and median birth weights are
Figure 3.13 compares fetal heart rates
similar.
measured by electronic monitoring (white
middle bar),
3.12. Which statement about dispersion is most
to measurements by hospital staff in three
correct?
different circumstances: when fetal heart rate
A. The range is the best way to describe the of beats per minute by electronic monitoring
babies’ birth weights. was normal (130–150), low (<130), and high
B. Standard deviations should not be (>150). Questions 3.14–3.16 relate to the figure.
calculated because the distribution is For each question, choose the best answer.
skewed.
C. Ninety-five percent of the birth weights 3.14. The distribution of hospital staff
will fall within about 2 standard measurements around the electronic monitor
deviations of the mean. measurement in Panel A of Figure 3.13 could
be due to:
3.13. One standard deviation of babies’ birth
weights encompasses approximately: A. Chance
B. Inter-observer variability
A. Weights from 2,000 to 4,000 g C. Biased preference for normal results
B. Weights from 3,000 to 4,000 g D. A and B
C. Weights from 2,000 to 6,000 g E. A and C
F. B and C
G. A, B, and C
5 Clinical Epidemiology: The

3.15. The distribution of hospital staff measure- 3.16. The distribution of hospital staff
ments around the electronic monitor mea- measurements around the electronic monitor
surement in Panel B of Figure 3.13 could be measurement in Panel C of Figure 3.13 could
due to: be due to:
A. Chance A. Chance
B. Inter-observer variability B. Inter-observer variability
C. Biased preference for normal results C. Biased preference for normal results
D. A and B D. A and B
E. A and C E. A and C
F. B and C F. B and C
G. A, B, and C G. A, B, and C

Answers are in Appendix

A.

REFERENCES
1. Sharma P, Parekh A, Uhl K. An innovative approach to deter-
6. Prospective Studies Collaboration. Age-specific relevance of
mine fetal risk: the FDA Office of Women’s Health pregnancy
usual blood pressure to vascular mortality: a meta-analysis of
exposure registry web listing. Womens Health Issues 2008;18:
individual data for one million adults in 61 prospective studies.
226–228.
2. Feinstein AR. The need for humanized science in evaluating Lancet 2002;360:1903–1913.
7. Wee CC, Huskey KW, Ngo LH, et al. Obesity, race and risk
medication. Lancet 1972;2:421–423.
3. Rubenfeld GD, Caldwell E, Granton J, et al. Interobserver vari- for death or functional decline among Medicare beneficiaries.
A cohort study. Ann Intern Med 2011;154:645–655.
ability in applying a radiographic definition for ARDS.
8. Englund M, Guermazi A, Gale D. Incidental meniscal findings
Chest 1999;116:1347–1353.
4. Morganroth J, Michelson EL, Horowitz LN, et al. on knee MRI in middle-aged and elderly persons. N Engl J
Med 2008;359:1108–1115.
Limitations of routine long-term electrocardiographic
9. Lazo M, Selvin E, Clark JM. Brief communication: clinical
monitoring to assess ventricular ectopic frequency. Circulation
implications of short-term variability in liver function test
1978;58:408–414.
5. Elveback LR, Guillier CL, Keating FR. Health, normality, and results. Ann Intern Med 2008;148:348–352.
the ghost of Gauss. JAMA 1970;211:69–75.
Chapter 3: Abnormality 55

Chapter 4

Risk: Basic Principles


The lesson . . . is that a large number of people at a small risk may give rise to more
cases of disease than the small number who are at a high risk.
—Geoffrey Rose
1985

KEY WORDS the risk of cardiovascular disease (CVD) has helped


decrease cardiovascular mortality in the United
States
Risk Calibration by half over the past several decades.
Risk factor Discrimination People have a strong interest in their risk of dis-
Exposure Concordance ease, a concern reflected in television and newspaper
Latency period statistic headlines, and the many Web sites and popular books
Immediate causes C-statistic about risk reduction. The risk of breast and
Distant causes Sensitivity prostate cancer, heart disease and stroke, Alzheimer
Marker Specificity disease, autism, and osteoporosis are examples of
Risk prediction Receiver operating topics in which the public has developed a strong
model Risk characteristic (ROC) interest. Dis- cerning patients want to know their
prediction tool Risk curve individual risks and how to reduce them.
stratification In this chapter, we will concentrate on the
under- lying principles about risk, important
regardless of
Risk generally refers to the probability of some that smoking, hypertension, and hyperlipidemia increase
untoward event. In medicine, clinicians deal with
probabilities in virtually every patient encounter. 50
They work with basic principles of risk whether
they are diagnosing a complaint, describing progno-
sis, deciding on treatment, or discussing
prevention with the patient. Patient encounters no
longer deal only with the patient’s complaints, but
increasingly involve checking for risk factors that
might lead to poor health in the future. In medical
research, more and more effort is devoted to
elucidating risk fac- tors for diseases. A major
rationale for the Human Genome Project is to
predict diseases that will become manifest over a
person’s lifetime. Clinical journals publish
epidemiologic studies investigating possible health
risks. All this effort has improved our understanding
about risks to health and how to study them, and has
helped improve the health of patients and
populations, sometimes in dramatic ways. For
example, research that led to the understanding
where risk confronts the clinician and
patient along the spectrum of care. As
much as possible, we will deal with
concepts rather than terminology and
methods, covering those important details
in later chapters. Chapters 5 and 6 deal with
risks for adverse health effects in the distant
future, often years or even decades away; they
describe scientific methods used to indicate the
probability that people who are exposed to
certain “risk factors” will subsequently
develop a particular disease or bad health
outcome more often than similar people who
are not exposed. In acute care settings such as
intensive care units, emergency rooms, and
hospital wards, the concern is about risks
patients with known disease might or might
not experience, termed prognosis (Chapter
7); the time horizon of prognosis spans
from minutes and hours to months and
years, depending on the study. Chapters 8, 9,
and 10 revisit risk as it relates to diagnosis,
treatment, and prevention. In each case, the
approach to assessing risk is somewhat
different. However, the fundamental
principles of determining risks to health are
similar.
Chapter 4: Risk: Basic Principles 51

RISK FACTORS between

Characteristics associated with an increased risk of


becoming diseased are called risk factors. Some risk
factors are inherited. For example, having the haplo-
type HLA-B27 greatly increases one’s risk of acquir-
ing the spondyloarthropathies. Work on the
Human Genome Project has identified many other
diseases for which specific genes are risk factors,
including colon and breast cancer, osteoporosis,
and amyotrophic lateral sclerosis. Other risk factors,
such as infectious agents, drugs, and toxins, are
found in the physical environment. Still others are
part of the social envi- ronment. For example,
bereavement after the loss of a spouse, change in
daily routines, and crowding all have been shown
to increase rates of disease—not only for emotional
illness but physical illness as well. Some of the most
powerful risk factors are behav- ioral; examples
include smoking, drinking alcohol to excess, driving
without seat belts, engaging in unsafe sex, eating too
much, and exercising too little.
Exposure to a risk factor means that a person,
before
becoming ill, has come in contact with or has
mani- fested the factor in question. Exposure can take
place at a single point in time, as when a community
is exposed to radiation during a nuclear accident.
More often, however, contact with risk factors for
chronic disease takes place over a period of time.
Cigarette smoking, hypertension, sexual promiscuity,
and sun exposure are examples of risk factors, with
the risk of disease being more likely to occur with
prolonged exposure.
There are several different ways of characterizing
the amount of exposure or contact with a putative
risk factor: ever been exposed, current dose, largest
dose taken, total cumulative dose, years of exposure,
years since first exposure, and so on. Although the
various measures of dose tend to be related to one
another, some may show an exposure–disease
relationship, whereas others may not. For
example, cumulative doses of sun exposure
constitute a risk factor for non- melanoma skin
cancer, whereas episodes of severe sunburn are a
better predictor of melanoma. If the correct
measure is not chosen, an association between a risk
factor and disease may not be evident. Choice of
an appropriate measure of exposure to a risk factor is
usually based on all that is known about the clinical
and biologic effects of the exposure, the pathophysi-
ology of the disease, and epidemiologic studies.

RECOGNIZING RISK
Risk factors associated with large effects that occur
rapidly after exposure are easy for anyone to recognize.
It is not difficult to appreciate the relationship
5 Clinical Epidemiology: The

exposure and medical conditions such as sunburn


and aspirin overdose, or the poor prognosis of
hypotension at the onset of myocardial infarction,
because the del- eterious effect follows exposure
relatively rapidly and is easy to see.
The sudden increase of a rare disease, or the
dra- matic clinical presentation of a new disease is also
easy to recognize, and invites efforts to find a cause.
AIDS was such an unusual syndrome that the
appearance of just a few cases raised suspicion
that some new agent (as it turned out, a retrovirus)
might be respon- sible, a suspicion confirmed
relatively quickly after the first cases of the disease.
A previously unidenti- fied coronavirus was
confirmed as the cause of severe adult respiratory
syndrome (SARS) in a matter of weeks after the
first reported cases of the highly lethal infection in
2003. Similarly, decades ago, physicians quickly
noticed when several cases of carcinoma of the
vagina, a very rare condition, began appearing. A
careful search for an explanation was undertaken,
and maternal exposure to diethylstilbestrol (a
hormone used to stabilize pregnancies in women
with a history of miscarriage) was found.
Most morbidity or mortality, however, is caused by
chronic diseases for which the relationship between
exposure and disease is far less obvious. It is usually
impossible for individual clinicians, however astute,
to recognize risk factors for chronic disease based
on their own experiences with patients. This is
true for several reasons, which are discussed in the
following pages.

Long Latency
Many chronic diseases have a long latency
period between exposure to a risk factor and the
first mani- festations of disease. Radiation
exposure in child- hood, for example, increases
the risk for thyroid cancer in adults decades later.
Similarly, hypertension precedes heart disease by
decades, and calcium intake in young and middle-
aged women affects osteopo- rosis and fracture
rates in old age. When patients experience the
consequence of exposure to a risk fac- tor years
later, the original exposure may be all but
forgotten and the link between exposure and disease
obscured.

Immediate Versus Distant Causes


The search for risk factors usually is a search for
causes of disease. In clinical medicine, physicians
are more interested in immediate causes of
disease— infectious, physiologic, or anatomic
changes leading to sickness such as a coronavirus
causing SARS or hypocalcemia leading to seizures.
But distant causes,
Chapter 4: Risk: Basic Principles 53

more remote causes, may be important in the


causal pathway. For example, lack of maternal 40% drop in the number of SIDS cases (2).
education is a risk factor for low-birth-weight Ongoing research led to evidence that side
infants. Other factors related to education, such as positioning of babies also increased the risk
poor nutrition, less prenatal care, cigarette smoking, of SIDS (3), and the American Academy of
and the like are more direct causes of low birth Pedi- atrics updated its recommendations in
weight. Nevertheless, studies in India have shown 2005 to make it clear that side sleeping was
that improving maternal education lowers infant no longer recommended.
mortality.

Common Exposure to
Risk Factors Low Incidence of Disease
Many risk factors, such as cigarette smoking or eat- The incidence of most diseases, even ones thought
ing a diet high in sugar, salt, and fat, have become to be “common,” is actually uncommon. Thus,
so common in Western societies that for many years although lung cancer is the most common cause of
their dangers went unrecognized. Only by compar- cancer deaths in Americans, and people who smoke
ing patterns of disease among people with and with- are as much as 20 times more likely to develop
out these risk factors, using cross-national studies lung cancer than those that do not smoke, the
or investigating special subgroups—Mormons, for yearly incidence of lung cancer in people who have
example, who do not smoke, or vegetarians who eat smoked heavily for 30 years, is 2 to 3 per 1,000.
diets low in cholesterol—were risks recognized that In the average physician’s practice, years may pass
were, in fact, large. It is now clear that about half between new cases of lung cancer. It is difficult
of lifetime users of tobacco will die because of their for the average clinician to draw conclusions about
habit; if current smoking patterns persist, it is pre- risks from such infrequent events.
dicted that in the 21st century, more than 1 billion
deaths globally will be attributed to smoking (1). Small Risk
A relationship between the sleeping position of
babies and the occurrence of sudden infant death The effects of many risk factors for chronic disease
syndrome (SIDS) is another example of a common are small. To detect a small risk, a large number of
exposure to a risk factor and the dramatic effect asso- people must be studied to observe a difference in dis-
ciated with its frequency, an association that went ease rates between exposed and unexposed persons.
unrecognized until relatively recently. For example, drinking alcohol has been known to
increase the risk of breast cancer, but it was less clear
whether low levels of consumption, such as drinking

Example

SIDS, the sudden, unexplained death of an in- fant younger than 1 year of age, is a leading cause of infant mortality. Studie

just one glass of wine or its equivalent a day, con-


ferred risk. A study of 2,400,000 women-years
was needed to find that women who averaged a
glass a day increased their risk of developing
5 Clinical Epidemiology: The
breast cancer 15% (4). Because of the large
numbers of woman- years in the study,
chance is an unlikely explanation for the
result, but even so, such a small effect
could be due to bias. In contrast, it is not
controversial that hepatitis B infection is a
risk factor for hepatoma, because people
with certain types of serologic evi- dence of
hepatitis B infection are up to 60 times (not
just 1.15 times) more likely to develop liver
cancer than those without it.

Multiple
Causes and
Multiple
Effects
There is usually not a close, one-to-one
relation- ship between a risk factor and a
particular disease.
Chapter 4: Risk: Basic Principles 55

factor may predict a disease outcome indirectly, by


RISK FACTORS virtue of an association with some other variable
that actually is a determinant of disease. That is,
holesterolemia Positive family history Thiamine deficiency Valvular disease the risk factor is confounded with a truly causal
Viral infection Smoking Diabetes
Alcohol
factor.
A risk factor that is not a cause of disease is called
a marker of disease, because it “marks” the
increased probability of disease. Not being a cause
does not diminish the value of a risk factor as a way
of predict- ing the probability of disease, but it
does imply that removing the risk factor might not
remove the excess risk associated with it.

High blood pressure Congestive heart failure

Example
DISEASES Homocystinuria, a rare pediatric disease caused by autosomal

Coronary atherosclerosis Stroke


Renal failure Myocardial infarction

Figure 4.1 ■ Relationship between risk factors and


disease: hypertension and congestive heart failure.
Hypertension causes many diseases, including congestive
heart failure, and congestive heart failure has many
causes, including hypertension.

A given risk factor may contribute to many diseases,


and a disease may have multiple causes. The
relation- ship between hypertension and congestive
failure is an example (Fig. 4.1). Some people with
hyperten- sion develop congestive heart failure,
and many do not. Also, many people who do not
have hyperten- sion develop congestive heart failure
because there are other causes. The relationship is
also difficult to recognize because hypertension causes
several diseases other than congestive heart failure.
Thus, although hypertension is the third leading
cause of conges- tive heart failure, physicians were
not aware of this relationship until the 1970s, when
adequate evidence became available after careful
study of large numbers of people over many years. There are several ways of deciding whether a risk
factor is a cause or merely a marker for disease.
Risk Factors May or These are covered in Chapter 5.
May Not Be Causal For all these reasons, individual clinicians are
rarely in a position to recognize, let alone confirm,
Just because risk factors predict disease, it does not
associations between exposure and chronic diseases.
necessarily follow that they cause disease. A risk
They may notice an association when a dramatic
5 Clinical Epidemiology: The

disease occurs quickly after an unusual exposure, but risk factor for a disease improves the ability to
most diseases and most exposures do not conform to predict disease, that is, improves risk stratification.
such a pattern. For accurate information about
risk, clinicians must turn to the medical literature,
partic- ularly to carefully constructed studies that
involve a large number of patients.
Example
PREDICTING RISK CVD is the most common cause of death glob- ally. In the United

A single powerful risk factor, as in the case of hepa-


titis B and hepatocellular cancer, can be very help-
ful clinically, but most risk factors are not strong.
If drinking a glass of wine a day increases the risk
of breast cancer by 15%, and the average 10-year
risk of developing breast cancer for women in their
40s is 1 in 69 (or 1.45%), having a glass of wine
with dinner would increase the 10-year risk to
about 1 in 60 (or 1.67%). Some women might not
think such a difference in the risk of breast cancer
is meaningful.

Combining Multiple
Risk Factors to Predict
Risk
Because most chronic diseases are caused by several
relatively weak risk factors acting together, statisti-
cally combining their effects can produce a more
powerful prediction of risk than considering one risk
factor at a time. Statistically combining risk
factors produces a risk prediction model or a risk
predic- tion tool (also sometimes called a clinical
predic- tion tool or a risk assessment tool). Risk
prediction tools are increasingly common in clinical
medicine; well-known models used for long-term
predictions include the Framingham Risk Score
for predict- ing cardiovascular events and the
National Cancer Institute’s Breast Cancer Risk
Assessment Tool for predicting breast cancer Even if a risk factor improves a risk prediction
occurrence. Shorter-term hospital risk prediction model, a clinical trial is necessary that demonstrates
tools include the Patient At Risk of Re-admission lowering or removing the risk factor protects
Scores (PARR) and the Criti- cal Care Early patients. In the case of CRP, such a trial has not yet
Warning Scores. Prediction tools have also been reported, so it is possible that it is a marker
combined diagnostic test results, for example, to rather than a causal factor for CVD.
diagnose acute myocardial infarction when a
patient presents with chest pain, or for diagnosing Risk Prediction in Individual
the occurrence of pulmonary embolism. The statisti- Patients and Groups
cal methods used to combine multiple risk factors are
discussed in Chapter 11. Risk prediction tools are often used to predict the
Risk prediction models help with two important future for individuals, with the hope that each per-
clinical activities. First, a good risk prediction model son will know his or her risks, a hope summarized
aids risk stratification, dividing groups of people by the term “personalized medicine.” As an exam-
into subgroups with different risk levels (e.g., low, ple, Table 4.1 summarizes the information used in
medium, and high). Using the risk stratification
approach can also help determine whether adding a
newly proposed
Chapter 4: Risk: Basic Principles 57

99 100 100
100 98 With CRP Without CRP

Women correctly predicted by model


80
72

60
53

40
35

20
9

0
Cardiovascular risk <5 5 to <10 10 to <20 ≥20
over 10 years (%)
Number of woman 6,965 633 248 65
Figure 4.2 ■ Effect of adding a new risk factor to a risk predic-
tion model. Comparison of risk prediction models for CVD over 10 years
among 7,911 non-diabetic women, with and without CRP as a risk
fac- tor. Adding CRP into the risk model improved risk stratification
of the women, especially to strata at higher risk by the model without CRP.
(Data from Ridker PM, Buring JE, Rifal N et al. Development and
validation of improved algorithms for the assessment of global
cardiovascular risk in women. JAMA 2007;297:611–619.)

Table 4.1 The National Cancer Institute (NCI) Breast Cancer


Risk Assessment Tool. A woman or her clinician
Example of a Risk Prediction Tool: The enters information, and the tool calculates a 5-year
NCI Breast Cancer Risk Assessment and lifetime (to age 90) risk of developing breast
can- cer. However, it turns out that predicting what
Risk Factors Included in the Modelb will happen in a single individual is much more
1. What is the woman’s age? difficult than prediction in a group of similar
2. What was the woman’s age at the time of her people.
first menstrual period? First, because predictions are expressed as prob-
3. What was the woman’s age at the time of her first
abilities of future events, there is a basic incompat-
live birth of a child?
4. How many of the woman’s first-degree
ibility between the incidence of a disease over 5
relatives— mother, sisters, daughters—have years (say, 15%) in a group of people and the
had breast cancer? chance that an individual in the group will develop
5. Has the woman ever had a breast biopsy? the disease. A single person will either develop the
5a. How many breast biopsies (positive or disease or not. (You cannot be “somewhat
negative) has the woman had? pregnant.”) So, in a sense, the average of the group
5b. Has the woman had at least one breast is always wrong for an individual because the two
biopsy with atypical hyperplasia? are expressed in different terms, a probability
6. What is the woman’s determined by what happened to a group in the past
race/ethnicity? 6a. What is the
a
The risk assessment tool is not for women with a history of versus the prospective prediction of presence or
breast cancer, ductal carcinoma in situ (DCIS), or lobular absence of disease in an individual.
carcinoma in situ (LCIS).
b
A woman (or her clinician) chooses answers to each question from
Second, the presence of even a strong risk fac-
a drop-down menu. tor does not necessarily mean that an individual is
Available at http://www.cancer.gov/bcrisktool/. very likely to get the disease. As pointed out
earlier in this chapter, many years of smoking can
increase a smoker’s risk of lung cancer
approximately 20-fold
5 Clinical Epidemiology: The

compared with non-smokers. Even so, the smoker score than the non-diseased individual, the c-statistic
has about a 1 in 10 chance of developing lung cancer would be 1.0. In one study assessing
in the next 10 years. Most risk factors (and risk discrimination of the NCI breast cancer risk tool,
pre- diction tools) for most diseases are much the c-statistic was calculated as 0.58 (9). It is clear
weaker than the risk of lung cancer with smoking. that this is not a high c-statistic, but just what the
meaning of val- ues between 0.5 and 1.0 is
EVALUATING RISK difficult to understand clinically.
PREDICTION TOOLS The clearest (although rarest) method to under-
stand how well a risk prediction model discriminates
Determining how well a particular risk prediction is to compare visually the predictions for individuals
tool works is done by asking two questions: (i) to the observed results for all individuals in the
how accurately does the tool predict the proportion study. Figure 4.3A illustrates perfect
of dif- ferent groups of people who will develop the discrimination by a hypothetical risk prediction tool;
disease (calibration), and (ii) how accurately does it the tool completely separates people destined to
identify individuals who will and will not develop develop disease from those destined not to develop
the disease (discrimination)? To answer these disease. Figure 4.3B illustrates the ability of the
questions, the tool is tested on a large group of NCI breast cancer risk prediction tool to
people who have been followed for several years discriminate between women who subsequently
(sometimes, decades) with known outcomes of did and did not develop breast can- cer over a 5-
disease for each person in the group. year period and visually shows what a c-statistic of
0.58 means. Although the average risk scores are
Calibration slightly higher for the women who devel- oped
Calibration, determining how well a prediction breast cancer, and the their curve on the graph is
tool correctly predicts the proportion of a group slightly to the right of those who did not develop
who will develop disease, is conceptually and breast cancer, the individual risk prediction scores of
operation- ally simple. It is measured by comparing the two groups overlap substantially; there is no
the number of people in a group predicted or place along the x-axis of risk that separates women
estimated (E) by the prediction tool to develop into groups who did and did not develop breast
disease to the num- ber who are observed (O) to cancer. This is so even though the calibration of
develop the disease. Ratios of E/O close to 1.0 the model was very good.
mean the risk tool is well calibrated—it predicts a
proportion of people that is very close to the actual Sensitivity and Specificity of a
proportion that develops the disease. Evaluations of Risk Prediction Tool
the NCI breast cancer risk assessment tool have Yet another way to assess a risk prediction tool’s ability
found it is highly accurate in predicting the to distinguish who will and will not develop disease
proportion of women in a group who will develop is to determine its sensitivity and specificity (a topic
breast cancer in the next 5 years, with E/O ratios that will be discussed more thoroughly in Chapters 8
close to 1.0. and 10). Sensitivity of a risk prediction tool is the
ability of the tool to identify those individuals
Discrimination destined to develop a disease and is expressed as the
Discriminating among individuals in a group who percentage of people who the tool correctly identifies
will and will not develop disease is difficult, even for will develop the disease. A tool’s specificity is the
well-calibrated risk tools. The most common method ability to iden- tify individuals who will not
used to measure discrimination accuracy is to cal- develop the disease, expressed as percentage of
culate a concordance statistic (often shortened to people the tool correctly identifies who will not
c-statistic). It estimates how often in pairs of ran- develop the disease. Looking at Figure 4.3, a 5-year
domly selected individuals, one of whom went on to risk of 1.67% was chosen as a cut point between
develop the disease of interest and one of whom “low” and “high” risk. Using that cut point, the
did not, the risk prediction score was higher for sensitivity was estimated as 44% (44% of women
the one who developed disease. If the risk prediction who developed breast cancer had a risk score
tool did not improve prediction at all, the resulting 1.67%) and specificity was estimated as 66% (66%
estimate would be like a coin toss and the c-statistic of women who did not develop breast cancer had a
would be risk score 1.67%). In other words, the risk
0.50. If the risk prediction tool worked perfectly, prediction tool missed more than half the women
so that in every pair the diseased individual had a who developed breast cancer over a 5-year period,
higher
Chapter 4: Risk: Basic Principles 59

A analysis
0.25 Women not developingWomen developing breast cancerbreast cancer that combines the results of sensitivity and
specificity and can be used to compare different
0.20
tools. ROCs are discussed in detail in Chapter 8.
Each group of women

Risk Stratification
0.15
As already mentioned, and as shown in Figure 4.2,
0.10
risk stratification can be used to assess how well a
risk prediction tool works and to determine whether
adding a new risk factor improves the tool’s
0.05
ability to classify people correctly into clinically
meaning- ful risk groups. Better risk stratification
0.00 improves a tool’s calibration. Risk stratification
Low risk x High risk may not dra- matically affect the tool’s
5-year risk of breast cancer diagnosis discrimination ability. For example, examining
Figure 4.2, the risk tool that included CRP
B 0.25 correctly assigned 99% of 6,965 women to the
Did not develop breast cancer lowest risk stratum (5% CVD events over 10
0.20 years). The study found that CVD events occurred
Each group of women

in 101 (1.4%) women assigned to the low- est risk


0.15 stratum, a result consistent with 5%, but because
the vast majority (88%) of women were assigned
to the lowest risk stratum, more women in that
0.10
group developed CVD (101) than in all the other
Developed breast cancer risk groups combined (97). This result, similar to
0.05 what happened with the breast cancer risk tool
(Fig. 4.3B), is a common, frustrating occurrence
0.00 with risk prediction.
0 0.025 0.05 0.075 0.1
Why Risk Prediction
Estimated 5-year risk of breast cancer
diagnosis using the Gail et al. model Tools Do Not
Figure 4.3 ■ A. The ability of a hypothetically per-
Discriminate Well Among
fect breast cancer risk prediction tool to discriminate Individuals
between women who did and did not develop breast Why is it that a risk prediction tool that predicts so
cancer. The group on the left have low risk scores and
well the proportion of a group of people who will
did not develop breast cancer, whereas the group on the
right have higher scores and did develop breast cancer.
develop disease does so poorly at discriminating
There is no overlap of the two groups and the c-statistic between those individuals who will and will not
would be develop disease? A major problem is the strength (or
1.0. (Redrawn with permission from Elmore JA, Fletcher more correctly, the weakness) of the prediction
SW. The risk of cancer risk prediction: “What is my risk of tool. Discrimination requires a very strong risk
getting breast cancer?” J Natl Cancer Inst factor (or combination of risk factors) to separate a
2006;98:1673–1675.) group of people into those who will and will not
B. The ability of an actual risk prediction tool to dis- develop a disease, with even moderate success. If
criminate between women who did and did not de- people who will develop the disease are just two or
velop breast cancer over a 5-year period. The risk scores
three or even five times more likely to develop the
of the two groups overlap substantially, with no place
along the x-axis that separates women who did and did
disease, risk pre- diction tools will not discriminate
not de- velop breast cancer. (Redrawn with permission well. They need to be many times (some authors
from Rockhill Levine B.) suggested at least 200 times [10]) more likely to
develop the disease. Few risk prediction rules are
that powerful.
Another problem is that for most chronic diseases,
risk factors are widely spread throughout the popula-
while assigning about a third of women not destined
tion. Thus, even people at low risk can develop the
to develop the disease to the high-risk group.
disease. Figure 4.3B shows that some women with
Stud- ies of prediction tools that include information
the lowest risk score developed breast cancer. In fact,
about sensitivity and specificity often also display a
in absolute numbers more women with low scores
receiver operating characteristic (ROC) curve, a
developed breast cancer than those with high scores
method of
6 Clinical Epidemiology: The

because (thank goodness), in most groups of women, Using Risk Factors to


there are relatively few with high scores. Choose Treatment
In summary, risk prediction models are an impor-
tant way of combining individual risk factors to Risk factors have long been used in choosing (and
achieve better stratification of people into groups developing) treatments. Patients with CVD who also
of graded risk, as illustrated in Figure 4.2. It is have elevated lipids are treated with statins or
helpful for an individual patient and clinician to other lipid lowering drugs. Specific treatments for
understand to which risk group a well-constructed hyper- lipidemia and hypertension are highly
risk model assigns the patient, but the limitations effective treat- ments for diabetic patients with
of the assign- ment should also be understood. In those conditions. In oncology, “targeted” therapies
addition, it is important to keep in mind the have been developed for certain cancers.
counterintuitive fact that for most diseases, most of
the people destined to develop a disease are not at
high risk.
Example
CLINICAL USES OF RISK
FACTORS AND RISK The HER2 receptor is an epidermal growth factor receptor that is
PREDICTION TOOLS

Risk Factors and Pretest Probability


for Diagnostic Testing
Knowledge of risk can be used in the diagnostic pro-
cess because the presence of a risk factor increases
the probability of disease. However, most risk factors
(and even risk prediction tools) are limited in their
ability to predict disease in symptomatic patients
because they usually are not as strong a predictor
of disease as are clinical findings of early disease.
As stated by Geoffrey Rose (11):
Often the best predictor of future major diseases
is the presence of existing minor disease. A low
ventila- tory function today is the best predictor of
its future rate of decline. A high blood pressure
today is the best predictor of its future rate of rise.
Early coronary heart disease is better than all of the Risk Stratification for
conventional risk factors as a predictor of future Screening Programs
fatal disease.
Knowledge of risk factors occasionally can be used
As an example of Rose’s dictum, although age, to improve the efficiency of screening programs by
male gender, smoking, hypertension, hyperlipid- select- ing subgroups of patients at substantially
emia, and diabetes are important predictors for increased risk. Although the risk for breast cancer
future coronary artery disease, they are far less associated with deleterious genetic mutations is very
important when evaluating a patient presenting to low in the general population, it is much higher in
the emergency department with chest pain (12). The women with multiple close relatives who developed
specifics of the clinical situation, such as presence the disease at a relatively early age; blood tests
and type of chest pain and results of an electrocar- screening for gene mutations are usually reserved for
diogram, are the most powerful first steps in deter- women whose fam- ily history indicates they are at
mining whether the patient is experiencing an substantially increased risk. Similarly, screening for
acute myocardial infarction (13). colorectal cancer is rec- ommended for the general
The absence of a very strong risk factor may help population starting at age
to rule out disease. Thus, it is reasonable to 50. However, people with a first-degree relative with
consider mesothelioma in the differential diagnosis a history of colorectal cancer are at increased risk for
of a pleural mass in a patient who is an asbestos the disease, and expert groups suggest that screening
worker. However, mesothelioma is a much less likely these people should begin at age 40.
diagnosis for the patient who has never been
exposed to asbestos.
Chapter 4: Risk: Basic Principles 61

Removing Risk Factors to supplied by a particular company, and the


Prevent Disease epidemic subsided after he cut off that supply. In the
process, he established that cholera was spread by
If a risk factor is also a cause of disease, removing contaminated water supplies. In modern times, the
it can prevent disease. Prevention can occur same approach is used to investigate outbreaks of
regardless of whether the mechanism by which the food-borne illnesses, to identify the source and take
disease develops is known. Some of the classic remedial action to stop the outbreak. Today, the
successes in the history of epidemiology illustrate this biologic cause is quickly determined as well and
point. Before bacteria were identified, John Snow helps to pinpoint the epidemic source. The concept
noted in 1854 that an increase rate of cholera of cause and its relationship to prevention is
occurred among people drinking water discussed in Chapter 12.

Revie w Question s
For questions 4.1–4.10, select the best answer. 4.4. Figure 4.2 shows:

4.1. In the mid-20th century, chest surgeons A. The risk model incorporating CRP
in Britain were impressed that they were results assigned too many women to the
operating on more men with lung cancer, intermediate risk strata.
most of whom were smoking. How might the B. The risk model incorporating CRP results
surgeons’ impression that smoking was a risk predicts which individual women will
factor for developing lung cancer have been develop CVD better than the risk model
wrong? without CRP results.
C. The number of women developing CVD
A. Smoking had become so common that over 10 years is likely highest in the
more men would have a history of group with a risk of 5%.
smoking, regardless of whether they
were undergoing operations for lung 4.5. A risk model for colon cancer estimates that
cancer. one of your patients has a 2% chance of
B. Lung cancer is an uncommon cancer, devel- oping colorectal cancer in the next 5
even among smokers. years. In explaining this to your patient,
C. Smoking confers a low risk of which of the following statements is most
lung cancer. correct?
D. There are other risk factors for
lung cancer. A. Because colorectal cancer is the second
most common non-skin cancer in
4.2. Risk factors are easier to recognize: men, he should be concerned about it.
B. The model shows that your patient will
A. When exposure to a risk factor occurs a not develop colorectal cancer in the next
long time before the disease. 5 years.
B. When exposure to the risk factor is C. The model shows that your patient is a
associated with a new disease. member of a group of people in whom
C. When the risk factor is a marker rather a very small number will develop
than a cause of disease. colorectal cancer in the next 5 years.
4.3. Risk prediction models are useful for: 4.6. In general, risk prediction tools are best at:
A. Predicting onset of disease A. Predicting future disease in a given patient.
B. Diagnosing disease B. Predicting future disease in a group of
C. Predicting prognosis patients.
D. All of the above C. Predicting which individuals will and will
not develop disease.
6 Clinical Epidemiology: The

4.7. When a risk factor is a marker for future C. Most women developing breast cancer
disease: over 5 years are at higher risk.
D. The risk model does not discriminate
A. The risk factor can help identify people at
very well.
increased risk of developing the disease.
B. Removing the risk factor can help prevent
4.10. It is difficult for risk models to determine
the disease.
which individuals will and will not develop
C. The risk factor is not confounding a true
disease for all of the following reasons
causal relationship.
except:
4.8. A risk factor is generally least useful in: A. The combination of risk factors is not
strongly related to disease.
A. The risk stratification process
B. The risk factors are common throughout a
B. Diagnosing a patient’s complaint
population.
C. Preventing disease
C. The model is well calibrated.
D. Most people destined to develop the
4.9. Figure 4.3B shows that:
disease are not at high risk.
A. The risk model is well calibrated.
B. The risk model works well at Answers are in Appendix A.
stratifying women into different risk
groups.

REFERENCES

1. Vineis P, Alavanja M, Buffler P, et al. Tobacco and cancer:


cardiovascular disease in women. N Engl J Med 2000;342:
recent epidemiological evidence. J Natl Cancer Inst
836–843.
2004;96:99–106.
9. Rockhill B, Speigelman D, Byrne C, et al. Validation of the
2. Willinger M, Hoffman HJ, Wu KT, et al. Factors
Gail et al. model of breast cancer risk prediction and
associated with the transition to nonprone sleep positions
implica- tions for chemoprevention. J Natl Cancer Inst
of infants in the United States: The National Infant Sleep
2001;93:358– 366.
Position Study. JAMA 1998;280:329–335.
10. Wald NJ, Hackshaw AK, Frost CD. When can a risk fac-
3. Li DK, Petitti DB, Willinger M, et al. Infant sleeping position
tor be used as a worthwhile screening test? BMJ 1999;319:
and the risk of sudden infant death syndrome in California,
1562–1565.
1997–2000. Am J Epidemiol 2003;157:446–455.
11. Rose G. Sick individuals and sick populations. Int J
4. Chen WY, Rosner B, Hankinson SE, et al. Moderate
Epidemiol 1985;14:32–38.
alcohol consumption during adult life, drinking patterns
12. Han JH, Lindsell CJ, Storrow AB, et al. The role of cardiac
and breast cancer risk. JAMA 2011;306:1884–1890.
risk factor burden in diagnosing acute coronary syndromes in
5. Humphrey LL, Rongwei F, Rogers K, et al. Homocysteine
the emergency department setting. Ann Emerg Med 2007;49:
level and coronary heart disease incidence: a systematic
145–152.
review and meta-analysis. Mayo Clin Proc 2008;83:1203–
13. Panju AA, Hemmelgarn BR, Guyatt GH, et al. Is this
1212.
6. Martí-Carvajal AJ, Solà I, Lathyris D, et al. Homocysteine patient having a myocardial infarction? The rational clinical
examina- tion. JAMA 1998;280:1256–1263.
lowering interventions for preventing cardiovascular events.
14. Piccart-Gebhart MJ, Procter M, Leyland-Jones B, et al.
Cochrane Database Syst Rev 2009;4:CD006612.
7. Buckley DI, Fu R, Freeman M, et al. C-reactive protein as Trastuzumab after adjuvant chemotherapy in HER2-
positive breast cancer. N Engl J Med 2005;353:1659–1672.
a risk factor for coronary heart disease: a systematic review
15. Romond EH, Perez EA, Bryant J, et al. Trastuzumab plus
and meta-analyses for the U.S. Preventive Services Task Force.
adjuvant chemotherapy for operable HER2-positive breast
Ann Intern Med 2009;151:483–495.
cancer. N Engl J Med 2005;353:1673–1684.
8. Ridker PM, Hennekens CH, Buring JE, et al. C-reactive
pro- tein and other markers of inflammation in the
prediction of
Chapter 5

Risk: Exposure to Disease


From the study of the characteristics of persons who later develop coronary heart
disease (CHD) and comparison with the characteristics of those who remain free of this
disease,
it is possible many years before any overt symptoms or signs become manifest . . . to
put together a profile of those persons in whom there is a high risk of developing CHD .
. . It has seldom been possible in noninfectious disease to identify such highly
susceptible individuals years before the development of disease.
—Thomas Dawber and William Kannel
1961

KEY WORDS incidence of disease. It describes methods used to


determine risk by following groups into the future
Observational study Extraneous variable and also discusses several ways of comparing risks as
Cohort study Covariate they affect individuals and populations. Chapter 6
Cohort Crude measure of describes methods of studying risk by looking back-
Exposed group effect Confounding ward in time.
Unexposed group Intermediate outcome The most powerful way to determine whether
Incidence study Confounding variable/ exposure to a potential risk factor results in an
Prospective cohort Confounder increased risk of disease is to conduct an experiment
study Controlling in which the researcher determines who is exposed.
Retrospective/historical Restriction People currently without disease are divided into
cohort study Matching groups of equal susceptibility to the disease in ques-
Case-cohort design Stratification tion. One group is exposed to the purported risk
Measure of effect Standardization factor and the other is not, but the groups other-
Absolute risk Multivariable analysis wise are treated the same. Later, any difference in
Attributable risk Logistic regression observed rates of disease in the groups can be attrib-
Risk difference Cox proportional uted to the risk factor. Experiments are discussed
Relative risk hazard in Chapter 9.
Risk ratio Unmeasured
Population- confounder
attributable risk Residual confounding
When Experiments Are
Population- Effect modification Not Possible or Ethical
attributable fraction Interaction The effects of most risk factors in humans cannot be
studied with experimental studies. Consider some of
the risk questions that concern us today: Are inac-
STUDIES OF RISK tive people at increased risk for cardiovascular disease,
everything else being equal? Do cellular phones
This chapter describes how investigators obtain esti- cause brain cancer? Does obesity increase the risk
mates of risk by observing the relationship between of can- cer? For such questions, it is usually not
exposure to possible risk factors and the subsequent possible to conduct an experiment. First, it would
be unethical
61

6 Clinical Epidemiology: The


6 Clinical Epidemiology: The

to impose possible risk factors on a group of healthy 1. They do not have the disease (or outcome) in
people for the purposes of scientific research. question at the time they are assembled.
Second, most people would balk at having their 2. They should be observed over a meaningful
diets and behaviors constrained by others for long period of time in the natural history of the dis-
periods of time. Finally, the experiment would ease in question so that there will be sufficient
have to go on for many years, which is difficult and time for the risk to be expressed. For example,
expensive. As a result, it is usually necessary to if one wanted to learn whether neck irradiation
study risk in less obtrusive ways. during childhood results in thyroid neoplasms,
Clinical studies in which the researcher gath- a 5-year follow-up would not be a fair test of
ers data by simply observing events as they this hypothesis, because the usual time period
happen, without playing an active part in what between radiation exposure and the onset of
takes place, are called observational studies. dis- ease is considerably longer.
Most studies of risk are observational studies and 3. All members of the cohort should be observed
are either cohort studies, described in the rest of over the full period of follow-up or methods
this chapter, or case- control studies, described in must be used to account for dropouts. To the
Chapter 6. extent that people drop out of the study and
their reasons for dropping out are related in some
Cohorts way to the outcome, the information provided
As defined in Chapter 2, the term cohort is used by an incomplete cohort can misrepresent the
to describe a group of people who have something true state of affairs.
in common when they are first assembled and
who are then observed for a period of time to see
what happens to them. Table 5.1 lists some of the Cohort Studies
ways in which cohorts are used in clinical
The basic design of a cohort study is illustrated in
research. Whatever members of a cohort have in
Figure 5.1. A group of people (a cohort) is
common, observations of them should fulfill three
assembled, none of whom has experienced the
criteria if the observations are to provide sound
outcome of interest, but all of whom could
information about risk of disease.
experience it. (For example, in a study of risk
factors for endometrial cancer, each member of
Table 5.1 the cohort should have an intact uterus.) Upon
Cohorts and Their Purposes entry into the study, people in the cohort are
classified according to those character- istics
Characteristic To Assess (possible risk factors) that might be related to
in Common Effect of Example outcome. For each possible risk factor, members
Age Age Life expectancy for of the cohort are classified either as exposed (i.e.,
people age 70 pos- sessing the factor in question, such as
(regardless of birth date)
hypertension) or unexposed. All the members of
Date of birth Calendar Tuberculosis rates for the cohort are then observed over time to see
time people born in 1930 which of them expe- rience the outcome, say,
Exposure Risk factor Lung cancer in people cardiovascular disease, and the rates of the outcome
who smoke events are compared in the exposed and
Disease Prognosis Survival rate for patients unexposed groups. It is then possible to see
with brain cancer whether potential risk factors are related to
Therapeutic Treatment Improvement in survival subsequent outcome events. Other names for cohort
intervention for patients with studies are incidence studies, which emphasize that
Hodgkin lymphoma patients are followed over time; prospective stud-
given combination ies, which imply the forward direction in which
chemotherapy the patients are pursued; and longitudinal studies,
Preventive Prevention Reduction in incidence which call attention to the basic measure of new
intervention of pneumonia after disease events over time.
pneumococcal The following is a description of a classic cohort
vaccination study that has made important contributions to our
understanding of cardiovascular disease risk factors
and to modern methods of conducting cohort studies.
Chapter 5: Risk: Exposure to Disease 63

POPULATION COHORT WITHOUT EXPOSURE TO DISEASE


DISEASE RISK FACTOR
YES

Exposed

NO
COHORT Time
YES

Not exposed

NO
Figure 5.1 ■ Design of a cohort study of risk. Persons without disease are divided into two groups—
those exposed to a risk factor and those not exposed. Both groups are followed over time to determine what
proportion of each group develops disease.

Prospective and Historical


Cohort Studies
Example
Cohort studies can be conducted in two ways
(Fig. 5.2).
The Framingham Study (1) was begun in 1949 to identify factors The cohort
associated with can be assembled
an increased risk in
of the presentheart dise
coronary
and followed into the future (a prospective cohort
study), or it can be identified from past records
and followed forward from that time up to the
present (a retrospective cohort study or a
historical cohort study). The Framingham Study
an example of a prospective cohort study. Useful
retrospective cohort studies are appearing
increasingly in the medical litera- ture because of the
availability of large computerized medical
databases.

Prospective Cohort Studies


Prospective cohort studies can assess purported
risk factors not usually captured in medical
records, including many health behaviors, edu-
cational level, and socioeconomic status, which
have been found to have important health effects.
When the study is planned before data are col-
lected, researchers can be sure to collect informa-
tion about possible confounders. Finally, all the
information in a prospective cohort study can be
collected in a standardized manner that decreases
measurement bias.
6 Clinical Epidemiology: The

PAST PRESENT FUTURE

Retrospective (historical) cohort

Cohort assembed
Follow-up

Prospective cohort
Cohort assembed Follow-up

Figure 5.2 ■ Retrospective and prospective cohort studies. Prospective cohorts are as-
sembled in the present and followed forward into the future. In contrast, retrospective cohorts are
made by going back into the past and assembling the cohort, for example, from medical records, then
following the group forward to the present.

standing, education, and other important health


Example determinants usually cannot be included in the stud-
How much leisure time physical activity is needed to achieve
ies.health
Also, benefits? Several
information guidelines
in many suggest
databases, a minimum of 30
especially
medical care information, is not collected in a
stan- dardized manner, leading to the possibility of
bias in results. Large computerized databases are
particularly useful for studying possible risk
factors and health outcomes that are likely to be
recorded in medical databases in somewhat standard
ways, such as diag- noses and treatments.

Example
The incidence of autism increased sharply
in the 1990s, coinciding with an increasing
vaccination of young children for measles,
mumps, and rubella (MMR). A report linking
MMR vaccination and autism in several chil-
dren caused widespread alarm that vaccina-
tion (or the vaccine preservative, thimerosal)
was responsible for the increasing incidence
of autism. In some countries, MMR vaccination
rates among young children dropped, result-
Historical Cohort Studies ing in new outbreaks and even deaths from
Using Medical Databases measles. Because of the seriousness of the
situation, several studies were undertaken to
Historical cohort studies can take advantage of com- evaluate MMR vaccine as a possible risk fac-
puterized medical databases and population reg- tor. In Denmark, a retrospective cohort study
istries that are used primarily for patient care or to included all children (537,303) born from
track population health. The major advantages of January 1991 through December 1998 (4). The
historical cohort studies over classical prospective investigators reviewed the children’s coun-
cohort studies are that they take less time, are less trywide health records and determined that
82% received the MMR vaccine (physicians
expensive, and are much easier to do. However, they
cannot undertake studies of factors not recorded in
computerized databases, so patients’ lifestyle, social
Chapter 5: Risk: Exposure to Disease 65

must report vaccinations to the government in parison group, the investigators


order to receive payment); 316 children were randomly sampled a similar group of
diagnosed with autism, and another 422 with women not under- going the procedure and
autistic-spectrum disorders. The frequency of enriched the sample with women who
autism among children who had been vac- subsequently developed breast cancer.
cinated was similar (in fact, slightly less) to “Enrichment” was accomplished by knowing
that among children not receiving MMR vac- who among 666,800 eligible women
cine. This, along with other studies, provided developed breast cancer—through
strong evidence against the suggestion that examination of the computerized
MMR vaccine causes autism. Subsequently, database. The investigators randomly
the original study leading to alarm was inves- sampled about 1% of comparison women of
tigated for fraud and conflict of interest a certain age who developed breast
and was retracted by The Lancet in 2010 (5). cancer,† but only about .01% of women who
did not. Adjustments for the sampling
fractions were then made during the
Case-Cohort Studies analysis. The results showed that bilateral
pro- phylactic mastectomy was associated
Another method using computerized medical data- with a 99% reduction in breast cancer
bases in cohort studies is the case-cohort design.
Conceptually, it is a modification of the retrospec-
tive cohort design that takes advantage of the abil- Advantages and
ity to determine the frequency of a given medical
condition in a large group of people. In a case-cohort
Disadvantages of Cohort
study, all exposed people in a cohort, but only a Studies
small random sample of unexposed people are Well-conducted cohort studies of risk, regardless
included in the study and followed for some of type, are the best available substitutes for a true
outcome of inter- est. For efficiency, the group of experiment when experimentation is not possible.
unexposed people is “enriched” with all those who They follow the same logic as a clinical trial and
subsequently suffer the outcome of interest (i.e., allow measurement of exposure to a possible risk
become cases). The results are then adjusted to fac- tor while avoiding any possibility of bias that
reflect the sampling fractions used to obtain the might occur if exposure were determined after the
sample. This efficient approach to a cohort study outcome was already known. The most important
requires that frequencies of outcomes be determined scientific disadvantage of cohort studies (in fact,
in the entire group of unexposed peo- ple; thus, the all observa- tional studies) is that they are subject to
need for a large, computerized, medical database. a great many more potential biases than are
experiments. People who are exposed to a certain
risk factor in the natural course of events are likely
to differ in a great many ways from a comparison
group of people not exposed to the factor. If some
of these other differences are also related to the
disease in question, they could con- found any
Example association observed between the putative risk factor
Does prophylactic mastectomy protect women who are at and the disease.
increased risk for breast cancer? A case-cohort study was done
The uses, strengths, and limitations of the
different types of cohort studies are summarized in
Table 5.2. Several of the advantages and disadvan-
tages apply regardless of type. However, the
potential


Strictly speaking, the study was a modification of a
standard case-cohort design that would have included all
cases, not just 1%, of 26,800 breast cancers that developed
in the compari- son group of women not undergoing
prophylactic mastectomy. However, because breast cancer
occurs commonly, a random sample of the group sufficed.
6 Clinical Epidemiology: The

Table 5.2
Advantages and Disadvantages of Cohort Studies

Advantages Disadvantages
All Cohort Study Types
The only way of establishing incidence (i.e., absolute risk) Susceptible to confounding and other biases
directly
Follows the same logic as the clinical question: If persons are
exposed, do they get the disease?
Exposure can be elicited without the bias that might occur if
outcome were known before documentation of exposure

Can assess the relationship between exposure and many


diseases
Prospective Cohort Studies
Can study a wide range of possible risk factors Inefficient because many more subjects must be enrolled than
experience the event of interest; therefore, cannot be used for
rare diseases
Can collect lifestyle and demographic data not available in most Expensive because of resources necessary to study many people over
medical records time
Can set up standardized ways of measuring exposure and degree Results not available for a long time
of exposure to risk factors
Assesses the relationship between disease and exposure to only
relatively few factors (i.e., those recorded at the outset of the
study)
Retrospective (Historical) Cohort Studies
More efficient than prospective cohort studies because data have Range of possible risk factors that can be studied is narrower than
already been collected for another purpose (i.e., during patient care that possible with prospective cohort studies
or for a registry)
Cheaper than prospective cohort studies because resources not Cannot examine patient characteristics not available in the data set
necessary to follow many people over time used
Faster than prospective cohort studies because patient Measurement of exposure and degree of exposure may not be
outcomes have already occurred standardized
Case-cohort Studies
All advantages of retrospective cohort studies apply All disadvantages of retrospective cohort studies apply
Even more efficient than retrospective cohort studies Difficult for readers to understand weighting procedures
because only a sample of unexposed group is analyzed used in the analysis

for difficulties with the quality of data is different


between vaccination and autism, the data in
for the three. In prospective studies, data can be col-
histori- cal cohort studies may not be of sufficient
lected specifically for the purposes of the study and
quality for rigorous research.
with full anticipation of what is needed. It is thereby
Prospective cohort studies can also collect data
possible to avoid measurement biases and some of
on lifestyle and other characteristics that might
the confounders that might undermine the
influence the results, and they can do so in
accuracy of the results. However, data for
standard ways. Many of these characteristics are not
historical cohorts are usually gathered for other
routinely available in retrospective and case-cohort
purposes—often as part of medical records for
studies, and those that are usually are not collected
patient care. Except for carefully selected
in stan- dard ways.
questions, such as the relationship
Chapter 5: Risk: Exposure to Disease 67

The principal disadvantage of prospective cohort WAYS TO EXPRESS AND


studies is that when the outcome is infrequent, which COMPARE RISK
is usually so in studies of risk, a large number of
people must be entered in a study and remain The basic expression of risk is incidence, which is
under observation for a long time before results defined in Chapter 2 as the number of new cases
are avail- able. Having to measure exposure in of disease arising during a given period of time in
many people and then follow them for years is a defined population that is initially free of the
inefficient when few ultimately develop the con- dition. In cohort studies, the incidence of
disease. For example, the Framingham Study of disease is compared in two or more groups that
cardiovascular disease (the most common cause of differ in exposure to a possible risk factor. To
death in America) was the largest study of its kind compare risks, several measures of the association
when it began. Nevertheless, more than 5,000 people between expo- sure and disease, called measures
had to be followed for several years before the first, of effect, are commonly used. These measures
preliminary conclusions could be published. Only represent different concepts of risk, elicit different
5% of the people had experienced a coronary event impressions of the magnitude of a risk, and are
during the first 8 years. Retrospec- tive and case- used for different pur- poses. Four measures of
cohort studies get around the problem of time but effect are discussed in the following text. Table 5.3
often sacrifice access to important and summarizes the four, along with absolute risk, and
standardized data. Table 5.4 demonstrates their use with the risk of
Another problem with prospective cohort studies lung cancer among smokers and non-smokers.
results from the people under study usually being
“free living” and not under the control of research-
ers. A great deal of effort and money must be
Absolute Risk
expended to keep track of them. Prospective Absolute risk is the probability of an event in a
cohort studies of risk, therefore, are expensive, population under study. Its value is the same as that
usually cost- ing many millions, sometimes hundreds for incidence, and the terms are often used inter-
of millions, of dollars. changeably. Absolute risk is the best way for indi-
Because of the time and money required for vidual patients and clinicians to understand how
pro- spective cohort studies, this approach cannot be risk factors may affect their lives. Thus, as Table 5.4
used for all clinical questions about risk, which was a shows, although smoking greatly increases the
major reason for efforts to find more efficient, yet chances of dying from lung cancer, among smokers
depend- able, ways of assessing risk, such as the absolute risk of dying from lung cancer each
retrospective and case-cohort designs. Another year in the population studied was 341.3 per
method, case-control studies, is discussed in
Chapter 6.

Table 5.3
Measures of Effect

Expression Question Definitiona


Absolute risk What is the incidence of disease in a # new cases over a given period of time #
group initially free of the condition? I people in the group
Attributable risk (risk difference) What is the incidence of disease
AR  IE  IE
attributable to exposure?
Relative risk (risk ratio) How many times more likely are exposed IE
RR 
persons to become diseased, relative to non- IE
exposed persons?
Population-attributable risk What is the incidence of disease in a
ARP  AR  P
population, associated with the
prevalence of a risk factor?
Population-attributable fraction What fraction of the disease in a population is ARP
AFP 
attributable to exposure to a risk factor? IT
a
Where IE  incidence in exposed persons; IE  incidence in non-exposed persons; P  prevalence of exposure to a risk factor; and IT  total incidence of
disease in a population.
6 Clinical Epidemiology: The

Table 5.4
Calculating Measures of Effect: Cigarette Smoking and Death from Lung Cancer in Mena

Simple Risks
Death rate (absolute risk or incidence) from lung cancer in smokers 341.3/100,000/yr

Death rate (absolute risk or incidence) from lung cancer in non- 14.7/100,000/yr
smokers
Prevalence of cigarette smoking 32.1%
Lung cancer mortality rate in population 119.4/100,000/yr
Compared Risks
Attributable risk  341.3/100,000/yr – 14.7/100,000/yr  326.6/100,000/yr
Relative risk  341.3/100,000/yr  14.7/100,000/yr  23.2
Population-attributable risk  326.6/100,000/yr  0.321  104.8/100,000/yr
Population-attributable fraction  104.8 /100,000/yr  119.4/100,000/yr  0.88
a
Data from Thun MJ, Day-Lally CA, Calle EE, et al. Excess mortality among cigarette smokers: Changes in a 20-year interval. Am J Public Health
1995;85:1223–1230.

100,000 (3 to 4 lung cancer deaths per 1,000 ratios, discussed in Chapter 6) is the most commonly
smokers per year). reported result in studies of risk, not only because
of its computational convenience but also because
Attributable Risk it is a common metric in studies with similar risk
factors but with different baseline incidence rates.
One might ask, “What is the additional risk (inci- Because relative risk indicates the strength of the
dence) of disease following exposure, over and association between exposure and disease, it is a
above that experienced by people who are not useful measure of effect for studies of disease
exposed?” The answer is expressed as etiology.
attributable risk, the absolute risk (or incidence)
of disease in exposed persons minus the absolute
risk in non-exposed per- sons. In Table 5.4, the Interpreting Attributable
attributable risk of lung cancer death in smokers is and Relative Risk
calculated as 326.6 per 100,000 per year. Although attributable and relative risk are calculated
Attributable risk is the additional incidence of from the same two components—the incidence (or
disease related to exposure, taking into account the absolute risk) of an outcome from an exposed and
background incidence of disease from other unexposed group—the resulting size of the risk may
causes. Note that this way of comparing rates implies appear to be quite different depending on whether
that the risk factor is a cause and not just a marker. attributable or relative risk is used.
Because of the way it is calculated, attributable risk
is also called risk difference, the differences
between two absolute risks.
Example
Suppose a risk factor doubles the chance of dying of a certain dise
Relative Risk
On the other hand, one might ask, “How many times
more likely are exposed persons to get the disease
relative to non-exposed persons?” To answer this
question, relative risk or risk ratio, is the ratio of
incidence in exposed persons to incidence in non-
exposed persons, estimated in Table 5.4 as 23.2.
Relative risk (or an estimate of relative risk, odds
Chapter 5: Risk: Exposure to Disease 69

In most clinical situations, it is best simply to


concentrate risk
below. The resulting calculations for relative risk and attributable on would
the attributable
be: risk by comparing
the absolute risks in exposed and unexposed people.
Ironically, because most medical research presents
results as relative risk, clinicians often emphasize
Incidence
rela- tive risks when advising patients.
Unexposed Exposed Relative Attributable
Group Group Risk Risk Population Risk
1/10,000 2/10,000 2.0 0.2/1,000
Another way of looking at risk is to ask, “How much
1/1,000 2/1,000 2.0 2/1,000
does a risk factor contribute to the overall rates of
1/100 2/100 2.0 20/1,000
disease in groups of people, rather than individuals?”
1/10 2/10 2.0 200/1,000 This information is useful for determining which
risk factors are particularly important and which
are trivial
Because the calculation of relative risk cancels out incidence, to not
it does the clarify
overallthehealth
size ofofrisk
a in
community,
discussions with pa
23.8 per 1,000 women-years) because the risk of fracturesandincreased with age
can inform regardless
those who mustof BMD (Table 5.5).
prioritize the
deployment of health care resources.
To estimate population risk, it is necessary to take
into account the frequency with which members of a
community are exposed to a risk factor. Population-
attributable risk is the product of the attributable
risk and the prevalence of exposure to the risk fac-
tor in a population. It measures the excess incidence
of disease in a community that is associated with
a risk factor. One can also describe the fraction of
disease occurrence in a population associated with a
particular risk factor, the population-attributable
fraction. It is obtained by dividing the population-
attributable risk by the total incidence of disease in
the population. In the case of cigarette smoking
and lung cancer (Table 5.4), smoking annually
contrib- utes about 105 lung cancer deaths for every
100,000 men in the population (population-
attributable risk), and accounts for 88% of all lung
cancer deaths

Table 5.5
Comparing Relative Risk and Attributable Risk in the Relationship of Bone Mineral
Density (BMD) T-scores, Fractures, and Age

Fracture Incidence (Absolute Risk per 1,000


Person-years)
Attributable Riskb per
Age BMD T-score >–1.0 BMD T-score <–2.0 Relative Riska 1,000 Person-years
50–59 4.5 19.0 2.63 14.4
60–69 50.0 22.0 2.78 16.4
70–79 7.5 30.5 2.37 20.3
80–99 16.5 42.0 1.97 23.8
a
Adjusted.
b
Estimated from Figure 2 of reference.
Data from Siris ES, Brenneman SK, Barrett-Connor E, et al. The effect of age and bone mineral density on the absolute, excess, and relative risk of
fracture in postmenopausal women aged 55-99: results from the National Osteoporosis Risk Assessment (NORA). Osteoporosis Int 2006;17:565–574.
7 Clinical Epidemiology: The

15 A
disease 10-year

13
Coronary heart

10 Excess coronary
heart disease
adjusted

attributable to
8
5 elevated blood
.
4.6 2 pressure
0

B
20 2
Population

2
Prevalance of
elevated blood
10 1
pressure at
various levels

50 C
coronary heart

48 Percent excess
coronary heart
25 disease
Excess

27 attributable to
23 .6 various levels of
.7 hypertension
0 140–159
Systolic 130–139 90–99 ≥160
Diastolic 85–89 Blood pressure (mm ≥100
Hg)
Figure 5.3 ■ Relationships among attributable risk, prevalence of risk factor, and population risk for
coronary heart disease (CHD) death due to hypertension. Panel A shows that the attributable risk for CHD increases as
blood pressure levels increase. However, because mild and moderate hypertension are more prevalent than severe hypertension (Panel
B), most excess CHD deaths caused by hypertension are not due to the highest levels of blood pressure (Panel C). (Data from Wilson
PWF, Agostino RB, Levy D, et al. Prediction of coronary heart disease using risk factor categories. Circulation 1998;97:1837–
1847.)

in the population (population-attributable


Paradoxically, then, physicians could save more lives
fraction). Note how important the prevalence of the
with effective treatment of lower, rather than higher,
risk factor (smoking) is to these calculations. As
levels of hypertension. This fact, so counterintuitive
smoking rates fall, the fraction of lung cancer due
to clinical thinking, has been termed “the prevention
to smoking also falls.
paradox” (8).
As discussed in Chapter 4, if a relatively weak
Measures of population risk are less frequently
risk factor is very prevalent in a community, it
encountered in the clinical literature than are mea-
could account for more disease than a very strong,
sures of absolute, attributable, and relative risks, but
but rare, risk factor. Figure 5.3 illustrates this for
a particular clinical practice is as much a population
hyperten- sion and the development of coronary
for the doctor as is a community for health policy-
heart disease. Figure 5.3A shows the attributable
makers. In addition, how the prevalence of exposure
(excess) risk of coro- nary heart disease according to
affects community risk can be important in the
various levels of hyper- tension among a group of
care of individual patients. For instance, when
about 2,500 men followed for 10 years. Risk
patients cannot give a history or when exposure is
increased with increasing blood pressure. However,
difficult for them to recognize, physicians depend
few men had very high blood pres- sure (Fig. 5.3B).
on the usual prevalence of exposure to estimate
As a result, the highest level of hyper- tension
the likeli- hood of various diseases. When
contributed only about a quarter of excess
considering treatable causes of cirrhosis in a North
coronary heart disease in the population (Fig.
American patient, for
5.3C).
Chapter 5: Risk: Exposure to Disease 71

example, it would be more useful to consider alcohol


than schistosomes, inasmuch as few North Ameri- Example
cans are exposed to schistosomes. Of course, one
might take a very different stance in the Nile
Delta, where schistosomiasis is prevalent, and the British investigators followed up a cohort of 17,981 children
people, who are mostly Muslims, rarely drink
alcohol.

TAKING OTHER VARIABLES


INTO ACCOUNT
Thus far, we mainly have been discussing exposure
and disease in isolation as if they were the only two
variables that matter. But, in fact, many other vari-
ables are part of the phenomenon being studied;
these other variables can have one of two important
effects on the results. They can cause an
unwanted, artificial change in the observed
relationship between exposure and disease
(confounding), lead- ing to incorrect conclusions
about the relationship; or they can modify the
magnitude of the exposure– disease relationship
(effect modification), which is valuable information
for clinicians. This section discusses these two
effects, which are so important to the
interpretation of research results, especially in Usually, investigators want to report more than
observational studies. crude measures of effect. They want to demonstrate
how exposure is related to disease independently of
Extraneous Variables all the other variables that might affect the
relation- ship. That is, they want to come as close as
Extraneous variables is a general term for vari- possible to describing cause and effect.
ables that are part of the system being studied but
are not (i.e., are “extraneous” to) the exposure and
disease of primary interest. For example, in a study
CONFOUNDING
of exercise and sudden death, the other variables The validity of observational studies is threat-
that are relevant to that study include age, body ened above all by confounding. We have already
mass index, coexisting diseases, all of the cardiovas- described confounding in conceptual terms in
cular risk factors, and everything having to do Chapter 1, noting that confounding occurs when
with the ability to exercise. Another term favored exposure is associated or “travels together” with
by statisticians is covariates. Neither term is another variable, which is itself related to the out-
particu- larly apt. Extraneous variables are not at all come, so that the effect of exposure can be confused
“extra- neous” because they can have important with or distorted by the effect of the other
effects of the exposure–disease relationship. Also, variable. Confounding causes a systematic error—
covariates may or may not “covary” (change in a bias— in inference, whereby the effects of one
relation to each other, exposure, or disease), but variable are attributed to another. For example,
these are the terms that are used. in a study of whether vitamins protect against
cardiovascular events, if people who choose to take
Simple Descriptions of Risk vitamins are also more likely to follow a healthy
Observational studies can disregard these other vari- lifestyle (e.g., not smoke cigarettes, exercise, eat a
ables and simply compare the course of disease in prudent diet, and avoid obesity), taking vitamins
two naturally occurring groups, one exposed to a will be associated with lower cardiovascular
risk or prognostic factor and the other not, without disease rates regardless of whether vitamins protect
implying that exposure itself was responsible for against cardiovascular disease. Confounding can
whatever dif- ferences in outcome are observed. increase or decrease an observed association
Crude measures of effect (not adjusted for other between exposure and disease.
variables) can be useful in predicting events,
without regard to causes.
7 Clinical Epidemiology: The

Working Definition Another approach is to see if the crude


relationship between exposure and disease is
A confounding variable is one that is: different when taking the potential confounder into
■ Associated with exposure account. The following example illustrates both
■ Associated with disease approaches.
■ Not part of the causal chain from exposure to
disease
A confounding variable cannot be in the causal
chain between exposure and disease; although vari- Example
ables that are in the chain are necessarily related to
both exposure and disease, they are not initiating As pointed out in an example in Chapter 4, several observational
events. (Such variables are sometimes referred to as 0.99. The authors concluded that their study did not show an inde
interme- diate outcomes.) If their effects were
removed, this would also remove any association
that might exist between exposure and disease. For
example, in a study of diet and cardiovascular
disease, serum cholesterol is a consequence of diet; if
the effect of cholesterol were removed, it would
incorrectly diminish the associa- tion between diet
and cardiovascular disease.
In practice, while confounding variables (col-
loquially called confounders) may be examined one
at a time, usually many variables can confound the
exposure–disease relationship and all are examined
and controlled for concurrently.

Potential Confounders
How does one decide which variables should be con-
sidered potential confounders? One approach is to
identify all the variables that are known, from
other studies, to be associated with either exposure
or dis- ease. Age is almost always a candidate, as are
known risk factors for the disease in question.
Another approach is to screen variables in the
study data for statistical associations with exposure
and disease, using liberal criteria for “association”
so as to err on the side of not missing potential
confounding. Inves- tigators may also consider
variables that just make sense according to their CONTROL OF CONFOUNDING
clinical experience or the biol- ogy of disease,
regardless of whether there are strong research To determine whether a factor is independently
studies linking them to exposure or disease. The related to risk or prognosis, it is ideal to compare
intention is to cast a broad net so as not to miss cohorts with and without the factor, everything
possible confounders. This is because of the possibil- else being equal. But in real life, “everything else” is
ity that a variable may confound the exposure– usu- ally not equal in observational studies.
disease relationship by chance, because of the What can be done about this problem? There
particular data at hand, even though it is not a are several possible ways of controlling† for
confounder in nature. differences

Confirming Confounding
How does one decide whether a variable that might
confound the relationship between exposure and disease.
disease actually does so? One approach is to simply
show that the variable is associated with exposure
and (separately) to show that it is associated with

Unfortunately, the term “control” also has several
Chapter 5: Risk: Exposure to Disease 73
other mean- ings: the non-exposed people in a cohort
study, the patients in a clinical trial who do not
receive the experimental treatment, and non-
diseased people (non-cases) in a case-control study.
7 Clinical Epidemiology: The

basic question is, “Are the differences between


Research question FOLATE STROKE groups in risk or prognosis related to the particular
factor under study or to some other factor(s)?”

Randomization
This study Other studies
Data source The best way to balance all extraneous variables
between groups is to randomly assign patients to
groups so that each patient has an equal chance of
falling into the exposed or unexposed group (see
Chapter 9). A special feature of randomization is that
Saturated fats Transfatty acids Smoking Exercise
it balances not only variables known to affect out-
Confounding variables come and included in the study, but also unknown or
unmeasured confounders. Unfortunately, it is usually
not possible to study risk or prognostic factors
with randomized trials.
Figure 5.4 ■ Example of confounding. The relationship
between folate intake and incidence of stroke was con- founded Restriction
by several cardiovascular risk and protective factors.
Patients who are enrolled in a study can be confined
to only those possessing a narrow range of
between groups. Controlling is a general term for any characteristics, a strategy called restriction. When
process aimed at removing the effects of extraneous this is done, cer- tain characteristics can be made
similar in the groups
variables while examining the independent effects being compared. For example, the effect of prior
of individual variables. A variety of methods can cardiovascular disease on prognosis after acute myo-
be applied during the design or analysis of cardial infarction could be studied in patients who
research (summarized in Table 5.6 and described had no history of cigarette smoking or hyperten-
in the fol- lowing text). One or more of these sion. However, this approach is limiting. Although
strategies should be applied in any observational restriction on entry to a study can certainly
study that attempts to describe the effect of one produce homogeneous groups of patients, it does
variable independent of other variables that might so at the expense of generalizability. In the course of
affect the outcome. The excluding

Table 5.6
Methods for Controlling Confounding

Phase of Study

Method Description Design Analysis


Randomization Assign patients to groups in a way that gives each patient an equal chance 
of falling into one or the other group.
Restriction Limit the range of characteristics of patients in the study 
Matching For each patient in one group, select one or more patients with the  
same characteristics (except for the one under study) for a comparison
group.
Stratification Compare rates within subgroups (strata) with otherwise similar probability 
of the outcome.
Simple adjustment Mathematically adjust crude rates for one or a few characteristics so that 
equal weight is given to strata of similar risk.
Multivariable adjustment Adjust for differences in a large number of factors related to 
outcome, using mathematical modeling techniques.
Best-case/Worst-case analysis Describe how different the results could be under the most extreme (or 
simply very unlikely) assumption about selection bias.
Chapter 5: Risk: Exposure to Disease 75

Table 5.7
Example of Stratification: Hypothetical Death Rates after Coronary Bypass Surgery in
Two Hospitals, Stratified by Preoperative Risk

Hospital A Hospital B
Preoperative Risk Patients Deaths Rate (%) Patients Deaths Rate (%)
High 500 30 6 400 24 6
Medium 400 16 4 800 32 4
Low 300 2 0.67 1,200 8 0.67
Total 1,200 48 4 2,400 64 2.7

potential subjects, cohorts may no longer be repre- are especially strongly related to outcome, investiga-
sentative of most patients with the condition. Also, tors rely on other ways of controlling for bias as well.
after restriction, it is no longer possible, in that study,
to learn anything more about the effects of Stratification
excluded variables.
With stratification, data are analyzed and results
presented according to subgroups of patients, or
Matching strata, of similar risk or prognosis (other than the
Matching is another way of making patients in exposure of interest). An example of this approach is
two groups similar. In its simplest form, for each the analysis of differences in hospital morality for
patient in the exposure group, one or more patients a common surgical procedure, coronary bypass
with the same characteristics (except for the factor surgery (Table 5.7). This is especially relevant today
of interest) would be selected for a comparison because of several high-profile examples of “report
group. Matching is typically done for variables that cards” for doctors and hospitals, and the concern
are so strongly related to outcome that investigators that the reported differences may be related to
want to be sure they are not different in the groups patient rather than surgeon or hospital
being compared. Often, patients are matched for characteristics.
age and sex because these variables are strongly Suppose we want to compare the operative
related to risk or prognosis for many diseases, but mor- tality rates for coronary bypass surgery at
matching for other variables, such as stage or sever- Hospitals A and B. Overall, Hospital A noted 48
ity of disease and prior treatments, may also be deaths in 1,200 bypass operations (4%), and
useful. Hospital B experienced 64 deaths in 2,400
Although matching is commonly done and can be operations (2.6%).
very useful, it has limitations. Matching controls bias The crude rates suggest that Hospital B is supe-
only for those variables involved in the match. Also, rior. But is it really superior if everything else is
it is usually not possible to match for more than a equal? Perhaps the preoperative risk among patients
few variables because of practical difficulties in find- in Hospital A was higher than in Hospital B and
ing patients who meet all of the matching criteria. that, rather than hospital care, accounted for the
Moreover, if categories for matching are relatively difference in death rates. To see if this possibility
crude, there may be room for substantial differences accounts for the observed difference in death rates,
between matched groups. For example, if women in patients in each of these hospitals are grouped into
a study of risk for birth of a child with Down strata of similar underlying preoperative risk based
syndrome were matched for maternal age within 10 on age, prior myocardial function, extent of occlu-
years, there could be a nearly 10-fold difference in sive disease, and other characteristics. Then the oper-
frequency related to age if most of the women in ative mortality rates within each stratum of risk
one group were 30 years old and most in the other are compared.
39 years old. Finally, as with restriction, once one Table 5.7 shows that when patients are divided
matches on a variable, its effects on outcomes can by preoperative risk, the operative mortality rates
no longer be evaluated in the study. For these in each risk stratum are identical in two hospitals:
reasons, although matching may be done for a few 6% in high-risk patients, 4% in medium-risk
characteristics that patients, and 0.67% in low-risk patients. The crude
rates were mis- leading because of important
differences in the risk
7 Clinical Epidemiology: The

characteristics of the patients treated at the two strategy to control for confounding when multiple
hos- pitals: 42% of Hospital A’s patients and only variables need to be considered.
17% of Hospital B’s patients were high risk.
An advantage of stratification is that it is a rela- Multivariable Adjustment
tively transparent way of recognizing and controlling
In most clinical situations, many variables act
for bias.
together to produce effects. The relationships among
these variables are complex. They may be related to
Standardization one another as well as to the outcome of interest.
If an extraneous variable is especially strongly related The effect of one might be modified by the pres-
to outcomes, two rates can be compared without ence of others, and the joint effects of two or more
bias related to this variable if they are adjusted to might be greater than their individual effects taken
equalize the weight given to that variable. This together.
Multivariable analysis makes it possible to con-
process, called standardization (or adjustment),
shows what the overall rate would be for each group sider the effects of many variables simultaneously.
if strata-specific rates were applied to a population Other terms for this approach include mathematical
made up of similar proportions of people in each modeling and multivariable adjustment. Modeling
stratum. is used to adjust (control) for the effects of many
To illustrate this process, suppose the operative variables simultaneously to determine the indepen-
mortality in Hospitals A and B can be adjusted to dent effects of one. This method also can select,
a common distribution of risk groups by giving from a large set of variables, those that contribute
each risk stratum the same weight in the two independently to the overall variation in outcome.
hospitals. Without adjustment, the risk strata Modeling can also arrange variables in order of
receive different weights in the two hospitals. The the strength of their contribution. There are several
mortality rate of 6% for high-risk patients receives a kinds of prototypic models, according to the design
weight of 500/1,200 in Hospital A and a much lower and data in the study. Cohort and case-control stud-
weight of 400/2,400 in Hospital B. The other risk ies typically rely on logistic regression, which is
strata were also weighted differently in the two used specifically for dichotomous outcomes. A Cox
proportional hazard model is used when the out-
hospitals. The result is a crude rate for Hospital A,
which is the sum of the rate in each stratum times come is the time to an event, as in survival
analyses (see Chapter 7).
its weight: (500/1,200  0.06)  (400/1,200  Multivariable analysis of observational studies is
0.04)  (300/1,200  0.0067)  0.04. the only feasible way of controlling for many vari-
Similarly, the crude rate for Hospital B is (400/2,400 ables simultaneously during the analysis phase of a
 0.06)  (800/2,400  0.04)  (1,200/2,400  study. Randomization also controls for multiple vari-
0.0067)  0.027. ables, but during the design and conduct phases of
If equal weights were used when comparing the a study. Matching can account for only a few
two hospitals, the comparison would be fair (free of variables at a time, and stratified analyses of many
the effect of different proportions in the various risk variables run the risk of having too few patients in
groups). The choice of weights does not matter as some strata. The disadvantage of modeling is that
long as it is the same in the two hospitals. Weights for most of us it is a “black box,” making it difficult
could be based on those existing in either of the hos- to recognize where the method might be
pitals or any reference population. For example, if misleading. At its best, model- ing is used in
each stratum were weighted 1/3, then the standard- addition to, not in place of, matching and stratified
ized rate for Hospital A  (1/3  0.06)  (1/3  analysis.
0.04)  (1/3  0.0067)  0.035, which is exactly
the same as the standardized rate for Hospital B. Overall Strategy for
The consequence of giving equal weight to strata Control of Confounding
in each hospital is to remove the apparent excess
risk of Hospital A. Except for randomization, all ways of dealing with
Standardization is commonly used in relatively extraneous differences between groups share a
crude comparisons to adjust for a single variable limita- tion: They are effective only for those
such as age that is obviously different in groups variables that are singled out for consideration.
being com- pared. For example, the crude results They do not deal with risk or prognostic factors
of the folate/ stroke example were adjusted for age, that are not known at the time of the study or those
as discussed earlier. Standardization is less useful as that are known but not taken into account.
a stand-alone
Chapter 5: Risk: Exposure to Disease 77

For this reason and the complementary EFFECT MODIFICATION


strengths and weakness of the various methods, one
should not rely on only one or another method of A very different issue from confounding is
controlling for bias but rather uses several methods whether the presence or absence of a variable
together, layered one on another. changes the effect of exposure on disease, called
effect modifica- tion. As Rothman (11) puts it,

The most central difference is that, whereas con-


Example founding is a bias that the investigator hopes to
In a study of whether ventricular premature contractions are prevent or, ifwith
associated necessary,
reducedtosur-
remove
vivalfrom theyears
in the data, following
ef- acute
Restrict the study to patients who are not very old or young fectand
modification
who do notishavean elaborated description
unusual causes, such asofarteritis or d
the effecttoitself.
Match for age, a factor strongly related to prognosis but extraneous Effect
the main modification is thus a
question.
Using stratified analysis, examine the results separately for finding
strata ofto be reportedclini-
differing rather
cal than a bias
severity. Thistoincludes
be the pre
avoided.of all
Using multivariable analysis, adjust the crude results for the effects Epi- thedemiologic analysis
variables other is generally
than the arrhythmia, taken to
aimed at eliminating confounding and discovering
and describing effect modification.

Statisticians call effect modification interaction,


and biologists call it synergy or antagonism, depend-
ing on whether the third factor increases or decreases
the effect.

120
With aspirin Without aspirin
Upper gastrointestinal complications/1,000 person-

110

100

90

80

70

OBSERVATIONAL STUDIES 60
AND CAUSE
50
The end result of a careful observational study, con-
trolling for a rich array of extraneous variables, is 40
to come as close as possible to describing a truly
30
independent effect, one that is separate from all
the other variables that confound the exposure–
20
disease relationship. However, it is always possible
that some important variables were not taken into
account, either because their importance was not History of ulcers10
Age (years)
No Yes NoYes NoYes
known or because they were not or could not be 0
measured. The consequence of unmeasured con- <50 60–69 ≥80
founders is residual confounding. For this rea-
son, in single studies the results should be thought Figure 5.5 ■ Example of effect modification. The ad-
of (and investigators should describe their results) ditional risk of gastrointestinal complications from aspirin is
as “independent associations” and not necessarily modified by age and history of peptic ulcer disease. (Data from
Patrono C, Rodriguez LAG, Landolfi R, et al. Low-dose aspirin
as establishing cause. Chapter 12 describes how to
for the prevention of atherothrombosis. N Engl J Med
build a case for a causal association. 2005;353:2373–2383.)
7 Clinical Epidemiology: The

Example
sharply with age, and even much more with
aspirin use. At the highest level of risk, in
Aspirin has been shown to prevent cardiovascu- men older than age 80 and with an ulcer
lar events. Whether it should be recommended history, aspirin doubles risk from 60 to
depends on a patient’s risk of cardiovascular 120/1,000 person- years.
events and of complications of aspirin, mainly
upper gastrointestinal bleeding. Figure 5.5
shows that the additional rate of gastrointes-
tinal complications (the incidence attributable This example shows that age and history of peptic
to aspirin) depends on two other factors, age ulcer disease modify the effect of aspirin on
and prior history of peptic ulcer disease (12). In gastro- intestinal complications. The additional
men younger than 50 years old with no history information provided by effect modification enables
of ulcers, the rate of complications is about clinicians to tailor their recommendations about the
1/1,000 person-years, and there is virtually no use of aspi- rin more closely to the characteristics of
additional risk related to aspirin. Risk increases an individual patient.
a little with age among men without a history Confounding and effect modification are inde-
of ulcers. However, among men with a history pendent of each other. A given variable might be a
of ulcer disease, the rate of complications rises confounder, effect modifier, both, or neither, depend-
ing on the research question and data.

Revie w Question s
For question 5.1, select the best answer. 5.2. What was the relative risk of stroke of
smokers compared to non-smokers in their
5.1. Which of the following statements is not 40s?
correct for both prospective and retrospective
cohort studies? A. 1.4
B. 4.0
A. They measure incidence of disease C. 22.3
directly. D. 30.2
B. They allow assessment of possible E. 72.8
associations between exposure and many F. 80.7
diseases.
C. They allow investigators to
decide beforehand what data to 5.3. What was the attributable risk per 1,000
collect. people of stroke among smokers compared
D. They avoid bias that might occur if to non-smokers in their 60s?
measurement of exposure is made after
A. 1.4
the outcome of interest is known.
B. 4.0
Questions 5.2–5.4 are based on the following C. 22.3
example: D. 30.2
E. 72.8
A study was done examining the relationship of F. 80.7
smoking, stroke, and age (13). The 12-year inci-
dence per 1,000 persons (absolute risk) of stroke
according to age and smoking status was: 5.4. Which of the following statements about the
study results is incorrect?
Age Non-smokers Smokers A. To calculate population-attributable risk
45–49 7.4 29.7 of smoking among people in their 60s,
65–69 80.2 110.4 additional data are needed.
Chapter 5: Risk: Exposure to Disease 79

B. More cases of stroke due to smoking 5.6. What is the attributable risk of DVT for
occurred in people in their 60s than women taking OCs who do not carry the
in their 40s. mutation for factor V Leiden compared to
C. When relative risk is calculated, the those not taking OCs and not carrying the
results reflect information about the mutation?
incidence in exposed and unexposed
A. 0.8/10,000/yr
persons, whereas the results for
B. 1.3/10,000/yr
attributable risk do not.
C. 2.2/10,000/yr
D. The calculated relative risk is a
D. 9.5/10,000/yr
stronger argument for smoking as
E. 25.5/10,000/yr
cause of stroke for persons in their
40s than the calculated risk for
5.7. What is the attributable risk of DVT
persons in their 60s.
for women taking OCs who carry factor
E. Depending on the question asked, age
V Leiden compared to women on OCs
could be considered either a confounding
but not carrying the mutation?
variable or an effect modifier in the
study. A. 0.8/10,000/yr
B. 1.3/10,000/yr
Questions 5.5–5.11 are based on the following
C. 2.2/10,000/yr
example:
D. 9.5/10,000/yr
E. 25.5/10,000/yr
Deep venous thrombosis (DVT) is a serious
5.8. In a population with 100,000 white women,
condition that occasionally can lead to pulmonary
embolism and death (14). The incidence of DVT all of whom take OCs, what is the population-
is increased by several genetic and environmental attributable risk for DVT in women who are
factors, including oral contraceptives (OCs) and a heterozygous for factor V Leiden?
genetic mutation, factor V Leiden. These two risk A. 0.8/10,000/yr
factors, OCs and factor V Leiden, interact. Hetero- B. 1.3/10,000/yr
zygotes for factor V Leiden have 4 to 10 times the C. 2.2/10,000/yr
risk of DVT of the general population. In women D. 9.5/10,000/yr
without the genetic mutation, incidence of DVT E. 25.5/10,000/yr
rises from about 0.8/10,000 women/yr among those
not on OCs to 3.0/10,000 women/yr for those 5.9. What is the relative risk of DVT in women
taking the pill. The baseline incidence of DVT in taking OCs and heterozygous for factor V
heterozygotes for factor V Leiden is 5.7/10,000 Leiden compared to women who take
women/yr, rising to 28.5/10,000 women/yr among OCs but do not carry the mutation?
those taking OCs. Mutations for factor V Leiden
occur in about 5% of whites but are absent in A. 3.8
Africans and Asians. B. 7.1
C. 9.5
D. 28.5
For questions 5.5–5.10., select the one best E. 35.6
answer.
5.10. What is the relative risk of DVT in women
5.5. What is the absolute risk of DVT in taking OCs and without the mutation
women who do not have the mutation and compared to women without the
do not take OCs? mutation who are not taking OCs?
A. 0.8/10,000/yr A. 3.8
B. 1.3/10,000/yr B. 7.1
C. 2.2/10,000/yr C. 9.5
D. 9.5/10,000/yr D. 28.5
E. 25.5/10,000/yr E. 35.6
8 Clinical Epidemiology: The

5.11. Given the information in this study and users died as often as non-users. However,
calculations for questions 5.5–5.10, which aspirin users were sicker and had illnesses
of the following statements about risk of more likely to be treated with aspirin. Which
developing DVT is incorrect? of the following methods is the best way to
account for the propensity of people to take
A. Factor V Leiden modifies the effect of
aspirin?
OCs on the annual risk of developing
DVT by increasing risk from about 3 A. Calculate the absolute risk of
per 10,000 to about 30 per 10,000 cardiovascular death in the two groups
women. and the risk difference attributable to
B. Being heterozygous for factor V Leiden using aspirin.
confers about twice the risk for DVT as B. Create subgroups of aspirin users and
taking OCs. non-users with similar indications for
C. Women heterozygous for factor V Leiden using the medication and compare death
should be advised against taking OCs rates among the subgroups.
because of the high relative risk for DVT C. For each person using aspirin, match
among such women. a non-user on age, sex, and
comorbidity and compare death rates
For question 5.12, select the best answer in the two groups.

5.12. In a study to determine if regularly taking


aspirin prevents cardiovascular death, aspirin

REFERENCES
1. Dawber TR. The Framingham Study: The Epidemiology of
results from the National Osteoporosis Risk Assessment
Atherosclerotic Disease. Cambridge, MA: Harvard University
(NORA), Osteoporosis Int 2006; 17:565–574.
Press; 1980.
8. Hofman A, Vandenbroucke JP. Geoffrey Rose’s big idea. Br
2. Kannel WB, Feinleib M, McNamara PM, et al. An investiga-
Med J 1992;305:1519–1520.
tion of coronary heart disease in families. The Framingham
9. Reulen RC, Winter DL, Frobisher C, et al. long-term
Offspring Study. Am J Epidemiol 1979;110:281–290.
cause- specific mortality among survivors of childhood
3. Wen CP, Wai J PM, Tsai MK, et al. Minimum amount of
cancers. JAMA 2010;304:172–179.
phys- ical activity for reduced mortality and extended life
10. Al-Delaimy WK, Rexrode KM, Hu FB, et al. Folate
expectancy: a prospective cohort study. Lancet
intake and risk of stroke among women. Stroke 2004;35:
2011;378:1244–1253.
1259–1263.
4. Madsen KM, Hviid A, Vestergaard M, et al. A population-
11. Rothman KJ. Modern Epidemiology. Boston: Little Brown
based study of measles, mumps, and rubella vaccination
and Co.; 1986.
and autism. N Engl J Med 2002;347:1477–1482.
12. Patrono C, Rodriguez LAG, Landolfi R, et al. Low-dose aspi-
5. The Editors of The Lancet. Retraction—ileal-lymphoid-nod-
rin for the prevention of atherothrombosis. N Engl J Med
ular hyperplasia, non-specific colitis, and pervasive develop-
2005;353:2373–2383.
mental disorder in children. Lancet 2010;375:445.
13. Psaty BM, Koepsell TD, Manolio TA, et al. Risk ratios
6. Geiger AM, Yu O, Herrinton LJ, et al. (on behalf of the CRN
and risk differences in estimating the effect of risk factors for
PROTECTS Group). A case-cohort study of bilateral prophy-
car- diovascular disease in the elderly. J Clin Epidemiol
lactic mastectomy efficacy in community practices. Am J Epi-
1990;43: 961–970.
demiol 2004;159:S99.
14. Vandenbroucke JP, Rosing J, Bloemenkemp KW, et al.
7. Siris ES, Brenneman SK, Barrett-Connor E, et al. The effect
Oral contraceptives and the risk of venous thrombosis. N
of age and bone mineral density on the absolute, excess, and
Engl J Med 2001;344:1527–1535.
rela- tive risk of fracture in postmenopausal women aged
55–99:
Chapter 5: Risk: Exposure to Disease 81

Chapter 6

Risk: From Disease


to Exposure
“. . . take two groups presumed to be representative of persons who do and do not have
the disease and determine the percentage of each group who have the characteristic. . . .
This yields, not a true rate, but rather what is usually referred to as a relative frequency.”
—Jerome Cornfield
1952

KEY WORDS answer. The inefficiency is especially limiting for very


rare diseases.
Latency period Overmatching Some of these limitations can be overcome by
Case-control study Recall bias modifications of cohort methods, such as retrospec-
Control Odds ratio tive cohort or case-cohort designs, described in the
Population-based Estimated relative preceding chapter. This chapter describes another
case-control study risk way of studying the relationship between a
Nested case-control Prevalence odds potential risk (or protective) factor and disease more
study ratio Crude odds efficiently: case- control studies. This approach has
Matching ratio Adjusted odds two main advantages over cohort studies. First, it
Umbrella matching ratio Epidemic curve bypasses the need to col- lect data on a large
number of people, most of whom do not get the
disease and so contribute little to the results.
Second, it is faster because it is not necessary to
Cohort studies are a wonderfully logical and direct wait from measurement of exposure until effects occur.
way of studying risk, but they have practical limi- But efficiency and timeliness come at a cost: Man-
tations. Most chronic diseases take a long time to aging bias is a more difficult and sometimes uncertain
develop. The latency period, the period of time task in case-control studies. In addition, these stud-
between exposure to a risk factor and the expression ies produce only an estimate of relative risk and no
of its pathologic effects, is measured in decades for direct information on other measures of effect such as
most chronic diseases. For example, smoking pre- absolute risk, attributable risk, and population risks,
cedes coronary disease, lung cancer, and chronic all described in the Chapter 5.
bronchitis by 20 years or more, and osteoporosis The respective advantages and disadvantages of
with fractures occurs in the elderly because of diet cohort and case-control studies are summarized in
and exercise patterns throughout life. Also, relatively Table 6.1.
few people in a cohort develop the outcome of inter- Despite the drawbacks of case-control studies, the
est, even though it is necessary to measure exposure trade-off between scientific strength and feasibility
in, and to follow-up, all members of the cohort. is often worthwhile. Indeed, case-control studies are
The result is that cohort studies of risk require a indispensable for studying risk for very uncommon
lot of time and effort, not to mention money, to diseases, as shown in the following example.
get an
80
Chapter 6: Risk: From Disease to Exposure 81

Table 6.1 By adding a comparison group and accounting for


other variables that might be related to
Summary of Characteristics of bisphospho- nate use and atypical fractures, the
Cohort and Case-Control Studies investigators were able to take the inference that
bisphosphonates might be a cause of atypical
Cohort Study Case-Control Study fractures well beyond what was possible with case
Begins with a defined Begins with sampled cases series alone.
cohort and controls This chapter, the third about risk, is titled
Exposure measured in Exposure measured in cases “From Disease to Exposure” because case-control
members of the cohort and controls, sometimes after studies involve looking backward from disease to
outcomes exposure, in contrast to cohort studies, which look
Cases arise in the cohort Exposure occurs before samples forward from exposure to disease.
during follow-up became cases and controls
Incidence measured for Exposure measured for cases
exposed and non-exposed and controls
members of the cohort
CASE-CONTROL STUDIES
Can calculate absolute, Can estimate relative risk The basic design of a case-control study is dia-
relative, attributable, and but there is no information on
grammed in Figure 6.1. Two samples are selected:
population risks directly incidence patients who have developed the disease in question
and an otherwise similar group of people who
have not developed the disease. The researchers then
look back in time to measure the frequency of
Example exposure to a possible risk factor in the two groups.
The resulting
In the mid 2000s, clinicians began reporting cases of an unusual form of data can be
femoral used toinestimate
fracture women. the relative
Bisphosphonates, dr
risk of disease related to a risk factor.

Example
Head injuries are relatively common among alpine skiers an
82 Clinical Epidemiology: The

EXPOSED EXPOSURE TO CASES/CONTROLS POPULATION


RISK FACTOR

YES

CCAASSEESS
(Have disease)
NO

Time

Research

YES
CCOONNTTRROOLLSS
(Do not have disease)
NO

Estimate of relative
risk
Figure 6.1 ■ Design of case-control studies.

USED HELMET

Head
Injur
y
DID NOT CONTROLLED
USE HELMET
Age, Sex
Nationality Skiers and
Skill level snowboarders at 8 major
Equipment used Ski Norweigian
school attendance
ski resorts
USED HELMET Rented or owned equipment

No
Head
DID NOT Injur
USE HELMET y

Figure 6.2 ■ A case-control study of helmet use and head injuries among skiers and snowboarders.
ESTIMATE
(Summary of Sulheim S, Holme I, Ekeland A, et al. Helmet use and risk of head injuries in alpine skiers and
OF RELATIVE
snowboarders. JAMA 2006;295:919–924.)
Chapter 6: Risk: From Disease to Exposure 83
with unusual exposures—the wrong sample if the
The word control comes up in other situations, underly-
too. It is used in experimental studies to refer to
people, animals, or biologic materials that have
not been exposed to the study intervention. In
diagnostic laboratories, “controls” refer to specimens
that have a known amount of the material being
tested for. As a verb, control is used to describe the
process of taking into account, neutralizing, or
subtracting the effects of variables that are
extraneous to the main research question. Here,
the term is used in the context of case-control
studies to refer to people who do not have the
disease or outcome under study.

DESIGN OF CASE-CONTROL
STUDIES
The validity of case-control studies depends on the
care with which cases and controls are selected, how
well exposure is measured, and how completely
potentially confounding variables are controlled.

Selecting Cases
The cases in case-control research should be new (inci-
dent) cases, not existing (prevalent) ones. The
reasons are based on the concepts discussed in
Chapter 2. The prevalence of a disease at a point in
time is a function of both the incidence and
duration of that disease. Duration is in turn
determined by the rate at which patients leave the
disease state (because of recovery or death) or
persist in it (because of a slow course or successful
palliation). It follows from these relation- ships that
risk factors for prevalent disease may be risk factors
for incidence, duration, or both; the relative
contributions of the two cannot be determined. For
example, if prevalent cases were studied, an exposure
that caused a rapidly lethal form of the disease would
result in fewer cases that were exposed, reducing
rela- tive risk and thereby suggesting that exposure is
less harmful than it really is or even that it is
protective.
At best, a case-control study should include all
the
cases or a representative sample of all cases that arise
in a defined population. For example, the bisphos-
phonates study included all residents of Sweden in
2008 and the helmets study all skiers and snowboard-
ers in eight major resorts in Norway (accounting
for 55% of all ski runs in the country).
Some case-control studies, especially older ones,
have identified cases in hospitals and referral cen-
ters where uncommon diseases are most likely to
be found. This way of choosing cases is convenient,
but it raises validity problems. These centers may
attract particularly severe or atypical cases or those
84 Clinical Epidemiology: The
of disease in the cases.
ing research question in case-control studies is
about ordinary occurrences of disease and
exposures.
Also, it is difficult in this situation to be
confident that controls, however they are chosen,
are truly simi- lar to cases in all ways other than
exposure, which is critical to the validity of this
kind of study (see the Selecting Controls section).
Fortunately, it is rarely necessary to take this
scientific risk because there are many databases
that make true population sampling possible.
However the cases might be identified, it
should be possible for both them and controls to be
exposed to the risk factor and to experience the
outcome. For example, in a case-control study of
exercise and sud- den death, cases and control
would have to be equally able to exercise (if they
chose to) to be eligible.
It goes without saying that diagnosis should
be rigorously confirmed for cases (and excluded
for controls), and the criteria made explicit. In the
bisphosphonates study, investigators agreed on
explicit criteria for atypical fractures of the femur
and reviewed all radiographs, not just reports of
them, to classify fracture type. One investigator then
reviewed a random sample of radiographs for a
second time without knowing how each had been
classified, and there was complete agreement
between the original and the second
classifications.

Selecting Controls
Above all, the validity of case-control studies
depends on the comparability of cases and
controls. To be comparable, cases and controls
should be members of the same base population and
have an equal opportu- nity of being exposed. The
best approach to meeting these requirements is to
ensure that controls are a ran- dom sample of all
non-cases in the same population or cohort that
produced the cases.

The Population Approach


Studies in which cases and controls are a
complete or random sample of a defined
population are called population-based case-
control studies. In prac- tice, most of these
populations are dynamic—that is, continually
changing, with people moving in and out of the
population—as described in Chapter 2 (3). This
might bias the result, especially if cases and controls
are sampled over a long period of time and
exposure is changing rapidly during this time.
This concern can be laid to rest if there is evidence
that population turnover is in fact so small as to
have little effect on the study results or if cases
and controls are matched on calendar time—that is,
controls are selected on the same date as the onset
Chapter 6: Risk: From Disease to Exposure 85
such
The Cohort Approach
Another way of ensuring that cases and controls
are comparable is to draw them from the same
cohort. In this situation, the study is said to be a
nested case- control study (it is “nested” in the
cohort).
In the era of large databases and powerful
comput- ers, why not just analyze cohort data as a
cohort study rather than a case-control study? After
all, the inef- ficiency of including many exposed
members of the cohort, even though few of them
will experience the outcome, could be overcome
by computing power. The usual reason for case-
control analyses of cohort data is that some of the
study variables, especially some covariates, may not
be available in the cohort data- base and,
therefore, have to be gathered from other sources
for each patient in the study. Obtaining the missing
information from medical records, question- naires,
genetic analyses, and linkage to other databases can
be very expensive and time-consuming. Therefore,
there is a practical advantage to having to assemble
this information only for cases and a sample of
non- cases in the cohort, not every member of the
cohort.
With nested case-control studies, there is an
opportunity to obtain both a crude measures of
incidence from a cohort analysis and a strong esti-
mate of relative risk, that takes into account a rich
set of covariates, from a case-control analysis. With
this information one has the full set of risk described
in Chapter 5—absolute risk for exposed and non-
exposed people, relative risk, attributable risk, and
population risks.
The bisphosphonate example illustrates the
advan- tages of complementary cohort and case
control anal- yses. A cohort analysis, taking only age
into account, showed that the increase in absolute
risk of atypi- cal fractures related to bisphosphonate
use was five cases per 10,000 patient-years.
Collection of data on covariates was done by linking
to other databases and was presumably too resource-
intensive to be done on the entire national sample.
With these data for cases and controls, a much
more credible estimate of rela- tive risk was
possible in the case-control analysis. The estimate of
relative risk of 33 from the case-control analysis
was consistent with the crude relative risk from the
cohort analysis (not accounting for potential
confounders other than age), which was 47. Because
of the two analyses, both cohort and case-control, the
authors could point out that the relative risk of atypi-
cal fracture was large but the absolute risk was small.

Hospital and Community Controls


If population- or cohort-based sampling is not pos-
sible, a fallback position is to select controls in
86 Clinical Epidemiology: The

a way that the selection seems to produce


controls that are comparable to cases. For
example, if cases are selected from a hospital
ward, the controls might be selected from
patients with different diseases, apparently
unrelated to the exposure and disease of
interest, in the same hospital. As pointed
out ear- lier, for most risk factors and
diseases, case-control studies in health care
settings are more fallible than population-
or cohort-based sampling because hos-
pitalized patients are usually a biased
sample of all people in the community, the
people to whom the results should apply.
Another approach is to obtain controls
from the community served by the hospital.
However, many hospitals do not draw
patients exclusively from the surrounding
community; some people in the com-
munity go to other hospitals, and some
people in other communities pass up their
own neighborhood hospital to go to the study
hospital. As a result, cases and controls may
be systematically different in ways that
distort the exposure-disease relationship.

Multiple Control Groups


If none of the available control groups
seems ideal, one can see how choice of
controls affects results by selecting several
control groups with apparently
complementary scientific strengths and
weaknesses. Similar estimates of relative risk
obtained using dif- ferent control groups is
evidence against bias because it is unlikely
that the same biases would affect other- wise
dissimilar groups in the same direction and
to the same extent. If the estimates of
relative risks are different, it is a signal that
one or more are biased and the reasons
need to be investigated.

Example

In the helmets and head injury example (2), the main control gro
Chapter 6: Risk: From Disease to Exposure 87
might be highly related to each other because they
Multiple Controls per Case have similar root causes; education,
Having several control groups per case group should
not be confused with having several controls for each
case. If the number of cases is limited, as is often
so with rare diseases, the study can provide more
infor- mation if there is more than one control per
case. More controls produce a gain in the ability to
detect an increase or decrease in risk if it exists, a
property of a study called “statistical power” (see
Chapter 11). As a practical matter, the gain is
worthwhile up to about three or four controls per
case, after which little is gained by including even
more controls.

Matching
If some characteristics seem especially strongly
related to either exposure or disease, such that one
would want to be sure that they occur similarly in
cases and controls, they can be matched. With
matching, for each case with a set of
characteristics, the study includes one or more
controls that possess the same characteristics.
Researchers commonly match for age and sex,
because these are often strongly related to both
exposure and disease, but matching may extend
beyond these demographic characteristics (e.g., to
risk profile or disease severity) when other factors
are known to be strongly associated with an
exposure or outcome. Matching increases the
useful information obtainable from a set of cases and
controls by reducing differences between groups in
determinants of disease other than the one being
considered, thereby allowing a more powerful
(sensitive) measure of association.
Sometimes, cases and controls are made
compa-
rable by umbrella matching, matching on a vari-
able such as hospital or community that is a proxy
for many other variables that could confound the
exposure–disease relationship and would be difficult
to measure one at a time, if that were possible at
all. Examples of variables that might be captured
under an umbrella include social disadvantage
related to income, education, race, and ethnicity;
propensity to seek health care or follow medical
advice; and local patterns of health care.
Matching can be overdone, biasing study results.
Overmatching can occur if investigators match on
variables so closely related to exposure that expo-
sure rates in cases and controls becomes more
simi- lar than they are in the population. The result
is to make the observed estimate of relative risk
closer to 1 (no effect). There are many reasons why
the match- ing variable might be related to
exposure. It may be part of the chain of events
leading from exposure to disease. Other variables
88 Clinical Epidemiology: The

income, race, and ethnicity tend to be related to


each other, so if one matches on one, it will obscure
effects of the others. Matching on diseases with
the same treatment would result in overmatching
for studies of the effects of that treatment. For
example, in a study of non-steroidal anti-
inflammatory drugs (NSAIDs) and renal failure, if
cases and controls were matched for the presence
of arthritic symptoms, which are com- monly
treated with NSAIDs, matched pairs would have
an artificially similar history of NSAID use.
A disadvantage of matching is that once a
variable is matched for, and so made similar in
cases and con- trols, it is no longer possible to learn
how it affects the exposure–disease relationship.
Also, for many studies it is not possible to find
matched controls for more than a few case
characteristics. This can be overcome, to some
extent, if the number of potential controls is huge
or if the matching criteria are relaxed (e.g., by
matching age within a 5-year range rather than
the same year). In summary, matching is a useful
way of controlling for confounding, but it can
limit the questions that can be asked in the study
and can cause rather than remove bias.

Measuring Exposure
The validity of case-control studies also depends on
avoiding misclassification when measuring expo-
sure. The safest approach is to depend on
complete, accurate records that were collected
before disease developed. Examples include
pharmacy records for studies of prescription drug
risks, surgical records for studies of surgical
complications, and stored blood specimens for
studies of risk related to biomolecular
abnormalities. With such records, knowledge of dis-
ease status cannot bias reporting of exposure.
However, many important exposures can only
be measured by asking cases and controls or their
proxies about them. Among these are exercise, diet,
and over- the-counter and recreational drug use.
The following example illustrates how investigators
can “harden” data from interviews that are
inherently vulnerable to bias.

Example

What are the risk factors for suicide in China? Investigators stu
Chapter 6: Risk: From Disease to Exposure 89

Physicians may be more likely to ask about an


authors noted that as with other studies
exposure and record that information in the
that depended on a “psychological autopsy”
medical record in cases than in controls if exposure
for measurement of exposure,
is already suspected of being a cause. Thus, a
“interviewers were aware of the cause of physician may be more likely to record a family
death of the deceased (suicide or other history of prostate can- cer in a patient with prostate
injury) so we could not com- pletely cancer or to record cell phone use in a patient with
eliminate potential interviewer bias.” They brain cancer. This bias should be understandable
went on to explain that they “tried to keep to all students of physical diagnosis. If a resident
this bias to a minimum by using the same admitting a relatively young woman with acute
interview schedule for cases and controls, myocardial infarction is aware of the reported
employing objective measures of potential association with use of birth con- trol pills, he or
risk factors, independently obtaining she might question the patient more intensely
evidence from two sources (family members about birth control pill use and to record this
and close associates), and giving extensive information more carefully. Protections against this
training to interviewers.” They also chose kind of bias are the same as those mentioned ear-
controls who died from injuries to match lier: multiple sources of information and
for one important characteristic that might “blinding” the data gatherers by keeping them in the
affect responses in the interview, the recent dark about the specific hypothesis under study.
death of a family member or associate. The existence of disease can also lead to exposure,
The study identified eight predictors of especially when the exposure under study is a medi-
sui- cide: high depression symptom score, cal treatment. Early manifestations of the disease
previous suicide attempt, acute stress just may lead to treatment, while the study question is
prior to death, low quality of life, high just the other way around: whether treatment causes
chronic stress, severe interpersonal conflict disease. If this problem is anticipated, it can be
in the 2 days before death, a blood relative dealt with in the design of the study, as illustrated in
with previous suicidal behavior, and a the following example.
friend or associate with previ- ous suicidal

When cases and controls are asked to recall Example


their previous exposures, bias can occur for several Do beta-blocker drugs prevent first myocar- dial infarctions in pat
reasons. Cases, knowing they have the disease
under study, may be more likely to remember
whether they were exposed, a problem called recall
bias. For example, parents of a child with Reyes
syndrome (an encepha- lopathy) may be more likely
to recall aspirin use after widespread efforts to make
parents aware of an asso- ciation between aspirin
use and Reyes syndrome in febrile children. A
man with prostate cancer might be more likely to
report a prior vasectomy after stories of an
association were in the news. With all the
publicity surrounding the possible risks of vari-
ous environmental and drug exposures, it is
entirely possible that victims of disease would
remember their exposures more often than people
without the disease.
Investigators can limit recall bias by not telling
All that might be said about bias in
patients the specific purpose of the study. It would
measurement of exposure can also be said of
be unethical not to inform participants in research
confounders. Many important covariates (e.g.,
about the general nature of the study question, but
smoking, diet, exercise, as well as race and ethnicity)
to provide detailed information about specific
may be poorly recorded in medical records and
questions and hypotheses could so bias the
databases and, therefore, must be obtained by
resulting informa- tion as to commit another breach
interview if they are to be included in the study at
of ethics—involving subjects in a flawed research
all.
project.
90 Clinical Epidemiology: The

Multiple Exposures
Thus far, we have described case-control studies of a bleeding, loss of appetite, increased urinary
single, dichotomous exposure, but case-control stud- fre- quency, abdominal pain, rectal
ies are an efficient means of examining a far richer bleeding, and abdominal bloating. After
array of exposures: the effects of multiple exposures, excluding symptoms reported in the 180
various doses of the same exposure, and exposures days before diagnosis (to get a better
that are early symptoms (not risk factors) of disease. estimate of “early” symptoms), three
remained independently associated with
ovarian cancer: abdominal distension, urinary

Example
Ovarian cancer is notoriously difficult to diag- THE ODDS RATIO: AN
nose early in its course when treatment might ESTIMATE OF RELATIVE RISK
be more effective. Investigators in England did a
case-control study of symptoms of ovarian cancer Figure 6.3 shows the dichotomous classification of
in primary care (6). Cases were 212 women over exposure and disease typical of both cohort and case-
40 years of age diagnosed with primary ovarian control studies and compares how risk is
cancer in 39 general practices in Devon, England, calculated differently for the two. These concepts
2000–2007; 1,060 controls without ovarian can- are illustrated with the bisphosphonates study,
cer were matched to cases by age and practice.
which had both a cohort and a case-control
Symptoms were abstracted from medical records
for the year before diagnosis. Seven symptoms component.
were independently associated with ovarian In the cohort study, participants were divided
cancer: abdominal distension, postmenopausal into two groups at the outset—exposed to
bisphosphonates (a  b) and not exposed to
bisphosphonates (c  d ). Cases of atypical fracture
emerged naturally over time in the exposed group (a)
and the unexposed group (c). This provides
appropriate numerators and denominators to calculate
the incidence of atypical fracture in the exposed

Cases Noncases

a+b
Exposed a b

Not c+d
exposed c d

a+c b+d

COHORT STUDY CASE-CONTROL STUDY


Relative risk = Odds ratio =
a/(a + b) c/(c + d) a/(a + c)

c/(a + c) = a/c = ad
b/(b + d)b/dbc d/(b + d)

Figure 6.3 ■ Calculation of relative risk from a cohort study and odds
ratio (estimated relative risk) from a case-control study.
Chapter 6: Risk: From Disease to Exposure 91

[a/(a  b)] and unexposed [c/(c  d)] members of The meaning of the odds ratio is analogous to
the cohort. It was also possible to calculate the relative that of relative risk obtained from cohort studies.
risk: If the frequency of exposure is higher among cases,
the odds ratio will exceed 1, indicating increased
Relative risk risk. The stronger the association is between the
Incidence of disease in the exposed exposure and disease, the higher the odds ratio.
=Incidence of disease in the unexposed Conversely, if the frequency of exposure is lower
among cases, the odds ratio will be 1, indicating
a / (a + b) protection. Because of the similarity of the
=c / (c + d ) information conveyed by an odds ratio and the
relative risk, and the meaning more readily attached
The case-control study, on the other hand,
to relative risk, odds ratios are often reported as
began with the selection of a group of cases of
estimated relative risks.
atypical frac- ture (a  c) and a group of controls An odds ratio is approximately equal to a
with other fractures (b  d ). There is no way of relative risk when the incidence of disease is low. To
knowing disease rates because these groups are see this mathematically, look at the formula for
determined not by nature but by the investigators’ relative risk in Figure 6.3. If the number of cases
selection criteria. Therefore, it is not possible to in the exposed group (a) is small relative to the
compute the incidence of disease among people number of non-cases in that group (b) then a/(a  b)
exposed and not exposed to bisphosphonates; and it is approximately equal to a/b. Similarly, if the
is not possible to calculate a relative risk. What does number of cases in the non-exposed group (c) is
have meaning, however, is the relative frequency of small relative to non-cases in that group (d), then
exposure to bisphosphonates among cases and c/(c  d) is approximated by c/d. Then, relative risk
controls. One approach to comparing the frequency  a/b divided by c/d, which simpli- fied to ad/bc, the
of exposure among cases and controls provides a odds ratio.
mea- sure of risk that is conceptually and How low must the rates be for the odds ratio to be
mathematically similar to relative risk. The odds an accurate estimate of relative risk? The answer
ratio is defined as the odds that a case is exposed depends on the size of the relative risk (7). In
divided by the odds general, bias in the estimate of relative risk
that a control is exposed: becomes large enough to matter as disease rates in
unexposed people become greater than about 1/100
 a ac  or perhaps 5/100. As out- comes become more
frequent, the odds ratio tends to overestimate the
a  c
c relative risk when it is 1 and underes- timate the
relative risk when it is 1. Fortunately, most diseases,
particularly those examined by means of case- control
 bdb  studies, have considerably lower rates.

 b d 
Earlier in this chapter, we described why case-
control studies should be about incident (new onset)
cases, not prevalent ones. Nevertheless, prevalence
Which simplifies to: odds ratios are commonly calculated for prevalence
studies and reported in the medical literature. The

bc ,
a prevalence odds ratio is a measure of association but
not a very informative one, not only because of

d dif- ficulty distinguishing factors related to


incidence versus duration but also because the rare
disease assumption is less likely to be met.
where odds are the ratios of two probabilities, the
probability of an event divided by 1 – the probability
of that event. CONTROLLING FOR
The odds ratio can be further simplified to:
EXTRANEOUS VARIABLES
ad
bc The greatest threat to the validity of observational
(cohort and case-control) studies is that the groups
Referring back to Figure 6.3, the odds ratio can be being compared might be systematically different in
obtained by multiplying diagonally across the table factors related to both exposure and disease—that
and dividing these cross-products. is, there is confounding. In Chapter 5, we
92 Clinical Epidemiology: The
described
Chapter 6: Risk: From Disease to Exposure 93

various ways of controlling for extraneous variables


when looking for independent effects of exposure on renal failure, hemolytic anemia, and
disease in observational studies. All of these thrombo- cytopenia) occurred in Germany
approaches—exclu- sion, matching, stratified in May 2011 (8). During the epidemic, there
analyses, and modeling—are also used in case-control were 3,816 reported cases, 845 with
studies, often in combination. Of course, this can only hemolytic-uremic syn- drome. Figure 6.4
be done for characteristics that were already shows the epidemic curve, the number of
suspected to affect the exposure–disease relationship cases over time. The immedi- ate cause,
and were measured in the study. infection with a toxin-producing strain of
Because mathematical modeling is almost always the bacterium Escherichia coli was quickly
used to control for extraneous variables, in identified, but the source of the infec- tion
practice, calculations of odds ratios are much more was not. Investigators did a case-control
compli- cated than the cross product of a two by study comparing 26 cases of hemolytic-
two table. An odds ratio calculated directly from a 2 uremic syndrome with 81 controls, matched
 2 table is referred to as a crude odds ratio for age and neighborhood (9). They found
because it has not taken into account variables other that 6/24 cases (25%) and 7/80 controls
than exposure and disease. After adjustment for the (9%) were exposed to sprouts, for an odds
effects of these other variables, it is called an ratio of 5.8, suggesting that the infection
adjusted odds ratio. was transmitted by eating contaminated
The implicit reason for case-control studies is sprouts. (Note that the odds ratio is not
to find causes. However, even when extraneous vari- exactly the cross-products in this case
ables have been controlled for by state-of-the-science because the calculation of odds ratio took
methods, the possibility remains that unmeasured into account the matching.) How- ever,
variables are confounding the exposure–disease rela- cucumbers and other produce were also
tionship. Therefore, one has to settle for describing implicated, although less strongly. To take
how exposure is related to disease independently of this further, investigators did a small cohort
other variables included in the study and be appro- study of people dining in groups at a single
priately humble about the possibility that unmea- restau- rant during the epidemic period.
sured variables might account for the results. For Cases were empirically defined as diners
these reasons, the results of observational studies are who developed bloody diarrhea or
best described as associations, not causes. hemolytic-uremic syn- drome or were found
by culture to have the offending organism.
INVESTIGATION OF A Twenty percent of the cohort met these
DISEASE OUTBREAK criteria, 26% of whom had hemolytic-uremic
syndrome. The relative risk for sprout
Up to this point, we have described use of the consumption was 14.2, and no other food
case- control method to identify risk factors for was strongly associated with the disease.
chronic diseases. The same method is used to Sprout consumption accounted for 100% of
identify risk factors for outbreaks (small cases. Investigators traced back the
epidemics) of acute diseases, typically infectious source of sprouts from the distributor that
diseases or poisonings. Often, the microbe or toxin supplied the restaurant to a single
is obvious early in the epidemic, after diagnostic producer. However, they could not culture
evaluation of cases, but the mode of transmission the causal Escherichia coli from seeds in
is not. Information on how the disease was spread is the implicated lot. Follow- ing the
needed to stop the epi- demic and to understand investigation, and after attention to the
possible modes of trans- mission, which might be
useful in the control of future epidemics.

Example
A large outbreak of gastroenteritis, with many cases complicated by hemolytic-uremic syn- drome (a potentially fatal co
This example also illustrates how case-control and
cohort studies, laboratory studies of the responsible
organism, and “shoe-leather” epidemiology during
trace-back acted in concert to identify the underlying
cause of the epidemic.
94 Clinical Epidemiology: The

250

200

150
Number of

100

50

0
5 10 15 20 25 30 5 10 15 20 25 30 5
May June July
Date of disease
onset
Figure 6.4 ■ Epidemic curve of an outbreak of Shiga-toxin-producing Escherichia coli infection in
Germany. (Redrawn with permission from Frank C, Werber D, Cramer JP, et al. Epidemic profile of shiga-toxin-
producing Escherichia coli 0104:H4 outbreak in Germany. N Engl J Med 2011;365:1771–1780.)

Revie w Question s
Read the following and select the best response. more than 520,000 participants, of vitamin
D concentration and the risk of colon
6.1. In a case-control study of oral contraceptives cancer. They studied 1,248 cases of incident
and myocardial infarction (heart attack), colon cancer arising in the cohort and an
exposure to birth control pills was abstracted equal number of controls, sampled from the
from medical records at the time of the myo- same cohort and matched by age, sex, and
cardial infarction. Results might be biased study center. Vitamin D was measured in
toward finding an association by all of the blood samples taken years before diagnosis.
following except: Vitamin D levels were lower in patients
A. Physicians might have asked about use with colon cancer, independent of a rich
of birth control pill use more carefully in array of poten- tially confounding variables.
cases. The study results could be described by any
B. Having a myocardial infarction might of the following except:
have led to oral contraceptive use. A. Vitamin D levels were associated
C. Physicians might have been more likely to with colorectal cancer.
record birth control use in cases. B. Vitamin D deficiency was a risk factor for
D. Medical record abstractors might have colorectal cancer.
looked for evidence of oral contraceptive C. Nesting the study in a large cohort was a
use more carefully if they knew a patient strength of the study.
had had a myocardial infarction. D. The results might have been confounded
E. Patients might have recalled exposure with unmeasured variables related to
more readily when they had a heart attack. vitamin D levels and colorectal cancer.
E. Vitamin D deficiency was a cause
6.2. Investigators in Europe did a case-control of colorectal cancer.
study, nested in a multicountry cohort of
Chapter 6: Risk: From Disease to Exposure 95
systematically different from cases (other
6.3. Which of the following is the most direct than on the exposure of interest).
result of a case-control study?
A. Prevalence
B. Risk difference
C. Relative risk
D. Incidence
E. Odds ratio
6.4. The epidemic curve for an acute infectious
disease describes:
A. The usual incubation period for the causal
agent
B. A comparison of illness over time in
exposed versus non-exposed people
C. The onset of illness in cases over time
D. The duration of illness, on average, in
affected individuals
E. The distribution of time from infection
to first symptoms
6.5. Which of the following is the best reason for
doing a case-control analysis of a cohort
study?
A. Case-control studies are a feasible way
of controlling for confounders not
found in the cohort dataset.
B. Case-control studies can provide all the
same information more easily.
C. Case-control studies can determine
incidence of disease in exposed and non-
exposed members of the cohort.
D. Case-control studies are in general
stronger than cohort studies.
6.6. The best way to identify cases is to obtain
them from:
A. A sample from the general (dynamic)
population
B. Primary care physicians’ offices
C. A community
D. A cohort representative of the population
E. A hospital

6.7. What is the best reason to include


multiple control groups in a case-control
study?
A. To obtain a stronger estimate of relative
risk
B. There are a limited number of cases and
an ample number of potential
controls.
C. To control for confounding
D. To increase the generalizability of the
result
E. The main control group may be
96 Clinical Epidemiology: The

6.8. Case-control studies can be used to study


all of the following except:
A. The early symptoms of stomach cancer
B. Risk factors for sudden infant
death syndrome
C. The incidence of suicide in the
adult population
D. The protective effect of aspirin
E. Modes of transmission of an
infectious disease
6.9. In a case-control study of exercise and
sudden cardiac death, matching would be
useful:
A. To control for all potential
confounding variables in the study
B. To make cases and controls similar
to each other with respect to a few
major characteristics
C. To make it possible to examine the
effects of the matched variables on
estimated relative risk
D. To test whether the right controls
were chosen for the cases in the
study
E. To increase the generalizability of the study
6.10. In a case-control study of whether
prolonged air travel is a risk factor for
venous thrombo- embolism, 60 out of 100
cases and 40 out of 100 controls had
prolonged air travel. What was the crude
odds ratio from this study?
A. 0.44
B. 1.5
C. 2.25
D. 3.0
E. Not possible to calculate
6.11. A population-based case-control study
would be especially useful for studying:
A. The population attributable risk of disease
B. Multiple outcomes (diseases)
C. The incidence of rare diseases
D. The prevalence of disease
E. Risk factors for disease

6.12. The prevalence odds ratio of


rheumatoid arthritis provides an
estimate of:
A. The relative risk of arthritis
B. The attributable risk of arthritis
C. Risk factors for the duration of arthritis
D. The association between a patient
characteristic and prevalence of
arthritis
E. Risk factors for the incidence of arthritis
Chapter 6: Risk: From Disease to Exposure 97

6.13. In an outbreak of acute gastroenteritis, a case- B. Are complications more common with
control study would be especially useful for fiberoptic cholecystectomy than with
identifying: conventional (open) surgery?
C. Is drinking alcohol a risk factor for breast
A. Characteristics of the people affected
cancer?
B. The number of people affected over time
D. How often do complications occur after
C. The microbe or toxin causing the
fiberoptic cholecystectomy?
outbreak
E. How effective are antibiotics for otitis
D. The mode of transmission
media?
E. Where the causative agent originated
6.16. In a case-control study of airplane flight
6.14. Sampling cases and controls from a defined
and thrombophlebitis, all of the following
population or cohort accomplishes which of
conditions should be met for the odds ratio
the following?
to be a reasonable estimate of relative risk
A. It is the only way of including incident except:
(new) cases of disease.
A. Controls were sampled from the same
B. It avoids the need for inclusion and
population as cases.
exclusion criteria.
B. Cases and controls met the same inclusion
C. It tends to include cases and controls
and exclusion criteria.
that are similar to each other except for
C. Other variables that might be related to
exposure.
air travel and thrombosis were controlled
D. It matches cases and controls on
for.
important variables.
D. Cases and controls were equally
E. It ensures that the results are generalizable.
susceptible to developing thrombophlebitis
(e.g., were equally mobile, weight, recent
6.15. Case-control studies would be useful for
trauma, previous VTE) other than for air
answering all of the following questions
travel.
except:
E. The incidence of thrombophlebitis was
A. Do cholesterol-lowering drugs prevent more than 5/100.
coronary heart disease?
Answers are in Appendix A.

REFERENCES
1. Schilcher J, Michaelsson K, Aspenberg P. Bisphosphonate use
6. Hamilton W, Peters TJ, Bankhead C, et al. Risk of ovarian
and atypical fractures of the femoral head. New Engl J Med
cancer in women with symptoms in primary care:
2011;364:1728–1737.
2. Sulheim S, Holme I, Ekeland A, et al. Helmet use and risk population- based case-control study. BMJ 2009;339:b2998.
doi:10.1136/ bmj.b2998
of head injuries in alpine skiers and snowboarders. JAMA
7. Feinstein AR. The bias caused by high value of incidence for
2006;295:919–924.
3. Knol MJ, Vandenbroucke JP, Scott P, et al. What do case- p1 in the odds ratio assumption that 1-p1 is approximately
equal to 1. J Chron Dis 1986;39:485–487.
control studies estimate? Survey of methods and assumptions
8. Frank C, Werber D, Cramer JP, et al. Epidemic profile of
in published case-control research. Am J Epidemiol 2008;168:
shiga-toxin-producing Escherichia coli 0104:H4 outbreak in
1073–1081.
4. Phillips MR, Yang G, Zhang Y, et al. Risk factors for Germany. N Engl J Med 2011;365:1771–1780.
9. Buchholz U, Bernard H, Werber D et al. German outbreak
suicide in China: a national case-control psychological autopsy
of Escherichia coli 0104:H4 associated with sprouts. N Engl J
study. Lancet 2002;360:1728–1736.
5. Psaty BM, Koepsell TD, LoGerfo JP, et al. Beta-blockers and Med 2011;365:1763–1770.
primary prevention of coronary heart disease in patients
with high blood pressure. JAMA 1989;261:2087–2094.
98 Clinical Epidemiology: The

Chapter 7

Prognosis
He, who would rightly distinguish those that will survive or die, as well as those that
will be subject to disease a longer or shorter time, ought, from his knowledge and
attention, to be able to form an estimate of all symptoms, and rationally to weigh their
powers by comparison.
—Hippocrates
460–375 B.C.

KEY WORDS difficult but indispensable task—predicting patients’


futures as closely as possible. The objective is to
avoid
Prognosis Case report expressing prognoses with vagueness when unneces-
Prognostic factors Clinical prediction sary and with certainty when misleading.
Clinical course rules Doctors and patients want to know the general
Natural history Training set course of the illness, but they want to go further
Zero time Test set and tailor this information to their particular
Inception cohort Validation situation as much as possible. For example, even
Stage migration Prognostic though ovarian cancer is usually fatal in the long
Event stratification run, women with this cancer may live from a few
Survival analysis Sampling bias months to many years, and they want to know
Kaplan-Meier analysis Migration bias where on this contin- uum their particular case is
Time-to-event Dropouts likely to fall.
analysis Measurement bias Studies of prognosis are similar to cohort studies
Censored Sensitivity analysis of risk. Patients are assembled who have a particular
Hazard ratios Best-case/worst-case dis- ease or illness in common, they are followed
Case series analysis forward in time, and clinical outcomes are
measured. Patient characteristics that are associated
with an outcome of the disease, called prognostic
factors, are identified. Prognostic factors are
analogous to risk factors, except that they represent a
different part of the disease spec-
When people become sick, they have a great many a
questions about how their illness will affect them. Is
it dangerous? Could I die of it? Will there be pain?
How long will I be able to continue my present activ-
ities? Will it ever go away altogether? Most patients
and their families want to know what to expect, even
in situations where little can be done about their
illness.
Prognosis is the prediction of the course of dis-
ease following its onset. This chapter reviews the ways
in which the course of disease can be described. The
intention is to give readers a better understanding of
trum, from disease to outcomes. Case-control studies
of people with the disease who do and do not DIFFERENCES IN RISK AND
have a bad outcome can also estimate the relative PROGNOSTIC FACTORS
risk asso- ciated with various prognostic factors,
but they are unable to provide information on Risk and prognostic factors differ from each other in
outcome rates (see Chapter 6). several ways.

93
9 Clinical Epidemiology: The

The Patients Are Different Clinicians can often form good estimates of short-
term prognosis from their own personal experience.
Studies of risk factors usually deal with healthy people, However, they may be less able to sort out,
whereas studies of prognostic factors are of sick people. without the assistance of research, the various
factors that are related to long-term prognosis or
The Outcomes Are Different the complex ways in which prognostic factors are
For risk, the event being counted is usually the onset related to one another.
of disease. For prognosis, consequences of disease
are counted, including death, complications, CLINICAL COURSE AND NATURAL
disability, and suffering. HISTORY OF DISEASE
The Rates Are Different Prognosis can be described as either the clinical
course or natural history of disease. The term clinical
Risk factors are usually for low-probability events.
course describes the evolution (prognosis) of a
Yearly rates for the onset of various diseases are on
disease that has come under medical care and has
the order of 1/1,000 to 1/100,000 or less. As a result,
been treated in a vari- ety of ways that affect the
relationships between exposure and disease are diffi-
subsequent course of events. Patients usually receive
cult to confirm in the course of day-to-day clinical
medical care at some time in the course of their illness
experiences, even for astute clinicians. Prognosis, on
when they have diseases that cause symptoms such as
the other hand, describes relatively frequent events.
pain, failure to thrive, disfigurement, or unusual
For example, several percent of patients with acute
behavior. Examples include type 1 diabetes mellitus,
myocardial infarction die before leaving the hospital.
carcinoma of the lung, and rabies. After such a
disease is recognized, it is likely to be treated.
The Factors May be Different
The prognosis of disease without medical
Variables associated with an increased risk are not interven- tion is termed the natural history of
necessarily the same as those marking a worse progno- disease. Natural history describes how patients fare if
sis. Often, they are considerably different for a given nothing is done about their disease. A great many
disease. For example, the number of well-established health conditions do not come under medical care,
risk factors for cardiovascular disease (hypertension, even in countries with advanced health care
smoking, dyslipidemia, diabetes, and family history systems. They remain unrecog- nized because they
of coronary heart disease) is inversely related to are asymptomatic (e.g., many can- cers of the
the risk of dying in the hospital after a first prostate are occult and slow growing) and are,
myocardial infarction (Fig. 7.1) (1). therefore, unrecognized in life. For others, such as
osteoarthritis, mild depression, or low-grade anemia,
people may consider their symptoms to be one of the
ordinary discomforts of daily living, not a disease
and, therefore, not seek medical care for them.

Reduced risk Increased risk


0
Example
Irritable bowel syndrome is a common condi- tion that involves ab
1

2
Risk

0.5 1.0 1.5 2.0


Adjusted OR (95% Cl)
Figure 7.1 ■ Risk and prognostic factors for first myo-
cardial infarction. (Redrawn with permission from
Canto JC, Kiefe CI, Rogers WJ, et al. Number if coronary
heart dis- ease risk factors and mortality in patients with
first myocar- dial infarction. JAMA 2011;306:2120–2127.)
Chapter 7: Prognosis 95

PATIENTS WITH DISEASE EXPOSURE TO DISEASE


PROGNOSTIC FACTOR OUTCOME
YES

Exposed

NO
COHORT Time
YES

Not exposed

NO
Figure 7.2 ■ Design of a cohort study of risk.

ELEMENTS OF PROGNOSTIC
STUDIES
less, complication rates in newborns were
Figure 7.2 shows the basic design of a cohort study much higher than in the general population.
of prognosis. At best, studies of prognosis are of a Neona- tal morbidity (one or more
defined clinical or geographic population, begin complications) oc- curred in 80% of infants
observation at a specified point in time in the and rates of congenital malformations and
course of disease, follow-up all patients for an unusually large newborns (macrosomia)
adequate period of time, and measure clinically were three-fold to 12-fold higher than in
important outcomes. the general population. This study sug-
gests that good control of blood sugar alone
Patient Sample was not sufficient to prevent complications
of pregnancy in women with type 1
The purpose of representative sampling from a
defined population is to assure that study results
have the greatest possible generalizability. It is Even without national medical records, popula-
sometimes possi- ble to study prognosis in a complete tion-based studies are possible. In the United
sample of patients with new-onset disease in large States, the Network of Organ Sharing collects data
regions. In some coun- tries, the existence of national on all patients with transplants, and the
medical records makes population-based studies of Surveillance, Epi- demiology, and End Results
prognosis possible. (SEER) program collects incidence and survival data
on all patients with new- onset cancers in several
large areas of the country, comprising 28% of the
Example
Dutch investigators studied the risk of compli- cations of pregnancy in women with type 1 dia- betes mellitus (4). The sa

U.S. population. For primary care questions, in the


United States and elsewhere, individual practices
9 Clinical Epidemiology: The
in communities have banded together into
“primary care research networks” to col- lect
research data on their patients’ care.
Most studies of prognosis, especially for less
common diseases, are of local patients. For these
studies, it is especially important to provide the
information that users can rely on to decide
whether the results gener- alize to their own
situation: patients’ characteristics (e.g., age, severity
of disease, and comorbidity), the setting where
they were found (e.g., primary care practices,
community hospitals, or referral centers), and how
they were sampled (e.g., complete, random,
Chapter 7: Prognosis 97

or convenience sampling). Often, this information is


sufficient to establish wide generalizability, for
exam- ple, in studies of community acquired Follow-Up
pneumonia or thrombophlebitis in a local hospital.
Patients must be followed for a long enough
Zero Time period of time for most of the clinically important
outcome events to have occurred. Otherwise, the
Cohorts in prognostic studies should begin from a observed rate will understate the true one. The
common point in time in the course of disease, called appropriate length of follow-up depends on the
zero time, such as at the time of the onset of disease. For studies of sur- gical site infections, the
symp- toms, diagnosis, or the beginning of follow-up period should last for a few weeks, and for
treatment. If observation begins at different points studies of the onset of AIDS and its complications in
in the course of disease for the various patients in patients with HIV infection, the follow-up period
a cohort, the description of their prognosis will lack should last several years.
precision, and the timing of recovery, recurrence,
death, and other outcome events will be difficult Outcomes of Disease
to interpret or will be misleading. The term
inception cohort is used to describe a group of Descriptions of prognosis should include the full
patients that is assembled at the onset (inception) range of manifestations of disease that would be con-
of their disease. sidered important to patients. This means not only
Prognosis of cancer is often described separately death and disease but also pain, anguish, and the
according to patients’ clinical stage (extent of spread) inability to care for one’s self or pursue usual activi-
at the beginning of follow-up. If it is, a systematic ties. The 5 Ds—death, disease, discomfort, disability,
change in how stage at zero time is established can and dissatisfaction—are a simple way to summarize
result in a different prognosis for each stage even important clinical outcomes (see Table 1.2).
if the course of disease is unchanged for each In their efforts to be “scientific,” physicians tend
patient in the cohort. This has been shown to to value precise or technologically measured
happen dur- ing staging of cancer—assessing the outcomes, sometimes at the expense of clinical
extent of disease, with higher stages corresponding relevance. As dis- cussed in Chapter 1, clinical
to more advanced cancer, which is done for the effects that cannot be directly perceived by patients,
purposes of prognosis and choice of treatment. such as radiologic reduc- tion in tumor size,
Stage migration occurs when a newer technology is normalization of blood chemistries, improvement in
able to detect the spread of cancer better than an ejection fraction, or change in serology, are not
older staging method. Patients who used to be clinically useful ends in themselves. It is appro- priate
classified in a lower stage are, with the newer to substitute these biologic phenomena for clini- cal
technology, classified as being in a higher (more outcomes only when the two are known to be related
advanced) stage. Removal of patients with more to each other. Thus, in patients with pneumonia,
advanced disease from lower stages results in an short- term persistence of abnormalities on chest
apparent improvement in prognosis for each stage, radiographs may not be alarming if the patient’s fever
regardless of whether treatment is more effec- tive or has subsided, energy has returned, and cough has
prognosis for these patients as a whole is bet- ter. diminished.
Stage migration has been called the “Will Rogers Ways to measure patient-centered outcomes are
phenomenon” after the humorist who said of the now used in clinical research. Table 7.1 shows a
geographic migration in the United States during the simple measure of quality of life used in studies of
economic depression of the 1930s, “When the Okies cancer
left Oklahoma and moved to California, they
raised the average intelligence in both states” (5).

Example
Positron emission tomography (PET) scans, a sensitive test for metastases, are now used to stage non–small cell lung canc

scans were in general use and found a 5.4%


9 Clinical Epidemiology: The

Table 7.1 rates have in common the same basic components of


A Simple Measure of Quality of Life. The incidence: events arising in a cohort of patients
Eastern Collaborative Oncology Group’s over time.
Performance Scale
A Trade-Off: Simplicity
Performance versus More Information
Status Definition
Summarizing prognosis by a single rate has the
0 Asymptomatic
vir- tue of simplicity. Rates can be committed to
1 Symptomatic, fully ambulatory memory and communicated succinctly. Their
2 Symptomatic, in bed 50% of the day drawback is that relatively little information is
3 Symptomatic, in bed 50% of the day conveyed. Large differ- ences in prognosis can be
4 Bedridden
hidden within similar sum- mary rates.
Figure 7.3 shows 5-year survival rates for patients
5 Dead
with four conditions. For each condition, about
Adapted with permission from Oken MM, Creech RH, Tomey DC, 10% of the patients are alive at 5 years. However, the
et al. Toxicity and response criteria of the Eastern Oncology
Group. Am J Clin Oncol 1982;5:649–655. clinical courses are otherwise quite different in
ways that are very important to patients. Early
survival in patients with dissecting aneurysms is
treatment. There are also research measures for very poor, but if they sur- vive the first few months,
perfor- mance status, health-related quality of life, their risk of dying is much less affected by having
pain, and other aspects of patient well-being. had the aneurysm (Fig. 7.3A). Patients with locally
invasive, non–small cell lung cancer experience a
DESCRIBING PROGNOSIS relatively constant mortality rate throughout the 5
years following diagnosis (Fig. 7.3B). The life of
It is convenient to summarize the course of disease patients with amyotrophic lateral sclerosis (ALS, Lou
as a single rate—the proportion of people experienc- Gehrig disease, a slowly progressive paraly- sis) and
ing an event during a fixed time period. Some respiratory difficulties is not immediately
rates used for this purpose are shown in Table 7.2. threatened, but as neurologic function continues to
These decline over the years, the inability to breathe
without assistance leads to death (Fig. 7.3C). Figure
7.3D is a benchmark. Only at age 100 years do
Table 7.2 people in the general population have a 5-year
Rates Commonly Used to survival rate compa- rable to that of patients with
Describe Prognosis the three diseases.
Rate
Definitiona Survival Analysis
5-year survival Percent of patients surviving 5 years
from some point in the course of their
When interpreting prognosis, it is preferable to know
disease the likelihood, on average, that patients with a given
condition will experience an outcome at any point
Case fatality Percent of patients with a disease
who die of it
in time. Prognosis expressed as a summary rate does
not contain this information. However, figures can
Disease-specific Number of people per 10,000 (or
show information about average time to event for
mortality 100,000) population dying of a specific
disease
any point in the course of disease. By event, we
mean a dichotomous clinical outcome that can
Response Percent of patients showing some
occur only once. In the following discussion, we
evidence of improvement following an
intervention
take the com- mon approach of describing
outcomes in terms of “survival,” but the same
Remission Percent of patients entering a phase
methods apply to the reverse (time to death) and to
in which disease is no longer
detectable
any other outcome event such as cancer recurrence,
cure of infection, freedom from symptoms, or
Recurrence Percent of patients who have return
of disease after a disease-free interval
arthritis becoming inactive.
a
Time under observation is either stated or assumed to be
sufficiently long so that all events that will occur have been Survival of a Cohort
observed.
The most straightforward way to learn about sur-
vival is to assemble a cohort of patients who have the
Chapter 7: Prognosis 99

A Dissecting aneurysm Lung cancer


100 100
B
80 80

60 60

40
40

20
20

012345 012345
YearsYears
Percent

C Amyotrophic lateral sclerosis D Age 100 years


100 100

80 80

60 60

40 40

20
20

012345 012345
Years Years
Figure 7.3 ■ A limitation of 5-year survival rates: Four conditions with the
same 5-year survival rate of 10%.

condition of interest and are at the same point in


the point at which they dropped out. Also, it would
the course of their illness (e.g., onset of symptoms,
be necessary to wait until all of the cohort’s members
diag- nosis, or beginning of treatment), and then keep
had reached each point in follow-up before the prob-
them all under observation until all experience the
ability of surviving to that point could be calculated.
outcome or not. For a small cohort, one might then
Because patients ordinarily become available for a
represent these patients’ clinical course, as shown in
study over a period of time, at any point in
Figure 7.4A. The plot of survival against time
calendar time, there would be a relatively long
displays steps cor- responding to the death of each
follow-up for patients who had entered the study
of the 10 patients in the cohort. If the number of
first, but only brief experience with those who had
patients were increased, the size of the steps
entered more recently. The last patient who entered
would diminish. If a very large number of patients
the study would have to reach each year of follow-up
were studied, the figure would approximate a
before any information on survival to that year
smooth curve (Fig. 7.4B). This infor- mation could
would be available.
then be used to predict the year-by-year, or even
week-by-week, prognosis of similar patients.
Unfortunately, obtaining the information in this Survival Curves
way is impractical for several reasons. Some of the To make efficient use of all available data from each
patients might drop out of the study before the end patient in the cohort, survival analysis has been
of the follow-up period, perhaps because of developed to estimate the survival of a cohort over
another illness, a move from the study area, or time. The usual method is called Kaplan-Meier
dissatisfac- tion with the study. These patients analysis, after its originators. Survival analysis can
would have to be excluded from the cohort even be applied to any outcomes that are dichotomous
though considerable effort had been exerted to and occur only once during follow-up (e.g., time
gather data on them until to
1 Clinical Epidemiology: The

A 10 patients B 1,000 patients


100 100

80 80

60 60
Number of

40
40

20
20

012345 012345
Time (years)Time (years)
Figure 7.4 ■ Survival of two cohorts, small and large, when all members are
observed for the full period of follow-up.

coronary event or to recurrence of cancer). A more


The probability of surviving to any point in
general term, useful when an event other than sur-
time is estimated from the cumulative probability
vival is described, is time-to-event analysis.
of sur- viving each of the time intervals that
Figure 7.5 shows a simplified survival curve. On
preceded it. Time intervals can be made as small
the vertical axis is the estimated probability of
as necessary; in Kaplan-Meir analyses, the intervals
surviv- ing, and on the horizontal axis is the period
are between each new event, such as death, and the
of time from the beginning of observation (zero
preceding one, however short or long that is. Most
time).
of the time,

4/5 = 80%1/2 = 50%


Probability of surviving interval:
100% 100% 100%

8 at risk
Probability

20 5 at risk
1 died
3 Censored 4 still alive 2 at risk
of

10 1 died
2 Censored 1 at risk
1 still alive
0 Censored

3 4 5
Time (years)

100

80
Probability

60
of

40

20

0
1 2 3 4 5

Time (years)
Figure 7.5 ■ Example of a survival curve, with detail for one part of the curve.
Chapter 7: Prognosis 101

no one dies and the probability of surviving is 1. that the estimates on the left-hand side of the curve
When a patient dies, the probability of surviving at are sound, because more patients are at risk early
that moment is calculated as the ratio of the in follow-up. But on the right-hand side, at the tail
number of patients surviving to the number at risk of the curve, the number of patients on whom esti-
of dying at that point in time. Patients who have mates of survival are based may become relatively
already died, dropped out, or have not yet been small because deaths, dropouts, and late entrants
followed up to that point are not at risk of dying to the study, so that fewer and fewer patients are
and are, therefore, not used to estimate survival for fol- lowed for that length of time. As a result,
that time. The probability of surviving does not estimates of survival toward the end of the follow-
change during intervals in which no one dies, so it up period are imprecise and can be strongly
is recalculated only when there is a death. Although affected by what happens to relatively few
the probability at any given interval is not very patients. For example, in Figure 7.5, only one
accurate, because either nothing has happened or patient was under observation at year 5. If that one
there has been only one event in a large cohort, the remaining patient happened to die, the probability
overall probability of surviving up to each point in of surviving would fall from 8% to zero. Clearly,
time (the product of all preceding probabilities) is this would be a too literal reading of the data.
remarkably accurate. When patients are lost from Therefore, estimates of survival at the tails of
the study at any point in time, they are referred to survival curves must be interpreted with caution.
as censored and are no longer counted in the Finally, the shape of many survival curves gives
denominator from that point forward. the impression that outcome events occurs more fre-
A part of the survival curve in Figure 7.5 (from quently early in follow-up than later on, when the
3 to 5 years after zero time) is presented in detail slope approaches a plateau. But this impression is
to illustrate the data used to estimate survival: deceptive. As time passes, rates of survival are being
patients at risk, patients no longer at risk applied to a diminishing number of patients, causing
(censored), and patients experiencing outcome the slope of the curve to flatten even if the rate of
events at each point in time. outcome events did not change.
Variations on basic survival curves increase the As with any estimate, Kaplan-Meier estimates of
amount of information they convey. Including the time to event depend on assumptions. It is
num- bers of patients at risk at various points in assumed that being censored is not related to
time gives some idea of the contribution of chance to prognosis. To the extent that this is not true, a
the observed rates, especially toward the end of survival analysis may yield biased estimates of
follow-up. The vertical axis can show the survival in cohorts. The Kaplan-Meier method may
proportion with, rather than without, the outcome not be accurate enough if there are competing
event; the resulting curve will sweep upward and risks—more than one kind of outcome event—and
to the right. The precision of survival estimates, the outcomes are not indepen- dent of each other
which declines with time because fewer and fewer such that one event changes the probability of
patients are still under observation, can be shown by experiencing the other. For example, patients with
confidence intervals at various points in time (see cancer who develop an infection related to
Chapter 11). Tics are sometimes added to the aggressive chemotherapy and drop out for that
survival curves to indicate each time a patient is reason may have had a different chance of dying
censored. of the cancer. There are other methods for
estimating cumulative incidence in the presence of
Interpreting Survival Curves competing risks.

Several points must be kept in mind when IDENTIFYING PROGNOSTIC


interpret- ing survival curves. First, the vertical axis FACTORS
represents the estimated probability of surviving for
members of the cohort, not the cumulative incidence Often, studies go beyond a simple description of
of surviving if all members of the cohort were prognosis in a homogeneous group of patients to
followed up. compare prognosis in patients with different char-
Second, points on a survival curve are the best acteristics, that is, they identify prognostic factors.
estimate, for a given set of data, of the probability Multiple survival curves, one for patients with
of survival for members of a cohort. However, the each of the characteristics, are represented on the
precision of these estimates depends on the number same figure where they can be visually (and
of patients on whom the estimate is based, as do statistically) compared.
all observations of samples. One can be more
confident
1 Clinical Epidemiology: The

Example CASE SERIES


Patients with renal cell carcinoma, like many A case series is a description of the course of
other cancers, have widely different chances of
disease in a small number of cases, a few dozen at
surviving over the several years after diagnosis.
Prognosis varies according to characteristics of
most. An even smaller report, with fewer than 10
the cancer and patient, such as stage (how far patients, is called a case report. Cases are
the cancer has spread, from being limited to the typically found at a clinic or referral center and then
kidney at one extreme to metastases to distant followed forward in time to describe the course of
organs at the other), grade (how abnormal the disease and backward in time to describe what
cancer cells appear), and performance status came earlier.
(how well patients are able to care for them- Such reports can make an important
selves). A study combined these three charac- contribution to understanding of disease, primarily
teristics into five prognostic groups (Fig. 7.6) (7).
by describing experiences with newly defined
In the most favorable group, more than 90% of
patients were alive at 8 years, whereas in the
syndromes or uncom- mon conditions. The reason
least favorable group, all were dead at 3 years. for introducing case series into a chapter about
This information might be especially useful in prognosis is that they may masquerade as true
helping patients and doctors understand what cohort studies even though they do not have
lies ahead, and it is much more informative comparable strengths.
than simply saying that “overall, 70% of pa-
tients with renal cell carcinoma survive 5 years.”

Example

Physicians in emergency departments see


patients with bites from North American
rattlesnakes. These bites are relatively uncom-
mon at any one place, making it difficult to
carry out large cohort studies of their clini-
The effects of one prognostic factor relative to the
cal course, so physicians must rely mainly on
effects of another can be summarized from data in
case series. An example is a description of the
a time-to-even analysis by a hazard ratio, which
clinical course of all 24 children managed at
is analogous to a risk ratio (relative risk). Also,
a children’s hospital in California during a 10-
survival curves can be compared after taking into
year period (8). Nineteen of the children were
account other factors related to prognosis so that the
actually injected with venom, and they were
indepen- dent effect of just one variable is
managed with the aggressive use of antivenin.
examined.
Three had surgical treatment to remove soft-
tissue debris or relieve tissue pressure. There
were no serious reactions to antivenin, and all
patients left the hospital without functional
100 Group 1
impairment.
80

60 Group 2
Survival

40 Group 3
Physicians caring for children with rattlesnake
20 Group 4 bites would be grateful for this and other case
Group 5 series if there were no better information to guide
0 their care, but that is not to say that the case series
0 24 pro- vided a complete and fully reliable picture of
48 72
Months snake-
96
bite care. Of all children bitten in that region, some
Figure 7.6 ■ Example of prognostic stratification. Sur- an integrated staging system. J Clin Oncol 2001;19:1649–
vival from surgery in a patient with renal cell cancer 1657.)
accord- ing to prognostic strata. (Redrawn with
permission from Zisman A, Pantuck AJ, Dorey F, et al.
Improved prognosti- fication for renal cell carcinoma using
Chapter 7: Prognosis 103
may have been doing so well after a bite that they
were not sent to the referral center. Others might
have been doing so badly that they were rushed to
the nearest hospital or even died before reaching
a hospital at all. In other words, the case series
does not describe the clinical course of all
children from
1 Clinical Epidemiology: The

the time of snakebite (the inception) but rather a Table 7.3


selected sample of children who happened to The CHADS2 Score and Risk of Stroke
come under care at that particular hospital. In According to CHADS2 Score
effect, case series describe the clinical course of
prevalent, not necessarily a representative sample of n Diagnosis of heart failure, past or current (1 point)
incident cases, so they are “false” cohorts. n Hypertension treated or untreated (1 point)
n Age 75 years (1 point)
CLINICAL PREDICTION RULES n Diabetes mellitus (1 point)
n Secondary prevention in patients with prior ischemic
A combination of variables can provide a more stroke, transient ischemic attack, or
pre- cise prognosis than any of these variables taken thromboembolism (2 points)
one at a time. As discussed in Chapter 4, clinical CHADS2 score  total point count
prediction rules estimate the probability of Stroke Risk per
outcomes (either prognosis or diagnosis) CHADS2 Points 100 Person-Years
according to a set of patient characteristics defined 0 0.49
by history, physical examina- tion, and simple
1 1.52
laboratory tests. They are “rules” because they are
often tied to recommendations about further 2 2.50
diagnostic evaluation or treatment. To make 3 5.27
prediction rules workable in clinical settings, they 4 6.02
depend on data that are available in the usual care 5–6 6.88
of patients and scoring, the basis for the predic-
Reproduced as published in UpToDate, Calculation: Atrial Fibrillation
tion, that has been simplified. CHADS(2) Score for Stroke Risk, with permission from MedCalc
3000 by Foundation Internet, Pittsburgh, PA.
Example
Atrial fibrillation causes an increased risk of stroke. Clots form in the atria in the absence of regular, organized contraction
A clinical prediction rule should be developed
in one setting and tested in others—with different
patients, physicians, and usual care practices—to
assure that predictions are good for a broad range
of settings and not just for where it was developed,
because it might have been the result of the partic-
ular characteristics of that setting. The data used to
develop the prediction rule is called the training set,
and the data used to assess its validity is called the
test set, which is used for validation of the
predic- tion rule.
The process of separating patients into groups
with different prognosis, as in the previous example,
is called prognostic stratification. In this case, atrial
fibrillation is the disease and stroke is the
outcome. The concept is similar to risk stratification
(Chapter 4), where patients are divided into different
strata of risk for developing disease.

BIAS IN COHORT STUDIES


In cohort studies of risk or prognosis, bias can
alter the description of the course of disease. Bias
can also create apparent differences between
groups when differences do not actually exist in
nature or obscure differences when they really do
exist. These biases have their counterparts in case-
control studies as well.
Chapter 7: Prognosis 105

They are a separate consideration from confounding Sampling bias can also be misleading when prog-
and effect modification (Chapter 5). nosis is compared across groups and sampling has
There are an almost infinite variety of systematic produced groups that are systematically different
errors, many of which are given specific names, with respect to prognosis, even before the factor of
but some are more basic. They can be recognized interest is considered. In the Bell palsy example,
more easily when one knows where they are most older patients might have had a worse prognosis
likely to occur in the course of a study. With that in because they are the ones who had an underlying
mind, we describe some possibilities for bias in herpes virus infection, not because of their age.
cohort studies and discuss them in relation to the Is this not confounding? Strictly speaking, it is not
following study of prognosis. because the study is for the purpose of prediction, not
to identify independent “causes” of recovery. Also, it

Example

Bell’s palsy is the sudden, one-sided, unex- plained onset of weakness of the face in the area innervated by the facial ne

is not plausible to consider such phenomena as sever-


ity of palsy or onset of recovery as causes because they
When thinking about the validity of this study, are probably part of the chain of events leading from
one should consider at least the following. disease to recovery. But to the extent that one
wants to show that prognostic factors are
Sampling Bias independent pre- dictors of outcome, the same
approaches as are used for confounding (Chapter 5)
Sampling bias has occurred when the patients in can be used to establish independence.
a study are not like other patients with the condi-
tion. Were patients in this study like others with Bell
palsy? The answer depends on the user’s perspec- Migration Bias
tive. Patients were “from the Copenhagen area” Migration bias is present when some patients drop
and apparently under the care of an ear, nose, and out of the study during follow-up and they are sys-
throat specialist, so the results generalize to other tematically different from those who remain. It is
referred patients (as long as we accept that Bell palsy often the case that some members of the original
in Den- mark is similar to this condition in other cohort leave a study over time. (Patients are assured
parts of the world). However, mild cases might not that this is their right as part of the ethical conduct of
have been included because they were managed by research on humans.) If dropout occurs randomly,
local clini- cians and quickly recovered or not such that the characteristics of lost patients are on
brought to medi- cal attention at all, which limits average similar to patients who remain, then there
the applicability in primary care settings. would be no bias. This is so regardless of whether
the number of dropouts is large or similar in the
cohorts being compared, but ordinarily the charac-
teristics of lost patients are not the same as those
1 Clinical Epidemiology: The
who remain in a study. Dropping out tends to be
related to prognosis. For example, patients who
are doing especially well or badly with their Bell
palsy may be more likely to leave the study, as
would those who need care for other illnesses, for
whom the extra visits related to the study would
be burdensome. This would distort the main
(descriptive) results of the study—rate and
completeness of recovery. If the study also aims to
identify prognostic factors (e.g., recovery in old
versus young patients), that also could be biased
by patients dropping out, for the same reasons.
Migration bias might be seen as an example of
selection bias because patients who were still in
the study when outcomes were measured were
selected from all those who began in the study.
Migration bias might be considered an example of
measurement bias because patients who migrate out
of the study are no longer available when
outcomes are measured.
Chapter 7: Prognosis 107

Measurement Bias
Measurement bias is present when members of cigarette products in saliva. Yet in a cohort
the cohort are not all assessed similarly for outcome. study PERHAPS,
BIAS, of cigarette smoking
BUT DOES and coronary
In the Bell palsy study, all members of the cohort were heart disease (CHD), misclassification of
IT smoking
MATTER? could not be different in people
exam- ined by a common protocol every month
until they were no longer improving, ruling out this who did or did not develop
Clinical epidemiology is not an CHDerror-finding
because
possibility. If it had been left to individual patients the outcome was not known
game. Rather, it is meant to characterizeat the time the
and physicians whether, when, and how they were credibility of a study so that clinicians can the
exposure was as- sessed. Even so, to decide
examined, this would have diminished confidence in howextent
much tothatrely onsmoking is incorrectly
its results when making high-
the description of time to and completeness of classified, it reduces whatever differences
stakes decisions about patients. It would be
recovery. Measurement bias also comes into play if in CHD rates in smokers and non- smokers
irresponsible to ignore results of studies that meet
prognostic groups are com- pared and patients in one highthat might have existed if all patients had
standards, just as clinical decisions need not be
group have a systematically better chance of having been correctly classified, making a “null”
bound by the results of weak studies.
outcomes detected than those in another. Some effect more likely. At the extreme, if
With this in mind, it is not enough to recognize
outcomes, such as death, cardiovas- cular classifying smoking status were totally at
that bias might be present in a study. One must go on
catastrophes, and major cancers, are so obvious that random, there could be no association
to determine if bias is actually present in the particu-
they are unlikely to be missed. But for less clear- cut between smoking and CHD.
lar study. Beyond that, one must decide whether
outcomes, including specific cause of death, sub- the consequences of bias are sufficiently large that
clinical disease, side effects, or disability, measurement they change the conclusions in a clinically important
bias can occur because of differences in the methods way. If damage to the study’s conclusions is not very
with which the outcome is sought or classified. great, then the presence of bias is of little practical
Measurement bias can be minimized in three conse- quence and the study is still useful.
gen-
eral ways: (i) examine all members of the cohort equally
for outcome events; (ii) if comparisons of SENSITIVITY ANALYSIS
prognostic groups are made, ensure that researchers
are unaware of the group to which each patient One way to decide how much bias might change
belongs; and (iii) set up careful rules for deciding if the conclusions of a study is to do a sensitivity
an outcome event has occurred (and follow the anal- ysis, that is, to show how much larger or
rules). To help readers under- stand the extent of these smaller the observed results might have been
kinds of biases in a given study, it is usual practice to under various assumptions about the missing data
include, with reports of the study, a flow diagram or potentially biased measurements. A best-
describing how the number of partici- pants changed case/worst-case analysis tests the effects of the
as the study progressed and why. It is also helpful most extreme possible assumptions but is an
to compare the characteristics of patients in and out unreasonably severe test for the effects of bias in
of the study after sampling and follow-up. most situations. More often, sensi- tivity analyses
test the effects of somewhat unlikely values, as in
Bias from “Non-differential” the following example.
Misclassification
Until now, we have been discussing how the results
of a study can be biased when there are systematic
differ- ences in how exposure or disease groups are
classified. But bias can also result if
misclassification is “non- differential,” that is, it
occurs similarly in the groups being compared. In
this case, the bias is toward find- ing no effect.

Example
When cigarette smoking is assessed by simply asking people whether they smoke, there is substantial misclassification re
1 Clinical Epidemiology: The

Example
syndrome
Poliomyelitis has been eradicated in many parts of the world but as those
late effects who were,
of infection con-the true
tinue. ratepatients deve
Some
would have
If the missing members of the cohort had different rates of post-polio been from
syndrome (137those
 194)/939
who were followed-up,
35%. how
(That is, all 137 patients known to have the
syndrome plus 50% of those who were not
followed up divided by all members of the
original cohort.) If the missing patients
were half as likely to get the syndrome, the
true rate would have been (137  48)/939 
20%. Thus, even with an improbably large
difference in post-polio syn- drome rates in
missing patients, the true rate would still
have been in the 20% to 35% range, a useful

More or less extreme differences could have been


assumed for the missing patients to explore how
“sen- sitive” the study result were to missing
members of the cohort. Sensitivity analysis is a
useful way of esti- mating how much various kinds
of bias could have affected the results of studies of
all kinds—cohort and case-control studies of risk or
prognosis, the accuracy of diagnostic tests, or clinical
trials of the effectiveness of treatment or prevention.

Revie w Question s
Read the following and select the best of the children were no longer in the study
response. when outcome (a second seizure) was
assessed at 1 year. Which of the following
7.1. For a study of the risk of esophageal cancer in would have the greatest effect on study
patients with Barrett esophagus (a precancer- results?
ous lesion), which of the following times in
the course of disease is the best example of A. Why the children dropped out
zero time? B. When in the course of follow-up
the children left the study
A. Diagnosis of Barrett esophagus for C. Whether dropping out was related
each patient to prognosis
B. Death of each patient D. Whether the number of children
C. Diagnosis of esophageal cancer for dropping out is similar in the groups
each patient
D. Calendar time when the first patient is 7.3. A cohort study of prostate cancer care com-
enrolled in the study pares rates of incontinence in patients who
E. Calendar time when no patient remains were treated with surgery versus medical
in the study care alone. Incontinence is assessed from
review of medical records. Which of the
7.2. A cohort study describes the recurrence of following is not an example of
seizure within 1 year in children hospital- measurement bias?
ized with a first febrile seizure. It compared
recurrence in children who had infection A. Men were more likely to tell their surgeon
versus immunization as an underlying cause about incontinence.
for fever at the time of the first seizure. Some B. Surgeons were less likely to record
complications of their surgery in the
record.
Chapter 7: Prognosis 109

C. Men who got surgery were more likely to 7.7. Which of the following kinds of studies
have follow-up visits. cannot be used to identify prognostic factors?
D. Chart abstractors used their judgment
A. Prevalence study
is deciding whether incontinence was
B. Time-to-event analysis
present or not.
C. Case-control study
E. Rates of incontinence were higher in the
D. Cohort study
men who got surgery.
7.8. Which of the following best describes the
7.4. A clinical prediction rule has been developed
information in a survival curve?
to classify the prognosis of community-
acquired pneumonia. Which of the following A. An unbiased estimate of survival even
is most characteristic of such a clinical predic- if some patients leave the study
tion rule? B. The estimated probability of survival from
A. Calculating a score is simple. zero time
B. The clinical data are readily available. C. The proportion of a cohort still alive at
C. Multiple prognostic factors are included. the end of follow-up
D. The rate at which original members of the
D. The results are used to guide further
cohort leave the study
management of the patient.
E. The cumulative survival of a cohort
E. All of the above.
over time
7.5. A study describes the clinical course of
7.9. Which of the following is the most appropri-
patients who have an uncommon neurologic
ate sample for a study of prognosis?
disease. Patients are identified at a referral
center that specializes in this disease. Their A. Members of the general population
medical records are reviewed for patients’ B. Patients in primary care in the community
characteristics, treatments, and their current C. Patients admitted to a
disease status. Which of the following best community hospital
describes this kind of study? D. Patients referred to a specialist
E. It depends on who will use the results of
A. Cohort study
the study.
B. Case-control study
C. Case series
7.10. Investigators wish to describe the clinical
D. Cross-sectional study
course of multiple sclerosis. They take advan-
E. A randomized controlled trial
tage of a clinical trial, already completed, in
which control patients received usual care.
7.6. A study used time-to-event analysis to
Patients in the trial had been identified at
describe the survival from diagnosis of 100
referral centers, had been enrolled at the time
patients with congestive heart failure. By the
of diagnosis, and had met rigorous entry
third year, 60 patients have been censored.
criteria. After 10 years, all patients had been
Which of the following would not be a reason
examined yearly and remained under observa-
for one of these patients being censored?
tion, and 40% were still able to walk. Which
A. The patient died of another cause before of the following most limits the credibility of
year 3. this study?
B. The patient decided not to continue in
A. Inconsistent zero time
the study.
B. Generalizability
C. The patient developed another disease that
C. Measurement bias
could be fatal.
D. Migration bias
D. The patient had been enrolled in the
E. Failure to use time-to-event methods
study for less than 3 years.
1 Clinical Epidemiology: The

7.11. Many different clinical prediction rules A. In this study, it is the rate of continued
have been developed to assess the severity smoking divided by the rate of quitting.
of community-acquired pneumonia. Which B. It can be estimated from a case-control
of the following is the most important study of smoking and amputation.
reason for choosing one to use? C. It cannot be adjusted for the presence
A. The prediction rule classifies patients into or absence of other factors related to
groups with very different prognosis. prognosis.
B. The prediction rule has been validated in D. It conveys information similar to relative
different settings. risk.
C. Many variables are included in the rule. E. It is calculated from the cumulative
D. Prognostic factors include state-of-the- incidence of amputation in smokers and
science diagnostic tests. quitters.
E. The score is calculated using computers.
7.13. In a time-to-event analyses, the event:
7.12. In a study of patients who smoke and A. Can occur only once
developed peripheral arterial disease, the B. Is dichotomous
hazard ratio for amputation in patients C. Both A and B
who continued to smoke, relative to those D. Neither A nor B
who quit smoking, is 5. Which of the
following best characterizes the hazard Answers are in Appendix
ratio?
A.

REFERENCES
1. Canto JG, Kiefe CI, Rogers WJ, et al. Number if coronary cer. The Will Rogers phenomenon revisited. Arch Intern Med
heart disease risk factors and mortality in patients with 2008;168:1541–1549.
first myocardial infarction. JAMA 2011;306:2120–2127. 7. Zisman A, Pantuck AJ, Dorey F, et al. Improved
2. Talley NJ, Zinsmeister AR, Van Dyke C, et al. Epidemiol- prognosti- fication for renal cell carcinoma using an
ogy of colonic symptoms and the irritable bowel integrated staging system. J Clin Oncol 2001;19:1649–
syndrome. Gastroenterology 1991;101:927–934. 1657.
3. Ford AC, Forman D, Bailey AG, et al. Irritable bowel syn- 8. Shaw BA, Hosalkar HS. Rattlesnake bites in children: anti-
drome: A 10-year natural history of symptoms and factors venin treatment and surgical indications. J Bone Joint Surg
that influence consultation behavior. Am J Gastroenterol Am 2002;84-A(9):1624.
2007; 103:1229–1239. 9. Gage BF, Waterman AD, Shannon W, et al. Validation of
4. Evers IM, de Valk HW, Visser GHA. Risk of complications clinical classification schemes for predicting stroke. Results
of pregnancy in women with type 1 diabetes: Nationwide from the National Registry of Atrial Fibrillation. JAMA 2001;
prospective study in the Netherlands. BMJ 2004;328:915– 285:2864–2870.
918. 10. Go AS, Hylek EM, Chang Y, et al. Anticoagulation therapy
5. Feinstein AR, Sosin DM, Wells CK. The Will Rogers for stroke prevention in atrial fibrillation. How well do
phenom- enon: stage migration and new diagnostic techniques randomized trials translate into clinical practice? JAMA
as a source of misleading statistics for survival in cancer. N 2003;290:2685–2692.
Engl J Med 1985;312:1604–1608. 11. Peitersen E. Bell’s palsy: the spontaneous course of 2,500
6. Chee KG, Nguyen DV, Brown M, et al. Positron emission peripheral facial nerve palsies of different etiologies. Acta
tomography and improved survival in patients with lung can- Otolaryngol 2002;(549):4–30.
12. Ramlow J, Alexander M, LaPorte R, et al. Epidemiology
of post-polio syndrome. Am J Epidemiol 1992;136:769–785.
Chapter 7: Prognosis 111

Chapter 8

Diagnosis
Appearances to the mind are of four kinds. Things either are what they appear to be;
or they neither are, nor appear to be; or they are, and do not appear to be; or they
are not, yet appear to be. Rightly to aim in all these cases is the wise man’s task.
—Epictetus†
2nd century
A.D.

KEY WORDS A diagnostic test is ordinarily understood to


mean a test performed in a laboratory, but the
principles dis-
Diagnostic test Negative predictive cussed in this chapter apply equally well to clinical
True positive value information obtained from history, physical examina-
True negative Posterior (posttest) tion, and imaging procedures. They also apply when
False positive probability a constellation of findings serves as a diagnostic test.
False negative Accuracy Thus, one might speak of the value of prodromal
Gold standard Prevalence neurologic symptoms, headache, nausea, and
Reference standard Prior (pretest) vomiting in diagnos- ing classic migraine, or of
Criterion standard probability hemoptysis and weight loss in a cigarette smoker as an
Sensitivity Likelihood ratio indication of lung cancer.
Specificity Probability
Cutoff point Odds SIMPLIFYING DATA
Receiver operator Pretest odds
characteristic Posttest odds In Chapter 3, we pointed out that clinical measure-
(ROC) curve Parallel testing ments, including data from diagnostic tests, are
Spectrum Serial testing expressed on nominal, ordinal, or interval scales.
Bias Clinical prediction Regardless of the kind of data produced by
Predictive value rules diagnostic tests, clinicians generally reduce the data
Positive predictive Diagnostic decision- to a simpler form to make them useful in practice.
value making rules Most ordinal scales are examples of this
simplification process. Heart murmurs can vary
from very loud to barely audible, but trying to
express subtle gradations in the intensity of
murmurs is unnecessary for clinical
Clinicians spend a great deal of time diagnosing decision making. A simple ordinal scale—grades I to
complaints or abnormalities in their patients, gen- VI—serves just as well. More often, complex data
erally arriving at a diagnosis after applying various are reduced to a simple dichotomy (e.g., present/
diagnostic tests. Clinicians should be familiar with absent, abnormal/normal, or diseased/well). This is
basic principles when interpreting diagnostic tests. done particularly when test results are used to help
This chapter deals with those principles. determine treatment decisions, such as the degree
of anemia that requires transfusion. For any given
test
result, therapeutic decisions are either/or decisions;

A 19th Century physician and proponent of the “numerical either treatment is begun or it is withheld. When
method” (relying on counts, not impressions) to understand the there are gradations of therapy according to the
natural history of diseases such as typhoid fever. test result, the data are being treated in an ordinal
108 fashion.
Chapter 8: Diagnosis 109

Example DISEASE
The use of blood pressure data to decide about therapy is an example of how informa- tion can be simplified for practic
Present Absent

True False
Positive
positive positive
a b
TEST
c d
Negative False True
negative negative

Figure 8.1 ■ The relationship between a diagnostic


test result and the occurrence of disease. There are
two possibilities for the test result to be correct (true positive
and true negative) and two possibilities for the result to be
incor- rect (false positive and false negative).

and two that are wrong (false). The test has given the
correct result when it is positive in the presence of
dis- ease (true positive) or negative in the absence of
the disease (true negative). On the other hand, the
test has been misleading if it is positive when the
disease is absent (false positive) or negative when
the disease is present (false negative).

The Gold Standard


A test’s accuracy is considered in relation to some way
of knowing whether the disease is truly present or
THE ACCURACY not—a sounder indication of the truth often
OF A TEST RESULT referred to as the gold standard (or reference
standard or criterion standard). Sometimes the
Diagnosis is an imperfect process, resulting in a standard of accu- racy is itself a relatively simple and
prob- ability rather than a certainty of being right. inexpensive test, such as a rapid streptococcal
The doc- tor’s certainty or uncertainty about a antigen test (RSAT) for group A streptococcus to
diagnosis has been expressed by using terms such validate the clinical impres- sion of strep throat or
as “rule out” or “possible” before a clinical diagnosis. an antibody test for human immunodeficiency
Increasingly, cli- nicians express the likelihood that a virus infection. More often, one must turn to
patient has a dis- ease as a probability. That being the relatively elaborate, expensive, or risky tests to be
case, it behooves the clinician to become familiar certain whether the disease is present or absent.
with the mathemati- cal relationships between the Among these are biopsy, surgical exploration,
properties of diagnostic tests and the information imaging procedures, and of course, autopsy.
they yield in various clinical situations. In many For diseases that are not self-limited and ordinar-
instances, understanding these issues will help the ily become overt over several months or even years
clinician reduce diagnostic uncer- tainty. In other after a test is done, the results of follow-up can serve
situations, it may only increase under- standing of as a gold standard. Screening for most cancers and
the degree of uncertainty. Occasionally, it may even chronic, degenerative diseases fall into this category.
convince the clinician to increase his or her level of For them, validation is possible even if on-the-spot
uncertainty. confirmation of a test’s performance is not feasible
A simple way of looking at the relationships because the immediately available gold standard is
between a test’s results and the true diagnosis is too risky, involved, or expensive. If follow-up is
shown in Figure 8.1. The test is considered to be used, the length of the follow-up period must be
either posi- tive (abnormal) or negative (normal), and long enough for the disease to declare itself, but
the disease is either present or absent. There are not so long that new cases can arise after the original
then four pos- sible types of test results, two that testing (see Chapter 10).
are correct (true)
11 Clinical Epidemiology: The

subsequent 4 years was 73% and specificity was 91%. The


Because it is almost always more costly, more investigators were able to fill in all four cells
dangerous, or both to use more accurate ways of
establishing the truth, clinicians and patients
prefer simpler tests to the rigorous gold standard,
at least initially. Chest x-rays and sputum smears are
used to determine the cause of pneumonia, rather
than bron- choscopy and lung biopsy for examination
of the dis- eased lung tissue. Electrocardiograms and
blood tests are used first to investigate the
possibility of acute myocardial infarction, rather
than catheterization or imaging procedures. The
simpler tests are used as proxies for more elaborate
but more accurate or pre- cise ways of establishing
the presence of disease, with the understanding that
some risk of misclassification results. This risk is
justified by the safety and conve- nience of the
simpler tests. But simpler tests are only useful when
the risks of misclassification are known and are
acceptably low. This requires a sound com-
parison of their accuracy to an appropriate standard.

Lack of
Information on
Negative Tests
The goal of all clinical studies aimed at describing
the value of diagnostic tests should be to obtain
data for all four of the cells shown in Figure 8.1.
With- out all these data, it is not possible to fully
evaluate the accuracy of the test. Most information
about the value of a diagnostic test is obtained from
clinical, not research, settings. Under these
circumstances, physi- cians are using the test in the
care of patients. Because of ethical concerns, they
usually do not feel justified in proceeding with more
exhaustive evaluation when preliminary diagnostic
tests are negative. They are naturally reluctant to
initiate an aggressive workup, with its associated
risk and expense, unless prelimi- nary tests are
positive. As a result, data on the number of true
negatives and false negatives generated by a test
(cells c and d in Fig. 8.1) tend to be much less
complete in the medical literature than data collected
about positive test results.
This problem can arise in studies of screening
tests
because individuals with negative tests usually are
not subjected to further testing, especially if the
testing involves invasive procedures such as
biopsies. One method that can get around this
problem is to make use of stored blood or tissue
banks. An investigation of prostate-specific antigen
(PSA) testing for prostate cancer examined stored
blood from men who subse- quently developed
prostate cancer and men who did not develop
prostate cancer (2). The results showed that for a
PSA level of 4.0 ng/mL, sensitivity over the
Chapter 8: Diagnosis 111
without requiring further testing on people
with negative test results. (See the following
text for defini- tions of sensitivity and
specificity.)

Lack of Information
on Test Results in the
Nondiseased
Some types of tests are commonly
abnormal in peo- ple without disease or
complaints. When this is so, the test’s
performance can be grossly misleading when
the test is applied to patients with the
condition or complaint.

Example
Lack of Objective
Standards for Disease
Magnetic resonance imaging (MRI) of the lum- bar spine is used in
For some conditions, there are simply no
hard-and- fast criteria for diagnosis. Angina
pectoris is one of these. The clinical
manifestations were described nearly a
century ago, yet there is still no better way
to substantiate the presence of angina pectoris
than a carefully taken history. Certainly, a
great many objec- tively measurable
phenomena are related to this clini- cal
syndrome, for example, the presence of
coronary artery stenosis on angiography,
delayed perfusion on a thallium stress test,
and characteristic abnormalities on
electrocardiograms both at rest and with
exercise. All are more commonly found in
patients believed to have angina pectoris, but
none is so closely tied to the clinical
syndrome that it can serve as the standard by
which the condition is considered present or
absent.
Other examples of medical conditions
difficult to diagnose because of the lack of
simple gold standard
11 Clinical Epidemiology: The

tests include hot flashes, Raynaud’s disease, irritable


bowel syndrome, and autism. In an effort to stan- interpreting each test knowing the results
dardize practice, expert groups often develop lists of the other test. Traditional colonoscopy is
of symptoms and other test results that can be used usu- ally considered the gold standard for
in combination to diagnose the clinical condition. identi- fying colon cancer or polyps in
Because there is no gold standard, however, it is pos- asymptomatic adults. However, virtual
sible that these lists are not entirely correct. colonoscopy identi- fied more colon cancers
Circular reasoning can occur—the validity of a and adenomatous polyps (especially those
laboratory test is established by comparing its behind folds in the colon) than the
results to a clinical diagnosis based on a careful traditional colonoscopy. In order not to
history of symptoms and a physical examination, but penalize the new test in compari- son to
once established, the test is then used to validate the the old, the investigators ingeniously
clinical diagnosis gained from history and physical created a new gold standard—a repeat opti-
examination! cal colonoscopy after reviewing the
results of both testing procedures—
Consequences of Imperfect whenever there was disagreement between
Gold Standards
Because of such difficulties, it is sometimes not pos- SENSITIVITY AND SPECIFICITY
sible for physicians in practice to find information
on how well the tests they use compare with a Figure 8.2 summarizes some relationships between a
thor- oughly trustworthy standard. They must diagnostic test and the actual presence of disease.
choose as their standard of validity another test that It is an expansion of Figure 8.1, with the addition
admittedly is imperfect, but is considered the best of some useful definitions. Most of the remainder of
available. This may force them into comparing one this chapter deals with these relationships in detail.
imperfect test against another, with one being
taken as a standard of validity because it has had
longer use or is con- sidered superior by a
consensus of experts. In doing so, a paradox may
arise. If a new test is compared with an old (but Example
imperfect) standard test, the new test may seem Figure 8.3 illustrates these relationships with an actual study
worse even though it is actually better. For example,
if the new test were more sensitive than the standard
test, the additional patients identified by the new test
would be considered false positives in rela- tion to
the old test. Similarly, if the new test is more often
negative in patients who really do not have the
disease, results for those patients would be
considered false negatives compared with the old
test. Thus, a new test can perform no better than
an established gold standard test, and it will seem
inferior when it approximates the truth more
closely unless special strategies are used.

Example
Computed tomographic (“virtual”) colonos- copy was compared to traditional (optical) colonoscopy in screening for c
Chapter 8: Diagnosis 113

DISEASE
Present Absent

+PV = a
Positive a b a+b

TEST

Negative c d –PV = d
c+d

Se = a Sp = d P =a + c
a+c b+d a+b+c+d

a c Se = Sensitivity
a+c a+c Sp = Specificity
LR+ = LR– = P = Prevalence
b d LR = Likelihood ratio
b+d b+d PV = Predictive value

Figure 8.2 ■ Diagnostic test characteristics and definitions. Se  sensitivity; Sp 


specificity; P  prevalence; PV  predictive value; LR  likelihood ratio. Note that
LR calculations are the same as Se/(1 – Sp) and calculations for LR– are the same as (1
– Se)/Sp.

DISEASE
DVT according to gold standard
(Compression ultrasonography
and/or 3-month follow-up)

Present Absent

+PV = 55 = 22%
Positive 55 198 253
TEST
D-dimer
assay for
diagnosis
–PV = 302 = 100%
of DVT Negative 1 302
303

55 Sp = 302 = 60% P = 56 = 10%


Se = = 98%
56 500 556

Se = Sensitivity
55 1 Sp = Specificity P
55 + 1 55 + 1 = Prevalence
LR+ = = 2.5 LR– = = 0.03
198 302 LR = Likelihood ratio
198 + 302 302 + 198 PV = Predictive value

Figure 8.3 ■ Diagnostic characteristics of a D-dimer assay in diagnosing deep


venous thrombosis (DVT). (Data from Bates SM, Kearon C, Crowther M, et al. A
diagnostic strategy involving a quantitative latex D-dimer assay reliably excludes deep
venous thrombosis. Ann Intern Med 2003;138:787–794.)
11 Clinical Epidemiology: The

Definitions costs, tissue diagnosis (a highly spe-


As can be seen in Figure 8.2, sensitivity is defined as
the proportion of people with the disease who
have a positive test for the disease. A sensitive test
will rarely miss people with the disease. Specificity
is the proportion of people without the disease who
have a negative test. A specific test will rarely
misclassify people as having the disease when they
do not.
Applying these definitions to the DVT example
(Fig. 8.3), we see that 55 of the 56 patients with
DVT had positive D-dimer results—for a
sensitivity of 98%. However, of the 500 patients
who did not have DVT, D-dimer results were
correctly negative for only 302, for a specificity of
60%.

Use of Sensitive Tests


Clinicians should take the sensitivity and specific-
ity of a diagnostic test into account when selecting
a test. A sensitive test (i.e., one that is usually
positive in the presence of disease) should be
chosen when there is an important penalty for
missing a disease. This would be so, for example,
when there is reason to suspect a dangerous but
treatable condition, such as tuberculosis, syphilis,
or Hodgkin lymphoma, or in a patient suspected of
having DVT. Sensitive tests are also helpful during
the early stages of a diagnostic workup, when several
diagnoses are being considered, to reduce the
number of possibilities. Diagnostic tests are used
in these situations to rule out diseases with a
negative result of a highly sensitive test (as in the
DVT example). As another example, one might
choose the highly sensitive HIV antibody test
early in the evaluation of lung infiltrates and
weight loss to rule out an AIDS-related infection.
In summary, a highly sensitive test is most helpful to
the clinician when the test result is negative.

Use of Specific Tests


Specific tests are useful to confirm (or “rule in”) a
diagnosis that has been suggested by other data. This
is because a highly specific test is rarely positive in the
absence of disease; it gives few false-positive results.
(Note that in the DVT example, the D-dimer test
was not specific enough [60%] to initiate
treatment after a positive test. All patients with
positive results underwent compression
ultrasonography, a much more specific test.) Highly
specific tests are particu- larly needed when false-
positive results can harm the patient physically,
emotionally, or financially. Thus, before patients are
subjected to cancer chemother- apy, with all its
attendant risks, emotional trauma, and financial
Chapter 8: Diagnosis 115
cific test) is generally required. In summary, a
highly specific test is most helpful when the test
result is positive.

Trade-Offs between Sensitivity


and Specificity
It is obviously desirable to have a test that is both
highly sensitive and highly specific. Unfortunately,
this is often not possible. Instead, whenever clinical
data take on a range of values, there is a trade-off
between the sensitivity and specificity for a given
diagnostic test. In those situations, the location of
a cutoff point, the point on the continuum
between normal and abnormal, is an arbitrary
decision. As a consequence, for any given test
result expressed on a continuous scale, one
characteristic, such as sensitivity, can be
increased only at the expense of the other (e.g.,
specificity). Table 8.1 demon- strates this
interrelationship for the use of B-type natriuretic
peptide (BNP) levels in the diagnosis of
congestive heart failure among patients present-
ing to emergency departments with acute dyspnea
(6). If the cutoff level of the test were set too low
(50 pg/mL), sensitivity is high (97%), but
the trade-off is low specificity (62%), which
would require many patients without congestive
heart fail- ure to undergo further testing for it. On
the other hand, if the cutoff level were set too
high (150 pg/ mL), more patients with
congestive heart failure would be missed. The
authors suggested that an acceptable compromise
would be a cutoff level of 100 pg/mL, which has
a sensitivity of 90% and a specificity of 76%.
There is no way, using a BNP test alone, that one
can improve both the sensitivity and specificity of
the test at the same time.

Table 8.1
Trade-Off between Sensitivity and
Specificity When Using BNP Levels
to Diagnose Congestive Heart
Failure
BNP Level
(ph/mL) Sensitivity (%) Specificity (%)
50 97 62
80 93 74
100 90 76
125 87 79
150 85 83
Adapted with permission from Maisel AS, Krishnaswamy P, Nowak
RM, et al. Rapid measurement of B-type natriuretic peptide in the
emergency diagnosis of heart failure. N Engl J Med 2002;347: 161–
167.
11 Clinical Epidemiology: The

Specificity (%)
80 60 40 20 0
100
50
80
100
125
80 150 20
Cutoff points (pg/mL)

60 40
Sensitivity (%)
(True-positive

1-Sensitivity
40 60

20 80

0
20 40 60 80 100
1-Specificity (%)
(False-positive rate)
Figure 8.4 ■ A receiver operator characteristic (ROC) curve. The accuracy of
B-type natriuretic peptide (BNP) in the emergency diagnosis of heart failure with vari-
ous cutoff levels of BNP between dyspnea due to congestive heart failure and other
causes. (Adapted with permission from Maisel AS, Krishnaswamy P, Nowak RM, et
al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of
heart failure. N Engl J Med 2002;347:161–167.)

The Receiver Operator diagonal shows the relationship between true-


Characteristic (ROC) Curve positive and false-positive rates for a useless test—
one that gives no additional information to what
Another way to express the relationship between sen-
was known before the test was performed,
sitivity and specificity for a given test is to construct
equivalent to making a diagnosis by flipping a coin.
a curve, called a receiver operator characteristic
The ROC curve shows how severe the trade-off
(ROC) curve. An ROC curve for the BNP levels in
between sensitivity and specificity is for a test and
Table 8.1 is illustrated in Figure 8.4. It is
can be used to help decide where the best cutoff
constructed by plot- ting the true-positive rate
point should be. Generally, the best cutoff point is
(sensitivity) against the false- positive rate (1 –
at or near the “shoulder” of the ROC curve, unless
specificity) over a range of cutoff val- ues. The
there are clinical reasons for minimizing either false
values on the axes run from a probability of 0 to 1
nega- tives or false positives.
(0% to 100%). Figure 8.4 illustrates visually the
ROC curves are particularly valuable ways of
trade-off between sensitivity and specificity.
comparing different tests for the same diagnosis. The
Tests that discriminate well crowd toward the
overall accuracy of a test can be described as the area
upper left corner of the ROC curve; as the sensi-
under the ROC curve; the larger the area, the
tivity is progressively increased (the cutoff point is
better the test. Figure 8.5 compares the ROC curve
lowered), there is little or no loss in specificity
for BNP to that of an older test, ventricular
until very high levels of sensitivity are achieved.
ejection fraction determined by
Tests that perform less well have curves that fall
electrocardiography (7). BNP is both more
closer to the diagonal running from lower left to
sensitive and more specific, with a larger area
upper right. The
Chapter 8: Diagnosis 117

100
BNP
50
80 EF
100
125
80 150 60

55
50
60
45
Sensitivity

40
35
40

BNP = B-Type Naturetic Peptide


EF = Ejection fraction

20

0
0 20 40 60 80 100
1-Specificity (%)
Figure 8.5 ■ ROC curves for the BNP and left ventricular ejection fractions
by echocardiograms in the emergency diagnosis of congestive heart fail-
ure in patients presenting with acute dyspnea. Overall, BNP is more sensi-
tive and more specific than ejection fractions, resulting in more area under the
curve. (Redrawn with permission from Steg PG, Joubin L, McCord J, et al. B-type
natriuretic peptide and echocardiographic determination of ejection fraction in
the diagnosis of congestive heart failure in patients with acute dyspnea. Chest
2005;128:21–29.)

under the curve (0.89) than that for ejection


wanting later when more experience with it has accu-
fraction (0.78). It is also easier and faster to obtain in
mulated. Initial enthusiasm followed by
an emer- gency setting—test characteristics
disappoint- ment arises not from any dishonesty
important in clini- cal situations when quick results
on the part of early investigators or unfair
are needed.
skepticism by the medical community later. Rather, it
Obviously, tests that are both sensitive and
is related to limitations in the methods by which the
specific are highly sought after and can be of
properties of the test were established in the first
enormous value. However, practitioners must
place. Sensitivity and specificity may be inaccurately
frequently work with tests that are not both highly
described because an improper gold standard has
sensitive and specific. In these instances, they must
been chosen, as discussed earlier in this chapter. In
use other means of circum- venting the trade-off
addition, two other issues related to the selection of
between sensitivity and specific- ity. The most
diseased and nondiseased patients can profoundly
common way is to use the results of several tests
affect the determination of sensitivity and specificity
together, as discussed later in this chapter.
as well. They are the spectrum of patients to which
the test is applied and bias in judging the test’s
ESTABLISHING SENSITIVITY performance. Statistical uncertainty, related to
AND SPECIFICITY studying a relatively small number of patients, also
can lead to inaccurate estimates of sensitivity and
Often, a new diagnostic test is described in glow-
specificity.
ing terms when first introduced, only to be found
11 Clinical Epidemiology: The

Spectrum of Patients other than ovarian cancer. These extraneous condi-


tions decreased the specificity and increased the
Difficulties may arise when the patients used to false- positive rate of the guidelines. Low
describe the test’s properties are different from those specificity of diagnostic and screening tests is a
to whom the test will be applied in clinical major problem for ovarian cancer and can lead to
practice. Early reports often assess the test’s surgery on many women without cancer.
characteristics among people who are clearly Disease spectrum and prevalence of disease are
diseased compared with people who are clearly not especially important when a test is used for
diseased, such as medical student volunteers. The screen- ing instead of diagnosis (see Chapter 10
test may be able to distin- guish between these for a more detailed discussion of screening). In
extremes very well but perform less well when theory, the sensi- tivity and specificity of a test are
differences are subtler. Also, patients with disease independent of the prevalence of diseased
often differ in severity, stage, or duration of the individuals in the sample in which the test is being
disease, and a test’s sensitivity will tend to be evaluated. (Work with Fig. 8.2 to confirm this for
higher in more severely affected patients. yourself.) In practice, however, several
characteristics of patients, such as stage and severity
Example of disease, may be related to both the sensitiv- ity
Ovarian cancer, the fourth most common non- skin cancerand
in women, has spread
the specificity of beyond the ovary
a test and to thebyprevalence
the time it is clinicall
because different kinds of patients are found in high-
and low-prevalence situations. Screening for disease
illustrates this point; screening involves the use of
the test in an asymptomatic population in which
the prevalence of the disease is generally low and the
spectrum of disease favors earlier and less severe cases.
In such situations, sensitivity tends to be lower and
specificity higher than when the same test is applied
to patients suspected of having the disease, more
of whom have advanced disease.

Example
A study was made of the sensitivity and speci- ficity of the clin

Bias
Some people in whom disease is suspected may The sensitivity and specificity of a test should be
have other conditions that cause a positive test, estab- lished independently of the means by which
thereby increasing the false-positive rate and the true diagnosis is established. Otherwise, there
decreas- ing specificity. In the example of guidelines could be a biased assessment of the test’s properties.
for ovar- ian cancer evaluation, specificity was low for As already pointed out, if the test is evaluated using
all cancer stages (60%). One reason for this is that data obtained during the course of a clinical
levels of the cancer marker, CA-125, evaluation of patients suspected of having the
recommended by guide- lines, are elevated by disease in question, a positive test may prompt the
many diseases and conditions clinician to continue pursuing
Chapter 8: Diagnosis 119

the diagnosis, increasing the likelihood that the 100


disease will be found. On the other hand, a negative
test may cause the clinician to abandon further
80 95%
testing, making it more likely that the disease, if confidence intervals Observed
present, will be missed.
Therefore, when the sensitivity and specificity of

Sensitivity
60
a test are being assessed, the test result should not be
part of the information used to establish the diag-
nosis. In studying DVT diagnosis by D-dimer 40
assay (discussed earlier), the investigators made sure
that the physicians performing the gold standard 20
tests (ultrasonography and follow-up assessment)
were unaware of the results of the D-dimer assays
so that the results of the D-dimer assays could not 0
influence (bias) the interpretation of 0 10 20 30 40 50
ultrasonography (10).
In the course of routine clinical care, this kind
of
bias can be used to advantage, especially if the test Number of people observed
result is subjectively interpreted. Many radiologic Figure 8.6 ■ The precision of an estimate of
imaging interpretations are subjective, and it is sensitivity. The 95% confidence interval for an observed
easy to be influenced by the clinical information sensitivity of 75%, according to the number of people
pro- vided. All clinicians have experienced having observed.
imaging studies overread because of a clinical
impression or, conversely, of going back over old
of sensitivity and specificity. Therefore, reported val-
studies in which a finding was missed because a
ues for sensitivity and specificity should not be taken
clinical fact was not communicated at the time and,
too literally if a small number of patients is studied.
therefore, attention was not directed to a particular
Figure 8.6 shows how the precision of estimates
area. Both to mini- mize and to take advantage of
of sensitivity increases as the number of people on
these biases, some radiologists prefer to read
which the estimate is based increases. In this particu-
imaging studies twice, first without, then with the
lar example, the observed sensitivity of the
clinical information.
diagnostic test is 75%. Figure 8.6 shows that if this
All the biases discussed tend to increase the
estimate is based on only 10 patients, by chance
agree- ment between the test and the gold standard.
alone, the true sensitivity could be as low as 45%
That is, they tend to make the test seem more
and as high as nearly 100%. When more patients
accurate than it actually is.
are studied, the 95% confidence interval narrows
Chance and the precision of the estimate increases.

Values for sensitivity and specificity are usually esti- PREDICTIVE VALUE
mated from observations on relatively small samples
of people with and without the disease of interest. Sensitivity and specificity are properties of a test that
Because of chance (random variation) in any one should be taken into account when deciding whether
sample, particularly if it is small, the true sensitiv- to use the test. However, once the results of a
ity and specificity of the test can be diag- nostic test are available, whether positive or
misrepresented, even if there is no bias in the negative, the sensitivity and specificity of the test are
study. The particular values observed are no longer relevant because these values are obtained
compatible with a range of true values, typically in persons known to have or not to have the
characterized by the “95% confi- dence intervals” disease. But if one knew the disease status of the
(see Chapter 12).† The width of this range of values patient, it would not be necessary to order the test!
defines the precision of the estimates For the clinician, the dilemma is to determine if
the patient has the dis-
ease, given the results of a test. (In fact, clinicians are

The 95% confidence interval of a proportion is easily estimated people observed. To be more exact, multiply by 1.96 instead of 2.
by the following formula, based on the binomial theorem:
p(1  p)
p  2 N

where p is the observed proportion and N is the number of


12 Clinical Epidemiology: The
usually more concerned with this question than
the sensitivity and specificity of the test.)

Definitions
The probability of disease, given the results of a test,
is called the predictive value of the test (see Fig.
8.2).
Chapter 8: Diagnosis 121

100 the condition in question. Prevalence is also called


prior (or pretest) probability , the probability of
Positive predictive value

80 disease before the test result is known. (For a full dis-


Sensitivity/specificity cussion of prevalence, see Chapter 2.)
60 The more sensitive a test is the better will be its
negative predictive value (the more confident the cli-
80/8090/90 99/99 nician can be that a negative test result rules out
40
the disease being sought). Conversely, the more
specific the test is, the better will be its positive
20 predictive value (the more confident the clinician
can be that a positive test confirms or rules in the
0 diagnosis being sought). Because predictive value is
1/5 1/10 1/50 1/100 1/1,000 1/10,000
also influenced by prevalence, it is not independent
Prevalence of the setting in which the test is used. Positive
results, even for a
Figure 8.7 ■ Positive predictive value according to sen- given point in time having
sitivity, specificity, and prevalence of disease.

Positive predictive value is the probability of dis-


ease in a patient with a positive (abnormal) test
result. Negative predictive value is the probability
of not having the disease when the test result is
negative (nor- mal). Predictive value answers the
question, “If my patient’s test result is positive
(negative), what are the chances that my patient
does (does not) have the dis- ease?” Predictive
value is sometimes called posterior (or posttest)
probability, the probability of disease after the
test result is known. Figure 8.3 illustrates these
concepts. Of the 253 patients with positive D-
dimer assays, only 55 (22%) had DVT (positive
predictive value). The negative predictive value of
the test was much better, almost 100%.
The term accuracy is sometimes used to summa-
rize the overall value of a test. Accuracy is the
pro- portion of all test results, both positive and
negative, that is correct. For the DVT example in
Figure 8.3, the accuracy of D-dimer assays was
64%. (Calculate this for yourself.) The area under
the ROC curve is another useful summary measure
of the information provided by a test result.
However, these summary measures usually are too
crude to be useful clinically because specific
information about the component parts—
sensitivity, specificity, and predictive value at
specific cutoff points—is lost when they are aggre-
gated into a single number.

Determinants of Predictive Value


The predictive value of a test is not a property of
the test alone. It is determined by the sensitivity
and specificity of the test and the prevalence of
disease in the population being tested, when
prevalence has its customary meaning, the
proportion of persons in a defined population at a
12 Clinical Epidemiology: The
very specific test, when applied to patients
with a low likelihood of having the disease,
will be largely false positives. Similarly,
negative results, even for a very sensitive
test, when applied to patients with a high
chance of having the disease, are likely to be
false neg- atives. In summary, the
interpretation of a positive or negative
diagnostic test result varies from setting to
setting, according to the prevalence of
disease in the particular setting.
It is not intuitively obvious why
prevalence should affect interpretation of a
test result. For those who are skeptical, it
might help to consider how a test would
perform at the extremes of prevalence.
Remember that no matter how sensitive and
specific a test might be (short of perfection),
there will still be a small pro- portion of
patients who are misclassified by it. Imag-
ine a population in which no one has the
disease. In such a group, all positive results,
even for a very spe- cific test, will be false
positives. Therefore, as the prev- alence of
disease in a population approaches zero, the
positive predictive value of a test also
approaches zero. Conversely, if everyone in a
population tested has the disease, all negative
results will be false negatives, even for a very
sensitive test. As prevalence approaches
100%, negative predictive value
approaches zero. Another way for skeptics
to convince themselves of these relationships
is to work with Figure 8.3, holding sensitivity
and specificity constant, changing preva-
lence, and calculating the resulting predictive
values. Figure 8.7 illustrates the effect of
prevalence on positive predictive value for a
test at different but gen- erally high levels of
sensitivity and specificity. When the
prevalence of disease in the population
tested is relatively high—more than several
percent—the test performs well. But at lower
prevalences, the positive predictive value
drops to nearly zero (negative predic- tive
value improves), and the test is virtually
useless. (The figure illustrates why positive
predictive value of a test is so much better in
diagnostic studies, which often can evaluate
tests with a few hundred patients,
Chapter 8: Diagnosis 123

than in screening studies, which usually must use


tens of thousands of patients, because of the follow-up. Thus, the test was especially
difference of underlying prevalence of disease in useful for excluding DVT in patients with a
the two situ- ations.) As sensitivity and specificity low prob- ability of DVT (about 40% of all
fall, the influ- ence of prevalence on predictive value patients) with- out further testing. This is an
becomes more pronounced. example of how a negative result of a non-
It is sometimes possible in clinical situations to specific test becomes clinically useful when
manipulate prevalence of a disease so that a diagnos- used in a group of patients with a very low
tic test becomes more useful.

As is clear from the example and Figure 8.6,


Example preva- lence is usually more important than
As indicated earlier in this chapter, DVT, a poten- sensitivity and specificity in determining predictive
tially dangerous cause of leg pain, is difficult value. One rea- son why this is so is that
to diagnose without specialized testing. Many prevalence can vary over a much wider range than
patients with leg pain do not have DVT, and it sensitivity and specificity can. Prevalence of
is important to differentiate patients who do disease can vary from 1 in 1 mil- lion to 1 in 10,
and do not have DVT because treatment for it depending on the age, gender, risk factors, and
(anticoagulation) is risky. It would be helpful to
clinical findings of the patient. Consider the
have a quick and easy diagnostic test that can
rule out DVT, but, as is clear from Figure 8.3, difference in prevalence of liver disease in healthy,
D-dimer assays are sensitive but not specific. young adults who do not use drugs, are not
However, even a test with relatively poor speci- sexually promiscuous, and consume only occasional
ficity might work reasonably well in a group of alcohol as compared to jaundiced intravenous drug
patients with a low prevalence of disease. An users who have multiple sex partners. In contrast,
effort was made to rule out DVT and improve sensitivity and specificity of diagnostic tests usually
negative predictive value by assigning patients vary over a much narrower range, from about 50%
to groups with differing probabilities (preva- to 99%.
lences) of DVT before applying the diagnos-
tic test. In a synthesis of 14 studies evaluating
the use of D-dimer assays in more than 8,000 Estimating Prevalence
patients, investigators found that the overall (Pretest Probability)
prevalence of DVT was 19% (11). Using a simple
validated clinical rule involving several clinical Because prevalence of disease is such a powerful
findings and history items, patients were strati- determinant of how useful a diagnostic test will be,
fied into groups of low (5%), medium (17%), clinicians should consider the probability of disease
and high (53%) probabilities (prevalence) before ordering a test. But how can a doctor estimate
of DVT and then tested with high-sensitivity the prevalence or probability of a particular disease in
D-dimer assays. Sensitivity was high (95% to
patients? Until recently, they tended to rely on clinical
98%) in all the groups, but specificity varied
according to DVT prevalence: 58% among
observations and their experience to estimate (usually
patients with low probability, 41% among implicitly) the pretest probability of most diseases.
those with medium probability, and 36% Studies have shown that these estimates are often
among those with high probability. Although inaccurate (perhaps because doctors tend to remem-
specificity of 58% is not very high, when the ber recent or remarkable patients and, consequently,
D-dimer assay results were negative among give them too much weight when making estimates).
patients with a low clinical probability of DVT, For several infectious diseases, such as influenza
the negative predictive value was 99%— and methicillin-resistant Staphylococcus aureus infec-
fewer than 1% of patients in this group were tion, periodic studies and tracking systems by the
diagnosed with DVT after at least 3 months of
Centers for Disease Control and Prevention alert
clinicians about changing prevalence. Large clini-
cal computer databanks also provide quantitative
estimates of the probability of disease, given vari-
ous combinations of clinical findings. Although the
resulting estimates of prevalence are not likely to be
very precise, using estimates from the medical lit-
erature is bound to be more accurate than implicit
judgment alone.
12 Clinical Epidemiology: The

tive test results. An asymptomatic 45-year-old


Increasing the Pretest
Probability of
Disease
Within the clinical setting, there are several ways
in which the probability of a disease can be
increased before using a diagnostic test.
Considering the rela- tionship between the
predictive value of a test and prevalence, it is
obviously to the physician’s advantage to apply
diagnostic tests to patients with an increased
likelihood of having the disease being sought. In fact,
diagnostic tests are most helpful when the presence
of disease is neither very likely nor very unlikely.

Specifics of the Clinical Situation


The specifics of the clinical situation are clearly a
strong influence on the decision to order tests. Symp-
toms, signs, and disease risk factors all raise or lower
the probability of finding a disease. For some diseases,
clinical-decision rules made up of simple history and
physical examinations produce groups of patients
with known prevalence or incidence of disease, as
shown in the DVT example. A young woman with
chest pain is more likely to have coronary disease
if she has typical angina and hypertension and she
smokes. As a result, an abnormal ECG stress test
is more likely to represent coronary disease in
such a woman than in a similar woman with
nonspecific chest pain and no coronary risk
factors.
The value of applying diagnostic tests to persons
more likely to have a particular illness is
intuitively obvious to most doctors. Nevertheless,
with the increasing availability of diagnostic tests, it
is easy to adopt a less selective approach when
ordering tests. However, the less selective the
approach, the lower the prevalence of the disease
is likely to be and the lower will be the positive
predictive value of the test.

Example

The probability of coronary artery disease


(CAD) based on the results of an ECG exercise
(stress) test varies according to the pretest
probability of CAD, which differs according to
age, gender, symptoms, and risk factors.
Figure 8.8 demon- strates posttest
probabilities after stress tests among men
with different ages, symptoms, and risk factors
(i.e., different pretest probabilities of CAD).
The elliptical curves show the results for
probabilities of CAD after positive and nega-
Chapter 8: Diagnosis 125
patients’ complaints. Therefore, relatively more
aggressive use of diagnostic tests might be justified in
these settings. (The need for
man has a very low pretest probability of
which rises little (to 10%) after a positive test
and decreases to practically zero after a nega-
tive test. At the other extreme, for a 55-year-old
man with typical angina who has a 93% pretest
probability of CAD, a positive test raises the
probability of CAD to nearly 100%, whereas a
negative test reduces the posttest probability
to about 75%—hardly reassuring to either the
patient or the doctor; further testing will likely
occur regardless of the test result in this patient,
making the test useless. Stress testing is most
useful in a 45-year-old man with atypical chest
pain who has a pretest 51% probability of CAD.
A positive test raises the posttest probability of
CAD to about 75%, which argues for more inva-
sive and definitive testing, whereas a negative
test lowers the probability of CAD to about 10%.

Because of the prevalence effect,


physicians must interpret similar stress test
results differently in dif- ferent clinical
situations. A positive test usually will be
misleading if it is used to search for
unsuspected disease in low-prevalence
situations, as sometimes has been done
among young joggers and on “execu- tive
physicals” of young healthy persons. The
oppo- site applies to an older man with
typical angina. In this case, a negative stress
test result is too often false negative to
exclude disease. As is clear in Figure 8.8, a
diagnostic test is most useful in intermediate
situa- tions, in which prevalence (pretest
probability) is nei- ther very high nor very
low.

Selected Demographic Groups


In a given setting, physicians can increase
the yield of diagnostic tests by applying them
to demographic groups known to be at
higher risk for a disease. The pretest
probability of CAD in a 55-year-old man
complaining of atypical angina chest pain is
65%; in a 35-year-old woman with the same
kind of pain, it is 12% (12). Similarly, a
sickle cell test would obviously have a higher
positive predictive value among African
Americans than among whites of Norwegian
descent.

Referral Process
Referral to teaching hospital wards, clinics,
and emergency departments increases the
chance that significant disease will underlie
12 Clinical Epidemiology: The

45-year-old asymptomatic
man with no risk factors

45-year-old asymptomatic man


with hypercholesterolemia,
hypertension, and diabetes 55-year-old man
with typical angina

100

45-year-old man with


atypical chest pain
80
Posttest probability of CAD

Positive
stress test
60

40

Negative
stress test
20

0
20 40 60 80 100

Clinical pretest probability of CAD (%)


Figure 8.8 ■ Posttest probabilities of coronary artery disease among men with
different pretest probabilities who underwent ECG exercise (stress) tests. The
top of the light pink bars indicate posttest probabilities of CAD after positive tests and
the top of the red bars indicate posttest probabilities of CAD after negative tests.
(Redrawn with permission from Patterson RE, Horowitz SF. Importance of epidemiology
and biostatistics in deciding clinical strategies for using diagnostic tests: a simplified
approach using ex- amples from coronary artery disease. J Am Coll Cardiol
1989;13:1653–1665.)

speedy diagnosis also promotes quicker use of diag-


nostic tests.) In primary care practice, on the other of headache. (It is unlikely that important
hand, and particularly among patients without con- ditions were missed because the clinic
com- plaints, the chance of finding disease is was vir- tually the only source of medical
considerably smaller, and tests should be used more care for these patients, and the soldiers
sparingly. remained in the mili- tary community for
many months.) However, during the first
week back in a medical resi- dency at a
Example teaching hospital, a patient with a
While practicing in a military clinic, one author saw hundreds headache
of people with headache,
similar to the rarely
ones ordered
manageddiagnostic
in tests,

Because clinicians may work at different points


along the prevalence spectrum at various times in
their clinical practices, they should bear in mind that
Chapter 8: Diagnosis 127

the intensity of their diagnostic evaluations may need These terms should be familiar to most readers
to be adjusted to suit the specific setting. because they are used in everyday conversation. For
example, we may say that the odds are 4:1 that the
Implications for Interpreting New England Patriots football team will win tonight
the Medical Literature or that they have an 80% probability of winning.

Published descriptions of diagnostic tests often Definitions


include, in addition to sensitivity and specificity,
some conclusions about the interpretation of a The likelihood ratio for a particular value of a
posi- tive or negative test (the test’s predictive diag- nostic test is defined as the probability of
value). This is done to provide information directly that test result in people with the disease divided by
useful to cli- nicians, but the data for these the prob- ability of the result in people without
publications are often gathered in university disease (see Fig. 8.2). Likelihood ratios express how
teaching hospitals where the prevalence of serious many times more (or less) likely a test result is to be
disease is relatively high. As a result, statements found in diseased, compared with nondiseased,
about predictive value in the medical literature people. For dichoto- mous results (both positive
may be misleading when the test is applied in less and negative), two types of likelihood ratios
highly selected settings. Occasion- ally, authors describe the test’s ability to dis- criminate between
compare the performance of a test in a number of diseased and nondiseased people. In the case of a
patients known to have the disease and an equal test’s positive likelihood ratio (LR), it is the ratio
number of patients without the disease. This is an of the proportion of diseased people with a
efficient way to describe sensitivity and specificity. positive test result (sensitivity) to the propor- tion
However, any reported positive predictive value from of nondiseased people with a positive result (1 –
such studies means little because it has been deter- specificity). A test’s negative likelihood ratio
mined for a group of patients in which the investiga- (LR–) is calculated when the test result is negative.
tors artificially set the prevalence of disease at 50%. In that case, it is the proportion of diseased people
with a negative test result (1 – sensitivity) divided
by the proportion of nondiseased people with a
LIKELIHOOD RATIOS negative test result (specificity) (see Fig. 8.2).
In the DVT example (Fig. 8.3), the data can be
Likelihood ratios are an alternative way of
used to calculate likelihood ratios for DVT in the
describ- ing the performance of a diagnostic test.
presence of a positive or negative D-dimer assay.
They sum- marize the same kind of information as
A positive test is about 2.5 times more likely to be
sensitivity and specificity and can be used to
found in the presence of DVT than in the absence
calculate the probabil- ity of disease after a positive
of it. If the D-dimer assay was negative, the
or negative test (positive or negative predictive
likelihood ratio for this negative test is 0.03.
value). The main advantage of likelihood ratios is
that they can be used at multiple levels of test Use of Likelihood Ratios
results.
Likelihood ratios must be used with odds, not
Odds probabil- ity. Therefore, the first step is to convert
pretest prob- ability (prevalence) to pretest odds, as
Because the use of likelihood ratios depends on odds, outlined earlier:
to understand them, it is first necessary to distinguish
odds from probability. Odds and probability Odds  Probability of event
contain the same information, but they express it  (1 – Probability of event)
differently. Probability, which is used to express Likelihood ratios can then be used to convert
sensitivity, specificity, and predictive value, is the pretest odds to posttest odds, by means of the
proportion of people in whom a particular following formula:
characteristic, such as a positive test, is present.
Odds, on the other hand, is the ratio of two Pretest odds  Likelihood ratio  Posttest odds
probabilities, the probability of an event to that of Posttest odds can, in turn, be converted back to
1 – the probability of the event. The two can be
interconverted using simple formulas: a
probability, using the formula:
Odds  Probability of event
 (1 – Probability of event) Probability  Odds  (1  Odds)
Probability  Odds  (1  Odds) In these relationships, pretest odds contain the
same information as prior or pretest probability
12 Clinical Epidemiology: The

(prevalence), likelihood ratios the same as


Example
Pleural effusions are routinely evaluated when the c
sensitivity/ specificity, and posttest odds the same as
positive pre- dictive value (posttest probability).

Why Use Likelihood Ratios?


Why master the concept of likelihood ratios when
they are much more difficult to understand than prev-
alence, sensitivity, specificity, and predictive value?
The main advantage of likelihood ratios is that they
make it possible to go beyond the simple and
clumsy clas- sification of a test result as either
abnormal or normal, as is done when describing the
accuracy of a diagnos- tic test only in terms of
sensitivity and specificity at a single cutoff point. way to differentiate the two was to
Obviously, disease is more likely in the presence of an measure the protein in pleural fluid and
extremely abnormal test result than it is for a serum; if the ratio of pleural fluid protein to
marginally abnormal one. With likelihood ratios, it serum protein was 0.50, the pleural fluid
is possible to summarize the information was identified as an exudate. However, the
contained in test results at different levels. One single cut-point may obscure important
can define likelihood ratios for each of an entire diagnostic information. This possibility was
range of possible values. In this way, information examined in a study of 1,448 patients with
represented by the degree of abnormality is not pleural fluids known to be either exudates or
discarded in favor of just the crude presence or transudates (13). Table 8.2 shows the
absence of it. distribution of pleural fluid findings accord-
In computing likelihood ratios across a range of ing to the ratio of pleural fluid protein to
test results, a limitation of sensitivity and specificity serum protein, along with calculated
is overcome. Instead of referring to the ability of the likelihood ratios. Not surprisingly, at high
test to identify all individuals with a given result or values of pleural fluid protein to serum
worse, it refers to the ability of a particular test protein ratios, almost all speci- mens were
result to iden- tify people with the disease. The exudates and the likelihood ratios were high,
same is true for the calculation of specificity. Thus, whereas at the other extreme, the opposite
for tests with a range of results, LRs can report was true. Overall, ratios close to the
information at each level. In general, tests with LRs traditional cut-point indicated that the
further away from 1.0 are asso- ciated with few false distinc- tion between exudates and
positives and few false negatives (10 for LR and transudates was uncertain. Also, using the
0.1 for LR–), whereas those with LRs close to 1.0 traditional cut-point of 0.50, there was more
give much less accurate results (2.1 to
5.0 for LR and 0.5 to 0.2 for LR–).
In summary, likelihood ratios can accommodate
the common and reasonable clinical practice of
putting more weight on extremely high (or low)
test results than on borderline ones when estimating
the probabil- ity (or odds) that a particular disease is
present.
Chapter 8: Diagnosis 129

Table 8.2
Distribution of Ratios for Pleural
Fluid Protein to Serum Protein in
Patients with Exudates and
Transudates, with Calculation of
Likelihood Ratios
Ratio of
Pleural
Fluid Number of Patients
Protein with Test Result
to Serum Likelihood
Protein Exudates Transudates Ratio
 0.70 475 1 168.65
0.66–0.70 150 1 53.26
0.61–0.65 117 6 6.92
0.56–0.60 102 12 3.02
0.50–0.55 70 14 1.78
0.46–0.50 47 34 0.49
0.41–0.45 27 34 0.28
0.36–0.40 13 37 0.12
0.31–0.35 8 44 0.06
0.30 19 182 0.04

Reproduced with permission from Heffner JE, Sahn SA, Brown


LK. Multilevel likelihood ratios for identifying exudative pleural
effu- sions. Chest 2002;121:1916–1920.
13 Clinical Epidemiology: The

The likelihood ratio has several other advantages A) Mathematical approach


over sensitivity and specificity as a description of 1) Convert pretest probability (prevalence) to pretest odds
test performance. As is clear from Table 8.2, the Pretest odds = prevalence/(1 – prevalence)
information contributed by the test is summarized 2) Multiply pretest odds by likelihood ratio to obtain posttest odds
Pretest odds × likelihood ratio = posttest odds
in one number corresponding to each level of test
3) Convert posttest odds to posttest probability (predictive value)
result. Also, likelihood ratios are particularly well Posttest probability = posttest odds/(1 + posttest odds)
suited for describing the overall odds of disease
when a series of diagnostic tests is used (see the fol- B) Using a likelihood ratio nomogram
lowing text). Place a straight edge at the correct prevalence and
likelihood ratio values and read off the posttest
probability where the straight edge crosses the line.
Calculating Likelihood Ratios
Figure 8.9A demonstrates two ways of arriving at
posttest probability: by calculation and with a nomo- .1 99
gram. Figure 8.9B makes the calculations using
the DVT example in Figure 8.3 and shows that the .2
cal- culated posttest probability (22%) is the same
as the positive predictive value calculated arrived at .5 95
by the nomogram in the figure. Although the
process is conceptually simple and each individual 1 1,000 90
calculation is easy, the overall effort is a bit 500
daunting. To make it easy, several computerized 2 200 80
programs for diagnos- tic test calculators that also 100
70
construct the associated nomogram are available 5 50
free on the Web. The nomo- gram shows the 60
20
difference between pre- and posttest odds, but it 10 10 50
requires having the nomogram easily available. % 5 40 %
These calculations demonstrate a disadvantage of 20 2 30
likelihood ratios. One must use odds, not 1
30 20
probabili- ties, and most of us find thinking in .5
terms of odds more difficult than probabilities. Also, 40 .2
50 10
the conversion from probability to odds and back .1
requires math, Internet access, or the use of a 60 .05
5
nomogram, which can complicate calculating 70 .02
posttest odds and predictive value during the routine 80
.01
course of patient care. .005 2
Table 8.3 displays a simplified approach. Likeli- .002
90 1
hood ratios of 2, 5, and 10 increase the probability of .001
disease approximately 15%, 30%, and 45%,
95 .5
respec- tively, and the inverse of these (likelihood
ratios of 0.5, 0.2, and 0.1) decrease the probability
of disease .2
similarly 15%, 30%, and 45% (14). Bedside use
99 .1
of likelihood ratios is easier when the three specific
Pretest Likelihood Posttest
likelihood ratios and their effect (multiples of 15) probability ratio probability
on posttest probability are remembered, especially (prevalence) (predictive value)
when the clinician can estimate the probability of
disease in the patient before the test is done. Using
this algorithm in the DVT example (Fig. 8.3), the Figure 8.9 ■ A. Formula and nomogram using test
probability of disease with a LR of 2.5 would be likeli- hood ratios to determine posttest probability of
approximately 25% (the underlying 10% prevalence disease.
plus about 15%). This is a little higher estimate
than that obtained with the mathematical
calculation, but it is close enough to conclude that
a patient with a positive D-dimer assay needs some
other test to con- firm the presence of DVT.
Chapter 8: Diagnosis 131

A) Mathematical approach Table 8.3


1) Convert pretest probability (prevalence) to pretest odds
.10/(1–.10) = .111
2) Multiply pretest odds by likelihood ratio of positive test
Simple “Rule of Thumb” for Determining
.11 x 2.5 = 0.278 Effect of Likelihood Ratios on Disease
3) Convert posttest odds to posttest probability Probability
(positive predictive value)
0.278/(1 + 0.278) = .22 = 22% Approximate Change in
Likelihood Ratio Disease Probability (%)
B) Using a likelihood ratio nomogram 10 45
The pretest probability is 10% and the LR+ is 2.5. Place a
9 40
ruler to intersect these 2 values and it crosses the posttest
probability line at about 22%. 8 —

.1 99 7 —
6 35
.2 5 30
4 25
.5 3 20
95
2 15
1
1,000 90 1 No Change
2 500 0.5 –15
200 80
0.4 –20
100
5 70 0.3 –25
50
60 0.2 –30
10 20
10 50 0.1 –45
%
5 40 %
20 30
2 Adapted with permission from McGee S. Simplifying likelihood
30 1 ratios. J Gen Intern Med 2002;17:646–649.
20
40 .5
50 .2 10
60 .1 MULTIPLE TESTS
.05
70 5 Because clinicians commonly use imperfect diagnos-
.02
80 .01 tic tests with 100% sensitivity and specificity and
.005 2 intermediate likelihood ratios, a single test frequently
90
results in a probability of disease that is neither
.002
.001 1 high nor low enough for managing the patient (e.g.,
95
some- where between 10% and 90%). Usually, it is
.5 not acceptable to stop the diagnostic process at
such a point. Would a physician or patient be
.2 satisfied with the conclusion that the patient has
99 even a 20% chance of having carcinoma of the
Pretest
.1 colon? Or that an asymp- tomatic, 45-year-old man
probability Likelihood Posttest with multiple risk factors has about a 30% chance
ratio probability
of coronary heart disease after a positive ECG
(prevalence) (predictive value) stress test? Even for less deadly
Figure 8.9 ■ B. An example: Calculating the posttest diseases, tests resulting in intermediate posttest prob-
probability of a positive D-dimer assay test for DVT (see abilities require more investigation. The physician is
Fig. 8.3). (Adapted with permission from Fagan TJ. Nomo- ordinarily bound to raise or lower the probability
gram for Bayes’s theorem. N Eng J Med 1975;293:257.) of disease substantially in such situations—unless, of
course, the diagnostic possibilities are all trivial,
noth- ing could be done about the result, or the risk
of pro- ceeding further is prohibitive. When these
exceptions do not apply, the clinician will want to
find a way to rule in or rule out the disease more
decisively.
13 Clinical Epidemiology: The

STRATEGY SEQUENCE OF EVENTS CONSEQUENCES


Parallel testing Test A or test B or test C
is positive
A +
Sensitivity
Specificity
B +

C +

Serial testing Test A and test B and test C are


positive
Sensitivity
A+ B + C +
Specificity

Figure 8.10 ■ Parallel and serial testing. In parallel testing, all tests are done at the
same time. In serial testing, each subsequent test is done only when the previous test
result is positive.

When multiple different tests are performed referral centers seem to diagnose disease that local
and all are positive or all are negative, the physicians miss.) However, false-positive diagnoses
interpretation is straightforward. All too often, are also more likely to be made (thus, the propen-
however, some are positive and others are negative. sity for overdiagnosing in such centers as well). In
Interpretation is then more complicated. This summary, parallel testing is particularly useful when
section discusses the principles by which multiple the clinician is faced with the need for a very sen-
tests are applied and interpreted. sitive testing strategy but has available only two or
Multiple diagnostic tests can be applied in two more relatively insensitive tests that measure
basic ways (Fig. 8.10). They can be used in different clinical phenomena. By using the tests in
parallel testing (i.e., all at once), and a positive parallel, the net effect is a more sensitive diagnostic
result of any test is considered evidence for strategy. The price, however, is further evaluation
disease. Or they can be done in serial testing (i.e., or treatment of some patients without the disease.
consecutively), with the decision to order the next
test in the series based on the results of the
previous test. For serial testing, all tests must give Example
a positive result in order for the diagnosis to be Neither of two tests used for diagnosing ovar-
made because the diagnostic process is stopped with ian cancer, CA-125 and transvaginal ultrasound,
a negative result. is a sensitive test when used alone. A study
was done in which 28,506 women underwent
both tests in a trial of ovarian cancer screening
Parallel Testing (15). If either test result was abnormal (a par-
Physicians usually order tests in parallel when rapid allel testing strategy), women were referred
assessment is necessary, as in hospitalized or emer- for further evaluation. Positive predictive val-
ues were determined for each test individually
gency patients, or for ambulatory patients who as well as when both tests were abnormal (a
can- not return easily because they are not mobile or serial testing strategy). The authors did not cal-
have come from a long distance for evaluation. culate sensitivities and specificities of the tests;
Multiple tests in parallel generally increase the to do so, they would have needed to include
sensitivity and, therefore, the negative predictive any cases of ovarian cancer that occurred in the
value for a given disease prevalence, above those of
each individual test. On the other hand, specific-
ity and positive predictive value are lower than for
each individual test. That is, disease is less likely to
be missed. (Parallel testing is probably one reason
why
Chapter 8: Diagnosis 133

Table 8.4
Test Characteristics of CA-125 and Transvaginal Ultrasound (TVU)

Test Sensitivitya (%) Specificitya (%) Positive Predictive Value (%)


Abnormal CA-125 (35 U/mL) 55.2 98.7 4.0
Abnormal TVU 75.8 95.4 1.6
Abnormal CA-125 or TVU 100.0 94.1 1.7
Abnormal CA-125 and TVU 31.0 99.9 26.5
a
Sensitivity and specificity estimated without interval cancers (see text).
Data from Buys SS, Partridge E, Greene M, et al. Ovarian cancer screening in the Prostate, Lung, Colorectal and Ovarian (PLCO)
cancer screening trial: findings from the initial screen of a randomized trial. Am J Obstet Gynecol 2005;193:1630–1639.

nosis. This process, long an implicit part of


women in the interval between the first and
clinical medicine, has been examined systematically
next screening round. However, it is possible
for an increasing number of diagnoses. For some
to roughly estimate the values for sensitivity medical conditions, certain history items, physical
and specificity from the information in the findings, and laboratory results are particularly
paper. Table 8.4 shows the positive important in their predictive power for making a
predictive values as well as estimated diagnosis. The resulting test combinations are
sensitivities and specifici- ties of the tests. called clinical prediction rules or diagnostic
Using the two tests in parallel raised the decision-making rules, which divides patients into
estimated sensitivity to about 100%, but the groups with dif- ferent prevalences.
positive predictive value was lower (1.7%)
than when using the tests individu- ally.
Because follow-up evaluation of abnor- mal
screening tests for ovarian cancer often
Example
Pharyngitis is among the most common com- plaints in office p
involves abdominal surgery, tests with high
positive predictive values are important. The
low positive predictive value resulting from a
parallel testing strategy meant that almost all
women who required follow-up evaluation—
many of whom underwent surgery—did not
have ovarian cancer. The authors examined
what would have happened if a serial testing
strategy had been used (requiring both tests
to be positive before further evaluation);
positive predictive value rose to 26.5%, but
at the cost of lowering the estimated
sensitivity to 31%, so low that 20 of 29
cancers would have been missed using this

Clinical Prediction Rules


A modification of parallel testing occurs when clini-
cians use the combination of multiple tests, some
with positive and some with negative results, to
arrive at a diagnosis. Usually, they start by taking
a history and doing a physical examination. They
may also order laboratory tests. The results of the
com- bined testing from history, physical
examination, and laboratory tests are then used to
make a diag-
13 Clinical Epidemiology: The

Table 8.5 ends up surer that positive test results represent


An Example of a Diagnostic Decision- disease but runs an increased risk that disease will be
Making Rule: Predictors of Group A missed. Serial testing is particularly useful when
Streptococcus (GAS) Pharyngitis. none of the individual tests available to a clinician is
Modified Centor Score highly specific. Physicians most often use serial testing
strategies in clinical situations where rapid
Criteria Points assessment of patients is not required, such as in
Temperature 38C 1 office practices and hospi- tal clinics in which
(100.4F) ambulatory patients are followed over time. (In
Absence of cough 1
acute care settings, time is the enemy of a serial
testing approach.) Serial testing is also used when
Swollen, tender anterior 1
some of the tests are expensive or risky, with these
cervical nodes
tests being employed only after simpler and safer tests
Tonsillar swelling or 1 suggest the presence of disease. For example,
exudate
maternal age, blood tests, and ultrasound are used
Age 1 to identify pregnancies at higher risk of delivering a
3–14 years 1 baby with Down syndrome. Mothers found to be at
15–44 years 0 high risk by those tests are then offered chorionic vil-
lus sampling or amniocentesis, both of which entail
45 years –1
risk of fetal loss. Serial testing leads to less laboratory
Probability of a Positive use than parallel testing because additional evaluation
Score Culture for GAS (%) is contingent on prior test results. However, serial
0 1–2.5 testing takes more time because additional tests are
1 5–10 ordered only after the results of previous ones become
2 11–17 available. Usually, the test that is less risky, less inva-
sive, easier to do, and cheaper should be done first. If
3 28–35
these factors are not in play, performing the test with
4 51–53 the highest specificity is usually more efficient, requir-
Adapted with permission from McIsaac WJ, Kellner JD, Aufricht ing fewer patients to undergo both tests.
P, et al. Empirical validation of guidelines for the management
of pharyngitis in children and adults. JAMA 2004;291:1587–
1595. Serial Likelihood Ratios
When a series of tests is used, an overall
Serial Testing probability can be calculated using the likelihood
ratio for each test result, as shown in Figure 8.11.
Serial testing maximizes specificity and positive pre- The prevalence of disease before testing is first
dictive value but lowers sensitivity and the negative converted to pretest odds. As each test is done, the
predictive value, as demonstrated in Table 8.4. posttest odds of one become the pretest odds for
One the next. In the end, a new probability of disease
is found that takes into

PRETEST PROBABILITY

TEST A Pretest odds x LRA = Posttest odds

TEST B Pretest odds x LRB = Posttest odds

TEST C Pretest odds x LRC = Posttest odds

POSTTEST PROBABILITY

Figure 8.11 ■ Use of likelihood ratios in serial testing. As each test is completed, its posttest
odds become the pretest odds for the subsequent test.
Chapter 8: Diagnosis 135

account the information contributed by all the tests test included in the decision rule contributed inde-
in the series. pendently to the diagnosis. To the degree that the two
tests are not contributing independent information,
Assumption of Independence multiple testing is less useful. For example, if two tests
are used in parallel with 60% and 80%
When multiple tests are used, the accuracy of the sensitivities, and the better test identifies all the
final result depends on whether the additional cases found by the less sensitive test, the combined
infor- mation contributed by each test is somewhat sensitivity cannot be higher than 80%. If they are
inde- pendent of that already available from the completely independent of each other, then the
preceding ones so that the next test does not sensitivity of parallel testing would be 92% (80%
simply duplicate known information. For example,  [60%  20%]).
in the diagnosis of endocarditis, it is likely that The premise of independence underlies the entire
fever (an indication of inflammation), a new heart approach to the use of multiple tests. However, it
murmur (an indication of valve destruction), and seems unlikely that tests for most diseases are fully
Osler nodes (an indication of emboli) each add independent of each other. If the assumption that the
independent, useful information. In the example of tests are completely independent is wrong, calcula-
pharyngitis, the investigators used statistical tion of the probability of disease from several tests
techniques to ensure that each diagnostic would tend to overestimate the tests’ value.

Revie w Question s
Questions 8.1–8.10 are based on the following 8.4. If the doctor thought the patient did not
example. For each question, select the best have sinusitis because the patient did not
answer. have facial pain, for what percent of patients
A study was made of symptoms and physical findings would she be correct?
in 247 patients evaluated for sinusitis. The final A. 38%
diagnosis was made according to x-ray findings (gold B. 48%
standard) (17). Ninety-five patients had sinusitis, and C. 52%
49 of them also had facial pain. One hundred fifty-two D. 61%
did not have sinusitis, and 79 of these patients had
facial pain. 8.5. How common was sinusitis in this study?
8.1. What is the sensitivity of facial pain for A. 38%
sinusitis in this study? B. 48%
A. 38% C. 52%
B. 48% D. 61%
C. 52%
D. 61% 8.6. What is the posttest probability of sinusitis in
patients with facial pain in this study?
8.2. What is the A. 38%
specificity? A. 38% B. 48%
B. 48% C. 52%
C. 52% D. 61%
D. 61% Both the positive and negative likelihood ratios for
8.3. If the doctor thought the patient had facial pain were 1.0. When clinicians asked several
sinusitis because the patient had facial pain, other questions about patient symptoms and took
for what percent of patients would she be physical examination results into account, the likeli-
correct? hood ratios for their overall clinical impressions as to
whether patients had sinusitis were as follows: “high
A. 38% probability,” LR 4.7; “intermediate probability,” LR
B. 48% 1.4, and “low probability,” LR 0.4.
C. 52%
D. 61%
13 Clinical Epidemiology: The

8.7. What is the probability of sinusitis in patients C. Predictive value of facial pain will be
assigned a “high probability” by clinicians? higher in a clinical setting in which the
A. 10% prevalence of sinusitis is 10%.
B. 20%
C. 45%
For questions 8.11 and 8.12, select the best
D. 75%
answer.
E. 90%
8.11. Which of the following statements is most
8.8. What is the probability of sinusitis in patients
assigned an “intermediate probability” by correct?
clinicians? A. Using diagnostic tests in parallel
A. 10% increases specificity and lowers positive
B. 20% predictive value.
C. 45% B. Using diagnostic tests in series increases
D. 75% sensitivity and lowers positive
E. 90% predictive value.
C. Using diagnostic tests in parallel
8.9. What is the probability of sinusitis in patients increases sensitivity and positive
assigned a “low probability” by clinicians? predictive value.
D. Using diagnostic tests in series
A. 10% increases specificity and positive
B. 20% predictive value.
C. 45%
D. 75% 8.12. Which of the following statements is most
E. 90% correct?
8.10. Given the answers for questions 8.7–8.9, A. When using diagnostic tests in parallel
which of the following statements is or series, each test should contribute
incorrect? information independently.
B. When using diagnostic tests in series, the
A. A clinical impression of “high probability” of test with the lowest sensitivity should be
sinusitis is more useful in the management used first.
of patients than one of “low probability.” C. When using diagnostic tests in parallel,
B. A clinical impression of “intermediate the test with the highest specificity should
probability’ is approximately equivalent to be used first.
a coin toss.
Answers are in Appendix A.

REFERENCES
1. Chobanian AV, Bakris GL, Black HR, et al. The Seventh
6. Maisel AS, Krishnaswamy P, Nowak RM, et al. Rapid mea-
Report of the Joint National Committee on Prevention,
surement of B-type natriuretic peptide in the emergency diag-
Detection, Evaluation, and Treatment of High Blood Pressure:
nosis of heart failure. N Engl J Med 2002;347:161–167.
The JNC 7 Report. JAMA 2003:289:2560–2571. 7. Steg PG, Joubin L, McCord J, et al. B-type natriuretic peptide
2. Gann PH, Hennekens CH, Stampfer MJ. A prospective evalu-
and echocardiographic determination of ejection fraction in
ation of plasma prostate-specific antigen for detection of pros-
the diagnosis of congestive heart failure in patients with acute
tatic cancer. JAMA 1995;273:289–294.
dyspnea. Chest 2005;128:21–29.
3. Wheeler SG, Wipf JE, Staiger TO, et al. Approach to the
8. Dearking AC, Aletti GD, McGree ME, et al. How relevant
diag- nosis and evaluation of low back pain in adults. In:
are ACOG and SGO guidelines for referral of adnexal
Basow DS, ed. UpToDate. Waltham, MA; UpToDate;
mass? Obstet Gynecol 2007;110:841–848.
2011.
9. Bobo JK, Lee NC, Thames SF. Findings from 752,081 clini-
4. Pickhardt PJ, Choi JR, Hwang I, et al. Computed tomo-
cal breast examinations reported to a national screening pro-
graphic virtual colonoscopy to screen for colorectal neoplasia
gram from 1995 through 1998. J Natl Cancer Inst 2000;92:
in asymptomatic adults. N Engl J Med 2003;349:2191–2200.
971–976.
5. Bates SM, Kearon C, Crowther M, et al. A diagnostic
10. Bates SM, Grand’Maison A, Johnston M, et al. A latex D-
strat- egy involving a quantitative latex D-dimer assay
dimer reliably excludes venous thromboembolism. Arch
reliably excludes deep venous thrombosis. Ann Intern Med
Intern Med 2001;161:447–453.
2003;138: 787–794.
Chapter 8: Diagnosis 137

11. Wells PS, Owen C, Doucette S, et al. Does this patient 15. Buys SS, Partridge E, Greene M, et al. Ovarian cancer screen-
have deep vein thrombosis? The rational clinical ing in the Prostate, Lung, Colorectal and Ovarian (PLCO)
examination. JAMA 2006;295:199–207. cancer screening trial: findings from the initial screen of a
12. Garber AM, Hlatky MA. Stress testing for the diagnosis of randomized trial. Am J Obstet Gynecol 2005;193:1630–
cor- onary heart disease. In Basow DS, ed. UpToDate. 1639.
Waltham, MA; UpToDate; 2012. 16. McIsaac WJ, Kellner JD, Aufricht P, et al. Empirical
13. Heffner JE, Sahn SA, Brown LK. Multilevel likelihood ratios validation of guidelines for the management of pharyngitis
for identifying exudative pleural effusions. Chest 2002;121: in children and adults. JAMA 2004;291:1587–1595.
1916–1920. 17. Williams JW, Simel DL, Roberts L, et al. Clinical evaluation
14. McGee S. Simplifying likelihood ratios. J Gen Intern Med for sinusitis. Making the diagnosis by history and physical
2002; 17:646–649. examination. Ann Intern Med 1992;117:705–710.
Chapter 9

Treatment
Treatments should be given “not because they ought to work, but because they do work.”
—L.H. Opie
1980

KEY WORDS After the nature of a patient’s illness has been


estab- lished and its expected course predicted,
the next
Hypotheses Masking question is, what can be done about it? Is there a
Treatment Allocation treatment that improves the outcome of disease? This
Intervention concealment chapter describes the evidence used to decide
Comparative Single-blind whether a well-intentioned treatment is actually
effectiveness Double-blind effective.
Experimental studies Open label
Clinical trials Composite outcomes IDEAS AND EVIDENCE
Randomized Health-related quality
controlled trials of life The discovery of effective new treatments requires
Equipoise Health status both rich sources of promising possibilities and rigor-
Inclusion criteria Efficacy trials ous ways of establishing that the treatments are, in
Exclusion criteria Effectiveness trials fact, effective (Fig. 9.1).
Comorbidity Intention-to-treat
Large simple trials analysis Ideas
Practical clinical trials Explanatory analyses
Ideas about what might be a useful treatment arise
Pragmatic clinical Per-protocol
from virtually any activity within medicine. These
trials Superiority trials
ideas are called hypotheses to the extent that they
Hawthorne effect Non-inferiority trials
Placebo
are assertions about the natural world that are
Inferiority margin
made for the purposes of empiric testing.
Placebo effect Cluster randomized
Some therapeutic hypotheses are suggested by the
Random allocation trials
mechanisms of disease at the molecular level. Drugs
Randomization Cross-over trials
Baseline
for antibiotic-resistant bacteria are developed through
Trials of N = 1
knowledge of the mechanism of resistance and
characteristics Confounding by
Stratified
hor- mone analogues are variations on the
indication
randomization
structure of native hormones. Other hypotheses
Phase I trials
Compliance
about treatments have come from astute
Phase II trials
Adherence
observations by clinicians, shared with their
Phase III trials
Run-in period
colleagues in case reports. Others are discovered by
Postmarketing
Cross-over
accident: The drug minoxidil, which was developed
surveillance
Blinding
for hypertension, was found to improve male pattern
baldness; and tamoxifen, developed for
contraception, was found to prevent breast cancer
in high-risk women. Traditional medicines, some
of which are supported by centuries of experience,
132
Chapter 8: Diagnosis 139
Chapter 9: Treatment 133

Case reports

Tests of hypotheses
Biology

Epidemiology Observational studies


Hypotheses EVIDENCE-BASED INFORM
Clinical observation IDEAS (Proposed answers) Experimental studies

Imagination

Reasoning

Figure 9.1 ■ Ideas and evidence.

may be effective. Aspirin, atropine, and digitalis are unpleasant surprises.


examples of naturally occurring substances that have
become established as orthodox medicines after
rigor- ous testing. Still other ideas come from trial
and error. Some anticancer drugs have been found by
methodi- cally screening huge numbers of substances
for activ- ity in laboratory models. Ideas about
treatment, but more often prevention, have also
come from epide- miologic studies of populations.
The Framingham Study, a cohort study of risk
factors for cardiovascular diseases, was the basis for
clinical trials of lowering blood pressure and serum
cholesterol.

Testing Ideas
Some treatment effects are so prompt and
powerful that their value is self-evident even
without formal testing. Clinicians do not have
reservations about the effectiveness of antibiotics for
bacterial meningitis, or diuretics for edema. Clinical
experience is sufficient.
In contrast, many diseases, including most
chronic diseases, involve treatments that are
considerably less dramatic. The effects are smaller,
especially when an effective treatment is tested
against another effective treatment. Also outcomes
take longer to develop. It is then necessary to put
ideas about treatments to a formal test, through
clinical research, because a vari- ety of
circumstances, such as coincidence, biased
comparisons, spontaneous changes in the course of
disease, or wishful thinking, can obscure the true
rela- tionship between treatment and outcomes.
When knowledge of the pathogenesis of disease,
based on laboratory models or physiologic studies in
humans, has become extensive, it is tempting to
pre- dict effects in humans on this basis alone.
However, relying solely on current understanding of
mechanisms without testing ideas using strong
clinical research on intact humans can lead to
13 Clinical Epidemiology: The

Example
Control of elevated blood sugar has been a keystone in the care of patients with diabe- tes mellitus, in part to prevent cardio

This study illustrates how treatments that


make good sense, based on what is known about the
disease at the time, may be found to be ineffective
when put to a rigorous test in humans. Knowledge
of pathogenesis,
Chapter 9: Treatment 135

worked out in laboratory models, may be disappoint- Observational and


ing in human studies because the laboratory Experimental Studies of
studies are in highly simplified settings. They Treatment Effects
usually exclude or control for many real-world
influences on disease such as variation in genetic Two general methods are used to establish the effects
endowment, the physical and social environment, of interventions: observational and experimental
and individual behaviors and preferences. studies. The two differ in their scientific strength and
Clinical experience and tradition also need to feasibility.
be put to a test. For example, bed rest has been In observational studies of interventions,
advo- cated for a large number of medical investiga- tors simply observe what happens to
conditions. Usually, there is a rationale for it. For patients who for various reasons do or do not get
example, it has been thought that the headache exposed to an inter- vention (see Chapters 5–7).
following lumbar puncture might result from a Observational studies of treatment are a special case
leak of cerebrospinal fluid through the needle of studies of prognosis in general, in which the
track causing stretching of the meninges. prognostic factor of interest is a therapeutic
However, a review of 39 trials of bed rest for 15 intervention. What has been said about cohort
different conditions found that outcome did not studies applies to observational studies of treat- ment
improve for any condition. Outcomes were worse as well. The main advantage of these studies is
with bed rest in 17 trials, including not only feasibility. The main drawback is the possibility that
lumbar puncture, but also acute low back pain, labor, there are systematic differences in treatment groups,
hypertension during pregnancy, acute myocardial other than the treatment itself, that can lead to mis-
infarction, and acute infectious hepatitis (2). leading conclusions about the effects of treatment.
Of course, it is not always the case that ideas Experimental studies are a special kind of cohort
are debunked. The main point is that promising study in which the conditions of study—selection
treat- ments have to be tested by clinical research of treatment groups, nature of interventions, man-
rather than accepted into the care of patients on the agement during follow-up, and measurement of
basis of reasoning alone. outcomes—are specified by the investigator for the
purpose of making unbiased comparisons. These
STUDIES OF TREATMENT EFFECTS stud- ies are generally referred to as clinical trials.
Clinical trials are more highly controlled and
Treatment is any intervention that is intended to managed than cohort studies. The investigators are
improve the course of disease after it is conducting an experiment, analogous to those
established. Treatment is a special case of done in the labora- tory. They have taken it upon
interventions in gen- eral that might be applied at themselves (with their patients’ permission) to
any point in the natural history of disease, from isolate for study the unique contribution of one
disease prevention to pallia- tive care at the end of factor by holding constant, as much as possible,
life. Although usually thought of as medications, all other determinants of the outcome.
surgery, or radiotherapy, health care interventions Randomized controlled trials, in which treat-
can take any form, including relaxation therapy, laser ment is randomly allocated, are the standard of
surgery, or changes in the organization and financing excellence for scientific studies of the effects of
of health care. Regardless of the nature of a well- treat- ment. They are described in detail below,
intentioned intervention, the principles by which it followed by descriptions of alternative ways of
is judged superior to other alternatives are the studying the effec- tiveness of interventions.
same.
Comparative effectiveness is a popular name RANDOMIZED CONTROLLED
for a not-so-new concept, the head-to-head com- TRIALS
parison of two or more interventions (e.g., drugs,
devices, tests, surgery, or monitoring), all of which The structure of a randomized controlled trial is
are believed to be effective and are current options shown in Figure 9.2. All elements are the same as for
for care. Comparison is not just for effectiveness, a cohort study except that treatment is assigned by
but also for all clinically important end results of the randomization rather than by physician and patient
interventions—both beneficial and harmful. Results choice. The “exposures” are treatments, and the “out-
can help clinicians and patients understand all of the comes” are any possible end result of treatment (such
consequences of choosing one or another course of as the 5 Ds described in Table 1.2).
action when both have been considered reasonable The patients to be studied are first selected
alternatives. from a larger number of patients with the
condition of
13 Clinical Epidemiology: The

POPULATION INTERVENTION OUTCOME

YES

Treated group

NO
SAMPLERandomization
YES

Control group

NO
Figure 9.2 ■ The structure of a randomized controlled trial.

interest. Using randomization, the patients are then


whether one is more harmful than the other. Of
divided into two (or more) groups of comparable
course, as with any human research, patients must
prognosis. One group, called the experimental group,
fully understand the consequences of participating in
is exposed to an intervention that is believed to be
the study, know that they can withdraw at any
better than current alternatives. The other group,
time without compromising their health care, and
called a control (or comparison) group, is treated
freely give their consent to participate. In addition,
the same in all ways except that its members are
the trial must be stopped whenever there is
not exposed to the experimental intervention.
convincing evi- dence of effectiveness, harm, or
Patients in the control group may receive a
futility in continuing.
placebo, usual care, or the current best available
treatment. The course of disease is then recorded in Sampling
both groups, and differ- ences in outcome are
attributed to the intervention. Clinical trials typically require patients to meet
The main reason for structuring clinical trials in rig- orous inclusion and exclusion criteria. These
this way is to avoid confounding when comparing are intended to increase the homogeneity of
the respective effects of two or more kinds of patients in the study, to strengthen internal validity,
treatments. The validity of clinical trials depends and to make it easier to distinguish the “signal”
on how well they have created equal distribution (treatment effect) from the “noise” (bias and
of all determi- nants of prognosis, other than the one chance).
being tested, in treated and control patients. Among the usual inclusion criteria is that
Individual elements of clinical trials are described patients really do have the condition being studied.
in detail in the following text. To be on the safe side, study patients must meet strict
diagnos- tic criteria. Patients with unusual, mild, or
equivocal manifestations of disease may be left out
Ethics
in the pro- cess, restricting generalizability.
Under what circumstances is it ethical to assign Of the many possible exclusion criteria, several
treatment at random, rather than as decided by the account for most of the losses:
patient and physician? The general principle, called
1. Patients with comorbidity (diseases other than the
equipoise, is that randomization is ethical when
one being studied) are typically excluded because
there is no compelling reason to believe that either
the care and outcome of these other diseases
of the randomly allocated treatments is better than
can muddy the contrast between experimental
the other. Usually it is believed that the
and comparison treatments and their outcomes.
experimen- tal intervention might be better than
2. Patients are excluded if they are not expected to
the control but that has not been conclusively
live long enough to experience the outcome
established by strong research. The primary
events of interest.
outcome must be benefit; treatments cannot be
3. Patients with contraindications to one of the treat-
randomly allocated to discover
ments cannot be randomized.
Chapter 9: Treatment 137

4. Patients who refuse to participate in a trial are care is otherwise the same as usual, without a great
excluded, for ethical reasons described earlier in deal of extra testing that is part of some trials.
the chapter. Follow-up is for a simple, clinically important
5. Patients who do not cooperate during the early outcome, such as discharge from the hospital alive.
stages of the trial are also excluded. This avoids This approach not only improves generalizability, it
wasted effort and the reduction in internal validity also makes it easier to recruit large numbers of
that occurs when patients do not take their participants at a reason- able cost so that moderate
assigned intervention, move in and out of effect sizes (large effects are unlikely for most
treatment groups, or leave the trial altogether. clinical questions) can be detected.
Practical clinical trials (also called pragmatic
For these reasons, patients in clinical trials are
clinical trials) are designed to answer real-world
usu- ally a highly selected, biased sample of all patients
questions in the actual care of patients by
with the condition of interest. As heterogeneity is
including the kinds of patients and interventions
restricted, the internal validity of the study is
found in ordi- nary patient care settings.
improved; in other words, there is less opportunity
for differences in out- come that are not related to
treatment itself. However, exclusions come at the Example
price of diminished general- izability: Patients in
the trial are not like most other patients seen in
day-to-day care. Severe ankle sprains are a common problem
among patients visiting emergency depart-
ments. Various treatments are in common
Example use. The Collaborative Ankle Support Trial
Figure 9.3 summarizes how patients were Group enrolled 584 patients with severe ankle
selected for a randomized controlled trial of
sprain in eight emergency departments in the
asthma management (3). Investigators invited
1,410 patients with asthma in 81 general United Kingdom in a randomized trial of four
practices in Scotland to participate. Only 458 com- monly used treatments: tubular
of those invited, about one-third, agreed to compression bandage and three types of
participate and could be contacted. An addi- mechanical sup- port (4). Quality of ankle
tional 199 were excluded, mainly because they function at 3 months was best after a below-
did not meet eligibility criteria, leaving 259 the-knee cast was used for 10 days and worst
patients (18% of those invited) to be random- when a tubular compres- sion bandage was
ized. Although the study invited patients from used; the tubular compres- sion bandage was
community practices, those who actually par- being used in 75% of centers in the United
ticipated in the trial were highly selected and
Kingdom at the time. Two less effective forms
perhaps unlike most patients in the community.
of mechanical support were several times
more expensive than the cast. All treatment
groups improved over time and there was no
difference in outcome among them at 9
months.
Because of the high degree of selection in
trials, it may require considerable faith to Practical trials are different from typical efficacy
generalize the results of clinical trials to ordinary tri- als where, in an effort to increase internal validity,
practice settings. severe restrictions are applied to enrollment,
If there are not enough patients with the disease of intervention, and adherence, limiting the relevance
interest, at one time and place, to carry out a of their results for usual patient care decisions. Large
scientifi- cally sound trial, then sampling can be from simple trials may be of practical questions too, but
multiple sites with common inclusion and practical trials need not be so large.
exclusion criteria. This is done mainly to achieve
adequate sample size, but it also increases Intervention
generalizability, to the extent that the sites are
somewhat different from each other. The intervention can be described in relation to three
Large simple trials are a way of overcoming general characteristics: generalizability, complexity,
the generalizability problem. Trial entry criteria are and strength.
simpli- fied so that most patients developing the First, is the intervention one that is likely to be
study con- dition are eligible. Participating patients implemented in usual clinical practice? In an
have to have accepted random allocation of effort
treatment, but their
13 Clinical Epidemiology: The

CRITERIA FOR INCLUSION NUMBER REMAINING

Population sampled 462,526


Patients in 81 general practices

Invited to participate 1,410


Patients with asthma

Agreed to participate 458


Able to contact

Met eligibility criteria 318


Asthma for at least 1 year
Age 18 or older
Treated with inhaled corticosteroid
No asthma visit in past 2 months Able
to use peak flow meter
No serious illness No
substance abuse Not
pregnant
Other

Able and willing to cooperate 259


Gave informed consent

Randomized 259

Figure 9.3 ■ Sampling of patients for a randomized controlled trial of asthma management. (Data from Hawkins
G, McMahon AD, Twaddle S, et al. Stepping down inhaled corticosteroids in asthma: randomized controlled trial. BMJ
2003;326: 1115–1121.)
Chapter 9: Treatment 139

to standardize the intervention so that it can be easily special interest and attention because of the study,
described and reproduced in other settings, regardless of the specific nature of the interven-
investi- gators may cater to their scientific, not tion they might be receiving. This phenomenon is
their clinical colleagues by studying treatments that called the Hawthorne effect. The reasons are not
are not feasible in usual practice. clear, but some seem likely: Patients want to
Second, does the intervention reflect the normal please them and make them feel successful. Also,
complexity of real-world treatment? Clinicians regu- patients who volunteer for trials want to do their
larly construct treatment plans with many compo- part to see that “good” results are obtained.
nents. Single, highly specific interventions make for ■ Usual Care. Do patients given the experimental
tidy science because they can be described precisely treat- ment do better than those receiving usual
and applied in a reproducible way, but they may have care— whatever individual doctors and patients
weak effects. Multifaceted interventions, which are decide? This is the only meaningful (and ethical)
often more effective, are also amenable to careful question if usual care is already known to be
eval- uation as long as their essence can be effective.
communicated and applied in other settings. For ■ Placebo Treatment. Do treated patients do bet-
example, a random- ized trial of fall prevention in ter than similar patients given a placebo—an
acute care hospitals stud- ied the effects of a fall risk intervention intended to be indistinguishable (in
assessment scale, with inter- ventions tailored to each physical appearance, color, taste, or smell) from
patient’s specific risks (5). the active treatment but does not have a
Third, is the intervention in question suffi- specific, known mechanism of action? Sugar
ciently different from alternative managements that pills and saline injections are examples of
it is reasonable to expect that the outcome will be placebos. It has been shown that placebos, given
affected? Some diseases can be reversed by treating with convic- tion, relieve severe, unpleasant
a single, dominant cause. Treating hyperthyroidism symptoms, such as postoperative pain, nausea, or
with radioisotope ablation or surgery is one example. itching, in about one-third of patients, a
However, most diseases arise from a combination of phenomenon called the placebo effect.
factors acting in concert. Interventions that change Placebos have the added advan- tage of making
only one of them, and only a small amount, cannot it difficult for study patients to know which
be expected to result in strong treatment effects. If intervention they have received (see “Blinding”
the conclusion of a trial evaluating such interventions in the following text).
is that a new treatment is not effective when used ■ Another Intervention. The comparator may be the
alone, it should come as no surprise. For this reason, current best treatment. The point of a
the first trials of a new treatment tend to enroll “compara- tive effectiveness” study is to find
those patients who are most likely to respond to out whether a new treatment is better than the
treatment and to maximize dose and compliance. one in current use.
Changes in outcome related to these comparators
Comparison Groups
are cumulative, as diagrammed in Figure 9.4.
The value of an intervention is judged in relation
to some alternative course of action. The question
is not only whether a comparison is used, but also New
how appropriate it is for the research question. Intervention
Results can be measured against one or more of
several kinds of comparison groups. Placebo
effect
■ No Intervention. Do patients who are offered
Improveme

the experimental treatment end up better off


Usual
than those offered nothing at all? Comparing care
treat- ment with no treatment measures the total
effects of care and of being in a study, both Hawthorne
specific and nonspecific. effect
■ Being Part of a Study. Do treated patients do
better than other patients who just participate in a Natural
history
study? A great deal of special attention is directed
toward patients in clinical trials. People have a
tendency to change their behavior when they are Figure 9.4 ■ Total effects of treatment are the sum of
the target of spontaneous improvement (natural history) as well as
nonspecific and specific responses.
14 Clinical Epidemiology: The

Allocating Treatment
Table 9.1
To study the effects of a clinical intervention free of
confounding, the best way to allocate patients to Example of a Table Comparing
treat- ment groups is by means of random allocation Baseline Characteristics: A Randomized
(also referred to as randomization). Patients are Trial of Liberal versus Restrictive
assigned to either the experimental or the control Transfusion in High-Risk Patients after
treatment by one of a variety of disciplined Hip Surgery
procedures—analogous to flipping a coin—whereby Percent with
each patient has an equal (or at least known) chance Characteristic for
of being assigned to any one of the treatment Each Group
groups. % Liberal % Restricted
Random allocation of patients is preferable to (1,007 (1,009
Characteristics Patients) Patients)
other methods of allocation because only randomiza-
tion has the ability to create truly comparable groups. Age (mean) 81.8 81.5
All factors related to prognosis, regardless of Male 24.8 23.7
whether they are known before the study takes Any cardiovascular 63.3 62.5
place or have been measured, tend to be equally disease
distributed in the comparison groups. Tobacco use 600 mg/d 11.6 11.3
In the long run, with a large number of patients
Anesthesiology risk score 3.0 2.9
in a trial, randomization usually works as just
described. However, random allocation does not General anesthesia 54.0 56.2
guarantee that the groups will be similar; Lived in nursing home 10.3 10.9
dissimilarities can arise by chance alone, particularly Data from Carson JL, Terrin ML, Noveck H, et al. Liberal or restrictive
when the number of patients randomized is small. To transfusion in high-risk patients after hip surgery. N Engl J Med
assess whether “bad luck” has occurred, authors of 2011; 365:2453–2462.
randomized controlled trials often present a table
comparing the frequency in the treated and control
groups of a variety of characteristics, espe- cially It is reassuring to see that important prognostic
those known to be related to outcome. These are variables are nearly equally distributed in the groups
called baseline characteristics because they are being compared. If the groups are substantially dif-
pres- ent before randomization and, therefore, ferent in a large trial, it suggests that something
should be equally distributed in the treatment has gone wrong with the randomization process.
groups. Smaller differences, which are expected because of
chance, can be controlled for during data analyses
(see Chapter 5). In some situations, especially
Example small trials, to reduce the risk of bad luck, it is
best
Table 9.1 shows some of the baseline character- istics for a study of to make
liberal sure
versus that at least
restrictive bloodsome of the in high-r
transfusion
characteristics known to be strongly associated
with outcome occur equally in treated and control
patients. Patients are gathered into groups (strata)
that have similar levels of a prog- nostic factor (e.g.,
age for most chronic diseases) and are randomized
separately within each of the strata, a process called
stratified randomization (Fig. 9.5). The final
groups are sure to be comparable, at least for the
characteristics that were used to create the strata.
Some investigators do not favor stratified randomiza-
tion, arguing that whatever differences arise from
bad luck are unlikely to be large and can be
adjusted for
mathematically after the data have been collected.

Differences Arising
after Randomization
Not all patients in clinical trials participate as
origi- nally planned. Some are found to not have the
disease
Chapter 9: Treatment 141

Treatment and
control groups
Strata Final study test were available. Of 499 patients with
T
groups sep- tic shock enrolled in the trial, 233
1
R (47%) had adrenal insufficiency. The main
C T analysis was of this subgroup, those who it
was believed might respond to
T
hydrocortisone, not of all patients
Eligible randomized. Among patients with poor
2
R response to corticotrophin, there was no
patients
C difference in survival in patients treated
with hydrocortisone, compared with
T C
3
R
Because response to corticotrophin was a charac-
C teristic that existed before randomization, it had
been
Stratification Randomization randomly allocated, so the advantages of a
randomized
Figure 9.5 ■ Diagram of stratified randomization. T controlled trial were preserved. However, the ineffi-
 ciency of enrolling and gathering data on patients
treated group, C  control group, and R  who would not contribute to the study’s results could
randomization.
not be avoided. Also, the number of patients in the
study was reduced, making it more difficult to detect
differences in survival if they existed. Nevertheless,
they were thought to have when they entered the this kind of trial has the important advantage of
trial. Others drop out, do not take their providing infor- mation on both the consequences
medications, are taken out of the study because of of a decision that a clinician must make before all
side effects or other illnesses, or somehow obtain the relevant informa- tion is available as well as
the other study treatment or treatments that are not effectiveness of treatment in the subset of patients
part of the study at all. In this way, treatment most likely to respond (see the Intention-to-Treat
groups that might have been comparable just after and Explanatory Trials section in this chapter).
randomization become less so as time passes.
Compliance
Patients May Not Have the
Disease Being Studied Compliance is the extent to which patients follow
medical advice. The term adherence is preferred by
It is sometimes necessary (both in clinical trials some people because it connotes a less subservient
and in practice) to begin treatment before it is relationship between patient and doctor. Compliance
certain whether the patient actually has the disease is another characteristic that comes into play after
for which the treatment is designed. randomization.
Although noncompliance suggests a kind of
willful neglect of good advice, other factors also
Example contribute. Patients may misunderstand which
Hydrocortisone may improve survival from drugs and doses are intended, run out of
septic shock, especially in patients with abnor- prescription medications, confuse various
mal adrenal response to shock, as measured preparations of the same drug, or have no money or
by an inappropriately small rise in plasma
cortisol after administration of corticotropin.
insurance to pay for drugs. Taken together, non-
Treatment must begin before the results of compliance may limit the usefulness of treatments
the test are available. Investigators in Israel that have been shown to work under favorable
and Europe studied whether hydrocortisone conditions.
improved survival to 28 days in patients with In general, compliance marks a better
septic shock (7). Patients were randomized to prognosis, apart from treatment. Patients in
hydrocortisone or placebo, and treatment was randomized trials who were compliant with
begun before results of the corticotrophin placebo had better out- comes than those who
were not (8).
Compliance is particularly important in medical
care outside the hospital. In hospitals, many
14 Clinical Epidemiology: The
factors act to constrain patients’ personal
behavior and render
Chapter 9: Treatment 143

them compliant. Hospitalized patients are gener- For example, in a study of asthma treatment, they
ally sicker and more frightened. They are in may receive not only the experimental drug but
strange surroundings, dependent upon the skill and also different doses of their usual drugs and make
atten- tion of the staff for everything, including greater efforts to control allergens in the home. If
their life. What is more, doctors, nurses, and these occur unequally in the two groups and affect
pharmacists have developed a well-organized system outcomes, they can introduce systematic differences
for ensuring that patients receive what is ordered for between the groups that were not present when the
them. As a result, clinical experience and medical groups were formed.
literature developed on the wards may
underestimate the importance of compliance Blinding
outside the hospital, where most patients and doctors
Participants in a trial may change their behavior or
are and where following doctors’ orders is less
reporting of outcomes in a systematic way (i.e., be
common.
biased) if they are aware of which patients are receiv-
In clinical trials, patients are typically selected to
ing which treatment. One way to minimize this effect
be compliant. During a run-in period, in which
is by blinding, an attempt to make the various par-
placebo is given and compliance monitored, non-
ticipants in a study unaware of the treatment group
compliant patients can be detected and excluded
patients have been randomized to so that this knowl-
before randomization.
edge cannot cause them to act differently, and
thereby diminish the internal validity of the study.
Cross-over Masking is a more appropriate metaphor, but
Patients may move from one randomly allocated blinding is the time-honored term.
treat- ment to another during follow-up, a Blinding can take place in a clinical trial at four
phenomenon called cross-over. If exchanges levels (Fig. 9.6). First, those responsible for
between treatment groups take place on a large allocating patients to treatment groups should not
scale, it can diminish the observed differences in know which treatment will be assigned next making
treatment effect compared to what might have been it impossible for them to break the randomization
observed if the original groups had remained intact. plan. Alloca- tion concealment is a term for this
form of blind- ing. Without it, some investigators
might be tempted to enter patients in the trial out of
Cointerventions
order to ensure that individuals get the treatment
After randomization, patients may receive a variety that seems best for them. Second, patients should
of interventions other than the ones being studied. be unaware of

POPULATION INTERVENTION OUTCOME

YES

Treated group

NO
SAMPLERandomization
YES

Control group
NO

Treatment allocation
LOCATION OF BLINDING Patients Clinicians Measurement of outcome

Figure 9.6 ■ Locations of potential blinding in randomize controlled trials.


14 Clinical Epidemiology: The

which treatment they are taking so that they


cannot change their compliance or reporting of
symptoms because of this information. Third, to Clinical trials may have as their primary outcome
ensure physi- cians caring for patients in a study a composite outcome, a set of outcomes that are
cannot, even sub- consciously, manage patients related to each other but are treated as a single out-
differently, physicians should not know which come variable. For example, in a study of percuta-
treatment each patient is on. Finally, when the neous repair versus open surgery for mitral regurgi-
researchers who assess outcomes are unaware of which tation, the composite outcome was the absence of
treatment individual patients have been offered, that death, mitral valve surgery, or severe mitral regur-
knowledge cannot affect their mea- surements. gitation 12 months after treatment (10). There are
The terms single-blind (patients) and double- several advantages to this approach. The individual
blind are sometimes used, but their meanings are outcomes in the composite may be so highly
ambiguous. It is better simply to describe what related to each other, biologically and clinically, that
was done. A trial in which there is no attempt at it is arti- ficial to consider them separately. The
blinding is called an open trial or, in the case of drug presence of one (such as death) may prevent the
trials, an open label trial. other (such as severe mitral regurgitation) from
In drug studies, blinding is often made possible occurring. With more ways to experience an
by using a placebo. However, for many important outcome event, a study is better able to detect
clinical questions, such as the effects of surgery, radio- treatment effects (see Chapter 11). The disadvantage
therapy, diet, or the organization of medical care, of composite outcomes is that they can obscure
blinding of patients and their physicians is differences in effects for different individual
difficult if not impossible. outcomes. In addition, one component may
Even when blinding appears to be possible, it account for most of the result, giving the
is more often claimed than successful. Physiologic impression that the intervention affects the others
effects, such as lowered pulse rate with beta-blocking too. All of these dis- advantages can be overcome
drugs and gastrointestinal upset or drowsiness with by simply examining effect on each component
other drugs, may signal to patients whether they outcome separately as well as together.
are taking the active drug or placebo. In addition to “hard” outcomes such as survival,
remission of disease, and return of function, a trial
Assessment of Outcomes sometimes measure health-related quality of life
Randomized controlled trials are a special case by broad, composite measures of health status. A
of cohort studies, and what we have already said simple quality-of-life measure used by a
about measures of effect and biases in cohort studies collaborative group of cancer researchers is shown
(Chapter 5) apply to them as well, as do the in Table 9.2. This “performance scale” combines
dangers of substituting intermediate outcomes for symptoms and function, such as the ability to walk.
clinically important ones (Chapter 1). Others are much more extensive; the Sickness Impact
Profile contains more than 100 items and a dozen
categories. Still oth- ers are specifically developed
Example for individual diseases. The main issue is that the
A low level of high-density lipoprotein (HDL) cholesterolvalue offactor
is a risk a clinical trial is strengthened
for cardiovascular disease. Thetodrug
theniacin raises
extent that such measures are reported along with
hard measures such as death and recurrence of
disease.

HDL and lower LDL cholesterol and triglyc-


erides than patients on simvastatin alone
(9). However, rates of the primary outcome,
Chapter 9: Treatment 145

Table 9.2 Table 9.3


A Simple Measure of Quality of Summarizing Treatment Effects
Life. The Eastern Collaborative
Summary
Oncology Group’s Performance Measurea Definition
Scale
Relative risk Control event rate  Treated event rate
Performance Status Definition reduction Control event rate
0 Asymptomatic Absolute risk Control even rate  Treated event rate
1 Symptomatic, fully reduction
ambulatory Number needed 1
2 Symptomatic, in bed 50% to treat OR Control event rate – Treatment event rate
of the day number needed
3 Symptomatic, in bed 50% to harm
of the day a
For continuous data in which there are measurements at
4 Bedridden baseline and after treatment, analogous measures are based on
the mean values for treated and control groups either after
5 Dead treatment or for the difference between baseline and
posttreatment values.

Options for describing effect size in clinical trials


diseases. Most randomized trials are designed in this
are summarized in Table 9.3. The options are
way.
simi- lar to summaries of risk and prognosis but
Second, does treatment help under ordinary cir-
related to change in outcome resulting from the
cumstances? Trials designed to answer this kind of
intervention.
question are called effectiveness trials. All the usual
elements of patient care are part of effectiveness
EFFICACY AND EFFECTIVENESS trials. Patients may not take their assigned treatment.
Some may drop out of the study, and others find
Clinical trials may describe the results of an interven-
ways to take the treatment they were not assigned.
tion in ideal or in real-world situations (Fig. 9.7).
The doc- tors and facilities may not be the best. In
First, can treatment help under ideal circum-
short, effec- tiveness trials describe results as most
stances? Trials that answer this question are called
patients would experience them. The difference
efficacy trials. Elements of ideal circumstances
between efficacy and effectiveness has been
include patients who accept the interventions
described as the “implementa- tion gap,” the gap
offered to them, follow instructions faithfully, get
between ideal care and ordinary care, and is a
the best possible care, and do not have care for other
target for improvement in its own right.

QUESTION TYPE OF TRIAL DIFFERENCES STRENGTHS

Can treatment work


under ideal INTERNAL VALIDITY
circumstances?

Compliance
Types of patients
Practicality

Does offering treatment Cost


work under ordinary GENERALIZABILITY
circumstances?

Figure 9.7 ■ Efficacy and effectiveness.


14 Clinical Epidemiology: The

Efficacy trials usually precede effectiveness trial. of whether these patients actually received the
The rationale is that if treatment under the best cir- treatment they were supposed to receive. This way
cumstances is not effective, then effectiveness under of analyzing trial results is called an intention-to-
ordinary circumstances is impossible. Also, if an treat analysis. An advantage of this approach is
effectiveness trial were done first and it showed no that the question corresponds to the one actually
effect, the result could have been because the faced by clinicians; they either offer a treatment
treat- ment at its best is just not effective or that or not. Also, the groups compared are as origi-
the treat- ment really is effective but was not nally randomized, so this comparison has the full
received. strength of a randomized trial. The disadvantage is
that to the extent that many patients do not receive
Intention-to-Treat and the treatment to which they were randomized, dif-
Explanatory Trials ferences in effectiveness will tend to be obscured,
increasing the chances of observing a misleadingly
A related issue is whether the results of a random- small effect or no statistical effect at all. If the study
ized controlled trial should be analyzed and shows no difference, it will be uncertain whether
presented according to the treatment to which the the problem is the treatment itself or that it was
patients were randomized or according to the one not received.
they actually received (Fig. 9.8). Another question is whether the experimen-
One question is: Which treatment choice is best tal treatment itself is better. For this question, the
at the time the decision must be made? To answer proper analysis is according to the treatment each
this question, analysis is according to which group patient actually received, regardless of the treatment
the patients were assigned (randomized), regardless

INTENTION TO TREAT

Experimental

Analysis
according to
SAMPLE treatment
assigned

Control

EXPLANATORY

Experimental

Analysis
according to
SAMPLE treatment
received

Control

Figure 9.8 ■ Diagram of group assignment in intention-to-treat and explanatory analyses.


Chapter 9: Treatment 147

to which they were randomized. Trials analyzed


in this way are called explanatory analyses (also
called per-protocol) because they assess whether Example
actually taking the treatments, rather than just
being offered them, makes a difference. The Yaws is an infectious disease affecting more that 500,000 ch
problem with this approach is that unless most
patients receive the treatment to which they are
assigned, the study no longer represents a
randomized trial; it is simply a cohort study. One
must be concerned about dissimi- larities among
groups, other than the experimental treatment, and
must use methods such as restriction, matching,
stratification, or adjustment to achieve com-
parability, just as one would for any non-
experimental study.
In general, intention-to-treat analyses are more
relevant to effectiveness questions, whereas explana-
tory analyses are consistent with the purposes of effi-
cacy trials, although aspects of the trial other than
how they are analyzed matter too. The primary
analy- sis is usually intention-to-treat, but both are
reported. Both approaches are legitimate, with the
right one depending on the question being asked. To
the extent that patients in a trial follow the
treatment to which they were randomized, these two
analyses will give similar results.

SUPERIORITY, EQUIVALENCE,
AND NON-INFERIORITY
Until now, we have been discussing superiority
trials, ones that seek to establish that one
treatment is better than another, but sometimes the
most impor- tant question is whether a treatment is
no less effec- tive than another. A typical example
is when a new drug is safer, cheaper, or easier to Non-inferiority trials usually require a larger sam-
administer than the established one and would, ple size than comparable superiority trials, especially
therefore, be preferable if it were as effective. In if the inferiority margin is small or one wants to
non-inferiority trials, the pur- pose is to show that a rule out small differences. Also, any aspect of the
new treatment is unlikely to be less effective, at least trial that tends to minimize differences between
to a clinically important extent, than the currently comparison groups, such as intention-to-treat
accepted treatment, which has been shown in other analyses in trials where many patients have
studies to be more effective than placebo. The dropped out or crossed over or when
question is one-directional—whether a new measurements of outcomes are impre- cise,
treatment is not worse—without regard to whether artificially increase the likelihood of finding non-
it might be better. inferiority regardless of whether it is truly present—
It is statistically impossible to establish that a that is, they result in a weak test for non-
treatment is not at all inferior to another. However, inferiority.
a study can rule out an effect that is less than a
pre- determined “minimum clinically important VARIATIONS ON BASIC
differ- ence,” also called an inferiority margin, the RANDOMIZED TRIALS
smallest difference in effect that is still considered
clinically important. The inferiority margin actually In cluster randomized trials, naturally occurring
takes into account both this clinical difference plus groups (“clusters”) of patients (defined by the doc-
the statis- tical imprecision of the study. The tors, hospitals, clinics, or communities that
following is an example of a non-inferiority trial. patients are affiliated with) are randomized, and
outcome
14 Clinical Epidemiology: The

events are counted in patients according to the the subgroups existed before randomization, patients
treat- ment their cluster was assigned. in each subgroup have been randomly allocated to
Randomization of groups, not the individuals in treatment groups. As a consequence, results in
them, may be prefera- ble for several reasons. It each subgroup represent, in effect, a small trial
may just be more practical to randomized clusters within a trial. The characteristics of a given patient
than individuals. Patients within clusters may be (e.g., the patient might be elderly and have severe
more similar to each other than to patients in other disease but no comorbidity) can be matched more
clusters, and this source of variation, apart from the specifically to those of one of the subgroups than it
study treatments themselves, should be taken into can to the trial as a whole. Treatment effectiveness
account in the study. If patients were randomized in the matched subgroup will more closely
within clusters, they or their physicians might approximate that of the individual patient and will
learn from each other about both treatments and be limited mainly by sta- tistical risks of false-
this might affect their behaviors. For example, positive and false-negative con- clusions, which are
how successfully could a physician be at randomly described in Chapter 11.
treating some patients with urinary tract infection one
way and other patients another way? Similarly, could Effectiveness in Individual Patients
a hospital establish a new plan to prevent intravenous
A treatment that is effective on average may not
catheter infections in some of its intensive care units
work on an individual patient. Therefore, results of
and not others when physicians see patients in
valid clinical research provide a good reason to
both settings over time? For these reasons,
begin treat- ing a patient, but experience with that
randomizing clusters rather than patients can be the
patient is a better reason to continue or not
best approach in some circumstances.
continue. When managing an individual patient, it
There are other variations on the usual
(“parallel is prudent to ask the following series of questions:
group”) randomized controlled trials. Cross-over ■ Is the treatment known (by randomized controlled
trials expose patients first to one of two randomly trials) to be efficacious for any patients?
allocated treatments and later to the other. If it can be ■ Is the treatment known to be effective, on
assumed that effects of the first exposure are no average, in patients like mine?
longer present by the time of the second exposure, ■ Is the treatment working in my patient?
perhaps because treatment is short-lived or there ■ Are the benefits worth the discomforts and
has been a “wash-out” period between exposures, risks (according to the patient’s values and
then each patient will have been exposed to each preferences)?
treatment in random order. This controls for
differences in respon- siveness among patients not By asking these questions and not simply follow-
related treatment effects. ing the results of trials alone, one can guard
against ill-founded choice of treatment or stubborn
persis- tence in the face of poor results.
TAILORING THE RESULTS OF
TRIALS TO INDIVIDUAL PATIENTS Trials of N = 1
Clinical trials describe what happens on average. They Rigorous clinical trials, with proper attention to bias
involve pooling the experience of many patients who and chance, can be carried out on individual patients,
may be dissimilar, both to one another and to the one at a time. The method, called trials of N = 1,
patients to whom the trial results will be generalized. is an improvement over the time-honored process
How can estimates of treatment effect be obtained of trial and error. A patient is given one treatment
that more closely match individual patients? or another, such as an active treatment or placebo,
in random order, each for a brief period of time.
Subgroups The patient and physician are blinded to which
treatment is given. Outcomes, such as a simple
Patients in clinical trials can be sorted into preference for a treatment or a symptom score, are
subgroups, each with a specific characteristic (or assessed after each period. After many repetitions
combination of characteristics) such as age, severity patterns of responses are analyzed statistically,
of disease, and comorbidity that might cause a much as one would for a more usual randomized
different treatment effect. That is, the data are controlled trial. This method is useful for deciding on
examined for effect modifi- cation. The number of the care of individual patients when activity of
such subgroups is limited only by the number of disease is unpredictable, response to treatment is
patients in the subgroups, which has to be large prompt, and there is no carryover effect from
enough to provide reasonably stable estimates. As period to period. Examples of diseases for which
long as the characteristics used to define
Chapter 9: Treatment 149
treated in
the method can be used include migraine headaches,
asthma, and fibromyalgia. For all their intellectual
appeal, however, trials of N  1 are rarely done
and even more rarely published.

ALTERNATIVES TO RANDOMIZED
CONTROLLED TRIALS
Randomized controlled trials are the gold standard
for studies of the effectiveness of interventions.
Only large randomized trials can definitively elimi-
nate confounding as an alternative explanation for
observed results.

Limitations of Randomized Trials


However, the availability of several well-conducted
randomized controlled trials does not necessarily set-
tle a question. For example, after 21 randomized con-
trolled trials over three decades (one is described
in an example earlier in this chapter) the
effectiveness of corticosteroids for septic shock
remains controversial. The heterogeneity of trial
results seems partly related to differences in dose
and duration of the drug, the proportion of patients
with relative adrenal insuffi- ciency, and whether
the outcome is reversal of shock or survival. That is,
the trials were of the same general questions but
very different specific questions.
Clinical trials also suffer from practical limita-
tions. They are expensive; in 2011, the average
cost per patient in drug trials was estimated to be
nearly
$50,000, and some large trials have cost hundreds of
millions of dollars. Logistics can be daunting, espe-
cially in maintaining similar methods across sites in
multicenter trials and in maintaining the integrity
of allocation concealment. Randomization itself
remains a hindrance, if not to the conduct of trials
at all then to full, unbiased participation. It is
particu- larly difficult to convince patients to be
randomized when a practice has become well
established in the absence of conclusive evidence
of its benefit.
For these reasons, clinical trials are not available
to guide clinicians in many important treatment
decisions, but clinical decisions must be made none-
theless. What are the alternatives to randomized con-
trolled trials, and how credible are they?
Observational Studies
of Interventions
In the absence of a consensus favoring one mode
of treatment over others, various treatments are
given according to the preferences of each individual
patient and doctor. As a result, in the course of
ordinary patient care, large numbers of patients are
15 Clinical Epidemiology: The

various ways and go on to manifest the effects.


When experience with these patients is captured
and prop- erly analyzed, it can complement the
information available from randomized trials,
suggest where new trials are needed, and provide
answers where trials are not yet available.

Unfortunately, it is difficult to be sure that Example


obser- vational studies of treatment are not
confounded. Treatment choice is determined by a Non–small cell lung cancer is common in the elderly. Palliative
great many fac- tors including severity of illness,
concurrent diseases, local preferences, and patient
cooperation. Patients receiving the various
treatments are likely to differ not only in their
treatment but in other ways as well. Especially
troubling is confounding by indica- tion
(sometimes called “reverse causation”), which
occurs when whatever prompted the doctor to
choose a treatment (the “indication”) is a cause of
the observed outcome, not just the treatment
itself. For
Chapter 9: Treatment 151

example, patients may be offered a new surgical pro- database is not part of formal research, there is no
cedure because they are a good surgical risk or have accounting for confounding and effect modification,
less aggressive disease and, therefore, seem but the predictions do have the advantage of being
especially likely to benefit from the procedure. To about real-world patients, not those as highly
the extent that the reasons for treatment choice are selected as in most clinical trials.
known, they can be taken into account like any other
confounders. Randomized versus
Observational Studies?

Example
Influenza can cause worsening of broncho- spasm in children with asthma. Influenza vac- cine is recommended, but many c

Are observational studies a reliable substitute for


Clinical Databases randomized controlled trials? With controlled trials
as the gold standard, most observational studies of
Sometimes, databases are available that include base- most questions get the right answer. However,
line characteristics and outcomes for a large number there are dramatic exceptions. For example,
of patients. Clinicians can match characteristic of observational studies have consistently shown that
a specific patient to similar patients in the database antioxidant vita- mins are associated with lower
and see what their outcomes were. When use of cardiovascular risk, but large randomized
the controlled trials have found no such effect.
Therefore, clinicians can be guided by observational
15 Clinical Epidemiology: The
studies of treatment effects when there are no
randomized trials to rely on, but they
should maintain a healthy skepticism.
Well-designed observational studies of
interventions have some strengths that
complement the limitations of usual
randomized trials. They count the effects of
actual treatment, not just of offering
treatment, which is a legitimate question in
its own right. They com- monly include
most people as they exist in naturally
occurring populations, in either clinical or
community settings, without severe inclusion
and exclusion crite- ria. They can often
accomplish longer follow-up than trials,
matching the time it takes for disease and
out- comes to develop. By taking advantage
of treatments and outcomes as they happen,
observational studies (especially case-control
studies and historical cohort studies using
health records) can answer clinical ques- tions
more quickly than it takes to complete a
random- ized trial. Of course, they are also
less expensive.
Because an ideal randomized controlled trial is
the standard of excellence, it has been
suggested that observational studies of
treatment be designed to resemble, as
closely as possible, a randomized trial of the
same question (15). One might ask, if the
study had been a randomized trial what
would be the inclu- sion and exclusion
criteria (e.g., excluding patients with
contraindications to either intervention),
how would exposure be precisely defined and
how should drop-outs and cross-overs be
managed? The resulting observational study
cannot be expected to avoid all
vulnerabilities, but at least it would be
stronger.

PHASES OF CLINICAL TRIALS


For studies of drugs, it is customary to
define three phases of trials in the order
they are undertaken.
Chapter 9: Treatment 153

Phase I trials are intended to identify a dose range rates of common side effects. They include enough
that is well tolerated and safe (at least for high- patients, sometimes thousands, to detect clinically
frequency, severe side effects) and include very important treatment effects and are usually published
small numbers of patients (perhaps a dozen) in biomedical journals.
without a control group. Phase II trials provide Phase III trials are not large enough to detect
preliminary information on whether the drug is dif- ferences in the rate, or even the existence, of
efficacious and the relationship between dose and uncom- mon side effects (see discussion of statistical
efficacy. These tri- als may be controlled but include power in Chapter 11). Therefore, it is necessary to
too few patients in treatment groups to detect any fol- low up very large numbers of patients after a
but the largest treat- ment effects. Phase III trials drug is in general use, a process called
are randomized trials and can provide definitive postmarketing surveillance.
evidence of efficacy and

Revie w Question s
Read the following and select the best 9.3. In a randomized trial, patients with
response. meningitis who were treated with
corticosteroids had lower rates of death,
9.1. A randomized controlled trial compares two hearing loss, and neurologic sequelae. Which
drugs in common use for the treatment of of the following is a randomized comparison?
asthma. Three hundred patients were entered
into the trial, and eligibility criteria were A. The subset of patients who, at the time of
broad. No effort was made to blind patients randomization, were severely affected by
to their treatment group after enrollment. the disease
Except for the study drugs, care was decided B. Patients who experienced other treatments
by each individual physician and patient. The versus those who did not
outcome measure was a brief questionnaire C. Patients who remained in the trial
assessing asthma-related quality of life. versus those who dropped out after
Which of the following best describes this randomization
trial? D. Patients who responded to the drug versus
those who did not
A. Practical clinical trial E. Patients who took the drug
B. Large simple trial compared with those who did not
C. Efficacy trial
D. Equivalence trial 9.4. A patient asks for your advice about whether
E. Non-inferiority trial to begin an exercise program to reduce
his risk of sudden death. You look for
9.2. A randomized controlled trial compared randomized controlled trials but find only
angioplasty with fibrinolysis for the treatment observational studies of this question. Some
of acute myocardial infarction. The authors are cohort studies comparing sudden death
state that “analysis was by intention to treat.” rates in exercisers with rates of sudden death
Which of the following is an advantage of in sedentary people; others are case-control
this approach? studies comparing exercise patterns in people
A. It describes the effects of treatments that who had experienced sudden death and
patients have actually received. matched controls. Which of the following
B. It is unlikely to underestimate treatment is not an advantage of observational
effect. studies of treatments like these over
C. It is not affected by patients dropping out randomized controlled trials?
of the study. A. Reported effects are for patients who have
D. It describes the consequences of offering actually experienced the intervention.
treatments regardless of whether they are B. It may be possible to carry out these
actually taken. studies by using existing data that was
E. It describes whether treatment can work collected for other purposes.
under ideal circumstances.
15 Clinical Epidemiology: The

C. The results can be generalized to more 9.8. A randomized controlled trial is analyzed
ordinary, real world settings. according the treatment each patient actually
D. Treatment groups would have had a received. Which of the following best
similar prognosis except for treatment describes this approach to analysis?
itself.
A. Superiority
E. A large sample size is easier to achieve.
B. Intention-to-treat
C. Explanatory
9.5. In a randomized controlled trial of a program
D. Phase I
to reduce lower extremity problems in
E. Open-label
patients with diabetes mellitus, patients were
excluded if they were younger than age 40,
9.9. In a randomized controlled trial, a beta-
were diagnosed before becoming 30 years
blocking drug is found to be more effective
old, took specific medication for
than placebo for stage fright. Patients taking
hyperglycemia, had other serious illness or
the beta-blocker tended to have a lower pulse
disability, or were not compliant with
rate and to feel more lethargic, which are
prescribed treatment during a run-in period.
known effects of this drug. For which of the
Which of the following is an advantage of
following is blinding possible?
this
approach? A. The patients’ physicians
B. The investigators who assigned patients to
A. It makes it possible to do an intention-
treatment groups
to- treat analysis.
C. The patients in the trial
B. It avoids selection bias.
D. The investigators who assess outcome
C. It improves the generalizability of the
study.
9.10. Which of the following best describes
D. It makes an effectiveness trial possible.
“equipoise” as the rationale for a randomized
E. It improves the internal validity of the
trial of two drugs?
study.
A. The drugs are known to be equally
9.6. Which of the following is not accomplished effective.
by an intention-to-treat analysis? B. One of the drugs is known to be
more toxic.
A. A comparison of the effects of
C. Neither drug is known to be
actually taking the experimental
more effective than the other.
treatments
D. Although one drug is more effective, the
B. A comparison of the effects of offering
other drug is easier to take with fewer side
the experimental treatments
effects.
C. A randomized comparison of treatment
effects
9.11. Antibiotic A is the established treatment for
community-acquired pneumonia, but it is
9.7. You are reading a report of a randomized
expensive and has many side effects. A new
controlled trial and wonder whether stratified
drug, antibiotic B, has just been developed
randomization, which the trial used, was
for community-acquired pneumonia and
likely to improve internal validity. For which
is less expensive and has fewer side effects,
of the following is stratified randomization
but its efficacy, relative to drug A, is not well
particularly helpful?
established. Which of the following would be
A. The study includes many patients. the best kind of trial for evaluating drug B?
B. One of the baseline variables is
A. Superiority
strongly related to prognosis.
B. Cross-over
C. Assignment to treatment group is
C. Cluster
not blinded.
D. Non-inferiority
D. Many patients are expected to drop
E. Equivalence
out.
E. An intention-to-treat analysis is
planned.
Chapter 9: Treatment 155

9.12. In a randomized controlled trial of two drugs A. “Bad luck” in randomization


for coronary artery disease, the primary B. A breakdown in allocation concealment
outcome is a composite of acute myocardial C. Both
infarction, severe angina pectoris, and cardiac D. Neither
death. Which of the following is the main
advantage of this approach? 9.14. Which of the following is usually learned
from a Phase III drug trial?
A. There are more outcomes events than
there would be for any of the individual A. The relationship between dose and
outcomes. efficacy
B. All outcomes are equally affected by B. Rates of uncommon side effects
the interventions. C. Efficacy or effectiveness
C. The trial has more generalizability. D. The dose range that is well tolerated
D. Each of the individual outcomes is
important in its own right. 9.15. Which of the following is the main advantage
E. If one outcome is infrequent, others make of randomized controlled trials over
up for it. observational studies of treatment effects?
A. Fewer ethical challenges
9.13. In a randomized controlled trial comparing
B. Prevention of confounding
two approaches to managing children with
C. Resemble usual care
bronchiolitis, baseline characteristics of
D. Quicker answer
the 200 children in the trial are
somewhat different in the two randomly E. Less expensive
allocated groups. Which of the
following might explain this finding? Answers are in Appendix

A.

REFERENCES
1. Action to Control Cardiovascular Risk in Diabetes Study
9. The AIM-HIGH Investigators. Niacin in patients with low
Group, Gerstein HC, Miller ME, et al. Effects of intensive
HDL cholesterol levels receiving intensive statin therapy. N
glucose lowering in type 2 diabetes. N Engl J Med 2008;
Engl J Med 2011;365:2255–2267.
358:2545–2559.
10. Feldman T, Foster E, Glower DG, et al. Percutaneous
2. Allen C, Glasziou P, Del Mar C. Bed rest: a potentially
repair or surgery for mitral regurgitation. N Engl J Med
harmful treatment needing more careful evaluation. Lancet
2011;364: 1395–1406.
1999;354:1229–1233.
11. Mitja O, Hayes R, Ipai A, et al. Single-dose azithromycin ver-
3. Hawkins G, McMahon AD, Twaddle S, et al. Stepping down
sus benzathine benzylpenicillin for treatment of yaws in chil-
inhaled corticosteroids in asthma: randomized controlled trial.
dren in Papua New Guinea: an open-label, non-inferiority,
BMJ 2003;326:1115–1121.
4. Lamb SE, Marsh JL, Hutton JL, et al. Mechanical supports for randomized trial. Lancet 2012;379:342–347.
12. Chrischilles EA, Pendergast JF, Kahn KL, et al. Adverse events
acute, severe ankle sprain: a pragmatic, multicentre, random-
among the elderly receiving chemotherapy for advanced non-
ized controlled trial. Lancet 2009;373:575–581.
5. Dykes PC, Carroll DL, Hurley A, et al. Fall prevention in small cell lung cancer. J Clin Oncol 2010;28:620–627.
13. Kramarz P, DeStefano F, Gargiullo PM, et al. Does influenza
acute care hospitals. A randomized trial. JAMA
vaccination exacerbate asthma. Analysis of a large cohort of
2010;304:1912–1918.
6. Carson JL, Terrin ML, Noveck H, et al. Liberal or children with asthma. Vaccine Safety Datalink Team. Arch
Fam Med 2000;9:617–623.
restrictive transfusion in high-risk patients after hip
14. Cates CJ, Jefferson T, Rowe BH. Vaccines for preventing
surgery. N Engl J Med 2011;365:2453–2462.
7. Sprung CL, Annane D, Keh D, et al. Hydrocortisone therapy influ- enza in people with asthma. Available at
http://summaries. cochrane.org/CD000364/vaccines-for-
for patients with septic shock. N Engl J Med 2008;358:111–
preventing-influenza- in-people-with-asthma. Accessed July
124.
8. Avins AL, Pressman A, Ackerson L, et al. Placebo adherence 26, 2012.
15. Feinstein AR, Horwitz RI. Double standards, scientific meth-
and its association with morbidity and mortality in the studies
ods, and epidemiologic research. N Engl J Med 1982;307:
of left ventricular dysfunction. J Gen Intern Med 2010;25:
1611–1617.
1275–1281.
C h a p t e r 10

Prevention
If a patient asks a medical practitioner for help, the doctor does the best he can. He is
not responsible for defects in medical knowledge. If, however, the practitioner
initiates
screening procedures, he is in a very different situation. He should have conclusive evidence
that screening can alter the natural history of disease in a significant proportion of those
screened.
—Archie Cochrane and Walter Holland
1971

KEY WORDS its conceptual basis and content. They should be


prepared to answer questions from patients such
as,
Preventive care Placebo adherence “How much exercise do I need, Doctor?” or “I heard
Immunizations Interval cancer that a study showed antioxidants were not helpful
Screening Detection Method in preventing heart disease. What do you think?”
Behavioral counseling Incidence method or “There was a newspaper ad for a calcium scan.
Chemoprevention False-positive Do you think I should get one?”
Primary prevention screening test Much of the scientific approach to prevention
Secondary prevention result Labeling effect in clinical medicine has already been covered in
Tertiary prevention Overdiagnosis this book, particularly the principles underlying
Surveillance Predisease risk, the use of diagnostic tests, disease prognosis, and
Prevalence screen Incidentaloma effective- ness of interventions. This chapter
Incidence screen Quality adjusted life expands on those principles and strategies as they
Lead-time bias year (QALY) specifically relate to prevention.
Length-time bias Cost-effectiveness
Compliance bias analysis PREVENTIVE ACTIVITIES
IN CLINICAL SETTINGS
Most doctors are attracted to medicine because 152
they look forward to curing disease. But all things
consid- ered, most people would prefer never to
contract a disease in the first place—or, if they
cannot avoid an illness, they prefer that it be caught
early and stamped out before it causes them any
harm. To accomplish this, people without specific
complaints undergo interventions to identify and
modify risk factors to avoid the onset of disease or
to find disease early in its course so that early
treatment prevents illness. When these interventions
take place in clinical practice, the activity is referred
to as preventive care.
Preventive care constitutes a large portion of
clinical practice (1). Physicians should understand
In the clinical setting, preventive care
activities often can be incorporated into the
ongoing care of patients, such as when a Chapter 9: Treatment 157
doctor checks the blood pressure of a patient
complaining of a sore throat or orders
pneu- mococcal vaccination in an older
person after dealing with a skin rash. At other
times, a special visit just for preventive care
is scheduled; thus the terms annual
physical, periodic checkup, or preventive
health examination.

Types of Clinical Prevention


There are four major types of clinical
preventive care: immunizations, screening,
behavioral counseling
Chapter 10: Prevention 153

(sometimes referred to as lifestyle changes), and che- Clinical


moprevention. All four apply throughout the life Onset diagnosis
span.
NOASYMPTOMATIC DISEASEDISEASE
CLINICAL COURSE
Immunization
Childhood immunizations to prevent 15 different
diseases largely determine visit schedules to the pedia-
trician in the early months of life. Human papillo- Primary SecondaryTertiary
Remove risk factors Early detectionReduce
mavirus (HPV) vaccinations of adolescent girls has and treatmentcomplications
recently been added for prevention of cervical
cancer.
Adult immunizations include diphtheria, pertussis, activities in medicine could be defined as
and tetanus (DPT) boosters and well as vaccinations prevention. After
to prevent influenza, pneumococcal pneumonia, and
hepatitis A and B.

Screening
Screening is the identification of asymptomatic dis-
ease or risk factors. Screening tests start in the pre-
natal period (such as testing for Down syndrome
in the fetuses of older pregnant women) and
continue throughout life (e.g., when inquiring
about hearing in the elderly). The latter half of this
chapter discusses scientific principles of screening.

Behavioral Counseling (Lifestyle


Changes)
Clinicians can give effective behavioral counsel-
ing to motivate lifestyle changes. Clinicians counsel
patients to stop smoking, eat a prudent diet, drink
alcohol moderately, exercise, and engage in safe sex-
ual practices. It is important to have evidence that
(i) behavior change decreases the risk for the condi-
tion of interest, and (ii) counseling leads to behav-
ior change before spending time and effort on this
approach to prevention (see Levels of Prevention
later in the chapter).

Chemoprevention
Chemoprevention is the use of drugs to prevent dis-
ease. It is used to prevent disease early in life (e.g.,
folate during pregnancy to prevent neural tube
defects and ocular antibiotic prophylaxis in all
newborns to prevent gonococcal ophthalmia
neonatorum) but is also common in adults (e.g.,
low-dose aspirin pro- phylaxis for myocardial
infarction, and statin treat- ment for
hypercholesterolemia).

LEVELS OF PREVENTION
Merriam-Webster’s dictionary defines prevention as
“the act of preventing or hindering” and “the act
or practice of keeping something from happening”
(2). With these definitions in mind, almost all
15 Clinical Epidemiology: The
Figure 10.1 ■ Levels of prevention. Primary prevention
prevents disease from occurring. Secondary prevention de- tects
and cures disease in the asymptomatic phase. Tertiary prevention
reduces complications of disease.

all, clinicians’ efforts are aimed at preventing the


untimely occurrences of the 5 Ds: death, disease,
dis- ability, discomfort, and dissatisfaction
(discussed in Chapter 1). However, in clinical
medicine, the defini- tion of prevention has
traditionally been restricted to interventions in
people who are not known to have the particular
condition of interest. Three levels of prevention
have been defined: primary, secondary, and
tertiary prevention (Fig. 10.1).

Primary Prevention
Primary prevention keeps disease from occurring
at all by removing its causes. The most common
clinical primary care preventive activities involve
immuniza- tions to prevent communicable
diseases, drugs, and behavioral counseling.
Recently, prophylactic surgery has become more
common, with bariatric surgery to prevent
complications of obesity, and ovariectomy and
mastectomy to prevent ovarian and breast cancer in
women with certain genetic mutations.
Primary prevention has eliminated many
infec- tious diseases from childhood. In
American men, primary prevention has prevented
many deaths from two major killers: lung cancer
and cardiovascular disease. Lung cancer mortality
in men decreased by 25% from 1991 to 2007, with
an estimated 250,000 deaths prevented (3). This
decrease followed smok- ing cessation trends
among adults, without orga- nized screening and
without much improvement in survival after
treatment for lung cancer. Heart disease mortality
rates in men have decreased by half over the past
several decades (4) not only because medi- cal
care has improved, but also because of primary
prevention efforts such as smoking cessation and
use of antihypertensive and statin medications.
Primary prevention is now possible for cervical,
hepatocel- lular, skin and breast cancer, bone
fractures, and alcoholism.
Chapter 10: Prevention 155
commit-
A special attribute of primary prevention
involving efforts to help patients adopt healthy
lifestyles is that a single intervention may prevent
multiple diseases. Smoking cessation decreases not
only lung cancer but also many other pulmonary
diseases, other cancers, and, most of all,
cardiovascular disease. Maintaining an appropriate
weight prevents diabetes and osteoarthri- tis, as well
as cardiovascular disease and some cancers.
Primary prevention at the community level can
also be effective. Examples include immunization
requirements for students, no-smoking regulations
in public buildings, chlorination and fluoridation
of the water supply, and laws mandating seatbelt use
in automobiles and helmet use on motorcycles and
bicy- cles. Certain primary prevention activities
occur in specific occupational settings (use of
earplugs or dust masks), in schools (immunizations),
or in specialized health care settings (use of tests to
detect hepatitis B and C or HIV in blood banks).
For some problems, such as injuries from automo-
bile accidents, community prevention works best.
For others, such as prophylaxis in newborns to
prevent gonococcal ophthalmia neonatorum, clinical
settings work best. For still others, clinical efforts
can comple- ment community-wide activities. In
smoking preven- tion efforts, clinicians help
individual patients stop smoking and public
education, regulations, and taxes prevent teenagers
from starting to smoke.

Secondary Prevention
Secondary prevention detects early disease when it
is asymptomatic and when treatment can stop it from
progressing. Secondary prevention is a two-step pro-
cess, involving a screening test and follow-up diag-
nosis and treatment for those with the condition of
interest. Testing asymptomatic patients for HIV and
routine Pap smears are examples. Most secondary
prevention is done in clinical settings.
As indicated earlier, screening is the identification
of an unrecognized disease or risk factor by history
taking (e.g., asking if the patient smokes), physical
examination (e.g., a blood pressure measurement),
laboratory test (e.g., checking for proteinuria in a
diabetic), or other procedure (e.g., a bone mineral
density examination) that can be applied
reasonably rapidly to asymptomatic people.
Screening tests sort out apparently well persons (for
the condition of inter- est) who have an increased
likelihood of disease or a risk factor for a disease
from people who have a low likelihood. Screening
tests are part of all secondary and some primary and
tertiary preventive activities.
A screening test is usually not intended to be
diag- nostic. If the clinician and/or patient are not
15 Clinical Epidemiology: The
hypertension, hyperlipidemia, obesity, and certain
ted to further investigation of abnormal genetic abnormalities. Treating risk factors as disease
results and treatment, if necessary, the broadens the definition of secondary prevention
screening test should not be performed at into the domain of traditional primary prevention.
all. In some disciplines, such as cardiology, the term
secondary prevention is used when discussing
tertiary prevention. “A new era of secondary
Tertiary Prevention prevention” was
Tertiary prevention describes clinical
activities that prevent deterioration or
reduce complications after a disease has
declared itself. An example is the use of
beta-blocking drugs to decrease the risk of
death in patients who have recovered from
myocardial infarction. Tertiary prevention
is really just another term for treatment, but
treatment focused on health effects
occurring not so much in hours and days
but months and years. For example, in
diabetic patients, good treatment requires
not just control of blood glucose. Searches
for and successful treatment of other
cardiovascular risk factors (e.g., hypertension,
hypercholesterolemia, obesity, and
smoking) help prevent cardiovascular disease
in diabetic patients as much, and even more,
than good control of blood glucose. In
addition, diabetic patients need regu- lar
ophthalmologic examinations for detecting
early diabetic retinopathy, routine foot care,
and monitor- ing for urinary protein to
guide use of angiotensin- converting enzyme
inhibitors to prevent renal failure. All these
preventive activities are tertiary in the sense
that they prevent and reduce complications
of a dis- ease that is already present.

Confusion about Primary,


Secondary, and Tertiary
Prevention
Over the years, as more and more of clinical
practice has involved prevention, the
distinctions among pri- mary, secondary, and
tertiary prevention have become blurred.
Historically, primary prevention was thought
of as primarily vaccinations for infectious
disease and counseling for healthy lifestyle
behaviors, but primary prevention now
includes prescribing antihypertensive
medication and statins to prevent
cardiovascular dis- eases, and performing
prophylactic surgery to prevent ovarian
cancer in women with certain genetic abnor-
malities. Increasingly, risk factors are treated
as if they are diseases, even at a time when
they have not caused any of the 5 Ds. This is
true for a growing number of health risks,
for example, low bone mineral density,
Chapter 10: Prevention 157

declared when treating patients with acute understanding of what is being sought or prevented.
coronary syndrome (myocardial infarction or For instance, physicians performing routine check-
unstable angina) with a combination of antiplatelet ups on their patients may order a urinalysis. How-
and anticoagu- lant therapies to prevent ever, a urinalysis might be used to search for any
cardiovascular death (5). Similarly, “secondary number of medical problems, including diabetes,
prevention” of stroke is used to describe asymptomatic urinary tract infections, renal
interventions to prevent stroke in patients with cancer, or renal failure. It is necessary to decide
transient ischemia attacks. which, if any, of these conditions is worth screening
Tests used for primary, secondary, and tertiary for before undertaking the test. One of the most
prevention, as well as for diagnosis, are often identi- important sci- entific advances in clinical
cal, another reason for confusing the levels of pre- prevention has been the development of methods
vention (and confusing prevention with diagnosis). for deciding whether a pro- posed preventive
Colonoscopy may be used to find a cancer in a activity should be undertaken (6). The remainder of
patient with blood in his stool (diagnosis); to find an this chapter describes these methods and concepts.
early asymptomatic colon cancer (secondary preven- Three criteria are important when judging whether
tion); remove an adenomatous polyp, which is a risk a condition should be included in preventive care
factor for colon cancer (primary prevention); or to (Table 10.1):
check for cancer recurrence in a patient treated for 1. The burden of suffering caused by the condition.
colon cancer (a tertiary preventive activity referred to 2. The effectiveness, safety, and cost of the preventive
as surveillance). intervention or treatment.
Regardless of the terms used, an underlying 3. The performance of the screening test.
rea- son to differentiate levels among preventive
activities is that there is a spectrum of probabilities
of disease and adverse health effects from the
condition(s) being sought and treated during Table 10.1
preventive activities, as well as different probabilities
of adverse health effects from interventions that are
used for prevention at the vari- ous levels. The
underlying risk of certain health prob- lems is usually
much higher in diseased than healthy people. For
example, the risk of cardiovascular dis- ease in
diabetics is much greater than in asymptom- atic
non-diabetics. Identical tests perform differently
depending on the level of prevention.
Furthermore, the trade-offs between effectiveness
and harms can be quite different for patients in
different parts of the spectrum. False-positive test
results and overdiagnosis (both discussed later in
this chapter) among people without the disease
being sought are important issues in secondary
prevention, but they are less important in
treatment of patients already known to have the
disease in question. The terms primary, secondary,
and tertiary prevention are ways to consider these
dif- ferences conceptually.

SCIENTIFIC APPROACH TO
CLINICAL PREVENTION
When considering what preventive activities to per-
form, the clinician must first decide with the patient
which medical problems or diseases they should
try to prevent. This statement is so clear and obvi-
ous that it would seem unnecessary to mention,
but the fact is that many preventive procedures,
espe- cially screening tests, are performed without
a clear
15 Clinical Epidemiology: The
Criteria for Deciding Whether a Medical
Condition Should Be Included in
Preventive Care
1. How great is the burden of suffering caused by the
condition in terms of:
Death Discomfort
Disease Dissatisfaction
Disability Destitution
2. How good is the screening test, if one is to be
performed, in terms of:
Sensitivity Safety
Specificity Acceptability
Simplicity
Cost
3. A. For primary and tertiary prevention, how good is the
therapeutic intervention in terms of:
Effectiveness
Safety
Cost-effectiveness
Or
B. For secondary prevention, if the condition is found, how
good is the ensuing treatment in terms of:
Effectiveness
Safety
Early treatment after screening being more effective than
later treatment without screening, when the patient
becomes symptomatic
Cost-effectiveness
Chapter 10: Prevention 159

BURDEN OF SUFFERING all the difficulties of randomized

Only conditions posing threats to life or health


(the 5 Ds in Chapter 1) should be included in
preventive care. The burden of suffering of a
medical condition is determined primarily by (i)
how much suffering (in terms of the 5 Ds) it causes
those afflicted with the condition, and (ii) its
frequency.
How does one measure suffering? Most often,
it is measured by mortality rates and frequency of
hos- pitalizations and amount of health care
utilization caused by the condition. Information
about how much disability, pain, nausea, or
dissatisfaction a given disease causes is much less
available.
The frequency of a condition is also important
in deciding about prevention. A disease may cause
great suffering for individuals who are unfortunate
enough to get it, but it may occur too rarely—
especially in the individual’s particular age group—
for screening to be considered. Breast cancer is an
example. Although it can occur in much younger
women, most breast cancers occur in women older
than 50 years of age. For 20-year-old women,
annual breast cancer inci- dence is 1.6 in 100,000
(about one-fifth the rate for men in their later 70s)
(7). Although breast cancer should be sought in
preventive care for older women, it is too
uncommon in average 20-year-old women and 70-
year-old men for screening. Screening for very
rare diseases means not only that, at most, very
few people will benefit, but screening also results in
false-positive tests in some people who are subject to
complications from further diagnostic evaluation.
The incidence of what is to be prevented is
espe-
cially important in primary and secondary prevention
because, regardless of the disease, the risk is low for
most individuals. Stratifying populations according
to risk and targeting the higher-risk groups can
help overcome this problem, a practice frequently
done by concentrating specific preventive activities
on certain age groups.

EFFECTIVENESS OF TREATMENT
As pointed out in Chapter 9, randomized
controlled trials are the strongest scientific evidence
for estab- lishing the effectiveness of treatments. It
is usual practice to meet this standard for tertiary
prevention (treatments). On the other hand, to
conduct ran- domized trials when evaluating
primary or secondary prevention requires very large
studies on thousands, often tens of thousands, of
patients, carried out over many years, sometimes
decades, because the outcome of interest is rare and
often takes years to occur. The task is daunting;
16 Clinical Epidemiology: The

controlled trials laid out in Chapter 9 on


Treatment are magnified many-fold.
Other challenges in evaluating
treatments in pre- vention are outlined in
the following text for each level of
prevention.

Treatment in Primary Prevention


Whatever the primary intervention
(immuniza- tions, drugs, behavioral
counseling, or prophylactic surgery), it
should be efficacious (able to produce a
beneficial result in ideal situations) and
effective (able to produce a beneficial net
result under usual conditions, taking into
account patient compliance). Because
interventions for primary prevention are
usually given to large numbers of healthy
people, they also must be very safe.

Randomized Trials
Virtually all recommended immunizations are
backed by evidence from randomized trials,
sometimes rela- tively quickly when the
outcomes occur within weeks or months, as in
childhood infections. Because phar-
maceuticals are regulated, primary and
secondary preventive activities involving
drugs (e.g., treatment of hypertension and
hyperlipidemia in adults) also usually have
been evaluated by randomized trials.
Randomized trials are less common when
the pro- posed prevention is not regulated, as
is true with vita- mins, minerals, and food
supplements, or when the intervention is
behavioral counseling.

Observational Studies
Observational studies can help clarify the
effective- ness of primary prevention when
randomization is not possible.

Example

Randomized trials have found hepatitis B virus (HBV) vaccines hig


Chapter 10: Prevention 161

determine if decades later it also was the electronic health databases of more
effective against cancer. A study was done than 9 million persons. Comparing the
in Taiwan, where nationwide HBV incidence of GBS occurring up to 6 weeks
vaccination was be- gun in 1984. The rates after vaccination to that of later
of hepatocellular cancer rates were (background) GBS occurrence in the same
compared among people who were group of vaccinated individuals, the
immunized at birth between 1984 and 2004 attributable risk of the 2009 vaccine was
to those born between 1979 and 1984 when esti- mated to be an additional five cases of
no vaccination program existed. (The GBS per million vaccinations. Although GBS
comparison was made possible by thorough incidence increased after vaccination, the
national health databases on the island.) effect in 2009 was about half that seen in
Hepatocellular cancer incidence decreased the 1970s (9). The very low estimated
almost 70% among young people in the 20 attributable risk in 2009 is reassuring. If
years after the introduction of HBV rare events, occurring in only a few people
As pointed out in Chapter 5, observational studies per million are to be detected in near real
are vulnerable to bias. The conclusion that HBV vac- time, population-based surveillance systems
cine prevents hepatocellular carcinoma is reasonable are required. Even so, associations found in
from a biologic perspective and from the dramatic surveillance systems are weak evi- dence
result. It will be on even firmer ground if studies for a causal relationship because they are
of other populations who undergo vaccination observational in nature and electronic da-
confirm the results from the Taiwan study. tabases often do not have information on
Building the case for causation in the absence of
Counseling
randomized trials is covered in Chapter 12.
U.S. laws do not require rigorous evidence of
Safety effec- tiveness of behavioral counseling methods.
Never- theless, clinicians should require scientific
With immunizations, the occurrence of adverse
evidence before incorporating routine counseling into
effects may be so rare that randomized trials
preven- tive care; counseling that does not work
would be unlikely to uncover them. One way to
wastes time, costs money, and may harm patients.
study this question is to track illnesses in large
Research has demonstrated that certain counseling
datasets of mil- lions of patients and to compare the
methods can help patients change some health
frequency of an adverse effect linked temporally to
behaviors. Smoking cessation efforts have led the
the vaccination among groups at different time
way with many random- ized trials evaluating
periods.
different approaches.

Example
Guillain-Barré syndrome (GBS) is a rare, seri-
ous immune-mediated neurological disorder
characterized by ascending paralysis that can
temporarily involve respiratory muscles so that
intubation and ventilator support are required.
A vaccine developed against swine flu in the
1970s was associated with an unexpected
sharp rise in the number of cases of GBS and
contributed to suspension of the vaccination
program. In 2009, a vaccine was developed to
protect against a novel influenza A (H1N1) vi-
rus of swine origin, and a method was need-
ed to track GBS incidence. One way this was
done was to utilize a surveillance system of
questions were addressed by a panel that re-
viewed all studies done on smoking cessation,
focusing on randomized trials (10). They found
43 trials that assessed amounts of counseling
16 contact
Clinicaland found a dose-response—the
Epidemiology: The more
contact time the better the abstinence rate
(Fig. 10.2). In addition, the panel found that ran-
domized trials demonstrated that pharmaco-
therapy with bupropion (a centrally acting drug
that decreases craving), varenicline (a nicotine
Chapter 10: Prevention 163

30

Smoking cessation rate (%


20

10

0
Counciling sessions (No.)
0–1 2–3 4–8 >8
Medication used
– + – + – + – +

Example
Figure 10.2 ■ Dose response of smoking cessation rates according to
the number of counseling sessions clinicians have with patients and use
Lung cancer is the leading cause of cancer-relat- ed death in the
of medication. (Data from Fiore MC, Jaén CR, Baker TB, et al. Treating tobacco use and
3.9 per 1,000 persons-years in men not offered screening (12). Ho
dependence: 2008 Update. Clinical Practice Guideline. Rockville, MD: U.S.
Department of Health and Human Services. Public Health Service. May 2008.)

Treatment in Secondary
receptor agonist), nicotine Prevention
gum, nasal
spray, orinpatches
Treatments were
secondary effective.
prevention areCombining
generally the
counsel- ing and medication (evaluated
same as treatments for curative medicine. Like in 18
inter-
trials) in- creased the smoking cessation rate
ventions for symptomatic disease, they should be
bothstill further. On the other hand, there was no
efficacious and effective. Unlike usual interven-
effect of anx- iolytics, beta-blockers, or
tions for disease, however, it typically takes years to
establish that a secondary preventive intervention is
effective, and it requires large numbers of people to
be studied. For example, early treatment after
colorectal cancer screening can decrease colorectal
cancer deaths by approximately one-third, but to
show this effect, a study of 45,000 people with 13
years of follow-up was required (11).
A unique requirement for treatment in
secondary prevention is that treatment of early,
asymptomatic disease must be superior to treatment
of the disease when it would have been diagnosed
in the usual course of events, when a patient seeks
medical care for symptoms. If outcome in the two
situations is the same, screening does not add
value.
16 Clinical Epidemiology: The

Treatment in Tertiary Prevention METHODOLOGIC ISSUES IN


All new pharmaceutical treatments in the United EVALUATING SCREENING
States are regulated by the U.S. Food and Drug PROGRAMS
Adminis- tration, which almost always requires
Several problems arise in the evaluation of screening
evidence of efficacy from randomized clinical trials.
programs, some of which can make it appear that
It is easy to assume, therefore, that tertiary
early treatment is effective after screening when it is
preventive treatments have been carefully evaluated.
not. These issues include the difference between
However, after a drug has been approved, it may be
prev- alence and incidence screens and three biases
used for new, unevalu- ated, indications. Patients
that can occur in screening studies: lead-time,
with some diseases are at increased risk for other
length-time, and compliance biases.
diseases; thus, some drugs are used not only to treat
the condition for which they are approved but also to
prevent other diseases for which patients are at
Prevalence and Incidence Screens
increased risk. The distinction between proven The yield of screening decreases as screening is
therapeutic effects of a medicine for a given repeated over time. Figure 10.3 demonstrates why this
disease and its effect in preventing other diseases is is so. The first time that screening is carried out—the
a subtle challenge facing clinicians when considering prevalence screen—cases of the medical condition
tertiary preventive interventions that have not been will have been present for varying lengths of time.
evaluated for that purpose. Sometimes, careful evalu- During the second round of screening, most cases
ations have led to surprising results. found will have had their onset between the first and
second screening. (A few will have been missed by
the first screen.) There- fore, the second (and
subsequent) screenings are called incidence screens.
Example Thus, when a group of people are periodically
rescreened, the number of cases of disease
In Chapter 9, we presented an example of a study of treatment for patients with type 2 dia- betes. It showed the surpris

Round of screening
1 2 3

Dx
Dx
Dx
Dx
Dx
Dx
Dx
Dx
Dx
Dx
Dx
Dx
5 3 3
Number of
Figure 10.3 ■ Thenewly decreasing yield of a screening test
after the first round of screening. The first round (preva-
lence screening) detects prevalent cases. The second and third rounds
(incidence screenings) detect incident cases. In this fig- ure, it is
assumed that the test detects all cases and that all people in the
population are screened. When this is not so, cases missed in the
first round are available for detection in sub- sequent rounds—and
the yield would be higher. O  onset of disease; Dx  diagnosis
time, if screening were not carried out.
Chapter 10: Prevention 165

ONSET DIAGNOSIS DEATH


Early Usual

UNSCREENED Dx

SCREENED
Early treatment Dx
not effective

SCREENED
Improved
Early treatment Dx survival
is effective

Figure 10.4 ■ How lead time affects survival time after screening; O = onset of disease.
Pink- shaded areas indicate length of survival after diagnosis (Dx).

in the group drops after the prevalence screen. This


it takes 20 to 30 years for it to progress from carci-
means that the positive predictive value for test
noma in situ into a clinically invasive disease), treat-
results will decrease after the first round of
ment of the medical condition found on screening
screening.
can be very effective.
How can lead time cause biased results in a study
Special Biases of the efficacy of early treatment? As Figure 10.4
The following biases are most likely to be a problem shows, because of screening, a disease is found
in observational studies of screening. earlier than it would have been after the patient
developed symptoms. As a result, people who are
Lead-Time Bias diagnosed by screening for a deadly disease will, on
average, survive longer from the time of diagnosis
Lead time is the period of time between the detec-
than people who are diagnosed after they develop
tion of a medical condition by screening and when
symptoms, even if early treatment is no more
it ordinarily would have been diagnosed because a
effective than treatment at the time of clini- cal
patient experienced symptoms and sought medical
presentation. In such a situation, screening would
care (Fig. 10.4). The amount of lead time for a given
appear to help people live longer, spuriously improving
disease depends on the biologic rate of progression
survival rates when, in reality, they would have been
of the disease and how early the screening test can
given not more “survival time” but more “disease
detect the disease. When lead time is very short, as
time.” An appropriate method of analysis to avoid
is true with lung cancer, it is difficult to demonstrate
lead- time bias is to compare age-specific
that treatment of medical conditions picked up
mortality rates rather than survival rates in a
on screening is more effective than treatment after
screened group of peo- ple and a control group of
symptoms appear. On the other hand, when lead
similar people who do not get screened, as in a
time is long, as is true for cervical cancer (on
randomized trial (Table 10.2). Screening for breast,
average,
lung, and colorectal cancers are
Table 10.2
Avoiding Bias in Screening

Bias Effect How to Avoid


Lead time Appears to improve survival time but actually Use mortality rather than survival rates.
increases “disease time” after disease detection.
Length time Outcome appears better in screened group because Compare outcomes in randomized controlled trial with
more cancers with a good prognosis are detected. control group and one offered screening. Count all
outcomes regardless of method of detection.
Compliance Outcome in screened group appears better due to Compare outcomes in randomized controlled trial with
compliance, not screening. control group and one offered screening. Count all
outcomes regardless of compliance.
16 Clinical Epidemiology: The

known to be effective because studies have shown Screening


that mortality rates of screened persons are lower
than those of a comparable group of unscreened Dx Dx
people. Dx

Length-Time Bias Dx
Dx
Length-time bias occurs because the proportion of
slow-growing lesions diagnosed during screening is
greater than the proportion of those diagnosed dur- Dx
ing usual medical care. As a result, length-time Dx
bias makes it seem that screening and early Dx
treatment are more effective than usual care. Dx
Length-time bias occurs in the following way. Dx
Screening works best when a medical condition
Dx
develops slowly. Most types of cancers, however, dem-
onstrate a wide range of growth rates. Some of Dx
them grow slowly, some very fast. Screening tests Figure 10.5 ■ Length-time bias. Cases that progress
are likely to find mostly slow-growing tumors rapidly from onset (O) to symptoms and diagnosis (Dx) are less
because they are present for a longer period of time likely to be detected during a screening examination.
before they cause symptoms. Fast-growing tumors
are more likely to cause symptoms that lead to
diagnosis in the inter- val between screening Compliance Bias
examinations, as illustrated in Figures 10.5 and
10.6. Screening, therefore, tends to find tumors The third major bias that can occur in prevention
with inherently better prognoses. As a result, the studies is compliance bias. Compliant patients
mortality rates of cancers found through screening tend to have better prognoses regardless of preven-
may be better than those not found through tive activities. The reasons for this are not completely
screening, but screening is not protective in this clear, but on average, compliant patients are more
situation.

D D D
Diagnosis after
S symptoms
Rapid growth
S
S
Size/stage of

Detection possible
by screening
Slow growth

Onset Screened
Time

Figure 10.6 ■ Relationship between length-time bias and speed of tumor growth.
Rapidly growing tumors come to medical attention before screening is performed, whereas more
slowly growing tumors allow time for detection. D  diagnosis after symptoms; S  detection
after screening.
Chapter 10: Prevention 167

interested in their health and are generally be made up of similar populations, the control popu-
healthier than non-compliant ones. For example, a lation should not have access to screening, and
random- ized study that invited people for both populations must have careful follow-up to
screening found that volunteers from the control docu- ment all cases of the outcome being studied.
group who were not invited but requested Because randomized controlled trials and
screening had better mortality rates than the prospec- tive population-based studies are difficult to
invited group, which contained both compliant conduct, take a long time, and are expensive,
people who wanted screening and those who investigators some- times try to use other kinds of
refused (15). The effect of patient compliance, as studies, such as histori- cal cohort studies (Chapter
distinct from treatment effect, has primarily involved 5) or case control studies (Chapter 6), to investigate
medication adherence in the placebo group, and preventive maneuvers.
has been termed placebo adherence.

Example Example
To test
An analysis was done to determine if health out- comes differed in whether periodic
the placebo arm ofscreening
a random-with sig-
ized moidoscopy
trial reducesw
among patients
10 years
35%) to active treatment (enalapril) or placebo. The analysis among
showed thatpatients dying of
after 3 years, colorectal
patients cancer and
randomized among we
to placebo wh

Biases from length time and patient compliance


can be avoided by relying on studies that have
concur- rent screened and control groups that are
comparable. In each group, all people experiencing
the outcomes of interest must be counted, regardless
of the method of diagnosis or degree of
participation (Table 10.2). Randomized trials are
the strongest design because patients who are
randomly allocated will have compa- rable numbers
of slow- and fast-growing tumors and, on average,
comparable levels of compliance. These groups then
can be followed over time with mortal- ity rates,
rather than survival rates, to avoid lead-time bias. If
a randomized trial is not possible, results of
population-based observational studies can be valid.
In such cases, the screened and control groups
must
16 Clinical Epidemiology: The

PERFORMANCE OF
SCREENING TESTS
cancer (18). Thirteen of 18 men who were
Tests used for screening should meet the criteria di- agnosed with prostate cancer within a
for diagnostic tests laid out in Chapter 8. year af- ter the blood sample had elevated
The following criteria for a good screening test PSA levels (4.0 ng/mL) and would have
apply to all types of screening tests, whether they are been diagnosed after an abnormal PSA
history, physical examination or laboratory tests. result; the other five had normal PSA
results and developed interval cancers
High Sensitivity and Specificity during the first year after a normal PSA
The very nature of searching for a disease in test. Thus, sensitivity of PSA was calculated
people without symptoms means that prevalence is
usually very low, even among high-risk groups A key challenge is to choose a correct period of
who were selected because of age, sex, and other follow-up. If the follow-up period is too short,
risk character- istics. A good screening test must, disease missed by the screening test might not have a
therefore, have a high sensitivity so that it does not chance to make itself obvious, so the test’s
miss the few cases of disease present. It must also sensitivity may be overestimated. On the other
be sensitive early in the disease, when the hand, if the follow-up period is too long, disease
subsequent course can still be altered. If a not present at the time of screening might be
screening test is sensitive only for late- stage found, resulting in a falsely low estimation of the
disease, which has progressed too far for effective test’s sensitivity.
treatment, the test would be useless. A screening test
should also have a high specificity to reduce the Detection and Incidence Methods
num- ber of people with false-positive results who for Calculating Sensitivity
require diagnostic evaluation.
Sensitivity and specificity are determined for Calculating sensitivity by counting cancers detected
screening tests much as they are for diagnostic tests, during screening as true positives and interval
with one major difference. As discussed in Chapter cancers as false negatives is sometimes referred to as
8, the sensitivity and specificity of a diagnostic test the detection method (Table 10.3). The method
are determined by comparing the results to another works well for many screening tests, but there are
test (the gold standard). In screening, the gold two difficulties with it for some cancer screening
standard for the presence of disease often is not tests. First, as already pointed out, it requires that the
only another, more accurate, test but also a period of appropriate amount of follow-up time for interval
time for follow- up. The gold standard test is cancers be known; often, it is not known and must be
routinely applied only to people with positive guessed. The detection method also assumes that the
screening test results, to dif- ferentiate between abnormalities detected by the screening test would go
true- and false-positive results. A period of follow-up on to cause trouble if left alone. This is not necessar-
is applied to all people who have a negative ily so for several cancers, particularly prostate
screening test result, in order to differenti- ate cancer.
between true- and false-negative test results.
Follow-up is particularly important in cancer Example
screening, where interval cancers, cancers not Histologic prostate cancer is common in men, especially older
detected during screening but subsequently discov-
ered over the follow-up period, occur. When interval
cancers occur, the calculated test sensitivity is lowered.

Example
In Chapter 8, we presented a study in which prostate-specific antigen (PSA) levels were measured in stored blood sam
Chapter 10: Prevention 169

Table 10.3
Calculating Sensitivity of a Cancer Screening Test According to the Detection
Method and the Incidence Method

Theoretical Example
A new screening test is introduced for pancreatic cancer. In a screening group, cancer is detected in 200 people; over the ensuing year,
another 50 who had negative screening tests are diagnosed with pancreatic cancer. In a concurrent control group with the same
characteristics and the same size, members did not undergo screening; 100 people were diagnosed with pancreatic cancer during the
year.
Sensitivity of the Test Using the Detection Method
Number of screen-detected cancers
Sensitivity  Number of screen-detected cancers plus number of interval cancers
200
 (200  50)
 .80 or 80%
Sensitivity of the Test Using the Incidence Method
Sensitivity  1 – (interval rate in the screening group/incidence rate in the control group)
50
1a b  0.50 or 50%
100

The incidence method calculates sensitivity


by using the incidence in persons not undergoing Example
screening and the interval cancer rate in persons The incidence of breast cancer increases with age, from approx
who are screened (Table 10.3). The rationale for this
approach is that the sensitivity of a test should affect
interval cancer rates, not disease incidence. For pros-
tate cancer, the incidence method defines sensitivity
of the test as 1 minus the ratio of the interval pros-
tate cancer rate in a group of men undergoing peri-
odic screening to the incidence of prostate cancer
in a group of men not undergoing screening
(control group). The incidence method of
calculating sensi- tivity gets around the problem of
counting “benign” prostate cancers, but it may
underestimate sensitivity because it excludes cancers
with long lead times. True sensitivity of a test is,
therefore, probably between the estimates of the two
methods.

Low Positive Predictive Value


Because of the low prevalence of most diseases in
asymptomatic people, the positive predictive value of
most screening tests is low, even for tests with
high specificity. (The reverse is true for negative Simplicity and Low Cost
predic- tive value, because when prevalence is low,
An ideal screening test should take only a few
the nega- tive predictive value is likely to be high.)
min- utes to perform, require minimum preparation
Clinicians who perform screening tests on their
by the patient, depend on no special appointments,
patients must accept the fact that they will have to
and be inexpensive.
work up many patients who have positive
Simple, quick examinations such as blood
screening test results but do not have disease.
pressure determinations are ideal screening tests.
However, they can minimize the problem by
Conversely, tests such as colonoscopy, which are
concentrating their screening efforts on people
expensive and
with a higher prevalence for disease.
17 Clinical Epidemiology: The

60 Taking all these issues into account sometimes


leads to surprising conclusions.
Ratio of number of women without

50
breast cancer to those with breast

Example
40

Several different tests can be used to screen for colorectal can


$20 for fecal occult blood tests to more than a $1,000 for scr
30

20

10

40–44 50–54 60–69 70–79 80–89


45–49 55–59
Age (years)
Figure 10.7 ■ Yield of abnormal screening mammo-
grams according to patient age. Number of women with-
out breast cancer for each woman diagnosed with breast cancer
among women having an abnormal mammogram when
screened for breast cancer. (Data from Carney PA, Mi- glioretti
DL, Yankaskas BC, et al. Individual and combined effects of
age, breast density, and hormone replacement therapy use on
the accuracy of screening mammography. Ann Intern Med
2003;138:168–175.)

Safety
require an appointment and bowel preparation, are It is reasonable and ethical to accept a certain risk for
best suited for diagnostic testing in patients with diagnostic tests applied to sick patients seeking help
symptoms and clinical indications. Nevertheless, for specific complaints. The physician cannot avoid
screening colonoscopy has been found to be action when the patient is severely ill, and does his
highly effective in decreasing colorectal or her best. It is quite another matter to subject
mortality, and a negative test does not have to be presumably well people to risks. In such
repeated for several years. Other tests, such as circumstances, the proce- dure should be especially
visual field testing for the detection of glaucoma safe. This is partly because the chances of finding
and audiograms for the detec- tion of hearing loss, disease in healthy people are so low. Thus, although
fall between these two extremes. The financial colonoscopy is hardly thought of as a dangerous
“cost” of the test depends not only on the cost of procedure when used on patients with
(or charge for) the procedure itself but also on the gastrointestinal complaints, it can cause bowel perfo-
cost of subsequent evaluations performed on ration. In fact, when colonoscopy, with a rate of
patients with positive test results. Thus sensitiv- two perforations per 1,000 examinations, is used to
ity, specificity, and predictive value affect cost. Cost screen for colorectal cancer in people in their 50s,
is also affected by whether the test requires a special perfora- tions occur more often than cancers are
visit to the physician. Screening tests performed found.
while the patient is seeing his or her physician for Concerns have been raised about possible long-
other reasons (as is frequently the case with blood term risks with the increasing use of CT scans to
pressure mea- surements) are much cheaper for screen for coronary artery disease or, in the case of
patients than tests requiring special visits, extra time whole-body scans, a variety of abnormalities. The
off work, and addi- tional transportation. Cost also is radiation dose of CT scans varies by type, with a CT
determined by how scan for coronary calcium on average being the
often a screening test must be repeated.
equivalent of about 30, and a whole-body scan
about 120, chest x-rays. One
Chapter 10: Prevention 171
surgery as part of the diagnostic evaluation of the test result.
estimate of risk projected 29,000 excess cancers as a Because of false-positive screening tests, five
result of 70 million CT scans performed in the
United States in 2007 (22). If these concerns are
correct, CT scans used to screen for early cancer
could themselves cause cancer over subsequent
decades.

Acceptable to
Patients and
Clinicians
If a screening test is associated with discomfort, it
usually takes several years to convince large percent-
ages of patients to obtain the test. This has been
true for Pap smears, mammograms,
sigmoidoscopies, and colonoscopies. By and large,
however, the American public supports screening.
The acceptability of the test to clinicians may
be overlooked by all but the ones performing it.
Clini- cian acceptance is especially relevant for
screening tests that involve clinical skill, such as
mammogra- phy, sigmoidoscopy, or colonoscopy.
In a survey of 53 mammography facilities, 44%
indicated shortages of mammographers. The authors
speculated that low reimbursement for screening
mammograms, high levels of malpractice litigation
in breast imaging, and administrative regulations all
may be reasons (23).

UNINTENDED CONSEQUENCES
OF SCREENING
Adverse effects of screening tests include discomfort
during the test procedure (the majority of women
undergoing mammography say that the procedure is
painful, although usually not so severe that
patients refuse the test), long-term radiation effects
after expo- sure to radiographic procedures, false-
positive test results (with resulting needless workups
and negative labeling effects), overdiagnosis, and
incidentalomas. The last three will be discussed in
this section.

Risk of False-Positive Result


A false-positive screening test result is an
abnor- mal result in a person without disease. As
already mentioned, tests with low positive
predictive val- ues (resulting from low prevalence
of disease, poor specificity of the test, or both) are
likely to lead to a higher frequency of false
positives. False-positive results, in turn, can lead to
negative labeling effects, inconvenience, and
expense in obtaining follow-up procedures. In
certain situations, false-positive results can lead to
major surgery. In a study of ovarian can- cer
screening, 8.4% (3,285) of 39,000 women had a
false-positive result and one-third of those underwent
17 Clinical Epidemiology: The

Table 10.4
Relation between Number of Different
Screening Tests Ordered and Percentage
of Normal People with at Least One
Abnormal Test Result

People With at Least


Number of Tests One Abnormality
(%)
1 5
5 23
20 64
100 99.4
Data from Sackett DL. Clinical diagnosis and the clinical
laboratory. Clin Invest Med 1978;1:37–43.

times more women without ovarian cancer


had sur- gery than those with ovarian
cancer (24).
False-positive results account for only a
minority of screening test results (only about
10% of screening mammograms are false
positives). Even so, they can affect large
percentages of people who get screened. This
happens in two ways. Most clinicians do not
perform only one or two tests on patients
presenting for routine checkups. Modern
technology, and perhaps the threat of
lawsuits, has fueled the propensity to “cover
all the bases.” Automated tests allow physicians
to order several dozen tests with a few checks in
the appropriate boxes. When the
measurements of screening tests are
expressed on interval scales (as most blood
tests are), and when normal is defined by the
range covered by 95% of the results (as is
usual), the more tests the clini- cian orders, the
greater the risk of a false-positive result. In fact,
as Table 10.4 shows, if the physician orders
enough tests, “abnormalities” will be
discovered in vir- tually all healthy patients.
A spoof entitled “The Last
Well Person” commented on this phenomenon (25).
Another reason that many people may
experience a false-positive screening test result
is that most screening tests are repeated at
regular intervals. With each repeat screen, the
patient is at risk for a false-positive result.

Example
In a clinical trial of lung cancer screening with low-dose spiral CT
Chapter 10: Prevention 173

Labeling effects are sometimes unpredictable,


experiencing a false-positive had
especially among people who know they are at
increased to 33% in the CT group and 15% high risk for a genetic disease because of a family
in the chest x-ray group (26). In a related history. Studies of relatives of patients with
study, participants received screening tests Huntington disease—a neurological condition
for prostate, ovarian, and colorectal cancer with onset in middle age causing mental
as well as chest x-rays for lung cancer, for a deterioration leading to dementia, movement
total of 14 tests over 3 years. The cumulative disorders, and death—have found psychological
risk of having at least one false- positive health after genetic testing did not deteriorate in
screening test result was 60.4% for men and those testing positive, perhaps because they were no
48.8% for women. The risk for undergoing longer dealing with uncertainty. Studies of women
an invasive diagnostic procedure prompted being tested for genetic mutations that increase their
by a false-positive test result was 28.5% for risk for breast and ovarian cancer have also found
men and 22.1% for women (27). that women testing positive for the mutation
experi- ence little or no psychological
Risk of Negative Labeling Effect deterioration.
Test results can sometimes have important psycho-
logical effects on patients, called a labeling effect. A Risk of
good screening test result produces either no labeling Overdiagnosis
effect or a positive labeling effect. (Pseudodisease) in
A positive labeling effect may occur when a Cancer Screening
patient is told that all the screening test results were
normal. Most clinicians have heard such responses The rationale for cancer screening is that the earlier a
as, “Great, that means I can keep working for cancer is found, the better the chance of cure. There-
another year.” On the other hand, being told that the fore, the thinking goes, it is always better to find can-
screening test result is abnormal and more testing is cer as early as possible. This thesis is being challenged
necessary may have an adverse psychological effect, by the observation that incidence often increases
particularly in cancer screening. Some people with after the introduction of widespread screening for
false-positive tests con- tinue to worry even after a particular cancer. A temporary increase in
being told everything was normal on follow-up incidence is to be expected because screening
tests. Because this is a group of people without moves the time of diagnosis forward, adding early
disease, negative labeling effects are par- ticularly cases to the usual number of prevalent cancers being
worrisome ethically. In such situations, screen- ing diagnosed with- out screening, but the temporary
efforts might promote a sense of vulnerability bump in incidence should come down to the
instead of health and might do more harm than good. baseline level after a few years. With several
cancers, however, incidence has remained at a higher
level, as illustrated for prostate cancer in Figure 2.5
Example in Chapter 2. It is as if screening caused more
A study of men with abnormal PSA screening test resultscancers. How couldwere
who subsequently thisdeclared
be? to be free of cancer after w
Some cancers
1 year after screening, 26% reported worry about prostate cancer, comparedaretoso6%slow growing
among (some
men with evenPSA result
normal
regress) that they do not cause any trouble for the
patient. If such cancers are found through screening,
they are called pseudodisease; the process leading to
their detection is called overdiagnosis because find-
ing them does not help the patient. Overdiagnosis
is an extreme example of length-time bias (Fig.
10.8). The cancers found have such a good
prognosis that they would never become evident
without screening technology. Some estimates are
that as many as 50% of prostate cancers diagnosed
by screening are due to overdiagnosis.
As research unravels the development of cancer, it
appears that a sequence of genetic and other changes
accompany pathogenesis from normal tissue to
malig- nant disease. At each step, only some lesions
go on to the next stage of carcinogenesis. It is likely
that over- diagnosis occurs because cancers early in the
17 Clinical Epidemiology: The
chain are being picked up by screening tests. The
challenge is
Chapter 10: Prevention 175

Size

Size at which cancer


causes death
Fast Slow
Size at which cancer
causes symptoms
Very slow

Nonprogressive

Abnormal cell
Time
Death from
other causes
Figure 10.8 ■ Mechanism of overdiagnosis in cancer screening. Note that non-
progressive, as well as some very slow-growing, cancers will never cause clinical harm. When these
cancers are found on screening, overdiagnosis has occurred. Overdiagnosis is an ex- treme form of
length-time bias. (Redrawn with permission from Welsh HG. Should I Be Tested for Cancer? Maybe
Not and Here’s Why. Berkeley, CA: University of California Press; 2004.)

to differentiate those early cancers that will go on


to cause morbidity and mortality from those that will Two population-based studies were,
lie dormant throughout life, even though there- fore, undertaken in Germany (29)
pathologically they appear the same. Currently, and Quebec (30), in which screening was
screening technol- ogy is not able to do this. offered for all in- fants in certain areas,
To determine if and to what degree overdiagnosis while infants in other ar- eas were not
occurs, it is necessary to compare a screened group screened and acted as concurrent controls.
with a similar unscreened group and determine Existing tumor registries were used to track
incidence and disease-specific mortality rates (not all cases of and deaths due to neu-
survival rates). This can be done by long-term roblastoma. In both studies, the incidence
randomized trials or by careful population-based of neuroblastoma doubled in the screened
observational studies. group, but mortality rates from
neuroblastoma over the subsequent 5 years
were equivalent in the screened groups
Example and the unscreened groups. It appeared
that the screening test primarily detected
tumors with a favorable prognosis, many of
Neuroblastoma, a tumor of neurologic tissue near the kidney,which would most
is the second havecommon
regressed
tumorifoccurring
left un-in children. Prog
detected. Meanwhile, highly invasive
disease was often missed. Investigators of
both stud- ies concluded that screening
infants for neuro- blastoma leads to

Overdiagnosis has been shown in randomized


tri- als of screening for lung and breast cancer. It also
can occur when detecting precancerous
abnormalities in cervical, colorectal, and breast
cancer screening, that is, with cervical dysplasia,
adenomatous polyps, and duc- tal carcinoma in situ
(abnormalities sometimes termed predisease). It is
important to understand that overdiagnosis can
coexist with effective screening and that although
randomized trials and population studies
17 Clinical Epidemiology: The

can help determine the amount of overdiagnosis, it them in prevention. For example, CT scans and mag-
is impossible to identify it in an individual patient. netic resonance imaging were developed for
diagnostic purposes in patients with serious
Incidentalomas complaints or known disease and PSA was
developed to determine whether treatment for
Over the past couple of decades, using CT as a
prostate cancer was successful. All of these tests are
screening test has become more common. CT has
now commonly used as screening tests, but most
been evaluated rigorously as “virtual colonoscopy”
became common in practice without careful
for colorectal cancer screening and also for lung
evaluation. Only low-dose CT scans for lung
cancer screening. It has been advocated as a
cancer screening underwent careful evaluation prior
screening test for coronary heart disease (with
to wide- spread use. PSA screening became so
calcium scores) and for screening in general with
common in the United States that when it was
full-body CT scans. Unlike most screening tests, CT
subjected to a careful randomized trial, more than
often visualizes much more than the targeted area of
half the men assigned to the control arm had a PSA
interest. For example, CT colonography visualizes
test during the course of a trial. When tests are so
the abdomen and lower thorax. In the process,
commonly used, it is difficult to determine rigorously
abnormalities are sometimes detected outside the
whether they are effective.
colon. Masses or lesions detected incidentally by an
Over time, improvements in screening tests, treat-
imaging examination are called incidentalomas.
ments, and vaccinations may change the need for
screening. As indicated earlier, effective secondary
Example

A systematic review of 17 studies found that incidentalomas were common in CT colonog- raphy; 40% of 3,488 patients h

prevention is a two-step process; a good screening


test followed by a good treatment for those found to
CHANGES IN SCREENING TESTS have disease. Changes in either one may affect
AND TREATMENTS OVER TIME how well screening works in preventing disease. At
one extreme, a highly accurate screening test will not
Careful evaluation of screening has been particularly help prevent adverse outcomes of a disease if there is
difficult when tests approved for diagnostic purposes no effective therapy. Screening tests for HIV
are then used as screening tests before evaluation in a preceded the devel- opment of effective HIV
screening study, a problem analogous to using thera- therapy; therefore, early in the history of HIV,
peutic interventions for prevention without screening could not prevent dis- ease progression in
evaluating people with HIV. With the devel- opment of
increasingly effective treatments, screening for HIV
has increased. At the other extreme, a highly
Chapter 10: Prevention 177
effective treatment may make screening
unnecessary. With modern therapy the 10-year
survival rate of tes- ticular cancer is about 85%, so
high that it would be difficult to show
improvement with screening for this rare cancer.
As HPV vaccination for prevention of cervical
cancer becomes more widespread and if it is able to
cover all carcinogenic types of HPV, the need for
cervical cancer screening should decrease over time.
Some recent studies of mammography screen- ing
have not found the anticipated mortality benefits
seen in earlier studies, partly because breast cancer
mortality among women not screened was lower
than in the past, probably due to improved
treatments. Thus, with the introduction of new
therapies and screening tests, effectiveness of
screening will change and on-going re-evaluation
is necessary.

WEIGHING BENEFITS AGAINST


HARMS OF PREVENTION
How can the many aspects of prevention covered
in this chapter be combined to make a decision
whether to include a preventive intervention in
clinical practice?
17 Clinical Epidemiology: The

Conceptually, the decision should be based on the Cost-effectiveness analysis is a method for
weighing the magnitude of benefits against the assessing the costs and health benefits of an inter-
magni- tude of harms that will occur as a result of vention. All costs related to disease occurrence
the action. This approach has become common and treatment should be counted, both with and with-
when making treatment decisions; reports of out the preventive activity, as well as all costs related to
randomized trials rou- tinely include harms as well the preventive activity itself. The health benefits
as benefits. of the activity are then calculated, and the incremen-
A straightforward approach is to present the ben- tal cost for each unit of benefit is determined.
efits and harms for a particular preventive activity
in some orderly and understandable way. When-
ever possible, these should be presented using abso-
lute, not relative risks. Figure 10.9 summarizes the Example
estimated key benefits and harms of annual mam-
mography for women in their 40s, 50s, and 60s Cervical cancer is caused by persistent infection of epithelial cell
(7,32,33). Such an approach can help clinicians $58,500 per QALY. (Upper limits of acceptable cost-effectiveness
and patients understand what is involved when
mak- ing the decision to screen. It can also help
clarify why different individuals and expert groups
come to different decisions about a preventive
activity, even when looking at the same set of
information. Dif- ferent people put different
values on benefits and harms (34).
Another approach to weighing benefits and harms
is a modeling process that expresses both benefits
and harms in a single metric and then subtracts
harms from benefits. (The most common metric used
is the quality adjusted life year [QALY]). The
advantage of this approach is that different types
of prevention (e.g., vaccinations, colorectal cancer
screening, and tertiary treatment of diabetes) can all
be compared to each other, which is important for
policymakers with limited resources. The
disadvantage is that for most clinicians and
policymakers, it is difficult to under- stand the
process by which benefits and harms are handled.
Regardless of the method used in weighing the
benefits and harms of preventive activities, the qual-
ity of the evidence for each benefit and harm must be
evaluated to prevent the problem of “garbage in, gar-
bage out.” Several groups making recommendations
for clinical prevention have developed explicit meth-
ods to evaluate the evidence and take into account
the strength of evidence when making their recom-
mendations (see Chapter 14).
If the benefits of a preventive activity outweigh
the harms, the final step is to determine the eco-
nomic effect of using it. Some commentators like
to claim that “prevention saves money,” but it
does so only rarely. (One possible exception is
screening for colorectal cancer. Chemotherapy for
the disease has become so expensive that some
analyses now find screening for this cancer saves
money.) Even so, most preventive services
recommended by groups who have carefully
evaluated the data are as cost-effective as other
clinical activities.
Chapter 10: Prevention 179

A
700 ≥1 False-positive mammogram
615 615

600

500

400
400
Number of

300

200 ≥1 Needle or open biopsy


79 78 80 Development of breast cancer
100
25 38
16
0
40 50 60 40 50 60 405060
Years of age at beginning of the 10-year period

B
45 Development of breast cancer
38
40
Breast cancer cured by treatment regardless
35 of screening
25
30
25
25
Number of

Diagnosis of ductal carcinoma in situ because of mammography


20 17 9
16 7
15 5
11 Life saved by screening
mammography
10
5
5 3
1
0
40 50 60 40 50 60 40 50 60 40 50 60

Years of age at beginning of


the 10-year period
Figure 10.9 ■ Weighing the benefits and harms when deciding about a preventive activity: compar-
ing estimated benefits and harms of screening mammography. Chances among 1,000 women ages 40, 50, or
60 who undergo annual screening mammography for 10 years: (A) of experiencing a false-positive mam- mogram,
undergoing a breast biopsy, and developing breast cancer, and (B) being cured of breast cancer regard- less of screening,
being diagnosed with non-invasive ductal carcinoma in situ, and averting death from breast cancer because of
screening mammography. (Estimates calculated from references 7, 32, and 33.)
18 Clinical Epidemiology: The

The effort to gather all the information needed recommendation has taken into account the
to make a decision whether to conduct a strength of the evidence. They can also look for
preventive activity in clinical practice is not estimates of cost-effectiveness. With these facts,
something a single clinician can accomplish, but they should be able to share with their patients the
when reviewing recom- mendations about information they need. Patients can then make an
prevention, individual clinicians can determine if informed decision about preventive activities that
the benefits and harms of the activ- ity are takes into account the scientific information and their
presented in an understandable way and if the individual values.

Revie w Question s
For questions 10.1–10.6, read the related
B. The positive predictive value of the
scenarios and select the best answer.
test was low.
C. The negative predictive value of the
A study was conducted to determine whether a test was low.
fecal occult blood screening test reduced mortality
from colorectal cancer (11). People ages 50 to 60 In a randomized controlled trial of screening chest
years were randomized to the screening test or to x-rays and sputum cytology for lung cancer, approxi-
a control group and followed for 13 years. Over mately 9,000 men were randomized to screening for
this time, there were 323 cancer cases and 82 6 years or a control group (16). After 20 years, the
colorectal cancer deaths in the 15,570 people lung cancer mortality was the same in both groups
randomized (4.4/1,000 person-years in the screened group and
to annual screening; there were 356 cancers and 3.9/1,000 person-years in the control group).
121 colorectal cancer deaths in the 15,394 How- ever, the median survival for patients
people randomized to the control group. diagnosed with lung cancer was 1.3 years in the
Investigations of positive tests found that about screened group and
30% of the screened group had colon polyps. The 0.9 years in the control group. Also, screening found
sensitivity and specificity of the test for colon more lung cancer—206 cancers were diagnosed in
cancer were both about 90%. the screened group and 160 in the control group.

10.1. What is the relative risk reduction of 10.4. What is the best conclusion after reading
colorectal mortality in the screened group? such a study?
A. 33% A. Finding a better survival rate but not
B. 39% a change in the mortality rate of
C. 48% lung cancer makes no sense and the
study must be flawed.
10.2. How many patients would you need to B. Because mortality did not change,
screen over the next 13 years to prevent one screening may have resulted in more
death from colorectal cancer? “disease time” for those diagnosed
with lung cancer.
A. 43 C. Improved survival demonstrates
B. 194 screening was effective in the study.
C. 385
10.5. What bias is the best explanation for the
10.3. The fact that 30% of the screened group improved survival in the face of no improve-
had colon polyps suggests all of the ment in the mortality rate in this study?
following except:
A. Lead-time bias
A. At least 30% of the screened group was B. Survival bias
investigated for positive fecal occult C. Compliance bias
blood tests. D. Length-time bias
Chapter 10: Prevention 181

10.6. What is the most likely reason for the fact A. People who refuse screening are
that 206 lung cancers were found in the usually healthier than those who
screened group and only 160 in the control accept.
group? B. Volunteers for screening are more likely
to need screening than those who refuse.
A. There were more smokers in the screened
C. Volunteers tend to be more interested
group.
in their health than those who do
B. Screening found cancers earlier and the
not participate in preventive
number of cancers in the control
activities.
group will catch up over time.
C. Screening picked up some cancers
10.10. All of the following statements are correct
that would not have come to
except:
medical attention without
screening. A. The gold standard for a test used for
diagnosis may be different than that for
the same test when used for screening.
For questions 10.7–10.11, choose the best
B. The incidence method cannot be used
answer.
to calculate sensitivity for cancer
screening tests.
10.7. When assessing a new vaccine, which of
C. When a screening program is begun,
the following is least important:
more people with disease are found on
A. Efficacy in preventing the disease the first round of screening than on later
B. Safety of the vaccine rounds.
C. Danger of the disease
D. Cost of giving the vaccine 10.11. For a cost-effectiveness analysis of a
preven- tive activity, which kinds of costs
10.8. When the same test is used in diagnostic and should be included?
screening situations, which of the following
A. Medical costs, such as those
statements is most likely correct?
associated with delivering the
A. The sensitivity and specificity will likely preventive intervention
be the same in both situations. B. All medical costs, including diagnostic
B. The positive predictive value will follow up of positive tests and treatment
be higher in a screening situation. for persons diagnosed with disease, with
C. Disease prevalence will be higher in and without the preventive activity
the diagnostic situation. C. Indirect costs, such as loss of income due
D. Overdiagnosis is equally likely in both to time off from work, among patients
situations. receiving the prevention and those who
develop the disease
10.9. A study found that volunteers for a new D. Indirect costs for both patients and care
screening test had better health outcomes givers
than people who refused testing. Which of E. All of the above
the following statements most likely explains
the finding? Answers are in Appendix

REFERENCES A.

1. Schappert SM, Rechtsteiner EA. Ambulatory medical care utili-


4. National Center for Health Statistics. Health. United
zation estimates for 2007. National Center for Health
States. 2010; With special feature on death and dying.
Statistics. Vital Health Stat 2011;13(169). Available at
Hyattsville, MD. 2011.
http://www.cdc.gov/ nchs/data/series/sr_13/sr13_169.pdf.
5. Roe MT, Ohman EM. A new era in secondary prevention
Accessed January 11, 2012.
after acute coronary syndrome. N Engl J Med
2. Prevention. 2011. In Meriam-webster.com. Available at http://
2012;366:85–87.
www.merriam-webster.com/dictionary/prevention. Accessed
6. Harris R, Sawaya GF, Moyer VA, et al. Reconsidering the cri-
January 13, 2012.
teria for evaluating proposed screening programs: reflections
3. Siegel R, Ward E, Brawley O, et al. Cancer statistics,
from 4 current and former members of the U.S. Preventive
2011. The impact of eliminating socioeconomic and racial
Services Task Force. Epidemiol Rev 2011;33:20–35.
dispari- ties on premature cancer deaths. CA Cancer J Clin
7. Howlader N, Noone AM, Krapcho M, et al. (eds). SEER
2011;61: 212–236.
18 Clinical Epidemiology: The
Cancer Statistics Review, 1975-2008, National Cancer
Institute. Bethesda,
Chapter 10: Prevention 183

MD. Available at http://seer.cancer.gov/csr/1975_2008/, based 21. Lansdorp-Vogelaar I, Knudsen AB, Brenner H. Cost-
on November 2010 SEER data submission, posted to the effective- ness of colorectal cancer screening. Epi Rev
SEER Web site, 2011. Accessed January 13, 2012. 2011;33:88–100.
8. Chang MH, You SL, Chen CJ, et al. Decreased incidence 22. Berrington de González A, Mahesh M, Kim K-P, et al. Radia-
of hepatocellular carcinoma in hepatitis B vaccinees: a 20 tion dose associated with common computed tomography
year follow-up study. J Natl Cancer Inst 2009;101:1348– examinations and the associated lifetime attributable risk of
1355. cancer. Arch Intern Med 2009;169:2071–2077.
9. Greene SK, Rett M, Weintraub ES, et al. Risk of 23. D’Orsi CD, Shin-Ping Tu, Nakano C. Current realities of
confirmed Guillain-Barré Syndrome following receipt of delivering mammography services in the community: do chal-
monovalent inactivated influenza A (H1N1) and seasonal lenges with staffing and scheduling exist? Radiology 2005;235:
influenza vac- cines in the Vaccine Safety Datalink Project, 391–395.
2009-2010. Am J Epidemiol 2012;175:1100–1109. 24. Buys SS, Partridge E, Black A, et al. Effect of screening
10. Fiore MC, Jaén CR, Baker TB, et al. Treating tobacco use on ovarian cancer mortality. The Prostate, Lung, Colorectal
and dependence: 2008 Update. Clinical Practice Guideline. and Ovarian (PLCO) cancer screening randomized
Rockville, MD: U.S. Department of Health and Human controlled trial. JAMA 2011;305:2295–2303.
Ser- vices. Public Health Service. May 2008. 25. Meador CK. The last well person. N Engl Med J 1994;330:
11. Mandel JS, Bond JH, Church TR, et al. (for the Minnesota 440–441.
Colon Cancer Control Study). Reducing mortality from 26. Croswell JM, Baker SG, Marcus PM, et al. Cumulative
colorectal cancer by screening for fecal occult blood. N inci- dence of false-positive test results in lung cancer
Engl J Med 1993;328:1365–1371. screening: a randomized trial. Ann Intern Med
12. Marcus PM, Bergstralh EJ, Fagerstrom RM, et al. Lung 2010;152:505–512.
can- cer mortality in the Mayo Lung Project: impact of 27. Croswell JM, Kramer BS, Kreimer AR. Cumulative incidence
extended follow-up. J Natl Cancer Inst. 2000;92:1308– of false-positive results in repeated multimodal cancer screen-
1316. ing. Ann Fam Med 2009;7:212–222.
13. The National Lung Screening Trial Research Team. Reduced 28. Fowler FJ, Barry MJ, Walker-Corkery BS. The impact of a
lung-cancer mortality with low-dose computed tomographic sus- picious prostate biopsy on patients’ psychological, socio-
screening. N Engl J Med 2011;365:395–409. behav- ioral, and medical care outcomes. J Gen Intern Med
14. Gæ´de P, Lund-Andersen H, Hans-Henrik P, et al. Effect of a 2006; 21: 715–721.
multifactorial intervention on mortality in type 2 diabetes. 29. Schilling FH, Spix C, Berthold F, et al. Neuroblastoma
N Engl J Med 2008;358:580–591. screen- ing at one year of age. N Engl J Med
15. Friedman GD, Collen MF, Fireman BH. Multiphasic 2002;346:1047–1053.
health checkup evaluation: a 16-year follow-up. J Chron Dis 30. Woods WG, Gao R, Shuster JJ, et al. Screening of infants and
1986;39: 453–463. mortality due to neuroblastoma. N Engl J Med 2002;346:
16. Avins AL, Pressman A, Ackerson L, et al. Placebo adherence 1041–1046.
and its association with morbidity and mortality in the studies 31. Xiong T, Richardson M, Woodroffe R, et al. Incidental
of left ventricular dysfunction. J Gen Intern Med 2010;25: lesions found on CT colonography: their nature and
1275–1281. frequency. Br J Radiol 2005;78:22–29.
17. Selby JV, Friedman GD, Quesenberry CP, et al. A case- 32. Hubbard RA, Kerlikowske K, Flowers CI, et al.
control study of screening sigmoidoscopy and mortality from Cumulative probability of false-positive recall or biopsy
colorec- tal cancer. N Eng J Med 1992;326:653–657. recommendation after 10 years of screening
18. Gann PH, Hennekens CH, Stampfer MJ. A prospective evalu- mammography. Ann Intern Med 2011;155:481–492.
ation of plasma prostate-specific antigen for detection of pros- 33. Mandelblatt JS, Cronin KA, Bailey S, et al. Effects of
tatic cancer. JAMA 1995;273:289–294. mam- mography screening under different screening
19. Delongchamps NB, Singh A, Haas GP. The role of prevalence schedules: model estimates of potential benefits and
in the diagnosis of prostate cancer. Cancer Control harms. Ann Intern Med 2009;151:738–747.
2006;13: 158–168. 34. Gillman MW, Daniels SR. Is universal pediatric lipid screen-
20. Carney PA, Miglioretti DL, Yankaskas BC, et al. ing justified? JAMA 2012;307:259–260.
Individual and combined effects of age, breast density, and 35. Goldie SJ, Kohli M, Grima D. Projected clinical benefits and
hormone replacement therapy use on the accuracy of cost-effectiveness of a human papillomavirus 16/18 vaccine. J
screening mam- mography. Ann Intern Med Natl Cancer Inst 2004;96:604–615.
2003;138:168–175.
18 Clinical Epidemiology: The

C h a p t e r 11

Chance
It is a common practice to judge a result significant, if it is of such a magnitude that it
would have been produced by chance not more frequently than once in twenty trials.
This is an arbitrary, but convenient, level of significance for the practical investigator, but
it does not mean that he allows himself to be deceived once in every twenty
experiments.
—Ronald Fisher
1929 (1)

KEY WORDS Many of us tend to underestimate the


importance of bias relative to chance when
interpreting data, per-
Hypothesis testing Non-parametric haps because statistics are quantitative and appear so
Estimation statistics definitive. We might say, in essence, “If the
Statistically Two-tailed statisti- cal conclusions are strong, a little bit of
significant One-tailed bias can’t do much harm.” However, when data
Type 1 (a) error Statistical power are biased, no amount of statistical elegance can save
Type II (β) Sample size the day. As one scholar put it, perhaps taking an
error Point estimate extreme position, “A well designed, carefully
Inferential statistics Statistical precision executed study usually gives results that are obvious
Statistical testing Confidence interval without a formal analysis and if there are substantial
P value Multiple comparisons flaws in design or execution a formal analysis will
Statistical Multivariable not help” (2).
significance modeling In this chapter, we discuss chance mainly in the
Statistical tests Bayesian reasoning context of controlled clinical trials because it is the
Null hypothesis simplest way of presenting the concepts. However,
statistics are an element of all clinical research,
whenever one makes inferences about populations
Learning from clinical experience, whether during being on one side of the true value as on the other.
formal research or in the course of patient care, is
impeded by two processes: bias and chance. As dis-
cussed in Chapter 1, bias is systematic error, the
result of any process that causes observations to
differ sys- tematically from the true values. Much
of this book has been about where bias might lurk,
how to avoid it when possible, and how to control for
it and estimate its effects when bias is unavoidable.
On the other hand, random error, resulting from
the play of chance, is inherent in all observations.
It can be minimized but never avoided altogether.
This source of error is called “random” because, on
aver- age, it is as likely to result in observed values
based on information obtained from samples.
There is always a possibility that the particular
sample of patients in a study, even though
selected in an unbi- ased way, might not be
similar to the population of patients as a whole.
Statistics help estimate how well observations on
samples approximate the true situation.

TWO APPROACHES TO CHANCE


Two general approaches are used to assess the role
of chance in clinical observations.
One approach, called hypothesis testing, asks
whether an effect (difference) is present or is not
by using statistical tests to examine the
hypothesis

175
17 Clinical Epidemiology: The

(called the “null hypothesis”) that there is no dif- more effective. Error of this kind, resulting in a
ference. This traditional way of assessing the role of “false- positive” conclusion that the treatment is
chance, associated with the familiar “P value,” has effective, is referred to as a type I error or a error,
been popular since statistical testing was introduced the probability of saying that there is a difference in
at the beginning of the 20th century. The treatment effects when there is not. On the other
hypothesis testing approach leads to dichotomous hand, the new treat- ment might be more effective,
conclusions: Either an effect is present or there is but the study concludes that it is not. This “false-
insufficient evi- dence to conclude an effect is negative” conclusion is called a type II error or β
present. error—the probability of saying that there is no
The other approach, called estimation, uses difference in treatment effects when there is. “No
sta- tistical methods to estimate the range of values difference” is a simplified way of saying that the
that is likely to include the true value—of a rate, true difference is unlikely to be larger than a
measure of effect, or test performance. This approach certain size, which is considered too small to be of
has gained popularity recently and is now favored by prac- tical consequence. It is not possible to
most medi- cal journals, at least for reporting main establish that there is no difference at all between
effects, for reasons described below. two treatments.
Figure 11.1 is similar to 2  2 tables comparing
HYPOTHESIS TESTING the results of a diagnostic test to the true diagnosis
(see Chapter 8). Here, the “test” is the conclusion
In the usual situation, the principal conclusions of of a clinical trial based on a statistical test of results
a trial are expressed in dichotomous terms, such as from the trial’s sample of patients. The “gold
a new treatment is either better or not better than standard” for validity is the true difference in the
usual care, corresponding to the results being either treatments being compared—if it could be
statistically significant (unlikely to be purely by established, for example, by making observations on
chance) or not. There are four ways in which the sta- all patients with the illness or a large number of
tistical conclusions might relate to reality (Fig. 11.1). samples of these patients. Type I error is analogous
Two of the four possibilities lead to correct con- to a false-positive test result, and type II error is
clusions: (i) The new treatment really is better, and analogous to a false-negative test result. In the
that is the conclusion of the study; and (ii) the absence of bias, random variation is responsible for
treat- ments really have similar effects, and the the uncertainty of a statistical conclusion.
study con- Because random variation plays a part in all
cludes that a difference is unlikely. obser- vations, it is an oversimplification to ask
whether chance is responsible for the results.
False-Positive and False-Negative Rather, it is a question of how likely random
Statistical Results variation is to account for the findings under the
There are also two ways of being wrong. The new particular conditions of the study. The probability
treatment and usual care may actually have similar of error due to random varia- tion is estimated by
effects, but it is concluded that the new treatment means of inferential statistics, a quantitative
is science that, given certain assumptions about the
mathematical properties of the data, is the basis for
calculations of the probability that the results
TRUE could have occurred by chance alone.
DIFFERENCE Statistics is a specialized field with its own jargon
(e.g., null hypothesis, variance, regression, power,
Present Absent and modeling) that is unfamiliar to many clini-
cians. However, leaving aside the genuine complex-
Significant Correct Type I (α) ity of statistical methods, inferential statistics should
CONCLUSION error be regarded by the non-expert as a useful means to
OF an end. Statistical testing is a means by which the
STATISTICAL effects of random variation are estimated.
TEST Not Type II
Correct The next two sections discuss type I and type II
significant ( β) error
errors and place hypothesis testing, as it is used to
estimate the probabilities of these errors, in context.
Figure 11.1 ■ The relationship between the results of
a statistical test and the true difference between two
treatment groups. (Absent is a simplification. It really means Concluding That a Treatment Works
that the true difference is not greater than a specified
Most statistics encountered in the medical litera-
amount.)
Chapter 11: Chance 177
ture concern the likelihood of a type I error
and are
17 Clinical Epidemiology: The

categories (0.05 or 0.05). Users are then free to


expressed by the familiar P value. The P value is a apply their
quantitative estimate of the probability that differ-
ences in treatment effects in the particular study at
hand could have happened by chance alone,
assum- ing that there is in fact no difference
between the groups. Another way of expressing this
is that P is an answer to the question, “If there were
no difference between treatment effects and the
trial was repeated many times, what proportion of
the trials would con- clude that the difference
between the two treatments was at least as large as
that found in the study?”
In this presentation, P values are called P, to dis-
tinguish them from estimates of the other kind of
error resulting from random variation, type II errors,
which are referred to as P. When a simple P is
found in the scientific literature, it ordinarily refers
to P.
The kind of error estimated by P applies
whenever one concludes that one treatment is more
effective than another. If it is concluded that the
P exceeds some limit (see below) so there is no
statistical differ- ence between treatments, then the
particular value of P is not as relevant; in that
situation, P (probability of type II error) applies.

Dichotomous and Exact P Values


It has become customary to attach special
significance to P values below 0.05 because it is
generally agreed that a chance of 1 in 20 is a
small enough risk of being wrong. A chance of 1
in 20 is so small, in fact, that it is reasonable to
conclude that such an occur- rence is unlikely to
have arisen by chance alone. It could have arisen
by chance, and 1 in 20 times it will, but it is
unlikely.
Differences associated with P  0.05 are called
statistically significant. However, setting a cutoff
point at 0.05 is entirely arbitrary. Reasonable
people might accept higher values or insist on
lower ones, depending on the consequences of a
false-positive con- clusion in a given situation. For
example, one might be willing to accept a higher
chance of a false-positive statistical test if the disease
is severe, there is currently no effective treatment,
and the new treatment is safe. On the other hand,
one might be reluctant to accept a false-positive test
if usual care is effective and the new treatment is
dangerous or much more expensive. This reasoning
is similar to that applied to the importance of false-
positive and false-negative diagnostic tests (Chapter
8).
To accommodate various opinions about what is
and is not unlikely enough, some researchers
report the exact probabilities of P (e.g., 0.03, 0.07,
0.11), rather than lumping them into just two
Chapter 11: Chance 179
own preferences for what is statistically significant.
However, P values 1 in 5 are usually reported as
simply P  0.20, because nearly everyone can agree
that a probability of a type I error 1 in 5 is unaccept-
ably high. Similarly, below very low P values (e.g.,
P  0.001) chance is a very unlikely explanation
for an observed difference, and little further
information is conveyed by describing this chance
more precisely. Another approach is to accept the
primacy of P 
0.05 and describe results that come close to that
standard with terms such as “almost statistically sig-
nificant,” “did not achieve statistical significance,”
“marginally significant,” or “a trend.” These value-
laden terms suggest that the finding should have
been statistically significant but for some annoying
reason was not. It is better to simply state the result
and exact P value (or point estimate and
confidence interval, see below) and let the reader
decide for him or herself how much chance could
have accounted for the result.

Statistical Significance and


Clinical Importance
A statistically significant difference, no matter how
small the P, does not mean that the difference is
clini- cally important. A P value of 0.0001, if it
emerges from a well-designed study, conveys a
high degree of confidence that a difference really
exists but says nothing about the magnitude of that
difference or its clinical importance. In fact, trivial
differences may be highly statistically significant if
a large enough num- ber of patients are studied.

Example

The drug donepezil, a cholinesterase inhibitor, was developed


18 Clinical Epidemiology: The

Table 11.1
donepezil and placebo groups. These Some Statistical Tests Commonly Used in
included entering institutional care and Clinical Research
progression of disability (both primary end
points) as well as behavioral and Test When Used
psychological symptoms, caregiver
To Test the Statistical Significance of a Difference
psychopathology, formal care costs, unpaid
Chi square (2) Between two or more
caregiver time, and adverse events or
proportions (when there are a
death. The authors concluded that the
large number of observations)
benefits of donepezil were “below
Fisher’s exact Between two proportions (when
there are a small number of
observations)
On the other hand, very unimpressive P values
can result from studies with strong treatment Mann-Whitney U Between two medians
effects if there are few patients in the study. Student t Between two means
F test Between two or more means
Statistical Tests To Describe the Extent of Association
Statistical tests are used to estimate the Regression Between an independent
probability of a type I error. The test is applied to coefficient (predictor) variable and a
the data to obtain a numerical summary for those dependent (outcome) variable
data called a test statistic. That number is then Pearson’s r Between two variables
compared to a sam- pling distribution to come up To Model the Effects of Multiple Variables
with a probability of a type I error (Fig. 11.2). The Logistic regression With a dichotomous outcome
distribution is under the null hypothesis, the
Cox proportional With a time-to-event outcome
proposition that there is no true difference in hazards
outcome between treatment groups. This device is
for mathematical reasons, not because “no
difference” is the working scientific hypoth- esis of the
investigators conducting the study. One ends up The chi-square (2) test for nominal data (counts)
rejecting the null hypothesis (concluding there is a is more easily understood than most and can be used
difference) or failing to reject it (concluding that to illustrate how statistical testing works. The extent
there is insufficient evidence in support of a to which the observed values depart from what
difference). Note that not finding statistical would have been expected if there were no treatment
significance is not the same as there being no effect is used to calculate a P value.
difference. Statistical testing is not able to establish
that there is no difference at all.
Some commonly used statistical tests are listed in Example
Table 11.1. The validity of many tests depends on Cardiac arrest outside the hospital has a poor outcome. Animal st
certain assumptions about the data; a typical assump-
tion is that the data have a normal distribution. If the
data do not satisfy these assumptions, the resulting
P value may be misleading. Other statistical tests,
called non-parametric tests, do not make
assump- tions about the underlying distribution of
the data. A discussion of how these statistical tests
are derived and calculated and of the assumptions on
which they rest can be found in any biostatistics
textbook.

Estimate of probability that observed value could be by


Test statistic
Data

Statistical Compare to
test standard
distribution
Figure 11.2 ■ Statistical testing.
Chapter 11: Chance 181

randomized to cooling (hypothermia) or usual expected if there were no treatment effect. Because they a
care (4). The primary outcome was survival to The 2 statistic for these data is:
hospital discharge with relatively good neuro-
logic function.

Observed Rates
Survival with Good
Neurological Function
(21  16.75)2  (9  13.25)2  (22  26.25)2
Yes No Total
16.75 13.25 26.75
Hypothermia 21 22 43  (25  20.75) 2  4.0
Usual care 9 25 34 20.75
tively obvious that the larger the 2,30
Total the more likely
47 chance
77 is to account for the ob- served differences. The resulting P value for a

Success rates were 49% in the patients treat-


ed with hypothermia and 26% in the patients
on usual care. How likely would it be for a
study of this size to observe a difference in
rates as great as this or greater if there was
in fact no difference in effectiveness? That
depends on how far the observed results
depart from what would have been expected
if the treatments were of similar
effectiveness and only random variation
accounted for the different rates. If
treatment had no effect on outcome, apply-
ing the success rate for the patients as a
whole (30/77  39%) to the number of
patients in each treatment group gives the When using statistical tests, the usual approach is
expected number of successes in each group: to test for the probability that an intervention is either
more or less effective than another to a statistically
important extent. In this situation, testing is called
Expected Rates two-tailed, referring to both tails of a bell-shaped
curve describing the random variation in differences
Success
between treatment groups of equal value, where
Yes No Total the two tails of the curve include statistically
Hypothermia 16.75 26.25 43 unlikely outcomes favoring one or the other
Usual care 13.25 20.75 34 treatment. Some- times there are compelling
Total 30 47 77
reasons to believe that one treatment could only be
better or worse than the other, in which case one-
tailed testing is used, where all of the type I error
The 2 statistic is the square of the differe- (5%) is in one of the tails, mak- ing it easier to reach
nces between observed and expected divided statistical significance.
by expected, summarized over all four cells:
Concluding That a
2 (Observed number  Expected Treatment Does Not Work
  2
number)
Expected number Some trials are unable to conclude that one treatment
is better than the other. The risk of a false-negative
The magnitude of the 2 statistic is deter- result is particularly large in studies with relatively
mined by how different all of the observed few patients or outcome events. The question then
numbers are from what would have been arises: How likely is a false-negative result (type II
or  error)? Could the “negative” findings in such
18 Clinical Epidemiology: The

trials have misrepresented the truth because these Visual presentation of negative results can be con-
particular studies had the bad luck to turn out in a vincing. Alternatively, one can examine confidence
relatively unlikely way? intervals (see Point Estimates and Confidence Inter-
vals, below) and learn a lot about whether the

Example
One of the examples in Chapter 9 was a ran- domized controlled trial of the effects on car- diovascular outcomes of adding

study was large enough to rule out clinically


important dif- ferences if they existed.
Of course, reasons for false-negative results other
than chance also need to be considered: biologic rea-
sons such as too short follow-up or too small dose
of niacin, as well as study limitations such as non-
compliance and missed outcome events.
Type II errors have received less attention than
type I errors for several reasons. They are more
dif- ficult to calculate. Also, most professionals
simply prefer things that work and consider negative
results unwelcome. Authors are less likely to submit
nega- tive studies to journals and when negative
studies are reported at all, the authors may prefer
to empha- size subgroups of patients in which
treatment dif- ferences were found. Authors may
also emphasize reasons other than chance to explain
why true differ- ences might have been missed.
Whatever the reason for not considering the
probability of a type II error, it is the main question
that should be asked when the results of a study are
interpreted as “no difference.”

HOW MANY STUDY PATIENTS


ARE ENOUGH?
Chapter 11: Chance 183
Suppose you are reading about a clinical treatment to usual care
trial that compares a promising new

50

patients with primary


40

Cumulative percent of
30

Niacin plus statin


20

10
Placebo plus statin

0
0 1 2 3 4
Years
Number at risk
Niacin plus statin 1,718 1,606 1,366 903 428
Placebo plus statin 1,696 1,581 1,381 910 436

Figure 11.3 ■ Example of a “negative” trial. (Redrawn with permission


from The AIM-HIGH Investigators. Niacin in patients with low HDL cholesterol
levels receiving intensive statin therapy. N Engl J Med 2011;365:2255–2267.)
18 Clinical Epidemiology: The

and finds no difference. You are aware that to detect the smallest degree of improvement that
random variation can be the reason for whatever would be clinically meaningful?” On the other hand,
differences are or are not observed, and you if one is interested in detecting only very large dif-
wonder if the num- ber of patients in this study is ferences between treated and control groups (i.e.,
large enough to make chance an unlikely strong treatment effects) then fewer patients need
explanation for what was found. Alternatively, you to be studied.
may be planning to do such a study and have the
same question. Either way, you need to understand Type I Error
how many patients would be needed to make a
Sample size is also related to the risk of a type I error
strong comparison of the effects of the two
(concluding that treatment is effective when it is not).
treatments?
The acceptable probability for a risk of this kind is
a value judgment. If one is prepared to accept the
Statistical Power
con- sequences of a large chance of falsely
The probability that a study will find a statistically concluding that the treatment is effective, one can
significant difference when a difference really exists reach conclusions with fewer patients. On the
is called the statistical power of the study. Power other hand, if one wants to take only a small risk
and P are complementary ways of expressing the of being wrong in this way, a larger number of
same concept. patients will be required. As dis- cussed earlier, it
Statistical power  1 – P is customary to set P at 0.05 (1 in 20) or
sometimes 0.01 (1 in 100).
Power is analogous to the sensitivity of a diagnos-
tic test. One speaks of a study being powerful when it Type II Error
has a high probability of detecting differences
when treatments really do have different effects. The chosen risk of a type II error is another
determi- nant of sample size. An acceptable
probability of this error is also a judgment that can
Estimating Sample Size be freely made and changed to suit individual tastes.
Requirements Probability of P is often set at 0.20, a 20% chance of
From the point of view of hypothesis testing of missing true differ- ences in a particular study.
nominal data (counts), an adequate sample size Conventional type II errors are much larger than type
depends on four characteristics of the study: the I errors, reflecting a higher value placed on being
magnitude of the difference in outcome between sure an effect is really present when it is said to be.
treatment groups, P and P (the probability of the
false-positive and false-negative conclusions you Characteristics of the Data
are willing to accept), and the underlying outcome
rate. The statistical power of a study is also determined by
These determinants of adequate sample size the nature of the data. When the outcome is
should be taken into account when investigators plan expressed by counts or proportions of events or
a study, to ensure that the study will have enough sta- time-to-event, its statistical power depends on the
tistical power to produce meaningful results. To the rate of events: The larger the number of events, the
extent that investigators have not done this well, or greater the statistical power for a given number of
some of their assumptions were found to be people at risk. As Peto et al. (6) put it:
inaccu- rate, readers need to consider the same In clinical trials of time to death (or of the time to
issues when interpreting study results. some other particular “event”—relapse, metastasis,
first thrombosis, stroke. recurrence, or time to death
Effect Size from a particular cause—the ability of the trial to
distinguish between the merits of two treatments
Sample size depends on the magnitude of the dif- depends on how many patients die (or suffer a rel-
ference to be detected. One is free to look for dif- evant event), rather than on the number of patients
ferences of any magnitude and of course one entered. A study of 100 patients, 50 of whom die, is
hopes to be able to detect even very small about as sensitive as a study with 1,000 patients,
differences, but more patients are needed to detect 50 of whom die.
small differ- ences, everything else being equal. If the data are continuous, such as blood pres-
Therefore, it is best to ask, “What is a sufficient sure or serum cholesterol, power is affected by the
number of patients
Chapter 11: Chance 185

Table 11.2
Determinants of Sample Size

Determined by
Date Type
Investigator Means Counts
1 1
Sample size varies according to: OR Variability
Effect size, P, P Outcome rate

degree to which patients vary among themselves.


The greater the variation from patient to patient Example
with respect to the characteristic being measured, A Small Sample Size That Was Adequate
the more difficult it is to be confident that the For many centuries, scurvy, a vitamin C defi- ciency syndrome ca
observed differences (or lack of difference) between
groups is not because of this variation, rather than A Large Sample Size That Was Inadequate
a true treatment effect. Low serum vitamin D levels may be a risk factor for colorectal can
In designing a study, investigators choose the
smallest treatment effect that is clinically
important (larger treatment effects will be easier to
detect) and the type I and type II errors they are
willing to accept. They also obtain estimates of
outcome event rates or variation among patients. It
is possible to design studies that maximize power
for a given sample size— such as by choosing
patients with a high event rate or similar
characteristics—as long as they match the research
question.

Interrelationships
The relationships among the four variables that
together determine an adequate sample size are sum-
marized in Table 11.2. The variables can be traded
off against one another. In general, for any given
number of patients in the study, there is a trade-off
between type 1 and type 2 errors. Everything else
being equal, the more one is willing to accept one
kind of error, the less it will be necessary to risk
the other. Neither kind of error is inherently worse
than the other. It is, of course, possible to reduce
both type 1 and type 2 errors if the number of
patients is increased, outcome events are more
frequent, vari- ability is decreased, or a larger
treatment effect is sought.
For conventional levels of P and P, the
relationship between the size of treatment effect
and the number of patients needed for a trial is
illustrated by the following examples. One repre-
sents a situation in which a relatively small num-
ber of patients was sufficient, and the other is one
in which a very large number of patients was too
small.
18 Clinical Epidemiology: The

baseline serum vitamin D levels (8).


Incident colon and rectal cancers were 1,500 Outcome event rate in the untrea
identified from the National Cancer 0.05
Registry. After 12 years of follow-up, 239
colon cancers and 192 rectal cancers

Number of people in each


0.50 0.20
developed in the cohort. After adjust- ment
for confounders, serum vitamin D levels
1,000
were positively associated with colon cancer
in- cidence and inversely associated to
rectal can- cer incidence, but neither of

500
For most of the therapeutic questions encountered
today, a surprisingly large sample size is required.
The value of dramatic, powerful treatments, such as
anti- biotics for pneumonia or thyroid replacement
for hypothyroidism, was established by clinical
expe- rience or studying a small number of 0 20 40 60 80 100
patients, but
such treatments come along rarely and many of them Proportional reduction in event rate (%)
are already well established. We are left with diseases, Figure 11.4 ■ The number of people required in each
many of which are chronic and have multiple, of two treatment groups (of equal size), for various
inter- acting causes, for which the effects of new rates of outcome events in the untreated group, to
treatments are generally small. This makes it especially have an 80% chance of detecting a difference (P =
important to plan clinical studies that are large 0.05) in reduction in outcome event rates in treated
enough to distin- guish real from chance effects. relative to untreated patients. (Calculated from formula
Figure 11.4 shows the relationship between in Weiss NS. Clinical epidemiology. The study of the outcome
sam- ple size and treatment difference for several of illness. New York: Oxford University Press; 1986.)
baseline rates of outcome events. Studies involving
fewer than 100 patients have a poor chance of
detecting statisti- cally significant differences for Statistical precision is expressed as a confidence
even large treatment effects. Looked at another way, interval, usually the 95% confidence interval,
it is difficult to detect effect sizes of 25%. In around the point estimate. Confidence intervals
practice, statistical power can be estimated by means are interpreted as follows: If the study is unbiased,
of readily available formulas, tables, nomograms, there is a 95% chance that the interval includes the
computer programs, or Web sites. true effect size. The more narrow the confidence
interval, the more certain one can be about the size of
the true effect. The true value is most likely to be close
POINT ESTIMATES AND to the point estimate, less likely to be near the
outer limits of the interval, and could (5 times out
CONFIDENCE INTERVALS of 100) fall outside these limits altogether. Statistical
The effect size that is observed in a particular preci- sion increases with the statistical power of the
study (such as treatment effect in a clinical trial or study.
relative risk in a cohort study) is called the point
estimate of the effect. It is the best estimate from
the study of the true effect size and is the summary Example
statistic usually given the most emphasis in reports The Women’s Health Initiative included a randomized controll
of research.
However, the true effect size is unlikely to be
exactly that observed in the study. Because of ran-
dom variation, any one study is likely to find a result
higher or lower than the true value. Therefore, a sum-
mary measure is needed for the statistical precision
of the point estimate, the range of values likely to
encompass the true effect size.
Chapter 11: Chance 187

OUTCOME Statistical significance at the 0.05 level can be


obtained from 95% confidence intervals. If the point
Stroke corresponding to no effect (i.e., a relative risk of 1 or
a treatment difference of 0) falls outside the 95%
confi- dence intervals for the observed effect, the
results are statistically significant at the 0.05 level. If
Hip fracture
the confi- dence intervals include this point, the
results are not statistically significant.
Confidence intervals have advantages over P val-
Breast cancer ues. They put the emphasis where it belongs, on
the size of the effect. Confidence intervals help the
reader to see the range of plausible values and so
Endometrial cancer to decide whether an effect size they regard as
clinically mean- ingful is consistent with or ruled out
0 by the data (10). They also provide information about
1 2 statistical power. If the confidence interval is
relatively wide and barely includes the value
Relative risk
corresponding to no effect, readers can see that low
power might have been the reason for
Figure 11.5 ■ Example of confidence intervals. The the negative result. On the other hand, if the confi-
relative risk and confidence intervals for outcomes in the dence interval is narrow and includes no effect, a
Women’s Health Initiative: a randomized controlled trial of
large effect is ruled out.
estrogen plus progestin in healthy postmenopausal women.
(Data from Writing Group for the Women’s Health Initiative
Point estimates and confidence intervals are used
Investigators. Risks and benefits of estrogen plus proges- to characterize the statistical precision of any rate
tin in healthy postmenopausal women. JAMA 2002;288: (inci- dence and prevalence), diagnostic test
321–333.) performance, comparisons of rates (relative and
attributable risks), and other summary statistics.
For example, studies have shown that 7.0% (95%
confidence interval, 5.2–9.4) of adults have a
clinically important fam- ily history of prostate
cancer (11); that the sensitiv- ity of a high-
intervals for four of these outcomes: stroke,
sensitivity cardiac troponin assay (at the optimal
hip fracture, breast cancer, and endometrial
cutoff point) for acute coronary syndrome was
cancer. The four illustrate various possibilities
84.8% (95% confidence interval 82.8–86.6) (12);
for how confidence intervals are interpreted.
and that return to usual activity after inguinal
Estrogen plus progestin was a risk factor for
hernia repair was shorter for laparoscopic than
stroke; the best estimate of this risk is the
open surgery (hazard ratio 0.56, 95% confidence
point estimate, a relative risk of 1.41, but the
interval 0.51–0.61) (13).
data are consistent with a relative risk as
Confidence intervals have become the usual way of
low as 1.07 or as high as 1.85. Estrogen plus
reporting the main results of clinical research
pro- gestin protected against hip fracture,
prevent- ing as much as 65% and as little as
because of their many advantages over the
2% of frac- tures. That is, the data are
hypothesis testing (P value) approach. P values are
consistent with very little benefit, although
still used because of tradition and as a convenience
substantial benefit is likely and even larger
when many results are reported and it would not be
benefits are consistent with the results.
feasible to include con- fidence intervals for all of
Although risk of breast cancer is likely to be them.
increased, the data are consis- tent with no
effect (the lower end of the con- fidence Statistical Power after a
interval includes a relative risk of 1.0). Study Is Completed
Finally, the study is not very informative for
Earlier in the chapter, we discussed how
endometrial cancer. Confidence intervals are
calculation of statistical power based on the
very wide, so not only was there no clear
risk or benefit, but also the estimate of
hypothesis testing approach is performed before a
risk was so imprecise that substantial risk or
study is undertaken to ensure that enough patients
benefit re- mained possible.
will be entered to have a good chance of detecting a
clinically meaningful effect if one is present.
However, after the study is completed, this
approach is less relevant (14). There
18 Clinical Epidemiology: The

Risk = 1 1 1 1
100 1,000 10,000 100,000
1.0

0.8

0.6
Probability of

0.4

0.2

1,000 10,000 100,000 1,000,000

Size of treatment group


Figure 11.6 ■ The probability of detecting one event according to the
rate of the event and the number of people observed. (Redrawn with per-
mission from Guess HA, Rudnick SA. Use of cost effectiveness analysis in
planning cancer chemoprophylaxis trials. Control Clin Trials 1983;4:89–100.)

is no longer a need to estimate effect size,


needed to detect rare events such as uncommon
outcome event rates, and variability among
side effects and complications. For that, a different
patients because they are all known. Rather,
approach, involving many more patients, is
attention should be directed to point estimates and
needed. An example is postmarketing surveillance of
confidence intervals. With them, one can see the
a drug, in which thousands of users are monitored
range of values that are consistent with the results
for side effects.
and whether the effect sizes of interest are within
Figure 11.6 shows the probability of detecting
this range or are ruled out by the data. In the
an event as a function of the number of people
niacin study, summarized earlier as an example of
under observation. A rule of thumb is: To have a
a negative trial, the hazard ratio was
good chance of detecting a 1/x event, one must
1.02 and the 95% confidence interval was 0.87 to
observe 3x people (15). For example, to detect at
1.21, meaning that the results were consistent with
least one event if the underlying rate is 1/1,000,
a small degree of benefit or harm. Whether this mat-
one would need to observe 3,000 people.
ters depends on the clinical importance attached to
a difference in rates as large as represented by this
confidence interval. MULTIPLE COMPARISONS
The statistical conclusions of research have an aura
DETECTING RARE EVENTS of authority that defies challenge, particularly by
non- experts. However, as many skeptics have
It is sometimes important to know how likely a
suspected, it is possible to “lie with statistics” even if
study is to detect a relatively uncommon event
the research is well designed, the mathematics
(e.g., 1/1,000), particularly if that event is severe,
flawless, and the investigators’ intentions beyond
such as bone marrow failure or life-threatening
reproach.
arrhythmia. A great many people must be
Statistical conclusions can be misleading because
observed in order to have a good chance of
the strength of statistical tests depends on the
detecting even one such event, much less to establish
num- ber of research questions considered in the
a relatively stable estimate of its frequency. For most
study and when those questions were asked. If many
clinical research, sample size is planned to be
compari- sons are made among the variables in a
sufficient to detect main effects, the answer sought
large set of data, the P value associated with each
for the primary research question. Sample size is
individual com- parison is an underestimate of how
likely to be well short of the number
often the result of
Chapter 11: Chance 189

that comparison, among the others, is likely to Table 11.3


arise by chance. As implausible as it might seem, the How Multiple Comparisons Can
inter- pretation of the P value from a single Be Misleading
statistical test depends on the context in which it is
done. 1. Make multiple comparisons within a study.
To understand how this might happen, consider 2. Apply tests of statistical significance to each
the following example. comparison.
3. Find a few comparisons that are
“interesting” (statistically significant).
4. Build an article around one of these
interesting findings.
Example 5. Do not mention the context of the individual
comparison (how many questions were examined
and which was considered primary before the
Suppose a large study has been done in which there are multiple
datasubgroups of patients and many different outcomes. For in
was examined).
6. Construct a post hoc argument for the plausibility
of the isolated finding.

This phenomenon is referred to as the multiple comparisons really were made. Sometimes, interest-
comparisons problem. Because of this problem, ing findings have been selected from a larger number
the strength of evidence from clinical research of uninteresting ones that are not mentioned. This
depends on how focused its questions were at the process of deciding after the fact what is and is not
outset. important about a mass of data can introduce con-
Unfortunately, when the results of research are siderable distortion of reality. Table 11.3 summarizes
presented, it is not always possible to know how how this misleading situation can arise.
many How can the statistical effects of multiple com-
19 Clinical Epidemiology: The
parisons be taken into account when
interpreting research? Although ways of
adjusting P values have been proposed,
probably the best advice is to be aware of the
problem and to be cautious about accepting
positive conclusions of studies in which
multiple comparisons were made. As put by
Armitage (16):
If you dredge the data sufficiently deeply
and suf- ficiently often, you will find
something odd. Many of these bizarre
findings will be due to chance. I do not
imply that data dredging is not an
occupation for honorable persons, but
rather that discoveries that were not initially
postulated as among the major objectives of
the trial should be treated with extreme
caution.

A special case of multiple comparisons


occurs when data in a clinical trial are
examined repeatedly as they accrue, to
assure that the trial is stopped as soon as
there is an answer, regardless of how long the
trial was planned to run. If this is done, as it
often is for ethical reasons, the final P value is
usually adjusted for the number of looks at the
data. There is a statisti- cal incentive to keep
the number of looks to a mini- mum. In any
case, if the accruing data are examined
repeatedly, it will be more difficult to reach
statistical significance after the multiple
looks at the data are taken into account.
Chapter 11: Chance 191

Another special case is genome-wide association


studies, in which more than 500,000 single- statistically significant. This analysis
nucleotide polymorphisms may be examined in cases suggests that effect modification was not
and con- trols (17). A common way to manage present, but this conclusion was limited by
multiple com- parisons, to divide the usual P value low statistical precision in subgroups.
of 0.05 by the number of comparisons, would Multiple comparisons, leading to a false-
require a genome- wide level of statistical positive finding in subgroups, would have
significance of 0.0000001 (10−7), which would be been an issue if there had been a
difficult to achieve because sample sizes in these statistically significant effect in one or
studies are constrained and rela- tive risks are
typically small. Because of this, belief in results of
genome-wide association studies relies on the
consistency and strength of associations across
many studies. Subgroup analyses tell clinicians about effect
modification so they can tailor their care of
SUBGROUP ANALYSIS individ- ual patients as closely as possible to study
results in patients like them. However, subgroup
It is tempting to go beyond the main results of a analyses incur risks of misleading results because of
study to examine results within subgroups of patients the increased chance of finding effects in a
with characteristics that might be related to particular subgroup that are not present, in the
treatment effect (i.e., to look for effect long run, in nature, that is, finding false-positive
modification, as dis- cussed in Chapter 5). Because results because of multiple comparisons.
characteristics present at the time of randomization In practice, the effects of multiple comparisons
are randomly allocated into treatment groups, the may not be as extreme when treatment effects in
consequence of subgroup analysis is to break the the various subgroups are not independent of each
trial as a whole into a set of smaller randomized other. To the extent that the variables are related to
controlled trials, each with a smaller sample size. each other, rather than independent, the risk of false-
positive findings is lessened. In the atrial fibrillation
and anticoagulant example, age and prior stroke

Example
Atrial fibrillation is treated with anticoagu- lants, vitamin K antagonists in high-risk pa- tients, to prevent stroke. Investi

are components of the CHADS2 score (a metric


for risk of stroke), but the three are treated as
separate subgroups.
19 Clinical Epidemiology: The
Another danger is coming to a false-negative
con- clusion. Within subgroups defined by certain
kinds of patients or specific kinds of outcomes,
there are fewer patients than for the study as a
whole, often too few to rule out false-negative
findings. Studies are, after all, designed to have
enough patients to answer the main research
question with sufficient statistical power. They
are ordinarily not designed to have sufficient
statistical power in subgroups, where the number
of patients and outcome events is smaller.
Guidelines for deciding whether a finding in a
subgroup is real are summarized in Table 11.4.

Multiple Outcomes
Another version of multiple looks at the data is to
report multiple outcomes—different manifesta-
tions of effectiveness, intermediate outcomes, and
harms. Usually this is handled by naming one of
the outcomes primary and the others secondary and
then being more guarded about conclusions for
the secondary outcomes. As with subgroups,
outcomes
Chapter 11: Chance 193

No. of Patients Hazard ratio with apixaban (95% Cl)


Characteristic Aspirin Apixaban
no. of events (%/yr)

Overall 5,599 113 (3.7) 51 (1.6)

Age
<65 yr 1,714 19 (2.0) 7 (0.7)
65 to <75 yr 1,987 28 (2.7) 24 (2.0)
≥75 yr 1,897 66 (6.1) 20 (2.0)

Age
Female 2,321 64 (4.9) 25 (1.9)
Male 3,277 49 (2.7) 26 (1.4)

Estimated GFR
<50 mL/min 1,198 36 (5.8) 16 (2.5)
50 to <80 mL/min 2,374 59 (4.5) 22 (1.7)
≥80 mL/min 2,021 18 (1.6) 13 (1.1)

CHADS2 score
0–1 2,026 18 (1.6) 10 (0.9)
2 1,999 40 (3.7) 25 (2.1)
≥3 1,570 55 (6.3) 16 (1.9)

Prior stroke or TIA


No 4,835 80 (3.0) 41 (1.5)
Yes 764 33 (8.3) 10 (2.5)

Study aspirin dose


<162 mg daily 3,602 85 (4.3) 39 (1.9)
≥162 mg daily 1,978 27 (2.4) 12 (1.1)

Previous VKA use


Yes 2,216 52 (4.2) 17 (1.4)
No 3,383 61 (3.3) 34 (1.8)

Patient refused VKA No


Yes 3,506 73 (3.8) 35 (1.8)
2,092 40 (3.4) 16 (1.45)

Heart failure No
Yes 3,428 66 (3.6) 28 (1.5)
2,171 45 (3.8) 23 (1.8)

0.05 0.25 1.00 4.00


Apixaban Aspirin
better better
Figure 11.7 ■ A subgroup analysis from a randomized controlled trial of the effectiveness of apixaban
versus aspirin on stroke and systemic embolism in patients with atrial fibrillation. GFR (glomerular filtration
rate) is a measure of kidney function. CHADS2 score is a prediction rule for the risk of embolism in patients with atrial
fibrillation. (Redrawn with permission from Connolly SJ, Eikelboom J, Joyner C, et al. Apixaban in patients with atrial
fibrillation. N Engl J Med 2011;364:806–817.)
19 Clinical Epidemiology: The

Table 11.4 taken into account, there would only be, at most,
Guidelines for Deciding Whether about 15 patients in each subgroup; if patients
Apparent Differences in Effects were unevenly distributed among subgroups, there
within Subgroups Are Reala would be even fewer in some.
What is needed then, in addition to tables
From the study itself: show- ing multiple subgroups, is a way of
• Is the magnitude of the observed difference examining the effects of several variables together.
clinically important? This is accom- plished by multivariable modeling
• How likely is the effect to have arisen by —developing a mathematical expression of the
chance, taking into account: effects of many vari- ables taken together. It is
the number of subgroups examined? “multivariable” because it examines the effects of
the magnitude of the P value? multiple variables simultane- ously. It is “modeling”
• Was a hypothesis that the effect would be observed because it is a mathematical construct, calculated
made before its discovery (or was justification for the from the data based on assump- tions about
effect argued for after it was found)?
characteristics of the data (e.g., that the variables
• Was it one of a small number of hypotheses?
are all normally distributed or all have the same
From other information:
variance).
• Was the difference suggested by comparisons within Mathematical models are used in two general
rather than between studies? ways in clinical research. One way is to study the
• Has the effect been observed in other studies?
indepen- dent effect of one variable on outcome
• Is there direct evidence that supports the existence
of the effect?
while taking into account the effects of other
a
variables that might confound or modify this
Adapted from Oxman AD, Guyatt GH. A consumer’s guide to
subgroup analysis. Ann Intern Med 1992;116:78–84. relationship (discussed under multivariable
adjustment in Chapter 5). The second way is to
predict a clinical event by calculating the combined
tend to be related to each other biologically (and effect of several variables acting together (introduced
as a consequence statistically), as is the case in the in concept under Clinical Prediction Rules in
above example where stroke and systemic embolism Chapter 7).
are different manifestations of the same clinical The basic structure of a multivariable model is:
phenomenon. Outcome variable  constant  (1  variable1)
 (2  variable2)  . . .,
MULTIVARIABLE METHODS where 1, 2, . . . are coefficients determined by
Most clinical phenomena are the result of many vari- the data, and variable1, variable2, . . . are the
ables acting together in complex ways. For example, variables that might be related to outcome. The
coronary heart disease is the joint result of lipid best estimates of the coefficients are determined
abnormalities, hypertension, cigarette smoking, fam- mathematically and depend on the powerful
ily history, diabetes, diet, exercise, inflammation, calculating ability of modern computers.
coagulation abnormalities, and perhaps personality. It Modeling is done in many different ways, but
is appropriate to try to understand these relationships some elements of the process are basic.
by first examining relatively simple arrangements of 1. Identify all the variables that might be related
the data, such as stratified analyses that show to the outcome of interest either as confounders
whether the effect of one variable is changed by the or effect modifiers. As a practical matter, it may
presence or absence of one or more of the other not be possible to actually measure all of them
variables. It is relatively easy to understand the and the missing variables should be mentioned
data when they are displayed in this way. explicitly as a limitation.
However, as mentioned in Chapter 7, it is usu- 2. If there are relatively few outcome events, the
ally not possible to account for more than a few vari- number of variables to be considered in the model
ables using this method because there are not enough might need to be reduced to a manageable size,
patients with each combination of characteristics usually no more than several. Often this is done
to allow stable estimates of rates. For example, if by selecting variables that, when taken one at a
120 patients were studied, 60 in each treatment time, are most strongly related to outcome. If a
group, and just one additional dichotomous statis- tical criterion is used at this stage, it is
variable was usual to err on the side of including variables, for
example, by choosing all variables showing an
association
Chapter 11: Chance 195

with the outcome of interest at a cutoff level of Some commonly used kinds of models are logis-
P  0.10. Evidence for the biologic importance tic regression (for dichotomous outcome variables
of the variable is also considered in making the such as those that occur in case-control studies)
selection. and Cox proportional hazards models (for time-to-
3. Models, like other statistical tests, are based on event studies).
assumptions about the structure of the data. Inves- Multivariable modeling is an essential strat-
tigators need to check whether these assumptions egy for dealing with the joint effects of multiple
are met in their particular data. variables. There is no other way to adjust for or
4. As for the actual models, there are many kinds to include many variables at the same time. How-
and many strategies that can be followed within ever, this advantage comes at a price. Models tend
models. All variables—exposure, outcome, and to be black boxes, and it is difficult to “get inside”
covariates—are entered in the model, with the them and understand how they work. Their
order determined by the research question. For validity is based on assumptions about the data
example, if some are to be controlled for in a that may not be met. They are clumsy at
causal analysis, they are entered in the model recognizing effect modification. An exposure
first, followed by the variable of primary inter- variable may be strongly related to outcome yet not
est. The model will then identify the appear in the model because it occurs rarely—and
independent effect of the variable of primary there is little direct information on the statistical
interest. On the other hand, if the investigator power of the model for that variable. Finally,
wants to make a prediction based on several model results are easily affected by quirks in the
variables, the relative strength of their data, the results of ran- dom variation in the
association to the outcome vari- able is characteristics of patients from sample to sample. It
determined by the model. has been shown, for example, that a model
frequently identified a different set of predictor
variables and produced a different order- ing of

Example
Gastric cancer is the second leading cause of cancer death in the world. Investigators in Europe analyzed data from a cohor

variables on different random samples of the same


dataset (20).
For these reasons, the models themselves cannot
19 Clinical Epidemiology: The
be taken as a standard of validity and must
be vali- dated independently. Usually, this is
done by observ- ing whether or not the
results of a model predicts what is found in
another, independent sample of patients.
The results of the first model are consid-
ered a hypothesis that is to be tested with
new data. If random variation is mainly
responsible for the results of the first
model, it is unlikely that the same random
effects will occur in the validating dataset,
too. Other evidence for the validity of a
model is its biologic plausibility and its
consistency with simpler, more transparent
analyses of the data, such as strati- fied
analyses.

BAYESIAN REASONING
An altogether different approach to the
information contributed by a study is based
on Bayesian inference. We introduced this
approach in Chapter 8 where we applied it to
the specific case of diagnostic testing.
Bayesian inference begins with prior belief
about the answer to a research question,
analogous to pre- test probability of a
diagnostic test. Prior belief is based on
everything known about the answer up to
the point when new information is
contributed by a study. Then, Bayesian
inference asks how much the results of the
new study change that belief.
Chapter 11: Chance 197

Some aspects of Bayesian inference are compel- small number of hypotheses are identified before-
ling. Individual studies do not take place in an hand and multiple comparisons are not as worrisome.
infor- mation vacuum; rather, they are in the Rather, prior belief depends on the plausibility of the
context of all other information available at the assertion rather than whether the assertion was estab-
time. Starting each study from the null hypothesis lished before or after the study was begun.
—that there is no effect—is unrealistic because Although Bayesian inference is appealing, so far
something is already known about the answer to the it has been difficult to apply because of poorly
question before the study is even begun. Moreover, devel- oped ways of assigning numbers to prior
results of individual studies change belief in belief and to the information contributed by a
relation to both their scien- tific strengths and the study. Two exceptions are in cumulative summaries
direction and magnitude of their results. For of research evidence (Chapter 13) and in diagnostic
example, if all preceding studies were negative testing, in which “belief ” is prior probability and the
and the next one, which is of compa- rable new infor- mation is expressed as a likelihood ratio.
strength, is found to be positive, then an effect is However, Bayesian inference is the conceptual basis
still unlikely. On the other hand, a weak prior belief for qualita- tive thinking about cause (see Chapter
might be reversed by a single strong study. Finally, 12).
with this approach it is not so important whether a

Revie w Question s
Read the following and select the best was 238 mg/dL in the group receiving the
response. new drug and 240 mg/dL in the group
receiving the old drug (P  0.001). Which of
11.1. A randomized controlled trial of thrombo-
the following best describes the meaning of
lytic therapy versus angioplasty for acute the P value in this study?
myocardial infarction finds no difference
in the main outcome, survival to discharge A. Bias is unlikely to account for the
from hospital. The investigators explored observed difference.
whether this was also true for subgroups of B. The difference is clinically important.
patients defined by age, number of vessels C. A difference as big or bigger than what
affected, ejection fraction, comorbidity, and was observed could have arisen by
other patient characteristics. Which of the chance one time in 1,000.
following is not true about this subgroup D. The results are generalizable to
analysis? other patients with hypertension.
E. The statistical power of this study
A. Examining subgroups increases the was inadequate.
chance of a false-positive (misleading
statistically significant) result in one 11.3. In a well-designed clinical trial of treatment
of the comparisons. for ovarian cancer, remission rate at 1 year is
B. Examining subgroups increases the 30% in patients offered a new drug and
chance of a false-negative finding in 20% in those offered a placebo. The P value
one of these subgroups, relative to the is 0.4. Which of the following best
main result. describes the interpretation of this result?
C. Subgroup analyses are bad
scientific practice and should not A. Both treatments are effective.
be done. B. Neither treatment is effective.
D. Reporting results in subgroups helps C. The statistical power of this study
clinicians tailor information in the study is 60%.
to individual patients. D. The best estimate of treatment effect
size is 0.4.
11.2. A new drug for hyperlipidemia was com- E. There is insufficient information to
pared with placebo in a randomized con- decide whether one treatment is better
trolled trial of 10,000 patients. After 2 years, than the other.
serum cholesterol (the primary outcome)
19 Clinical Epidemiology: The

11.4. In a cohort study, vitamin A intake was C. 1/15,000


found to be a risk factor for hip fracture D. 1/20,000
in women. The relative risk (highest E. 1/30,000
quintile versus lowest quintile) was 1.48,
and the 95% confidence interval was 1.05 11.8. Which of the following is least related to the
to 2.07. Which of the following best statistical power of a study with a dichoto-
describes the meaning of this confidence mous outcome?
interval?
A. Effect size
A. The association is not B. Type I error
statistically significant at the P  C. Rate of outcome events in the control
0.05 level. group
B. A strong association between vitamin A D. Type II error
intake and hip fracture was established. E. The statistical test used
C. The statistical power of this study is 95%.
D. There is a 95% chance that a range of 11.9. Which is the following best characterizes
relative risks as low as 1.05 and as high the application of Bayesian reasoning to a
as clinical trial?
2.07 includes the true risk.
E. Bias is an unlikely explanation for this A. Prior belief in the comparative
result. effectiveness of treatment is
guided by equipoise.
11.5. Which of the following is the best reason for
B. The results of each new study
calling P  0.05 “statistically significant?” changes belief in treatment effect
from what it was before the study.
A. It definitively rules out a false- C. Bayesian inference is an alternative
positive conclusion. way of calculating a P value.
B. It is an arbitrarily chosen but useful rule D. Bayesian reasoning is based, like inferential
of thumb. statistics, on the null hypothesis.
C. It rules out a type II error. E. Bayesian reasoning depends on a
D. It is a way of establishing a well- defined hypothesis before the
clinically important effect size. study is begun.
E. Larger or smaller P values do not
provide useful information. 11.10. In a randomized trial of intensive glucose
lowering in type 2 diabetes, death rate was
11.6. Which of the following is the biggest higher in the intensively treated patients:
advantage of multivariable hazard ratio 1.22 (95% confidence interval
modeling? 1.01–1.46). Which if the following is not
A. Models can control for many variables true about this study?
simultaneously. A. The results are consistent with almost no
B. Models do not depend on effect.
assumptions about the data. B. The best estimate of treatment effect is a
C. There is a standardized and reproducible hazard ratio of 1.22.
approach to modeling. C. If a P value were calculated, the
D. Models make stratified results would be statistically
analyses unnecessary. significant at the
E. Models can control for confounding in 0.05 level.
large randomized controlled trials. D. A P value would provide as much
information as the confidence interval.
11.7. A trial randomizes 10,000 patients to two treat- E. The results are consistent with 46%
ment groups of similar size, one offered chemo- higher death rates in the intensively
prevention and the other usual care. How treated patients.
frequently must a side effect of chemopreven-
tion occur for the study to have a good chance Answers are in Appendix A.
of observing at least one such side effect?
A. 1/5,000
B. 1/10,000
Chapter 11: Chance 199

REFERENCES
1. Fisher R in Proceedings of the Society for Psychical
11. Mai PL, Wideroff L, Greene MH, et al. Prevalence of
Research, 1929, quoted in Salsburg D. The Lady Tasting Tea.
family history of breast, colorectal, prostate, and lung cancer
New York: Henry Holt and Co; 2001.
2. Johnson AF. Beneath the technological fix: outliers and prob- in a popu- lation-based study. Public Health Genomics
2010;13:495–503.
ability statements. J Chronic Dis 1985;38:957–961.
12. Venge P, Johnson N, Lindahl B, et al. Normal plasma levels
3. Courtney C, Farrell D, Gray R, et al for the AD2000
Collab- orative Group. Long-term donepezil treatment in 565 of cardiac troponin I measured by the high-sensitivity cardiac
troponin I access prototype assay and the impact on the diag-
patients with Alzheimer’s disease (AD2000): randomized
double-blind trial. Lancet 2004:363:2105–2115. nosis of myocardial ischemia. J Am Coll Cardiol 2009;54:
4. Bernard SA, Gray TW, Buist MD, et al. Treatment of 1165–1172.
13. McCormack K, Scott N, Go PMNYH, et al. Laparoscopic
coma- tose survivors of out-of-hospital cardiac arrest with
induced hypothermia. N Engl J Med 2002;346:557–563. techniques versus open techniques for inguinal hernia repair.
5. The AIM-HIGH Investigators. Niacin in patients with low Cochrane Database Syst Rev 2003;(1):CD001785.
14. Goodman SN, Berlin JA. The use of predicted confidence
HDL cholesterol levels receiving intensive statin therapy. N
Engl J Med 2011;365:2255–2267. intervals when planning experiments and the misuse of
6. Peto R, Pike MC, Armitage P, et al. Design and analysis of power when interpreting results. Ann Intern Med
randomized clinical trials requiring prolonged observation of 1994;121: 200–206.
15. Sackett DL, Haynes RB, Gent M, et al. Compliance. In:
each patient. I. Introduction and design. Br J Cancer 1976;34:
Inman WHW, ed. Monitoring for Drug Safety. Lancaster,
585–612.
7. Lind J. A treatise on scurvy. Edinburgh; Sands, Murray UK: MTP Press; 1980.
16. Armitage P. Importance of prognostic factors in the analysis
and Cochran, 1753 quoted by Thomas DP. J Royal Society
of data from clinical trials. Control Clin Trials 1981;1:347–
Med 1997;80:50–54.
8. Weinstein SJ, Yu K, Horst RL, et al. Serum 25- 353.
17. Hunter DJ, Kraft P. Drinking from the fire hose—statisti-
hydroxyvita- min D and risks of colon and rectal cancer in
cal issues in genomewide association studies. N Engl J
Finnish men. Am J Epidemiol 2011;173:499–508.
9. Rossouw JE, Anderson GL, Prentice RL, et al. for the Med 2007;357:436–439.
18. Connolly SJ, Eikelboom J, Joyner C, et al. Apixaban in
Wom- en’s Health Initiative Investigators. Risks and benefits
patients with atrial fibrillation. N Engl J Med 2011;364:
of estro- gen plus progestin in healthy postmenopausal
women: prin- ciple results from the Women’s Health 806–817.
19. Duell EJ, Travier N, Lujan-Barroso L, et al. Alcohol
Initiative randomized controlled trial. JAMA
consump- tion and gastric cancer risk in European
2002;288:321–333.
10. Braitman LE. Confidence intervals assess both clinical Prospective Investi- gation into Cancer and Nutrition
(EPIC) cohort. Am J Clin Nutr 2011;94:1266–1275.
signifi- cance and statistical significance. Ann Intern Med
20. Diamond GA. Future imperfect: the limitations of clinical
1991;114: 515–517.
prediction models and the limits of clinical prediction. J
Am Coll Cardiol 1989;14:12A–22A.
C h a p t e r 12

Cause
In what circumstances can we pass from observed association to a verdict of
causation? Upon what basis should we proceed to do so?
—Sir Austin Bradford Hill
1965

KEY WORDS
Web of causation
Decision analysis Example
In 1843, Oliver Wendell Holmes (then profes-
Aggregate risk studies
Cost-effectiveness sor of anatomy and physiology and later dean
Ecological studies
analysis Cost– of Harvard Medical School), published a study
Ecological fallacy
benefit analysis linking hand washing habits by obstetricians
Time-series studies
Multiple time-series and childbed (puerperal) fever, an often-fatal
studies disease following childbirth. (Puerperal fever
is now known to be caused by a bacterial in-
This book has been about three kinds of clinically fection.) Holmes’s observations led him to
use- con- clude that “the disease known as
ful information. One is description, a simple puerperal fever is so far contagious, as to be
statement of how often things occur, summarized frequently carried from patient to patient by
by metrics such as incidence and prevalence, as well as physicians and nurses (1).”
(in the case of diagnostic test performance) sensitivity, One response to Holmes’s assertion was
specificity, predictive value, and likelihood ratio. that the findings made no sense. “I prefer to
Another is pre- diction, evidence that certain attribute them [puerperal fever cases] to ac-
outcomes regularly fol- low exposures without regard cident, or Providence, of which I can form a
to whether the exposures are independent risk factors, conception, rather than to contagion of which
let alone causes. The third is either directly or I cannot form any clear idea, at least as to
implicitly about cause and effect. Is a risk factor an this particular malady,” wrote Charles Meigs,
independent cause of disease? Does treat- ment cause pro- fessor of midwifery and the diseases of
patients to get better? Does a prognostic factor women and children at Jefferson Medical
cause a different outcome, everything else being College. Around that time, a Hungarian
equal? This chapter considers cause in greater depth. physician, Ignaz Semmelweis, showed that
Another word for the study of the origination of disinfecting physi- cians’ hands reduced rates
disease is “etiology,” now commonly used as a syn- of childbed fever, and his studies were also
onym for cause, as in “What is the etiology of this dismissed because he had no generally
disease?” To the extent that the cause of disease is not accepted explanation for his findings.
known, the disease is said to be “idiopathic” or of Holmes’s and Semmelweis’s assertions were
“unknown etiology.” made decades before pioneering work— by
There is a longstanding tendency to judge the Louis Pasteur, Robert Koch, and Joseph Lister
legitimacy of a causal assertion by whether it makes —established the germ theory of disease.
sense according to beliefs at the time, as the
following historical example illustrates.
194
Chapter 12: Cause 195

The importance attached to a cause-and-effect retrovirus causes AIDS; and the discovery in 2003
relationship “making sense,” usually in terms of a that a coronavirus caused an outbreak of severe acute
bio- logic mechanism, is still imbedded in current respiratory syndrome (SARS) (2).
think- ing. For example, in the 1990s, studies
showing that eradication of Helicobacter pylori
infection prevented peptic ulcer disease were met Multiple Causes
with skepticism because everyone knew that ulcers For some diseases, one cause appears to be so
of the stomach and duo- denum were not an dom- inant that we speak of it as the cause. We
infectious disease. Now, H. pylori infection is say that Mycobacterium tuberculosis causes
recognized as a major cause of this disease. In this tuberculosis or that an abnormal gene coding for the
chapter, we review concepts of cause in clinical metabolism of phe- nylalanine, an amino acid, causes
medicine. We discuss the broader array of phenylketonuria. We may skip past the fact that
evidence, in addition to biologic plausibility, that tuberculosis is also caused by host and
strengthens or weakens the case that an association environmental factors and that the disease
represents a cause-and-effect relationship. We also phenylketonuria develops because there is
briefly deal with a kind of research design not yet phenylalanine in the diet.
considered in this book: studies in which exposure to More often, however, various causes make a more
a possible cause is known only for groups and not for balanced contribution to the occurrence of disease
the individuals in the groups. such that no one stands out. The underlying assump-
tion of Koch’s postulates, one cause– one disease, is
too simplified. Smoking causes lung cancer,
BASIC PRINCIPLES coronary artery disease, chronic obstructive
pulmonary dis- ease, and skin wrinkles. Coronary
Single Causes artery disease has multiple causes, including cigarette
In 1882, 40 years after the Holmes-Meigs con- smoking, hyper- tension, hypercholesterolemia,
frontation, Koch set forth postulates for determin- diabetes, inflamma- tion, and heredity. Specific
ing that an infectious agent is the cause of a parasites cause malaria, but only if the mosquito
disease (Table 12.1). Basic to his approach was the vectors can breed, become infected, and bite
assump- tion that a particular disease has one cause people, and those people are not taking antimalarial
and that a particular cause results in one disease. This drugs or are unable to control the infection on their
approach helped him to identify for the first time own.
the bacteria causing tuberculosis, diphtheria, When many factors act together, it has been called
typhoid, and other common infectious diseases of the “web of causation” (3). A causal web is well
his day. understood in chronic degenerative diseases such as
Koch’s postulates contributed greatly to the cardiovascular disease and cancer, but it is also the
con- cept of cause in medicine. Before Koch, it was basis for infectious diseases, where the presence of a
believed that many different bacteria caused any microbe is a necessary but not sufficient cause of dis-
given disease. The application of his postulates ease. AIDS cannot occur without exposure to HIV,
helped bring order out of chaos. They are still useful but exposure to the virus does not necessarily result
today. That a unique infectious agent causes a in disease. For example, exposure to HIV rarely
particular infectious disease was the basis for the results in seroconversion after needlesticks (about
discovery in 1977 that Legion- naire disease is 3/1,000) because the virus is not nearly as
caused by a gram-negative bacterium; the discovery infectious as, say, the hepatitis B virus. Similarly,
in the 1980s that a newly identified not everyone exposed to tuberculosis—in Koch’s
day or now—becomes infected.
When multiple causes act together, the
Table 12.1 resulting risk may be greater or less than would be
expected by simply combining the effects of the
Koch’s
1. Postulates
The organism must be present in every case of
the disease.
separate causes. That is, they interact—there is
2. The organism must be isolated and grown in effect modification. Figure 12.1 shows the 10-year
pure culture. risk of cardiovascular disease in a 60-year-old man
3. The organism must cause a specific disease with no prior history of cardiovascular disease
when inoculated into an animal. according to the presence or absence of several
4. The organism must then be recovered from common risk factors. The risk is greater than the
the animal and identified. sum of the effects of each indi- vidual risk factor.
The effect of low HDL is more in the presence of
elevated total cholesterol, the effect
19 Clinical Epidemiology: The

100

10-year risk of cardiovascular event


80

60

40

20

0
Total cholesterol (mg/dL)160280
HDL (mg/dL) 60 60 35
Smoking No No No Yes
Systolic blood pressure (mm Hg) 120 120 120 120 160
Daibetes mellitisNo No NoNoNoYes
Figure 12.1 ■ The interaction of multiple risk factors for cardiovascular dis-
ease. Ten-year cardiovascular risk (%) for a 60-year-old man with no risk
factors (left bar) and with the successive addition of five risk factors (bars to the
right). Each risk factor alone adds relatively little (several percent) to risk whereas
adding them to each other increases risk almost 10-fold, far more than the sum of
the individual risk factors acting independently, which is shown by the shaded
area of the right- hand bar. (Data from The Framingham Risk Calculator in
UpToDate, Waltham, MA according to formulae in D’Agostino RB Sr, Vasan RS,
Pencina MJ, et al. General cardiovascular risk profile for use in primary care. The
Framingham Heart Study. Circulation 2008;117(6):743–753.)

of smoking is more in the presence of both


elevated total cholesterol and low HDL, and so on. Proximity of Cause to Effect
The conse- quence of exposure to each new risk
When biomedical scientists study cause, they usually
factor is affected by exposure to the others, an
search for the underlying pathogenetic mechanism
example of effect modi- fication. Age and sex are
or final common pathway of disease. Sickle cell ane-
also risk factors and interact with the others (not
mia is an example. In simplified form, pathogenesis
shown).
involves a gene coded for abnormal hemoglobin that
When multiple causative factors are present and
polymerizes in low-oxygen environments (the capil-
interact, it may be possible to make a substantial
laries of some tissues), resulting in deformed red
impact on a patient’s health by changing just one
cells, causing anemia as they are destroyed, and
or a few of them. In the previous example, treating
occluding vessels, causing attacks of ischemia with
hypertension and elevated serum cholesterol can sub-
pain and tis- sue destruction.
stantially lower the risk of developing cardiovascular
Disease is also determined by less specific, more
disease, even if the other risk factors are unchanged.
remote causes (risk factors) such as behavior and
By and large, clinicians are more interested in
environments. These factors may have large effects
treat- able or reversible causes than immutable
on disease rates. For example, a large proportion of
ones. For example, when it comes to cardiovascular
cardio- vascular and cancer deaths in the United
disease, age and sex cannot be changed—they have
States can be traced to behavioral and environmental
to be taken as a given. On the other hand, smoking,
factors such as cigarette smoking, diet, and lack of
blood pressure, and serum cholesterol can be
exercise; AIDS is primarily spread through unsafe
changed. Therefore, even though risks related to age
sexual behaviors and shared needles; and deaths
and sex are at least as big as for risk factors as the
from violence and unin- tended injuries are rooted
others and are taken into account when estimating
in social conditions, access to guns, intoxication
cardiovascular risk, they do not offer a target for
while driving, and seatbelt use.
prevention or treatment.
Chapter 12: Cause 197

Crowding
Malnutrition Exposure to Tissue invasion
Vaccination Mycobacterium and reaction
Genetics
SUSCEPTIBLE HOST INFECTION TUBERCULOSIS

Risk factors for


tuberculosis Pathogenesis
Distant from outcome Proximal to outcome

Figure 12.2 ■ Proximal and distal causes of tuberculosis.

Figure 12.2 shows how both risk factors and


role in the decline in tuberculosis rates in
pathogenesis of tuberculosis—distant and proximal
developed countries than treatments. Figure 12.3
causes—lead on a continuum to the disease.
shows that the death rate from tuberculosis in
Exposure to M. tuberculosis depends on the host’s
England and Wales dropped dramatically before the
environment: close proximity to active cases.
tubercle bacillus was identified and a century before
Infection depends on host susceptibility, which can
the first effective anti- biotics were introduced in
be increased by mal- nutrition, decreased by
the 1950s.
vaccination, and altered by genetic endowment.
The web of causation is continually changing,
Whether infection progresses to disease depends on
even for old diseases. Between 1985 and 1992, the
these factors and others, such as immunocompetence,
num- ber of tuberculosis cases in the United States,
which can be compromised by HIV infection and
which had been falling for a century, began to
age. Finally, active infection may be cured by
increase (Fig. 12.4) (4). Why did this happen? There
antibiotic treatment.
had been an influx of immigrants from countries
Clinicians may be so intent on pathogenesis
with high rates of tuberculosis. The AIDS
that they underestimate the importance of more
epidemic produced more people with a weakened
remote causes of disease. In the case of tuberculosis,
immune system, mak- ing them more susceptible to
social and economic improvements influencing
infection with M. tuber- culosis. When infected, their
host sus- ceptibility, such as less crowded living
bodies allowed massive multiplication of the
space and bet- ter nutrition, appear to have played a
bacterium, making them more infectious. Rapid
more prominent
multiplication, especially in patients

400

300

Tubercle bacillus identified


Death rate (per

200

Antibiotics introduced
100

0
1840 1860 1880 1900 1920 1940 1960 1980 2000
Year

Figure 12.3 ■ Declining death rate from respiratory tuberculosis in England


and Wales over the past 150 years. Most of the decrease occurred before
antibiotic therapy was available. (Data from McKeown T. The Role of Medicine:
Dream, Mirage, or Nemesis. London: Nuffield Provincial Hospital Trust; 1976 and
from http://www.
hpa.org.uk/Topics/InfectiousDiseases/InfectionsAZ/Tuberculosis/TBUKSurveillance
Data/TuberculosisMortality)
19 Clinical Epidemiology: The

30,000

25,000

20,000
Number of

15,000

10,000

50,000

0
1980 1985 1990 1995 2000 2005 2010
Year
Figure 12.4 ■ Tuberculosis cases in the United States, 1980 through 2010. A
longstanding decline was halted in 1985. The number of cases reached a peak in
1992 and then began to decline again. (Adapted from Centers for Disease
Control and Prevention. Reported tuberculosis in the United States, 2010. Available at
http://www. cdc.gov/Features/dsTB2010Data/. Accessed February 8, 2012.)

who did not follow prescribed drug regimens, Examining Individual Studies
favored the development of multidrug resistant
strains. People who were more likely to have both One approach to evidence for cause-and-effect has
AIDS and tuberculosis—the socially disadvantaged, been discussed throughout this book: in-depth analy-
intravenous drug users, and prisoners—were sis of the studies themselves. When an association
developing multi- drug resistant disease and has been observed, a causal relationship is
exposing others in the pop- ulation to a difficult-to- established to the extent that the association cannot
treat strain. The interplay of environment, behavior, be accounted for by bias and chance. Figure 12.5
and molecular biology com- bined to reverse a summarizes a familiar approach. One first looks
declining trend in tuberculosis. To combat the new for bias and how much it might have changed the
epidemic of tuberculosis, the public health result, and then whether the association is unlikely
infrastructure was rebuilt. Multidrug regimens to be by chance. For observational studies,
(biologic efforts) and directly observing therapy to confounding is always a
ensure compliance (behavioral efforts) were initiated
and the rate of tuberculosis began to decline again. Explanation ASSOCIATION

INDIRECT EVIDENCE FOR CAUSE


Bias in selection YES NO
In clinical medicine, it is not possible to prove causal or measurement
relationships beyond any doubt, as one might a
mathematic formula. What is possible is to
increase one’s conviction in a cause-and-effect Chance LIKELYUNLIKELY
relationship by means of empiric evidence to the
point where, as a practical matter, cause has been
established. Con- versely, evidence against a cause Confounding YES NO
can accumulate to the point where a cause-and-effect
relationship becomes implausible.
A postulated cause-and-effect relationship should Cause CAUSE
be examined in as many different ways as possible. Figure 12.5 ■ Association and cause. Bias, chance, and
In confounding should be excluded before concluding that
the remainder of this chapter, we discuss some a causal association is likely.
com- monly used approaches.
Chapter 12: Cause 199

possibility. Although confounding can be controlled a long way toward increasing or decreasing its
in comprehensive, state-of-the science ways, it is validity, regardless of the type of design used. A bad
almost never possible to rule it out entirely; there- random- ized controlled trial contributes less to our
fore, confounding remains the enduring challenge to under- standing of cause than an exemplary cohort
causal reasoning based on observational research. study.
Randomized trials can deal definitively with con- With this hierarchy in mind, the strength of the
founding, but they are not possible for studies of risk evidence for cause and effect is sometimes judged
(i.e., causes) per se. For example, it is unethical (and according to the best studies of the question. Well-
would be unsuccessful) to randomize non-smokers to designed and well-executed randomized controlled
cigarette smoking to study whether smoking causes trials trump observational studies, state-of-the-
lung cancer. However, randomized controlled trials science observational studies trump case series, and
can contribute to causal inference in two so on. This is a highly simplified approach to
situations. One is when the trial is to treat a possible evidence but is a useful shortcut.
cause, such as elevated cholesterol or blood pressure,
and the out- come is prevented. Another is when a THE BODY OF EVIDENCE FOR
trial is done for another purpose and the AND AGAINST CAUSE
intervention causes unan- ticipated harms. For
example, the fact that there were an excess of What aspects of the research findings support
cardiovascular events in randomized tri- als of the cause and effect when only observational studies are
cyclooxygenase-2 inhibitor rofecoxib, which had avail- able? In 1965, the British statistician Sir
been given for other reasons (e.g., pain relief ), is Austin Brad- ford Hill proposed a set of
evidence that this drug may be a cause of cardiovas- observations that taken together help to establish
cular events. whether a relationship between an environmental
factor and disease is causal or just an association (5)
Hierarchy of Research Designs (Table 12.3). We review these “Bradford Hill
The various research designs can be placed in a criteria,” mainly using smoking and lung cancer as
hierar- chy of scientific strength for the purpose of an example. Smoking is generally believed to cause
establish- ing cause (Table 12.2). At the top of the lung cancer even though there are not randomized
hierarchy are systematic reviews of randomized controlled trials of smoking or an undisputed
controlled trials because they can deal definitively biologic mechanism.
with confounding. Randomized trials are followed by
observational stud- ies, with little distinction Table 12.3
between cohort and case- control studies in an era Evidence That an Association
when case-control analyses are nested in cohorts Is Cause and Effect
sampled from defined populations. Lower still are
uncontrolled studies, biologic reason- ing, and Criteria Comments
personal experience. Of course, this order is only a Temporality Cause precedes effect
rough guide to strength of evidence. The man- ner in Strength Large relative risk
which an individual study is performed can go Dose–response Larger exposure to cause
associated with higher rates of
Table 12.2 disease
Hierarchy of Research Design Strength Reversibility Reduction in exposure is followed
by lower rates of disease
Individual Risk Studies Consistency Repeatedly observed by different
Systematic reviews: consistent evidence from multiple persons, in different places,
randomized controlled trials circumstances, and times
Randomized controlled trials Biologic plausibility Makes sense according to biologic
knowledge of the time
Observational studies
Cohort studies Specificity One cause leads to one effect
Case-control, Case-cohort studies Analogy Cause-and-effect relationship
Cross-sectional studies already established for a similar
exposure or disease
Case series
Adapted from Bradford Hill AB. The environment and disease:
Experience, expert opinion association and causation. Proc R Soc Med 1965;58:295–300.
20 Clinical Epidemiology: The

Does Cause Precede Effect? 300

A cause should obviously occur before its effects. This 251


seems self-evident, but the principle can be over-
looked when interpreting cross-sectional and case-
control studies, in which both the purported cause 200

Lung cancer
and the effect are measured at the same points in

deaths/100,000
time. Smoking clearly precedes lung cancer by
several decades, but there are other examples where 127
the order of cause and effect can be confused.
100
78

Example
“Whiplash,” is the occurrence of neck pain following a forceful flexion/extension injury, typically in an auto accident. Man
10
0
0 1–14
15–24 25+
Cigarettes smoked, number/day
Figure 12.6 ■ Example of a dose–response relation-
ship: lung cancer deaths in male physicians according
to dose (number) of cigarettes smoked. (Data from data
in Doll R, Peto R. Mortality in relation to smoking: 20
years’ observations on male British doctors. Br Med J
1976;2:1525– 1536.)

Dose–Response Relationships
A dose–response relationship is present if increas-
ing exposure to the purported cause is followed by
a larger and larger effect. In the case of cigarette
smok- ing, “dose” might be the number of years of
smoking, current packs per day, or “pack-years.”
Figure 12.6 shows a clear dose–response curve when
lung cancer death rates (responses) are plotted
Finding that what was thought to be a cause actu- against the number of cigarettes smoked (doses).
ally follows an effect is powerful evidence against Demonstrating a dose–response relationship
cause, but temporal sequence alone is only strengthens the argument for cause and effect, but its
minimal evidence for cause. absence is relatively weak evidence against causation
because not all causal associations exhibit a dose–
Strength of the Association response relationship within the range observed
and because confounding remains possible.
A strong association between a purported cause
and an effect, as expressed by a large relative or
absolute risk, is better evidence for a causal Example
relationship than a weak association. The reason is Both the strong association between smok- ing and lung cancer a
that unrecognized bias could account for small
relative risks but is unlikely to result in large ones.
Thus, the 20 times higher incidence of lung
cancer among male smokers compared to non-
smokers is much stronger evidence that smoking
causes lung cancer than the finding that smoking
is related to renal cancer, for which the relative
risk is much smaller (about 1.5). Similarly, a 10-
to 100- fold increase in risk of hepatocellular
carcinoma in patients with hepatitis B infection is
strong evidence that the virus is a cause of liver
cancer.
Chapter 12: Cause 201

all come to the same conclusion, evidence for a


is a theoretically possible explanation for
causal relationship is strengthened. Causation is
the association between smoking and lung
particu- larly supported when studies using several
cancer, although just what the confounding different research designs, with complementary
factor might be has never been clarified. strengths and weaknesses, all produce the same result
Short of a randomized controlled trial because stud- ies using the same design might have
(which would, on average, allocate people all made the same mistake. For the association of
with the confounding factor equally to smoking and lung cancer, many cohort, case-
smoking and non-smoking groups) the control, and time- series studies have shown that
possibility of confounding is diffi- cult to increased tobacco use is followed by increased lung
cancer incidence, in both sexes, in various ethnic
groups, and in different countries.
Reversible Associations Different studies can produce different results.
Lack of consistency does not necessarily mean that
A factor is more likely to be a cause of disease when its a causal relationship does not exist. Study results
removal results in a decreased risk. Figure 12.7 may differ because of differences in patients, inter-
shows that when people quit smoking, they ventions, follow-up, or outcome measures (i.e., they
decrease their likelihood of getting lung cancer in address somewhat different research questions).
relation to the number of years since quitting. Also, the studies may vary in quality, and one
Reversible associations are strong, but not good study may contribute more valid information
infallible, evidence of a causal relationship because than several poor ones.
confound- ing could account for it. For example,
Figure 12.7 is consistent with the (unlikely)
explanation that people who are willing to quit Biologic Plausibility
smoking have smaller amounts of an unidentified As discussed at the beginning of this chapter, the
confounding factor than those who continue to belief that a possible cause is consistent with our
smoke. knowledge of the mechanisms of disease, as it is cur-
rently understood, is often given considerable weight
Consistency when assessing cause and effect. When one has abso-
When several studies conducted at different times in lutely no idea how an association might have
different settings and with different kinds of patients arisen, one tends to be skeptical that the
association is real. Such skepticism often serves us
well.
20

15.816.0
Example
Ratio of mortality rate of

15
The substance Laetrile, extracted from apricot pits, was toute
ex-smokers to never

10

5.9
5.3
5

2.0

0
0 <5 5–9 10–14 15+
Years since stopped smoking
Figure 12.7 ■ Reversible association: declining mortal-
ity from lung cancer in ex-cigarette smokers. The data
exclude people who stopped smoking after getting cancer.
(Data from data in Doll R, Petro R. Mortality in relation
to smoking: 20 years’ observations on male British doctors.
Br Med J 1976;2:1525–1536.)
20 Clinical Epidemiology: The

Biologic plausibility, when present, strengthens group to which individuals belong. Another term
the case for causation, but the absence of biologic is ecological studies, because people are classified
plausibility may just reflect the limitations of under- by the general level of exposure in their
standing of the biology of disease rather than the lack environment, which may or may not correspond to
of a causal association. their individual exposure. Examples are epidemiologic
studies relating countries’ wine consumption to rates
Specificity of cardiovascu- lar mortality and studies of
Specificity—one cause–one effect—is more often regional cancer or birth defect rates in relation to
found for acute infectious diseases (e.g., poliomyeli- regional exposures such as chemical spills.
tis and tetanus) and for genetic diseases (e.g., famil- The main problem with studies that simply cor-
ial adenomatous polyposis or ochronosis), although relate average exposure with average disease rates in
genetic effects are sometimes modified by gene–gene groups is the potential for an ecological fallacy,
and gene–environment interactions. As mentioned in which affected individuals in a generally exposed
earlier in this chapter, chronic, degenerative diseases group may not have been the ones actually
often have many causes for the same effect and many exposed to the risk factor. Also, exposure may not be
effects from the same cause. Lung cancer is caused the only characteristic that distinguishes people in
by cigarette smoking, asbestos, and radiation. the exposed group from those in the non-exposed
Cigarette smoking not only causes lung cancer but group; that is, there may be confounding factors.
also bron- chitis, cardiovascular disease, periodontal Thus, aggre- gate risk studies like these are most
disease, and wrinkled skin, to name a few. Thus, useful in raising hypotheses, which should then be
specificity is strong evidence for cause and effect, tested by more rig- orous research.
but the absence of specificity is weak evidence Evidence from aggregate risk studies can be
against it. strengthened when observations are made over a
period of time bracketing the exposure and even fur-
Analogy ther strengthened if observations are in more than
one place and calendar time.
The argument for a cause-and-effect relationship is In a time-series study, disease rates are measured
strengthened when examples exist of well- at several points in time, both before and after the
established causes that are analogous to the one in purported cause has been introduced. It is then
question. Thus, the case that smoking causes lung pos- sible to see whether a trend in disease rate
cancer is strengthened by observations that other over time changes in relation to the time of
environmen- tal toxins such as asbestos, arsenic, and exposure. If changes in the purported cause are
uranium also cause lung cancer. directly followed by changes in the purported effect,
In a sense, applying the Bradford Hill criteria to and not at some other time, the association is less
cause is an example of Bayesian reasoning. For likely to be spurious. An advan- tage of time-series
exam- ple, belief in causality based on strength of analyses is that they can distinguish between changes
association and dose–response is modified (built already occurring over time (secular trends) and the
up or dimin- ished) by evidence concerning effects of the intervention itself.
biologic plausibility or specificity, with each of the
criteria contributing to a greater or lesser extent to
the overall belief that an association is causal. The Example
main difference from the Bayesian approach to
diagnostic testing is that the various lines of
evidence for cause (dose–response, reversibility, Health care–associated infections with methicillin- resistant Staphy
consistency, etc.) are being assembled concurrently
in various scientific disciplines rather than in series
by clinical research.

AGGREGATE RISK STUDIES


Until now, we have discussed studies in which expo-
sure and disease are known for each individual in
a study. To fill out the spectrum of research
designs, we now consider a different kind of
study, called aggregate risk studies, in which
exposure to a risk factor is characterized by the
average exposure of the
Chapter 12: Cause 203

No program Transition Program fully implemented


2.0

1.8

1.6

1.4
MRSA infections/1,000
Health–care associated

1.2

1.0

0.8

0.6

0.4

0.2

Year

Figure 12.8 ■ A time-series study. Effects of a program to reduce methicillin-resistant


Staphylococcus aureus (MRSA) infections in Veterans Affairs facilities. Results for intensive
care units. (Redrawn with permission from Jain R, Kralovic SM, Evans ME, et al. Veterans Affairs
initiative to prevent methicillin-resistant Staphylococcus aureus infections. N Engl J Med
2011;364:1419–1430.)

introduction of the suspected cause at various times


responsibilities for preventing infection.
and places, there is stronger evidence for cause
Data on MRSA infection rates were
than when this phenomenon is observed only
gathered from 2 years before the once, as in the single time-series example, because
intervention thorough the time when the it is even more improbable that the same
bundle was initiated to when it was fully extraneous factor(s) occurred at the same time in
implemented (Fig. 12.8). Rates in intensive relation to the interven- tion in many different
care units were stable in the 2-year period places and calendar times.
before the bundle was introduced and fell
progressively to 62% of the preinterven- Example
tion rate following the intervention, provid-
ing evidence that this particular
intervention achieved its purposes. Screening with Pap smears is done to prevent
deaths from cervical cancer. This practice was
initiated before there was strong evidence
Inference from time-series studies can be of effectiveness—indeed, before randomized
strength- ened if it is possible to rule out other controlled trials. The best available evidence of
interventions occurring around the same time as the effectiveness comes from multiple time-series
one under study that might have caused a change in studies. An example is shown in Figure 12.9 (9).
rates. The case is also strengthened if intermediate Organized Pap smear screening programs were
outcomes (e.g., increased hand washing) follow the implemented in Nordic countries at different
intervention (the MRSA package), as they did in the times and with different intensities. Mortal-
MRSA example. ity rates from cervical cancer had been rising
In a multiple time-series studies, the and then began to fall in the years just be-
suspected cause is introduced into several different fore screening began, illustrating the value of
groups at dif- ferent times. Time-series measurements having information on time trends. Rates were
are then made in each group to determine whether the roughly similar before screening programs
effect occurred in the same sequential manner in
which the suspected cause was introduced. When an
effect regularly follows
20 Clinical Epidemiology: The

14
Targeted coverage for national population
(%)
12

10 Denmark 40
Cervical cancer mortality

6 Norway5

4
Sweden 100
Finland 100
2
Iceland 100

0
1955 1960 1965 1970 1975 1980 1985
Year
Figure 12.9 ■ A multiple time-series study. Change in cervical cancer mortality
rates according to year organized Pap smear screening programs were implemented
and targeted coverage. Arrows mark the year coverage was achieved for each
country. (Redrawn with permission from La˘a˘ra˘ E, Day NE, Hakama M. Trends in
mortality from cervical cancer in Nordic countries: association with organized
screening programmes. Lancet 1987;1(8544):1247–1249.

it is so improbable that confounding would have


were started. Rates fell the most in produced the same effects following intervention
countries (Iceland, Finland, and Sweden) at many different times and places.
with national programs with the broadest
coverage and least in countries (Denmark MODELING
and Norway) with the least coverage.
We have already discussed how mathematical
models are used to control for confounding and to
Multiple time series like this one are usually develop prediction rules. Another use of models is to
not planned experiments. Rather, the interventions describe the relative importance of various causes of
were introduced in the different countries at differ- disease and its prevention.
ent times and with different intensities for their
own sociopolitical reasons. Researchers later took
advan- tage of this “natural experiment,” and data on
Example
The U.S. death rate from colorectal cancer fell by 26% from 1975
cervical cancer death rates, to do a structured
study.
The various ecological study designs are of vastly
different scientific strength. Simply relating aggre-
gate exposure to aggregate risk across regions may
be useful for raising hypotheses to be tested by more
rigorous studies. A single time series can provide
convincing evidence of cause and effect if there is an
unmistakable break in trend after the intervention
and if concurrent interventions are ruled out. Mul-
tiple time series can provide strong evidence of
cause, arguably on a par with randomized trials,
because
Chapter 12: Cause 205

35

30
Risk factors

25 Screening Treatment
Colorectal cancer mortality

20

15 Mortality

10

0
1975
1980 1985 1990 1995 2000 2005

Year of death
Figure 12.10 ■ Causes for the decline in colorectal cancer deaths, 1975–2000. (Re-
drawn with permission from Edwards BK, Ward E, Kohler BA, et al. Annual report to
the Nation on the status of cancer, 1975–2006, featuring colorectal cancer trends and
impact of interventions (risk factors, screening, and treatment) to reduce future rates. Cancer
2010; 116:544–573.)

consequences of alternative courses of action.


use, and physical activity) along with the Decision analysis identifies the decisions that lead
preva- lence of each over time. Similarly, to the best outcomes in human terms, such as survival
screening was entered as the rates of each or being disease-free. Cost-effectiveness analysis
type of screening test as they changed over compares cost to outcomes (e.g., life saved or
time, and their sensi- tivity and specificity in quality- adjusted lives saved) for alternative courses
detecting precancerous lesions. Treatment of action. In cost–benefit analysis, both cost and
was modeled as the kinds of chemotherapy benefits are expressed in money terms. Whenever
regimens available over time and their cost is taken into account in these models, it is
effectiveness in reducing mortality. The adjusted for the change in the value of money over
model included the many steps leading time, because the money is spent or saved at various
from adenomas (precursors of cancer) points in time over the course of disease. In all
through cancer to treatment and survival. these models, if some of the inputted data are weak,
Figure 12.10 shows that the model sensitivity analysis can be used to see the effects of
predicted the actual fall in colorectal various values for these data. Modeling can provide
cancer death rates very closely. Changes in answers to questions that are so broad in scope that
risk factors accounted for 35% of the they are not available from individual research
decline in mortality, screening another 53% studies—and perhaps never will be. They are
increasingly relied on to complement other forms
Models take a group of people through the of research when trying to understand
various possible steps in the natural history of the the consequences of clinical decisions.
disease. In this case, it was the U.S. population
from develop- ment of polyps through their
transition to cancer, prevention by screening, and WEIGHING THE EVIDENCE
effects of treatment on cancer deaths. Data for the
probabilities of transition from one to another state When determining cause, one must consider the evi-
(e.g., from polyp to cancer or from cancer to cure) dence from all available studies. After examining the
are from published research. research design, the quality of studies, and
Other kinds of models are used for quantita- whether their results are for and against cause, the
tive decision making, comparing the downstream case for cau- sality can be strengthened or eroded.
20 Clinical Epidemiology: The

Systematic review Randomized controlled trial Systematic review Randomized controlled trial
Multiple time series Non-randomized trial Multiple time series Non-randomized trial
Cohort Cohort Case-control Time series
Case-control Time series Cross-sectional
Cross-sectional Case series
Case series DESIGN

Case report

StrongAGAINST Weak FOR Strong

Not specific Not reversible Temporal sequence


No dose-response Small effect Specificity
No analogy Analogy

Not biologically plausible FINDINGBiologic plausibility No effectConsistency


Incorrect temporal sequenceLarge effect Dose–response
Reversibility

Figure 12.11 ■ Relative strength of evidence for and against a causal effect. Note that
with study designs, the strength of evidence for a causal relationship is a mirror image of
that against. With findings, evidence for a causal effect does not mirror evidence against an
effect.

Figure 12.11 summarizes the different types of cause, whereas a cross-sectional study finding no
evidence for and against cause, depending on the effect is weak evidence against cause.
research design, and results that strengthen or Belief in a cause-and-effect relationship is a
weaken the evidence for cause. The figure roughly judgment based on both the scientific strength and
indicates relative strengths in helping to establish results of all research bearing on the question. As a
or discard a causal hypothesis. Thus, a carefully practical matter, at issue is whether the weight of the
done cohort study showing a strong association and evidence is convincing enough for us to behave as if
a dose–response relationship that is reversible is something were a cause, not whether it is established
strong evidence for beyond all reasonable doubt.

Revie w Question s
Read the following statements and select the brain cancer have not agreed with each other.
best response. A randomized controlled trial might resolve
the question. What is the main reason why a
12.1. One of your patients read that cell phones randomized controlled trial (RCT) would be
cause brain cancer, and she wants to know unlikely for this question?
your opinion. You discover that the inci-
dence of malignant brain tumors is increas- A. It would cost too much.
ing in the United States. Results of several B. People would not agree to be randomized
observational studies of cell phone use and to cell phone use.
Chapter 12: Cause 207

C. It would take too long. B. Multiple time-series studies cannot


D. Even if done well, a randomized trial provide strong evidence of cause and
could not answer the question. effect.
C. Studies relating average exposure to
12.2. Which of the following would be least con- average risk can provide a strong
sistent with the belief that cell phone use is a test of a causal hypothesis.
cause of brain cancer? D. An ecological fallacy might affect
A. A dose–response relationship their validity.
B. A large effect size (such as relative risk)
12.7. Which of the following would weaken belief
C. Separate analyses of patients with right-
that the “MRSA package” (Fig. 12.6) caused
and left-sided tumors in relation to
the observed decline in MRSA infections in
the side they usually listened to their
VA acute care hospitals?
cell phone
D. A biologic explanation for why A. Hand washing, a part of the MRSA
cell phones might cause cancer package, increased after the program
E. Cell phone use is associated with many was introduced.
different kinds of cancers B. Better antibiotics became
available around the time of the
12.3. Which of the following is the most accurate program
description of a causal relationship? C. The rate of MRSA infections was
stable for the years before the program
A. Tuberculosis has a single cause,
was introduced.
the tubercle bacillus.
D. Decline in rates began just after the
B. Most genetic diseases are caused only
program was introduced.
by an abnormal gene.
C. Coronary heart disease has multiple
12.8. Randomized controlled trials are the stron-
interacting causes.
gest research designs for establishing cause
D. Effective treatment has been the main
and effect. Which of the following is the
cause of decline in tuberculosis rates.
main limitation of trials for this purpose?
12.4. Which of the following is not one of the A. Randomized trials cannot control for
Bradford Hill criteria for causation? unmeasured confounders.
B. Clinical trials may not be ethical or
A. Dose–response
feasible for some questions.
B. Statistical significance
C. Type of study design is only a crude
C. Reversibility
measure of the scientific strength.
D. Biologic plausibility
D. Results of poorly designed and conducted
E. Analogy
trials are no better than for observational
studies.
12.5. A study of the colorectal cancer screening
comparing fecal occult blood testing to no
12.9. Which of the following provides the
screening finds that screening costs $20,000
strongest evidence for a cause-and-effect
per year of life saved. This study is an
relationship?
example of which of the following?
A. Observational studies that have
A. Cost–benefit analysis
controlled for bias and minimized the
B. Decision analysis
role of chance.
C. Cost-effectiveness analysis
B. A biologic mechanism can explain the
D. A clinical decision rule
relationship.
C. The purported cause clearly precedes the
12.6. Which of the following is true about aggre-
effect.
gate risk studies? D. There is a dose–response relationship.
A. Aggregate risk studies are not E. The evidence as a whole is
susceptible to confounding. consistent with the Bradford Hill
criteria.
20 Clinical Epidemiology: The

12.10. You discover that case-control studies have A. Use of cell phones increases the
been done to determine whether cell phone incidence of brain cancers by 50%.
use is associated with the development B. Use of cell phones protects against brain
of brain cancer. In one study, patients with cancers.
brain cancer and matched controls without C. Specific types of cancers might
brain cancer were asked about cell phone be associated with cell phone use.
use. The estimated relative risk for at least D. The study has adequate statistical power
100 hours of use compared to no use was 1.0 to answer the research question.
for all types of brain cancers combined (95%
confidence interval 0.6–1.5). This finding is Answers are in Appendix A.
consistent with all of the following except:

REFERENCES
1. Holmes OW. On the contagiousness of puerperal fever. Med
7. Moertel CC, Fleming TR, Rubin J, et al. A clinical trial of
Classics 1936;1:207–268. [Originally published, 1843.]
2. Fouchier RA, Kuiken T, Schutten M, et al. Aetiology: Koch’s amygdalin (Laetrile) in the treatment of human cancer. N
Engl J Med 1982;306:201–206.
postulates fulfilled for SARS virus. Nature 2003;423:240.
8. Jain R, Kralovic SM, Evans ME, et al. Veterans Affairs ini-
3. MacMahon B, Pugh TF. Epidemiology: Principles and Meth-
tiative to prevent methicillin-resistant Staphylococcus aureus
ods. Boston: Little, Brown & Co.; 1970.
4. Burzynski J, Schluger NW. The epidemiology of tuberculosis infections. N Engl J Med 2011;364:1419–1430.
9. Lăără E, Day NE, Hakama M. Trends in mortality from cervi-
in the United States. Sem Respir Crit Care Med 2008;29:492–
cal cancer in the Nordic countries: association with organized
498.
5. Bradford Hill A. The environment and disease: association or screening programmes. Lancet 1987;1(8544):1247–1249.
10. Edwards BK, Ward E, Kohler BA, et al. Annual report to the
causation? Proc R Soc Med 1965;58:295–300.
6. Mykletun A, Glozier N, Wenzel HG, et al. Reverse causality nation on the status of cancer, 1975-2006, featuring
colorec- tal cancer trends and impact of interventions (risk
in the association between whiplash and symptoms of anxiety
and depression. The HUNT Study. Spine 2011;36:1380– factors, screening, and treatment) to reduce future rates.
Cancer 2010; 116:544–573.
1386.
C h a p t e r 13

Summarizing the
Evidence
When the research community synthesizes existing evidence thoroughly, it is certain
that a substantial proportion of current notions about the effects of health care will
be
changed. Forms of care currently believed to be ineffective will be shown to be
effective; forms of care thought to be useful will be exposed as either useless or
harmful; and the justification for uncertainty about the effects of many other forms of
health care will be made explicit.
—Ian Chalmers and Brian Haynes
1994

KEY WORDS Reviews are made available to users in many differ-


ent forms. They may be articles in journals, chapters
Narrative review Patient-level meta- in textbooks, summaries prepared by the Cochrane
Systematic review analysis Collaboration, or monographs published by profes-
PICO Fixed effect model sional or governmental organizations. If the authors
Publication bias Random effects model of individual research articles are doing their job
Funnel plot Meta-regression properly, they will provide information about results
Forest plot Network meta-analysis of previous studies in the Introduction or Discussion
Meta-analysis Cumulative meta- sections of the article.
Heterogeneity analysis However a review is made available, the impor-
tant issue is how well it is done. There are many dif-
ferent ways, each with strengths and weaknesses. In
Clinical decisions are based on the weight of evi- this chapter, we briefly describe traditional reviews
dence bearing on a question. Sometimes the results and then address in more detail a powerful and
of large, strong studies are so compelling that they more explicitly scientific approach called “systematic
eclipse all other studies of the same question. More reviews.”
often, however, clinicians depend on the accumula-
tion of evidence from many less definitive studies. TRADITIONAL REVIEWS
When considering these individual studies, clinicians
need to establish the context for that one piece of In traditional reviews, called narrative reviews, an
evi- dence by asking, “Have there been other good expert in the field summarizes evidence and makes
stud- ies of the same question, what have they recommendations. An advantage of these reviews is
shown, and do their results establish a pattern when that they can address broad-gauged topics, such as
the studies’ scientific strengths and statistical “management of the diabetic patient,” and consider
precision are taken into account?” Reviews are a range of issues, such as diagnostic criteria, blood
intended to answer these kinds of questions. glucose control and monitoring, cardiovascular risk

209
21 Clinical Epidemiology: The

factor modification, and micro- and macrovascular Table 13.1


complications. Compliance and cost-effectiveness
may also be included. Clinicians need guidance on Elements of a Systemic Review
such a broad range of questions and experts are in
1. Define a specific question.
a good position to provide it. Authors usually have
2. Find all relevant studies (published and unpublished).
experience with the disease, know the pertinent evi- 3. Select the strongest studies.
dence base, and have been applying their knowledge 4. Describe the scientific strength of the selected
in the care of patients. studies.
A disadvantage of narrative reviews is that evi- 5. Determine if quality is associated with results.
dence and recommendations in them may be col- 6. Summarize the studies in figures (forest plots)
ored by value judgments that are not made explicit. and tables.
The lack of structure of traditional reviews may hide 7. Determine if pooling of studies (meta-
important threats to validity. Original research might analysis) is justified.
8. If so, calculate a summary effect size and
be cited without a clear account of how articles were
confidence interval.
found, raising the danger that they were selectively
cited to support a point of view. Personal experience
and conventional wisdom are often included and or whether skin adhesives are better than sutures for
may be difficult to distinguish from bedrock research closing superficial lacerations. For a systematic review
evidence. The strength of the original research may to be useful, strong studies of the question should
not be carefully critiqued, but instead suggested by be available. There should not be so few studies of
shorthand indicators of quality—such as the prestige the question that one could just as well critique the
of the journal, eminence of the author, how recently individual studies directly. The study results should
the study was published, the number of articles for disagree or at least leave the question open; if all the
and against a given conclusion, and perhaps general studies agree with one another, there is nothing to
research design (e.g., as randomized trial)—without reconcile in a review. Systematic reviews are also
regard for how well studies were actually designed use- ful when there is reason to believe that politics,
and executed. Also, there may be no explicit ratio- intel- lectual passion, or self-interest are accounting for
nale for why one research finding was valued over how research results are being interpreted.
another. Systematic reviews can provide a credible answer
Of course, a traditional review may not have these to targeted (but not broad-gauged) questions and
limitations, especially if it has been peer reviewed by offers a set of possibilities for how traditional reviews
other experts with complementary expertise and if the can be done better. They complement, but cannot
author has included evidence from more structured replace, traditional reviews. Systematic reviews are
reviews. However, concern about the limitations of most often used to summarize randomized
traditional reviews, especially lack of structure and controlled trials; therefore, we will base our
transparency, has prompted a new approach. comments on trials. However, the same methods are
used to summarize observational studies of risk and
SYSTEMATIC REVIEWS studies of diagnostic test performance.
The elements of a systematic review are summa-
Systematic reviews are rigorous reviews of the
rized in Table 13.1 and will be addressed one at a time
evi- dence bearing on specific clinical questions.
throughout the remainder of this chapter.
They are “systematic” because they summarize
original research following a scientifically based plan
Defining a Specific Question
that has been decided on in advance and made
explicit at every step. As a result, readers can see the Systematic reviews are of specific questions. For
strength of the evidence for whatever conclusions effects of interventions, the elements of specificity
are reached and, in principle, check the validity for have been defined under the acronym PICO (1):
themselves. Sometimes it is possible to combine P  Patients
studies, giving a more precise estimate of effect size I  Intervention
than is available in individual studies. C  Comparison
Systematic reviews are especially useful for
O  Outcomes
addressing a single, focused question such as
whether angiotensin-converting enzyme inhibitors To these, some have added T for time (e.g., follow-
reduce the death rate in patients with congestive up in a cohort study or randomized trial) and S
heart failure for
Chapter 13: Summarizing the Evidence 211

study design (e.g., randomized trial or cohort) to textbooks that are continually updated) are a source.
make PICOTS. Elements of targeted questions for Experts in the content area (e.g., rheumatic heart dis-
other kinds of studies (e.g., studies of diagnostic test ease or Salmonella infection) may recommend stud-
accuracy or observational studies of risk or prognostic ies that were not turned up by the other approaches.
factors) are less well defined but include many of the References cited in articles already found are
same features. another possibility. There are a growing number of
registries of clinical trials and funded research that
Finding All Relevant Studies can be used to find unpublished results.
The goal of consulting all these sources is to avoid
The first step in a systematic review is to find all the
missing any important article, even at the expense of
studies that bear on the question at hand. The review
inefficiency. In diagnostic test terms, the reviewer uses
should include a complete sample of the best studies
multiple parallel tests to increase the sensitivity of the
of the question, not just a biased sample of studies
search, even at the expense of many false-positive
that happen to have come to attention. Clinicians
results (i.e., unwanted or redundant citations), which
who review topics less formally—for colleagues in
need to be weeded out by examining the studies
rounds, morning report, and journal clubs—face a
themselves.
similar challenge and should use similar methods,
In addition to exercising due diligence in find-
although the process cannot be as exhaustive.
ing articles, authors of systematic reviews explicitly
How can a reviewer be reasonably sure that he or
describe the search strategy for their review, including
she has found all the best studies, considering that
search terms. This allows readers to see the extent to
the medical literature is vast and widely dispersed?
which the reviewer took into account all the studies
No one method of searching is sufficient for this task,
that were available at the time.
so multiple complementary approaches are used
(Table 13.2).
Limit Reviews to Scientifically
Most reviews start by searching online databases
of published research, among them MEDLINE, Strong, Clinically Relevant Studies
(the National Library of Medicine’s electronic To be included in a systematic review, studies must
database of published articles) EMBASE, and the meet a threshold for scientific strength. The assump-
Cochrane Database of Systematic Reviews. There tion is that only the relatively strong studies should
are many others that can be identified with a count. How is that threshold established? Various
librarian’s help. Some, such as MEDLINE, can be expert groups have proposed criteria for adequate sci-
searched both for a content area (such as treatment entific strength, and their advantages and limitations
of atrial fibrillation) and for a quality marker (e.g., are discussed later in this chapter.
randomized controlled trial). However, even in the Usually only a small proportion of studies are
best hands the sensitivity of MEDLINE searches selected from a vast number of potential articles on
(even for articles that are in MEDLINE) is far from the topic. Many articles describe the biology of dis-
perfect. Also, the contents of the various databases ease and are not ready for clinical application. Oth-
tend to complement each other. Therefore, database ers communicate opinions or summaries of existing
searching is useful but not suf- ficient. evidence, not original clinical research. Many stud-
Other ways of finding the right articles make ies are not scientifically strong, and the information
up for what database searches might have missed. they contain is eclipsed by stronger studies. Relatively
Recent reviews and textbooks (particularly few articles report evidence bearing directly on the
electronic clinical question and are both scientifically strong and
clinically relevant. Table 13.3 shows how articles were
Table 13.2 selected for a systematic review of statin drugs for the
prevention of infections; only 11 of 632 publications
•Approaches to Finding
Search online database such All the
as MEDLINE, identified were included in the review.
Studies
EMBASE, Bearing on a Question
and the Cochrane Database of Systemic
Reviews. Are Published Studies a Biased
• Read recent reviews and textbooks. Sample of All Completed Research?
• Seek the advice of experts in the content area.
• Consider articles cited in the articles already found The articles cited in systematic reviews should include
by other approaches. all scientifically strong studies of the question, regard-
• Review registries of clinical trials and funded less of whether have been published. Publication
bias is the tendency for published studies to be
21 Clinical Epidemiology: The

Table 13.3 To get around these problems, some authors of


Systematic Reviews Include Only a Small Proportion systematic
of Allreviews
Articlesmake
on aa Question.
concerted effort to find
Articles Considerer and Included in a Systematicunpublished studies,and
Review of Statins including thoseofthat
Prevention were
Infections
funded and begun but not completed. They are aided
in this effort by public registries of all studies that
have been started.
Funnel plots are a graphical way of detecting bias
632POTENTIALLY RELEVANT in the selection of studies for systematic reviews. For
each study, the effect size is plotted against some mea-
sure of the study’s size or precision, such as sample
size, number of outcome events, or confidence interval
587 Duplicates (Fig. 13.1).
In the absence of publication bias (Fig. 13.1A),
Excluded large trials (plotted at the top of the figure) are likely
Review, rationale, study protocol, or baseline report
to be published no matter what they find and yield
Not randomized clinical trial Not placebo controlled Follow-up < 12 months
Intervention combined with other treatments estimates of comparative effectiveness that are closely
Other grouped around the true effect size. Small studies
(plotted in the lower part of the figure) are more
likely to vary in reported effect size because of sta-
tistical imprecision, and so to be spread out at the
bottom of the figure, surrounding the true effect size.
In the absence of publication bias, they would be as
often in the lower right as the lower left of the figure.
38DATA ON INCIDENCE OF ADVERSE EVENTS AND MORTALITY The result, in the absence of bias, is a symmetrical,
peaked distribution—an inverted funnel. Publica-
tion bias, particularly the tendency for small studies
Excluded
to be published only if they are positive, shows up as
Subgroup analyses of included, duplicate, or excluded trials
asymmetry
Adverse events not specified No infections mentioned in the
Data not funnelbyplot
provided (Fig. 13.1B). There are
authors
disproportionately fewer small studies that favor the
control group, seen as a paucity of studies in the lower
right corner of the figure.
Other factors, not directly related to distaste for
negative studies, can cause publication bias. Fund-
11INCLUDED IN META-ANALYSIS ing of research by agencies with a financial interest
in the results can also lead to distortions in the
Data from Van den Hoek HL, Bos WJ, de Boer A, et al. Statins and
the prevention of infections: systematic review and meta-analysis scientific record. Outcomes of studies sponsored by
of data from large randomized placebo controlled trials. BJM industry (usually drug and device companies) are
2011; 343:d7281. more likely to favor the sponsor’s product than those
with other funding sources. One possible reason is
systematically different from all completed studies that industry sponsors sometimes require, as a
of a question. In general, published studies are more condition of fund- ing research, that they approve
likely to be “positive” (i.e., to find an effect) for the resulting articles before they are submitted to
sev- eral reasons, related to a general preference for journals. Industry spon- sors have blocked
posi- tive results. Investigators are less likely to publication of research they have funded that has
complete studies that seem likely to end up negative not, in their opinion, found the “right” result.
and less likely to submit negative studies to journals.
Journal peer reviewers are less likely to find negative How Good Are the Best Studies?
studies interesting news, and editors are less likely
to publish them. Clinicians need to know just how good the best stud-
Other selective pressures may favor positive stud- ies of a question are so that they will know how seri-
ies. Authors may report outcomes that were not ously to take the conclusions of the systematic
iden- tified or made primary before data were collected review. Are the studies so strong that it would be
and were selected for emphasis in publications after irrespon- sible to discount them? Or are they weak,
the results were available. suggesting that it is reasonable to not follow their
lead?
Chapter 13: Summarizing the Evidence 213

No publication bias table showing the extent to which markers of quality


A are present in the studies included in the review.

Example
Precision of the estimate of the

A systematic review summarized published re- ports of the eff

Favor intervention Favor control


Magnitude of the effect size

Possible publication bias


B
Precision of the estimate of the

Individual measures of quality can also be com-


bined into summary measures. A simple, commonly
used scale for studies of treatment effectiveness, the
Jadad Scale, includes whether the study was described
as randomized and double-blinded and whether there
was a description of withdrawals and dropouts (3).
However, there is not a clear relationship between
summary scores for study quality and results (4). Why
might this be? The component studies in systematic
reviews are already highly selected and, therefore,
might not differ much from one another in quality.
Also, summary measures of quality typically add up
Favor intervention Favor control scores for the presence or absence of each element
of
Magnitude of the effect size quality, and there is no reason to believe that each
Figure 13.1 ■ Funnel plots to detect publication bias. makes an equal contribution to the overall validity of
Each trial is indicated by a circle. A. Trials are symmetrical in the study. It is not difficult to imagine, for
the shape of an inverted funnel, suggesting no publication example, that weakness in one aspect of a study
bias. B. There are no trials in the lower right corner, might be so damaging as to render the entire study
sug- gesting that small trials not favoring the intervention
invalid, even though all the other aspects of quality
were not published. (Redrawn with permission from Guyatt
GH, Oxman AD, Montori V, et al. GRADE guidelines: 5.
are exemplary.
Rating the quality of evidence  publication bias. J Clin
Epidemiol 2011;64:1277–1282.) Example
In a randomized, placebo-controlled trial, wom- en with unexp

Many studies have shown that the individual ele-


ments of quality discussed throughout this book,
such as concealment of treatment assignment,
blinding, follow-up, and sample size, are
systematically related to study results. The quality of
the evidence identi- fied by the systematic review
can be summarized by a
21 Clinical Epidemiology: The

No. of patients randomly assigned


No. of Trials
Variable Effect size

All trials 20 3,846

Concealment of allocation Adequate


Unclear 2 1,253
18 2,593

Placebo control Yes


No 17 3,091
3 755

Patient blinding Adequate Unclear or no


12 1,952
8 1,894

Intention to treat analysis Yes


No or unclear 3 1,553
17 2,293

Patients randomly assigned


>2005 2,419
≤20015 1,427

Duration of follow-up
>6 months 11 2,430
≤6 months 9 1,416
1.0
0.8 0.6 0.4 0.2 0.0
Figure 13.2 ■ Quality of 20 trials in a systematic review of the effectiveness of glucosamine
on pain in patients with osteoarthritis of the knee or hip. (Data from Reichenbach S, Sterchi R,
Scherer M, et al. Meta-analysis: chondroitin for osteoarthritis of the knee and hip. Ann Intern Med 2007;
146:580–590.)

Therefore, while quality checklists and scores


randomization, allocation concealment,
have their place, they are no substitute for critically
sample size, and follow-up, were strong.
examining the individual studies in a systematic
Also, the iron and placebo pills were
review with an eye toward how much any imper-
visually identical. How- ever, iron causes fections in the studies might have influenced their
characteristic side effects (e.g., dark stools results.
and constipation) that could have prompted
women to recognize whether they were
taking iron or the placebo. Also, fatigue is a Is Scientific Quality Related to
relatively “soft” outcome that might be influ- Research Results?
enced by belief in the effectiveness of iron
sup- plements. Even if all other aspects of
Studies meeting higher standards for methods should
the study were beyond reproach (and the come closer to the truth than weak ones. Therefore,
overall quality score was excellent), this it may be informative for reviewers to show if there
one aspect of qual- ity—patients being is a relationship between study quality and study
aware of which treatment they were taking conclusion. This is often done by examining the
coupled with an outcome eas- ily influenced relationship between individual quality measures
by this knowledge—could have accounted and outcomes because of the limitations of summary
for the small observed difference, rather scores.
Chapter 13: Summarizing the Evidence 215

Summarizing Results
Example
Figure 13.2 shows the relationship between several markers The
of results
quality of
anda systematic review
effect size with are typically
confidence dis- for trials
intervals
–0.13 to 0.07), where an effect size of –0.30 was consideredplayed as a forest
minimally showingThat
plotrelevant.
clinically the point estimate
is, there was noofclinically
effectiveness and confidence interval for each study in
the review. Figure 13.3 illustrates a summary of
stud- ies comparing quinine to placebo for muscle
cramps (6). The measure of effectiveness in this
example is change in the number of leg cramps in a
2-week period. In other systematic reviews, it might
be rela- tive risk, attributable risk, or any other
measure of effect. Point estimates are represented by
boxes with their size proportional to the size of the
study. A verti- cal line marks where neither quinine
nor placebo was more effective.
The origin of the name “forest plot” is uncertain,
but it is variously attributed to a researcher’s name
or the appearance resembling a “forest of lines” (7).
We believe they help readers “see the forest and the
trees.”

Study or subgroup Cramp number


Weight Cramp number (95% CI)

CIBA 1988 1.68 5.7%

Diener 2002 3 8.7%

Fung 1989 3.69 13.0%


Gorlich 1991 2.69 0.4%

Hays 1986 0.55 15.5%

Jansen 1994 2.7 3.4%

Jansen 1997 6 6.1%

Jones 1983 2 2.4%

Lee 1991 3.55 11.1%

Leo Winter 1986 1.26 14.5%

Warburton
Sidorov 1987
1993 2.18
0.7 9.5%
8.5%

Woodfield 2005 14.2 1.2%

Total (95% Cl) 100.0%


10 5 0 5 10

Favors quinine Favors placebo


Figure 13.3 ■ Example of a forest plot. Summary of 13 randomized trials of the effectiveness of quinine ver-
sus placebo on number of cramps in 2 weeks. (Redrawn with permission from El-Tawil S, Al Musa T, Valli H, et
al. Quinine for muscle cramps. Cochrane Database Syst Rev 2010;(12):CD005044.)
21 Clinical Epidemiology: The

Forest plots summarize a tremendous amount of patients, interventions, doses, follow-ups, and out-
information that would otherwise require a great deal comes. Treating “apples and oranges” as if they are all
of effort to find. just fruits disregards useful information.
Investigators use two general approaches to decide
1. Number of studies. The rows show the number of
studies meeting stringent criteria for quality, in whether it is appropriate to pool study results. One
is to make an informed judgment about whether the
this case, 13.
2. What studies and when. The first column identifies research questions addressed by the trials are similar
names and year of publication for the component enough to constitute studies of the same question (or
studies so that readers can see how old the stud- a set of reasonably similar questions).
ies are and where they can be found. (Full refer-
ences to studies are not shown in the figure but are
included in the article).
3. Pattern of effect sizes. The 13 point estimates, taken Example
as a whole, show what the various studies reported
for effect sizes. In the example, all of the 13 stud- Do antioxidant supplements prevent gastroin- testinal cancers? In
ies favored quinine, but the size of the effects var-
ies.
4. Precision of estimates. Many studies (6 of 13) were
“negative” (their confidence intervals included
no effect). This would give the impression, in a
simple accounting of the number of “positive”
and “negative” studies, that treatment is not
effective or at least that effectiveness is question-
able. The forest plot gives a different impression:
All point estimates favor quinine, and the nega-
tive studies tend to be imprecise yet consistent
with effectiveness.
5. The effects for the big studies. The large, statistically
precise studies (seen by both narrow confidence
intervals and large boxes representing point esti-
mates) deserve more weight than small ones. In
the example, the confidence intervals for the two
largest studies do not include “no change in mus-
cle cramp rate” (although one touches it).
In these ways, a single picture conveys in a glance
a lot of basic information about the very best studies
of a question.

COMBINING STUDIES IN
META-ANALYSES
Meta-analysis is the practice of combining (“pool-
ing”) the results of individual studies, if they are
simi- lar enough to justify a quantitative summary
effect size. When appropriate, meta-analyses provide
more precise estimates of effect sizes than are
available in any of the individual studies.

Are the Studies Similar Enough


to Justify Combining?
Reasonable people might disagree on how similar
It makes no sense to pool the results of very differ- studies must be to justify combining them. We have
ent studies—studies of altogether different kinds of been critical of pooling in this study, but other capable
Chapter 13: Summarizing the Evidence 217

people thought that pooling was justified and that


the study was worth publishing in a leading infants. Investigators did a patient-level
journal. meta- analysis of trials of nitric oxide in
Another approach is to use a statistical test for preterm in- fants (9). They pooled data on
heterogeneity, the extent to which the trial results
3,298 preterm infants in 12 trials and found
are different from each other beyond what might be no effect on death or chronic lung disease
expected by chance. Here the null hypothesis is that (59% versus 61% favoring nitric oxide,
there is no difference among study results and the sta- relative risk 0.96, 95% confidence interval
tistical test is used to see if there is a statistically signif- 0.92–1.01). Because there were data for
icant difference among them. Failing to reject the each patient, it was possible to look for
null hypothesis of no difference among the studies effectiveness in clinically relevant sub-
may seem reassuring, but there is a problem. Most groups of preterm infants just as one might
meta- analyses are of relatively few studies, so the tests in a single large trial. Effectiveness did not
have limited statistical power. The risk of a false- differ according to gestational age, birth
negative result, a conclusion that the studies are not weight, multiple births, race, antenatal
hetero- geneous when they are, is often high. Power steroids, and seven other infant
is also affected by the number of patients in these
studies and how evenly they are distributed among
the stud- ies. If one of the component studies is
How Are the Results Pooled?
much larger than the others, it contributes most of When study results are pooled, each individual study
the informa- tion bearing on the question. It may contributes to the summary effect in relation to its
be more infor- mative to examine the large study size (strictly speaking, the inverse of its variance).
carefully and then contrast it with the others. In the Those that contribute large amounts of information
antioxidant and gas- trointestinal cancers example, are weighted more heavily than those that make small
the statistical test did show heterogeneity. contributions. This is made explicit in the quinine
example (Fig. 13.3), where the weight of each study
What Is Combined— is reported in the third column, with the total add-
Studies or Patients? ing up to 100%. Four of the largest studies contrib-
ute more to the summary effect than the other nine
Until now, we have been discussing how studies are
smaller studies.
combined, which is the usual way meta-analyses are Two kinds of mathematical models are used to
done, An even more powerful approach is to obtain summarize studies in meta-analyses. These models
data on each individual patient in each of the com- differ in what is being summarized and in how con-
ponent studies and to pool these data to produce, servative they are in estimating overall confidence
in effect, a single large study called a patient-level intervals.
meta-analysis. Relatively few meta-analyses are done
With the fixed effect model (Fig. 13.4A), it is
this way because of the difficulties in obtaining all assumed that each of the studies is of exactly the same
these data from many different investigators and question so that the results of the studies differ only
rec- onciling how variables were coded. However, by chance. This model is called “fixed effect” because
when patient-level meta-analyses can be done, it it is assumed that there is only one underlying effect
becomes possible to look for effects in clinically size for all the studies, although the results of the indi-
important sub- groups of patients. The numbers of vidual studies do differ from one another because of
patients in these subgroups may be too small in the the play of chance.
individual studies to produce stable estimates of The main problem with this approach is that, on
effects but large enough when patients in several the face of it, the studies rarely resemble one another
studies are pooled. so closely (in terms of patients, interventions, follow-
up, and outcomes) that they can be considered sim-

Example
Inhaled nitric oxide is effective in full-term in- fants with pulmonary hypertension and hypox- ic respiratory failure. How

ple replications of one another. Should vitamins A,


C, and E and selenium really be considered simply
21 Clinical Epidemiology: The
examples of “antioxidant supplements,” even
though they have different biochemical structures
and mech- anisms of action? Or are they different
enough from one another that they might have
different effects? To the extent that the study
questions differ somewhat,
Chapter 13: Summarizing the Evidence 219

A Fixed effects model

Pooled result estimated treatment effectSingle true treatment effect

Results of multiple clinical trials randomly distributed around the true treatment eff

Treatment effects
B Random effects
model

Pooled result single estimated treatment effect

Multiple true treatment effects (distribution of treatment effects)

Results of multiple clinical trials randomly distributed around the

Treatment effects
Figure 13.4 ■ Models for combining studies in a meta-analysis. A. Fixed
effect model. B. Random effects model. (Redrawn with permission from UpTo-
Date, Waltham, MA.)

the width of the summary confidence intervals cal-


sonable to combine studies using a random effects
culated by the fixed effect model tends to imply a
model, as long as the studies are similar enough to
greater degree of precision than is actually the case.
one another (which is obviously a value judgment).
Also, by combining dissimilar studies, one loses use-
Random effects models produce wider confidence
ful information that might have resulted from con-
intervals than fixed effect models and for this reason
trasting them. The fixed effect model is used when the
are thought to be more realistic. However, it is
studies in a systematic review are clearly quite similar
uncer- tain how the family of similar studies is
to each other.
defined and whether the studies are really a random
The random effects model (Fig. 13.4B) assumes
sample of all such studies of a question.
that the studies address somewhat different ques-
Nevertheless, because ran- dom effects models at
tions but that they form a closely related family of
least take heterogeneity into account and are,
studies of a similar question. The studies are consid-
therefore, less likely to overestimate precision, they
ered a random sample of all studies bearing on that
are the model used when (as if often the case)
question. Even if clinical judgment and a statistical
heterogeneity is present.
tests both suggest heterogeneity, it may still be rea-
When an overall effect size is calculated, it is usu-
ally displayed at the bottom of the forest plot of
22 Clinical Epidemiology: The

component studies as a diamond representing the gastric banding, among others. Randomized tri-
summary point estimate and confidence interval als have compared various combinations of these to
(see Fig. 13.3). The summary effect is a more precise each other and to usual care, but no study has com-
and formalized presentation of what might have pared each technique to all the others. Network
been concluded from the pattern of results available meta-analysis is a mathematical way of estimating
in the forest plot. the comparative effectiveness of interventions that
are not directly compared in actual studies but can
Identifying Reasons be indirectly compared by use of modeling. Using
for Heterogeneity a network meta-analysis, investigators were able to
estimate the respective effects of each bariatric sur-
Random effects models are a way of taking heteroge- gery method compared to usual care, showing that
neity into account when calculating a summary effect each was effective and identifying the hierarchy of
size, but a separate need is to identify characteristics effectiveness (10).
of patients or treatments that are responsible for the
variation in effects. CUMULATIVE META-ANALYSES
The most straightforward way to identify reasons
for heterogeneity is to do subgroup analyses. This is Usually the studies in a forest plot are represented
possible in patient-level meta-analyses, as described in separately in alphabetical order by first author or
the example about nitric oxide treatment of preterm in chronological order. Another way to look at the
infants. However, if trials, not patients, are pooled, same information is to present a cumulative meta-
one must rely on less direct methods. analysis. Component studies are put in chrono-
Another approach to understanding the reasons for logical order, from oldest to most recent, and a new
heterogeneity is to do a sensitivity analysis, as summary effect size and confidence interval is calcu-
discussed in Chapter 7. Summary effects are lated for each time the results of a new study became
examined with and without trials that seem, either for available. In this way, the figure represents a running
clinical or statistical reasons, to be different from the summary of all the studies up to the time of each
others. For example, investigators might look at new trial. This is a Bayesian approach, as described
summary effects (or statis- tical tests for in Chapter 11, where each new trial modifies prior
heterogeneity) after removing relatively weak trials belief in comparative effectiveness, established by the
or those in which the dose of drug was relatively trials that went before.
small to see if study strength or drug dose account The following example illustrates the kind of
for differences in results across studies. insights a cumulative meta-analyses can provide and
A modeling approach called meta-regression, also shows how meta-analyses in general and cumula-
similar to multivariable analysis (discussed in Chap- tive meta-analyses in particular are useful for estab-
ter 11) can be used to explore reasons for heterogene- lishing harmful effects. Individual trials, which are
ity when trials, not patients, are pooled. The powered to detect effectiveness, are usually under-
indepen- dent variables are those reported in powered to detect harms because harms occur at a
aggregate in each individual trial (e.g., the average substantially lower rate. Pooling data may accumulate
age or proportion of men and women in those trials) enough events to detect harmful effects.
and the outcomes are the reported treatment effect
for each of those tri- als. The number of observations
is the number of tri- als in the meta-analysis. This
approach is limited by the availability of data on
Example
the covariates of interest in the individual trials, the
compatibility of the data across trials, and the
Rofecoxib is a non-steroidal anti-inflammatory drug (NSAID) th
stability of models based on just a few observations
(the number of trials in the meta- analysis). Another
limitation, as with any aggregate- risk study, is the
possibility of an “ecological fallacy,” as discussed in
Chapter 12.
The various studies in a systematic review some-
times address the effectiveness of a set of
interrelated interventions, not just a single
comparison between an intervention and control
group. For example, there are several techniques for
bariatric surgery such as jejunoileal bypass, sleeve
gastrectomy, adjustable
Chapter 13: Summarizing the Evidence 221

of the comparator, naproxen. Investigators the information contributed by the one


a meta-analysis of 16 randomized trials of ro- large study. Subsequent studies tended to
fecoxib versus control (a placebo or another consolidate this finding and increase
NSAID) (11). All but one of the studies was statistical precision. Rofecoxib was taken
small, with only 1 to 6 cardiovascular events per off the market in 2004, 4 years after a
trial, but there were 24 cardiovascular events in cumulative meta-analysis would have shown
the one large trial. The summary relative risk cardiovascular risk was present at
of rofecoxib for myocardial infarction, includ- conventional levels of statistical significance.
ing data from the large trial, was 2.24 (95%
confidence interval 1.24–4.02). When was risk
first apparent? A cumulative meta-analysis Cumulative meta-analyses have been used to
(Fig. 13.5) shows that a statistically significant show when the research community could have
effect was apparent in 2000, mainly because of known about effectiveness or harm if it had avail-
able a meta-analysis of the evidence, but now it

Year Patients Events P Relative Risk (95% Cl) of myocardial infarction

1997 523 1 0.916

1998 615 2 0.736

1,399 5 0.828

2,208 6 0.996

2,983 8 0.649

3,324 9 0.866

1999 4,017 12 0.879

5,059 13 0.881

2000 5,193 16 0.855

13,269 40 0.070

14,247 44 0.034

15,156 46 0.025

20,742 52 0.010

2001 20,742 58 0.007

20,742 63 0.007

21,432 64 0.007

0.1 1 10

Favors rofecaxib Favors control


Figure 13.5 ■ A cumulative meta-analysis of studies comparing the effects of rofecoxib to placebo on
rates of myocardial infarction. (Redrawn with permission from Juni P, Nartey L, Reichenbach S, et al. Risk of
cardiovascular events and rofecoxib: cumulative meta-analysis. Lancet 2004;364:2021–2029.)
22 Clinical Epidemiology: The

is possible to have the results of meta-analyses in


real time. In the Cochrane Collaboration, meta- Example
analyses are updated every time a new trial becomes Patients with low back and leg pain may have a herniated inte
available. For each new meta-analysis, the sum-
mary effect size represents the accumulation of evi-
dence up to the time of the update, though without
explicitly showing what summary effect sizes had
been in the past.

SYSTEMATIC REVIEWS
OF OBSERVATIONAL AND
DIAGNOSTIC STUDIES
We have discussed systematic reviews and meta-
analyses, using randomized controlled trials as
exam- ples. However, systematic reviews are also
useful for other kinds of studies, as illustrated by the
following summary of observational studies.

Example
Patients with venous thromboembolism may have recurrences after anticoagulation is stopped. Investigators obtained p

STRENGTHS AND WEAKNESSES


OF META-ANALYSES
Meta-analyses, when justified by relatively homo-
geneous results of component studies, can make
many contributions to systematic reviews. They
can establish that an effect is present or absent with
more authority than individual trials or less formal
ways of summing up effects. Pooling makes it pos-
sible to estimate effects sizes more precisely so that
clinicians can have a better understanding of how
big or small the true effect might be. Meta-analyses
can detect treatment complications or differences in
effects among subgroups, questions that individual
trials usually do not have the statistical power to
address. They make it possible to recognize the point
in time when effectiveness or harm has been estab-
lished whereas this is much more difficult with less
The performance of a diagnostic test, being a rela- formal reviews.
tively targeted question, is also well suited for system- Disadvantages of meta-analyses include the temp-
atic reviews, as the following example illustrates. tation to pool quite dissimilar studies, providing a
Chapter 13: Summarizing the Evidence 223

Study TPFPFNTN Sensitivity Specificity

Albeck 1996 511510 4

Charnley 1951 63811 6

Demircan 2002179185 82

Gurdjian 1961 929122213

Hakelius 19721,411 4225670

Kerr 1988 9820216

Knutsson 1961155187 2

Kosteljanetz 198444231221

Kosteljanetz 1988406 5 1

Spangfort 19722,088 308 6939


0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 13.6 ■ A systematic review of diagnostic test performance. The sensitivity and specificity of straight leg raising
as a test for lumbar disc herniation in patients with low back pain and leg pain. (Redrawn with permission from van der
Windt DA, Simons E, Riphagen II, et al. Physical examination for lumbar radiculopathy due to disc herniation in patients with low-
back pain. Cochrane Database Syst Rev 2010;(2):CD007431.)

misleading estimate of effects and directing attention


strength of the individual studies that they summa-
away from why differences in effects exist. Meta-
rize. Sometimes most of the weight in a meta-
analyses do not include information based on the
analysis is vested in a single trial, so there is
biology of disease, clinical experience, and the practi-
relatively little difference between the information in
cal application of best evidence to patient care—other
that trial and in a summary of all studies of the
dimensions of care that may have as large influence
same question. For the most part, the two—large
on patient-centered outcomes as do choice of drugs
strong studies and meta-analyses of all related
or procedures.
studies—complement each other rather than
Which comes closer to the truth, the best indi-
compete. When they disagree, the main issue is why
vidual research or meta-analyses? It is a false choice.
they disagree, which should be sought by
Meta-analyses cannot be better than the scientific
examination of the studies themselves, not the
methods in general.

Revie w Question s
Read the following and select the best A. You wish to obtain a more
response. generalizable conclusion.
B. A statistical test shows that the studies are
13.1. A systematic review of observational studies heterogeneous.
of antioxidant vitamins to prevent cardio- C. Most of the component studies are
vascular disease combined the results of 12 statistically significant.
studies to obtain a summary effect size and D. The component studies have different,
confidence interval. Which of the following and to some extent complementary,
would be the strongest rationale for combin- biases.
ing study results? E. You can obtain a more precise estimate
of effect size.
22 Clinical Epidemiology: The

13.2. You are asked to critique a review of the D. Individual measures of quality such
literature on whether alcohol is a risk factor as randomization and blinding are
for breast cancer. The reviewer has searched not associated with study results.
MEDLINE and found several observational
studies of this question but has not 13.6. Which of the following kinds of studies is
searched elsewhere. All of the following are least likely to be published?
limita- tions of this search strategy except:
A. Small positive studies
A. Studies with negative results tend to B. Large negative studies
not be published. C. Large positive studies
B. MEDLINE searches typically miss D. Small negative studies
some articles, even those included in the
database. 13.7. Which of the following is a comparative
C. MEDLINE does not include all of the advantage of traditional (“narrative”) reviews
world’s journals. over systematic reviews?
D. MEDLINE can be simultaneously
searched for both content area A. Readers can confirm that evidence cited
and methods. is selected without bias.
B. Narrative reviews can review a broad
13.3. A systematic review of antiplatelet drugs range of questions bearing on the care of
for cardiovascular disease prevention a condition.
combined individual patients, not trials. C. They rely on the experience and
Which of the following is an advantage of judgment of an expert in the field.
this approach? D. They provide a quantitative summary of
effects.
A. It is more efficient for the investigator. E. The scientific strength of the studies
B. Subgroup analyses are possible. cited is explicitly evaluated.
C. It is not necessary to choose between
fixed and random effects models when 13.8. Which of the following is not always part of
combining data. a typical forest plot?
D. Publication bias is less likely.
A. The number of studies that meet high
13.4. Which of the following is not generally used
standards for quality
to define a specific clinical question studied B. A summary or pooled effect size with
by randomized controlled trials? confidence interval
C. Point estimates of effect size for each
A. Covariates that were taken study
into account D. Confidence intervals for each study
B. Interventions (e.g., exposure or E. The size or weight contributed by each
experimental treatment) study
C. Comparison group (e.g., patients taking
placebo in a randomized trial) 13.9. Which of the following cannot be used for
D. Outcomes identifying severity of illness as a reason
E. Patients in the trials for heterogeneity in a systematic review
with meta-analysis?
13.5. Which of the following best describes ways
of measuring study quality? A. Controlling for severity of illness in a
study-level meta-analysis
A. The validity of summary measures B. A mathematical model relating average
of study quality is well established. severity of illness to outcome across
B. A description of study quality is a useful the component studies
part of systematic reviews. C. A comparison of summary effect sizes
C. In a summary measure of quality, in studies stratified by mean severity
strengths of the study can make up for of illness
weaknesses. D. Subgroup analyses of patient-level data
Chapter 13: Summarizing the Evidence 225

13.10. Which of the following is the best justifica- 13.11. Which of the following is an advantage of
tion for combining the results of five studies the random effects model over the fixed
into a single summary effect? effect model?
A. The patients, interventions, and A. It can be used for studies with time-to-
outcomes are relatively similar. event analyses.
B. All studies are of high quality. B. It describes the summary effect size for a
C. A statistical test does not single, narrowly defined question.
detect heterogeneity. C. It is better suited for meta-analyses of
D. Publication bias has been ruled out by a diagnostic test performance.
funnel plot. D. It gives more realistic results when there
E. The random effects model will be used to is heterogeneity among studies.
calculate the summary effect size. E. Confidence intervals tend to be narrower.

Answers are in Appendix A.

REFERENCES

1. Sackett DL, Richardson WS, Rosenberg W, et al. Evidence-


8. Bjelakovic G, Nikolova D, Simonetti RG, et al. Antioxidant
based medicine. How to practice and teach EBM. New
York; Churchill Livingstone; 1997. supplements for prevention of gastrointestinal cancers: a sys-
2. Reichenbach S, Sterchi R, Scherer M, et al. Meta-analysis: tematic review and meta-analysis. Lancet 2004;364:1219–
chondroitin for osteoarthritis of the knee and hip. Ann Intern 1228.
Med 2007;146:580–590. 9. Askie LM, Ballard RA, Cutter GR, et al. Inhaled nitric oxide
3. Jadad AR, Moore RA, Carroll D, et al. Assessing the quality in preterm infants: an individual-level data meta-analysis of
of reports of randomized clinical trials: Is blinding necessary? randomized trials. Pediatrics 2011;128:729–739.
Controlled Clinical Trials 1996;17:1–12. 10. Padwal R, Klarenbach S, Wiebe N, et al. Bariatric surgery: a
4. Balk EM, Bonis PAL, Moskowitz H, et al. Correlation of systematic review and network meta-analysis of randomized
qual- ity measures with estimates of treatment effect in meta- trials. Obes Rev 2011;12:602–621.
analyses of randomized controlled trials. JAMA 11. Juni P, Nartey L, Reichenbach S, et al. Risk of cardiovascu-
2002;287:2973–2982. lar events and rofecoxib: cumulative meta-analysis. Lancet
5. Verdon F, Burnand B, Stubi CL, et al. Iron supplementation 2004;364:2021–2029.
for unexplained fatigue in nonanemic women: double-blind, 12. Douketis J, Tosetto A, Marcucci M, et al. Risk of recurrence
randomized placebo controlled trial. BMJ 2003;326:1124– after venous thromboembolism in men and women: patient
1228. level meta-analysis. BMJ 2011;342:d813.
6. El-Tawil S, Al Musa T, Valli H, et al. Quinine for muscle 13. van der Windt DA, Simons E, Riphagen II, et al. Physical
cramps. Cochrane Database Syst Rev 2010;(12):CD005044. examination for lumbar radiculopathy due to disc herniation
7. Lewis S, Clarke M. Forest plots: trying to see the wood and the in patients with low-back pain. Cochrane Database Syst Rev
trees. BMJ 2001;322:1479–1480. 2010;(2):CD007431.
22 Clinical Epidemiology: The

C h a p t e r 14

Knowledge
Management
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
—T.S. Eliot
1934

KEY WORDS handle comfortably, and they have countless experts


to help them decide what they should take
seriously
Knowledge Clinical practice and what they should disregard. However,
management guidelines develop- ing one’s own plan for managing
Conflict of interest MEDLINE knowledge becomes crucial later on, whether in
Scientific misconduct EMBASE practice or academe.
Point of care PubMed Even with recent developments, effective and
Just-in-time Peer review effi- cient knowledge management is a challenging
learning Structured abstract task. In this chapter, we review modern approaches
to clinical knowledge management. We will discuss
four basic tasks: looking up information, keeping
up with new
Finding the best available answer to a specific information may seem to be a low pri- ority for
clinical question is like finding a needle in a clinicians still in training. They are sur- rounded
haystack. Essen- tial information is mixed with a by information, far more than they can
vast amount of less credible “factoids” and
opinions, and it is a daunting task to sort the wheat
from the chaff. Yet, that is what clinicians need to
do. Critical reading is only as good as the
information found.
Knowledge management is the effective and
efficient organization and use of knowledge. This
was a difficult task in the days of print media only.
For- tunately, knowledge management has become a
great deal easier in the era of electronic information.
There are more and better studies on a broad range of
clinical questions, widely available access to
research results, and efficient ways to rapidly sort
articles by topic and scientific strength. These
opportunities followed the widespread availability of
computers, the World Wide Web, and electronic
information for clinical purposes. Finding
developments in your field, remaining connected and judge clinical research results for myself or
to medicine as a profession, and helping patients delegate this task to someone else?” The answer is
find good health information themselves. both. Cli- nicians should be capable of finding and
critiquing information on their own; it is a basic skill
BASIC PRINCIPLES in clinical medicine. But as a practical matter, it is
not possible to go it alone for all of one’s
Several aspects of knowledge management cut information needs. There are just too many
across all activities. questions in a day and too little time to answer
them on one’s own. Therefore, clinicians must
Do It Yourself or Delegate? find trustworthy agents to help them manage
Clinicians must first ask themselves, “Will I find knowledge.

225
22 Clinical Epidemiology: The

Which Medium? tendency to report the results of research according


to one’s own stake in the results.
One can obtain information via a rich array of media. Conflict of interest exists when investigators’
They range from printed books and journals to digital private interests compete with their responsibilities
information on the Web accessed through stationary to be unbiased investigators. There are many possible
and handheld platforms. There are audiotapes, vid- competing interests:
eotapes, and more. The information is neither
more nor less sound because of how it happens to ■ Financial conflict of interest: When personal or
come to you. Validity depends on authors, family income is related to research results (this
reviewers, and editors, not the medium. However, conflict is usually considered the most powerful
the availability of various media, with and most difficult to detect)
complementary advantages and disadvantages, ■ Personal relationships: Supporting friends and
makes it easier to find ones that match every user’s putting down rivals
preferences. ■ Intellectual passion: Being for one’s own ideas
A modern knowledge management plan should and against competing ones
be based on electronic information on the Internet. ■ Institutional loyalties: Putting the interests of
The information base for clinical medicine is chang- one’s own school, company, or organization
ing too fast for print media alone to be sufficient. For above others
example, clinically important discoveries in antiviral ■ Career advancement: Investigators get more aca-
therapy for HIV, innovative scanning technologies, demic credit for publishing interesting results in
and state-of-the-science cancer chemotherapy elite journals.
emerge from year to year, even month to month. Conflict of interest exists in relation to a
The Inter- net can keep pace with such rapid specific topic, not in general, and regardless of
change and also complement, but not replace, whether it has actually changed behavior.
traditional sources. How is conflict of interest expressed? Scientific
misconduct—fraud, fabrication, and plagiarism—
Grading Information are extreme examples. Less extreme is selective report-
Grading makes it possible for clinicians to grasp ing of research results, either by not reporting unwel-
the basic value of information in seconds. Usu- come results (publication bias) or reporting results
ally, the quality of the evidence (confidence in according to whether they seem to be the “right”
estimates of effects) and strength of recommenda- ones. Industry sponsors of research can sometimes
tions are graded separately. Table 14.1 shows an block publication or alter how results are reported.
example of one widely used grading scheme called To create a public record of whether this has
GRADE, similar in principle to other approaches occurred, random- ized controlled trials are now
in general use. This grading is for interventions; registered on publically available Web sites before
grading of other kinds of information is less well data collection, making to possible to follow-up on
developed. Notice that recommendations are based whether the results were published when expected
on the strength of the research evidence, depend and whether the reported endpoints were the same
on the balance of benefits and harms, and vary in as when the trial began (1).
how forcefully and widely the intervention should More subtle and more difficult to detect are
be offered to patients. Although criteria for grading efforts to “spin” results by the way they are
are explicit, assigning grades still depends partly on described, for example, by implying that a very low
judgment. P value means the results are clinically important or
by describing effects as “large” when most of us
Misleading Reports of would think they were not (2). All of us depend on
Research Findings peer reviewers and editors to limit the worst of this
kind of editorializing in scientific articles.
Until now, we have acted as if the only threats to the We mention these somewhat sordid influences on
validity of published clinical research stem from the the information clinicians (and their patients) depend
difficulties of applying good scientific principles to on because they are, in some situations, every bit
the study of human illness. That is, validity is as real and important as the well-informed
about the management of bias and chance. application of confidence intervals and control of
Unfortunately, there are other threats to the confounding, the usual domain of clinical
validity of research results, related to the epidemiology. Research and its interpretation are
investigators themselves and the social, political, and human endeavors and will, therefore, always be
economic environment in which they work. We are tinged, to some extent, with
referring to the all-too-human
Chapter 14: Knowledge Management 227

Table 14.1
Grading Recommendations for Treatment According to the Quality of Evidence
(Confidence in Estimate of Effect, A–C) and Strength of Recommendation (1–2) with
Implications. Based on GRADE Guidelines

Grade of Clarity of Risk/ Quality of Supporting


Recommendation Benefit Evidence Implications
1A. Benefits clearly Consistent evidence from well- Strong recommendations
Strong outweigh risks and performed randomized controlled apply to most patients in
recommendation, high- burdens, or vice versa trials, or overwhelming evidence in most circumstances without
quality evidence some other form. Further research reservation. Clinicians should
is unlikely to change confidence in follow a strong recommendation
the estimates of benefits and risks unless there is a clear and
compelling rationale for an
alternative approach.
1B. Benefits clearly Evidence from randomized Strong recommendation that
Strong outweigh risks and controlled trials with important applies to most patients.
recommendation, burdens, or vice versa limitations (inconsistent results, Clinicians should follow a strong
moderate-quality methodologic flaws, or imprecision), recommendation unless there is
evidence or very strong evidence of some a clear and compelling rationale
other research design. Further for an alternative approach.
research (if performed) is likely
to change our confidence in the
estimates of benefits and risk
1C. Benefits appear to Evidence from observational Strong recommendation that
Strong outweigh risk and studies, unsystematic clinical applies to most patients. Some
recommendation, low- burdens, or vice versa experience, or randomized of the evidence base supporting
quality evidence controlled trials with serious flaws. the recommendation is of low
Any estimate of effect is uncertain. quality.
2A. Benefits closely Consistent evidence from well- Weak recommendation. Best
Weak recommendation, balanced with risks performed randomized controlled action may differ depending
high-quality evidence and burdens trials or overwhelming evidence on circumstances or patient
of some other form. Further or societal values
research is unlikely to change our
confidence in the estimates of
benefits and risks.
2B. Benefits closely Evidence from randomized Weak recommendation.
Weak recommendation, balanced with risks controlled trials with important Alternative approaches likely
moderate-quality and burdens, with limitations (inconsistent to be better for some
evidence some uncertainty results, methodologic flaws patients under some
in the estimates of or imprecision), or very strong circumstances.
benefits, risks, and evidence from some other
burdens research design. Further research
(if performed) is likely to change
confidence in estimates of benefits
and risks.
2C. Uncertainty in the Evidence from observational Very weak recommendation.
Weak recommendation, estimates of benefits, studies, unsystematic clinical Other alternatives may be
low-quality evidence risks, and burdens; experience, or randomized equally reasonable.
benefits may be controlled trials with serious flaws.
closely balanced with Any estimate of effect is uncertain.
risks and burdens
Adapted from Guyatt GH, Oxman AD, Vist GE, et al. for the GRADE Working Group. GRADE: an emerging consensus on rating quality of
evidence and strength of recommendations. BMJ 2008;336:924–926.
22 Clinical Epidemiology: The

self-serving results. There are ongoing efforts to limit Table 14.2


bias related to conflicts of interest, mainly by
insist- ing on full disclosure but also by excluding
people with obvious conflicts of interest from peer
review of manuscripts and grants, authorship of
review articles and editorials, and from guidelines
panels.

LOOKING UP ANSWERS TO
CLINICAL QUESTIONS
Clinicians need to be able to look up answers to
questions that arise during the care of their
patients. They need this for things they do not
know but also to check facts they think they know
but might not, because the information base for patient
care is always changing.
It is best to get answers to questions just at the
time and place where they arise during the care of
patients. This has been called the point of care
and the asso- ciated learning just-in-time-learning.
Answers can then be used to guide clinical decision
making for the patient at hand. Also, what is learned
is more likely to be retained than information
encountered out of con- text in a classroom, lecture
hall, book, or journal, apart from the need to know
for a specific patient. In any case, postponing the
answering of questions to a later time too often
means they do not get answered at all. For just-in-
time learning to happen, several con- ditions must
be in place (Table 14.2). Most patient care settings
are time-pressured, so the answer must come
quickly. As an office pediatrician pointed out, “If I
added just 1 to 2 extra minutes to each patient
visit, I would get home an hour later at the end of
the day!” What clinicians need is not an answer
but the best available answer, given the state of
knowledge at the time. They need information that
corresponds as closely as possible to the specific
clinical situation their patient is in; if the patient is
elderly and has several dis- eases, the research
information should be about elderly patients with
comorbidities. Clinicians need informa- tion sources
that move with them as they travel from their office
to home (where they take night and week-
end call) and to hospitals and nursing homes.
When all this happens, and it certainly can, the
results are extraordinarily powerful.
Chapter 14: Knowledge Management 229
Conditions in Which Information Is
Available at the Point of Care
Condition
Rationale
Rapid access The information must be available
within minutes for it to fit into the
busy workflow of most patient care susceptibility to antimalarial drugs varies
settings.
across the globe and that it is continually
Current Because the best information base for changing. The Centers for Disease Control
clinical decisions is continually changing, and Prevention have a Web site
the information usually needs to be
(http://www.cdc.gov) with cur- rent
electronic (as a practical matter, on the
information for travelers to all parts of the
Internet).
world. Using the computer in your clinic,
Tailored to Clinicians need information that you quickly find out which prophylactic
the specific matches as closely as possible the
drug this patient should take and for how
question actual situation of their individual
long before, during, and after the trip. You
patient.
are also reminded that he should have a
Sorted by There is a vast amount of information
booster dose of polio vac- cine and be
scientific for almost any clinical question but
strength only a small proportion of it is vaccinated for hepatitis A and B, typhoid,
scientifically strong and clinical and yellow fever. The site lists clinics where
relevant. these vaccines are available. The site also
Available Clinicians cannot leave their place of shows that northern Ghana, where your
in clinical work to look up answers; they must find patient will be visiting, is in the “meningitis
situations it right where they work. belt,” so he should also be vaccinated
against meningo- coccal disease. The
information you are relying on is an up-to-
Example
Solutions
A patient sees you because he will be traveling to Ghana and wants advice on malaria prophy- laxis. You are aware that the
Clinical Colleagues
A network of colleagues with various and
complemen- tary expertise is a time-honored way
of getting point of care information. Many
clinicians have identified
23 Clinical Epidemiology: The

local opinion leaders for this purpose. Of course, also make explicit the evidence base and rationale for
those opinion leaders must have their own sources of those recommendations. Like evidence-based medi-
infor- mation, presumably more than just other cine, guidelines are meant to be a starting place for
colleagues. decision making about individual patients, to be
modified by clinical judgment; that is, they are
Electronic Textbooks guide- lines, not rules. High-quality guidelines
represent the wise application of research evidence to
Textbooks, even libraries, are on the Internet and
the realities of clinical care, but guidelines vary in
made available to clinicians by their medical schools,
quality. Table 14.3
health systems, and professional societies. For exam-
ple, UpToDate (http://www.uptodate.com) is an elec-
tronic information resource for clinicians, the product Table 14.3
of thousands of physician–authors and editors cover- Standards for Trustworthy
ing 9,000 topics in the equivalent of 90,000 Clinical Practice Guidelines
printed pages (if it were ever printed). † Information is
contin- ually updated, peer reviewed, searchable and Standard Explanation
linked to abstracts of the original research, and Transparency How the guideline was developed
recommenda- tions are graded. UpToDate is and funded has been made explicit
available at the point of care throughout the world and is publically accessible.
wherever the Internet can be accessed by Conflict of Interest Group members’ conflicts of interest
computers or mobile platforms. related to financial, intellectual,
institutional, and patient/public
Example activities bearing on the guideline
are disclosed.
One author was seeing patients in Boston dur- ing the anthrax scare in 2001. Around that time, biologic terrorists spread
Group Group membership was
Composition multidisciplinary and balanced,
comprising a variety of
methodological experts and
clinicians, and populations expected
to be affected by the guideline.
Systematic Review Recommendations are based on
systematic reviews that met
high standards for quality.
Evidence and Each recommendation is
Strength of accompanied by an explanation for
Recommendation its underlying reasoning, the level of
confidence in the evidence, and the
strength of the recommendation.
Description of The guideline states precisely what
Recommendations the recommended action is and
under what circumstances it should
be performed.
External Review The guideline has been reviewed
by the full spectrum of relevant
stakeholders (e.g., scientific and
Other textbooks, such as ACP Medicine, clinical experts, organizations, and
Harrison’s Online, and many subspecialty textbooks, patients).
are also available in electronic form. Updating The guideline reports the date of
publication and evidence review
Clinical Practice Guidelines and plans for updating when
there is new evidence that would
Clinical practice guidelines are advice to clinicians substantially change the guideline.
about the care of patients with specific conditions. In
addition to giving recommendations, good guidelines Modified from Institute of Medicine. Clinical Practice Guidelines
We Can Trust. Washington, DC: National Academies Press; 2011.
The standards were for developing guidelines and have been
modified to guide users in recognize guidelines they can trust.

Robert and Suzanne Fletcher are among hundreds the editors
of UpToDate.
Chapter 14: Knowledge Management 231

summarizes criteria for credible guidelines developed


by the U.S. Institute of Medicine. A relatively Example
com- prehensive listing of guidelines can be found You are seeing a patient who you thought had cat scratch disea
at the National Guideline Clearinghouse, which is
available online at http://www.guidelines.gov.

The Cochrane Library


Clinical scientists throughout the world have
volun- teered to review the world’s literature on
specific clin- ical questions, to synthesize this
information, store it in a central site, and to keep it
up to date. The collec- tion of reviews is available at
http://www.cochrane. org. Although the Cochrane
Library is incomplete, given the vast number of
questions it might address, it is an excellent source
of systematic reviews, with meta-analyses when
justified, on the effects of interventions and, more Other Sources on the Internet
recently, of diagnostic test performance.
A vast amount of health information is posted on
Citation Databases (PubMed and Others) the Internet, some of which is quite helpful for
health professionals. It can be found by a search
MEDLINE is a bibliographic database, compiled by engine such as Google or Google Scholar and by
the U.S. National Library of Medicine, covering sites sponsored by the U.S. government such as
approximately 5,000 journals in biomedicine and MedlinePlus (http://
health, mostly published in English. It is available www.nlm.nih.gov/medlineplus) and HealthFinder
free of charge using a search engine, usually PubMed (http://healthfinder.gov) for health information and
(http://www.ncbi.nlm.nih.gov/pubmed). MEDLINE Health Hotlines (http://healthhotlines.nlm.nih.gov)
can be searched by topic, journal, author, year, for contact information of health related
and research design. In addition to citations, some organizations. Other countries have their own
abstracts are available. EMBASE (http://www.embase. Internet resources.
com) is also used and complements what is found
in MEDLINE; beyond these two are many other
biblio- graphic databases for more specialized
SURVEILLANCE ON NEW
purposes.
DEVELOPMENTS
PubMed searches are limited by two kinds of Keeping up with new developments in any clini-
misclassification. First, they produce false-negative cal field is a daunting task. It is not that the pace
results; that is, they miss articles that really are of practice-changing discoveries is unmanageable.
wanted. Second, searches produce many false- Rather, the relevant information is widely dispersed
positive results; that is, they find more citations across many journals and mixed with a vast
than are actually wanted on the basis of scientific number of less important articles.
strength and clinical relevance. For example, when
Canadian nephrologists were asked to use PubMed to
answer unique clinical questions in their field, they
were able to retrieve 46% of relevant articles and
Example
How widely are the best articles in a field dis- persed among jour
the ratio of relevant to non- relevant articles was
1/16 (3). Both problems can be reduced, but not
totally overcome, by better searching techniques.
PubMed searches are a mainstay for investigators
and educators who have the time to construct careful
searches and sort through the resulting articles, but
PubMed searches are too inefficient to be of much
practical value in helping clinicians, especially in
answering day-to-day questions quickly. However,
PubMed is particularly useful for looking up whether
rare events have been reported.
23 Clinical Epidemiology: The

100
100.0

86.5
80 79.3
72.8
69.9
66.3
60
62.0
Percent of

55.5
48.3

40
38.9

28.8
20
17.3

0
5 10 15 20 25 30 35 40
Number of journals
Figure 14.1 ■ How many journals would you have to read to keep up with
the literature in your field? The proportion of scientifically strong, clinically rel-
evant articles in internal medicine according to the number of journals, in descending
order of yield. (Data from ACP Journal Club, 2011).

There are now various ways to have new informa-


encounter according to the number of tion—published research articles, guidelines, white
journals read, starting with the highest papers, and news articles—in your specific areas of
yield journal and adding journals in order of interest sent to you. One way is to identify specific
descending yield. One would need to topics and have new information about them auto-
regularly review 4 journals to find 50% of matically sent to you as it arises by means of RSS feeds
these articles, 8 journals to find 75%, and 20 and other services. Another is to participate in one of
journals to find 90% of the key articles in a growing number of social media, such as Facebook
and blogs, where other people discover and select
information you might want to know about and send
Therefore, it is not possible for individual readers, it to you, just as you send new information to
even with great effort, to find all of the essential arti- them. Less structured examples are research teams who
cles in a field on their own. They need to delegate the share articles and news stories related to their work
task to a trusted intermediary, one who will review and ward teams in teaching hospitals where
many journals and select articles according to criteria residents, stu- dents, and attending physicians share
they agree with. articles about their patients’ medical problems.
Fortunately, help is available. Most clinical spe- Social media can be effective and efficient if you
cialties sponsor publications that summarize major choose the right col- leagues to participate with.
articles in their field. These publications vary in how
explicit and rigorous their selection process is. At JOURNALS
one extreme, ACP Journal Club publishes its criteria
for each kind of article (e.g., studies of prevention, Journals have a central role in the health professions.
treat- ment, diagnosis, prognosis) and provides a Everything we have said about clinical epidemiology
critique of each article it selects. At the other and knowledge management is based on a foundation
extreme, many newsletters include summaries of of original research published in peer-reviewed journals.
articles without making explicit either how they Research reports are selected and improved before
were selected or what their strengths and publication by a rigorous process involving critical
limitations are.
Chapter 14: Knowledge Management 233

review by editors guided by peer review, comments test evaluation, systematic review, etc.) (Table 14.4).
by experts in the article’s content area and methods Readers can use these checklists to see if all the
who provide advice on whether to publish and how a necessary information is included in an article, just as
manuscript (the term for the article before it is pub- investiga- tors use them to assure that their articles
lished) could be improved. The reviewers are are complete.
advisors to the editor (or editorial team), not the Journals themselves are not particularly helpful
ones who directly decide the fate of the manuscript. for some elements of knowledge management.
Peer-review and editing practices, along with the Reading individual journals is not a reliable way of
evidence base and rationale for them, are keeping up with new scientific developments in a
summarized on the official Web site of the World field or for looking up the answers to clinical
Association of Medical Edi- tors questions.
(http://www.wame.org) and the International But journals do add another dimension:
Committee of Medical Journal Editors exposing readers to the full breadth of their
(http://www. icmje.org). Peer review and editing profession. Opin- ions, stories, untested
improve manu- scripts, but published articles are far hypotheses, commentary on published articles,
from perfect (5); therefore, readers should be grateful expressions of professional values, as well as
for the journals’ efforts to make articles better but descriptions of the historical, social, and politi- cal
also maintain a healthy skepticism about the quality context of current-day medicine and much more,
of the end result. Working groups have defined the reflect the full nature of the profession (Table
information that should be in a complete research 14.5). The richness of this information completes the
article according to the type of study (randomized clinical picture for many readers. For example, when
controlled trial, diagnostic Annals of Internal Medicine began publishing stories
about being a doctor (6), many readers remarked that
while reports

Table 14.4
Guidelines for Reporting Research Studies

Study Type Name of Statement Citation


Randomized Controlled Trials Consolidated Standards of Reporting http://www.consort-statement.org
Trials (CONSORT)
Diagnostic Tests Standards for Reporting of Diagnostic http://www.stard.org
Test Accuracy (STARD)
Observational Studies Strengthening the Reporting http://www.strobe-statement.org
of Observational Studies in
Epidemiology (STROBE)
Non-Randomized Studies of Transparent Reporting of Evaluations http://www.cdc.gov/trendstatement
Educational, Behavioral, and with Nonrandomized Design (TREND)
Public Health Interventions
Meta-analyses of Randomized Quality of Reporting of Meta-analyses Moher D, Cook DJ, Eastwood S, et al. Improving
Controlled Trials (QUOROM) the quality of reports of meta-analyses of
randomized controlled trials: the QUOROM
statement. Lancet 1999;354:1896–900.
Meta-analyses of Meta-analyses of Observational Stroup DF, Berlin JA, Morton SC, et al. Meta-
Observational Studies Studies in Epidemiology (MOOSE) analysis of observational studies in
epidemiology: a proposal for reporting. Meta-
analysis Of Observational Studies in
Epidemiology (MOOSE) group. JAMA
2000;283:2008–2012.
Systematic Reviews of Quality Assessment of Diagnostic Whiting PF, Rutjes AWS, Westwood ME, et al.
Diagnostic Accuracy Studies Accuracy Studies (QUADAS) QUADAS-2: a revised tool for the quality
assessment of diagnostic accuracy studies.
Ann Intern Med 2011;155:529–536.
Genetic Risk Prediction Genetic Risk Prediction Studies Janssens AC, Ioannidis JP, van Duijn CM, et al.
Studies (GRIPS) Strengthening the reporting of genetic risk
prediction studies: the GRIPS statement. Ann
Intern Med 2011;154:421–425.
23 Clinical Epidemiology: The

Table 14.5 and to different degrees, just as the completeness


The Diverse Contents of a of the history and physical examination, which is
General Medical Journal part of a clinician’s repertoire, is used to a varying
extent from one patient encounter to another. It is
Science The Profession not nec- essary to read journals from cover to cover,
Original research Medical education any more than one would read a newspaper from
front to back. Rather, one browses—reads in layers
Preliminary studies History
—according to the time available and the strength
Review articles Public policy and relevance of each individual article.
Editorials (for synthesis and Book reviews Approaches to streamlined reading vary. It is a
opinion) good idea to at least survey the titles (analogous to
Letters to the editor News newspa- per headlines) of all articles in an issue to
Hypotheses Stories and poems decide which articles matter most to you. For those
that do, you might read more deeply, adjusting the
depth as you go (Fig. 14.2). The abstract is the best
of research and reviews were essential, the place to start, and many responsible readers stop
experience of being a doctor was what they cared there. If the conclu- sions are interesting, the
about the most. methods section might come next; there, one finds
basic information bearing on whether the
”Reading” Journals conclusions are credible. One might want to look at
The ability to critique research on one’s own is a the results section to see a more detailed description
core skill for clinicians. But this skill is used of what was found. Key figures (e.g., a sur- vival curve
selectively for the main results of a randomized trial)

YOUR QUESTION WHERE TO LOOK


Cursory review (title and abstract)

What is this study about?


Title
OPTION

What was concluded? Conclusions


OPTION

Is it likely to be true? Design


OPTION
STOP
To whom does it apply? Patients, setting
OPTION

What was found? Results

In depth review (article)

Importance of the research question? Introduction


OPTION
How big was the effect? OPTION OPTION
Figures and tables

STOP
How strong were the methods? Methods
Context
Discussion

Figure 14.2 ■ Reading a journal article in layers. Individual readers can progress
deeper into an article or stop and go on to another, according to its scientific strength and
clinical im- portance to them.
Chapter 14: Knowledge Management 235

Table 14.6 of anxiety, self-reproach, and cluttered workspaces.


The Organization of a Structured If such negative feelings are associated with reading
Abstract medical journals, something is wrong.

Heading Value to Reader GUIDING PATIENTS’ QUEST FOR


Context Burden of suffering from the HEALTH INFORMATION
disease/ illness. Why is the research
question important? What is Patients now look up health information on the
already known? Internet. As a result, clinicians have different respon-
Objective What the investigators set out to learn sibilities for teaching their patients.
Setting The setting to which results can be One responsibility is to guide patients to the
generalized, such as community, most credible Web sites. Simple searches, such as for
primary care practices, referral centers, migraine headaches or weight loss, find a rich
and the like array of sites, some among the best in the world,
Participants What kinds of patients (regarding others zealous and misguided, and still others
generalizability)? How many (regarding commercial and self-serving. Clinicians should be
statistical power/precision)? able to suggest especially good Web sites for the
Design How strong is the study? How well is it patient’s particu- lar questions. There are many
matched to the research question? that are sponsored by governments, medical
Intervention Is the intervention state-of-the-art? Is it schools, professional organiza- tions, and patient
(if any) feasible in your setting?” advocacy groups. Clinicians can also help patients
Main outcome Are the outcomes clinically important?
recognize the best health informa- tion on the Web,
measures guided by criteria formulated by the Medical
Library Association (Table 14.7).
Results What was found?
Another responsibility is to help patients weigh
Limitations What aspects of the study threaten the value of information they do find. Here, clini-
the validity of the conclusions?
cians have a great deal to offer based on their
Conclusion Do the authors believe that the under- standing of clinical epidemiology, the
result answers their question? How biology of
convincingly?

Table 14.7
may communicate the “bottom line” efficiently. A Criteria Patients Can Use to Evaluate
few articles are so important, in relation to one’s Health Information on the Web
particu- lar needs, that they are worth reading word
for word, perhaps for participation in a journal 1. Sponsorship
club. • Can you easily identify the site sponsor? Are
Structured abstracts are organized according advisory board members and consultants listed?
to the kinds of information that critical readers • What is the Web address (gov  government,
edu  educational institution, org 
depend on when deciding whether to believe a
professional organization, com 
study’s results. Table 14.6 shows headings of commercial)?
abstracts in structured form, along with the kind of 2. Currency
information associated with them. (Traditional • The site should have been updated recently, and
abstracts, with headings for Introduction, Methods, the date of the latest revision posted.
Results, and Discussion, are a shortened version.) 3. Factual Information
These headings make it easier for readers to find the • The information should be about facts, not
information they need and also force authors to opinions, and can be verified from primary
include this information, some of which might sources such as professional articles.
• When opinions are stated, the source (a qualified
otherwise have been left out if the abstract were
professional or organization) should be
less structured. identified.
Unfortunately, many clinicians set goals for jour- 4. Audience
nal reading that are higher than they can achieve. • The Web site should clearly state
They believe they must look at each article in whether the information is for consumers
detail, which requires a lot of time with each or health
journal issue. Too often, this results in postponing professionals. (Some sites have separate areas
reading and per- haps never getting to it at all, and it for consumers and health professionals).
can generate a lot Modified from Medical Library Association. A User’s Guide to Finding
and Evaluating Health Information on the Web. Available at http://
23 Clinical Epidemiology: The
mlanet.org/resources/userguide.html. Accessed August 1,
2012.
Chapter 14: Knowledge Management 237

disease, the clinical presentations of illness, the dif- colleagues outside their specialty about patient care
ference between isolated observations and consistent decisions. They have a better basis for deciding
patterns of evidence, and much more. All of this is how to delegate some aspects of their information
a valuable complement to what patients bring to needs. They can gain more confidence and
the encounter—intense interest in a specific clinical experience greater satisfaction with the intellectual
question and the willingness to spend lots of time aspects of their work. Beyond that, every clinician
searching for answers. should have a plan for knowledge management, one
that fits his or her par- ticular needs and resources.
PUTTING KNOWLEDGE The Internet must be an important part of the plan
MANAGEMENT INTO PRACTICE because no other medium is so comprehensive, up-
to-date, and flexible. Much of the information
Clinical epidemiology, as described in this book, is needed to guide patient care deci- sions should be
intended to make clinicians’ professional lives easier available at the point of care so that it can be brought
and more satisfying. Armed with a sound to bear on the patient at hand. There is no reason
grounding in the principles by which the validity why the information you use cannot be the best
and generaliz- ability of clinical information are available in the world at the time, as long as
judged, clinicians can more quickly and accurately you have access to the Internet.
detect whether the scien- tific basis for assertions is A workable approach to knowledge manage-
sound. For example, they can see when confidence ment must be active. Clinicians should set aside time
intervals are consistent with clini- cally important periodically to revisit their plan, to learn about
benefit or harm or that a study of the effects of an new opportunities as they arise, and to acquire new
intervention includes neither randomiza- tion nor skills as they are needed. There has never been a
other efforts to deal with confounding. They are time when the evidence base for clinical medicine
better prepared to participate in discussions with was so strong and accessible. Why not make the
most of it?

Revie w Question s
Read the following statements and select the C. Guarantee that the information
best answer. they contain is beyond reproach.
D. Expose you to the many dimensions of
14.1. You are finishing residency and will begin
your profession.
practice. You want to establish a plan for
keeping up with new developments in your 14.3. Many children in your practice have attacks
field even though there are few professional of otitis media. You want to base your
colleagues in your community. All of the management on the best available evidence.
following might be useful, but which will be Which of the following is the least credible
most useful to you? source of information on this question?
A. Subscribe to a few good journals. A. A clinical practice guideline by a
B. Buy new editions of printed textbooks. major medical society
C. Subscribe to a service that reviews B. A systematic review published in a
the literature in your field. major journal
D. Search MEDLINE at regular intervals. C. The Cochrane Database of Systematic
E. Keep up contacts with colleagues in Reviews
your training program by e-mail and D. The most recent research article on this
telephone. question
14.2. You can rely on the best general 14.4. A search of MEDLINE is especially
medical journals in your field to: useful for which of the following?
A. Provide answers to clinical questions. A. Finding all of the best articles bearing on
B. Assure that you have kept up with the a clinical question
medical literature. B. An efficient strategy for finding the good
articles
23 Clinical Epidemiology: The

C. Looking for reports of rare events 14.8. Which of the following should be least reas-
D. Keeping up with the medical literature suring to a patient about the quality of a
E. Being familiar with the medical Web site providing information about HIV?
profession as a whole
A. The site is sponsored by a
governmental agency and names its
14.5. Which of the following is accomplished by
advisory board members.
peer review of research manuscripts before
B. The site provides facts, not opinions.
they are published?
C. The primary source of information is
A. Exclude articles by authors with a conflict stated.
of interest. D. The author is a well-known expert in the
B. Make the published article accurate and field.
trustworthy. E. The date of the last revision is posted and
C. Relieve readers of the need to be recent.
skeptical about the study.
D. Decide for the editors about whether 14.9. Which of the following is not part of grading
they should publish the manuscript. clinical recommendations using the GRADE
system?
14.6. An author of an article showing that screen-
A. Deciding whether to use a diagnostic test
ing colonoscopy is more effective in
B. Takes into account the balance of benefits
prevent- ing colorectal cancer than other
and harms
forms of screening would have conflicts of
C. Rates the quality of scientific
interest if he or she had any of the following
evidence separately
except: D. Suggests how commonly and how
A. Clinical income from performing forcefully a treatment should be
colonoscopies recommended
B. Investment in a company that makes E. Rates the strength of the evidence and of
colonoscopies recommendations separately
C. Investment in medical products in general
D. Publications of articles that have 14.10. A comprehensive approach to managing
consistently advocated colonoscopy as the knowledge in your field would include
best screening test which of the following?
E. Rivalry with other scholars who advocate A. Subscribing to some journals and
another screening test browsing them
B. Establishing a plan for looking up
14.7. Which of the following is the least useful
information at the point of care
way of looking up answers to clinical C. Finding a publication that helps you keep
questions at the point of care? up with new developments in your field
A. Subscribing to several journals and D. Identifying Web sites you can
keeping them available where you see recommend to your patients
patients E. All of the above
B. Guidelines on http://www.guidelines.gov
C. The Cochrane Library on the Internet Answers are in Appendix
D. A continually updated electronic
textbook A.

REFERENCES
1. Laine C, Horton, R, DeAngelis CD, et al. Clinical trial registra- 4. Losanoff JE, Sauter ER, Rider KD. Cat scratch disease present-
tion: looking back and moving ahead. Lancet 2007;369:1909– ing with abdominal pain and retroperitoneal lymphadenopa-
1911. thy. J Clin Gastroentrol 2004;38:300–301.
2. Fletcher RH, Black B. “Spin” in scientific writing: scientific 5. Goodman SN, Berlin J, Fletcher SW, et al. Manuscript quality
mischief and legal jeopardy. Med Law 2007;26(3):511–525. before and after peer review and editing at Annals of
3. Shariff SZ, Sontrop JM, Haynes RB, et al. Impact of Internal Medicine. Ann Intern Med 1994;121:11–21.
PubMed search filters on the retrieval of evidence for 6. Lacombe MA, ed. On Being a Doctor. Philadelphia: American
physicians. CMAJ 2012;184:303. College of Physicians; 1995.
Answers to Review Questions
CHAPTER 1 INTRODUCTION

1.1 D. Samples can give a misleading impression


older patients respond to treatment the same
of the situation in the parent population, espe-
as younger patients. Internal validity is
cially if the sample is small.
about whether the results are correct for the
1.2 E. Generalizing the results of a study of men patients in the study, not about whether they
to the care of a woman assumes that the are cor- rect for other kinds of patients.
effective- ness of surgery for low back pain Both bias and chance affect internal
is the same for men and women. validity.
1.3 B. The two treatment groups did not have 1.10 B. If volunteers had the same amount of
an equal chance of having pain measured. exercise as the non-volunteers, a difference in
the rate of CHD could not be explained by
1.4 A. The difference in recovery between
this variable. Differences in the groups
patients who received surgery versus medical
(selection bias) or in the methods used to
care may be the result of some other factor,
determine CHD (measure- ment bias) could
such as age, that is different between the two
account for the finding.
treated groups and not the result of surgery
itself. 1.11 C. The drug’s effect on mortality is the
most important thing to determine. It is
1.5 B. These are biases related to measurement of
possible that some arrythmia-suppressing
the outcome (recovery from pain).
drugs can increase the rate of sudden death
1.6 C. The other medical conditions confound the (in fact, this has happened). In such
relationship between treatment and situations, the interme- diate biologic
outcome; that is, they are related to both outcome of decreased arrhyth- mias is an
treatment and recovery and might be the unreliable marker of the clinical outcome—
reason for the observed differences. sudden death.
1.7 C. The observation that histamines mediate 1.12 B. Measurement bias is frequently an issue
inflammation in hay fever leads to a when patients are asked to recall something
promis- ing hypothesis that blocking that may be related to the illness, because
histamines will relieve symptoms, but the those with illness may have heightened recall
hypothesis needs to be tested in people with for preceding events that they think might
hay fever. The other answers all assume be related to their illness.
more about the causes of symptoms than is
1.13 B. The questioning (measurement) about con-
actually stated. For example, histamine is
traceptive use was not the same in the two
only one of many mediators of
groups.
inflammation in hay fever.
1.14 D. Small study numbers increase the pos-
1.8 C. Samples may misrepresent populations
sibility that chance accounts for differences
by chance, especially when the samples are
between groups.
small.
1.15 A. Different neighborhoods often are sur-
1.9 A. Generalizing from younger to older
rogates for socioeconomic variables that are
patients is a matter of personal judgment
related to numerous health outcomes, which
based on whatever facts there are that bear
may mean the two groups are different in
on whether
terms of important covariates.
24 Clinical Epidemiology: The

237
23 Appendix A: Answers to Review

CHAPTER 2 FREQUENCY

2.1 C. The population at risk is dynamic—people


2.9 E. A larger sample would produce an estimate
are entering and leaving it continually, so
of incidence that is closer to the true one by
the rates from cancer registries should be
reducing random error. It would not intro-
reported as person-years.
duce a systematic difference in incidence as
2.2 A. This is point prevalence because cases are A–D would.
described for a point in time in the course
2.10 A. It is endemic because it is confined to a
of their life.
geo- graphic area.
2.3 A. There is no follow-up (no time dimension)
2.11 E. The cases existed any time in a 3-month
in a prevalence study.
period.
2.4 D. A random or probability sample is
2.12 C. Dynamic populations, such as residents
represen- tative of the population sampled in
of the State of North Carolina, are
the long run, if enough people in the
continually turning over because of births
population are sampled.
and deaths as well as out-migration and in-
2.5 B. In a steady state, duration of disease  migration.
prevalence/incidence. In this case, it is
2.13 A. Children in the cohort have in common
1/100 divided by 40/10,000/year  25
being born in North Carolina in 2012 and
years.
are followed up for scoliosis as it develops
2.6 D. A study beginning with existing patients over time.
with a disease and looking back at their pre-
2.14 D. Prevalence studies are not useful for dis-
vious experience with the disease would not
eases of short duration because there will be
be a cohort study, which begins with
few cases at a point in time. They do not mea-
patients with something in common (e.g.,
sure incidence and cannot provide much
new onset of disease) and follows them
evi- dence bearing on cause.
forward in time for subsequent health
events. 2.15 E. A number like 800,000 without an
associ- ated denominator is not a rate and
2.7 C. Even though sampling every 10th patient
is, there- fore, not any of the incidence and
might produce a representative sample, it
prevalence rates given in A–D.
would not be random and could
misrepresent the population sampled.
2.8 C. Children with another seizure were all part
of the original cohort of children with a first
seizure.

CHAPTER 3 ABNORMALITY

3.1 D. Ordinal
3.8 D. Because clinical distributions do not neces-
3.2 B. Dichotomous sarily follow a normal distribution, abnormal-
ity should not be defined by whether or not
3.3 A. Interval—continuous they do.
3.4 E. Interval—discrete 3.9 A. Naturally occurring distributions may or
3.5 C. Nominal may not resemble the normal curve.
3.6 C. This approach, called construct validity, is 3.10 C. Although ultimately you and the patient
one of the ways of establishing the validity of may decide on a trial of statin therapy, this
a measurement. Note that answers B and D is not an emergency situation. The
relate to reliability, not validity. cholesterol test should be repeated; elevated
values are often lower on repeat testing. A
3.7 D. All except D are reasons for variation in trial of exer- cise and weight loss can also
measurements on a single patient, whereas lower cholesterol. Patients who are otherwise
D is about variation among patients. healthy and with a
10% 10-year risk of cardiovascular diseases
Appendix A: Answers to Review Questions 239

are usually not immediately prescribed medi- the measurements (interobserver variability)
cation to lower cholesterol. could also be at play.
3.11 B. The figure shows one mode (hump). The 3.15 G. Panel B shows skewed measurements to
median and mean are similar to each other, the right of the true value. In other words,
and are both below 4,000 g. the hospital staff tended to overestimate the
elec- tronic monitor results and record normal
3.12 C. Two standard deviations encompass 95%
fetal heart rates when they were abnormal on
of the values. The distribution is not
elec- tronic monitoring, thus demonstrating
skewed. Range is sensitive to extreme
biased measurements. Chance and
values.
interobserver vari- ability may also be
3.13 B. One standard deviation encompasses involved because not all the measurements
about 2/3 of values around the mean. Look- are the same.
ing at the figure, 2/3 of the values around
3.16 G. Panel C shows similar results to Panel B
the mean would be approximately 3,000 to
except that the bias is in the other direction,
4,000 g.
that is, the hospital staff tended to
3.14 D. Panel A of Figure 3.13 shows underesti- mate the electronic monitor
approximately even dispersion of results. In both Panels B and C, the hospital
measurements above and below the true staff measure- ments tended to “normalize”
value, suggesting chance varia- tion. The what were abnor- mal electronic monitor
effect of different observers making measurements.

CHAPTER 4 RISK: BASIC PRINCIPLES

4.1 A. When a risk factor is common in a popula-


4.6 B. Well-constructed risk models are more
tion, it is important to compare its
likely to be highly calibrated (able to
prevalence in people with and without
predict the percentage of a group that
disease. There is no comparison in the
develops dis- ease) than have good
example. Lung cancer is a common cancer,
discrimination (able to predict which
and smoking is a strong, not weak, risk
individuals in the group will develop
factor for cancer. The fact that there are
disease).
other risk factors for lung cancer is not
relevant. 4.7 A. Markers help identify groups with
increased risk of disease, but because they are
4.2 B. Risk factors are sought when a new disease
not causes of disease, removing the marker
appears, as was the case with HIV.
does not prevent disease. Markers are usually
4.3 D. Risk prediction models are used in all situ- confounded with cause of disease (see Chap-
ations. ters 1 and 5).
4.4 C. Because so many more women were 4.8 B. Symptoms, physical examination find-
assigned to the low-risk stratum, the largest ings, and laboratory tests are generally
number of cases will likely come from that more important than risk factors when
group. Risk prediction models give prob- diagnosing disease.
abilities but do not predict which persons in 4.9 D. The overlap of the two groups
a group will develop disease. The figure does demonstrates that the risk model does not
not give information about incorrect discriminate well between women who did
assignments to the strata. and did not develop breast cancer over 5
4.5 C. The patient is unlikely to develop years. The figure gives no direct information
colorec- tal cancer in the next 5 years about calibration and strat- ification. Few
because he is a member of the group with a women who developed breast cancer were
low probability, but he could be one of the at high risk.
2% who develops disease. Even for a
4.10 C. Good calibration does not impair discrimi-
common cancer, there are subgroups of
nation. However, risk models can discriminate
people with a low probability of developing it.
poorly even if they are highly calibrated, as
For example, few young people develop
the example in the text shows.
colorectal cancer.
24 Appendix A: Answers to Review

CHAPTER 5 RISK: EXPOSURE TO DISEASE

5.1 C. Retrospective cohort studies do not


allow investigators the luxury of deciding Incidence (per 10,000/yr) of DVT
what data to collect. They can only choose from According to OC Use and Mutation
the avail- able data collected at a time before for Factor V Leiden
the study originated. Also, these data, OC Use
which are often collected for clinical Present Absent
purposes, may not have been collected
systematically and in the same way. Answers Present 28.5 5.7
Factor V Leiden
A, B, and D are correct for both prospective Absent 3.0 0.8
and retrospective cohort studies.
5.2 B. Relative risk is the ratio of incidence of an
outcome (disease) in exposed  non-exposed.
Therefore, the relative risk of stroke in The table shows that among women who
smokers compared to non-smokers in their do not carry the factor V Leiden mutation
40s is 29.7  7.4  4.0 the attributable risk of DVT for taking OC
compared to those who do not take OC is
5.3 D. Attributable risk (AR) is the risk of disease 3.0/10,000 – 0.8/10,000  2.2/10,000.
attributable to the risk factor and is calculated
as the difference in absolute risk (incidence) 5.7 E. The attributable risk of DVT among OC
of exposed persons minus that in non-exposed users who also carry factor V Leiden is substan-
persons. Therefore, the attributable risk of tial when compared to most women using OC,
stroke in smokers compared to non-smokers 28.5/10,000 – 3.0/10,000  25.5/10,000.
in their 60s is 110.4 – 80.2  30.2. 5.8 B. Population attributable risk  attributable
5.4 C. Relative risk gives no information about
risk (25.5/10,000 women/yr)  prevalence of
incidence, whereas attributable risk does, so factor V Leiden (0.05)  1.3/10,000
women/yr
the statement in C is incorrect. The other
answers are all correct. To calculate popula- 5.9 C. Relative risk is calculated by dividing the
tion attributable risk, one must know the incidence of DVT among women carry-
prevalence of a risk factor in the popula- ing the factor V Leiden mutation and using
tion, which is not given in the question. OCs by the incidence of DVT in women
The incidence of stroke among smokers was who take OCs but do not carry the mutation
higher in the 60s than in the 40s. The RR of (28.5/10,000/yr  3.0/10,000  9.5)
stroke among smokers in the 40s was 4.0,
5.10 A. Among women without the mutation, the
compared to 1.4 in the 60s. Stronger rela- relative risk for DVT and using OCs is
tive risks are better evidence for a causal rela- calcu- lated by dividing the incidence among
tionship than weaker ones. In the analysis of those tak- ing OCs by the incidence among
this study, age could be treated as either a those not using OCs (3.0/10,000/yr 
confounding variable and controlled for or,
0.8/10,000/yr  3.8).
as in the information presented, as an effect
modifier showing that the effect of smoking 5.11 C. Using OCs in a woman heterozygous for
varies by age. fac- tor V Leiden is an example of a risk
factor with a substantial relative risk
5.5 A. Absolute risk is another term for incidence.
(prescribing OCs to such a woman increases
DVT is a rare event in most women, and
her chance of develop- ing DVT with a
the incidence in this study was
relative risk of 5.0) but a rela- tively small
0.8/10,000/yr among women who neither
absolute risk—28.5 women out of 10,000
had the mutation nor took OCs.
who have the mutation and use OCs would
5.6 C. A good way to organize your thinking develop DVT in the next year. Even more
for questions 5.6–5.10 is to create a 2  2 relevant for this patient is that her abso- lute
table of the incidence of DVT according to risk would rise from about 6 to about 28 per
OC use (present/absent) and factor V Leiden 10,000 over the next year. It is important
(present/ absent). for a known carrier who wants to take OCs
to understand that her risk is increased and
to know how much that increased risk is in
absolute terms. A prudent clinician would
Appendix A: Answers to Review Questions 241
also
24 Appendix A: Answers to Review

want to be sure that the patient does not 5.12 B. The degree of illness may have confounded
have other indications of increased risk of the results in this study so that sicker
throm- bosis such as age, smoking, or a patients are more likely to take aspirin. One
family or per- sonal history of clotting way to examine this possibility and adjust
problems. This kind of decision requires for con- founding if it is present is to
careful judgment from both patient and stratify the users and non-users into groups
clinician. However, using absolute or with similar indi- cations for using aspirin
attributable risk will clarify the risk for the and compare death rates in the subgroups.
patient better than using relative risk when
discussing clinical consequences.

CHAPTER 6 RISK: FROM DISEASE TO EXPOSURE

6.1 B. If exposure to oral contraceptives was they are


recorded just after the myocardial
infarction, it could not have been a cause of
it. All of the other choices could have
artificially increased exposure in cases,
resulting in a falsely elevated odds ratio.
6.2 E. Even an exemplary cases-control study
such as this one should not claim that it has
identi- fied a cause because unmeasured
confounding is always possible.
6.3 E. Case-control studies produce only odds
ratios, which can be used to estimate relative
risk.
6.4 C. Epidemic curves describe the rise and
fall in the number of cases over the time of
the outbreak.
6.5 A. One can always obtain a crude relative risk
from a cohort study. However, if the cohort
data do not contain all of the variables that
should be controlled for, a case-control
anal- ysis of the cohort study is a more
efficient approach to including the
additional data because the data needs to be
collected only for cases and controls, not for
the entire cohort.
6.6 D. It would be better to sample cases and
controls from a cohort rather than a
dynamic population, especially if exposure or
disease is changing rapidly over time and if
controls are not matched to the date of onset
of disease for cases.
6.7 E. Multiple control groups (not to be confused
with multiple controls per case) are a way
of examining whether the results are “sensi-
tive” to the types of controls chosen, that is,
whether the results using the different control
groups are substantially different, calling the
results into question.
6.8 C. Case-control studies do not provide
infor- mation on incidence (although if
Appendix A: Answers to Review Questions 243
nested in a cohort, a cohort analysis of the
same data can).
6.9 B. Matching is used to control for
variables that might be strongly related to
exposure or disease, to be sure that at least
the cases and controls do not differ on
those variables.
6.10 C. The crude odds ratio is obtained by
creat- ing a 2  2 table relating the number
of cases versus controls to the number of
exposed ver- sus non-exposed people and
dividing the cross products. In this case, the
odds ratio is 60  60 divided by 40  40 
2.25.
6.11 E. Case-control studies cannot study multiple
outcomes because they begin with the
pres- ence or absence of only one disease,
cannot report incidence, and cases should be
incident (new onset), not prevalent.
6.12 D. Odds ratios based on prevalent cases
can provide a rough measure of
association but not a comparison of risk,
which is about inci- dent (new-onset)
cases.
6.13 D. During the early phases of an outbreak,
the offending microbe or toxin is usually
known, but even if it is not, the most
pressing question is how the disease is being
spread. This infor- mation can then be used
to stop the outbreak and identify the
source.
6.14 C. If a case-control study is based on all or a
ran- dom sample of cases and a random
sample of controls from a population or
cohort, the cases and controls should be
similar to each other on characteristics other
than exposure.
6.15 D. Case-controls studies do not provide
infor- mation on incidence.
6.16 E. The odds ratio approximates the relative
risk when the disease is rare; a rule of thumb
is
1/100.
24 Appendix A: Answers to Review

CHAPTER 7 PROGNOSIS

7.1 A. Zero time is at the onset of disease (in


7.8 B. Survival curves estimate survival in a cohort,
this case, Barrett’s esophagus), not the time
after taking into account censored patients,
of out- come events (in this case, esophageal
and do not directly describe the proportion
cancer).
of the original cohort who survived.
7.2 C. Different rates of dropping out would
7.9 E. All of the responses might be correct
bias results only if those who dropped out
depending on the research question. For
had a systematically different prognosis from
exam- ple, it would be useful to know
those who remained.
prognosis both among patient in primary care
7.3 E. Responses A–D are all possible reasons for and also among those referred to specialists.
measurement bias, whereas a true difference
7.10 B. Because patients had to meet stringent
in incontinence rates between patients treated
crite- ria to be included in the trial, the clinical
with surgery versus medical care would not be
course of those assigned to usual care would
a systematic error.
not be rep- resentative of patients with the
7.4 E. Responses A–D are all features of clinical disease, as they occur in the general
prediction rules. population or in a defined clinical setting.
Therefore, the results would not be
7.5 C. Case series are a hybrid research strategy, not
generalizable to any naturally occurring group
describing clinical course in a cohort nor
of patients with multiple sclerosis.
rela- tive risk with a case-control approach nor
study- ing a representative sample of prevalent 7.11 B. Even if clinical prediction rules are con-
cases. structed using the best available methods, the
strongest test of their ability to predict is
7.6 C. Patients would be censored for any
that they have been shown to do so in
reason that removes them from the study or
patients other that the ones used to develop
causes them not to be in the study for as
the rule.
long as 3 years. If the outcome is survival,
having another potentially fatal disease 7.12 D. A hazard ratio is calculated from
would not matter if they had not yet died of informa- tion in a survival curve and is
it. similar but not identical to relative risk.
7.7 A. Prognosis is about disease outcomes over 7.13 C. Outcome events in time-to-event
time, and prevalence studies do not measure analyses are either/or events that occur only
events over time. once.

CHAPTER 8 DIAGNOSIS

First, determine the numbers for each of the four


8.6 A. 38%
cells in Figure 8.2: a  49; b  79; c  95 – 49 (or
46); d  152 – 79 (or 73). Add the numbers to 1. Calculate LR of a positive test (patient
determine the column and row totals. has facial pain)
8.1 C. 52% Sensitivity  49/95  52% 49/(49  46)
LR   79/(79 73)
8.2 B. 48% Specificity  73/152  48% 
 1.0
8.3 A. 38% Positive predictive value  49/128
 2. Convert pretest probability to pretest odds
38% Pretest odds  prevalence/(1 – prevalence)
8.4 D. 61% Negative predictive value  73/119   0.38/(1 – 0.38)
61%  0.61
8.5 A. 38% Prevalence of sinusitis in this practice 3. Calculate posttest odds by multiplying pre-
 test odds by LR
95/247  38%
Posttest odds  0.61  1.0
 0.61
Appendix A: Answers to Review Questions 245

4. Convert posttest odds to posttest


8.8 C. ~45%
probability Posttest
probability  posttest odds/(1  posttest odds) 8.9 B. ~20%
 0.61/(1  0.61) 8.10 C. As prevalence becomes smaller, the predic-
 0.38 or 38% tive value of a test decreases (see Fig. 8.7).
(Another, simpler, approach is that posttest Clinicians are more likely to treat a patient
probability  positive predictive value) with a 75% probability of sinusitis than to
forego treatment for one with 20% prob-
 PV  49/128 ability. (For the latter, further testing, such
 38% as sinus x-rays, may be warranted.) The
posttest probability of “intermediate
8.7 D. ~75% The LR for “high probability” of probability” of sinusitis is 45%, close to a
sinusitis was 4.7 and the pretest probability coin toss.
(prevalence) was 38%. You can use one of three 8.11 D. Requiring two independent tests to be
methods to determine the posttest abnormal, that is, using them in series,
probability of sinusitis among patients with increases specificity and positive predictive
“high probabil- ity” of sinusitis: (1) the value. (See example on page 126–7 and Table
mathematical approach outlined in Figure 8.4). How- ever, this approach lowers
8.8 (or on Web sites), sensitivity; therefore, B is incorrect. Using tests
(2) using a nomogram, or (3) using the in parallel increases sensitivity (making A
bedside “rule of thumb” approach outlined in incorrect) but usually decreases positive
Table 8.3. predictive value (making C incorrect). Work
1. Mathematical approach: with Figure 8.2 and 8.3 to convince
Pretest odds  0.38/(1.38)  0.61 yourself.
LR  pretest odds  4.7  0.61  8.12 A. The most important requirement when
2.9 Posttest probability  2.9/(1  using multiple tests is that each contributes
2.9) independent information not already evi-
 0.74 or 74% dent from a previous test. When using tests
2. Nomogram: in series, performing the test with the
Put ruler on 38% for prevalence and highest specificity first is most efficient and
4.7 for LR; it crosses the posttest requires fewer patients to undergo both tests.
number at approximately 75%. In paral- lel testing, performing the test with
the high- est sensitivity is most efficient.
3. Bedside “rule of thumb”:
An LR of 4.7 (close to 5) increases the
prob- ability of sinusitis approximately 30
percent- age points, from 38% to 68%.

CHAPTER 9 TREATMENT

9.1 A. Every effort was made to make this trial about effectiveness in ordinary circumstances
as true to life as possible by comparing and
drugs in common use, having broad
eligibility cri- teria, not blinding
participants, allowing care to proceed as
usual, and relying on a patient- centered
outcome rather than a laboratory
measurement. Although the trial is for effi-
cacy, it is better described as practical and it is
certainly not large.
9.2 D. Intention-to-treat analysis, counting out-
comes according to the treatment group that
patients `were randomized to, tests the effects
of offering treatment, regardless of whether
patients actually take it. It is, therefore,
24 Appendix A: Answers to Review
the measure of effect, like usual care, is
affected by drop-outs, reducing the observed
treatment effect over what it would have been
if everyone took the treatment they were
assigned to.
9.3 A. All characteristics at the time of
random- ization, such as severity of
disease, are ran- domly allocated.
Characteristics arising after randomization,
such as retention, response to treatment,
and compliance, are not.
9.4 D. The greatest advantage of randomized
tri- als over observational studies is
prevention of confounding. Randomization
creates compar- ison groups that would have
the same outcome rates, on average, were it
not for intervention effects.
Appendix A: Answers to Review Questions 247

9.5 E. The study had extensive inclusion and be some evidence bearing on the
exclu- sion criteria. This would increase the comparison, but the evidence should not be
extent to which patients in the trial were conclusive.
similar to each other, making it easier to
9.11 D. Because the new drug has advantages over
detect treatment differences if they exist,
but at the expense of generalizability, the the old one, but comparative effectiveness is
ability to extrapolate from study results to unknown, the appropriate randomized trial
ordinary patient care. comparing the two would be a non-inferiority
trial to establish whether the new drug is no
9.6 A. Intention-to-treat analyses describe the less efficacious.
effects of being offered treatments, not neces-
9.12 A. Making the primary outcome of a trial a
sarily taking them. To describe the effects
of actually receiving the intervention, one com- posite of clinically important and
would have to treat the data as if they were related out- comes increases the number of
from a cohort study and use a variety of outcome events and, therefore, the ability of
methods to control for confounding. the trial to detect an effect if it is present. The
disadvantage is that the intervention may
9.7 B. Stratified randomization is one approach to affect the component out- comes differently,
control of confounding, especially useful and reliance on the composite outcome alone
when a characteristic is strongly related to might mask this effect.
outcome— and also when the study is small
9.13 C. Both bad luck in randomization and break-
enough that one worries that randomization
down in allocation concealment (by bad meth-
might not cre- ate groups with similar
ods or cheating) would show up as differences
prognosis.
in baseline characteristics of patients in a trial.
9.8 C. In explanatory analyses, outcomes are Small differences are expected and the chal-
attrib- uted to the treatment patients actually lenge is to decide how large the differences
receive, not the treatment group they were must be for them to raise concern.
randomized to, which is an intention-to-treat
9.14 C. The usual drug trials reported in the
analysis.
clini- cal literature are “Phase III” trials,
9.9 B. Side effects of the drug, both symptoms intended to establish efficacy or
and signs, would alert patients and doctors to effectiveness. Fur- ther study, with
who is taking the active drug but could not postmarketing surveillance, is needed to
affect random allocation, which was done detect uncommon side effects. Responses A
before drug was begun. and D are about what Phase I and Phase II
9.10 C. For a randomized controlled trial to be trials are meant to establish.
ethi- cal, there should not be conclusive 9.15 B. Prevention of confounding is the main
evidence that one of the experimental advantage of randomized controlled trials,
treatments is bet- ter or worse than the other which is why they are valued despite being
—that is, the scien- tific community should eth- ically complicated, slower, and more
be in a state of “equi- poise” on that issue. expensive. Whether they resemble usual care
There may be opinions as to which is better, depends on how the trial is designed.
but no consensus. There may

CHAPTER 10 PREVENTION

10.1 A. 33%. The relative risk reduction of 0.0026


colorec- tal cancer mortality is the absolute Relative risk reduction of colorectal cancer
risk reduc- tion divided by the cancer mortality due to screening  0.0026/0.0079 
mortality rate in the control group. (See
Chapter 9.)
The colorectal cancer mortality rate in
the screened group  82/15,570  0.0053
The colorectal cancer mortality rate in
the control group  121/15,394  0.0079
Absolute risk reduction  0.0079 – 0.0053

24 Appendix A: Answers to Review
0.33 or 33% reduction. An
alternative approach calculates the
complement of rela- tive risk. In the
example, the relative risk of
colorectal mortality in the screened
compared to the control group was
0.0053/0.0079 
0.67. The complement of 0.67  1.00 – 0.67 
0.33 or 33%.
10.2 C. 385. The number needed to
screen is the reciprocal of absolute
risk reduction, 1/0.0026, or 385.
Appendix A: Answers to Review Questions 249

10.3 C. If 30% of the screened group were found every day circumstances, in which some
to have polyps, at least 4,671 (0.3  15,570) people in the intervention group did not
had a false-positive test for colon cancer. If receive the vaccine)? (ii) Is the vaccine safe?
the sensitivity of the test was 90% and the (iii) Is the bur- den of suffering of the
number of cancers was 82, about 74 people condition the vaccine protects against
(0.9  82) had a true-positive result. The important enough to consider a preventive
positive pre- dictive value of the test, measure? and (iv) Is it cost-effective? Cost of
therefore, would be about 74/(74  4,671), the vaccine is only one component of an
or 1.6%. (Using exact numbers, the authors analysis of cost-effectiveness.
calculated a positive predictive value of
10.8 C. Disease prevalence is lower in presumably
2.2%). Negative predictive value was not
calculated, but it would be high because the well people (screening) than in symptom-
incidence of cancer over 13 years was low atic patients (diagnosis). As a result, positive
(323/15,570 or 21 per thousand) and about predictive value will be lower in screening.
90% of tests were negative. Because screening is aimed at picking up
early disease, the sensitivity of most tests is
10.4 B. Lead time (the period time between the lower in screening than in patients with more
detection of a disease on screening and advanced disease. Overdiagnosis is less likely in
when it would ordinarily be diagnosed diagnos- tic situations when symptomatic
because the patient seeks medical care due patients have more late-stage disease.
to symptoms) can be associated with what
10.9 C. Volunteers for preventive care are more
appears to be an improvement in survival.
The fact that mortal- ity did not improve after compliant with advice about medical care
screening in this ran- domized trial raises the and usually have better health outcomes
possibility that lead- time bias is responsible than those who reject preventive care. This
for the result. Another possible cause of the effect is so strong that it is seen even when
finding is overdiagnosis. volunteers are taking placebo medications.
10.10 B. The incidence method is particularly useful
10.5 A. See answer for 10.4.
when calculating sensitivity for screening
10.6 C. Overdiagnosis, the detection of lesions that tests because it takes into account the
would not have caused clinical symptoms or possibility of overdiagnosis. The gold
morbidity, is likely because, even after 20 years, standard for screen- ing tests almost always
the number of cancers in the control group involves a follow-up interval. The first round
remained fewer than that in the screened of screening picks up prevalent as well as
group. Increased numbers of cancers in the incident cases, inflating the number compared
screened group could occur if there were to later screening rounds.
more smokers in the screened group, but
10.11 E. Cost-effectiveness should be estimated in
randomiza- tion should have made that
a way that captures all costs of the
possibility unlikely.
preventive activity and all the costs
10.7 D. Several important questions are to be associated with diag- nosis and treatment,
considered when assessing a new vaccine, wherever they occur, to determine the costs
including: (i) Has the vaccine been shown to of the preventive activity from a societal
be efficacious (did it work under ideal perspective. All costs that occur when
circum- stances when everyone in the prevention is not done are subtracted from
intervention group received the vaccine and no all costs when prevention is done to esti- mate
one in a comparable control group did) and the cost for a given health effect.
was it effective (under

CHAPTER 11 CHANCE

11.1 C. Although subgroup analyses risk false- to do with bias and statistical power or the
positive and false-negative conclusions, they
provide information that can help clinicians as
long as their limitations are kept in mind
11.2 C. The P value describes the risk of a false-
positive conclusion and has nothing directly
25 Appendix A: Answers to Review
generalizability or clinical importance of the
finding
11.3 E. The P value is not small enough to
establish that a treatment effect exists and
provides no information on whether a
clinically important effect could have been
missed because of inad- equate statistical
power.
Appendix A: Answers to Review Questions 251

11.4 D. The result was statistically significant, if one rater, not sample size), 6,000 divided by 3
wanted to think of the role of chance in that or a 1/2,000 event rate could be detected
way, because it excluded a relative risk of 1.0. with 6,000 people under observation.
Response D described the information the
11.8 E. Statistical power depends on the joint
confidence interval contributes.
effects of all of the factors mentioned in A–D.
11.5 B. Calling P  0.05 “statistically significant” is It may vary a bit with the statistical test
a useful convention but otherwise has no par- used to calculate it, but this is not the main
ticular mathematical or clinical meaning. deter- minate of sample size.
11.6 A. Models do depend on assumptions about 11.9 B. Bayesian reasoning is about how new
the data, are not done in a standard way, infor- mation affects prior belief and has
and are meant to complement stratified nothing to do with inferential statistics or
analyses, not replace them. Although they the ethical rationale for randomized trials.
might be used in large randomized trials,
11.10 D. The results are consistent with a 1% up to a
they are not particularly useful in that
situation because randomization of a large 46% higher death rate, with the best
number of patients has already made estimate being 22% higher, and excludes a
confounding very unlikely. hazard ratio of 1.0 (no effect), so it is
“statistically signifi- cant.” Confidence
11.7 C. Of the 12,000 people in the trial, about internals provide more infor- mation than a P
6,000 would be in the chemoprevention arm value for the same data because they include
of the trial. Applying the rule of thumb the point estimate and range of values that is
men- tioned in this chapter (but solving for likely to include the true effect.
event

CHAPTER 12 CAUSE

12.1 B. It would be unethical to randomly allocate the


a potentially harmful activity, and people
would probably not accept long-term
restrictions on their use of cell phones.
12.2 E. All responses, which reflect Bradford Hill
criteria, would be useful in deciding
whether cell phones cause brain cancer, but
E, a lack of specificity, is weaker than the
rest. (Consider how many diseases cigarette
smoking causes.)
12.3 C. Figure 12.1 shows how several risk factors
act together to cause coronary heart disease.
Nearly all diseases are caused by the joint effect
of genes and environment. Figure 12.3 shows
a huge decline in tuberculosis rate before
effec- tive treatment was established in the
1950s.
12.4 B. In the absence of bias, statistical significance
establishes that an association is unlikely to
be by chance whereas the Bradford Hill
criteria go beyond that to establish whether
an asso- ciation is cause-and-effect.
12.5 C. Cost-effectiveness analyses describe finan-
cial costs per clinical effects, such as year of life
saved, as in this example, for alternative tests
and treatments.
12.6 D. When exposure and disease are measured
in groups, not individuals, it is possible that
25 Appendix A: Answers to Review
individuals who got the disease were
not the ones who were exposed and
that people who were exposed were
not the ones who got the disease, a
problem called the “ecological fallacy.”
12.7 B. In a time-series study, a change in
the rate of disease after an
intervention might be caused by other
changes in local conditions at about
the same time. This possibility needs
to be ruled out before one can have
confidence that the study exposure
caused the outcome.
12.8 B. Randomized controlled trials, if
well designed and carried out, are the
strongest evi- dence for cause and
effect. But they are not possible for
many suspected causes because it is
not ethical to involve people is
studies of harm and because they
may not be willing to cooperate with
such a trial.
12.9 E. The pattern of evidence from all
of the Bradford Hill criteria is more
powerful than evidence for any one
of them.
12.10 D. The confidence interval is so
wide that it is consistent with either
harm or benefit and, therefore, does
not even establish whether there is an
association between cell phones and
brain cancer, let alone whether cell
phones are a cause.
Appendix A: Answers to Review Questions 253

EFFICACY
CHAPTER 13 SUMMARIZING THE EVIDENCE

13.1 E. The main advantage of combining (pool-


not meaningful to give each element the same
ing) study results in a systematic review is
weight.
to have a larger sample size and as a result
a more stable and precise estimate of effect 13.6 D. The usual situation is that small negative
size. Although it could be argued that the studies (ones that find no effect) are less
results are somewhat more generalizable likely to be published than small studies that
because they come from more than one find effects. Large studies are likely to be
time and place, generalizability is not the published no matter what they find.
main advan- tage. EFFECTIVENESS
13.7 B. Narrative reviews can take a comprehen-
13.2 D. MEDLINE can be searched using both sive approach to the set of questions clinicians
content and methods terms, but it does not must answer to manage a patient but that is at
include all of the world’s journals, searches the expense of a transparent description of the
miss some of the articles in MEDLINE, and scientific basis for the evidence cited.
they are not a remedy for publication bias 13.8 B. Forest plots summarize the raw evidence in
since all citations in it are published. These a systematic review. Pooled effect size and con-
limitations are why reviewers need to use fidence interval may or may not be
other, complementary ways of searching for included, depending on whether it is
articles in addition to MEDLINE. appropriate to combine the study results.
13.3 B. Although individual studies usually have
13.9 A. It is not possible to control for a
limited statistical power in subgroups, pooling
covariate in a study-level meta-analysis, as
of several studies may overcome this
one might in one of the parent studies or in a
problem, as long as patients, not trials, are
patient-level meta-analysis.
pooled.
13.10 A. Pooling is justified if there is relatively
13.4 A. B–E are elements of PICO, core compo-
little heterogeneity across studies, as
nents of a specific research question. Covari-
determined by an informed review of the
ates are a critical aspect of the research, espe-
patient, interven- tions, and outcomes. A
cially for observational studies and to identify
statistical rest for het- erogeneity is also
effect modification in clinical trials, but
useful in general, but not in this case, with so
they are about how successfully the question
few studies, because of low statistical power.
can be answered, not the question itself.
13.11 D. Random effects models take heterogeneity
13.5 B. Systematic reviews should include a descrip-
into account, at least if it is not extreme,
tion of the methodologic strength of the
and, therefore, result in wider confidence
studies to help users understand how strong
intervals, which are more likely to be
the conclusions are. Combining individual
accurate than the narrower confidence
elements of quality into a summary
intervals that would result from a fixed
measure is less well established, perhaps
effect model.
because it is

CHAPTER 14 KNOWLEDGE MANAGEMENT

14.1 C. It is virtually impossible to keep up with


even surveillance on new developments in
all scientifically strong, clinically relevant
your field.
new research in your field without help
from a publication that does that for you. 14.3 D. The resources listed in A–C are all excel-
lent for looking up the answers to clinical
14.2 D. Journals are a rich source if information on
questions, whereas a recent article, out of
the many aspects of being a physician—the
context with others of the same question, is of
history, politics, science, ideas, experiences,
marginal value unless it happens to be
and more—but are not particularly good for
much stronger than all the others that
finding answers to immediate questions or
preceded it.
25 Appendix A: Answers to Review

14.4 C. PubMed searches are indispensable for point of care (if computers and the relevant
find- ing whether a rare event has been programs are available). Medical journals
reported. They are one important part of an have great value for other reasons, but they
effort to find all published articles on a are not useful for this purpose.
specific clini- cal question and are useful for
14.8 D. Patients and clinicians alike might well
researchers, but they are too inefficient for
respect famous experts but should look for
most questions at the point of care.
more solid footing—the organization that
14.5 B. Although peer review and editing make sponsors them, facts and the source of those
articles better (more readable, accurate, and facts—when deciding whether to believe
complete), articles are far from perfect them.
when the process is over and they are
14.9 A. GRADE has been developed for treat-
published.
ment recommendations, but not other clinical
14.6 C. Conflict of interest is in relation to a spe- questions.
cific activity and does not exist in the
14.10 E. All are basic elements of a comprehensive
general case of investing in medical products
knowledge management plan, as described in
not spe- cifically related to colonoscopy.
this chapter.
14.7 A. B–E are all valuable resources for looking
up the answers to clinical questions at the
Additional Readings
1. INTRODUCTION Hennekins CH, Buring JE. Epidemiology in Medicine.
Boston: Little, Brown and Company; 1987.
Clinical Epidemiology Jekel JF, Elmore JG, Katz DL. Epidemiology, Biostatistics
Feinstein AR. Why clinical epidemiology? Clin Res and Preventive Medicine, 3rd ed. Philadelphia:
1972;20: 821–825. Elsevier/ Saunders; 2007.
Feinstein AR. Clinical Epidemiology. The Architecture of Rothman KJ. Epidemiology: An Introduction. New
Clinical Research. Philadelphia: WB Saunders; 1985. York: Oxford University Press; 2002.
Feinstein AR. Clinimetrics. New Haven, CT: Yale Univer-
sity Press; 1987. Related Fields
Hulley SB, Cummings SR. Designing Clinical Research. Brandt AM, Gardner M. Antagonism and accommodation:
An Epidemiologic Approach, 3rd ed. Philadelphia: Lip- interpreting the relationship between public health
pincott Williams & Wilkins; 2007. and medicine in the United States during the 20th
Riegelman RIC. Studying and Study and Testing a Test, century. Am J Public Health 2000;90:707–715.
5th ed. Philadelphia: Lippincott Williams & Wilkins; 2005. Kassirer JP, Kopelman RI. Learning Clinical Reasoning.
Sackett DL. Clinical epidemiology. Am J Epidemiol 1969; Baltimore: Williams & Wilkins; 1991.
89:125–128. Sox, HC, Blatt MA, Higgins MC, et al. Medical
Sackett DL, Haynes RB, Guyatt GH, et al. Clinical Decision Making. Philadelphia, American College of
Epide- miology: A Basic Science for Clinical Physicians, 2006.
Medicine, 2nd ed. Boston: Little, Brown and White KL. Healing the Schism: Epidemiology, Medicine,
Company; 1991. and the Public’s Health. New York: Springer-Verlag;
Weiss NS. Clinical Epidemiology: The Study of the 1991.
Outcomes of Illness, 3rd ed. New York: Oxford
University Press; 2006. 2. FREQUENCY
Morgenstern H, Kleinbaum DG, Kupper LL. Measures
Evidence-Based Medicine
of disease incidence used in epidemiologic research. Int
Guyatt G, Rennie D, Meade M, et al. User’s Guide to
J Epidemiol 1980;9:97–104.
the Medical Literature: Essentials of Evidence-Based
Clini- cal Practice, 2nd ed. Chicago: American Medical 3. ABNORMALITY
Asso- ciation Press; 2008.
Hill J, Bullock I, Alderson P. A summary of the methods Feinstein AR. Clinical Judgment. Baltimore: Williams &
that the National Clinical Guideline Centre uses to Wilkins; 1967.
produce clinical guidelines for the National Institute Streiner DL, Norman GR. Health Measurement Scales—A
for Health and Clinical Excellence. Ann Intern Med Practical Guide to Their Development and Use, 3rd ed.
2011:154:752–757. New York: Oxford University Press; 2003.
Jenicek M, Hitchcock D. Evidence-Based Practice: Logic Yudkin PL, Stratton IM. How to deal with regression to the
and Critical Thinking in Medicine. Chicago: American mean in intervention studies. Lancet 1996;347:241–
Medical Association Press; 2005. 243.
Straus SE, Glasziou P, Richardson WS, et al. Evidence- 4. RISK: BASIC PRINCIPLES
Based Medicine: How to Practice and Teach It, 4th ed.
New York: Elsevier; 2011. Diamond GA. What price perfection? Calibration and dis-
crimination of clinical prediction models. J Clin Epide-
Epidemiology miol 1992;45:85–89.
Friedman GD. Primer of Epidemiology, 5th ed. New York: Steiner JF. Talking about treatment: the language of popu-
Appleton and Lange; 2004. lations and the language of individuals. Ann Intern Med
Gordis L. Epidemiology, 4th ed. Philadelphia: Elsevier/ 1999;130:618–622.
Saunders; 2009.
Greenberg RS, Daniels SR, Flanders W, et al. Medical Epi- 5. RISK: EXPOSURE TO DISEASE
demiology, 4th ed. New York: Lange Medical Books/ Samet JM, Munoz A. Evolution of the cohort study. Epide-
McGraw Hill; 2005. miol Rev 1998;20:1–14.
249
25 Appendix B: Additional

6. RISK: FROM DISEASE TO EXPOSURE reflections from 4 current and former members of the
U.S. Preventive Services Task Force. Epidemiol Rev
Grimes DA, Schulz KF. Compared to what? Finding 2011;33:20–25.
controls for case-control studies. Lancet 2005;365: Rose G. Sick individuals and sick populations. In J
1429–1433. Epidemiol 30:427–432.
Wald NJ, Hackshawe C, Frost CD. When can a risk factor
7. PROGNOSIS be used as a worthwhile screening test? BMJ 1999;319:
Dekkers OM, Egger M, Altman DG, et al. Distinguish- 1562–1565.
ing case series from cohort studies. Ann Intern Med
2012;156:37–40. 11. CHANCE
Jenicek M. Clinical Case Reporting in Evidence-Based
Concato J, Feinstein AR, Holford TR. The risk of deter-
Medicine, 2nd Ed. New York: Oxford University Press;
mining risk with multivariable models. Ann Intern Med
2001.
1993;118:201–210.
Laupacis A, Sekar N, Stiell IG. Clinical prediction rules:
Goodman SN. Toward evidence-based statistics. 1: the P
a review and suggested modifications of methodologic
value fallacy. Ann Intern Med 1999;130:995–1004.
standards. JAMA 1997;277:488–494.
Goodman SN. Toward evidence-based statistics. 2: the
Vandenbroucke JP. In defense of case reports. Ann Intern
Bayes factor. Ann Intern Med 1999;130:1005–1013.
Med 2001;134:330–334.
Rothman KJ. A show of confidence. N Engl J Med
1978; 299:1362–1363.
8. DIAGNOSIS
McGee S. Evidence-Based Physical Diagnosis. New York: 12. CAUSE
Elsevier; 2007.
Ransohoff DF, Feinstein AR. Problems of spectrum and Buck C. Popper’s philosophy for epidemiologists. Int J
bias in evaluating the efficacy of diagnostic tests. N Epi- demiol 1975;4:159–168.
Engl J Med 1978;299:926–930. Chalmers AF. What Is This Thing Called Science?, 2nd ed.
Whiting P, Rutjes AWS, Reitsma JB, et al. Sources of New York: University of Queensland Press; 1982.
varia- tion and bias in studies of diagnostic accuracy: a Morganstern H. Ecologic studies in epidemiology: con-
systematic review. Ann Intern Med 2004;140:189–202. cepts, principles, and methods. Ann Rev Public Health
1995;16:61–81.
9. TREATMENT
13. SUMMARIZING THE EVIDENCE
Friedman LM, Furberg CD, DeMets DL. Fundamentals
of Clinical Trials, 3rd ed. New York: Springer-Verlag; Goodman S, Dickersin K. Metabias: a challenge for
1998. compar- ative effectiveness research. Ann Intern Med
Kaul S, Diamond GA. Good enough: a primer on the 2011;155: 61–62.
analysis and interpretation of noninferiority trials. Ann Lau J, Ioannidis JPA, Schmid CH. Summing up the
Intern Med 2006;145:62–69. evidence: one answer is not always enough. Lancet
Pocock SJ. Clinical Trials: A Practical Approach. 1998;351: 123–127.
Chichester: Wiley; 1983. Leeflang MMG, Deeks JJ, Gatsonis C, et al. Systematic
Sackett DL, Gent M. Controversy in counting and reviews of diagnostic test accuracy. Ann Intern Med
attrib- uting events in clinical trials. N Engl J Med 2008;149:889–897.
1979;301: 1410–1412. Norris SL, Atkins D. Challenges in using nonrandomized
The James Lind Library. http://www.jameslindlibrary.org studies in systematic reviews of treatment interventions.
Tunis SR, Stryer DB, Clancy CM. Practical clinical tri- Ann Intern Med 2005;142:1112–1119.
als: increasing the value of clinical research for Riley RD, Lambert PC, Abo-Zaid G. Meta-analysis of indi-
decision making in clinical and health policy. JAMA vidual participant data: rationale, conduct, and
2004;291: 1624–1632. report- ing. BMJ 2010;340:c221.
Yusuf S, Collins R, Peto R. Why do we need some
large, simple randomized trials? Stat Med 1984;3:409– 14. KNOWLEDGE MANAGEMENT
420. Cook DA, Dupras DM. A practical guide to develop-
ing effective Web-based learning. J Gen Intern Med
10. PREVENTION 2004;19:698–707.
Goodman SN. Probability at the bedside: the knowing Shiffman RN, Shekelle P, Overhage JM, et al. Standard-
of chances or the chances of knowing? Ann Intern ized reporting of clinical practice guidelines: a proposal
Med 1999;130:604–606. from the conference on guideline standardization. Ann
Harris R, Sawaya GF, Moyer VA, et al. Reconsidering the Intern Med 2003;139;493–498.
criteria for evaluation proposed screening programs:
Note: Page numbers in italics denote figures; those followed by a t denote tables.

A Case report, 101 Cluster randomized trials, 145


Abnormality, 31 Case series, 101–102, 106, 242 Cochrane library, 230
criteria for, 41–45, 42, 43, 44 Cases Cohort, 21, 29, 62, 62t, 238
distributions and, 38–41, 39, 39t, 40 defined, 23–24 approach, 84
variation and, 35–38, 35t, 36, 37, 38 selecting, case-control studies, 83 Cohort studies, 21, 62–63, 63
Absolute risk, 67–68, 68t Cause(s), 194 advantages and disadvantages of,
Accuracy, 33, 118 basic principles, 195–198, 195t, 65–67, 66t
Adherence, 140 196, 197 historical, 63, 64
Adjusted odds ratio, 89 evidence for and against, 199–202, using medical databases, 64–65
Aggregate risk studies, 202–204, 203 199t, 200 prospective, 63, 64
Allocation concealment, 141 indirect evidence for, 198–199, 198, Cointerventions, 141
Analogy, 202 199t Comorbidity, 135
Assumption of independence, 129 modeling, 204–205, 205 Comparative effectiveness, 134
Attributable risk, 68, 68t multiple, 195–196, 196 Compliance, 140–141
proximity of, 196–198, 197 Compliance bias, 161–162
single, 195, 195t Complication rate, 20,
B weighing the evidence and, 205–206, 20t Composite outcomes,
Baseline characteristics, 139 206 142
Bayesian reasoning, 190–191 Censored, 100 Concordance statistics, 56
Best-case/worst-case analysis, 104 Central tendency, 38 Confidence interval, 183
Bias Chance, 10, 14, 16, 237 Conflict of interest, 226
in clinical observation, 7, 7t approaches to, 175–176 Confounders, 72
in cohort studies, 102–104 bayesian reasoning, 191–192 Confounding, 8–10, 9, 14, 71–72, 237
compliance, 161–162 detecting rare events, 185, 185 confirming, 72
confounding, 8–10, 9 effects of, as cumulative, 10–11, 10 control of, 72–76
defined, 7 hypothesis testing and, 176–180, 176 by indication, 147
effects of, as cumulative, 10–11, 10 overall strategy for control of, 75–76
multiple comparisons, 185–187, 186t
lead-time, 160–161, 160t multivariable methods, 189–190 variables, 72
length-time, 161, 161 Consistency, 201
point estimates and confidence
measurement, 8, 104 intervals, 183–185 Construct validity, 34
migration, 103 sensitivity and specificity, 117, 117 Constructs, 33
from non-differential misclassification, subgroup analysis, 187–189, Content validity, 33
104 188, 189t Continuous data, 32, 46, 238
recall, 86 Chemoprevention, 153 Controlling, 72
sampling, 12, 103 Chi-square (2) test, Controls
selection, 7 178 hospital and community, 84
sensitivity and specificity, 116–117 Clinical colleagues, 228–229 selecting, for case-control studies,
Biologic differences, variation Clinical course, 94 83
resulting Clinical databases, 148 Convenience samples, 25, 29, 238
from, 36–37, 37 Clinical epidemiology Cost–benefit analysis, 205
Biologic plausibility, 201–202 basic principles, 6–12 Cost-effectiveness analysis, 5, 170, 205
Blinding, 141–142, 141 bias (systematic error) in, 7–10, 7t Counseling, 157–158, 158
chance, 10, 10 Covariates, 6, 71
C clinical issues and questions, 2, 2t Cox proportional hazard, 75
C-statistic, 56 defined, 3–4 Criterion standard, 109
Calibration, 56 health outcomes in, 2–3, 2t Criterion validity, 33–34
Case-cohort design, 65 numbers and probability in, Cross-over, 141
Case-control studies, 81–83, 82 6 populations and samples, 6, Cross-over trials, 146
characteristics of, 81t 6 Cross-sectional studies, 21
design of, 83–87 purpose of, 3–4, 6 Crude measure of effect, 71
measuring exposure, 85–87 variables in, 6 Crude odds ratio, 89
nested, 84 Clinical practice guidelines, 229–230, Cumulative incidence, 21–22, 29,
population-based, 83 229t 238
selecting cases, 83 Clinical prediction rules, 102, 102t, Cumulative meta-analysis, 219–221,
selecting controls, 83–85 127, 128t 220
Case fatality rate, 20 Clinical prediction tool, Cutoff point, 113
54 Clinical trials, 134

251
25 Ind

D Equipoise, 135 Inception cohort, 96


Data Equivalence trial, 145 Incidence, 18–19, 18
characteristics of, 181–182 Estimated relative risk, 88 characteristics of, 19t
continuous, 32 Estimation, 176 cumulative, 21–22
dichotomous Evidence-based medicine, 4, 4t density, 22–23
discrete, 33 Evidence summarization, 209 in relation to time, 19
interval, 32–33 Exclusion criteria, 135 relationships among, prevalence,
nominal, 32 Experimental studies, 134 duration and, 19–20
ordinal, 32 Explanatory analyses, 145 Incidence density, 22, 29, 238
simplifying in making diagnosis, Exposed group, 62 Incidence method, 164
108–109 Exposure, 51 Incidence screen, 159
Decision analysis, 5, 205 External validity, 11, 11, 14, 237 Incidence studies, 21, 62
Decision making Extraneous variables, 6, 71 Incidentaloma, 169
medical, 12 Inclusion criteria, 135
shared, 12 F Independence, assumption of, 129
Denominator, 18 Fallacy, ecological, 202 Independent variable, 6
Dependent variable, 6 False cohorts, 102 Infant mortality rate, 20,
Diagnosis False negative, 109, 176–177 20t Inference, 6
accuracy of test result in, 109–111, 109 False positive, 109, 176–177 Inferential statistics, 176
likelihood ratios in, 122–125, 123t, False-positive screening test result, Inferiority margin, 145
124, 125, 125t Intention-to-treat analysis, 144
166 The 5 Ds, 2–3, 2t
multiple tests in, 125–129, Interaction, 76
Fixed effect model, 217
126, 127t, 128t Intermediate outcomes, 72
Forest plots, 215, 215
predictive value, 117–122, 118 Internal validity, 11, 11
Frequency
sensitivity and specificity in, 111–117, Interpretability, 35
commonly used rates, 20, 20t
112, 113t, 114, 115 Interval cancer, 163
distribution of disease by time, place,
simplifying data, in making, 108–109 Interval data, 32–33
and person, 25–28
Diagnostic decision-making rules, 127 Intervention, 134
incidence in, 18, 18, 19t
Diagnostic test, 108 prevalence in, 18–19, 18, 19t
Dichotomous data, 32 relationships among prevalence, J
Discrete data, 33 incidence, and duration of Journals, 231–234, 232t, 233, 234t
Discrimination, 56 disease, 19–20 reading, 233–234, 233, 234t
Disease studies of, 23–25 Just-in-time learning, 228
increasing the pretest probability of, studies of prevalence and incidence,
120–122 21–23
lack of objective standards for, 110– suitability of words as substitutes for K
111 outbreak, 89–90 number, 18 Kaplan-Meier analysis, 98
outcomes of, 2t uses of prevalence studies, Karnofsky Performance Status Scale,
Dispersion, 38 28 Frequency distribution, 38 35 Knowledge management
Distant causes, 51–52 Funnel plots, 212 basic principles, 225–228
Distribution defined, 225
actual, 39 guiding patients’ quest for health
describing, 38, 39, 39t G information, 234–235, 234t
frequency, 38 Generalizability, 11–12, 106, 242 journals, 231–234, 232t, 233,
gaussian 40–41 Gold standard, 109 234t looking up answers to
normal, 40–41, 40 consequences of imperfect, 111 clinical
Skewed, 39 Grab samples, 25 questions, 228–230
Dose–response relationship, 200–201 Grading information, 226 into practice, 235
Double-blind, 142 surveillance on new developments,
Dropouts, 103 H 230–231
Duration of disease, 20 Hawthorne effect, 138 Koch’s postulates, 195t
Dynamic population, 22, 22, 30, 238 Hazard ratios, 101
Health outcomes, 2–3, L
E 2t Labeling effect, 167
Ecological fallacy, 202 Health-related quality of life, 142 Large simple trials, 136
Ecological studies, 202 Health services research, 4–5 Latency period, 51, 80
Effect modification, 76–77 Health status, 142 Lead-time bias, 160–161, 160t
Effectiveness trials, 143 Heterogeneity, 217
Length-time bias, 161, 161
Efficacy trials, 143 Hierarchy of research designs, 199, 199t Lifestyle changes, 153
Electronic textbooks, 229 Hypotheses, 132 Likelihood ratios
EMBASE, 230 null, 178 calculating, 124–125, 125t
Endemic, 26, 27, 30, 238 Hypothesis testing, 175 defined, 122
Epidemic curve, 26, 26, 89, 90 odds, 122
Epidemiology, 4. See also Clinical I reason for using, 123–124, 123t
epidemiology Immediate causes, 51 use of, 122–123
Immunizations, 153 Logistic regression, 75
Index 253

M Outcomes in relation to time,


Marker, 53 assessment of, 142–143, 143t 19 period, 18
Masking, 141 composite, 142 point, 18
Matching, 74, 85 health, 2–3, 2t relationships among, incidence,
Maternal mortality rate, multiple, 187, 189 duration and, 19–20
20t Mathematical models, in prognostic studies, 96–97, Prevalence odds ratio, 88
189 Measure of effect, 67 97t Overdiagnosis, risk of, in Prevalence screen, 159
Measurement bias, 8, 14, 16, 104, cancer Prevalence studies, 21, 21, 30, 106,
237 Measurements screening, 167–169 238, 242
performance of, 33–35, 35 Overmatching, 85 uses of, 28
variation resulting from, 35–36, Oversample, 25 Prevention
35t Medical decision making, 12 benefits against harms of, 169–172
MEDLINE, 230 burden of suffering, 156
Meta-analysis, 216
P in clinical settings, 152–153
P value, 177 effectiveness of treatment, 156–159,
combining studies in, 216–219, 218
Pandemic, 26 158
cumulative, 219–221, 220
Parallel testing, 126–127, 127t levels of, 153–155, 153
observational and diagnostic studies,
Patient-level meta-analysis, 217 randomized trials in, 156
221
Peer review, 232 scientific approach to clinical,
strengths and weaknesses of, 221–222,
Per-protocol, 145 155,
222
Perinatal mortality rate, 20, 20t 155t
Meta-regression, 219
Period prevalence, 18, 30, 238 screening tests
Migration bias, 103
Person, distribution of disease by, and treatments over time,
Multiple causes, 195–196, 196
27–28 169 performance of, 163–
Multiple comparisons, 185–187,
Person-time, 22 166
186t
Personalized medicine, 54, 55t unintended consequences of screening,
Multiple control groups, 84
Phase I trials, 149 166–169
Multiple controls per case, 85
Phase II trials, 149 Preventive care, 152
Multiple outcomes, 187, 189
Phase III trials, 149 Primary prevention, 153–154
Multiple tests, 125–129, 126,
PICO, 210 treatment in, 156–158
127t, 128t
Place, distribution of disease by, 27, 27 Prior (pretest) probability, 118
Multiple time-series studies, 203
Placebo, 138 Probability, 122
Multivariable adjustment, 75
Placebo adherence, 162 numbers and, 6
Multivariable analysis, 75
Placebo effect, 138 Probability sample, 25, 29, 238
Multivariable methods, 189–190
Point estimate, 183 Prognosis, 93
Multivariable modeling, 189
Point of care, 228, bias in cohort studies, 102–
228t 104 case series, 101–102
N Point prevalence, 18, 29, clinical course and natural history of
Narrative reviews, 209 238 Population disease, 94
Natural history, 94 defined, 25 clinical prediction rules, 102, 102t
Negative predictive value, 118 dynamic, 22, 22 defined, 93
Negative tests study sample, 25 describing, 97–100, 97t, 98, 99
lack of information on, Population at risk, sensitivity analysis, 104–105
110 Nested case-control study, 25 Prognostic factors
84 Network meta-analysis, Population attributable fraction, 69 defined, 93
219 Population attributable risk, 69 differences in, 93–94, 94
Nominal data, 32 Population risk, 69–71 identifying, 100–101, 101
Non-inferiority trials, 145 Population sciences, 4 Prognostic stratification, 102
Non-parametric statistics, 178 Population-based case-control studies, Prognostic studies, elements of, 95–97
Normal distribution, 40–41, 40 83 Prospective cohort studies, 63–64
Null hypothesis, 178 Populations, 6 Pseudodisease, 167–169
Numbers Positive predictive value, 118 Publication bias, 211
probability and, 6 Posterior (posttest) probability, PubMed, 230
Numerator, 18 118
Numerical method, 17 Postmarketing surveillance, 149
Posttest odds, 122 Q
Potential confounders, 72 Quality adjusted life year (QALY), 170
O Practical clinical trials, 136 Quantitative decision making, 5
Observational studies, 62, 71 Pragmatic clinical trials, 136 Questions
of interventions, 147–148 Precision, 34 clinical, 2, 2t
versus randomized, 148 Predictive value looking up answers, 228–230
of treatment effects, 156–157 definitions, 117–118
Odds, 122 determinants of, 118–119 R
Odds ratio, 87–88 Predisease, 168 Random allocation, 139
One-tailed, 179 Pretest odds, 122 Random effects model, 218
Open label trial, Pretest probability, 119 Random sample, 25
142 Prevalence, 18, 18, 118 Random variation, 10
Ordinal data, 32, 46, 238 characteristics of, 19t Randomization, 73, 139
estimating, 119
25 Ind

Randomized controlled trials, 134, 135 studies of, 61–67 Spectrum of patients, 116
alternatives to, 147–148 taking other variables into Stage migration, 96
assessment of outcomes, 142–143, account, 71 Standardization, 75
143t ways to express and compare, 67–71 Statistical power, 181, 184–185
blinding, 141–142, 141 Risk assessment tool, 54 Statistical precision, 183
cluster, 145 Risk difference, 68 Statistical tests, 176, 178–179, 178t
comparison groups, 138, 138 Risk factors, 51 Statistically significant, 176, 177
crossover, 146 casuality and, 53–54 Stratification, 74
differences arising after, 139–141 to choose treatment, 58 Stratified randomization, 139
ethics, 135 common exposure to, Structured abstract, 234, 234t
intervention, 136, 138 52 Subgroup analysis, 187–189, 188, 189t
limitations of, 147 in establishing pretest probability for Subgroups, 146
non-inferiority, 145 diagnostic testing, 58 Superiority trials, 145
versus observational studies, to predict risk, 54 Surveillance, 155
148 sampling, 135–136, 137 to prevent disease, 59 Surveys, 21
variations on, 145–146 Risk prediction tool, 54 Survival analysis, 97
Randomized trials, 156 calibration, 56 Survival curves, 98–100, 99
Range, 34 clinical uses of, 58–59 interpreting, 100
Rare events, detecting, 185, 185 discrimination, 56 Survival of a cohort, 97–98, 99
Reading journals, 233–234, 233, 234t risk stratification, 57 Survival rate, 20
Recall bias, 86 sensitivity and specificity of, 56–57, Systematic review, 210–216, 210t, 212t
Receiver operator characteristic (ROC) 57
curve, 57, 114–115, 115 Risk ratio, 68
Reference standard, 109 Risk stratification, 54 T
Referral process, in increasing the for screening programs, 58 Tertiary prevention, 154
pretest probability of disease, Run-in period, 141 treatments in, 158
120–122 Test set, 102
Regression to the mean, 45–46 Time, distribution of disease by, 26–27,
Relative risk, 68, 68t S 26
Reliability, 34, 35 Safety, 157 Time-series studies, 202, 203
Reproducibility, 34 Sample, 6, 6 Time-to-event analysis, 99
Responsiveness, 34–35 Sample size, 181 Training set, 102
Restriction, 73 Sampling bias, 12, 103 Treatment
Retrospective/historical cohort Sampling fraction, 25, 36 allocating, 139, 139t
studies, 63 Scales, 33 effectiveness trials in, 143–144, 143
Reverse causation, 147 Scientific misconduct, 226 efficacy trials in, 143–144, 143
Reversible associations, 201, 201 Screening, 153 equivalence trial, 145
Reviews acceptable to patients and clinicians, explanatory trials, 144–145, 144
narrative, 209 166 ideas and evidence, 132–134
systematic, 210 changes in, 169 intention-to-treat, 144–145, 144
traditional, 209 low positive predictive value, non-inferiority trials, 145
Risk 164 safety, 165–166 observational studies of interventions,
absolute, 67–68, 67t, sensitivity, 163 147–148
68t attributable, 68, 68t calculating, 163–164, 164t phases of clinical trials, 148–149
confounding, 71–72 simplicity and low cost, 164–165 randomized controlled trials, 134, 135
control of, 72–76 specificity, 163 alternatives to, 147–148
controlling for extraneous variables, Secondary prevention, 154 assessment of outcomes, 142–143,
88–89 treatments in, 158 143t
defined, 50 Selection bias, 7, 14, 16, blinding, 141–142, 141
difference, 67t, 68 237 Sensitive tests, use of, comparison groups, 138, 138
effect modification, 76–77 113 Sensitivity, 111, 112 differences arising after, 139–141
of false-positive result, 166–167 defined, 113 ethics, 135
interpreting attributable and relative, establishing, 115–117 intervention, 136, 138
68–69, 69t of risk prediction tool, 56– limitations of, 147
of negative labeling effect, 167 57 trade-offs between, 113, versus observational studies, 148
observational studies and cause, 76 113t sampling, 135–136, 137
odds ratio, 87–88, 87 Sensitivity analysis, 104–105 variations on, 145–146
overdiagnosis (pseudo disease) in Serial likelihood ratios, 128–129 studies of, 134
cancer screening, 167–169 Serial testing, 128, 128 superiority trials, 145
population, 69–71 Shared decision making, 12 tailoring the results of trials to
population attributable, 69 Single-blind, 142 individual patients, 146–147
predicting, 54–56 Single causes, 195, 195t Treatment effect studies of, 134
ratio, 67t, 68 Skewed distribution, 39 Trials of N = 1, 146–147
recognizing, 51–54 Specificity, 111, 112, 202 True negative, 109
relative, 67t, 68, 68t defined, 113 True positive, 109
simple descriptions of, 71 establishing, 115–117 Two-tailed, 179
of risk prediction tool, 56–
57 trade-offs between, 113,
113t
Index 255

Type 1 () error, content, 33 resulting from biologic


176 Type II () criterion, 33–34 differences, 36–37, 37
error, 176 external, 11, 11, 14, 237 resulting from measurement, 35–36, 35t
internal, 11, 11 total, 37, 38
U Variables, 6
Umbrella matching, 85 confounding, 72
Unmeasured confounders, 76 dependent, 6
W
Usual Care, 138 Web of causation, 195
extraneous, 6, 71
White coat hypertension, 8, 8
Variation, 35
V biologic, 36–37, 37
Validation, 102 effects of, 37–38 Z
Validity, 33–34, 33t, 35 measurements, 35–36, 35t Zero time, in prognostic studies, 96
construct, 34

You might also like