Common Biostatistical Problems
And How to Avoid Them
Peter Bacchetti
Biostatistical Methods III
April 19 and 21, 2005
Review from Tuesday
Do you have to report anything with p>0.05 as supporting (or proving) no effect? There seemed to be a concern on
Tuesday that editors might regard anything other than a “negative” interpretation of a result with p>0.05 as unwarranted and unallowable. Some
editors may think this way, but it they do they are wrong.
No. This “rule” is based on bad reasoning:
- Unless a finding is conclusively positive, it supports no effect This is a false dichotomization of what a study can
find. The real spectrum of possible results is much richer (or messier), and studies can lean in either direction while not being conclusive
one way or the other.
- It is better to be “conservative” than accurate Giving a false impression that results support no effect is not desirable.
Being conservative by recognizing when results are not conclusive is fine, but so-called conservatism that overstates (or completely mis-
states) the evidence for a negative conclusion is not rigorous and not appropriate.
Make sure interpretation accurately reflects estimates, CI’s, and p-values. Proper interpretation can accurately reflect not
only the p-value, but also the other information that the data provide. See pp. 63-65 of the Vittinghoff text and Tuesday’s annotated slides.
Consider the study of vitamin E supplementation we looked at on Tuesday.
Vitamin E study from JAMA
Cancer death RH: 0.88 (0.71-1.09) p=0.24 Suppose that this was the most relevant analysis and a true effect of 0.88 (12%
reduction in risk of fatal cancer) would be important. This would of course imply that the lower bound of 0.71 (29% reduction) would also be
important.
Interpretation: Vitamin E does not prevent Cancer This was the interpretation given in the paper. This claims to have ruled
out any important effect, contradicting the point estimate and especially the lower end of the confidence bound.
A more accurate interpretation:
Alternative: “The estimated effect was large enough to be important in terms of public health and supportive of
an anti-oxidant mechanism in cancer prevention. Unfortunately, this modest evidence for effectiveness was far
from conclusive, even in this large long-term study, because the confidence interval included no effect and
extended to a moderately harmful effect.”
Which interpretation exaggerates how much was learned from the study? Clearly, the claim of definitive proof of no effect is not “conservative” in
any desirable sense.
Some editors or reviewers might balk at the statement that results with p>0.05 nevertheless favor a positive finding. This would probably reflect an
exclusive focus on the p-value, ignoring the fact that the point estimate is the value most supported by the data. Requiring that these results be
interpreted as supporting (or even proving) no effect, would therefore be incorrect. The better approach is to accurately state which way the evidence
leans, while also noting that it is not conclusive. There is no “rule” to prevent this. Science is not a game, and whatever best promotes accuracy is
allowable.
2
Technical Issues
Errors in the data Even the best statistical analyses and interpretations can go wrong if there are errors in the data.
No data checking Unfortunately, even some large coordinating centers produce data sets riddled with errors. Effort to minimize data
errors is usually worthwhile, because errors can have unpredictable and substantial consequences.
Selective data checking When it is not feasible to careful check all data, selective checking may still be worthwhile. But care is
needed to avoid introducing bias. For example, checking only the largest values of a variable might eliminate only errors where the value was
recorded as larger than it really was, while leaving the opposite errors. This could result in the remaining errors being more systematic than the entire
original set of errors, introducing bias. I once reviewed a paper where the main interpretation was inconsistent with the key results (comparison of
two Kaplan-Meier curves). In the revision, the curves had changed (with no acknowledgement or mention of why) to be more consistent with the
interpretation. In a subsequent revision, the same thing happened again. This could have resulted from their checking only the data that were
causing problems for their desired interpretation. So when you must be selective, try to be symmetric or unbiased in what you check, such as
checking both large and small outliers. The fewer errors you start with, or the closer you get to complete checking, the less this will matter.
Missing data
If much, worry about why Some causes of missing data can produce severe bias in the data that remain. For example, those with the
riskiest behavior may be more inclined to skip questions about risk factors. The more that is missing, the more likely there is a problem with what
remains. Think about why participants may have declined to provide the information. You can also compare those who responded to the those who
didn’t on demographics and other variables that are more complete. This can provide clues about why some people didn’t respond and what impact
that may have on your results.
Deleted cases accumulate with added predictors For multivariate modeling, the default in Stata and other programs is to
delete all observations that have missing values for any predictor or the outcome. (For stepwise selection, observations missing any candidate
predictor are deleted, even if the candidate is never used.) This means that smaller models may have more observations than larger ones. Keep this
in mind if you need to compare the results of such models, such as when you are examining how an effect of interest changes when controlled for
other predictors. In these cases, you should fit the models to the same set of observations, even though the smaller models could utilize more.
Consider alternatives to deletion You can make “missing” a separate category in your models, thereby preserving the observations
with missing values for that predictor. (You can also do this for numeric predictors, but it is trickier.) The interpretation of the effect of being
“missing” is difficult, because there may be selection bias in who is missing and the model treats all those with “missing” as the same, even though
their real values probably differ. But preserving all the good data in those observations may be worthwhile. Stata has an “impute” command that
allows you to fill in missing values with a best guess based on other non-mising variables. This may work well, but treating imputed data as if it
were known will tend to overstate the accuracy of your results. A strategy called “multiple imputation” can avoid such overstatements, but this is a
technically complicated approach.
3
Unchecked assumptions
- Normality (of residuals), outliers
It can happen that an outcome variable appears non-normal, but predictors in a multiple regression model can explain enough so that the remaining
unexplained residuals appear normal. The Stata sktest command can check normality. T-tests technically are based on a normality assumption, but
outliers are what really mess them up. Plots are often helpful for showing outliers. Deleting outliers, or changing them to smaller values, can be very
dangerous. This changes the data and may introduce bias, so other strategies are usually preferable (transformation or bootstrapping). It may be OK
to report results based on deleting or changing outliers as confirmatory analyses.
- Linearity for numeric predictors
The linearity assumption can be checked by adding the square of the predictor to the model, or by breaking the predictor into categories. Scatterplots
and smoothing methods are also useful.
- Proportional hazards assumption
The Stata stphtest and stphplot commands can be used to check this assumption. Covariates that appear to violate proportionality can be controlled
by stratification, or the non-proportionality can be modeled using time-dependent covariates defined as the product of the covariate with time itself
(or log of time itself).
- Interactions
Check these by adding interaction terms to the model. Usually, only suspected or plausible two-way interactions are examined, because the number
of interactions is so large (particularly if 3-way interactions are considered) and they are hard to estimate accurately.
4
Ignoring dependence in the data
- Unpaired summaries and tests for paired data An elementary mistake, but not unheard of.
- Ignoring clustering (by provider, hospital, etc)
Outcomes among different patients in a study may not always be completely independent as assumed by simple methods. Different patients of the
same provider may fare more alike than patients from two different providers. In fact, this will usually be the case, because providers do not give
completely standardized care. Likewise, there are often hospital-to-hospital differences, or many other possible sources of dependence or clustering.
- Repeated measures
Different measures from the same person are usually more alike than measures from two different people. And measures closer together in time are
often more alike than measures farther apart in time.
Later in class you will learn about methods for correctly analyzing clustered and repeated measures data.
Vague descriptions of methods
“Repeated measures ANOVA”
As you will see later in class, this could mean a few different things
“Predictors were chosen by stepwise selection”
There are many variations on this general strategy. (It is also not always a good method for picking predictors.)
In general, you should try to describe your methods specifically enough that someone else with your data set could reproduce the same results.
5
Poor description of survival analyses
Provide:
- Operational definitions of starting time, occurrence of event, and censoring time
This is neglected with surprising frequency. Any analysis of time to an event needs to be clear about the time from when to when.
- How events were ascertained
This is important for establishing the completeness of ascertainment, and sometimes for explaining clumps of events (e.g., if many were found at a
scheduled 6 month visit).
- Summaries of followup among those censored
Followup is complete for anyone who had the event; it does not matter whether the event occurred at 2 days or 5 years—either way, we know all we
need about that person’s outcome. The amount of followup matters for those who did not have the event, especially the minimum followup and/or
the number of subjects with shorter than desired followup. Mixing the early events into summaries of followup times obscures this information.
- Summaries of early loss to followup and reasons
Censoring due to loss to followup is more likely to violate the assumption of non-informative censoring, so this is a particular concern that should be
addressed separately from observations censored just due to the planned end of the study or observation period.
6
Mean ± SD when data are non-normal
Readers will tend to interpret these summaries as if the data were normal, so using these summaries in other situations can produce confusion.
Use median and range (and quartiles)
These are often acceptable for summarizing non-normal values in a study population
Or geometric mean with CI
This is sometimes a useful summary of non-normal data; it is based on log-transformed values (log transform, take mean, then take antilog to get the
geometric mean on the original scale)
SD vs SE Although there is often confusion about which is appropriate to present, these address different issues:
SD to show variability in a population
SE to show uncertainty around an estimate
Pick the one that shows what matters to the point you want to make.
7
Too little or too much precision
OR=0.3, OR=2.537
P=0.01, P=0.4275
In general, too little precision may leave the reader wondering about the exact magnitude. For example, p=0.01 could mean anything from 0.005 to
0.015, which is a pretty wide range. Extra precision is not directly harmful, but gives a spurious impression of how precise the results are. It also can
look naïve, giving an astute reader or reviewer the impression that you do not know what is important and what is not.
Give OR’s to two decimals if <2.0, one if >2.0
This may err on the side of giving too much precision in that smaller OR’s (say, down to 1.5) are often given to only one decimal, but I think this is a
reasonable proposal. The same would apply for relative risks and hazard ratios.
Give p-values to two significant digits (leading 0’s don’t count), to a maximum of four. Do not use P< for
values of 0.0001 or more; use P=. That is, don’t say p<0.01 when you could say p=0.0058.
P=0.13, P=0.013, P=0.0013, P=0.0001, P<0.0001
This also may err on the side of sometimes giving a little more precision than is needed, but it is not too excessive. P-values are often limited to three
decimals, with p<0.001 being the smallest reported. This may sometimes be unavoidable when software only produces 3 decimals.
Never use “p=NS” or “p>0.1” This gives needlessly vague information and encourages problem #1 from Tuesday.
Do not show χ2 or other statistics that provide the same information as p-values (but are less interpretable)
These add no information and clutter presentation of results. They may seem to add some technical cachet, but leaving out unimportant details
actually conveys a better impression of technical savvy (to me, at least).
8
Poorly scaled numeric predictors
Age in years
CD4+ cell count
In regression models, the coefficients for numeric predictors are the estimated effects of a 1-unit increase in the predictor. So if the age variable is in
years, the estimated effect is for a 1-year increase in age, which is often too small to be readily interpretable. When a 1-unit increase is a very small
amount, estimated coefficients will necessarily be very close to zero and results will be hard to interpret.
These give estimated OR’s (or RR’s or HR’s) very close to 1.0.
OR 1.0051 for each 1 cell/mm3 increase
These are defined as exp(coefficient). Because coefficients will be close to zero, these will be close to 1.0. An OR of 1.0051 per 1 cell/mm3 increase
in CD4 count is very hard to interpret. It is also hard to rescale this by eye, because the OR for a 100 cell increase is (1.0051)100, which most people
cannot calculate in their heads.
Rescale numeric predictors to make results interpretable
Age in decades (effect per 10 year increase in age) Make a new variable age10=age/10 and use that as a predictor
CD4/100 (effect per 100 cell increase) Make a new variable cd4per100=cd4/100
9
Figures with low information content
25
20
15
Risk
10
0
Women Men
This is an extreme example: the graph only shows two numbers. In some fields, low-info figures may be necessary for clarity or visual impact, but
including more information is still usually preferable. For example, add confidence interval bars and p-values, or find a way to show a cross-
classification all on one graph. The best use of figures is when they can clearly show information that is impractical to give in text or tables.
10
Terms likely to be misread Don’t needlessly give readers and reviewers the opportunity to misunderstand what you mean
Use “Mann-Whitney” instead of “Wilcoxon” or “Wilcoxon rank-sum”
- Could be confused with “Wilcoxon signed-rank”
Both “Mann-Whitney” and “Wilcoxon rank-sum” are used, due to the near-simultaneous, independent development of the method.
Use “Relative Hazard” or “Hazard Ratio” instead of “Relative Risk” for proportional hazards model results
- Could be confused with analysis of a binary outcome
Avoid use of “significant”. Use “statistically significant” if meaning p<0.05, and use “important” or
“substantial” if that is the intended meaning.
As noted under Problem #1, some journals reserve “significant” alone to mean “statistically significant”. If they do not allow the full term, just avoid
using it (this may be a good strategy anyway).
Conclusion
Try to avoid these problems and follow these guidelines
(Or be clear on why your case is an exception) It may be, but check this carefully and explain clearly
Take advantage of the faculty help that is available
11
Reviewing
The remaining slides concern what to do when you are peer reviewing others’ work. This is not directly relevant to your written projects, but there
are some important statistical issues.
Statistical problems very common
Some papers documenting the frequent poor quality of statistical methods in published research are:
Altman DG. Statistical reviewing for medical journals. Stat Med 1998;17:2661-74.
Goodman SN, Altman DG, George SL. Statistical reviewing policies of medical journals: Caveat lector? J Gen Intern Med 1998;13:753-6.
Look for problems discussed here Unless my experience has been wildly atypical, you will see lots of these problems in many papers.
• Provide constructive guidance on how to fix If you can spot a problem, you probably can also describe how to fix it. This can
rescue a paper that would have been worthless or even counterproductive and make it valuable, so constructive criticism can be very
worthwhile. This probably won’t result in any tangible reward or credit, but it is still worth doing.
• Recommend statistical review if in doubt If you are not sure whether there is a statistical problem or how to fix it,
recommending more detailed assessment by a statistical reviewer is probably better than making your best guess or raising the issue without
providing constructive guidance. Statistical review is not a panacea, because statisticians make mistakes too, even about statistical issues.
But at least you will have done the right thing.
Avoid imperative to “find the flaw” Unfortunately, some reviewers gravitate to statistical issues when they want to find something to
criticize. Please do not add to the problem of spurious criticism on statistical issues. A reference on this phenomenon is:
Bacchetti P. Peer review of statistics in medical research: the other problem. Br Med J, 324:1271-1273, 2002.
Especially
• Multiple comparisons Reviewers often ask for multiple comparisons adjustments even though the multiple analyses all reinforce each
other scientifically, such as a treatment showing benefit on a number of different outcome measures. Another example of how results may
reinforce each other is on page 12 of Tuesday’s annotated slides.
• Sample size
These two areas can almost always be criticized and so are favorites with those who want to criticize but cannot find any more cogent points to raise.
12
Sample Size
This is mainly an an issue for study proposals rather than completed studies.
There are a number of severe problems with current conventions and expectations concerning sample size planning for clinical research. Some
statisticians (me, for example) therefore find dealing with sample size to be difficult and demoralizing. When you are a reviewer, you have the
power to choose not to enforce these problematic expectations and conventions.
When reviewing, do not worry about whether sample size is too small.
This may seem like a very radical proposal, but I will attempt to justify it below. Because so much is wrong with how sample size planning is
currently done and evaluated, I believe that this is actually the best strategy until better methods and standards are developed. The only exception I
would make would be in the case where a proposal has severely exaggerated what it is likely to be able to establish or accomplish.
A big advantage of not worrying about sample size is that you will have more time to think about more important issues.
The main reasons for not criticizing the sample size of proposed studies are:
Current conventions and standards have severe problems
Criticism of “inadequate” or suspect sample size will generally not improve studies
Also not relevant for selecting the best
13
Larger studies produce more knowledge, but
• Cost more
• Burden more participants
This is a huge blind spot in current sample size thinking. There is always much concern about whether sample size is large enough, but never any
accounting for the drawbacks of larger studies. One publication related to this blind spot has appeared: Bacchetti P, Wolf LE, Segal MR, McCulloch
CE. Ethics and sample size. Am J Epidemiol, 161:105-110, 2005. Another has been submitted.
An innovative study with 50% power is at least 5/8 as valuable as with 80% (and may cost <5/8 as much)
Some colleagues and I have been studying different ways of projecting a study’s value as a function of sample size, and using power is the least
favorable toward innovative studies with small sample sizes. Alternatives based on confidence interval width or Bayesian formulations are more
favorable. In addition, sample size may have been chosen as the maximum possible without hitting a cost barrier, such as the need to open a second
site. In such cases, a study with 50% power may cost less than half what one with 80% power would. It also requires only half the sample size and
so will burden only half as many participants. In such situations, requiring that investigators double their sample size in order to have 80% power
would actually reduce the projected value produced per dollar spent. So this criticism might not really improve the study.
80% power not imperative
• 80+ does not ensure success
• <80 does not doom to failure
This seems obvious when said out loud, but reviewers usually assume that 70% power is “inadequate”. Because the assumptions used in calculating
power are so uncertain, a reviewer can easily argue that a study is no good by proposing alternative assumptions that lead to <80% power.
14
Power calculations very assumption-dependent Although proposals are generally expected to prove that they will have at least 80%
power, the information needed to be certain about this is not available before (or even after) the study.
• Variability If we need to study the mean responses to treatments, we probably also do not know the variability. But if our guess about the
variance is 2-fold too low, then our sample size will be only half what we need to produce the calculated power. Unfortunately, variances are
hard to estimate accurately. A pilot study with 10 observations will produce a CI for the variance that has a 7-fold range from its lower to
upper ends, while a pilot with 30 observations still leaves about a 3-fold range. For studies with yes/no outcomes, this is less of a problem.
• True size of difference Some advocate that power calculations should be done using the minimum “clinically meaningful” effect size,
but there is no good justification for this. This does not ensure that a “negative” result will be definitive, because we will not be using the
poor reasoning discussed under problem #1. The actual size of the effect being studied is a key determinant of how valuable the study is
likely to end up being, but the study is only worth doing because this true effect is unknown.
Suggestive but inconclusive results can be valuable Even if a study has p>0.05 for its main analysis, the information it provides can
still be valuable (as long as the result is not misinterpreted as in Problem #1, above). A reference is:
Edwards SJL, Lilford RJ, Braunholtz D, Jackson J. Why “underpowered” trials are not necessarily unethical. Lancet 1997; 350:804-7.
Larger sample size may not be an option
(is a sample size of zero really preferable?)
Investigators frequently propose the largest sample size they can practically do. Condemning this as “inadequate” will then mean that the study will
not be done at all. If the study was otherwise promising and innovative, this would be a needless loss.
For completed studies, do not insist that sample size calculations be reported
• Keep focus on the precision actually obtained, shown directly by confidence intervals
Guidelines unfortunately call for presentation of a priori power calculations when presenting completed trials. This inevitably encourages the
discredited practice of using p-values and power to establish negative results. There is no reason to participate in perpetuating this problem. A
reference is: Bacchetti P. Author’s thoughts on power calculations (letter). Br Med J, 325: 492-493, 2002.
Only when reviewing!!
Many people in a position to accept or reject your proposals and papers would regard what I have said above as complete heresy. So I am not
recommending that you completely abandon current conventions and standards, only that you refrain from enforcing them when you can do so
without penalty. This will usually be the case when you are a reviewer.
15