Common Errors in Statistics (and How to Avoid Them)
By Phillip I. Good and James W. Hardin
()
About this ebook
Praise for Common Errors in Statistics (and How to Avoid Them)
"A very engaging and valuable book for all who use statistics in any setting."
—CHOICE
"Addresses popular mistakes often made in data collection and provides an indispensable guide to accurate statistical analysis and reporting. The authors' emphasis on careful practice, combined with a focus on the development of solutions, reveals the true value of statistics when applied correctly in any area of research."
—MAA Reviews
Common Errors in Statistics (and How to Avoid Them), Fourth Edition provides a mathematically rigorous, yet readily accessible foundation in statistics for experienced readers as well as students learning to design and complete experiments, surveys, and clinical trials.
Providing a consistent level of coherency throughout, the highly readable Fourth Edition focuses on debunking popular myths, analyzing common mistakes, and instructing readers on how to choose the appropriate statistical technique to address their specific task. The authors begin with an introduction to the main sources of error and provide techniques for avoiding them. Subsequent chapters outline key methods and practices for accurate analysis, reporting, and model building. The Fourth Edition features newly added topics, including:
- Baseline data
- Detecting fraud
- Linear regression versus linear behavior
- Case control studies
- Minimum reporting requirements
- Non-random samples
The book concludes with a glossary that outlines key terms, and an extensive bibliography with several hundred citations directing readers to resources for further study.
Presented in an easy-to-follow style, Common Errors in Statistics, Fourth Edition is an excellent book for students and professionals in industry, government, medicine, and the social sciences.
Related to Common Errors in Statistics (and How to Avoid Them)
Related ebooks
Patient Registry Data for Research: A Basic Practical Guide Rating: 0 out of 5 stars0 ratingsIntroduction to Biostatistics with JMP (Hardcover edition) Rating: 1 out of 5 stars1/5Fundamentals of Biostatistics for Public Health Students Rating: 0 out of 5 stars0 ratingsDesign and Analysis of Experiments, Volume 3: Special Designs and Applications Rating: 0 out of 5 stars0 ratingsStatistical Methods in Diagnostic Medicine Rating: 4 out of 5 stars4/5Statistics II Essentials Rating: 3 out of 5 stars3/5The Question of Competence: Reconsidering Medical Education in the Twenty-First Century Rating: 0 out of 5 stars0 ratingsFun With Math And π: Genius of Ancient Mathematicians Rating: 0 out of 5 stars0 ratingsQuanto-Geometry Rating: 0 out of 5 stars0 ratingsInternational Neurology Rating: 0 out of 5 stars0 ratingsDATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide) Rating: 0 out of 5 stars0 ratingsMedikalPreneur: The Official Guidebook for Physicians' Success in Business Rating: 0 out of 5 stars0 ratingsNeurological Disorders due to Systemic Disease Rating: 0 out of 5 stars0 ratingsImaging of Ischemic Heart Disease in Women: A Critical Review of the Literature Rating: 0 out of 5 stars0 ratingsMedical Conditions Associated with Suicide Risk: Suicide and Renal Disease: Medical Conditions Associated with Suicide Risk, #1 Rating: 0 out of 5 stars0 ratingsExamples and Problems in Mathematical Statistics Rating: 5 out of 5 stars5/5The Mass Psychology of Addiction Rating: 0 out of 5 stars0 ratingsGale Researcher Guide for: Neuropsychological Disorders Rating: 0 out of 5 stars0 ratingsCapsule Calculus Rating: 0 out of 5 stars0 ratingsFrom the Family Doctor to the Current Disaster of Corporate Health Maintenance: How to Get Back to Real Patient Care! Rating: 0 out of 5 stars0 ratingsWhere Mathematics goes wrong?...: Mathematical Fallacies, Howlers & Number tricks … Rating: 5 out of 5 stars5/5Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications Rating: 4 out of 5 stars4/5Undergraduate Research in the Sciences: Engaging Students in Real Science Rating: 0 out of 5 stars0 ratingsSimplifying Psychiatric Documentation: Time-Saving Templates for Medication Recommendations Rating: 0 out of 5 stars0 ratingsConcise Biostatistical Principles & Concepts: Guidelines for Clinical and Biomedical Researchers Rating: 0 out of 5 stars0 ratingsBiostatistician - The Comprehensive Guide: Vanguard Professionals Rating: 0 out of 5 stars0 ratingsResearch Methodology and Quantitative Methods Rating: 1 out of 5 stars1/5Analyzing Quantitative Data: An Introduction for Social Researchers Rating: 0 out of 5 stars0 ratingsStatistics Essentials For Dummies Rating: 4 out of 5 stars4/5
Technology & Engineering For You
The Big Book of Maker Skills: Tools & Techniques for Building Great Tech Projects Rating: 4 out of 5 stars4/5The Big Book of Hacks: 264 Amazing DIY Tech Projects Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Basic Engineering Mechanics Explained, Volume 1: Principles and Static Forces Rating: 5 out of 5 stars5/580/20 Principle: The Secret to Working Less and Making More Rating: 5 out of 5 stars5/5The Art of War Rating: 4 out of 5 stars4/5The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos, Rating: 4 out of 5 stars4/5Pilot's Handbook of Aeronautical Knowledge (Federal Aviation Administration) Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Beginner's Guide to Reading Schematics, Fourth Edition Rating: 4 out of 5 stars4/5How to Build a Car: The Autobiography of the World’s Greatest Formula 1 Designer Rating: 4 out of 5 stars4/5UX/UI Design Playbook Rating: 4 out of 5 stars4/5The Official Highway Code: DVSA Safe Driving for Life Series Rating: 4 out of 5 stars4/5Understanding Media: The Extensions of Man Rating: 4 out of 5 stars4/5The Wuhan Cover-Up: And the Terrifying Bioweapons Arms Race Rating: 4 out of 5 stars4/5The Art of Tinkering: Meet 150+ Makers Working at the Intersection of Art, Science & Technology Rating: 4 out of 5 stars4/5The Maker's Field Guide: The Art & Science of Making Anything Imaginable Rating: 0 out of 5 stars0 ratingsThe Total Motorcycling Manual: 291 Essential Skills Rating: 5 out of 5 stars5/5Basic Machines and How They Work Rating: 4 out of 5 stars4/5Technical Book of the Car Rating: 0 out of 5 stars0 ratingsPMP Question Bank Rating: 4 out of 5 stars4/5The Homeowner's DIY Guide to Electrical Wiring Rating: 4 out of 5 stars4/5How to Write Effective Emails at Work Rating: 4 out of 5 stars4/5
Reviews for Common Errors in Statistics (and How to Avoid Them)
0 ratings0 reviews
Book preview
Common Errors in Statistics (and How to Avoid Them) - Phillip I. Good
Part I
FOUNDATIONS
c01uf001Chapter 1
Sources of Error
Don’t think—use the computer.
Dyke (tongue in cheek) [1997].
We cannot help remarking that it is very surprising that research in an area that depends so heavily on statistical methods has not been carried out in close collaboration with professional statisticians, the panel remarked in its conclusions. From the report of an independent panel looking into Climategate.
¹
STATISTICAL PROCEDURES FOR HYPOTHESIS TESTING, ESTIMATION, AND MODEL building are only a part of the decision-making process. They should never be quoted as the sole basis for making a decision (yes, even those procedures that are based on a solid deductive mathematical foundation). As philosophers have known for centuries, extrapolation from a sample or samples to a larger, incompletely examined population must entail a leap of faith.
The sources of error in applying statistical procedures are legion and include all of the following:
1.
a) Replying on erroneous reports to help formulate hypotheses (see Chapter 9)
b) Failing to express qualitative hypotheses in quantitative form (see Chapter 2)
c) Using the same set of data both to formulate hypotheses and to test them (see Chapter 2)
2.
a) Taking samples from the wrong population or failing to specify in advance the population(s) about which inferences are to be made (see Chapter 3)
b) Failing to draw samples that are random and representative (see Chapter 3)
3. Measuring the wrong variables or failing to measure what you intended to measure (see Chapter 4)
4. Using inappropriate or inefficient statistical methods. Examples include using a two-tailed test when a one-tailed test is appropriate and using an omnibus test against a specific alternative (see Chapters 5 and 6).
5.
a) Failing to understand that p-values are functions of the observations and will vary in magnitude from sample to sample (see Chapter 6)
b) Using statistical software without verifying that its current defaults are appropriate for your application (see Chapter 6)
6. Failing to adequately communicate your findings (see Chapters 8 and 10)
7.
a) Extrapolating models outside the range of the observations (see Chapter 11)
b) Failure to correct for confounding variables (see Chapter 13)
c) Use the same data to select variables for inclusion in a model and to assess their significance (see Chapter 13)
d) Failing to validate models (see Chapter 15)
But perhaps the most serious source of error lies in letting statistical procedures make decisions for you.
In this chapter, as throughout this text, we offer first a preventive prescription, followed by a list of common errors. If these prescriptions are followed carefully, you will be guided to the correct, proper, and effective use of statistics and avoid the pitfalls.
PRESCRIPTION
Statistical methods used for experimental design and analysis should be viewed in their rightful role as merely a part, albeit an essential part, of the decision-making procedure.
Here is a partial prescription for the error-free application of statistics.
1. Set forth your objectives and your research intentions before you conduct a laboratory experiment, a clinical trial, or survey, or analyze an existing set of data.
2. Define the population about which you will make inferences from the data you gather.
3.
a) Recognize that the phenomena you are investigating may have stochastic or chaotic components.
b) List all possible sources of variation. Control them or measure them to avoid their being confounded with relationships among those items that are of primary interest.
4. Formulate your hypotheses and all of the associated alternatives. (See Chapter 2.) List possible experimental findings along with the conclusions you would draw and the actions you would take if this or another result should prove to be the case. Do all of these things before you complete a single data collection form, and before you turn on your computer.
5. Describe in detail how you intend to draw a representative sample from the population. (See Chapter 3.)
6. Use estimators that are impartial, consistent, efficient, robust, and minimum loss. (See Chapter 5.) To improve results, focus on sufficient statistics, pivotal statistics, and admissible statistics, and use interval estimates. (See Chapters 5 and 6.)
7. Know the assumptions that underlie the tests you use. Use those tests that require the minimum of assumptions and are most powerful against the alternatives of interest. (See Chapter 6.)
8. Incorporate in your reports the complete details of how the sample was drawn and describe the population from which it was drawn. If data are missing or the sampling plan was not followed, explain why and list all differences between data that were present in the sample and data that were missing or excluded. (See Chapter 8.)
FUNDAMENTAL CONCEPTS
Three concepts are fundamental to the design of experiments and surveys: variation, population, and sample. A thorough understanding of these concepts will prevent many errors in the collection and interpretation of data.
If there were no variation, if every observation were predictable, a mere repetition of what had gone before, there would be no need for statistics.
Variation
Variation is inherent in virtually all our observations. We would not expect outcomes of two consecutive spins of a roulette wheel to be identical. One result might be red, the other black. The outcome varies from spin to spin.
There are gamblers who watch and record the spins of a single roulette wheel hour after hour hoping to discern a pattern. A roulette wheel is, after all, a mechanical device and perhaps a pattern will emerge. But even those observers do not anticipate finding a pattern that is 100% predetermined. The outcomes are just too variable.
Anyone who spends time in a schoolroom, as a parent or as a child, can see the vast differences among individuals. This one is tall, that one short, though all are the same age. Half an aspirin and Dr. Good’s headache is gone, but his wife requires four times that dosage.
There is variability even among observations on deterministic formula-satisfying phenomena such as the position of a planet in space or the volume of gas at a given temperature and pressure. Position and volume satisfy Kepler’s Laws and Boyle’s Law, respectively (the latter over a limited range), but the observations we collect will depend upon the measuring instrument (which may be affected by the surrounding environment) and the observer. Cut a length of string and measure it three times. Do you record the same length each time?
In designing an experiment or survey we must always consider the possibility of errors arising from the measuring instrument and from the observer. It is one of the wonders of science that Kepler was able to formulate his laws at all given the relatively crude instruments at his disposal.
Deterministic, Stochastic, and Chaotic Phenomena
A phenomenon is said to be deterministic if given sufficient information regarding its origins, we can successfully make predictions regarding its future behavior. But we do not always have all the necessary information. Planetary motion falls into the deterministic category once one makes adjustments for all gravitational influences, the other planets as well as the sun.
Nineteenth century physicists held steadfast to the belief that all atomic phenomena could be explained in deterministic fashion. Slowly, it became evident that at the subatomic level many phenomena were inherently stochastic in nature, that is, one could only specify a probability distribution of possible outcomes, rather than fix on any particular outcome as certain.
Strangely, twenty-first century astrophysicists continue to reason in terms of deterministic models. They add parameter after parameter to the lambda cold-dark-matter model hoping to improve the goodness of fit of this model to astronomical observations. Yet, if the universe we observe is only one of many possible realizations of a stochastic process, goodness of fit offers absolutely no guarantee of the model’s applicability. (See, for example, Good, 2012.)
Chaotic phenomena differ from the strictly deterministic in that they are strongly dependent upon initial conditions. A random perturbation from an unexpected source (the proverbial butterfly’s wing) can result in an unexpected outcome. The growth of cell populations has been described in both deterministic (differential equations) and stochastic terms (birth and death process), but a chaotic model (difference-lag equations) is more accurate.
Population
The population(s) of interest must be clearly defined before we begin to gather data.
From time to time, someone will ask us how to generate confidence intervals (see Chapter 8) for the statistics arising from a total census of a population. Our answer is no, we cannot help. Population statistics (mean, median, and thirtieth percentile) are not estimates. They are fixed values and will be known with 100% accuracy if two criteria are fulfilled:
1. Every member of the population is observed.
2. All the observations are recorded correctly.
Confidence intervals would be appropriate if the first criterion is violated, for then we are looking at a sample, not a population. And if the second criterion is violated, then we might want to talk about the confidence we have in our measurements.
Debates about the accuracy of the 2000 United States Census arose from doubts about the fulfillment of these criteria.² You didn’t count the homeless,
was one challenge. You didn’t verify the answers,
was another. Whether we collect data for a sample or an entire population, both these challenges or their equivalents can and should be made.
Kepler’s laws
of planetary movement are not testable by statistical means when applied to the original planets (Jupiter, Mars, Mercury, and Venus) for which they were formulated. But when we make statements such as Planets that revolve around Alpha Centauri will also follow Kepler’s Laws,
then we begin to view our original population, the planets of our sun, as a sample of all possible planets in all possible solar systems.
A major problem with many studies is that the population of interest is not adequately defined before the sample is drawn. Do not make this mistake. A second major problem is that the sample proves to have been drawn from a different population than was originally envisioned. We consider these issues in the next section and again in Chapters 2, 6, and 7.
Sample
A sample is any (proper) subset of a population. Small samples may give a distorted view of the population. For example, if a minority group comprises 10% or less of a population, a jury of 12 persons selected at random from that population fails to contain any members of that minority at least 28% of the time.
As a sample grows larger, or as we combine more clusters within a single sample, the sample will grow to more closely resemble the population from which it is drawn.
How large a sample must be to obtain a sufficient degree of closeness will depend upon the manner in which the sample is chosen from the population.
Are the elements of the sample drawn at random, so that each unit in the population has an equal probability of being selected? Are the elements of the sample drawn independently of one another? If either of these criteria is not satisfied, then even a very large sample may bear little or no relation to the population from which it was drawn.
An obvious example is the use of recruits from a Marine boot camp as representatives of the population as a whole or even as representatives of all Marines. In fact, any group or cluster of individuals who live, work, study, or pray together may fail to be representative for any or all of the following reasons (Cummings and Koepsell, 2002):
1. Shared exposure to the same physical or social environment;
2. Self selection in belonging to the group;
3. Sharing of behaviors, ideas, or diseases among members of the group.
A sample consisting of the first few animals to be removed from a cage will not satisfy these criteria either, because, depending on how we grab, we are more likely to select more active or more passive animals. Activity tends to be associated with higher levels of corticosteroids, and corticosteroids are associated with virtually every body function.
Sample bias is a danger in every research field. For example, Bothun [1998] documents the many factors that can bias sample selection in astronomical research.
To prevent sample bias in your studies, before you begin determine all the factors that can affect the study outcome (gender and lifestyle, for example). Subdivide the population into strata (males, females, city dwellers, farmers) and then draw separate samples from each stratum. Ideally, you would assign a random number to each member of the stratum and let a computer’s random number generator determine which members are to be included in the sample.
SURVEYS AND LONG-TERM STUDIES
Being selected at random does not mean that an individual will be willing to participate in a public opinion poll or some other survey. But if survey results are to be representative of the population at large, then pollsters must find some way to interview nonresponders as well. This difficulty is exacerbated in long-term studies, as subjects fail to return for follow-up appointments and move without leaving a forwarding address. Again, if the sample results are to be representative, some way must be found to report on subsamples of the nonresponders and the dropouts.
AD-HOC, POST-HOC HYPOTHESES
Formulate and write down your hypotheses before you examine the data.
Patterns in data can suggest, but cannot confirm, hypotheses unless these hypotheses were formulated before the data were collected.
Everywhere we look, there are patterns. In fact, the harder we look the more patterns we see. Three rock stars die in a given year. Fold the United States twenty-dollar bill in just the right way and not only the Pentagon but the Twin Towers in flames are revealed.³ It is natural for us to want to attribute some underlying cause to these patterns, but those who have studied the laws of probability tell us that more often than not patterns are simply the result of random events.
Put another way, finding at least one cluster of events in time or in space has a greater probability than finding no clusters at all (equally spaced events).
How can we determine whether an observed association represents an underlying cause-and-effect relationship or is merely the result of chance? The answer lies in our research protocol. When we set out to test a specific hypothesis, the probability of a specific event is predetermined. But when we uncover an apparent association, one that may well have arisen purely by chance, we cannot be sure of the association’s validity until we conduct a second set of controlled trials.
In the International Study of Infarct Survival [1988], patients born under the Gemini or Libra astrological birth signs did not survive as long when their treatment included aspirin. By contrast, aspirin offered apparent beneficial effects (longer survival time) to study participants from all other astrological birth signs. Szydloa et al. [2010] report similar spurious correlations when hypothesis are formulated with the data in hand.
Except for those who guide their lives by the stars, there is no hidden meaning or conspiracy in this result. When we describe a test as significant at the 5% or one-in-20 level, we mean that one in 20 times we will get a significant result even though the hypothesis is true. That is, when we test to see if there are any differences in the baseline values of the control and treatment groups, if we have made 20 different measurements, we can expect to see at least one statistically significant difference; in fact, we will see this result almost two-thirds of the time. This difference will not represent a flaw in our design but simply chance at work. To avoid this undesirable result—that is, to avoid attributing statistical significance to an insignificant random event, a so-called Type I error—we must distinguish between the hypotheses with which we began the study and those which came to mind afterward. We must accept or reject our initial hypotheses at the original significance level while demanding additional corroborating evidence for those exceptional results (such as a dependence of an outcome on astrological sign) that are uncovered for the first time during the trials.
No reputable scientist would ever report results before successfully reproducing the experimental findings twice, once in the original laboratory and once in that of a colleague.⁴ The latter experiment can be particularly telling, as all too often some overlooked factor not controlled in the experiment—such as the quality of the laboratory water—proves responsible for the results observed initially. It is better to be found wrong in private, than in public. The only remedy is to attempt to replicate the findings with different sets of subjects, replicate, then replicate again.
Persi Diaconis [1978] spent some years investigating paranormal phenomena. His scientific inquiries included investigating the powers linked to Uri Geller, the man who claimed he could bend spoons with his mind. Diaconis was not surprised to find that the hidden powers
of Geller were more or less those of the average nightclub magician, down to and including forcing a card and taking advantage of ad-hoc, post-hoc hypotheses (Figure 1.1).
FIGURE 1.1. Photo of Geller.
(Reprinted from German Language Wikipedia.)
c01f001When three buses show up at your stop simultaneously, or three rock stars die in the same year, or a stand of cherry trees is found amid a forest of oaks, a good statistician remembers the Poisson distribution. This distribution applies to relatively rare events that occur independently of one another (see Figure 1.2). The calculations performed by Siméon-Denis Poisson reveal that if there is an average of one event per interval (in time or in space), whereas more than a third of the intervals will be empty, at least a quarter of the intervals are likely to include multiple events.
FIGURE 1.2. Frequency plot of the number of deaths in the Prussian army as a result of being kicked by a horse (there are 200 total observations).
c01f002Anyone who has played poker will concede that one out of every two hands contains something
interesting. Do not allow naturally occurring results to fool you nor lead you to fool others by shouting, Isn’t this incredible?
TABLE 1.1. Probability of finding something interesting in a five-card hand
The purpose of a recent set of clinical trials was to see if blood flow and distribution in the lower leg could be improved by carrying out a simple surgical procedure prior to the administration of standard prescription medicine.
The results were disappointing on the whole, but one of the marketing representatives noted that the long-term prognosis was excellent when a marked increase in blood flow was observed just after surgery. She suggested we calculate a p-value⁵ for a comparison of patients with an improved blood flow after surgery versus patients who had taken the prescription medicine alone.
Such a p-value is meaningless. Only one of the two samples of patients in question had been taken at random from the population (those patients who received the prescription medicine alone). The other sample (those patients who had increased blood flow following surgery) was determined after the fact. To extrapolate results from the samples in hand to a larger population, the samples must be taken at random from, and be representative of, that population.
The preliminary findings clearly called for an examination of surgical procedures and of patient characteristics that might help forecast successful surgery. But the generation of a p-value and the drawing of any final conclusions had to wait for clinical trials specifically designed for that purpose.
This does not mean that one should not report anomalies and other unexpected findings. Rather, one should not attempt to provide p-values or confidence intervals in support of them. Successful researchers engage in a cycle of theorizing and experimentation so that the results of one experiment become the basis for the hypotheses tested in the next.
A related, extremely common error whose correction we discuss at length in Chapters 13 and 15 is to use the same data to select variables for inclusion in a model and to assess their significance. Successful model builders develop their frameworks in a series of stages, validating each model against a second independent dataset before drawing conclusions.
One reason why many statistical models are incomplete is that they do not specify the sources of randomness generating variability among agents, i.e., they do not specify why otherwise observationally identical people make different choices and have different outcomes given the same choice.
—James J. Heckman
TO LEARN MORE
On the necessity for improvements in the use of statistics in research publications, see Altman [1982, 1991, 1994, 2000, 2002]; Cooper and Rosenthal [1980]; Dar, Serlin, and Omer [1994]; Gardner and Bond [1990]; George [1985]; Glantz [1980]; Goodman, Altman, and George [1998]; MacArthur and Jackson [1984]; Morris [1988]; Strasak et al. [2007]; Thorn et al. [1985]; and Tyson et al. [1983].
Brockman and Chowdhury [1997] discuss the costly errors that can result from treating chaotic phenomena as stochastic.
Notes
¹ This is from an inquiry at the University of East Anglia headed by Lord Oxburgh. The inquiry was the result of emails from climate scientists being released to the public.
² City of New York v. Department of Commerce, 822 F. Supp. 906 (E.D.N.Y, 1993). The arguments of four statistical experts who testified in the case may be found in Volume 34 of Jurimetrics, 1993, 64–115.
³ A website with pictures is located at http://www.foldmoney.com/.
⁴ Remember cold fusion
? In 1989, two University of Utah professors told the newspapers they could fuse deuterium molecules in the laboratory, solving the world’s energy problems for years to come. Alas, neither those professors nor anyone else could replicate their findings, though true believers abound (see http://www.ncas.org/erab/intro.htm).
⁵ A p-value is the probability under the primary hypothesis of observing the set of observations we have in hand. We can calculate a p-value once we make a series of assumptions about how the data were gathered. These days, statistical software does the calculations, but it’s still up to us to validate the assumptions.
Chapter 2
Hypotheses: The Why of Your Research
All who drink of this treatment recover in a short time,
Except those whom it does not help, who all die,
It is obvious therefore, that it only fails in incurable cases.
—Galen (129–199)
IN THIS CHAPTER, AIMED AT BOTH RESEARCHERS WHO will analyze their own data as well as those researchers who will rely on others to assist them in the analysis, we review how to formulate a hypothesis that is testable by statistical means, the appropriate use of the null hypothesis, Neyman–Pearson theory, the two types of error, and the more general theory of decisions and losses.
PRESCRIPTION
Statistical methods used for experimental design and analysis should be viewed in their rightful role as merely a part, albeit an essential part, of the decision-making procedure:
1. Set forth your objectives and the use you plan to make of your research before you conduct a laboratory experiment, a clinical trial, a survey, or analyze an existing set of data.
2. Formulate your hypothesis and all of the associated alternatives. List possible experimental findings along with the conclusions you would draw and the actions you would take if this or another result should prove to be the case. Do all of these things before you complete a single data collection form, and before you turn on your computer.
WHAT IS A HYPOTHESIS?
A well-formulated hypothesis will be both quantifiable and testable, that is, involve measurable quantities or refer to items that may be assigned to mutually exclusive categories. It will specify the population to which the hypothesis will apply.
A well-formulated statistical hypothesis takes one of two forms:
1. Some measurable characteristic of a defined population takes one of a specific set of values.
2. Some measurable characteristic takes different values in different defined populations, the difference(s) taking a specific pattern or a specific set of values.
Examples of well-formed statistical hypotheses include the following:
For males over 40 suffering from chronic hypertension, a 100 mg daily dose of this new drug will lower diastolic blood pressure an average of 10 mm Hg.
For males over 40 suffering from chronic hypertension, a daily dose of 100 mg of this new drug will lower diastolic blood pressure an average of 10 mm Hg more than an equivalent dose of metoprolol.
Given less than 2 hours per day of sunlight, applying from 1 to 10 lbs of 23-2-4 fertilizer per 1000 square feet will have no effect on the growth of fescues and Bermuda grasses.
All redheads are passionate
is not a well-formed statistical hypothesis, not merely because passionate
is ill defined, but because the word all
suggests there is no variability. The latter problem can be solved by quantifying the term all
to, let’s say, 80%. If we specify passionate
in quantitative terms to mean has an orgasm more than 95% of the time consensual sex is performed,
then the hypothesis 80% of redheads have an orgasm more than 95% of the time consensual sex is performed
becomes testable.
Note that defining passionate
to mean "has an orgasm every time consensual sex is performed would not be provable as it too is a statement of the
all-or-none variety. The same is true for a hypothesis such as
has an orgasm none of the times consensual sex is performed. Similarly, qualitative assertions of the form
not all, or
some are not statistical in nature because these terms leave much room for subjective interpretation. How many do we mean by
some"? Five out of 100? Ten out of 100?
The statements, Doris J. is passionate,
or Both Good brothers are 5′10″ tall″ is equally not statistical in nature as they concern specific individuals rather than populations [Hagood, 1941]. Finally, note that until someone other than Thurber succeeds in locating unicorns, the hypothesis,
80% of unicorns are white" is not testable.
Formulate your hypotheses so they are quantifiable, testable, and statistical in nature.
HOW PRECISE MUST A HYPOTHESIS BE?
The chief executive of a drug company may well express a desire to test whether our antihypertensive drug can beat the competition.
The researcher, having done done preliminary reading of the literature, might want to test a preliminary hypothesis on the order of For males over 40 suffering from chronic hypertension, there is a daily dose of our new drug that will lower diastolic blood pressure an average of 20 mm Hg.
But this hypothesis is imprecise. What if the necessary dose of the new drug required taking a tablet every hour? Or caused liver malfunction? Or even death? First, the researcher would need to conduct a set of clinical trials to determine the maximum tolerable dose (MTD). Subsequently, she could test the precise hypothesis, a daily dose of one-third to one-fourth the MTD of our new drug will lower diastolic blood pressure an average of 20 mm Hg in males over 40 suffering from chronic hypertension.
In a series of articles by Horwitz et al. [1998], a physician and his colleagues strongly criticize the statistical community for denying them (or so they perceive) the right to provide a statistical analysis for subgroups not contemplated in the original study protocol. For example, suppose that in a study of the health of Marine recruits, we notice that not one of the dozen or so women who received a vaccine contracted pneumonia. Are we free to provide a p-value for this result?
Statisticians Smith and Egger [1998] argue against hypothesis tests of subgroups chosen after the fact, suggesting that the results are often likely to be explained by the play of chance.
Altman [1998; pp. 301–303], another statistician, concurs noting that, … the observed treatment effect is expected to vary across subgroups of the data … simply through chance variation,
and that doctors seem able to find a biologically plausible explanation for any finding.
This leads Horwitz et al. to the incorrect conclusion that Altman proposes that we dispense with clinical biology (biologic evidence and pathophysiologic reasoning) as a basis for forming subgroups.
Neither Altman nor any other statistician would quarrel with Horwitz et al.’s assertion that physicians must investigate how do we [physicians] do our best for a particular patient.
Scientists can and should be encouraged to make subgroup analyses. Physicians and engineers should be encouraged to make decisions based upon them. Few would deny that in an emergency, coming up with workable, fast-acting solutions without complete information is better than finding the best possible solution.¹ But, by the same token, statisticians should not be pressured to give their imprimatur to what, in statistical terms, is clearly an improper procedure, nor should statisticians mislabel suboptimal procedures as the best that can be done.²
We concur with Anscombe [1963] who writes, … the concept of error probabilities of the first and second kinds … has no direct relevance to experimentation. … The formation of opinions, decisions concerning further experimentation, and other required actions, are not dictated … by the formal analysis of the experiment, but call for judgment and imagination. … It is unwise for the experimenter to view himself seriously as a decision-maker. … The experimenter pays the piper and calls the tune he likes best; but the music is broadcast so that others might listen. …
A Bill of Rights for Subgroup Analysis
Scientists can and should be encouraged to make subgroup analyses.
Physicians and engineers should be encouraged to make decisions utilizing the findings of such analyses.
Statisticians and other data analysts can and should rightly refuse to give their imprimatur to related tests of significance.
FOUND DATA
p-values should not be computed for hypotheses based on found data
as of necessity all hypotheses related to found data are after the fact. This rule does not apply if the observer first divides the data into sections. One part is studied and conclusions drawn; then the resultant hypotheses are tested on the remaining sections. Even then, the tests are valid only if the found data can be shown to be representative of the population at large.
NULL OR NIL HYPOTHESIS
A major research failing seems to be the exploration of uninteresting or even trivial questions. … In the 347 sampled articles in Ecology containing null hypotheses tests, we found few examples of null hypotheses that seemed biologically plausible.
—Anderson, Burnham, and Thompson [2000].
We do not perform an experiment to find out if two varieties of wheat or two drugs are equal. We know in advance, without spending a dollar on an experiment, that they are not equal.
—Deming [1975].
Test only relevant null hypotheses.
The null hypothesis has taken on an almost mythic role in contemporary statistics. Obsession with the null (more accurately spelled and pronounced nil), has been allowed to shape the direction of our research. We have let the tool use us instead of