100% found this document useful (8 votes)
300 views16 pages

A Guide To Sample Size For Animal Based Studies - 1st Edition Entire Volume Download

This guide addresses the critical issue of determining appropriate sample sizes for animal-based research, emphasizing the importance of statistical, operational, and ethical justifications. It highlights the common pitfalls in sample size estimation and the need for researchers to adopt rigorous methodologies to minimize waste and ensure valid results. The book aims to assist non-statisticians in right-sizing their experiments to achieve reliable outcomes while adhering to ethical standards.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
300 views16 pages

A Guide To Sample Size For Animal Based Studies - 1st Edition Entire Volume Download

This guide addresses the critical issue of determining appropriate sample sizes for animal-based research, emphasizing the importance of statistical, operational, and ethical justifications. It highlights the common pitfalls in sample size estimation and the need for researchers to adopt rigorous methodologies to minimize waste and ensure valid results. The book aims to assist non-statisticians in right-sizing their experiments to achieve reliable outcomes while adhering to ethical standards.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

A Guide to Sample Size for Animal based Studies 1st Edition

Visit the link below to download the full version of this book:

https://medipdf.com/product/a-guide-to-sample-size-for-animal-based-studies-1st-
edition/

Click Download Now


Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Preface
‘H ow large a sample size do I need for my
study’? Although one of the most commonly
asked questions in statistics, the importance of
experiments, so they are statistically, operationally,
and ethically justifiable. A ‘right-sized’ experiment
has a clear plan for sample size justification and
proper sample size estimation seems to be over- transparently reports the numbers of all animals
looked by many preclinical researchers. Over the used in the study. For basic and veterinary research-
past two decades, numerous reviews of the pub- ers, appropriate sample sizes are critical to the
lished literature indicate many studies are too small design and analysis of a study. The best sample
to answer the research question and results are too sizes optimise study design to align with available
unreliable to be trusted. Few published studies pres- resources and ensure the study is adequately
ent adequate justification of their chosen sample powered to detect meaningful, reliable, and gener-
sizes or even report the total number of animals alisable results. Other stakeholders not directly
used. On the other hand, it is not unusual for proto- involved in animal experimentation can also benefit
cols (usually those involving mouse models) to from understanding the basic principles involved.
request preposterous numbers of animals, some- Oversight veterinarians and ethical oversight com-
times in the tens or even hundreds of thousands, mittees are responsible for appraising animal
‘because this is an exploratory study, so it is research protocols for compliance with best prac-
unknown how many animals we will require’. tice, ethical, and regulatory standards. An apprecia-
This widespread phenomenon of sample sizes tion of sample size construction can help assess
based on nothing more than guesswork or intuition scientific and ethical justifications for animal use
illustrates the pervasiveness of what Amos Tversky and whether the proposed sample size is fit for
and Daniel Kahneman identified in 1971 as the purpose. Funding agencies and policymakers use
‘belief in the law of small numbers’. Researchers research results to inform decisions related to ani-
overwhelmingly rely on best judgement in planning mal welfare, public health, and future scientific ben-
experiments, but judgement is almost always mis- efit. Understanding the logic behind sample size
leading. Researchers choose sample sizes based on justification can assist in evaluation of study quality
what ‘worked’ before or because a particular sample and reliability of research findings, and ultimately
size is a favourite with the research community. promote more informed evidence-based decision-
Tversky and Kahneman showed that researchers making.
who gamble their research results on small An extensive background in statistics is not
intuitively-based samples consistently have the required, but readers should have had some basic
odds stacked against their findings (even if results statistical training. The emphasis throughout is on
are true). They overestimate the stability and preci- the upstream components of the research process –
sion of their results, and fail to account for sampling statistical process, study planning, and sample size
variation as a possible reason for observed pattern. calculations rather than analysis. I have used real
The result is research waste on a colossal scale, espe- data in nearly all examples and provided formulae
cially of animals, that is increasingly difficult to and code, so sample size approximations can be
justify. reproduced by hand or by computer. By training
This book was written to assist non-statisticians and inclination I prefer SAS, but whenever possible
who use animals in research to ‘right-size’ I have provided R code or links to R libraries.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Acknowledgements
M any thanks to Anton Bespalov (PAASP, Hei-
delberg, Germany); Cori Astrom, Christina
Hendricks, and Bryan Penberthy (University of
the UK Animals in Science Education Trust enabled
me to upgrade my home computer system, making
working on this project immeasurably easier.
Florida); Cora Mezger, Maria Christodoulou, and The book was nearing completion when I came
Mariagrazia Zottoli (Department of Statistics, Uni- across the Icelandic word sprakkar that means
versity of Oxford); and Megan Lafollette (North ‘extraordinary women’. I have been fortunate to
American 3Rs Collaborative), who kindly reviewed encounter many sprakkar whilst writing this book.
various chapters of this book whilst it was in prep- In addition to the women (and men!) already
aration and provided much helpful feedback. mentioned, special thanks to researchers Amara
Thanks to the University of Florida IACUC chairs Estrada, Francesca Griffin, Autumn Harris, Maggie
Dan Brown and Rebecca Kimball, who encouraged Hull, Wendy Mandese, and Elizabeth Nunamaker,
researchers to consult the original 10-page handout who generously allowed me to use some of their
I had devised for sample size estimation. And last, data as examples. And special thanks to Jane Buck
but certainly not least, special thanks to Tim Morey, and Julie Laskaris for their wonderful friendship
Chair of the Department of Anesthesiology, Univer- and hospitality over the years. Jane Buck, Professor
sity of Florida, who encouraged me to put that Emerita of Psychology, Delaware State University,
handout into book form. and past president of the American Association of
Thanks are also due to the University of Florida University Professors, continues to amaze and show
Faculty Endowment Fund for providing me with what is possible for a statistician ‘with attitude’.
a Faculty Enhancement Opportunities grant to Julie advised me that the only approach to properly
allow me to devote some concentrated time to writ- edit one’s own work on a book-length project was to
ing. A generous honorarium from Scientist Center ‘slit its throat’, then told me to do as she said, not as
for Animal Welfare (SCAW) and an award from she actually did. Cheers.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
What is Sample Size?

Chapter 3: Ten Strategies to Increase Information (and Reduce Sample Size).


Chapter 1: The Sample Size Problem in Animal-Based Research.
Chapter 2: Sample Size Basics.
I
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1
The Sample Size Problem in
Animal-Based Research

CHAPTER OUTLINE HEAD


1.1 Organisation of the Book 5 References 6

Good Numbers Matter. This is especially true when BOX 1.1


animals are research subjects. Researchers are Right-Sizing Checklist
responsible for minimising both direct harms to
research animals and the indirect harms that result Statistically defensible: Are numbers verifiable?
from wasting animals in poor-quality studies (Calculations)
(Reynolds 2021). The ethical use of animals in Outcome variable identified
research is framed by the ‘Three Rs’ principles of Difference to be detected
Replacement, Reduction, and Refinement. Origi- Expected variation in response
nating over 60 years ago (Russell and Burch 1959), Number of groups
the 3Rs strategy is framed by the premise that max- Anticipated statistical test (if hypothesis tests used)
imal information should be obtained for minimal All calculations shown
harms. Harms are minimised by Replacement,
methods or technologies that substitute for animals; Operationally defensible: Are numbers feasible?
Reduction, the methods using the fewest animals for (Resources)
the most robust and scientifically valid information; Qualified technical staff
and Refinement, the methods that improve animal Time
welfare through minimising pain, suffering, distress, Space
and other harms (Graham and Prescott 2015). Resources
The focus of this book is on Reduction and meth- Equipment
ods of ‘right-sizing’ experiments. A right-sized Funding
experiment is an optimal size for a study to achieve Ethically defensible: Are numbers fit for purpose? (3Rs)
its objectives with the least amount of resources,
including animals. However, simply minimising Appropriate for study objectives?
the total number of animals is not the same as Reasonable number of groups?
right-sizing. A right-sized experiment has a sample Are collateral losses accounted for and minimized?
size that is statistically, operationally, and ethically Are loss mitigation plans described?
defensible (Box 1.1). This will mean compromising Are 3Rs strategies described?
between the scientific objectives of the study, pro- Source: Adapted from Reynolds (2021).
duction of scientifically valid results, availability

A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 A Guide to Sample Size for Animal-based Studies

of resources, and the ethical requirement to mini- unaware of their existence. Many preclinical studies
mise waste and suffering of research animals. Thus, reported in the literature consist of numerous two-
sample size calculations are not a single calculation group designs. However, this approach is both inef-
but a set of calculations, involving iteration through ficient and inflexible and unsuited to exploratory
formal estimates, followed by reality checks for studies with multiple explanatory variables
feasibility and ethical constraints (Reynolds 2019). (Reynolds 2022). Statistically based designs are
Additional challenges to right-sizing experiments rarely reported in the preclinical literature. In part,
include those imposed by experimental design and this is because the design of experiments is seldom
biological variability (Box 1.2). In The Principles of taught in introductory statistics courses directed
Humane Experimental Technique (1959), Russell towards biomedical researchers.
and Burch were very clear that Reduction is achieved Power calculations are the gold standard for sam-
by systematic strategies of experimentation rather ple size justification. However, they are commonly
than trial and error. In particular, they emphasised misapplied, with little or no consideration of study
the role of the statistically based family of experi- design, type of outcome variable, or the purpose of
mental designs and design principles proposed by the study. The most common power calculation is
Ronald Fisher, still relatively new at the time. For- for two-group comparisons of independent samples.
mal experimental designs customised to address However, this is inappropriate when the study is
the particular research question increase the experi- intended to examine multiple independent factors
mental signal through the reduction of variation. and interactions. Power calculations for continuous
Design principles that reduce bias, such as rand- variables are not appropriate for correlated observa-
omisation and allocation concealment (blinding) tions or count data with high prevalence of zeros.
increase validity. These methods increase the Power calculations cannot be used at all when sta-
amount of usable information that can be obtained tistical inference is not the purpose of the study, for
from each animal (Parker and Browne 2014). example, assessment of operational and ethical fea-
Although it has now been almost a century sibility, descriptive or natural history studies, and
since Fisher-type designs were developed many species inventories.
researchers in biomedical sciences still seem Evidence of right-sizing is provided by a clear
plan for sample size justification and transparent
reporting of the number of all animals used in the
BOX 1.2 study. This is why these items are part of best-
Challenges for Right-Sizing Animal-Based Studies practice reporting standards for animal research
publications (Kilkenny et al. 2010, Percie du Sert
Ethics and welfare considerations. The three Rs
et al. 2020 and are essential for the assessment of
Replacement, Reduction, and Refinement should
research reproducibility (Vollert et al. 2020). Unfor-
be the primary driver of animal numbers.
tunately, there is little evidence that either sample
Experimental design. Animal-based research has
size justification or sample size reporting has
no design culture. Clinical trial models are inapp-
improved over the past decade. Most published ani-
ropriate for exploratory research. Multifactorial
mal research studies are underpowered and biased
agriculture/industrial design may be more suita-
(Button et al. 2013, Henderson et al. 2013, Macleod
ble in many cases, and they are unfamiliar to
et al. 2015) with poor validity (Würbel 2017, Sena
most researchers.
and Currie 2019), severely limiting reproducibility
Biological variability. Animals can display signifi-
and translation potential (Sena et al. 2010, Silver-
cant differences in responses to interventions,
man et al. 2017). A recent cross-sectional survey
making it challenging to estimate an appropriate
of mouse cancer model papers published in high-
sample size.
impact oncology journals found that fewer than
Cost and resource constraints. The financial cost of
2% reported formal power calculations, and less
conducting animal-based research, including the
than one-third reported sample size per group. It
cost of housing, caring for, and monitoring the
was impossible to determine attrition losses, or
animals, must be considered in estimates of
how many experiments (and therefore animals)
sample size.
were discarded due to failure to achieve statistical
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
The Sample Size Problem in Animal-Based Research 5

significance (Nunamaker and Reynolds 2022). The the most appropriate statistically based study
most common sample size mistake is not perform- design. Although advanced statistical or mathemat-
ing any calculations at all (Fosgate 2009). Instead, ical skills are not required, readers are expected to
researchers make vague and unsubstantiated state- have at least a basic course on statistical analysis
ments such as ‘Sample size was chosen because it is methods and some familiarity with the basics of
what everyone else uses’ or ‘experience has shown power and hypothesis testing. SAS code is provided
this is the number needed for statistical signifi- in appendices at the end of each chapter and refer-
cance’. Researchers often game, or otherwise adjust, ences to specific R packages in the text. It is strongly
calculations to obtain a preferred sample size recommended that everyone involved in devising
(Schultz and Grimes 2005, Fitzpatrick et al. 2018). animal-based experiments take at least one course
In effect, these studies were performed without jus- in the design of experiments, a topic not often cov-
tification of the number of animals used. ered by statistical analysis courses.
Statistical thinking is both a mindset and a set of This book is organised into four sections
skills for understanding and making decisions (Figure 1.1).
based on data (Tong 2019). Reproducible data can
only be obtained by sustained application of statis- Part I Sample size basics discusses definitions of
tical thinking to all experimental processes: good sample size, elements of sample size determina-
laboratory procedure, standardised and comprehen- tion, and strategies for maximising information
sive operating protocols, appropriate design of power without increasing sample size.
experiments, and methods of collecting and analys- Part II Feasibility. This section presents strategies
ing data. Appropriate strategies of sample size justi- for establishing study feasibility with pilot stud-
fication are an essential component. ies. Justification of animal numbers must first
address questions of operational feasibility
(‘Can it work?’ Is the study possible? suitable?
1.1 Organisation of the Book convenient? sustainable?). Once operational
This book is a guide to methods of approximating logistics are standardised, pilot studies can be
sample sizes. There will never be one number or performed to establish empirical feasibility
approach, and sample size will be determined for (‘Does it work?’ is the output large enough to
the most part by study objectives and choice of be measured? consistent enough to be reliable?)

Sample size Sample size Confidence, power,


definitions basics significance

Feasibility Description Comparison

Arithmetic Interval Power

Sampling Maximise
probability Signal Effect size

Data Minimise
visualisation Noise Process variation
Operational
Subject variation
feasibility?

No Yes
Justifiable
Feasible? numbers

Figure 1.1: Overview of book organisation. For animal numbers to be justifiable (Are they feasible? appropriate? ethical?
verifiable?), sample size should be determined by formal quantitative calculations (arithmetic, probability-based, precision-
based, power-based) and consideration of operational constraints.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 A Guide to Sample Size for Animal-based Studies

and translational feasibility (‘Will it work?’ proof Graham, M.L. and Prescott, M.J. (2015). The multifacto-
of concept and proof of principle) before pro- rial role of the 3Rs in shifting the harm-benefit analysis
ceeding to the main experiments. Power calcula- in animal models of disease. European Journal of
tions are not appropriate for most pilots. Instead, Pharmacology 759: 19–29. https://doi.org/
common-sense feasibility checks include basic 10.1016/j.ejphar.2015.03.040.
arithmetic (with structured back-of-the-envelope Henderson, V.C., Kimmelman, J., Fergusson, D. et al.
(2013). Threats to validity in the design and conduct
calculations), simple probability-based calcula-
of preclinical efficacy studies: a systematic review of
tions, and graphics. guidelines for in vivo animal experiments. PLoS Med-
Part III Description. This section presents methods icine 10: e1001489.
for summarising the main features of the sample Kilkenny, C., Browne, W.J., Cuthill, I.C. et al. (2010).
data and results. Basic descriptive statistics pro- Improving bioscience research reporting: the ARRIVE
vide a simple and concise summary of the data in guidelines for reporting animal research. PLoS Biol-
ogy 8 (6): e1000412. https://doi.org/10.1371/
terms of central tendency and dispersion or
journal.pbio.1000412.
spread. Graphical representations are used to
Macleod, M.R., Lawson McLean, A., Kyriakopoulou,
identify patterns and outliers and explore rela- A. et al. (2015). Risk of bias in reports of in vivo
tionships between variables. Intervals computed research: a focus for improvement. PLoS Biology 13:
from the sample data are the range of values e1002301. https://doi.org/10.1371/journal.
estimated to contain the true value of a popula- pbio.1002273.
tion parameter with a certain degree of confi- Nunamaker, E.A. and Reynolds, P.S. (2022). “Invisible
dence. Four types of intervals are discussed: actors”—how poor methodology reporting compro-
confidence intervals, prediction, intervals, toler- mises mouse models of oncology: a cross-sectional sur-
ance intervals, and reference intervals. Intervals vey. PLoS ONE 17 (10): e0274738. https://doi.org/
shift emphasis away from significance tests 10.1371/journal.pone.0274738.
and P-values to more meaningful interpretation Parker, R.M.A. and Browne, W.J. (2014). The place of
of results. experimental design and statistics in the 3Rs. ILAR
Journal 55 (3): 477–485.
Part IV Comparisons. Power-based calculations Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The
for sample size are centred on understanding ARRIVE guidelines 2.0: updated guidelines for reporting
effect size in the context of specific experimen- animal research. PLoS Biology 18 (7): e3000410.
tal designs and the choice of outcome variables. https://doi.org/10.1371/journal.pbio.3000410.
Effect size provides information about the Reynolds, P.S. (2019). When power calculations won’t
practical significance of the results beyond con- do: fermi approximation of animal numbers. Lab Ani-
siderations of statistical significance. Specific mal (NY) 48: 249–253.
Reynolds, P.S. (2021). Statistics, statistical thinking, and
designs considered are two-group comparisons,
the IACUC. Lab Animal (NY) 50 (10): 266–268.
ANOVA-type designs, and hierarchical designs.
https://doi.org/10.1038/s41684-021-00832-w.
Reynolds, P.S. (2022). Between two stools: preclinical
research, reproducibility, and statistical design of
References experiments. BMC Research Notes 15: 73. https://
doi.org/10.1186/s13104-022-05965-w.
Button, K.S., Ioannidis, J.P.A., Mokrysz, C. et al. (2013). Russell, W.M.S. and Burch, R.L. (1959). The Principles
Power failure: why small sample size undermines the of Humane Experimental Technique. London:
reliability of neuroscience. Nature Reviews Neurosci- Methuen.
ence 14: 365–376. Schulz, K.F. and Grimes, D.A. (2005). Sample size calcu-
Fitzpatrick, B.G., Koustova, E., and Wang, Y. (2018). Get- lations in randomised trials: mandatory and mystical.
ting personal with the “reproducibility crisis”: inter- Lancet 365 (9467): 1348–1353. https://doi.org/
views in the animal research community. Lab 10.1016/S0140-6736(05)61034-3.
Animal (NY) 47: 175–177. Sena, E.S. and Currie, G.L. (2019). How our approaches
Fosgate, G.T. (2009). Practical sample size calculations to assessing benefits and harms can be improved. Ani-
for surveillance and diagnostic investigations. Journal mal Welfare 28: 107–115.
of Veterinary Diagnostic Investigation 21: 3–14. Sena ES, van der Worp HB, Bath PM, Howells DW,
https://doi.org/10.1177/104063870902100102. Macleod MR. Publication bias in reports of animal
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
The Sample Size Problem in Animal-Based Research 7

stroke studies leads to major overstatement of efficacy. Vollert, J., Schenker, E., Macleod, M. et al. (2020). Sys-
PLoS Biology, 2010 8(3):e1000344. https://doi.org/ tematic review of guidelines for internal validity in
10.1371/journal.pbio.1000344. the design, conduct and analysis of preclinical biomed-
Silverman, J., Macy, J., and Preisig, P. (2017). The role of ical experiments involving laboratory animals. BMJ
the IACUC in ensuring research reproducibility. Lab Open Science 4 (1): e100046. https://doi.org/
Animal (NY) 46: 129–135. 10.1136/bmjos-2019-100046.
Tong, C. (2019). Statistical inference enables bad science; Würbel, H. (2017). More than 3Rs: the importance of sci-
statistical thinking enables good science. American entific validity for harm-benefit analysis of animal
Statistician 73: 246–261. research. Lab Animal 46: 164–166.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2
Sample Size Basics

CHAPTER OUTLINE HEAD


2.1 Introduction 9 2.5 Repeats, Replicates, and Pseudo-
2.2 Experimental Unit 9 Replication 12
2.3 Biological Unit 11 2.5.1 Repeats of Entire Experiments 13
2.5.2 Pseudo-Replication 13
2.4 Technical Replicates 11
References 15

provide an estimate of time dependencies in response.


2.1 Introduction Technical replicates are used to obtain an estimate of
Investigators frequently assume ‘sample size’ is the measurement error and are essential for quality con-
same as ‘the number of animals’. This is not necessar- trol of experimental procedures. Pseudo-replication
ily true. Reliable sample size estimates are determined is a serious statistical error that occurs when the
by the correct identification of the experimental units, number of data points (evaluation units) is confused
the true unit of replication (Box 2.1). Replication of with the number of independent samples, or experi-
experimental units increases both precision of esti- mental units (Hurlbert 2009; Lazic 2010). Incorrect
mates and statistical power for testing the central specification of the true sample size results in errone-
hypothesis. Replicates on the same subject over time ous estimates of the standard error, inflated type
I error rates, and increased number of false positives
BOX 2.1 (Cox and Donnelly 2011). Research results will there-
What Is Sample Size? fore be biased and misleading.
Definitions of ‘replicates’ and ‘replication’ are fre-
A replicate is one unit in one group. quently confused in the literature, and further con-
Sample size is determined by the number of replicates flated with study replication. Planning experiments
of the experimental unit. using formal statistical designs can help differenti-
Experimental unit: Entire entity to which a treatment
ate between the different types of replicates and
or control intervention can be independently and indi-
sampling units, and determine which is best suited
vidually applied.
for the intended study.

Biological replicate is a biologically distinct and inde-


pendent experimental unit.
2.2 Experimental Unit
Technical replicate is one of multiple measurements
The experimental unit or unit of analysis is the smal-
on subsamples of the experimental unit, used to obtain
lest entire entity to which a treatment or control
an estimate of measurement error.
intervention can be independently and randomly

A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 A Guide to Sample Size for Animal-based Studies

(a) (b)

(c) (d)

t1 t2 t3 t4

t1 t2 t3 t4

Figure 2.1: Units of replication. (a) Experimental unit = individual animal = biological unit. The entire entity to which an
experimental or control intervention can be independently applied. There are two treatment interventions A or B. Here each
mouse receives a separate intervention, and the individual mouse is the experimental unit (EU). The individual mouse is also
the biological unit. (b). Experimental unit = groups of animals. There are two treatment interventions A or B. Each dam
receives either A or B, but measurements are conducted on the pups in each litter. The experimental unit is the dam (N = 2),
and biological unit is the pup (n = 8). For this design, the number of pups cannot contribute to the test of the central
hypothesis. (c) Experimental unit with repeated observations. The experimental unit is the individual animal (= biological unit)
with four sequential measurements made on each animal. The sample size N = 2. (d) Experimental unit = part of each animal.
There are two treatment interventions A or B. Treatment A is randomised to either the right or left flank of each mouse,
and B is injected into the opposite flank of that mouse. The experimental unit is flank (N = 8). The individual mouse is the
biological unit. Each mouse can be considered statistically as a block with paired observations within each animal.

applied (Figure 2.1a). Cox and Donnelly (2011) size N will not be the same as the number of
define it as the ‘smallest subdivision of the experi- animals.
mental material such that two distinct units might The total sample size N refers to the number of
be randomized (randomly allocated) to different independent experimental units in the sample.
treatments.’ Whatever happens to one experimental The classic meaning of a ‘replicate’ refers to the
unit will have no bearing on what happens to the number of experimental units within a treatment
others (Hurlbert 2009). If the test intervention is or intervention group. Therefore, replicating exper-
applied to a ‘grouping’ other than the individual imental units (and hence increasing N) contributes
animal (e.g. a litter of mice, a cage or tank of ani- to statistical power for testing the central statistical
mals, a body part; (Figure 2.1b–d)), then the sample hypothesis. Power calculations estimate the number
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Sample Size Basics 11

of experimental units required to test the hypothe- unit can be an individual biological unit, a group of
sis. The assignment of treatments and controls to biological units, a sequence of observations on a sin-
experimental units should be randomised if the gle biological unit or a part of a biological unit (Lazic
intention is to perform statistical hypothesis tests and Essioux 2013; Lazic et al. 2018). The biological
on the data (Cox and Donnelly 2011). unit of replication may be the whole animal or a sin-
Independence of experimental units is essential gle biological sample, such as strains of mice, cell
for most null hypothesis statistical tests and meth- lines or tissue samples (Table 2.1).
ods of analysis and is the most important condition
for ensuring the validity of statistical inferences (van
Belle 2008). Non-independence of experimental 2.4 Technical Replicates
units occurs with repeated measures and multi-level
Technical replicates or repeats are multiple mea-
designs and must be handled by the appropriate sta-
surements made on subsamples of an experimental
tistically based designs and analyses for hypothesis
unit (Figure 2.2). Technical replicates are used
tests to be valid.
to obtain an estimate of measurement error, the dif-
ference between a measured quantity and its true
value. Technical replicates are essential for asses-
2.3 Biological Unit sing internal quality control of experimental proce-
The biological unit is the entity about which infer- dures and processes, and ensuring that results are
ences are to be made. Replicates of the biological not an artefact of processing variation (Taylor and
unit are the number of unique biological samples Posch 2014). Differences between operators and
or individuals used in an experiment. Replication instruments, instrument drift, subjectivity in deter-
of biological units captures biological variability mination of measurement landmarks, or faulty cal-
between and within these units (Lazic et al. 2018). ibration can result in measurement error. Cell
The biological unit is not necessarily the same as cultures and protein-based experiments can also
the experimental unit. Depending on how the treat- show considerable variation from run to run, so
ment intervention is randomised, the experimental in vitro experiments are usually repeated several

Table 2.1: Units of Replication in a Hypothetical Single-Cell Gene Expression RNA Sequencing Experiment. Designating a
given replicate unit as an experimental unit depends on the central hypothesis to be tested and the study design.

Replicate ‘unit’ Replicate type

Animals Colonies Biological


Strains Biological
Cohoused animals in a cage Biological
Sex (male, female) Biological
Individuals Biological
Sample Organs from animals killed for purpose Biological
preparation
Methods for dissociating cells from tissue Technical
Dissociation runs from given tissue sample Technical
Individual cells Biological
RNA-seq library construction Technical
Sequencing Runs from the library of a given cell Technical
Readouts from different transcript molecules Biological or technical
Readouts with unique molecular identifier (UMI) from a Technical
given transcript molecule

Source: Adapted from Blainey et al. (2014).


Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 A Guide to Sample Size for Animal-based Studies

times. At least three technical replicates of Western response/expression rates, expected false positive
blots, PCR measurements, or cell proliferation rate, and number of sampling time points (Lee
assays may be necessary to assess reliability of tech- and Whitmore 2002; Lin et al. 2010; Jung and Young
nique and confirm validity of observed changes in 2012).
protein levels or gene expression (Taylor and Posch
2014).
The variance calculated from the multiple mea- Example: Experimental Units with Techni-
surements is an estimate of the precision, and cal Replication
therefore the repeatability, of the measurement.
Technical replicates measure the variability Two treatments A and B are randomly allocated
between measurements on the same experimental to six individually housed mice, with three mice
units. Repeating measurements increases the preci- receiving A and three receiving B. Lysates are
sion only for estimates of the measurement error; obtained from each mouse in three separate ali-
they do not measure variability either within or quots (Figure 2.2).
between treatment groups. Therefore, increasing The individual mouse is the experimental unit
the number of technical replicates does not improve because treatments can be independently and
power or contribute to the sample size for testing the randomly allocated to each mouse. There are
central hypothesis. Analysing technical repeats as three subsamples or technical replicates per
independent measurements is pseudo-replication. mouse. The total sample size is N = 6, with k =
High-dimensionality studies produce large 2 treatments, n = 3 mice per treatment group,
amounts of output information per subject. Exam- and j = 3 technical replicates per mouse. The total
ples include multiple DNA/RNA microarrays; bio- sample size N is 6, not 18.
chemistry assays; biomarker studies; proteomics;
metabolomics; inflammasome profiles, etc. These
studies may require a number of individual animals,
either for operational purposes (for example, to 2.5 Repeats, Replicates, and
obtain enough tissue for processing) or as part of
the study design (for example, to estimate biological
Pseudo-Replication
variation). Sample size will then be determined by Confusion of repeats with replicates is a problem of
the amount of tissue required for the assay techni- study design, and pseudo-replication is a problem of
cal replicates, or by design-specific requirements analysis. Study validity is compromised by incorrect
for power. Design features include anticipated identification of the experimental unit. A replicate is

A B

Figure 2.2: Experimental unit versus technical replicates. Two treatments A and B are randomly allocated to six mice.
The individual mouse is the experimental unit. Three lysate aliquots are obtained from each mouse. These are technical
replicates. The total sample size N is 6, not 18.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Sample Size Basics 13

a new experimental run on a new experimental


will be appropriately estimated. In the second
unit. Randomisation of interventions to experimen-
scenario, treatment intervention may or may
tal units and randomising the order in which exper-
not have been randomly allocated to mice, but
imental units are measured (sequence allocation
measurements were obtained for all mice in the
randomisation) minimises the effects of systematic
first group followed by those in the second group.
error or bias. A repeat is a consecutive run of the
Bias results from confounding of outcome mea-
same treatment or factor combination. It does not
surements with potential time-dependencies
minimise bias and may actually increase bias if
(for example, increasing skill levels or learning)
there are time-dependencies in the data. Repeats
and difference in assessment, especially if treat-
are not valid replicates.
ment allocation is not concealed (blinded).

Example: Replication Versus Repeats


In Figure 2.3, the experimental units are eight 2.5.1 Repeats of Entire Experiments
mice that receive one of two interventions. In
A common practice in mouse studies is to repeat an
the first scenario, both treatment allocated to
entire experiment two or three times. It has been
mouse and the measurement sequence are rando-
argued that this practice provides evidence that
mised. Bias is minimised and treatment variance
results are robust. However, NIH directives are clear
that replication is justifiable only for major or key
Y1 Y5 results, and that replications be independent.
Y2
Repeating an experiment in triplicate by a single
Y6
laboratory is not independent replication. These
Y3 Y7
repeats can provide only an estimate of the overall
measurement error of that experiment for that
Y4 Y8 lab. A major consideration is study quality. If the
True replicates
study is poorly designed and underpowered, repli-
cating it only wastes animals. Unless the purpose
of direct internal replications is scientifically justi-
Y1 Y5 fied, experiments are appropriately designed and
conducted to maximise internal validity, and exper-
Y2 Y6
imental, biological, and technical replicates are
clearly distinguished, simple direct repeats of
Y3 Y7
experiments on whole animals are rarely ethically
Y4 Y8 justifiable. Chapter 6 provides practical guidelines
for experiment replication.
Repeats

Figure 2.3: Replicates versus repeats. True replicates are


separate runs of the same treatment on separate
2.5.2 Pseudo-Replication
experimental units. Both treatment allocation to units In a classic paper, Hurlbert (1984) defines pseudo-
and sequence allocation for the processing of individual
experimental units are randomised. In this experiment,
replication as occurring when inferential statistics
the eight measurements on eight mice are taken in are used ‘to test for treatment effects with data
random order. Repeat measurements are taken during from experiments where either treatments are
the same experimental run or consecutive runs. Unless not replicated (though samples may be) or experi-
processing order is randomised, there will be
mental units are not statistically independent’
confounding with systematic sources of variability
caused by other variables that change over time. In this (Hurlbert 1984, 2009). The extent of pseudo-
experiment, eight measurements on eight mice are replication in animal-based research is disturbingly
obtained consecutively with units in the first treatment prevalent. Lazic et al. (2018) reported that less than
measured first. one-quarter of studies they surveyed identified the
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 A Guide to Sample Size for Animal-based Studies

correct repli-cation unit, and almost half showed Sacrificial pseudo-replication occurs when there
pseudo-replication, suggesting that inferences based are multiple replicates within each treatment arm,
on hypothesis tests were likely invalid. the data are structured as a feature of the design
Three of the most common types of pseudo- (such as pairing, clustering, or nesting), but design
replication are simple, sacrificial, and temporal. structure is ignored in the analyses. The units are
Others are described in Hurlbert and White (1993) treated as independent, so the degrees of freedom
and Hurlbert (2009). for testing treatment effects are too large. Sacrificial
Simple pseudo-replication occurs when there is pseudo-replication is especially common in studies
only a single replicate per treatment. There may with categorical outcomes when the χ 2 test or Fish-
be multiple observations, but they are not obtained er’s exact test is used for analysis (Hurlbert and
from independent experimental replicates. The arti- White 1993; Hurlbert 2009).
ficial inflation of sample size results in estimates of
the standard error that are too small, contributing to
increased Type I error rate and increased number of Example: Sunfish Foraging Preferences
false positives. Dugatkin and Wilson (1992) studied feeding suc-
cess and tankmate preferences in 12 individually
marked sunfish housed in two tanks. Preference
Example: Mouse Photoperiod Exposure was evaluated for each fish for all possible pair-
A study on circadian rhythms was conducted to wise combinations of two other tankmates. There
assess the effect of two different photoperiods on were 2 groups × 60 trials, per group × 2 replicate
mouse wheel-running. Mice in one environmen- sets of trials, for a total of 240 observations. They
tal chamber were exposed to a long photoperiod concluded that feeding success was weakly but
with 14 hours of light, and mice in a second statistically significantly correlated with aggres-
chamber to a short photoperiod with 6 hours of sion (P < 0.001) based on 209 degrees of freedom,
light. There were 15 cages in each chamber and that fish in each group strongly preferred
with four mice per cage. What is the effective (P < 0.001) the same individual in each of the
sample size? two replicate preference experiments, based on
This is simple pseudo-replication. The experi- 60 observations.
mental unit is the chamber, so the effective sam- The actual number of experimental units is 12,
ple size is one per treatment. Analysing the data with 6 fish per tank. The correct degrees of
as if there is a sample size of n = 60 (or even n = freedom for the regression analysis is 4, not
15) per treatment is incorrect. The number of 209. Suggested analyses for preference data
mice and cages in each chamber is irrelevant. included one-sample t-tests with 5 degrees of free-
This design implicitly assumes that chamber con- dom or one-tailed Wilcoxon matched-pairs test
ditions are uniform and chamber effects are zero. with N = 12. Correct analyses would produce
However, variation both between chambers and much larger P-values, suggesting that interpreta-
between repeats for the same chamber can be tion of these data requires substantial revision
considerable (Potvin and Tardif 1988; Hammer (Lombardi and Hurlbert 1996).
and Hopper 1997). Increasing sample size of mice
will not remedy this situation because chamber
Temporal (or spatial) pseudo-replication occurs
environment is confounded with photoperiod. It
when multiple measurements are obtained sequen-
is, therefore, not possible to estimate experimental
tially on the same experimental units, but analysed
error, and inferential statistics cannot be applied.
as if they represent an individual experimental unit.
Analysis should be restricted to descriptive statis-
Sequential observations (or repeated measures)
tics only. The study should be re-designed either to
are correlated within each individual. Repeated mea-
allow replication across several chambers, or if
sures increase the precision of within-unit estimates,
chambers are limited, as a multi-batch design
but the number of repeated measures do not increase
replicated at two or more time points.
the power for estimating treatment effects.
Downloaded from https://onlinelibrary.wiley.com/doi/ by University Of Wisconsin-Stout, Wiley Online Library on [19/02/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Sample Size Basics 15

Hurlbert, S.H. and White, M.D. (1993). Experiments


Example: Tumour Proliferation in Mouse with freshwater invertebrate zooplanktivores: quality
Models of Cancer of statistical analysis. Bulletin of Marine Science 53:
Sequential measurements of solid tumour volume 128–153.
Hurlbert, S. (2009). The ancient black art and transdisci-
in mice are commonly reported as a measure of dis-
plinary extent of pseudoreplication. Journal of Com-
ease progression or response to an intervention.
parative Psychology 123 (4): 434–443.
Mull et al. (2020) tested the effects of low-dose Hurlbert, S.H. (1984). Pseudoreplication and the design
UCN-01 to promote survival of tumour-bearing of ecological field experiments. Ecological Monographs
mice with lower tumour burden. Mice in four 54: 187–211.
treatment groups were weighed daily for 30 days, Jung, S.-H. and Young, S.S. (2012). Power and sample
then twice weekly to day 75. Differences in tumour size calculation for microarray studies. Journal of Bio-
volume between groups were assessed by t-tests pharmaceutical Statistics 22: 30–42.
and one-way ANOVA at five time points. Lazic, S.E. (2010). The problem of pseudoreplication in
neuroscientific studies: is it affecting your analysis?
BMC Neuroscience 11: 5. https://doi.org/
This is temporal pseudo-replication because the 10.1186/1471-2202-11-5.
same groups of mice are repeatedly sampled over Lazic, S.E. and Essioux, L. (2013). Improving basic and
time, but separate hypothesis tests were performed translational science by accounting for litter-to-litter
at different time points. However, successive obser- variation in animal models. BMC Neuroscience 14: 37.
Lazic, S.E., Clarke-Williams, C.J., and Munafò, M.R.
vations on the same mice are correlated, and sam-
(2018). What exactly is ‘N’ in cell culture and animal
ple size is expected to decline as mice die or are experiments? PLoS Biology 16: e2005282.
humanely euthanised at different times during Lee, M.-L.T. and Whitmore, G.A. (2002). Power and
the study. Traditional ANOVA or repeated- sample size for DNA microarray studies. Statistics in
measures ANOVA methods cannot handle missing Medicine 21: 3543–3570.
data or imbalance in the number of repeated Lin, W.-J., Hsueh, H.-M., and Chen, J.J. (2010). Power
responses and do not incorporate the actual corre- and sample size estimation in microarray studies.
lation structure of the data. Mixed models are BMC Bioinformatics 11: 48. https://doi.org/
much more appropriate, because the true variation 10.1186/1471-2105-11-48.
in the repeated measurements can be modelled Lombardi, C.M. and Hurlbert, S.H. (1996). Sunfish
directly by incorporating time dependencies and cognition and pseudoreplication. Animal Behaviour
allowing customisation of the correlation struc- 52: 419–422.
ture; they can also accommodate missing data Millar, R.B. and Anderson, M.J. (2004). Remedies for pseu-
doreplication. Fisheries Research 70: 397–407. https://
due to subject loss.
doi.org/10.1016/j.fishres.2004.08.016.
Mull, B.B., Livingston, J.A., Patel, N. et al. (2020). Spe-
cific, reversible G1 arrest by UCN-01 in vivo provides
References cytostatic protection of normal cells against cytotoxic
chemotherapy in breast cancer. British Journal of
Blainey, P., Krzywinski, M., and Altman, N. (2014). Rep- Cancer 122 (6): 812–822. https://doi.org/
lication. Nature Methods 11: 879–880. 10.1038/s41416-019-0707-z.
Cox, D.R. and Donnelly, C.A. (2011). Principles of Applied Potvin, C. and Tardif, S. (1988). Sources of variability and
Statistics. Cambridge: Cambridge University Press. experimental designs in growth chambers. Functional
Dugatkin, L.A. and Wilson, D.S. (1992). The prerequi- Ecology 2: 123–130.
sites for strategic behaviour in bluegill sunfish, Lepo- Taylor, S.C. and Posch, A. (2014). The design of a quan-
mis macrochirus. Animal Behaviour 44: 223–230. titative Western Blot. BioMed Research International
Hammer, P.A. and Hopper, D.A. (1997). Experimental https://doi.org/10.1155/2014/361590.
design. In: Plant Growth Chamber Handbook (ed. R. van Belle, G. (2008). Statistical Rules of Thumb, 2nd edi-
W. Langhans and T.W. Tibbitts), 177–188. Iowa State tion. New York: Wiley.
University NCR-101 Publication No. 340. https:// Vaux, D., Fidler, F., and Cumming, G. (2012). Replicates
www.controlledenvironments.org/wp-content/ and repeats—what is the difference and is it signifi-
uploads/sites/6/2017/06/Ch13.pdf. cant? EMBO Reports 13: 291–296.

You might also like