0% found this document useful (0 votes)

42 views25 pages

Comparison of Count Modeling Techniques For Estimating Environmental Monitoring Limits in Clean Rooms

This document discusses statistical techniques for setting alert and action limits for environmental monitoring in clean rooms used by pharmaceutical companies. It compares traditional percentile limits, parametric bootstrap limits, nonparametric bootstrap limits, and Bayesian limits using simulated count data. The key distributions considered are Poisson, negative binomial, and zero-inflated versions to account for overdispersion and excess zeros often seen in environmental monitoring data. The goal is to better understand the strengths and limitations of these statistical modeling techniques for setting appropriate environmental monitoring limits.

Uploaded by

Kristian Uriel Delgado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views25 pages

Comparison of Count Modeling Techniques For Estimating Environmental Monitoring Limits in Clean Rooms

Uploaded by

Kristian Uriel Delgado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Statistics in Biopharmaceutical Research

ISSN: (Print) 1946-6315 (Online) Journal homepage: https://www.tandfonline.com/loi/usbr20

“Comparison of Count Modeling Techniques for

Estimating Environmental Monitoring Limits in
Clean Rooms”

Plinio A. De los Santos, Ji Young Kim, Pieta C. IJzerman-Boon, George G.

Kariuki & Brandye Smith-Goettler

To cite this article: Plinio A. De los Santos, Ji Young Kim, Pieta C. IJzerman-Boon, George
G. Kariuki & Brandye Smith-Goettler (2020): “Comparison of Count Modeling Techniques for
Estimating Environmental Monitoring Limits in Clean Rooms”, Statistics in Biopharmaceutical
Research, DOI: 10.1080/19466315.2020.1799854

To link to this article: https://doi.org/10.1080/19466315.2020.1799854

Accepted author version posted online: 23

Jul 2020.

Submit your article to this journal

Article views: 2

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=usbr20
“Comparison of Count Modeling Techniques for Estimating

Environmental Monitoring Limits in Clean Rooms”

by Plinio A. De los Santos1, Ji Young Kim1, Pieta C. IJzerman-Boon1,

George G. Kariuki1,2 and Brandye Smith-Goettler1

t
ip
Abstract:

cr
Pharmaceutical and biotechnology industries manufacture their products in clean rooms, which are

designed to minimize levels of particulates (like microorganisms recovered from the air or from the

us
clean room surfaces). Alert and action limits are employed to monitor and control the state of the room,
an
keeping the level of particulates at appropriate levels. Particulate monitoring systems could generate

particulate count data with the following characteristics: have repeated counts, have inflated zero or
M
low counts, and could be dispersed and have distributions with long thin tails to the right. In this paper,

we present comparisons of four statistical modeling techniques for setting alert and action limits (i.e.,
e d

traditional percentile, parametric bootstrap, nonparametric bootstrap, and Bayesian with informative
pt

priors) using simulated environmental monitoring data under controlled experimental conditions, to
ce

better understand the strengths and limitations of these techniques.

KEYWORDS: Bayesian Percentiles, Bootstrap Percentiles, Particulate Count Estimation, USP<1116>, Zero
Ac

Inflation.

1
Center for Mathematical Sciences, MMD, Merck & Co., Inc., Kenilworth, NJ USA
2
Currently at Regulatory Compliance and External Engagement, Global Quality / Global Product
Development and Supply, Bristol-Myers Squibb, New Brunswick, NJ USA

1 of 24
Introduction

For assurance of drug substance/product quality, pharmaceutical manufacturers use controlled

environments (e.g. clean rooms) to mitigate microbial contamination. United States Pharmacopeia

Chapter 1116 (USP<1116>)[1], Microbiological Evaluation of Clean Rooms and Other Controlled

Environments, indicates that an environmental monitoring program: (i) describes in detail the

procedures and methods used for monitoring particulates as well as microorganisms in controlled

t
environments, (ii) includes sampling sites, frequency of sampling, and investigative actions that should

ip
be followed if alert or action levels are exceeded. Furthermore, this USP chapter specifies that while an

cr
alert level focuses on limits to ensure that the process is within control, an action level is a limit that, if

us
exceeded, should trigger an investigation and a corrective action. In this paper we focus on comparing

statistical methods used to establish data driven alert and action levels in support of an environmental
an
monitoring program.
M
Wilson [2] stated that, in practice, environmental monitoring data are usually not normally distributed.

Also, he pointed out that their histograms generally resemble a Poisson distribution or a Negative
d

Exponential distribution (a.k.a. “Exponential” distribution), which are two interrelated distributions that
e

could be employed to describe count data (i.e., while the Poisson distribution focuses on describing the
pt

actual counts, the Exponential distribution focuses on describing the time between counts). The Poisson
ce

distribution requires that the mean and the variance of the counts have the same value and can be
Ac

described simply by the mean count (μ). Since we do not consider real-time monitoring data in this

paper, but only data from samples collected at discrete points in time, we will only be using distributions

describing the count data, not the times between them.

Hoffman [3] indicated that counts for many processes cannot be adequately modelled by the Poisson

distribution, especially when the data are over-dispersed (i.e., when the variance of the data is

considerably larger than its mean). For that situation, Hoffman [3] recommends the use of the Negative

2 of 24
Binomial distribution, since this is a natural/flexible extension of the Poisson distribution as depicted in

Figure 1. The figure shows that when the Negative Binomial dispersion parameter “k” increases, the

distribution converges to a Poisson. When a random variable, X, follows a Negative Binomial distribution

with location parameter “μ” and dispersion parameter “k”, the mean and variance are given by:

[ ] (1)

[ ] ( ) (2)

t
ip
Notice that equation (2) also shows that when the dispersion parameter goes to infinity, the mean and

the variance of the counts have the same value and the Negative Binomial converges to the Poisson

cr
distribution.

us
But in certain situations, the environmental data are populated by an excessive number of zeros,
an
beyond what would be structurally expected in either a Poisson or a Negative Binomial distribution. This

phenomenon is known as “zero inflation.” In this situation, a zero-inflated probability distribution may
M
be employed, which allows for excess zeros. The density function “ ( )” for a zero-inflated probability

distribution can be described with a specified probability of excess zeros (“Pr.(zero)”) and probability
d

distribution function “ ( )”:

e
pt

( ) ( ( )) ( )
( ) { (3)
( ( )) ( )
ce

When the probability distribution function “ ( )” is Poisson distributed, the zero inflated distribution is
Ac

known as a zero-inflated Poisson (or ZIP) distribution [4]. Similarly, when the probability distribution

function “ ( )” is Negative Binomial distributed, the zero inflated distribution is known as a zero-

inflated Negative Binomial (or ZINB) distribution [5]. For comparison purposes, the above four

distributions are considered in this paper with input parameter ranges consistent with historically

observed environmental monitoring surface data.

3 of 24
Figure 1: Examples of Poisson and Negative Binomial distributions

t
ip
cr
us
an
M
Method Description and Study Design

Traditionally, alert and action limits are calculated by obtaining one-sided percentiles of a suitable
d

parametric distribution. In this paper, the comparison limits will be also calculated using Bootstrap and
e

Bayesian based procedures, as outlined in Table 1. Some additional information about these techniques
pt

are:
ce

 Nonparametric bootstrapping [6, 7, 8] is a computationally intensive technique for making

inferences about a population characteristic using samples from the population. The central idea of
Ac

bootstrapping is that it may sometimes be better to draw conclusions about the characteristics of

the population strictly from the sample at hand. Bootstrapping involves “resampling” the data with

replacement many times, in order to generate an empirical estimate of the entire sampling

distribution of the statistic.

4 of 24
Table 1: Description of Environmental Monitoring Upper Limits Estimation Methods for Comparison
Method Process
Traditional Fit the assumed distribution to the data using maximum likelihood estimation and
get an upper percentile limit (e.g. 95% or 99%, to be used as alert or action limit)
from the fitted distribution.
Bootstrap If the bootstrap is parametric, fit the assumed distribution to the data and
estimate the population parameters from an observed sample of size “n” using
maximum likelihood estimation. Otherwise, employ the observed raw data
frequencies as the fitted distribution.
Resample the fitted distribution “B” times and obtain in each occasion a sample
size “n”. Then, for each re-sample, estimate an empirical distribution-based
percentile limit.

t
In this context, “re-sampling” implies the generation of “n” random counts from

ip
the fitted distribution at each of the “B” iterations. For the comparison, “B” was
set to 1000.

cr
Set the environmental monitoring limit equal to the median of the “B” percentile
limits. The median of the bootstrap samples was employed instead of the average

us
of the bootstrap samples because it was considered a more robust distribution
parameter.
Bayesian For each assumed distribution, obtain 20,000 samples from the posterior
distributions of the parameters of the assumed distribution. Sampling for the
an
assumed distribution was performed in RStan using 5,500 iterations including a
warm up period of 500 on 2 MCMC chains. To reduce autocorrelation, thinning
was set at 5, meaning that every 5th sample was saved.
M
For each of the 20,000 sampled parameter combinations, obtain a percentile limit
of the corresponding distribution.
Set the environmental monitoring limit equal to the median of the distribution of
d

the percentile limits.


pt

Parametric bootstrapping initially assumes a distribution for the population and employs an

observed sample to estimate the distributional parameters. Then draw a large number of samples
ce

from the estimated parametric distribution to further calculate the statistic of interest.
Ac

 A Bayesian approach [9] is another computationally intensive technique, based on the idea that the

distribution parameters are random variables. It allows the use of prior information on the

parameters when available. With the advancements in the computational power, the estimation of

a posterior distribution is simple and flexible in the sense that estimates of any posterior distribution

for a given prior distribution is possible. Another advantage of this technique is that it is possible to

obtain the posterior predictive distribution of the statistics of interest directly from the posterior

5 of 24
distributions of the parameters based on Markov Chain Monte Carlo (MCMC) [10]. The Bayesian

analysis presented in this paper was performed using Stan [11], a powerful computational platform

for MCMC sampling with an R interface. The sampling was done using the algorithm ‘No-U-Turn

Sampler (NUTS)’ considering its advantages over other algorithms (robustness against tuning

parameters and flexibility in the choice of models) [12], and the predictive distributions of the

microbial count upper limits were obtained.

t
ip
During the limit calculation across methods, both the 95th and the 99th percentile levels were calculated.

cr
To evaluate the above calculation methods, simulated data were generated while considering the

extreme parameter conditions listed in Table 2. Their corresponding densities are plotted in Figure 2.

us
Then, 30 sphere packing space filling design points, represented in an experimental cube in Figure 3,
an
were employed to survey points within the extreme parameter space [13]. As listed in Table 3, when

combining the sphere packing design points and the not yet included extreme corner points, 35
M
parameter point locations were considered. Their mean, variance and “true” percentiles are also
d

provided in Table 3. The selected parameters represent conditions which are consistent with historically
e

observed environmental monitoring surface data. However, these should not be considered universal
pt

across all possible situations that could be observed in the field. The selected base conditions enable us
ce

to illustrate and characterize the performance of the environmental monitoring estimation methods

outlined in the previous section within the selected parameter context. Table 4 lists the assumed
Ac

informative prior distributions employed with the Bayesian method, which were also chosen based on

the range of data historically observed.

Sample sizes of levels n=60 and n=300 were used for each combination as the typical small and large

sample sizes for the counts in a specific room, as well as 50 experimental replicates from each

combination. Hence, a total of 3,500 simulated datasets were created using the R script provided in

Appendix A.

6 of 24
Table 2: Extreme Parameter Combinations and Resulting Distributions
Inflation Dispersion Mean Count Source
(probability of structural Parameter k Parameter μ Distribution
zeros, irrespective of
distribution driven zeros)
No Low (0.1) Poisson (μ=0.1)
No [k  1000 (large number)] High (2.9) Poisson (μ=2.9)
[Pr.(zero)=0.0] Yes Low (0.1) Negative Binomial (μ=0.1, k=1)
[k=1] High (2.9) Negative Binomial (μ=2.9, k=1)
No Low (0.1) ZIP (μ=0.1, Pr.(zero)=0.6)
Yes [k  1000 (large number)]
High (2.9) ZIP (μ=2.9, Pr.(zero)=0.6)

t
ip
[Pr.(zero)=0.6] Yes Low (0.1) ZINB (μ=0.1, k=1, Pr.(zero)=0.6)
[k=1] High (2.9) ZINB (μ=2.9, k=1, Pr.(zero)=0.6)
Note: It was assumed that k=1000 is large enough to approximate the setting without dispersion.

cr
Table 3: Parameter Values for Simulated Experiments and “True” Percentiles from Source Distributions

us
Inflation Dispersion Mean Count Source “True” Percentiles Sphere Packing
Mean Variance Corner Point
Pr.(zero) k Parameter μ Distribution Design Point
95% 99%
0.00 1 0.10 Negative Bin. 0.10 0.11 1 1 X
0.00
0.00
0.00
1
222
545
2.90
0.10
1.68
Negative Bin.
Negative Bin.
Negative Bin.
2.90
0.10
1.68
an
11.31
0.10
1.69
10
1
4
15
1
5
X X
X
X
0.00 1,000 0.10 Poisson 0.10 0.10 1 1 X
M
0.00 1,000 1.20 Poisson 1.20 1.20 3 4 X
0.00 1,000 2.90 Poisson 2.90 2.91 6 7 X X
0.01 463 2.90 ZINB 2.87 3.14 6 8 X
0.02 640 0.48 ZINB 0.47 0.48 2 3 X
d

0.07 1 1.13 ZINB 1.05 2.47 4 7 X

0.15 1,000 0.10 ZIP 0.09 0.09 1 1 X
e

0.16 222 2.11 ZINB 1.77 3.31 5 6 X

0.18 846 1.99 ZINB 1.63 3.09 4 6 X
pt

0.24 1 0.10 ZINB 0.08 0.09 1 1 X

0.25 512 1.24 ZINB 0.93 1.60 3 4 X
0.29 531 2.90 ZINB 2.06 5.90 6 7 X
ce

0.31 647 0.10 ZINB 0.07 0.07 1 1 X

0.32 1 1.39 ZINB 0.95 3.16 4 7 X
0.34 1,000 1.08 ZIP 0.71 1.26 3 4 X
0.34 1,000 2.90 ZIP 1.91 5.87 5 7 X
Ac

0.34 73 2.90 ZINB 1.91 5.94 6 7 X

0.43 287 0.57 ZINB 0.32 0.48 2 3 X
0.44 746 1.99 ZINB 1.11 2.95 4 5 X
0.50 996 0.10 ZINB 0.05 0.05 0 1 X
0.57 466 2.90 ZINB 1.25 4.58 5 7 X
0.59 34 1.42 ZINB 0.58 1.38 3 4 X
0.60 1 0.10 ZINB 0.04 0.05 0 1 X X
0.60 1 2.66 ZINB 1.06 6.54 6 11 X
0.60 1 2.90 ZINB 1.16 7.67 7 12 X
0.60 478 1.34 ZINB 0.54 1.21 3 4 X
0.60 581 0.10 ZINB 0.04 0.04 0 1 X
0.60 938 2.82 ZINB 1.13 4.11 5 7 X
0.60 983 1.25 ZINB 0.50 1.09 3 4 X
0.60 1,000 0.10 ZIP 0.04 0.04 0 1 X

7 of 24
Inflation Dispersion Mean Count Source “True” Percentiles Sphere Packing
Mean Variance Corner Point
Pr.(zero) k Parameter μ Distribution Design Point
95% 99%
0.60 1,000 2.90 ZIP 1.16 4.31 5 7 X
Figure 2: Density Comparisons at the Extreme Parameter Combinations (Corner Points)

t
ip
cr
us
an
M
e d
pt

Figure 3: Experimental Cube with Data Generation Parameters

ce
Ac

8 of 24
Table 4: Bayesian Informative Prior per Assumed Distribution
Assumed Distribution that Applies to
Assumed Informative
Parameter Negative
Prior Distribution Poisson ZIP ZINB
Binomial
Mean Count Parameter μ uniform(0.1, 2.9) X X X X
Dispersion k uniform(1, 1000) X X
Inflation Pr.(zero) uniform(0, 0.6) X X
Assessing Akaike Information Criteria Goodness of Fit Patterns

For each of the simulated datasets, four candidate distributions (i.e., Poisson, Negative Binomial, ZIP,

and ZINB) were fitted even when the assumed distribution was not exactly the true distribution.

t
However, as listed in Table 5, there were 41 out of the 14,000 possible instances for which it was not

ip
feasible to fit an assumed distribution due to convergence issues. This situation occasionally may be

cr
observed when the sample size is small, and having low mean counts and preponderance of zeros [14,

us
15]. As a result, only 13,959 feasible cases were part of the assessment.

an
Table 5: List of Assumed Distribution Fit with Convergence Issues
Source Distribution Parameters
Assumed Number of
n Inflation Dispersion Mean Count Parameter
M
Distribution Convergence Issues
Pr.(zero) k μ
Negative Binomial 60 0.00 222 0.1 1
0.31 647 0.1 1
d

0.50 996 0.1 2

0.60 1 0.1 3
581 0.1 7
pt

1,000 0.1 3
300 0.00 1,000 2.9 1
ce

ZINB 60 0.00 222 0.1 1

1,000 2.9 3
0.31 647 0.1 1
Ac

0.50 996 0.1 2

0.60 1 0.1 3
581 0.1 7
1,000 0.1 3
ZIP 60 0.00 1,000 2.9 3

The goodness-of-fit of each fitted distribution to the data was assessed using the Akaike Information

Criterion (AIC) [16]. This criterion helps define the best model in a way that balances the number of

parameters with the objective of avoiding either over- or under-fitting the data:

9 of 24
( ̂) (4)

where “k”, in this case, is the number of parameters,

and “ ̂ ” is the maximum value of the likelihood function.

It is possible that some of the AIC estimates are too close to distinguish whether a substantial difference

between fitted options is significant. A goodness of fit assessment using AIC is relative: the smaller a

t
value of the AIC in comparison to others, the better the fit. Then, it should be expected that the AIC

ip
tends to be smaller when the assumed distribution is consistent with the source distribution input

cr
parameters. But also, there are instances where the AIC for assumed distributions which are different

us
from the source distribution, are very close to the AIC obtained when the assumed and the source

distribution input parameters are consistent with each other. This happened more often for the lower
an
mean count levels. For illustration purposes, Figure 4 provides box-plots of the AIC per sample size level
M
for the various assumed distributions and the source distributions at the experiment corner points, split

by the mean count levels. The split by mean count level was motivated by the clear differences in scale
d

between AIC at the low and high mean count levels. When drilling down into these plots some visual
e

patterns emerged:
pt

 Irrespective of the mean count level, the higher sample size level tended to exhibit a higher AIC. This
ce

could be expected since the AIC is a likelihood-based criterion and the likelihood scales with the
Ac

sample size.

 When the mean count level was at the low level, the AIC values from different assumed

distributions tended to be close to each other at a given sample size and source distribution levels.

Similar patterns were also observed at the high mean count level when the source distribution was

Poisson.

10 of 24
 But when the mean count level was at the high level and the true source distribution was other than

Poisson, there appear to be differences in AIC values, indicating that Poisson, followed by ZIP did not

fit very well when the source distribution was different.

 Although the visual patterns in Figure 4 suggest that there are opportunities of fitting multiple

distributions to a dataset, there is no clear guidance of when AIC values could be close enough to

each other to suggest that there is no significant difference between the distribution fit. The AIC, as

t
well as other metrics commonly employed (like Bayesian Information Criteria or BIC) are just able to

ip
provide a ranking. But in most cases evaluated, it is possible to use a p-value based assessment like

cr
the chi-square test. Also, Vuong [17] proposed the use of a pairwise test for the types of

us
distributions that are being considered in this paper. Although Wilson indicated that Vuong’s test

has been misused for zero-inflated models under non-nested conditions [18], Merkle et al. [19]
an
evaluated the adequacy of Vuong’s test under those conditions. In this context, two models are
M
“nested” when one reduces to the other when certain parameters are fixed [18]. At the end,

practitioners in the field should exercise cautious judgement.

e d
pt
ce
Ac

11 of 24
Figure 4: Comparison of AIC Results by Sample Size and Assumed Distribution Levels at the Corner Point
Source Distributions and at the Extreme Mean Count Levels

t
ip
cr
us
an
M
d

Assessing Limit Patterns

After fitting the various assumed distributions to each of the simulated datasets, the main interest was
pt

the comparison of 95th and 99th percentile limits obtained with all the methods. Considering 13,959
ce

fitted distribution cases, that would yield 111,672 comparisons between the 4 estimation methods (i.e.,

Bayesian with informative priors, parametric bootstrap, nonparametric bootstrap, and the traditional
Ac

methods) for the 2 sets of limits.

Figures 5 and 6 shows the average difference between the estimated percentile limits and the “true”

limits from the source distributions (e.g. the “true” percentile limits are provided in Table 3). These

show that when the mean count level was low (i.e., less than 1.5), or when there was a combination of

high mean count level (i.e., over 1.5) without extreme inflation levels (i.e., between 0.15 and 0.45), the

12 of 24
average limit differences are relatively low, and in cases very close to zero. But when there was a

combination of high mean count level (i.e., over 1.5) with extreme inflation levels (i.e., below 0.15 or

above 0.45), there was more variability in the limit differences (especially at the 99th percentile limits).

That suggests an interactive behavior in the limit difference response as a function of the mean count

level and the inflation parameters. These plots also revealed that at the high mean count parameter

levels, the differences tended to show a decreasing slope as a function of the dispersion parameter “k”,

t
varying with the calculation method and the assumed distribution. The average differences appeared to

ip
show consistent results between the two sample size levels evaluated (i.e. 60 and 300). Figure 6 shows

cr
that for the 99th percentile limits, the Bayesian method provides lower percentile estimates than the

us
other methods, usually closer to the true limits, except sometimes at higher mean count levels when the

assumed distribution is negative binomial. However, this pattern is not present for the 95th percentile
an
limits in Figure 5.
M
e d
pt
ce
Ac

13 of 24
Figure 5: Plotting the 95% Limit Difference (True minus Estimated) for Cases that Converged

t
ip
cr
us
an
Note: a 95% confidence interval is included around each method by dispersion slope.
M
Figure 6: Plotting the 99% Limit Difference (True minus Estimated) for Cases that Converged
e d
pt
ce
Ac

Note: a 95% confidence interval is included around each method by dispersion slope.

14 of 24
Figures 7 and 8 provide violin plots of the differences between the “true” and the estimated limits by

the estimation method and by sample size and percentile level for the lowest and highest mean count

parameter levels. Violin plots show the probability density of the data at different groups, usually

smoothed by a kernel. In these figures, mean differences are denoted as red and blue circles, and the

black dashed lines are the reference lines where the differences are zero. In addition, Figures 9 and 10

compare directly the estimated vs the “true” limits for the lowest and highest mean count parameter

t
levels. These plots show that:

ip
 When the mean count was the lowest evaluated (0.1) and the limit was estimated at the 95 th

cr
percentile level, the bootstrap and traditional methods tended to show limit differences around

us
zero. But in that case, when the source distribution was ZIP or ZINB (e.g., two distributions with high

inflation), the Bayesian method with informative priors tended to yield limits higher than the “true”
an
limit. For all source distributions, higher sample sizes led to limits closer to the true one.
M
 When the mean count was the lowest evaluated (0.1) and the limit was estimated at the 99 th

percentile level, two patterns were observed:

o When the source distribution was ZIP or ZINB, the Bayesian method tended to show limit
e

differences around zero. But the bootstrap and traditional methods tended to yield limits lower
pt

than the “true” limit.

o When the source distribution was Poisson or Negative Binomial (no inflation), all the estimation
Ac

methods tended to show limit differences around zero.

Higher sample sizes led to limits closer to the true one, especially when the source distribution was

Poisson or Negative Binomial.

 When the mean count was the highest evaluated (2.9) and the limit was estimated at the 95 th

percentile level, the non-parametric bootstrap method tended to consistently show limit differences

around zero irrespectively of the source distribution. Especially when the assumed distribution was

15 of 24
Poisson or ZIP and differed from the source distribution, the non-parametric bootstrap provided

limits closer to the true limits than the other methods.

 When the mean count was the highest evaluated (2.9) and the limit was estimated at the 99 th

percentile level, two patterns were observed:

o When the source distribution was Negative Binomial or ZINB, all the estimation methods tended

to yield limits lower than the “true” limit. But in some of the assumed distribution cases (when

t
the assumed distribution was consistent with the source distribution), the Bayesian method

ip
with informative priors tended to show limit differences closer to zero.

cr
o When the source distribution was Poisson or ZIP, the Bayesian method with informative priors

us
tended to show limit differences around zero, while the other methods tended to result in limits

lower than the “true” limit. an

Patterns in Figures 5 to 8 may suggest some implementation strategies of the estimation methods:
M
 Because the average differences between the true and estimated limits were low when mean count

levels are low (0.15 or less), or when there was a combination of high mean count level (i.e., over
d

1.5) with not extreme inflation levels (i.e., between 0.15 and 0.45), it may be appropriate to employ
e

in that situation the simplest estimation method (i.e., the traditional percentile).
pt

 But when there was a combination of high mean count level (i.e., over 1.5) with extreme inflation
ce

levels (i.e., below 0.15 or above 0.45), not a single method would be the optimal across all related
Ac

conditions. For instance:

o When the calculated percentile was around 95%, the non-parametric bootstrap method

appeared to yield the smallest average differences in most of the cases.

o When the calculated percentile was around 99%, the Bayesian method with informative priors

appeared to yield the smallest differences in most of the cases.

16 of 24
Figure 7: Differences between True and Estimated Percentile Limits for the Lowest Mean Count
Parameter Level (0.1) and the Dispersion and Inflation Corner Points

t
ip
cr
us
an
M
e d
pt
ce
Ac

th th
Note: Violin plots of the difference in limits (True-Estimated) at each percentile level (95 and 99 ) by the
assumed distribution (A), the source distribution (S) and sample size (red: n=60, blue: n=300). The mean
differences are shown as dots in each color. Black dashed lines are drawn at the difference of 0 (a perfect match
between true and estimated limits). Notation employed to label methods: BY_I=Bayesian with informative priors,
BT_NP=nonparametric bootstrap, BT_P=parametric bootstrap, and Trad=traditional.

17 of 24
Figure 8: Differences between True and Estimated Percentile Limits for the Highest Mean Count
Parameter Level (2.9) and the Dispersion and Inflation Corner Points

t
ip
cr
us
an
M
e d
pt
ce
Ac

18 of 24
Figure 9: Estimated vs True Percentile Limits for Lowest Mean Count Parameter Level (0.1) per Method
Bayesian (with Informative Priors) Nonparametric Bootstrap

t
Parametric Bootstrap Traditional

ip
cr
us
an
M
Figure 10: Estimated vs True Percentile Limits for Highest Mean Count Level (2.9) per Method
Bayesian (with Informative Priors) Nonparametric Bootstrap
e d
pt
ce
Ac

Parametric Bootstrap Traditional

19 of 24
Concluding Remarks

In this paper, we explored statistical methods to estimate the limits of the microbial counts through a

carefully designed simulation study. The following was observed through the study:

1. When mean counts are close to zero, fitting limits under any of the assumed distributions would

yield similar results. In that case it may be reasonable to fit the simplest distribution possible (e.g.,

Poisson).

t
2. When mean count levels were high and source distributions were over-dispersed (like Negative

ip
Binomial and ZINB), estimation methods may underestimate the actual limits.

cr
3. Typically, the limit estimation methods evaluated on average tend to yield similar results when the

us
mean count values were low, or when there was a combination of high mean count level without

extreme inflation levels. In that situation, it may be appropriate to estimate the limits using the
an
simplest methodology (e.g., with the traditional method).
M
4. When there was a combination of high mean count level with extreme inflation levels, there was

more variability in the limit differences. In that situation it was observed that, when estimating a
d

95th percentile limit, on average the non-parametric bootstrap method tended to yield limits which
e

were closer to the source distribution limits. Similarly, it was observed that, when estimating a 99 th
pt

percentile limit, on average the Bayesian method with informative priors tended to yield limits
ce

which were closer to the source distribution limits.

5. The Akaike Information Criteria (AIC) can provide a relative goodness of fit ranking among several

assumed distributions. Other p-value based goodness of fit alternatives (such as chi-square test or

Vuong’s test) are available but would not always be informative in detecting differences between

multiple groups with similar ranking. In the end, practitioners in the field should exercise cautious

judgement.

20 of 24
6. Sometimes an assumed distribution cannot be fit to the data, especially when the sample size is

limited and when the assumed distribution is skewed (like in the case of the Negative Binomial

distribution). For that reason, it is important to assess multiple distributions while trying to obtain

data driven environmental monitoring limits.

As indicated earlier, the limit estimates in this paper are based on univariate distributions for the

particle counts (e.g., the intercept-only count generalized linear models) and are conducive for limit

t
estimations of separate areas at a time (e.g., a clean room). But the estimation methods described can

ip
be extended to count-based regression models with multiple predictors when appropriate covariate

cr
information is available (for example, room temperature or moisture level could be included as

us
covariates), allowing for the establishment of appropriate limits for the situation at hand.

an
M
e d
pt
ce
Ac

21 of 24
Acknowledgement

Authors want to recognize Perceval Sondag from Merck & Co., Inc., Kenilworth, NJ, USA for his support

in Bayesian inference questions.

References

[1] USP<1116> “Microbiological Evaluation of Clean Rooms and Other Controlled Environments”.

[2] J.D. Wilson, “Setting alert/action limits for environmental monitoring programs”, PDA Journal of

t
ip
Pharmaceutical Science and Technology, 1997, V.51 (4), pp. 161-2.

cr
[3] D. Hoffman, “Negative binomial control limits for count data with extra-Poisson variation”,

Pharmaceutical Statistics, 2004, V.2 (2), pp. 127–132.

us
[4] D. Lambert, “Zero-inﬂated Poisson Regression Models with an Application to Defects in
an
Manufacturing”, Technometrics, 1992, V.34 (1), pp. 1–14.

[5] H. Yang, W. Zhao, T. O'day and W. Fleming, “Environmental Monitoring: Setting Alert and Action
M
Limits Based on a Zero-Inflated Model”, PDA Journal of Pharmaceutical Science and Technology,

2013, V.67(1), pp. 2-8.

e d

[6] B. Efron and G. Gong, “A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation”, The
pt

American Statistician, 1983, V.37 (1), pp. 36-48.

[7] B. Efron, “The Bootstrap and Modern Statistics”, Journal of the American Statistical Association,

2000, V.95(452), pp. 1293-1296.

[8] C.Z. Mooney and R.D. Duval, “Bootstrapping: a nonparametric approach to statistical inference”,

Sage Publications, 1993.

[9] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin. “Bayesian Data Analysis”, Chapman and Hall/CRC,

2004.

[10] D. Gamerman, H.F. Lopes. “Markov chain Monte Carlo: Stochastic simulation for Bayesian

inference”, Boca Raton: Chapman and Hall/CRC, 2006.

22 of 24
[11] Stan Development Team. “RStan: the R interface to Stan. R package version 2.19.2”, 2019.

[12] M. D. Hoffman and A. Gelman, “The No-U-Turn Sampler: Adaptively Setting Path Lengths in

Hamiltonian Monte Carlo”, Journal of Machine Learning Research, 2014, V.15, pp. 1593-1623.

[13] L. Pronzato and W. G. Müller, “Design of computer experiments: space ﬁlling and beyond“,

Statistical Computing, 2012, V.22, pp. 681–701.

[14] L. Xu, A D. Paterson, W. Turpin and W Xu, “Assessment and Selection of Competing Models for

t
Zero-Inflated Microbiome Data”, Journal PLOS ONE, DOI:10.1371(July 6, 2015), pp.1-30.

ip
[15] C.D. Desjardins, “Evaluating the Performance of Two Competing Models Of School Suspension

cr
Under Simulation - The Zero-Inflated Negative Binomial and the Negative Binomial Hurdle”,

us
University of Minnesota doctoral dissertation (May, 2013), p. 48.

[16] H. Akaike, “A new look at the statistical model identification“, IEEE Transactions on Automatic
an
Control, 1974, V.19(6), pp. 716–723.
M
[17] Q. H. Vuong, “Likelihood ratio tests for model selection and non-nested hypotheses”,

Econometrica, 1989, V.57, pp. 307–333.

[18] P. Wilson, “The misuse of the Vuong test for non-nested models to test for zero-inflation”,
e

Economics Letters, 2015, V.127, pp. 51-53.

[19] E.C. Merkle, D. You, and K.J. Preacher, “Testing non-nested structural equation models”,
ce

Psychological Methods, 2016, V.21(2), pp. 151-163.

23 of 24
Appendix A: R Program to Generate Simulated Datasets

# Specify working folder for files

path="C:/ "
# Load "input.param.file.csv" with "Inflation", "Dispersion", and "Mean.Count"
# input parameters provided in first three columns of Table 3.
inputfile<-read.csv("input.param.file.csv", header=T)
n0<-nrow(inputfile)
n1<-60
n2<-300
replicates<-50
n<-c(rep(n1,n0),rep(n2,n0))

t
inputfile<-cbind(n,rbind(inputfile,inputfile))

ip
require(VGAM)
set.seed(3622)

cr
for (i in 1:(n0*2))
{

us
for (replicate in 1:replicates)
{
temp<-rzinegbin(n=inputfile[i,1], mu =inputfile[i,4], size = inputfile[i,3], pstr0 = inputfile[i,2])
an
temp<-data.frame(n=rep(inputfile$n[i],inputfile$n[i]),
Inflation=rep(inputfile$Inflation[i],inputfile$n[i]),
Dispersion=rep(inputfile$Dispersion[i],inputfile$n[i]),
M
Mean.Count=rep(inputfile$Mean.Count[i],inputfile$n[i]),
replicate=rep(replicate,inputfile$n[i]),
Count=temp)
d

if (i==1 & replicate==1)

{
e

simulated_data<-temp
}
pt

else
{
ce

simulated_data<-rbind(simulated_data,temp)
}
}
Ac

}
write.csv(simulated_data, "simulated_data.csv",row.names = FALSE)

24 of 24

Bridget Mermikides The Classical Guitar Compendium Tab
100% (2)
Bridget Mermikides The Classical Guitar Compendium Tab
170 pages
Innovative Statistical Methods For Public Health Data
No ratings yet
Innovative Statistical Methods For Public Health Data
354 pages
Modeling Count Data (Joseph M. Hilbe)
No ratings yet
Modeling Count Data (Joseph M. Hilbe)
304 pages
Previewpdf
100% (2)
Previewpdf
76 pages
Trasparency of Things Contemplating The Nature of Experience Rupert Spira PDF
100% (2)
Trasparency of Things Contemplating The Nature of Experience Rupert Spira PDF
271 pages
Statistical Methods For Environmental Pollution Monitoring
No ratings yet
Statistical Methods For Environmental Pollution Monitoring
334 pages
Exponentially Weighted Moving Average Chart Using Zero-Inflated Negative Binomial Distribution.
No ratings yet
Exponentially Weighted Moving Average Chart Using Zero-Inflated Negative Binomial Distribution.
17 pages
Contamination Control in Cleanrooms - Texwipe
No ratings yet
Contamination Control in Cleanrooms - Texwipe
43 pages
Anti Drowing System
100% (2)
Anti Drowing System
88 pages
Prosocial Behavior - Extra Notes
No ratings yet
Prosocial Behavior - Extra Notes
5 pages
Chapter 3 Radiation
100% (1)
Chapter 3 Radiation
36 pages
ASM - Mid Sem - Slide 2 To 8
No ratings yet
ASM - Mid Sem - Slide 2 To 8
257 pages
Extreme Value Theory UNC
No ratings yet
Extreme Value Theory UNC
178 pages
Tutorial 106b - Poisson Regression and Log-Linear Models (Bayesian)
No ratings yet
Tutorial 106b - Poisson Regression and Log-Linear Models (Bayesian)
122 pages
The Yamas Niyamas Exploring Yoga S Ethical Practice Deborah Adele
No ratings yet
The Yamas Niyamas Exploring Yoga S Ethical Practice Deborah Adele
41 pages
Upper Conf Limits
No ratings yet
Upper Conf Limits
32 pages
EM Alert Limits PDA - Full
No ratings yet
EM Alert Limits PDA - Full
9 pages
Persuasive Writing: Self Learning Activity Grade 10-English Learning Competencies
No ratings yet
Persuasive Writing: Self Learning Activity Grade 10-English Learning Competencies
4 pages
QMT 11 Notes
No ratings yet
QMT 11 Notes
150 pages
Env Professional Stats 0505 0
No ratings yet
Env Professional Stats 0505 0
59 pages
Zero-Inflated Model
No ratings yet
Zero-Inflated Model
5 pages
Van Cleve Three Versions of The Bundle Theory PDF
No ratings yet
Van Cleve Three Versions of The Bundle Theory PDF
13 pages
Mock Memo - Manisha Aswal
No ratings yet
Mock Memo - Manisha Aswal
9 pages
An R Package AZIAD For Analyzing Zero-Inflated and Zero-Altered Data
No ratings yet
An R Package AZIAD For Analyzing Zero-Inflated and Zero-Altered Data
34 pages
Environmental Monitoring and Microbiological Manufacture of Sterile Drugs
No ratings yet
Environmental Monitoring and Microbiological Manufacture of Sterile Drugs
8 pages
Decision Tree Approaches For Zero-Inflated Count Data: Seong-Keon Lee & Seohoon Jin
100% (1)
Decision Tree Approaches For Zero-Inflated Count Data: Seong-Keon Lee & Seohoon Jin
15 pages
Lec08 2025
No ratings yet
Lec08 2025
43 pages
Stat Activity 3 Group 4
No ratings yet
Stat Activity 3 Group 4
93 pages
1997wilson161-2 Setting Alert Action Limit
100% (2)
1997wilson161-2 Setting Alert Action Limit
2 pages
Public Debt and Low Interest Rates by Olivier Blanchard PDF
No ratings yet
Public Debt and Low Interest Rates by Olivier Blanchard PDF
37 pages
Workshop 2-Basic Patterns
No ratings yet
Workshop 2-Basic Patterns
10 pages
A Bayesian Test For Excess Zeros in A Zero-Inflated Power Series Distribution
No ratings yet
A Bayesian Test For Excess Zeros in A Zero-Inflated Power Series Distribution
17 pages
Zero-Inflated Negative Binomial-Sushila Distribution and Its Application
No ratings yet
Zero-Inflated Negative Binomial-Sushila Distribution and Its Application
10 pages
Seminar Front, Certificate, Acknowled
No ratings yet
Seminar Front, Certificate, Acknowled
3 pages
SimPlate TPC-CI MicroVal Report
No ratings yet
SimPlate TPC-CI MicroVal Report
64 pages
Annotated Bibliography Entries
No ratings yet
Annotated Bibliography Entries
2 pages
On Zero Modified Poisson Sujatha Distrib
No ratings yet
On Zero Modified Poisson Sujatha Distrib
19 pages
S 99a E - 10 18 PDF
No ratings yet
S 99a E - 10 18 PDF
3 pages
Power One
No ratings yet
Power One
63 pages
Chapt 1
No ratings yet
Chapt 1
20 pages
Bayesian Zero Inflated Negative Binomial Regression Model For The Parkinson Data
No ratings yet
Bayesian Zero Inflated Negative Binomial Regression Model For The Parkinson Data
8 pages
Sterilisation and Disinfection EHL 4531 &2 (2007) (Disinfection 37-68)
No ratings yet
Sterilisation and Disinfection EHL 4531 &2 (2007) (Disinfection 37-68)
32 pages
Bayesian Factor Zero-Inflated Poisson Model For Multiple Grouped Count Data
No ratings yet
Bayesian Factor Zero-Inflated Poisson Model For Multiple Grouped Count Data
27 pages
Understanding Populations, Samples and Sample Size Requirements
No ratings yet
Understanding Populations, Samples and Sample Size Requirements
14 pages
Essoham Ali
No ratings yet
Essoham Ali
27 pages
InTech-Particle Dispersion Within A Deep Open Cast Coal Mine
No ratings yet
InTech-Particle Dispersion Within A Deep Open Cast Coal Mine
20 pages
Radiance Final (Revised and Edited)
No ratings yet
Radiance Final (Revised and Edited)
19 pages
FNCP (Hypertension)
No ratings yet
FNCP (Hypertension)
3 pages
Input Modeling For Simulation
No ratings yet
Input Modeling For Simulation
48 pages
Fatahi 2012
No ratings yet
Fatahi 2012
14 pages
Heilbron (1994)
No ratings yet
Heilbron (1994)
17 pages
New Statistical Process Control Charts For Overdispersed Count Data Based On The Bell Distribution
No ratings yet
New Statistical Process Control Charts For Overdispersed Count Data Based On The Bell Distribution
22 pages
Devotional and Prayer Journal
No ratings yet
Devotional and Prayer Journal
18 pages
Zhu (2012)
No ratings yet
Zhu (2012)
14 pages
Decorpo: Some Tables of The Negative Binomial Distribution and Their Use
No ratings yet
Decorpo: Some Tables of The Negative Binomial Distribution and Their Use
36 pages
Zero-Inflated Generalized Poisson Regression Model With An Application To Domestic Violence Data
No ratings yet
Zero-Inflated Generalized Poisson Regression Model With An Application To Domestic Violence Data
14 pages
Baltagi Poisson
No ratings yet
Baltagi Poisson
37 pages
Countreg
No ratings yet
Countreg
11 pages
Example 2 Stat
No ratings yet
Example 2 Stat
9 pages
Lecture No7 Pipeline Systems
No ratings yet
Lecture No7 Pipeline Systems
4 pages
Comparing Poisson Regression Via Negative Binomial Regression For Modeling Zero-Inflated Data
No ratings yet
Comparing Poisson Regression Via Negative Binomial Regression For Modeling Zero-Inflated Data
9 pages
Unit 6 Input Modeling: Collect Data From The Real System of Interest
No ratings yet
Unit 6 Input Modeling: Collect Data From The Real System of Interest
7 pages
s13063 023 07648 8
No ratings yet
s13063 023 07648 8
11 pages
ACTL30004 Assignment
No ratings yet
ACTL30004 Assignment
15 pages
Statistical Data Science
No ratings yet
Statistical Data Science
5 pages
How To Load Test TCP Protocol Services With JMeter
No ratings yet
How To Load Test TCP Protocol Services With JMeter
11 pages
Institute of Mathematical Statistics
No ratings yet
Institute of Mathematical Statistics
20 pages
Calculo UCL
No ratings yet
Calculo UCL
32 pages
El 31 4 01
No ratings yet
El 31 4 01
10 pages
Yang 2013
No ratings yet
Yang 2013
9 pages
Reference Papr
No ratings yet
Reference Papr
14 pages
CTY0 Extra Grammar Exercises Unit 4
No ratings yet
CTY0 Extra Grammar Exercises Unit 4
5 pages
A Generalized Statistical Control Chart For Over or Under Dispersed Data
No ratings yet
A Generalized Statistical Control Chart For Over or Under Dispersed Data
7 pages
Simple Random Sampling
No ratings yet
Simple Random Sampling
4 pages
Limites Microbianos Por Poisson
No ratings yet
Limites Microbianos Por Poisson
10 pages
Score Tests For Heterogeneity and Overdispersion in Zero-Inflated Poisson and Binomial Regression Models
No ratings yet
Score Tests For Heterogeneity and Overdispersion in Zero-Inflated Poisson and Binomial Regression Models
16 pages
Comprehension For ENG
No ratings yet
Comprehension For ENG
6 pages
Jack and The Beanstalk
No ratings yet
Jack and The Beanstalk
3 pages
2023-03-15 UTEW 311 Assignment Instructions Fin-1
No ratings yet
2023-03-15 UTEW 311 Assignment Instructions Fin-1
6 pages
PSSN-CP-2021 - Template (Conf Proceedings)
No ratings yet
PSSN-CP-2021 - Template (Conf Proceedings)
7 pages
Ac Advisory Paper - 101017
No ratings yet
Ac Advisory Paper - 101017
4 pages
MATM111 Midterms REVIEWER
No ratings yet
MATM111 Midterms REVIEWER
3 pages
Lecture 9 Statistical Learning
No ratings yet
Lecture 9 Statistical Learning
3 pages
Guidelines For Writing A Summary With In-Text Citations
No ratings yet
Guidelines For Writing A Summary With In-Text Citations
3 pages
People vs. Lagon
No ratings yet
People vs. Lagon
4 pages
Greenwich Associates - Order and Execution Management Systems Increasingly Indispensable - 2019-06-17
No ratings yet
Greenwich Associates - Order and Execution Management Systems Increasingly Indispensable - 2019-06-17
2 pages
PIRS SLEEP 25 Abstract Supplement A246 A2472002
No ratings yet
PIRS SLEEP 25 Abstract Supplement A246 A2472002
1 page
Theoretical/methodological Approach: Representation
No ratings yet
Theoretical/methodological Approach: Representation
3 pages
Definition of Household As Per Bangladesh Census
No ratings yet
Definition of Household As Per Bangladesh Census
2 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet