Robust Estimation of Between and Within Laboratory Standard Deviation With Measurement Results Below The Detection Limit
Robust Estimation of Between and Within Laboratory Standard Deviation With Measurement Results Below The Detection Limit
Abstract In method validation studies, the question studies, also called collaborative studies, is to inves-
arises how to deal with \LOD results in order to tigate the trueness and the precision of the analytical
ensure reliable precision estimates. It is proposed to method, whereby the precision between laboratories
elaborate strategies for dealing with \LOD results on (referred to as the reproducibility standard deviation)
the basis of the Q/Hampel method, a particularly must be distinguished from the precision within
robust statistical method. Two different strategies are laboratories (referred to as the repeatability standard
presented and assessed: censoring \LOD results and deviation).
setting them equal to LOD/2. The two strategies are In collaborative studies in the area of trace analy-
then examined in the light of the reliability of the sis, some of the test samples can have concentrations
precision estimates obtained on the basis of a com- near the limit of detection (LOD). Due to random
prehensive simulation study. The first method variation of measured results, it is not unlikely that
exhibits better precision, while the second method some of the results will lie below the LOD of the
displays better trueness. However, the two methods respective laboratories (‘‘\LOD results’’).
are only satisfactory as long as the percentage The question thus arises how to deal with the-
of \LOD results is less than 25 %. For higher percen- se \LOD results in order to ensure reliable precision
tages of \LOD results, more sophisticated methods (i.e. repeatability and reproducibility) estimates.
should be applied. In practice, different types of limits are used in
analytical chemistry, such as the limit of quantitation,
Keywords LOD \LOD results Method validation limit of quantification, limit of determination, etc. In
Censoring Robust statistical methods Q method this article, LOD will be used as an umbrella term to
Q/Hampel Reproducibility Repeatability denote these different limits. Moreover, in practice,
peaks are sometimes confused and assigned to the
wrong parameter so that it cannot be excluded that
1 Introduction individual results having been reported as \LOD
actually lie above the LOD. However, for the sake of
In method validation interlaboratory studies, differ- simplicity, such cases will not be considered here, and
ent laboratories perform replicate measurements of it will be assumed that each \LOD result has been
the same test sample by means of one and the same correctly assigned to the parameter being determined.
analytical method (ISO 5725 1-6). The purpose of such As yet, there exists no international consensus
regarding methods to be applied for dealing
with \LOD results. One strategy which has found
widespread use both in proficiency testing and col-
& Steffen Uhlig
laborative studies consists in ‘‘censoring’’ all \LOD
[email protected]
results, i.e. omitting them from the computation of
1
QuoData GmbH, Prellerstr. 14, 01309 Dresden, Germany the statistical parameters. In order to understand
123
386 S. Uhlig
Table 1 Example—laboratory results deviation is sufficiently high (say, 30 %). For this rea-
4.24 4.02 5.76 3.70
son, computations of means and standard deviations
3.88 3.71 3.76 4.38
reported in this paper will be performed with the
robust and highly efficient Q/Hampel method: the Q
4.65 2.41 2.61 4.19
method for the estimation of repeatability and
4.32 5.80 4.06 4.58
reproducibility standard deviations and the Hampel
2.23 3.62 3.67 4.43
estimator for the mean. The Q method is based on
the consideration of pairwise absolute differences
Table 2 Example—statistical parameters between laboratory results (Rousseeuw and Croux
1993; Uhlig 1997; Müller and Uhlig 2001) and was
Computed… Mean Standard Relative
deviation (sd) sd adopted in the German standard DIN 38402 A45. A
brief overview will be found in the Appendix. The
…with all the data 4.00 0.91 0.23 Q/Hampel method is more robust than the algorithm
…without \LOD results 4.28 0.65 0.15 A?S described in ISO 5725-5 and ISO 13528 (Colson
2015). It is implemented in PROLab Plus, a software
how this approach affects the estimation of means package for the evaluation of interlaboratory studies.
and variances, a simple example will now be In the following, the ‘‘censoring’’ method descri-
considered. bed above will be referred to as method 1. To be
Consider the 20 laboratory results given in Table 1. specific, method 1 consists in applying the Q/Hampel
Assuming that all 20 laboratories operate under the method to estimate the statistical parameters after
same LOD of 3.5, it is straightforward to compute the exclusion of \LOD results.
mean, standard deviation (sd) and relative sd on the An alternative approach to omitting \LOD results
basis, first, of all 20 results and then of the 17 results from computations consists in setting them equal to
above the LOD. The computed values are displayed in a constant and including them. Other possible choi-
Table 2. ces for the constant are 0 or the LOD itself. This
As can be seen, the censorship of \LOD results method, with the constant being LOD/2, will be
produces an increase in the estimate of the mean and referred to in the following as method 2. Again, the
a decrease in the standard deviation estimate, yiel- computation of the statistical parameters for method
ding a reduction of the relative sd by one-third, 2 is performed with the Q/Hampel statistical method.
although only 3 of the 20 lab results are \LOD. In the following, methods 1 and 2 will be compared
Accordingly, it is worthwhile exploring other on the basis of a simulation which will now be
approaches for dealing with \LOD results. described.
As has just been seen, the presence of \LOD results
will affect the estimations both of means and of
standard deviations: means will be overestimated 2 Description of the simulation
and standard deviations underestimated. In this
article, however, the focus will lie on precision In order to assess and compare the two methods for
parameters. Accordingly, the relative standard error dealing with \LOD results, a simulation study was
and bias in estimates of reproducibility and repeat- performed in which the reproducibility and repeat-
ability standard deviations will constitute the main ability sd estimates sR and sr were computed for each
criterion for the assessment of the estimation of the two methods. These simulations were run for
methods. three different study sizes n = 15, 30, 60, where
A complication arises from the fact that, near the n denotes the number of laboratories.
LOD, the relative reproducibility standard deviation In order to obtain results representative of a broad
will be very high. Since negative values are not pos- spectrum of circumstances, different scenarios were
sible, it can no longer be assumed that laboratory included. These scenarios were chosen in order to
results follow a normal distribution. In such cases, include three basic cases: (1) all laboratories operate
robust statistical methods can be much more effi- under the same LOD; (2) there are two groups of
cient than conventional statistical methods such as laboratories, each with its own LOD, denoted LOD1
outlier tests, whose performance is very sensitive to and LOD2, whereby LOD1 is much lower than LOD2;
deviations from the normal distribution. For instance, and (3) there are two groups of laboratories, each
the Grubbs test will fail to identify reported concen- with its own LOD, denoted LOD1 and LOD2, whereby
trations of zero as outliers if the relative standard LOD1 and LOD2 do not lie far apart, the difference
123
Estimation of standard deviation with measurement results below detection limit 387
between the two corresponding to two-thirds of the was computed. This performance characteristic is
reproducibility standard deviation. obtained as the ratio of the empirical standard error
The specifics of the scenarios will now be described relative to the mean, and thus constitutes a measure
in more detail (Table 3). Each scenario is character- of the dispersion of the 1000 sR or sr estimates. The
ized by several parameters: first, the theoretical second performance characteristic is the relative bias,
reproducibility standard deviation (rR = 5 or denoted Br, and is obtained as the difference between
rR = 30) of the population from which the sample of the mean sR or sr and the corresponding theoretical
laboratories is taken; second, the theoretical repeat- sd, relative to the latter. The third performance char-
ability standard deviation (rr = 2 or 4 for the case acteristic is the relative absolute bias, denoted Babs,
rR = 5 and rr = 12 or 24 for the case rR = 30); third, and is similar to Br, except that the absolute value of
the percentage of laboratories operating under LOD1 the difference between the mean and the theoretical
and LOD2; and fourth, the percentage of laboratories standard deviation is considered. Both Br and Babs
within each group having reported \LOD results. The constitute measures of the trueness of the 1000 sR or
case that all labs share the same LOD is modelled by sr estimates. In order to present the results of the
100 % of the laboratories operating under LOD1. simulation in succinct and interpretable form, these
Accordingly, each row in Table 3 represents two three performance characteristics were then aver-
scenarios (corresponding to the two repeatability aged over the scenarios corresponding to a particular
standard deviations). For each scenario, the mean percentage of \LOD results and study size.
value is 100. Thus, for example, an average RSE of 30 % means
Independently of whether or not all laboratories that, on average across the 12 scenarios (4 scenarios
share the same LOD, the key characteristic of a parti- for the case 0 % \LOD results), the standard error of
cular scenario is the percentage of \LOD results, and the 1000 sR or sr estimates represents 30 % of the
this value is displayed in the first column of Table 3. It mean sR or sr estimate.
corresponds to the sum of two products, each product Similarly, an average Br of 5 % means that, aver-
obtained by multiplying the percentage of laborato- aged across the scenarios, the mean of the 1000 sR or
ries operating under LOD1 or LOD2 and the percentage sr estimates is 5 % greater than the theoretical value
therefrom having reported \LOD1 or \LOD2 results. rR or rr.
The values for the theoretical reproducibility and It is due to the averaging across scenarios that it is
repeatability standard deviations rR and rr are dis- useful to consider both the Br and the Babs perfor-
played in the second and third column. For each mance characteristics, since the scenario-specific Br
different percentage of \LOD results and each study values could mutually cancel each other out, whereas
size there are six different rows, two corresponding to the Babs values alone give no indication as to the
the case that all laboratories share the same LOD, two direction of the bias.
corresponding to the case that the two LODs lie far The results are displayed in Tables 4 and 5. It will
apart and two rows for the case that the two LODs lie be observed that, in addition to the average across
close to each other. The only exception is the case scenarios, the range is also given (in brackets).
where the percentage of \LOD results is zero: only four Let it be noted that for the computation of the
scenarios (i.e. two rows) can be constructed. repeatability standard deviations, the number of
For each method and each scenario, 1000 runs replicates was set to two. It was verified that the
were simulated—each run representing a round of number of replicates does not affect substantially the
results (e.g. 30 lab results, for the case n = 30)—thus computation of sR for the different scenarios.
yielding 1000 different estimates for each of the two
standard deviations. Thus, for the case of the repro-
ducibility, it was possible to compute the mean sR 3 Assessment and comparison of the two
across these 1000 estimates as well as the corre- methods
sponding empirical standard error, the latter
representing a measure of the variability which the On the basis of the results from the simulation de-
1000 estimates are subject to. The same holds true for scribed in the previous section, the two methods will
the repeatability. now be compared.
On the basis of these two parameters (i.e. the mean First, the results for the reproducibility standard
and the standard error across the 1000 estimates), deviation estimates will be discussed (Table 4).
three performance characteristics were then derived. The first performance characteristic is the average
First of all, the relative standard error, denoted RSE RSE(sR) and constitutes a measure of the uncertainty
123
388 S. Uhlig
Table 3 Description of the scenarios: each row represents two different scenarios (corresponding to the two repeatability standard
deviations displayed in the column for rr)
\LOD rR rr Group of labs with LOD1 Group of labs with LOD2
results (%)
Laboratories LOD1 Lab results Laboratories with LOD2 Lab results
with LOD1 (%) below LOD1 (%) LOD2 (%) below LOD2 (%)
0 5 2 and 4 0 – – 0 – –
30 12 and 24 0 – – 0 – –
12.5 5 2 and 4 100 94 12.5 0 – 0
30 12 and 24 100 65 12.5 0 – 0
5 2 and 4 50 Low 0 50 97 25
30 12 and 24 50 Low 0 50 80 25
5 2 and 4 50 92 6.5 50 96 18.5
30 12 and 24 50 59 8.6 50 71 16.5
25 5 2 and 4 100 97 25 0 – 0
30 12 and 24 100 80 25 0 – 0
5 2 and 4 50 Low 0 50 100 50
30 12 and 24 50 Low 0 50 100 50
5 2 and 4 50 95 15.3 50 98 34.7
30 12 and 24 50 72 17.5 50 86 32.5
37.5 5 2 and 4 100 98 37.5 0 – 0
30 12 and 24 100 90 37.5 0 – 0
5 2 and 4 50 Low 0 50 103 75
30 12 and 24 50 Low 0 50 120 75
5 2 and 4 50 97 25.6 50 100 49.6
30 12 and 24 50 82 27.3 50 98 47.7
50 5 2 and 4 100 100 50 0 – 0
30 12 and 24 100 100 50 0 – 0
5 2 and 4 50 Low 0 60 105 83.3
30 12 and 24 50 Low 0 60 129 83.3
5 2 and 4 50 98 37.2 50 102 62.9
30 12 and 24 50 91 38.1 50 109 61.9
For each percentage of \LOD results (computed from the first and third columns for each group of laboratories), there are six different
rows, two corresponding to the case that all laboratories share the same LOD, two scenarios corresponding to the case that the two
LODs lie far apart and two scenarios for the case that the two LODs lie close to each other. The only exception is the case where the
percentage of \LOD results is zero: only two rows (corresponding to the two theoretical reproducibility standard deviations
rR ¼ 5 and rR ¼ 30) are displayed. For each scenario, the true value is 100
in the estimate of the reproducibility standard considerably outperforms method 1 as long as the per-
deviation. For the cases where the percentage centage of \LOD results does not exceed 25 %.
of \LOD results is less than or equal to 25 %, the two Turning to the third and last performance char-
methods obtain comparable results. However, it must acteristic—the average relative bias—it is observed
be emphasized that for a percentage of \LOD results that, in the presence of \LOD results, method 1 con-
greater than 25 %, method 2 obtains considerably sistently obtains a negative relative bias, i.e. the
worse results than method 1 (for n = 30 and n = 60). reproducibility standard deviation is underestimated.
For both methods, the average RSE(sR) improves as This is no surprise: the example considered in the
the number of laboratories increases, as expected. introduction had led us to expect this result. How-
The second performance characteristic—the average ever, the magnitude of this negative bias may be
relative absolute bias—is a measure of the trueness of considered unacceptable. For method 2, on the other
the reproducibility estimate. Already for the case hand, the relative bias remains relatively low and
n = 15, method 2 obtains much better results than positive as long as the percentage of \LOD results
results 1. For the cases n = 30 and n = 60, method 2 does not exceed 25 %. It should also be noted that, for
123
Estimation of standard deviation with measurement results below detection limit 389
Table 4 Reproducibility standard deviation simulation estimates: average relative standard error, average relative absolute bias and
average bias of sR across scenarios, with the corresponding range (in brackets)
No. of \LOD Average RSE of sR (%) Average relative absolute bias of sR (%) Average relative bias of sR (%)
labs results (%)
Method 1 Method 2 Method 1 Method 2 Method 1 Method 2
censoring \LOD = LOD/2 censoring \LOD = LOD/2 censoring \LOD = LOD/2
Table 5 Repeatability standard deviation simulation estimates: average relative standard error, average relative absolute bias and
average bias of sr across scenarios, with the corresponding range (in brackets)
No. of \LOD Average RSE of sr (%) Average relative absolute bias of sr (%) Average relative bias of sr (%)
labs results (%)
Method 1 Method 2 Method 1 Method 2 Method 1 Method 2
censoring \LOD = LOD/2 censoring \LOD = LOD/2 censoring \LOD = LOD/2
123
390 S. Uhlig
percentages of \LOD results less than 25 %, the rela- Conflict of interest The author declares that there are no
tive bias of method 2 ranges from -32 to 34 (the conflicts of interest.
corresponding ranges for method 1 are even more
extensive). For higher percentages of \LOD results, Appendix
the ranges are larger. This means that the relative
bias for both methods very much depends on the The computation of reproducibility and repeatability
scenario, and that at times this bias can be unac- parameters according to the Q method can be broken
ceptably large. down into several steps. The first step is the determi-
In an effort to improve the performance of method 2, nation of the jump discontinuities of the function
it was checked whether discarding the absolute pair- nj1 nj2
1 X 1 X X
wise differences of zero (corresponding to \LOD results H1 ð y Þ ¼ 1 ;
set equal to the same value of LOD/2) in the computa- J 1 j1 \j2 J nj1 nj2 i1 ¼1 i2 ¼1 fjxj1 i1 xj2 i2 j y g
tions of the Q method would lead to an increase in 2
performance. However, if anything, this procedure ð1Þ
results in worse performance characteristics for
method 2. where xji denotes the ith result of laboratory j, nj
The results for the repeatability standard deviations denotes the number of replicates for laboratory j and
are displayed in Table 5 and are very similar to those of 1A denotes the indicator function for the set A. J
the reproducibility standard deviations. Method 2 denotes the number of laboratories. The function H1
either outperforms method 1 or the two methods maps a (positive) real number y to the percentage of
display similar performance levels as long as the per- pairwise absolute differences between results from
centage of \LOD results is 25 % or less. different laboratories which are less than or equal to
y, i.e. it is an empirical cumulative distribution
function. Let y1 ; . . .; yk denote the jump
4 Conclusion discontinuities of H1 (in ascending order). A linear
interpolation G1 is then computed on the basis of
In this paper, the influence of results below the LOD values defined at the jump discontinuities:
8
on the reproducibility and repeatability standard < 0:5 ðH1 ðyi Þ þ H1 ðyi1 ÞÞ for i 2
deviations is examined. One widespread strategy for G1 ðyi Þ ¼ 0:5 H1 ðy1 Þ for i ¼ 1 and y1 [ 0
:
dealing with such results consists in omitting \LOD 0 for i ¼ 1 and y1 ¼ 0:
results from all computations. However, this ð2Þ
approach leads to underestimates of the precision
parameters. Accordingly, it is worthwhile exploring Having defined
other approaches for dealing with \LOD results. In q ¼ 0:25 þ 0:75 H1 ð0Þ; ð3Þ
this paper, it is proposed to elaborate strategies for
the reproducibility sd sR is obtained as
dealing with \LOD results on the basis of the
Q/Hampel method, a particularly robust statistical G1 ðqÞ
sR ¼ pffiffiffi 1 1 ; ð4Þ
method. Two different strategies are presented: on 2 U ð0:5 þ 0:5 qÞ
the one hand, censoring \LOD results and, on the
other hand, setting \LOD results equal to LOD/2. The where U denotes the cumulative distribution func-
first strategy is to be preferred if the precision of the tion of the standard normal distribution. It should be
estimates is the main criterion, while the second noted that, if no rounding takes place, the values
should be adopted if the trueness of the estimates is H1(0) = 0 and q = 0.25 are obtained. The denomina-
the main concern. However, it must be noted that tor in Eq. 4 is defined in such a way that, in the case
when the percentage of \LOD results exceeds 25 %, of a standard normal distribution, sR = 1 will be
neither method is satisfactory and more sophistica- obtained on average.
ted methods must be applied, for example the one For the estimation of the repeatability sd, absolute
described by Uhlig and Colson (2015—in prepara- differences between results within each laboratory
tion). Finally, even though the simulations reported constitute the basis for the definition of the empirical
in this article represent a broad range of possible cumulative distribution function:
1X X
scenarios, quite different scenarios can arise in J
2
practice which may not have been taken into H2 ð y Þ ¼ 1 ð5Þ
J j¼1 nj ðnj 1Þ 1 i \i n fjxji1 xji2 j y g
account here. 1 2 j
123
Estimation of standard deviation with measurement results below detection limit 391
Just as in the case of the reproducibility sd, a linear Müller CH, Uhlig S (2001) Estimation of variance components
interpolation is then performed to obtain the with high breakdown point and high efficiency. Biome-
trika 88:353–366
function G2. The repeatability sd sr is then obtained PROLab Plus. http://www.quodata.de/en/software/for-interlabora
in the same way as sR (see Eq. 4), the only difference tory-tests/prolab.html; http://www.quodata.de/en
being the definition Rousseeuw PJ, Croux C (1993) Alternatives to the median
absolute deviation. J Am Statist Assoc 88:1273–1283
q ¼ 0:5 þ 0:5 H2 ð0Þ: ð6Þ Uhlig S (1997) Robuste Schätzung in Kovarianzmodellen mit
Anwendung im Qualitätsmanagement und in
Finanzmärkten. Habilitationsschrift, Freie Universität Ber-
lin, Berlin. Available at quodata.de/en
References Uhlig S, Colson B (2015) Q/Hampel*: A new robust method for
interlaboratory studies with high percentages of ‘‘less
Colson B (2015) The Untold History of Algorithm A. http:// than’’ values. (in preparation). http://quodata.de/en/
quodata.de/en/company/algorithm-a.html company/algorithm-a.html
123