A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao
A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao
Given two continuous probability distribution functions F 1 and F 2 in one-dimensional space, consider the hypothesis
test problem
H0 : F 1 = F 2 vs. Ha : F 1 ̸= F 2 (1)
n n
based on the samples { } Xi1 i=1 1
and { } Xj2 j=2 1
from the respective distributions. The classical Kolmogorov–Smirnov test uses
the maximum difference of the empirical distribution functions (or cumulative frequency functions) at the observed values.
nk
Specifically, let Fnkk (k = 1, 2) be the empirical distribution function based on the sample {Xtk }t = 1 , that is,
#{t : Xtk ≤ x, 1 ≤ t ≤ nk }
Fnkk (x) = , ∞ < x < ∞, (2)
nk
where # means ‘‘the number of’’, then the Kolmogorov–Smirnov test statistic DKS is computed as (up to a multiple)
DKS = max{ max |Fn11 (Xi1 ) − Fn22 (Xi1 )|, max |Fn11 (Xj2 ) − Fn22 (Xj2 )|}. (3)
1≤i≤n1 1≤j≤n2
The value of DKS is often computed by a brute force algorithm, which simply counts the number of sample values that are
less than Xi1 or Xj2 for each i = 1, 2, . . . , n1 and j = 1, 2, . . . , n2 . The number of comparisons needed by the brute force
algorithm is O(n2 ), where n = n1 + n2 .
However, there exists a faster algorithm. Let L be the least common multiple of n1 and n2 , d1 = L/n1 , d2 = L/n2 , and let
http://dx.doi.org/10.1016/j.csda.2016.07.014
0167-9473/© 2016 Elsevier B.V. All rights reserved.
54 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58
be the pooled sample arranged ascendingly. (Throughout this paper we assume all the observed values have no ties when
necessary.) Define
See Burr (1963), Hájek and Šidàk (1967) and Xiao et al. (2007). The value of the Kolmogorov–Smirnov test statistic is the
maximum value of |ht |/L over 1 ≤ t ≤ n:
DKS = max |ht |/L. (7)
0≤t ≤n
If the quick sort method is used, this algorithm only needs O(n log2 n) comparisons (Hoare, 1961), which is O(n) times more
efficient than the brute force algorithm. In addition, the use of L even speeds up the algorithm since all the intermediate
results are integers.
The generalization of the Kolmogorov–Smirnov test to high dimensional probability distributions is a challenge. To
generalize the Kolmogorov–Smirnov test to two-dimensional space, Peacock (1983) proposed a procedure which makes
the use of four (rather than just one) pairs of cumulative frequency functions. Denote the two given samples in a plane by
n
{(Xik , Yik )}i=k 1 , k = 1, 2, respectively, the four pairs of cumulative frequency functions used by Peacock’s test are given by
k
F++ (x, y) = #{i : Xik > x, Yik > y, 1 ≤ i ≤ nk }/nk , (8)
F+− (x, y) = #{i :
k
Xik > x, Yik ≤ y, 1 ≤ i ≤ nk }/nk , (9)
F−+ (x, y) = #{i :
k
Xik ≤ x, Yik > y, 1 ≤ i ≤ nk }/nk , (10)
and
k
F−− (x, y) = #{i : Xik ≤ x, Yik ≤ y, 1 ≤ i ≤ nk }/nk , (11)
where ∞ < x, y < ∞ and k = 1, 2. Let {Xt0 : t = 1, 2, . . . , n} be the pooled data set consisting of the values of
the X -components of the given samples and {Yt0 : t = 1, 2, . . . , n} the pooled data set consisting of the values of the
Y -components of the given samples. Define
def
D++ = max 1
|F++ (Xs0 , Yt0 ) − F++
2
(Xs0 , Yt0 )|, (12)
1≤s≤n, 1≤t ≤n
def
D+− = max 1
|F+− (Xs0 , Yt0 ) − F+−
2
(Xs0 , Yt0 )|, (13)
1≤s≤n, 1≤t ≤n
def
D−+ = max 1
|F−+ (Xs0 , Yt0 ) − F−+
2
(Xs0 , Yt0 )|, (14)
1≤s≤n, 1≤t ≤n
and
def
D−− = max 1
|F−− (Xs0 , Yt0 ) − F−−
2
(Xs0 , Yt0 )|. (15)
1≤s≤n, 1≤t ≤n
the pooled sample sorted ascendingly by the values of the Y -components of the data points. Please notice that any point
(Xs0 , Yt0 ), Xs0 being an X -value in the pooled sample and Yt0 being a Y -value in the pooled sample, respectively, can be
expressed as (X(′u) , Y(v)
′
) for some 1 ≤ u ≤ n and 1 ≤ v ≤ n. Therefore, we can re-write the expression of D−− as
where
(Similar expressions can be found for D++ , D+− , D−+ .) Similar to (5), define for each v = 1, 2, . . . , n,
hvt = L × [F−−
1
(X(′t ) , Y(v)
′
) − F−−
2
(X(′t ) , Y(v)
′
)], t = 0, 1, . . . , n, (19)
then
Dv−− = max |hvt |/L, v = 1, 2, . . . , n. (20)
1≤t ≤n
Let Sk be the set of observed sample points of sample k, k = 1, 2, respectively. Then we can rewrite hvt as
#{i : Xi′ ≤ X(′t ) , Y(′i) ≤ Y(v)
′
, (Xi′ , Y(′i) ) ∈ S1 } #{i : Xi′ ≤ X(′t ) , Y(′i) ≤ Y(v)
′
, (Xi′ , Y(′i) ) ∈ S2 }
v
ht = L −
n1 n2
#{i : Xi ≤ ′
X(′t ) , 1 ≤i≤ v, (Xi , Y(′i) )
′
∈ S1 } ′
#{ i : X i ≤ X(′t ) , 1 ≤ i ≤ v, (Xi′ , Y(′i) ) ∈ S2 }
≡L − . (21)
n1 n2
Now we can easily derive the following recurrence (which is very similar to (6)):
v−1
ht if Xv′ > X(′t ) ,
v v−1
ht = ht + d1 if Xv′ ≤ X(′t ) and (Xv′ , Y(v)
′
) ∈ S1 , (22)
htv−1 − d2 if Xv ≤ X(t ) and (Xv , Y(v)
′ ′ ′ ′
) ∈ S2 .
This recurrence is obtained by sorting the pooled sample according to the values of the Y -components of the sample points.
Similarly, by sorting the pooled sample according to the values of the X -components of the sample points, we can obtain
the other recurrence:
v
ht −1 if Yt′ > Y(v)
′
,
v
ht = hv + d1 if Yt ≤ Y(v) and (X(′t ) , Yt′ ) ∈ S1 ,
′ ′
(23)
hvt −1 − d2 if Yt′ ≤ Y(v)
′
and (X(′t ) , Yt′ ) ∈ S2 .
t −1
(Set h0t = hv0 = 0 for all v, t = 0, 1, 2, . . . , n.) Thus, hvt can be computed via the following double for-loop:
Algorithm 1:
for (t = 1; t ≤ n; t ← t + 1) do
set h0t = 0;
for (v = 1; v ≤ n; v ← v + 1) do
Compute hvt via (22)
end for
end for
Note that the inner v -loops corresponding to different t-values are totally independent of each other. This fact has two
implications. First, we can switch t and v to get another double for-loop for computing hvt . Secondly, it is a parallel algorithm.
We can also devise similar recurrences to evaluate D−+ , D+− and D++ . However, we will propose a different approach.
t ,v
Denote hvt by h−− and let
t ,v def 1
h++ = L × [F++ (X(′t ) , Y(v)
′
) − F++
2
(X(′t ) , Y(v)
′
)], (24)
t ,v def
h+− = L × [F+− (X(′t ) , Y(v)
1 ′
) − F+− (X(′t ) , Y(v)
2 ′
)], (25)
and
t ,v def 1
h−+ = L × [F−+ (X(′t ) , Y(v)
′
) − F−+
2
(X(′t ) , Y(v)
′
)], (26)
then
t ,v
max Dv−− = max{ max |h−− | : 1 ≤ v ≤ n}/L, (27)
1≤v≤n 1≤t ≤n
56 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58
t ,v
max Dv+− = max{ max |h+− | : 1 ≤ v ≤ n}/L, (28)
1≤v≤n 1≤t ≤n
t ,v
max Dv−+ = max{ max |h−+ | : 1 ≤ v ≤ n}/L, (29)
1≤v≤n 1≤t ≤n
and
t ,v
max Dv++ = max{ max |h++ | : 1 ≤ v ≤ n}/L. (30)
1≤v≤n 1≤t ≤n
Algorithm 2:
Set KS = 0;
for (t = 1; t ≤ n; t ← t + 1) do
Set h0t = 0;
for (v = 1; v ≤ n; v ← v + 1) do
t ,v
Compute htv or h−− via (22);
t ,v t ,v t ,v
Compute h−+ , h+− and h++ via (38), (39) and (40), respectively.
if If KS < max{|htv |, |h−+ |t ,v , |h+− |t ,v , |h++ |t ,v } then
Set KS = max{|htv |, |h−+ |t ,v , |h+− |t ,v , |h++ |t ,v };
end if
end for
end for
return KS/L.
Please be reminded that both L(NX1 (X(′t ) ) − NX2 (X(′t ) )) and L(NY1 (Y(v)
′
) − NY2 (Y(v)
′
)) can be evaluated via recurrences similar to
(6). This algorithm is parallel. Its computational efficiency will be discussed later.
Similarly, we can develop an efficient algorithm for the F&F test, which is widely implemented by using the brute force
algorithm. Define
d−−
t
1
= L[F−− (Xt′ , Y(′t ) ) − F−−
2
(Xt′ , Y(′t ) )], (42)
Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58 57
d−+
t
1
= L[F−+ (Xt′ , Y(′t ) ) − F−+
2
(Xt′ , Y(′t ) )], (43)
dt+−
= L[F+− (Xt , Y(′t ) )
1 ′
− F+− (Xt , Y(′t ) )],
2 ′
(44)
and
d++
t
1
= L[F++ (Xt′ , Y(′t ) ) − F++
2
(Xt′ , Y(′t ) )], (45)
(where ∗ stands for + or −), the F&F test statistic can be computed as
where d−− −−
(0) is set to be 0. Clearly dt = d−−
(t ) . By using the recurrence (50), the total number of comparisons needed to
compute all dt s is about n(n + 1)/2, about half of the number of comparisons needed by the brute force algorithm, which
−−
It is possible to generalize Peacock’s test to higher dimensional spaces. However, it demands using 2d pairs of cumulative
distribution functions and computing the maximum absolute differences of each pair of cumulative functions at nd points
in a d-dimensional space, where d ≥ 2. So, the workload of the computing process may be prohibitively heavy if the brute
force algorithm is used. Fortunately, the proposed algorithm can be generalized to higher dimensional spaces. To illustrate,
it is sufficient to show how to generalize the recurrence (22) or (23) to 3-dimensional case.
nk
Denote the two given samples in a 3-dimensional space by {(Xik , Yik , Zik )}k= 1 , k = 1, 2. The pooled samples sorted
ascendingly by the values of the X -components, that of the Y -components and that of the Z -components, will be denoted
by {(X(′t ) , Yt′ , Zt′ ) : 1 ≤ t ≤ n}, {(Xt′ , Y(′t ) , Zt′ ) : 1 ≤ t ≤ n}, {(Xt′ , Yt′ , Z(′t ) ) : 1 ≤ t ≤ n}, respectively. Then
u,v,w def 1
h−−− = L × [F−−− (X(′u) , Y(v)
′
, Z(w)
′
) − F−−−
2
(X(′u) , Y(v)
′
, Z(w)
′
)] (similar to (19))
= L × [ #{ t : Xt ≤ X(′u) , Yt′
′
≤ Y(v)′
, Z(′t ) ≤ Z(w)
′
, (Xt′ , Yt′ , Z(′t ) ) ∈ S1 }/n1
− #{t : Xt′ ≤ X(′u) , Yt′ ≤ Y(v) , Z(t ) ≤ Z(w) , (Xt′ , Yt′ , Z(′t ) ) ∈ S2 }/n2 ]
′ ′ ′
Algorithm 3:
for (u = 1; u ≤ n; u ← u + 1) do
for (v = 1; v ≤ n; v ← v + 1) do
u,v,0
Set h−−− = 0;
for (w = 1; w ≤ n; w ← w + 1) do
u,v,w
Compute h−−− via the recurrence (52)
end for
end for
end for
The algorithm for the 3-dimensional Kolmogorov–Smirnov test is much more complex than that in the 2-dimensional
case, but the reader should be able to work out the detail.
It is interesting to compare the computational efficiency of the proposed algorithm with that of the brute force algorithms
for both the F&F test and Peacock test in a d-dimensional space, where d ≥ 2. We will compare the numbers of comparisons
needed by each of the algorithms, for the major operations that consume the CPU time are numerical comparisons in these
algorithms. Throughout the rest of the paper, we will call an algorithm an N algorithm if it needs N comparisons.
Algorithm 1 needs O(n2 ) comparisons once the pooled sample is sorted by the values of the X -coordinates and that of
the Y -coordinates, respectively. The quick sort algorithm takes O(n log2 n) comparisons to sort the pooled samples (Hoare,
1961). Therefore, Algorithm 1 is an O(n2 ) algorithm. Consequently, Algorithm 2, the proposed algorithm for the Peacock test
in 2-dimensional spaces, is O(n2 ). This conclusion can be generalized to higher dimensional spaces. In fact, the proposed
algorithm is O(nd ) in a d-dimensional space. Hence, the proposed algorithm is O(n) times more efficient than the brute
force algorithm for the Peacock test, which is O(nd+1 ). The range-counting method can be used to speed up the brute force
algorithm for the Peacock test. The resulting algorithm is O(n2 log n) in two-dimensional spaces (Lopes et al., 2007), which
is still O(log n) times less efficient than the proposed algorithm. Therefore, there is no sufficient evidence to believe that it
would be more efficient than the proposed algorithm for the Peacock test in higher dimensional spaces.
The brute force algorithm for the F&F test is O(n2 ), regardless of the value of d. Hence it is O(d−2 ) times more efficient
than the proposed algorithm if d > 2. In fact, even if d = 2, a careful examination of both algorithms shows that the former
needs fewer comparisons than the proposed algorithm. The computing process for the F&F test can be made even faster by
using the range-counting method or recurrences similar to (50). However, the cost of using the F&F test is that it does not
use the full information from the data.
In order to conduct a numerical study, the author implemented the proposed algorithm, the brute force algorithms for
both the Peacock test and the F&F test in C++. (A R-package that implements the proposed algorithms for the Peacock
test in both two and three dimensional spaces is publicly available as a contributed package at any CRAN mirrors listed
in https://cran.r-project.org/mirrors.html.) The algorithm for the F&F test is improved by using the recurrences similar to
(50), which can be verified to be approximately four times more efficient than the proposed algorithm for the Peacock test
in two-dimensional spaces. From the numerical study, the author observed that the proposed algorithm for the Peacock
test completed the computing process in an ordinary PC in 1 s when both sample sizes are n1 = n2 = 1000 and d = 2,
while the brute force algorithm used 656 s. The time needed for the F&F test is negligible. In the case d = 3, the proposed
algorithm took 8817 s (almost two and half hours) to complete the computing process, while the brute force algorithm
would need many days. The algorithm for the F&F test completed the computing process in 1 s. When the sample sizes
becomes n1 = n2 = 10,000, in the 2-dimensional case the proposed algorithm for the Peacock test completed the computing
process in 129 s, while the improved algorithm for the F&F test only used 32 s. The brute force algorithm would not be
able to complete the computation in a few days. In the 3-dimensional case, the algorithm for the F&F test completed the
computation in 74 s. No other algorithm would be able to complete the computation in a few days.
Acknowledgments
The author would like to thank the associate editor and the two reviewers for their valuable comments and suggestions.
References
Burr, E.J., 1963. Small-sample distribution of the two-sample Cramér–von Mises criterion for small equal samples. Ann. Math. Statist. 34 (1), 95–101.
Fasano, G., Franceschini, A., 1987. A multidimensional version of the Kolmogorov–Smirnovtest. Mon. Not. R. Astron. Soc. 225, 155–170.
Hájek, J., Šidàk, Z., 1967. Theory of Rank Tests. Academic Press, New York.
Hoare, C.A.R., 1961. Algorithm 64: Quicksort. Commun. ACM 4 (7), 321. http://dx.doi.org/10.1145/366622.366644.
Lopes, R.H.C., Reid, I., Hobson, P.R., The two-dimensional Kolmogorov–Smirnovtest, in: XI International Workshop on Advanced Computing and Analysis
Techniques in Physical Research, Amsterdam, the Netherlands, 2007.
Peacock, J.A., 1983. Two-dimensional goodness-of-fit testing in astronomy. Mon. Not. R. Astron. Soc. 202, 625–627.
Xiao, Y., Gordon, A., Yakovlev, A., 2007. A C++ program for the Cramér–von Mises two-sample test. J. Stat. Softw. (Amer. Statist. Assoc.) 17 (8).