0% found this document useful (0 votes)
9 views6 pages

A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao

This document presents a fast algorithm for the two-dimensional Kolmogorov–Smirnov two-sample test, which significantly improves computational efficiency over the brute force method. The proposed algorithm operates in O(n log n) time complexity, making it O(n) times more efficient than traditional approaches, and can be generalized to higher dimensions. The paper also discusses the generalization of the Kolmogorov–Smirnov test to two-dimensional spaces and provides detailed mathematical formulations and recurrences for the new algorithm.

Uploaded by

Avi Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao

This document presents a fast algorithm for the two-dimensional Kolmogorov–Smirnov two-sample test, which significantly improves computational efficiency over the brute force method. The proposed algorithm operates in O(n log n) time complexity, making it O(n) times more efficient than traditional approaches, and can be generalized to higher dimensions. The paper also discusses the generalization of the Kolmogorov–Smirnov test to two-dimensional spaces and provides detailed mathematical formulations and recurrences for the new algorithm.

Uploaded by

Avi Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Computational Statistics and Data Analysis 105 (2017) 53–58

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis


journal homepage: www.elsevier.com/locate/csda

A fast algorithm for two-dimensional Kolmogorov–Smirnov


two sample tests
Yuanhui Xiao
Department of Mathematics and Statistics, Mississippi State University, Mississippi State, MS 39762, United States

article info abstract


Article history: By using the brute force algorithm, the application of the two-dimensional two-sample
Received 31 December 2015 Kolmogorov–Smirnov test can be prohibitively computationally expensive. Thus a fast
Received in revised form 14 July 2016 algorithm for computing the two-sample Kolmogorov–Smirnov test statistic is proposed to
Accepted 19 July 2016
alleviate this problem. The newly proposed algorithm is O(n) times more efficient than the
Available online 30 July 2016
brute force algorithm, where n is the sum of the two sample sizes. The proposed algorithm
is parallel and can be generalized to higher dimensional spaces.
Keywords:
Kolmogorov–Smirnov test
© 2016 Elsevier B.V. All rights reserved.
Brute force algorithm

1. A fast algorithm for one-dimensional Kolmogorov–Smirnov test

Given two continuous probability distribution functions F 1 and F 2 in one-dimensional space, consider the hypothesis
test problem

H0 : F 1 = F 2 vs. Ha : F 1 ̸= F 2 (1)
n n
based on the samples { } Xi1 i=1 1
and { } Xj2 j=2 1
from the respective distributions. The classical Kolmogorov–Smirnov test uses
the maximum difference of the empirical distribution functions (or cumulative frequency functions) at the observed values.
nk
Specifically, let Fnkk (k = 1, 2) be the empirical distribution function based on the sample {Xtk }t = 1 , that is,

#{t : Xtk ≤ x, 1 ≤ t ≤ nk }
Fnkk (x) = , ∞ < x < ∞, (2)
nk
where # means ‘‘the number of’’, then the Kolmogorov–Smirnov test statistic DKS is computed as (up to a multiple)

DKS = max{ max |Fn11 (Xi1 ) − Fn22 (Xi1 )|, max |Fn11 (Xj2 ) − Fn22 (Xj2 )|}. (3)
1≤i≤n1 1≤j≤n2

The value of DKS is often computed by a brute force algorithm, which simply counts the number of sample values that are
less than Xi1 or Xj2 for each i = 1, 2, . . . , n1 and j = 1, 2, . . . , n2 . The number of comparisons needed by the brute force
algorithm is O(n2 ), where n = n1 + n2 .
However, there exists a faster algorithm. Let L be the least common multiple of n1 and n2 , d1 = L/n1 , d2 = L/n2 , and let

{X(0t ) : 1 ≤ t ≤ n} = {X(01) ≤ X(02) ≤ · · · ≤ X(0n) } (4)

E-mail address: [email protected].

http://dx.doi.org/10.1016/j.csda.2016.07.014
0167-9473/© 2016 Elsevier B.V. All rights reserved.
54 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58

be the pooled sample arranged ascendingly. (Throughout this paper we assume all the observed values have no ties when
necessary.) Define

ht = L × [Fn11 (X(0t ) ) − Fn22 (X(0t ) )], 0 ≤ t ≤ n. (5)


The value of h0 is set to be 0. The reader can easily verify the following recurrence:

if X(0t ) = Xi1 for some i,



ht −1 + d 1
ht = (6)
ht −1 − d 2 if X(0t ) = Xj2 for some j.

See Burr (1963), Hájek and Šidàk (1967) and Xiao et al. (2007). The value of the Kolmogorov–Smirnov test statistic is the
maximum value of |ht |/L over 1 ≤ t ≤ n:
DKS = max |ht |/L. (7)
0≤t ≤n

If the quick sort method is used, this algorithm only needs O(n log2 n) comparisons (Hoare, 1961), which is O(n) times more
efficient than the brute force algorithm. In addition, the use of L even speeds up the algorithm since all the intermediate
results are integers.

2. Generalization to two-dimensional spaces

The generalization of the Kolmogorov–Smirnov test to high dimensional probability distributions is a challenge. To
generalize the Kolmogorov–Smirnov test to two-dimensional space, Peacock (1983) proposed a procedure which makes
the use of four (rather than just one) pairs of cumulative frequency functions. Denote the two given samples in a plane by
n
{(Xik , Yik )}i=k 1 , k = 1, 2, respectively, the four pairs of cumulative frequency functions used by Peacock’s test are given by
k
F++ (x, y) = #{i : Xik > x, Yik > y, 1 ≤ i ≤ nk }/nk , (8)
F+− (x, y) = #{i :
k
Xik > x, Yik ≤ y, 1 ≤ i ≤ nk }/nk , (9)
F−+ (x, y) = #{i :
k
Xik ≤ x, Yik > y, 1 ≤ i ≤ nk }/nk , (10)
and
k
F−− (x, y) = #{i : Xik ≤ x, Yik ≤ y, 1 ≤ i ≤ nk }/nk , (11)

where ∞ < x, y < ∞ and k = 1, 2. Let {Xt0 : t = 1, 2, . . . , n} be the pooled data set consisting of the values of
the X -components of the given samples and {Yt0 : t = 1, 2, . . . , n} the pooled data set consisting of the values of the
Y -components of the given samples. Define
def
D++ = max 1
|F++ (Xs0 , Yt0 ) − F++
2
(Xs0 , Yt0 )|, (12)
1≤s≤n, 1≤t ≤n
def
D+− = max 1
|F+− (Xs0 , Yt0 ) − F+−
2
(Xs0 , Yt0 )|, (13)
1≤s≤n, 1≤t ≤n
def
D−+ = max 1
|F−+ (Xs0 , Yt0 ) − F−+
2
(Xs0 , Yt0 )|, (14)
1≤s≤n, 1≤t ≤n

and
def
D−− = max 1
|F−− (Xs0 , Yt0 ) − F−−
2
(Xs0 , Yt0 )|. (15)
1≤s≤n, 1≤t ≤n

Peacock’s test is then defined as


D2DKS = max{D++ , D+− , D−+ , D−− }. (16)
The test is often performed by a brute force algorithm and its application is very expensive in terms of computing time
unless the sample sizes n1 and n2 are very small. Indeed, to compute the value of D−− , we need to compute the value of
the difference of the cumulative frequency functions F−− 1 2
and F−− at all the n2 pairs (Xs , Yt ), Xs and Yt being coordinates
of any pairs in the given samples. It will need O(n) comparisons to compute the value of the difference of the cumulative
frequency functions F−− 1 2
and F−− at a single point. Thus, it will take O(n3 ) comparisons to compute the value of D−− . Similar
conclusions can be made for D++ , D+− , D−+ .
To alleviate the problem, Fasano and Franceschini (1987, F&F, for short) revised Peacock’s test by comparing the
cumulative frequency functions at the observed sample points only, so the number of comparisons needed is only O(n2 ).
The F&F test is widely used in practice. But it is a variant of Peacock’s test, a different approach in essence.
In fact, there exists a fast algorithm for evaluating the value of Peacock’s test statistic. Denote by {(X(′t ) , Yt′ ) : 1 ≤ t ≤ n}
the pooled sample sorted ascendingly by the values of the X -components of the data points, and by {(Xt′ , Y(′t ) ) : 1 ≤ t ≤ n}
Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58 55

the pooled sample sorted ascendingly by the values of the Y -components of the data points. Please notice that any point
(Xs0 , Yt0 ), Xs0 being an X -value in the pooled sample and Yt0 being a Y -value in the pooled sample, respectively, can be
expressed as (X(′u) , Y(v)

) for some 1 ≤ u ≤ n and 1 ≤ v ≤ n. Therefore, we can re-write the expression of D−− as

D−− = max max |F−−1


(X(′u) , Y(v)

) − F−−
2
(X(′u) , Y(v)

)| = max Dv−− , (17)
1≤u≤n 1≤v≤n 1≤v≤n

where

Dv−− = max |F−−


1
(X(′u) , Y(v)

) − F−−
2
(X(′u) , Y(v)

)|. (18)
1≤u≤n

(Similar expressions can be found for D++ , D+− , D−+ .) Similar to (5), define for each v = 1, 2, . . . , n,

hvt = L × [F−−
1
(X(′t ) , Y(v)

) − F−−
2
(X(′t ) , Y(v)

)], t = 0, 1, . . . , n, (19)
then
Dv−− = max |hvt |/L, v = 1, 2, . . . , n. (20)
1≤t ≤n

Let Sk be the set of observed sample points of sample k, k = 1, 2, respectively. Then we can rewrite hvt as
#{i : Xi′ ≤ X(′t ) , Y(′i) ≤ Y(v)

, (Xi′ , Y(′i) ) ∈ S1 } #{i : Xi′ ≤ X(′t ) , Y(′i) ≤ Y(v)

, (Xi′ , Y(′i) ) ∈ S2 }
 
v
ht = L −
n1 n2
#{i : Xi ≤ ′
X(′t ) , 1 ≤i≤ v, (Xi , Y(′i) )

∈ S1 } ′
#{ i : X i ≤ X(′t ) , 1 ≤ i ≤ v, (Xi′ , Y(′i) ) ∈ S2 }
 
≡L − . (21)
n1 n2
Now we can easily derive the following recurrence (which is very similar to (6)):
 v−1
ht if Xv′ > X(′t ) ,
v v−1
ht = ht + d1 if Xv′ ≤ X(′t ) and (Xv′ , Y(v)

) ∈ S1 , (22)
htv−1 − d2 if Xv ≤ X(t ) and (Xv , Y(v)
′ ′ ′ ′
) ∈ S2 .

This recurrence is obtained by sorting the pooled sample according to the values of the Y -components of the sample points.
Similarly, by sorting the pooled sample according to the values of the X -components of the sample points, we can obtain
the other recurrence:
 v
ht −1 if Yt′ > Y(v)

,
v
ht = hv + d1 if Yt ≤ Y(v) and (X(′t ) , Yt′ ) ∈ S1 ,
′ ′
(23)
hvt −1 − d2 if Yt′ ≤ Y(v)

and (X(′t ) , Yt′ ) ∈ S2 .
t −1

(Set h0t = hv0 = 0 for all v, t = 0, 1, 2, . . . , n.) Thus, hvt can be computed via the following double for-loop:

Algorithm 1:
for (t = 1; t ≤ n; t ← t + 1) do
set h0t = 0;
for (v = 1; v ≤ n; v ← v + 1) do
Compute hvt via (22)
end for
end for

Note that the inner v -loops corresponding to different t-values are totally independent of each other. This fact has two
implications. First, we can switch t and v to get another double for-loop for computing hvt . Secondly, it is a parallel algorithm.
We can also devise similar recurrences to evaluate D−+ , D+− and D++ . However, we will propose a different approach.
t ,v
Denote hvt by h−− and let
t ,v def 1
h++ = L × [F++ (X(′t ) , Y(v)

) − F++
2
(X(′t ) , Y(v)

)], (24)
t ,v def
h+− = L × [F+− (X(′t ) , Y(v)
1 ′
) − F+− (X(′t ) , Y(v)
2 ′
)], (25)
and
t ,v def 1
h−+ = L × [F−+ (X(′t ) , Y(v)

) − F−+
2
(X(′t ) , Y(v)

)], (26)
then
t ,v
max Dv−− = max{ max |h−− | : 1 ≤ v ≤ n}/L, (27)
1≤v≤n 1≤t ≤n
56 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58

t ,v
max Dv+− = max{ max |h+− | : 1 ≤ v ≤ n}/L, (28)
1≤v≤n 1≤t ≤n
t ,v
max Dv−+ = max{ max |h−+ | : 1 ≤ v ≤ n}/L, (29)
1≤v≤n 1≤t ≤n

and
t ,v
max Dv++ = max{ max |h++ | : 1 ≤ v ≤ n}/L. (30)
1≤v≤n 1≤t ≤n

Define the following marginal cumulative frequency functions:


PXk (x) = #{t : Xtk > x, 1 ≤ t ≤ nk }/nk , (31)
NXk ( x ) = #{ t : Xtk ≤ x, 1 ≤ t ≤ nk }/nk , (32)
PYk (y) = #{t : Ytk > y, 1 ≤ t ≤ nk }/nk , (33)
and
NYk (y) = #{t : Ytk ≤ y, 1 ≤ t ≤ nk }/nk , (34)
for all ∞ < x, y < ∞ and k = 1, 2. It follows that
k
F++ (x, y) + F+−
k
(x, y) = PXk (x), (35)
F+− (x, y) + F−− (x, y) =
k k
NYk (y), (36)
and
k
F−+ (x, y) + F−−
k
(x, y) = NXk (x), (37)
k = 1, 2. Therefore,
t ,v t ,v
h−+ = L × [NX1 (X(′t ) ) − NX2 (X(′t ) )] − h−− = L × [NX1 (X(′t ) ) − NX2 (X(′t ) )] − hvt , (38)
t ,v t ,v v
h+− = L × [ NY1 (Y(v)

) − NY2 (Y(v)

)] − h−− = L × [ NY1 (Y(v)

) − NY2 (Y(v)

)] − ht , (39)
and
t ,v t ,v
h++ = L × [PX1 (X(′t ) ) − PX2 (X(′t ) )] − h+−
,v
= L[PX1 (X(′t ) ) − PX2 (X(′t ) )] − [L(NY1 (Y(v)

) − NY2 (Y(v)

)) − ht−− ]
,v
= −L(NX1 (X(′t ) ) − NX2 (X(′t ) )) − L(NY1 (Y(v)

) − NY2 (Y(v)

)) + ht−−
= hvt − L(NX1 (X(′t ) ) − NX2 (X(′t ) )) − L(NY1 (Y(v)

) − NY2 (Y(v)

)). (40)
Thus,
t ,v t ,v t ,v t ,v
D2DKS = max max{|h−− |, |h−+ |, |h+− |, |h++ |}/L. (41)
t v
Hence the value of D2DKS in (16) can be computed by the following algorithm:

Algorithm 2:
Set KS = 0;
for (t = 1; t ≤ n; t ← t + 1) do
Set h0t = 0;
for (v = 1; v ≤ n; v ← v + 1) do
t ,v
Compute htv or h−− via (22);
t ,v t ,v t ,v
Compute h−+ , h+− and h++ via (38), (39) and (40), respectively.
if If KS < max{|htv |, |h−+ |t ,v , |h+− |t ,v , |h++ |t ,v } then
Set KS = max{|htv |, |h−+ |t ,v , |h+− |t ,v , |h++ |t ,v };
end if
end for
end for
return KS/L.
Please be reminded that both L(NX1 (X(′t ) ) − NX2 (X(′t ) )) and L(NY1 (Y(v)

) − NY2 (Y(v)

)) can be evaluated via recurrences similar to
(6). This algorithm is parallel. Its computational efficiency will be discussed later.
Similarly, we can develop an efficient algorithm for the F&F test, which is widely implemented by using the brute force
algorithm. Define
d−−
t
1
= L[F−− (Xt′ , Y(′t ) ) − F−−
2
(Xt′ , Y(′t ) )], (42)
Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58 57

d−+
t
1
= L[F−+ (Xt′ , Y(′t ) ) − F−+
2
(Xt′ , Y(′t ) )], (43)
dt+−
= L[F+− (Xt , Y(′t ) )
1 ′
− F+− (Xt , Y(′t ) )],
2 ′
(44)

and

d++
t
1
= L[F++ (Xt′ , Y(′t ) ) − F++
2
(Xt′ , Y(′t ) )], (45)

for all t = 1, 2, . . . , n. Since


1
max |F∗∗ (Xt0 , Yt0 ) − F∗∗
2
(Xt0 , Yt0 )| = max |d∗∗
t |/L, (46)
1≤t ≤n 1≤t ≤n

(where ∗ stands for + or −), the F&F test statistic can be computed as

D2DFF = max {|d−−


t |, |d−+
t |, |d+−
t |, |d++
t |}/L. (47)
1 ≤t ≤n

See Fasano and Franceschini (1987). Notice that


#{i : 1 ≤ i ≤ t , Xi′ ≤ Xt′ , (Xi′ , Y(′i) ) ∈ S1 }

d−−
t =L (48)
n1
#{i : 1 ≤ i ≤ t , Xi′ ≤ Xt′ , (Xi′ , Y(′i) ) ∈ S2 }

− , (49)
n2

hence we can compute the value of d−−


t by using the following recurrence:

if Xi′ > Xt′ ,


 −−
d(i−1)
d−−
(i) = d−−
(i−1) + d1 if Xi′ ≤ Xt′ and (Xi′ , Y(′i) ) ∈ S1 , 1 ≤ i ≤ t, (50)
if Xi′ ≤ Xt′ and (Xi′ , Y(′i) ) ∈ S2 ,
 −−
d(i−1) − d2

where d−− −−
(0) is set to be 0. Clearly dt = d−−
(t ) . By using the recurrence (50), the total number of comparisons needed to
compute all dt s is about n(n + 1)/2, about half of the number of comparisons needed by the brute force algorithm, which
−−

is about n2 . Similar conclusions can be made for d−+t , d+−


t and d++
t .
The algorithm using the recurrence (50) is less efficient than the range-counting algorithm for the F&F test. However, it
is more concise. See Lopes et al. (2007).

3. Generalization to higher dimensional spaces

It is possible to generalize Peacock’s test to higher dimensional spaces. However, it demands using 2d pairs of cumulative
distribution functions and computing the maximum absolute differences of each pair of cumulative functions at nd points
in a d-dimensional space, where d ≥ 2. So, the workload of the computing process may be prohibitively heavy if the brute
force algorithm is used. Fortunately, the proposed algorithm can be generalized to higher dimensional spaces. To illustrate,
it is sufficient to show how to generalize the recurrence (22) or (23) to 3-dimensional case.
nk
Denote the two given samples in a 3-dimensional space by {(Xik , Yik , Zik )}k= 1 , k = 1, 2. The pooled samples sorted
ascendingly by the values of the X -components, that of the Y -components and that of the Z -components, will be denoted
by {(X(′t ) , Yt′ , Zt′ ) : 1 ≤ t ≤ n}, {(Xt′ , Y(′t ) , Zt′ ) : 1 ≤ t ≤ n}, {(Xt′ , Yt′ , Z(′t ) ) : 1 ≤ t ≤ n}, respectively. Then

u,v,w def 1
h−−− = L × [F−−− (X(′u) , Y(v)

, Z(w)

) − F−−−
2
(X(′u) , Y(v)

, Z(w)

)] (similar to (19))
= L × [ #{ t : Xt ≤ X(′u) , Yt′

≤ Y(v)′
, Z(′t ) ≤ Z(w)

, (Xt′ , Yt′ , Z(′t ) ) ∈ S1 }/n1
− #{t : Xt′ ≤ X(′u) , Yt′ ≤ Y(v) , Z(t ) ≤ Z(w) , (Xt′ , Yt′ , Z(′t ) ) ∈ S2 }/n2 ]
′ ′ ′

= L × [ #{ t : Xt′ ≤ X(′u) , Yt′ ≤ Y(v)′


, 1 ≤ t ≤ w, (Xt′ , Yt′ , Z(′t ) ) ∈ S1 }/n1
− #{t : Xt′ ≤ X(′u) , Yt′ ≤ ′
Y(v) , 1 ≤ t ≤ w, (Xt′ , Yt′ , Z(′t ) ) ∈ S2 }/n2 ] (51)

for all u, v, w = 1, 2, . . . , n. It follows that we can derive the following recurrence:


 u,v,w−1
h−−− if Xw′ > X(′u) or Yw′ > Y(v)

,
u,v,w u,v,w−1
h−−− = h−−− + d1 if Xw ≤ X(u) , Yw ≤ Y(v) and (Xw′ , Yw′ , Z(w)
′ ′ ′ ′ ′
) ∈ S1 , (52)
u,v,w−1
if Xw′ ≤ X(′u) , Yw′ ≤ Y(v)

and (Xw′ , Yw′ , Z(w)

) ∈ S2 .

h−−− − d2
0,v,w u,0,w u,v,0 u,v,w
(Set h−−− = h−−− = h−−− = 0 for all u, v, w = 1, 2, . . . , n.) Now all the values of h−−− can be computed by the following
triple-loop:
58 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58

Algorithm 3:
for (u = 1; u ≤ n; u ← u + 1) do
for (v = 1; v ≤ n; v ← v + 1) do
u,v,0
Set h−−− = 0;
for (w = 1; w ≤ n; w ← w + 1) do
u,v,w
Compute h−−− via the recurrence (52)
end for
end for
end for
The algorithm for the 3-dimensional Kolmogorov–Smirnov test is much more complex than that in the 2-dimensional
case, but the reader should be able to work out the detail.
It is interesting to compare the computational efficiency of the proposed algorithm with that of the brute force algorithms
for both the F&F test and Peacock test in a d-dimensional space, where d ≥ 2. We will compare the numbers of comparisons
needed by each of the algorithms, for the major operations that consume the CPU time are numerical comparisons in these
algorithms. Throughout the rest of the paper, we will call an algorithm an N algorithm if it needs N comparisons.
Algorithm 1 needs O(n2 ) comparisons once the pooled sample is sorted by the values of the X -coordinates and that of
the Y -coordinates, respectively. The quick sort algorithm takes O(n log2 n) comparisons to sort the pooled samples (Hoare,
1961). Therefore, Algorithm 1 is an O(n2 ) algorithm. Consequently, Algorithm 2, the proposed algorithm for the Peacock test
in 2-dimensional spaces, is O(n2 ). This conclusion can be generalized to higher dimensional spaces. In fact, the proposed
algorithm is O(nd ) in a d-dimensional space. Hence, the proposed algorithm is O(n) times more efficient than the brute
force algorithm for the Peacock test, which is O(nd+1 ). The range-counting method can be used to speed up the brute force
algorithm for the Peacock test. The resulting algorithm is O(n2 log n) in two-dimensional spaces (Lopes et al., 2007), which
is still O(log n) times less efficient than the proposed algorithm. Therefore, there is no sufficient evidence to believe that it
would be more efficient than the proposed algorithm for the Peacock test in higher dimensional spaces.
The brute force algorithm for the F&F test is O(n2 ), regardless of the value of d. Hence it is O(d−2 ) times more efficient
than the proposed algorithm if d > 2. In fact, even if d = 2, a careful examination of both algorithms shows that the former
needs fewer comparisons than the proposed algorithm. The computing process for the F&F test can be made even faster by
using the range-counting method or recurrences similar to (50). However, the cost of using the F&F test is that it does not
use the full information from the data.
In order to conduct a numerical study, the author implemented the proposed algorithm, the brute force algorithms for
both the Peacock test and the F&F test in C++. (A R-package that implements the proposed algorithms for the Peacock
test in both two and three dimensional spaces is publicly available as a contributed package at any CRAN mirrors listed
in https://cran.r-project.org/mirrors.html.) The algorithm for the F&F test is improved by using the recurrences similar to
(50), which can be verified to be approximately four times more efficient than the proposed algorithm for the Peacock test
in two-dimensional spaces. From the numerical study, the author observed that the proposed algorithm for the Peacock
test completed the computing process in an ordinary PC in 1 s when both sample sizes are n1 = n2 = 1000 and d = 2,
while the brute force algorithm used 656 s. The time needed for the F&F test is negligible. In the case d = 3, the proposed
algorithm took 8817 s (almost two and half hours) to complete the computing process, while the brute force algorithm
would need many days. The algorithm for the F&F test completed the computing process in 1 s. When the sample sizes
becomes n1 = n2 = 10,000, in the 2-dimensional case the proposed algorithm for the Peacock test completed the computing
process in 129 s, while the improved algorithm for the F&F test only used 32 s. The brute force algorithm would not be
able to complete the computation in a few days. In the 3-dimensional case, the algorithm for the F&F test completed the
computation in 74 s. No other algorithm would be able to complete the computation in a few days.

Acknowledgments

The author would like to thank the associate editor and the two reviewers for their valuable comments and suggestions.

References

Burr, E.J., 1963. Small-sample distribution of the two-sample Cramér–von Mises criterion for small equal samples. Ann. Math. Statist. 34 (1), 95–101.
Fasano, G., Franceschini, A., 1987. A multidimensional version of the Kolmogorov–Smirnovtest. Mon. Not. R. Astron. Soc. 225, 155–170.
Hájek, J., Šidàk, Z., 1967. Theory of Rank Tests. Academic Press, New York.
Hoare, C.A.R., 1961. Algorithm 64: Quicksort. Commun. ACM 4 (7), 321. http://dx.doi.org/10.1145/366622.366644.
Lopes, R.H.C., Reid, I., Hobson, P.R., The two-dimensional Kolmogorov–Smirnovtest, in: XI International Workshop on Advanced Computing and Analysis
Techniques in Physical Research, Amsterdam, the Netherlands, 2007.
Peacock, J.A., 1983. Two-dimensional goodness-of-fit testing in astronomy. Mon. Not. R. Astron. Soc. 202, 625–627.
Xiao, Y., Gordon, A., Yakovlev, A., 2007. A C++ program for the Cramér–von Mises two-sample test. J. Stat. Softw. (Amer. Statist. Assoc.) 17 (8).

You might also like