0% found this document useful (0 votes)

9 views6 pages

A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao

This document presents a fast algorithm for the two-dimensional Kolmogorov–Smirnov two-sample test, which significantly improves computational efficiency over the brute force method. The proposed algorithm operates in O(n log n) time complexity, making it O(n) times more efficient than traditional approaches, and can be generalized to higher dimensions. The paper also discusses the generalization of the Kolmogorov–Smirnov test to two-dimensional spaces and provides detailed mathematical formulations and recurrences for the new algorithm.

Uploaded by

Avi Roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao

Uploaded by

Avi Roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Computational Statistics and Data Analysis 105 (2017) 53–58

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

A fast algorithm for two-dimensional Kolmogorov–Smirnov

two sample tests
Yuanhui Xiao
Department of Mathematics and Statistics, Mississippi State University, Mississippi State, MS 39762, United States

article info abstract

Article history: By using the brute force algorithm, the application of the two-dimensional two-sample
Received 31 December 2015 Kolmogorov–Smirnov test can be prohibitively computationally expensive. Thus a fast
Received in revised form 14 July 2016 algorithm for computing the two-sample Kolmogorov–Smirnov test statistic is proposed to
Accepted 19 July 2016
alleviate this problem. The newly proposed algorithm is O(n) times more efficient than the
Available online 30 July 2016
brute force algorithm, where n is the sum of the two sample sizes. The proposed algorithm
is parallel and can be generalized to higher dimensional spaces.
Keywords:
Kolmogorov–Smirnov test
© 2016 Elsevier B.V. All rights reserved.
Brute force algorithm

1. A fast algorithm for one-dimensional Kolmogorov–Smirnov test

Given two continuous probability distribution functions F 1 and F 2 in one-dimensional space, consider the hypothesis
test problem

H0 : F 1 = F 2 vs. Ha : F 1 ̸= F 2 (1)
n n
based on the samples { } Xi1 i=1 1
and { } Xj2 j=2 1
from the respective distributions. The classical Kolmogorov–Smirnov test uses
the maximum difference of the empirical distribution functions (or cumulative frequency functions) at the observed values.
nk
Specifically, let Fnkk (k = 1, 2) be the empirical distribution function based on the sample {Xtk }t = 1 , that is,

#{t : Xtk ≤ x, 1 ≤ t ≤ nk }
Fnkk (x) = , ∞ < x < ∞, (2)
nk
where # means ‘‘the number of’’, then the Kolmogorov–Smirnov test statistic DKS is computed as (up to a multiple)

DKS = max{ max |Fn11 (Xi1 ) − Fn22 (Xi1 )|, max |Fn11 (Xj2 ) − Fn22 (Xj2 )|}. (3)
1≤i≤n1 1≤j≤n2

The value of DKS is often computed by a brute force algorithm, which simply counts the number of sample values that are
less than Xi1 or Xj2 for each i = 1, 2, . . . , n1 and j = 1, 2, . . . , n2 . The number of comparisons needed by the brute force
algorithm is O(n2 ), where n = n1 + n2 .
However, there exists a faster algorithm. Let L be the least common multiple of n1 and n2 , d1 = L/n1 , d2 = L/n2 , and let

{X(0t ) : 1 ≤ t ≤ n} = {X(01) ≤ X(02) ≤ · · · ≤ X(0n) } (4)

E-mail address: [email protected].

http://dx.doi.org/10.1016/j.csda.2016.07.014
0167-9473/© 2016 Elsevier B.V. All rights reserved.
54 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58

be the pooled sample arranged ascendingly. (Throughout this paper we assume all the observed values have no ties when
necessary.) Define

ht = L × [Fn11 (X(0t ) ) − Fn22 (X(0t ) )], 0 ≤ t ≤ n. (5)

The value of h0 is set to be 0. The reader can easily verify the following recurrence:

if X(0t ) = Xi1 for some i,


ht −1 + d 1
ht = (6)
ht −1 − d 2 if X(0t ) = Xj2 for some j.

See Burr (1963), Hájek and Šidàk (1967) and Xiao et al. (2007). The value of the Kolmogorov–Smirnov test statistic is the
maximum value of |ht |/L over 1 ≤ t ≤ n:
DKS = max |ht |/L. (7)
0≤t ≤n

If the quick sort method is used, this algorithm only needs O(n log2 n) comparisons (Hoare, 1961), which is O(n) times more
efficient than the brute force algorithm. In addition, the use of L even speeds up the algorithm since all the intermediate
results are integers.

2. Generalization to two-dimensional spaces

The generalization of the Kolmogorov–Smirnov test to high dimensional probability distributions is a challenge. To
generalize the Kolmogorov–Smirnov test to two-dimensional space, Peacock (1983) proposed a procedure which makes
the use of four (rather than just one) pairs of cumulative frequency functions. Denote the two given samples in a plane by
n
{(Xik , Yik )}i=k 1 , k = 1, 2, respectively, the four pairs of cumulative frequency functions used by Peacock’s test are given by
k
F++ (x, y) = #{i : Xik > x, Yik > y, 1 ≤ i ≤ nk }/nk , (8)
F+− (x, y) = #{i :
k
Xik > x, Yik ≤ y, 1 ≤ i ≤ nk }/nk , (9)
F−+ (x, y) = #{i :
k
Xik ≤ x, Yik > y, 1 ≤ i ≤ nk }/nk , (10)
and
k
F−− (x, y) = #{i : Xik ≤ x, Yik ≤ y, 1 ≤ i ≤ nk }/nk , (11)

where ∞ < x, y < ∞ and k = 1, 2. Let {Xt0 : t = 1, 2, . . . , n} be the pooled data set consisting of the values of
the X -components of the given samples and {Yt0 : t = 1, 2, . . . , n} the pooled data set consisting of the values of the
Y -components of the given samples. Define
def
D++ = max 1
|F++ (Xs0 , Yt0 ) − F++
2
(Xs0 , Yt0 )|, (12)
1≤s≤n, 1≤t ≤n
def
D+− = max 1
|F+− (Xs0 , Yt0 ) − F+−
2
(Xs0 , Yt0 )|, (13)
1≤s≤n, 1≤t ≤n
def
D−+ = max 1
|F−+ (Xs0 , Yt0 ) − F−+
2
(Xs0 , Yt0 )|, (14)
1≤s≤n, 1≤t ≤n

and
def
D−− = max 1
|F−− (Xs0 , Yt0 ) − F−−
2
(Xs0 , Yt0 )|. (15)
1≤s≤n, 1≤t ≤n

Peacock’s test is then defined as

D2DKS = max{D++ , D+− , D−+ , D−− }. (16)
The test is often performed by a brute force algorithm and its application is very expensive in terms of computing time
unless the sample sizes n1 and n2 are very small. Indeed, to compute the value of D−− , we need to compute the value of
the difference of the cumulative frequency functions F−− 1 2
and F−− at all the n2 pairs (Xs , Yt ), Xs and Yt being coordinates
of any pairs in the given samples. It will need O(n) comparisons to compute the value of the difference of the cumulative
frequency functions F−− 1 2
and F−− at a single point. Thus, it will take O(n3 ) comparisons to compute the value of D−− . Similar
conclusions can be made for D++ , D+− , D−+ .
To alleviate the problem, Fasano and Franceschini (1987, F&F, for short) revised Peacock’s test by comparing the
cumulative frequency functions at the observed sample points only, so the number of comparisons needed is only O(n2 ).
The F&F test is widely used in practice. But it is a variant of Peacock’s test, a different approach in essence.
In fact, there exists a fast algorithm for evaluating the value of Peacock’s test statistic. Denote by {(X(′t ) , Yt′ ) : 1 ≤ t ≤ n}
the pooled sample sorted ascendingly by the values of the X -components of the data points, and by {(Xt′ , Y(′t ) ) : 1 ≤ t ≤ n}
Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58 55

the pooled sample sorted ascendingly by the values of the Y -components of the data points. Please notice that any point
(Xs0 , Yt0 ), Xs0 being an X -value in the pooled sample and Yt0 being a Y -value in the pooled sample, respectively, can be
expressed as (X(′u) , Y(v)
′
) for some 1 ≤ u ≤ n and 1 ≤ v ≤ n. Therefore, we can re-write the expression of D−− as

D−− = max max |F−−1

(X(′u) , Y(v)
′
) − F−−
2
(X(′u) , Y(v)
′
)| = max Dv−− , (17)
1≤u≤n 1≤v≤n 1≤v≤n

where

Dv−− = max |F−−

1
(X(′u) , Y(v)
′
) − F−−
2
(X(′u) , Y(v)
′
)|. (18)
1≤u≤n

(Similar expressions can be found for D++ , D+− , D−+ .) Similar to (5), define for each v = 1, 2, . . . , n,

hvt = L × [F−−
1
(X(′t ) , Y(v)
′
) − F−−
2
(X(′t ) , Y(v)
′
)], t = 0, 1, . . . , n, (19)
then
Dv−− = max |hvt |/L, v = 1, 2, . . . , n. (20)
1≤t ≤n

Let Sk be the set of observed sample points of sample k, k = 1, 2, respectively. Then we can rewrite hvt as
#{i : Xi′ ≤ X(′t ) , Y(′i) ≤ Y(v)
′
, (Xi′ , Y(′i) ) ∈ S1 } #{i : Xi′ ≤ X(′t ) , Y(′i) ≤ Y(v)
′
, (Xi′ , Y(′i) ) ∈ S2 }
 
v
ht = L −
n1 n2
#{i : Xi ≤ ′
X(′t ) , 1 ≤i≤ v, (Xi , Y(′i) )
′
∈ S1 } ′
#{ i : X i ≤ X(′t ) , 1 ≤ i ≤ v, (Xi′ , Y(′i) ) ∈ S2 }
 
≡L − . (21)
n1 n2
Now we can easily derive the following recurrence (which is very similar to (6)):
 v−1
ht if Xv′ > X(′t ) ,
v v−1
ht = ht + d1 if Xv′ ≤ X(′t ) and (Xv′ , Y(v)
′
) ∈ S1 , (22)
htv−1 − d2 if Xv ≤ X(t ) and (Xv , Y(v)
′ ′ ′ ′
) ∈ S2 .


This recurrence is obtained by sorting the pooled sample according to the values of the Y -components of the sample points.
Similarly, by sorting the pooled sample according to the values of the X -components of the sample points, we can obtain
the other recurrence:
 v
ht −1 if Yt′ > Y(v)
′
,
v
ht = hv + d1 if Yt ≤ Y(v) and (X(′t ) , Yt′ ) ∈ S1 ,
′ ′
(23)
hvt −1 − d2 if Yt′ ≤ Y(v)
′
and (X(′t ) , Yt′ ) ∈ S2 .
t −1

(Set h0t = hv0 = 0 for all v, t = 0, 1, 2, . . . , n.) Thus, hvt can be computed via the following double for-loop:

Algorithm 1:
for (t = 1; t ≤ n; t ← t + 1) do
set h0t = 0;
for (v = 1; v ≤ n; v ← v + 1) do
Compute hvt via (22)
end for
end for

Note that the inner v -loops corresponding to different t-values are totally independent of each other. This fact has two
implications. First, we can switch t and v to get another double for-loop for computing hvt . Secondly, it is a parallel algorithm.
We can also devise similar recurrences to evaluate D−+ , D+− and D++ . However, we will propose a different approach.
t ,v
Denote hvt by h−− and let
t ,v def 1
h++ = L × [F++ (X(′t ) , Y(v)
′
) − F++
2
(X(′t ) , Y(v)
′
)], (24)
t ,v def
h+− = L × [F+− (X(′t ) , Y(v)
1 ′
) − F+− (X(′t ) , Y(v)
2 ′
)], (25)
and
t ,v def 1
h−+ = L × [F−+ (X(′t ) , Y(v)
′
) − F−+
2
(X(′t ) , Y(v)
′
)], (26)
then
t ,v
max Dv−− = max{ max |h−− | : 1 ≤ v ≤ n}/L, (27)
1≤v≤n 1≤t ≤n
56 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58

t ,v
max Dv+− = max{ max |h+− | : 1 ≤ v ≤ n}/L, (28)
1≤v≤n 1≤t ≤n
t ,v
max Dv−+ = max{ max |h−+ | : 1 ≤ v ≤ n}/L, (29)
1≤v≤n 1≤t ≤n

and
t ,v
max Dv++ = max{ max |h++ | : 1 ≤ v ≤ n}/L. (30)
1≤v≤n 1≤t ≤n

Define the following marginal cumulative frequency functions:

PXk (x) = #{t : Xtk > x, 1 ≤ t ≤ nk }/nk , (31)
NXk ( x ) = #{ t : Xtk ≤ x, 1 ≤ t ≤ nk }/nk , (32)
PYk (y) = #{t : Ytk > y, 1 ≤ t ≤ nk }/nk , (33)
and
NYk (y) = #{t : Ytk ≤ y, 1 ≤ t ≤ nk }/nk , (34)
for all ∞ < x, y < ∞ and k = 1, 2. It follows that
k
F++ (x, y) + F+−
k
(x, y) = PXk (x), (35)
F+− (x, y) + F−− (x, y) =
k k
NYk (y), (36)
and
k
F−+ (x, y) + F−−
k
(x, y) = NXk (x), (37)
k = 1, 2. Therefore,
t ,v t ,v
h−+ = L × [NX1 (X(′t ) ) − NX2 (X(′t ) )] − h−− = L × [NX1 (X(′t ) ) − NX2 (X(′t ) )] − hvt , (38)
t ,v t ,v v
h+− = L × [ NY1 (Y(v)
′
) − NY2 (Y(v)
′
)] − h−− = L × [ NY1 (Y(v)
′
) − NY2 (Y(v)
′
)] − ht , (39)
and
t ,v t ,v
h++ = L × [PX1 (X(′t ) ) − PX2 (X(′t ) )] − h+−
,v
= L[PX1 (X(′t ) ) − PX2 (X(′t ) )] − [L(NY1 (Y(v)
′
) − NY2 (Y(v)
′
)) − ht−− ]
,v
= −L(NX1 (X(′t ) ) − NX2 (X(′t ) )) − L(NY1 (Y(v)
′
) − NY2 (Y(v)
′
)) + ht−−
= hvt − L(NX1 (X(′t ) ) − NX2 (X(′t ) )) − L(NY1 (Y(v)
′
) − NY2 (Y(v)
′
)). (40)
Thus,
t ,v t ,v t ,v t ,v
D2DKS = max max{|h−− |, |h−+ |, |h+− |, |h++ |}/L. (41)
t v
Hence the value of D2DKS in (16) can be computed by the following algorithm:

Algorithm 2:
Set KS = 0;
for (t = 1; t ≤ n; t ← t + 1) do
Set h0t = 0;
for (v = 1; v ≤ n; v ← v + 1) do
t ,v
Compute htv or h−− via (22);
t ,v t ,v t ,v
Compute h−+ , h+− and h++ via (38), (39) and (40), respectively.
if If KS < max{|htv |, |h−+ |t ,v , |h+− |t ,v , |h++ |t ,v } then
Set KS = max{|htv |, |h−+ |t ,v , |h+− |t ,v , |h++ |t ,v };
end if
end for
end for
return KS/L.
Please be reminded that both L(NX1 (X(′t ) ) − NX2 (X(′t ) )) and L(NY1 (Y(v)
′
) − NY2 (Y(v)
′
)) can be evaluated via recurrences similar to
(6). This algorithm is parallel. Its computational efficiency will be discussed later.
Similarly, we can develop an efficient algorithm for the F&F test, which is widely implemented by using the brute force
algorithm. Define
d−−
t
1
= L[F−− (Xt′ , Y(′t ) ) − F−−
2
(Xt′ , Y(′t ) )], (42)
Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58 57

d−+
t
1
= L[F−+ (Xt′ , Y(′t ) ) − F−+
2
(Xt′ , Y(′t ) )], (43)
dt+−
= L[F+− (Xt , Y(′t ) )
1 ′
− F+− (Xt , Y(′t ) )],
2 ′
(44)

and

d++
t
1
= L[F++ (Xt′ , Y(′t ) ) − F++
2
(Xt′ , Y(′t ) )], (45)

for all t = 1, 2, . . . , n. Since

1
max |F∗∗ (Xt0 , Yt0 ) − F∗∗
2
(Xt0 , Yt0 )| = max |d∗∗
t |/L, (46)
1≤t ≤n 1≤t ≤n

(where ∗ stands for + or −), the F&F test statistic can be computed as

D2DFF = max {|d−−

t |, |d−+
t |, |d+−
t |, |d++
t |}/L. (47)
1 ≤t ≤n

See Fasano and Franceschini (1987). Notice that

#{i : 1 ≤ i ≤ t , Xi′ ≤ Xt′ , (Xi′ , Y(′i) ) ∈ S1 }

d−−
t =L (48)
n1
#{i : 1 ≤ i ≤ t , Xi′ ≤ Xt′ , (Xi′ , Y(′i) ) ∈ S2 }

− , (49)
n2

hence we can compute the value of d−−

t by using the following recurrence:

if Xi′ > Xt′ ,

 −−
d(i−1)
d−−
(i) = d−−
(i−1) + d1 if Xi′ ≤ Xt′ and (Xi′ , Y(′i) ) ∈ S1 , 1 ≤ i ≤ t, (50)
if Xi′ ≤ Xt′ and (Xi′ , Y(′i) ) ∈ S2 ,
 −−
d(i−1) − d2

where d−− −−
(0) is set to be 0. Clearly dt = d−−
(t ) . By using the recurrence (50), the total number of comparisons needed to
compute all dt s is about n(n + 1)/2, about half of the number of comparisons needed by the brute force algorithm, which
−−

is about n2 . Similar conclusions can be made for d−+t , d+−

t and d++
t .
The algorithm using the recurrence (50) is less efficient than the range-counting algorithm for the F&F test. However, it
is more concise. See Lopes et al. (2007).

3. Generalization to higher dimensional spaces

It is possible to generalize Peacock’s test to higher dimensional spaces. However, it demands using 2d pairs of cumulative
distribution functions and computing the maximum absolute differences of each pair of cumulative functions at nd points
in a d-dimensional space, where d ≥ 2. So, the workload of the computing process may be prohibitively heavy if the brute
force algorithm is used. Fortunately, the proposed algorithm can be generalized to higher dimensional spaces. To illustrate,
it is sufficient to show how to generalize the recurrence (22) or (23) to 3-dimensional case.
nk
Denote the two given samples in a 3-dimensional space by {(Xik , Yik , Zik )}k= 1 , k = 1, 2. The pooled samples sorted
ascendingly by the values of the X -components, that of the Y -components and that of the Z -components, will be denoted
by {(X(′t ) , Yt′ , Zt′ ) : 1 ≤ t ≤ n}, {(Xt′ , Y(′t ) , Zt′ ) : 1 ≤ t ≤ n}, {(Xt′ , Yt′ , Z(′t ) ) : 1 ≤ t ≤ n}, respectively. Then

u,v,w def 1
h−−− = L × [F−−− (X(′u) , Y(v)
′
, Z(w)
′
) − F−−−
2
(X(′u) , Y(v)
′
, Z(w)
′
)] (similar to (19))
= L × [ #{ t : Xt ≤ X(′u) , Yt′
′
≤ Y(v)′
, Z(′t ) ≤ Z(w)
′
, (Xt′ , Yt′ , Z(′t ) ) ∈ S1 }/n1
− #{t : Xt′ ≤ X(′u) , Yt′ ≤ Y(v) , Z(t ) ≤ Z(w) , (Xt′ , Yt′ , Z(′t ) ) ∈ S2 }/n2 ]
′ ′ ′

= L × [ #{ t : Xt′ ≤ X(′u) , Yt′ ≤ Y(v)′

, 1 ≤ t ≤ w, (Xt′ , Yt′ , Z(′t ) ) ∈ S1 }/n1
− #{t : Xt′ ≤ X(′u) , Yt′ ≤ ′
Y(v) , 1 ≤ t ≤ w, (Xt′ , Yt′ , Z(′t ) ) ∈ S2 }/n2 ] (51)

for all u, v, w = 1, 2, . . . , n. It follows that we can derive the following recurrence:

 u,v,w−1
h−−− if Xw′ > X(′u) or Yw′ > Y(v)
′
,
u,v,w u,v,w−1
h−−− = h−−− + d1 if Xw ≤ X(u) , Yw ≤ Y(v) and (Xw′ , Yw′ , Z(w)
′ ′ ′ ′ ′
) ∈ S1 , (52)
u,v,w−1
if Xw′ ≤ X(′u) , Yw′ ≤ Y(v)
′
and (Xw′ , Yw′ , Z(w)
′
) ∈ S2 .

h−−− − d2
0,v,w u,0,w u,v,0 u,v,w
(Set h−−− = h−−− = h−−− = 0 for all u, v, w = 1, 2, . . . , n.) Now all the values of h−−− can be computed by the following
triple-loop:
58 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58

Algorithm 3:
for (u = 1; u ≤ n; u ← u + 1) do
for (v = 1; v ≤ n; v ← v + 1) do
u,v,0
Set h−−− = 0;
for (w = 1; w ≤ n; w ← w + 1) do
u,v,w
Compute h−−− via the recurrence (52)
end for
end for
end for
The algorithm for the 3-dimensional Kolmogorov–Smirnov test is much more complex than that in the 2-dimensional
case, but the reader should be able to work out the detail.
It is interesting to compare the computational efficiency of the proposed algorithm with that of the brute force algorithms
for both the F&F test and Peacock test in a d-dimensional space, where d ≥ 2. We will compare the numbers of comparisons
needed by each of the algorithms, for the major operations that consume the CPU time are numerical comparisons in these
algorithms. Throughout the rest of the paper, we will call an algorithm an N algorithm if it needs N comparisons.
Algorithm 1 needs O(n2 ) comparisons once the pooled sample is sorted by the values of the X -coordinates and that of
the Y -coordinates, respectively. The quick sort algorithm takes O(n log2 n) comparisons to sort the pooled samples (Hoare,
1961). Therefore, Algorithm 1 is an O(n2 ) algorithm. Consequently, Algorithm 2, the proposed algorithm for the Peacock test
in 2-dimensional spaces, is O(n2 ). This conclusion can be generalized to higher dimensional spaces. In fact, the proposed
algorithm is O(nd ) in a d-dimensional space. Hence, the proposed algorithm is O(n) times more efficient than the brute
force algorithm for the Peacock test, which is O(nd+1 ). The range-counting method can be used to speed up the brute force
algorithm for the Peacock test. The resulting algorithm is O(n2 log n) in two-dimensional spaces (Lopes et al., 2007), which
is still O(log n) times less efficient than the proposed algorithm. Therefore, there is no sufficient evidence to believe that it
would be more efficient than the proposed algorithm for the Peacock test in higher dimensional spaces.
The brute force algorithm for the F&F test is O(n2 ), regardless of the value of d. Hence it is O(d−2 ) times more efficient
than the proposed algorithm if d > 2. In fact, even if d = 2, a careful examination of both algorithms shows that the former
needs fewer comparisons than the proposed algorithm. The computing process for the F&F test can be made even faster by
using the range-counting method or recurrences similar to (50). However, the cost of using the F&F test is that it does not
use the full information from the data.
In order to conduct a numerical study, the author implemented the proposed algorithm, the brute force algorithms for
both the Peacock test and the F&F test in C++. (A R-package that implements the proposed algorithms for the Peacock
test in both two and three dimensional spaces is publicly available as a contributed package at any CRAN mirrors listed
in https://cran.r-project.org/mirrors.html.) The algorithm for the F&F test is improved by using the recurrences similar to
(50), which can be verified to be approximately four times more efficient than the proposed algorithm for the Peacock test
in two-dimensional spaces. From the numerical study, the author observed that the proposed algorithm for the Peacock
test completed the computing process in an ordinary PC in 1 s when both sample sizes are n1 = n2 = 1000 and d = 2,
while the brute force algorithm used 656 s. The time needed for the F&F test is negligible. In the case d = 3, the proposed
algorithm took 8817 s (almost two and half hours) to complete the computing process, while the brute force algorithm
would need many days. The algorithm for the F&F test completed the computing process in 1 s. When the sample sizes
becomes n1 = n2 = 10,000, in the 2-dimensional case the proposed algorithm for the Peacock test completed the computing
process in 129 s, while the improved algorithm for the F&F test only used 32 s. The brute force algorithm would not be
able to complete the computation in a few days. In the 3-dimensional case, the algorithm for the F&F test completed the
computation in 74 s. No other algorithm would be able to complete the computation in a few days.

Acknowledgments

The author would like to thank the associate editor and the two reviewers for their valuable comments and suggestions.

References

Burr, E.J., 1963. Small-sample distribution of the two-sample Cramér–von Mises criterion for small equal samples. Ann. Math. Statist. 34 (1), 95–101.
Fasano, G., Franceschini, A., 1987. A multidimensional version of the Kolmogorov–Smirnovtest. Mon. Not. R. Astron. Soc. 225, 155–170.
Hájek, J., Šidàk, Z., 1967. Theory of Rank Tests. Academic Press, New York.
Hoare, C.A.R., 1961. Algorithm 64: Quicksort. Commun. ACM 4 (7), 321. http://dx.doi.org/10.1145/366622.366644.
Lopes, R.H.C., Reid, I., Hobson, P.R., The two-dimensional Kolmogorov–Smirnovtest, in: XI International Workshop on Advanced Computing and Analysis
Techniques in Physical Research, Amsterdam, the Netherlands, 2007.
Peacock, J.A., 1983. Two-dimensional goodness-of-fit testing in astronomy. Mon. Not. R. Astron. Soc. 202, 625–627.
Xiao, Y., Gordon, A., Yakovlev, A., 2007. A C++ program for the Cramér–von Mises two-sample test. J. Stat. Softw. (Amer. Statist. Assoc.) 17 (8).

Gardner's Art Through The Ages A Global History Volume II 14th All Chapter Instant Download
No ratings yet
Gardner's Art Through The Ages A Global History Volume II 14th All Chapter Instant Download
34 pages
Pica Daniel Homework 2
No ratings yet
Pica Daniel Homework 2
5 pages
The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
No ratings yet
The Two-Dimensional Kolmogorov-Smirnov Test: Raul H.C. Lopes
12 pages
Bondad de Ajuste
No ratings yet
Bondad de Ajuste
13 pages
Testing The Statistiscal Independence of Continuous Random Variables. A New Robust Algorithm
No ratings yet
Testing The Statistiscal Independence of Continuous Random Variables. A New Robust Algorithm
5 pages
Testing Random Number Generators
No ratings yet
Testing Random Number Generators
54 pages
Chap8
No ratings yet
Chap8
10 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Anna's Archive
No ratings yet
Anna's Archive
12 pages
Kolmogorov Smirnov
100% (1)
Kolmogorov Smirnov
12 pages
FULL WORK
No ratings yet
FULL WORK
19 pages
Input Data Analysis (3) Goodness-of-Fit Tests
No ratings yet
Input Data Analysis (3) Goodness-of-Fit Tests
28 pages
lec05
No ratings yet
lec05
28 pages
Deriving Inferential Statistics From Recurrence Plots: A Recurrence-Based Test of Differences Between Sample Distributions and Its Comparison To The Two-Sample Kolmogorov-Smirnov Test
No ratings yet
Deriving Inferential Statistics From Recurrence Plots: A Recurrence-Based Test of Differences Between Sample Distributions and Its Comparison To The Two-Sample Kolmogorov-Smirnov Test
9 pages
17-570
No ratings yet
17-570
45 pages
Kolmogorov-Smirnov Test
No ratings yet
Kolmogorov-Smirnov Test
10 pages
Deriving CDF of Kolmogorov-Smirnov Test Statistic: Jan Vrbik
No ratings yet
Deriving CDF of Kolmogorov-Smirnov Test Statistic: Jan Vrbik
20 pages
1112.6014v3
No ratings yet
1112.6014v3
27 pages
ASTA MC Week 4.1
No ratings yet
ASTA MC Week 4.1
40 pages
1-s2.0-016794739592844N-main
No ratings yet
1-s2.0-016794739592844N-main
11 pages
Jurnal
No ratings yet
Jurnal
42 pages
Maxima
No ratings yet
Maxima
3 pages
Massey 1951
No ratings yet
Massey 1951
12 pages
Data Analaysis and Visualization - 49Q
No ratings yet
Data Analaysis and Visualization - 49Q
28 pages
sadhanala19a
No ratings yet
sadhanala19a
10 pages
Unit 4 QB Part B Answer (2023)
No ratings yet
Unit 4 QB Part B Answer (2023)
17 pages
Arnab Hazra Ks Test
No ratings yet
Arnab Hazra Ks Test
14 pages
Point Pattern Analysis: Using Spatial Inferential Statistics
No ratings yet
Point Pattern Analysis: Using Spatial Inferential Statistics
34 pages
3.2 Tests For Random Numbers: Two Types of Tests: 1. Frequency Test: U
No ratings yet
3.2 Tests For Random Numbers: Two Types of Tests: 1. Frequency Test: U
12 pages
f22 Midterm1 Sol
No ratings yet
f22 Midterm1 Sol
6 pages
Solutii IIOT 2024 25 Runda Finala Nationala
No ratings yet
Solutii IIOT 2024 25 Runda Finala Nationala
9 pages
Khoshnevisan - Empirical Processes and The Kolmogorov-Smirnov Statistic
No ratings yet
Khoshnevisan - Empirical Processes and The Kolmogorov-Smirnov Statistic
12 pages
PermutationTests
No ratings yet
PermutationTests
29 pages
2502.07672v1
No ratings yet
2502.07672v1
69 pages
Test Questions For Grade 11
No ratings yet
Test Questions For Grade 11
10 pages
DK1730 ch05 PDF
No ratings yet
DK1730 ch05 PDF
13 pages
STT201 Exam
No ratings yet
STT201 Exam
3 pages
Solu 4
No ratings yet
Solu 4
44 pages
Armstrong2018
No ratings yet
Armstrong2018
62 pages
The Two-Sample Kolmogorov-Smirnov Test: N X X I X CDF M X Y I X CDF
No ratings yet
The Two-Sample Kolmogorov-Smirnov Test: N X X I X CDF M X Y I X CDF
4 pages
The Kolmogorov
No ratings yet
The Kolmogorov
3 pages
Final Exam: 1. Lots of Ones (20 Points)
No ratings yet
Final Exam: 1. Lots of Ones (20 Points)
4 pages
Mathematical Statistics: (Communicated by Prof. H. at The Meeting of October 31, 1959)
No ratings yet
Mathematical Statistics: (Communicated by Prof. H. at The Meeting of October 31, 1959)
10 pages
18.443 MIT Stats Course
No ratings yet
18.443 MIT Stats Course
139 pages
Mathematics Statistics 1
No ratings yet
Mathematics Statistics 1
212 pages
7.assignment2 DAA Answers Dsatm PDF
No ratings yet
7.assignment2 DAA Answers Dsatm PDF
19 pages
PSA_PSB Detailed Solution 2025 Supremum Classes
No ratings yet
PSA_PSB Detailed Solution 2025 Supremum Classes
40 pages
Practical Statistics
No ratings yet
Practical Statistics
14 pages
A Multivariate Kolmogorov-Smimov Test of Goodness of Fit 1
No ratings yet
A Multivariate Kolmogorov-Smimov Test of Goodness of Fit 1
9 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Chapter 2
No ratings yet
Chapter 2
10 pages
Assignment 1 (Concept) : Solutions: Note, Throughout Exercises 1 To 4, N Denotes The Input Size of A Problem. (10%)
No ratings yet
Assignment 1 (Concept) : Solutions: Note, Throughout Exercises 1 To 4, N Denotes The Input Size of A Problem. (10%)
6 pages
Ws 943213
No ratings yet
Ws 943213
16 pages
2.4.3.4 Interpret The Results From The SPSS Output Window The SPSS
No ratings yet
2.4.3.4 Interpret The Results From The SPSS Output Window The SPSS
11 pages
A Kolmogorov-Smirnov Test For R Samples: Walter B Ohm
No ratings yet
A Kolmogorov-Smirnov Test For R Samples: Walter B Ohm
24 pages
Adaptive Rank-Based Tests For High Dimensional Mean Problems
No ratings yet
Adaptive Rank-Based Tests For High Dimensional Mean Problems
17 pages
The Annals of Applied Probability 1998, Vol. 8, No. 3, 886-895
No ratings yet
The Annals of Applied Probability 1998, Vol. 8, No. 3, 886-895
10 pages
Solutions 2
No ratings yet
Solutions 2
7 pages
Stat 245 Homework 3 Solution
No ratings yet
Stat 245 Homework 3 Solution
8 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
From Everand
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
CSPacademic
No ratings yet
R&J Study Notes and Q&A_2
No ratings yet
R&J Study Notes and Q&A_2
25 pages
Dual Synchronous, Step-Down Controller With 5-V and 3.3-V Ldos
No ratings yet
Dual Synchronous, Step-Down Controller With 5-V and 3.3-V Ldos
31 pages
Celebrate With A Great Joy
No ratings yet
Celebrate With A Great Joy
19 pages
Lecture 12-Operation On Singly Linked Lists
No ratings yet
Lecture 12-Operation On Singly Linked Lists
27 pages
FleetManager API REFERENCE
No ratings yet
FleetManager API REFERENCE
11 pages
Footwashing: Having A Part With The Lord
No ratings yet
Footwashing: Having A Part With The Lord
18 pages
Listening Practice 5
50% (2)
Listening Practice 5
3 pages
Assessing Vocabulary in The Language Classroom
No ratings yet
Assessing Vocabulary in The Language Classroom
7 pages
SC-T7000/SC-T7050/SC-T7070/SC-T7080: Rev.02 No.2 CC17-CASE-012
No ratings yet
SC-T7000/SC-T7050/SC-T7070/SC-T7080: Rev.02 No.2 CC17-CASE-012
6 pages
Rta Synopsis
No ratings yet
Rta Synopsis
3 pages
Scanning Strategies and Best Practices Training Labs
No ratings yet
Scanning Strategies and Best Practices Training Labs
44 pages
Siemens - G120 - 04 STARTER Connection To Target Device en
No ratings yet
Siemens - G120 - 04 STARTER Connection To Target Device en
41 pages
Assignment Calculus II - Chapter II 1
No ratings yet
Assignment Calculus II - Chapter II 1
13 pages
SK RPT Bahasa Inggeris Tahun 2 (Edited)
No ratings yet
SK RPT Bahasa Inggeris Tahun 2 (Edited)
12 pages
Online Hostel Management System
No ratings yet
Online Hostel Management System
47 pages
Complete Download of C++ How to Program Early Objects Version 9th Edition Deitel Test Bank Full Chapters in PDF DOCX
100% (29)
Complete Download of C++ How to Program Early Objects Version 9th Edition Deitel Test Bank Full Chapters in PDF DOCX
39 pages
Pre-Intermediate Half Term Test
No ratings yet
Pre-Intermediate Half Term Test
12 pages
Integration of Princely State1
No ratings yet
Integration of Princely State1
63 pages
HTTP Plugin Notes
No ratings yet
HTTP Plugin Notes
7 pages
001 MBA Business Statistics Week 14 Parametric and Non Parametric Tests 17-06-2023 F1 SV
No ratings yet
001 MBA Business Statistics Week 14 Parametric and Non Parametric Tests 17-06-2023 F1 SV
109 pages
Birthday Mine
No ratings yet
Birthday Mine
6 pages
Marketing Plan Presentation PDF
No ratings yet
Marketing Plan Presentation PDF
45 pages
Formative Vs Summative Assessment
100% (1)
Formative Vs Summative Assessment
50 pages
RELATIVE CLAUSES
No ratings yet
RELATIVE CLAUSES
4 pages
Riddles in The Dark
No ratings yet
Riddles in The Dark
8 pages
Homework Log Template PDF
100% (1)
Homework Log Template PDF
7 pages
Problems Encountered in Translating Conversational Implicatures in The Holy Qur Ān Into English
No ratings yet
Problems Encountered in Translating Conversational Implicatures in The Holy Qur Ān Into English
9 pages
Learning/Thinking Styles and Multiple Intelligences
75% (4)
Learning/Thinking Styles and Multiple Intelligences
7 pages
Learning Update 1 1
No ratings yet
Learning Update 1 1
8 pages

A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao

Uploaded by

A fast algorithm for 2-D KS two sample tests-Yuanhui Xiao

Uploaded by

Computational Statistics and Data Analysis 105 (2017) 53–58

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

A fast algorithm for two-dimensional Kolmogorov–Smirnov

article info abstract

1. A fast algorithm for one-dimensional Kolmogorov–Smirnov test

{X(0t ) : 1 ≤ t ≤ n} = {X(01) ≤ X(02) ≤ · · · ≤ X(0n) } (4)

E-mail address: [email protected].

ht = L × [Fn11 (X(0t ) ) − Fn22 (X(0t ) )], 0 ≤ t ≤ n. (5)

if X(0t ) = Xi1 for some i,

2. Generalization to two-dimensional spaces

Peacock’s test is then defined as

D−− = max max |F−−1

Dv−− = max |F−−

Define the following marginal cumulative frequency functions:

for all t = 1, 2, . . . , n. Since

D2DFF = max {|d−−

See Fasano and Franceschini (1987). Notice that

hence we can compute the value of d−−

if Xi′ > Xt′ ,

is about n2 . Similar conclusions can be made for d−+t , d+−

3. Generalization to higher dimensional spaces

= L × [ #{ t : Xt′ ≤ X(′u) , Yt′ ≤ Y(v)′

for all u, v, w = 1, 2, . . . , n. It follows that we can derive the following recurrence:

You might also like