Q-Means A Quantum Algorithm For Unsupervised Machine Learning
Q-Means A Quantum Algorithm For Unsupervised Machine Learning
4
Atos Bull, Les Clayes Sous Bois, France
Abstract
Quantum machine learning is one of the most promising applications of a full-scale quan-
tum computer. Over the past few years, many quantum machine learning algorithms have
been proposed that can potentially offer considerable speedups over the corresponding classi-
cal algorithms. In this paper, we introduce q-means, a new quantum algorithm for clustering
which is a canonical problem in unsupervised machine learning. The q-means algorithm has
convergence and precision guarantees similar to k-means, and it outputs with high probability
a good approximation of the k cluster centroids like the classical algorithm. Given a dataset
of N d-dimensional vectors
vi (seen as a matrix V ∈ RN ×d ) stored
in QRAM, the running
1.5
η η 2η
time of q-means is O kd 2 κ(V )(µ(V ) + k ) + k 2 κ(V )µ(V ) per iteration, where κ(V ) is
e
δ δ δ
the condition number, µ(V ) is a parameter that appears in quantum linear algebra procedures
2
and maxi kvi k .For a natural notion of well-clusterable datasets, the running time becomes
η = 2.5
Oe k 2 d 3 + k 2.5 32 per iteration, which is linear in the number of features d, and polynomial
η η
δ δ
in the rank k, the maximum square norm η and the error parameter δ. Both running times
are only polylogarithmic in the number of datapoints N . Our algorithm provides substantial
savings compared to the classical k-means algorithm that runs in time O(kdN ) per iteration,
particularly for the case of large datasets.
1
In the same work, they also proposed an adiabatic algorithm for the assignment step of the
k-means algorithm, that can potentially provide an exponential speedup in the number of data
points as well, in the case the adiabatic algorithm performs exponentially better than the classical
algorithm. The adiabatic algorithm is used in two places for this algorithm, the first to select
the initial centroids, and the second to assign data points to the closest cluster. However, while
arguments are given for its efficiency, it is left as an open problem to determine how well the
adiabatic algorithm performs on average, both in terms of the quality of solution and the running
time.
Wiebe, Kapoor and Svore [8] apply the minimum finding algorithm [10] to obtain nearest-
neighbor methods for supervised and unsupervised learning. At a high level, they recovered a
Grover-type quadratic speedup with respect to the number of elements in the dataset in finding
the k nearest neighbors of a vector. Otterbach et al. [11] performed clustering by exploiting a well-
known reduction from clustering to the Maximum-Cut (MAXCUT) problem; the MAXCUT is then
solved using QAOA, a quantum algorithm for performing approximate combinatorial optimization
[12].
Let us remark on a recent breakthrough by Tang et al. [13, 14, 15], who proposed three classical
machine learning algorithms obtained by dequantizing recommendation systems [2] and low rank
linear system solvers. Like the quantum algorithms, the running time of these classical algorithms
is O(poly(k)polylog(mn)), that is poly-logarithmic in the dimension of the dataset and polynomial
in the rank. However, the polynomial dependence on the rank of the matrices is significantly worse
than the quantum algorithms and in fact renders these classical algorithms highly impractical. For
example, the new classical algorithm for stochastic regression inspired by the HHL algorithm [16]
has a running time of Õ(κ6 k 16 kAk6F /6 ), which is impractical even for a rank-10 matrix.
The extremely high dependence on the rank and the other parameters implies not only that the
quantum algorithms are substantially faster (their dependence on the rank is sublinear!), but also
that in practice there exist much faster classical algorithms for these problems. While the results
of Tang et al. are based on the FKV methods [17], in classical linear algebra, algorithms based
on the CUR decomposition that have a running time linear in the dimension and quadratic in the
rank are preferred to the FKV methods [17, 18, 19]. For example, for the recommendation systems
matrix of Amazon or Netflix, the dimension of the matrix is 106 × 107 , while the rank is certainly
not√lower than 100. The dependence on the dimension and rank of the quantum algorithm in [2] is
O( k log(mn)) ≈ O(102 ), of the classical CUR-based algorithm is O(mk 2 ) ≈ O(1011 ), while of the
Tang algorithm is O(k 16 log(mn)) ≈ O(1033 ).
It remains an open question to find classical algorithms for these machine learning problems
that are poly-logarithmic in the dimension and are competitive with respect to the quantum or
the classical algorithms for the same problems. This would involve using significantly different
techniques than the ones presently used for these algorithms.
2
of the closest centroid. (ii) Each centroid is updated to be the average of the data points assigned
to the corresponding cluster. These two steps are repeated until convergence, that is, until the
change in the centroids during one iteration is sufficiently small.
More precisely, we are given a dataset V of vectors vi ∈ Rd for i ∈ [N ]. At step t, we denote
the k clusters by the sets Cjt for j ∈ [k], and each corresponding centroid by the vector ctj . At
each iteration, the data points vi are assigned to a cluster Cjt such that C1t ∪ C2t · · · ∪ CKt = V and
Cit ∩ Clt = ∅ for i 6= l. Let d(vi , ctj ) be the Euclidean distance between vectors vi and ctj . The first
step of the algorithm assigns each vi a label `(vi )t corresponding to the closest centroid, that is
`(vi )t = argminj∈[k] (d(vi , ctj )).
The centroids are then updated, ct+1 = |C1t | i∈C t vi , so that the new centroid is the average of all
P
j j j
points that have been assigned to the cluster in this iteration. We say that we have converged if
for a small threshold τ we have
k
1X
d(ctj , ct−1
j ) 6 τ.
k
j=1
The loss function that this algorithm aims to minimize is the RSS (residual sums of squares), the
sum of the squared distances between points and the centroid of their cluster.
X X
RSS := d(cj , vi )2
j∈[k] i∈Cj
The RSS decreases at each iteration of the k-means algorithm, the algorithm therefore converges
to a local minimum for the RSS. The number of iterations T for convergence depends on the data
and the number of clusters. A single iteration has complexity of O(kN d) since the N vectors of
dimension d have to be compared to each of the k centroids.
From a computational complexity point of view, we recall that it is NP-hard to find a clustering
that achieves the global minimum for the RSS. There are classical clustering algorithms based on
optimizing different loss functions, however the k-means algorithm uses the RSS as the objective
function.
√
The algorithm can be super-polynomial in the worst case (the number of iterations is
2 ω( N ) [22]), but the number of iterations is usually small in practice. The k-means algorithm with
a suitable heuristic like k-means++ to initialize the centroids finds a clustering such that the value
for the RSS objective function is within a multiplicative O(log N ) factor of the minimum value [21].
1.3 δ-k-means
We now consider a δ-robust version of the k-means in which we introduce some noise. The noise
affects the algorithms in both of the steps of k-means: label assignment and centroid estimation.
Let us describe the rules for the assignment step of δ-k-means more precisely. Let c∗i be the
closest centroid to the data point vi . Then, the set of possible labels Lδ (vi ) for vi is defined as
follows:
Lδ (vi ) = {cp : |d2 (c∗i , vi ) − d2 (cp , vi )| ≤ δ }
The assignment rule selects arbitrarily a cluster label from the set Lδ (vi ).
Second, we add δ/2 noise during the calculation of the centroid. Let Cjt+1 be the set of points
which have been labeled by j in the previous step. For δ-k-means we pick a centroid ct+1
j with the
property
3
1 X δ
ct+1
j − vi < .
|Cjt+1 | t+1
2
vi ∈Cj
One way to do this is to calculate the centroid exactly and then add some small Gaussian noise
to the vector to obtain the robust version of the centroid.
Let us add two remarks on the δ-k-means. First, for a well-clusterable data set and for a small
δ, the number of vectors on the boundary that risk to be misclassified in each step, that is the
vectors for which |Lδ (vi )| > 1 is typically much smaller compared to the vectors that are close to
a unique centroid. Second, we also increase by δ/2 the convergence threshold from the k-means
algorithm. All in all, δ-k-means is able to find a clustering that is robust when the data points
and the centroids are perturbed with some noise of magnitude O(δ). As we will see in this work,
q-means is the quantum equivalent of δ-k-means.
Result 1. Given dataset V ∈ RN ×d stored in QRAM, the q-means algorithm outputs with high
probability
centroids c1 , · · · , ck that are consistent
with an output of the δ-k-means algorithm in time
η η 2 η 1.5
O kd δ2 κ(V )(µ(V ) + k δ ) + k δ2 κ(V )µ(V ) per iteration, where κ(V ) is the condition number,
e
µ(V ) is a parameter that appears in quantum linear algebra procedures and 1 ≤ kvi k2 ≤ η.
When we say that the q-means output is consistent with the δ-k-means, we mean that with
high probability the clusters that the q-means algorithm outputs are also possible outputs of the
δ-k-means.
We go further in our analysis and study a well-motivated model for datasets that allows for
good clustering. We call these datasets well-clusterable. One possible way to think of such datasets
is the following: a dataset is well-clusterable when the k clusters arise from picking k well-separated
vectors as their centroids, and then each point in the cluster is sampled from a Gaussian distribution
4
with small variance centered on the centroid of the cluster. We provide a rigorous definition in
following sections. For such well-clusterable datasets we can provide a tighter analysis of the
running time and have the following result, whose formal version appears as Theorem 5.2.
Result 2. Given a well-clusterable dataset V ∈ RN ×d stored in QRAM, the q-means algorithm
outputs with high probability k centroids c1 , ·· · , ck that are consistent with the output of the δ-k-
means algorithm in time O e k 2 d η2.5 2
+ k 2.5 ηδ3 per iteration, where 1 ≤ kvi k2 ≤ η.
δ3
In order to assess the running time and performance of our algorithm we performed extensive simu-
lations for different datasets. The running time of the q-means algorithm is linear in the dimension
d, which is necessary when outputting a classical description of the centroids, and polynomial in
the number of clusters k which is typically a small constant. The main advantage of the q-means
algorithm is that it provably depends logarithmically on the number of points, which can in many
cases provide a substantial speedup. The parameter δ (which plays the same role as in the δ-k-
means) is expected to be a large enough constant that depends on the data and the parameter η is
again expected to be a small constant for datasets whose data points have roughly the same norm.
For example, for the MNIST dataset, η can be less than 8 and δ can be taken to be equal to 0.5.
In Section 6 we present the results of our simulations. For different datasets we find parameters
δ such that the number of iterations is practically the same as in the k-means, and the δ-k-means
algorithm converges to a clustering that achieves an accuracy similar to the k-means algorithm or
in times better. We obtained these simulation results by simulating the operations executed by the
quantum algorithm adding the appropriate errors in the procedures.
2 Quantum preliminaries
We assume a basic understanding of quantum computing, we recommend Nielsen and Chuang
[23]Pfor an introduction to the subject. A vector state |vi for v ∈ Rd is defined as |vi =
1 th vector in the standard basis. The dataset
kvk m∈[d] vm |mi, where |mi represents em , the m
is represented by a matrix V ∈ RN ×d , i.e. each row is a vector vi ∈ Rd for i ∈ [N ] that repre-
sents a single data point. The cluster centers, called centroids, at time t are stored in the matrix
C t ∈ Rk×d , such that the j th row ctj for j ∈ [k] represents the centroid of the cluster Cjt .
We denote as Vk the optimal rank k approximation of V , that is Vk = ki=0 σi ui viT , where ui , vi
P
are the row and column singular vectorsPrespectively and the sum is over the largest k singular
values σi . We denote as V≥τ the matrix `i=0 σi ui viT where σ` is the smallest singular value which
is greater than τ .
We will assume at some steps that these matrices and V and C t are stored in suitable QRAM
data structures which are described in [2]. To prove our results, we are going to use the following
tools:
Theorem 2.1 (Amplitude estimation [24]). Given a quantum algorithm
√ p
A : |0i → p |y, 1i + 1 − p |G, 0i
where |Gi is some garbage state, then for any positive integer P , the amplitude estimation algorithm
outputs p̃ (0 ≤ p̃ ≤ 1) such that
p
p(1 − p) π 2
|p̃ − p| ≤ 2π +
P P
5
with probability at least 8/π 2 . It uses exactly P iterations of the algorithm A. If p = 0 then p̃ = 0
with certainty, and if p = 1 and P is even, then p̃ = 1 with certainty.
In addition to amplitude estimation, we will make use of a tool developed in [8] to boost the
probability of getting a good estimate for the distances required for the q-means algorithm. In high
level, we take multiple copies of the estimator from the amplitude estimation procedure, compute
the median, and reverse the circuit to get rid of the garbage. Here we provide a theorem with
respect to time and not query complexity.
Theorem 2.2 (Median Evaluation [8]). Let U be a unitary operation that maps
√ √
U : |0⊗n i 7→ a |x, 1i + 1 − a |G, 0i
for some 1/2 < a ≤ 1 in time T . Then there exists a quantum algorithm that,√ for any ∆ > 0 and
for any 1/2 < a0 ≤ a, produces a state |Ψi such that k |Ψi − |0⊗nL i |xi k ≤ 2∆ for some integer
L, in time & '
ln(1/∆)
2T 2 .
2 |a0 | − 12
We also need some state preparation procedures. These subroutines are needed for encoding
vectors in vi ∈ Rd into quantum states |vi i. An efficient state preparation procedure is provided by
the QRAM data structures.
Theorem 2.3 (QRAM data structure [2]). Let V ∈ RN ×d , there is a data structure to store the
rows of V such that,
1. The time to insert, update or delete a single entry vij is O(log2 (N )).
2. A quantum algorithm with access to the data structure can perform the following unitaries
in time T = O(log2 N ).
(a) |ii |0i →
P |ii |vi i for i ∈ [N ].
(b) |0i → i∈[N ] kvi k |ii.
In our algorithm we will also use subroutines for quantum linear algebra. For a symmetric
matrix M ∈ Rd×d with spectral norm kM k = 1 stored in the QRAM, the running time of these
algorithms depends linearly on the condition number κ(M ) of the matrix, that can be replaced
by κτ (M ), a condition threshold where we keep only the singular values bigger than τ , and the
parameter µ(M ), a matrix dependent parameter defined as
q
µ(M ) = min (kM kF , s2p (M )s(1−2p) (M T )),
p∈[0,1]
for sp (M ) = maxi∈[n] j∈[d] Mijp . The different terms in the minimum in the definition of µ(M )
P
correspond to different √choices for the data structure for storing M , as detailed in [3]. Note
that µ(M ) ≤ kM kF ≤ d as we have assumed that kM k = 1. The running time also depends
logarithmically on the relative error of the final outcome state. [4, 25].
Theorem 2.4 (Quantum linear algebra [4] ). Let M ∈ Rd×d such that kM k2 = 1 and x ∈ Rd . Let
, δ > 0. If M is stored in appropriate QRAM data structures and the time to prepare |xi is Tx ,
then there exist quantum algorithms that with probability at least 1 − 1/poly(d) return
6
1. A state |zi such that k|zi − |M xik ≤ in time O((κ(M
e )µ(M ) + Tx κ(M )) log(1/)).
−1
2. A state |zi such that |zi − |M xi ≤ in time O((κ(M )µ(M ) + Tx κ(M )) log(1/)).
e
e x κ(M )µ(M ) log(1/)).
3. Norm estimate z ∈ (1 ± δ) kM xk, with relative error δ, in time O(T δ
N ×d
The linear algebra procedures above can also be appliedto any rectangular matrix V ∈ R
0 V
by considering instead the symmetric matrix V = .
VT 0
The final component needed for the q-means algorithm is a linear time algorithm for vector
state tomography that will be used to recover classical information from the quantum states corre-
sponding to the new centroids in each step. Given a unitary U that produces a quantum state |xi,
by calling O(d log d/2 ) times U , the tomography algorithm is able to reconstruct a vector x
e that
approximates |xi such that k|e xi − |xik ≤ .
Theorem 2.5 (Vector state tomography [26]). Given access to unitary U such that U |0i = |xi
and its controlled version in time T (U ), there is a tomography algorithm with time complexity
O(T (U ) d log
2
d
e ∈ Rd such that ke
) that produces unit vector x x − xk2 ≤ with probability at least
(1 − 1/poly(d)).
Intuitively, the assumptions guarantee that most of the data can be easily assigned to one of
k clusters, since these points are close to the centroids, and the centroids are sufficiently far from
each other. The exact inequality comes from the error analysis, but in spirit it says that ξ 2 should
be bigger than a quantity that depends on β and the maximum norm η.
7
We now show that a well-clusterable dataset has a good rank-k approximation where k is the
number of clusters. This result will later be used for giving tight upper bounds on the running
time of the quantum algorithm for well-clusterable datasets. As we said, one can easily construct
such datasets by picking k well separated vectors to serve as cluster centers and then each point in
the cluster is sampled from a Gaussian distribution with small variance centered on the centroid of
the cluster.
Claim 3.1. Let Vk be the optimal k-rank approximation for a well-clusterable data matrix V , then
kV − Vk k2F ≤ (λβ 2 + (1 − λ)4η) kV k2F .
Proof. Let W ∈ RN ×d be the matrix with row wi = cl(vi ) , where cl(vi ) is the centroid closest to
vi . The matrix W has rank at most k as it has exactly k distinct rows. As Vk is the optimal
rank-k approximation to V , we have kV − Vk k2F ≤ kV − W k2F . It therefore suffices to upper bound
kV − W k2F . Using the fact that V is well-clusterable, we have
X X
kV − W k2F = (vij − wij )2 = d(vi , cl(vi ) )2 ≤ λN β 2 + (1 − λ)N 4η,
ij i
where we used Definition 1 to say that for a λN fraction of the points d(vi , cl(vi ) )2 ≤ β 2 and for
the remaining points d(vi , cl(vi ) )2 ≤ 4η. Also, as all vi have norm at least 1 we have N ≤ kV kF ,
implying that kV − Vk k2 ≤ kV − W k2F ≤ (λβ 2 + (1 − λ)4η) kV k2F .
The running time of the quantum linear algebra routines for the data matrix V in Theorem 2.4
depend on the parameters µ(V ) and κ(V ). We establish bounds on both of these parameters using
the fact that V is well-clusterable
kV k √
Claim 3.2. Let V be a well-clusterable data matrix, then µ(V ) := kV kF = O( k).
√
Proof. We show that when we rescale V so that kV k = 1, then we have kV kF = O( k) for the
P the triangle inequality we have that kV kF ≤ kV − Vk kF + kVk kF . Using the
rescaled matrix. From
fact that kVk k2F = i∈[k] σi2 ≤ k and Claim 3.1, we have,
p √
kV kF ≤ (λβ 2 + (1 − λ)4η) kV kF + k
√
k
√
Rearranging, we have that kV kF ≤ √ = O( k).
1− (λβ 2 +(1−λ)4η)
We next show that if we use a conditionP threshold κτ (V ) instead of the true condition number
κ(V ), that is we consider the matrix V≥τ = σi ≥τ σi ui viT by discarding the smaller singular values
σi < τ , the resulting matrix remains close to the original one, i.e. we have that kV − V≥τ kF is
bounded.
Claim 3.3. Let V be a matrix with a rank-k approximation given by kV − Vk kF ≤ 0 kV kF and let
τ = √τk kV kF , then kV − V≥τ kF ≤ (0 + τ ) kV kF .
Proof. Let l be the smallest index such that σl ≥ τ , so that we have kV − V≥τ kF = kV − Vl kF . We
split the argument into two cases depending on whether l is smaller or greater than k.
• If l ≥ k then kV − Vl kF ≤ kV − Vk kF ≤ 0 kV kF .
8
qP
k
• If l < k then, kV − Vl kF ≤ kV − Vk kF + kVk − Vl kF ≤ 0 kV kF + 2
i=l+1 σi .
As each σi < τ and the sum is over at most k indices, we have the upper bound (0 +τ ) kV kF .
The reason we defined the notion of well-clusterable dataset is to be able to provide some strong
guarantees for the clustering of most points in the dataset. Note that the clustering problem in
the worst case is NP-hard and we only expect to have good results for datasets that have some
good property. Intuitively, we should only expect k-means to work when the dataset can actually
be clusterd in k clusters. We show next that for a well-clusterable dataset V , there is a constant δ
that can be computed in terms of the parameters in Definition 1 such that the δ-k-means clusters
correctly most of the data points.
Claim 3.4. Let V be a well-clusterable data matrix. Then, for at least λN data points vi , we have
√
min (d2 (vi , cj ) − d2 (vi , c`(i) )) ≥ ξ 2 − 2 ηβ
j6=`(i)
√
which implies that a δ-k-means algorithm with any δ < ξ 2 −2 ηβ will cluster these points correctly.
Proof. By Definition 1, we know that for a well-clusterable dataset V , we have that d(vi , cl(vi ) ) ≤ β
for at least λN data points and where cl(vi ) is the centroid closest to vi . Further, the distance
√
between each pair of the k centroids satisfies the bounds 2 η ≥ d(ci , cj ) ≥ ξ. By the triangle
inequality, we have d(vi , cj ) ≥ d(cj , c`(i) ) − d(vi , c`(i) ). Squaring both sides of the inequality and
rearranging,
9
Algorithm 1 q-means.
Require: Data matrix V ∈ RN ×d stored in QRAM data structure. Precision parameters δ for
k-means, error parameters 1 for distance estimation, 2 and 3 for matrix multiplication and
4 for tomography.
Ensure: Outputs vectors c1 , c2 , · · · , ck ∈ Rd that correspond to the centroids at the final step of
the δ-k-means algorithm.
1: Select k initial centroids c01 , · · · , c0k and store them in QRAM data structure.
2: t=0
3: repeat
4: Step 1: Centroid Distance Estimation
Perform the mapping (Theorem 4.1)
N N
1 X 1 X
√ |ii ⊗j∈[k] |ji |0i 7→ √ |ii ⊗j∈[k] |ji |d2 (vi , ctj )i (1)
N i=1 N i=1
3.2 Perform matrix multiplication with matrix V T and vector |χtj i to obtain the state |ct+1
j i
with error 2 , along with an estimation of ct+1
j with relative error 3 (Theorem 2.4).
7: Step 4: Centroid Update
4.1 Perform tomography for the states |ct+1 j i with precision 4 using the operation from
Steps 1-3 (Theorem 2.5) and get a classical estimate ct+1
j for the new centroids such that
t+1 t+1 √
|cj − cj | ≤ η(3 + 4 ) = centroids
4.2 Update the QRAM data structure for the centroids with the new vectors ct+1 0 · · · ct+1
k .
8: t=t+1
9: until convergence condition is satisfied.
10
used to calculate the average square distance between a point and all points in a cluster) can be
adapted to calculate the square distance or inner product (with sign) between two vectors stored
in the QRAM. The distance estimation becomes very efficient when we have quantum access to
the vectors and the centroids as in Theorem 2.3. That is, when we can query the state preparation
oracles |ii |0i 7→ |ii |vi i , and |ji |0i 7→ |ji |cj i in time T = O(log d), and we can also query the norm
of the vectors.
For q-means, we need to estimate distances or inner products between vectors which have
different norms. At a high level, if we first estimate the inner between the quantum states |vi i and
|cj i corresponding to the normalized vectors and then multiply our estimator by the product of the
vector norms we will get an estimator for the inner product of the unnormalised vectors. A similar
calculation works for the square distance instead of the inner product. If we have an absolute error
for the square distance estimation of the normalized vectors, then the final error is of the order
of kvi k kcj k.
We present now the distance estimation theorem we need for the q-means algorithm and develop
its proof in the next subsection.
Theorem 4.1 (Centroid Distance estimation). Let a data matrix V ∈ RN ×d and a centroid matrix
C ∈ Rk×d be stored in QRAM, such that the following unitaries |ii |0i 7→ |ii |vi i , and |ji |0i 7→
|ji |cj i can be performed in time O(log(N d)) and the norms of the vectors are known. For any
∆ > 0 and 1 > 0, there exists a quantum algorithm that performs the mapping
N N
1 X 1 X
√ |ii ⊗j∈[k] (|ji |0i) 7→ √ |ii ⊗j∈[k] (|ji |d2 (vi , cj )i),
N i=1 N i=1
e k η log(1/∆) where
where |d2 (vi , cj ) − d2 (vi , cj )| 6 1 with probability at least 1 − 2∆, in time O 1
η = maxi (kvi k2 ).
Lemma 4.2 (Distance / Inner Products Estimation). Assume for a data matrix V ∈ RN ×d and
a centroid matrix C ∈ Rk×d that the following unitaries |ii |0i 7→ |ii |vi i , and |ji |0i 7→ |ji |cj i can
be performed in time T and the norms of the vectors are known. For any ∆ > 0 and 1 > 0, there
exists a quantum algorithm that computes
|ii |ji |0i 7→ |ii |ji |d2 (vi , cj )i , where |d2 (vi , cj ) − d2 (vi , cj )| 6 1 with probability at least 1 − 2∆, or
|ii |ji |0i →
7 |ii |ji |(vi , cj )i , where |(vi , cj ) − (vi , cj )| 6 1 with probability at least 1 − 2∆
in time O e kvi kkcj kT log(1/∆) .
1
Proof. Let us start by describing a procedure to estimate the square `2 distance between the
normalised vectors |vi i and |cj i. We start with the initial state
1
|φij i := |ii |ji √ (|0i + |1i) |0i
2
11
Then, we query the state preparation oracle controlled on the third register to perform the
mappings |ii |ji |0i |0i 7→ |ii |ji |0i |vi i and |ii |ji |1i |0i 7→ |ii |ji |1i |cj i. The state after this is given
by,
1
√ (|ii |ji |0i |vi i + |ii |ji |1i |cj i)
2
Finally, we apply an Hadamard gate on the the third register to obtain,
1 1
|ii |ji |0i (|vi i + |cj i) + |1i (|vi i − |cj i)
2 2
The probability of obtaining |1i when the third register is measured is,
1 1 1 − hvi |cj i
pij = (2 − 2hvi |cj i) = d2 (|vi i , |cj i) =
4 4 2
which is proportional to the square distance between the two normalised vectors.
We can rewrite |1i (|vi i − |cj i) as |yij , 1i (by swapping the registers), and hence we have the
final mapping
√ p
A : |ii |ji |0i 7→ |ii |ji ( pij |yij , 1i + 1 − pij |Gij , 0i) (3)
where the probability pij is proportional to the square distance between the normalised vectors and
Gij is a garbage state. Note that the running time of A is TA = Õ(T ).
Now that we know how to apply the transformation described in Equation 3, we can use known
techniques to perform the centroid distance estimation as defined in Theorem 4.1 within additive
error with high probability. The method uses two tools, amplitude estimation, and the median
evaluation 2.2 from [8].
First, using amplitude estimation (Theorem 2.1) with the unitary A defined in Equation 3, we
can create a unitary operation that maps
√
α |pij , G, 1i + (1 − α) |G0 , 0i
p
U : |ii |ji |0i 7→ |ii |ji
where G, G0 are garbage registers, |pij − pij | ≤ and α ≥ 8/π 2 . The unitary U requires P iterations
of A with P = O(1/). Amplitude estimation thus takes time TU = O(T e /). We can now apply
Theorem 2.2 for the unitary U to obtain a quantum state |Ψij i such that,
√
k |Ψij i − |0i⊗L |pij , Gi k2 ≤ 2∆
The running time of the procedure is O(TU ln(1/∆)) = O( e T log(1/∆)).
Note that we can easily multiply the value pij by 4 in order to have the estimator of the square
distance of the normalised vectors or compute 1 − 2pij for the normalized inner product. Last, the
garbage state does not cause any problem in calculating the minimum in the next step, after which
this step is uncomputed.
The running time of the procedure is thus O(TU ln(1/∆)) = O( T log(1/∆)).
The last step is to show how to estimate the square distance or the inner product of the
unnormalised vectors. Since we know the norms of the vectors, we can simply multiply the estimator
of the normalised inner product with the product of the two norms to get an estimate for the inner
product of the unnormalised vectors and a similar calculation works for the distance. Note that
12
the absolute error now becomes kvi k kcj k and hence if we want to have in the end an absolute
error this will incur a factor of kvi k kcj k in the running time. This concludes the proof of the
lemma.
The proof of the theorem follows rather straightforwardly from this lemma. In fact one just
needs to apply the above distance estimation procedure from Lemma 4.2 k times. Note also that
the norms of the centroids are always smaller than the maximum norm of a data point which gives
us the factor η.
Proof. The k-means update rule for the centroids is given by ct+1 = |C1t | i∈Cj vi . As the columns
P
j j
of V T are the vectors vi , this can be rewritten as ct+1
j = V T χtj .
13
The above claim allows us to compute the updated centroids ct+1 j using quantum linear algebra
t
operations. In fact, the state |ψ i can be written as a weighted superposition of the characteristic
vectors of the clusters.
k k
r r
t
X |C |
j 1 X X |Cj | t
|ψ i = p |ii |ji = |χj i |ji
N |C j | N
j=1 i∈C j j=1
By measuring the last register, we can sample from the states |χtj i for j ∈ [k], with probability
proportional to the size of the cluster. We assume here that all k clusters are non-vanishing, in
other words they have size Ω(N/k). Given the ability to create the states |χtj i and given that the
matrix V is stored in QRAM, we can now perform quantum matrix multiplication by V T to recover
an approximation of the state |V T χj i = |ct+1
j i with error 2 , as stated in Theorem 2.4. Note that
the error 2 only appears inside a logarithm. The same Theorem allows us to get an estimate of
the norm V T χtj = ct+1 j with relative error 3 . For this, we also need an estimate of the size
of each cluster, namely the norms kχj k. We already have this, since the measurements of the last
register give us this estimate, and since the number of measurements made is large compared to k
(they depend on d), the error from this source is negligible compared to other errors.
The running time of this step is derived from Theorem 2.4 where the time to prepare the
state |χtj i is the time of Steps 1 and 2. Note that we do not have to add an extra k factor due
to the sampling, since we can run the matrix multiplication procedures in parallel for all j so
that every time we measure a random |χtj i we perform one more step of the corresponding matrix
multiplication. Assuming that all clusters have size Ω(N/k) we will have an extra factor of O(log k)
in the running time by a standard coupon collector argument.
Proof. We can rewrite kcj − cj k as kcj k |cj i − kcj k |cj i . It follows from triangle inequality that:
kcj k |cj i − kcj k |cj i ≤ kcj k |cj i − kcj k |cj i + kkcj k |cj i − kcj k |cj ik
14
√
We have the upper bound kcj k ≤ η. Using the bounds for the error we have from tomography
√ √
and norm estimation, we can upper bound the first term by η3 and the second term by η4 .
The claim follows.
Let us make a remark about the ability to use Theorem 2.5 to perform tomography in our case.
The updated centroids will be recovered in step 4 using the vector state tomography algorithm in
Theorem 2.5 on the composition of the unitary that prepares |ψ t i and the unitary that multiplies
the first register of |ψ t i by the matrix V T . The input of the tomography algorithm requires a
unitary U such that U |0i = |xi for a fixed quantum state |xi. However, the labels `(vi ) are not
deterministic due to errors in distance estimation, hence the composed unitary U as defined above
therefore does not produce a fixed pure state |xi.
We therefore need a procedure that finds labels `(vi ) that are a deterministic function of vi and
the centroids cj for j ∈ [k]. One solution is to change the update rule of the δ-k-means algorithm
to the following: Let `(vi ) = j if d(vi , cj ) < d(vi , cj 0 ) − 2δ for j 0 6= j where we discard the points to
which no label can be assigned. This assignment rule ensures that if the second register is measured
and found to be in state |ji, then the first register contains a uniform superposition of points from
cluster j that are δ far from the cluster boundary (and possibly a few points that are δ close to
the cluster boundary). Note that this simulates exactly the δ-k-means update rule while discarding
some of the data points close to the cluster boundary. The k-means centroids are robust under
such perturbations, so we expect this assignment rule to produce good results in practice.
A better solution is to use consistent phase estimation instead of the usual phase estimation for
the distance estimation step , which can be found in [28, 29]. The distance estimates are generated
by the phase estimation algorithm applied to a certain unitary in the amplitude estimation step.
The usual phase estimation algorithm does not produce a deterministic answer and instead for each
eigenvalue λ outputs with high probability one of two possible estimates λ such that |λ − λ| ≤ .
Instead, here as in some other applications we need the consistent phase estimation algorithm that
with high probability outputs a deterministic estimate such that |λ − λ| ≤ .
We also describe another simple method of getting such consistent phase estimation, which is
to combine phase estimation estimates that are obtained for two different precision values. Let us
assume that the eigenvalues for the unitary U are e2πiθi for θi ∈ [0, 1]. First, we perform phase
estimation with precision N11 where N1 = 2l is a power of 2. We repeat this procedure O(log N/θ2 )
times and output the median estimate. If the value being estimated is λ+α 2l
for λ ∈ Z and α ∈ [0, 1]
and |α − 1/2| ≥ θ0 for an explicit constant θ0 (depending on θ) then with probability at least
1 − 1/poly(N ) the median estimate will be unique and will equal to 1/2l times the closest integer
to (λ + α). In order to also produce a consistent estimate for the eigenvalues for the cases where
the above procedure fails, we perform a second phase estimation with precision 2/3N1 . We repeat
this procedure as above for O(log N/θ2 ) iterations and taking the median estimate. The second
procedure fails to produce a consistent estimate only for eigenvalues λ+α 2l
for λ ∈ Z and α ∈ [0, 1]
0 0 0
and |α − 1/3| ≤ θ or |α − 2/3| ≤ θ for a suitable constant θ . Since the cases where the two
procedures fail are mutually exclusive, one of them succeeds with probability 1 − 1/poly(N ). The
estimate produced by the phase estimation procedure is therefore deterministic with very high
probability. In order to complete this proof sketch, we would have to give explicit values of the
constants θ and θ0 and the success probability, using the known distribution of outcomes for phase
estimation.
For what follows, we assume that indeed the state in Equation 4 is almost a deterministic state,
15
meaning that when we repeat the procedure we get the same state with very high probability.
24
We set the error on the matrix multiplication to be 2 d log d as we need to call the unitary
that builds ct+1
j for O( d log
24
d
) times. We will see that this does not increase the runtime of the
algorithm, as the dependence of the runtime for matrix multiplication is logarithmic in the error.
5 Analysis
We provide our general theorem about the running time and accuracy of the q-means algorithm.
Theorem 5.1 (q-means). For a data matrix V ∈ RN ×d stored in an appropriate QRAM data struc-
ture and parameter δ > 0, the q-means algorithm with
high probability outputs centroids consistent
η η 2 η 1.5
with the classical δ-k-means algorithm, in time O kd δ2 κ(V )(µ(V ) + k δ ) + k δ2 κ(V )µ(V ) per
e
q
iteration, where κ(V ) is the condition number, µ(M ) = minp∈[0,1] (kM kF , s2p (M )s(1−2p) (M T )),
and 1 ≤ kvi k2 ≤ η.
We prove the theorem in Sections 5.1 and 5.2 and then provide the running time of the algorithm
for well-clusterable datasets as Theorem 5.2.
for a point vi and a centroid cj . The second step finds the minimum of these distances without
adding any error.
For the q-means to output a cluster assignment consistent with the δ-k-means algorithm, we
require that:
δ
∀j ∈ [k], |d2 (cj , vi ) − d2 (cj , vi )| ≤
2
which implies that no centroid with distance more than δ above the minimum distance can be
chosen by the q-means algorithm as the label. Thus we need to take 1 < δ/2.
After the cluster assignment of the q-means (which happens in superposition), we update the
clusters, by first performing a matrix multiplication to create the centroid states and estimate their
norms, and then a tomography to get a classical description of the centroids. The error in this part
is centroids , as defined in Claim 4.5, namely
√
kcj − cj k ≤ centroid = η(3 + 4 ).
16
Again, for ensuring that the q-means is consistent with the classical δ-k-means algorithm we
take 3 < 4√δ η and 4 < 4√δ η . Note also that we have ignored the error 2 that we can easily deal
with since it only appears in a logarithmic factor.
1 η 2 η
O kd 2 κ(V ) µ(V ) + k
e +k κ(V )µ(V )
4 1 3 1
The analysis in section 5.1 shows that we can take 1 = δ/2, 3 = 4√δ η and 4 = 4√δ η . Substituting
these values in the above running time, it follows that the running time of the q-means algorithm
is
1.5
η η
2 η
O
e kd κ(V ) µ(V ) + k + k 2 κ(V )µ(V ) .
δ2 δ δ
This completes the proof of Theorem 5.1. We next state our main result when applied to a well-
clusterable dataset, as in Definition 3.
Theorem 5.2 (q-means on well-clusterable data). For a well-clusterable dataset V ∈ RN ×d stored
in appropriate QRAM, the q-means algorithm returns with high probability the k centroidscon-
e k 2 d η2.5
sistently with the classical δ-k-means algorithm for a constant δ in time O
2
+ k 2.5 ηδ3 per
δ3
iteration, for 1 ≤ kvi k2 ≤ η.
Proof. Let V ∈ RN ×d be a well-clusterable dataset as in Definition 1. In this case, we know by
1
Claim 3.3 that κ(V ) = σmin can be replaced by a thresholded condition number κτ (V ) = τ1 . In
practice, this is done by discarding the singular values smaller than a certain threshold during√
quantum matrix multiplication. Remember that by Claim 3.2 we know that kV kF = O( k).
Therefore we need to pick τ for a threshold τ = √τk kV kF such that κτ (V ) = O( 1τ ).
Thresholding the singular values in the matrix multiplication step introduces an additional
additive error in centroid . By Claim 3.3 and Claim 4.5 , we have that the perror centroid in ap-
√
proximating the true centroids becomes η(3 + 4 + 0 + τ ) where 0 = λβ 2 + (1 − λ)4η is a
dataset dependent parameter computed in Claim 3.1. We can set τ = 3 = 4 = 0 /3 to obtain
√
centroid = 2 η0 .
The definition of the δ-k-means update rule requires that centroid ≤ δ/2. Further, Claim 3.4
√
shows that if the error δ in the assignment step satsifies δ ≤ ξ 2 − 2 ηβ, then the δ-k-means
17
algorithm finds the corrects clusters. By Definition 1 of a well-clusterable dataset, we can find a
suitable constant δ satisfying both these constraints, namely satisfying
√ p √
4 η λβ 2 + (1 − λ)4η < δ < ξ 2 − 2 ηβ.
√
Substituting the values µ(V ) = O( k) from Claim 3.2, κ(V ) = O( 1τ ) and τ = 3 = 4 =
√
0 /3 = O( η/δ) in the running time for the general q-means algorithm,
we obtain that
the running
η 2.5 η 2
time for the q-means algorithm on a well-clusterable dataset is Oe k 2 d 3 + k 2.5 3 per iteration.
δ δ
Let us make some concluding remarks regarding the running time of q-means. For dataset where
the number of points is much bigger compared to the other parameters, the running time for the
q-means algorithm is an improvement compared to the classical k-means algorithm. For instance,
for most problems in data analysis, k is eventually small (< 100). The number of features d ≤ N in
most situations, and it can eventually be reduced by applying a quantum dimensionality reduction
algorithm first (which have running time poly-logarithmic in d). To sum up, q-means has the same
output as the classical δ-k-means algorithm (which approximates k-means), it conserves the same
number of iterations, but has a running time only poly-logarithmic in N , giving an exponential
speedup with respect to the size of the dataset.
18
Figure 1: Representation of 4 Gaussian clusters of 10 dimensions in a 3D space
spanned by the first three PCA dimensions.
form 4 Gaussian clusters with standard deviation 2.5, that we can see in Figure 1. The condition
number of dataset is calculated to be 5.06. We ran k-means and δ-k-means for 7 different values of
δ to understand when the δ-k-means becomes less accurate.
In Figure 2 we can see that until η/δ = 3 (for δ = 1.2), the δ-k-means algorithm converges
on this dataset. We can now make some remarks about the impact of δ on the efficiency. It
seems natural that for small values of δ both algorithms are equivalent. For higher values of δ, we
observed a late start in the evolution of the accuracy, witnessing random assignments for points
on the clusters’ boundaries. However, the accuracy still reaches 100% in a few more steps. The
increase in the number of steps is a tradeoff with the parameter η/δ.
19
6.2 MNIST
The MNIST dataset is composed of 60.000 handwritten digits as images of 28x28 pixels (784
dimensions). From this raw data we first performed some dimensionality reduction processing,
then we normalized the data such that the minimum norm is one. Note that, if we were doing
q-means with a quantum computer, we could use efficient quantum procedures equivalent to Linear
Discriminant Analysis, such as [7], or other quantum dimensionality reduction algorithms like
[1, 30].
As preprocessing of the data, we first performed a Principal Component Analysis (PCA), re-
taining data projected in a subspace of dimension 40. After normalization, the value of η was 8.25
(maximum norm of 2.87), and the condition number was 4.53. Figure 3 represents the evolution
of the accuracy during the k-means and δ-k-means for 4 different values of δ. In this numerical
experiment, we can see that for values of the parameter η/δ of order 20, both k-means and δ-k-
means reached a similar, yet low accuracy in the classification in the same number of steps. It
is important to notice that the MNIST dataset, without other preprocessing than dimensionality
reduction, is known not to be well-clusterable under the k-means algorithm.
Figure 3: Accuracy evolution on the MNIST dataset under k-means and δ-k-means for 4
different values of δ. Data has been preprocessed by a PCA to 40 dimensions. All versions
converge in the same number of steps, with a drop in the accuracy while δ increases. The
apparent difference in the number of steps until convergence is just due to the stopping
condition for k-means and δ-k-means.
On top of the accuracy measure (ACC), we also evaluated the performance of q-means against
many other metrics, reported in Table 1 and 2. More detailed information about these metrics can
be found in [31, 32]. We introduce a specific measure of error, the Root Mean Square Error of
Centroids (RMSEC), which a direct comparison between the centroids predicted by the k-means
algorithm and the ones predicted by the δ-k-means. It is a way to know how far the centroids are
predicted. Note that this metric can only be applied to the training set. For all these measures,
except RMSEC, a bigger value is better. Our simulations show that δ-k-means, and thus the q-
means, even for values of δ (between 0.2 − 0.5) achieves similar performance to k-means, and in
20
most cases the difference is on the third decimal point.
Table 1: A sample of results collected from the same experiments as in Figure 3. Different metrics
are presented for the train set and the test set. ACC: accuracy. HOM: homogeneity. COMP:
completeness. V-M: v-measure. AMI: adjusted mutual information. ARI: adjusted rand index.
RMSEC: Root Mean Square Error of Centroids.
These experiments have been repeated several times and each of them presented a similar
behavior despite the random initialization of the centroids.
Figure 4: Three accuracy evolutions on the MNIST dataset under k-means and
δ-k-means for 4 different values of δ. Each different behavior is due to the random
initialization of the centroids
Finally, we present a last experiment with the MNIST dataset with a different data preprocess-
ing. In order to reach higher accuracy in the clustering, we replace the previous dimensionality
reduction by a Linear Discriminant Analysis (LDA). Note that a LDA is a supervised process that
uses the labels (here, the digits) to project points in a well chosen lower dimensional subspace.
Thus this preprocessing cannot be applied in practice in unsupervised machine learning. However,
for the sake of benchmarking, by doing so k-means is able to reach a 87% accuracy, therefore it
allows us to compare k-means and δ-k-means on a real and almost well-clusterable dataset. In
the following, the MNIST dataset is reduced to 9 dimensions. The results in Figure 5 show that
δ-k-means converges to the same accuracy than k-means even for values of η/δ down to 16. In
some other cases, δ-k-means shows a faster convergence, due to random fluctuations that can help
escape faster from a temporary equilibrium of the clusters.
21
Figure 5: Accuracy evolution on the MNIST dataset under k-means and δ-k-means for 4
different values of δ. Data has been preprocessed to 9 dimensions with a LDA reduction.
All versions of δ-k-means converge to the same accuracy than k-means in the same number
of steps.
Table 2: A sample of results collected from the same experiments as in Figure 5. Different metrics
are presented for the train set and the test set. ACC: accuracy. HOM: homogeneity. COMP:
completeness. V-M: v-measure. AMI: adjusted mutual information. ARI: adjusted rand index.
RMSEC: Root Mean Square Error of Centroids.
Let us remark, that the values of η/δ in our experiment remained between 3 and 20. Moreover,
the parameter η, which is the maximum square norm of the points, provides a worst case guarantee
for the algorithm, while one can expect that the running time in practice will scale with the average
square norm of the points. For the MNIST dataset after PCA, this value is 2.65 whereas η = 8.3.
In conclusion, our simulations show that the convergence of δ-k-means is almost the same as the
regular k-means algorithms for large enough values of δ. This provides evidence that the q-means
algorithm will have as good performance as the classical k-means, and its running time will be
significantly lower for large datasets.
22
References
[1] S. Lloyd, M. Mohseni, and P. Rebentrost, “Quantum principal component analysis,” Nature
Physics, vol. 10, no. 9, p. 631, 2014.
[2] I. Kerenidis and A. Prakash, “Quantum recommendation systems,” Proceedings of the 8th
Innovations in Theoretical Computer Science Conference, 2017.
[3] I. Kerenidis and A. Prakash, “Quantum gradient descent for linear systems and least squares,”
arXiv:1704.04992, 2017.
[4] S. Chakraborty, A. Gilyén, and S. Jeffery, “The power of block-encoded matrix pow-
ers: improved regression techniques via faster Hamiltonian simulation,” arXiv preprint
arXiv:1804.01973, 2018.
[5] S. Lloyd, M. Mohseni, and P. Rebentrost, “Quantum algorithms for supervised and
unsupervised machine learning,” arXiv, vol. 1307.0411, pp. 1–11, 7 2013. [Online]. Available:
http://arxiv.org/abs/1307.0411
[6] J. Allcock, C.-Y. Hsieh, I. Kerenidis, and S. Zhang, “Quantum algorithms for feedforward
neural networks,” Manuscript, 2018.
[7] I. Kerenidis and A. Luongo, “Quantum classification of the MNIST dataset via slow feature
analysis,” arXiv preprint arXiv:1805.08837, 2018.
[9] E. Aı̈meur, G. Brassard, and S. Gambs, “Quantum speed-up for unsupervised learning,” Ma-
chine Learning, vol. 90, no. 2, pp. 261–287, 2013.
[10] C. Durr and P. Hoyer, “A quantum algorithm for finding the minimum,” arXiv preprint quant-
ph/9607014, 1996.
[13] A. Gilyén, S. Lloyd, and E. Tang, “Quantum-inspired low-rank stochastic regression with
logarithmic dependence on the dimension,” arXiv preprint arXiv:1811.04909, 2018.
[14] E. Tang, “Quantum-inspired classical algorithms for principal component analysis and super-
vised clustering,” arXiv preprint arXiv:1811.00414, 2018.
23
[16] A. W. Harrow, A. Hassidim, and S. Lloyd, “Quantum algorithm for linear systems of equa-
tions,” Physical review letters, vol. 103, no. 15, p. 150502, 2009.
[17] A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for finding low-rank
approximations,” Journal of the ACM (JACM), vol. 51, no. 6, pp. 1025–1041, 2004.
[18] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clustering large graphs via the
singular value decomposition,” Machine learning, vol. 56, no. 1-3, pp. 9–33, 2004.
[19] D. Achlioptas and F. McSherry, “Fast computation of low rank matrix approximations,” in
Proceedings of the 33rd Annual Symposium on Theory of Computing, 611-618, 2001.
[20] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory,
vol. 28, no. 2, pp. 129–137, 1982.
[21] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings
of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial
and Applied Mathematics, 2007, pp. 1027–1035.
[22] D. Arthur and S. Vassilvitskii, “How slow is the k-means method?” in Proceedings of the
twenty-second annual symposium on Computational geometry. ACM, 2006, pp. 144–153.
[23] M. A. Nielsen and I. Chuang, “Quantum computation and quantum information,” 2002.
[24] G. Brassard, P. Hoyer, M. Mosca, and A. Tapp, “Quantum amplitude amplification and esti-
mation,” Contemporary Mathematics, vol. 305, pp. 53–74, 2002.
[25] A. Gilyén, Y. Su, G. H. Low, and N. Wiebe, “Quantum singular value transformation
and beyond: exponential improvements for quantum matrix arithmetics,” arXiv preprint
arXiv:1806.01838, 2018.
[26] I. Kerenidis and A. Prakash, “A quantum interior point method for LPs and SDPs,”
arXiv:1808.09266, 2018.
[28] A. Ta-Shma, “Inverting well conditioned matrices in quantum logspace,” in Proceedings of the
forty-fifth annual ACM symposium on Theory of computing. ACM, 2013, pp. 881–890.
[29] A. Ambainis, “Variable time amplitude amplification and quantum algorithms for linear alge-
bra problems,” in STACS’12 (29th Symposium on Theoretical Aspects of Computer Science),
vol. 14. LIPIcs, 2012, pp. 636–647.
[30] I. Cong and L. Duan, “Quantum discriminant analysis for dimensionality reduction and clas-
sification,” arXiv preprint arXiv:1510.00113, 2015.
[31] “A demo of k-means clustering on the handwritten digits data.” [Online]. Available: http:
//scikit\discretionary{-}{}{}learn.org/stable/auto examples/cluster/plot kmeans digits.html
24
[32] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series
in statistics New York, NY, USA:, 2001, vol. 1, no. 10.
25