0% found this document useful (0 votes)
11 views

Q-Means A Quantum Algorithm For Unsupervised Machine Learning

Uploaded by

Jayoti Saha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Q-Means A Quantum Algorithm For Unsupervised Machine Learning

Uploaded by

Jayoti Saha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

q-means: A quantum algorithm for unsupervised machine learning

Iordanis Kerenidis1,2 , Jonas Landman1,3 , Alessandro Luongo1,4 , and Anupam Prakash1


1
CNRS, IRIF, Université Paris Diderot, Paris, France
2
Centre for Quantum Technologies, National University of Singapore, Singapore
3
Ecole Polytechnique, Palaiseau, France
arXiv:1812.03584v2 [quant-ph] 11 Dec 2018

4
Atos Bull, Les Clayes Sous Bois, France

December 12, 2018

Abstract
Quantum machine learning is one of the most promising applications of a full-scale quan-
tum computer. Over the past few years, many quantum machine learning algorithms have
been proposed that can potentially offer considerable speedups over the corresponding classi-
cal algorithms. In this paper, we introduce q-means, a new quantum algorithm for clustering
which is a canonical problem in unsupervised machine learning. The q-means algorithm has
convergence and precision guarantees similar to k-means, and it outputs with high probability
a good approximation of the k cluster centroids like the classical algorithm. Given a dataset
of N d-dimensional vectors
 vi (seen as a matrix V ∈ RN ×d ) stored
 in QRAM, the running
1.5
η η 2η
time of q-means is O kd 2 κ(V )(µ(V ) + k ) + k 2 κ(V )µ(V ) per iteration, where κ(V ) is
e
δ δ δ
the condition number, µ(V ) is a parameter that appears in quantum linear algebra procedures
2
and maxi kvi k .For a natural notion of well-clusterable datasets, the running time becomes
 η = 2.5
Oe k 2 d 3 + k 2.5 32 per iteration, which is linear in the number of features d, and polynomial
η η
δ δ
in the rank k, the maximum square norm η and the error parameter δ. Both running times
are only polylogarithmic in the number of datapoints N . Our algorithm provides substantial
savings compared to the classical k-means algorithm that runs in time O(kdN ) per iteration,
particularly for the case of large datasets.

Emails: [email protected], [email protected], [email protected], [email protected]


1 Introduction
The last decade has witnessed the emergence of a scientific and industrial revolution, which lever-
aged our ability to process an increasing volume of data and extract value from it. Henceforth,
the imminent widespread adoption of technologies such as the Internet of Things, IPv6, and 5G
internet communications is expected to generate an even bigger amount of data, most of which will
be unlabelled. As the amount of data generated in our society is expected to grow, more powerful
ways of processing information are needed. Quantum computation is a promising new paradigm
for performing fast computations. In recent years, there have been proposals for quantum machine
learning algorithms that have the potential to offer considerable speedups over the corresponding
classical algorithms, either exponential or large polynomial speedups [1, 2, 3, 4, 5, 6].
In most of these quantum machine learning applications, there are some common algorithmic
primitives that are used to build the algorithms. For instance, quantum procedures for linear
algebra (matrix multiplication, inversion, and projections in sub-eigenspaces of matrices), have
been used for recommendation systems or dimensionality reduction techniques [2, 7, 1]. Second,
the ability to estimate distances between quantum states, for example through the SWAP test, has
been used for supervised or unsupervised learning [5, 8]. We note that most of these procedures need
quantum access to the data, which can be achieved by storing the data in specific data structures
in a QRAM (Quantum Random Access Memory).
Here, we are interested in unsupervised learning and in particular in the canonical problem of
clustering: given a dataset represented as N vectors, we want to find an assignment of the vectors
to one of k labels (for a given k that we assume to know) such that similar vectors are assigned to
the same cluster. Often, the Euclidean distance is used to measure the similarity of vectors, but
other metrics might be used, according to the problem under consideration.
We propose q-means, a quantum algorithm for clustering, which can be viewed as a quantum
alternative to the classical k-means algorithm. More precisely, q-means is the equivalent of the
δ-k-means algorithm, a robust version of k-means that will be defined later. We provide a detailed
analysis to show that q-means has an output consistent with the classical δ-k-means algorithm and
further has a running time that depends poly-logarithmically on the number of elements in the
dataset. The last part of this work includes simulations which assert the performance and running
time of the q-means algorithm.

1.1 Related Work


In this section, we discuss previous work on quantum unsupervised learning and clustering. Aimeur,
Brassard and Gambs [9] gave two quantum algorithms for unsupervised learning using the ampli-
fication techniques from [10]. Specifically, they proposed an algorithm for clustering based on
minimum spanning trees that runs in time Θ(N 3/2 ) and a quantum√algorithm for k-median (a
problem related to k-means) algorithm with complexity time O(N 3/2 / k).
Lloyd, Mohseni and Rebentrost [5] proposed quantum k-means and nearest centroid algorithms
using an efficient subroutine for quantum distance estimation assuming as we do quantum access
to the data. Given a dataset of N vectors in a feature space of dimension d, the running time of
each iteration of the clustering algorithm (using a distance estimation procedure with error ) is
O( kN log d ) to produce the quantum state corresponding to the clusters. Note that the time is linear
in the number of data points and it will be linear in the dimension of the vectors if the algorithm
needs to output the classical description of the clusters.

1
In the same work, they also proposed an adiabatic algorithm for the assignment step of the
k-means algorithm, that can potentially provide an exponential speedup in the number of data
points as well, in the case the adiabatic algorithm performs exponentially better than the classical
algorithm. The adiabatic algorithm is used in two places for this algorithm, the first to select
the initial centroids, and the second to assign data points to the closest cluster. However, while
arguments are given for its efficiency, it is left as an open problem to determine how well the
adiabatic algorithm performs on average, both in terms of the quality of solution and the running
time.
Wiebe, Kapoor and Svore [8] apply the minimum finding algorithm [10] to obtain nearest-
neighbor methods for supervised and unsupervised learning. At a high level, they recovered a
Grover-type quadratic speedup with respect to the number of elements in the dataset in finding
the k nearest neighbors of a vector. Otterbach et al. [11] performed clustering by exploiting a well-
known reduction from clustering to the Maximum-Cut (MAXCUT) problem; the MAXCUT is then
solved using QAOA, a quantum algorithm for performing approximate combinatorial optimization
[12].
Let us remark on a recent breakthrough by Tang et al. [13, 14, 15], who proposed three classical
machine learning algorithms obtained by dequantizing recommendation systems [2] and low rank
linear system solvers. Like the quantum algorithms, the running time of these classical algorithms
is O(poly(k)polylog(mn)), that is poly-logarithmic in the dimension of the dataset and polynomial
in the rank. However, the polynomial dependence on the rank of the matrices is significantly worse
than the quantum algorithms and in fact renders these classical algorithms highly impractical. For
example, the new classical algorithm for stochastic regression inspired by the HHL algorithm [16]
has a running time of Õ(κ6 k 16 kAk6F /6 ), which is impractical even for a rank-10 matrix.
The extremely high dependence on the rank and the other parameters implies not only that the
quantum algorithms are substantially faster (their dependence on the rank is sublinear!), but also
that in practice there exist much faster classical algorithms for these problems. While the results
of Tang et al. are based on the FKV methods [17], in classical linear algebra, algorithms based
on the CUR decomposition that have a running time linear in the dimension and quadratic in the
rank are preferred to the FKV methods [17, 18, 19]. For example, for the recommendation systems
matrix of Amazon or Netflix, the dimension of the matrix is 106 × 107 , while the rank is certainly
not√lower than 100. The dependence on the dimension and rank of the quantum algorithm in [2] is
O( k log(mn)) ≈ O(102 ), of the classical CUR-based algorithm is O(mk 2 ) ≈ O(1011 ), while of the
Tang algorithm is O(k 16 log(mn)) ≈ O(1033 ).
It remains an open question to find classical algorithms for these machine learning problems
that are poly-logarithmic in the dimension and are competitive with respect to the quantum or
the classical algorithms for the same problems. This would involve using significantly different
techniques than the ones presently used for these algorithms.

1.2 The k-means algorithm


The k-means algorithm was introduced in [20], and is extensively used for unsupervised problems.
The inputs to k-means algorithm are vectors vi ∈ Rd for i ∈ [N ]. These points must be partitioned
in k subsets according to a similarity measure, which in k-means is the Euclidean distance between
points. The output of the k-means algorithm is a list of k cluster centers, which are called centroids.
The algorithm starts by selecting k initial centroids randomly or using efficient heuristics like
the k-means++ [21]. It then alternates between two steps: (i) Each data point is assigned the label

2
of the closest centroid. (ii) Each centroid is updated to be the average of the data points assigned
to the corresponding cluster. These two steps are repeated until convergence, that is, until the
change in the centroids during one iteration is sufficiently small.
More precisely, we are given a dataset V of vectors vi ∈ Rd for i ∈ [N ]. At step t, we denote
the k clusters by the sets Cjt for j ∈ [k], and each corresponding centroid by the vector ctj . At
each iteration, the data points vi are assigned to a cluster Cjt such that C1t ∪ C2t · · · ∪ CKt = V and

Cit ∩ Clt = ∅ for i 6= l. Let d(vi , ctj ) be the Euclidean distance between vectors vi and ctj . The first
step of the algorithm assigns each vi a label `(vi )t corresponding to the closest centroid, that is
`(vi )t = argminj∈[k] (d(vi , ctj )).

The centroids are then updated, ct+1 = |C1t | i∈C t vi , so that the new centroid is the average of all
P
j j j
points that have been assigned to the cluster in this iteration. We say that we have converged if
for a small threshold τ we have
k
1X
d(ctj , ct−1
j ) 6 τ.
k
j=1

The loss function that this algorithm aims to minimize is the RSS (residual sums of squares), the
sum of the squared distances between points and the centroid of their cluster.
X X
RSS := d(cj , vi )2
j∈[k] i∈Cj

The RSS decreases at each iteration of the k-means algorithm, the algorithm therefore converges
to a local minimum for the RSS. The number of iterations T for convergence depends on the data
and the number of clusters. A single iteration has complexity of O(kN d) since the N vectors of
dimension d have to be compared to each of the k centroids.
From a computational complexity point of view, we recall that it is NP-hard to find a clustering
that achieves the global minimum for the RSS. There are classical clustering algorithms based on
optimizing different loss functions, however the k-means algorithm uses the RSS as the objective
function.

The algorithm can be super-polynomial in the worst case (the number of iterations is
2 ω( N ) [22]), but the number of iterations is usually small in practice. The k-means algorithm with
a suitable heuristic like k-means++ to initialize the centroids finds a clustering such that the value
for the RSS objective function is within a multiplicative O(log N ) factor of the minimum value [21].

1.3 δ-k-means
We now consider a δ-robust version of the k-means in which we introduce some noise. The noise
affects the algorithms in both of the steps of k-means: label assignment and centroid estimation.
Let us describe the rules for the assignment step of δ-k-means more precisely. Let c∗i be the
closest centroid to the data point vi . Then, the set of possible labels Lδ (vi ) for vi is defined as
follows:
Lδ (vi ) = {cp : |d2 (c∗i , vi ) − d2 (cp , vi )| ≤ δ }
The assignment rule selects arbitrarily a cluster label from the set Lδ (vi ).
Second, we add δ/2 noise during the calculation of the centroid. Let Cjt+1 be the set of points
which have been labeled by j in the previous step. For δ-k-means we pick a centroid ct+1
j with the
property

3
1 X δ
ct+1
j − vi < .
|Cjt+1 | t+1
2
vi ∈Cj

One way to do this is to calculate the centroid exactly and then add some small Gaussian noise
to the vector to obtain the robust version of the centroid.
Let us add two remarks on the δ-k-means. First, for a well-clusterable data set and for a small
δ, the number of vectors on the boundary that risk to be misclassified in each step, that is the
vectors for which |Lδ (vi )| > 1 is typically much smaller compared to the vectors that are close to
a unique centroid. Second, we also increase by δ/2 the convergence threshold from the k-means
algorithm. All in all, δ-k-means is able to find a clustering that is robust when the data points
and the centroids are perturbed with some noise of magnitude O(δ). As we will see in this work,
q-means is the quantum equivalent of δ-k-means.

1.4 Our results


We define and analyse a new quantum algorithm for clustering, the q-means algorithm, whose
performance is similar to that of the classical δ-k-means algorithm and whose running time provides
substantial savings, especially for the case of large data sets.
The q-means algorithm combines most of the advantages that quantum machine learning algo-
rithms can offer for clustering. First, the running time is poly-logarithmic in the number of elements
of the dataset and depends only linearly on the dimension of the feature space. Second, q-means
returns explicit classical descriptions of the cluster centroids that are obtained by the δ-k-means
algorithm. As the algorithm outputs a classical description of the centroids, it is possible to use
them in further (classical or quantum) classification algorithms.
Our q-means algorithm requires that the dataset is stored in a QRAM (Quantum Random
Access Memory), which allows the algorithm to use efficient linear algebra routines that have been
developed using QRAM data structures. Of course, our algorithm can also be used for clustering
datasets for which the data points can be efficiently prepared even without a QRAM, for example
if the data points are the outputs of quantum circuits.
We start by providing a worst case analysis of the running time of our algorithm, which depends
on parameters of the data matrix, for example the condition number and the parameter µ that
appears in the quantum linear algebra procedures. Note that with O e we hide polylogaritmic factors.

Result 1. Given dataset V ∈ RN ×d stored in QRAM, the q-means algorithm outputs with high
probability
 centroids c1 , · · · , ck that are consistent
 with an output of the δ-k-means algorithm in time
η η 2 η 1.5
O kd δ2 κ(V )(µ(V ) + k δ ) + k δ2 κ(V )µ(V ) per iteration, where κ(V ) is the condition number,
e
µ(V ) is a parameter that appears in quantum linear algebra procedures and 1 ≤ kvi k2 ≤ η.
When we say that the q-means output is consistent with the δ-k-means, we mean that with
high probability the clusters that the q-means algorithm outputs are also possible outputs of the
δ-k-means.
We go further in our analysis and study a well-motivated model for datasets that allows for
good clustering. We call these datasets well-clusterable. One possible way to think of such datasets
is the following: a dataset is well-clusterable when the k clusters arise from picking k well-separated
vectors as their centroids, and then each point in the cluster is sampled from a Gaussian distribution

4
with small variance centered on the centroid of the cluster. We provide a rigorous definition in
following sections. For such well-clusterable datasets we can provide a tighter analysis of the
running time and have the following result, whose formal version appears as Theorem 5.2.
Result 2. Given a well-clusterable dataset V ∈ RN ×d stored in QRAM, the q-means algorithm
outputs with high probability k centroids c1 , ·· · , ck that are consistent with the output of the δ-k-
means algorithm in time O e k 2 d η2.5 2
+ k 2.5 ηδ3 per iteration, where 1 ≤ kvi k2 ≤ η.
δ3

In order to assess the running time and performance of our algorithm we performed extensive simu-
lations for different datasets. The running time of the q-means algorithm is linear in the dimension
d, which is necessary when outputting a classical description of the centroids, and polynomial in
the number of clusters k which is typically a small constant. The main advantage of the q-means
algorithm is that it provably depends logarithmically on the number of points, which can in many
cases provide a substantial speedup. The parameter δ (which plays the same role as in the δ-k-
means) is expected to be a large enough constant that depends on the data and the parameter η is
again expected to be a small constant for datasets whose data points have roughly the same norm.
For example, for the MNIST dataset, η can be less than 8 and δ can be taken to be equal to 0.5.
In Section 6 we present the results of our simulations. For different datasets we find parameters
δ such that the number of iterations is practically the same as in the k-means, and the δ-k-means
algorithm converges to a clustering that achieves an accuracy similar to the k-means algorithm or
in times better. We obtained these simulation results by simulating the operations executed by the
quantum algorithm adding the appropriate errors in the procedures.

2 Quantum preliminaries
We assume a basic understanding of quantum computing, we recommend Nielsen and Chuang
[23]Pfor an introduction to the subject. A vector state |vi for v ∈ Rd is defined as |vi =
1 th vector in the standard basis. The dataset
kvk m∈[d] vm |mi, where |mi represents em , the m
is represented by a matrix V ∈ RN ×d , i.e. each row is a vector vi ∈ Rd for i ∈ [N ] that repre-
sents a single data point. The cluster centers, called centroids, at time t are stored in the matrix
C t ∈ Rk×d , such that the j th row ctj for j ∈ [k] represents the centroid of the cluster Cjt .
We denote as Vk the optimal rank k approximation of V , that is Vk = ki=0 σi ui viT , where ui , vi
P
are the row and column singular vectorsPrespectively and the sum is over the largest k singular
values σi . We denote as V≥τ the matrix `i=0 σi ui viT where σ` is the smallest singular value which
is greater than τ .
We will assume at some steps that these matrices and V and C t are stored in suitable QRAM
data structures which are described in [2]. To prove our results, we are going to use the following
tools:
Theorem 2.1 (Amplitude estimation [24]). Given a quantum algorithm
√ p
A : |0i → p |y, 1i + 1 − p |G, 0i
where |Gi is some garbage state, then for any positive integer P , the amplitude estimation algorithm
outputs p̃ (0 ≤ p̃ ≤ 1) such that
p
p(1 − p)  π 2
|p̃ − p| ≤ 2π +
P P

5
with probability at least 8/π 2 . It uses exactly P iterations of the algorithm A. If p = 0 then p̃ = 0
with certainty, and if p = 1 and P is even, then p̃ = 1 with certainty.
In addition to amplitude estimation, we will make use of a tool developed in [8] to boost the
probability of getting a good estimate for the distances required for the q-means algorithm. In high
level, we take multiple copies of the estimator from the amplitude estimation procedure, compute
the median, and reverse the circuit to get rid of the garbage. Here we provide a theorem with
respect to time and not query complexity.
Theorem 2.2 (Median Evaluation [8]). Let U be a unitary operation that maps
√ √
U : |0⊗n i 7→ a |x, 1i + 1 − a |G, 0i

for some 1/2 < a ≤ 1 in time T . Then there exists a quantum algorithm that,√ for any ∆ > 0 and
for any 1/2 < a0 ≤ a, produces a state |Ψi such that k |Ψi − |0⊗nL i |xi k ≤ 2∆ for some integer
L, in time & '
ln(1/∆)
2T 2 .
2 |a0 | − 12

We also need some state preparation procedures. These subroutines are needed for encoding
vectors in vi ∈ Rd into quantum states |vi i. An efficient state preparation procedure is provided by
the QRAM data structures.
Theorem 2.3 (QRAM data structure [2]). Let V ∈ RN ×d , there is a data structure to store the
rows of V such that,
1. The time to insert, update or delete a single entry vij is O(log2 (N )).
2. A quantum algorithm with access to the data structure can perform the following unitaries
in time T = O(log2 N ).
(a) |ii |0i →
P |ii |vi i for i ∈ [N ].
(b) |0i → i∈[N ] kvi k |ii.
In our algorithm we will also use subroutines for quantum linear algebra. For a symmetric
matrix M ∈ Rd×d with spectral norm kM k = 1 stored in the QRAM, the running time of these
algorithms depends linearly on the condition number κ(M ) of the matrix, that can be replaced
by κτ (M ), a condition threshold where we keep only the singular values bigger than τ , and the
parameter µ(M ), a matrix dependent parameter defined as
q
µ(M ) = min (kM kF , s2p (M )s(1−2p) (M T )),
p∈[0,1]

for sp (M ) = maxi∈[n] j∈[d] Mijp . The different terms in the minimum in the definition of µ(M )
P
correspond to different √choices for the data structure for storing M , as detailed in [3]. Note
that µ(M ) ≤ kM kF ≤ d as we have assumed that kM k = 1. The running time also depends
logarithmically on the relative error  of the final outcome state. [4, 25].
Theorem 2.4 (Quantum linear algebra [4] ). Let M ∈ Rd×d such that kM k2 = 1 and x ∈ Rd . Let
, δ > 0. If M is stored in appropriate QRAM data structures and the time to prepare |xi is Tx ,
then there exist quantum algorithms that with probability at least 1 − 1/poly(d) return

6
1. A state |zi such that k|zi − |M xik ≤  in time O((κ(M
e )µ(M ) + Tx κ(M )) log(1/)).
−1
2. A state |zi such that |zi − |M xi ≤  in time O((κ(M )µ(M ) + Tx κ(M )) log(1/)).
e
e x κ(M )µ(M ) log(1/)).
3. Norm estimate z ∈ (1 ± δ) kM xk, with relative error δ, in time O(T δ
N ×d
The linear algebra procedures above can also be appliedto any rectangular matrix V ∈ R
0 V
by considering instead the symmetric matrix V = .
VT 0
The final component needed for the q-means algorithm is a linear time algorithm for vector
state tomography that will be used to recover classical information from the quantum states corre-
sponding to the new centroids in each step. Given a unitary U that produces a quantum state |xi,
by calling O(d log d/2 ) times U , the tomography algorithm is able to reconstruct a vector x
e that
approximates |xi such that k|e xi − |xik ≤ .

Theorem 2.5 (Vector state tomography [26]). Given access to unitary U such that U |0i = |xi
and its controlled version in time T (U ), there is a tomography algorithm with time complexity
O(T (U ) d log
2
d
e ∈ Rd such that ke
) that produces unit vector x x − xk2 ≤  with probability at least
(1 − 1/poly(d)).

3 Modelling well-clusterable datasets


In this section, we define a model for the dataset in order to provide a tight analysis on the running
time of our clustering algorithm. Note that we do not need this assumption for our general q-means
algorithm, but in this model we can provide tighter bounds for its running time. Without loss of
generality we consider in the remaining of the paper that the dataset V is normalized so that for
all i ∈ [N ], we have 1 ≤ kvi k, and we define the parameter η = maxi kvi k2 . We will also assume
that the number k is the “right” number of clusters, meaning that we assume each cluster has at
least some Ω(N/k) data points.
We now introduce the notion of a well-clusterable dataset. The definition aims to capture
some properties that we can expect from datasets that can be clustered efficiently using a k-means
algorithm. Our notion of a well-clusterable dataset shares some similarity with the assumptions
made in[27], but there are also some differences specific to the clustering problem.

Definition 1 (Well-clusterable dataset). A data matrix V ∈ RN ×d with rows vi ∈ Rd , i ∈ [N ] is


said to be well-clusterable if there exist constants ξ, β > 0, λ ∈ [0, 1], η ≤ 1, and cluster centroids
ci for i ∈ [k] such that:
1. (separation of cluster centroids): d(ci , cj ) ≥ ξ ∀i, j ∈ [k]
2. (proximity to cluster centroid): At least λN points vi in the dataset satisfy d(vi , cl(vi ) ) ≤ β
where cl(vi ) is the centroid nearest to vi .
3. (Intra-cluster smaller than inter-cluster square distances): The following inequality is satis-
fied
√ p √
4 η λβ 2 + (1 − λ)4η ≤ ξ 2 − 2 ηβ.

Intuitively, the assumptions guarantee that most of the data can be easily assigned to one of
k clusters, since these points are close to the centroids, and the centroids are sufficiently far from
each other. The exact inequality comes from the error analysis, but in spirit it says that ξ 2 should
be bigger than a quantity that depends on β and the maximum norm η.

7
We now show that a well-clusterable dataset has a good rank-k approximation where k is the
number of clusters. This result will later be used for giving tight upper bounds on the running
time of the quantum algorithm for well-clusterable datasets. As we said, one can easily construct
such datasets by picking k well separated vectors to serve as cluster centers and then each point in
the cluster is sampled from a Gaussian distribution with small variance centered on the centroid of
the cluster.

Claim 3.1. Let Vk be the optimal k-rank approximation for a well-clusterable data matrix V , then
kV − Vk k2F ≤ (λβ 2 + (1 − λ)4η) kV k2F .

Proof. Let W ∈ RN ×d be the matrix with row wi = cl(vi ) , where cl(vi ) is the centroid closest to
vi . The matrix W has rank at most k as it has exactly k distinct rows. As Vk is the optimal
rank-k approximation to V , we have kV − Vk k2F ≤ kV − W k2F . It therefore suffices to upper bound
kV − W k2F . Using the fact that V is well-clusterable, we have
X X
kV − W k2F = (vij − wij )2 = d(vi , cl(vi ) )2 ≤ λN β 2 + (1 − λ)N 4η,
ij i

where we used Definition 1 to say that for a λN fraction of the points d(vi , cl(vi ) )2 ≤ β 2 and for
the remaining points d(vi , cl(vi ) )2 ≤ 4η. Also, as all vi have norm at least 1 we have N ≤ kV kF ,
implying that kV − Vk k2 ≤ kV − W k2F ≤ (λβ 2 + (1 − λ)4η) kV k2F .

The running time of the quantum linear algebra routines for the data matrix V in Theorem 2.4
depend on the parameters µ(V ) and κ(V ). We establish bounds on both of these parameters using
the fact that V is well-clusterable
kV k √
Claim 3.2. Let V be a well-clusterable data matrix, then µ(V ) := kV kF = O( k).

Proof. We show that when we rescale V so that kV k = 1, then we have kV kF = O( k) for the
P the triangle inequality we have that kV kF ≤ kV − Vk kF + kVk kF . Using the
rescaled matrix. From
fact that kVk k2F = i∈[k] σi2 ≤ k and Claim 3.1, we have,
p √
kV kF ≤ (λβ 2 + (1 − λ)4η) kV kF + k

k

Rearranging, we have that kV kF ≤ √ = O( k).
1− (λβ 2 +(1−λ)4η)

We next show that if we use a conditionP threshold κτ (V ) instead of the true condition number
κ(V ), that is we consider the matrix V≥τ = σi ≥τ σi ui viT by discarding the smaller singular values
σi < τ , the resulting matrix remains close to the original one, i.e. we have that kV − V≥τ kF is
bounded.

Claim 3.3. Let V be a matrix with a rank-k approximation given by kV − Vk kF ≤ 0 kV kF and let
τ = √τk kV kF , then kV − V≥τ kF ≤ (0 + τ ) kV kF .

Proof. Let l be the smallest index such that σl ≥ τ , so that we have kV − V≥τ kF = kV − Vl kF . We
split the argument into two cases depending on whether l is smaller or greater than k.
• If l ≥ k then kV − Vl kF ≤ kV − Vk kF ≤ 0 kV kF .

8
qP
k
• If l < k then, kV − Vl kF ≤ kV − Vk kF + kVk − Vl kF ≤ 0 kV kF + 2
i=l+1 σi .
As each σi < τ and the sum is over at most k indices, we have the upper bound (0 +τ ) kV kF .

The reason we defined the notion of well-clusterable dataset is to be able to provide some strong
guarantees for the clustering of most points in the dataset. Note that the clustering problem in
the worst case is NP-hard and we only expect to have good results for datasets that have some
good property. Intuitively, we should only expect k-means to work when the dataset can actually
be clusterd in k clusters. We show next that for a well-clusterable dataset V , there is a constant δ
that can be computed in terms of the parameters in Definition 1 such that the δ-k-means clusters
correctly most of the data points.
Claim 3.4. Let V be a well-clusterable data matrix. Then, for at least λN data points vi , we have

min (d2 (vi , cj ) − d2 (vi , c`(i) )) ≥ ξ 2 − 2 ηβ
j6=`(i)


which implies that a δ-k-means algorithm with any δ < ξ 2 −2 ηβ will cluster these points correctly.
Proof. By Definition 1, we know that for a well-clusterable dataset V , we have that d(vi , cl(vi ) ) ≤ β
for at least λN data points and where cl(vi ) is the centroid closest to vi . Further, the distance

between each pair of the k centroids satisfies the bounds 2 η ≥ d(ci , cj ) ≥ ξ. By the triangle
inequality, we have d(vi , cj ) ≥ d(cj , c`(i) ) − d(vi , c`(i) ). Squaring both sides of the inequality and
rearranging,

d2 (vi , cj ) − d2 (vi , c`(i) ) ≥ d2 (cj , c`(i) ) − 2d(cj , c`(i) )d(vi , c`(i) ))


Substituting the bounds on the distances implied by the well-clusterability assumption, we obtain
√ √
d2 (vi , cj ) − d2 (vi , c`(i) ) ≥ ξ 2 − 2 ηβ. This implies that as long as we pick δ < ξ 2 − 2 ηβ, these
points are assigned to the correct cluster, since all other centroids are more than δ further away
than the correct centroid.

4 The q-means algorithm


The q-means algorithm is given as Algorithm 1. At a high level, it follows the same steps as
the classical k-means algorithm, where we now use quantum subroutines for distance estimation,
finding the minimum value among a set of elements, matrix multiplication for obtaining the new
centroids as quantum states, and efficient tomography. First, we pick some random initial points,
using some classical tchnique, for example k-means++ [21]. Then, in Steps 1 and 2 all data points
are assigned to clusters, and in Steps 3 and 4 we update the centroids of the clusters. The process
is repeated until convergence.

4.1 Step 1: Centroid distance estimation


The first step of the algorithm estimates the square distance between data points and clusters
using a quantum procedure. This can be done using the Swap Test as in [5] and also using the
Frobenius distance estimation procedure [7]. Indeed, the subroutine presented in [7] (originally

9
Algorithm 1 q-means.
Require: Data matrix V ∈ RN ×d stored in QRAM data structure. Precision parameters δ for
k-means, error parameters 1 for distance estimation, 2 and 3 for matrix multiplication and
4 for tomography.
Ensure: Outputs vectors c1 , c2 , · · · , ck ∈ Rd that correspond to the centroids at the final step of
the δ-k-means algorithm.

1: Select k initial centroids c01 , · · · , c0k and store them in QRAM data structure.
2: t=0
3: repeat
4: Step 1: Centroid Distance Estimation
Perform the mapping (Theorem 4.1)
N N
1 X 1 X
√ |ii ⊗j∈[k] |ji |0i 7→ √ |ii ⊗j∈[k] |ji |d2 (vi , ctj )i (1)
N i=1 N i=1

where |d2 (vi , ctj ) − d2 (vi , ctj )| ≤ 1 .


5: Step 2: Cluster Assignment
Find the minimum distance among {d2 (vi , ctj )}j∈[k] (Lemma 4.3), then uncompute Step 1 to
create the superposition of all points and their labels
N N
1 X 1 X
√ |ii ⊗j∈[k] |ji |d2 (vi , ctj )i 7→ √ |ii |`t (vi )i (2)
N i=1 N i=1

6: Step 3: Centroid states creation


|Cjt |
3.1 Measure the label register to obtain a state |χtj i = q1
P
i∈Cjt |ii, with prob. N
|Cjt |

3.2 Perform matrix multiplication with matrix V T and vector |χtj i to obtain the state |ct+1
j i
with error 2 , along with an estimation of ct+1
j with relative error 3 (Theorem 2.4).
7: Step 4: Centroid Update
4.1 Perform tomography for the states |ct+1 j i with precision 4 using the operation from
Steps 1-3 (Theorem 2.5) and get a classical estimate ct+1
j for the new centroids such that
t+1 t+1 √
|cj − cj | ≤ η(3 + 4 ) = centroids
4.2 Update the QRAM data structure for the centroids with the new vectors ct+1 0 · · · ct+1
k .
8: t=t+1
9: until convergence condition is satisfied.

10
used to calculate the average square distance between a point and all points in a cluster) can be
adapted to calculate the square distance or inner product (with sign) between two vectors stored
in the QRAM. The distance estimation becomes very efficient when we have quantum access to
the vectors and the centroids as in Theorem 2.3. That is, when we can query the state preparation
oracles |ii |0i 7→ |ii |vi i , and |ji |0i 7→ |ji |cj i in time T = O(log d), and we can also query the norm
of the vectors.
For q-means, we need to estimate distances or inner products between vectors which have
different norms. At a high level, if we first estimate the inner between the quantum states |vi i and
|cj i corresponding to the normalized vectors and then multiply our estimator by the product of the
vector norms we will get an estimator for the inner product of the unnormalised vectors. A similar
calculation works for the square distance instead of the inner product. If we have an absolute error
 for the square distance estimation of the normalized vectors, then the final error is of the order
of  kvi k kcj k.
We present now the distance estimation theorem we need for the q-means algorithm and develop
its proof in the next subsection.

Theorem 4.1 (Centroid Distance estimation). Let a data matrix V ∈ RN ×d and a centroid matrix
C ∈ Rk×d be stored in QRAM, such that the following unitaries |ii |0i 7→ |ii |vi i , and |ji |0i 7→
|ji |cj i can be performed in time O(log(N d)) and the norms of the vectors are known. For any
∆ > 0 and 1 > 0, there exists a quantum algorithm that performs the mapping
N N
1 X 1 X
√ |ii ⊗j∈[k] (|ji |0i) 7→ √ |ii ⊗j∈[k] (|ji |d2 (vi , cj )i),
N i=1 N i=1
 
e k η log(1/∆) where
where |d2 (vi , cj ) − d2 (vi , cj )| 6 1 with probability at least 1 − 2∆, in time O 1
η = maxi (kvi k2 ).

4.2 Proof of Theorem 4.1


The theorem will follow easily from the following lemma which computes the square distance or
inner product of two vectors.

Lemma 4.2 (Distance / Inner Products Estimation). Assume for a data matrix V ∈ RN ×d and
a centroid matrix C ∈ Rk×d that the following unitaries |ii |0i 7→ |ii |vi i , and |ji |0i 7→ |ji |cj i can
be performed in time T and the norms of the vectors are known. For any ∆ > 0 and 1 > 0, there
exists a quantum algorithm that computes

|ii |ji |0i 7→ |ii |ji |d2 (vi , cj )i , where |d2 (vi , cj ) − d2 (vi , cj )| 6 1 with probability at least 1 − 2∆, or
|ii |ji |0i →
7 |ii |ji |(vi , cj )i , where |(vi , cj ) − (vi , cj )| 6 1 with probability at least 1 − 2∆
 
in time O e kvi kkcj kT log(1/∆) .
1

Proof. Let us start by describing a procedure to estimate the square `2 distance between the
normalised vectors |vi i and |cj i. We start with the initial state
1
|φij i := |ii |ji √ (|0i + |1i) |0i
2

11
Then, we query the state preparation oracle controlled on the third register to perform the
mappings |ii |ji |0i |0i 7→ |ii |ji |0i |vi i and |ii |ji |1i |0i 7→ |ii |ji |1i |cj i. The state after this is given
by,
1
√ (|ii |ji |0i |vi i + |ii |ji |1i |cj i)
2
Finally, we apply an Hadamard gate on the the third register to obtain,
 
1 1
|ii |ji |0i (|vi i + |cj i) + |1i (|vi i − |cj i)
2 2

The probability of obtaining |1i when the third register is measured is,

1 1 1 − hvi |cj i
pij = (2 − 2hvi |cj i) = d2 (|vi i , |cj i) =
4 4 2
which is proportional to the square distance between the two normalised vectors.
We can rewrite |1i (|vi i − |cj i) as |yij , 1i (by swapping the registers), and hence we have the
final mapping
√ p
A : |ii |ji |0i 7→ |ii |ji ( pij |yij , 1i + 1 − pij |Gij , 0i) (3)
where the probability pij is proportional to the square distance between the normalised vectors and
Gij is a garbage state. Note that the running time of A is TA = Õ(T ).
Now that we know how to apply the transformation described in Equation 3, we can use known
techniques to perform the centroid distance estimation as defined in Theorem 4.1 within additive
error  with high probability. The method uses two tools, amplitude estimation, and the median
evaluation 2.2 from [8].
First, using amplitude estimation (Theorem 2.1) with the unitary A defined in Equation 3, we
can create a unitary operation that maps
√ 
α |pij , G, 1i + (1 − α) |G0 , 0i
p
U : |ii |ji |0i 7→ |ii |ji

where G, G0 are garbage registers, |pij − pij | ≤  and α ≥ 8/π 2 . The unitary U requires P iterations
of A with P = O(1/). Amplitude estimation thus takes time TU = O(T e /). We can now apply
Theorem 2.2 for the unitary U to obtain a quantum state |Ψij i such that,

k |Ψij i − |0i⊗L |pij , Gi k2 ≤ 2∆
The running time of the procedure is O(TU ln(1/∆)) = O( e T log(1/∆)).

Note that we can easily multiply the value pij by 4 in order to have the estimator of the square
distance of the normalised vectors or compute 1 − 2pij for the normalized inner product. Last, the
garbage state does not cause any problem in calculating the minimum in the next step, after which
this step is uncomputed.
The running time of the procedure is thus O(TU ln(1/∆)) = O( T log(1/∆)).
The last step is to show how to estimate the square distance or the inner product of the
unnormalised vectors. Since we know the norms of the vectors, we can simply multiply the estimator
of the normalised inner product with the product of the two norms to get an estimate for the inner
product of the unnormalised vectors and a similar calculation works for the distance. Note that

12
the absolute error  now becomes  kvi k kcj k and hence if we want to have in the end an absolute
error  this will incur a factor of kvi k kcj k in the running time. This concludes the proof of the
lemma.

The proof of the theorem follows rather straightforwardly from this lemma. In fact one just
needs to apply the above distance estimation procedure from Lemma 4.2 k times. Note also that
the norms of the centroids are always smaller than the maximum norm of a data point which gives
us the factor η.

4.3 Step 2: Cluster assignment


At the end of step 1, we have coherently estimated the square distance between each point in the
dataset and the k centroids in separate registers. We can now select the index j that corresponds
to the centroid closest to the given data point, written as `(vi ) = argminj∈[k] (d(vi , cj )). As the
square is a monotone function, we do not need to compute the square root of the distance in order
to find `(vi ).
Lemma 4.3 (Circuit for finding the minimum). Given k different log p-bit registers ⊗j∈[k] |aj i,
there is a quantum circuit Umin that maps (⊗j∈[p] |aj i) |0i → (⊗j∈[k] |aj i) |argmin(aj )i in time
O(k log p).
Proof. We append an additional register for the result that is initialized to |1i. We then repeat
the following operation for 2 ≤ j ≤ k, we compare registers 1 and j, if the value in register j is
smaller we swap registers 1 and j and update the result register to j. The cost of the procedure is
O(k log p).

The cost of finding the minimum is O(k)


e in step 2 of the q-means algorithm, while we also need
to uncompute the distances by repeating Step 1. Once we apply the minimum finding Lemma 4.3
and undo the computation we obtain the state
N
1 X
|ψ t i := √ |ii |`t (vi )i . (4)
N i=1

4.4 Step 3: Centroid state creation


The previous step gave us the state |ψ t i = √1N N t
P
i=1 |ii |` (vi )i. The first register of this state
stores the index of the data points while the second register stores the label for the data point in
the current iteration. Given these states, we need to find the new centroids |ct+1 j i, which are the
average of the data points having the same label.
Let χtj ∈ RN be the characteristic vector for cluster j ∈ [k] at iteration t scaled to unit `1 norm,
that is (χtj )i = |C1t | if i ∈ Cj and 0 if i 6∈ Cj . The creation of the quantum states corresponding to
j
the centroids is based on the following simple claim.
Claim 4.4. Let χtj ∈ RN be the scaled characteristic vector for Cj at iteration t and V ∈ RN ×d be
the data matrix, then ct+1
j = V T χtj .

Proof. The k-means update rule for the centroids is given by ct+1 = |C1t | i∈Cj vi . As the columns
P
j j
of V T are the vectors vi , this can be rewritten as ct+1
j = V T χtj .

13
The above claim allows us to compute the updated centroids ct+1 j using quantum linear algebra
t
operations. In fact, the state |ψ i can be written as a weighted superposition of the characteristic
vectors of the clusters.
 
k k
r r
t
X |C |
j  1 X X |Cj | t
|ψ i = p |ii |ji = |χj i |ji
N |C j | N
j=1 i∈C j j=1

By measuring the last register, we can sample from the states |χtj i for j ∈ [k], with probability
proportional to the size of the cluster. We assume here that all k clusters are non-vanishing, in
other words they have size Ω(N/k). Given the ability to create the states |χtj i and given that the
matrix V is stored in QRAM, we can now perform quantum matrix multiplication by V T to recover
an approximation of the state |V T χj i = |ct+1
j i with error 2 , as stated in Theorem 2.4. Note that
the error 2 only appears inside a logarithm. The same Theorem allows us to get an estimate of
the norm V T χtj = ct+1 j with relative error 3 . For this, we also need an estimate of the size
of each cluster, namely the norms kχj k. We already have this, since the measurements of the last
register give us this estimate, and since the number of measurements made is large compared to k
(they depend on d), the error from this source is negligible compared to other errors.
The running time of this step is derived from Theorem 2.4 where the time to prepare the
state |χtj i is the time of Steps 1 and 2. Note that we do not have to add an extra k factor due
to the sampling, since we can run the matrix multiplication procedures in parallel for all j so
that every time we measure a random |χtj i we perform one more step of the corresponding matrix
multiplication. Assuming that all clusters have size Ω(N/k) we will have an extra factor of O(log k)
in the running time by a standard coupon collector argument.

4.5 Step 4: Centroid update


In Step 4, we need to go from quantum states corresponding to the centroids, to a classical de-
scription of the centroids in order to perform the update step. For this, we will apply the vector
state tomography algorithm, stated in Theorem 2.5, on the states |ct+1j i that we create in Step 3.
Note that for each j ∈ [k] we will need to invoke the unitary that creates the states |ct+1
j i a total
d log d
of O( 2 ) times for achieving k|cj i − |cj ik < 4 . Hence, for performing the tomography of all
4

clusters, we will invoke the unitary O( k(log k)d(log


24
d)
) times where the O(k log k) term is the time to
get a copy of each centroid state.
The vector state tomography gives us a classical estimate of the unit norm centroids within
error 4 , that is k|cj i − |cj ik < 4 . Using the approximation of the norms kcj k with relative error
3 from Step 3, we can combine these estimates to recover the centroids as vectors. The analysis is
described in the following claim:
Claim 4.5. Let 4 be the error we commit in estimating |cj i such that k|cj i − |cj ik < 4 , and
3 the error we commit in the estimating the norms, | kcj k − kcj k| ≤ 3 kcj k. Then kcj − cj k ≤

η(3 + 4 ) = centroid .

Proof. We can rewrite kcj − cj k as kcj k |cj i − kcj k |cj i . It follows from triangle inequality that:

kcj k |cj i − kcj k |cj i ≤ kcj k |cj i − kcj k |cj i + kkcj k |cj i − kcj k |cj ik

14

We have the upper bound kcj k ≤ η. Using the bounds for the error we have from tomography
√ √
and norm estimation, we can upper bound the first term by η3 and the second term by η4 .
The claim follows.

Let us make a remark about the ability to use Theorem 2.5 to perform tomography in our case.
The updated centroids will be recovered in step 4 using the vector state tomography algorithm in
Theorem 2.5 on the composition of the unitary that prepares |ψ t i and the unitary that multiplies
the first register of |ψ t i by the matrix V T . The input of the tomography algorithm requires a
unitary U such that U |0i = |xi for a fixed quantum state |xi. However, the labels `(vi ) are not
deterministic due to errors in distance estimation, hence the composed unitary U as defined above
therefore does not produce a fixed pure state |xi.
We therefore need a procedure that finds labels `(vi ) that are a deterministic function of vi and
the centroids cj for j ∈ [k]. One solution is to change the update rule of the δ-k-means algorithm
to the following: Let `(vi ) = j if d(vi , cj ) < d(vi , cj 0 ) − 2δ for j 0 6= j where we discard the points to
which no label can be assigned. This assignment rule ensures that if the second register is measured
and found to be in state |ji, then the first register contains a uniform superposition of points from
cluster j that are δ far from the cluster boundary (and possibly a few points that are δ close to
the cluster boundary). Note that this simulates exactly the δ-k-means update rule while discarding
some of the data points close to the cluster boundary. The k-means centroids are robust under
such perturbations, so we expect this assignment rule to produce good results in practice.
A better solution is to use consistent phase estimation instead of the usual phase estimation for
the distance estimation step , which can be found in [28, 29]. The distance estimates are generated
by the phase estimation algorithm applied to a certain unitary in the amplitude estimation step.
The usual phase estimation algorithm does not produce a deterministic answer and instead for each
eigenvalue λ outputs with high probability one of two possible estimates λ such that |λ − λ| ≤ .
Instead, here as in some other applications we need the consistent phase estimation algorithm that
with high probability outputs a deterministic estimate such that |λ − λ| ≤ .
We also describe another simple method of getting such consistent phase estimation, which is
to combine phase estimation estimates that are obtained for two different precision values. Let us
assume that the eigenvalues for the unitary U are e2πiθi for θi ∈ [0, 1]. First, we perform phase
estimation with precision N11 where N1 = 2l is a power of 2. We repeat this procedure O(log N/θ2 )
times and output the median estimate. If the value being estimated is λ+α 2l
for λ ∈ Z and α ∈ [0, 1]
and |α − 1/2| ≥ θ0 for an explicit constant θ0 (depending on θ) then with probability at least
1 − 1/poly(N ) the median estimate will be unique and will equal to 1/2l times the closest integer
to (λ + α). In order to also produce a consistent estimate for the eigenvalues for the cases where
the above procedure fails, we perform a second phase estimation with precision 2/3N1 . We repeat
this procedure as above for O(log N/θ2 ) iterations and taking the median estimate. The second
procedure fails to produce a consistent estimate only for eigenvalues λ+α 2l
for λ ∈ Z and α ∈ [0, 1]
0 0 0
and |α − 1/3| ≤ θ or |α − 2/3| ≤ θ for a suitable constant θ . Since the cases where the two
procedures fail are mutually exclusive, one of them succeeds with probability 1 − 1/poly(N ). The
estimate produced by the phase estimation procedure is therefore deterministic with very high
probability. In order to complete this proof sketch, we would have to give explicit values of the
constants θ and θ0 and the success probability, using the known distribution of outcomes for phase
estimation.
For what follows, we assume that indeed the state in Equation 4 is almost a deterministic state,

15
meaning that when we repeat the procedure we get the same state with very high probability.
24
We set the error on the matrix multiplication to be 2  d log d as we need to call the unitary
that builds ct+1
j for O( d log
24
d
) times. We will see that this does not increase the runtime of the
algorithm, as the dependence of the runtime for matrix multiplication is logarithmic in the error.

5 Analysis
We provide our general theorem about the running time and accuracy of the q-means algorithm.

Theorem 5.1 (q-means). For a data matrix V ∈ RN ×d stored in an appropriate QRAM data struc-
ture and parameter δ > 0, the q-means algorithm with
 high probability outputs centroids consistent

η η 2 η 1.5
with the classical δ-k-means algorithm, in time O kd δ2 κ(V )(µ(V ) + k δ ) + k δ2 κ(V )µ(V ) per
e
q
iteration, where κ(V ) is the condition number, µ(M ) = minp∈[0,1] (kM kF , s2p (M )s(1−2p) (M T )),
and 1 ≤ kvi k2 ≤ η.

We prove the theorem in Sections 5.1 and 5.2 and then provide the running time of the algorithm
for well-clusterable datasets as Theorem 5.2.

5.1 Error analysis


In this section we determine the error parameters in the different steps of the quantum algorithm
so that the quantum algorithm behaves the same as the classical δ-k-means. More precisely, we will
determine the values of the errors 1 , 2 , 3 , 4 in terms of δ so that firstly, the cluster assignment of
all data points made by the q-means algorithm is consistent with a classical run of the δ-k-means
algorithm, and also that the centroids computed by the q-means after each iteration are again
consistent with centroids that can be returned by the δ-k-means algorithm.
The cluster assignment in q-means happens in two steps. The first step estimates the square
distances between all points and all centroids. The error in this procedure is of the form

|d2 (cj , vi ) − d2 (cj , vi )| < 1 .

for a point vi and a centroid cj . The second step finds the minimum of these distances without
adding any error.
For the q-means to output a cluster assignment consistent with the δ-k-means algorithm, we
require that:
δ
∀j ∈ [k], |d2 (cj , vi ) − d2 (cj , vi )| ≤
2
which implies that no centroid with distance more than δ above the minimum distance can be
chosen by the q-means algorithm as the label. Thus we need to take 1 < δ/2.
After the cluster assignment of the q-means (which happens in superposition), we update the
clusters, by first performing a matrix multiplication to create the centroid states and estimate their
norms, and then a tomography to get a classical description of the centroids. The error in this part
is centroids , as defined in Claim 4.5, namely

kcj − cj k ≤ centroid = η(3 + 4 ).

16
Again, for ensuring that the q-means is consistent with the classical δ-k-means algorithm we
take 3 < 4√δ η and 4 < 4√δ η . Note also that we have ignored the error 2 that we can easily deal
with since it only appears in a logarithmic factor.

5.2 Runtime analysis


As the classical algorithm, the runtime of q-means depends linearly on the number of iterations, so
here we analyze the cost of a single step.
The cost of tomography for the k centroid vectors is O( kd log4k2 log d ) times the cost of preparation
of a single centroid state |ctj i. A single copy of |ctj i is prepared applying the matrix multiplication
by V T procedure on the state |χtj i obtained using square distance estimation. The time required
for preparing a single copy of |ctj i is O(κ(V )(µ(V ) + Tχ ) log(1/2 )) by Theorem 2.4 where Tχ is the
e kη log(∆−1 ) log(N d) = O(
 
time for preparing |χt i. The time Tχ is O e kη ) by Theorem 4.1.
j 1 1
The cost of norm estimation for k different centroids is independent of the tomography cost and
kT κ(V )µ(V )
is O( χ 3
e ). Combining together all these costs and suppressing all the logarithmic factors
we have a total running time of,

   
1 η 2 η
O kd 2 κ(V ) µ(V ) + k
e +k κ(V )µ(V )
4 1 3 1

The analysis in section 5.1 shows that we can take 1 = δ/2, 3 = 4√δ η and 4 = 4√δ η . Substituting
these values in the above running time, it follows that the running time of the q-means algorithm
is
1.5
 
η  η 
2 η
O
e kd κ(V ) µ(V ) + k + k 2 κ(V )µ(V ) .
δ2 δ δ
This completes the proof of Theorem 5.1. We next state our main result when applied to a well-
clusterable dataset, as in Definition 3.
Theorem 5.2 (q-means on well-clusterable data). For a well-clusterable dataset V ∈ RN ×d stored
in appropriate QRAM, the q-means algorithm returns with high probability the k centroidscon-
e k 2 d η2.5
sistently with the classical δ-k-means algorithm for a constant δ in time O
2
+ k 2.5 ηδ3 per
δ3
iteration, for 1 ≤ kvi k2 ≤ η.
Proof. Let V ∈ RN ×d be a well-clusterable dataset as in Definition 1. In this case, we know by
1
Claim 3.3 that κ(V ) = σmin can be replaced by a thresholded condition number κτ (V ) = τ1 . In
practice, this is done by discarding the singular values smaller than a certain threshold during√
quantum matrix multiplication. Remember that by Claim 3.2 we know that kV kF = O( k).
Therefore we need to pick τ for a threshold τ = √τk kV kF such that κτ (V ) = O( 1τ ).
Thresholding the singular values in the matrix multiplication step introduces an additional
additive error in centroid . By Claim 3.3 and Claim 4.5 , we have that the perror centroid in ap-

proximating the true centroids becomes η(3 + 4 + 0 + τ ) where 0 = λβ 2 + (1 − λ)4η is a
dataset dependent parameter computed in Claim 3.1. We can set τ = 3 = 4 = 0 /3 to obtain

centroid = 2 η0 .
The definition of the δ-k-means update rule requires that centroid ≤ δ/2. Further, Claim 3.4

shows that if the error δ in the assignment step satsifies δ ≤ ξ 2 − 2 ηβ, then the δ-k-means

17
algorithm finds the corrects clusters. By Definition 1 of a well-clusterable dataset, we can find a
suitable constant δ satisfying both these constraints, namely satisfying
√ p √
4 η λβ 2 + (1 − λ)4η < δ < ξ 2 − 2 ηβ.

Substituting the values µ(V ) = O( k) from Claim 3.2, κ(V ) = O( 1τ ) and τ = 3 = 4 =

0 /3 = O( η/δ) in the running time for the general q-means algorithm,
 we obtain that
 the running
η 2.5 η 2
time for the q-means algorithm on a well-clusterable dataset is Oe k 2 d 3 + k 2.5 3 per iteration.
δ δ

Let us make some concluding remarks regarding the running time of q-means. For dataset where
the number of points is much bigger compared to the other parameters, the running time for the
q-means algorithm is an improvement compared to the classical k-means algorithm. For instance,
for most problems in data analysis, k is eventually small (< 100). The number of features d ≤ N in
most situations, and it can eventually be reduced by applying a quantum dimensionality reduction
algorithm first (which have running time poly-logarithmic in d). To sum up, q-means has the same
output as the classical δ-k-means algorithm (which approximates k-means), it conserves the same
number of iterations, but has a running time only poly-logarithmic in N , giving an exponential
speedup with respect to the size of the dataset.

6 Simulations on real data


We would like to assert the capability of the quantum algorithm to provide accurate classification
results, by simulations on a number of datasets. However, since neither quantum simulators nor
quantum computers large enough to test q-means are available currently, we tested the equivalent
classical implementation of δ-k-means. For implementing the δ-k-means, we changed the assignment
step of the k-means algorithm to select a random centroid among those that are δ-close to the closest
centroid and added δ/2 error to the updated clusters.
We benchmarked our q-means algorithm on two datasets: a synthetic dataset of gaussian clus-
ters, and the well known MNIST dataset of handwritten digits. To measure and compare the
accuracy of our clustering algorithm, we ran the k-means and the δ-k-means algorithms for differ-
ent values of δ on a training dataset and then we compared the accuracy of the classification on a
test set, containing data points on which the algorithms have not been trained, using a number of
widely-used performance measures.

6.1 Gaussian clusters dataset


We describe numerical simulations of the δ-k-means algorithm on a synthetic dataset made of
several clusters formed by random gaussian distributions. These clusters are naturally well suited
for clustering by construction, close to what we defined to be a well-clusterable dataset in Definition
1 of Section 3. Doing so, we can start by comparing k-means and δ-k-means algorithms on high
accuracy results, even though this may not be the case on real-world datasets. Without loss of
generality, we preprocessed the data so that the minimum norm in the dataset is 1, in which case
η = 4. This is why we defined η as a maximum instead of the ratio of the maximum over the
minimum which is really the interesting quantity. Note that the running time basically depends
on the ratio η/δ. We present a simulation where 20.000 points in a feature space of dimension 10

18
Figure 1: Representation of 4 Gaussian clusters of 10 dimensions in a 3D space
spanned by the first three PCA dimensions.

form 4 Gaussian clusters with standard deviation 2.5, that we can see in Figure 1. The condition
number of dataset is calculated to be 5.06. We ran k-means and δ-k-means for 7 different values of
δ to understand when the δ-k-means becomes less accurate.
In Figure 2 we can see that until η/δ = 3 (for δ = 1.2), the δ-k-means algorithm converges
on this dataset. We can now make some remarks about the impact of δ on the efficiency. It
seems natural that for small values of δ both algorithms are equivalent. For higher values of δ, we
observed a late start in the evolution of the accuracy, witnessing random assignments for points
on the clusters’ boundaries. However, the accuracy still reaches 100% in a few more steps. The
increase in the number of steps is a tradeoff with the parameter η/δ.

Figure 2: Accuracy evolution during k-means and δ-k-means on well-clusterable


Gaussians for 5 values of δ. All versions converged to a 100% accuracy in few steps.

19
6.2 MNIST
The MNIST dataset is composed of 60.000 handwritten digits as images of 28x28 pixels (784
dimensions). From this raw data we first performed some dimensionality reduction processing,
then we normalized the data such that the minimum norm is one. Note that, if we were doing
q-means with a quantum computer, we could use efficient quantum procedures equivalent to Linear
Discriminant Analysis, such as [7], or other quantum dimensionality reduction algorithms like
[1, 30].
As preprocessing of the data, we first performed a Principal Component Analysis (PCA), re-
taining data projected in a subspace of dimension 40. After normalization, the value of η was 8.25
(maximum norm of 2.87), and the condition number was 4.53. Figure 3 represents the evolution
of the accuracy during the k-means and δ-k-means for 4 different values of δ. In this numerical
experiment, we can see that for values of the parameter η/δ of order 20, both k-means and δ-k-
means reached a similar, yet low accuracy in the classification in the same number of steps. It
is important to notice that the MNIST dataset, without other preprocessing than dimensionality
reduction, is known not to be well-clusterable under the k-means algorithm.

Figure 3: Accuracy evolution on the MNIST dataset under k-means and δ-k-means for 4
different values of δ. Data has been preprocessed by a PCA to 40 dimensions. All versions
converge in the same number of steps, with a drop in the accuracy while δ increases. The
apparent difference in the number of steps until convergence is just due to the stopping
condition for k-means and δ-k-means.

On top of the accuracy measure (ACC), we also evaluated the performance of q-means against
many other metrics, reported in Table 1 and 2. More detailed information about these metrics can
be found in [31, 32]. We introduce a specific measure of error, the Root Mean Square Error of
Centroids (RMSEC), which a direct comparison between the centroids predicted by the k-means
algorithm and the ones predicted by the δ-k-means. It is a way to know how far the centroids are
predicted. Note that this metric can only be applied to the training set. For all these measures,
except RMSEC, a bigger value is better. Our simulations show that δ-k-means, and thus the q-
means, even for values of δ (between 0.2 − 0.5) achieves similar performance to k-means, and in

20
most cases the difference is on the third decimal point.

Algo Dataset ACC HOM COMP V-M AMI ARI RMSEC


Train 0.582 0.488 0.523 0.505 0.389 0.488 0
k-means
Test 0.592 0.500 0.535 0.517 0.404 0.499 -
Train 0.580 0.488 0.523 0.505 0.387 0.488 0.009
δ-k-means, δ = 0.2
Test 0.591 0.499 0.535 0.516 0.404 0.498 -
Train 0.577 0.481 0.517 0.498 0.379 0.481 0.019
δ-k-means, δ = 0.3
Test 0.589 0.494 0.530 0.511 0.396 0.493 -
Train 0.573 0.464 0.526 0.493 0.377 0.464 0.020
δ-k-means, δ = 0.4
Test 0.585 0.492 0.527 0.509 0.394 0.491 -
Train 0.573 0.459 0.522 0.488 0.371 0.459 0.034
δ-k-means, δ = 0.5
Test 0.584 0.487 0.523 0.505 0.389 0.487 -

Table 1: A sample of results collected from the same experiments as in Figure 3. Different metrics
are presented for the train set and the test set. ACC: accuracy. HOM: homogeneity. COMP:
completeness. V-M: v-measure. AMI: adjusted mutual information. ARI: adjusted rand index.
RMSEC: Root Mean Square Error of Centroids.

These experiments have been repeated several times and each of them presented a similar
behavior despite the random initialization of the centroids.

(a) (b) (c)

Figure 4: Three accuracy evolutions on the MNIST dataset under k-means and
δ-k-means for 4 different values of δ. Each different behavior is due to the random
initialization of the centroids

Finally, we present a last experiment with the MNIST dataset with a different data preprocess-
ing. In order to reach higher accuracy in the clustering, we replace the previous dimensionality
reduction by a Linear Discriminant Analysis (LDA). Note that a LDA is a supervised process that
uses the labels (here, the digits) to project points in a well chosen lower dimensional subspace.
Thus this preprocessing cannot be applied in practice in unsupervised machine learning. However,
for the sake of benchmarking, by doing so k-means is able to reach a 87% accuracy, therefore it
allows us to compare k-means and δ-k-means on a real and almost well-clusterable dataset. In
the following, the MNIST dataset is reduced to 9 dimensions. The results in Figure 5 show that
δ-k-means converges to the same accuracy than k-means even for values of η/δ down to 16. In
some other cases, δ-k-means shows a faster convergence, due to random fluctuations that can help
escape faster from a temporary equilibrium of the clusters.

21
Figure 5: Accuracy evolution on the MNIST dataset under k-means and δ-k-means for 4
different values of δ. Data has been preprocessed to 9 dimensions with a LDA reduction.
All versions of δ-k-means converge to the same accuracy than k-means in the same number
of steps.

Algo Dataset ACC HOM COMP V-M AMI ARI RMSEC


Train 0.868 0.736 0.737 0.737 0.735 0.736 0
k-means
Test 0.891 0.772 0.773 0.773 0.776 0.771 -
Train 0.868 0.737 0.738 0.738 0.736 0.737 0.031
q-means, δ = 0.2
Test 0.891 0.774 0.775 0.775 0.777 0.774 -
Train 0.869 0.737 0.739 0.738 0.736 0.737 0.049
q-means, δ = 0.3
Test 0.890 0.772 0.774 0.773 0.775 0.772 -
Train 0.865 0.733 0.735 0.734 0.730 0.733 0.064
q-means, δ = 0.4
Test 0.889 0.770 0.771 0.770 0.773 0.769 -
Train 0.866 0.733 0.735 0.734 0.731 0.733 0.079
q-means, δ = 0.5
Test 0.884 0.764 0.766 0.765 0.764 0.764 -

Table 2: A sample of results collected from the same experiments as in Figure 5. Different metrics
are presented for the train set and the test set. ACC: accuracy. HOM: homogeneity. COMP:
completeness. V-M: v-measure. AMI: adjusted mutual information. ARI: adjusted rand index.
RMSEC: Root Mean Square Error of Centroids.

Let us remark, that the values of η/δ in our experiment remained between 3 and 20. Moreover,
the parameter η, which is the maximum square norm of the points, provides a worst case guarantee
for the algorithm, while one can expect that the running time in practice will scale with the average
square norm of the points. For the MNIST dataset after PCA, this value is 2.65 whereas η = 8.3.
In conclusion, our simulations show that the convergence of δ-k-means is almost the same as the
regular k-means algorithms for large enough values of δ. This provides evidence that the q-means
algorithm will have as good performance as the classical k-means, and its running time will be
significantly lower for large datasets.

22
References
[1] S. Lloyd, M. Mohseni, and P. Rebentrost, “Quantum principal component analysis,” Nature
Physics, vol. 10, no. 9, p. 631, 2014.

[2] I. Kerenidis and A. Prakash, “Quantum recommendation systems,” Proceedings of the 8th
Innovations in Theoretical Computer Science Conference, 2017.

[3] I. Kerenidis and A. Prakash, “Quantum gradient descent for linear systems and least squares,”
arXiv:1704.04992, 2017.

[4] S. Chakraborty, A. Gilyén, and S. Jeffery, “The power of block-encoded matrix pow-
ers: improved regression techniques via faster Hamiltonian simulation,” arXiv preprint
arXiv:1804.01973, 2018.

[5] S. Lloyd, M. Mohseni, and P. Rebentrost, “Quantum algorithms for supervised and
unsupervised machine learning,” arXiv, vol. 1307.0411, pp. 1–11, 7 2013. [Online]. Available:
http://arxiv.org/abs/1307.0411

[6] J. Allcock, C.-Y. Hsieh, I. Kerenidis, and S. Zhang, “Quantum algorithms for feedforward
neural networks,” Manuscript, 2018.

[7] I. Kerenidis and A. Luongo, “Quantum classification of the MNIST dataset via slow feature
analysis,” arXiv preprint arXiv:1805.08837, 2018.

[8] N. Wiebe, A. Kapoor, and K. M. Svore, “Quantum Algorithms for Nearest-


Neighbor Methods for Supervised and Unsupervised Learning,” 2014. [Online]. Available:
https://arxiv.org/pdf/1401.2142.pdf

[9] E. Aı̈meur, G. Brassard, and S. Gambs, “Quantum speed-up for unsupervised learning,” Ma-
chine Learning, vol. 90, no. 2, pp. 261–287, 2013.

[10] C. Durr and P. Hoyer, “A quantum algorithm for finding the minimum,” arXiv preprint quant-
ph/9607014, 1996.

[11] J. Otterbach, R. Manenti, N. Alidoust, A. Bestwick, M. Block, B. Bloom, S. Caldwell, N. Di-


dier, E. S. Fried, S. Hong et al., “Unsupervised machine learning on a hybrid quantum com-
puter,” arXiv preprint arXiv:1712.05771, 2017.

[12] E. Farhi, J. Goldstone, and S. Gutmann, “A quantum approximate optimization algorithm,”


arXiv preprint arXiv:1411.4028, 2014.

[13] A. Gilyén, S. Lloyd, and E. Tang, “Quantum-inspired low-rank stochastic regression with
logarithmic dependence on the dimension,” arXiv preprint arXiv:1811.04909, 2018.

[14] E. Tang, “Quantum-inspired classical algorithms for principal component analysis and super-
vised clustering,” arXiv preprint arXiv:1811.00414, 2018.

[15] E. Tang, “A quantum-inspired classical algorithm for recommendation systems,” arXiv


preprint arXiv:1807.04271, 2018.

23
[16] A. W. Harrow, A. Hassidim, and S. Lloyd, “Quantum algorithm for linear systems of equa-
tions,” Physical review letters, vol. 103, no. 15, p. 150502, 2009.

[17] A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for finding low-rank
approximations,” Journal of the ACM (JACM), vol. 51, no. 6, pp. 1025–1041, 2004.

[18] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clustering large graphs via the
singular value decomposition,” Machine learning, vol. 56, no. 1-3, pp. 9–33, 2004.

[19] D. Achlioptas and F. McSherry, “Fast computation of low rank matrix approximations,” in
Proceedings of the 33rd Annual Symposium on Theory of Computing, 611-618, 2001.

[20] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory,
vol. 28, no. 2, pp. 129–137, 1982.

[21] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings
of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial
and Applied Mathematics, 2007, pp. 1027–1035.

[22] D. Arthur and S. Vassilvitskii, “How slow is the k-means method?” in Proceedings of the
twenty-second annual symposium on Computational geometry. ACM, 2006, pp. 144–153.

[23] M. A. Nielsen and I. Chuang, “Quantum computation and quantum information,” 2002.

[24] G. Brassard, P. Hoyer, M. Mosca, and A. Tapp, “Quantum amplitude amplification and esti-
mation,” Contemporary Mathematics, vol. 305, pp. 53–74, 2002.

[25] A. Gilyén, Y. Su, G. H. Low, and N. Wiebe, “Quantum singular value transformation
and beyond: exponential improvements for quantum matrix arithmetics,” arXiv preprint
arXiv:1806.01838, 2018.

[26] I. Kerenidis and A. Prakash, “A quantum interior point method for LPs and SDPs,”
arXiv:1808.09266, 2018.

[27] P. Drineas, I. Kerenidis, and P. Raghavan, “Competitive recommendation systems,” in Pro-


ceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002,
pp. 82–90.

[28] A. Ta-Shma, “Inverting well conditioned matrices in quantum logspace,” in Proceedings of the
forty-fifth annual ACM symposium on Theory of computing. ACM, 2013, pp. 881–890.

[29] A. Ambainis, “Variable time amplitude amplification and quantum algorithms for linear alge-
bra problems,” in STACS’12 (29th Symposium on Theoretical Aspects of Computer Science),
vol. 14. LIPIcs, 2012, pp. 636–647.

[30] I. Cong and L. Duan, “Quantum discriminant analysis for dimensionality reduction and clas-
sification,” arXiv preprint arXiv:1510.00113, 2015.

[31] “A demo of k-means clustering on the handwritten digits data.” [Online]. Available: http:
//scikit\discretionary{-}{}{}learn.org/stable/auto examples/cluster/plot kmeans digits.html

24
[32] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series
in statistics New York, NY, USA:, 2001, vol. 1, no. 10.

25

You might also like