NeurIPS 2021 The Inductive Bias of Quantum Kernels Paper
NeurIPS 2021 The Inductive Bias of Quantum Kernels Paper
Abstract
It has been hypothesized that quantum computers may lend themselves well to
applications in machine learning. In the present work, we analyze function classes
defined via quantum kernels. Quantum computers offer the possibility to efficiently
compute inner products of exponentially large density operators that are classically
hard to compute. However, having an exponentially large feature space renders
the problem of generalization hard. Furthermore, being able to evaluate inner
products in high dimensional spaces efficiently by itself does not guarantee a
quantum advantage, as already classically tractable kernels can correspond to high-
or infinite-dimensional reproducing kernel Hilbert spaces (RKHS).
We analyze the spectral properties of quantum kernels and find that we can expect
an advantage if their RKHS is low dimensional and contains functions that are
hard to compute classically. If the target function is known to lie in this class, this
implies a quantum advantage, as the quantum computer can encode this inductive
bias, whereas there is no classically efficient way to constrain the function class in
the same way. However, we show that finding suitable quantum kernels is not easy
because the kernel evaluation might require exponentially many measurements.
In conclusion, our message is a somewhat sobering one: we conjecture that quan-
tum machine learning models can offer speed-ups only if we manage to encode
knowledge about the problem at hand into quantum circuits, while encoding the
same bias into a classical model would be hard. These situations may plausibly
occur when learning on data generated by a quantum process, however, they appear
to be harder to come by for classical datasets.
1 Introduction
In recent years, much attention has been dedicated to studies of how small and noisy quantum devices
[1] could be used for near term applications to showcase the power of quantum computers. Besides
fundamental demonstrations [2], potential applications that have been discussed are in quantum
chemistry [3], discrete optimization [4] and machine learning (ML) [5–12].
Initiated by the seminal HHL algorithm [13], early work in quantum machine learning (QML) was
focused on speeding up linear algebra subroutines, commonly used in ML, offering the perspective
of a runtime logarithmic in the problem size [14–17]. However, most of these works have an inverse
polynomial scaling of the runtime in the error and it was shown rigorously by Ciliberto
√ et al. [18]
mechanical measurement process a runtime complexity O( n) is necessary
that due to the quantum √
for convergence rate 1/ n.
Rather than speeding up linear algebra subroutines, we focus on more recent suggestions that use a
quantum device to define and implement the function class and do the optimization on a classical
computer. There are two ways to that: the first are so-called Quantum Neural Networks (QNN) or
∗
JMK and SB contributed equally and are ordered randomly.
Figure 1: Quantum advantage via inductive bias: (a) Data generating quantum circuit f (x) =
Tr ρV (x)(M ⊗ id) = Tr ρ̃V (x)M . (b) The full quantum kernel k(x, x0 ) = Tr ρV (x)ρV (x0 )
0
V V 0
cannot learn f efficiently. (c) The biased quantum kernel q(x, x ) =
is too general and
Tr ρ̃ (x)ρ̃ (x ) meaningfully constrains the function space and allows to learn f with little data.
parametrized quantum circuits [5–7] which can be trained via gradient based optimization [5, 19–23].
The second approach is to use a predefined way of encoding the data in the quantum system and
defining a quantum kernel as the inner product of two quantum states [7–11]. These two approaches
essentially provide a parametric and a non-parametric path to quantum machine learning, which are
closely related to each other [11]. Since the optimization of QNNs is non-convex and suffers from
so-called Barren Plateaus [24], we here focus on quantum kernels, which allow for convex problems
and thus lend themselves more readily to theoretical analysis.
The central idea of using a QML model is that it enables to do computations that are exponentially
hard classically. However, also in classical ML, kernel methods allow us to implicitly work with
high- or infinite dimensional function spaces [25, 26]. Thus, purely studying the expressivity of
QML models [27] is not sufficient to understand when we can expect speed-ups. Only recently first
steps where taken into this direction [10, 12, 28]. Assuming classical hardness of computing discrete
logarithms, Liu et al. [10] proposed a task based on the computation of the discrete logarithm where
the quantum computer, equipped with the right feature mapping, can learn the target function with
exponentially less data than any classical (efficient) algorithm. Similarly, Huang et al. [12] analyzed
generalization bounds and realized that the expressivity of quantum models can hinder generalization.
They proposed a heuristic to optimize the labels of a dataset such that it can be learned well by a
quantum computer but not a classical machine.
In this work, we relate the discussion of quantum advantages to the classical concept of inductive
bias. The no free lunch theorem informally states that no learning algorithm can outperform other
algorithms on all problems. This implies that an algorithm that performs well on one type of problem
necessarily performs poorly on other problems. A standard inductive bias in ML is to prefer functions
that are continuous. An algorithm with that bias, however, will then struggle to learn functions that
are discontinuous. For a QML model to have an edge over classical ML models, we could thus ensure
that it is equipped with an inductive bias that cannot be encoded (efficiently) with a classical machine.
If a given dataset fits this inductive bias, the QML model will outperform any classical algorithm. For
kernel methods, the qualitative concept of inductive bias can be formalized by analyzing the spectrum
of the kernel and relating it to the target function [25, 29–33].
Our main contribution is the analysis of the inductive bias of quantum machine learning models based
on the spectral properties of quantum kernels. First, we show that quantum kernel methods will fail to
generalize as soon as the data embedding into the quantum Hilbert space is too expressive (Theorem
1). Then we note that projecting the quantum kernel appropriately allows to construct inductive biases
that are hard to create classically (Figure 1). However, our Theorem 2 also implies that estimating the
biased kernel requires exponential measurements, a phenomenon reminiscent of the Barren plateaus
observed in quantum neural networks. Finally we show experiments supporting our main claims.
While our work gives guidance to find a quantum advantage in ML, this yields no recipe for obtaining
a quantum advantage on a classical dataset. We conjecture that unless we have a clear idea how the
data generating process can be described with a quantum computer, we cannot expect an advantage
by using a quantum model in place of a classical machine learning model.
2
2 Supervised learning
We briefly introduce the setting and notation for supervised learning as a preparation for our analysis
of quantum mechanical methods in this context. The goal of supervised learning is the estimation of
a functional mechanism based on data generated from this mechanism. For concreteness we focus
on the regression setting where we assume data is generated according to Y = f ∗ (X) + ε where
ε denotes zero-mean noise. We focus on X ∈ X ⊂ Rd , Y ∈ R. We denote the joint probability
distribution of (X, Y ) by D and we are given n i.i.d. observations DnR from D. We will refer to the
marginal distribution of X as µ, define the L2µ inner product hf, gi = f (x)g(x) µ(dx) and denote
the corresponding norm by k · k. The least square risk
and the empirical risk of some hypothesis
h : X → R is defined by R(h) = ED (h(X) − Y )2 and Rn (h) = EDn (h(X) − Y )2 .
In supervised machine learning, one typically considers a hypothesis space H of functions h : X → R
and tries to infer argminh∈H R(h) (assuming for simplicity that the minimizer exists). Typically this
is done by (regularized) empirical risk minimization argminh∈H Rn (h) + λΩ(h), where λ > 0 and
Ω determine the regularization. The risk of h can then be decomposed in generalization and training
error R(h) = (R(h) − Rn (h)) + Rn (h).
Kernel ridge regression. We will focus on solving the regression problem over a reproducing
kernel Hilbert space (RKHS) [25, 26]. An RKHS F associated with a positive definite kernel
k : X × X → R is the space of functions such that for all x ∈ X and h ∈ F the reproducing property
h(x) = hh, k(x, ·)iF holds. Kernel ridge regression regularizes the RKHS norm, i.e., Ω(h) = khk2F .
With observations {(x(i) , y (i) )}ni=1 we can compute the kernel matrix K(X, X)ij = k(x(i) , x(j) )
and the Representer Theorem [34] ensures that the empirical risk minimizer of kernel ridge regression
Pn
is of the form fˆnλ (·) = i=1 αi k(x(i) , ·), with α = (K(X, X) + λ id)−1 y. The goal of our work is
to study when a (quantum) kernel is suitable for learning a particular problem. The central object to
study this is the integral operator.
Spectral properties and inductive bias. For kernel k and marginal distribution µ, the integral
operator K, is defined as (Kf )(x) = k(x, x0 )f (x0 )µ(dx0 ). Mercer’s Theorem ensures that there
R
exist a spectral decomposition of K with (possibly infinitely many) eigenvalues γi (ordered non-
eigenfunctions φi , which are orthonormal in L2µ , i.e., hφi , φj i = δi,j .
increasingly) and corresponding P
We will assume that TrP[K] = i γi = 1 which we can ensure by rescaling the kernel. We can
then write k(x, x0 ) = i γi φi (x)φi (x0 ). While the functions φ form a basis of F they might not
completely span L2µ . In this case we simply complete the basis and implicitly take γ = 0 for the
added functions. Then we can decompose functions in L2µ as
X
f (x) = ai φi (x). (1)
i
a2i
We have kf k2 = i a2i and kf k2F = i γi (if f ∈ F). Kernel ridge regression penalizes the
P P
RKHS norm of functions. The components corresponding to zero eigenvalues are infinitely penalized
and cannot be learned since they are not in the RKHS. For large regularization λ the solution fˆnλ is
heavily biased towards learning only the coefficients of the principal components (corresponding to
the largest eigenvalues) and keeps the other coefficients small (at the risk of underfitting). Decreasing
the regularization allows ridge regression to also fit the other components, however, at the potential
risk of overfitting to the noise in the empirical data. Finding good choices of λ thus balances this
bias-variance tradeoff.
We are less concerned with the choice of λ, but rather with the spectral properties of a kernel that
allow for a quantum advantage. Similar to the above considerations, a target function f can easily
be learned if it is well aligned with the principal components of a kernel. In the easiest case, the
kernel only has a single non-zero eigenvalue and is just k(x, x0 ) = f (x)f (x0 ). Such a construction is
arguably the simplest path to a quantum advantage in ML.
Example 1 (Trivial Quantum Advantage). Let f be a scalar function that is easily computable on
a quantum device but requires exponential resources to approximate classically. Generate data as
Y = f (X) + . The kernel k(x, x0 ) = f (x)f (x0 ) then has an exponential advantage for learning f
from data.
3
To go beyond this trivial case, we introduce two qualitative measures to judge the quality of a kernel
for learning the function f . The kernel target alignment of Cristianini et al. [30] is
2
P
hk, f ⊗ f i i γi aP
i
A(k, f ) = = (2)
( i γi2 )1/2 i a2i
1/2 1/2
P
hk, ki hf ⊗ f, f ⊗ f i
and measures how well the kernel fits f . If A = 1, learning reduces to estimating a single real
parameter, whereas for A = 0, learning is infeasible. We note that the kernel target alignment also
weighs the contributions of f depending on the corresponding eigenvalue, i.e., the alignment is better
if large |ai | correspond to large γi . The kernel target alignment was used extensively to optimize
kernel functions [31] and recently also used to optimize quantum kernels [35].
In a similar spirit, the task-model alignment of Canatar et al. [32]
P measures
P how much of the signal
of f is captured in the first i principal components: C(i) = j≤i a2j ( j a2j )−1 . The slower C(i)
approaches 1, the harder it is to learn as the target function is more spread over the eigenfunctions.
Put simply, the potential advantage of a quantum computer arises from its state space being expo-
nentially large in the number of qubits d, thus computing general expressions like (3) on a classical
computer is exponentially hard. However, besides the huge obstacles in building quantum devices
with high fidelity, the fact that the outcome of the quantum computation (3) has to be estimated from
measurements often prohibits to easily harness this power, see also Wang et al. [37], Peters et al. [38].
We will discuss this in the context of quantum kernels in Section 4.
We consider parameter dependent quantum states ρ(x) = U (x)ρ0 U † (x) that are generated by
evolving the initial state ρ0 with the data dependent unitary transformation U (x) [7, 11]. Most often
we will without loss of generality assume that the initial state is ρ0 = (|0i h0|)⊗d . We then define
quantum machine learning models via observables M of the data dependent state
fM (x) = Tr U (x)ρ0 U † (x)M = Tr [ρ(x)M ] .
(4)
In the following we introduce the two most common function classes suggested for quantum machine
learning. We note that there also exist proposals that do not fit into the form of Eq. (4) [27, 35, 39].
Quantum neural networks. A "quantum neural network" (QNN) is defined via a variational
quantum circuit (VQC) [6, 40, 41]. Here the observable Mθ is parametrized by p ∈ N classical
parameters θ ∈ Θ ⊆ Rp . This defines a parametric function class FQ Θ = {fMθ |θ ∈ Θ}. The most
common ansatz is to consider Mθ = U (Θ)M U † (Θ) where U (Θ) = i U (θi ) is the composition of
unitary evolutions each acting on few qubits. For this and other common models of the parametric
circuit it is possible to analytically compute gradients and specific optimizers for quantum circuits
based on gradient descent have been developed [5, 19–23]. Nevertheless, the optimization is usually
a non-convex problem and suffers from additional difficulties due to oftentimes exponentially (in d)
4
Table 1: Concepts in the quantum Hilbert space H and the reproducing kernel Hilbert space F.
vanishing gradients [24]. This hinders a theoretical analysis. Note that the non-convexity does not
arise from the fact that the QNN can learn non-linear functions, but rather because the observable Mθ
depends non-linearly on the parameters. In fact, the QNN functions are linear in the fixed feature
mapping ρ(x). Therefore the analogy to classical neural networks is somewhat incomplete.
Quantum kernels. The class of functions we consider are RKHS functions where the kernel is
expressed by a quantum computation. The key observation is that (4) is linear in ρ(x). Instead of
optimizing over the parametric function class FΘ , we can define the nonparametric class of functions
F = {fM |fM (·) = Tr [ρ(·)M ] , M = M † }.2 To endow this function class with the structure of an
RKHS, observe that the expression Tr [ρ1 ρ2 ] defines a scalar product on density matrices. We then
define kernels via the inner product of data-dependent density matrices:
Definition 1 (Quantum Kernel [7, 8, 11]). Let ρ : x 7→ ρ(x) be a fixed feature mapping from X to
density matrices. Then the corresponding quantum kernel is k(x, x0 ) = Tr [ρ(x)ρ(x0 )].
The Representer Theorem [34] reduces the empirical risk minimization over the exponentially
large function class F to an optimization problem with a set of parameters whose dimensionality
corresponds to the training set size. Since the ridge regression objective is convex (and so are many
other common objective functions in ML), this can be solved efficiently with a classical computer.
In the described setting, the quantum computer is only used to estimate the kernel. For pure state
encodings, this is done by inverting the data encoding transformation (taking its conjugate transpose)
and measuring the probability that the resulting state equals theinitial state ρ0 . To see this we use
the cyclic property of the trace k(x,x0 ) = Tr [ρ(x)ρ(x0 )] = Tr U (x)ρ0 U † (x) U (x0 )ρ0 U † (x0 ) =
Tr U † (x0 )U (x)ρ0 U † (x)U (x0 ) ρ0 . If ρ0 = (|0i h0|)⊗d , then k(x, x0 ) corresponds to the proba-
bility of observing every qubit in the ’0’ state after the initial state was evolved with U † (x0 )U (x). To
evaluate the kernel, we thus need to estimate this probability from a finite number of measurements.
For our theoretical analysis we work with the exact value of the kernel and in our experiments we
also simulate the full quantum state. We discuss the difficulties related to measurements in Sec. 4.
We now study the inductive bias for simple quantum kernels and their learning performance. We
first give a high level discussion of a general hurdle for quantum machine learning models to surpass
classical methods and then analyze two specific kernel approaches in more detail.
Continuity in classical machine learning. Arguably the most important bias in nonparametric
regression are continuity assumptions on the regression function. This becomes particularly apparent
in, e.g., nearest neighbour regression or random regression forests [42] where the regression function
is a weighted average of close points. Here we want to emphasize that there is a long list of results
concerning the minimax optimality of kernel methods for regression problems [43–45]. In particular
these results show that asymptotically the convergence of kernel ridge regression of, e.g., Sobolev
functions reaches the statistical limits which also apply to any quantum method.
2
F is defined for a fixed feature mapping x 7→ ρ(x). Although M is finite dimensional and thus F can be
seen as a parametric function class, we will be interested in the case where M is exponentially large in d and we
can only access functions from F implicitly. Therefore we refer to it as nonparametric class of functions.
5
A simple quantum kernel. We now restrict our attention to rather simple kernels to facilitate
a theoretical analysis. As indicated above we consider data in X ⊂ Rd and we assume N that the
distribution µ of the data factorizes over the coordinates (i.e. µ can be written as µ = µi ). This
data is embedded in a d-qubit quantum circuit. Let us emphasize here that the RKHS based on a
quantum state of d-qubits is at most 4d dimensional, i.e., finite dimensional and in the infinite data
limit n → ∞ standard convergence guarantees from parametric statistics apply. Here we consider
growing dimension d → ∞, and sample size polynomial in the dimension n = n(d) ∈ Poly(d). In
particular the sample size n 4d will be much smaller than the dimension of the feature space and
bounds from the parametric literature do not apply.
Here we consider embeddings where each coordinate is embedded into a single qubit using a map
ϕi followed by an arbitrary unitary transformation V , so that we can express the embedding in
the quantum Hilbert space as |ψ V (x)i = V
N
|ϕi (xi )i with corresponding density matrix (feature
map)3
ρV (x) = |ψ V (x)i hψ V (x)| . (5)
0 0
Note that the corresponding
N kernel k(x,
Qx ) = Tr [ρ(x)ρ(x )] is independent of V and factorizes
0 0 0
N
k(x, x ) = Tr [ ρi (xi ) ρi (xi )] = Tr [ρi (xi )ρi (xi )] where ρi (xi ) = |ϕi (xi )ihϕi (xi )|. The
product structure of the kernel allows us to characterize the RKHS generated by k based on the
one dimensional case. The embedding of a single variable can be parametrized by complex valued
functions a(x), b(x) as
|ϕi (x)i = a(x)|0i + b(x)|1i. (6)
One important object characterizing
R this embedding
R turns out to be the mean density matrix of this
embedding given by ρµi = ρi (y) µi (dy) = |ϕi (y)ihϕi (y)| µi (dy). This can be identified with
the kernel mean embedding
N of the distribution [46]. Note that for factorizing base measure µ the
factorization ρµ = ρµi holds. Let us give a concrete example to clarify the setting, see Fig. 1(b).
Example 2. [11, Example III.1.] We consider the cosine kernel where a(x) = cos(x/2), b(x) =
i sin(x/2). This embedding can be realized using a single quantum RX (x) = exp (−i x2 σx ) gate
such that |ψ(x)i = RX (x)|0i = cos(x/2)|0i + i sin(x/2)|1i. In this case the kernel is given by
† 0 0 0
k(x, x0 ) = |h0|RX (x)RX (x)|0i|2 = | cos( x2 ) cos( x2 ) + sin( x2 ) sin( x2 )|2 = cos( x−x 2
2 ) . (7)
As a reference measure µ we consider the uniform measure on [−π, π]. Then the mean density
matrix is the completely mixed state ρµ = 12 id. For Rd valued data whose coordinates are encoded
independently the kernel is given by k(x, x0 ) = cos2 ((xi − x0i )/2) and ρµ = 2−d id2d ×2d . We
Q
emphasize that due to the kernel trick this kernel can be evaluated classically in runtime O(d).
Quantum RKHS. We now characterize the RKHS and the eigenvalues of the integral operator
for quantum kernels. The RKHS consists of all functions f ∈ F that can be written as f (x) =
d d
Tr [ρ(x)M ] where M ∈ C2 ×2 is a Hermitian operator. Using this characterization of the finite
dimensional RKHS we can rewrite the infinite dimensional eigenvalue problem of the integral operator
as a finite dimensional problem. The action of the corresponding integral operator on f can be written
as
Z Z
(Kf )(x) = f (y)k(y, x) µ(dy) = Tr [M ρ(y)] Tr [ρ(y)ρ(x)] µ(dy)
Z Z (8)
= Tr [(M ⊗ ρ(x))(ρ(y) ⊗ ρ(y))] µ(dy) = Tr (M ⊗ ρ(x)) ρ(y) ⊗ ρ(y) µ(dy)
R
We denote the operator Oµ = ρ(y) ⊗ ρ(y) µ(dy) for which Tr [Oµ ] = 1 holds. Then we can write
(Kf )(x) = Tr [Oµ (M ⊗ ρ(x))] = Tr [Oµ (M ⊗ id)(id ⊗ ρ(x))]
(9)
= Tr [Tr1 [Oµ (M ⊗ id)] ρ(x)]
3
When we can ignore V , we simply assume V = id and write ρ(x) instead of ρV (x). For
the kernel, since
V † V = id = V † V and due to the cyclic property of the trace we have kV (x, x0 ) = Tr ρV (x)ρV (x0 ) =
6
where Tr1 [·] refers to the partial trace over the first factor. For the definition and a proof of the last
equality we refer to Appendix A. The eigenvalues of K can now be identified with the eigenvalues
of the linear map Tµ mapping M → Tr1P [Oµ (M ⊗ id)]. As shown in the appendix there is an
eigendecomposition such that Tµ (M ) = λi Ai Tr [Ai M ] where Ai are orthonormal Hermitian
matrices (for details, a proof and an example we refer to Appendix C). The eigenfunctions of K are
given by fi (x) = Tr [ρ(x)Ai ].
We now state a bound that controls the largest eigenvalue of the integral operator K in terms of the
eigenvalues of the mean density matrix ρµ (Proof in Appendix C.2).
q
Lemma 1. The largest eigenvalue γmax of K satisfies the bound γmax ≤ Tr ρ2µ .
The lemma above shows that the squared eigenvalues of K are bounded by Tr ρ2µ , an expression
known as the purity [36] of the density matrix ρµ , which measures the diversity of the data embedding.
On the other hand the eigenvalues of K are closely related to the learning guarantees of kernel ridge
regression. In particular, standard generalization bounds for kernel ridge regression [47] become
vacuous when γmax is exponentially smaller than the training sample size (if Tr [K] = 1 which holds
for pure state embeddings). The next result shows that this is not just a matter of bounds.
Theorem 1. Suppose the purity of the embeddings ρµi satisfies Tr ρ2µi ≤ δ < 1 as the dimension
and number of qubits d grows. Furthermore, suppose the training sample size only grows polynomially
in d, i.e., n ≤ dl for some fixed l ∈ N. Then there exists d0 = d0 (δ, l, ε) such that for all
d ≥ d0 no function can be learned using kernel ridge regression with the d-qubit kernel k(x, x0 ) =
Tr [ρ(x)ρ(x0 )] in the sense that for any f ∈ L2 , with probability at least 1 − ε for all λ ≥ 0
The proof of the theorem can be found in Appendix D. It relies on a general result (Theorem 3 in
Appendix D) which shows that for any (not necessarily quantum) kernel the solution of kernel ridge
regression cannot generalize when the largest eigenvalue in the Mercer decomposition is sufficiently
small (depending on the sample size). Then the proof of Theorem 1 essentially boils down to proving
a bound on the largest eigenvalue using Lemma 1.
Theorem 1 implies that generalization is only possible when the mean embedding of most coordinates
is close to a pure state, i.e. the embedding x → |ϕi (x)i is almost constant. To make learning from
data feasible we cannot use the full expressive power of the quantum Hilbert space but instead only
very restricted embeddings allow to learn from data. This generalizes an observation already made
in [12]. Since also classical methods allow to handle high-dimensional and infinite dimensional
RKHS the same problem occurs for classical kernels where one solution is to adapt the bandwidth of
the kernel to control the expressivity of the RKHS. In principle this is also possible in the quantum
context, e.g., for the cosine embedding.
Biased kernels. We have discussed that without any inductive bias, the introduced quantum kernel
cannot learn any function for large d. One suggestion to reduce the expressive power of the kernel
is the use of projected kernels [12]. They are defined using reduced density matrices given by
ρ̃Vm (x) = Trm+1...d ρV (x) where Trm+1...d [·] denotes the partial trace over qubits m + 1 to
d (definition in Appendix A) . Then they consider the usual quantum kernel for this embedding
qmV
(x, x0 ) = Tr ρ̃Vm (x)ρ̃Vm (x0 ) . Physically, this corresponds to just measuring the first m qubits and
the functions f in the RKHS can be written in terms of a hermitian operator M acting on m qubits so
that f (x) = Tr ρV (x)(M ⊗ id) = Tr ρ̃Vm (x)M . If V is sufficiently complex it is assumed that f
is hard to compute classically [48].
Indeed above procedure reduces the generalization gap. But this comes at the price of an increased
approximation error if the remaining RKHS cannot fully express the target function f ∗ anymore,
i.e., the learned function underfits. Without any reason to believe that the target function is well
represented via the projected kernel, we cannot hope for a performance improvement by simply
reducing the size of the RKHS in an arbitrary way. However, if we know something about the data
generating process than this can lead to a meaningful inductive bias. For the projected kernel
this
could be that we know that the target function can be expressed as f ∗ (x) = Tr ρ̃Vm (x)M ∗ , see
V
Fig. 1. In this case using qm improves the generalization error without increasing the approximation
V
error. To emphasize this, we will henceforth refer to qm as biased kernel.
7
2−1 100
−2
2
10−1
q train
Size of eigenvalue γi
2 3 4 5 6 7 1 2 3 4 5 6 7
Number of Qubits d Number of Qubits d
Figure 2: Left: Spectral behavior of biased kernel q, see Theorem 2b) and Equation (11) Right: The
biased kernel q, equipped with prior knowledge, easily learns the function for arbitrary number of
qubits and achieves optimal mean squared error (MSE). Models that are ignorant to the structure of
f ∗ fail to learn the function. The classical kernel krbf and the full quantum kernel overfit (they have
low training error, but large test error). The biased kernel on the wrong qubit qw has litle capacity
with the wrong bias and thus underfits (training and test error essentially overlap).
We now investigate the RKHS for reduced density matrices where V is a Haar-distributed random
unitary matrix (proof in Appendix E).
Theorem 2. Suppose V is distributed according to the Haar measure on the group of unitary
matrices. Fix m. Then the following two statements hold:
a) The reduced density operator satisfies with high probability ρ̃Vm = 2−m id + O(2−d/2 ) and the
V
projected kernel satisfies with high probability qm (x, x0 ) = 2−m + O(2−d/2 ) as d → ∞.
V V
b) Let Tµ,m denote the linear integral operator for the kernel qm as defined above. Then the
averaged operator EV Tµ,m has one eigenvalue 2−m + O(2−2d ) whose eigenfunction is constant
V
(up to higher order terms of order O(2−2d ) and 22m − 1 eigenvalues 2−m−d + O(2−2d ).
The averaged integral operator in the second part of the result is not directly meaningful, however it
V
gives some indication of the behavior of the operators Tµ,m . In particular, we expect a similar result
to hold for most V if the mean embedding ρµ is sufficiently mixed. A proof of this result would
V
require us to bound the variance of the matrix elements of Tµ,m which is possible using standard
formula for expectations of polynomials over the unitary group but lengthy.
V
Note that the dimension of the RKHS for the biased kernel qm with m-qubits is bounded by 4m . This
implies that learning is possible when projecting to sufficiently low dimensional biased kernels such
that the training sample size satisfies n & 4m ≥ dim(F).
Let us now focus on the case m = 1, that is the biased kernel is solely defined via the first qubit.
Assuming that Theorem 2b) also holds for fixed V we can assume that the biased kernel has the form
X3
q(x, x0 ) ≡ q1V (x, x0 ) = γ0 φ0 (x)φ0 (x0 ) + γi φi (x)φi (x0 ), (11)
i=1
where γ0 = 1/2 + O(2 ) and φ0 (x) = 1 is the constant function up to terms of order O(2−2d ).
−2d
Barren plateaus. Another conclusion from Theorem 2a) is that the fluctuations of the reduced
density matrix around its mean are exponentially vanishing in the number of qubits. In practice the
value of the kernel would be determined by measurements and exponentially many measurements
are necessary to obtain exponential accuracy of the kernel function. Therefore the theorem suggests
that it is not possible in practice to learn anything beyond the constant function from generic biased
8
kernels for (modestly) large values of d. This observation is closely related to the fact that for many
quantum neural networks architectures, the gradient of the parameters with respect to the loss is
exponentially vanishing with the system size d, a phenomenon known as Barren plateaus [24, 49].
5 Experiments
Since for small d we can simulate the biased kernel efficiently, we illustrate our theoretical findings in
the following experiments. Our implementation, building on standard open source packages [50, 51],
is available online.4 We consider the case described above where we know thatthe data was generated
by measuring an observable on the first qubit, i.e., f ∗ (x) = Tr ρ̃V1 (x)M ∗ , but we do not know
M ∗ , see Fig. 1. We use the full kernel k and the biased kernel q for the case m = 1. To show the
effect of selecting the wrong bias, we also include the behavior of a biased kernel defined only on the
second qubit, denoted as qw . As a classical reference we also include the performance of a radial
basis function kernel krbf (x, x0 ) = exp(−kx − x0 k2 /2). For the experiments we fix a single qubit
observable M ∗ = σz and perform the experiment for varying number d of qubits. First we draw a
random unitary V . The dataset is then generated by drawing N = 200 realizations {x(i) }N i=1 from the
d (i) ∗ (i)
d dimensional uniform distribution on [0, 2π] . We then define the labels as y = cf (x ) + (i) ,
∗ −4
V (i)
where f (x) = Tr ρ̃ (x)σz , is Gaussian noise with Var[] = 10 , and c is chosen such that
Var[f (X)] = 1. Keeping the variances fixed ensures that we can interpret the behavior for varying d.
We first verify our findings from Theorem 2b) and Equation (11) by estimating the spectrum of q.
Fig. 2 (left) shows that Theorem 2b) also holds for individual V with high probability. We then
use 2/3 of the data for training kernel ridge regression (we fit the mean seperately) with preset
regularization, and use 1/3 to estimate the test error. We average the results over ten random seeds
(random V , x(i) , (i) ) and results are reported in the right panel of Fig. 2. This showcases that as
the number of qubits increases, it is impossible to learn f ∗ without the appropriate spectral bias. k
and krbf have too little bias and overfit, whereas qw has the wrong bias and severly underfits. The
performance of qw underlines that randomly biasing the kernel does not significantly improve the
performance over the full kernel k. In the appendix we show that this is not due to a bad choice of
regularization, by reporting cherry-picked results over a range of regularizations.
To further illustrate the spectral properties, we empirically estimate the kernel target alignment [30]
and the task-model alignment [32] that we introduced in Sec. 2. By using the centered kernel matrix
(see App. B) and centering the data we can ignore the first eigenvalue in (11) corresponding the
constant function. In Figure 3 (left) we show the empirical (centered) kernel target alignment for 50
random seeds. The biased kernel is the only one well aligned with the task. The right panel of Fig. 3
shows the task model alignment. This shows that f ∗ can be completely expressed with the first four
components of the biased kernel, while the other kernels essentially need the entire spectrum (we
use a sample size of 200, hence the empirical kernel matrix is only 200 dimensional) and thus are
unable to learn. Note that the kernel qw is four dimensional, and so higher contributions correspond
to functions outside its RKHS that it actually cannot even learn at all.
6 Discussion
We provided an analysis of the reproducing kernel Hilbert space (RKHS) and the inductive bias
of quantum kernel methods. Rather than the dimensionality of the RKHS, its spectral properties
determine whether learning is feasible. Working with exponentially large RKHS comes with the risk
of having a correspondingly small inductive bias. This situation indeed occurs for naive quantum
encodings, and hinders learning unless datasets are of exponential size. To enable learning, we neces-
sarily need to consider models with a stronger inductive bias. Encoding a bias towards continuous
functions is likely not a promising path for a quantum advantage, as this is where classical machine
learning models excel.
Our results suggest that we can only achieve a quantum advantage if we know something about the
data generating process and cannot efficiently encode this classically, yet are able use this information
to bias the quantum model. We indeed observe an exponential advantage in the case where we know
that the data comes from a single qubit observable and constrain the RKHS accordingly. However,
4
https://github.com/jmkuebler/quantumbias
9
50 q 1.0
30 0.6
0.4
20
q
0.2 k
10
krbf
qw
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 100 101 102
Kernel Alignment A i
Figure 3: Histogram of the kernel target alignment over 50 runs (left) and task model alignment
(right) for d = 7.
we find that evaluating the kernel requires exponentially many measurements, an issue related to
Barren Plateaus encountered in quantum neural networks.
With fully error-corrected quantum computers it becomes feasible to define kernels with a strong bias
that do not require exponentially many measurements. An example of this kind was recently presented
by Liu et al. [10]: here one knows that the target function is extremely simple after computing the
discrete logarithm. A quantum computer is able to encode this inductive bias by using an efficient
algorithm for computing the discrete logarithm.
However, even for fully coherent quantum computers it is unclear how we can reasonably encode a
strong inductive bias about a classical dataset (e.g., images of cancer cells, climate-data, etc.). The
situation might be better when working with quantum data, i.e., data that is collected via observations
of systems at a quantum mechanical scale. To summarize, we conclude that there is no indication
that quantum machine learning will substantially improve supervised learning on classical datasets.
The authors thank the anonymous reviewers for their helpful comments that made the theorems
and their proofs more accessible. This work was in part supported by the German Federal Ministry
of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B) and the
Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.
References
[1] John Preskill. Quantum computing in the NISQ era and beyond. Quantum, 2:79, 2018.
[2] Frank Arute, Kunal Arya, Ryan Babbush, et al. Quantum supremacy using a programmable
superconducting processor. Nature, 574(7779):505–510, 2019.
[3] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J
Love, Alán Aspuru-Guzik, and Jeremy L O’brien. A variational eigenvalue solver on a photonic
quantum processor. Nature Communications, 5:4213, 2014.
[4] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization
algorithm. arXiv:1411.4028, 2014.
[5] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa, and Keisuke Fujii. Quantum circuit
learning. Phys. Rev. A, 98:032309, 2018.
[6] Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term
processors. arXiv:1802.06002, 2018.
10
[7] Vojtěch Havlíček, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala,
Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature
spaces. Nature, 567(7747):209, 2019.
[8] Maria Schuld and Nathan Killoran. Quantum machine learning in feature Hilbert spaces. Phys.
Rev. Lett., 122:040504, 2019.
[9] Jonas M. Kübler, Krikamol Muandet, and Bernhard Schölkopf. Quantum mean embedding of
probability distributions. Phys. Rev. Research, 1:033159, 2019.
[10] Yunchao Liu, Srinivasan Arunachalam, and Kristan Temme. A rigorous and robust quantum
speed-up in supervised machine learning. Nature Physics, 17(9):1013–1017, 2021.
[11] Maria Schuld. Quantum machine learning models are kernel methods. arXiv:2101.11020, 2021.
[12] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hart-
mut Neven, and Jarrod R. McClean. Power of data in quantum machine learning. Nature
Communications, 12(1):2631, 2021.
[13] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of
equations. Phys. Rev. Lett., 103(15):150502, 2009.
[14] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big
data classification. Phys. Rev. Lett., 113(13), 2014.
[15] Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quantum algorithm for data fitting. Phys. Rev.
Lett., 109:050505, 2012.
[16] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and
unsupervised machine learning. arXiv:1307.0411, 2013.
[17] Iordanis Kerenidis and Anupam Prakash. Quantum gradient descent for linear systems and least
squares. Phys. Rev. A, 101:022316, 2020.
[18] Carlo Ciliberto, Andrea Rocchetto, Alessandro Rudi, and Leonard Wossnig. Statistical limits of
supervised quantum learning. Physical Review A, 102(4), 2020.
[19] Gian Giacomo Guerreschi and Mikhail Smelyanskiy. Practical optimization for hybrid quantum-
classical algorithms. arXiv:1701.01450, 2017.
[20] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. Evaluating
analytic gradients on quantum hardware. Physical Review A, 99(3):032331, 2019.
[21] James Stokes, Josh Izaac, Nathan Killoran, and Giuseppe Carleo. Quantum natural gradient.
Quantum, 4:269, 2020.
[22] Ryan Sweke, Frederik Wilde, Johannes Jakob Meyer, Maria Schuld, Paul K Fährmann,
Barthélémy Meynard-Piganeau, and Jens Eisert. Stochastic gradient descent for hybrid quantum-
classical optimization. Quantum, 4:314, 2020.
[23] Jonas M. Kübler, Andrew Arrasmith, Lukasz Cincio, and Patrick J. Coles. An Adaptive
Optimizer for Measurement-Frugal Variational Algorithms. Quantum, 4:263, 2020.
[24] Jarred R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush, and Hartmut Neven.
Barren plateaus in quantum neural network training landscapes. Nature Communications, 9:
4812, 2018.
[25] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels. MIT Press, Cambridge,
MA, USA, 2002.
[26] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, 2004.
[27] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the
expressive power of variational quantum-machine-learning models. Phys. Rev. A, 103:032430,
2021.
11
[28] Hsin-Yuan Huang, Richard Kueng, and John Preskill. Information-theoretic bounds on quantum
advantage in machine learning. Phys. Rev. Lett., 126:190505, 2021.
[29] Robert C. Williamson, Alexander J. Smola, and Bernhard Schölkopf. Generalization perfor-
mance of regularization networks and support vector machines via entropy numbers of compact
operators. IEEE Transactions on Information Theory, 47(6):2516–2532, 2001.
[30] Nello Cristianini, Jaz Kandola, Andre Elisseeff, and John Shawe-Taylor. On kernel target
alignment. In Dawn E. Holmes and Lakhmi C. Jain, editors, Innovations in Machine Learning:
Theory and Applications, pages 205–256. Springer Berlin Heidelberg, 2006.
[31] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels
based on centered alignment. Journal of Machine Learning Research, 13(28):795–828, 2012.
[32] Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model
alignment explain generalization in kernel regression and infinitely wide neural networks.
Nature Communications, 12(1):2914, 2021.
[33] Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clement Hongler, and Franck Gabriel. Kernel
alignment risk estimator: Risk prediction from training data. In NeurIPS, 2020.
[34] Bernhard Schölkopf, Ralf Herbrich, and Alexander J. Smola. A generalized representer theorem.
In COLT, 2001.
[35] Thomas Hubregtsen, David Wierichs, Elies Gil-Fuster, Peter-Jan H. S. Derks, Paul K.
Faehrmann, and Johannes Jakob Meyer. Training quantum embedding kernels on near-term
quantum computers. arXiv:2105.02276, 2021.
[36] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum Information:
10th Anniversary Edition. Cambridge University Press, 2010.
[37] Xinbiao Wang, Yuxuan Du, Yong Luo, and Dacheng Tao. Towards understanding the power of
quantum kernels in the NISQ era. Quantum, 5:531, 2021.
[38] Evan Peters, João Caldeira, Alan Ho, Stefan Leichenauer, Masoud Mohseni, Hartmut Neven,
Panagiotis Spentzouris, Doug Strain, and Gabriel N. Perdue. Machine learning of high dimen-
sional data on a noisy quantum processor. arXiv:2101.09581, 2021.
[39] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I. Latorre. Data re-
uploading for a universal quantum classifier. Quantum, 4:226, 2020.
[40] M. Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C. Benjamin, Suguru Endo, Keisuke
Fujii, Jarrod R. McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, and Patrick J. Coles.
Variational quantum algorithms. Nature Reviews Physics, 3(9):625–644, 2021.
[41] Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw, et al. Noisy intermediate-scale quantum
(nisq) algorithms. arXiv:2101.08448, 2021.
[42] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, 2001.
[43] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares
algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
[44] Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares
regression. In COLT, 2009.
[45] Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares
algorithms. Journal of Machine Learning Research, 21(205):1–38, 2020.
[46] Krikamol Muandet, Kenji Fukumizu, Bharath K. Sriperumbudur, and Bernhard Schölkopf.
Kernel mean embedding of distributions: A review and beyond. Found. Trends Mach. Learn.,
10(1-2):1–141, 2017.
[47] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.
Adaptive computation and machine learning. MIT Press, 2012.
12
[48] Aram W. Harrow and Ashley Montanaro. Quantum computational supremacy. Nature, 549
(7671):203–209, 2017.
[49] M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, and Patrick J. Coles. Cost function
dependent barren plateaus in shallow parametrized quantum circuits. Nature Communications,
12(1):1791, 2021.
[50] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, et al. Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[51] Ville Bergholm, Josh Izaac, Maria Schuld, Christian Gogolin, and Nathan Killoran. Pennylane:
Automatic differentiation of hybrid quantum-classical computations. arXiv:1811.04968, 2018.
[52] Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel
Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press,
2016.
[53] Zbigniew Puchała and Jarosław A. Miszczak. Symbolic integration with respect to the haar
measure on the unitary groups. Bulletin of the Polish Academy of Sciences: Technical Sciences,
65(No 1):21–27, 2017.
[54] Christoph Dankert, Richard Cleve, Joseph Emerson, and Etera Livine. Exact and approximate
unitary 2-designs and their application to fidelity estimation. Phys. Rev. A, 80:012304, 2009.
13