0% found this document useful (0 votes)
9 views50 pages

Spectrum of Kernel Random Matrices

Uploaded by

Vinayak Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views50 pages

Spectrum of Kernel Random Matrices

Uploaded by

Vinayak Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Annals of Statistics

2010, Vol. 38, No. 1, 1–50


DOI: 10.1214/08-AOS648
© Institute of Mathematical Statistics, 2010

THE SPECTRUM OF KERNEL RANDOM MATRICES

B Y N OUREDDINE E L K AROUI1
University of California, Berkeley
We place ourselves in the setting of high-dimensional statistical inference
where the number of variables p in a dataset of interest is of the same order
of magnitude as the number of observations n.
We consider the spectrum of certain kernel random matrices, in particular
n × n matrices whose (i, j )th entry is f (Xi Xj /p) or f (Xi − Xj 2 /p)
where p is the dimension of the data, and Xi are independent data vectors.
Here f is assumed to be a locally smooth function.
The study is motivated by questions arising in statistics and computer sci-
ence where these matrices are used to perform, among other things, nonlinear
versions of principal component analysis. Surprisingly, we show that in high-
dimensions, and for the models we analyze, the problem becomes essentially
linear—which is at odds with heuristics sometimes used to justify the usage
of these methods. The analysis also highlights certain peculiarities of models
widely studied in random matrix theory and raises some questions about their
relevance as tools to model high-dimensional data encountered in practice.

1. Introduction. Recent years has seen newfound theoretical interest in the


properties of large-dimensional sample covariance matrices. With the increase in
the size and dimensionality of datasets to be analyzed, questions have been raised
about the practical relevance of information derived from classical asymptotic re-
sults concerning spectral properties of sample covariance matrices. To address
these concerns, one line of analysis has been the consideration of asymptotics
where both the sample size n and the number of variables p in the dataset go
to infinity jointly while assuming, for instance, that p/n had a limit.
This type of questions concerning the spectral properties of large-dimensional
matrices have been and are being addressed in variety of fields, from physics to
various areas of mathematics. While the topic is classical, with the seminal con-
tribution [43] dating back from the 1950s, there has been renewed and vigorous
interest in the study of large-dimensional random matrices in the last decade or
so. This has led to new insights and the appearance of new “canonical” distribu-
tions [38], new tools (see [41]) and, in Statistics, a sense that one needs to exert
caution with familiar techniques of multivariate analysis when the dimension of

Received December 2007; revised August 2008.


1 Supported by NSF Grant DMS-06-05169.
AMS 2000 subject classifications. Primary 62H10; secondary 60F99.
Key words and phrases. Covariance matrices, kernel matrices, eigenvalues of covariance matri-
ces, multivariate statistical analysis, high-dimensional inference, random matrix theory, machine
learning, Hadamard matrix functions, concentration of measure.
1
2 N. EL KAROUI

the data gets large and the sample size is of the same order of magnitude as that
dimension.
So far in Statistics, this line of work has been concerned mostly with the proper-
ties of sample covariance matrices. In a seminal paper, Marčenko and Pastur [30]
showed a result that, from a statistical standpoint, may be interpreted as saying,
roughly, that asymptotically, the histogram the eigenvalues of a sample (i.e., ran-
dom) covariance matrix is (asymptotically) a deterministic nonlinear deformation
of the histogram of the eigenvalues of the population covariance matrix. Remark-
ably, they managed to characterize this deformation for fairly general population
covariances. Their result was shown in great generality and introduced new tools
to the field including one that has become ubiquitous, the Stieltjes transform of
a distribution. In its best-known form, their result says that when the population
covariance is identity, and hence all the population eigenvalues are equal to 1, in
the limit
√ the sample √ eigenvalues are split, and if p ≤ n, they are spread between
[(1 − p/n)2 , (1 + p/n)2 ] according to a fully explicit density known now as
the density of the Marčenko–Pastur law. Their result was later re-discovered inde-
pendently in [42] (under slightly weaker conditions) and generalized to the case
of nondiagonal covariance matrices in [37] under some particular distributional
assumptions which we discuss later in the paper.
On the other hand, recent developments have been concerned with fine proper-
ties of the largest eigenvalue of random matrices which became amenable to analy-
sis after mathematical breakthroughs which happened in the 1990s (see [38, 39]
and [40]). Classical statistical work on joint distribution of eigenvalues of sample
covariance matrices (see [1] for a good reference) then became usable for analy-
sis in high-dimensions. In particular, in the case of gaussian distributions, with Id
covariance, it was shown in [27] and [16] that the largest eigenvalue of the sample
covariance matrix is Tracy–Widom distributed. More recent progress [17] man-
aged to carry out the analysis for essentially general population covariance. On
the other hand, models for which the population covariance has a few separated
eigenvalues have also been of interest (see, for instance, [8] and [31]). Beside the
particulars of the different type of fluctuations that can be encountered (Tracy–
Widom, Gaussian or other), researchers have been able to precisely localize these
largest eigenvalues. One interesting aspect of those results is the fact that in the
high-dimensional setting of interest to us, the largest eigenvalues are always posi-
tively biased, with the bias being sometime large. (We also note that in the case of
i.i.d. data—which naturally is less interesting in statistics—results on the localiza-
tion of the largest eigenvalue have been available for quite some time now, after the
works [21] and [45], to cite a few.) This is √ naturally in sharp contrast to classical
results of multivariate analysis which show n-consistency of all sample eigenval-
ues; though the possibility of bias is a simple consequence of Jensen’s inequality.
On the other hand, there has been much less theoretical work on kernel random
matrices. By this term, we mean matrices with (i, j ) entry of the form,
Mi,j = k(Xi , Xj ),
SPECTRUM OF KERNEL RANDOM MATRICES 3

where M is an n × n matrix, Xi is a p-dimensional data vector and k is a function


of two variables, often called a kernel, that may depend on n. Common choices of
kernels include, for instance, k(Xi , Xj ) = f (Xi − Xj 2 /t), where f is a function
and t is a scalar, or k(Xi , Xj ) = f (Xi Xj /t). For the function f , common choices
include f (x) = exp(−x), f (x) = exp(−x a ), for a certain scalar a, f (x) = (1 +
x)a , or f (x) = tanh(a + bx), where b is a scalar. We refer the reader to [33],
Chapter 4, or [44] for more examples.
In particular, we are not aware of any work in the setting of high-dimensional
data analysis, where p grows with n. However, given the practical success and
flexibility of these methods (we refer to [35] for an introduction), it is natural to
try to investigate theoretically their properties. Further, as illustrated in the data
analytic part of [44], an n/p boundedness assumption is not unrealistic as far as
applications of kernel methods are concerned. One aim of the present paper is to
shed some theoretical light on the properties of these kernel random matrices and
to do so in relatively wide generality. We note that the choice of renormalization
that we make below is motivated in part by the arguments of [44] and their practical
choices of kernels for data of varying dimensions.
Existing theory on kernel random matrices (see, for instance, the interesting
[28]) for fixed-dimensional input data predicts that the eigenvalues of kernel ran-
dom matrices behave—at least for the largest ones—like the eigenvalues of the
corresponding operator on L2 (dP ), if the data is i.i.d. with probability distribu-
tion P . To be more precise, if Xi is a sequence of i.i.d. random variables with
distribution P , under regularity conditions on the kernel k(x, y), it was shown in
[28] that, for any index l, the lth largest eigenvalue of the kernel matrix M with
entries,
1
Mi,j = k(Xi , Xj ),
n
converges to the lth largest eigenvalue of the operator K defined as

Kf (x) = k(x, y)f (y) dP (y).

These insights have also been derived through more heuristic but nonetheless en-
lightening arguments in, for instance, [44]. Further, more precise fluctuation re-
sults are also given in [28]. We also note interesting work on Laplacian eigenmaps
(see, e.g., [9]) where, among other things, results have been obtained showing
convergence of eigenvalues and eigenvectors of certain Laplacian random matrices
(which are quite closely connected to kernel random matrices) computed from data
sampled from a manifold, to corresponding quantities for the Laplace–Beltrami
operator on the manifold.
These results are in turn used in the literature to explain the behavior of non-
linear versions of standard procedures of multivariate statistics, such as Principal
Component Analysis (PCA), Canonical Correlation Analysis (CCA) or Indepen-
dent Component Analysis (CCA). We refer the reader to [36] for an introduction
4 N. EL KAROUI

to kernel-PCA, and to [2] for an introduction to kernel-CCA and kernel-ICA. At


the heart of these techniques are the spectral properties of kernel random matrices.
Because these techniques are used in bioinformatics, a field where large datasets
are common and becoming the norm, it is natural to ask what can be said about
these spectral properties for high-dimensional data.
We show that for the models we analyze (ICA-type models and generalizations
that go beyond the linear setting of ICA), kernel random matrices essentially be-
have like sample covariance matrices and hence their eigenvalues suffer from the
same bias problems that affect sample covariance matrices in high-dimensions. In
particular, if one were to try to apply the heuristics of [44] which were developed
for low-dimensional problems, to the high-dimensional case, the predictions would
be quite wildly wrong. (A simple example is provided by the Gaussian kernel with
i.i.d. Gaussian data where the computations can be done completely explicitly as
explained in [44].) We also note that the scaling we use is different from the one
used in low dimensions, where the matrices are scaled by 1/n. This is because the
high-dimensional problem would be completely degenerate if we used this normal-
ization in our setting. However, our results still give information about the problem
when it is scaled by 1/n.
From a random matrix point of view, our study is connected to the study of
Euclidean random matrices and distance matrices which is of some interest in, for
instance, Physics. We refer to [11] and [12] for work in this direction in the low
(or fixed) dimensional setting. We also note that at the level of generality we place
ourselves in, the random matrices we study do not seem to be amenable to study
through the classical tools of random matrix theory. Hence, beside their obvious
statistical interest, they are also interesting on purely mathematical grounds.
We now turn to the gist of our paper, which will show that high-dimensional ker-
nel random matrices behave spectrally essentially like matrices closely connected
to sample covariance matrices. We will get two types of results: in Theorems 2.1
and 2.2, we get a strong approximation result (in operator norm) for standard mod-
els (ICA-like) studied in random matrix theory. In Theorems 2.3 and 2.4, we char-
acterize the limiting spectral distribution of our kernel random matrices, for a wider
class of data distributions. In Section 2, we also state clearly the consequences of
our theorems and review the relevant theory of high-dimensional sample covari-
ance matrices. From a technical standpoint, we adopt a point of view centered
on the concentration of measure phenomenon, as exposed for instance in [29] as it
provides a unified way to treat the two types of results we are interested in. Finally,
we discuss in our (self-contained) conclusion (Section 3), the consequences of our
results and in particular some possible limitations of “standard” random matrix
models as tools to model data encountered in practice focusing on geometric prop-
erties of datasets drawn according to those models. As explained in more details
there, vectors drawn according to these standard random matrix models essentially
live close to spheres and are almost orthogonal to one another, a property that may
or may not be present in datasets to be analyzed and can be seen as a key to many
classical and less classical random matrix results (see also [19]).
SPECTRUM OF KERNEL RANDOM MATRICES 5

2. Spectrum of kernel random matrices. Kernel random matrices do not


seem to be amenable to analysis through the usual tools of random matrix theory.
In particular, for general f , it seems difficult to carry out either a method of mo-
ments proof, or a Stieltjes transform proof, or a proof that relies on knowing the
density of the eigenvalues of the matrix.
Hence, we take an indirect approach. Our strategy is to find approximations of
the kernel random matrix that have two properties. First, the approximation matrix
is analyzable or has already been analyzed in random matrix theory. Second, the
quality of the approximation is good enough that spectral properties of the approx-
imating matrix can be shown to carry over to the kernel matrix.
The strategy in the first two theorems is to derive an operator norm “consistent”
approximation of our kernel matrix. In other words, if we call M our kernel matrix,
we will find K such that |M − K|2 → 0, as n and p tend to ∞. Note that both M
and K are real symmetric (and hence Hermitian) here. We explain after the state-
ment of Theorem 2.1 why operator norm consistency is a desirable property. But
let us say that in a nutshell, it implies consistency for each individual eigenvalue
as well as eigenspaces corresponding to separated eigenvalues.
For the second set of theorems (Theorems 2.3 and 2.4), we will relax the distri-
butional assumptions made on the data, but, at the expense of the precision of the
results we will obtain, we will characterize the limiting spectral distribution of our
kernel random matrices.
Our theorems below show that kernel random matrices can be well approxi-
mated by matrices that are closely connected to large-dimensional covariance ma-
trices. The spectral properties of those matrices have been the subject of a signifi-
cant amount of work in recent and less recent years, and hence this knowledge, or
at least part of it, can be transferred to kernel random matrices. In particular, we
refer the reader to [4, 5, 8, 17, 19, 21, 27, 30, 31, 37, 42] and [45] for some of the
most statistically relevant results in this area. We review some of them now.

2.1. Some results on large-dimensional sample covariance matrices. Since


our main theorems are approximating theorems, we first wish to state some of
the properties of the objects we will use to approximate kernel random matrices.
In what follows, we consider an n × p data matrix, with, say p/n having a fi-
nite nonzero limit. Most of the results that have been obtained are of two types:
either they are so-called “bulk” results and concern essentially the spectral distri-
bution (or loosely speaking the histogram of eigenvalues) of the random matrices
of interest; or they concern the localization and fluctuation behavior of extreme
eigenvalues of these random matrices.

2.1.1. Spectral distribution results. An object of interest in random matrix


theory is the spectral distribution of random matrices. Let us call li the decreas-
ingly ordered eigenvalues of our random matrix, and let us assume we are working
with an n × n matrix, Mn . The empirical spectral distribution of a n × n matrix is
6 N. EL KAROUI

the probability measure which puts mass 1/n at each of its eigenvalues. In other
words, if we call Fn this probability measure, we have
1 n
dFn (x) = δl (x).
n i=1 i
Note that the histogram of eigenvalues represents an integrated version of this
measure.
For random matrices, this measure Fn is naturally a random measure. A key
result in the area of covariance matrices is that if we observe i.i.d. data vectors Xi ,
1/2
with Xi = p Yi , where p is a positive semi-definite matrix and Yi is a vec-
tor with i.i.d entries, under weak moment conditions on Yi and assuming that the
spectral distribution of p has a limit (in the sense of weak convergence of distri-
butions), Fn converges to a nonrandom measure which we call F .
1/2
We call the models Xi = p Yi the “standard” models of random matrix the-
ory because most results have been derived under these assumptions. In particular,
various results [5, 6, 21] show, among many other things, that when the entries of
the vector Y have 4 (absolute) moments, the largest eigenvalues of the sample co-
variance matrix X  X/n, where Xi now occupies the ith row of the n × p matrix X,
stay close to the endpoint of the support of F .
A natural question is therefore to try to characterize F . Except in particular
situations, it is difficult to do so explicitly. However, it is possible to characterize
a certain transformation of F . The tool of choice in this context is the Stieltjes
transform of a distribution. It is a function defined on C+ by the formula, if we
call StF the Stieltjes transform of F ,

dF (λ)
StF (z) = , Im[z] > 0.
λ−z
In particular for empirical spectral distributions, we see that, if Fn is the spectral
distribution of the matrix Mn ,
1 n
1 1  
StFn (z) = = trace (Mn − z Id)−1 .
n i=1 li − z n
The importance of the Stieltjes transform in the context of random matrix theory
stems from two facts: on the one hand, it is connected fairly explicitly to the matri-
ces that are being analyzed; on the other hand, pointwise convergence of Stieltjes
transform implies weak convergence of distributions, if a certain mass preservation
condition is satisfied. This is how a number of bulk results are therefore proved. For
a clear and self-contained introduction to the connection between Stieltjes trans-
forms and weak convergence of probability measures, we refer the reader to [22].
The result of [30], later generalized by [37] for standard random matrix models
with nondiagonal covariance, and more recently by [19], away from those stan-
dard models, is a functional characterization of the limit F . If one calls wn (z)
SPECTRUM OF KERNEL RANDOM MATRICES 7

the Stieltjes transform of the empirical spectral distribution of XX  /n, wn (z) con-
verges pointwise (and almost surely after [37]) to a nonrandom w(z) which, as a
function, is a Stieltjes transform. Moreover, w, the Stieltjes transform of F , satis-
fies the equation, if p/n → ρ, ρ > 0,

1 λ dH (λ)
− =z−ρ ,
w(z) 1 + λw
where H is the limiting spectral distribution of p , assuming that such a distribu-
tion exists. We note that [37] proved the result under a second moment condition
on the entries of Yi .
From this result, [30] derived that in the case where p = Id, and hence dH =
δ1 , the empirical spectral distribution has a limit whose density is, if ρ ≤ 1,

1 (b − x)(x − a)
fρ (x) = ,
2πρ x
where a = (1 − ρ 1/2 )2 and b = (1 + ρ 1/2 )2 . The difference between the population
spectral distribution (a point mass at 1, of mass 1) and the limit of the empirical
spectral distribution is quite striking.

2.1.2. Largest eigenvalues results. Another line of work has been focused on
the behavior of extreme eigenvalues of sample covariance matrices. In particular,
[21] showed,
√ under some moment conditions, that when p = Idp , l1 (X  X/n) →
(1 + p/n) almost surely. In other words, the largest eigenvalue stays close to
2

the endpoint of the limiting spectral distribution of X  X/n. This result was later
generalized in [45], and was shown to be true under the assumption of finite fourth
moment only, for data with mean 0. In recent years, fluctuation results have been
obtained for this largest eigenvalue which is of practical interest in Principal Com-
ponents Analysis (PCA). Under Gaussian assumptions, [16] and [27] (see also [20]
and [26]) showed that the fluctuations of the largest eigenvalue are Tracy–Widom
distributed. For the general covariance case, similar results, as well as localization
information, were recently obtained in [17]. We note that the localization infor-
mation (i.e., a formula) that was discovered in this latter paper was shown to hold
for a wide variety of standard random matrix models through appeal to [5]. We
refer the interested reader to Fact 2 in [17] for more information. Interesting work
has also been done on so-called “spiked” models where a few population eigen-
values are separated from the bulk of them. In particular, in the case where all
population eigenvalues are equal, except for one that is significantly larger (see [7]
for the discovery of an interesting phase transition), [31] showed, in the Gaussian
case, inconsistency of the largest sample eigenvalue, as well as the fact that the
angle between the population and sample principal eigenvectors is bounded away
from 0. Paul [31] also obtained fluctuation information about the largest empirical
eigenvalue. Finally, we note that the same inconsistency of eigenvalue result was
also obtained in [8], beyond the Gaussian case.
8 N. EL KAROUI

2.1.3. Notation. Let us now define some notation and add some clarifications.
We denote by A the transpose of A. The matrices we will be working with
all have real entries. We remind the reader that if A and B are two rectangular
matrices, AB and BA have the same eigenvalues, except for possibly, a certain
number of zeros. We will make repeated use of this fact, for example, for matrices
like X  X and XX  . In the case where A and B are both square, AB and BA have
exactly the same eigenvalues.
We will also need various norms on matrices. We will use the so-called operator
norm, which we denote √ by |A|2 which corresponds to the largest singular value
of A, that is, maxi li (A A). We occasionally denote the largest singular value
of A by σ1 (A). Clearly, for positive semi-definite matrices, the largest singular
value is equal to the largest eigenvalue. Finally, we will sometimes need to use the
Frobenius (or Hilbert–Schmidt) norm of a matrix A. We denote it by AF . By
definition, it is simply, because we are working with matrices with real entries,

A2F = A2i,j .
i,j

Further, we use ◦ to denote the Hadamard (i.e., entrywise) product of two matri-
ces. We denote by μm the mth moment of a random variable. Note that by a slight
abuse of notation, we might also use the same notation to refer to the mth absolute
moment (i.e., E|X|m ) of a random variable, but if there is any ambiguity, we will
naturally make precise which definition we are using.
Finally, in the discussion of standard random matrix models that follows, there
will be arrays of random variables and a.s. convergence. We work with random
variables defined on a common probability space. To each ω corresponds an
infinite-dimensional array of numbers. Unless otherwise noted, the n × p matrices
we will use in what follows are the “upper-left” corner of this array.
We now turn to the study of kernel random matrices. We will show that we
can approximate them by matrices that are closely connected to sample covariance
matrices in high-dimensions and, therefore, that a number of the results we just
reviewed also apply to them.

2.2. Inner-product kernel matrices: f (Xi Xj /p).

T HEOREM 2.1 (Spectrum of inner product kernel random matrices). Let us


assume that we observe n i.i.d. random vectors, Xi in Rp . Let us consider the
kernel matrix M with entries
 
Xi Xj
Mi,j =f .
p
We assume that:
(a) n p, that is, n/p and p/n remain bounded as p → ∞.
SPECTRUM OF KERNEL RANDOM MATRICES 9

(b) p is a positive semi-definite p × p matrix, and |p |2 = σ1 (p ) remains


bounded in p, that is, there exists K > 0, such that σ1 (p ) ≤ K, for all p.
(c) p /p has a finite limit, that is, there exists l ∈ R such that limp→∞ trace(p )/
p = l.
1/2
(d) Xi = p Yi .
(e) The entries of Yi , a p-dimensional random vector, are i.i.d. Also, denoting
by Yi (k) the kth entry of Yi , we assume that E(Yi (k)) = 0, var(Yi (k)) = 1
and E(|Yi (k)|4+ε ) < ∞ for some ε > 0. (We say that Yi has 4 + ε absolute
moments.)
(f) f is a C 1 function in a neighborhood of l = limp→∞ trace(p )/p and a C 3
function in a neighborhood of 0.
Under these assumptions, the kernel matrix M can (in probability) be approx-
imated consistently in operator norm, when p and n tend to ∞, by the matrix K,
where


trace(p2 )    XX 
K = f (0) + f (0) 11 + f (0) + υp Idn ,
2p2 p
where
 
trace(p ) trace(p )
υp = f − f (0) − f  (0) .
p p
In other words,
|M − K|2 → 0 in probability, when p → ∞.

The advantages of obtaining an operator norm consistent estimator are many.


We list some here:
• Asymptotically, M and K have the same j -largest eigenvalue, for any j ; this is
simply because for symmetric matrices, if lj is the j th largest eigenvalue of a
matrix, Weyl’s inequality (see, e.g., Corollary III.2.6 in [10]) implies that
|lj (M) − lj (K)| ≤ |M − K|2 .
Hence our result implies that |lj (M) − lj (K)| → 0 in probability as p and n go
to infinity.
• The limiting spectral distributions of M and K (if they exist) are the same. This
is a consequence of Lemma 2.1, page 30 below. So in particular, when K has
a limiting spectral distribution (in the sense of weak convergence of probability
measures), the empirical spectral distribution of M converges to that distribution
(in the sense of weak convergence of distributions) in probability.
• We have subspace consistency for eigenspaces corresponding to separated
eigenvalues. (For a proof, we refer to [18], Corollary 3.) So, when K has eigen-
values that stay separated from the bulk of this matrix’s eigenvalues, then M
has in probability the same property, and the angle between the corresponding
eigenspaces for K and M go to 0 in probability.
10 N. EL KAROUI

(Note that the statements we just made assume that both M and K are symmetric,
which is the case here.)
The strategy for the proof is the following. According to the results of Lem-
ma A.3, the matrix Xi Xj /p has “small” entries off the diagonal, whereas on the
diagonal, the entries are essentially constant and equal to trace(p )/p. Hence, it
is natural to try to use the δ-method (i.e., do a Taylor expansion) entry by entry.
By contrast to standard problems in Statistics, the fact that we have to perform n2
of those Taylor expansions means that the second order term is not negligible a
priori. The proof shows that this approach can be carried out rigorously, and that,
perhaps surprisingly, the second order term is not too complicated to approximate
in operator norm. It is also shown that the third order term plays essentially no
role.
Before we start the proof, we want to mention that we will drop the index p
in p below to avoid cumbersome notation. Let us also note, more technically,
that an important step of the proof is to show that, when the Yi ’s have enough
moments, they can be treated without much error in spectral results has bounded
random variables—the bound depending on the number of moments, n and p.
This then enables us to use concentration results for convex Lipschitz functions
of independent bounded random variables at various important points of the proof
and also in Lemma A.3 whose results underly much of the approach taken here.

P ROOF OF T HEOREM 2.1. First, let us call


trace()
τ .
p
Using Taylor expansions, we can rewrite our kernel matrix as
f  (0) 
f (Xi Xj /p) = f (0) + f  (0)Xi Xj /p + (Xi Xj /p)2
2
f (3) (ξi,j ) 
+ (Xi Xj /p)3 if i = j,
6
 
Xi 22
f (Xi 22 /p) = f (τ ) + f  (ξi,i ) −τ on the diagonal.
p
The proof can be separated in different steps. We will break the kernel matrix
into a diagonal term and an off-diagonal term. The results of Lemma A.3, after
they are shown, will allow us to take care of the diagonal matrix at relatively lost
cost. So we postpone that part of the analysis to the end of the proof and we first
focus on the off-diagonal matrix.
In what follows, we call “second order term” the matrix A with entries,
f  (0) 
Ai,j = (Xi Xj /p)2 1i =j .
2
SPECTRUM OF KERNEL RANDOM MATRICES 11

We call “third order term” the matrix B with entries,


f (3) (ξi,j ) 
Bi,j = (Xi Xj /p)3 1i =j .
6
The “off-diagonal” matrix is the sum A + B.

(A) Study of the off-diagonal matrix.


• Truncation and centralization. Following the arguments of Lemma 2.2 in [45],
we see that because we have assumed that we have 4 + ε absolute moments, and
n p, the array Y = Y1≤i≤n,1≤j ≤p is almost surely equal to the array Y of same
dimensions with
Yi,j = Yi,j 1|Yi,j |≤Bp where Bp = p1/2−δ and δ > 0.

We will therefore carry out the analysis on this Y array. Note that most of the
results we will rely on require vectors of i.i.d. entries with mean 0. Of course, Yi,j
has in general a mean different from 0. In other words, if we call μ = E(Yi,j ),
we need to show that we do not lose anything in operator norm by replacing Yi ’s
by Ui ’s with Ui = Yi − μ1. Note that, as seen in Lemma A.3, by plugging in
t = 1/2 − δ in the notation of this lemma, which corresponds to the 4 + ε moment
assumption here, we have
|μ| ≤ p−3/2−δ .
Now let us call S the matrix XX /p, except that its diagonal is replaced by
zeros. From [45], and the fact that n/p stays bounded, we know that |XX /p|2 ≤
σ1 ()|Y Y  |2 /p stays bounded. Using Lemma A.3, we see that the diagonal of
XX /p stays bounded a.s. in operator norm. Therefore, |S|2 is bounded a.s.
Now, as in the proof of Lemma A.3, we have
 
U  Uj 1 Uj 1 Ui 1 1 Ui Uj
Si,j = i +μ + + μ2  + Ri,j a.s.
p p p p p
Note that this equality is true a.s. only because it involves replacing Y by Y. The
proof of Lemma A.3 shows that
 1/2 
|Ri,j | ≤ μ2σ1 () σ1 () + p−δ/2 + μ2 σ1 ()
1/2
a.s.
We conclude that, for some constant C,
R2F ≤ Cn2 μ2 ≤ Cn2 p−3−2δ → 0 a.s.
Therefore |R|2 → 0 a.s. In other words, if we call SU the matrix with i, j entry
Ui Uj /p off the diagonal and 0 on the diagonal,
|S − SU |2 → 0 a.s.
12 N. EL KAROUI

Now it is a standard result on Hadamard products (see for instance, [10], Prob-
lem I.6.13, or [25], Theorems 5.5.1 and 5.5.15) that for two matrices A and B,
|A ◦ B|2 ≤ |A|2 |B|2 . Since the Hadamard product is commutative, we have
S ◦ S − SU ◦ SU = (S + SU ) ◦ (S − SU ).
We conclude that
|S ◦ S − SU ◦ SU |2 ≤ |S − SU |2 (|S|2 + |SU |2 ) → 0 a.s.,
since |S − SU |2 → 0 a.s., and |S|2 and hence |SU |2 stay bounded, a.s.
The conclusion of this study is that to approximate the second order term in op-
erator norm, it is enough to work with SU and not S, and hence, very importantly,
with bounded random variables with zero mean. Further, the proof of Lemma A.3
makes clear that σU2 , the variance of the Ui,j ’s goes to 1, the variance of the Yi,j ’s,
very fast. So if we can approximate the matrix with (i, j )-entry Ui Uj /(pσU2 )
consistently in operator norm by a matrix whose operator norm is bounded, this
same matrix will constitute an operator norm approximation of Ui Uj /p.
In other words, we can assume that, when working with matrices of dimension
n × p, the random variables we will be working with have variance 1 without loss
of generality and that they have mean 0 and are bounded by Bp , Bp depending on
p and going to infinity.
• Control of the second order term. We now focus on approximating in operator
norm the matrix with (i, j )th entry,
f  (0) 
(Xi Xj /p)2 1i =j .
2
As we just explained, we assume from now on in all the work concerning the
second order term that the vectors Yi have mean 0, and that their entries have
variance 1 and are bounded by Bp = p1/2−δ . This is because we just saw that
replacing Yi by Ui /σU would not change (a.s. and asymptotically) the operator
norm of the matrix to be studied. We note that to make clear that the truncation
(p)
depends on p, we might have wanted to use the notation Yi , but since there will
be no ambiguity in the proof, we chose to use the less cumbersome notation Yi .
The control of the second order term turns out to be the most delicate part of the
analysis, and the only place where we need the assumption that Xi =  1/2 Yi . Let
us call W the matrix with entries

⎨ (Xi Xj )2
Wi,j = , if i = j ,
⎩ p2
0, if i = j .
Note that when i = j ,
E(Wi,j ) = E(trace(Xi Xj Xj Xi ))/p2 = E(trace(Xj Xj Xi Xi ))/p2
= trace( 2 )/p2 .
SPECTRUM OF KERNEL RANDOM MATRICES 13

Because we assume that trace()/p has a finite limit, and n/p stays bounded away
from 0, we see that the matrix E(W ) has a largest eigenvalue that, in general, does
not go to 0. Note also that under our assumptions, E(Wi,j ) = O(1/p). Our aim is
to show that W can be approximated in operator norm by this constant matrix. So
let us consider the matrix W with entries

⎨ (Xi Xj )2
Wi,j = − trace( 2 )/p2 , if i = j ,
⎩ p2
0, if i = j .
Simple computations show that the expected Frobenius norm squared of this ma-
trix does not go to 0. Hence more subtle arguments are needed to control its op-
erator norm. We will show that E(trace(W 4 )) goes to zero which implies that
E(|W |42 ) goes to zero because W is real symmetric.
The elements contributing to trace(W 4 ) are generally of the form Wi,j Wj,k ×
Wk,l Wl,i . We are going to study these terms according to how many indices are
equal to each other.

(i) Terms involving 4 different indices: i = j = k = l. We first focus on the case


where all these indices (i, j, k, l) are different. Recall that Xi =  1/2 Yi , where Yi
has i.i.d. entries. We want to compute E(Wi,j Wj,k Wk,l Wl,i ), so it is natural to
focus first on
E(Wi,j Wj,k Wk,l Wl,i |Yi , Yk ).
Now, note that
1 
Wi,j = 2
{Yi Yj Yj Yi − trace( 2 )}
p
1  
= 2
Yi (Yj Yj − Id)Yi + trace  2 (Yi Yi − Id) .
p
Hence, calling
Mj  Yj Yj − Id,
we have
p4 Wi,j Wj,k = (Yi Mj Yi Yk Mj Yk ) + (Yi Mj Yi ) trace( 2 Mk )
+ (Yk Mj Yk ) trace( 2 Mi ) + trace( 2 Mi ) trace( 2 Mk ).
Now, of course, we have E(Mj ) = E(Mj |Yi , Yk ) = 0. Hence,

p4 E(Wi,j Wj,k |Yi , Yk ) = (Yi E(Mj Yi Yk Mj |Yi , Yk )Yk )
+ trace( 2 Mi ) trace( 2 Mk ).
14 N. EL KAROUI

If M is a deterministic matrix, we have, since E(Yj Yj ) = Id,

E(Mj MMj ) = E(Yj Yj MYj Yj ) − M.

If we now use Lemma A.1, and, in particular, (4), page 40, we finally have, recall-
ing that here σ 2 = 1,

E(Mj MMj ) = (M + M  ) + (μ4 − 3) diag(M) + trace(M) Id −M


= M  + (μ4 − 3) diag(M) + trace (M) Id.

In the case of interest here, we have M = Yi Yk , and the expectation is to be
understood conditionally on Yi , Yk , but because we have assumed that the indices
are different and the Ym ’s are independent, we can do the computation of the con-
ditional expectation as if M were deterministic. Therefore, we have

(Yi E(Mj Yi Yk Mj |Yi , Yk )Yk )


= Yi [Yk Yi  + (μ4 − 3) diag(Yi Yk ) + (Yk  2 Yi ) Id]Yk
= [(Yi  2 Yk )2 + (μ4 − 3)Yi  diag(Yi Yk )Yk + (Yi  2 Yk )2 ].

Naturally, we have E(Wi,j Wj,k |Yi , Yk ) = E(Wk,l Wl,i |Yi , Yk ), and therefore, by us-
ing properties of conditional expectation, since all the indices are different,

p8 E(Wi,j Wj,k Wk,l Wl,i )



= E [2(Yi  2 Yk )2 + (μ4 − 3)Yi  diag(Yi Yk )Yk

+ trace( 2 Mi ) trace( 2 Mk )]2 .

By convexity, we have (a + b + c)2 ≤ 3(a 2 + b2 + c2 ), so to control the above


expression, we just need to control the square of each of the terms appearing in it.
In other words, we need to understand the terms

T1 = E((Yi  2 Yk )4 ),
T2 = E([Yi  diag(Yi Yk )Yk ]2 )

and

T3 = E([trace( 2 Mi ) trace( 2 Mk )]2 ).

Study of T1 . Let us start by the term T1 = E((Yi  2 Yk )4 ). A simple re-writing


shows that

(Yi  2 Yk )4 = Yi  2 Yk Yk  2 Yi Yi  2 Yk Yk  2 Yi .


SPECTRUM OF KERNEL RANDOM MATRICES 15

Using (4) in Lemma A.1, we therefore have, using the fact that  2 Yi Yi  2 is sym-
metric,
E((Yi  2 Yk )4 |Yi )
= Yi  2 [2 2 Yi Yi  2 + (μ4 − 3) diag( 2 Yi Yi  2 )
+ trace( 2 Yi Yi  2 ) Id] 2 Yi
= 3(Yi  4 Yi )2 + (μ4 − 3)Yi  2 diag( 2 Yi Yi  2 ) 2 Yi .
Finally, we have, using (5) in Lemma A.1,
E((Yi  2 Yk )4 ) = 3[2 trace( 4 ) + (trace( 4 ))2 + (μ4 − 3) trace( 4 ◦  4 )]
+ (μ4 − 3)E(Yi  2 diag( 2 Yi Yi  2 ) 2 Yi ).
Now we have
Yi  2 diag( 2 Yi Yi  2 ) 2 Yi = trace( 2 Yi Yi  2 diag( 2 Yi Yi  2 ))
= trace( 2 Yi Yi  2 ◦  2 Yi Yi  2 ).
Calling vi =  2 Yi , we note that the matrix whose trace is taken is (vi vi ) ◦ (vi vi ) =
(vi ◦ vi )(vi ◦ vi ) (see [24], page 458 or [25], page 307). Hence,
Yi  2 diag( 2 Yi Yi  2 ) 2 Yi = vi ◦ vi 22 .
Now let us call mk the kth column of the matrix  2 . Using the fact that  2 is
symmetric, we see that the kth entry of the vector vi is vi (k) = mk Yi . So vi (k)4 =
Yi mk mk Yi Yi mk mk Yi . Calling Mk = mk mk , we see, using (5) in Lemma A.1, that

E(vi (k)4 ) = 2 trace(M2k ) + [trace(Mk )]2 + (μ4 − 3) trace(Mk ◦ Mk ).


Using the definition of Mk , we finally get that
E(vi (k)4 ) = 3mk 42 + (μ4 − 3)mk ◦ mk 22 .
Now, note that if C is a generic matrix and Ck is its kth column, denoting by
ek the kth vector of the canonical basis, we have Ck = Cek , and hence Ck 22 =
ek C  Cek ≤ σ12 (C) where σ1 (C) is the largest singular value of C. So in particular,
if we call λ1 (D) the largest eigenvalue of a positive semi-definite matrix D, we
have mk 42 ≤ λ1 ( 4 )mk 22 .

After recalling the definition of mk , and using the fact that k mk ◦ mk 22 =
 2 ◦  2 2F , we deduce that
 
E(vi ◦ vi 22 ) = 3 mk 42 + (μ4 − 3) mk ◦ mk 22
k k

≤ 3λ1 ( ) trace( ) + (μ4 − 3) trace([ 2 ◦  2 ]2 ).


4 4
16 N. EL KAROUI

Therefore, we can conclude that


E((Yi  2 Yk )4 ) ≤ 3λ1 ( 4 ) trace( 4 ) + (μ4 − 3) trace([ 2 ◦  2 ]2 ).
Now recall that, according to Theorem 5.5.19 in [25], if C and D are positive semi-
definite matrices, λ(C ◦ D) ≺w d(C) ◦ λ(D) where λ(D) is the vector of decreas-
ingly ordered eigenvalues of D, and d(C) denotes the vector of decreasingly or-
dered diagonal entries of C (because all the matrices are positive semidefinite, their
eigenvalues are their singular values). Here ≺w denotes weak (sub)majorization. In
our case, of course, C = D =  2 . Using the results of Example II.3.5(iii) in [10],
with the function φ(x) = x 2 , we see that
   
trace ( 2 ◦  2 )2 = λ2i ( 2 ◦  2 ) ≤ di2 ( 2 )λ2i ( 2 ) ≤ λ1 ( 4 ) trace( 4 ).
Finally, we have
(1) T1 = E((Yi  2 Yk )4 ) ≤ (3 + |μ4 − 3|)λ1 ( 4 ) trace( 4 ).
This bounds the first term, T1 , in our upper bound.
Study of T3 . Let us now turn to the third term, T3 = E([trace( 2 Mi ) trace ×
( 2 Mk )]2 ). We remind the reader that Mi = Yi Yi − Id. By independence of Yi and
Yk , it is enough to understand E([trace( 2 Mi )]2 ). Note that
 
E([trace( 2 Mi )]2 ) = E [Yi  2 Yi − trace( 2 )]2
= E(Yi  2 Yi Yi  2 Yi ) − trace( 2 )2 .
Using (5) in Lemma A.1, we conclude that
E([trace( 2 Mi )]2 ) = 2 trace( 4 ) + (μ4 − 3) trace( 2 ◦  2 ).
Using the fact that we know the diagonal of  2 ◦  2 , we conclude that
T3 = E([trace( 2 Mi )]2 [trace( 2 Mk )]2 )
(2)
≤ {2 trace( 4 ) + |μ4 − 3|λ1 ( 2 ) trace( 2 )}2 .
So we have an upper bound on T3 .
Study of T2 . Finally, let us turn to the middle term, T2 = E([Yi  diag(Yi Yk ×
)Yk ]2 ). Before we square it, the argument of the expectation has the form
Yi  diag(Yk Yi )Yk . Call uk = Yk . Making the same computations as above,
we find that
Yi  diag(Yk Yi )Yk = trace(diag(Yk Yi )Yk Yi )
 
= trace (Yk Yi ) ◦ (Yk Yi )
   
= trace (uk ui ) ◦ (uk ui ) = trace (uk ◦ uk )(ui ◦ ui )
= (ui ◦ ui ) (uk ◦ uk ).
SPECTRUM OF KERNEL RANDOM MATRICES 17

We deduce, using independence and elementary properties of inner products, that


E([Yi  diag(Yk Yi )Yk ]2 ) ≤ E(ui ◦ ui 22 )E(uk ◦ uk 22 ).
Note that to arrive at (1), we studied expressions similar to E(ui ◦ ui 22 ). So we
can similarly conclude that
(3) T2 = E([Yi  diag(Yk Yi )Yk ]2 ) ≤ {(3 + |μ4 − 3|)λ1 ( 2 ) trace( 2 )}2 .
With our assumptions, the terms (1), (2) and (3) are O(p 2 ). Note that in the com-
putation of the trace, there are O(n4 ) such terms. Finally, note that the expectation
of interest to us corresponds to the sum of the three quadratic terms divided by p 8 .
So the total contribution of these terms is in expectation O(p−2 ). This takes care
of the contribution of the terms involving four different indices, as it shows that
  
0≤E Wi,j Wj,k Wk,l Wl,i = O(p−2 ).
i =j =k =l

(ii) Terms involving three different indices: i = j = k. Note that because


Wi,i = 0, terms involving 3 different indices with a nonzero contribution are nec-
essarily of the form (Wi,j )2 (Wi,k )2 , since terms with a cycle of length 3 all involve
a term of the form Wi,i and hence contribute 0. Let us now focus on those terms,
assuming that j = k. Note that we have O(n3 ) such terms and that it is enough to
focus on the Wi,j2 W 2 , since the contribution of the other terms is, in expectation,
i,k
of order 1/p [with our assumptions trace( 2 )/p2 = O(1/p)], and because we
4

have only n3 terms in the sum, this extra contribution is asymptotically zero. Now,
we clearly have E(Wi,j 2 W 2 |Y ) = [E(W 2 |Y )]2 , by conditional independence of
i,k i i,j i
the two terms. The computation of E(Wi,j |Yi ) is similar to the ones we have made
2

above, and we have


p4 E(Wi,j
2
|Yi ) = 2(Yi  2 Yi )2 + (μ4 − 3)Yi  diag(Yi Yi )Yi
+ (trace(Yi Yi ))2 .
Using the fact that Ki = Yi Yi  is positive semidefinite, and hence its diagonal
entries are nonnegative, we have trace(Ki ◦ Ki ) ≤ (trace(Ki ))2 , and we conclude
that
p4 E(Wi,j
2
|Yi ) ≤ (3 + |κ4 − 3|)(Yi  2 Yi )2 ≤ (3 + |κ4 − 3|)σ1 ()4 Yi 42 .
Hence,
1
2
E(Wi,j 2
Wi,k )≤ 8
(3 + |κ4 − 3|)2 σ1 ()8 Yi 82 .
p
Now, the application F which takes a vector and returns its Euclidean norm
is trivially a convex 1-Lipschitz function, with respect to Euclidean norm. Be-
cause the entries of Yi are bounded by Bp , we see that, according to Corol-
lary 4.10 in [29], F (Yi ) = Yi 2 satisfies a concentration inequality, namely,
18 N. EL KAROUI

for r > 0, P (|Yi 2 − mF | > r) ≤ 4 exp(−r 2 /16Bp2 ) where mF is a median of


F (Yi ) = Yi 2 (hence mF is a deterministic quantity). A simple integration (see,
for instance, the proof of Proposition 1.9 in [29], and change the power from 2
to 8) then shows that
 8 
E Yi 2 − mF  = O(Bp8 ).
Now we know, according to Proposition 1.9 in [29], that if μF is the mean of
F (Yi ), that is, μF = E(Yi 2 ), μF exists and |mF − μF | = O(Bp ). Since μ2F ≤
μF 2 = E(Yi 22 ) = p, we conclude that, if C denotes a generic constant that may
change from display to display,
 8 
  8 

E(Yi 82 ) ≤ E Yi 2 − mF + mF  ≤ 27 E Yi 2 − mF  + m8F
  8  
≤ C E Yi 2 − mF  + |mF − μF |8 + μ8F ≤ C(Bp8 + p4 ).
Now our original assumption about the number of absolute moments of the random
variables of interest imply that Bp = O(p1/2−δ ). Consequently,

E(Yi 82 ) = O(p4 ).


Therefore,
2
E(Wi,j 2
Wi,k ) = O(p−4 )
and
 
2
E(Wi,j 2
Wi,k ) = O(p−1 ).
i j =i,k =i,j =k

Hence, we also have


 
2
E(Wi,j 2
Wi,k ) = O(p−1 ).
i j =i,k =i,j =k

(iii) Terms involving two different indices: i = j . The last terms we have to
focus on to control E(trace(W 4 )) are of the form Wi,j 4 . Note that we have n2 terms

like this. Since by convexity, (a + b)4 ≤ 8(a 4 +b4 ), we see that it is enough to
understand the contribution of Wi,j 4 to show that 4
i,j E(Wi,j ) tends to zero. Now,
let us call for a moment v = Yi and u = Yj . The quantity of interest to us is
basically of the form E((u v)8 ). Let us do computations conditional on v. We note
that since the entries of u are independent and have mean 0, in the expansion of
(u v)8 , the only terms that will contribute a nonzero quantity to the expectation
have entries of u raised to a power greater than 2. We can decompose the sum
representing E((u v)8 |v) into subterms, according to what powers of the terms are
involved. There are 6 terms: (2, 2, 2, 2) (i.e., all terms are raised to the power 2),
(3, 3, 2) (i.e., two terms are raised to the power 3, and one to the power 2), (4, 2, 2),
SPECTRUM OF KERNEL RANDOM MATRICES 19

(4, 4), (5, 3), (6, 2) and (8). For instance the subterm corresponding to (2, 2, 2, 2)
is, before taking expectations,

u2i1 u2i2 u2i3 u2i4 (vi1 vi2 vi3 vi4 )2 .
i1 =i2 =i3 =i4

After taking expectations conditional on v, we see that it is obviously nonnegative


and contributes
  4
(σ 2 )4 (vi1 vi2 vi3 vi4 )2 ≤ vi2 = (Yi  2 Yi )4 ≤ σ1 ()8 Yi 82 .
i1 =i2 =i3 =i4

Note that we just saw that E(Yi 82 ) = O(p4 ) in our context. Similarly, the term
(3, 3, 2) will contribute

μ23 σ 2 vi31 vi32 vi23 .
i1 =i2 =i3

In absolute value, this term is less than


 2  
μ23 σ 2 |vi |3 vi2 .
 
Now, note that if z is such that z2 = 1, we have, for p ≥ 2, |zi |p ≤ zi2 = 1.
 p
Applied to z = v/v2 , we conclude that |vi |p ≤ v2 . Consequently, the term
(3, 3, 2) contributes in absolute value less than

μ23 σ 2 v82 .

The same analysis can be repeated for all the other terms which are all found to be
less than v82 times the moments of u involved. Because we have assumed that
our original random variables had 4 + ε absolute moments, the moments of order
less than 4 cause no problem. The moments of order higher than 4, say 4 + k, can
be bounded by μ4 Bpk . Consequently, we see that
 
Yi 8  
4
E(Wi,j ) = E(E(Wi,j
4
|Yi )) ≤ CBp4 E 8
= O(Bp4 /p4 ) = O p−(2+4δ) .
p
Since we have n2 such terms, we see that

4
E(Wi,j )→0 as p → ∞.
i =j

Using our earlier convexity remark, we finally conclude that



4
E(Wi,j )→0 as p → ∞.
i =j
20 N. EL KAROUI

(iv) Second order term: combining all the elements. We have therefore estab-
lished control of the second order term and seen that the largest singular value of
W goes to 0 in probability, using Chebyshev’s inequality. Note that we have also
shown that the operator norm of W is bounded in probability and that
 
 trace( 2 )  
W − (11 − Id) → 0 in probability.
 p 2
2

• Control of the third order term. We note that the third order term is of the form
X X
f (3) (ξi,j ) ip j Wi,j . According to Lemma A.5, if M is a real symmetric matrix
with nonnegative entries, and E is a symmetric matrix such that maxi,j |Ei,j | = ζ ,
then
σ1 (E ◦ M) ≤ ζ σ1 (M).
Note that W is real symmetric matrix with nonnegative entries. So all we have
to show to prove that the third order term goes to zero in operator norm is that
maxi =j |Xi Xj /p| goes to 0 because we have just established that |W |2 remains
bounded in probability. We are going to make use of Lemma A.3, page 43 in the
Appendix. In our setting, we have Bp = p1/2−δ , or 2/m = 1/2 − δ. The lemma
implies, for instance, that
max|Xi Xj /p| ≤ p−δ log(p) a.s.
i =j

So maxi =j |Xi Xj /p| → 0 a.s. Note that this implies that maxi =j |ξi,j | → 0 a.s.
Since we have assumed that f (3) exists and is continuous and hence bounded in a
neighborhood of 0, we conclude that
 
maxf (3) (ξi,j )Xi Xj /p = o(p−δ/2 ) a.s.
i,j

If we call E the matrix with entry Ei,j = f (3) (ξi,j )Xi Xj /p off-the diagonal and 0
on the diagonal, we see that E satisfies the conditions put forth in our discussion
earlier in this section and we conclude that
|E ◦ W |2 ≤ max|Ei,j ||W |2 = o(p −δ/2 ) a.s.
i,j

Hence, the operator norm of the third order term goes to 0 almost surely. [To maybe
clarify our arguments, let us repeat that we analyzed the second order term by
replacing the Yi ’s by, in the notation of the truncation and centralization discus-
sion, Ui . Let us call WU = SU ◦ SU , again using notation introduced in the trunca-
tion and centralization discussion. As we saw, |W − WU |2 → 0 a.s., so showing,
as we did, that |WU |2 remains bounded (a.s.) implies that |W |2 does too, and
this is the only thing we need in our argument showing the control of the third
order term.]
SPECTRUM OF KERNEL RANDOM MATRICES 21

(B) Control of the diagonal term. The proof here is divided into two parts.
First, we show that the error term coming from the first order expansion of the di-
agonal is easily controlled. Then we show that the terms added when replacing the
off-diagonal matrix by XX  /p + trace( 2 )/p2 11 can also be controlled. Recall
the notation τ = trace()/p.
• Errors induced by diagonal approximation. Note that Lemma A.3 guarantees
that for all i, |ξi,i − τ | ≤ p−δ/2 , a.s. Because we have assumed that f  is continuous
and hence bounded in a neighborhood of τ , we conclude that f  (ξi,i ) is uniformly
bounded in p. Now Lemma A.3 also guarantees that
 
 Xi 22 
max − τ  ≤ p −δ a.s.
i p
Hence, the diagonal matrix with entries f (Xi 22 /p) can be approximated consis-
tently in operator norm by f (τ ) Id a.s.
• Errors induced by off-diagonal approximation. When we replace the off-
diagonal matrix by f  (0)XX  /p + [f (0) + f  (0) trace( 2 )/2p2 ]11 , we add a
diagonal matrix with (i, i) entry f (0) + f  (0)Xi 22 /p + f  (0) trace( 2 )/2p2
which we need to subtract eventually. We note that 0 ≤ trace( 2 )/p2 ≤ σ12 ()/
p → 0 when σ1 () remains bounded in p. So this term does not create any prob-
lem. Now, we just saw that the diagonal matrix with entries Xi 22 /p can be consis-
tently approximated in operator norm by (trace()/p) Id. So the diagonal matrix
with (i, i) entry f (0) + f  (0)Xi 22 /p + f  (0) trace( 2 )/2p2 can be approxi-
mated consistently in operator norm by (f (0) + f  (0) trace()/p) Id a.s.
This finishes the proof. 

2.3. Kernel random matrices of the type f (Xi − Xj 22 /p). As is to be ex-
pected, the properties of such matrices can be deduced from the study of inner
product kernel matrices, with a little bit of extra work. We need to slightly modify
the distributional assumptions under which we work, and consider the case where
we have 5 + ε absolute moments for the entries of Yi . We also need to assume that
f is regular is the neighborhood of different points. Otherwise, the assumptions
are the same as that of Theorem 2.1. We have the following theorem:

T HEOREM 2.2 (Spectrum of Euclidean distance kernel matrices). Consider


the n × n kernel matrix M with entries
 
Xi − Xj 22
Mi,j =f .
p
Let us call
trace()
τ =2 .
p
22 N. EL KAROUI

Let us call ψ the vector with ith entry ψi = Xi 22 /p − trace()/p. Suppose
that the assumptions of Theorem 2.1 hold, but that conditions (e) and (f) are re-
placed by:
(e ) The entries of Yi , a p-dimensional random vector, are i.i.d. Also, denoting
by Yi (k) the kth entry of Yi , we assume that E(Yi (k)) = 0, var(Yi (k)) = 1
and E(|Yi (k)|5+ε ) < ∞ for some ε > 0. (We say that Yi has 5 + ε absolute
moments.)
(f ) f is C 3 in a neighborhood of τ .
Then M can be approximated consistently in operator norm (and in probability)
by the matrix K, defined by
 
  XX 
K = f (τ )11 + f (τ ) 1ψ + ψ1 − 2
p
 
f  (τ ) trace( 2 ) 
+ 1(ψ ◦ ψ) + (ψ ◦ ψ)1 + 2ψψ  + 4 11 + υp Id,
2 p2
υp = f (0) + τf  (τ ) − f (τ ).
In other words,
|M − K|2 → 0 in probability.

P ROOF. Note that here the diagonal is just f (0) Id and it will cause no trouble.
The work, therefore, focuses on the off-diagonal matrix. In what follows, we call
τ = 2 trace()
p . Let us define
Xi 22 Xj 22
Ai,j = + −τ
p p
and
Xi Xj
Si,j =.
p
With these notation, we have, off the diagonal, that is, when i = j , by a Taylor
expansion,
1
Mi,j = f (τ ) + [Ai,j − 2Si,j ]f  (τ ) + [Ai,j − 2Si,j ]2 f  (τ )
2
1 (3)
+ f (ξi,j )[Ai,j − 2Si,j ]3 .
6
We note that the matrix A with entries Ai,j is a rank 2 matrix. As a matter of fact,
X 2
it can be written, if ψ is the vector with entries ψi = pi 2 − τ/2, A = 1ψ  + ψ1 .
Using the well-known identity (see, e.g., [23], Chapter 1, Theorem 3.2),
 
1 + u v u22
det(I + uv  + vu ) = det ;
v22 1 + u v
SPECTRUM OF KERNEL RANDOM MATRICES 23

we see immediately that the nonzero eigenvalues of A are



1 ψ ± nψ2 .
After these preliminary remarks, we are ready to start the proof per se.
• Truncation and centralization. Since we assume 5+ε absolute moments, we
see, using Lemma 2.2 in [45], that we can truncate the Yi ’s at level Bp = p2/5−δ
with δ > 0 and a.s. not change the data matrix. We then need to centralize the
vectors truncated at p2/5−δ . Note that because we work with Xi − Xj =  1/2 (Yi −
Yj ), centralization creates absolutely no problem here since it is absorbed in the
difference. So in what follows we can assume without loss of generality that we are
working with vectors Xi =  1/2 Yi where the entries of Yi are bounded by p2/5−δ
and E(Yi ) = 0. The issue of variance 1 is addressed as before, so we can assume
that the entries of Yi have variance 1.
• Concentration of Xi − Xj 22 /p. By plugging in the results of Corol-
lary A.2, with 2/m = 2/5 − δ, we get that
 
 Xi − Xj 22 trace()  −1/10−δ
 −2
max
i =j p p  ≤ log(p)p .

Also, using the result of Lemma A.3, we have


 
 Xi 22 trace() 

max|ψi | = max −  ≤ log(p)p−1/10−δ .
i i p p
Note that, as explained in the proof of Lemma A.3, these results are true whether
we work with Yi or their truncated and centralized version.
• Control of the second order term. The second order term is the matrix with
(i, j )-entry
1i =j 12 f  (τ )(Ai,j − Si,j )2 .
Let us call T the matrix with 0 on the diagonal and off-diagonal entries Ti,j =
(Ai,j − 2Si,j )2 . In other words,
 2
Xi − Xj 22 − 2 trace()
Ti,j = 1i =j .
p
We simply write (Ai,j − 2Si,j )2 = A2i,j − 4Ai,j Si,j + 4Si,j
2 . In the notation of
2
the proof of Theorem 2.1, the matrix with entries Si,j off the diagonal and 0 on the
diagonal is what we called W . We have already shown that
 
 trace( 2 )  
W − (11 − Id) → 0 in probability.
 p 2
2
24 N. EL KAROUI

Now, let us focus on the term Ai,j Si,j . Let us call H the matrix with
Hi,j = (1 − δi,j )Ai,j Si,j .
Let us denote by S the matrix with off-diagonal entries Si,j and 0 on the diago-
nal. If we call S = XX  /p, we have
 = S − diag(S).
S
Now note that Ai,j = ψi + ψj . Therefore, we have, if diag(ψ) is the diagonal
matrix with (i, i) entry ψi ,
 diag(ψ) + diag(ψ)S.
H =S 

We just saw that under our assumptions, maxi |ψi | → 0 a.s. Because for any n × n
matrices L1 , L2 , |L1 L2 |2 ≤ |L1 |2 |L2 |2 , we see that to show that |H |2 goes
 2 remains bounded.
to 0, we just need to show that |S|

Now we clearly have,
 √ |S|2 2 ≤ ||2 |Y Y/p|2 . We know from [45] that
|Y Y /p|2 → σ (1 + n/p) , a.s. Under our assumptions on n and p, this is
2

bounded. Now
trace()
diag(S) = diag(ψ) + Id,
p
so our concentration results once again imply that | diag(S)|2 ≤ trace()/p + η
a.s., for any η > 0. Because | · |2 is subadditive, we finally conclude that
 2 is bounded
|S| a.s.
Therefore,
|H |2 → 0 a.s.
Putting together all these results, we see that we have shown that
 
   trace( 2 )  
T − A ◦ A − diag(A ◦ A) − 4 (11 − Id) → 0 in probability.
 p 2
2

• Control of the third order term. The third order term is the matrix L with 0
on the diagonal and off-diagonal entries
 3
f (3) (ξi,j ) Xi − Xj 22 − 2 trace()
Li,j =  E ◦ T,
6 p
where T was the matrix investigated in the control of the second order term. On
the other hand, E is the matrix with entries
 
f (3) (ξi,j ) Xi − Xj 22 − 2 trace()
Ei,j = (1 − δi,j ) .
6 p
SPECTRUM OF KERNEL RANDOM MATRICES 25

We have already seen that through concentration, we have


 
 Xi − Xj 22 2 trace()  −1/10−δ
 −
max
i =j p p  ≤ log(p)p a.s.

This naturally implies that


 
 2 trace()  −1/10−δ

maxξi,j −
i =j p  ≤ log(p)p a.s.

So if f (3) is bounded in a neighborhood of τ , we see that with high-probability so


is maxi =j |f (3) (ξi,j )|. Therefore,

max|Ei,j | ≤ K log(p)p−1/10−δ .
i =j

We are now in position to apply the Hadamard product argument (see Lem-
ma A.5) we used for the control of the third order term in the proof of Theo-
rem 2.1. To show that the third order term tends in operator norm to 0, we hence
just need to show that |T |2 remains small compared to the bound we just gave on
maxi,j |Ei,j |. Of course, this is equivalent to showing that the matrix that approxi-
mates T has the same property in operator norm.
Clearly, because σ1 () stays bounded, trace( 2 )/p stays bounded and so does
|trace( 2 )/p2 (11 − Id)|2 . So we just have to focus on A ◦ A − diag(A ◦ A).
Recall that Ai,i = 2(Xi 22 /p − trace()/p), and so Ai,i = 2ψi . We have al-
ready seen that our concentration arguments imply that maxi |ψi | → 0 a.s. So
|diag(A ◦ A)|2 = maxi ψi2 goes to 0 a.s. Now,
A = 1ψ  + ψ1 ,
and hence, elementary Hadamard product computations [relying on ab ◦ uv  =
(a ◦ u)(b ◦ v) ] give
A ◦ A = 1(ψ ◦ ψ) + 2ψψ  + (ψ ◦ ψ)1 .
Therefore,
√ 
|A ◦ A|2 ≤ 2 nψ ◦ ψ2 + ψ22 .
Using Lemma A.1, and in particular equation (5), we see that
trace( 2 ) trace( ◦ )
E(ψi2 ) = 2σ 4 2
+ (μ4 − 3σ 4 ) ,
p p2
and therefore, E(ψ22 ) remains bounded. On the other hand, using Lemma 2.7
of [5], we see that if we have 5 + ε absolute moments,
 
(μ4 trace( 2 ))2 4
3−ε trace( )
E(ψi4 ) ≤ C + μ 5+ε Bp .
p4 p4
26 N. EL KAROUI

Now recall that we can take Bp = p2/5−δ . Therefore nE(ψ ◦ ψ22 ) is, at most, of
order Bp3−ε /p. We conclude that
  
P |A ◦ A|2 > log(p) Bp3−ε /p → 0.
Note that this implies that
  
P |T |2 > log(p) Bp3−ε /p → 0.
Now, note that the third order term is of the form E ◦ T . Because we have
assumed that we have 5 + ε absolute moments, we have already seen that our
concentration results imply that

  2
 Bp  

max|Ei,j | = O log(p) = O log(p)p−1/10−δ a.s.
i =j p
Using the fact that T has positive entries and therefore (see Lemma A.5), |E ◦
T |2 ≤ maxi,j |Ei,j ||T |2 , we conclude that with high-probability,

  5−ε 

2  Bp 
|E ◦ T |2 = O (log(p)) = O((log(p))2 p−δ ) where δ  > 0.
p2
Hence,
|E ◦ T |2 → 0 in probability.

• Adjustment of the diagonal. To obtain the compact form of the approxima-


tion announced in the theorem, we need to include diagonal terms that are not
present in the matrices resulting from the Taylor expansion. Here, we show that
the corresponding matrices are easily controlled in operator norm.
When we replace the zeroth and first order terms by
 
XX
f (τ )11 + f  (τ ) 1ψ  + ψ1 − 2 ,
p
we add to the diagonal the term f (τ ) + f  (τ )(2ψi − 2Xi 22 /p) = f (τ ) −
2f  (τ ) trace()
p . In the end, we need to subtract it.
When we replace the second order term by 12 f  (τ )[1(ψ ◦ ψ) + 2ψψ  + (ψ ◦
2
ψ)1 + 4 trace(
p2
) 
11 ], we add to the diagonal the diagonal matrix with (i, i) entry,
 
trace( 2 )
2f  (τ ) ψi2 + .
p2
2
With our assumptions, maxi |ψi | → 0 a.s. and trace(
p
)
remains bounded, so the
added diagonal matrix has operator norm converging to 0 a.s. We conclude that we
SPECTRUM OF KERNEL RANDOM MATRICES 27

do not need to add it to the correction in the diagonal of the matrix approximating
our kernel matrix. 

An interpretation of the proofs of Theorems 2.1 and 2.2 is that they rely on a
local “multiscale” approximation of the original matrix (i.e., the terms used in the
entry-wise approximation are all of different order of magnitudes, or at different
“scales”). However, globally, that is, when looking at eigenvalues of the matrices
and not just at each of their entries there is a bit of a mixture between the scales
which creates the difficulties we had to deal with to control the second order term.

2.3.1. A note on the Gaussian Kernel. The Gaussian kernel corresponds to


f (x) = exp(−γ x) in the notation of Theorem 2.2. We would like to discuss it a
bit more because of its widespread use in applications.
The result of Theorem 2.2 gives accurate limiting eigenvalue information for
the case where we renormalize the distances by the dimension which seems to be
implicitly or explicitly what is often done in practice.
However, it is possible that information about the nonrenormalized case might
also be of interest in some situations. Let us assume now that trace() grows to
infinity at least as fast as p1/2+2/m+δ where δ > 0 is such that 1/2 + 2/m + δ < 1
which is possible since m ≥ 5 + ε here. We of course still assume that its largest
singular value, σ1 (), remains bounded. Then Corollary A.2 guarantees that
Xi − Xj 22 trace()
min > a.s.
i =j p p
Hence
max exp(−Xi − Xj 22 ) ≤ exp(− trace()) ≤ exp(−p 1/2+2/m+δ ) a.s.
i =j

Hence, in this case, if M is our kernel matrix with entries exp(−Xi − Xj 22 ),
we have
|M − Id|2 ≤ n exp(−p1/2+2/m+δ ) a.s.,
and the upper bound tends to zero extremely fast.

2.4. More general models. In this subsection, we consider more general mod-
els than the ones considered above. In particular, we will here focus on data models
for which the vectors Xi satisfy a so-called dimension-free concentration inequal-
ity. As was shown in [19], under these conditions, the Marčenko–Pastur equation
holds (as well as generalized versions of it). Note that these models are more gen-
eral than the one considered above (the proofs in the Appendix illustrate why the
standard random matrix models can be considered as subcases of this more general
class of matrices) and can describe various interesting objects like vectors with cer-
tain log-concave distributions or vectors sampled in a uniform manner from certain
28 N. EL KAROUI

Riemannian submanifolds of Rp endowed with the canonical Riemannian metric


inherited from Rp .
Before we state more precisely the theorem and give examples of distributions
that satisfy its assumptions, let us give some motivation. One potential criticism of
Theorems 2.1 and 2.2 is that they deal with data models that are inherently quite
linear. So a natural question is to understand whether our linear approximation
result is limited to these linear settings. Also, kernel methods are often advocated
for their handling of nonlinearities, and though the linear case is probably a basic
one that needs to be understood, a null model of sorts, it is important to be able to
go beyond it. As we will soon see, the next theorems allow us to get results beyond
the linear setting.
Our generalization of Theorem 2.1 to more general distributions is the follow-
ing.

T HEOREM 2.3. Consider a triangular “array” of matrices where each row of


the array consists of a n × p matrix. We assume these matrices are independent.
We call Xi , i = 1, . . . , n the rows of this matrix.
Suppose the vectors {Xi }ni=1 ∈ Rp are i.i.d. mean 0 and have the property that
for any 1-Lipschitz function F (with respect to Euclidean norm), if mF is a median
of F (Xi ),
 
∀t > 0 P |F (Xi ) − mF | > t ≤ C exp(−ct b ) for some b > 0,
where C is independent of p and c may depend on p but is required to satisfy
c ≥ p −(1/2−ε)b/2 . b is fixed and independent of n and p.
Consider the n × n kernel random matrix M with Mi,j = f (Xi Xj /p). Assume
that p n.
(a) Call  the covariance matrix of the Xi ’s and assume that σ1 () stays bounded
and trace()/p has a limit.
(b) Suppose that f is a real valued function which is C 2 around 0 and C 1 around
trace()/p.
Then the spectrum of this matrix is asymptotically nonrandom and has, a.s., the
same limiting spectral distribution as that of
XX 
M = f (0)11 + f  (0) + υp Idn ,
p
where υp = f ( trace()
p ) − f (0) − f  (0) trace()
p .

We note that the term f (0)11 does not affect the limiting spectral distribution
of M since finite rank perturbations do not have any effect on limiting spectral
distributions (see, e.g., [3], Lemma 2.2). Therefore, it could be removed from the
approximating matrix, but since it will clearly be present in numerical work and
SPECTRUM OF KERNEL RANDOM MATRICES 29

simulations, we chose to leave it in our approximation. We also note that the lim-
iting distribution of XX /p under these assumptions has been obtained in [19].
Here are a few examples of models satisfying the distributional assumptions
stated above. (Unless otherwise noted, b = 2 in all these examples.)

Examples of distributions for which the previous theorem applies.


• Gaussian random variables with ||2 bounded and trace()/p converges. The
assumptions of the theorem apply according to [29], Theorem 2.7, with c(p) =
1/||2 .

• Vectors of the type pr where r is uniformly distributed on the unit (2 -)
sphere is dimension p. Theorem 2.3 in [29] shows that Theorem 2.3 applies,
with c(p) = (1 − 1/p)/2, after noticing that a 1-Lipschitz function with respect
to Euclidean norm is also 1-Lipschitz with respect to the geodesic distance on
the sphere.

• Vectors  pr with r uniformly distributed on the unit (2 -)sphere in Rp and
with   =  where  satisfies the assumptions of the theorem.
• Vectors with log-concave density of the type e−U (x) with the Hessian of U sat-
isfying, for all x, Hess(U ) ≥ c Idp where c > 0 has the characteristics of c(p)
above (see [29], Theorem 2.7). Here we also need ||2 to satisfy the assump-
tions of the theorem.
• Vectors r distributed according to a (centered) Gaussian copula, with corre-
sponding correlation matrix  having ||2 bounded. Here Theorem 2.3 ap-
plies, since if r̃ has a Gaussian copula distribution, then its ith entry satisfies
r̃i = (vi ) where v is multivariate normal with covariance matrix ,  being a
correlation matrix, that is, its diagonal is 1. Here  is the cumulative distribution
function of a standard normal distribution which is trivially Lipschitz. Now tak-
ing r = r̃ − 1/2 gives a centered Gaussian copula. The fact that the covariance
matrix of r then has bounded operator norm requires a bit of work and is shown
in the Appendix of [19].
• Vectors sampled uniformly from certain compact connected smooth Rie-
mannian submanifolds, M, of Rp , canonically equipped with the Riemannian
metric g defined by restricting to each tangent space the ambient scalar prod-
uct in Rp . The curvature properties of these submanifolds need to satisfy the
assumptions of Theorem 2.4 in [29]. Also, the covariance matrix  need to sat-
isfy the assumptions of Theorem 2.3. We note that since the length of a curve in
M is equal to its length in Rp , the same remark that we made in the case of the
sphere of unit radius applies here, too. In particular, a 1-Lipschitz function with
respect to Euclidean norm is 1-Lipschitz with respect to the geodesic distance
on the manifold.
• Vectors of the type p 1/b r, 1 ≤ b ≤ 2 where r is uniformly distributed in the
1-b ball or sphere in Rp . (See [29], Theorem 4.21, which refers to [34] as
the source of the theorem.) We also refer the reader to [29], pages 37 and 38
30 N. EL KAROUI

for some of subtleties involved in the definition of the uniform distribution on


the sphere. Fact A.1 applies to them, with c(p) depending only on b. Also, the
concentration function is of the form exp(−c(b)t b ) here.
We now turn to the proof of Theorem 2.3. The first step in the proof is the
following lemma.

L EMMA 2.1. Suppose Kn is an n × n real symmetric matrix, whose spectral


distribution converges weakly (i.e., in distribution) to a limit. Suppose Mn is an
n × n real symmetric matrix.

1. Suppose Mn is such that Mn − Kn F = o( n). Then Mn and Kn have the
same limiting spectral distribution.
2. Suppose Mn is such that |Mn − Kn |2 → 0. Then Mn and Kn have the same
limiting spectral distribution.

Before we prove the lemma, we note that our assumptions imply that the lim-
iting spectral distribution of Kn is a probability distribution. Therefore, to obtain
the results of the lemma, we just need to show pointwise convergence of Stieltjes
transforms and then rely on the results of [22], and in particular Corollary 1 there.

P ROOF OF L EMMA 2.1. We call StKn and StMn the Stieltjes transforms of the
spectral distributions of these two matrices. Suppose z = u + iv. Let us call li (Mn )
the ith largest eigenvalue of Mn .

Proof of statement 1. We first focus on the Frobenius norm part of the lemma.
We have
 
1 
n
1 1  1

n
|li (Mn ) − li (Kn )|
|StKn (z) − StMn (z)| =  − ≤ .
n i=1 li (Kn ) − z li (Mn ) − z  n i=1 v2
Now, by Holder’s inequality,



n
√ 
n
|li (Mn ) − li (Kn )| ≤ n |li (Mn ) − li (Kn )|2 .
i=1 i=1

Using Lidskii’s theorem [i.e., the fact that, since Mn and Kn are hermitian, the
vector with entries li (Mn ) − li (Kn ) is majorized by the vector li (Mn − Kn )], with,
in the notation of [10], Theorem III.4.4 (x) = x 2 , we have

n 
n
|li (Mn ) − li (Kn )|2 ≤ li2 (Mn − Kn ) = Mn − Kn 2F .
i=1 i=1
We conclude that
Mn − Fn F
|StKn (z) − StMn (z)| ≤ √ 2 ,
nv
SPECTRUM OF KERNEL RANDOM MATRICES 31

since |li (Kn ) − z| ≥ |Im[li (Kn ) − z]| = v, and therefore 1/|li (Kn ) − z| ≤ 1/v.
Under the assumptions of the lemma, we therefore have

|StKn (z) − StMn (z)| → 0.

Therefore the Stieltjes transform of the spectral distribution of Mn converges point-


wise to the Stieltjes transform of the limiting spectral distribution of Kn . Hence, by,
e.g., Corollary 1 in [22], the spectral distribution of Mn converges in distribution
to the limiting spectral distribution of Kn , which, as noted earlier, is a probability
distribution by our assumptions.

Proof of statement 2. Let us now turn to the operator norm part of the lemma.
By the same computations as above, we have, using Weyl’s inequality,
 
1 n
1 1 

|StKn (z) − StMn (z)| =  − 
n i=1 li (Kn ) − z li (Mn ) − z 


1 n
|li (Mn ) − li (Kn )|

n i=1 v2
|Mn − Kn |2
≤ .
v2
Hence if |Mn − Kn |2 → 0, it is clear that the two Stieljtes transforms are asymp-
totically equal, and the conclusion follows. 

We now turn to the proof of the theorem.

P ROOF OF T HEOREM 2.3. For the weaker statement required for the proof
of Theorem 2.3, we will show that in the δ-method we need to keep only the
first term of the expansion as long as f has a second derivative that is bounded
in a neighborhood of 0, and a first derivative that is bounded in a neighborhood
of trace()/p. In other words, we will split the problem into two parts: off the
diagonal, we write
    2
Xi Xj  X  Xj f  (ξi,j ) Xi Xj
f = f (0) + f (0) i + if i = j ;
p p 2 p
on the diagonal, we write
      
Xi Xi trace() X  Xi trace()
f =f + f  (ξi,i ) i − .
p p p p
32 N. EL KAROUI

• Control of the off-diagonal error matrix. Here we focus on the matrix W with
(i, j ) entry
 2
f  (ξi,j ) Xi Xj
Wi,j = 1i =j .
2 p
The strategy is going to be to control the Frobenius norm of the matrix
⎧ 2
⎪ 
⎨ Xi Xj
Wi,j = , if i = j ,
⎪ p

0, if i = j .
According to Lemma 2.1, √ it is enough for our needs to show that the Frobenius
norm of this matrix is o( n) a.s. to have the result we wish. Hence, the result will
be shown, if we can for instance show that
max Wi,j ≤ p −(1/2+ε) (log(p))1+δ a.s., for some δ > 0.
i,j

Now Lemma A.4 or Fact A.1 gives, for instance,


  
 Xi Xj 
max   ≤ (pc2/b (p))−1/2 [log(p)]2/b a.s.
i =j p 
Therefore, with our assumption on c(p), we have
max Wi,j ≤ p−(1/2+ε) (log(p))4/b a.s.
i,j

Now, W F ≤ nmaxi,j |Wi,j |, so we conclude that in this situation, with our as-
sumptions that n p,
√ 
W F = o n a.s.
Now let us focus on
Wi,j = f  (ξi,j )Wi,j ,
where ξi,j is between 0 and Xi Xj /p. We just saw that with very high-probability,
this latter quantity was less (in absolute value) than p−(1/4+ε/2) (log(p))2/b , if c ≥
p−(1/2−ε)b/2 . Therefore if f  is bounded by K in a neighborhood of 0, we have,
with very high probability that
√ 
W F ≤ KW F = o n .
• Control of the diagonal matrix. We first note that when we replace the off-
diagonal matrix by f (0)11 + f  (0)XX  /p, we add to the diagonal certain terms
that we need to subtract eventually.
Hence, our strategy here is to show that we can approximate (in operator norm)
the diagonal matrix D with entries
   
trace() X  Xi trace() X  Xi
Di,i = f + f  (ξi,i ) i − − f  (0) i − f (0)
p p p p
SPECTRUM OF KERNEL RANDOM MATRICES 33

by υp Idp . To do so, we just have to show that the diagonal error matrix Z, with
entries
 
    Xi Xi trace()
Zi,i = f (ξi,i ) − f (0) −
p p
goes to zero in operator norm.
As seen in Lemma A.4 or Fact A.1, if c ≥ p−(1/2−ε)b/2 , with very high-
probability,
  
 Xi Xi trace()  −(1/4+ε/2)
 −
max
i p p ≤p (log(p))2/b .

If f  is continuous and hence bounded around trace()


p , we therefore see that the
operator (or spectral) norm of Z satisfies with high-probability
|Z|2 ≤ Kp−(1/4+ε/2) (log(p))2/b .
• Final step. We clearly have
M − M = W + Z.
It is also clear that M has a limiting spectral distribution, satisfying, up to center-
ing and scaling, the Marčenko–Pastur equation; this was shown in [19]. By Lem-
ma 2.1, we see that M and M − Z have the same limiting spectral distribution,
since their difference is Z and |Z|2 → 0. Using the same lemma, we see that M
and M − Z have (in probability) the same limiting spectral distribution, since their
difference is W and √ we have established that the Frobenius norm of this matrix is
(in probability) o( n). Hence, M and M have (in probability) the same limiting
spectral distribution. 

We finally treat the case of kernel matrices computed from Euclidean norms, in
this more general distributional setting.

T HEOREM 2.4. Let us call τ = 2 trace()/p where  is the covariance ma-


trix of the Xi ’s. Suppose that f is a real valued function which is C 2 around τ and
C 1 around 0.
Under the assumptions of Theorem 2.3, the kernel matrix M with (i, j ) entry
 
Xi − Xj 22
Mi,j = f
p
has a nonrandom limiting spectral distribution which is the same as that of the
matrix
XX
M = f (τ )11 − 2f  (τ ) + υp Idn ,
p
where υp = f (0) + τf  (τ ) − f (τ ).
34 N. EL KAROUI

We note once again that the term f (τ )11 does not affect the limiting spectral
distribution of M. But we keep it for the same reasons as before.

P ROOF OF T HEOREM 2.4. Note that the diagonal term is simply f (0) Id, so
this term does not create any problem.
The rest of proof is similar to that of Theorem 2.3. In particular the control of the
Frobenius norm of the second order term is done in the same way, by controlling
the maximum of the off-diagonal term, using Corollary A.3 and Fact A.1 (and
hence Lemma A.4).
Therefore, we only need to understand the first order term, in other words, the
matrix with 0 on the diagonal and off-diagonal entry
Xi − Xj 22
Ri,j = −τ
p
   
Xi 22 trace() Xj 22 trace() X  Xj
= − + − −2 i .
p p p p p
Xi 22
As in the proof of Theorem 2.2, let us call ψ the vector with ith entry ψi = p −
trace()
p . Clearly,
 
 XX

Ri,j = δi,j 1ψ + ψ1 − 2 .
p
Simple computations show that
trace() XX
R−2 Id = 1ψ  + ψ1 − 2 .
p p
Now, obviously, 1ψ  + ψ1 is a matrix of rank at most 2. Hence, R has the same
limiting spectral distribution as
trace() XX 
2 Id −2
p p
since finite rank perturbations do not affect limiting spectral distributions (see, for
instance, [3], Lemma 2.2). This completes the proof. 

2.5. Some consequences of the theorems. In practice, it is often the case that
slight variant of kernel random matrices are used. In particular, it is customary to
center the matrices, that is, to transform M so that its row sum, or column sum or
both are 0. Note that these operations correspond to right and/or left multiplication
by the matrix H = Idn −11 /n.
In these situations, our results still apply; the following fact makes it clear.

FACT 2.1 (Centered kernel random matrices). Let H be the n × n matrix


Idn −11 /n.
SPECTRUM OF KERNEL RANDOM MATRICES 35

1. If the kernel random matrix M can be approximated consistently in operator


norm by K, then, if a, b ∈ {0, 1},
H a MH b can be approximated consistently in operator norm by H a KH b .
2. If the kernel random matrix M has the same limiting spectral distribution as
the matrix K, then, if a, b ∈ {0, 1},
H a MH b has the same limiting spectral distribution as K.

A nice consequence of the first point is that the recent hard work on localizing
the largest eigenvalues of sample covariance matrices (see [8, 17] and [31]) can be
transferred to kernel random matrices and used to give some information about the
localization of the largest eigenvalues of H MH , for instance. In the case of the
results of [17], Fact 2 and the arguments of [19], Section 2.3.4, show that it gives
exact localization information. In other words, we can characterize the a.s. limit
of the largest eigenvalue of H MH (or H M or MH ) fairly explicitly, provided
Fact 2 in [17] applies. Finally, let us mention the obvious fact that since two square
matrices A and B, AB and BA have the same eigenvalues, we see that H MH has
the same eigenvalues as MH and H M because H 2 = H .

P ROOF OF FACT 2.1. The proofs are simple. First note that H is positive
semi-definite and |H |2 = 1. Using the submultiplicativity of | · |2 , we see that
|H a MH b − H a KH b |2 ≤ |M − K|2 |H a |2 |H b |2 = |M − K|2 .
This shows the first point of the fact.
The second point follows from the fact that H a MH b is a finite rank perturbation
of M. Hence, using Lemma 2.2 in [3], we see that these two matrices have the same
limiting spectral distribution, and since, by assumption, K has the same limiting
spectral distribution as M, we have the result of the second point. 

On Laplacian-like matrices. Finally, we point out a simple consequence of our


results for Laplacian-like matrices. In light of recent results on manifold learning
(see [9]) where these matrices play a key role, it is natural to ask what happens to
them in our context. Suppose M is an n × n kernel random matrix as defined in
the previous theorems, and consider the (Laplacian-like) matrix L defined by

⎨ Li,j = −Mi,j /n,
⎪ if i = j ,
1
⎪ Li,i = n

Mi,j .
i =j

Call DL the diagonal matrix made up of the diagonal elements of L. We note


that our concentration results (Lemmas A.3 and A.4, as well as Fact A.1) imply
that DL can be approximated in operator norm by f (0) Idn in the scalar product
kernel matrix case and by f (2 trace(p )/p) Idn in the Euclidean distance kernel
36 N. EL KAROUI

matrix case. [It is so because Li,i is an average of almost constant (and equal)
quantities, so with high-probability Li,i cannot deviate from this constant value,
for that would require that at least one of the components of the average deviate
from the constant value in question.] Hence, there exists γp such that |DL −
γp Idn |2 tends to 0 almost surely. We also recall that the diagonal of the matrix
M can be consistently approximated in operator norm by a (finite) multiple of
the identity matrix, so the diagonal of M/n can be consistently approximated in
operator norm by 0. Therefore, |L + M/n − DL |2 tends to 0 almost surely, and
therefore, |L + M/n − γp Idn |2 tends to zero almost surely. In other words, L can
be consistently approximated in operator norm by γp Idn −M/n. Consequently,
when we can, as in Theorems 2.1 and 2.2, consistently approximate M in operator
norm by a linearized version, K, of M, then L can be consistently approximated in
operator norm by γp Idn −K/n, and we can deduce spectral properties of L from
that of γp Idn −K/n. When we know only about the limiting spectral distribution
of M, as in Theorems 2.3 and 2.4, the operator norm consistent approximation of
L by γp Idn −M/n carries over to give us information about the limiting spectral
distribution of L since the effect of γp Idn is just to “shift” the eigenvalues by γp .
We note that getting information about the eigenvectors of L would require finer
work on the properties of the matrix DL since approximating it by a multiple of
the identity does not give us any information about its eigenvectors.

3. Conclusions. The main result of this paper is that under various technical
assumptions, in high-dimensions, kernel random matrices [i.e., n × n matrices with
(i, j )th entry f (Xi Xj /p) or f (Xi − Xj 22 /p) where {Xi }ni=1 are i.i.d. random
vectors in Rp with p → ∞ and p n] which are often used to create nonlinear
versions of standard statistical methods and essentially behave like covariance ma-
trices, that is, linearly, a result that is in sharp contrast with the low-dimensional
situation where p is assumed to be fixed, and where it is known that, under some
regularity conditions, spectral properties of kernel random matrices mimick those
of certain integral operators. Under ICA-like assumptions, we were able to get a
“strong approximation” result (Theorems 2.1 and 2.2), that is, an operator norm
consistency result that carries information about individual eigenvalues and eigen-
vectors corresponding to separated eigenvalues. Under more general and less linear
assumptions (Theorems 2.3 and 2.4), we have obtained results concerning the lim-
iting spectral distribution of these matrices using a “weak approximation” result
relying on bounds on Frobenius norms.
Beside the mathematical results obtained above, this study raises several statis-
tical questions, both about the richness—or lack thereof—of models that are often
studied in random matrix theory and about the effect of kernel methods in this
context.

3.1. On kernel random matrices. Our study, motivated in part by numerical


experiments we read about in the interesting [44], has shown that in the asymptotic
SPECTRUM OF KERNEL RANDOM MATRICES 37

setting we considered which is generally considered relevant for high-dimensional


data analysis, the kernel random matrices considered here behave essentially like
matrices closely connected to sample covariance matrices. This is in sharp con-
trast to the low-dimensional setting where it was explained heuristically in [44],
and proved rigorously in [28], that the eigenvalues of kernel random matrices con-
verged (under certain assumptions) to those of a canonically related operator.
This suggests that kernel methods could suffer from the same problems that
affect linear statistical methods, such as Principal Component Analysis, in high-
dimensions. The practical significance of our result is that in high-dimensions, the
nonlinear methods that rely on kernel matrices may be behaving like their linear
counterparts. Our study also permits the transfer of some recent random matrix
results concerning large-dimensional sample covariance matrices to kernel random
matrices. We now discuss some possible practical settings to highlight when our
results are and are not relevant.

On kernel-PCA. An important motivation for this study was to try to under-


stand the properties of kernel-PCA in high-dimensions. We refer the reader to [36]
pages 48–50 for a primer on kernel-PCA, but let us say that kernel-PCA performs
a spectral decomposition of a (row and column-centralized) kernel matrix to ef-
ficiently perform a nonlinear version of PCA; instead of doing standard PCA,
the algorithm performs PCA in feature space. Our results, and in particular The-
orems 2.1 and 2.2 clearly show that in high-dimensions, when the assumptions
of the theorems are satisfied, the algorithm may essentially be performing a lin-
ear PCA despite appearances to the contrary. By contrast, in low-dimension, re-
sults such as [28] show that the intuition behind kernel-PCA is correct and that
the algorithm then performs a genuinely nonlinear PCA. Being aware of the dif-
ference between the two settings should be helpful to practitioners in that it will
inform them about the possible limitations of kernel-PCA as a nonlinear method.
For an example of applications, we refer the reader to, for instance, [44] and [36],
Chapter 10. We note also that from a slightly more “numerical analysis” stand-
point, our results basically say that the Nyström method (see [44] and references
therein) to approximate eigenfunctions of integral operators can be unreliable in
high-dimensions.

On kernel-ICA. The setting of Theorems 2.1 and 2.2 is naturally well-suited


for applications to ICA-like problems. Since we are dealing with vectors Xi here,
we would be considering multidimensional ICA problems (see [2], page 33). For
instance, in the formulation of [2], kernel-ICA is solved by solving a kernel-CCA
problem, that is, a generalized eigenvalue problem with kernel matrices as input.
We refer the reader to equations (10) and (13) in [2] for more details. The results
of Theorems 2.1 and 2.2 are directly relevant here, since the matrices at stake in
these kernel-CCA problems can be approximated consistently in operator norm
by linear counterparts, and hence the solution of the kernel-CCA problem can be
38 N. EL KAROUI

consistently approximated by the solution of the problem obtained by linearizing


the kernel matrices at stake, provided the smallest singular value of the linearized
version of the kernel matrices in question stay bounded away from 0. We note
that in practice this latter requirement can be checked using our fairly detailed
knowledge of the properties of extreme eigenvalues of sample covariance matrices.
We note that in this setting, our theorems confirm the predictions in [2], page 33,
that problems might arise with the algorithm they propose in high-dimensions due
to slow decay of eigenvalues.

Geostatistics applications. Certain kernel matrices, corresponding to covari-


ance functions of, for instance, Gaussian processes, also appear in geostatistics and
spatial statistics in techniques such as kriging (see, e.g., [15], Chapter 3 and, for
instance, pages 106–110). For examples of kernels that correspond to covariance
functions of Gaussian processes, we refer the reader to [33], Chapter 4. Naturally,
in those sort of applications, the dimension of the data vectors is low (at most 3),
and therefore “classical results” such as those of [28] apply whereas our results are
limited to the high-dimensional setting more often encountered in some problems
of multivariate statistics, machine learning or bioinformatics.

3.2. Limitations of standard random matrix models. In the study of spectral


distribution of large-dimensional sample covariance matrices, it has been some-
what forcefully advocated that the study should be done under the assumptions
that the data are of the form Xi =  1/2 Yi where the entries of Yi have, for instance,
finite fourth moment. At first sight, this idea is appealing, as it seems to allow a
great variety of distributions and hence flexible modeling. A possible drawback
however, is the assumption that the data are linear combinations of i.i.d. random
variables or the necessary presence of independence in the model. This has how-
ever been recently addressed (see, e.g., [19]) and it has been shown that one could
go beyond models requiring independence in a lurking random vector which the
data linearly depend on.

Data analytic consequences. However, a serious limitation is still present. As


the results of Lemmas A.3, A.4 and Fact A.1 make clear, under the models for
which the limiting spectral distribution of the sample covariance matrix has been
shown to satisfy the Marčenko–Pastur equation, the norms of the data vectors are
concentrated, and the corresponding data vectors are almost orthogonal to one
another. In other words, under the “standard” ICA-like random matrix models
(used in Theorems 1 and 2), that is, the random vectors {Xi }ni=1 are i.i.d. in Rp ,
Xi =  1/2 Yi with Yi having i.i.d. entries with mean 0, variance 1 and 4 + ε ab-
solute moments, we have, assuming that {Yi }ni=1 are i.i.d., p → ∞, p/n and ||2
remain bounded; the vectors {Xi }ni=1 have the property that for a deterministic (and
computable) sequence cp , we have
 
max Xi 22 /p − cp  → 0
1≤i≤n
SPECTRUM OF KERNEL RANDOM MATRICES 39

and

max|Xi Xj |/p → 0.
i =j

Both these statements hold almost surely. Geometrically, this means that the vec-

tors {Xi / p}ni=1 are close to a sphere and almost orthogonal to one another. These
properties also hold for the more general (and less linear) models we considered
in Theorems 2.3 and 2.4.
Hence, if one were to plot a histogram of {Xi 22 /p}ni=1 , this histogram would
look tightly concentrated around a single value—the spread of this histogram be-
ing computable from our concentration results (Lemmas A.3, A.4 and Fact A.1).
Though the models appear to be quite rich, the geometry that we can perceive
by sampling n such vectors, with n p, is, arguably, relatively poor. These re-
marks should not be taken as aiming to discredit the interesting body of work that
has emerged out of the study of such models. Their aim is just to warn possible
users that in data analysis, a good first step would be to plot the histogram of
{Xi 22 /p}ni=1 and check whether it is concentrated around a single value. Simi-
larly, one might want to plot the histogram of inner products {Xi Xj /p} and check
that it is concentrated around 0. If this is not the case, then insights derived from
random matrix theoretic studies would likely not be helpful in the data analysis.
We note, however, that recent random matrix work (see [13, 14, 19, 32]) has
been concerned with distributions which could be loosely speaking be called of
“elliptical” type—though they are more general than what is usually called ellipti-
cal distributions in Statistics. In those settings, the data is, for instance, of the form
Xi = ri  1/2 Yi where ri is a real-valued random variable, independent of Yi . This
allows the data vectors to not approximately live on spheres (but does not change
anything about angles between different vectors), and is a possible way to address
some of the concerns we just raised. The characterization of the limiting spectrum
gets quite a bit more involved than in the “standard” setting, that is, ri = 1, and the
results show a lack of robustness to the “indirect” assumption that the data vectors
live close to a sphere.
Finally, this geometric discussion applies also to theoretical studies undertaken
under the assumptions that the Xi are N (0, p ) and that the problem is high di-
mensional. It should highlight some possibly severe limitations of the normality
assumption in high-dimensions.

APPENDIX
In this appendix, we collect a few useful results that are needed in the proof
of our theorems, and whose content we thought would be more accessible if they
were separated from the main proofs.
40 N. EL KAROUI

(A) Some useful results. We have the following elementary facts.

L EMMA A.1. Suppose Y is a vector with i.i.d. entries and mean 0. Call its
entries yi . Suppose E(yi2 ) = σ 2 and E(yi4 ) = μ4 . Then if M is a deterministic
matrix,

(4) E(Y Y  MY Y  ) = σ 4 (M + M  ) + (μ4 − 3σ 4 ) diag(M) + σ 4 trace(M) Id.

Further, we have (Y  MY )2 = trace(MY Y  MY Y  ) and

E(trace(MY Y  MY Y  ))
(5)
= σ 4 trace(M 2 + MM  ) + σ 4 (trace(M))2 + (μ4 − 3σ 4 ) trace(M ◦ M).

Here diag(M) denotes the matrix consisting of the diagonal of the matrix M and
0 off the diagonal. The symbol ◦ denotes Hadamard multiplication between matri-
ces.

P ROOF. Let us call R = Y Y  MY Y  . The proof of the first part is elementary


and consists merely in writing the (i, j )th entry of the corresponding matrix. As a
matter of fact, we have
 
Ri,j = yi yj yi yj Mi,j = yi yj yk yl Mk,l .
i,j k,l

Using the fact that entries of Y are independent and have mean 0, we see that, in
the sum, the only terms that will not be 0 in expectation are those for which each
index appears at least twice. If i = j , only the terms of the form yi2 yj2 have this
property. So if i = j ,
 
E(Ri,j ) = E yi2 yj2 (Mi,j + Mj,i ) = σ 4 (Mi,j + Mj,i ).

Let us now turn to the diagonal terms. Here again, only the terms yi2 yk2 matter. So
on the diagonal,

E(Ri,i ) = μ4 Mi,i + σ 4 Mj,j = (μ4 − σ 4 )Mi,i + trace(M).
j =i

We conclude that

E(R) = σ 4 (M + M  ) + (μ4 − 3σ 4 ) diag(M) + trace(M) Id.

The second part of the proof follows from the first result, after we remark that,
if D is a diagonal and L is general matrix, trace(LD) = trace(L ◦ D), from which
we conclude that trace(M diag(M)) = trace(M ◦ diag(M)) = trace(M ◦ M). 
SPECTRUM OF KERNEL RANDOM MATRICES 41

L EMMA A.2 (Concentration of quadratic forms). Suppose the vectors Z is


a vector in Rp with i.i.d. entries of mean 0 and variance σ 2 . Suppose that their
entries are bounded by Bp . Let M be a symmetric matrix, with largest singular
value σ1 (M). Call
128 exp(4π)σ1 (M)Bp2
ζp = ,
p

νp = σ1 ().
Then we have, if r/2 > ζp ,
   
 Z MZ 2 trace(M) 

P  −σ
p p >r
  
(6) ≤ 8 exp(4π) exp −p(r/2 − ζp )2 / 32Bp2 (1 + 2νp )2 σ1 (M)
  
+ 8 exp(4π) exp −p/ 32Bp2 (1 + 2νp )2 σ1 (M) .

P ROOF. We can decompose, using the spectral decomposition of M, M =


M+ − M− where M+ is positive semi-definite and M− is positive definite (or
0 if M is itself positive semi-definite). We can do so by replacing the negative
eigenvalues of M by 0 in the spectral decomposition and get M+ in that way. Note
that then, the largest singular values of M+ and M− are also bounded by σ1 (M)
since σ1 (M) is absolute value of the largest eigenvalue of M in absolute value,
and the nonzero eigenvalues of M+ are a subset of the eigenvalues of M and so
are the eigenvalues of M− when M− is not 0. Now it is clear that the function F
 1/2 √
which associates
√ to a vector x in Rp the scalar x  M+ x/p = M+ x/ p2 is
a convex, σ1 (M)/p-Lipschitz function with respect to Euclidean norm. Calling
mF the median of F (Z), we have, using Corollary 4.10 in [29],
   
P |F (Z) − mF | > r ≤ 4 exp −pr 2 /(16Bp2 σ1 (M)) .
Let us now call μF the mean of F (Z) (it exists according to Proposition 1.8 in
[29]). Following the arguments given in the proof of this Proposition 1.8, and
spelling out the constants appearing in the last result of Proposition 1.8 in [29],
we see that
   
P |F (Z) − μF | > r ≤ 4 exp(4π) exp −pr 2 /(32Bp2 σ1 (M)) .
(Using the notation of Proposition 1.8 in [29], we picked κ2 = 1/2, and C  =
exp(πC 2 /4); showing that this is a valid choice just requires one to carry out some
of the computations mentioned in the proof of that Proposition.)
Let us call A, B, D the sets
   
 Z M+ Z 
A   − μ2F  > r ,
p
! " ! "
Z  M+ Z Z  M+ Z
B + μF ≤ 1 + 2μF = − μF ≤ 1
p p
42 N. EL KAROUI

and
!  "
 ZM Z 
 + 
D  − μF  > r/(1 + 2μF ) .
 p 

Of course, we have P (A) ≤ P (A ∩√ B) + P c Now note that A ∩ B ⊆ D, simply


√(B ). √ √
because for positive reals, a − b/( a + b) = a − b. We conclude that
#   
P (A) ≤ 4 exp(4π) exp −pr 2 / 32Bp2 (1 + 2μF )2 σ1 (M)
 $
+ exp −p/(32Bp2 σ1 (M)) .
Let us know call σ 2 the variance of the each of the component of Z. We know,
according to Proposition 1.9 in [29], that
E(Z  M+ Z) trace(M+ )
var(F ) = − μ2F = σ 2 − μ2F ≤ ζp
p p
128 exp(4π)σ1 (M)Bp2
= .
p
Hence, we conclude that, if r > ζp ,
   
 Z M+ Z 2 trace(M+ ) 

P  −σ
p p >r
  
≤ 4 exp(4π) exp −p(r − ζp )2 / 32Bp2 (1 + 2μF )2 σ1 (M)
  
+ 4 exp(4π) exp −p/ 32Bp2 (1 + 2μF )2 σ1 (M) .
To get the announced result, we note that for the sum of two reals to be
greater than r in absolute value, one needs to be greater than r/2, and that
our bounds become conservative when we replace μF (and its counterpart for
M− ) byνp . [Note that we

get conservative bounds when replacing the μF ’s by
max(E( Z  M+ Z/p), E( Z  M− Z/p)), and that this quantity is clearly bounded
by σ σ1 ().] Hence, we have, as announced, if r/2 > ζp ,
   
 Z MZ trace(M)  
P  − σ2 >r
p p
  
≤ 8 exp(4π) exp −p(r/2 − ζp )2 / 32Bp2 (1 + 2μF )2 σ1 (M)
  
+ 8 exp(4π) exp −p/ 32Bp2 (1 + 2μF )2 σ1 (M) .
Finally, we note that the proof makes clear that the same result would hold for
different choices of M+ and M− , as long as max(σ1 (M+ ), σ1 (M− )) ≤ σ1 (M).


We therefore have the following useful corollary:


SPECTRUM OF KERNEL RANDOM MATRICES 43

C OROLLARY A.1. Let Yi and Yj be i.i.d. random vectors as in Lemma A.2


with variance 1. Suppose that  is a positive semi-definite matrix. We have, with
128 exp(4π)σ1 ()Bp2
ζp =
p
and

νp = σ1 (),
that if r/2 > ζp and K = 8 exp(4π),
   
 Yi Yj    

P   > r ≤ K exp −p(r/2 − ζp )2 / 32Bp2 (1 + 2νp )2 σ1 ()
p 
(7)   
+ K exp −p/ 32Bp2 (1 + 2νp )2 σ1 () .

P ROOF. The proof relies on the results of Lemma A.2. Remark that, since 
is symmetric,
  
1 0  Yi
Yi Yj = (Yi Yj ) .
2  0 Yj
Now the entries of the vector made by concatenating Yi and Yj are i.i.d. and so
we fall back into the setting of Lemma A.2. Finally,
 here M+ and M  − areknown
 0
explicitly. A possible choice is M+ = 1/2   

and M − = 1/2 0  . νp is
obtained by upper bounding the expectation of the square of F in the notation of
the proof of the previous lemma for these explicit matrices. Note that their largest
singular values are both smaller that σ1 (), so the results of the previous lemma
apply. 

L EMMA A.3. Let {Yi }ni=1 be i.i.d. random vectors in Rp , whose entries are
i.i.d., mean 0, variance 1 and have bounded (in p) m ≥ 4 absolute moments. Sup-
pose that {p } is a sequence of positive semi-definite matrices whose operator
norms are uniformly bounded in p and n/p is asymptotically bounded. We have,
for any given ε > 0,
  
 Yi p Yj trace(p )  −1/2+2/m
 − δi,j
max
i,j p p ≤p (log(p))(1+ε)/2 a.s.

P ROOF. Throughout the proof, we assume without loss of generality that m <
∞.
Call t = 2/m. It is clear that with our moment assumptions, t ≤ 1/2. According
to Lemma 2.2 in [45], the maximum of the array of {Yi }ni=1 is a.s. less than pt . So
to control the maximum of the inner products of interest, it is enough to control
the same quantity when we replace Yi by Yi with Yi,l  Yi,l 1|Yi,l |≤pt . Now note
44 N. EL KAROUI

that Yi satisfies the boundedness assumption of Corollary A.1, but its mean is not
necessarily zero and its variance is not 1. Note however, that all the entries of Yi
have the same mean, μ . Since Yi has mean 0, we have
   
| ≤ E |Y1,1 |1|Y1,1 |>pt ≤ E |Y1,1 |m p −t (m−1) ≤ μm p −2+t .

σ 2 the variance of Y, we have
Similarly, if we call 
     
σ 2 = E |Y1,1 |2 1|Y1,1 |≤pt − μ
 2 = 1 − E |Y1,1 |2 1|Y1,1 |>pt + μ
2 .

Hence, 0 ≤ 1 − 
σ 2 , and
 
σ 2 = E |Y1,1 |2 1|Y1,1 |>pt + μ
1− 2
 
≤ E |Y1,1 |m p−t (m−2) + μ
2

≤ μm p−2+2t + μ2m p−4+2t = O(p−2+2t ).


Let us call Ui = Yi − μ
1p and Ui = Ui /
σ where σ 2 is also the variance of Ui .
Corollary A.1 applies to the random variables Ui with Bp = 2p t when p is large
enough. So ζp = O(p1−2t ). Let us now call, for some ε > 0,
r(p) = pt−1/2 (log(p))(1+ε)/2 .
Since, for p large enough, r(p)/2 > ζp , we can apply the conclusions of Corol-
lary A.1, and by plugging in the different quantities, we see that
  
 p U
P |U j /p| > r(p) ≤ exp(−K(log(p))1+ε ),
i

where K denotes a generic constant (that may change from display to display). In
particular, K is independent of p and is hence trivially bounded away from 0 as p
grows. The bound we just obtained on 1 −  σ 2 also implies that for p large enough,
σ 2 > 1/2 from which we conclude that for another K with the same properties,

 
P |Ui p Uj /p| > r(p) ≤ exp(−K(log(p))1+ε ).
σ 2 is the vari-
In other respects, the arguments of Lemma A.2 show that, since 
ance of Ui ,
 
P |Ui p Ui /p − 
σ 2 trace(p )/p| > r(p) ≤ exp(−K(log(p))1+ε ).
Now
Yi p Yj U  p Uj (1 p Uj + Ui p 1) 1 p 1
= i +μ
 +μ
2 .
p p p p
 
Remark that 1 p 1 ≤ pσ1 (p ), and |1 p Uj | ≤ 1 p 1 Uj p Uj . We con-
clude, using the results obtained in the proof of Lemma A.2 that with prob-
 than 1 − exp(−K(log(p))
ability greater 1+ε ), the middle term is smaller than
 
|. As a matter of fact, Uj p Uj /p is concentrated
2 σ1 (p )( σ1 (p ) + r(p))|μ
SPECTRUM OF KERNEL RANDOM MATRICES 45

around its mean which is smaller than 
σ trace(p )/p which is itself smaller than

| = O(p −2+t ) = o(r(p)). We can therefore conclude
σ1 (p ). Now recall that |μ
that
    

Y p Yj trace(p )  
P  i − δi,j 
σ2  > 2r(p) ≤ 2 exp(−K(log(p))
1+ε
).
p p
Now note that 0 ≤ 1 − σ 2 = O(p−2+2t ) = o(r(p)) since t ≤ 1/2 < 3/2. With our
assumptions, trace(p )/p remains bounded, so we have finally
  j  
 Yi p Y trace(p ) 

P  − δi,j > 3r(p) ≤ 2 exp(−K(log(p))1+ε ).
p p 

And therefore,
   j  
 Yi p Y trace(p ) 
 − δi,j  > 3r(p) ≤ 2n exp(−K(log(p))
2 1+ε
P max ).
i,j p p
Using the Borel–Cantelli lemma, we reach the conclusion that
  j 
 Yi p Y trace(p ) 
max  − δi,j ≤ 3r(p) = 3p2/m−1/2 log(p) a.s.
i,j p p 

Y  Y trace( )
Because the left-hand side is a.s. equal to | i pp j − δi,j p
p
|, we reach the
announced conclusion but with r(p) replaced by 3r(p). Note that, of course, any
multiple of r(p), where the constant is independent of p, would work in the proof.
In particular, by taking r̃(p) = r(p)/3, we reach the announced conclusion. 

C OROLLARY A.2. Under the same assumptions as that of Lemma A.3, if we


1/2
call Xi = p Yi , we also have
 
 Xi − Xj 22 trace(p ) 
max  −2  ≤ p −1/2+2/m (log(p))(1+ε)/2 a.s.
i =j p p

P ROOF. The proof follows immediately from the results of Lemma A.3, after
we write
Xi − Xj 22 − 2 trace(p )
= [Yi p Yi − trace(p )] + [Yj p Yj − trace(p )] − 2Yi p Yj .
Note that as explained in the proof of Lemma A.3, the constants in front of the
bounding sequence do not matter, so we can replace 3p−1/2+2/m (log(p))(1+ε)/2
by p−1/2+2/m (log(p))(1+ε)/2 , and the result still holds. [In other words, we are
really using Lemma A.3 with upper bound p−1/2+2/m (log(p))(1+ε)/2 /3.] 
46 N. EL KAROUI

L EMMA A.4. Let {Xi }ni=1 be i.i.d. random vectors in Rp whose entries are
i.i.d., mean 0, having the property that for 1-Lipschitz (with respect to Euclidean
norm) functions F , if we denote by mF the median of F (Xi ),
 
P |F (Xi ) − mF | > r ≤ C exp(−c(p)r 2 ),
where C is independent of p and c is allowed to vary with p (if it goes to zero,
we assume it does so like p−α , 0 ≤ α < 1). Call p the covariance matrix of X1 .
Assume that σ1 (p ) remains bounded in p. Then, under the triangular array con-
struction of Theorem 2.3, we have, for any ε > 0,
  
 Xi Xj trace(p )  −1/2
 − δi,j
max
i,j p p  ≤ (pc(p)) (log(p))(1+ε)/2 a.s.

P ROOF. The proof once again relies on concentration inequalities. First note
that Proposition 1.11 combined with Proposition 1.7 in [29] shows that if Xi
and Xj are independent and satisfy concentration inequalities with concentration
 
function α(r) (with respect to Euclidean norm), then the vector YYji also satis-
fies concentration inequalities with concentration function 2α(r/2) with respect
to Euclidean norm in R2p . (We note that Proposition 1.11 is proved for the met-
ric on R2p  · 2 +  · 2 where each Euclidean norm is a norm in Rp , but the
same proof goes through for Euclidean norm on R2p . Another argument would
be to say that the metric  · 2 +  · 2 is equivalent
√ to the norm of the full R
2p

with the constants in √


√ the √
inequalities being 1 and 2 simply because for a, b > 0,
a + b ≤ a + b ≤ 2 a 2 + b2 .)
2 2
Therefore, the arguments of Lemma A.2 go through without any problems with
p = Id and Bp2 = 4/c(p). So a result similar to Corollary A.1 holds and we can
apply the same ideas as in the proof of Lemma A.3 and get the announced result.


C OROLLARY A.3. Under the assumptions of Lemma A.4, we have, for any
ε > 0,
 
 Xi − Xj 22 trace(p )  −1/2
max −2  ≤ (pc(p)) (log(p))(1+ε)/2 a.s.
i =j p p

P ROOF. The proof is an immediate consequence of Lemma A.4, along the


same lines as the proof of Corollary A.2. 

Finally, allow the same lines of proof; we have the following fact.

FACT A.1. Let {Xi }ni=1 be i.i.d. random vectors in Rp whose entries are i.i.d.,
mean 0, having the property that for 1-Lipschitz (with respect to Euclidean norm)
functions F , if we denote by mF the median of F (Xi ),
 
P |F (Xi ) − mF | > t ≤ C exp(−c(p)t b ) for some b > 0,
SPECTRUM OF KERNEL RANDOM MATRICES 47

where C is independent of p and c is allowed to vary with p (if it goes to zero, we


assume it does so like p−α , 0 ≤ α < b/2). Call p the covariance matrix of X1 .
Assume that σ1 (p ) remains bounded in p. Then, we have, under the triangular
array construction of Theorem 2.3, for any ε > 0,
  
 Xi Xj trace(p )  −1/2
 − δi,j  ≤ (pc (p))
2/b
max (log(p))(1+ε)/b a.s.
i,j p p
Also, we then have
 
 Xi − Xj 22 trace(p )  −1/2
max −2  ≤ (pc (p))
2/b
(log(p))(1+ε)/b a.s.
i =j p p

The proof of this last fact follows the same step as that of Lemma A.4, with a
slight adjustment since we need to replace 2 by b. For a related question and more
details, we refer the reader to [19].

(B) A linear algebraic result. Finally, we finish this appendix with a linear
algebraic lemma which we need in our approximations and is of independent in-
terest.

L EMMA A.5. Suppose M is real symmetric matrix with nonnegative entries.


Suppose that E is a real symmetric matrix such that maxi,j |Ei,j | ≤ ζ , for some
ζ ≥ 0. Then, if σ1 (A) is the largest singular value of matrix A and if ◦ represents
the Hadamard product (i.e., entrywise multiplication of two matrices), we have
σ1 (E ◦ M) ≤ ζ σ1 (M).

P ROOF. We first note that E ◦ M is real symmetric. Therefore,


#  $1/(2k)
σ1 (E ◦ M) = lim trace (E ◦ M)2k .
k→∞
Now we claim that
  
trace (E ◦ M)2k  ≤ ζ 2k trace(M 2k ).

To see this, recall that for a p × p matrix A,



trace(Ak ) = Ai1 ,i2 Ai2 ,i3 · · · Aik ,i1 .
1≤i1 ,i2 ,...,ik ≤p

Now,
|Ai1 ,i2 Ai2 ,i3 · · · Aik ,i1 | ≤ |Ai1 ,i2 ||Ai2 ,i3 | · · · |Aik ,i1 |.
When A = E ◦ M, Ai,j = Ei,j Mi,j . Since Mi,j ≥ 0, we therefore have |Ei,j ×
Mi,j | ≤ ζ Mi,j . Hence,
   
trace (E ◦ M)k  ≤ ζ k Mi1 ,i2 Mi2 ,i3 · · · Mik ,i1 = ζ k trace(M k ).
1≤i1 ,i2 ,...,ik ≤p
48 N. EL KAROUI

So
#  $1/(2k)
trace (E ◦ M)2k ≤ ζ [trace(M 2k )]1/(2k) .
Taking limits as k → ∞ concludes the proof. 

Acknowledgments. I would like to thank Bin Yu for stimulating my interest in


the questions considered in this paper and for interesting discussions on the topic.
I would like to thank Elizabeth Purdom for discussions about kernel analysis and
Peter Bickel for many stimulating discussions about random matrices and their
relevance in statistics. I would also like to thank an anonymous referee for useful
and constructive comments that resulted in an improved presentation of the paper.

REFERENCES
[1] A NDERSON, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley,
Hoboken, NJ. MR1990662
[2] BACH, F. R. and J ORDAN, M. I. (2003). Kernel independent component analysis. J. Mach.
Learn. Res. 3 1–48. MR1966051
[3] BAI, Z. D. (1999). Methodologies in spectral analysis of large-dimensional random matrices,
a review. Statist. Sinica 9 611–677. MR1711663
[4] BAI, Z. D., M IAO, B. Q. and PAN, G. M. (2007). On asymptotics of eigenvectors of large
sample covariance matrix. Ann. Probab. 35 1532–1572. MR2330979
[5] BAI, Z. D. and S ILVERSTEIN, J. W. (1998). No eigenvalues outside the support of the limiting
spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26
316–345. MR1617051
[6] BAI, Z. D. and S ILVERSTEIN, J. W. (1999). Exact separation of eigenvalues of large-
dimensional sample covariance matrices. Ann. Probab. 27 1536–1555. MR1733159
[7] BAIK, J., B EN A ROUS, G. and P ÉCHÉ, S. (2005). Phase transition of the largest eigenvalue for
nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697. MR2165575
[8] BAIK, J. and S ILVERSTEIN, J. (2006). Eigenvalues of large sample covariance matrices of
spiked population models. J. Multivariate Anal. 97 1382–1408. MR2279680
[9] B ELKIN, M. and N IYOGI, P. (2009). Convergence of Laplacian eigenmaps. Preprint.
[10] B HATIA, R. (1997). Matrix Analysis. Graduate Texts in Mathematics 169. Springer, New York.
MR1477662
[11] B OGOMOLNY, E., B OHIGAS, O. and S CHMIT, C. (2003). Spectral properties of distance ma-
trices. J. Phys. A 36 3595–3616. MR1986436
[12] B ORDENAVE, C. (2008). Eigenvalues of Euclidean random matrices. Random Structures Algo-
rithms 33 515–532. Available at [Link] MR2462254
[13] B OUTET DE M ONVEL, A., K HORUNZHY, A. and VASILCHUK, V. (1996). Limiting eigenvalue
distribution of random matrices with correlated entries. Markov Process. Related Fields 2
607–636. MR1431189
[14] B URDA, Z., J URKIEWICZ, J. and WACŁAW, B. (2005). Spectral moments of correlated Wishart
matrices. Phys. Rev. E 71 026111.
[15] C RESSIE, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York. MR1239641
[16] E L K AROUI, N. (2003). On the largest eigenvalue of Wishart matrices with identity covariance
when n, p and p/n → ∞. Available at arXiv:[Link]/0309355.
[17] E L K AROUI, N. (2007). Tracy–Widom limit for the largest eigenvalue of a large class of com-
plex sample covariance matrices. Ann. Probab. 35 663–714. MR2308592
SPECTRUM OF KERNEL RANDOM MATRICES 49

[18] E L K AROUI, N. (2008). Operator norm consistent estimation of large-dimensional sparse co-
variance matrices. Ann. Statist. 36 2717–2756. MR2485011
[19] E L K AROUI, N. (2009). Concentration of measure and spectra of random matrices: With appli-
cations to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 19
2362–2405.
[20] F ORRESTER, P. J. (1993). The spectrum edge of random matrix ensembles. Nuclear Phys. B
402 709–728. MR1236195
[21] G EMAN, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8 252–261.
MR0566592
[22] G ERONIMO, J. S. and H ILL, T. P. (2003). Necessary and sufficient condition that the limit of
Stieltjes transforms is a Stieltjes transform. J. Approx. Theory 121 54–60. MR1962995
[23] G OHBERG, I., G OLDBERG, S. and K RUPNIK, N. (2000). Traces and Determinants of Lin-
ear Operators. Operator Theory: Advances and Applications. 116 Birkhäuser, Basel.
MR1744872
[24] H ORN, R. A. and J OHNSON, C. R. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge.
MR1084815
[25] H ORN, R. A. and J OHNSON, C. R. (1994). Topics in Matrix Analysis. Cambridge Univ. Press,
Cambridge. MR1288752
[26] J OHANSSON, K. (2000). Shape fluctuations and random matrices. Comm. Math. Phys. 209
437–476. MR1737991
[27] J OHNSTONE, I. (2001). On the distribution of the largest eigenvalue in principal component
analysis. Ann. Statist. 29 295–327. MR1863961
[28] KOLTCHINSKII, V. and G INÉ, E. (2000). Random matrix approximation of spectra of integral
operators. Bernoulli 6 113–167. MR1781185
[29] L EDOUX, M. (2001). The concentration of measure phenomenon. Mathematical Surveys and
Monographs 89. Amer. Math. Soc., Providence, RI. MR1849347
[30] M AR ČENKO, V. A. and PASTUR, L. A. (1967). Distribution of eigenvalues in certain sets of
random matrices. Mat. Sb. (N.S.) 72 507–536. MR0208649
[31] PAUL, D. (2007). Asymptotics of sample eigenstructure for a large-dimensional spiked covari-
ance model. Statist. Sinica 17 1617–1642. MR2399865
[32] PAUL, D. and S ILVERSTEIN, J. (2009). No eigenvalues outside the support of the limiting
empirical spectral distribution of a separable covariance matrix. J. Multivariate Anal. 100
37–57. MR2460475
[33] R ASMUSSEN, C. E. and W ILLIAMS, C. K. I. (2006). Gaussian Processes for Machine Learn-
ing. MIT Press, Cambridge, MA. MR2514435
[34] S CHECHTMAN, G. and Z INN, J. (2000). Concentration on the lpn ball. In Geometric Aspects
of Functional Analysis. Lecture Notes in Mathematics 1745 245–256. Springer, Berlin.
MR1796723
[35] S CHÖLKOPF, B. and S MOLA, A. J. (2002). Learning with Kernels. MIT Press, Cambridge,
MA. MR1949972
[36] S CHÖLKOPF, B., T SUDA, K. and V ERT, J. P. (2004). Kernel Methods in Computational Biol-
ogy. MIT Press, Cambridge, MA.
[37] S ILVERSTEIN, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of
large-dimensional random matrices. J. Multivariate Anal. 55 331–339. MR1370408
[38] T RACY, C. and W IDOM, H. (1994). Level-spacing distribution and the Airy kernel. Comm.
Math. Phys. 159 151–174. MR1257246
[39] T RACY, C. and W IDOM, H. (1996). On orthogonal and symplectic matrix ensembles. Comm.
Math. Phys. 177 727–754. MR1385083
[40] T RACY, C. and W IDOM, H. (1998). Correlation functions, cluster functions and spacing distri-
butions for random matrices. J. Stat. Phys. 92 809–835. MR1657844
50 N. EL KAROUI

[41] VOICULESCU, D. (2000). Lectures on free probability theory. In Lectures on Probability The-
ory and Statistics (Saint-Flour, 1998). Lecture Notes in Mathematics 1738 279–349.
Springer, Berlin. MR1775641
[42] WACHTER, K. W. (1978). The strong limits of random matrix spectra for sample matrices of
independent elements. Ann. Probab. 6 1–18. MR0467894
[43] W IGNER, E. (1955). Characteristic vectors of bordered matrices with infinite dimensions. Ann.
of Math. (2) 62 548–564. MR0077805
[44] W ILLIAMS, C. and S EEGER, M. (2000). The effect of the input density distribution on kernel-
based classifiers. International Conference on Machine Learning 17 1159–1166.
[45] Y IN, Y. Q., BAI, Z. D. and K RISHNAIAH, P. R. (1988). On the limit of the largest eigenvalue
of the large-dimensional sample covariance matrix. Probab. Theory Related Fields 78
509–521. MR0950344

D EPARTMENT OF S TATISTICS
U NIVERSITY OF C ALIFORNIA , B ERKELEY
367 E VANS H ALL
B ERKELEY, C ALIFORNIA 94720-3860
USA
E- MAIL : nkaroui@[Link]

You might also like