Spectrum of Kernel Random Matrices
Spectrum of Kernel Random Matrices
B Y N OUREDDINE E L K AROUI1
University of California, Berkeley
We place ourselves in the setting of high-dimensional statistical inference
where the number of variables p in a dataset of interest is of the same order
of magnitude as the number of observations n.
We consider the spectrum of certain kernel random matrices, in particular
n × n matrices whose (i, j )th entry is f (Xi Xj /p) or f (Xi − Xj 2 /p)
where p is the dimension of the data, and Xi are independent data vectors.
Here f is assumed to be a locally smooth function.
The study is motivated by questions arising in statistics and computer sci-
ence where these matrices are used to perform, among other things, nonlinear
versions of principal component analysis. Surprisingly, we show that in high-
dimensions, and for the models we analyze, the problem becomes essentially
linear—which is at odds with heuristics sometimes used to justify the usage
of these methods. The analysis also highlights certain peculiarities of models
widely studied in random matrix theory and raises some questions about their
relevance as tools to model high-dimensional data encountered in practice.
the data gets large and the sample size is of the same order of magnitude as that
dimension.
So far in Statistics, this line of work has been concerned mostly with the proper-
ties of sample covariance matrices. In a seminal paper, Marčenko and Pastur [30]
showed a result that, from a statistical standpoint, may be interpreted as saying,
roughly, that asymptotically, the histogram the eigenvalues of a sample (i.e., ran-
dom) covariance matrix is (asymptotically) a deterministic nonlinear deformation
of the histogram of the eigenvalues of the population covariance matrix. Remark-
ably, they managed to characterize this deformation for fairly general population
covariances. Their result was shown in great generality and introduced new tools
to the field including one that has become ubiquitous, the Stieltjes transform of
a distribution. In its best-known form, their result says that when the population
covariance is identity, and hence all the population eigenvalues are equal to 1, in
the limit
√ the sample √ eigenvalues are split, and if p ≤ n, they are spread between
[(1 − p/n)2 , (1 + p/n)2 ] according to a fully explicit density known now as
the density of the Marčenko–Pastur law. Their result was later re-discovered inde-
pendently in [42] (under slightly weaker conditions) and generalized to the case
of nondiagonal covariance matrices in [37] under some particular distributional
assumptions which we discuss later in the paper.
On the other hand, recent developments have been concerned with fine proper-
ties of the largest eigenvalue of random matrices which became amenable to analy-
sis after mathematical breakthroughs which happened in the 1990s (see [38, 39]
and [40]). Classical statistical work on joint distribution of eigenvalues of sample
covariance matrices (see [1] for a good reference) then became usable for analy-
sis in high-dimensions. In particular, in the case of gaussian distributions, with Id
covariance, it was shown in [27] and [16] that the largest eigenvalue of the sample
covariance matrix is Tracy–Widom distributed. More recent progress [17] man-
aged to carry out the analysis for essentially general population covariance. On
the other hand, models for which the population covariance has a few separated
eigenvalues have also been of interest (see, for instance, [8] and [31]). Beside the
particulars of the different type of fluctuations that can be encountered (Tracy–
Widom, Gaussian or other), researchers have been able to precisely localize these
largest eigenvalues. One interesting aspect of those results is the fact that in the
high-dimensional setting of interest to us, the largest eigenvalues are always posi-
tively biased, with the bias being sometime large. (We also note that in the case of
i.i.d. data—which naturally is less interesting in statistics—results on the localiza-
tion of the largest eigenvalue have been available for quite some time now, after the
works [21] and [45], to cite a few.) This is √ naturally in sharp contrast to classical
results of multivariate analysis which show n-consistency of all sample eigenval-
ues; though the possibility of bias is a simple consequence of Jensen’s inequality.
On the other hand, there has been much less theoretical work on kernel random
matrices. By this term, we mean matrices with (i, j ) entry of the form,
Mi,j = k(Xi , Xj ),
SPECTRUM OF KERNEL RANDOM MATRICES 3
These insights have also been derived through more heuristic but nonetheless en-
lightening arguments in, for instance, [44]. Further, more precise fluctuation re-
sults are also given in [28]. We also note interesting work on Laplacian eigenmaps
(see, e.g., [9]) where, among other things, results have been obtained showing
convergence of eigenvalues and eigenvectors of certain Laplacian random matrices
(which are quite closely connected to kernel random matrices) computed from data
sampled from a manifold, to corresponding quantities for the Laplace–Beltrami
operator on the manifold.
These results are in turn used in the literature to explain the behavior of non-
linear versions of standard procedures of multivariate statistics, such as Principal
Component Analysis (PCA), Canonical Correlation Analysis (CCA) or Indepen-
dent Component Analysis (CCA). We refer the reader to [36] for an introduction
4 N. EL KAROUI
the probability measure which puts mass 1/n at each of its eigenvalues. In other
words, if we call Fn this probability measure, we have
1 n
dFn (x) = δl (x).
n i=1 i
Note that the histogram of eigenvalues represents an integrated version of this
measure.
For random matrices, this measure Fn is naturally a random measure. A key
result in the area of covariance matrices is that if we observe i.i.d. data vectors Xi ,
1/2
with Xi = p Yi , where p is a positive semi-definite matrix and Yi is a vec-
tor with i.i.d entries, under weak moment conditions on Yi and assuming that the
spectral distribution of p has a limit (in the sense of weak convergence of distri-
butions), Fn converges to a nonrandom measure which we call F .
1/2
We call the models Xi = p Yi the “standard” models of random matrix the-
ory because most results have been derived under these assumptions. In particular,
various results [5, 6, 21] show, among many other things, that when the entries of
the vector Y have 4 (absolute) moments, the largest eigenvalues of the sample co-
variance matrix X X/n, where Xi now occupies the ith row of the n × p matrix X,
stay close to the endpoint of the support of F .
A natural question is therefore to try to characterize F . Except in particular
situations, it is difficult to do so explicitly. However, it is possible to characterize
a certain transformation of F . The tool of choice in this context is the Stieltjes
transform of a distribution. It is a function defined on C+ by the formula, if we
call StF the Stieltjes transform of F ,
dF (λ)
StF (z) = , Im[z] > 0.
λ−z
In particular for empirical spectral distributions, we see that, if Fn is the spectral
distribution of the matrix Mn ,
1 n
1 1
StFn (z) = = trace (Mn − z Id)−1 .
n i=1 li − z n
The importance of the Stieltjes transform in the context of random matrix theory
stems from two facts: on the one hand, it is connected fairly explicitly to the matri-
ces that are being analyzed; on the other hand, pointwise convergence of Stieltjes
transform implies weak convergence of distributions, if a certain mass preservation
condition is satisfied. This is how a number of bulk results are therefore proved. For
a clear and self-contained introduction to the connection between Stieltjes trans-
forms and weak convergence of probability measures, we refer the reader to [22].
The result of [30], later generalized by [37] for standard random matrix models
with nondiagonal covariance, and more recently by [19], away from those stan-
dard models, is a functional characterization of the limit F . If one calls wn (z)
SPECTRUM OF KERNEL RANDOM MATRICES 7
the Stieltjes transform of the empirical spectral distribution of XX /n, wn (z) con-
verges pointwise (and almost surely after [37]) to a nonrandom w(z) which, as a
function, is a Stieltjes transform. Moreover, w, the Stieltjes transform of F , satis-
fies the equation, if p/n → ρ, ρ > 0,
1 λ dH (λ)
− =z−ρ ,
w(z) 1 + λw
where H is the limiting spectral distribution of p , assuming that such a distribu-
tion exists. We note that [37] proved the result under a second moment condition
on the entries of Yi .
From this result, [30] derived that in the case where p = Id, and hence dH =
δ1 , the empirical spectral distribution has a limit whose density is, if ρ ≤ 1,
√
1 (b − x)(x − a)
fρ (x) = ,
2πρ x
where a = (1 − ρ 1/2 )2 and b = (1 + ρ 1/2 )2 . The difference between the population
spectral distribution (a point mass at 1, of mass 1) and the limit of the empirical
spectral distribution is quite striking.
2.1.2. Largest eigenvalues results. Another line of work has been focused on
the behavior of extreme eigenvalues of sample covariance matrices. In particular,
[21] showed,
√ under some moment conditions, that when p = Idp , l1 (X X/n) →
(1 + p/n) almost surely. In other words, the largest eigenvalue stays close to
2
the endpoint of the limiting spectral distribution of X X/n. This result was later
generalized in [45], and was shown to be true under the assumption of finite fourth
moment only, for data with mean 0. In recent years, fluctuation results have been
obtained for this largest eigenvalue which is of practical interest in Principal Com-
ponents Analysis (PCA). Under Gaussian assumptions, [16] and [27] (see also [20]
and [26]) showed that the fluctuations of the largest eigenvalue are Tracy–Widom
distributed. For the general covariance case, similar results, as well as localization
information, were recently obtained in [17]. We note that the localization infor-
mation (i.e., a formula) that was discovered in this latter paper was shown to hold
for a wide variety of standard random matrix models through appeal to [5]. We
refer the interested reader to Fact 2 in [17] for more information. Interesting work
has also been done on so-called “spiked” models where a few population eigen-
values are separated from the bulk of them. In particular, in the case where all
population eigenvalues are equal, except for one that is significantly larger (see [7]
for the discovery of an interesting phase transition), [31] showed, in the Gaussian
case, inconsistency of the largest sample eigenvalue, as well as the fact that the
angle between the population and sample principal eigenvectors is bounded away
from 0. Paul [31] also obtained fluctuation information about the largest empirical
eigenvalue. Finally, we note that the same inconsistency of eigenvalue result was
also obtained in [8], beyond the Gaussian case.
8 N. EL KAROUI
2.1.3. Notation. Let us now define some notation and add some clarifications.
We denote by A the transpose of A. The matrices we will be working with
all have real entries. We remind the reader that if A and B are two rectangular
matrices, AB and BA have the same eigenvalues, except for possibly, a certain
number of zeros. We will make repeated use of this fact, for example, for matrices
like X X and XX . In the case where A and B are both square, AB and BA have
exactly the same eigenvalues.
We will also need various norms on matrices. We will use the so-called operator
norm, which we denote √ by |A|2 which corresponds to the largest singular value
of A, that is, maxi li (A A). We occasionally denote the largest singular value
of A by σ1 (A). Clearly, for positive semi-definite matrices, the largest singular
value is equal to the largest eigenvalue. Finally, we will sometimes need to use the
Frobenius (or Hilbert–Schmidt) norm of a matrix A. We denote it by AF . By
definition, it is simply, because we are working with matrices with real entries,
A2F = A2i,j .
i,j
Further, we use ◦ to denote the Hadamard (i.e., entrywise) product of two matri-
ces. We denote by μm the mth moment of a random variable. Note that by a slight
abuse of notation, we might also use the same notation to refer to the mth absolute
moment (i.e., E|X|m ) of a random variable, but if there is any ambiguity, we will
naturally make precise which definition we are using.
Finally, in the discussion of standard random matrix models that follows, there
will be arrays of random variables and a.s. convergence. We work with random
variables defined on a common probability space. To each ω corresponds an
infinite-dimensional array of numbers. Unless otherwise noted, the n × p matrices
we will use in what follows are the “upper-left” corner of this array.
We now turn to the study of kernel random matrices. We will show that we
can approximate them by matrices that are closely connected to sample covariance
matrices in high-dimensions and, therefore, that a number of the results we just
reviewed also apply to them.
(Note that the statements we just made assume that both M and K are symmetric,
which is the case here.)
The strategy for the proof is the following. According to the results of Lem-
ma A.3, the matrix Xi Xj /p has “small” entries off the diagonal, whereas on the
diagonal, the entries are essentially constant and equal to trace(p )/p. Hence, it
is natural to try to use the δ-method (i.e., do a Taylor expansion) entry by entry.
By contrast to standard problems in Statistics, the fact that we have to perform n2
of those Taylor expansions means that the second order term is not negligible a
priori. The proof shows that this approach can be carried out rigorously, and that,
perhaps surprisingly, the second order term is not too complicated to approximate
in operator norm. It is also shown that the third order term plays essentially no
role.
Before we start the proof, we want to mention that we will drop the index p
in p below to avoid cumbersome notation. Let us also note, more technically,
that an important step of the proof is to show that, when the Yi ’s have enough
moments, they can be treated without much error in spectral results has bounded
random variables—the bound depending on the number of moments, n and p.
This then enables us to use concentration results for convex Lipschitz functions
of independent bounded random variables at various important points of the proof
and also in Lemma A.3 whose results underly much of the approach taken here.
We will therefore carry out the analysis on this Y array. Note that most of the
results we will rely on require vectors of i.i.d. entries with mean 0. Of course, Yi,j
has in general a mean different from 0. In other words, if we call μ = E(Yi,j ),
we need to show that we do not lose anything in operator norm by replacing Yi ’s
by Ui ’s with Ui = Yi − μ1. Note that, as seen in Lemma A.3, by plugging in
t = 1/2 − δ in the notation of this lemma, which corresponds to the 4 + ε moment
assumption here, we have
|μ| ≤ p−3/2−δ .
Now let us call S the matrix XX /p, except that its diagonal is replaced by
zeros. From [45], and the fact that n/p stays bounded, we know that |XX /p|2 ≤
σ1 ()|Y Y |2 /p stays bounded. Using Lemma A.3, we see that the diagonal of
XX /p stays bounded a.s. in operator norm. Therefore, |S|2 is bounded a.s.
Now, as in the proof of Lemma A.3, we have
U Uj 1 Uj 1 Ui 1 1 Ui Uj
Si,j = i +μ + + μ2 + Ri,j a.s.
p p p p p
Note that this equality is true a.s. only because it involves replacing Y by Y. The
proof of Lemma A.3 shows that
1/2
|Ri,j | ≤ μ2σ1 () σ1 () + p−δ/2 + μ2 σ1 ()
1/2
a.s.
We conclude that, for some constant C,
R2F ≤ Cn2 μ2 ≤ Cn2 p−3−2δ → 0 a.s.
Therefore |R|2 → 0 a.s. In other words, if we call SU the matrix with i, j entry
Ui Uj /p off the diagonal and 0 on the diagonal,
|S − SU |2 → 0 a.s.
12 N. EL KAROUI
Now it is a standard result on Hadamard products (see for instance, [10], Prob-
lem I.6.13, or [25], Theorems 5.5.1 and 5.5.15) that for two matrices A and B,
|A ◦ B|2 ≤ |A|2 |B|2 . Since the Hadamard product is commutative, we have
S ◦ S − SU ◦ SU = (S + SU ) ◦ (S − SU ).
We conclude that
|S ◦ S − SU ◦ SU |2 ≤ |S − SU |2 (|S|2 + |SU |2 ) → 0 a.s.,
since |S − SU |2 → 0 a.s., and |S|2 and hence |SU |2 stay bounded, a.s.
The conclusion of this study is that to approximate the second order term in op-
erator norm, it is enough to work with SU and not S, and hence, very importantly,
with bounded random variables with zero mean. Further, the proof of Lemma A.3
makes clear that σU2 , the variance of the Ui,j ’s goes to 1, the variance of the Yi,j ’s,
very fast. So if we can approximate the matrix with (i, j )-entry Ui Uj /(pσU2 )
consistently in operator norm by a matrix whose operator norm is bounded, this
same matrix will constitute an operator norm approximation of Ui Uj /p.
In other words, we can assume that, when working with matrices of dimension
n × p, the random variables we will be working with have variance 1 without loss
of generality and that they have mean 0 and are bounded by Bp , Bp depending on
p and going to infinity.
• Control of the second order term. We now focus on approximating in operator
norm the matrix with (i, j )th entry,
f (0)
(Xi Xj /p)2 1i =j .
2
As we just explained, we assume from now on in all the work concerning the
second order term that the vectors Yi have mean 0, and that their entries have
variance 1 and are bounded by Bp = p1/2−δ . This is because we just saw that
replacing Yi by Ui /σU would not change (a.s. and asymptotically) the operator
norm of the matrix to be studied. We note that to make clear that the truncation
(p)
depends on p, we might have wanted to use the notation Yi , but since there will
be no ambiguity in the proof, we chose to use the less cumbersome notation Yi .
The control of the second order term turns out to be the most delicate part of the
analysis, and the only place where we need the assumption that Xi = 1/2 Yi . Let
us call W the matrix with entries
⎧
⎨ (Xi Xj )2
Wi,j = , if i = j ,
⎩ p2
0, if i = j .
Note that when i = j ,
E(Wi,j ) = E(trace(Xi Xj Xj Xi ))/p2 = E(trace(Xj Xj Xi Xi ))/p2
= trace( 2 )/p2 .
SPECTRUM OF KERNEL RANDOM MATRICES 13
Because we assume that trace()/p has a finite limit, and n/p stays bounded away
from 0, we see that the matrix E(W ) has a largest eigenvalue that, in general, does
not go to 0. Note also that under our assumptions, E(Wi,j ) = O(1/p). Our aim is
to show that W can be approximated in operator norm by this constant matrix. So
let us consider the matrix W with entries
⎧
⎨ (Xi Xj )2
Wi,j = − trace( 2 )/p2 , if i = j ,
⎩ p2
0, if i = j .
Simple computations show that the expected Frobenius norm squared of this ma-
trix does not go to 0. Hence more subtle arguments are needed to control its op-
erator norm. We will show that E(trace(W 4 )) goes to zero which implies that
E(|W |42 ) goes to zero because W is real symmetric.
The elements contributing to trace(W 4 ) are generally of the form Wi,j Wj,k ×
Wk,l Wl,i . We are going to study these terms according to how many indices are
equal to each other.
p4 E(Wi,j Wj,k |Yi , Yk ) = (Yi E(Mj Yi Yk Mj |Yi , Yk )Yk )
+ trace( 2 Mi ) trace( 2 Mk ).
14 N. EL KAROUI
If we now use Lemma A.1, and, in particular, (4), page 40, we finally have, recall-
ing that here σ 2 = 1,
In the case of interest here, we have M = Yi Yk , and the expectation is to be
understood conditionally on Yi , Yk , but because we have assumed that the indices
are different and the Ym ’s are independent, we can do the computation of the con-
ditional expectation as if M were deterministic. Therefore, we have
Naturally, we have E(Wi,j Wj,k |Yi , Yk ) = E(Wk,l Wl,i |Yi , Yk ), and therefore, by us-
ing properties of conditional expectation, since all the indices are different,
T1 = E((Yi 2 Yk )4 ),
T2 = E([Yi diag(Yi Yk )Yk ]2 )
and
Using (4) in Lemma A.1, we therefore have, using the fact that 2 Yi Yi 2 is sym-
metric,
E((Yi 2 Yk )4 |Yi )
= Yi 2 [2 2 Yi Yi 2 + (μ4 − 3) diag( 2 Yi Yi 2 )
+ trace( 2 Yi Yi 2 ) Id] 2 Yi
= 3(Yi 4 Yi )2 + (μ4 − 3)Yi 2 diag( 2 Yi Yi 2 ) 2 Yi .
Finally, we have, using (5) in Lemma A.1,
E((Yi 2 Yk )4 ) = 3[2 trace( 4 ) + (trace( 4 ))2 + (μ4 − 3) trace( 4 ◦ 4 )]
+ (μ4 − 3)E(Yi 2 diag( 2 Yi Yi 2 ) 2 Yi ).
Now we have
Yi 2 diag( 2 Yi Yi 2 ) 2 Yi = trace( 2 Yi Yi 2 diag( 2 Yi Yi 2 ))
= trace( 2 Yi Yi 2 ◦ 2 Yi Yi 2 ).
Calling vi = 2 Yi , we note that the matrix whose trace is taken is (vi vi ) ◦ (vi vi ) =
(vi ◦ vi )(vi ◦ vi ) (see [24], page 458 or [25], page 307). Hence,
Yi 2 diag( 2 Yi Yi 2 ) 2 Yi = vi ◦ vi 22 .
Now let us call mk the kth column of the matrix 2 . Using the fact that 2 is
symmetric, we see that the kth entry of the vector vi is vi (k) = mk Yi . So vi (k)4 =
Yi mk mk Yi Yi mk mk Yi . Calling Mk = mk mk , we see, using (5) in Lemma A.1, that
have only n3 terms in the sum, this extra contribution is asymptotically zero. Now,
we clearly have E(Wi,j 2 W 2 |Y ) = [E(W 2 |Y )]2 , by conditional independence of
i,k i i,j i
the two terms. The computation of E(Wi,j |Yi ) is similar to the ones we have made
2
(iii) Terms involving two different indices: i = j . The last terms we have to
focus on to control E(trace(W 4 )) are of the form Wi,j 4 . Note that we have n2 terms
like this. Since by convexity, (a + b)4 ≤ 8(a 4 +b4 ), we see that it is enough to
understand the contribution of Wi,j 4 to show that 4
i,j E(Wi,j ) tends to zero. Now,
let us call for a moment v = Yi and u = Yj . The quantity of interest to us is
basically of the form E((u v)8 ). Let us do computations conditional on v. We note
that since the entries of u are independent and have mean 0, in the expansion of
(u v)8 , the only terms that will contribute a nonzero quantity to the expectation
have entries of u raised to a power greater than 2. We can decompose the sum
representing E((u v)8 |v) into subterms, according to what powers of the terms are
involved. There are 6 terms: (2, 2, 2, 2) (i.e., all terms are raised to the power 2),
(3, 3, 2) (i.e., two terms are raised to the power 3, and one to the power 2), (4, 2, 2),
SPECTRUM OF KERNEL RANDOM MATRICES 19
(4, 4), (5, 3), (6, 2) and (8). For instance the subterm corresponding to (2, 2, 2, 2)
is, before taking expectations,
u2i1 u2i2 u2i3 u2i4 (vi1 vi2 vi3 vi4 )2 .
i1 =i2 =i3 =i4
Note that we just saw that E(Yi 82 ) = O(p4 ) in our context. Similarly, the term
(3, 3, 2) will contribute
μ23 σ 2 vi31 vi32 vi23 .
i1 =i2 =i3
μ23 σ 2 v82 .
The same analysis can be repeated for all the other terms which are all found to be
less than v82 times the moments of u involved. Because we have assumed that
our original random variables had 4 + ε absolute moments, the moments of order
less than 4 cause no problem. The moments of order higher than 4, say 4 + k, can
be bounded by μ4 Bpk . Consequently, we see that
Yi 8
4
E(Wi,j ) = E(E(Wi,j
4
|Yi )) ≤ CBp4 E 8
= O(Bp4 /p4 ) = O p−(2+4δ) .
p
Since we have n2 such terms, we see that
4
E(Wi,j )→0 as p → ∞.
i =j
(iv) Second order term: combining all the elements. We have therefore estab-
lished control of the second order term and seen that the largest singular value of
W goes to 0 in probability, using Chebyshev’s inequality. Note that we have also
shown that the operator norm of W is bounded in probability and that
trace( 2 )
W − (11 − Id) → 0 in probability.
p 2
2
• Control of the third order term. We note that the third order term is of the form
X X
f (3) (ξi,j ) ip j Wi,j . According to Lemma A.5, if M is a real symmetric matrix
with nonnegative entries, and E is a symmetric matrix such that maxi,j |Ei,j | = ζ ,
then
σ1 (E ◦ M) ≤ ζ σ1 (M).
Note that W is real symmetric matrix with nonnegative entries. So all we have
to show to prove that the third order term goes to zero in operator norm is that
maxi =j |Xi Xj /p| goes to 0 because we have just established that |W |2 remains
bounded in probability. We are going to make use of Lemma A.3, page 43 in the
Appendix. In our setting, we have Bp = p1/2−δ , or 2/m = 1/2 − δ. The lemma
implies, for instance, that
max|Xi Xj /p| ≤ p−δ log(p) a.s.
i =j
So maxi =j |Xi Xj /p| → 0 a.s. Note that this implies that maxi =j |ξi,j | → 0 a.s.
Since we have assumed that f (3) exists and is continuous and hence bounded in a
neighborhood of 0, we conclude that
maxf (3) (ξi,j )Xi Xj /p = o(p−δ/2 ) a.s.
i,j
If we call E the matrix with entry Ei,j = f (3) (ξi,j )Xi Xj /p off-the diagonal and 0
on the diagonal, we see that E satisfies the conditions put forth in our discussion
earlier in this section and we conclude that
|E ◦ W |2 ≤ max|Ei,j ||W |2 = o(p −δ/2 ) a.s.
i,j
Hence, the operator norm of the third order term goes to 0 almost surely. [To maybe
clarify our arguments, let us repeat that we analyzed the second order term by
replacing the Yi ’s by, in the notation of the truncation and centralization discus-
sion, Ui . Let us call WU = SU ◦ SU , again using notation introduced in the trunca-
tion and centralization discussion. As we saw, |W − WU |2 → 0 a.s., so showing,
as we did, that |WU |2 remains bounded (a.s.) implies that |W |2 does too, and
this is the only thing we need in our argument showing the control of the third
order term.]
SPECTRUM OF KERNEL RANDOM MATRICES 21
(B) Control of the diagonal term. The proof here is divided into two parts.
First, we show that the error term coming from the first order expansion of the di-
agonal is easily controlled. Then we show that the terms added when replacing the
off-diagonal matrix by XX /p + trace( 2 )/p2 11 can also be controlled. Recall
the notation τ = trace()/p.
• Errors induced by diagonal approximation. Note that Lemma A.3 guarantees
that for all i, |ξi,i − τ | ≤ p−δ/2 , a.s. Because we have assumed that f is continuous
and hence bounded in a neighborhood of τ , we conclude that f (ξi,i ) is uniformly
bounded in p. Now Lemma A.3 also guarantees that
Xi 22
max − τ ≤ p −δ a.s.
i p
Hence, the diagonal matrix with entries f (Xi 22 /p) can be approximated consis-
tently in operator norm by f (τ ) Id a.s.
• Errors induced by off-diagonal approximation. When we replace the off-
diagonal matrix by f (0)XX /p + [f (0) + f (0) trace( 2 )/2p2 ]11 , we add a
diagonal matrix with (i, i) entry f (0) + f (0)Xi 22 /p + f (0) trace( 2 )/2p2
which we need to subtract eventually. We note that 0 ≤ trace( 2 )/p2 ≤ σ12 ()/
p → 0 when σ1 () remains bounded in p. So this term does not create any prob-
lem. Now, we just saw that the diagonal matrix with entries Xi 22 /p can be consis-
tently approximated in operator norm by (trace()/p) Id. So the diagonal matrix
with (i, i) entry f (0) + f (0)Xi 22 /p + f (0) trace( 2 )/2p2 can be approxi-
mated consistently in operator norm by (f (0) + f (0) trace()/p) Id a.s.
This finishes the proof.
2.3. Kernel random matrices of the type f (Xi − Xj 22 /p). As is to be ex-
pected, the properties of such matrices can be deduced from the study of inner
product kernel matrices, with a little bit of extra work. We need to slightly modify
the distributional assumptions under which we work, and consider the case where
we have 5 + ε absolute moments for the entries of Yi . We also need to assume that
f is regular is the neighborhood of different points. Otherwise, the assumptions
are the same as that of Theorem 2.1. We have the following theorem:
Let us call ψ the vector with ith entry ψi = Xi 22 /p − trace()/p. Suppose
that the assumptions of Theorem 2.1 hold, but that conditions (e) and (f) are re-
placed by:
(e ) The entries of Yi , a p-dimensional random vector, are i.i.d. Also, denoting
by Yi (k) the kth entry of Yi , we assume that E(Yi (k)) = 0, var(Yi (k)) = 1
and E(|Yi (k)|5+ε ) < ∞ for some ε > 0. (We say that Yi has 5 + ε absolute
moments.)
(f ) f is C 3 in a neighborhood of τ .
Then M can be approximated consistently in operator norm (and in probability)
by the matrix K, defined by
XX
K = f (τ )11 + f (τ ) 1ψ + ψ1 − 2
p
f (τ ) trace( 2 )
+ 1(ψ ◦ ψ) + (ψ ◦ ψ)1 + 2ψψ + 4 11 + υp Id,
2 p2
υp = f (0) + τf (τ ) − f (τ ).
In other words,
|M − K|2 → 0 in probability.
P ROOF. Note that here the diagonal is just f (0) Id and it will cause no trouble.
The work, therefore, focuses on the off-diagonal matrix. In what follows, we call
τ = 2 trace()
p . Let us define
Xi 22 Xj 22
Ai,j = + −τ
p p
and
Xi Xj
Si,j =.
p
With these notation, we have, off the diagonal, that is, when i = j , by a Taylor
expansion,
1
Mi,j = f (τ ) + [Ai,j − 2Si,j ]f (τ ) + [Ai,j − 2Si,j ]2 f (τ )
2
1 (3)
+ f (ξi,j )[Ai,j − 2Si,j ]3 .
6
We note that the matrix A with entries Ai,j is a rank 2 matrix. As a matter of fact,
X 2
it can be written, if ψ is the vector with entries ψi = pi 2 − τ/2, A = 1ψ + ψ1 .
Using the well-known identity (see, e.g., [23], Chapter 1, Theorem 3.2),
1 + u v u22
det(I + uv + vu ) = det ;
v22 1 + u v
SPECTRUM OF KERNEL RANDOM MATRICES 23
Now, let us focus on the term Ai,j Si,j . Let us call H the matrix with
Hi,j = (1 − δi,j )Ai,j Si,j .
Let us denote by S the matrix with off-diagonal entries Si,j and 0 on the diago-
nal. If we call S = XX /p, we have
= S − diag(S).
S
Now note that Ai,j = ψi + ψj . Therefore, we have, if diag(ψ) is the diagonal
matrix with (i, i) entry ψi ,
diag(ψ) + diag(ψ)S.
H =S
We just saw that under our assumptions, maxi |ψi | → 0 a.s. Because for any n × n
matrices L1 , L2 , |L1 L2 |2 ≤ |L1 |2 |L2 |2 , we see that to show that |H |2 goes
2 remains bounded.
to 0, we just need to show that |S|
Now we clearly have,
√ |S|2 2 ≤ ||2 |Y Y/p|2 . We know from [45] that
|Y Y /p|2 → σ (1 + n/p) , a.s. Under our assumptions on n and p, this is
2
bounded. Now
trace()
diag(S) = diag(ψ) + Id,
p
so our concentration results once again imply that | diag(S)|2 ≤ trace()/p + η
a.s., for any η > 0. Because | · |2 is subadditive, we finally conclude that
2 is bounded
|S| a.s.
Therefore,
|H |2 → 0 a.s.
Putting together all these results, we see that we have shown that
trace( 2 )
T − A ◦ A − diag(A ◦ A) − 4 (11 − Id) → 0 in probability.
p 2
2
• Control of the third order term. The third order term is the matrix L with 0
on the diagonal and off-diagonal entries
3
f (3) (ξi,j ) Xi − Xj 22 − 2 trace()
Li,j = E ◦ T,
6 p
where T was the matrix investigated in the control of the second order term. On
the other hand, E is the matrix with entries
f (3) (ξi,j ) Xi − Xj 22 − 2 trace()
Ei,j = (1 − δi,j ) .
6 p
SPECTRUM OF KERNEL RANDOM MATRICES 25
max|Ei,j | ≤ K log(p)p−1/10−δ .
i =j
We are now in position to apply the Hadamard product argument (see Lem-
ma A.5) we used for the control of the third order term in the proof of Theo-
rem 2.1. To show that the third order term tends in operator norm to 0, we hence
just need to show that |T |2 remains small compared to the bound we just gave on
maxi,j |Ei,j |. Of course, this is equivalent to showing that the matrix that approxi-
mates T has the same property in operator norm.
Clearly, because σ1 () stays bounded, trace( 2 )/p stays bounded and so does
|trace( 2 )/p2 (11 − Id)|2 . So we just have to focus on A ◦ A − diag(A ◦ A).
Recall that Ai,i = 2(Xi 22 /p − trace()/p), and so Ai,i = 2ψi . We have al-
ready seen that our concentration arguments imply that maxi |ψi | → 0 a.s. So
|diag(A ◦ A)|2 = maxi ψi2 goes to 0 a.s. Now,
A = 1ψ + ψ1 ,
and hence, elementary Hadamard product computations [relying on ab ◦ uv =
(a ◦ u)(b ◦ v) ] give
A ◦ A = 1(ψ ◦ ψ) + 2ψψ + (ψ ◦ ψ)1 .
Therefore,
√
|A ◦ A|2 ≤ 2 nψ ◦ ψ2 + ψ22 .
Using Lemma A.1, and in particular equation (5), we see that
trace( 2 ) trace( ◦ )
E(ψi2 ) = 2σ 4 2
+ (μ4 − 3σ 4 ) ,
p p2
and therefore, E(ψ22 ) remains bounded. On the other hand, using Lemma 2.7
of [5], we see that if we have 5 + ε absolute moments,
(μ4 trace( 2 ))2 4
3−ε trace( )
E(ψi4 ) ≤ C + μ 5+ε Bp .
p4 p4
26 N. EL KAROUI
Now recall that we can take Bp = p2/5−δ . Therefore nE(ψ ◦ ψ22 ) is, at most, of
order Bp3−ε /p. We conclude that
P |A ◦ A|2 > log(p) Bp3−ε /p → 0.
Note that this implies that
P |T |2 > log(p) Bp3−ε /p → 0.
Now, note that the third order term is of the form E ◦ T . Because we have
assumed that we have 5 + ε absolute moments, we have already seen that our
concentration results imply that
2
Bp
max|Ei,j | = O log(p) = O log(p)p−1/10−δ a.s.
i =j p
Using the fact that T has positive entries and therefore (see Lemma A.5), |E ◦
T |2 ≤ maxi,j |Ei,j ||T |2 , we conclude that with high-probability,
5−ε
2 Bp
|E ◦ T |2 = O (log(p)) = O((log(p))2 p−δ ) where δ > 0.
p2
Hence,
|E ◦ T |2 → 0 in probability.
do not need to add it to the correction in the diagonal of the matrix approximating
our kernel matrix.
An interpretation of the proofs of Theorems 2.1 and 2.2 is that they rely on a
local “multiscale” approximation of the original matrix (i.e., the terms used in the
entry-wise approximation are all of different order of magnitudes, or at different
“scales”). However, globally, that is, when looking at eigenvalues of the matrices
and not just at each of their entries there is a bit of a mixture between the scales
which creates the difficulties we had to deal with to control the second order term.
Hence, in this case, if M is our kernel matrix with entries exp(−Xi − Xj 22 ),
we have
|M − Id|2 ≤ n exp(−p1/2+2/m+δ ) a.s.,
and the upper bound tends to zero extremely fast.
2.4. More general models. In this subsection, we consider more general mod-
els than the ones considered above. In particular, we will here focus on data models
for which the vectors Xi satisfy a so-called dimension-free concentration inequal-
ity. As was shown in [19], under these conditions, the Marčenko–Pastur equation
holds (as well as generalized versions of it). Note that these models are more gen-
eral than the one considered above (the proofs in the Appendix illustrate why the
standard random matrix models can be considered as subcases of this more general
class of matrices) and can describe various interesting objects like vectors with cer-
tain log-concave distributions or vectors sampled in a uniform manner from certain
28 N. EL KAROUI
We note that the term f (0)11 does not affect the limiting spectral distribution
of M since finite rank perturbations do not have any effect on limiting spectral
distributions (see, e.g., [3], Lemma 2.2). Therefore, it could be removed from the
approximating matrix, but since it will clearly be present in numerical work and
SPECTRUM OF KERNEL RANDOM MATRICES 29
simulations, we chose to leave it in our approximation. We also note that the lim-
iting distribution of XX /p under these assumptions has been obtained in [19].
Here are a few examples of models satisfying the distributional assumptions
stated above. (Unless otherwise noted, b = 2 in all these examples.)
Before we prove the lemma, we note that our assumptions imply that the lim-
iting spectral distribution of Kn is a probability distribution. Therefore, to obtain
the results of the lemma, we just need to show pointwise convergence of Stieltjes
transforms and then rely on the results of [22], and in particular Corollary 1 there.
P ROOF OF L EMMA 2.1. We call StKn and StMn the Stieltjes transforms of the
spectral distributions of these two matrices. Suppose z = u + iv. Let us call li (Mn )
the ith largest eigenvalue of Mn .
Proof of statement 1. We first focus on the Frobenius norm part of the lemma.
We have
1
n
1 1 1
n
|li (Mn ) − li (Kn )|
|StKn (z) − StMn (z)| = − ≤ .
n i=1 li (Kn ) − z li (Mn ) − z n i=1 v2
Now, by Holder’s inequality,
n
√
n
|li (Mn ) − li (Kn )| ≤ n |li (Mn ) − li (Kn )|2 .
i=1 i=1
Using Lidskii’s theorem [i.e., the fact that, since Mn and Kn are hermitian, the
vector with entries li (Mn ) − li (Kn ) is majorized by the vector li (Mn − Kn )], with,
in the notation of [10], Theorem III.4.4 (x) = x 2 , we have
n
n
|li (Mn ) − li (Kn )|2 ≤ li2 (Mn − Kn ) = Mn − Kn 2F .
i=1 i=1
We conclude that
Mn − Fn F
|StKn (z) − StMn (z)| ≤ √ 2 ,
nv
SPECTRUM OF KERNEL RANDOM MATRICES 31
since |li (Kn ) − z| ≥ |Im[li (Kn ) − z]| = v, and therefore 1/|li (Kn ) − z| ≤ 1/v.
Under the assumptions of the lemma, we therefore have
Proof of statement 2. Let us now turn to the operator norm part of the lemma.
By the same computations as above, we have, using Weyl’s inequality,
1 n
1 1
|StKn (z) − StMn (z)| = −
n i=1 li (Kn ) − z li (Mn ) − z
1 n
|li (Mn ) − li (Kn )|
≤
n i=1 v2
|Mn − Kn |2
≤ .
v2
Hence if |Mn − Kn |2 → 0, it is clear that the two Stieljtes transforms are asymp-
totically equal, and the conclusion follows.
P ROOF OF T HEOREM 2.3. For the weaker statement required for the proof
of Theorem 2.3, we will show that in the δ-method we need to keep only the
first term of the expansion as long as f has a second derivative that is bounded
in a neighborhood of 0, and a first derivative that is bounded in a neighborhood
of trace()/p. In other words, we will split the problem into two parts: off the
diagonal, we write
2
Xi Xj X Xj f (ξi,j ) Xi Xj
f = f (0) + f (0) i + if i = j ;
p p 2 p
on the diagonal, we write
Xi Xi trace() X Xi trace()
f =f + f (ξi,i ) i − .
p p p p
32 N. EL KAROUI
• Control of the off-diagonal error matrix. Here we focus on the matrix W with
(i, j ) entry
2
f (ξi,j ) Xi Xj
Wi,j = 1i =j .
2 p
The strategy is going to be to control the Frobenius norm of the matrix
⎧ 2
⎪
⎨ Xi Xj
Wi,j = , if i = j ,
⎪ p
⎩
0, if i = j .
According to Lemma 2.1, √ it is enough for our needs to show that the Frobenius
norm of this matrix is o( n) a.s. to have the result we wish. Hence, the result will
be shown, if we can for instance show that
max Wi,j ≤ p −(1/2+ε) (log(p))1+δ a.s., for some δ > 0.
i,j
Now, W F ≤ nmaxi,j |Wi,j |, so we conclude that in this situation, with our as-
sumptions that n p,
√
W F = o n a.s.
Now let us focus on
Wi,j = f (ξi,j )Wi,j ,
where ξi,j is between 0 and Xi Xj /p. We just saw that with very high-probability,
this latter quantity was less (in absolute value) than p−(1/4+ε/2) (log(p))2/b , if c ≥
p−(1/2−ε)b/2 . Therefore if f is bounded by K in a neighborhood of 0, we have,
with very high probability that
√
W F ≤ KW F = o n .
• Control of the diagonal matrix. We first note that when we replace the off-
diagonal matrix by f (0)11 + f (0)XX /p, we add to the diagonal certain terms
that we need to subtract eventually.
Hence, our strategy here is to show that we can approximate (in operator norm)
the diagonal matrix D with entries
trace() X Xi trace() X Xi
Di,i = f + f (ξi,i ) i − − f (0) i − f (0)
p p p p
SPECTRUM OF KERNEL RANDOM MATRICES 33
by υp Idp . To do so, we just have to show that the diagonal error matrix Z, with
entries
Xi Xi trace()
Zi,i = f (ξi,i ) − f (0) −
p p
goes to zero in operator norm.
As seen in Lemma A.4 or Fact A.1, if c ≥ p−(1/2−ε)b/2 , with very high-
probability,
Xi Xi trace() −(1/4+ε/2)
−
max
i p p ≤p (log(p))2/b .
We finally treat the case of kernel matrices computed from Euclidean norms, in
this more general distributional setting.
We note once again that the term f (τ )11 does not affect the limiting spectral
distribution of M. But we keep it for the same reasons as before.
P ROOF OF T HEOREM 2.4. Note that the diagonal term is simply f (0) Id, so
this term does not create any problem.
The rest of proof is similar to that of Theorem 2.3. In particular the control of the
Frobenius norm of the second order term is done in the same way, by controlling
the maximum of the off-diagonal term, using Corollary A.3 and Fact A.1 (and
hence Lemma A.4).
Therefore, we only need to understand the first order term, in other words, the
matrix with 0 on the diagonal and off-diagonal entry
Xi − Xj 22
Ri,j = −τ
p
Xi 22 trace() Xj 22 trace() X Xj
= − + − −2 i .
p p p p p
Xi 22
As in the proof of Theorem 2.2, let us call ψ the vector with ith entry ψi = p −
trace()
p . Clearly,
XX
Ri,j = δi,j 1ψ + ψ1 − 2 .
p
Simple computations show that
trace() XX
R−2 Id = 1ψ + ψ1 − 2 .
p p
Now, obviously, 1ψ + ψ1 is a matrix of rank at most 2. Hence, R has the same
limiting spectral distribution as
trace() XX
2 Id −2
p p
since finite rank perturbations do not affect limiting spectral distributions (see, for
instance, [3], Lemma 2.2). This completes the proof.
2.5. Some consequences of the theorems. In practice, it is often the case that
slight variant of kernel random matrices are used. In particular, it is customary to
center the matrices, that is, to transform M so that its row sum, or column sum or
both are 0. Note that these operations correspond to right and/or left multiplication
by the matrix H = Idn −11 /n.
In these situations, our results still apply; the following fact makes it clear.
A nice consequence of the first point is that the recent hard work on localizing
the largest eigenvalues of sample covariance matrices (see [8, 17] and [31]) can be
transferred to kernel random matrices and used to give some information about the
localization of the largest eigenvalues of H MH , for instance. In the case of the
results of [17], Fact 2 and the arguments of [19], Section 2.3.4, show that it gives
exact localization information. In other words, we can characterize the a.s. limit
of the largest eigenvalue of H MH (or H M or MH ) fairly explicitly, provided
Fact 2 in [17] applies. Finally, let us mention the obvious fact that since two square
matrices A and B, AB and BA have the same eigenvalues, we see that H MH has
the same eigenvalues as MH and H M because H 2 = H .
P ROOF OF FACT 2.1. The proofs are simple. First note that H is positive
semi-definite and |H |2 = 1. Using the submultiplicativity of | · |2 , we see that
|H a MH b − H a KH b |2 ≤ |M − K|2 |H a |2 |H b |2 = |M − K|2 .
This shows the first point of the fact.
The second point follows from the fact that H a MH b is a finite rank perturbation
of M. Hence, using Lemma 2.2 in [3], we see that these two matrices have the same
limiting spectral distribution, and since, by assumption, K has the same limiting
spectral distribution as M, we have the result of the second point.
matrix case. [It is so because Li,i is an average of almost constant (and equal)
quantities, so with high-probability Li,i cannot deviate from this constant value,
for that would require that at least one of the components of the average deviate
from the constant value in question.] Hence, there exists γp such that |DL −
γp Idn |2 tends to 0 almost surely. We also recall that the diagonal of the matrix
M can be consistently approximated in operator norm by a (finite) multiple of
the identity matrix, so the diagonal of M/n can be consistently approximated in
operator norm by 0. Therefore, |L + M/n − DL |2 tends to 0 almost surely, and
therefore, |L + M/n − γp Idn |2 tends to zero almost surely. In other words, L can
be consistently approximated in operator norm by γp Idn −M/n. Consequently,
when we can, as in Theorems 2.1 and 2.2, consistently approximate M in operator
norm by a linearized version, K, of M, then L can be consistently approximated in
operator norm by γp Idn −K/n, and we can deduce spectral properties of L from
that of γp Idn −K/n. When we know only about the limiting spectral distribution
of M, as in Theorems 2.3 and 2.4, the operator norm consistent approximation of
L by γp Idn −M/n carries over to give us information about the limiting spectral
distribution of L since the effect of γp Idn is just to “shift” the eigenvalues by γp .
We note that getting information about the eigenvectors of L would require finer
work on the properties of the matrix DL since approximating it by a multiple of
the identity does not give us any information about its eigenvectors.
3. Conclusions. The main result of this paper is that under various technical
assumptions, in high-dimensions, kernel random matrices [i.e., n × n matrices with
(i, j )th entry f (Xi Xj /p) or f (Xi − Xj 22 /p) where {Xi }ni=1 are i.i.d. random
vectors in Rp with p → ∞ and p n] which are often used to create nonlinear
versions of standard statistical methods and essentially behave like covariance ma-
trices, that is, linearly, a result that is in sharp contrast with the low-dimensional
situation where p is assumed to be fixed, and where it is known that, under some
regularity conditions, spectral properties of kernel random matrices mimick those
of certain integral operators. Under ICA-like assumptions, we were able to get a
“strong approximation” result (Theorems 2.1 and 2.2), that is, an operator norm
consistency result that carries information about individual eigenvalues and eigen-
vectors corresponding to separated eigenvalues. Under more general and less linear
assumptions (Theorems 2.3 and 2.4), we have obtained results concerning the lim-
iting spectral distribution of these matrices using a “weak approximation” result
relying on bounds on Frobenius norms.
Beside the mathematical results obtained above, this study raises several statis-
tical questions, both about the richness—or lack thereof—of models that are often
studied in random matrix theory and about the effect of kernel methods in this
context.
and
max|Xi Xj |/p → 0.
i =j
Both these statements hold almost surely. Geometrically, this means that the vec-
√
tors {Xi / p}ni=1 are close to a sphere and almost orthogonal to one another. These
properties also hold for the more general (and less linear) models we considered
in Theorems 2.3 and 2.4.
Hence, if one were to plot a histogram of {Xi 22 /p}ni=1 , this histogram would
look tightly concentrated around a single value—the spread of this histogram be-
ing computable from our concentration results (Lemmas A.3, A.4 and Fact A.1).
Though the models appear to be quite rich, the geometry that we can perceive
by sampling n such vectors, with n p, is, arguably, relatively poor. These re-
marks should not be taken as aiming to discredit the interesting body of work that
has emerged out of the study of such models. Their aim is just to warn possible
users that in data analysis, a good first step would be to plot the histogram of
{Xi 22 /p}ni=1 and check whether it is concentrated around a single value. Simi-
larly, one might want to plot the histogram of inner products {Xi Xj /p} and check
that it is concentrated around 0. If this is not the case, then insights derived from
random matrix theoretic studies would likely not be helpful in the data analysis.
We note, however, that recent random matrix work (see [13, 14, 19, 32]) has
been concerned with distributions which could be loosely speaking be called of
“elliptical” type—though they are more general than what is usually called ellipti-
cal distributions in Statistics. In those settings, the data is, for instance, of the form
Xi = ri 1/2 Yi where ri is a real-valued random variable, independent of Yi . This
allows the data vectors to not approximately live on spheres (but does not change
anything about angles between different vectors), and is a possible way to address
some of the concerns we just raised. The characterization of the limiting spectrum
gets quite a bit more involved than in the “standard” setting, that is, ri = 1, and the
results show a lack of robustness to the “indirect” assumption that the data vectors
live close to a sphere.
Finally, this geometric discussion applies also to theoretical studies undertaken
under the assumptions that the Xi are N (0, p ) and that the problem is high di-
mensional. It should highlight some possibly severe limitations of the normality
assumption in high-dimensions.
APPENDIX
In this appendix, we collect a few useful results that are needed in the proof
of our theorems, and whose content we thought would be more accessible if they
were separated from the main proofs.
40 N. EL KAROUI
L EMMA A.1. Suppose Y is a vector with i.i.d. entries and mean 0. Call its
entries yi . Suppose E(yi2 ) = σ 2 and E(yi4 ) = μ4 . Then if M is a deterministic
matrix,
E(trace(MY Y MY Y ))
(5)
= σ 4 trace(M 2 + MM ) + σ 4 (trace(M))2 + (μ4 − 3σ 4 ) trace(M ◦ M).
Here diag(M) denotes the matrix consisting of the diagonal of the matrix M and
0 off the diagonal. The symbol ◦ denotes Hadamard multiplication between matri-
ces.
Using the fact that entries of Y are independent and have mean 0, we see that, in
the sum, the only terms that will not be 0 in expectation are those for which each
index appears at least twice. If i = j , only the terms of the form yi2 yj2 have this
property. So if i = j ,
E(Ri,j ) = E yi2 yj2 (Mi,j + Mj,i ) = σ 4 (Mi,j + Mj,i ).
Let us now turn to the diagonal terms. Here again, only the terms yi2 yk2 matter. So
on the diagonal,
E(Ri,i ) = μ4 Mi,i + σ 4 Mj,j = (μ4 − σ 4 )Mi,i + trace(M).
j =i
We conclude that
The second part of the proof follows from the first result, after we remark that,
if D is a diagonal and L is general matrix, trace(LD) = trace(L ◦ D), from which
we conclude that trace(M diag(M)) = trace(M ◦ diag(M)) = trace(M ◦ M).
SPECTRUM OF KERNEL RANDOM MATRICES 41
and
! "
ZM Z
+
D − μF > r/(1 + 2μF ) .
p
P ROOF. The proof relies on the results of Lemma A.2. Remark that, since
is symmetric,
1 0 Yi
Yi Yj = (Yi Yj ) .
2 0 Yj
Now the entries of the vector made by concatenating Yi and Yj are i.i.d. and so
we fall back into the setting of Lemma A.2. Finally,
here M+ and M − areknown
0
explicitly. A possible choice is M+ = 1/2
and M − = 1/2 0 . νp is
obtained by upper bounding the expectation of the square of F in the notation of
the proof of the previous lemma for these explicit matrices. Note that their largest
singular values are both smaller that σ1 (), so the results of the previous lemma
apply.
L EMMA A.3. Let {Yi }ni=1 be i.i.d. random vectors in Rp , whose entries are
i.i.d., mean 0, variance 1 and have bounded (in p) m ≥ 4 absolute moments. Sup-
pose that {p } is a sequence of positive semi-definite matrices whose operator
norms are uniformly bounded in p and n/p is asymptotically bounded. We have,
for any given ε > 0,
Yi p Yj trace(p ) −1/2+2/m
− δi,j
max
i,j p p ≤p (log(p))(1+ε)/2 a.s.
P ROOF. Throughout the proof, we assume without loss of generality that m <
∞.
Call t = 2/m. It is clear that with our moment assumptions, t ≤ 1/2. According
to Lemma 2.2 in [45], the maximum of the array of {Yi }ni=1 is a.s. less than pt . So
to control the maximum of the inner products of interest, it is enough to control
the same quantity when we replace Yi by Yi with Yi,l Yi,l 1|Yi,l |≤pt . Now note
44 N. EL KAROUI
that Yi satisfies the boundedness assumption of Corollary A.1, but its mean is not
necessarily zero and its variance is not 1. Note however, that all the entries of Yi
have the same mean, μ . Since Yi has mean 0, we have
| ≤ E |Y1,1 |1|Y1,1 |>pt ≤ E |Y1,1 |m p −t (m−1) ≤ μm p −2+t .
|μ
σ 2 the variance of Y, we have
Similarly, if we call
σ 2 = E |Y1,1 |2 1|Y1,1 |≤pt − μ
2 = 1 − E |Y1,1 |2 1|Y1,1 |>pt + μ
2 .
Hence, 0 ≤ 1 −
σ 2 , and
σ 2 = E |Y1,1 |2 1|Y1,1 |>pt + μ
1− 2
≤ E |Y1,1 |m p−t (m−2) + μ
2
where K denotes a generic constant (that may change from display to display). In
particular, K is independent of p and is hence trivially bounded away from 0 as p
grows. The bound we just obtained on 1 − σ 2 also implies that for p large enough,
σ 2 > 1/2 from which we conclude that for another K with the same properties,
P |Ui p Uj /p| > r(p) ≤ exp(−K(log(p))1+ε ).
σ 2 is the vari-
In other respects, the arguments of Lemma A.2 show that, since
ance of Ui ,
P |Ui p Ui /p −
σ 2 trace(p )/p| > r(p) ≤ exp(−K(log(p))1+ε ).
Now
Yi p Yj U p Uj (1 p Uj + Ui p 1) 1 p 1
= i +μ
+μ
2 .
p p p p
Remark that 1 p 1 ≤ pσ1 (p ), and |1 p Uj | ≤ 1 p 1 Uj p Uj . We con-
clude, using the results obtained in the proof of Lemma A.2 that with prob-
than 1 − exp(−K(log(p))
ability greater 1+ε ), the middle term is smaller than
|. As a matter of fact, Uj p Uj /p is concentrated
2 σ1 (p )( σ1 (p ) + r(p))|μ
SPECTRUM OF KERNEL RANDOM MATRICES 45
around its mean which is smaller than
σ trace(p )/p which is itself smaller than
| = O(p −2+t ) = o(r(p)). We can therefore conclude
σ1 (p ). Now recall that |μ
that
Y p Yj trace(p )
P i − δi,j
σ2 > 2r(p) ≤ 2 exp(−K(log(p))
1+ε
).
p p
Now note that 0 ≤ 1 − σ 2 = O(p−2+2t ) = o(r(p)) since t ≤ 1/2 < 3/2. With our
assumptions, trace(p )/p remains bounded, so we have finally
j
Yi p Y trace(p )
P − δi,j > 3r(p) ≤ 2 exp(−K(log(p))1+ε ).
p p
And therefore,
j
Yi p Y trace(p )
− δi,j > 3r(p) ≤ 2n exp(−K(log(p))
2 1+ε
P max ).
i,j p p
Using the Borel–Cantelli lemma, we reach the conclusion that
j
Yi p Y trace(p )
max − δi,j ≤ 3r(p) = 3p2/m−1/2 log(p) a.s.
i,j p p
Y Y trace( )
Because the left-hand side is a.s. equal to | i pp j − δi,j p
p
|, we reach the
announced conclusion but with r(p) replaced by 3r(p). Note that, of course, any
multiple of r(p), where the constant is independent of p, would work in the proof.
In particular, by taking r̃(p) = r(p)/3, we reach the announced conclusion.
P ROOF. The proof follows immediately from the results of Lemma A.3, after
we write
Xi − Xj 22 − 2 trace(p )
= [Yi p Yi − trace(p )] + [Yj p Yj − trace(p )] − 2Yi p Yj .
Note that as explained in the proof of Lemma A.3, the constants in front of the
bounding sequence do not matter, so we can replace 3p−1/2+2/m (log(p))(1+ε)/2
by p−1/2+2/m (log(p))(1+ε)/2 , and the result still holds. [In other words, we are
really using Lemma A.3 with upper bound p−1/2+2/m (log(p))(1+ε)/2 /3.]
46 N. EL KAROUI
L EMMA A.4. Let {Xi }ni=1 be i.i.d. random vectors in Rp whose entries are
i.i.d., mean 0, having the property that for 1-Lipschitz (with respect to Euclidean
norm) functions F , if we denote by mF the median of F (Xi ),
P |F (Xi ) − mF | > r ≤ C exp(−c(p)r 2 ),
where C is independent of p and c is allowed to vary with p (if it goes to zero,
we assume it does so like p−α , 0 ≤ α < 1). Call p the covariance matrix of X1 .
Assume that σ1 (p ) remains bounded in p. Then, under the triangular array con-
struction of Theorem 2.3, we have, for any ε > 0,
Xi Xj trace(p ) −1/2
− δi,j
max
i,j p p ≤ (pc(p)) (log(p))(1+ε)/2 a.s.
P ROOF. The proof once again relies on concentration inequalities. First note
that Proposition 1.11 combined with Proposition 1.7 in [29] shows that if Xi
and Xj are independent and satisfy concentration inequalities with concentration
function α(r) (with respect to Euclidean norm), then the vector YYji also satis-
fies concentration inequalities with concentration function 2α(r/2) with respect
to Euclidean norm in R2p . (We note that Proposition 1.11 is proved for the met-
ric on R2p · 2 + · 2 where each Euclidean norm is a norm in Rp , but the
same proof goes through for Euclidean norm on R2p . Another argument would
be to say that the metric · 2 + · 2 is equivalent
√ to the norm of the full R
2p
C OROLLARY A.3. Under the assumptions of Lemma A.4, we have, for any
ε > 0,
Xi − Xj 22 trace(p ) −1/2
max −2 ≤ (pc(p)) (log(p))(1+ε)/2 a.s.
i =j p p
Finally, allow the same lines of proof; we have the following fact.
FACT A.1. Let {Xi }ni=1 be i.i.d. random vectors in Rp whose entries are i.i.d.,
mean 0, having the property that for 1-Lipschitz (with respect to Euclidean norm)
functions F , if we denote by mF the median of F (Xi ),
P |F (Xi ) − mF | > t ≤ C exp(−c(p)t b ) for some b > 0,
SPECTRUM OF KERNEL RANDOM MATRICES 47
The proof of this last fact follows the same step as that of Lemma A.4, with a
slight adjustment since we need to replace 2 by b. For a related question and more
details, we refer the reader to [19].
(B) A linear algebraic result. Finally, we finish this appendix with a linear
algebraic lemma which we need in our approximations and is of independent in-
terest.
Now,
|Ai1 ,i2 Ai2 ,i3 · · · Aik ,i1 | ≤ |Ai1 ,i2 ||Ai2 ,i3 | · · · |Aik ,i1 |.
When A = E ◦ M, Ai,j = Ei,j Mi,j . Since Mi,j ≥ 0, we therefore have |Ei,j ×
Mi,j | ≤ ζ Mi,j . Hence,
trace (E ◦ M)k ≤ ζ k Mi1 ,i2 Mi2 ,i3 · · · Mik ,i1 = ζ k trace(M k ).
1≤i1 ,i2 ,...,ik ≤p
48 N. EL KAROUI
So
# $1/(2k)
trace (E ◦ M)2k ≤ ζ [trace(M 2k )]1/(2k) .
Taking limits as k → ∞ concludes the proof.
REFERENCES
[1] A NDERSON, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley,
Hoboken, NJ. MR1990662
[2] BACH, F. R. and J ORDAN, M. I. (2003). Kernel independent component analysis. J. Mach.
Learn. Res. 3 1–48. MR1966051
[3] BAI, Z. D. (1999). Methodologies in spectral analysis of large-dimensional random matrices,
a review. Statist. Sinica 9 611–677. MR1711663
[4] BAI, Z. D., M IAO, B. Q. and PAN, G. M. (2007). On asymptotics of eigenvectors of large
sample covariance matrix. Ann. Probab. 35 1532–1572. MR2330979
[5] BAI, Z. D. and S ILVERSTEIN, J. W. (1998). No eigenvalues outside the support of the limiting
spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26
316–345. MR1617051
[6] BAI, Z. D. and S ILVERSTEIN, J. W. (1999). Exact separation of eigenvalues of large-
dimensional sample covariance matrices. Ann. Probab. 27 1536–1555. MR1733159
[7] BAIK, J., B EN A ROUS, G. and P ÉCHÉ, S. (2005). Phase transition of the largest eigenvalue for
nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697. MR2165575
[8] BAIK, J. and S ILVERSTEIN, J. (2006). Eigenvalues of large sample covariance matrices of
spiked population models. J. Multivariate Anal. 97 1382–1408. MR2279680
[9] B ELKIN, M. and N IYOGI, P. (2009). Convergence of Laplacian eigenmaps. Preprint.
[10] B HATIA, R. (1997). Matrix Analysis. Graduate Texts in Mathematics 169. Springer, New York.
MR1477662
[11] B OGOMOLNY, E., B OHIGAS, O. and S CHMIT, C. (2003). Spectral properties of distance ma-
trices. J. Phys. A 36 3595–3616. MR1986436
[12] B ORDENAVE, C. (2008). Eigenvalues of Euclidean random matrices. Random Structures Algo-
rithms 33 515–532. Available at [Link] MR2462254
[13] B OUTET DE M ONVEL, A., K HORUNZHY, A. and VASILCHUK, V. (1996). Limiting eigenvalue
distribution of random matrices with correlated entries. Markov Process. Related Fields 2
607–636. MR1431189
[14] B URDA, Z., J URKIEWICZ, J. and WACŁAW, B. (2005). Spectral moments of correlated Wishart
matrices. Phys. Rev. E 71 026111.
[15] C RESSIE, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York. MR1239641
[16] E L K AROUI, N. (2003). On the largest eigenvalue of Wishart matrices with identity covariance
when n, p and p/n → ∞. Available at arXiv:[Link]/0309355.
[17] E L K AROUI, N. (2007). Tracy–Widom limit for the largest eigenvalue of a large class of com-
plex sample covariance matrices. Ann. Probab. 35 663–714. MR2308592
SPECTRUM OF KERNEL RANDOM MATRICES 49
[18] E L K AROUI, N. (2008). Operator norm consistent estimation of large-dimensional sparse co-
variance matrices. Ann. Statist. 36 2717–2756. MR2485011
[19] E L K AROUI, N. (2009). Concentration of measure and spectra of random matrices: With appli-
cations to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 19
2362–2405.
[20] F ORRESTER, P. J. (1993). The spectrum edge of random matrix ensembles. Nuclear Phys. B
402 709–728. MR1236195
[21] G EMAN, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8 252–261.
MR0566592
[22] G ERONIMO, J. S. and H ILL, T. P. (2003). Necessary and sufficient condition that the limit of
Stieltjes transforms is a Stieltjes transform. J. Approx. Theory 121 54–60. MR1962995
[23] G OHBERG, I., G OLDBERG, S. and K RUPNIK, N. (2000). Traces and Determinants of Lin-
ear Operators. Operator Theory: Advances and Applications. 116 Birkhäuser, Basel.
MR1744872
[24] H ORN, R. A. and J OHNSON, C. R. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge.
MR1084815
[25] H ORN, R. A. and J OHNSON, C. R. (1994). Topics in Matrix Analysis. Cambridge Univ. Press,
Cambridge. MR1288752
[26] J OHANSSON, K. (2000). Shape fluctuations and random matrices. Comm. Math. Phys. 209
437–476. MR1737991
[27] J OHNSTONE, I. (2001). On the distribution of the largest eigenvalue in principal component
analysis. Ann. Statist. 29 295–327. MR1863961
[28] KOLTCHINSKII, V. and G INÉ, E. (2000). Random matrix approximation of spectra of integral
operators. Bernoulli 6 113–167. MR1781185
[29] L EDOUX, M. (2001). The concentration of measure phenomenon. Mathematical Surveys and
Monographs 89. Amer. Math. Soc., Providence, RI. MR1849347
[30] M AR ČENKO, V. A. and PASTUR, L. A. (1967). Distribution of eigenvalues in certain sets of
random matrices. Mat. Sb. (N.S.) 72 507–536. MR0208649
[31] PAUL, D. (2007). Asymptotics of sample eigenstructure for a large-dimensional spiked covari-
ance model. Statist. Sinica 17 1617–1642. MR2399865
[32] PAUL, D. and S ILVERSTEIN, J. (2009). No eigenvalues outside the support of the limiting
empirical spectral distribution of a separable covariance matrix. J. Multivariate Anal. 100
37–57. MR2460475
[33] R ASMUSSEN, C. E. and W ILLIAMS, C. K. I. (2006). Gaussian Processes for Machine Learn-
ing. MIT Press, Cambridge, MA. MR2514435
[34] S CHECHTMAN, G. and Z INN, J. (2000). Concentration on the lpn ball. In Geometric Aspects
of Functional Analysis. Lecture Notes in Mathematics 1745 245–256. Springer, Berlin.
MR1796723
[35] S CHÖLKOPF, B. and S MOLA, A. J. (2002). Learning with Kernels. MIT Press, Cambridge,
MA. MR1949972
[36] S CHÖLKOPF, B., T SUDA, K. and V ERT, J. P. (2004). Kernel Methods in Computational Biol-
ogy. MIT Press, Cambridge, MA.
[37] S ILVERSTEIN, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of
large-dimensional random matrices. J. Multivariate Anal. 55 331–339. MR1370408
[38] T RACY, C. and W IDOM, H. (1994). Level-spacing distribution and the Airy kernel. Comm.
Math. Phys. 159 151–174. MR1257246
[39] T RACY, C. and W IDOM, H. (1996). On orthogonal and symplectic matrix ensembles. Comm.
Math. Phys. 177 727–754. MR1385083
[40] T RACY, C. and W IDOM, H. (1998). Correlation functions, cluster functions and spacing distri-
butions for random matrices. J. Stat. Phys. 92 809–835. MR1657844
50 N. EL KAROUI
[41] VOICULESCU, D. (2000). Lectures on free probability theory. In Lectures on Probability The-
ory and Statistics (Saint-Flour, 1998). Lecture Notes in Mathematics 1738 279–349.
Springer, Berlin. MR1775641
[42] WACHTER, K. W. (1978). The strong limits of random matrix spectra for sample matrices of
independent elements. Ann. Probab. 6 1–18. MR0467894
[43] W IGNER, E. (1955). Characteristic vectors of bordered matrices with infinite dimensions. Ann.
of Math. (2) 62 548–564. MR0077805
[44] W ILLIAMS, C. and S EEGER, M. (2000). The effect of the input density distribution on kernel-
based classifiers. International Conference on Machine Learning 17 1159–1166.
[45] Y IN, Y. Q., BAI, Z. D. and K RISHNAIAH, P. R. (1988). On the limit of the largest eigenvalue
of the large-dimensional sample covariance matrix. Probab. Theory Related Fields 78
509–521. MR0950344
D EPARTMENT OF S TATISTICS
U NIVERSITY OF C ALIFORNIA , B ERKELEY
367 E VANS H ALL
B ERKELEY, C ALIFORNIA 94720-3860
USA
E- MAIL : nkaroui@[Link]