0% found this document useful (0 votes)
68 views10 pages

Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church

Uploaded by

Hiệp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views10 pages

Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church

Uploaded by

Hiệp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Very Sparse Random Projections

Ping Li Trevor J. Hastie Kenneth W. Church


Department of Statistics Department of Statistics Microsoft Research
Stanford University Stanford University Microsoft Corporation
Stanford CA 94305, USA Stanford CA 94305, USA Redmond WA 98052, USA
[email protected] [email protected] [email protected]

ABSTRACT be computed as AAT , at the cost of time O(n2 D), which is


There has been considerable interest in random projections, often prohibitive for large n and D, in modern data mining
an approximate algorithm for estimating distances between and information retrieval applications.
pairs of points in a high-dimensional vector space. Let To speed up the computations, one can generate a ran-
A ∈ Rn×D be our n points in D dimensions. The method dom projection matrix R ∈ RD×k and multiply it with the
multiplies A by a random matrix R ∈ RD×k , reducing the original matrix A ∈ Rn×D to obtain a projected data matrix
D dimensions down to just k for speeding up the compu-
1
tation. R typically consists of entries of standard normal B = √ AR ∈ Rn×k , k  min(n, D). (1)
N (0, 1). It is well known that random projections preserve k
pairwise distances (in the expectation). Achlioptas proposed
sparse random projections by replacing the N (0, 1) entries The (much smaller) matrix B preserves all pairwise dis-
in R with entries in {−1, 0, 1} with probabilities { 16 , 32 , 16 }, tances of A in expectations, provided that R consists of
achieving a threefold speedup in processing time. i.i.d. entries with zero mean and constant variance. Thus,
We recommend using R of entries in {−1, 0, 1} with prob- we can achieve a substantial cost reduction for computing
√ AAT , from O(n2 D) to O(nDk + n2 k).
abilities { 2√1D , 1− √1D , 2√1D } for achieving a significant D-
In information retrieval, we often do not have to materi-
fold speedup, with little loss in accuracy. alize AAT . Instead, databases and search engines are inter-
ested in storing the projected data B in main memory for
Categories and Subject Descriptors efficiently responding to input queries. While the original
H.2.8 [Database Applications]: Data Mining data matrix A is often too large, the projected data matrix
B can be small enough to reside in the main memory.
The entries of R (denoted by {rji }D k
j=1 i=1 ) should be i.i.d.
General Terms with zero mean. In fact, this is the only necessary condi-
Algorithms, Performance, Theory tion for preserving pairwise distances [4]. However, differ-
ent choices of rji can change the variances (average errors)
Keywords and error tail bounds. It is often convenient to let rji follow
a symmetric distribution about zero with unit variance. A
Random projections, Sampling, Rates of convergence “simple” distribution is the standard normal1 , i.e.,

1. INTRODUCTION
` 2´ ` 4´
rji ∼ N (0, 1), E (rji ) = 0, E rji = 1, E rji = 3.
Random projections [1, 43] have been used in Machine
Learning [2, 4, 5, 13, 14, 22], VLSI layout [42], analysis of La- It is “simple” in terms of theoretical analysis, but not in
tent Semantic Indexing (LSI) [35], set intersections [7, 36], terms of random number generation. For example, a uni-
finding motifs in bio-sequences [6, 27], face recognition [16], form distribution is easier to generate than normals, but
privacy preserving distributed data mining [31], to name a the analysis is more difficult.
few. The AMS sketching algorithm [3] is also one form of In this paper, when R consists of normal entries, we call
random projections. this special case as the conventional random projections,
We define a data matrix A of size n × D to be a collection about which many theoretical results are known. See the
of n data points {ui }n D
i=1 ∈ R . All pairwise distances can monograph by Vempala [43] for further references.
We derive some theoretical results when R is not restricted
to normals. In particular, our results lead to significant im-
provements over the so-called sparse random projections.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 1.1 Sparse Random Projections
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to In his novel work, Achlioptas [1] proposed using the pro-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. 1
KDD’06, August 20–23, 2006, Philadelphia, Pennsylvania, USA. The normal distribution is 2-stable. It is one of the few
Copyright 2006 ACM 1-59593-339-5/06/0008 ...$5.00. stable distributions that have closed-form density [19].
jection matrix R with i.i.d entries in Therefore, one can compute both pairwise 2-norm dis-
8 1 tances and inner products in k (instead of D) dimensions,
√ < 1 with prob. 2s 1 achieving a huge cost reduction when k  min(n, D).
rji = s 0 with prob. 1 − s
, (2)
1
−1 with prob. 2s 2.2 Distributions
:

where Achlioptas used s = 1 or s = 3. With s = 3, one can It is easy to show that (e.g. Lemma 1.3 of [43])
achieve a threefold speedup because only 31 of the data need
v1,i kv1 k2
to be processed (hence the name√ sparse random projections). p ∼ N (0, 1), ∼ χ2k , (6)
Since the multiplications with s can be delayed, no floating m1 /k m1 /k
point arithmetic is needed and all computation amounts to v1,i − v2,i kv1 − v2 k2
highly optimized database aggregation operations. p ∼ N (0, 1), ∼ χ2k , (7)
d/k d/k
This method of sparse random projections has gained its » – „» – » –«
popularity. It was first experimentally tested on image and v1,i 0 1 m1 a
∼N ,Σ = . (8)
text data by [5] in SIGKDD 2001. Later, many more publi- v2,i 0 k a m2
cations also adopted this method, e.g., [14, 29, 38, 41]. where χ2k denotes a chi-squared random variable with k de-
1.2 Very Sparse Random Projections grees of freedom. v1,i i.i.d. is any entry in v1 ∈ Rk .
√ Knowing the distributions of the projected data enables
We show that one can use s  3 (e.g., s = D, or even
us to derive (sharp) error tail bounds. For example, various
s = logDD ) to significantly speed up the computation.
Johnson and Lindenstrauss (JL) embedding theorems [1,4,9,
Examining (2), we can see that sparse random projec- 15,20,21] have been proved for precisely determining k given
tions are random sampling at a rate of 1s , i.e., when s = 3, some specified level of accuracy, for estimating the 2-norm
one-third of the data are sampled. Statistical results tell distances. According to the best known result [1]:
us that one does not have to sample one-third (D/3) of 4+2γ
If k ≥ k0 = 2 /2− 3 /3 log n, then with probability at least
the data to obtain good estimates. In fact, when the data −γ
are approximately normal, log D of the data probably suf- 1 − n , for any two rows ui , uj , we have
fice (i.e., s = logDD ), because of the exponential error tail (1 − )kui − uj k2 ≤ kvi − vj k2 ≤ (1 + )kui − uj k2 . (9)
bounds, common in normal-like distributions, such as bino-
mial, gamma, etc. For better robustness, we recommend Remark: (a) The JL lemma is conservative in many ap-
√ plications because it was derived based on Bonferroni cor-
choosing s less aggressively (e.g., s = D).
rection for multiple comparisons. (b) It is only for the l2
To better understand sparse and very sparse random pro- distance, while many applications care more about the in-
jections, we first give a summary of relevant results on con- ner product. As shown` in (5), ´ the variance of the inner
ventional random projections, in the next section. product estimator, Var v1T v2 N , is dominated by the mar-
gins (i.e., m1 m2 ) even when the data are uncorrelated. This
is probably the weakness of random projections.
2. CONVENTIONAL RANDOM
PROJECTIONS: R ∼ N (0, 1) 2.3 Sign Random Projections
Conventional random projections multiply the original data A popular variant of conventional random projections is
matrix A ∈ Rn×D with a random matrix R ∈ RD×k , con- to store only the signs of the projected data, from“which one

sisting of i.i.d. N (0, 1) entries. Denote by {ui }n
i=1 ∈ R
D
the can estimate the vector cosine angles, θ = cos−1 √ma1 m2 ,
n k
rows in A and by {vi }i=1 ∈ R the rows of the projected
by the following result [7, 17]:
data, i.e., vi = √1k RT ui . We focus on the leading two rows:
u1 , u2 and v1 , v2 . For convenience, we denote θ
Pr (sign(v1,i ) = sign(v2,i )) = 1 −, (10)
D D
π
One can also estimate a by assuming that m1 , m2 are known,
X X
m1 = ku1 k2 = u21,j , m2 = ku2 k2 = u22,j , √
j=1 j=1 from a = cos(θ) m1 m2 , at the cost of some bias.
D
The advantage of sign random projections is the saving
in storing the projected data because only one bit is needed
X
a = uT
1 u2 = u1,j u2,j , d = ku1 − u2 k2 = m1 + m2 − 2a.
j=1
for the sign. With sign random projections, we can com-
pare vectors using hamming distances for which efficient al-
2.1 Moments gorithms are available [7,20,36]. See [28] for more comments
It is easy to show that (e.g., Lemma 1.3 of [43]) on sign random projections.

2 3. OUR CONTRIBUTIONS
E kv1 k2 = ku1 k2 = m1 , Var kv1 k2 N = m21 ,
` ´ ` ´
(3)
k
2 We propose very sparse random projections √ to speed up
E kv1 − v2 k2 = d, Var kv1 − v2 k2 N = d2 , (4)
` ´ ` ´
the (processing) computations by a factor of D or more.
k
where the subscript “N ” indicates that a “normal” projec- • We derive exact variance formulas for kv1 k2 , kv1 −v2 k2 ,
tion matrix is used. and v1T v2 as functions of s.2 Under reasonable regular-
From our later results in Lemma 3 (or [28, Lemma 1]) we ity conditions, they converge to the corresponding vari-
can derive ances when rji ∼ N (0, 1) is used, as long as s = o(D)
2
1` [1] proved the upper bounds for the variances of kv1 k2 and
“ ” “ ”
E v1T v2 = a, Var v1T v2 m1 m2 + a2 . (5)
´
=
N k kv1 − v2 k2 for s = 1 and s = 3.
√ √
(e.g., s = D, or even s = logDD ). When s = D, the Lemma 2.
“ ”
1
E kv1 − v2 k2 = ku1 − u2 k2 = d,
` ´
rate of convergence is O D1/4 , which is fast since (15)
D has to be large otherwise there would be no need Var kv1 − v2 k2
` ´
of seeking√approximate answers. This means we can D
!
achieve a D-fold speedup with little loss in accuracy. 1 2
X 4
= 2d + (s − 3) (u1,j − u2,j ) (16)
k
• We show that v1,i , v1,i −“v2,i and
” (v1,i , v2,i )√converge j=1

to normals at the rate O D1/4 1


when s = D. This 1 ` 2´
D

2d . (17)
allows us to apply, with a high level of accuracy, re- k
sults of conventional random projections, e.g., the JL- The third lemma concerns the inner product.
embedding theorem in (9) and the sign random pro- Lemma 3.
jections in (10). In particular, we suggest using a max- “ ”
imum likelihood estimator of the asymptotic (normal) E v1T v2 = uT1 u2 = a, (18)
distribution to estimate the inner product a = uT 1 u2 ,
“ ”
taking advantage of the marginal norms m1 , m2 . Var v1T v2
D
!
• Our results essentially hold for any other distributions 1 2
X
= m1 m2 + a + (s − 3) u21,j u22,j . (19)
of rji . When rji is chosen to have negative kurtosis, k j=1
we can achieve strictly smaller variances (errors) than
conventional random projections. D1`
m1 m2 + a 2 .
´
∼ (20)
k
4. MAIN RESULTS Therefore, very sparse random projections preserve pair-
wise distances in expectations with variances as functions
Main results of our work are presented in this section with
of s. Compared with Var(kv1 k2 )N , Var(kv1 − v2 k2 )N , and
detailed proofs in Appendix √ A. For convenience, we always Var(v1T v2 )N in (3), (4), and (5), respectively, the extra terms
let s = o(D) (e.g., s = D) and assume all fourth mo-
all involve (s−3) and“q are asymptotically negligible. The rate
ments are bounded, e.g., E(u41,j ) < ∞, E(u42,j ) < ∞ and ”
s−3
E(u21,j u22,j ) < ∞. In fact, analyzing the rate of convergence of convergence is O , in terms of the standard er-
D

of asymptotic normality only requires bounded third mo- ror (square root “of variance). When s = D, the rate of
ments and an even much weaker assumption is needed for 1

ensuring asymptotic normality. Later we will discuss the convergence is O D1/4 .
possibility of relaxing this assumption of bounded moments. When s < 3, “sparse” random projections can actually
achieve slightly smaller variances.
4.1 Moments
The first three lemmas concern the moments (means and 4.2 Asymptotic Distributions
variances) of v1 , v1 − v2 and v1T v2 , respectively. The asymptotic analysis provides a feasible method to
study distributions of the projected data.
Lemma 1. The task of analyzing the distributions is easy when a nor-
E kv1 k2 = ku1 k2 = m1 , mal random matrix R is used. The analysis for other types
` ´
(11)
D
! of random projection distributions is much more difficult (in
1 X fact, intractable). To see this, each entry v1,i = √1k RTi u1 =
Var kv1 k2 = 2m21 + (s − 3) u41,j
` ´
. (12)
k j=1 √1
PD
r u . Other than the case r ∼ N (0, 1), ana-
k j=1 ji 1,j ji

As D → ∞, lyzing v1,i and v1 exactly is basically impossible, although


PD in some simple cases [1] we can study the bounds of the
(s − 3) j=1 (u1,j )4 s − 3 E (u1,j )4 moments and moment generating functions.
→ → 0, (13) Lemma 4 and Lemma 5 present the asymptotic distribu-
m21 D E2 (u1,j )2
´ D 1 ` 2´ tions of v1 and v1 − v2 , respectively. Again, Lemma 5 is
Var kv1 k2 ∼
`
i.e., 2m1 . (14) strictly analogous to Lemma 4.
k
D Lemma 4. As D → ∞,
∼ denotes “asymptotically equivalent” for large D.
”2 P v1,i L kv1 k2 L 2
=⇒ N (0, 1), =⇒ χk , (21)
“P
D
Note that m21 = 2
= D 4 2 2
P
j=1 u1,j j=1 u1,j + j6=j 0 u1,j u1,j 0 ,
p
m1 /k m1 /k
with D diagonal terms and D(D−1) 2
cross-terms. When all with the rate of convergence
dimensions of u1 are roughly equally important, the cross- PD
terms dominate. Since D is very large, the diagonal terms √ j=1 |u1,j |3
|Fv1,i (y) − Φ(y)| ≤ 0.8 s
are negligible. However, if a few entries are extremely large 3/2
m1
compared to the majority of the entries, the cross-terms
E|u1,j |3
r
may be of the same order as the diagonal terms. Assum- s
→ 0.8 ` ` 2 ´´3/2 → 0, (22)
ing bounded fourth moment prevents this from happening. D E u
1,j
The next Lemma is strictly analogous to Lemma 1. We
L
present them separately because Lemma 1 is more conve- where =⇒ denotes “convergence in distribution;” Fv1,i (y) is
nient to present and analyze, while Lemma 2 contains the the empirical cumulative density function (CDF) of v1,i and
results on the 2-norm distances, which we will use. Φ(y) is the standard normal N (0, 1) CDF.
Lemma 5. As D → ∞, as m1 and m2 can often be easily either exactly calculated
or accurately estimated.4
v1,i − v2,i L kv1 − v2 k2 L 2 The authors’ very recent work [28] on conventional ran-
p =⇒ N (0, 1), =⇒ χk , (23)
d/k d/k dom projections shows that if we know the margins m1 and
m2 , we can estimate a = uT
1 u2 often more accurately using
with the rate of convergence a maximum likelihood estimator (MLE).
PD
|u1,j − u2,j |3 The following lemma estimates a = uT 1 u2 , taking advan-
√ j=1
|Fv1,i −v2,i (y) − Φ(y)| ≤ 0.8 s → 0. tage of knowing the margins.
d3/2
(24)
Lemma 7. When the margins, m1 and m2 are known, we
can use a maximum likelihood estimator (MLE) to estimate
The above two lemmas show that both v1,i and v1,i − a by maximizing the joint density function of (v1 , v2 ). Since
v2,i are approximately normal, with
“ the ”
rate of convergence (v1,i , v2,i ) converges to a bivariate normal, an asymptotic
p 1

determined by s/D, which is O D1/4 when s = D. MLE is the solution to a cubic equation
The next lemma concerns the joint distribution of (v1,i , v2,i ). “ ”
a3 − a2 v1T v2 + a −m1 m2 + m1 kv2 k2 + m2 kv1 k2
` ´

Lemma 6. As D → ∞, − m1 m2 v1T v2 = 0. (29)


» – „» – » –«
1 v1,i L 0 1 0
Σ− 2 =⇒ N , , (25) The asymptotic variance of this estimator, denoted by âM LE ,
v2,i 0 0 1 is
and 2 2
` ´
1 m1 m2 − a
Var (âM LE )∞ = ≤ Var (âM F )∞ . (30)
θ k m1 m2 + a 2
Pr (sign(v1,i ) = sign(v2,i )) → 1 − . (26)
π
Var(âM LE )∞ (m1 m2 −a2 )2 2
(θ))2
where The ratio = = (1−cos ranges
Var(âM F )∞ (m1 m2 +a2 )2 (1+cos2 (θ))2

from 0 to 1, indicating possibly substantial improvements.


» – „ «
1 m1 a a
Σ= , θ = cos−1 √ . For example, when cos(θ) ≈ 1 (i.e., a2 ≈ m1 m2 ), the im-
k a m2 m 1 m2
provement will be huge. When cos(θ) ≈ 0 (i.e., a ≈ 0), we
The asymptotic normality shows that we can use other do not benefit from âM LE . Note that some studies (e.g., du-
random projections matrix R to achieve asymptotically the plicate detection) are mainly interested in data points that
same performance as conventional random projections, which are quite similar (i.e., cos(θ) close to 1).
are the easiest to analyze. Since the convergence rate is so
fast, we can simply apply results on conventional random
4.5 The Kurtosis of rji : (s − 3)
projections such as the JL lemma and sign random projec- We have seen that the parameter s plays an important
tions when a non-normal projection matrix is used.3 role in the performance of very sparse random projections.
It is interesting that s − 3 is exactly the kurtosis of rji :
4.3 A Margin-free Estimator E((rji − E(rji ))4 )
Recall that, because E(v1T v2 ) = uT
1 u2 , one can estimate
γ2 (rji ) = − 3 = s − 3, (31)
E2 ((rji − E(rji ))2 )
a = uT T
1 u2 without bias as âM F = v1 v2 , with the variance

D
! as rji has zero mean and unit variance.5
1 2
X 2 2 The kurtosis for rji ∼ N (0, 1) is zero. If one is only inter-
Var (âM F ) = m1 m2 + a + (s − 3) u1,j u2,j , (27)
k j=1
ested in smaller estimation variances (ignoring the benefit of
sparsity), one may choose the distribution of rji with nega-
1`
m1 m2 + a 2 ,
´
Var (âM F )∞ = (28) tive kurtosis. A couple of examples are
k
where the subscript “MF” indicates “Margin-free,” i.e., an • A continuous uniform distribution in [−l, l] for any l >
estimator of a without using margins. Var (âM F ) is the vari- 0. It’s kurtosis = − 56 .
ance of v1T v2 in (19). Ignoring the asymptotically negligible
• A discrete uniform distribution symmetric about zero,
part involving s − 3 leads to Var (âM F )∞ . N 2 +1
We will compare âM F with an asymptotic maximum like- with N points. Its kurtosis = − 56 N 2 −1 , ranging be-
6
lihood estimator based on the asymptotic normality. tween -2 (when N = 2) and − 5 (when N → ∞). The
case with N = 2 is the same as (2) with s = 1.
4.4 An Asymptotic MLE Using Margins
The tractable asymptotic distributions of the projected • Discrete and continuous U-shaped distributions.
data allow us to derive more accurate estimators using max- 4
Computing all marginal norms of A costs O(nD), which
imum likelihood. is often negligible. As important summary statistics, the
In many situations,
PD we can assumePDthat2 the marginal marginal norms may be already computed during various
2
norms m1 = j=1 u1,j and m2 = j=1 u2,j are known, stage of processing, e.g., normalization and term weighting.
5
Note that the kurtosis can not be smaller than −2 because
3
In the proof of the asymptotic normality, we used E(|rji |3 ) of the Cauchy-Schwarz inequality: E2 (rji 2 4
) ≤ E(rji ). One
and E(|rji |2+δ ). They should be replaced by the correspond- may consult http://en.wikipedia.org/wiki/Kurtosis for refer-
ing moments when other projection distributions are used. ences to kurtosis of various distributions.
5. HEAVY-TAIL AND TERM WEIGHTING (Web page), j = 1 to D. Some summary statistics are listed
The very sparse random projections are useful even for in Table 1.
heavy-tailed data, mainly because of term weighting. The data are certainly heavy-tailed as the kurtoses for
We have seen that bounded forth and third moments are u1,j and u2,j are 195 and 215, respectively, far above zero.
needed for analyzing the convergence of moments (variance) Therefore we do not expect that very sparse random projec-
and the convergence to normality, respectively. The proof tions with s = logDD ≈ 6000 work well, though the results
of asymptotic normality in Appendix A suggests that we are actually not disastrous as shown in Figure 1(d).
only need stronger than bounded second moments to ensure
asymptotic normality. In heavy-tailed data, however, even
the second moment may not exist. Table 1: Some summary statistics of the word pair,
Heavy-tailed data are ubiquitous in large-scale data min- “THIS” (u1 ) and “HAVE” (u2 ). γ2 denotes the
E(u2 2
1,j u2,j )
ing applications (especially Internet data) [25,34]. The pair- kurtosis. η(u1,j , u2,j ) = , af-
E(u2 2 2
1,j )E(u2,j )+E (u1,j u2,j )
wise distances computed from heavy-tailed data are usually
Var v1T v2 (see the proof
` ´
dominated by “outliers,” i.e., exceptionally large entries. fects the convergence of of
Pairwise vector distances are meaningful only when all Lemma 3). These expectations are computed empir-
dimensions of the data are more or less equally important. ically from the data. Two popular term weighting
For heavy-tailed data, such as the (unweighted) term-by- schemes are applied. The “square root weighting”

document matrix, pairwise distances may be misleading. replaces u1,j with u1,j and the “logarithmic weight-
Therefore, in practice, various term weighting schemes are ing” replaces any non-zero u1,j with 1 + log u1,j .
proposed e.g., [33, Chapter 15.2] [10, 30, 39, 45], to weight
the entries instead of using the original data. Unweighted Square root Logarithmic
It is well-known that choosing an appropriate term weight- γ2 (u1,j ) 195.1 13.03 1.58
ing method is vital. For example, as shown in [23, 26], in γ2 (u2,j ) 214.7 17.05 4.15
E(u4
1,j )
text categorization using support vector machine (SVM), 180.2 12.97 5.31
E2 (u2 )
choosing an appropriate term weighting scheme is far more E(u4
1,j
2,j )
important than tuning kernel functions of SVM. See similar E2 (u2 )
205.4 18.43 8.21
2,j
comments in [37] for the work on Naive Bayes text classifier. η(u1,j , u2,j ) 78.0 7.62 3.34
We list two popular and simple weighting schemes. One cos(θ(u1 , u2 )) 0.794 0.782 0.754
variant of the logarithmic weighting keeps zero entries and
replaces any non-zero count with 1+log(original count). An- We first test random projections on the
other scheme is the square root weighting. In the same spirit √ original (unweighted,
heavy-tailed) data, for s = 1, 3, 256 = D and 6000 ≈ logDD ,
of the Box-Cox transformation [44, Chapter 6.8], these vari- presented in Figure 1. We then apply square root weighting
ous weighting schemes significantly reduce the kurtosis (and and logarithmic weighting before random projections. The
skewness) of the data and make the data resemble normal. results are presented in Figure 2, for s = 256 and s = 6000.
Therefore, it is fair to say that assuming finite moments These results are consistent with what we would expect:
(third or fourth) is reasonable whenever the computed dis-
tances are meaningful. • When s is small, i.e., O(1), sparse random projections
However, there are also applications in which pairwise dis- perform very similarly to conventional random projec-
tances do not have to bear any clear meaning. For example, tions as shown in panels (a) and (b) of Figure 1 .
using random projections to estimate the joint sizes (set
intersections). If we expect the original data are severely • With increasing s, the variances of sparse random pro-
heavy-tailed and no term weighting will be applied, we rec- jections increase. With s = logDD , the errors are large
ommend using s = O(1). (but not disastrous),
√ because the data are heavy-tailed.
Finally, we shall point out that very sparse random pro- With s = D, sparse random projections are robust.
jections
√ can be fairly robust against heavy-tailed data when
s = D. For example,Pinstead of assuming finite fourth mo- • Since cos(θ(u1 , u2 )) ≈ 0.7 ∼ 0.8 in this case, marginal
D
u4 √ information can improve the estimation accuracy quite
ments, as long as D PDj=1 21,j 2 grows slower than O( D),
( j=1 u1,j ) substantially. The asymptotic variances of âM LE match

we can still achieve the convergence of variances if s = D, the empirical variances of the√asymptotic MLE estima-
in Lemma 1. Similarly, analyzingPthe rate of converge to tor quite well, even for s = D.
√ D
|u |3
normality only requires that D PDj=1 21,j 3/2 grows slower • After applying term weighting on the original data,
( j=1 1,j )
u
sparse random projections are almost as accurate as
than O(D1/4 ). An even weaker condition is needed to only
conventional random projections, even for s ≈ logDD ,
ensure asymptotic normality. We provide some additional
as shown in Figure 2.
analysis on heavy-tailed data in Appendix B.

7. CONCLUSION
6. EXPERIMENTAL RESULTS We provide some new theoretical results on random pro-
Some experimental results are presented as a sanity check, jections, a randomized approximate algorithm widely used
using one pair of words, “THIS” and “HAVE,” from two in machine learning and data mining. In particular, our
rows of a term-by-document matrix provided by MSN. D = theoretical results suggest that we can achieve a significant

216 = 65536. That is, u1,j (u2,j ) is the number of occur- D-fold speedup in processing time with little loss in accu-
rences of word “THIS” (word “HAVE”) in the jth document racy, where D is the original data dimension. When the data
0.7 0.7 0.7 0.7
MF MF MF MF
0.6 MLE 0.6 MLE 0.6 MLE 0.6 MLE
Standard errors

Standard errors

Standard errors

Standard errors
0.5 Theor. MF 0.5 Theor. MF 0.5 Theor. MF 0.5 Theor. MF
0.4 Theor. ∞ 0.4 Theor. ∞ 0.4 Theor. ∞ 0.4 Theor. ∞

0.3 0.3 0.3 0.3


0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0 0 0 0
10 100 10 100 10 100 10 100
k k k k

(a) s = 1 (b) s = 3 (a) Square root (s = 256) (b) Logarithmic (s = 256)


0.7 1.5 0.7 0.7
MF MF MF MF
0.6 MLE MLE 0.6 0.6
MLE MLE
Standard errors

Standard errors

Standard errors

Standard errors
0.5 Theor. MF Theor. MF 0.5 Theor. MF 0.5 Theor. MF
1
0.4 Theor. ∞ Theor. ∞ Theor. ∞ Theor. ∞
0.4 0.4
0.3 0.3 0.3
0.5
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0 0
10 100 10 100 10 100 10 100
k k k k

(c) s = 256 (d) s = 6000 (c) Square root (s = 6000) (d) Logarithmic (s = 6000)
Figure 1: Two words “THIS” (u1 ) and “HAVE” (u2 ) Figure 2: After applying term weighting on the orig-
from the MSN Web crawl data are tested. D = 216 . inal data, sparse random projections are almost as
Sparse random projections are applied to estimated
√ accurate as conventional random projections, even
a = uT1 u2 , with four values of s: 1, 3, 256 = D for s = 6000 ≈ logDD . Note that the legends are the
and 6000 ≈ logDD , in panels (a), (b), (c) and (d), same as in Figure 1.
respectively, presented
√ in terms of the normalized
Var(â)
standard error, a
. 104 simulations are con-
ducted for each k, ranging from 10 to 100. There [2] Dimitris Achlioptas, Frank McSherry, and Bernhard
are five curves in each panel. The two labeled as Schölkopf. Sampling techniques for kernel methods. In Proc.
“MF” and “Theor.” overlap. “MF” stands for the of NIPS, pages 335–342, Vancouver, BC, Canada, 2001.
empirical variance of the “Margin-free” estimator [3] Noga Alon, Yossi Matias, and Mario Szegedy. The space
complexity of approximating the frequency moments. In
âM F ; while “Theor. MF” for the theoretical vari- Proc. of STOC, pages 20–29, Philadelphia,PA, 1996.
ance of âM F , i.e., (27). The solid curve, labeled as [4] Rosa Arriaga and Santosh Vempala. An algorithmic theory
“MLE,” presents the empirical variance of âM LE , the of learning: Robust concepts and random projection. In
estimator using margins as formulated in Lemma 7. Proc. of FOCS (Also to appear in Machine Learning),
There are two curves both labeled as “Theor. ∞,” pages 616–623, New York, 1999.
for the asymptotic theoretical variances of âM F (the [5] Ella Bingham and Heikki Mannila. Random projection in
higher curve, (28)) and âM LE (the lower curve, (30)). dimensionality reduction: Applications to image and text
data. In Proc. of KDD, pages 245–250, San Francisco, CA,
2001.
[6] Jeremy Buhler and Martin Tompa. Finding motifs using
are free of “outliers” (e.g., after careful term weighting), a random projections. Journal of Computational Biology,
cost reduction by a factor of logDD is also possible. 9(2):225–242, 2002.
Our proof of the asymptotic normality justifies the use of [7] Moses S. Charikar. Similarity estimation techniques from
an asymptotic maximum likelihood estimator for improving rounding algorithms. In Proc. of STOC, pages 380–388,
the estimates when the marginal information is available. Montreal, Quebec, Canada, 2002.
[8] G. P. Chistyakov and F. Götze. Limit distributions of
studentized means. The Annals of Probability,
8. ACKNOWLEDGMENT 32(1A):28–77, 2004.
We thank Dimitris Achlioptas for very insightful com- [9] Sanjoy Dasgupta and Anupam Gupta. An elementary proof
ments. We thank Xavier Gabaix and David Mason for point- of a theorem of Johnson and Lindenstrauss. Random
ers to useful references. Ping Li thanks the enjoyable and Structures and Algorithms, 22(1):60 – 65, 2003.
[10] Susan T. Dumais. Improving the retrieval of information
helpful conversations with Tze Leung Lai, Joseph P. Ro-
from external sources. Behavior Research Methods,
mano, and Yiyuan She. Finally, we thank the four anony- Instruments and Computers, 23(2):229–236, 1991.
mous reviewers for constructive suggestions. [11] Richard Durrett. Probability: Theory and Examples.
Duxbury Press, Belmont, CA, second edition, 1995.
9. REFERENCES [12] William Feller. An Introduction to Probability Theory and
[1] Dimitris Achlioptas. Database-friendly random projections: Its Applications (Volume II). John Wiley & Sons, New
Johnson-Lindenstrauss with binary coins. Journal of York, NY, second edition, 1971.
Computer and System Sciences, 66(4):671–687, 2003. [13] Xiaoli Zhang Fern and Carla E. Brodley. Random
projection for high dimensional data clustering: A cluster A probabilistic analysis. In Proc. of PODS, pages 159–168,
ensemble approach. In Proc. of ICML, pages 186–193, Seattle,WA, 1998.
Washington, DC, 2003. [36] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy.
[14] Dmitriy Fradkin and David Madigan. Experiments with Randomized algorithms and NLP: Using locality sensitive
random projections for machine learning. In Proc. of KDD, hash function for high speed noun clustering. In Proc. of
pages 517–522, Washington, DC, 2003. ACL, pages 622–629, Ann Arbor, MI, 2005.
[15] P. Frankl and H. Maehara. The Johnson-Lindenstrauss [37] Jason D. Rennie, Lawrence Shih, Jaime Teevan, and
lemma and the sphericity of some graphs. Journal of David R. Karger. Tackling the poor assumptions of naive
Combinatorial Theory A, 44(3):355–362, 1987. Bayes text classifiers. In Proc. of ICML, pages 616–623,
[16] Navin Goel, George Bebis, and Ara Nefian. Face Washington, DC, 2003.
recognition experiments with random projection. In Proc. [38] Ozgur D. Sahin, Aziz Gulbeden, Fatih Emekçi, Divyakant
of SPIE, pages 426–437, Bellingham, WA, 2005. Agrawal, and Amr El Abbadi. Prism: indexing
[17] Michel X. Goemans and David P. Williamson. Improved multi-dimensional data in p2p networks using reference
approximation algorithms for maximum cut and vectors. In Proc. of ACM Multimedia, pages 946–955,
satisfiability problems using semidefinite programming. Singapore, 2005.
Journal of ACM, 42(6):1115–1145, 1995. [39] Gerard Salton and Chris Buckley. Term-weighting
[18] F. Götze. On the rate of convergence in the multivariate approaches in automatic text retrieval. Inf. Process.
CLT. The Annals of Probability, 19(2):724–739, 1991. Manage., 24(5):513–523, 1988.
[19] Piotr Indyk. Stable distributions, pseudorandom [40] I. S. Shiganov. Refinement of the upper bound of the
generators, embeddings and data stream computation. In constant in the central limit theorem. Journal of
FOCS, pages 189–197, Redondo Beach,CA, 2000. Mathematical Sciences, 35(3):2545–2550, 1986.
[20] Piotr Indyk and Rajeev Motwani. Approximate nearest [41] Chunqiang Tang, Sandhya Dwarkadas, and Zhichen Xu. On
neighbors: Towards removing the curse of dimensionality. scaling latent semantic indexing for large peer-to-peer
In Proc. of STOC, pages 604–613, Dallas, TX, 1998. systems. In Proc. of SIGIR, pages 112–121, Sheffield, UK,
[21] W. B. Johnson and J. Lindenstrauss. Extensions of 2004.
Lipschitz mapping into Hilbert space. Contemporary [42] Santosh Vempala. Random projection: A new approach to
Mathematics, 26:189–206, 1984. VLSI layout. In Proc. of FOCS, pages 389–395, Palo Alto,
[22] Samuel Kaski. Dimensionality reduction by random CA, 1998.
mapping: Fast similarity computation for clustering. In [43] Santosh Vempala. The Random Projection Method.
Proc. of IJCNN, pages 413–418, Piscataway, NJ, 1998. American Mathematical Society, Providence, RI, 2004.
[23] Man Lan, Chew Lim Tan, Hwee-Boon Low, and Sam Yuan [44] William N. Venables and Brian D. Ripley. Modern Applied
Sung. A comprehensive comparative study on term Statistics with S. Springer-Verlag, New York, NY, fourth
weighting schemes for text categorization with support edition, 2002.
vector machines. In Proc. of WWW, pages 1032–1033, [45] Clement T. Yu, K. Lam, and Gerard Salton. Term
Chiba, Japan, 2005. weighting in information retrieval using the term precision
[24] Erich L. Lehmann and George Casella. Theory of Point model. Journal of ACM, 29(1):152–170, 1982.
Estimation. Springer, New York, NY, second edition, 1998.
[25] Will E. Leland, Murad S. Taqqu, Walter Willinger, and
Daniel V. Wilson. On the self-similar nature of Ethernet APPENDIX
traffic. IEEE/ACM Trans. Networking, 2(1):1–15, 1994.
[26] Edda Leopold and Jorg Kindermann. Text categorization
A. PROOFS
with support vector machines. how to represent texts in Let {ui }n
i=1 denote the rows of the data matrix A ∈ R
n×D
.
input space? Machine Learning, 46(1-3):423–444, 2002. A projection matrix R ∈ RD×k consists of i.i.d. entries rji :
[27] Henry C.M. Leung, Francis Y.L. Chin, S.M. Yiu, Roni
Rosenfeld, and W.W. Tsang. Finding motifs with √ √ 1 1
Pr(rji = s) = Pr(rji = − s) = , Pr(rji = 0) = 1 − ,
insufficient number of strong binding sites. Journal of 2s s
Computational Biology, 12(6):686–701, 2005. 2 4 3 √
E(rji ) = 0, E(rji ) = 1, E(rji ) = s, E(|rji |) = s,
[28] Ping Li, Trevor J. Hastie, and Kenneth W. Church. ` 2
rj 0 i0 = 0 when i 6= i0 , or j 6= j 0 .
´
Improving random projections using marginal information. E (rji rj 0 i0 ) = 0, E rji
In Proc. of COLT, Pittsburgh, PA, 2006.
[29] Jessica Lin and Dimitrios Gunopulos. Dimensionality We denote the projected data vectors by vi = √1 RT ui .
k
reduction by random projection and latent semantic For convenience, we denote
indexing. In Proc. of SDM, San Francisco, CA, 2003.
[30] Bing Liu, Yiming Ma, and Philip S. Yu. Discovering D
X D
X
unexpected information from your competitors’ web sites. m1 = ku1 k2 = u21,j , m2 = ku2 k2 = u22,j ,
In Proc. of KDD, pages 144–153, San Francisco, CA, 2001. j=1 j=1
[31] Kun Liu, Hillol Kargupta, and Jessica Ryan. Random D
projection-based multiplicative data perturbation for X
privacy preserving distributed data mining. IEEE a = uT
1 u2 = u1,j u2,j , d = ku1 − u2 k2 = m1 + m2 − 2a.
Transactions on Knowledge and Data Engineering, j=1
18(1):92–106, 2006.
[32] B. F. Logan, C. L. Mallows, S. O. Rice, and L. A. Shepp.
We will always assume
Limit distributions of self-normalized sums. The Annals of
s = o(D), E(u41,j ) < ∞, E(u42,j ) < ∞, (⇒ E(u21,j u22,j ) < ∞).
Probability, 1(5):788–809, 1973.
[33] Chris D. Manning and Hinrich Schutze. Foundations of By the strong law of large numbers
Statistical Natural Language Processing. The MIT Press,
Cambridge, MA, 1999.
PD I ” PD (u1,j − u2,j )I
j=1 u1,j

I j=1
[34] M. E. J. Newman. Power laws, pareto distributions and → E u1,j , → E (u1,j − u2,j )I ,
zipf’s law. Contemporary Physics, 46(5):232–351, 2005. D D
PD J
[35] Christos H. Papadimitriou, Prabhakar Raghavan, Hisao j=1 (u1,j u2,j )
Tamaki, and Santosh Vempala. Latent semantic indexing: → E (u1,j u2,j )J , a.s. I = 2, 4, J = 1, 2.
D
A.1 Moments As D → ∞,
The following expansions are useful for proving the next PD
(u1,j )4
PD
three lemmas. (s − 3) j=1 s−3 j=1 (u1,j )4 /D
=
D
X D
X D
X D
X m21 D m21 /D2
m1 m2 = u21,j u22,j = u21,j u22,j + u21,j u22,j 0 , o(D) E (u1,j )4
j=1 j=1 j=1 → → 0.
j6=j 0 D E2 (u1,j )2
D
!2 D
X X X
m21 = u21,j = u41,j + 2 u21,j u21,j 0 ,
j=1 j=1 j<j 0

D
X
!2 D
X X Lemma 2.
2
a = u1,j u2,j = u21,j u22,j + 2 u1,j u2,j u1,j 0 u2,j 0 .
E kv1 − v2 k2 = ku1 − u2 k2 = d,
` ´
j=1 j=1 j<j 0
D
!
Lemma 1. ` 2´ 1 2
X 4
Var kv1 − v2 k = 2d + (s − 3) (u1,j − u2,j ) .
k
E kv1 k2 = ku1 k2 = m1 ,
` ´
j=1

D
!
2´ 1 X As D → ∞,
2m21 u41,j
`
Var kv1 k = + (s − 3) .
k j=1
(s − 3) D 4
P
j=1 (u1,j − u2,j ) s − 3 E (u1,j − u2,j )4
As D → ∞, → →0
d2 D E2 (u1,j − u2,j )2
PD 4 4
(s − 3) j=1 (u1,j ) s − 3 E (u1,j )
→ → 0.
m21 D E2 (u1,j )2
Proof of Lemma 2. The proof is analogous to the proof
Proof of Lemma 1 . v1 = √1 RT u1 ,
k
Let Ri be the ith of Lemma 1.
column of R, 1 ≤ i ≤ k. WePcan write the ith element of v1
D
to be v1,i = √1k RT 1
i u1 = √k j=1 (rji ) u1,j . Therefore,
Lemma 3.
“ ”
E v1T v2 = uT
0 1
D 1 u2 = a,
2 1 @X ` 2 ´ 2 X
v1,i = r u +2 (rji ) u1,j (rj 0 i ) u1,j 0 A ,
k j=1 ji 1,j D
!
“ ” 1 2
X
j<j
v1T v2 u21,j u22,j
0
Var = m1 m2 + a + (s − 3) .
k
from which it follows that j=1

D D
` 2 ´ 1X 2 ´ X As D → ∞,
E kv1 k2 = u21,j = m1 .
`
E v1,i = u1,j ,
k j=1 PD
j=1 (s − 3) j=1 u21,j u22,j
0 12 m1 m2 + a 2
D
E u2 u22,j
` ´
4 1 X 2
X s−3
u21,j + 2 ` 2 ´ ` 2 1,j
` ´
v1,i = 2@ rji (rji ) u1,j (rj 0 i ) u1,j 0 A → → 0.
k D E u1,j E u2,j + E2 (u1,j u2,j )
´
j=1 j<j 0

0 PD ` 4 ´ 4 P ` 2´ 2 ` 2 ´ 2 1
j=1 r ji u 1,j + 2 j<j 0 rji ”u1,j rj 0 i u1,j 0
2 Proof of Lemma 3.
1 B
“ P C
= 2B +4 j<j 0 (rji ) u1,j (rj 0 i ) u1,j 0 C, 0 1
k @ “P ` 2 ´ 2 ” “P ” A D
+4 D 1 @X ` 2 ´
j=1 rji u1,j j<j 0 (rji ) u1,j (rj i ) u1,j
X
0 0
v1,i v2,i = rji u1,j u2,j + (rji ) u1,j (rj 0 i ) u2,j 0 A ,
k j=1 0 j6=j
from which it follows that
0 1
D
` 4 ´ 1 @ X 4 X 2 2
D
E v1,i = 2 s u1,j + 6 u1,j u1,j 0 A , 1X “ ”
k j=1 =⇒ E (v1,i v2,i ) = u1,j u2,j , E v1T v2 = a.
j<j 0
0 k j=1
D D
!2 1
` 2 ´ 1 @ X 4 X 2 2 X
Var v1,i = 2 s u1,j + 6 u1,j u1,j 0 − u21,j A
k j=1 j=1 2 2
j<j 0 v1,i v2,i
0 1 12
D 0
1 @ X X D
= 2 (s − 1) u41,j + 4 u21,j u21,j 0 A 1 @X ` 2 ´
= 2 rji u1,j u2,j +
X
(rji ) u1,j (rj 0 i ) u2,j 0 A
k j=1 k
j<j 0
j=1 j6=j 0
D
!
1 0 PD ` 4 ´ 2 2
r ` 2 ´1,j 2,j + ` 2 ´
u u
X 1
2 4
= 2 2m1 + (s − 3) u1,j , j=1 ji
k
P
j=1
B 2 rji u1,j u2,j rj 0 i u1,j 0 u2,j 0 + C
1 B “P j<j 0
”2 C
= 2B C,
@ “ j6=j 0 (rji ) u1,j (rj ”i )“u2,j +
D
!
1 X k B 0 0 C
Var kv1 k2 = 2m21 + (s − 3) u41,j .
` ´ ” A
PD ` 2 ´
k
P
j=1 rji u1,j u2,j j6=j 0 (rji ) u1,j (rj i ) u2,j
0 0
j=1
=⇒ which immediately leads to
2 2 k
` ´
E v1,i v2,i 2 2
kv1 k2
„ «
v1,i L
X v1,i L
=⇒ χ21 , = =⇒ χ2k .
m1 /k m1 /k m1 /k
0 1
D
1 @ X 2 2 X X 2 2 i=1
= 2 s u1,j u2,j + 4 u1,j u2,j u1,j 0 u2,j 0 + u1,j u2,j 0 A
k j=1
We need to go back and check the Lindeberg condition.
j<j 0 j6=j 0
D D
|zj |2+δ
0 1 „ «
D 1 X ` 2 ´ 1 X
1 @ X 2 2
X 2 2 2A E z j ; |z j | ≥ s D ≤ E
= 2 (s − 2) u1,j u2,j + u1,j u2,j 0 + 2a s2D j=1 s2D j=1 (sD )δ
k j=1 j6=j 0 P D 2+δ
j=1 |u1,j | /D
! “ s ”δ 1
D 2
1 X 2 2 2 =
 δ PD
”(2+δ)/2
= 2 m1 m2 + (s − 3) u1,j u2,j + 2a , D

2
k j=1 u1,j /D
j=1
«δ
E|u1,j |2+δ

o(D) 2 1
D
! → δ
→ 0,
1 D  E(u2 ) (2+δ)/2
X ` ´
Var (v1,i v2,i ) = m1 m2 + a2 + (s − 3) u21,j u22,j , 1,j
k2 j=1
provided E|u1,j |2+δ < ∞, for some δ > 0, which is much
! weaker than our assumption that E(u41,j ) < ∞.
D It remains to show the rate of convergence using the Berry-
“ ” 1 X
Var v1T v2 = m1 m2 + a2 + (s − 3) u21,j u22,j . Esseen theorem. Let ρD = D
P 3 s1/2
PD 3
k j=1 E|zj | = k3/2 j=1 |u1,j |
j=1
PD 3
ρD √ j=1 |u1,j |
|Fv1,i (y) − Φ(y)| ≤ 0.8 3 = 0.8 s 3/2
sD m1
A.2 Asymptotic Distributions
E|u1,j |3
r
s
→ 0.8 ` ` 2 ´´3/2 → 0.
D E u
Lemma 4. As D → ∞, 1,j

v1,i L kv1 k2 L 2
p =⇒ N (0, 1), =⇒ χk ,
m1 /k m1 /k Lemma 5. As D → ∞,
with the rate of convergence v1,i − v2,i L kv1 − v2 k2 L 2
p =⇒ N (0, 1), =⇒ χk ,
PD d/k d/k
√ j=1 |u1,j |3
|Fv1,i (y) − Φ(y)| ≤ 0.8 s 3/2 with the rate of convergence
m1
PD 3
E|u1,j |3 √ j=1 |u1,j − u2,j |
r
s
→ 0.8 ` ` 2 ´´3/2 → 0, |Fv1,i −v2,i (y) − Φ(y)| ≤ 0.8 s
D E u d3/2
1,j
s E|u1,j − u2,j |3
r
L → 0.8 → 0.
where =⇒ denotes “convergence in distribution,” Fv1,i (y) is D E 32 (u1,j − u2,j )2
the empirical cumulative density function (CDF) of v1,i and
Φ(y) is the standard normal N (0, 1) CDF.
Proof of Lemma 5. The proof is analogous to the proof
Proof of Lemma 4. The Lindeberg central limit theo- of Lemma 4.
rem (CLT) and the Berry-Esseen theorem are needed for The next lemma concerns the joint distribution of (v1,i , v2,i ).
the proof [12, Theorems VIII.4.3 and XVI.5.2].6
PD 1 PD
Write v1,i = √1k RT i u1 = j=1
√ (rji ) u1,j =
k j=1 zj ,
Lemma 6. As D → ∞,
with zj = √1k (rji ) u1,j . Then
» – „» – » –« » –
1 v1,i L 0 1 0 1 m1 a
Σ− 2 =⇒ N , , Σ=
v2,i 0 0 1 k a m2
u21,j δ |u1,j |
2+δ
E(zj ) = 0, Var(zj ) = , E(|zj |2+δ ) = s 2 (2+δ)/2 , ∀δ > 0. and
k k „ «
θ a
, θ = cos−1
PD 2
Let s2D
=
PD j=1 u1,j m1 Pr (sign(v1,i ) = sign(v2,i )) → 1 − √ .
j=1 Var(zj ) = k
= k
. Assume the π m 1 m2
Lindeberg condition
D m1
1 X ` 2 ´ Proof of Lemma 6. We have seen that Var (v1,i ) = k
,
2
E zj ; |zj | ≥ sD → 0, for any  > 0. Var (v2,i ) = mk2 , E (v1,i v2,i ) = ka , i.e.,
sD j=1
„» –« » –
v1,i 1 m1 a
Then cov = = Σ.
v2,i k a m2
PD
j=1 zj v1,i L The Lindeberg multivariate central limit theorem [18] says
= p =⇒ N (0, 1),
sD m1 /k » – „» – » –«
1 v1,i L 0 1 0
Σ− 2 =⇒ N , .
6
The best Berry-Esseen constant 0.7915 (≈ 0.8) is from [40]. v2,j 0 0 1
The multivariate Lindeberg condition is automatically satis- B. HEAVY-TAILED DATA
fied by assuming bounded third moments of u1,j and u2,j . A We illustrate that very sparse random projections are fairly
trivial consequence of the asymptotic normality yields robust against heavy-tailed data, by a Pareto distribution.
θ The assumption of finite moments has simplified the anal-
Pr (sign(v1,i ) = sign(v2,i )) → 1 − . ysis of convergence a great deal. For example, assuming
π
(δ + 2)th moment, 0 < δ ≤ 2 and s = o(D), we have
„ « PD 2+δ “ s ”δ/2 PD |u1,j |2+δ /D
j=1 |u1,j |
E(u1,j u2,j )
Strictly speaking, we should write θ = cos−1 q 2 . (s) δ/2
=
j=1
E(u1,j )E(u2 2,j )
“P ”1+δ/2
D
“P ”1+δ/2
D 2 D 2
j=1 (u1,j ) j=1 (u1,j )/D
A.3 An Asymptotic MLE Using Margins “ s ”δ/2 E u2+δ
` ´
1,j
→ ` ` 2 ´´1+δ/2 → 0. (33)
Lemma 7. Assuming that the margins, m1 and m2 are D E u1,j
known and using the asymptotic normality of (v1,i , v2,i ), we
can derive an asymptotic maximum likelihood estimator (MLE), Note that δ = 2 corresponds to the rate of convergence
which is the solution to a cubic equation for the variance in Lemma 1, and δ = 1 corresponds to the
“ ” rate of convergence for asymptotic normality in Lemma 4.
a3 − a2 v1T v2 + a −m1 m2 + m1 kv2 k2 + m2 kv1 k2
` ´
From the proof of Lemma 4 in Appendix A, we can see that
T the convergence of (33) (to zero) with any δ > 0 suffices for
− m1 m2 v1 v2 = 0,
achieving asymptotic normality.
Denoted by âM LE , the asymptotic variance of this estima- For heavy-tailed data, the fourth moment (or even the
tor is second moment) may not exist. The most common model for
` 2 2
´ heavy-tailed data is the Pareto distribution with the density
1 m1 m2 − a
Var (âM LE )∞ = . function7 f (x; α) = xα+1 α
, whose mth moment = α−m α
, only
k m1 m2 + a 2 defined if α > m. The measurements of α for many types of
Proof of Lemma 7. For notational convenience, we treat data are available in [34]. For example, α = 1.2 for the word
(v1,i , v2,i ) as exactly normally distributed so that we do not frequency, α = 2.04 for the citations to papers, α = 2.51 for
need to keep track of the “convergence” notation. the copies of books sold in the US, etc.
The likelihood function of {v1,i , v2,i }ki=1 is then For simplicity, we assume that 2 < α ≤ 2 + δ ≤ 4. Un-
“ ” k k der this assumption, the asymptotic normality is guaranteed
lik {v1,i , v2,i }ki=1 = (2π)− 2 |Σ|− 2 × and it remains to show the rate of convergence of moments
and distributions. In this case, the second moment E u21,j
` ´
k » –!
1 Xˆ ˜ −1 v1,i PD “ ”
exp − v1,i v2,i Σ . exists. The sum j=1 |u1,j |
2+δ
grows as O D(2+δ)/α as
2 i=1 v2,i
shown in [11, Example 2.7.4].8 Thus, we can write
where
» –
1 m1 a PD 2+δ
j=1 |u1,j | sδ/2
„ «
Σ= . δ/2
k a m2 s “ = O
1+δ/2 2+δ
D1+δ/2− α
PD ”
2
We can then express the log likelihood function, l(a), as j=1 (u1,j )
8 “ ”

k
” k ` 2´ < O s
δ=2
log lik {v1,i , v2,i }i=1 ∝ l(a) = − log m1 m2 − a − “D
2−4/α
2 = ”1/2 , (34)
s
k
: O D3−6/α δ=1
k 1 X ` 2 2
´
v m2 − 2v1,i v2,i a + v2,i m1 ,
2 m1 m2 − a2 i=1 1,i from which we can choose s using prior √ knowledge of α.
For example, suppose α = 3 and s = D. (34) indicates
The MLE equation is the solution to l 0 (a) = 0, which is that the rate of convergence for variances would be O(D 1/12 )
in terms of the standard error. (34) also verifies that the rate
“ ”
a3 − a2 v1T v2 + a −m1 m2 + m1 kv2 k2 + m2 kv1 k2
` ´
of convergence to normality is O(D 1/4 ), as expected.
T
− m 1 m2 v 1 v 2 = 0 Of course, we could always choose s more conservatively,
e.g., s = D1/4 , if we know the data are severely heavy-tailed.
The large sample theory [24, Theorem 6.3.10] says that
Since D is large, a factor of D 1/4 is still considerable.
âM LE is asymptotically unbiased and “ converges ” in distribu-
1
What if α < 2? The second moment no longer exists.
tion to a normal random variable N a, I(a) , where I(a), The analysis will involve the so-called self-normalizing sums
the expected Fisher Information, is [8, 32]; but we will not delve into this topic. In fact, it is
2 not really meaningful to compute the l2 distances when the
` 00 ´ m1 m2 + a
I(a) = −E l (a) = k , data do not even have bounded second moment.
(m1 m2 − a2 )2
after some algebra. 7
Note that in general, a Pareto distribution has an addition
Therefore, the asymptotic variance of âM LE would be parameter xmin , and f (x; α, xmin ) = αx min
with x ≥ xmin .
xα+1
2 2 Since we are only interested in the relative ratio of moments,
` ´
1 m1 m2 − a
Var (âM LE )∞ = . (32) we can without loss of generality assume xmin = 1. Also
k m1 m2 + a 2 note that in [34], their “α” is equal to our α + 1.
8
Note that if x ∼ Pareto(α), then xt ∼ Pareto(α/t).

You might also like