JL transformation_minlash
JL transformation_minlash
1
The curse of dimensionality appears to be fundamental to the nearest neighbor problem
(and many other geometric problems), and is not an artifact of the specific solution of the k-d
tree. For further intuition, consider the nearest neighbor problem in one dimension, where
the point set P and query q are simply points on the line. The natural way to preprocess
P is to sort the points by value; searching for the nearest neighbor is just binary search to
find the interval in which q lies, and then computing q’s distance to the points immediately
to the left and right of q. What about two dimensions? It’s no longer clear what “sorting
the point set” means, but let’s assume that we figure that out. (The k-d tree offers one
approach.) Intuitively, there are now four directions that we have to check for a possible
nearest neighbor. In three dimensions, there are eight directions to check, and in general
the number of relevant directions is scaling exponentially with k. There is no known way
to overcome the curse of dimensionality in the nearest neighbor problem, without resorting
to approximation (as we do below): every known solution that uses a reasonable amount of
space uses time that scales exponentially in k or linearly in the number n of points.
2 Point of Lecture
Why is the curse of dimensionality a problem? The issue is that the natural representation of
data is often high-dimensional. Recall our motivating examples for representing data points
in real space. With documents and the bag-of-words model, the number of coordinates equals
the number of words in the dictionary — often in the tens of thousands. Images are often
represented as real vectors, with at least one dimension per pixel (recording pixel intensities)
— here again, the number of dimensions is typically in the tens of thousands. The same
holds for points representing the purchase or page view history of an Amazon customer, or
of the movies watched by a Netflix subscriber.
The friction between the large number of dimensions we want to use to represent data
and the small number of dimensions required for computational tractability motivates dimen-
sionality reduction. The goal is to re-represent points in high-dimensional space as points in
low-dimensional space, preserving interpoint distances as much as possible. We can think of
dimensionality reduction as a form of lossy compression, tailored to approximately preserve
distances. For an analogy, the count-min sketch (Lecture #2) is a form a lossy compression
tailored to the approximate preservation of frequency counts.
Dimensionality reduction enables the following high-level approach to the nearest neigh-
bor problem:
1. Represent the data and queries using a large number k of dimensions (tens of thousands,
say).
2. Use dimensionality reduction to map the data and queries down to a small number d
of dimensions (in the hundreds, say).
2
Provided the dimensionality reduction subroutine approximately preserves all interpoint dis-
tances, the answer to the nearest-neighbor query in low dimensions is an approximately
correct answer to the original high-dimensional query. Even if the reduced number d of
dimensions is still too big to use k-d trees, reducing the dimension still speeds up algorithms
significantly. For example, on Mini-Project #2, you will use dimensionality reduction to
improve the running time of a brute-force search algorithm, where the running time has
linear dependence on the dimension.
The three-step paradigm above is relevant for any computation that only cares about
interpoint distances between the points, not just the nearest-neighbor problem. Distance-
based clustering is another example.
It should now be clear that we would love to have subroutines for dimensionality reduction
in our algorithmic toolbox. The rest of this lecture gives a few examples of such subroutines,
and a unified way to think about them.
Achieving error 50% doesn’t sound too impressive, but it’s easy to reduce it in a sec-
ond step, via the “magic of independent trials.” Repeating the experiment above ` times
— choosing ` different hash functions h1 , . . . , h` and labeling each object x with ` bits
f1 (x), . . . , f` (x) — the properties become:
3
1. If x = y, then fi (x) = fi (y) for all i = 1, 2, . . . , `.
2. If x 6= y and the hi ’s are good and independent hash functions, then Pr[f (x) = f (y)] ≤
2−k .
For example, to achieve a user-specified error of δ > 0, we only need to use dlog2 1δ e bits to
represent each object. For all but the tiniest values of δ, this representation is much smaller
than the original log2 U -bit one.
between points in Rk , what’s the analog of a hash function? This section proposes ran-
dom projection as the answer. This idea results in a very neat primitive, the Johnson-
Lindenstrauss (JL) transform, which says that if all we care about are the Euclidean dis-
tances between points, then we can assume (conceptually and computationally) that the
number of dimensions is not overly huge (in the hundreds, at most).
Thus, fr (x) is a random linear combination of the components of x. This function will play a
role analogous to the single-bit function defined in (1) in the previous section. The function
in (1) compresses a log2 U -bit object description to a single bit; the random projection in (2)
replaces a vector of k real numbers with a single real number.
Figure 1 recalls the geometry of the inner product, as the projection of one vector onto
the line spanned by another, which should be familiar from your high school training.
If we want to use this idea to approximately preserve the Euclidean distances between
points, how should we pick the rj ’s? Inspired by our two-step approach in Section 3, we first
try to preserve distances only in a weak sense. We then use independent trials to reduce the
error.
4
x
hx, ri
O
Figure 1: The inner product of two vectors is the projection of one onto the line spanned by
the other.
Figure 2: The probability density function for the standard Gaussian distribution (with
mean 0 and variance 1).
5
variables, even non-independent ones. Similarly, it’s not interesting that the variance of
X1 + X2 is the sum of the variances of X1 and X2 — this holds for any pair of independent
random variables.4 What’s remarkable is that the distribution of X1 + X2 is a Gaussian
(with the only mean and variance that it could possibly have). Adding two distributions
from a family generally gives a distribution outside that family. For example, the sum of
two random variables that are uniform on [0, 1] certainly isn’t uniformly distributed on [0, 2]
— there’s more mass in the middle then on the ends.
Here’s where the nice properties of Gaussians come in. Recall that the xj ’s and yj ’s are fixed
(i.e., constants), while the rj ’s are random. For each j = 1, 2, . . . , k, since rj is a Gaussian
with mean zero and variance 1, (xj − yj )rj is a Gaussian with mean zero and variance
(xj − yj )2 . (Multiplying a random variable by a scalar λ scales the standard deviation by λ
and hence the variance by λ2 .) Since Gaussians add, the right-hand side of (3) is a Gaussian
with mean 0 and variance
X k
(xj − yj )2 = kx − yk22 .
j=1
Whoa — this is an unexpected connection between the output of random projection and
the (square of the) quantity that we want to preserve. How can we exploit it? Recalling
the definition Var(X) = E[(X − E[X])2 ] of variance as the expected squared deviation of a
random variable from its mean, we see that for a random variable X with mean 0, Var(X)
is simply E[X 2 ]. Taking X to be the random variable in (3), we have
That is, the random variable (fr (x) − fr (y))2 is an unbiased estimator of the squared Eu-
clidean distance between x and y.
4
But this doesn’t generally hold for non-independent random variables, right?
6
4.4 Step 2: The Magic of Independent Trials
We’ve showed that random projection reduces the number of dimensions from k to just
one (replacing each x by fr (x)), while preserving squared distances in expectation. Two
issues are: we care about preserving distances, not squares of distances;5 and we want to
almost always preserve distances very closely (not just in expectation). We’ll solve both
these problems in one fell swoop, via the magic of independent trials.
Suppose instead of picking a single vector r, we pick d vectors r1 , . . . , rd . Each component
of each vector is drawn i.i.d. from a standard Gaussian. For a given pair x, y of points, we get
d independent unbiased estimates of kx − yk22 (via (4)). Averaging independent unbiased
estimates yields an unbiased estimate with less error.6 Because our estimates in (4) are
(squares of) Gaussians, which are very well-understood distributions, one can figure out
exactly how large d needs to be to achieve a target approximation (for details, see [3]). The
bottom line is: for a set of n points in k dimensions, to preserve all n2 interpoint Euclidean
distances up to a 1 ± factor, one should set d = Θ(−2 log n).7
x 7→ √1 Ax,
d
√
where the 1/ d scaling factor corresponds to the average over independent trials discussed
in Section 4.4.
To see how this mapping fA corresponds to our derivation in Section 4.4, note that for
5
The fact that X 2 has expectation µ2√does not imply that X has expectation µ. For example, suppose
X is equally likely to be 0 or 2 and µ = 2.
6
We’ll discuss this idea in more detail next week, but we’ve already reviewed the tools needed
Pd to make
this precise. Suppose X1 , . . . , Xd are independent and all have mean µ and variance σ 2 . Then i=1 Xi has
Pd
mean dµ and variance dσ 2 , and so the average d1 i=1 Xi has mean µ and variance σ 2 /d. Thus, averaging d
independent unbiased estimates yields yields an unbiased estimate and drops the variance by a factor of d.
7
The constant suppressed by the Θ is reasonable, no more than 2. In practice, it’s worth checking if
you get away with d smaller than what is necessary for this theoretical guarantee. For typical applications,
setting d in the low hundreds should be good enough for acceptable results. See also Mini-Project #2.
7
(k, 1)-vector x
x1
..
.
i.i.d. Gaussians
xk
with aTi denotes the ith row of A. Since each row aTi is just a k-vector with entries chosen
i.i.d. from a standard Gaussian, each term
k
!2
2 X
aTi (x − y) = aij (xj − yj )
j=1
is precisely the unbiased estimator of kx − yk22 described in (3) and (4). Thus (5)–(7) is
the average of d unbiased estimators. Provided d is sufficiently large, with probability close
to 1, all of the low-dimensional interpoint squared distances kfA (x) − fA (y)k22 are very
good approximations of the original squared distances kx − yk22 . This implies that, with
equally large probability, all interpoint Euclidean distances are approximately preserved by
the mapping fA down to d dimensions.
Thus, for any point set x1 , . . . , xn in k-dimensional space, and any computation that
cares only about interpoint Euclidean distances, there is little loss in doing the computation
on the d-dimensional fA (xi )’s rather than on the k-dimension xi ’s.
8
The JL transform is not usually implemented exactly the way we described it. One
simplification is to use ±1 entries rather than Gaussian entries; this idea has been justified
both empirically and theoretically (see [1]). Another line of improvement is add structure
to the matrix so that the matrix-vector product Ax can be computed particularly quickly a
la the Fast Fourier Transform — this is known as the “fast JL transform” [2]. Finally, since
the JL transform can often only bring the dimension down into the hundreds without overly
distorting interpoint distances, additional tricks are often needed. One of these is “locality
sensitive hashing (LSH),” touched on briefly in Section 6.
|A ∩ B|
J(A, B) = .
|A ∪ B|
Jaccard similarity is easily defined for multi-sets (see last lecture); here, to keep things simple,
we do not allow an element to appear in a set more than once.
Random projection replaces k real numbers with a single one. So an analog here would
replace a set of elements with a single element. The plan is implement a random such
mapping that preserves Jaccard similarity in expectation, and then to use independent trials
as in Section 4.4 to boost the accuracy.
5.2 MinHash
For sets, the analog of random projection is the MinHash subroutine:
9
U
10
solution:
1. Hash all n objects into b buckets using a good hash function. (b could be roughly n,
for example).
2. In each bucket, use brute-force search (i.e., compare all pairs) on the objects in that
bucket to identify and remove duplicate objects.9
Why is this a good solution? Duplicate objects hash to the same bucket, so all duplicate
objects are identified. With a good hash function and a sufficiently large number b of
buckets, different objects usually hash to different buckets. Thus, in a given bucket, we
expect a small number of distinct objects, so brute-force search in a bucket does not waste
much time comparing objects that are distinct and in the same bucket due to a hash function
collision.
Naively extending this idea to filter near-duplicates fails utterly. The problem is that
two objects x and x0 that are almost the same (but still distinct) are generally mapped to
unrelated buckets by a good hash function. To extend duplicate detection to near-duplicate
detection, we want a function h such that, if x and x0 are almost the same, then h is likely to
map x and x0 to the same bucket. This is the idea behind locality sensitive hashing (LSH).
(Additional optional material, not covered in lecture, to be added.)
References
[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with bi-
nary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003.
[2] N. Ailon and B. Chazelle. Faster dimension reduction. Communications of the ACM,
53(2):97–104, 2010.
[3] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Linden-
strauss. Random Structures and Algorithms, 22(1):60–65, 2003.
[5] Y. Rabinovich N. Linial, E. London. The geometry of graphs and some of its algorithmic
applications. Combinatorica, 15(2):215–245, 1995.
9
It’s often possible to be smarter here, for example if it’s possible to sort the objects.
11