17 Random Projections and Orthogonal Matching Pursuit
17 Random Projections and Orthogonal Matching Pursuit
Again we will consider high-dimensional data P . Now we will consider the uses and effects of randomness.
We will use it to simplify P (put it in a lower dimensional space) and to recover data after random noise has
interfered with it.
The first approach will be through random projections, and we will discuss the Johnson-Lindenstrauss
Lemma, and the very simple algorithm implied by it.
Then to discuss recovery, we need to model the problem a bit more carefully. So we will define the
compressed sensing problem. Then we will discuss the simplest way to recover data orthogonal matching
pursuit. Although this technique does not have the best possible bounds, it is an extremely general approach
that can be used in many areas.
This is stricter that the requirements for PCA since we want all distance between pairs of points preserved,
whereas PCA was asking for more of an average error to be small. This allowed some points to have large
error as long as most did not.
The idea to create µ is very simple: choose one at random!
To create µ, we create k random unit vectors up 1 , u2 , . . . , uk , then project onto the subspace spanned by
these vectors. Finally we need to re-normalize by d/k so the expected norm is preserved.
A classic theorem [1], known as the Johnson-Lindenstrauss Lemma, shows that if k = O((1/ε2 ) log(n/δ))
in Algorithm 17.1.1 then for all p, p0 ∈ P then equation (17.1) is satisfied with probability at least 1 − δ. The
proof can almost be seen as a Chernoff-Hoeffding bound plus Union bound, see L3.2. For each distance,
each random projection (after appropriate normalization) gives an unbiased estimate; this requires the 1/ε2
1
term
to make the difference from the unbiased estimate to be small. Then we take the union bound over all
n 2 ) distances (this yields the log n term).
2 = O(n
Interpretation of bounds. It is pretty amazing that this bound does not depend on d. Moreover, it is
essentially tight; that is, there are known point sets such that it requires Ω(1/ε2 ) dimensions to satisfy
equation 17.1.
Although the log n component can be quite reasonable, the 1/ε2 part can be quite onerous. For instance,
if we want error to be within 1% error, we may need k at about log n times 10,000. Its not often that d is
large enough that setting k = 10,000 is useful.
However, we can sometimes get about 10% error (recall this is the worst case error) when k = 200 or so.
Also the log n term may not be required if the data does actually lie in a lower dimensional space naturally
(or is very clustered); this component is really a worst case analysis.
A rule of thumb: use JL when d > 100,000 and the desired k > 500.
In conclusion, this may be useful when k = 200 ok, and not too much precision is needed, and PCA is
too slow. Otherwise, SVD or its approximations may be a better technique.
Extensions/Advantages. One can also combine this with PCA ideas [2] to get similar bounds and per-
formance as in L16.
Another advantage of this technique is that µ is defined independently of P , so if we don’t know P ahead
of time, we can still create µ and then use it in several different cases. But if we know something of P , then
again typically PCA is better.
Typically, the random Gaussian vector ui can also be replaced with a random vector in {−1, 0, +1}d .
S T = [0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0]
for d = 32 and m = 8. (Perhaps in practice the non-zeros could be larger, and the “zeros” may be small,
such as < 0.05.)
Now the goal is to make only N = K · m log(d/m)) (random) measurements of S and recover it exactly
(or with high probability). In some settings K is 4, not more than 20 in general. For this to work, each
measurement needs to be of (nearly) the entire matrix, otherwise we can miss a non-zero and just never
witness it.
Let a measurement xi be a random vector in {−1, 0, +1}d . Example:
xTi = [-1 0 1 0 1 1 -1 1 0 -1 0 0 1 -1 -1 1 0 1 0 1 -1 -1 -1 0 1 0 0 -1 0 1 0 0]
yi = hS, xi i = 0+0+0+0+1+0+0+0+0+0+0+0+0+0+0+1+0+0+0+1-1+0-1+0+0+0+0+0+0+1+0+0 = 2.
CS 6140 Data Mining; Spring 2015 Instructor: Jeff M. Phillips, University of Utah
Examples:
• single pixel camera: Instead of 10 Gigapixels (about 25MB), directly sense the 5MB jpg. This is hard,
but we can get kind of close. Take N measurements where each yi is the sum of all 10 Gigapixels
with a random mask xi . Each pixel is either taken in “as is” (a +1), is blocked (a 0), or is subtracted
(a −1).
Such cameras have been built - they work ok, not as well as regular camera :).
• Hubble Telescope: High resolution camera in space (less atmospheric interference). But communi-
cation to/from space is expensive. So with fixed (but initial random mask matrix X, that is already
know on Earth), can send compressed signals down.
• MRI on kids: They squirm a lot. So few angles voxels need to be sensed and this technique gets the
best images available on kids. Not as high resolution as on full MRI, but with much much less time.
• Noisy Data: Data is often noisy, and have more attributes than actually there. This helps find the true
structure. See more next lecture.
• First, find the measurement column Xj (not the row xi used to measure).
This represents the single index of S that explains the most about y.
that represents our guess of entry sj in S. If S is always 0 or 1, then we may enforce that γ = 1.
• Finally, we calculate the residual r = y −Xj γ. This is what remains to be explained by other elements
of S.
• Then we repeat for t rounds. We stop when the residual is small enough (nothing left to explain) or γ
is small enough (the additional explanation is not that useful).
CS 6140 Data Mining; Spring 2015 Instructor: Jeff M. Phillips, University of Utah
Remarks
and, as we will see next lecture, this will bias towards sparse solutions.
• Can re-solve for optimal least squares to get better estimate each round, but more work.
• This converges if in each step we restrict kri k < kri−1 k. A Frank-Wolfe analysis can show that its
within ε of optimal after t = 1/ε steps. Although it may not be a global optimum.
• Term “orthogonal” comes since each Xji in the ith step is always linear independent of [Xj1 . . . Xji−1 ].
Adds an orthogonal explanation of y.
• Roughly, the analysis of why d log(m/d) measurements is through the Coupon Collectors since we
need to hit each of the d measurements. And since X is random and N is large enough, then each
hXj , Xj 0 i (for j 6= j 0 ) should be small (they are close to orthogonal).
S = [0, 0, 1, 0, 0, 1, 0, 0, 1, 0].
1 −1 −1 0 −1 0 −1 0
0 1
−1 −1 0 1 −1 0 0 −1 0 1
1 −1 1 −1 0 −1 1 1 0 0
X=
1
0 −1 0 0 1 −1 −1 1 1
−1 0 0 0 1 0 1 0 1 −1
0 0 −1 −1 −1 0 −1 1 −1 0
so for instance the first row x1 = (0, 1, 1, −1, −1, 0, −1, 0, −1, 0) yields measurement hS, x1 i = 0 + 0 +
1 + 0 + 0 + 0 + 0 + 0 + (−1) + 0 = 0.
The observed measurement vector is
y = XS T = [0, 0, 0, 1, 1, −2]T .
Columns 7 and 9 have the most explanatory power towards y, based on X. We let j1 = 9. Xj1 = X9 =
(−1, 0, 0, 1, 1, −1)T = X(:,9). Then 1 = γ1 = arg minγ ky − X9 γk. We can then set r = y − X9 ∗ γ1 =
(1, 0, 0, 0, 0, −1).
Next, we observe that columns 3, 4, 5, 7, and 9 have the most explanatory power for the new r. We
choose 3 arbitrarily, letting j2 = 3. Let Xj2 = X3 = (1, 0, 1, −1, 0, −1)T = X(:,3). Then 1 = γ2 =
arg minγ kr − X3 γk (actually any value in the range [0, 1] will give same minimum). Using γ2 we update
CS 6140 Data Mining; Spring 2015 Instructor: Jeff M. Phillips, University of Utah
r = r −X3 ∗γ2 = (0, 0, −1, 1, 0, 0). Note: This progress seemed sideways at best. It increased our non-zero
γi values, but did not decrease kr − yk.
Finally, we observe columns 1, 3, 6, 7, and 8 have the most explanatory power of the new r. We choose
6 arbitrarily, let j3 = 6. Note: we could have chosen 3, and then gone an updated our choice of γ2 . Let
Xj3 = X6 = (0, 0, −1, 1, 0, 0)T = X(:,6). Then 1 = γ3 = arg minγ kr − X6 γk. Then using γ3 we
update r = r − X6 γ3 = (0, 0, 0, 0, 0, 0). So we have completely explained y using only 3 data elements.
Remarks:
• This would not have worked so cleanly if we made other arbitrary choices. Using OMP typically
needs something like N = 20 × m log d measurements (instead of 6). Large measurements would
have made it much more likely that at each step we chose the correct variable ji as most explanatory.
• This still will not always converge to the correct solution. It might get stuck without explaining
everything exactly. In that case, we can often guess we still get a good enough explanation (although
slightly off) and leave it at that. With much larger d and m, getting a good guess of the m non-zero
bits might still be useful. There are other more complex minimization techniques we can alternatively
consider in the next lecture.
CS 6140 Data Mining; Spring 2015 Instructor: Jeff M. Phillips, University of Utah
CS 6140 Data Mining; Spring 2015 Instructor: Jeff M. Phillips, University of Utah
Bibliography
[1] William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.
Contemporary Mathematics, 26(189-206):1, 1984.
[2] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Foun-
dations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 143–152. IEEE,
2006.