Statistical Methods For Machine Learning
Statistical Methods For Machine Learning
Methods
for
Machine
Learning
Larry Wasserman
https://www.stat.cmu.edu/~larry/=sml/
36-708
Statistical Methods in Machine Learning
Syllabus, Spring 2019
http://www.stat.cmu.edu/∼larry/=sml
Lectures: Tuesday and Thursday 1:30 - 2:50 pm (POS 152)
This course is an introduction to Statistical Machine Learning. The goal is to study modern methods and the
underlying theory for those methods. There are two pre-requisites for this course:
Contact Information
Instructor:
Larry Wasserman BH 132G 412-268-8727 [email protected]
Teaching Assistants:
The names and office hours for the TA’s will be on the course website.
Office Hours
Larry Wasserman Tuesdays 12:00-1:00 pm Baker Hall 132G
Text
There is no text but course notes will be posted. Useful reference are:
1. Trevor Hastie, Robert Tibshirani, Jerome Friedman (2001). The Elements of Statistical Learning, Avail-
able at http://www-stat.stanford.edu/∼tibs/ElemStatLearn/.
2. Chris Bishop (2006). Pattern Recognition and Machine Learning.
3. Luc Devroye, László Györfi, Gábor Lugosi. (1996). A probabilistic theory of pattern recognition.
4. Gyorfi, Kohler, Krzyzak and Walk (2002). A Distribution-Free Theory of Nonparametric Regression.
5. Larry Wasserman (2004). All of Statistics: A Concise Course in Statistical Inference.
6. Larry Wasserman (2005). All of Nonparametric Statistics.
Grading
1. There will be four assignments. They are due Fridays at 3:00 p.m.. Hand them by uploading a pdf file
to Canvas.
2. Midterm Exam. The date is Thursday MARCH 7.
3. Project. There will be a final project, described later in the syllabus.
Course Calendar
The course calendar is posted on the course website and will be updated throughout the semester.
Project
The project involves picking a topic of interest, reading the relevant results in the area and then writing a short
paper (8 pages) summarizing the key ideas in the area. You may focus on a single paper if you prefer. Your are
NOT required to do new research.
The paper should include background, statement of important results, and brief proof outlines for the results.
If appropriate, you should also include numerical experiments are an application with real data.
Proposal. A one page proposal is due February 8. It should contain the following information: (1) project
title, (2) team members, (3) precise description of the problem you are studying, (4) anticipated scope of the
project, and (5) reading list. (Papers you will need to read).
Progress Report. Due April 5. Three pages. Include: (i) a high quality introduction, (ii) what have you
done so far, (iii) what remains to be done and (iv) a clear description of the division of work among teammates,
if applicable.
Final Report: Due May 3. The paper should be in NIPS format. (pdf only). Maximum 8 pages. No
appendix is allowed. You should submit a pdf file electronically. It should have the following format:
1 Statistics versus ML
Statistics and ML are overlapping fields. Both address the same question: how do we extract
information from data? But there are differences in emphasis. In particular, some topics get
greater emphasis than others. Here are some examples:
More emphasis in ML More emphasis in Stat Common Areas
Bandits Confidence Sets Prediction (Regression and Classification)
Reinforcement Learning Large Sample Theory Probability Bounds (Concentration)
Efficient Computation Statistical Optimality Clustering
Deep Learning Causality Graphical Models
However, the lines between the two fields are blurry and will become increasingly so.
Another difference between the two fields is that ML researchers tend to publish short pa-
pers in conferences while Statisticians tend to publish long papers in journals. Each has
advantages and disadvantages.
2 Concentration
Hoeffding’s inequality:
VC Dimension. Let A be a class of sets. If F is a finite set, let s(A, F ) be the number of
subset of F ‘picked out’ by A. Define the growth function
sn (A) = sup s(A, F ).
|F |=n
If the VC dimension is finite, then there is a phase transition in the growth function from
exponential to polynomial:
1
Theorem 2 (Sauer’s Theorem) Suppose that A has finite VC dimension d. Then, for
all n ≥ d,
en d
s(A, n) ≤ . (3)
d
p
Theorem 3 (Vapnik and Chervonenkis) Let A be a class of sets. For any t > 2/n,
2
P sup |Pn (A) − P (A)| > t ≤ 4 s(A, 2n)e−nt /8 (4)
A∈A
n2
P(|Y − µ| > ) ≤ 2 exp − 2 . (7)
2σ + 2M /3
It follows that
σ 2
t
P |Y − µ| > + ≤ e−t
n 2(1 − c)
for small enough and c.
3 Probability
P
1. Xn → 0 means that means that, for every > 0 P(|Xn | > ) → 0 as n → ∞.
2
2. Xn Z means that P(Xn ≤ z) → P(Z ≤ z) at all continuity points z.
3. Xn = OP (an ) means that, Xn /an is bounded
in probability:
for every > 0 there is
an M > 0 such that, for all large n, P Xann > M ≤ .
4. Xn = op (an ) means that Xn /an goes to 0 in probability: for every > 0
Xn
P > → 0 as n → ∞.
an
5. Law of large numbers: X1 , . . . , Xn ∼ P then
P
Xn → µ
4 Basic Statistics
1. Bias and Variance. Let θb be an estimator of θ. Then
where bias = E[θ]b − θ and Var = Var(θ). b In many cases there is a bias-variance trade-
off. In parametric problems, we typically have that the standard deviation is O(n−1/2 )
but the bias is O(1/n) so the variability dominates. In nonparametric problems this is
no longer true. We have to choose tuning parameters in classifiers and estimators to
balance the bias and variance.
2. A set of distributions P is a statistical model. They can be small (parametric models)
or large (nonparametric models).
3. Confidence Sets. Let X1 , . . . , Xn ∼ P where P ∈ P. Let θ = T (P ) be some quantity
of interest, Then Cn = C(X1 , . . . , Xn ) is a 1 − α confidence set if
inf P (T (P ) ∈ Cn ) ≥ 1 − α.
P ∈P
3
5. Fisher information In (θ) = nI(θ) where
∂ 2 log p(X; θ)
I(θ) = −E .
∂θ2
6. Then
θbn − θ
N (0, 1)
sn
q
1
where sn = b .
nI(θ)
5 Minimaxity
If supP ∈P EP [L(θ,
b θ)] = Rn then θb is a minimax estimator.
6 Regression
1. Y ∈ R, X ∈ Rd and prediction risk is
E(Y − m(X))2 .
We write X = (X(1), . . . , X(d)).
2. Minimizer is m(x) = E(Y |X = x).
3. Best linear predictor: minimize
E(Y − β T X)2
where X(1) = 1 so that β1 is the intercept. Minimizer is
β = Λ−1 α
where Λ(j, k) = E[X(j)X(k)] and α(j) = E(Y X(j)).
4
4. The data are
(X1 , Y1 ), . . . , (Xn , Yn ).
Given new X predict Y .
5. Minimize training error
1X
R(β)
b = (Yi − β T Xi )2 .
n i
Solution: least squares:
βb = (XT X)−1 XT Y
where X(i, j) = Xi (j).
6. Fitted values Yb = Xβb = HY where H = X(XT X)−1 XT is the hat matrix: the projector
onto the column space of X.
7. Bias-Variance tradeoff: Write Y = m(X) + and let Yb = m(X)
b where m(x)
b = xT β.
b
Then Z Z
2 2 2
R = E(Y − Y ) = σ + b (x)p(x)dx + v(x)p(x)dx
b
7 Classification
1. X ∈ Rd and Y ∈ {0, 1}.
2. Classifier h : Rd → {0, 1}.
3. Prediction risk:
R(h) = P(Y 6= h(X)).
The Bayes rule minimizes R(h):
where m(x) = P(Y = 1|X = x), π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|Y = 1)
and p0 (x) = p(x|Y = 0).
4. Re-coded loss. If we code Y as Y ∈ {−1, +1}. then many classifiers can be written
as
h(x) = sign(ψ(x))
for some ψ. For linear classifiers, ψ(x) = β T x. Then the loss can be written as
I(Y 6= h(X)) = I(Y ψ(X) < 0) and risk is
5
5. Linear Classifiers. A linear classifier has the form hβ (x) = I(β T x > 0). (I am
including a intercept in x. In other words x = (1, x(2), . . . , x(d)).) Given data
(X1 , Y1 ), . . . , (Xn , Yn ) there are several ways to estimate a linear classifier:
(a) Empirical risk minimization (ERM): Choose βb to minimize
n
1X
Rn (β) = 6 hβ (Xi )).
I(Yi =
n i=1
6. The SVM is an example of the general idea of replacing the true loss with a surrogate
loss that is easier to minimize. Replacing I(Y ψ(X) < 0) with
6
Density Estimation
36-708
1 Introduction
Density estimation used for: regression, classification, clustering and unsupervised predic-
tion. For example, if pb(x, y) is an estimate of p(x, y) then we get the following estimate of
the regression function: Z
m(x)
b = yb
p(y|x)dy
where π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0). Inserting
sample estimates of π1 and π0 , and density estimates for p1 and p0 yields an estimate of
the Bayes rule. For clustering, we look for the high density regions, based on an estimate
of the density. Many classifiers that you are familiar with can be re-expressed this way.
Unsupervised prediction is discussed in Section 9. In this case we want to predict Xn+1 from
X1 , . . . , X n .
Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
4
1 1 X
p(x) = φ(x; 0, 1) + φ(x; (j/2) − 1, 1/10) (1)
2 10 j=0
where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson
density. Based on 1,000 draws from p, we computed a kernel density estimator, described
later. The estimator depends on a tuning parameter called the bandwidth. The top right plot
is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.
1
1.0
1.0
0.5
0.5
0.0
−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0
1.0
0.5
0.5
0.0
0.0
−3 0 3 −3 0 3
Just Right Oversmoothed
Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.
2
2 Loss Functions
Devroye and Györfi (1985) make a strong case for using the L1 norm
Z
kbp − pk1 ≡ |b p(x) − p(x)|dx
as the loss instead of L2 . The L1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric
where the supremum is over all measurable sets. Now if P and Q have densities p and q then
Z
1 1
dT V (P, Q) = |p − q| = kp − qk1 . H
2 2
R
Thus, if |p − q| < δ then we know that |P (A) − Q(A)| < δ/2 for all A. Also, the L1 norm is
transformation invariant. Suppose that T is a one-to-one smooth function. Let Y = T (X).
Let p and q be densities for X and let pe and qe be the corresponding densities for Y . Then
Z Z
|p(x) − q(x)|dx = |e p(y) − qe(y)|dy. H
Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much
better loss function than L2 for density estimation. But it is much more difficult to deal
with. For now, we will focus on L2 loss. But we may discuss L1 loss later.
R
Another loss function is the Kullback-Leibler loss p(x) log p(x)/q(x)dx. This is not a good
loss function to use for nonparametric density estimation. The reason is that the Kullback- H
Leibler loss is completely dominated by the tails of the densities.
3 Histograms
Perhaps the simplest density estimators are histograms. For convenience, assume that the
data X1 , . . . , Xn are contained in the unit cube X = [0, 1]d (although this assumption is not
crucial). Divide X into bins, or sub-cubes, of size h. We discuss methods for choosing
3
h later. There are N ≈ (1/h)d such bins and each has volume hd . Denote the bins by
B1 , . . . , BN . The histogram density estimator is
N b
X θj
pbh (x) = d
I(x ∈ Bj ) (2)
j=1
h
where n
1X
θbj = I(Xi ∈ Bj )
n i=1
is the fraction of data points in bin Bj . Now we bound the bias and variance of pbh . We will
assume that p ∈ P(L) where
( )
P(L) = p : |p(x) − p(y)| ≤ Lkx − yk, for all x, y . (3)
R
First we bound the bias. Let θj = P (X ∈ Bj ) = Bj
p(u)du. For any x ∈ Bj ,
θj
ph (x) ≡ E(b
ph (x)) = (4)
hd
and hence R
Bj
p(u)du 1
Z
p(x) − ph (x) = p(x) − = d (p(x) − p(u))du.
hd h
Thus,
1
Z
1 √ Z √
|p(x) − ph (x)| ≤ d |p(x) − p(u)|du ≤ d Lh d du = Lh d
h h
√
where we used the fact that if x, u ∈ Bj then kx − uk ≤ dh.
1 θj (1 − θj ) θj C
Var(b
ph (x)) = 2d
Var(θbj ) = 2d
≤ 2d
≤ .
h nh nh nhd
4
where C0 = L2 d(C/(L2 d))2/(d+2) .
Later, we will prove the following theorem which shows that this upper bound is tight.
Specifically:
sup P n (kb
ph − pk∞ > )
P ∈P
where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/h)d ,
√
Earlier we saw that supx |p(x) − ph (x)| ≤ L dh. Hence, with probability at least 1 − πn ,
√
kb
ph − pk∞ ≤ kb ph − ph k∞ + kph − pk∞ ≤ + L dh. (7)
Now set s
1 2
= log .
cnhd δhd
5
Then, with probability at least 1 − δ,
s
√
1 2
kb
ph − pk∞ ≤ log + L dh. (8)
cnhd δhd
2
Boxcar: K(x) = 21 I(x) Gaussian: K(x) = √1 e−x /2
2π
3 2 70
Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81
(1 − |x|3 )3 I(x)
−3 0 3 −3 0 3
−3 0 3 −3 0 3
Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).
6
−10 −5 0 5 10
Figure 3: A kernel density estimator pb. At each point x, pb(x) is the average of the kernels
centered over the data points Xi . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.
Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth,
the kernel density estimator is defined to be
n
1X 1 kx − Xi k
pb(x) = K . (10)
n i=1 hd h
where H is a positive definite bandwidth matrix and KH (x) = |H|−1/2 K(H −1/2 x). For
simplicity, we will take H = h2 I and we get back the previous formula.
Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multivari-
ate case the coordinates of Xi should be standardized so that each has the same variance,
since the norm kx − Xi k treats all coordinates as if they are on the same scale.
The kernel estimator places a smoothed out lump of mass of size 1/n over each data point
Xi ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.
7
4.1 Risk Analysis
In this section we examine the accuracy of kernel density estimation. We will first need a
few definitions.
∂ s1 +···+sd
Ds = .
∂xs11 · · · ∂xsdd
Let β be a positive integer. Define the Hölder class
( )
Σ(β, L) = g : |Ds g(x)−Ds g(y)| ≤ Lkx−yk, for all s such that |s| = β −1, and all x, y .
(11)
For example, if d = 1 and β = 2 this means that
The most common case is β = 2; roughly speaking, this means that the functions have
bounded second derivatives.
where H
X (u − x)s
gx,β (u) = Ds g(x). (13)
s!
|s|≤β
ph (x)]. The next lemma provides a bound on the bias ph (x) − p(x).
Let ph (x) = E[b
8
Lemma 3 The bias of pbh satisfies:
sup |ph (x) − p(x)| ≤ chβ (14)
p∈Σ(β,L)
for some c.
Proof. We have
Z
1
|ph (x) − p(x)| = K(ku − xk/h)p(u)du − p(x)
hd
Z
= K(kvk)(p(x + hv) − p(x))dv
Z Z
≤ K(kvk)(p(x + hv) − px,β (x + hv))dv + K(kvk)(px,β (x + hv) − p(x))dv .
R
The first term is bounded by Lhβ K(s)|s|β since p ∈ Σ(β, L). The second term is 0 from
the properties on K since px,β (x + hv) − p(x) is a polynomial of degree β (with no constant
term).
hd
kx − uk
Z Z
2 1 2
Var(Zi ) ≤ E(Zi ) = 2d K p(u)du = 2d K 2 (kvk) p(x + hv)dv
h h h
Z
supx p(x) c
≤ d
K 2 (kvk)dv ≤ d
h h
for some c since the densities in Σ(β, L) are uniformly bounded. The result follows.
Since the mean squared error is equal to the variance plus the bias squared we have:
1
Theorem 5 The L2 risk is bounded above, uniformly over Σ(β, L), by h4β + nhd
(up to
constants). If h n−1/(2β+d) then
Z 2β
2β+d
2 1
sup E (b ph (x) − p(x)) dx . (16)
p∈Σ(β,L) n
9
4.2 Minimax Bound
According to the next theorem, there does not exist an estimator that converges faster than
O(n−2β/(2β+d) ). We state the result for integrated L2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.
Theorem 6 together with (16) imply that kernel estimators are rate minimax.
Now we state a result which says how fast pb(x) concentrates around p(x). First, recall
Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2 and
|Yi | ≤ M . Then
n2
P(|Y − µ| > ) ≤ 2 exp − 2 . (18)
2σ + 2M /3
Note that the last statement follows from the bias-variance calculation followed by Markov’s
inequality. The first statement does not.
|b
p(x) − p(x)| ≤ |b
p(x) − ph (x)| + |ph (x) − p(x)| (21)
10
where ph (x) = E(bp(x)). From Lemma 3, |ph (x) − p(x)| ≤ chβ for some c. Now pb(x) =
n−1 ni=1 Zi where
P
1 kx − Xi k
Zi = d K .
h h
Note that |Zi | ≤ c1 /hd where c1 = K(0). Also, Var(Zi ) ≤ c2 /hd from Lemma 4. Hence, by
Bernstein’s inequality,
n2 nhd 2
p(x) − ph (x)| > ) ≤ 2 exp −
P(|b ≤ 2 exp −
2c2 h−d + 2c1 h−d /3 4c2
p
whenever ≤ 3c2 /c1 . If we choose = C log(2/δ)/(nhd ) where C = 4c2 then
r !
C
P |b
p(x) − ph (x)| > ≤ δ.
nhd
4.4 Concentration in L∞
Theorem 7 shows that, for each x, pb(x) is close to p(x) with high probability. We would like
a version of this result that holds uniformly over all x. That is, we want a concentration
result for
kb
p − pk∞ = sup |b p(x) − p(x)|.
x
We can write
kb
ph − pk∞ ≤ kb ph − ph k∞ + chβ .
ph − ph k∞ + kph − pk∞ ≤ kb
We can bound the first term using something called bracketing together with Bernstein’s
theorem to prove that,
d
3n2 hd
C
P(kbph − ph k∞ > ) ≤ 4 exp − . (22)
hd+1 28K(0)
11
where N (T, d, ) denotes the -covering number of the metric space (T, d), F is the envelope
function of F and the supremum is taken over the set of all probability measures on Rd . The
quantities A and v are called the VC characteristics of Fh .
Theorem 8 (Giné and Guillou 2002) Assume that the kernel satisfies the above prop-
erty.
1. Let h > 0 be fixed. Then, there exist constants c1 > 0 and c2 > 0 such that, for all
small > 0 and all large n,
ph (x) − ph (x)| > ≤ c1 exp −c2 nhd 2 .
P sup |b (24)
x∈Rd
nhdn
2. Let hn → 0 as n → ∞ in such a way that | log hdn |
→ ∞. Let
s
| log hn |
n ≥ . (25)
nhdn
Then, for all n large enough, (24) holds with h and replaced by hn and n , respectively.
The above theorem imposes minimal assumptions on the kernel K and, more importantly,
on the probability distribution P , whose density is not required to be bounded or smooth,
and, in fact, may not even exist. Combining the above theorem with Lemma 3 we have the
following result.
for some constants C and c where C depends on δ. Choosing h log n/n−1/(2β+d) we have
2 C log n
P sup |bp(x) − p(x)| > 2β/(2β+d) < δ.
x n
We have ignored what happens near the boundary of the sample space. If x is O(h) close to
the boundary, the bias is O(h) instead of O(h2 ). There are a variety of fixes including: data
reflection, transformations, boundary kernels, local likelihood.
12
4.6 Confidence Bands and the CLT
p
Consider first a single point x. Let sn (x) = Var(b
ph (x)). The CLT implies that
N (0, 1) as long as
Xn
lim E[Ln,i |2+δ = 0
n→∞
i=1
for some δ > 0. But this is does not yield a confidence interval for p(x). To see why, let us
write
pbh (x) − p(x) pbh (x) − ph (x) ph (x) − p(x) bias
= + = Zn (x) + √ .
sn (x) sn (x) sn (x) var(x)
Assuming that the optimize the risk by balancing the bias and the variance, the second term
is some constant c. So
pbh (x) − p(x)
N (c, τ 2 (x)).
sn (x)
This means that the usual confidence interval pbh (x) ± zα/2 s(x) will not cover p(x) with
probability tending to 1 − α. One fix for this is to undersmooth the estimator. (We sacrifice
risk for coverage.) An easier approach is just to interpret pbh (x) ± zα/2 s(x) as a confidence
interval for the smoothed density ph (x) instead of p(x).
But this only gives an interval at one point. To get a confidence band we use the bootstrap.
Let Pn be the empirical distribution of X1 , . . . , Xn . The idea is to estimate the distribution
√
Fn (t) = P nhd ||b ph (x) − ph (x)||∞ ≤ t
where pb∗h is constructed from the bootstrap sample X1∗ , . . . , Xn∗ ∼ Pn . Later in the course,
we will show that
P
sup |Fn (t) − Fbn (t)| → 0.
t
1. Let Pn be the empirical distribution that puts mass 1/n at each data point Xi .
13
h= 1
h= 2 h= 3
7. Let
zα zα
`n (x) = pbh (x) − √ , un (x) = pbh (x) + √ .
nhd nhd
See Figure 4.
If you want a confidence band for p you need to reduce the bias (undersmooth). A simple
way to do this is with twicing. Suppose that β = 2 and that we use the kernel estimator pbh .
Note that,
ph (x)] = p(x) + C(x)h2 + o(h2 )
E[b
p2h (x)] = p(x) + C(x)4h2 + o(h2 )
E[b
14
for some C(x). That is, the leading term of the bias is b(x) = C(x)h2 . So if we define
5 Cross-Validation
In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h. Here, we
describe two cross-validation methods.
A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is
Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x)dx + p2 (x)dx.
(b 2 2
The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L(h) = pb (x) dx − 2 pb(x)p(x)dx.
where pb(−i) is the density estimator obtained after removing the ith observation.
15
H
It is easy to check that E[R(h)]
b = R(h).
When the kernel is Gaussian, the cross-validation score can be written, after some tedious
algebra, as follows. Let φ(z; σ) denote a Normal density with mean 0 and variance σ 2 . Then,
√ d
φd
(0; 2h) n − 2 XY √
R(h) =
b + φ(X i` − X j` ; 2h) (27)
(n − 1) n(n − 1)2 i6=j `=1
d
2 XY
− φ(Xi` − Xj` ; h). (28)
n(n − 1) i6=j `=1
The estimator pb and the cross-validation score can be computed quickly using the fast Fourier
transform; see pages 61–66 of Silverman (1986).
For histograms, it is easy to work out the leave-one-out cross-validation in close form:
2 n + 1 X b2
R(h)
b = − θ . H
(n − 1)h (n − 1)h j j
A further justification for cross-validation is given by the following theorem due to Stone
(1984).
Theorem 12 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel
estimator with bandwidth h and let b
h denote the bandwidth chosen by cross-validation. Then,
R 2
p(x) − pbbh (x) dx a.s.
→ 1. (29)
inf h (p(x) − pbh (x))2 dx
R
The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.
There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.
16
based on bandwidth h. For simplicity, assume the sample size is even and denote the sample
size by 2n. Randomly split the data X = (X1 , . . . , X2n ) into two sets of size n. Denote
these by Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ).1 Let H = {h1 , . . . , hN } be a finite grid of
bandwidths. Let n
1X 1 kx − Yi k
pbj (x) = K .
n i=1 hdj h
Thus we have a set P = {b
p1 , . . . , pb} of density estimators.
R R
We would like to minimize L(p, pbj ) = pb2j (x) − 2 pbj (x)p(x)dx. Define the estimated risk
Z n
b pbj ) = pb2 (x) − 2
X
Lbj ≡ L(p, pbj (Zi ). (30)
j
n i=1
Y → {b
p1 , . . . , pbN } = P
split
X = (X1 , . . . , X2n ) =⇒
Z → {L bN }
b1 , . . . , L
This theorem can be proved using concentration of measure techniques that we discuss later
in class. A similar result can be proved for V -fold cross-validation.
In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.
Theorem 14 Let RxR = E(p(x) − pb(x))2 and let R = Rx dx. Assume that p00 is absolutely
R
17
and R
K 2 (x)dx
Z
1 4 4 00 2 1
R = σK hn p (x) dx + +O + O(h6n ) (31)
4 nh n
2
R
where σK = x2 K(x)dx.
Proof. Write Kh (x, X) = h−1 K ((x − X)/h) and pb(x) = n−1 i Kh (x, Xi ). Thus, E[b
P
p(x)] =
E[Kh (x, X)] and Var[bp(x)] = n−1 Var[Kh (x, X)]. Now,
x−t
Z
1
E[Kh (x, X)] = K p(t) dt
h h
Z
= K(u)p(x − hu) du
h2 u2 00
Z
0
= K(u) p(x) − hup (x) + p (x) + · · · du
2
Z
1
= p(x) + h2 p00 (x) u2 K(u) du · · ·
2
R R
since K(x) dx = 1 and x K(x) dx = 0. The bias is
1 2 2 00
E[Khn (x, X)] − p(x) = σK hn p (x) + O(h4n ).
2
By a similar calculation,
R
p(x) K 2 (x) dx
1
Var[b
p(x)] = +O .
n hn n
The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result.
If we differentiate (31) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
1/5
c2
h∗ = (32)
c21 A(f )n
where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = f 00 (x)2 dx. This is informative
R R R
because it tells us that the best bandwidth decreases at rate n−1/5 . Plugging h∗ into (31),
we see that if the optimal bandwidth is used then R = O(n−4/5 ).
6 High Dimensions
The rate of convergence n−2β/(2β+d) is slow when the dimension d is large. In this case it is
hopeless to try to estimate the true density p precisely in the L2 norm (or any similar norm).
18
We need to change our notion of what it means to estimate p in a high-dimensional problem.
Instead of estimating p precisely we have to settle for finding an adequate approximation
of p. Any estimator that finds the regions where p puts large amounts of mass should be
considered an adequate approximation. Let us consider a few ways to implement this type
of thinking.
Ph = P ⊕ K h
where ⊕ denotes convolution2 and Kh is the distribution with density h−d K(kuk/h). In
other words, if X ∼ Ph then X = Y + Z where Y ∼ P and Z ∼ Kh . This is just another
way to say that Ph is a blurred or smoothed version of P . ph need not be close in L2 to p but
still could preserve most of the important shape information about p. Consider then choosing
a fixed h > 0 and estimating ph instead of p. This corresponds to ignoring the bias in the
density estimator. From Theorem 8 we conclude:
2
Theorem 15 Let h > 0 be fixed. Then P(kbph − ph k∞ > ) ≤ Ce−nc . Hence,
r !
log n
kb
ph − ph k∞ = OP .
n
The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.
Independence Based Methods. If we can live with some bias, we can reduce the dimen-
sionality by imposing some independence assumptions. The simplest example is to treat the
components (X1 , . . . , Xd ) as if they are independent. In that case
d
Y
p(x1 , . . . , xd ) = pj (xj )
j=1
19
An extension is to use a forest. We represent the distribution with an undirected graph. A
graph with no cycles is a forest. Let E be the edges of the graph. Any density consistent
with the forest can be written as
d
Y Y pj,k (xj , xk )
p(x) = pj (xj ) .
j=1
pj (xj )pk (xk )
(j,k)∈E
To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.
Density Trees. Ram and Gray (2011) suggest a recursive partitioning scheme similar to
decision trees. They split each coordinate dyadically, in a greedy fashion. The density
estimator is taken to be piecewise constant. They use an L2 risk estimator to decide when to
split. This seems promising. The ideas seems to have been re-discovered in Yand and Wong
(arXiv:1404.1425) and Liu and Wong (arXiv:1401.2597). Density trees seem very promising.
It would be nice if there was an R package to do this and if there were more theoretical
results.
7 Example
Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L2 error which does not guarantee that the
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.
8 Derivatives
Kernel estimators can also be used to estimate the derivatives of a density.3 Let D⊗r p denote
the rth derivative p. We are using Kronecker notation. Let D⊗0 p = p, D⊗1 f is the gradient
3
In this section we follow Chacon and Duong (2013), Electronic Journal of Statistics, 7, 499-532.
20
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth
Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.
21
of p, and D⊗2 p = vecH where H is the Hessian. We also write this as p(r) when convenient.
The asymptotic mean squared error is derived in Chacon, Duong and Wand (2011) and is
given by
1 −1/2 m2 (K)
|H |tr((H −1 )⊗r R(D⊗r (K))) + 2 tr((Idr ⊗ vecT H)R(D⊗(r+2) p)(Idr ⊗ vec(H)))
n 4
R R
where R(g) = g(x)g T (x)dx, m2 (K) = xxT K(x)dx. The optimal H has entries of order
n−2/(d+2r+4) which yield an asymptotic mean squared error of order n−4/(d+2r+4) . In dimension
d = 1, the risk looks like this as a function of r:
r risk
0 n−4/5
1 n−4/7
2 n−4/9
We see that estimating derivatives is harder than estimating the density itself.
One application of this that we consider later in the course is mode-based clustering. Here,
we use density estimation to find the modes of the density. We associate clusters with these
modes. We can also test for a mode by testing if D2 p(x) < 0 at the estimated modes.
22
9 Unsupervised Prediction and Anomaly Detection
We can use density estimation to do unsupervised prediction and anomaly detection. The
basic idea is due to Vovk, and was developed in a statistical framework in Lei, Robins and
Wasserman (2014).
P(Yn+1 ∈ Cn ) ≥ 1 − α.
Fix a value y. Let A = (Y1 , . . . , Yn , y) be the augmented dataset. That is, we set Yn+1 = y.
Let pbA be a density estimate based on A. Consider the vector
Under H0 , the rank of these values is uniformly distributed. That is, for each i,
1
pA (Yi ) ≤ pbA (y)) =
P(b .
n+1
A p-value for the test is
n+1
1 X
π(y) = pA (Yi ) ≤ pbA (y)).
I(b
n + 1 i=1
Computing Cn is tedious. Fortunately, Jing, Robins and Wasserman (2014) show that there is
a simpler set that still has the correct coverage (but is slightly larger). The set is constructed
as follows. Let Zi = pb(Yi ). Order these observations
Z(1) ≤ · · · ≤ Z(n) .
23
Lemma 16 We have that Cn ⊂ Cn+ and hence
P(Yn+1 ∈ Cn ) ≥ 1 − α.
Finally, we note that any Yi with a small p-value can be regarded as an outlier (anomaly).
The above method is exact. We can also use a simpler, asympotic approach. With Z(k)
b = {y : pb(y) ≥ t} where now t = Z(k) . From Cadre, Pelletier and Pudlo
defined above, set C
(2013) we have that
√ P
nhd µ(C∆C)
b →c
for some constant c where C is the true 1 − α level set. Hence, P (Yn+1 ∈ C)
b = 1 − α + oP (1).
Note also that the L2 loss does not make any sense. If you tried to use cross-validation, you
would find that the estimated risk is minimized at h = 0. H
A simple solution is to focus on estimating the smoothed density ph (x) which is well-defined
for every h > 0. More sophisticated ideas are based on topological data analysis which we
discuss later in the course.
11 Series Methods
We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that Xi ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = {f : [0, 1] → R, f 2 (x)dx < ∞}.
0
24
Thus Z Z
φ2j (x)dx = 1, φj (x)φk (x)dx = 0.
An example is the cosine basis:
√
φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . ,
If p ∈ F then
∞
X
p(x) = βj φj (x)
j=1
R1 Pk
where βj = 0
p(x)φj (x)dx. An estimate of p is pb(x) = j=1 βbj φj (x) where
n
1X
βbj = φj (Xi ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.
The first term is of order O(k/n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that p ∈ P with
( ∞
)
X X
P= p∈F : p= βj φj : βj2 j 2q < ∞ .
j j=1
11.1 L1 Methods
25
VC Classes. Let A be a class of sets with VC dimension ν. As in section 5.2, split the data
X into Y and Z with P = {b p1 , . . . , pbN } constructed from Y . For g ∈ P define
Z
∆n (g) = sup g(x)dx − Pn (A)
A∈A A
−1
Pn
where Pn (A) = n i=1 I(Zi ∈ A). Let pb = argming∈P ∆n (g).
The difficulty in implementing this idea is computing and minimizing ∆n (g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.
Yatracos Classes. Devroye and Györfi (2001) use a class of sets called a Yatracos class which
leads to estimators with some remarkable properties. n Let P = {po1 , . . . , pN } be a set of
densities and define the Yatracos class of sets A = A(i, j) : i 6= j where A(i, j) = {x :
pi (x) > pj (x)}. Let
pb = argming∈G ∆(g)
where Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
Pn
and Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on a sample Z1 , . . . , Zn ∼ p.
26
Theorem 18 The estimator pb satisfies
Z Z
|b
p − p| ≤ 3 min |pj − p| + 4∆ (34)
j
R
where ∆ = supA∈A A
p − Pn (A) .
R
The term minj |pj − p| is like a bias while term ∆ is like the variance.
R R
Proof. Let i be such that pb = pi and let s be such that |ps − p| = minj |pj − p|. Let
B = {pi > ps } and C = {ps > pi }. Now,
Z Z Z
|b
p − p| ≤ |ps − p| + |ps − pi |. (35)
Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ). For each h let
n
1X kx − Yi k
pbh (x) = K .
n i=1 h
Let n o
A = A(h, ν) : h, ν > 0, h 6= ν
where A(h, ν) = {x : pbh (x) > pbν (x)}. Define
Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
27
Pn
where Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on Z. Let
pb = argming∈G ∆(g).
Under some regularity conditions on the kernel, we have the following result.
The proof involves showing that the terms on the right hand side of (34) are small. We refer
the reader to Devroye and Györfi (2001) for the details.
R
Recall that dT V (P, Q) = supA |P (A) − Q(A)| = (1/2) |p(x) − q(x)|dx where the supremum
is over all measurable sets. The above theorem says that the estimator does well in the
total variation metric, even though the method only used the Yatracos class of sets. Finding
computationally efficient methods to implement this approach remains an open question.
12 Mixtures
Another approach to density estimation is to use mixtures. We will discuss mixture modelling
when we discuss clustering.
In some problems, X is not just high dimensional, it is infinite dimensional. For example
suppose that each Xi is a curve. An immediate problem is that the concept of a density
is no longer well defined. On a Euclidean space, the density p for a probability measure is
28
R
the function that satisfies P (A) = A p(u)dµ(u) for all measurable A where µ is Lebesgue
measure. Formally, we say that p is the Radon-Nikodym derivative of P with respect to the
dominating measure µ. Geometrically, we can think of p as
P(kX − xk ≤ )
p(x) = lim
→0 V ()
where V () = d π d/2 /Γ(d/2 + 1) is the volume of a sphere of radius . Under appropriate
conditions, these two notions of density agree. (This is the Lebesgue density theorem.)
When the outcome space X is a set of curves, there is no dominating measure and hence
there is no density. Instead, we define the density geometrically by
q (x) = P(ξ(x, X) ≤ )
for a small where ξ is some metric on X . However we cannot divide by V () and let tend
to 0 since the dimension d is infinite.
One way around this is to use a fixed and work with the unnormalized density q . For the
purpose of finding high-density regions this may be adequate. An estimate of q is
n
1X
qb (x) = I(ξ(x, Xi ) ≤ ).
n i=1
Pk
An alternative is to expand Xi into a basis: X(t) ≈ j=1 βj ψj (t). A density can be defined
in terms of the βj ’s.
Example 20 Figure 6 shows the tracks (or paths) of 40 North Atlantic tropical cyclones
(TC). The full dataset, consisting of 608 from 1950 to 2006 is shown in Figure 7. Buchman,
Lee and Schafer (2009) provide a thorough analysis of the data. We refer the reader to their
paper for the full details.4
Each data point— that is, each TC track— is a curve in R2 . Various questions are of
interest: Where are the tracks most common? Is the density of tracks changing over time?
Is the track related to other variables such as windspeed and pressure?
Each curve Xi can be regarded as mapping Xi : [0, Ti ] → R2 where Xi (t) = (Xi1 (t), Xi2 (t))
is the position of the TC at time t and Ti is the lifelength of the TC. Let
n o
Γi = (Xi1 (t), Xi2 (t)) : 0 ≤ t ≤ Ti
29
Figure 6: Paths of 40 tropical cyclones in the North Atlantic.
Figure 8 shows the 10 TC’s with highest local density and the the 10 TC’s with lowest local
density using = 16.38. This choice of corresponds to the 10th percentile of the values
{dH (Xi , Xj ) : i 6= j}. The high density tracks correspond to TC’s in the gulf of Mexico with
short paths. The low density tracks correspond to TC’s in the Atlantic with long paths.
15 Miscellanea
Another method for selecting h which is sometimes used when p is thought to be very smooth
is the plug-in method. The idea is to take the formula for the mean squared error (equation
31), insert a guess of p00 and then solve for the optimal bandwidth h. For example, if d = 1
30
Figure 7: Paths of 608 tropical cyclones in the North Atlantic.
Figure 8: 10 highest density paths (black) and 10 lowest density paths (blue).
31
and under the idealized assumption that p is a univariate Normal this yields h∗ = 1.06σn−1/5 .
Usually, σ is estimated by min{s, Q/1.34} where s is the sample standard deviation and Q
is the interquartile range.5 This choice of h∗ works well if the true density is very smooth
and is called the Normal reference rule.
Since we don’t want to necessarily assume that p is very smooth, it is usually better to
estimate h using cross-validation. See Loader (1999) for an interesting comparison between
cross-validation and plugin methods.
A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h(x) for each point x. One can also use a different bandwidth h(xi ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.
Density estimation is sometimes used to find unusual observations or outliers. These are
observations for which pb(Xi ) is very small.
16 Summary
1. A commonly used nonparametric density estimator is the kernel estimator
n
1X 1 kx − Xi k
pbh (x) = K .
n i=1 hd h
5
Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing
by 1.34 is that Q/1.34 is a consistent estimate of σ if the data are from a N (µ, σ 2 ).
32
Nonparametric Regression
Statistical Machine Learning, Spring 2019
Ryan Tibshirani and Larry Wasserman
1 Introduction
1.1 Basic setup
Given a random pair (X, Y ) ∈ Rd × R, recall that the function
m0 (x) = E(Y |X = x)
is called the regression function (of Y on X). The basic goal in nonparametric regression: to
construct a predictor of Y given X. This is basically the same as constructing an estimate m b
of m0 , from i.i.d. samples (Xi , Yi ) ∈ Rd × R, i = 1, . . . , n. Given a new X, our prediction of
b
Y is m(X). We often call X the input, predictor, feature, etc., and Y the output, outcome,
response, etc.
Note for i.i.d. samples (Xi , Yi ) ∈ Rd × R, i = 1, . . . , n, we can always write
Yi = m0 (Xi ) + i , i = 1, . . . , n,
where i , i = 1, . . . , n are i.i.d. random errors, with mean zero. Therefore we can think
about the sampling distribution as follows: (Xi , i ), i = 1, . . . , n are i.i.d. draws from some
common joint distribution, where E(i ) = 0, and Yi , i = 1, . . . , n are generated from the
above model.
It is common to assume that each i is independent of Xi . This is a very strong as-
sumption, and you should think about it skeptically. We too will sometimes make this
assumption, for simplicity. It should be noted that a good portion of theoretical results
that we cover (or at least, similar theory) also holds without this assumption.
Yi = m0 (Xi ) + i , i = 1, . . . , n,
where now Xi , i = 1, . . . , n are fixed inputs, and i , i = 1, . . . , n are i.i.d. with E(i ) = 0.
For arbitrary Xi , i = 1, . . . , n, this is really just the same as starting with the random
input model, and conditioning on the particular values of Xi , i = 1, . . . , n. (But note: after
conditioning on the inputs, the errors are only i.i.d. if we assumed that the errors and inputs
were independent in the first place.)
1
Generally speaking, nonparametric regression estimators are not defined with the ran-
dom or fixed setups specifically in mind, i.e., there is no real distinction made here. A
caveat: some estimators (like wavelets) do in fact assume evenly spaced fixed inputs, as in
Xi = i/n, i = 1, . . . , n,
for evenly spaced inputs in the univariate case.
Theory is not completely the same between the random and fixed input worlds (some
theory is sharper when we assume fixed input points, especially evenly spaced input points),
but for the most part the theory is quite similar.
Therefore, in what follows, we won’t be very precise about which setup we assume—
random or fixed inputs—because it mostly doesn’t matter when introducing nonparametric
regression estimators and discussing basic properties.
1.3 Notation
We will define an empirical norm k · kn in terms of the training points Xi , i = 1, . . . , n,
acting on functions m : Rd → R, by
n
1X 2
kmk2n = m (Xi ).
n
i=1
This makes sense no matter if the inputs are fixed or random (but in the latter case, it is a
random norm)
When the inputs are considered random, we will write PX for the distribution of X, and
we will define the L2 norm k · k2 in terms of PX , acting on functions m : Rd → R, by
Z
kmk2 = E[m (X)] = m2 (x) dPX (x).
2 2
So when you see k · k2 in use, it is a hint that the inputs are being treated as random
A quantity of interest will be the (squared) error associated with an estimator m b of m0 ,
which can be measured in either norm:
b − m0 k2n or km
km b − m0 k22 .
b is itself random). We will study bounds
In either case, this is a random quantity (since m
in probability or in expectation. The expectation of the errors defined above, in terms of
either norm (but more typically the L2 norm) is most properly called the risk; but we will
often be a bit loose in terms of our terminology and just call this the error.
b
where bn (x) = E[m(x)] − m(x) is the bias, v(x) = Var(m(x))
b is the variance and τ 2 =
E(Y − m(X))2 is the un-avoidable error. Generally, we have to choose tuning parameters
carefully to balance the bias and variance.
2
1.5 What does “nonparametric” mean?
Importantly, in nonparametric regression we don’t assume a particular parametric form
for m0 . This doesn’t mean, however, that we can’t estimate Pp bm0 using (say) a linear com-
b
bination of spline basis functions, written as m(x) = j=1 βj gj (x). A common question:
the coefficients on the spline basis functions β1 , . . . , βp are parameters, so how can this be
nonparametric? Again, the point is that we don’t assume a parametric form for m0 , i.e.,
we don’t assume that m0 itself is an exact linear combination of splines basis functions
g1 , . . . , gp .
(There is an extension to real valued β but we will not need that.) If g ∈ H(β, L) and
` = β − 1, then we can define the Taylor approximation of g at x by
(y − x)` (`)
ge(y) = g(y) + (y − x)g 0 (x) + · · · + g (x)
`!
and then |g(y) − ge(y)| ≤ |y − x|β .
3
The definition for higher dimensions is similar. Let X be a compact subset of Rd . Let
β and L be positive numbers. Given a vector s = (s1 , . . . , sd ), define |s| = s1 + · · · + sd ,
s! = s1 ! · · · sd !, xs = xs11 · · · xsdd and
∂ s1 +···+sd
Ds = .
∂xs11 · · · ∂xsdd
(1)
For example, if d = 1 and β = 2 this means that
The most common case is β = 2; roughly speaking, this means that the functions have
bounded second derivatives.
Again, if g ∈ Hd (β, L) then g(x) is close to its Taylor series approximation:
where
X (u − x)s
gx,β (u) = Ds g(x). (3)
s!
|s|≤β
The Sobolev class S1 (β, L) is the set of β times differentiable functions (technically, it
only requires weak derivatives) g : R → R such that
Z
(g (β) (x))2 dx ≤ L2 .
2 k-nearest-neighbors regression
Here’s a basic method to start us off: k-nearest-neighbors regression. We fix an integer
k ≥ 1 and define
1 X
b
m(x) = Yi , (4)
k
i∈Nk (x)
4
This is not at all a bad estimator, and you will find it used in lots of applications, in
many cases probably because of its simplicity. By varying the number of neighbors k, we can
b with small k corresponding
achieve a wide range of flexibility in the estimated function m,
to a more flexible fit, and large k less flexible.
But it does have its limitations, an apparent one being that the fitted function m b
essentially always looks jagged, especially for small or moderate k. Why is this? It helps to
write
Xn
b
m(x) = wi (x)Yi , (5)
i=1
where the weights wi (x), i = 1, . . . , n are defined as
(
1/k if Xi is one of the k nearest points to x
wi (x) =
0 else.
b − m0 k22 . n−2/(2+d) .
Ekm (6)
See Chapter 6.3 of Gyorfi et al. (2002). Later, we will see that this is optimal.
Proof sketch: assume that Var(Y |X = x) = σ 2 , a constant, for simplicity, and fix
(condition on) the training points. Using the bias-variance tradeoff,
2 2 2
b
E m(x) − m0 (x) b
= E[m(x)] − m0 (x) + E m(x)b − E[m(x)]
b
| {z } | {z }
Bias2 (m(x))
b Var(m(x))
b
X 2
1 σ2
= m0 (Xi ) − m0 (x) +
k k
i∈Nk (x)
2
L X σ2
≤ kXi − xk2 + .
k k
i∈Nk (x)
In the last line we used the Lipschitz property |m0 (x) − m0 (z)| ≤ Lkx − zk2 , for some
constant L > 0. Now for “most” of the points we’ll have kXi − xk2 ≤ C(k/n)1/d , for a
5
1e+06
●
8e+05
6e+05
eps^(−(2+d)/d)
4e+05
●
2e+05
●
0e+00
●
● ● ● ● ● ●
2 4 6 8 10
Dimension d
constant C > 0. (Think of a having input points Xi , i = 1, . . . , n spaced equally over (say)
[0, 1]d .) Then our bias-variance upper bound becomes
2/d
k 2 σ2
(CL) + ,
n k
We can minimize this by balancing the two terms so that they are equal, giving k 1+2/d n2/d ,
i.e., k n2/(2+d) as claimed. Plugging this in gives the error bound of n−2/(2+d) , as claimed.
6
look at classes of functions with more structure. One such example is the additive model,
covered later in the notes.
Warning! Don’t confuse this with the notion of kernels in RKHS methods
which we cover later.
Given a bandwidth h > 0, the (Nadaraya-Watson) kernel regression estimate is defined
as
Xn
kx − Xi k2
K Yi
h X
i=1
b
m(x) = n = wi (x)Yi (8)
X kx − Xi k2
K i
h
i=1
P
where wi (x) = K(kx − Xi k2 /h)/ nj=1 K(kx − xj k2 /h). Hence kernel smoothing is also a
linear smoother.
In comparison to the k-nearest-neighbors estimator in (4), which can be thought of as
a raw (discontinuous) moving average of nearby responses, the kernel estimator in (8) is a
smooth moving average of responses. See Figure 2 for an example with d = 1.
7
192 6. Kernel Smoothing Methods
1.5
O O O O
O O
O O
O O O O O O O O
O O O O O O O O
fˆO(x0 ) OO O
O OO O OO
fˆ(x0 ) OO O
OO OO
1.0
1.0
O O O OO O O OO OO
O O
OO
O
OO
O O
OO
O
O
O OO O
•
O
O
OO O
OO
O
OO
O
OO
O O
OO
O
O
O OO
• O
OO O
O
O
OO
O
OOO OO OOO OO
0.5
0.5
O O O O O O
O O
O O O O O O O O
O O O O
O OO O O O OO O O
O OO
O O O O OO
O O O
O O O O O O
O O
0.0
0.0
O O O OO O O O OO
O O O O O O
O OO O OO
-0.5
-0.5
O O O O
O O O O O O
O O
-1.0
-1.0
O O
O O
O O
0.0 0.2 0.4 x0 0.6 0.8 1.0 0.0 0.2 0.4 x0 0.6 0.8 1.0
FIGURE 6.1. In each panel 100 pairs x , y are generated at random from the
i i
Figure 2: Comparing k-nearest-neighbor and Epanechnikov kernels, when d = 1. From
blue curve
Chapter 6 of with
HastieGaussian errors: Y = sin(4X) + ε, X ∼ U [0, 1], ε ∼ N (0, 1/3). In
et al. (2009)
the left panel the green curve is the result of a 30-nearest-neighbor running-mean
smoother. The red point is the fitted constant fˆ(x0 ), and the red circles indicate
Theorem. Suppose that d = 1 and that m00 is bounded. Also suppose that X has a
those observations contributing to the fit at x0 . The solid yellow region indicates
non-zero, differentiable density p and that the support is unbounded. Then, the risk is
the weights assigned to Zobservations.
In
the right panel, the green curve is the
2Z
kernel-weighted h 4 p0 (x) 2 ) window width
Rnaverage,
= n using an Epanechnikov
x2 K(x)dx m00 (x) +kernel
2m0 (x)with (half
dx
λ = 0.2. 4 p(x)
R Z
σ2 K 2 (x)dx dx 1
+ +o + o(h4n )
nhn p(x) nhn
where p is the density of PX .
6.1 One-Dimensional Kernel Smoothers
InThe
Chapter 2, we
first term motivated
is the theThe
squared bias. k–nearest-neighbor average
dependence on p and p0 is the design bias and
is undesirable. We’ll fix this problem later using local linear smoothing. It follows that the
optimal bandwidth is hn ≈ n−1/5 fˆ(x)yielding
= Ave(y i |xof
a risk i ∈n Nk (x))
−4/5 . In d dimensions, the term(6.1) nhn
becomes nhn . In that case It follows that the optimal bandwidth is hn ≈ n−1/(4+d) yielding
d
a as
riskan n−4/(4+d) . of the regression function E(Y |X = x). Here Nk (x) is the set
ofestimate
ofIfk the support
points has boundaries
nearest then there
to x in squared is bias ofand
distance, order
Ave O(h) near the
denotes theboundary.
average
This happens because of the asymmetry of the kernel weights
(mean). The idea is to relax the2 definition of conditional expectation, in such regions. See Figure
as
3. Specifically, the bias is of order O(h ) in the interior but is of order O(h) near the
illustrated in the left panel of Figure 6.1, and compute an average in a
boundaries. The risk then becomes O(h3 ) instead of O(h4 ). We’ll fix this problems using
neighborhood
local of the
linear smoothing. target
Also, point.
the result In this
above dependscaseonwe have used
assuming that Pthe 30-nearest
X has a density.
Weneighborhood—the
can drop that assumption fit at(and
x0 is thefor
allow average of the
boundaries) and30getpairs whose
a slightly xi values
weaker result
are closest to x0 . The green curve is traced out as we apply this definition
due to Gyorfi, Kohler, Krzyzak and Walk (2002).
atFor simplicity,
different we will
values x0 .use
Thethegreen
spherical kernel
curve I(kxkfˆ
K(kxk) =since
is bumpy, ≤(x)
1); is
the results can be
discontinuous
extended to other kernels. Hence,
in x. As we move xP0 from left to right, thePk-nearest neighborhood remains
n n
constant, until Yii I(kX i − right
xk ≤ h) i=1 Yi I(kX i − xk than
≤ h) the furthest
b a=point
m(x) P i=1 x
n
to the of =x0 becomes closer
point x ′ in the neighborhood
i i=1 I(kXi − ≤ h)
toxkthe left of x , natPnwhich
(B(x, h))
time x replaces x ′ . 0 i i
The average in (6.1) changes in a discrete way, leading to a discontinuous
8
fˆ(x).
This discontinuity is ugly and unnecessary. Rather than give all the
points in the neighborhood equal weight, we can assign weights that die
off smoothly with distance from the target point. The right panel shows
where Pn is the empirical measure and B(x, h) = {u : kx − uk ≤ h}. If the denominator
b
is 0 we define m(x) = 0. The proof of the following theorem is from Chapter 5 of Györfi,
Kohler, Krzyżak and Walk (2002).
Theorem: Risk bound without density. Suppose that the distribution of X has
compact support and that Var(Y |X = x) ≤ σ 2 < ∞ for all x. Then
c2
sup b − mk2P ≤ c1 h2 +
Ekm . (9)
P ∈Hd (1,L) nhd
The proof is in the appendix. Note that the rate n−2/(d+2) is slower than the pointwise
rate n−4/(d+2) because we have made weaker assumptions.
Recall from (7) we saw that this was the minimax optimal rate over Hd (1, L). More
generally, the minimax rate over Hd (α, L), for a constant L > 0, is
see again Chapter 3.2 of Gyorfi et al. (2002). However, as we saw above, with extra condi-
tions, we got the rate n−4/(4+d) which is minimax for Hd (2, L). We’ll get that rate under
weaker conditions with local linear regression.
If the support of the distribution of X lives on a smooth manifold of dimension r < d
then the term Z
dP (x)
nP (B(x, h))
is of order 1/(nhr ) instead of 1/(nhd ). In that case, we get the improved rate n−2/(r+2) .
over all θ ∈ R. In other words, Instead we could consider forming the local estimate
b
m(x) =α bx + βbxT x, where α
bx , βbx minimize
Xn
kx − Xi k
K (Yi − α − β T Xi )2 .
h
i=1
9
6.1 One-Dimensional Kernel Smoothers 195
1.5
N-W Kernel at Boundary Local Linear Regression at Boundary
1.5
O O O O O O
O O
OO O OO O
O O
O O OO O O O OO O
O O O O
O O O O
1.0
1.0
OO O O OO O O
O O O O O O
O O O O O
OO OO O O O O O
OO OO
O O O O O O
O O O O
fˆ(x ) O O O O
O O
OO O 0 O O OO O O
OO O OO
O O OO
OO
fˆ(x )
0.5
0.5
O O
O O O O
O • O O
O
O
0 O
O
OOOO O O
O
O
O
O OO O
O
O
• O
OOOO O O
O
O
O
O
O OO O
O
O O
0.0
0.0
O O O O O O O O
O O O O
O O O O
O O O O
O O O O O O
O
O O O
O O
-0.5
-0.5
O OO O OO
O O
O O O O
O O
O O
O O
-1.0
-1.0
0.0 x0 0.2 0.4 0.6 0.8 1.0 0.0 x0 0.2 0.4 0.6 0.8 1.0
FIGURE 6.3. The locally weighted average has bias problems at or near the
Figure 3: Comparing
boundaries (Nadaraya-Watson)
of the domain. kernel smoothing
The true function to local linear
is approximately regression;
linear here, butthe
former
most of the observations in the neighborhood have a higher mean than the targetof
is biased at the boundary, the latter is unbiased (to first-order). From Chapter 6
Hastie
point,etsoal.despite
(2009) weighting, their mean will be biased upwards. By fitting a locally
weighted linear regression (right panel), this bias is removed to first order
b
We can rewrite the local linear regression estimate m(x). This is just given by a weighted
least squares fit, so
b
m(x) = b(x)T (B T ΩB)−1 B T ΩY,
because of the asymmetry of the kernel in that region. By fitting straight
where b(x) = (1, x) ∈ Rd+1 , B ∈ Rn×(d+1) with ith row b(Xi ), and Ω ∈ Rn×n is diagonal
linesithrather than constants locally, we can remove this bias exactly =to first
TY ,
with diagonal element K(kx−X i k2 /h). We can write more concisely as m(x) b w(x)
order;
where see=Figure
w(x) ΩB(B T6.3ΩB)(right
−1 b(x),panel).
which showsActually, this bias
local linear can be
regression is apresent in the
linear smoother
interior of the domain as well, if the X values are not equally spaced (for
too.
the
The same
vectorreasons,
of fitted but
valuesusually
b = (m(x
µ b less
1 ), . severe).
b n ))Again
. . , m(x can be locally
expressedweighted
as linear
regression will make a first-order correction.
w1 (x)T Y
Locally weighted regression
.. solves a separate T −1weighted
T least squares prob-
b=
µ . = B(B ΩB) B ΩY = SY
lem at each target point x0 :T
wn (x) Y
N
!
which should look familiar
min to youKfrom weighted least squares. 2
λ (x0 , xi ) [yi − α(x0 ) − β(x0 )xi ] . (6.7)
Now we’ll sketch how 0the
α(x0 ),β(x ) local linear fit reduces the bias, fixing (conditioning on) the
i=1
training points. Compute at a fixed point x,
The estimate is then fˆ(x0 ) = α̂(x0 )X n+ β̂(x0 )x0 . Notice that although we fit
b
m(x)] =
an entire linear model to the data in the
E[ wi (x)m 0 (Xi ).we only use it to evaluate
region,
i=1
the fit at the single point x0 .
Define
Using theexpansion
a Taylor vector-valued function
of m0 about x, b(x)T = (1, x). Let B be the N × 2
T
regression matrix with ithX n row b(xi ) , andX nW(x0 ) the N × N diagonal
matrix withE[ith
b diagonal
m(x)] = m0 (x)element
wi (x) K
+λ (x00(x)
∇m , xi ). Then
T
(Xi − x)wi (x) + R,
i=1 i=1
fˆ(x0 ) = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )y (6.8)
N
! 10
= li (x0 )yi . (6.9)
i=1
Equation (6.8) gives an explicit expression for the local linear regression
where the remainder term R contains quadratic and higher-order terms, and under regular-
ity conditions, is small. One can check that in fact for the local linear regression estimator
b
m,
n
X Xn
wi (x) = 1 and (Xi − x)wi (x) = 0,
i=1 i=1
b
and so E[m(x)] = m0 (x) + R, which means that m b is unbiased to first-order.
It can be shown that local linear regression removes boundary bias and design bias.
b is
Theorem. Under some regularity conditions, the risk of m
Z Z 2 Z Z
h4n 1
tr(m00 (x) K(u)uuT du) dP (x)+ d K 2 (u)du σ 2 (x)dP (x)+o(h4n +(nhdn )−1 ).
4 nhn
For a proof, see Fan & Gijbels (1996). For points near the boundary, the bias is
Ch2 m00 (x) + o(h2 ) whereas, the bias is Chm0 (x) + o(h) for kernel estimators.
In fact, Fan (1993) shows a rather remarkable result. Let Rn be the minimax risk for
estimating m(x0 ) over the class of functions with bounded second derivatives in a neighbor-
hood of x0 . Let the maximum risk rn of the local linear estimator with optimal bandwidth
satisfies
Rn
1 + o(1) ≥ ≥ (0.896)2 + o(1).
rn
Rn
Moreover, if we compute the minimax risk over all linear estimators we get rn → 1.
b
m(x) = b(x)(B T ΩB)−1 B T Ωy = w(x)T y,
where b(x) = (1, x, . . . , xk ), B is an n × (k + 1) matrix with ith row b(Xi ) = (1, Xi , . . . , Xik ),
and Ω is as before. Hence again, local polynomial regression is a linear smoother
Assuming that m0 ∈ H1 (α, L) for a constant L > 0, a Taylor expansion shows that the
local polynomial estimator m b of order k, where k is the largest integer strictly less than α
and where the bandwidth scales as h n−1/(2α+1) , satisfies
b − m0 k22 . n−2α/(2α+1) .
Ekm
11
1.0
0.5
0.0
−0.5
See Chapter 1.6.1 of Tsybakov (2009). This matches the lower bound in (11) (when d = 1)
In multiple dimensions, d > 1, local polynomials become kind of tricky to fit, because of
the explosion in terms of the number of parameters we need to represent a kth order poly-
nomial in d variables. Hence, an interesting alternative is to return back kernel smoothing
but use a higher-order kernel. A kernel function K is said to be of order k provided that
Z Z Z
K(t) dt = 1, t K(t) dt = 0, j = 1, . . . , k − 1, and 0 < tk K(t) dt < ∞.
j
This means that the kernels we were looking at so far were of order 2
An example of a 4th-order kernel is K(t) = 83 (3 − 5t2 )1{|t| ≤ 1}, plotted in Figure 4.
Notice that it takes negative values.
Lastly, while local polynomial regression and higher-order kernel smoothing can help
“track” the derivatives of smooth functions m0 ∈ Hd (α, L), α ≥ 2, it should be noted that
they don’t share the same universal consistency property of kernel smoothing (or k-nearest-
neighbors). See Chapters 5.3 and 5.4 of Gyorfi et al. (2002)
4 Splines
Suppose that d = 1. Define an estimator by
n
X Z 1
2
b = argmin
m Yi − m(Xi ) +λ m00 (x)2 dx. (12)
f i=1 0
Spline Lemma. The minimizer of (25) is a cubic spline with knots at the data points.
(Proof in the Appendix.)
12
The key result presented above tells us that we can choose a basis η1 , . . . , ηn for the set
of kth-order natural splines with knots over x1 , . . . , xn , and reparametrize the problem as
Xn Xn 2 Z 1X n 2
00
b
β = argmin Yi − βj ηj (Xi ) + λ βj ηj (x) dx. (13)
β∈Rn i=1 j=1 0 j=1
showing the smoothing spline problem to be a type of generalized ridge regression problem.
In fact, the solution in (29) has the explicit form
βb = (N T N + λΩ)−1 N T Y,
b = (m(x
and therefore the fitted values µ b 1 ), . . . , m(x
b n )) are
b = N (N T N + λΩ)−1 N T Y ≡ SY.
µ (16)
Therefore, once again, smoothing splines are a type of linear smoother
A special property of smoothing splines: the fitted values in (30) can be computed
in O(n) operations. This is achieved by forming N from the B-spline basis (for natural
splines), and in this case the matrix N T N + ΩI ends up being banded (with a bandwidth
that only depends on the polynomial order k). In practice, smoothing spline computations
are extremely fast
(The Sobolev class Sd (m, C) in d dimensions can be defined similarly, where we sum over
all partial derivatives of order m.)
Assuming m0 ∈ S1 (m, C) for the underlying regression function, where C > 0 is a
constant, the smoothing spline estimator mb of polynomial order k = 2m − 1 with tuning
parameter λ n1/(2m+1) n1/(k+2) satisfies
b − m0 k2n . n−2m/(2m+1) in probability.
km
The proof of this result uses much more fancy techniques from empirical process theory
(entropy numbers) than the proofs for kernel smoothing. See Chapter 10.1 of van de Geer
(2000) This rate is seen to be minimax optimal over S1 (m, C) (e.g., Nussbaum (1985)).
13
5 Mercer kernels, RKHS
5.1 Hilbert Spaces
A Hilbert space is a complete inner product space. We will see that a reproducing kernel
Hilbert space (RKHS) is a Hilbert space with extra structure that makes it very useful for
statistics and machine learning.
An example of a Hilbert space is
n Z o
2
L2 [0, 1] = f : [0, 1] → R : f <∞
14
5.4 Mercer Kernels
A RKHS is defined by a Mercer kernel. A Mercer kernel K(x, y) is a function of two
variables that is symmetric and positive definite. This means that, for any function f ,
Z Z
K(x, y)f (x)f (y)dx dy ≥ 0.
(This is like the definition of a positive definite matrix: xT Ax ≥ 0 for each x.)
Our main example is the Gaussian kernel
||x−y||2
K(x, y) = e− σ2 .
Given a kernel K, let Kx (·) be the function obtained by fixing the first coordinate. That
is, Kx (y) = K(x, y). For the Gaussian kernel, Kx is a Normal, centered at x. We can create
functions by taking linear combinations of the kernel:
k
X
f (x) = αj Kxj (x).
j=1
P P
Given two such functions f (x) = kj=1 αj Kxj (x) and g(x) = m j=1 βj Kyj (x) we define an
inner product XX
hf, gi = hf, giK = αi βj K(xi , yj ).
i j
In general, f (and g) might be representable in more than one way. You can check that
hf, giK is independent of how f (or g) is represented. The inner product defines a norm:
p sX X √
||f ||K = hf, f, i = αj αk K(xj , xk ) = αT Kα
j k
This follows from the definition of hf, gi where we take g = Kx . This implies that
15
This is called the reproducing property. It also implies that Kx is the representer of the
evaluation functional.
The completion of H0 with respect to || · ||K is denoted by HK and is called
the RKHS generated by K.
To verify that this is a well-defined Hilbert space, you should check that the following
properties hold:
hf, gi = hg, f i
hcf + dg, hi = chf, hi + chg, hi
hf, f i = 0 iff f = 0.
The last one is not obvious so let us verify it here. It is easy to see that f = 0 implies
that hf, f i = 0. Now we must show that hf, f i = 0 implies that f (x) = 0. So suppose that
hf, f i = 0. Pick any x. Then
5.6 Examples
Example 1. Let H be all functions f on R such that the support of the Fourier transform
of f is contained in [−a, a]. Then
sin(a(y − x))
K(x, y) =
a(y − x)
and Z
hf, gi = f g.
Then
K(x, y) = (xy)−1 e−x sinh(y)I(0 < x ≤ y) + e−y sinh(x)I(0 < y ≤ x)
and Z 1
2
||f || = (f 2 (x) + (f 0 (x))2 )x2 dx.
0
16
Example R3. The Sobolev space of order m is (roughly speaking) the set of functions f
such that (f (m) )2 < ∞. For m = 1 and X = [0, 1] the kernel is
( 2 3
1 + xy + xy2 − y6 0 ≤ y ≤ x ≤ 1
K(x, y) = 2 3
1 + xy + yx2 − x6 0 ≤ x ≤ y ≤ 1
and Z 1
||f ||2K = f 2 (0) + f 0 (0)2 + (f 00 (x))2 dx.
0
for some α1 , . . . , αn .
17
5.9 RKHS Regression
b to minimize
Define m X
R= (Yi − m(Xi ))2 + λ||m||2K .
i
Pn
b
By the representer theorem, m(x) = i=1 αi K(xi , x). Plug this into R and we get
R = ||Y − Kα||2 + λαT Kα
where Kjk = K(Xj , Xk ) is the Gram matrix. The minimizer over α is
b = (K + λI)−1 Y
α
P
b
and m(x) = j bj K(Xi , x). The fitted values are
α
Yb = Kb
α = K(K + λI)−1 Y = LY.
So this is a linear smoother.
We can use cross-validation to choose λ. Compare this with smoothing kernel
regression.
subject to 0 ≤ αi ≤ C.
The RKHS version is to minimize
X λ
J= [1 − Yi f (Xi )]+ + ||f ||2K .
2
i
The dual is the same except that hXi , Xj i is replaced with K(Xi , Xj ). This is called the
kernel trick.
18
5.12 The Kernel Trick
This is a fairly general trick. In many algorithms you can replace hxi , xj i with K(xi , xj ) and
get a nonlinear version of the algorithm. This is equivalent to replacing x with Φ(x) and
replacing hxi , xj i with hΦ(xi ), Φ(xj )i. However, K(xi , xj ) = hΦ(xi ), Φ(xj )i and K(xi , xj ) is
much easier to compute.
In summary, by replacing hxi , xj i with K(xi , xj ) we turn a linear procedure into a
nonlinear procedure without adding much computation.
6 Linear smoothers
6.1 Definition
Every
P estimator we have discussed so far is a linear smoother meaning that m(x) b =
w (x)Y for some weights w (x) that do not depend on the Y 0 s. Hence, the fitted values
i i i i i
b = (m(X
µ b 1 ), . . . , m(X
b n )) are of the form µ b = SY for some matrix S ∈ Rn×n depending on
the inputs X1 , . . . , Xn —and also possibly on a tuning parameter such as h in kernel smooth-
ing, or λ in smoothing splines—but not on the Yi ’s. We call S, the smoothing matrix. For
comparison, recall that in linear regression, µ b = HY for some projection matrix H.
For linear smoothers µ b = SY , the effective degrees of freedom is defined to be
n
X
ν ≡ df(b
µ) ≡ Sii = tr(S),
i=1
19
6.2 Cross-validation
K-fold cross-validation can be used to estimate the prediction error and choose tuning
parameters.
For linear smoothers µ b = (m(x
b 1 ), . . . m(x
b n )) = SY , leave-one-out cross-validation can
be particularly appealing because in many cases we have the seemingly magical reduction
n n
1X 2 1 X Yi − m(X b i) 2
b =
CV(m) Yi − mb −i (Xi ) = , (17)
n n 1 − Sii
i=1 i=1
where b −i
m denotes the estimated regression function that was trained on all but the ith pair
(Xi , Yi ). This leads to a big computational savings since it shows us that, to compute leave-
one-out cross-validation error, we don’t have to actually ever compute m b −i , i = 1, . . . , n.
Why does (17) hold, and for which linear smoothers µ b = Sy? Just rearranging (17)
perhaps demystifies this seemingly magical relationship and helps to answer these questions.
Suppose we knew that m b had the property
1
b −i (Xi ) =
m b i ) − Sii Yi .
m(X (18)
1 − Sii
That is, to obtain the estimate at Xi under the function m b −i fit on all but (Xi , Yi ), we take
the sum of the linear weights
P (from our original fitted function m)b across all but the ith
b i ) − Sii Yi = i6=j Sij Yj , and then renormalize so that these weights sum to 1.
point, m(X
This is not an unreasonable property; e.g., we can immediately convince ourselves that
it holds for kernel smoothing. A little calculation shows that it also holds for smoothing
splines (using the Sherman-Morrison update formula).
From the special property (18), it is easy to show the leave-one-out formula (17). We
have
1 Yi − m(X b i)
Yi − mb −i (Xi ) = Yi − b i ) − Sii Yi =
m(X ,
1 − Sii 1 − Sii
and then squaring both sides and summing over n gives (17).
Finally, generalized cross-validation is a small twist on the right-hand side in (17) that
gives an approximation to leave-one-out cross-validation error. It is defined as by replacing
the appearances of diagonal terms Sii with the average diagonal term tr(S)/n,
n
1 X Yi − m(X b i) 2 b
GCV(m) b = = (1 − ν/n)−2 R
n 1 − tr(S)/n
i=1
7 Additive models
7.1 Motivation and definition
Computational efficiency and statistical efficiency are both very real concerns as the dimen-
sion d grows large, in nonparametric regression. If you’re trying to fit a kernel, thin-plate
20
spline, or RKHS estimate in > 20 dimensions, without any other kind of structural con-
straints, then you’ll probably be in trouble (unless you have a very fast computer and tons
of data).
Recall from (11) that the minimax rate over the Holder class Hd (α, L) is n−2α/(2α+d) ,
which has an exponentially bad dependence on the dimension d. This is usually called the
curse of dimensionality (though the term apparently originated with Bellman (1962), who
encountered an analogous issue but in a separate context—dynamic programming).
What can we do? One answer is to change what we’re looking for, and fit estimates
with less flexibility in high dimensions. Think of a linear model in d variables: there is a
big difference between this and a fully nonparametric model in d variables. Is there some
middle man that we can consider, that would make sense?
Additive models play the role of this middle man. Instead of considering a full d-
dimensional function of the form
b j − madd
Ekm 2
j k2 . n
−2α/(2α+1)
, j = 1, . . . , d.
add
Hence each component of the best additive approximation f to m0 can be estimated
at the optimal univariate rate. Loosely speaking, though we cannot hope to recover m0
arbitrarily, we can recover its major structure along the coordinate axes.
7.2 Backfitting
Estimation with additive models is actually very simple; we can just choose our favorite
univariate smoother (i.e., nonparametric estimator), and cycle through estimating each
21
function mj , j = 1, . . . , d individually (like a block coordinate descent algorithm). Denote
the result of running our chosen univariate smoother to regress Y = (Y1 , . . . , Yn ) ∈ Rn over
the input points Z = (Z1 , . . . , Zn ) ∈ Rn as
b = Smooth(Z, Y ).
m
E.g., we might choose Smooth(·, ·) to be a cubic smoothing spline with some fixed value of
the tuning parameter λ, or even with the tuning parameter selected by generalized cross-
validation
Once our univariate smoother has been chosen, we initialize m b 1, . . . , m
b d (say, to all to
zero) and cycle over the following steps for j = 1, . . . , d, 1, . . . , d, . . .:
P
1. define ri = Yi − `6=j mb ` (xi` ), i = 1, . . . , n;
This algorithm is known as backfitting. In last step above, we are removing the mean from
each fitted function mb j , j = 1, . . . , d, otherwise the model would not be identifiable. Our
final estimate therefore takes the form
b
m(x) b 1 (x(1)) + · · · + m(x(d))
=Y +m b
P
where Y = n1 ni=1 Yi . Hastie & Tibshirani (1990) provide a very nice exposition on the
some of the more practical aspects of backfitting and additive models.
In many cases, backfitting is equivalent to blockwise coordinate descent performed on
a joint optimization criterion that determines the total additive estimate. E.g., for the
additive cubic smoothing spline optimization problem,
n
X d
X 2 X
d Z 1
b 1, . . . , m
m b d = argmin Yi − mj (xij ) + λj m00j (t)2 dt,
m1 ,...,md 0
i=1 j=1 j=1
22
7.3 Error rates
Error rates for additive models are both kind of what you’d expect and surprising. What
you’d expect: if the underlying function m0 is additive, and we place standard assumptions
on its component functions, such as f0,j ∈ S1 (m, C), j = 1, . . . , d, for a constant C > 0,
a somewhat straightforward argument building on univariate minimax theory gives us the
lower bound
inf sup Ekmb − m0 k22 & dn−2m/(2m+1) .
m
b m ∈⊕d S (m,C)
0 j=1 1
This is simply d times the univariate minimax rate. (Note that we have been careful to
track the role of d here, i.e., it is not being treated like a constant.) Also, standard methods
like backfitting with univariate smoothing splines of polynomial order k = 2m − 1, will also
match this upper bound in error rate (though the proof to get the sharp linear dependence
on d is a bit trickier).
For the reasons we discussed earlier with density functions, this is essentially an impossible
problem.
We can, however, still
P get an informal (but useful) estimatePthe variability of m(x). b
b
Suppose that m(x) = i wi (x)Yi . The conditional variance is i wi2 (x)σ 2 (x) which can
23
P 2
be estimatedqby i wi (x)bσ 2 (x). An asymptotic, pointwise (biased) confidence band is
P 2
b
m(x) ± zα/2 σ 2 (x).
i wi (x)b
A better idea is to bootstrap the quantity
√
n supx |m(x)
b − E[m(x)]|
b
b(x)
σ
9 Wavelet smoothing
Not every nonparametric regression estimate needs to be a linear smoother (though this
does seem to be very common), and wavelet smoothing is one of the leading nonlinear tools
for nonparametric estimation. The theory of wavelets is elegant and we only give a brief
introduction here; see Mallat (2008) for an excellent reference
You can think of wavelets as defining an orthonormal function basis, with the basis
functions exhibiting a highly varied level of smoothness. Importantly, these basis functions
also display spatially localized smoothness at different locations in the input domain. There
are actually many different choices for wavelets bases (Haar wavelets, symmlets, etc.), but
these are details that we will not go into
We assume d = 1. Local adaptivity in higher dimensions is not nearly as settled as
it is with smoothing splines or (especially) kernels (multivariate extensions of wavelets are
possible, i.e., ridgelets and curvelets, but are complex)
Consider basis functions, φ1 , . . . , φn , evaluated over n equally spaced inputs over [0, 1]:
Xi = i/n, i = 1, . . . , n.
The assumption of evenly spaced inputs is crucial for fast computations; we also typically
assume with wavelets that n is a power of 2. We now form a wavelet basis matrix W ∈ Rn×n ,
defined by
Wij = φj (Xi ), i, j = 1, . . . , n
The goal, given outputs y = (y1 , . . . , yn ) over the evenly spaced input points, is to
represent y as a sparse combination of the wavelet basis functions. To do so, we first
perform a wavelet transform (multiply by W T ):
θe = W T y,
θb = Tλ (θ),
e
24
and then perform an inverse wavelet transform (multiply by W ):
b = W θb
µ
The wavelet and inverse wavelet transforms (multiplication by W T and W ) each require
O(n) operations, and are practically extremely fast due do clever pyramidal multiplication
schemes that exploit the special structure of wavelets
The threshold function Tλ is usually taken to be hard-thresholding, i.e.,
or soft-thresholding, i.e.,
[Tλsoft (z)]i = zi − sign(zi )λ · 1{|zi | ≥ λ}, i = 1, . . . , n.
These thresholding functions are both also O(n), and computationally trivial, making
wavelet smoothing very fast overall
We should emphasize that wavelet smoothing is not a linear smoother, i.e., there is no
b = Sy for all y
single matrix S such that µ
We can write the wavelet smoothing estimate in a more familiar form, following our
previous discussions on basis functions and regularization. For hard-thresholding, we solve
25
For the wavelet smoothing estimator, denoted by m b wav , Donoho & Johnstone (1998)
provide a seminal analysis. Assuming that m0 ∈ M (k, C) for a constant C > 0 (and further
conditions on the setup), they show that (for an appropriate scaling of the smoothing
parameter λ),
(See again Tibshirani (2014) for a translation to the notation of the current setting.) Hence
the answers to our questions are: (ii) linear smoothers cannot cope with the heterogeneity
of functions in M (k, C), and are are bounded away from optimality, which means (i) we
can interpret M (k, C) as being much larger than S1 (k + 1, C 0 ), because linear smoothers
can be optimal over the latter class but not over the former. See Figure 5 for a diagram
Let’s back up to emphasize just how remarkable the results (21), (22) really are. Though
it may seem like a subtle difference in exponents, there is actually a significant difference
in the minimax rate and minimax linear rate: e.g., when k = 0, this is a difference of n−1/2
(optimal) and n−1/2 (optimal among linear smoothers) for estimating a function of bounded
variation. Recall also just how broad the linear smoother class is: kernel smoothing, regres-
sion splines, smoothing splines, RKHS estimators ... none of these methods can achieve a
better rate than n−1/2 over functions of bounded variation
Practically, the differences between wavelets and linear smoothers in problems with
spatially heterogeneous smoothness can be striking as well. However, you should keep in
mind that wavelets are not perfect: a shortcoming is that they require a highly restrictive
setup: recall that they require evenly spaced inputs, and n to be power of 2, and there are
often further assumptions made about the behavior of the fitted function at the boundaries
of the input domain
Also, though you might say they marked the beginning of the story, wavelets are not the
end of the story when it comes to local adaptivity. The natural thing to do, it might seem,
is to make (say) kernel smoothing or smoothing splines more locally adaptive by allowing
for a local bandwidth parameter or a local penalty parameter. People have tried this, but it
26
Figure 5: A diagram of the minimax rates over M (k, C) (denoted Fk in the picture) and
S1 (k + 1, C) (denoted Wk+1 in the picture)
is both difficult theoretically and practically to get right. A cleaner approach is to redesign
the kind of penalization used in constructing smoothing splines directly.
27
de Boor (1978) for an in-depth coverage. Informally, a spline is a lot smoother than
a piecewise polynomial, and so modeling with splines can serve as a way of reducing
the variance of fitted estimators. See Figure 6
• A bit of statistical folklore: it is said that a cubic spline is so smooth, that one cannot
detect the locations of its knots by eye!
• How can we parametrize the set of a splines with knots at t1 , . . . , tp ? The most natural
way is to use the truncated power basis, g1 , . . . , gp+k+1 , defined as
where β1 , . . . , βp+k+1 are coefficients and g1 , . . . , gp+k+1 , are basis functions for order
k splines over the knots t1 , . . . , tp (e.g., the truncated power basis or B-spline basis)
• Letting y = (y1 , . . . , yn ) ∈ Rn , and defining the basis matrix G ∈ Rn×(p+k+1) by
Gij = gj (xi ), i = 1, . . . , n, j = 1, . . . , p + k + 1,
we can just use least squares to determine the optimal coefficients βb = (βb1 , . . . , βbp+k+1 ),
βb = argmin ky − Gβk22 ,
β∈Rp+k+1
P
which then leaves us with the fitted regression spline fb(x) = p+k+1
j=1 βbj gj (x)
• Of course we know that βb = (GT G)−1 GT y, so the fitted values µ
b = (fb(x1 ), . . . , fb(xn ))
are
µb = G(GT G)−1 GT y,
and regression splines are linear smoothers
28
5.2 Piecewise Polynomials and Splines 143
Discontinuous Continuous
O O
O O O O O O
O O O O O O
O O O O O O
OO O O
O OO O O
O
OOO O OOO O
OO O OO O
O O O O
O O
O O O O O
O O O O O O
O
O O O O
O O
O O O O O O
O
O O
O
O O O O O O
O O O O O O
O O O O
O O
O O
ξ1 ξ2 ξ1 ξ2
O O
O O O O O O
O O O O O O
O O O O O O
OO O O
O OO O O
O
OOO O OOO O
OO O OO O
O O O O
O O O O
O O O O
O O O O O
O
O O O O
O O
O O O O O O
O
O O
O
O O O O O O
O O O O O O
O O O O
O O
O O
ξ1 ξ2 ξ1 ξ2
FIGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of
Figure 6: Illustration of the effects of enforcing continuity at the knots, across various orders
continuity.
of the derivative, for a cubic piecewise polynomial. From Chapter 5 of Hastie et al. (2009)
• A way to remedy this problem is to force the piecewise polynomial function to have a
lower degree to the left of the leftmost knot, and to the right of the rightmost knot—
this is exactly what natural splines do. A natural spline of order k, with knots at
t1 < . . . < tp , is a piecewise polynomial function f such that
It is implicit here that natural splines are only defined for odd orders k
• What is the dimension of the span of kth order natural splines with knots at t1 , . . . , tp ?
Recall for splines, this was p + k + 1 (the number of truncated power basis functions).
For natural splines, we can compute this dimension by counting:
(k − 1)
(k + 1) · (p − 1) + + 1 · 2 − k · p = p.
| {z } | 2 {z } |{z}
a b c
Above, a is the number of free parameters in the interior intervals [t1 , t2 ], . . . , [tp−1 , tp ],
b is the number of free parameters in the exterior intervals (−∞, t1 ], [tp , ∞), and c is
the number of constraints at the knots t1 , . . . , tp . The fact that the total dimension
is p is amazing; this is independent of k!
• Note that there is a variant of the truncated power basis for natural splines, and a
variant of the B-spline basis for natural splines. Again, B-splines are the preferred
parametrization for computational speed and stability
• Natural splines of cubic order is the most common special case: these are smooth
piecewise cubic functions, that are simply linear beyond the leftmost and rightmost
knots
30
for overfitting by shrinking the coefficients of the estimated function (in its basis
expansion)
• Interestingly, we can motivate and define a smoothing spline directly from a func-
tional minimization perspective. With inputs x1 , . . . , xn lying in an interval [0, 1], the
smoothing spline estimate fb, of a given odd integer order k ≥ 0, is defined as
n
X Z 1
2 2
fb = argmin yi − f (xi ) + λ f (m) (x) dx, where m = (k + 1)/2. (24)
f i=1 0
This is an infinite-dimensional optimization problem over all functions f for the which
the criterion is finite. This criterion trades off the least squares error of f over the
observed pairs (xi , yi ), i = 1, . . . , n, with a penalty term that is large when the mth
derivative of f is wiggly. The tuning parameter λ ≥ 0 governs the strength of each
term in the minimization
• By far the most commonly considered case is k = 3, i.e., cubic smoothing splines,
which are defined as
n
X Z 1
2
b
f = argmin yi − f (xi ) + λ f 00 (x)2 dx (25)
f i=1 0
• Remarkably, it so happens that the minimizer in the general smoothing spline prob-
lem (38) is unique, and is a natural kth-order spline with knots at the input points
x1 , . . . , xn ! Here we give a proof for the cubic case, k = 3, from Green & Silverman
(1994) (see also Exercise 5.7 in Hastie et al. (2009))
The key result can be stated as follows: if fe is any twice differentiable function on
[0, 1], and x1 , . . . , xn ∈ [0, 1], then there exists a natural cubic spline f with knots at
x1 , . . . , xn such that f (xi ) = fe(xi ), i = 1, . . . , n and
Z 1 Z 1
00 2
f (x) dx ≤ fe00 (x)2 dx.
0 0
Note that this would in fact prove that we can restrict our attention in (25) to natural
splines with knots at x1 , . . . , xn
Proof: the natural spline basis with knots at x1 , . . . , xn is n-dimensional, so given any
n points zi = fe(xi ), i = 1, . . . , n, we can always find a natural spline f with knots at
x1 , . . . , xn that satisfies f (xi ) = zi , i = 1, . . . , n. Now define
31
Consider
Z 1 Z 1
1
f 00 (x)h00 (x) dx = f 00 (x)h0 (x) − f 000 (x)h0 (x) dx
0 0 0
Z xn
000 0
=− f (x)h (x) dx
x1
n−1
X Z xn
xj+1
000
=− f (x)h(x) + f (4) (x)h0 (x) dx
xj x1
j=1
n−1
X
=− f 000 (x+
j ) h(xj+1 ) − h(xj ) ,
j=1
where in the first line we used integration by parts; in the second we used the that
f 00 (a) = f 00 (b) = 0, and f 000 (x) = 0 for x ≤ x1 and x ≥ xn , as f is a natural spline; in
the third we used integration by parts again; in the fourth line we used the fact that f 000
is constant on any open interval (xj , xj+1 ), j = 1, . . . , n − 1, and that f (4) = 0, again
because f is a natural spline. (In the above, we use f 000 (u+ ) to denote limx↓u f 000 (x).)
Finally, since h(xj ) = 0 for all j = 1, . . . , n, we have
Z 1
f 00 (x)h00 (x) dx = 0.
0
and therefore Z Z
1 1
00 2
f (x) dx ≤ fe00 (x)2 dx, (26)
0 0
with equality if and only if 00
= 0 for all x ∈ [0, 1]. Note that h00 = 0 implies that
h (x)
h must be linear, and since we already know that h(xj ) = 0 for all j = 1, . . . , n, this
is equivalent to h = 0. In other words, the inequality (45) holds strictly except when
fe = f , so the solution in (25) is uniquely a natural spline with knots at the inputs
32
This is a finite-dimensional problem, and after we compute the
Pncoefficients βb ∈ Rn ,
b b
we know that the smoothing spline estimate is simply f (x) = j=1 βj ηj (x)
βb = (N T N + λΩ)−1 N T y,
b = N (N T N + λΩ)−1 N T y.
µ (30)
• A special property of smoothing splines: the fitted values in (30) can be computed in
O(n) operations. This is achieved by forming N from the B-spline basis (for natural
splines), and in this case the matrix N T N + ΩI ends up being banded (with a band-
width that only depends on the polynomial order k). In practice, smoothing spline
computations are extremely fast
b = N (N T N + λΩ)−1 N T y
µ
−1 T
= N N T I + λ(N T )−1 ΩN −1 N N y
= (I + λQ)−1 y, (31)
where Q = (N T )−1 ΩN −1
where D = diag(d1 , . . . , dn )
33
1.0
1e−05
5e−05
1e−04
0.2
5e−04
0.8
0.001
0.005
0.01
0.1
0.05
0.6
Eigenvectors
Eigenvalues
0.0
0.4
−0.1
0.2
−0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50
x Number
Figure 7: Eigenvectors and eigenvalues for the Reinsch form of the cubic smoothing spline
operator, defined over n = 50 evenly spaced inputs on [0, 1]. The left plot shows the bottom
7 eigenvectors of the Reinsch matrix Q. We can see that the smaller the eigenvalue, the
“smoother” the eigenvector. The right plot shows the weights wj = 1/(1 + λdj ), j = 1, . . . , n
implicitly used by the smoothing spline estimator (32), over 8 values of λ. We can see that
when λ is larger, the weights decay faster, so the smoothing spline estimator places less
weight on the “nonsmooth” eigenvectors
34
● ● ●
● ● ● ●
●
● ●
Row 25 ●
Row 5
Row 50 Row 50
0.08
● ● ● ● ● ●
Row 75 Row 95
0.15
●
● ● ● ● ● ● ●
●
●
●
●
0.06
● ● ● ● ● ● ●
0.10
●
● ● ● ● ● ●
●
●
●
0.04
● ● ● ● ● ●
●
● ● ● ● ● ●
● ●
0.05
●
● ●
●
● ●
●
●
●
● ● ● ● ● ● ●
●
0.02
●
●
●
● ●
●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
●
●
●
● ●
● ● ● ● ● ●
●
●
●
0.00
● ●●
● ● ● ● ● ●
● ● ● ●● ● ●●● ● ● ● ●
● ● ●● ●
●●
● ●●
●● ● ●● ●●●
●●● ● ●● ●● ● ●●● ●
● ● ●
●●
● ●●
●●
●
●●
●
●●
●
●
●●●●●●
●●●
●
● ●●●
●● ●●●
●●●●●
● ●
●
●●
●●
●
●●
●●●
●●●●
●●● ●
●●●
●● ●
●●
●●●●
●●●●
● ●●
●●●
●●●●
●●
●
0.00
● ●● ●● ●
● ●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●
●
●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●● ● ●●● ●
●● ● ●● ● ● ●● ●●● ● ● ●● ● ●●
●●
●●●●●● ●●
●●●●● ●●●●●●●● ●●
●●●●● ●●●●●●●● ●●●●●●●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7
x x
Figure 8: Rows of the cubic smoothing spline operator S defined over n = 100 evenly spaced
input points on [0, 1]. The left plot shows 3 rows of S (in particular, rows 25, 50, and 75)
for λ = 0.0002. These look precisely like translations of a kernel. The right plot considers
a setup where the input points are concentrated around 0.5, and shows 3 rows of S (rows
5, 50, and 95) for the same value of λ. These still look like kernels, but the bandwidth is
larger in low-density regions of the inputs
• There is more recent work that connects smoothing splines of all orders to kernel
smoothing. See, e.g., ??.
35
A graph of K is given in Figure 1. The effective local bandwidth demonstrated
below is ~ l ' ~ f ( t ) - ' asymptotically;
'~ thus the smoothing spline's behaviour is
intermediate between fixed kernel smoothing (no dependence on f ) and smooth-
ing based on a n average of a fixed number of neighbouring values (effective local
bandwidth proportional to l l f ) . The desirability of this dependence on a low
power o f f will be discussed in Section 3.
The paper is organized as follows. In Section 2 the main theorem is stated and
discussed. In addition, some graphs of actual weight functions are presented and
compared with their asymptotic forms. These show that the kernel approximation
of the weight function is excellent in practice. Section 3 contains some discussion
Figure 9: The Silverman kernel in (33), which is the (asymptotically) equivalent implicit
kernel used by smoothing splines. Note that it can be negative. From ?
(The Sobolev class Wd (m, C) in d dimensions can be defined similarly, where we sum
over all partial derivatives of order m.)
The proof of this result uses much more fancy techniques from empirical process theory
(entropy numbers) than the proofs for kernel smoothing. See Chapter 10.1 of van de
Geer (2000)
• This rate is seen to be minimax optimal over W1 (m, C) (e.g., Nussbaum (1985)).
Also, it is worth noting that the Sobolev W1 (m, C) and Holder H1 (m, L) classes are
equivalent in the following sense: given W1 (m, C) for a constant C > 0, there are
L0 , L1 > 0 such that
The first containment is easy to show; the second is far more subtle, and is a con-
sequence of the Sobolev embedding theorem. (The same equivalences hold for the
d-dimensional versions of the Sobolev and Holder spaces.)
36
10.9 Multivariate splines
• Splines can be extended to multiple dimensions, in two different ways: thin-plate
splines and tensor-product splines. The former construction is more computationally
efficient but more in some sense more limiting; the penalty for a thin-plate spline, of
polynomial order k = 2m − 1, is
X Z 2
∂ m f (x)
dx,
α1 +...+αd =m
∂xα1 1 xα2 2 . . . ∂xαd d
• The multivariate extensions (thin-plate and tensor-product) of splines are highly non-
trivial, especially when we compare them to the (conceptually) simple extension of
kernel smoothing to higher dimensions. In multiple dimensions, if one wants to study
penalized nonparametric estimation, it’s (argurably) easier to study reproducing ker-
nel Hilbert space estimators. We’ll see, in fact, that this covers smoothing splines
(and thin-plate splines) as a special case
37
References
Bellman, R. (1962), Adaptive Control Processes, Princeton University Press.
Devroye, L., Gyorfi, L., & Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition,
Springer.
Donoho, D. L. & Johnstone, I. (1998), ‘Minimax estimation via wavelet shrinkage’, Annals
of Statistics 26(8), 879–921.
Fan, J. (1993), ‘Local linear regression smoothers and their minimax efficiencies’, The An-
nals of Statistics pp. 196–216.
Fan, J. & Gijbels, I. (1996), Local polynomial modelling and its applications: monographs
on statistics and applied probability 66, Vol. 66, CRC Press.
Green, P. & Silverman, B. (1994), Nonparametric Regression and Generalized Linear Mod-
els: A Roughness Penalty Approach, Chapman & Hall/CRC Press.
Gyorfi, L., Kohler, M., Krzyzak, A. & Walk, H. (2002), A Distribution-Free Theory of
Nonparametric Regression, Springer.
Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning;
Data Mining, Inference and Prediction, Springer. Second edition.
Johnstone, I. (2011), Gaussian estimation: Sequence and wavelet models, Under contract to
Cambridge University Press. Online version at http://www-stat.stanford.edu/~imj.
Kim, S.-J., Koh, K., Boyd, S. & Gorinevsky, D. (2009), ‘`1 trend filtering’, SIAM Review
51(2), 339–360.
Lin, Y. & Zhang, H. H. (2006), ‘Component selection and smoothing in multivariate non-
parametric regression’, Annals of Statistics 34(5), 2272–2297.
Mallat, S. (2008), A wavelet tour of signal processing, Academic Press. Third edition.
Mammen, E. & van de Geer, S. (1997), ‘Locally apadtive regression splines’, Annals of
Statistics 25(1), 387–413.
Raskutti, G., Wainwright, M. & Yu, B. (2012), ‘Minimax-optimal rates for sparse addi-
tive models over kernel classes via convex programming’, Journal of Machine Learning
Research 13, 389–427.
Ravikumar, P., Liu, H., Lafferty, J. & Wasserman, L. (2009), ‘Sparse additive models’,
Journal of the Royal Statistical Society: Series B 75(1), 1009–1030.
38
Scholkopf, B. & Smola, A. (2002), ‘Learning with kernels’.
Steidl, G., Didas, S. & Neumann, J. (2006), ‘Splines in higher order TV regularization’,
International Journal of Computer Vision 70(3), 214–255.
Stone, C. (1985), ‘Additive regression models and other nonparametric models’, Annals of
Statistics 13(2), 689–705.
Wahba, G. (1990), Spline Models for Observational Data, Society for Industrial and Applied
Mathematics.
Wang, Y., Smola, A. & Tibshirani, R. J. (2014), ‘The falling factorial basis and its statistical
properties’, International Conference on Machine Learning 31.
39
Appendix: Locally adaptive estimators
10.10 Locally adaptive regression splines
Locally adaptive regression splines (Mammen & van de Geer 1997), as their name suggests,
can be viewed as variant of smoothing splines that exhibit better local adaptivity. For a
given integer order k ≥ 0, the estimate is defined as
n
X 2
b = argmin
m Yi − m(Xi ) + λ TV(f (k) ). (34)
f i=1
The minimization domain is infinite-dimensional, the space of all functions for which the
criterion is finite
Another remarkable variational result, similar to that for smoothing splines, shows that
(34) has a kth order spline as a solution (Mammen & van de Geer 1997). This almost
turns the minimization into a finite-dimensional one, but there is one catch: the knots of
this kth-order spline are generally not known, i.e., they need not coincide with the inputs
x1 , . . . , xn . (When k = 0, 1, they do, but in general, they do not)
To deal with this issue, we can redefine the locally adaptive regression spline estimator
to be
X n
2
b = argmin
m Yi − m(Xi ) + λ TV(f (k) ), (35)
f ∈Gk i=1
i.e., we restrict the domain of minimization to be Gk , the space of kth-order spline functions
with knots in Tk , where Tk is a subset of {x1 , . . . , xn } of size n−k −1. The precise definition
of Tk is not important; it is just given by trimming away k + 1 boundary points from the
inputs
As we already know, the space Gk of kth-order splines with knots in Tk has dimension
|Tk | + k + 1 = n. Therefore we can choose a basis g1 , . . . , gn for the functions in Gk , and the
problem in (35) becomes one of finding the coefficients in this basis expansion,
n
X n
X 2 n X
n (k) o
βb = argmin Yi − βj gj (Xi ) + λ TV βj gj (Xi ) , (36)
f ∈Gk i=1 j=1 j=1
P
b
and then we have m(x) = nj=1 βbj gj (x)
Now define the basis matrix G ∈ Rn×n by
Gij = gj (Xi ), i = 1, . . . , n.
and so
n X
n (k) o n
X
TV βj gj (Xi ) = k. |βj |.
j=1 j=k+2
40
Hence the locally adaptive regression spline problem (36) can be expressed as
n
X
βb = argmin ky − Gβk22 + λk. |βi |. (37)
β∈Rn i=k+2
This is a lasso regression problem on the truncated power basis matrix G, with the first k +1
coefficients (those corresponding to the pure polynomial functions, in the basis expansion)
left unpenalized
This reveals a key difference between the locally adaptive regression splines (37) (origi-
nally, problem (35)) and the smoothing splines (29) (originally, problem
X n Z 1
2 2
b = argmin
m Yi − m(Xi ) + λ f (m) (x) dx, where m = (k + 1)/2. (38)
f i=1 0
In the first problem, the total variation penalty is translated into an `1 penalty on the
coefficients of the truncated power basis, and hence this acts a knot selector for the estimated
function. That is, at the solution in (37), the estimated spline has knots at a subset of Tk
(at a subset of the input points x1 , . . . , xn ), with fewer knots when λ is larger. In contrast,
recall, at the smoothing spline solution in (29), the estimated function has knots at each of
the inputs x1 , . . . , xn . This is a major difference between the `1 and `2 penalties
From a computational perspective, the locally adaptive regression spline problem in (37)
is actually a lot harder than the smoothing spline problem in (29). Recall that the latter
reduces to solving a single banded linear system, which takes O(n) operations. On the other
hand, fitting locally adaptive regression splines in (37) requires solving a lasso problem with
a dense n × n regression matrix G; this takes something like O(n3 ) operations. So when
n = 10, 000, there is a big difference between the two.
There is a tradeoff here, as with extra computation comes much improved local adap-
tivity of the fits. See Figure 10 for an example. Theoretically, when m0 ∈ M (k, C) for a
constant C > 0, Mammen & van de Geer (1997) show the locally adaptive regression spline
estimator, denoted m b lrs , with λ n1/(2k+3) , satisfies
b lrs − m0 k2n . n−(2k+2)/(2k+3) in probability,
km
so (like wavelets) it achieves the minimax optimal rate over n−(2k+2)/(2k+3) . In this regard,
as we discussed previously, they actually have a big advantage over any linear smoother
(not just smoothing splines)
41
True function Locally adaptive regression spline, df=19
10
10
●●●● ●●●●
● ●
● ●
● ● ●● ● ● ●●
● ●
● ●
● ●
●● ●●
8
8
● ● ● ●
● ●
● ● ● ●
●● ●● ●● ●●
●
● ●●● ●
● ●●●
● ● ● ●
●● ● ● ●● ● ●
●● ● ●● ●
● ● ● ●
6
6
● ● ● ● ● ●
● ●
● ●●● ● ●● ● ●●● ● ●●
● ● ● ●
●● ● ● ● ●● ●● ● ● ● ●●
● ●● ● ●● ●●● ● ●● ● ●● ●●●
●● ●●● ●●● ● ● ●● ● ●● ●●● ●●● ● ● ●● ●
● ● ●● ● ● ●●
●
● ● ● ●
● ● ●
● ● ● ●●● ●●● ● ● ● ●●● ●●●
● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
4
4
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ●
●● ● ●● ●
● ●
2
2
● ●
● ●
● ●
● ●
● ●
● ●
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
10
●●●● ●●●●
● ●
● ●
● ● ●● ● ● ●●
● ●
● ●
● ●
●● ●●
8
● ● ● ●
● ●
● ● ● ●
●● ●● ●● ●●
●
● ●●● ●
● ●●●
● ● ● ●
●● ● ● ●● ● ●
●● ● ●● ●
● ● ● ●
6
● ● ● ● ● ●
● ●
● ●●● ● ●● ● ●●● ● ●●
● ● ● ●
●● ● ● ● ●● ●● ● ● ● ●●
● ●● ● ●● ●●● ● ●● ● ●● ●●●
●● ●●● ●●● ● ● ●● ● ●● ●●● ●●● ● ● ●● ●
● ● ●● ● ● ●●
●
● ● ● ●
● ● ●
● ● ● ●●● ●●● ● ● ● ●●● ●●●
● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
4
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ●
●● ● ●● ●
● ●
2
● ●
● ●
● ●
● ●
● ●
● ●
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 10: The top left plot shows a simulated true regression function, which has inhomoge-
neous smoothness: smoother towards the left part of the domain, wigglier towards the right.
The top right plot shows the locally adaptive regression spline estimate with 19 degrees of
freedom; notice that it picks up the right level of smoothness throughout. The bottom left
plot shows the smoothing spline estimate with the same degrees of freedom; it picks up the
right level of smoothness on the left, but is undersmoothed on the right. The bottom right
panel shows the smoothing spline estimate with 33 degrees of freedom; now it is appropriately
wiggly on the right, but oversmoothed on the left. Smoothing splines cannot simultaneously
represent different levels of smoothness at different regions in the domain; the same is true
of any linear smoother
42
having knots in a set Tk ⊆ {X1 , . . . Xn } with size |Tk | = n − k − 1. The trend filtering
problem is given by replacing Gk with a different function space,
n
X 2
b = argmin
m Yi − m(Xi ) + λ TV(f (k) ), (40)
f ∈Hk i=1
where the new domain is Hk = span{h1 , . . . , hn }. Assuming that the input points are
ordered, x1 < . . . < xn , the functions h1 , . . . , hn are defined by
j−1
Y
hj (x) = (x − x` ), j = 1, . . . , k + 1,
`=1
(41)
k
Y
hk+1+j (x) = (x − xj+` ) · 1{x ≥ xj+k }, j = 1, . . . , n − k − 1.
`=1
(Our convention is to take the empty product to be 1, so that h1 (x) = 1.) These are dubbed
the falling factorial basis, and are piecewise polynomial functions, taking an analogous form
to the truncated power basis functions in (10.11). Loosely speaking, they are given by
replacing an rth-order power function in the truncated power basis with an appropriate
r-term product, e.g., replacing x2 with (x − x2 )(x − x1 ), and (x − tj )k with (x − xj+k )(x −
xj+k−1 ) · . . . , (x − xj+1 )
Defining the falling factorial basis matrix
Hij = hj (Xi ), i, j = 1, . . . , n,
it is now straightforward to check that the proposed problem of study, trend filtering in
(40), is equivalent to
Xn
b 2
β = argmin ky − Hβk2 + λk. |βi |. (42)
β∈Rn i=k+2
This is still a lasso problem, but now in the falling factorial basis matrix H. Compared to
the locally adaptive regression spline problem (37), there may not seem to be much of a
difference here—like G, the matrix H is dense, and solving (42) would be slow. So why did
we go to all the trouble of defining trend filtering, i.e., introducing the somewhat odd basis
h1 , . . . , hn in (41)?
The usefulness of trend filtering (42) is seen after reparametrizing the problem, by
inverting H. Let θ = Hβ, and rewrite the trend filtering problem as
θb = argmin ky − θk22 + λkDθk1 , (43)
θ∈Rn
43
10
Locally adaptive splines ●●●●
●
Trend filtering ●
● ● ●●
●
●
●
●●
8
● ●
●
● ●
●● ●
●
● ●
● ●●
● ●
●● ● ●
●● ●
● ●
6
● ● ●
●
● ●●● ● ●●
● ●
●● ● ● ● ●●
● ●● ● ●● ●●●
●● ●●● ●●● ● ● ●● ●
● ● ● ●
●● ● ● ● ●●●
● ●
●●●
● ●
● ●● ● ● ●
4
● ● ●
● ● ●
● ●
●
●
●
●● ●
●
2
●
●
●
●
●
●
0
Figure 11: Trend filtering and locally adaptive regression spline estimates, fit on the same
data set as in Figure 10. The two are tuned at the same level, and the estimates are visually
indistinguishable
One can hence interpret D as a type of discrete derivative operator, of order k + 1. This
also suggests an intuitive interpretation of trend filtering (43) as a discrete approximation
to the original locally adaptive regression spline problem in (34)
The bandedness of D means that the trend filtering problem (43) can be solved efficiently,
in close to linear time (complexity O(n1.5 ) in the worst case). Thus trend filtering estimates
are much easier to fit than locally adaptive regression splines
But what of their statistical relevancy? Did switching over to the falling factorial basis
(41) wreck the local adaptivity properties that we cared about in the first place? Fortu-
nately, the answer is no, and in fact, trend filtering and locally adaptive regression spline
estimates are extremely hard to distinguish in practice. See Figure 11
Moreover, Tibshirani (2014), Wang et al. (2014) prove that the estimates from trend
b tf and m
filtering and locally adaptive regression spline estimates, denoted m b lrs , respectively,
when the tuning parameter λ for each scales as n1/(2k+3) , satisfy
b tv − m
km b lrs k2n . n−(2k+2)/(2k+3) in probability.
This coupling shows that trend filtering converges to the underlying function m0 at the rate
n−(2k+2)/(2k+3) whenever locally adaptive regression splines do, making them also minimax
optimal over M (k, C). In short, trend filtering offers provably significant improvements
over linear smoothers, with a computational cost that is not too much steeper than a single
banded linear system solve
44
10.12 Proof of (9)
Let Pn
i=1 m(Xi )I(kXi − xk ≤ h)
mh (x) = .
nPn (B(x, h))
Let An = {Pn (B(x, h)) > 0}. When An is true,
! P
n
Var(Yi |Xi )I(kXi − xk ≤ h) σ2
E (m b h (x) − mh (x))2 X1 , . . . , Xn = i=1 ≤ .
n2 Pn2 (B(x, h)) nPn (B(x, h))
Since m ∈ M, we have that |m(Xi ) − m(x)| ≤ LkXi − xk ≤ Lh for Xi ∈ B(x, h) and hence
Therefore,
Z Z Z
2
E (m b h (x) − m(x)) dP (x) = E (m b h (x) − mh (x)) dP (x) + E (mh (x) − m(x))2 dP (x)
2
Z Z
σ2
≤E IAn (x) dP (x) + L2 h2 + m2 (x)E(IAn (x)c )dP (x). (44)
nPn (B(x, h))
To bound the first term, let Y = nPn (B(x, h)). Note that Y ∼ Binomial(n, q) where
q = P(X ∈ B(x, h)). Now,
n
X
I(Y > 0) 2 2 n
E ≤ E = q k (1 − q)n−k
Y 1+Y k+1 k
k=0
Xn
2 n+1
= q k+1 (1 − q)n−k
(n + 1)q k+1
k=0
2 X n+1
n+1
≤ q k (1 − q)n−k+1
(n + 1)q k
k=0
2 2 2
= (q + (1 − q))n+1 = ≤ .
(n + 1)q (n + 1)q nq
Therefore,
Z Z
σ 2 IAn (x) dP (x)
E dP (x) ≤ 2σ 2 .
nPn (B(x, h)) nP (B(x, h))
SM
We may choose points z1 , . . . , zM such that the support of P is covered by j=1 B(zj , h/2)
where M ≤ c2 /(nhd ). Thus,
Z M Z
X M Z
X
dP (x) I(z ∈ B(zj , h/2)) I(z ∈ B(zj , h/2))
≤ dP (x) ≤ dP (x)
nP (B(x, h)) nP (B(x, h)) nP (B(zj , h/2))
j=1 j=1
M c1
≤ ≤ .
n nhd
45
The third term in (44) is bounded by
Z Z
m (x)E(IAn (x)c )dP (x) ≤ sup m (x) (1 − P (B(x, h)))n dP (x)
2 2
x
Z
≤ sup m2 (x) e−nP (B(x,h)) dP (x)
x
Z
nP (B(x, h))
= sup m (x) e−nP (B(x,h))
2
dP (x)
x nP (B(x, h))
Z
1
≤ sup m2 (x) sup(ue−u ) dP (x)
x u nP (B(x, h))
c1 c2
≤ sup m2 (x) sup(ue−u ) d = d
.
x u nh nh
Note that this would in fact prove that we can restrict our attention in (25) to natural
splines with knots at x1 , . . . , xn .
The natural spline basis with knots at x1 , . . . , xn is n-dimensional, so given any n points
zi = fe(Xi ), i = 1, . . . , n, we can always find a natural spline f with knots at x1 , . . . , xn that
satisfies m(Xi ) = zi , i = 1, . . . , n. Now define
Consider
Z 1 Z 1
1
00 00 00 0
f (x)h (x) dx = f (x)h (x) − f 000 (x)h0 (x) dx
0 0 0
Z xn
000 0
=− f (x)h (x) dx
x1
n−1
X Z xn
xj+1
000
=− f (x)h(x) + f (4) (x)h0 (x) dx
xj x1
j=1
n−1
X
=− f 000 (x+
j ) h(xj+1 ) − h(xj ) ,
j=1
where in the first line we used integration by parts; in the second we used the that f 00 (a) =
f 00 (b) = 0, and f 000 (x) = 0 for x ≤ x1 and x ≥ xn , as f is a natural spline; in the third we
used integration by parts again; in the fourth line we used the fact that f 000 is constant on
any open interval (xj , xj+1 ), j = 1, . . . , n−1, and that f (4) = 0, again because f is a natural
46
spline. (In the above, we use f 000 (u+ ) to denote limx↓u f 000 (x).) Finally, since h(xj ) = 0 for
all j = 1, . . . , n, we have Z 1
f 00 (x)h00 (x) dx = 0.
0
From this, it follows that
Z 1 Z 1
2
e00 2
f (x) dx = f 00 (x) + h00 (x) dx
0 0
Z 1 Z 1 Z 1
00 2 00 2
= f (x) dx + h (x) dx + 2 f 00 (x)h00 (x) dx
0 0 0
Z 1 Z 1
= f 00 (x)2 dx + h00 (x)2 dx,
0 0
and therefore Z Z
1 1
f 00 (x)2 dx ≤ fe00 (x)2 dx, (45)
0 0
with equality if and only if h00 (x) = 0 for all x ∈ [0, 1]. Note that h00 = 0 implies that h must
be linear, and since we already know that h(xj ) = 0 for all j = 1, . . . , n, this is equivalent to
h = 0. In other words, the inequality (45) holds strictly except when fe = f , so the solution
in (25) is uniquely a natural spline with knots at the inputs.
47
Linear Regression
Given a new pair (X, Y ) we want to predict Y from X. The conditional prediction risk is
Z
2 2
b = E[(Y − m(X))
R(m) b |D] = (y − m(x))
b dP (x, y)
where the expected value is over all random variables. The true regression function is
where
σ 2 = E[Y − m(X)]2 , bn (x) = E[m(x)]
b − m(x), vn (x) = Var(m(x)).
b
A linear predictor has the form g(x) = β T x. The best linear predictor minimizes E(Y −β T X)2 .
(We do not assume that m(x) is linear.) The minimizer, assuming that Σ is non-singular, is
β∗ = Σ−1 α
where Σ = E[XX T ] and α = E(Y X). We will use linear predictors; but we should
never assume that m(x) is linear. The excess risk is of the linear predictor β T x is
1
1 Low Dimensional Linear Regression
Recall that Σ = E[XX T ]. The least squares estimator βb minimizes the training error rbn (β).
We then have that
b −1 α
βb = Σ b
where
b= 1 1X
X
Σ Xi XiT , α
b= Yi Xi .
n i n i
We want to show that r(β) b is close to r(β∗ ). For simplicity, we will assume that the distri-
bution P of (Yi , Xi ) supported on a compact set. Also, for simplicity, we assume that βb is
truncated by some large constant L.
Theorem 1 Let P be the set of all distributions for Z = (X, Y ) supported on a compact set
K. There exists constants c1 , c2 such that the following is true. For any > 0,
2
sup P r(βn ) > r(β∗ (P )) + 2 ≤ c1 e−nc2 .
n b (2)
P ∈P
Hence, r !
1
r(βbn ) − r(β∗ ) = OP .
n
Proof. Given any β, define βe = (−1, β) and Λ = E[ZZ T ] where Z = (Y, X). Note that
Similarly,
rbn (β) = βeT Λ
b n βe
where
bn = 1
X
Λ Zi ZiT .
n i
So
rn (β) − r(β)| = |βeT (Λ
|b b n − Λ)β| e 2 ∆n
e ≤ ||β||
1
where
∆n = max |Λ
b n (j, k) − Λ(j, k)|.
j,k
2
On the event supβ∈B |b
rn (β) − r(β)| < , we have
Theorem 2 (Theorem 11.3 of Gyorfi, Kohler, Krzyzak and Walk, 2002) Let σ 2 =
supx Var(Y |X = x) < ∞. Assume that all the random variables are bounded by L < ∞.
Then
Z Z
Cd(log(n) + 1)
bT 2
E |β x − m(x)| dP (x) ≤ 8 inf |β T x − m(x)|2 dP (x) + .
β n
The proof is straightforward but is very long. The strategy is to first bound n−1 i (βbT Xi −
P
m(Xi ))2 using
P the properties of least squares. Then, using concentration of measure one can
relate n−1 i f 2 (Xi ) to f 2 (x)dP (x).
R
Theorem 3 We have √
n(βb − β) N (0, Γ)
where " #
Γ = Σ−1 E (Y − X T β)2 XX T Σ−1
Γ b −1 M
b=Σ cΣb −1
where n
c(j, k) = 1
X
M 2i
Xi (j)Xi (k)b
n i=1
i = Yi − βbT Xi .
and b
The matrix Γb is called the sandwich estimator. The Normal approximation can be used to
q
construct confidence intervals for β. For example, β(j)±zα Γ(j,
b b j)/n is an asymptotic 1−α
confidence interval for β(j). We can also get confidence intervals by using the bootstrap. Do
not use the textbook formulas for the standard errors of β. b These assume that the regression
function itself is linear. See Buja et al (2015) for details.
3
2 High Dimensional Linear Regression
Now suppose that d > n. We can no longer use least squares. There are many approaches.
The simplest is to preprocess the data to reduce the dimension. For example, we can perform
PCA on the X 0 s and use the first k principal components where k < n. Alternatively, we
can cluster the covariates based on their correlations. We can the use one feature from each
cluster or take the average of the covariates within each cluster. Another approach is to
screen the variables by choosing the k features with the largest correlation with Y . After
dimension reduction, we can the use least squares. These preprocessing methods can be very
effective.
A different approach is to use all the covariates but, instead of least squares, we shrink the
coefficients towards 0. This is called ridge regression and is discussed in the next section.
Yet another approach is model selection where we try to find a good subset of the covariates.
Let S be a subset of {1, . . . , d} and let XS = (X(j) : j ∈ S). If the size of S is not too
large, we can regress Y on XS instead of S.
In particular, fix k < n and let Sk denote all subsets of size k. For a given S ∈ Sk , let βS be
the best linear predictor βS = Σ−1S αS for the subset S. We would like to choose S ∈ Sk to
minimize
E(Y − βST XS )2 .
This is equivalent to:
minimize E(Y − β T X)2 subject to ||β||0 ≤ k
where ||β||0 is the number of non-zero elements of β.
There will be a bias-variance tradeoff. As k increases, the bias decreases but the variance
increases.
We can approximate the risk with the training error. But the minimization is over all subsets
of size k. This minimization is NP-hard. So best subset regression is infeasible. We can
approximate best subset regression in two different ways: a greedy approxmation or a convex
relaxation. The former leads to forward stepwise regression. The latter leads to the lasso.
All these methods involve a tuning parameter which can be chosen by cross-validation.
3 Ridge Regression
4
Forward Stepwise Regression
1. Input k. Let S = ∅.
2. Let rj = n−1 i Yi Xi (j) denote the corrleation between Y and the j th feature.
P
Let J = argmaxj |rj |. Let S = S ∪ {J}.
3. Compute the regression of Y on XS = (X(j) : j ∈ S). Compute the residuals
e = (e1 , . . . , en ) where ei = Yi − βbST Xi .
4. Compute the correlations rj between the residuals e and the remaining features.
5. Let J = argmaxj |rj |. Let S = S ∪ {J}.
6. Repeat steps 3-5 until |S| = k.
7. Output S.
Theorem 4 (Hsu, Kakade and Zhang 2014) Suppose that ||Xi || ≤ r. Let β T x be the
best linear apprximation to m(x). Then, with probability at least 1 − 4e−t ,
!!
r2
b − r(β) ≤ 1 + O 1 + λ λ||β||2 σ 2 tr(Σ)
r(β) + .
n 2 n 2λ
Now we will discuss the theory of forward stepwise regression. Let’s start with a functional,
noise-free version. We want to greedily approximate a function f using a dictionary of
functions D = {ψ1 , ψ2 , . . . , }. The elements of D are called atoms. Assume that ||ψ|| = 1 for
all ψ ∈ D. Assume that f and the atoms of the dictionary belong to a Hilbert space H.
5
1. Input: f .
2. Initialize: r0 = f , f0 = 0, V = ∅.
Let ΣN denote all linear combinations of elements of D with at most N terms. Define the
best N -term approximation error
where Λ denotes a subset of D and Span(Λ) is the set of linear combinations of functions in
Λ.
where the infimum is over all expansions of f . The functional version of stepwise regres-
sion, known as the Orthogonal Greedy Algorithm (OGA), is also known as Orthogonal
Matching Pursuit. The algorithm is given in Figure 2.
Proof. Note that fN is the best approximation to f from Span(VN ). On the other hand, the
best approximation from the set {a gN : a ∈ R} is hf, gN igN . The error of the former must be
6
smaller than the error of the latter. In other words, ||f −fN ||2 ≤ ||f −fN −1 −hrN −1 , gN igN ||2 .
Thus,
2 4khk2L1
2
krN k ≤ kf − hk + . (6)
N
P P
Proof. Choose any h ∈ L1 and write h = j βj ψj where khkL1 = j |βj |. Write f =
fN −1 + f − fN −1 = fN −1 + rN −1 and note that rN −1 is orthogonal to fN −1 . Hence, krN −1 k2 =
7
hrN −1 , f i and so
By combining the previous results with concentration of measure arguments (see appendix
for details) we get the following result, due to Barron, Cohen, Dahmen and DeVore (2008).
8
1. Input: Y ∈ Rn .
2. Initialize: r0 = Y , fb0 = 0, V = ∅.
where ha, bin = n−1 ni=1 ai bi . Set VN = VN −1 ∪ {gN }. Let fN be the projection
P
of rN −1 onto Span(VN ). Let rN = Y − fN .
Theorem
√ 9 Let hn = argminh∈FN kf0 − hk2 . Suppose that lim supn→∞ khn kL1,n < ∞. Let
N ∼ n. Then, for every γ > 0, there exist C > 0 such that
C log n
kf − fbN k2 ≤ 4σN
2
+
n1/2
except on a set of probability n−γ .
P
Let us compare this with the lasso which we will discuss next. Let fL = j βj ψj minimize
kf − fL k2 subject to kβk1 ≤ L. Then, we will see that
1/2
2 2 log n
kf − fbL k ≤ kf − fL k + OP
n
The rate n−1/2 is in fact optimal. It might be surprising that the rate is independent of the
dimension. Why do you think this is the case?
9
This is a convex problem so the estimator can be found efficiently. The estimator is sparse:
for large enough λ, many of the components of βb are 0. This is proved in the course on
convex optimization. Now we discuss some theoretical properties of the lasso.1
The following result was proved in Zhao and Yu (2006), Meinshausen and Bühlmann (2005)
and Wainwright (2006). The version we state is from Wainwright (2006). Let β = (β1 , . . . , βs , 0, . . . , 0)
and decompose the design matrix as X = (XS XS c ) where S = {1, . . . , s}. Let βS =
(β1 , . . . , βs ).
3. φn (dn ) > 0.
5. λn satisfies
nλ2n
→∞
log(dn − sn )
and r −1 !
1 log sn 1 T
+ λn X X → 0. (8)
min1≤j≤sn |βj | n n
∞
The conditions of this theorem are very strong. They are not checkable and they are unlikely
to ever be true in practice.
10
3. E(exp |i |) < ∞ and E(2i ) = σ 2 < ∞.
Then
log n sn log n 1
kβn − βbn k2 = OP +O (9)
n φ2n (sn log n) log n
If
log n
sn log dn →0 (10)
n
and s
σy2 Φn (min n, dn )n2
λn = (11)
sn log n
P
then kβbn − βn k2 → 0.
Once again, the conditions of this theorem are very strong. They are not checkable and they
are unlikely to ever be true in practice.
The next theorem is the most important one. It does not require unrealistic conditions. We
state the theorem for bounded covariates. A more general version appears in Greenshtein
and Ritov (2004).
Theorem 12 Let Z = (Y, X). Assume that |Y | ≤ B and maxj |X(j)| ≤ B. Let
β∗ = argmin r(β)
||β||1 ≤L
where r(β) = E(Y − β T X)2 . Thus, xT β∗ is the best, sparse linear predictor (in the L1 sense).
Let βb be the lasso estimator:
βb = argmin rb(β)
||β||1 ≤L
−1
Pn
where rb(β) = n i=1 (Yi − XiT β)2 . With probabilty at least 1 − δ,
√ !
v
u
u 16(L + 1)4 B 2 2d
b ≤ r(β∗ ) + t
r(β) log √ .
n δ
11
Proof. Let Z = (Y, X) and Zi = (Yi , Xi ). Define γ ≡ γ(β) = (−1, β). Then
r(β) = E(Y − β T X)2 = γ T Λγ
where Λ = E[ZZ T ]. Note that ||γ||1 = ||β||1 + 1. Let B = {β : ||β||1 ≤ L}. The training
error is n
1X
rb(β) = (Yi − XiT β)2 = γ T Λγ
b
n i=1
r(β) − r(β)| = |γ T (Λ
|b b − Λ)γ|
X
≤ b k) − Λ(j, k)| ≤ ||γ||2 δn
|γ(j)| |γ(k)| |Λ(j, 1
j,k
≤ (L + 1)2 ∆n
where
∆n = max |Λ(j,
b k) − Λ(j, k)|.
j,k
So,
b + (L + 1)2 ∆n ≤ rb(β∗ ) + (L + 1)2 ∆n ≤ r(β∗ ) + 2(L + 1)2 ∆n .
b ≤ rb(β)
r(β)
Note that |Z(j)Z(k)| ≤ B 2 < ∞. By Hoeffding’s inequality,
2 /(2B 2 )
P(∆n (j, k) ≥ ) ≤ 2e−n
and so, by the union bound,
2 /(2B 2 )
P(∆n ≥ ) ≤ 2d2 e−n =δ
r √
if we choose = (4B /n) log √2δd . Hence,
2
√ !
v
u
4 2
b ≤ r(β∗ ) + t 16(L + 1) B log 2d
u
r(β) √ .
n δ
Problems With Sparsity. Sparse estimators are convenient and popular but they can
some problems. Say that βb is weakly sparsistent if, for every β,
Pβ I(βbj = 1) ≤ I(βj = 1) for all j → 1 (12)
12
Theorem 13 (Leeb and Pötscher (2007)) Suppose that the following condiitons hold:
1. d is fixed.
2. The covariariates are nonstochastic and n−1 XT X → Q for some positive definite matrix
Q.
3. The errors i are independent with mean 0, finite variance σ 2 and have a density f
satisfying
Z 0 2
f (x)
0< f (x)dx < ∞.
f (x)
√
Proof. Choose any s ∈ Rd and let βn = −s/ n. Then,
sup Eβ (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β))I(βb = 0))
β
√
= `(− nβn )Pβn (βb = 0) = `(s)Pβn (βb = 0).
Now, P0 (βb = 0) → 1 by assumption. It can be shown that we also have Pβn (βb = 0) → 1.2
Hence, with probability tending to 1,
sup Eβ (`(n1/2 (βb − β)) ≥ `(s).
β
R(βbn )
sup → ∞.
β Rn
The implication is that when d is much smaller than n, sparse estimators have poor behavior.
However, when dn is increasing and dn > n, the least squares estimator no longer satisfies
(13). Thus we can no longer say that some other estimator outperforms the sparse estimator.
In summary, sparse estimators are well-suited for high-dimensional problems but not for low
dimensional problems.
2
This follows from a property called contiguity.
13
5 Cross Validation
The following result is from Gyorfi, Kohler, Krzyzak and Walk (2002). Let M =R{mh } be a
finite class of regression estimators indexed by a parameter h. Let mbh minimize |mh (x) −
m(x)|2 dP (x) over M. We want to show that cross-validation (data-splitting) leads to an
estimator with risk nearly as good as mbh .
Split the data into training D = {(X1 , Y1 ), . . . , (Xn , Yn )} and test D0 = {(X10 , Y10 ), . . . , (Xn0 , Yn0 )}.
Let mH minimize n−1 i∈D0 |Yi − mh (Xi )|2 . Assume that the data Yi and the estimators are
P
bounded by L.
Proof. Then
Z Z
2
E |mH − m| dP (x)|D = E |Y − mH | dP (x)|D − E|Y − m(X)|2
2
= T1 + T2
where Z
T1 = E |Y − mH | dP (x)|D − E|Y − m(X)|2 − T2
2
and
1X 1X
|Yi − mH (Xi )|2 − |Yi − m(Xi )|2 ≤ (1+δ) |Yi − mbh (Xi )|2 − |Yi − m(Xi )|2
T2 = (1+δ)
n D0 n D0
and so
E[T2 |D] ≤ (1 + δ) E(|Y − mbh (X)|2 |D) − E|Y − m(X)|2
Z
= (1 + δ) |mbh (x) − m(x)|2 dP (x).
The second part of the proof involves some tedious calculations. We will bound P (T1 ≥ s|D).
The event T1 ≥ s is the same as
!
2 2 1X 2 2
(1 + δ) E(|mH (X) − Y | |D) − E|m(X) − Y | − |Yi − mH (Xi )| − |Yi − m(Xi )| ≥
n D0
s + δ E|mH (X) − Y |2 − E|m(X) − Y |2 .
14
This has probability at most |M| times the probabilty that
!
2 1X 2
|Yi − mH (Xi )|2 − |Yi − m(Xi )|2
(1 + δ) E(|mh (X) − Y | |D) − E|m(X) − Y | − ≥
n D0
s + δ E|mh (X) − Y |2 − E|m(X) − Y |2
where
δσ 2
1
A= s+
(1 + δ)2 16L2
and
2 8L2 δσ 2
2
B = 2σ + s+ .
31+δ 16L2
Now A/B ≥ s/c for c = L2 (16/δ + 35 + 19δ). So
Finally Z ∞
c|M| −nu/c
E[T1 |D] ≤ u + P (T1 > s|D) ≤ u + e .
u n
The result follows by setting u = c log |M|/n.
6 Inference?
Is it possible to do inference after model selection? Do we need to? I’ll discuss this in class.
15
References
Buja, Berk, Brown, George, Pitkin, Traskin, Zhao and Zhang (2015). Models as Apprx-
imations — A Conspiracy of Random Regressors and Model Deviations Against Classical
Inference in Regression. Statistical Science.
Hsu, Kakade and Zhang (2014). Random design analysis and ridge regression. arXiv:1106.2363.
Appendix: L2 Boosting
(0) (k)
Define estimators m
bn ,...,m b (0) (x) = 0 and then iterate the follow-
b n , . . . , as follows. Let m
ing steps:
− βbJ XiJ )2 .
P
3. Find J = argminj RSSj where RSSj = i (Ui
b (k+1) (x) = m
4. Set m b (k) (x) + βbJ xJ .
Yb (k) = Bk Y (16)
where Yb (k) = (m
b (k) (X1 ), . . . , m
b (k) (Xn ))T ,
and
Xj XTj
Hj = . (18)
kXj k2
16
Pdn
Theorem 16 (Bühlmann 2005) Let mn (x) = j=1 βj,n xj be the best linear approxima-
tion based on dn terms. Suppose that:
1−ξ
(A1 Growth) dn ≤ C0 eC1 n for some C0 , C1 > 0 and some 0 < ξ ≤ 1.
(A3 Bounded Covariates) supn max1≤j≤dn maxi |Xij | < ∞ with probability 1.
b n (X) − mn (x)|2 → 0
EX |m (19)
as n → 0.
for some 0 < tk ≤ 1. In the weak greedy algorithm we take Fk = Fk−1 +hf, gk igk . In the weak
orthogonal greedy algorithm we take Fk to be the projection of Rk−1 (f ) onto {g1 , . . . , gk }.
Finally set Rk (f ) = f − Fk .
L2 boosting essentially replaces hf, Xj i with hY, Xj in = n−1 i Yi Xij . Now hY, Xj in has
P
mean hf, Xj i. The main burden of the proof is to show that hY, Xj in is close to hf, Xj i with
17
high probability and then apply Temlyakov’s result. For this we use Bernstein’s inequality.
Recall that if |Zj | are bounded by M and Zj has variance σ 2 then
n2
1
P(|Z − E(Zj )| > ) ≤ 2 exp − 2 . (22)
2 σ + M /3
Hence, the probability that any empirical inner products differ from their functional coun-
terparts is no more than
n2
2 1
dn exp − 2 →0 (23)
2 σ + M /3
because of the growth condition.
The L1 norm depends on n and so we denote this by khkL1,n . For technical reasons, we
assume that kf k∞ ≤ B, that fbn is truncated to be no more than B and that kψk∞ ≤ B for
all ψ ∈ Dn .
Theorem 18 Suppose that pn ≡ |D|n ≤ nc for some c ≥ 0. Let fbN be the output of
the stepwise regression algorithm after N steps. Let f (x) = E(Y |X = x) denote the true
regression function. Then, for every h ∈ Dn ,
!
2
8khkL CN log n 1
P kf − fbN k2 > 4kf − hk2 + 1,n
+ < γ
N n n
Before proving this theorem, we need some preliminary results. For any Λ ⊂ D, let SΛ =
Span(Λ). Define ( )
[
FN = SΛ : |Λ| ≤ N .
Recall that, if F is a set of functions then Np (, F, ν) is the Lp covering entropy with respect
to the probability measure ν and Np (, F) is the supremum of Np (, F, ν) over all probability
measures ν.
18
Also,
N +1 N +1
2eB 2 3eB 2
N 2eB 3eB N
N1 (t, FN ) ≤ 12p log , N2 (t, FN ) ≤ 12p log .
t t t2 t2
Proof. The first two equation follow from standard covering arguments. The second two
equations follow from the fact that the number of subsets of Λ of size at most N is
N X N j
X p ep ep N
N
p N
≤ ≤N ≤ p max N ≤ 4pN .
j j N N N
j=1 j=1
The following lemma is from Chapter 11 of Gyorfi et al. The proof is long and technical and
we omit it.
2 (1 − )αn
β
≤ 14N1 ,F exp − .
20B 214(1 + )B 4
Apply Lemma 20 with = 1/2 together with Lemma 19 to conclude that, for C0 > 0 large
enough,
C0 N log n 1
P A1 > for some f < γ .
n n
To bound A2 , apply Theorem 7 with norm k · kn and with Y replacing f . Then,
4khk21,n
kY − fbk2n ≤ kY − hk2n +
k
19
8khk21,n
and hence A2 ≤ k
. Next, we have that
20
A Closer Look at Sparse Regression
Ryan Tibshirani
(ammended by Larry Wasserman)
1 Introduction
In these notes we take a closer look at sparse linear regression. Throughout, we
make the very strong assumption that Yi = β T Xi + i where E[i |Xi ] = 0 and
Var(i |Xi ) = σ 2 . These assumptions are highly unrealistic but they permit a more de-
tailed analysis. There are several books on high-dimensional estimation: Hastie, Tib-
shirani & Wainwright (2015), Buhlmann & van de Geer (2011), Wainwright (2017).
(Truthfully, calling it “the `0 norm” is a misnomer, since it is not a norm: it does not
satisfy positive homogeneity, i.e., kaβk0 6= akβk0 whenever a 6= 0, 1.)
In constrained form, this gives rise to the problems:
1
In penalized form, the use of `0 , `1 , `2 norms gives rise to the problems:
1
min ky − Xβk22 + λkβk0 (Best subset selection) (4)
β∈Rp 2
1
minp ky − Xβk22 + λkβk1 (Lasso regression) (5)
β∈R 2
1
min ky − Xβk22 + λkβk22 (Ridge regression) (6)
β∈Rp 2
with λ ≥ 0 the tuning parameter. In fact, problems (2), (5) are equivalent. By this,
we mean that for any t ≥ 0 and solution βb in (2), there is a value of λ ≥ 0 such
that βb also solves (5), and vice versa. The same equivalence holds for (3), (6). (The
factors of 1/2 multiplying the squared loss above are inconsequential, and just for
convenience)
It means, roughly speaking, that computing solutions of (2) over a sequence of
t values and performing cross-validation (to select an estimate) should be basically
the same as computing solutions of (5) over some sequence of λ values and perform-
ing cross-validation (to select an estimate). Strictly speaking, this isn’t quite true,
because the precise correspondence between equivalent t, λ depends on the data X, y
Notably, problems (1), (4) are not equivalent. For every value of λ ≥ 0 and
solution βb in (4), there is a value of t ≥ 0 such that βb also solves (1), but the converse
is not true
2
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λ
are constants chosen by the corresponding techniques; sign denotes the sign of its
argument (±1), and x+ denotes “positive part” of x. Below the table, estimators
are shown by broken red lines. The 45◦ line in gray shows the unrestricted estimate
for reference.
2.3 Sparsity Estimator Formula
The best subset selection and the lasso estimators have a special, useful property:
Best subset
their solutions are sparse, i.e., at (size M ) βb β̂
a solution wej ·will j | ≥β
I(|β̂have b|jβ̂=
(M0) |)
for many components
j ∈ {1, . . . , p}. In problem
Ridge (1), this is obviously true,
β̂j /(1 λ) k ≥ 0 controls the sparsity
+where
level. In problem (2), it is less obviously true, but we get a higher degree of sparsity
the smaller the value Lasso
of t ≥ 0. In the penalized sign(forms, j | −(5),
β̂j )(|β̂(4), λ)+we get more sparsity
the larger the value of λ ≥ 0
This isBestnotSubset
true of ridge regression,Ridge i.e., the solution of (3) orLasso (6) generically has
all nonzero components, no matter the value of t or λ. Note that sparsity is desirable, λ
for two reasons: (i) it corresponds to performing variable selection in the constructed
linear model, and (ii) it) |provides a level of interpretability (beyond sheer accuracy)
|β̂(M
That the `0 (0,0)norm induces sparsity is obvious.(0,0) But, why does the(0,0)`1 norm induce
sparsity and not the `2 norm? There are different ways to look at it; let’s stick
with intuition from the constrained problem forms (2), (5). Figure 1 shows the
“classic” picture, contrasting the way the contours of the squared error loss hit the
two constraint sets, the `1 and `2 balls. As the `1 ball has sharp corners (aligned with
the coordinate axes), we get sparse solutions
β2 ^
β
. β2 ^
β
.
β1 β1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
Figure 1: The “classic” illustration comparing lasso and ridge constraints. From
(right). Shown are contours of the error and constraint functions. The solid blue
Chapter 3 of Hastie et al. (2009)
areas are the constraint regions |β | + |β | ≤ t and β 2 + β 2 ≤ t2 , respectively,
1 2 1 2
while the red ellipses are the contours of the least squares error function.
Intuition can also be drawn from the orthogonal case. When X is orthogonal, it
is not hard to show that the solutions of the penalized problems (4), (5), (6) are
XT y
βbsubset = H√2λ (X T y), βblasso = Sλ (X T y), βbridge =
1 + 2λ
3
respectively, where Ht (·), St (·) are the componentwise hard- and soft-thresholding
functions at the level t. We see several revealing properties: subset selection and
lasso solutions exhibit sparsity when the componentwise least squares coefficients
(inner products X T y) are small enough; the lasso solution exihibits shrinkage, in
that large enough least squares coefficients are shrunken towards zero by λ; the ridge
regression solution is never sparse and compared to the lasso, preferentially shrinkage
the larger least squares coefficients even more
2.4 Convexity
The lasso and ridge regression problems (2), (3) have another very important prop-
erty: they are convex optimization problems. Best subset selection (1) is not, in fact
it is very far from being convex. Consider using the norm ||β||p as a penalty. Sparsity
requires p ≤ 1 and convexity requires p ≥ 1. The only norm that gives sparsity and
convexity is p = 1. The appendix has a brief review of convexity.
5
3.2 Lasso
Now we turn to subgradient optimality (sometimes called the KKT conditions) for
the lasso problem in (5). They tell us that any lasso solution βb must satisfy
X T (y − X β)
b = λs, (9)
where s ∈ ∂kβk
b 1 , a subgradient of the `1 norm evaluated at β.
b Precisely, this means
that
{+1}
βbj > 0
sj ∈ {−1} βbj < 0 j = 1, . . . , p. (10)
[−1, 1] βbj = 0,
From (9) we can read off a straightforward but important fact: even though the
solution βb may not be uniquely determined, the optimal subgradient s is a function
of the unique fitted value X βb (assuming λ > 0), and hence is itself unique.
Now from (10), note that the uniqueness of s implies that any two lasso solutions
must have the same signs on the overlap of their supports. That is, it cannot happen
that we find two different lasso solutions βb and βe with βbj > 0 but βej < 0 for some
j, and hence we have no problem interpretating the signs of components of lasso
solutions.
Let’s assume henceforth that the columns of X are in general position (and we
are looking at a nontrivial end of the path, with λ > 0), so the lasso solution βb is
unique. Let A = supp(β) b be the lasso active set, and let sA = sign(βbA ) be the signs
of active coefficients. From the subgradient conditions (9), (10), we know that
(where recall we know that XAT XA is invertible because X has columns in general
position). We see that the active coefficients βbA are given by taking the least squares
coefficients on XA , (XAT XA )−1 XAT y, and shrinking them by an amount λ(XAT XA )−1 sA .
Contrast this to, e.g., the subset selection solution in (7), where there is no such
shrinkage.
Now, how about this so-called shrinkage term (XAT XA )−1 XAT y? Does it always
act by moving each one of the least squares coefficients (XAT XA )−1 XAT y towards zero?
Indeed, this is not always the case, and one can find empirical examples where a
lasso coefficient is actually larger (in magnitude) than the corresponding least squares
coefficient on the active set. Of course, we also know that this is due to the correlations
6
between active variables, because when X is orthogonal, as we’ve already seen, this
never happens.
On the other hand, it is always the case that the lasso solution has a strictly
smaller `1 norm than the least squares solution on the active set, and in this sense,
we are (perhaps) justified in always referring to (XAT XA )−1 XAT y as a shrinkage term.
To see this, note that, for any vector b, ||b||1 = sT b where s is the vector of signs of
b 1 = sT βb = sT βbA and so
b. So ||β|| A
The first term is less than or equal to k(XAT XA )−1 XAT yk1 , and the term we are sub-
tracting is strictly negative (because (XAT XA )−1 is positive definite).
7
Notice that kX T k∞ = maxj=1,...,p |XjT | is a maximum of p Gaussians, each with
mean zero and variance upper bounded by σ 2 n. By a standard maximal inequality
for Gaussians, for any δ > 0,
p
max |XjT | ≤ σ 2n log(ep/δ),
j=1,...,p
with probability at least 1−δ. Plugging this to the second-to-last display and dividing
by n, we get the finite-sample result for the lasso estimator
r
1 2 2 log(ep/δ)
kX βb − Xβ0 k2 ≤ 4σkβ0 k1 , (13)
n n
with probability at least 1 − δ.
The high-probability result (13) implies an in-sample risk bound of
r
1 log p
EkX βb − Xβ0 k2 . kβ0 k1
2
.
n n
Compare to this with the risk bound (8) for best subset selection, which is on the
(optimal) order of s0 log p/n when β0 has s0 nonzero components. If each of the
nonzero components here has constant p magnitude, then above risk bound for the
lasso estimator is on the order of s0 log p/n, which is much slower.
Predictive risk. Instead of in-sample risk, we might also be interested in out-
of-sample risk, as after all that reflects actual (out-of-sample) predictions. In least
squares, recall, we saw that out-of-sample risk was generally higher than in-sample
risk. The same is true for the lasso Chatterjee (2013) gives a nice, simple analysis of
out-of-sample risk for the lasso. He assumes that x0 , xi , i = 1, . . . , n are i.i.d. from
an arbitrary distribution supported on a compact set in Rp , and shows that the lasso
estimator in bound form (2) with t = kβ0 k1 has out-of-sample risk satisfying
r
log p
E(x0 β − x0 β) . kβ0 k1
Tb T 2 2
.
n
The proof is not much more complicated than the above, for the in-sample risk, and
reduces to a clever application of Hoeffding’s inequality, though we omit it for brevity.
Note here the dependence on kβ0 k21 , rather than kβ0 k1 as in the in-sample risk. This
agrees with the analysis we did in the previous set of notes where we did not assume
the linear model. (Only the interpretation changes.)
8
with an arbitrary mean function µ(X), and normal errors ∼ N (0, σ 2 ). We will
analyze the bound form lasso estimator (2) for simplicity. By optimality of β,
b for any
1
other β feasible for the lasso problem in (2), it holds that
e
hX T (y − X β),
b βe − βi
b ≤ 0. (14)
Rearranging gives
hµ(X) − X β, b ≤ hX T , βb − βi.
b X βe − X βi e (15)
Now using the polarization identity kak22 + kbk22 − ka − bk22 = 2ha, bi,
kX βb − µ(X)k22 + kX βb − X βk
e 2 ≤ kX βe − µ(X)k2 + 2hX T , βb − βi,
2 2
e
Also if we write X βebest as the best linear that predictor of `1 at most t, achieving
the infimum on the right-hand side (which we know exists, as we are minimizing a
continuous function over a compact set), then
r
1 2 log(ep/δ)
kX βb − X βebest k22 ≤ 4σt ,
n n
with probability at least 1 − δ
1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 (16)
1
To see this, consider minimizing a convex function f (x) over a convex set C. Let xb be a
minimizer. Let z ∈ C be any other point in C. If we move away from the solution xb we can only
x). In other words, h∇f (b
increase f (b x), z − zbi ≥ 0.
9
then
s0 log p
kβb − β0 k22 . (17)
nφ20
with probability tending to 1. (This condition can be slightly weakened, but not
much.) The condition is unlikely to hold in any real problem. Nor is it checkable.
The proof is in the appendix.
References
Beale, E. M. L., Kendall, M. G. & Mann, D. W. (1967), ‘The discarding of variables
in multivariate analysis’, Biometrika 54(3/4), 357–366.
Bertsimas, D., King, A. & Mazumder, R. (2016), ‘Best subset selection via a modern
optimization lens’, The Annals of Statistics 44(2), 813–852.
10
Buhlmann, P. & van de Geer, S. (2011), Statistics for High-Dimensional Data,
Springer.
Candes, E. J. & Tao, T. (2006), ‘Near optimal signal recovery from random projec-
tions: Universal encoding strategies?’, IEEE Transactions on Information Theory
52(12), 5406–5425.
Chen, S., Donoho, D. L. & Saunders, M. (1998), ‘Atomic decomposition for basis
pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61.
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression’,
Annals of Statistics 32(2), 407–499.
Foster, D. & George, E. (1994), ‘The risk inflation criterion for multiple regression’,
The Annals of Statistics 22(4), 1947–1975.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning;
Data Mining, Inference and Prediction, Springer. Second edition.
Hastie, T., Tibshirani, R. & Wainwright, M. (2015), Statistical Learning with Sparsity:
the Lasso and Generalizations, Chapman & Hall.
Hoerl, A. & Kennard, R. (1970), ‘Ridge regression: biased estimation for nonorthog-
onal problems’, Technometrics 12(1), 55–67.
Osborne, M., Presnell, B. & Turlach, B. (2000a), ‘A new approach to variable selection
in least squares problems’, IMA Journal of Numerical Analysis 20(3), 389–404.
11
Osborne, M., Presnell, B. & Turlach, B. (2000b), ‘On the lasso and its dual’, Journal
of Computational and Graphical Statistics 9(2), 319–337.
Raskutti, G., Wainwright, M. J. & Yu, B. (2011), ‘Minimax rates of estimation for
high-dimensional linear regression over `q -balls’, IEEE Transactions on Information
Theory 57(10), 6976–6994.
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of
the Royal Statistical Society: Series B 58(1), 267–288.
van de Geer, S. & Buhlmann, P. (2009), ‘On the conditions used to prove oracle
results for the lasso’, Electronic Journal of Statistics 3, 1360–1392.
Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Ma-
chine Learning Research 7, 2541–2564.
5 Appendix: Convexity
It is convexity that allows to equate (2), (5), and (3), (6) (and yes, the penalized forms
are convex problems too). It is also convexity that allows us to both efficiently solve,
and in some sense, precisely understand the nature of the lasso and ridge regression
solutions
Here is a (far too quick) refresher/introduction to basic convex analysis and convex
optimization. Recall that a set C ⊆ Rn is called convex if for any x, y ∈ C and
t ∈ [0, 1], we have
tx + (1 − t)y ∈ C,
i.e., the line segment joining x, y lies entirely in C. A function f : Rn → R is called
convex if its domain dom(f ) is convex, and for any x, y ∈ dom(f ) and t ∈ [0, 1],
f tx + (1 − t)y ≤ tf (x) + (1 − t)f (y),
i.e., the function lies below the line segment joining its evaluations at x and y. A
function is called strictly convex if this same inequality holds strictly for x 6= y and
t ∈ (0, 1)
E.g., lines, rays, line segments, linear spaces, affine spaces, hyperplans, halfspaces,
polyhedra, norm balls are all convex sets
12
E.g., affine functions aT x + b are convex and concave, quadratic functions xT Qx +
bT x + c are convex if Q 0 and strictly convex if Q 0, norms are convex
Formally, an optimization problem is of the form
min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
`j (x) = 0, j = 1, . . . r
Here D = dom(f ) ∩ m
T Tr
i=1 dom(hi ) ∩ j=1 dom(`j ) is the common domain of all func-
tions. A convex optimization problem is an optimization problem in which all functions
f, h1 , . . . hm are convex, and all functions `1 , . . . `r are affine. (Think: why affine?)
Hence, we can express it as
min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
Ax = b
Why is a convex optimization problem so special? The short answer: because any
local minimizer is a global minimizer. To see this, suppose that x is feasible for the
convex problem formulation above and there exists some R > 0 such that
Such a point x is called a local minimizer. For the sake of contradiction, suppose that
x was not a global minimizer, i.e., there exists some feasible z such that f (z) < f (x).
By convexity of the constraints (and the domain D), the point tz + (1 − t)x is feasible
for any 0 ≤ t ≤ 1. Furthermore, by convexity of f ,
f tz + (1 − t)x ≤ tf (z) + (1 − t)f (x) < f (x)
for any 0 < t < 1. Lastly, we can choose t > 0 small enough so that kx − (tz + (1 −
t)x)k2 = tkx − zk2 ≤ R, and we obtain a contradiction
Algorithmically, this is a very useful property, because it means if we keep “going
downhill”, i.e., reducing the achieved criterion value, and we stop when we can’t do
so anymore, then we’ve hit the global solution
Convex optimization problems are also special because they come with a beautiful
theory of beautiful convex duality and optimality, which gives us a way of understand-
ing the solutions. We won’t have time to cover any of this, but we’ll mention what
subgradient optimality looks like for the lasso
Just based on the definitions, it is not hard to see that (2), (3), (5), (6) are convex
problems, but (1), (4) are not. In fact, the latter two problems are known to be
NP-hard, so they are in a sense even the worst kind of nonconvex problem
13
6 Appendix: Geometry of the solutions
One undesirable feature of the best subset selection solution (7) is the fact that
it behaves discontinuously with y. As we change y, the active set A must change
at some point, and the coefficients will jump discontinuously, because we are just
doing least squares onto the active set. So, does the same thing happen with the
lasso solution (11)? The answer it not immediately clear. Again, as we change y,
the active set A must change at some point; but if the shrinkage term were defined
“just right”, then perhaps the coefficients of variables to leave the active set would
gracefully and continously drop to zero, and coefficients of variables to enter the
active set would continuously move form zero. This would make whole the lasso
solution continuous. Fortuitously, this is indeed the case, and the lasso solution
βb is continuous as a function of y. It might seem a daunting task to prove this,
but a certain perspective using convex geometry provides a very simple proof. The
geometric perspective in fact proves that the lasso fit X βb is nonexpansive in y, i.e.,
1-Lipschitz continuous, which is a very strong form of continuity. Define the convex
polyhedron C = {u : kX T uk∞ ≤ λ} ⊆ Rn . Some simple manipulations of the KKT
conditions show that the lasso fit is given by
X βb = (I − PC )(y),
the residual from projecting y onto C. A picture to show this (just look at the left
panel for now) is given in Figure 2.
The projection onto any convex set is nonexpansive, i.e., kPC (y) − PC (y 0 )k2 ≤
ky − y 0 k2 for any y, y 0 . This should be visually clear from the picture. Actually, the
same is true with the residual map: I − PC is also nonexpansive, and hence the lasso
fit is 1-Lipschitz continuous. Viewing the lasso fit as the residual from projection
onto a convex polyhedron is actually an even more fruitful perspective. Write this
polyhedron as
C = (X T )−1 {v : kvk∞ ≤ λ},
where (X T )−1 denotes the preimage operator under the linear map X T . The set
{v : kvk∞ ≤ λ} is a hypercube in Rp . Every face of this cube corresponds to a subset
A ⊆ {1, . . . p} of dimensions (that achieve the maximum value |λ|) and signs sA ∈
{−1, 1}|A| (that tell which side of the cube the face will lie on, for each dimension).
Now, the faces of C are just faces of {v : kvk∞ ≤ λ} run through the (linear) preimage
transformation, so each face of C can also indexed by a set A ⊆ {1, . . . p} and signs
sA ∈ {−1, 1}|A| . The picture in Figure 2 attempts to convey this relationship with
the colored black face in each of the panels.
Now imagine projecting y onto C; it will land on some face. We have just argued
that this face corresponds to a set A and signs sA . One can show that this set A is
exactly the active set of the lasso solution at y, and sA are exactly the active signs.
The size of the active set |A| is the co-dimension of the face. Looking at the picture:
we can that see that as we wiggle y around, it will project to the same face. From the
14
y
X β̂
û A, sA
(X T )−1
0
0
{v : kvk∞ ≤ λ}
T
C = {u : kX uk∞ ≤ λ}
Rn Rp
Figure 2: A geometric picture of the lasso solution. The left panel shows the polyhe-
dron underlying all lasso fits, where each face corresponds to a particular combination
of active set A and signs s; the right panel displays the “inverse” polyhedron, where
the dual solutions live
correspondence between faces and active set and signs of lasso solutions, this means
that A, sA do not change as we perturb y, i.e., they are locally constant. But this isn’t
true for all points y, e.g., if y lies on one of the rays emanating from the lower right
corner of the polyhedron in the picture, then we can see that small perturbations of
y do actually change the face that it projects to, which invariably changes the active
set and signs of the lasso solution. However, this is somewhat of an exceptional case,
in that such points can be form a of Lebesgue measure zero, and therefore we can
assure ourselves that the active set and signs A, sA are locally constant for almost
every y.
From the lasso KKT conditions (9), (10), it is possible to compute the lasso
solution in (5) as a function of λ, which we will write as β(λ),
b for all values of the
tuning parameter λ ∈ [0, ∞]. This is called the regularization path or solution path of
the problem (5). Path algorithms like the one we will describe below are not always
possible; the reason that this ends up being feasible for the lasso problem (5) is that
the solution path β(λ),
b λ ∈ [0, ∞] turns out to be a piecewise linear, continuous
function of λ. Hence, we only need to compute and store the knots in this path,
which we will denote by λ1 ≥ λ2 ≥ . . . ≥ λr ≥ 0, and the lasso solution at these
knots. From this information, we can then compute the lasso solution at any value
15
1
of λ by linear interpolation.
The knots λ1 ≥ . . . ≥ λr in the solution path correspond to λ values at which
the active set A(λ) = supp(β(λ))
b changes. As we decrease λ from ∞ to 0, the knots
typically correspond to the point at which a variable enters the active set; this con-
nects the lasso to an incremental variable selection procedure like forward stepwise
regression. Interestingly though, as we decrease λ, a knot in the lasso path can also
correspond to the point at which a variables leaves the active set. See Figure 3.
0.3
0.2
Coefficients
0.1
0.0
−0.1
−0.2
lambda
Figure 3: An example of the lasso path. Each colored line denotes a component of the
lasso solution βbj (λ), j = 1, . . . , p as a function of λ. The gray dotted vertical lines
mark the knots λ1 ≥ λ2 ≥ . . .
The lasso solution path was described by Osborne et al. (2000a,b), Efron et al.
(2004). Like the construction of all other solution paths that followed these seminal
works, the lasso path is essentially given by an iterative or inductive verification of the
KKT conditions; if we can maintain that the KKT conditions holds as we decrease
λ, then we know we have a solution. The trick is to start at a value of λ at which the
solution is trivial; for the lasso, this is λ = ∞, at which case we know the solution
must be β(∞)
b = 0.
Why would the path be piecewise linear? The construction of the path from the
16
KKT conditions is actually rather technical (not difficult conceptually, but somewhat
tedious), and doesn’t shed insight onto this matter. But we can actually see it clearly
from the projection picture in Figure 2.
As λ decreases from ∞ to 0, we are shrinking (by a multiplicative factor λ) the
polyhedron onto which y is projected; let’s write Cλ = {u : kX T uk∞ ≤ λ} = λC1 to
make this clear. Now suppose that y projects onto the relative interior of a certain
face F of Cλ , corresponding to an active set A and signs sA . As λ decreases, the
point on the boundary of Cλ onto which y projects, call it u b(λ) = PCλ (y), will move
along the face F , and change linearly in λ (because we are equivalently just tracking
the projection of y onto an affine space that is being scaled by λ). Thus, the lasso
fit X β(λ)
b =y−u b(λ) will also behave linearly in λ. Eventually, as we continue to
decrease λ, the projected point u b(λ) will move to the relative boundary of the face F ;
then, decreasing λ further, it will lie on a different, neighboring face F 0 . This face will
correspond to an active set A0 and signs sA0 that (each) differ by only one element to
A and sA , respectively. It will then move linearly across F 0 , and so on.
Now we will walk through the technical derivation of the lasso path, starting
at λ = ∞ and β(∞)
b = 0, as indicated above. Consider decreasing λ from ∞, and
continuing to set β(λ)
b = 0 as the lasso solution. The KKT conditions (9) read
X T y = λs,
What happens next? As we decrease λ from λ1 , we know that we’re going to have to
change β(λ)
b from 0 so that the KKT conditions remain satisfied. Let j1 denote the
variable that achieves the maximum in (19). Since the subgradient was |sj1 | = 1 at
λ = λ1 , we see that we are “allowed” to make βbj1 (λ) nonzero. Consider setting
as λ decreases from λ1 , where sj1 = sign(XjT1 y). Note that this makes β(λ)
b a piecewise
linear and continuous function of λ, so far. The KKT conditions are then
XjT1 y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 ) = λsj1 ,
17
for all j 6= j1 . Recall that the above held with strict inequality at λ = λ1 for all j 6= j1 ,
and by continuity of the constructed solution β(λ), b it should continue to hold as we
decrease λ for at least a little while. In fact, it will hold until one of the piecewise
linear paths
XjT (y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 )), j 6= j1
becomes equal to ±λ, at which point we have to modify the solution because otherwise
the implicit subgradient
and max+ denotes the maximum over all of its arguments that are < λ1 .
To keep going: let j2 , s2 achieve the maximum in (21). Let A = {j1 , j2 }, sA =
(sj1 , sj2 ), and consider setting
as λ decreases from λ2 . Again, we can verify the KKT conditions for a stretch of
decreasing λ, but will have to stop when one of
becomes equal to ±λ. By linearity, we can compute this next “hitting time” explic-
itly, just as before. Furthermore, though, we will have to check whether the active
components of the computed solution in (22) are going to cross through zero, because
past such a point, sA will no longer be a proper subgradient over the active compo-
nents. We can again compute this next “crossing time” explicitly, due to linearity.
Therefore, we maintain that (22) is the lasso solution for all λ2 ≥ λ ≥ λ3 , where λ3 is
the maximum of the next hitting time and the next crossing time. For convenience,
the lasso path algorithm is summarized below.
As we decrease λ from a knot λk , we can rewrite the lasso coefficient update in
Step 1 as
βbA (λ) = βbA (λk ) + (λk − λ)(XAT XA )−1 sA ,
(23)
βb−A (λ) = 0.
18
We can see that we are moving the active coefficients in the direction (λk −λ)(XAT XA )−1 sA
for decreasing λ. In other words, the lasso fitted values proceed as
X β(λ)
b b k ) + (λk − λ)XA (X T XA )−1 sA ,
= X β(λ A
for decreasing λ. Efron et al. (2004) call XA (XAT XA )−1 sA the equiangular direction,
because this direction, in Rn , takes an equal angle with all Xj ∈ Rn , j ∈ A.
For this reason, the lasso path algorithm in Algorithm ?? is also often referred
to as the least angle regression path algorithm in “lasso mode”, though we have not
mentioned this yet to avoid confusion. Least angle regression is considered as another
algorithm by itself, where we skip Step 3 altogether. In words, Step 3 disallows any
component path to cross through zero. The left side of the plot in Figure 3 visualizes
the distinction between least angle regression and lasso estimates: the dotted black
line displays the least angle regression component path, crossing through zero, while
the lasso component path remains at zero.
Lastly, an alternative expression for the coefficient update in (23) (the update in
Step 1) is
λk − λ T
βbA (λ) = βbA (λk ) + (XA XA )−1 XAT r(λk ),
λk (24)
β−A (λ) = 0,
b
where r(λk ) = y − XA βbA (λk ) is the residual (from the fitted lasso model) at λk . This
follows because, recall, λk sA are simply the inner products of the active variables
with the residual at λk , i.e., λk sA = XAT (y − XA βbA (λk )). In words, we can see that
the update for the active lasso coefficients in (24) is in the direction of the least
squares coefficients of the residual r(λk ) on the active variables XA .
1 φ2
kXvk22 ≥ 0 kvS k21 for all v ∈ Rp such that kv−S k1 ≤ 3kvS k1 . (25)
n s0
While this may look like an odd condition, we will see it being useful in the proof
below, and we will also have some help interpreting it when we discuss the restricted
eigenvalue condition shortly. Roughly, it means the (truly active) predictors can’t be
too correlated
19
Recall from our previous analysis for the lasso estimator in penalized form (5), we
showed on an event Eδ of probability at least 1 − δ,
p
kX βb − Xβ0 k22 ≤ 2σ 2n log(ep/δ)kβb − β0 k1 + 2λ(kβ0 k1 − kβk
b 1 ).
Choosing λ large enough and applying the triangle inequality then gave us the slow
rate wepderived before. Now we choose λ just slightly larger (by a factor of 2):
λ ≥ 2σ 2n log(ep/δ). The remainder of the analysis will be performed on the event
Eδ and we will no longer make this explicit until the very end. Then
where the two inequalities both followed from the triangle inequality, one application
for each of the two terms, and we have used that βb0,−S = 0. As kX βb − Xβ0 k22 ≥ 0,
we have shown
kβb−S − βb0,−S k1 ≤ 3kβbS − β0,S k1 ,
and thus we may apply the compatibility condition (25) to the vector v = βb − β0 .
This gives us two bounds: one on the fitted values, and the other on the coefficients.
Both start with the key inequality (from the second-to-last display)
For the fitted values, we upper bound the right-hand side of the key inequality (26),
r
2 s0
kX βb − Xβ0 k2 ≤ 3λ kX βb − Xβ0 k2 ,
nφ20
or dividing through both sides by kX βb − Xβ0 k2 , then squaring both sides, and di-
viding by n,
1 9s0 λ2
kX βb − Xβ0 k22 ≤ 2 2 .
n n φ0
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that
1 72σ 2 s0 log(ep/δ)
kX βb − Xβ0 k22 ≤ , (27)
n nφ20
with probability at least 1 − δ. Notice the similarity between (27) and (8): both
provide us in-sample risk bounds on the order of s0 log p/n, but the bound for the
lasso requires a strong compability assumption on the predictor matrix X, which
roughly means the predictors can’t be too correlated
20
For the coefficients, we lower bound the left-hand side of the key inequality (26),
nφ20 b
kβS − β0,S k21 ≤ 3λkβbS − β0,S k1 ,
s0
so dividing through both sides by kβbS − β0,S k1 , and recalling kβb−S k1 ≤ 3kβbS − β0,S k1 ,
which implies by the triangle inequality that kβb − β0 k1 ≤ 4kβbS − β0,S k1 ,
12s0 λ
kβb − β0 k1 ≤ .
nφ20
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that
r
24σs 0 2 log(ep/δ)
kβb − β0 k1 ≤ 2
, (28)
φ0 n
p
with probability at least 1 − δ. This is a error bound on the order of s0 log p/n for
the lasso coefficients (in `1 norm)
Restricted eigenvalue result. Instead of compatibility, we may assume that
X satisfies the restricted eigenvalue condition with constant φ0 > 0, i.e.,
1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 . (29)
This produces essentially the same results as in (27), (28), but additionally, in the `2
norm,
s0 log p
kβb − β0 k22 .
nφ20
with probability tending to 1
Note the similarity between (29) and the compatibility condition (25). The former
is actually stronger, i.e., it implies the latter, because kβk22 ≥ kβJ k22 ≥ kβJ k21 /s0 . We
may interpret the restricted eigenvalue condition roughly as follows: the requirement
(1/n)kXvk22 ≥ φ20 kvk22 for all v ∈ Rn would be a lower bound of φ20 on the smallest
eigenvalue of (1/n)X T X; we don’t require this (as this would of course mean that X
was full column rank, and couldn’t happen when p > n), but instead that require
that the same inequality hold for v that are “mostly” supported on small subsets J
of variables, with |J| = s0
21
We aim to show that, at some value of λ, the lasso solution βb in (5) has an active
set that exactly equals the true support set,
A = supp(β)
b = S,
with high probability. We actually aim to show that the signs also match,
sign(βbS ) = sign(β0,S ),
with high probability. The primal-dual witness method basically plugs in the true
support S into the KKT conditions for the lasso (9), (10), and checks when they can
be verified
We start by breaking up (9) into two blocks, over S and S c . Suppose that
supp(β)
b = S at a solution β.
b Then the KKT conditions become
Hence, if we can satisfy the two conditions (30), (31) with a proper subgradient
s, such that
sS = sign(β0,S ) and ks−S k∞ = max |sj | < 1,
j ∈S
/
then we have met our goal: we have recovered a (unique) lasso solution whose active
set is S, and whose active signs are sign(β0,S )
So, let’s solve for βbS in the first block (30). Just as we did in the work on basic
properties of the lasso estimator, this yields
where we have substituted sS = sign(β0,S ). From (31), this implies that s−S must
satisfy
1 T
X−S I − XS (XST XS )−1 XST y + X−S
T
XS (XST XS )−1 sign(β0,S ).
s−S = (33)
λ
To lay it out, for concreteness, the primal-dual witness method proceeds as follows:
1. Solve for the lasso solution over the S components, βbS , as in (32), and set
βb−S = 0
3. Check that sign(βbS ) = sign(β0,S ), and that ks−S k∞ < 1. If these two checks
pass, then we have certified there is a (unique) lasso solution that exactly re-
covers the true support and signs
22
The success of the primal-dual witness method hinges on Step 3. We can plug in y =
Xβ0 + , and rewrite the required conditions, sign(βbS ) = sign(β0,S ) and ks−S k∞ < 1,
as
and
1 T
X−S I − XS (XST XS )−1 XST + X−S
T
XS (XST XS )−1 sign(β0,S )
< 1. (35)
λ ∞
As ∼ N (0, σ 2 I), we see that the two required conditions have been reduced to
statements about Gaussian random variables. The arguments we need to check these
conditions actually are quite simply, but we will need to make assumptions on X and
β0 . These are:
With these assumptions in place on X and β0 , let’s first consider verifying (34),
and examine ∆S , whose components ∆j , j ∈ S are as defined in (34). We have
Note that w = (XST XS )−1 XST is Gaussian with mean zero and covariance σ 2 (XST XS )−1 ,
so the variances of components of w are bounded by
σ2n
σ 2 Λmax (XST XS )−1 ≤ ,
C
where we have used the minimum eigenvalue assumption. By a standard result on
the maximum of Gaussians, for any δ > 0, it holds with probability at least 1 − δ
that
σ p
k∆S k∞ ≤ √ 2n log (es0 /δ) + λk(XST XS )−1 k∞
C
γ σp
≤ β0,min + √ 2n log (es0 /δ) − 4λ .
C γ
| {z }
a
where in the second line we used the minimum signal condition. As long as a < 0,
we can see that the sign condition (34) is verified
Now, let’s consider verifying (35). Using the mutual incoherence condition, we
have
1 T
X−S I − XS (XST XS )−1 XST + X−S
T
XS (XST XS )−1 sign(β0,S )
≤ kzk∞ + (1 − γ),
λ ∞
where z = (1/λ)X−ST
(I − XS (XST XS )−1 XST ) = (1/λ)X−S
T
PXS , with PXS the projec-
tion matrix onto the column space of XS . Notice that z is Gaussian with mean zero
23
and covariance (σ 2 /λ2 )X−S
T
PXS X−S , so the components of z have variances bounded
by
σ2n σ2n
Λ max (P XS
) ≤ .
λ2 λ2
Therefore, again by the maximal Gaussian inequality, for any δ > 0, it holds with
probability at least 1 − δ that
1 T
X−S I − XS (XST XS )−1 XST + X−S
T
XS (XST XS )−1 sign(β0,S )
λ ∞
σp
≤ 2n log (e(p − s0 )/δ) + (1 − γ)
λ
σp
=1+ 2n log (e(p − s0 )/δ) − γ ,
λ
| {z }
b
given by regressing each Xj on the truly active variables XS , to be small (in `1 norm)
for all j ∈
/ S. In other words, no truly inactive variables can be highly correlated
(or well-explained, in a linear projection sense) by any of the truly active variables.
Finally, this minimum signal condition ensures that the nonzero entries of the true
coefficient vector β0 are big enough to detect. This is quite restrictive and is not
needed for risk bounds, but it is crucial to support recovery.
24
8.2 Minimax bounds
Under the data model (??) with X fixed, subject to the scaling kXj k22 ≤ n, for
j = 1, . . . , p, and ∼ N (0, σ 2 ), Raskutti et al. (2011) derive upper and lower bounds
on the minimax prediction error
1
M (s0 , n, p) = inf sup kX βb − Xβ0 k22 .
βb kβ0 k0 ≤s0 n
(Their analysis is acutally considerably more broad than this and covers the coefficient
error kβb − β0 k2 , as well `q constraints on β0 , for q ∈ [0, 1].) They prove that, under
no additional assumptions on X,
s0 log(p/s0 )
M (s0 , n, p) . ,
n
with probability tending to 1
They also prove that, under a type of restricted eigenvalue condition in which
(1/n)kXvk22
c0 ≤ ≤ c1 for all v ∈ Rp such that kvk0 ≤ 2s0 ,
kvk22
s0 log(p/s0 )
M (s0 , n, p) & ,
n
with probability at least 1/2
The implication is that, for some X, minimax optimal prediction may be able
to be performed at a faster rate than s0 log(p/s0 )/n; but for low correlations, this
is the rate we should expect. (This is consistent with the worst-case-X analysis of
Foster & George (1994), who actually show the worst-case behavior is attained in the
orthogonal X case)
25
J. R. Statist. Soc. B (2009)
71, Part 5, pp. 1009–1030
Pradeep Ravikumar,
University of California, Berkeley, USA
Summary. We present a new class of methods for high dimensional non-parametric regression
and classification called sparse additive models. Our methods combine ideas from sparse linear
modelling and additive non-parametric regression. We derive an algorithm for fitting the models
that is practical and effective even when the number of covariates is larger than the sample
size. Sparse additive models are essentially a functional version of the grouped lasso of Yuan
and Lin. They are also closely related to the COSSO model of Lin and Zhang but decouple
smoothing and sparsity, enabling the use of arbitrary non-parametric smoothers. We give an
analysis of the theoretical properties of sparse additive models and present empirical results
on synthetic and real data, showing that they can be effective in fitting sparse non-parametric
models in high dimensional data.
Keywords: Additive models; Lasso; Non-parametric regression; Sparsity
1. Introduction
Substantial progress has been made recently on the problem of fitting high dimensional
linear regression models of the form Yi = XiT β + "i , for i = 1, . . . , n. Here Yi is a real-valued
response, Xi is a predictor and "i is a mean 0 error term. Finding an estimate of β when p > n
that is both statistically well behaved and computationally efficient has proved challenging;
however, under the assumption that the vector β is sparse, the lasso estimator (Tibshirani,
1996) has been remarkably successful. The lasso estimator β̂ minimizes the l1 -penalized sum of
squares
p
.Yi − XiT β/2 + λ |βj |
i j=1
with the l1 -penalty β1 encouraging sparse solutions, where many components β̂ j are 0. The
good empirical success of this estimator has been recently backed up by results confirming that
it has strong theoretical properties; see Bunea et al. (2007), Greenshtein and Ritov (2004), Zhao
and Yu (2007), Meinshausen and Yu (2006) and Wainwright (2006).
The non-parametric regression model Yi = m.Xi / + "i , where m is a general smooth function,
relaxes the strong assumptions that are made by a linear model but is much more challenging
in high dimensions. Hastie and Tibshirani (1999) introduced the class of additive models of the
form
Address for correspondence: Larry Wasserman, Department of Statistics, 232 Baker Hall, Carnegie Mellon
University, Pittsburgh, PA 15213-3890, USA.
E-mail: [email protected]
This additive combination of univariate functions—one for each covariate Xj —is less general
than joint multivariate non-parametric models but can be more interpretable and easier to fit;
in particular, an additive model can be estimated by using a co-ordinate descent Gauss–Seidel
procedure, called backfitting. Unfortunately, additive models only have good statistical and
computational behaviour when the number of variables p is not large relative to the sample size
n, so their usefulness is limited in the high dimensional setting.
In this paper we investigate sparse additive models (SPAMs), which extend the advantages of
sparse linear models to the additive non-parametric setting. The underlying model is the same
as in equation (1), but we impose a sparsity constraint on the index set {j : fj ≡ 0} of functions
fj that are not identically zero. Lin and Zhang (2006) have proposed COSSO, an extension of
the lasso to this setting, for the case where the component functions fj belong to a reproducing
kernel Hilbert space. They penalized the sum of the reproducing kernel Hilbert space norms of
the component functions. Yuan (2007) proposed an extension of the non-negative garrotte to
this setting. As with the parametric non-negative garrotte, the success of this method depends
on the initial estimates of component functions fj .
In Section 3, we formulate an optimization problem in the population setting that induces
sparsity. Then we derive a sample version of the solution. The SPAM estimation procedure
that we introduce allows the use of arbitrary non-parametric smoothing techniques, effectively
resulting in a combination of the lasso and backfitting. The algorithm extends to classifica-
tion problems by using generalized additive models. As we explain later, SPAMs can also be
thought of as a functional version of the grouped lasso (Antoniadis and Fan, 2001; Yuan and
Lin, 2006).
The main results of this paper include the formulation of a convex optimization problem
for estimating a SPAM, an efficient backfitting algorithm for constructing the estimator and
theoretical results that analyse the effectiveness of the estimator in the high dimensional setting.
Our theoretical results are of two different types. First, we show that, under suitable choices
of the design parameters, the SPAM backfitting algorithm recovers the correct sparsity pattern
asymptotically; this is a property that we call sparsistency, as a shorthand for ‘sparsity pattern
consistency’. Second, we show that the estimator is persistent, in the sense of Greenshtein and
Ritov (2004), which is a form of risk consistency.
In the following section we establish notation and assumptions. In Section 3 we formulate
SPAMs as an optimization problem and derive a scalable backfitting algorithm. Examples show-
ing the use of our sparse backfitting estimator on high dimensional data are included in Section
5. In Section 6.1 we formulate the sparsistency result, when orthogonal function regression is
used for smoothing. In Section 6.2 we give the persistence result. Section 7 contains a discussion
of the results and possible extensions. Proofs are contained in Appendix A.
The statements of the theorems in this paper were given, without proof, in Ravikumar et al.
(2008). The backfitting algorithm was also presented there. Related results were obtained in
Meier et al. (2008) and Koltchinskii and Yuan (2008).
Let μ denote the distribution of X , and let μj denote the marginal distribution of Xj for each
j = 1, . . . , p. For a function fj on [0, 1] denote its L2 .μj / norm by
1
√
fj μj = fj .x/ dμj .x/ = E{fj .Xj /2 }:
2
.4/
0
When the variable Xj is clear from the context, we remove the dependence on μj in the notation
·μj and simply write fj .
For j ∈ {1, . . . , p}, let Hj denote the Hilbert subspace L2 .μj / of measurable functions fj .xj /
of the single scalar variable xj with zero mean, E{fj .Xj /} = 0. Thus, Hj has the inner product
fj , fj = E{fj .Xj / fj .Xj /} .5/
√
and fj = E{fj .Xj /2 } < ∞. Let H = H1 ⊕ H2 ⊕ . . . ⊕ Hp denote the Hilbert space of func-
tions of .x1 , . . . , xp / that have the additive form: m.x/ = Σj fj .xj /, with fj ∈ Hj , j = 1, . . . , p.
Let {ψjk , k = 0, 1, . . .} denote a uniformly bounded, orthonormal basis with respect to L2 [0, 1].
Unless stated otherwise, we assume that fj ∈ Tj where
∞
∞
2 2νj
Tj = fj ∈ Hj : fj .xj / = βjk ψjk .xj /, βjk k C2 .6/
k=0 k=0
for some 0 < C < ∞. We shall take νj = 2 although the extension to other levels of smoothness is
straightforward. It is also possible to adapt to νj although we do not pursue that direction here.
Let Λmin .A/ and Λmax .A/ denote the minimum and maximum eigenvalues of a square matrix
A. If v = .v1 , . . . , vk /T is a vector, we use the norms
k
k
v = v2j , v1 = |vj |, v∞ = max|vj |: .7/
j=1 j=1 j
3. Sparse backfitting
The outline of the derivation of our algorithm is as follows. We first formulate a population
level optimization problem and show that the minimizing functions can be obtained by iterat-
ing through a series of soft thresholded univariate conditional expectations. We then plug in
smoothed estimates of these univariate conditional expectations, to derive our sparse backfitting
algorithm.
where the expectation is taken with respect to X and the noise ". Now consider the following
modification of this problem that introduces a scaling parameter for each function, and that
imposes additional constraints:
1012 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
2
p
min E Y− βj gj .Xj / .9/
β∈Rp ,gj ∈Hj j=1
subject to
p
|βj | L, .10/
j=1
E.gj2 / = 1, j = 1, . . . , p, .11/
noting that gj is a function whereas β = .β1 , . . . , βp /T is a vector. The constraint that β lies in
the l1 -ball {β : β1 L} encourages sparsity of the estimated β, just as for the parametric lasso
(Tibshirani, 1996). It is convenient to absorb the scaling constants βj into the functions fj , and
to re-express the minimization in the following equivalent Lagrangian form:
2
1 p p √
L.f , λ/ = E Y − fj .Xj / + λ E{fj2 .Xj /}: .12/
2 j=1 j=1
†The first two steps in the iterative algorithm are the usual backfitting
procedure; the remaining steps carry out functional soft thresholding.
1 n 2 2
ŝj = β̂ j Xij = |β̂j |
n i=1
so the soft thresholding in step 4 of the SPAM backfitting algorithm is the same as the soft
thresholding in step 3 in the co-ordinate descent lasso algorithm.
5. Examples
To illustrate the method, we consider a few examples.
where "i ∼ N.0, 1/; the relevant component functions are given by
f1 .x/ = −sin.1:5x/,
f2 .x/ = x3 + 1:5.x − 0:5/2 ,
f3 .x/ = −φ.x, 0:5, 0:82 /,
f4 .x/ = sin{exp.−0:5x/}
where φ.·, 0:5, 0:82 / is the Gaussian probability distribution function with mean 0.5 and stan-
dard deviation 0.8. The data therefore have 96 irrelevant dimensions. The covariates are sampled
independent and identically distributed from uniform.−2:5, 2:5/. All the component functions
are standardized, i.e.
1016 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
1.0
8
0.8
7
Component Norms
0.6
6
Cp
2
5
0.4
75
4
0.2
3
90
0.0
2
14
0.0 0.4 0.8 0.0 0.4 0.8
(a) (b)
6
6
6
4
4
4
4
−6 −4 −2 m3 2
2
2
−4 −2 m2
−2 m1
−4 −2 m4
−4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x2 x3 x4
(c) (d) (e) (f)
6
4
−6 −4 −2 m5 2
−6 −4 −2 m6 2
m7
m8
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x7
x5 x6 x8
(g) (h) (i) (j)
Fig. 1. Simulated data: (a) empirical l2 -norm of the estimated components as plotted against the regular-
ization parameter :λ (the value on the x -axis is proportional to Σj kfˆj k); (b) Cp -scores against the amount
of regularization (::, value of λ which has the smallest Cp -score); estimated ( ) versus true additive
component functions (– – – ) for (c)–(f) the first four relevant dimensions and (g)–(j) the first four irrelevant
dimensions ((c) l1 D 97:05; (d) l1 D 88:36; (e) l1 D 90:65; (f) l1 D 79:26; (g)–(j) l1 D 0)
Sparse Additive Models 1017
1 n
fj .Xij / = 0,
n i=1
.25/
1 n
f 2 .Xij / = 1:
n − 1 i=1 j
The results of applying SPAMs are summarized in Fig. 1, using the plug-in bandwidths
hj = 0:6 sd.Xj /=n1=5 :
Fig. 1(a) shows regularization paths as the parameter λ varies; each curve is a plot of fˆj .λ/
versus
p
p
fˆk .λ/ maxλ fˆk .λ/ .26/
k=1 k=1
for a particular variable Xj . The estimates are generated efficiently over a sequence of λ-values
by ‘warm starting’ fˆj .λt / at the previous value fˆj .λt−1 /. Fig. 1(b) shows the Cp -statistic as a
function of regularization level.
(a) (b)
Fig. 2. Comparison of sparse reconstructions by using (a) the lasso and (b) SPAMs
1018 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
Let {yi }i=1,:::,N be the data to be represented with respect to some learned basis, where each
instance yi ∈ Rn is an n-dimensional vector. The linear sparse coding optimization problem is
N
1 i i 2
min y − Xβ + λβ 1 i
.27/
β,X i=1 2n
such that
Xj 1: .28/
Here X is an n × p matrix with columns Xj , representing the ‘dictionary’ entries or basis vec-
tors to be learned. It is not required that the basis vectors are orthogonal. The l1 -penalty on
the coefficients β i encourages sparsity, so each data vector yi is represented by only a small
number of dictionary elements. Sparsity allows the features to specialize, and to capture salient
properties of the data.
This optimization problem is not jointly convex in β i and X. However, for fixed X , each
weight vector β i is computed by running the lasso. For fixed β i , the optimization is similar to
ridge regression and can be solved efficiently. Thus, an iterative procedure for (approximately)
solving this optimization problem is easy to derive.
In the case of sparse coding of natural images, as in Olshausen and Field (1996), the basis vec-
tors Xj encode basic edge features at different scales and spatial orientations. In the functional
version, we no longer assume a linear parametric fit between the dictionary X and the data
y. Instead, we model the relationship by using an additive model. This leads to the following
optimization problem for functional sparse coding:
N
1 i p 2
p
min y − fji .Xj / + λ fji .29/
f ,X i=1 2n j=1 j=1
such that
Xj 1, j = 1, . . . , p: .30/
Fig. 2 illustrates the reconstruction of various image patches by using the sparse linear model
compared with the SPAM. Local linear smoothing was used with a Gaussian kernel having fixed
bandwidth h = 0:05 for all patches and all codewords. The codewords Xj are those obtained by
using the Olshausen-Field procedure; these become the design points in the regression estima-
tors. Thus, a codeword for a 16 × 16 patch corresponds to a vector Xj of dimension 256, with
each Xij the grey level for a particular pixel.
6. Theoretical properties
6.1. Sparsistency
In the case of linear regression, with fj .Xj / = βjÅT Xj , several researchers have shown that, under
certain conditions on n and p, the number of relevant variables s = |supp.β Å /|, and the design
matrix X , the lasso recovers the sparsity pattern asymptotically, i.e. the lasso estimator β̂ n is
sparsistent:
P{supp.β Å / = supp.β̂ n /} → 1: .31/
Here, supp.β/ = {j : βj = 0}. References include Wainwright (2006), Meinshausen and Bühlmann
(2006), Zou (2005), Fan and Li (2001) and Zhao and Yu (2007). We show a similar result for
SPAMs under orthogonal function regression.
Sparse Additive Models 1019
In terms of an orthogonal basis ψ, we can write
p ∞
Å ψ .X / + " :
Yi = βjk jk ij i .32/
j=1 k=1
To simplify the notation, let βj be the dn -dimensional vector {βjk , k = 1, . . . , dn } and let Ψj
be the n × dn matrix Ψj .i, k/ = ψjk .Xij /. If A ⊂ {1, . . . , p}, we denote by ΨA the n × d|A| matrix
where, for each j ∈ A, Ψj appears as a submatrix in the natural way.
We now analyse the sparse backfitting algorithm of Table 1 by assuming that an orthogonal
series smoother is used to estimate the conditional expectation in its step 2. As noted earlier, an
orthogonal series smoother for a predictor Xj is the least squares projection onto a truncated
set of basis functions {ψj1 , . . . , ψjd }. Our optimization problem in this setting is
1
p 2
p 1 T T
min Y − Ψj βj + λ βj Ψj Ψj βj : .33/
β 2n j=1 2 j=1 n
Combined with the soft thresholding step, the update for fj in the algorithm in Table 1 can thus
be seen to solve the problem
1 2 1 T T
min Rj − Ψj βj 2 + λn β Ψ Ψj βj
β 2n n j j
where v22 denotes Σni=1 v2i and Rj = Y − Σl=j Ψl βl is the residual for fj . The sparse backfitting
algorithm thus solves
1 p 2
p 1
min{Rn .β/ + λn Ω.β/} = min Y− Ψj βj + λn √ Ψj βj .34/
β β 2n j=1 2 j=1 n 2
where Rn denotes the squared error term and Ω denotes the regularization term, and each βj is
a dn -dimensional vector. Let S denote the true set of variables {j : fj = 0}, with s = |S|, and let
S c denote its complement. Let Ŝ n = {j : β̂j = 0} denote the estimated set of variables from the
minimizer β̂ n , with corresponding function estimates f̂j .xj / = Σdk=1
n
β̂jk ψjk .xj /. For the results in
this section, we shall treat the covariates as fixed. A preliminary version of the following result
is stated, without proof, in Ravikumar et al. (2008).
Theorem 2. Suppose that the following conditions hold on the design matrix X in the orthog-
onal basis ψ:
1 T
Λmax Ψ ΨS Cmax < ∞, .35/
n S
1 T
Λmin ΨS ΨS Cmin > 0, .36/
n
−1
1 T 1 T Cmin 1 − δ
maxc Ψj ΨS ΨS ΨS √ , for some 0 < δ 1: .37/
j∈S n n Cmax s
Assume that the truncation dimension dn satisfies dn → ∞ and dn = o.n/. Furthermore, sup-
pose the following conditions, which relate the regularization parameter λn to the design
parameters n and p, the number of relevant variables s and the truncation size dn :
s
→ 0, .38/
dn λn
1020 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
dn log{dn .p − s/}
→ 0, .39/
nλ2n
1 log.sdn / s3=2 √
+ + λn .sdn / → 0 .40/
ρÅn n dn
where ρnÅ = minj∈S βjÅ ∞ . Then the solution β̂ n to problem (33) is unique and satisfies Ŝ n = S
with probability approaching 1.
This result parallels the theorem of Wainwright (2006) on model selection consistency of
the lasso; however, technical subtleties arise because of the truncation dimension dn which is
increasing with sample size, and the matrix ΨT j Ψ which appears in the regularization of βj . As
a result, the operator norm rather than the ∞-norm appears in the incoherence condition (37).
Note, however, that condition (37) implies that
−1 −1
ΨT T
S c ΨS .ΨS ΨS / ∞ = max
c
ΨT T
j ΨS .ΨS ΨS / ∞ .41/
j∈S
Cmin dn
.1 − δ/ .42/
Cmax
√ √
since .1= n/A∞ A mA∞ for an m × n matrix A. This relates it to the more standard
incoherence conditions that have been used for sparsistency in the case of the lasso.
The following corollary, which imposes the additional condition that the number of relevant
variables is bounded, follows directly. It makes explicit how to choose the design parameters dn
and λn , and implies a condition on the fastest rate at which the minimum norm ρÅn can approach 0.
Corollary 1. Suppose that s = O.1/, and assume that the design conditions (35)–(37) hold. If
the truncation dimension dn , regularization parameter λn and minimum norm ρÅn satisfy
dn n1=3 , .43/
log.np/
λn , .44/
n1=3
1=6
1 n
Å =o .45/
ρn log.np/
then P.Ŝ n = S/ → 1.
The following proposition clarifies the implications of condition (45), by relating the sup-norm
βj ∞ to the function norm fj 2 .
Proposition 1. Suppose that f.x/ = Σk βk ψk .x/ is in the Sobolev space of order ν > 21 , so that
Σ∞
i=1 βi i
2 2ν C2 for some constant C. Then
f 2 = β2 cβ2ν=.2ν+1/
∞ .46/
is the predictive oracle. Greenshtein and Ritov (2004) showed that the lasso is persistent for
Mn = {l.x/ = xT β : β1 Ln } and Ln = o{n= log.n/1=4 }. Note that mÅn is the best linear approx-
imation (in prediction risk) in Mn but the true regression function is not assumed to be linear.
Here we show a similar result for SPAMs.
In this section, we assume that the SPAM estimator m̂n is chosen to minimize
2
1 n
Yi − βj gj .Xij / .50/
n i=1 j
subject to β1 Ln and gj ∈ Tj . We make no assumptions about the design matrix. Let Mn ≡
Mn .Ln / be defined by
pn
Mn = m : m.x/ = βj gj .xj / : E.gj / = 0, E.gj2 / = 1, |βj | Ln .51/
j=1 j
7. Discussion
The results that are presented here show how many of the recently established theoretical prop-
erties of l1 -regularization for linear models extend to SPAMs. The sparse backfitting algorithm
that we have derived is attractive because it decouples smoothing and sparsity, and can be used
with any non-parametric smoother. It thus inherits the nice properties of the original backfitting
procedure. However, our theoretical analyses have made use of a particular form of smoothing,
using a truncated orthogonal basis. An important problem is thus to extend the theory to cover
more general classes of smoothing operators. Convergence properties of the SPAM backfitting
1022 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
algorithm should also be investigated; convergence of special cases of standard backfitting was
studied by Buja et al. (1989).
An additional direction for future work is to develop procedures for automatic bandwidth
selection in each dimension. We have used plug-in bandwidths and truncation dimensions dn in
our experiments and theory. It is of particular interest to develop procedures that are adaptive
to different levels of smoothness in different dimensions. It would also be of interest to consider
more general penalties of the form pλ .fj /, as in Fan and Li (2001).
Finally, we note that, although we have considered basic additive models that allow functions
of individual variables, it is natural to consider interactions, as in the functional analysis-of-
variance model. One challenge is to formulate suitable incoherence conditions on the functions
that enable regularization-based procedures or greedy algorithms to recover the correct inter-
action graph. In the parametric setting, one result in this direction is Wainwright et al. (2007).
Acknowledgements
This research was supported in part by National Science Foundation grant CCF-0625879 and
a Siebel scholarship to PR.
Appendix A: Proofs
A.1. Proof of theorem 1
Consider the minimization of the Lagrangian
2
1 p p √
min {L.f , λ/} ≡ E Y − fj .Xj / + λ E{fj .Xj /2 } .53/
{fj ∈Hj } 2 j=1 j=1
with respect to fj ∈ Hj , holding the other components {fk , k = j} fixed. The stationary condition is obtained
by setting the Fréchet derivative to 0. Denote by @j L.f , λ; ηj / the directional derivative with respect to fj
in the direction ηj .Xj / ∈ Hj {E.ηj / = 0, E.ηj2 / < ∞}. Then the stationary condition can be formulated as
@j L.f , λ; ηj / = 21 E{.fj − Rj + λvj /ηj } = 0 .54/
√
j = Y − Σk=j fk is the residual for fj , and vj ∈ Hj is an element of the subgradient @ E.fj /, satisfying
2
where R√
vj = fj = E.fj2 / if E.fj2 / = 0 and vj ∈ {uj ∈ Hj |E.u2j / 1} otherwise.
Using iterated expectations, the above condition can be rewritten as
E[{fj + λvj − E.Rj |Xj /}ηj ] = 0: .55/
But, since fj − E.Rj |Xj / + λvj ∈ Hj , we can compute the derivative in the direction ηj = fj − E.Rj |Xj / +
λvj ∈ Hj , implying that
E[{fj .xj / − E.Rj |Xj = xj / + λ vj .xj /}2 ] = 0, .56/
i.e.
fj + λvj = E.Rj |Xj / almost everywhere. .57/
Denote the conditional expectation √ E.Rj |Xj /—also the projection of the residual Rj onto Hj —by Pj .
Now, if E.fj2 / = 0, then vj = fj = E.fj2 /, which from condition (57) implies
√ √ √
E.Pj2 / = E[{fj + λfj = E.fj2 /}2 ] .58/
λ √
= 1+ √ E.fj2 / .59/
E.fj /
2
√
= E.fj2 / + λ .60/
λ: .61/
√
If E.fj2 / = 0, then fj = 0 almost everywhere, and E.v2j / 1. Equation (57) then implies that
Sparse Additive Models 1023
√
E.Pj2 / λ: .62/
Using equation (60), we thus arrive at the soft thresholding update for fj :
λ
fj = 1 − √ Pj .64/
E.Pj2 / +
.1=n/ΨTj Ψj βj
gj = √ if βj = 0,
{.1=n/βjT ΨTj Ψj βj }
−1
1 T
gjT Ψ Ψj gj 1 if βj = 0:
n j
Our argument is based on the technique of a primal dual witness, which has been used previously in
the analysis of the lasso (Wainwright, 2006). In particular, we construct a coefficient subgradient pair
.β̂, ĝ/ which satisfies supp.β̂/ = supp.β Å / and in addition satisfies the optimality conditions for the objec-
tive (34) with high probability. Thus, when the procedure succeeds, the constructed coefficient vector β̂
is equal to the solution of the convex objective (34), and ĝ is an optimal solution to its dual. From its
construction, the support of β̂ is equal to the true support supp.β Å /, from which we can conclude that
the solution of the objective (34) is sparsistent. The construction of the primal dual witness proceeds as
follows.
(a) Set β̂ S c = 0.
(b) Set ĝ S = @Ω.β Å /S .
(c) With these settings of β̂ S c and ĝ S , obtain β̂ S and ĝ S c from the stationary conditions in equation (65).
For the witness procedure to succeed, we must show that .β̂, ĝ/ is optimal for the objective (34), meaning
that
β̂ j = 0 for j ∈ S, .66a/
−1
1 T
gjT Ψ Ψj gj < 1 for j ∈ S c : .66b/
n j
For uniqueness of the solution, we require strict dual feasibility, meaning strict inequality in condition
(66b). In what follows, we show that these two conditions hold with high probability.
1024 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
A.2.1. Condition (66a)
Setting β̂ S c = 0 and
.1=n/ΨTj Ψj βjÅ
ĝ j = √ for j ∈ S,
{.1=n/β ÅT ΨT Ψ β Å }
j j j j
A.2.2. Bounding T3
Note that, for j ∈ S,
−1
1 T 1
1 = gjT Ψ Ψj gj gj 2 ,
n j Cmax
√
and thus |gj Cmax . Noting further that
√
gS ∞ = max.gj ∞ / max.gj 2 / Cmax , .70/
j∈S j∈S
it follows that
√
T3 := Σ−1 −1
SS ĝ S ∞ Cmax ΣSS ∞ : .71/
A.2.3. Bounding T2
We proceed in two steps; we first bound V ∞ and use this to bound .1=n/ΨTS V ∞ . Note that, as we are
working over the Sobolev spaces Sj of order 2,
∞
Å Ψ .X / B |β Å |
∞
|Vi | = βjk jk ij jk
j∈S k=dn +1 j∈S k=dn +1
∞ |β Å |k2
∞
∞ 1
β Å2 k4
jk
=B B jk
j∈S k=dn +1 k2 j∈S k=dn +1 k=dn +1 k 4
∞ 1 sB
sBC 4
3=2
,
k=dn +1 k dn
Sparse Additive Models 1025
for some constant B > 0. It follows that
1 T 1
Ψ V Ψ .X / V ∞ Ds , .72/
n jk n jk ij 3=2
i dn
where D denotes a generic constant. Thus,
1 T Ds
T2 := Σ−1
SS Ψ V Σ−1
SS ∞ : .73/
n S ∞
3=2
dn
A.2.4. Bounding T1
Let Z = T1 = Σ−1
SS .1=n/ΨS W . Note that W ∼ N.0, σ I/, so that Z is Gaussian as well, with mean 0. Consider
T 2
σ 2 T −1 σ2
var.Zl / = el ΣSS el :
n Cmin n
By Gaussian comparison results (Ledoux and Talagrand, 1991), we have then that
√ log.sdn /
E.Z∞ / 3 {log.sdn /var.Z/∞ } 3σ : .74/
nCmin
Substituting the bounds for T2 and T3 from equations (73) and (71) respectively into equation (69),
and using the bound for the expected value of T1 from inequality (74), it follows from an application of
Markov’s inequality that
Å ρnÅ −1 −3=2 √ ρnÅ
P β̂ S − βS ∞ > P Z∞ + ΣSS ∞ .Dsdn + λn Cmax / >
2 2
2 √
Å {E.Z∞ / + Σ−1 SS ∞ .Dsdn
−3=2
+ λn Cmax /}
ρn
2 log.sdn / Ds √
Å 3σ + Σ−1
SS ∞ + λ n C max ,
ρn nCmin dn
3=2
But this is satisfied by assumption (40) in the theorem. We have thus shown that condition (66a) is satisfied
with probability converging to 1.
1 T √ 1 Ds
Ψ V dn ΨTj V ,
n j n ∞ dn
and hence
1 T √ 1 Ds3=2
Ψ V s ΨTS V :
n S n ∞ dn
Substituting in the bound (81) on the mean μj ,
−1 √ Ds3=2 Ds
μj ΣjS ΣSS .sCmax / + + : .82/
λn dn λn dn
s
→ 0: .84/
λn dn
Sparse Additive Models 1027
Thus the bound on the mean becomes
√ 2Ds √
μj Cmin .1 − δ/ + < Cmin ,
λn dn
for sufficiently large n. It therefore suffices, for condition (66b) to be satisfied, to show that
δ
P maxc ĝ j − μj ∞ > √ → 0, .85/
j∈S 2 dn
since this implies that
ĝ j μj + ĝ j − μj
√
μj + dn ĝ j − μj ∞
√ δ
Cmin .1 − δ/ + + o.1/,
2
with probability approaching 1. To show result (85), we again appeal to Gaussian comparison results.
Define
W
Zj = ΨTj .I − ΨS .ΨTS ΨS /−1 ΨTS / , .86/
n
for j ∈ S c . Then Zj are zero-mean Gaussian random variables, and we need to show that
Zj ∞ δ
P maxc √ → ∞: .87/
j∈S λn 2 dn
A calculation shows that E.Zjk
2
/ σ 2 =n. Therefore, we have by Markov’s inequality and Gaussian com-
parison that
√
Zj ∞ δ 2 dn
P maxc √ E.max |Zjk |/
j∈S λn 2 dn δλn jk
√
2 dn √ √
[3 log{.p − s/dn } max{ E.Zjk2
/}]
δλn jk
6σ dn log{.p − s/dn }
,
δλn n
which converges to 0 given the assumption (39) of the theorem that
λ2n n
→ ∞:
dn log{.p − s/dn }
Thus condition (66b) is also satisfied with probability converging to 1, which completes the proof.
∞
β∞ |βi | .89/
i=1
k
∞
= β∞ |βi | + β∞ |βi | .90/
i=1 i=k+1
∞ iν |β |
i
kβ2∞ + β∞ ν
.91/
i=k+1 i
1028 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
∞ ∞ 1
kβ2∞ + β∞ βi2 i2ν 2ν
.92/
i=1 i=k+1 i
k1−2ν
kβ2∞ + β∞ C , .93/
2ν − 1
where the last inequality uses the bound
∞
∞
−2ν k1−2ν
i x−2ν dx = : .94/
i=k+1 k 2ν − 1
Let kÅ be the index that minimizes expression (93). Some calculus shows that kÅ satisfies
c1 β−2=.2ν+1/
∞ kÅ c2 β−2=.2ν+1/
∞ .95/
for some constants c1 and c2 . Using the above expression in expression (93) then yields
f 22 β∞ .c2 β.2ν−1/=.2ν+1/
∞ + c1 β.2ν−1/=.2ν+1/
∞ / .96/
= cβ4ν=.2ν+1/
∞ .97/
Hence m̂n is the minimizer of R̂.β, g/ subject to the constraint Σj βj gj .xj / ∈ Mn .Ln / and gj ∈ Tj . For all
.β, g/,
|R̂.β, g/ − R.β, g/| β21 max sup |μ̂jk .g/ − μjk .g/| .103/
jk gj ∈Sj , gk ∈Sk
where
n
μ̂jk .g/ = n−1 gj .Zij / gk .Zik /
i=1 jk
Sparse Additive Models 1029
and μjk .g/ = E{gj .Zj / gk .Zk /}. From inequality (98) it follows that
1=2
1
log{N[ ] .", Mn /} 2 log.pn / + K : .104/
"
√
Hence, J[ ] .C, Mn / = O{ log.pn /} and it follows from inequality (100) and Markov’s inequality that
log.pn / 1
max sup |μ̂jk .g/ − μjk .g/| = OP = OP : .105/
jk gj ∈Sj , gk ∈Sk n n.1−ξ/=2
We conclude that
L2n
sup |R̂.g/ − R.g/| = OP : .106/
g∈M n.1−ξ/=2
Therefore,
L2n
R.mÅ / R.m̂n / R̂.m̂n / + OP .1−ξ/=2
n
L2n L2n
R̂.mÅ / + OP R.mÅ / + OP
n.1−ξ/=2 n.1−ξ/=2
and the conclusion follows.
References
Antoniadis, A. and Fan, J. (2001) Regularized wavelet approximations (with discussion). J. Am. Statist. Ass., 96,
939–967.
Buja, A., Hastie, T. and Tibshirani, R. (1989) Linear smoothers and additive models. Ann. Statist., 17, 453–510.
Bunea, F., Tsybakov, A. and Wegkamp, M. (2007) Sparsity oracle inequalities for the lasso. Electron. J. Statist.,
1, 169–194.
Daubechies, I., Defrise, M. and DeMol, C. (2004) An iterative thresholding algorithm for linear inverse problems.
Communs Pure Appl. Math., 57, 1413–1457.
Daubechies, I., Fornasier, M. and Loris, I. (2007) Accelerated projected gradient method for linear inverse
problems with sparsity constraints. Technical Report. Princeton University, Princeton. (Available from
arXiv:0706.4297.)
Fan, J. and Jiang, J. (2005) Nonparametric inference for additive models. J. Am. Statist. Ass., 100, 890–907.
Fan, J. and Li, R. Z. (2001) Variable selection via penalized likelihood. J. Am. Statist. Ass., 96, 1348–1360.
Greenshtein, E. and Ritov, Y. (2004) Persistency in high dimensional linear predictor-selection and the virtue of
over-parametrization. Bernoulli, 10, 971–988.
Hastie, T. and Tibshirani, R. (1999) Generalized Additive Models. New York: Chapman and Hall.
Juditsky, A. and Nemirovski, A. (2000) Functional aggregation for nonparametric regression. Ann. Statist., 28,
681–712.
Koltchinskii, V. and Yuan, M. (2008) Sparse recovery in large ensembles kernel machines. In Proc. 21st A. Conf.
Learning Theory, pp. 229–238. Eastbourne: Omnipress.
Ledoux, M. and Talagrand, M. (1991) Probability in Banach Spaces: Isoperimetry and Processes. New York:
Springer.
Lin, Y. and Zhang, H. H. (2006) Component selection and smoothing in multivariate nonparametric regression.
Ann. Statist., 34, 2272–2297.
Meier, L., van de Geer, S. and Bühlmann, P. (2008) High-dimensional additive modelling. (Available from arXiv.)
Meinshausen, N. and Bühlmann, P. (2006) High dimensional graphs and variable selection with the lasso. Ann.
Statist., 34, 1436–1462.
Meinshausen, N. and Yu, B. (2006) Lasso-type recovery of sparse representations for high-dimensional data.
Technical Report 720. Department of Statistics, University of California, Berkeley.
Olshausen, B. A. and Field, D. J. (1996) Emergence of simple-cell receptive field properties by learning a sparse
code for natural images. Nature, 381, 607–609.
Ravikumar, P., Liu, H., Lafferty, J. and Wasserman, L. (2008) Spam: sparse additive models. In Advances in
Neural Information Processing Systems, vol. 20 (eds J. Platt, D. Koller, Y. Singer and S. Roweis), pp. 1201–1208.
Cambridge: MIT Press.
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288.
van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge: Cambridge University Press.
1030 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
Wainwright, M. (2006) Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical Report
709. Department of Statistics, University of California, Berkeley.
Wainwright, M. J., Ravikumar, P. and Lafferty, J. D. (2007) High-dimensional graphical model selection using l1 -
regularized logistic regression. In Advances in Neural Information Processing Systems, vol. 19 (eds B. Schölkopf,
J. Platt and T. Hoffman), pp. 1465–1472. Cambridge: MIT Press.
Wasserman, L. and Roeder, K. (2007) Multi-stage variable selection: screen and clean. Carnegie Mellon University,
Pittsburgh. (Available from arXiv:0704.1139.)
Yuan, M. (2007) Nonnegative garrote component selection in functional ANOVA models. Proc. Artif. Intell.
Statist. (Available from www.stat.umn.edu/∼aistat/proceedings.)
Yuan, M. and Lin, Y. (2006) Model selection and estimation in regression with grouped variables. J. R. Statist.
Soc. B, 68, 49–67.
Zhao, P. and Yu, B. (2007) On model selection consistency of lasso. J. Mach. Learn. Res., 7, 2541–2567.
Zou, H. (2005) The adaptive lasso and its oracle properties. J. Am. Statist. Ass., 101, 1418–1429.
Linear Classification
36-708
1 Review of Classification
The problem of predicting a discrete random variable Y from another random variable X
is called classification, also sometimes called discrimination, pattern classification or pat-
tern recognition. We observe iid data (X1 , Y1 ), . . . , (Xn , Yn ) ∼ P where Xi ∈ Rd and
Yi ∈ {0, 1, . . . , K − 1}. Often, the covariates X are also called features. The goal is to
predict Y given a new X; here are some examples:
1. The Iris Flower study. The data are 50 samples from each of three species of Iris
flowers, Iris setosa, Iris virginica and Iris versicolor; see Figure 1. The length and
width of the sepal and petal are measured for each specimen, and the task is to predict
the species of a new Iris flower based on these features.
2. The Coronary Risk-Factor Study (CORIS). The data consist of attributes of 462 males
between the ages of 15 and 64 from three rural areas in South Africa. The outcome Y
is the presence (Y = 1) or absence (Y = 0) of coronary heart disease and there are 9
covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipopro-
tein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A
behavior), obesity, alcohol (current alcohol consumption), and age. The goal is to
predict Y from all these covariates.
3. Handwriting Digit Recognition. Here each Y is one of the ten digits from 0 to 9. There
are 256 covariates X1 , . . . , X256 corresponding to the intensity values of the pixels in a
16 × 16 image; see Figure 2.
4. Political Blog Classification. A collection of 403 political blogs were collected during
two months before the 2004 presidential election. The goal is to predict whether a blog
is liberal (Y = 0) or conservative (Y = 1) given the content of the blog.
1
Figure 1: Three different species of the Iris data. Iris setosa (Left), Iris versicolor (Middle),
and Iris virginica (Right).
For K > 2, we have a multiclass classification problem. To simplify the discussion, we mainly
discuss binary classification, and briefly explain how methods can extend to the multiclass
case.
A binary classifier h is a function from X to {0, 1}. It is linear if there exists a function
H(x) = β0 + β T x such that h(x) = I(H(x) > 0). H(x) is also called a linear discriminant
function. The decision boundary is therefore defined as the set x ∈ Rd : H(x) = 0 , which
corresponds to a (d − 1)-dimensional hyperplane within the d-dimensional input space X .
X covariate (feature)
X domain of X, usually X ⊂ Rd
Y response (pattern)
h binary classifier, h : X → {0, 1}
H linear discriminant function, H(x) = β0 + β T x and h(x) = I H(x) > 0
m regression function, m(x) = E(Y |X = x) = P(Y = 1|X = x)
PX marginal distribution of X
pj pj (x) = p(x|Y = j), the conditional density1 of X given that Y = j
π1 π1 = P(Y = 1)
P joint distribution of (X, Y )
2
Figure 2: Examples from the zipcode data.
The rule h∗ is called the Bayes rule. The risk R∗ = R(h∗ ) of the Bayes rule is called the
Bayes risk. The set {x ∈ X : m(x) = 1/2} is called the Bayes decision boundary.
3
Hence,
P Y 6= h(X)|X = x − P Y 6= h∗ (X)|X = x
= h∗ (x)m(x) + 1 − h∗ (x) 1 − m(x) − h(x)m(x) + 1 − h(x) 1 − m(x)
∗ 1
h∗ (x) − h(x) .
= 2m(x) − 1 h (x) − h(x) = 2 m(x) − (9)
2
When m(x) ≥ 1/2 and h∗ (x) = 1, (9) is non-negative. When m(x) < 1/2 and h∗ (x) = 0, (9)
is again non-negative. This proves (4).
If H is a set of classifiers then the classifier ho ∈ H that minimizes R(h) is the oracle classifier.
Formally,
R(ho ) = inf R(h)
h∈H
and Ro = R(ho ) is called the oracle risk of H. In general, if h is any classifier and R∗ is the
Bayes risk then,
The first term is analogous to the variance, and the second is analogous to the squared bias
in linear regression.
For a binary classifier problem, given a covariate X we only need to predict its class label
Y = 0 or Y = 1. This is in contrast to a regression problem where we need to predict a
real-valued response Y ∈ R. Intuitively, classification is a much easier task than regression.
To rigorously formalize this, let m∗ (x) = E(Y |X = x) be the true regression function and
4
let h∗ (x) be the corresponding Bayes rule. Let m(x)
b be an estimate of m∗ (x) and define the
plug-in classification rule:
> 12
1 if m(x)
h(x) = (14)
b b
0 otherwise.
We have the following theorem.
√
The last inequality follows from the fact that E|Z| ≤ EZ 2 for any Z.
Example 3 Figure 3 shows two one-dimensional regression functions. In both cases, the
Bayes rule is h∗ (x) = I(x > 0) and the decision boundary is D = {x = 0}. The left plot
illustrates an easy problem; there is little ambiguity around the decision boundary. Even a
poor estimate of m(x) will recover the correct decision boundary. The right plot illustrates a
hard problem; it is hard to know from the data if you are to the left or right of the decision
boundary.
5
1 1
1/2 1/2
Bayes decision boundary Bayes decision boundary
regression function m(x)
regression function m(x)
0 0
x=0 x x=0 x
Figure 3: The Bayes rule is h∗ (x) = I(x > 0) in both plots, which show the regression
function m(x) = E(Y |x) for two problems. The left plot shows an easy problem; there is
little ambiguity around the decision boundary. The right plot shows a hard problem; it is
hard to know from the data if you are to the left or right of the decision boundary.
b → 0.
So classification is easier than regression. Can it be strictly easier? Suppose that R(m)
We have that
Z
R(h) − R(h∗ ) ≤ 2 |m(x)
b b − m∗ (x)|I(h∗ (x) 6= b
h(x))dP (x)
Z
= 2 |m(x)b − m∗ (x)|I(h∗ (x) 6= b
h(x))I(m∗ (x) 6= 1/2)dP (x)
h i
= 2E |m(X)
b − m∗ (X)|I(h∗ (X) 6= bh(X))I(m∗ (X) 6= 1/2)
h i
≤ 2E |m(X)
b − m∗ (X)|I(h∗ (X) 6= h(X))I(|m∗ (X) − 1/2| ≤ , m∗ (X) 6= 1/2)
b
h i
+ 2E |m(X)
b − m∗ (X)|I(h∗ (X) 6= bh(X))I(|m∗ (X) − 1/2| > )
p
≤ 2 R(m)(a b 1/2 + b1/2 )
where
a = P (h∗ (X) 6= b
h(X), |m∗ (X) − 1/2| ≤ , m∗ (X) 6= 1/2)
and
b = P (h∗ (X) 6= b
h(X), |m∗ (X) − 1/2| > ).
Now
R(m)
b ≤ P (|m(X) − m∗ (X)| > ) ≤ →0
b
2
b
so
h) − R(h∗ )
R(b
lim p ≤ 2a1/2 .
n→∞ R(m)
b
But a → 0 as → 0 so So
h) − R(h∗ )
R(b
p → 0.
R(m)
b
6
So the LHS can be smaller than the right hand side. But how much smaller? Yang (1999)
showed that if the class of regression functions is sufficiently rich, then
which says that the minimax classification rate is the square root of the regression rate.
But, there are natural classes that fail the richness condition such as low noise classes. For
example, if P (|m∗ (X)−1/2| ≤ ) = 0 and m b satifies an exponential inequality then R(√
h)−R(h∗ )
b
R(m)
b
is exponentially small. So it really depends on the problem.
The conceptually simplest approach is empirical risk minimization (ERM) where we minimize
the training error over all linear classifiers. Let Hβ (x) = β T x (where x(1) = 1) and hβ (x) =
I(Hβ (x) > 0). Thus we define βb to be the value of β that minimizes
n
bn (β) = 1
X
R I(Yi 6= hβ (Xi )).
n i=1
R(h∗ ) ≤ R(b
h) ≤ R(
bb h) + ≤ R(h
b ∗ ) + ≤ R(h∗ ) + 2.
So we need to bound
P (sup |R(h)
b − R(h)| ≤ ).
h∈H
2
h) − R(h∗ ) > 2) ≤ 8nd+1 e−n /32 .
We conclude that P (R(b
7
The result can be improved if there are not too many data points near the decision boundary.
We’ll state a result due to Koltchinski and Panchenko (2002) that involves the margin. (See
also Kakade, Sridharan and Tewari 2009). Let us take Yi ∈ {−1, +1} so we can write
h(x) = sign(β T x). Suppose that |X(j)| ≤ A < ∞ for each j. We also restrict ourselved to
the set of linear classifiers h(x) = sign(β T x) with |β(j)| ≤ A. Define the margin-sensitive
loss
1
if u ≤ 0
u
φγ (u) = 1 − γ if 0 < u ≤ γ
0 if u > γ.
Suppose that p0 (x) = p(x|Y = 0) and p1 (x) = p(x|Y = 1) are both multivariate Gaussians:
1 1 T −1
pk (x) = exp − (x − µk ) Σk (x − µk ) , k = 0, 1.
(2π)d/2 |Σk |1/2 2
where Σ1 and Σ2 are both d × d covariance matrices. Thus, X|Y = 0 ∼ N (µ0 , Σ0 ) and
X|Y = 1 ∼ N (µ1 , Σ1 ).
Given a square matrix A, we define |A| to be the determinant of A. For a binary classification
problem with Gaussian distributions, we have the following theorem.
Theorem 4 If X|Y = 0 ∼ N (µ0 , Σ0 ) and X|Y = 1 ∼ N (µ1 , Σ1 ), then the Bayes rule is
(
1 if r1 < r0 + 2 log 1−π1 + log |Σ
2 2 π1 0|
h∗ (x) = |Σ1 | (17)
0 otherwise
where ri2 = (x − µi )T Σ−1
i (x − µi ) for i = 1, 2 is the Mahalanobis distance.
Proof. By definition, the Bayes rule is h∗ (x) = I π1 p1 (x) > (1 − π1 )p0 (x) . Plug-in the
specific forms of p0 and p1 and take the logarithms we get h∗ (x) = 1 if and only if
(x − µ1 )T Σ−1
1 (x − µ1 ) − 2 log π1 + log |Σ1 |
< (x − µ0 )T Σ−1
0 (x − µ0 ) − 2 log(1 − π1 ) + log |Σ0 | . (18)
8
The theorem immediately follows from some simple algebra.
where
1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk (20)
2 2
is called the Gaussian discriminant function. The decision boundary of the above classifier
can be characterized by the set {x ∈ X : δ1 (x) = δ0 (x)}, which is quadratic so this procedure
is called quadratic discriminant analysis (QDA).
1 X
Σ
b0 = (Xi − µb0 )(Xi − µ b0 )T , (23)
n0 − 1 i: Y =0
i
1 X
Σ
b1 = (Xi − µb1 )(Xi − µ b1 )T , (24)
n1 − 1 i: Y =1
i
P P
where n0 = i (1 − Yi ) and n1 = i Yi . (Note: we could also estimate Σ0 and Σ1 using their
maximum likelihood estimates, which replace n0 − 1 and n1 − 1 with n0 and n1 .)
where now
1
δk (x) = xT Σ−1 µk − µTk Σ−1 µk + log πk . (26)
2
Hence, the classifier is linear. The parameters are estimated as before, except that we use a
pooled estimate of the Σ:
b = (n0 − 1)Σ0 + (n1 − 1)Σ1 .
b b
Σ (27)
n0 + n1 − 2
The classification rule is (
1 if δ1 (x) > δ0 (x)
h∗ (x) = (28)
0 otherwise.
9
The decision boundary {x ∈ X : δ0 (x) = δ1 (x)} is linear so this method is called linear
discrimination analysis (LDA).
When the dimension d is large, fully specifying the QDA decision boundary requires d +
d(d − 1) parameters, and fully specifying the LDA decision boundary requires d + d(d − 1)/2
parameters. Such a large number of free parameters might induce a large variance. To further
regularize the model, two popular methods are diagonal quadratic discriminant analysis
(DQDA) and diagonal linear discriminant analysis (DLDA). The only difference between
DQDA and DLDA with QDA and LDA is that after calculating Σ b 1 and Σ
b 0 as in (24), we
set all the off-diagonal elements to be zero. This is also called the independence rule.
We now generalize to the case where Y takes on more than two values. That is, Y ∈
{0, . . . , K − 1} for K > 2. First, we characterize the Bayes classifier under this multiclass
setting.
Proof. We have
R(h) = 1 − P (h(X) = Y ) (30)
K−1
X
= 1− P (h(X) = k, Y = k) (31)
k=0
K−1
X h i
= 1− E I h(X) = k P (Y = k|X) (32)
k=0
∗
It’s clear
that h (X) = argmaxk P (Y = k|X) achieves the minimized classification error
1 − E maxk P (Y = k|X) .
Let πk = P(Y = k). The next theorem extends QDA and LDA to the multiclass setting.
1 X
Σbk = (Xi − µ
bk )(Xi − µ bk )T , (36)
nk − 1 i: Y =k
i
PK−1
b = k=0 (nk − 1)Σk .
b
Σ (37)
n−K
Example 7 Let us return to the Iris data example. Recall that there are 150 observations
made on three classes of the iris flower: Iris setosa, Iris versicolor, and Iris virginica. There
are four features: sepal length, sepal width, petal length, and petal width. In Figure 4 we
visualize the datasets. Within each class, we plot the densities for each feature. It’s easy to
see that the distributions of petal length and petal width are quite different across different
classes, which suggests that they are very informative features.
Figures 5 and 6 provide multiple figure arrays illustrating the classification of observations
based on LDA and QDA for every combination of two features. The classification boundaries
and error are obtained by simply restricting the data to these a given pair of features before
fitting the model. We see that the decision boundaries for LDA are linear, while the decision
boundaries for QDA are highly nonlinear. The training errors for LDA and QDA on this data
are both 0.02. From these figures, we see that it is very easy to discriminate the observations
of class Iris setosa from those of the other two classes.
11
7
6
5
setosa
4
3
2
1
0
7
6
variable
5
versicolor
density
Sepal.Length
4
3 Sepal.Width
2 Petal.Length
1 Petal.Width
0
7
6
5
virginica
4
3
2
1
0
2 4 6
value
Figure 4: The Iris data: The estimated densities for different features are plotted within
each class. It’s easy to see that the distributions of petal length and petal width are quite
different across different classes, which suggests that they are very informative features.
12
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
v v v
Error: v
0.2 v v v Error: 0.033 v vvvv Error: 0.04 v v vvv
v
7.5
7.5
vv vv v
vv e v v vvv v vv v v
e
v v eev v vv ee
v eee vvev vv e eev vvvvv
e e
e e e v vvv v
vvv
v
eee ee vv v v eee v v v
6.5
6.5
v vv eeeeee
ee e v vv vv
v v ve
e vv
e v e e vvvvvvvvv v e
e
e v e vvv vv v v vv v v
e
e
e
Sepal.Length v
e v eee eevvv e e
e e
ee eeee vvve v e eee veve e vv
v e vv
ve
e eee
vev ee
v e e s s s s ss ee e
e
ee e vv
ee
ee e v
sss e e
e e eee e
e vv v
5.5
5.5
eeee e s
s ss s s ss
ssss ee e ee ss s e e e e e
e s ss ss s s s ssssss e ee e ss s e e
e eee v sss ss s s ss s ss sss
s
sss
s e v s ss ss s s
s
ss s e
e v
s s s s s
s ss s s s sssss ss ss
4.5
4.5
s s ss s
s ssss ss s
Error: 0.2 Error: 0.047 Error: 0.033
s s s
s s s
4.0
4.0
s s s
s s s s
s s vv ssss v v sss v v
s ss s s s
s ss v s s v ss v
3.5
3.5
sss s ssss ss s
s s sss s e vv sssss e vv sss e vv
ss v v
e ss e vv s s e v v
s ss s e e vv vve v Sepal.Width ssss eee vv vvv s ee e v v v
s ss v eve v ss e eev vv v ss ee v v v v
3.0
3.0
ss sss e ee e vvev vee vv vv vv ssss eeeeevvevv v vvv v s s s eeee v e v v v v v
s ee eeeve e v s e eeeee v v s eee v
vev evvve e v v ee eeee vvv v v v eeee v vvvvv v
e ee v e vv eee ve vv e eee e v v
e ee v v e ee v v e e v v
2.5
2.5
v e eev v v
e e ee v ev v e e e vvvv
e e e ee ee
s e e e s e ee s e e
ve
e e e v e e
v
2.0
2.0
e v v e e v
Error: 0.033 vv v
Error: 0.047 v v Error: 0.04 vvv
v v v v v v
v v vvv v v v v vv v vv v vv
6
6
vv v v v v
v v v vv v v
v vvvv vvvv v v v v vv v v v v vv vv v v v v
vv v v vvv v v v v vvv
vve vv vv evv
v e vv e vvv v ee vv vv vvv v v
v
vv evvvve v e
5
5
v e vevv e vv eee e
e e eee e e e e e eee eeee v
ev
v eeee eeeeeee e ee v e e ee ee
e e e e e ee
e
e ee
e
ee
e e ee e ee
e eee
e e
eeee
e ee ee e e eee e Petal.Length e
e e e
e
4
4
e e ee e e e ee e
ee e ee e
e e e e e e
ee ee e
e e e
3
3
2
2
s s s s s s
sss ss s sss ss s s s s ss ss s s
ss ss s ss s s
ssssss
ssss sssss
s sss s sss ss ss s
s ss s ss s s s
ssss
s
ss
ss
s s s s s s s s s ss
ss
2.5
v v v v v s v vv
1
2.51
Error: 0.04 v v v v v v
Error: 0.033 Error: 0.04 v v
v v vvv v v vvv v vvvv vvv v
vv v v v v vv v
v vvv v v v vv v vvvvv v
2.0
vv v vv v v v v v vvvv vv
2.0
v vv v v vv vvv v
e
vvvvvvv v vv v v v v v ve v e
vvv vvvv v
v e v e v e
e e v e v ee ee e v
1.5
1.5
e e eev ee
vee e e v e e
e v eeee e eeeevv
e v eee e
e v eeeeee e eeee v
eee eeee e e e eeee e eeeeeee Petal.Width
e ee e eee e eeee e
e ee ee e ee
1.0
1.0
ee e ee e e eee ee eee ee
s s s
0.5
0.5
s s s
ss s s s sss s sssss
ss s ss s s s ss s ssss
s ssssssss
ssssss s s s sss
s s s s s s s s s ssss
ssss
ssss
s ss s ss s s s ss
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
Figure 5: Classifying the Iris data using LDA. The multiple figure array illustrates the
classification of observations based on LDA for every combination of two features. The
classification boundaries and error are obtained by simply restricting the data to a given
pair of features before fitting the model. In these plots, “s” represents the class label Iris
setosa, “e” represents the class label Iris versicolor, and “v” represents the class label Iris
virginica. The red letters illustrate the misclassified observations.
13
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
v v v
Error: v
0.2 v v v Error: 0.04 v vvvv Error: 0.033 v v vvv
v
7.5
7.5
vv vv v
vv e v v vvv v vv v v
e
v v eev v vv ee
v eee vvev vv e eev vvvvv
e e
e e e v vvv v
vvv
v
eee ee vv v v eee v v v
6.5
6.5
v vv eeeeee
ee e v vv vv
v v ve
e e
vvv e e vvvvvvvvv v e
e
e v e vvv vv v v vv v v
e
e
e
Sepal.Length v
e v eee eevv e e
e e
ee eeee vvve v e eee veve e vv
v v e vv
ve
e eee
vev ee
v e e s s s s ss ee e
e
ee e vv
ee
ee e v
sss e e
e e eee e
e vv v
5.5
5.5
eeee e s
s ss s s ss
ssss ee e ee ss s e e e e e
e s ss ss s s s ssssss e ee e ss s e e
e eee v sss ss s s ss s ss sss
s
sss
s e v s ss ss s s
s
ss s e
e v
s s s s s
s ss s s s sssss ss ss
4.5
4.5
s s ss s
s ssss ss s
Error: 0.2 Error: 0.047 Error: 0.047
s s s
s s s
4.0
4.0
s s s
s s s s
s s vv ssss v v sss v v
s ss s s s
s ss v s s v ss v
3.5
3.5
sss s ssss ss s
s s sss s e vv sssss e vv sss e vv
ss v v
e ss e vv s s e v v
s ss s e e vv vve v Sepal.Width ssss eee vv vvv s ee e v v v
s ss v eve v ss e eev vv v ss ee v v v v
3.0
3.0
ss sss e ee e vvev vee vv vv vv ssss eeeeevvevv v vvv v s s s eeee v e v v v v v
s ee eeeve e v s e eeeee v v s eee v
vev evvve e v v ee eeee vvv v v v eeee v vvvvv v
e ee v e vv eee ve vv e eee e v v
e ee v v e ee v v e e v v
2.5
2.5
v e eev e
v v e ee v ev v e e e vvvv
e e e ee ee
s e e e s e ee s e e
eve e e v e v
e
2.0
2.0
e v v e e v
Error: 0.04 vv v
Error: 0.047 v v Error: 0.02 vvv
v v v v v v
v v vvv v v v v vv v vv v vv
6
6
vv v v v v
v v v vv v vv v
v vvvv vvvv v v v v vv v v v v vv vv v v
vv v v vvv v v v v vvv
vve vv vv evv
v e vv e vvv v ee vv vv vvv v v
v
vv evvvve v e
5
5
v e vevv e vv eee e
e e eee e e e e e eee e
eeee v
ev
v eeee eeeeeee e ee v e e ee ee
e e e e e eeee
e e
ee
e e ee e ee
e eee
e e
eee
e ee e e e
e e e e ee ee e e e Petal.Length e e e
4
4
e e e e ee e
ee e ee e
e e e e e e
ee ee e
e e e
3
3
2
2
s s s s s s
sss ss s sss ss s s s s ss ss s s
ss ss s ss s s
ssssss
ssss sssss
s sss s sss ss ss s
s ss s ss s s s
ssss
s
ss
ss
s s s s s s s s s ss
ss
2.5
v v v v v s v vv
1
2.51
Error: 0.033 v v v v v v
Error: 0.047 Error: 0.02 v v
v v vvv v v vvv v vvvv vvv v
vv v v v v vv v
v vvv v v v vv v vvvvv v
2.0
vv v vv v v v v v vvvv vv
2.0
v vv v v vv vvv v
e
vvvvvvv v vv v v v v v ve v vvv vvvv v
e
v e v e v e
e e v e v ee ee e v
1.5
1.5
e e eev ee
vee e e v e e
e v eeee e eeeevv
e v eee e
e v eeeeee e eeee v
eee eeee e e e eeee e eeeeeee Petal.Width
e ee e eee e eeee e
e ee ee e ee
1.0
1.0
ee e ee e e eee ee eee ee
s s s
0.5
0.5
s s s
ss s s s sss s sssss
ss s ss s s s ss s ssss
s ssssssss
ssssss s s s sss
s s s s s s s s s ssss
ssss
ssss
s ss s ss s s s ss
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
Figure 6: Classifying the Iris data using QDA. The multiple figure array illustrates the
classification of observations based on QDA for every combination of two features. The
classification boundaries are displayed and the classification error by simply casting the data
onto these two features are calculated. In these plots, “s” represents the class label Iris
setosa, “e” represents the class label Iris versicolor, and “v” represents the class label Iris
virginica. The red letters illustrate the misclassified observations.
14
4 Fisher Linear Discriminant Analysis
There is another version of linear discriminant analysis due to Fisher (1936). The idea is to
first reduce the covariates to one dimension by projecting the data onto a line. Algebraically,
T T
this
Pd means replacing the covariate X = (X1 , . . . , Xd ) with a linearT combination U = w X =
j=1 wj Xj . The goal is to choose the vector w = (w1 , . . . , wd ) that “best separates the
data into two groups.” Then we perform classification with the one-dimensional covariate U
instead of X.
What do we mean by “best separates the data into two groups”? Formally, we would like
the two groups to have means that as far apart as possible relative to their spread. Let
µj denote the mean of X for Y = j, j = 0, 1. And let Σ be the covariance matrix of X.
Then, for j = 0, 1, E(U |Y = j) = E(wT X|Y = j) = wT µj and Var(U ) = wT Σw. Define the
separation by
2
E(U |Y = 0) − E(U |Y = 1)
J(w) =
wT Σw
(wT µ0 − wT µ1 )2
=
wT Σw
w (µ0 − µ1 )(µ0 − µ1 )T w
T
= .
wT Σw
J is sometimes called the Rayleigh coefficient. Our goal is to find w that maximizes J(w).
Since J(w)Pinvolves unknown population quantities Σ0 , Σ1 , µ0 , µ1 , we estimate J as follows.
n
Let nj = i=1 I(Yi = j) be the number of observations in class j, let µ bj be the sample mean
vector of the X’s for class j, and let Σj be the sample covariance matrix for all observations
in class j. Define
w T SB w
J(w) = T
b (38)
w SW w
where
µ0 − µ
SB = (b b1 )T ,
µ0 − µ
b1 )(b
(n0 − 1)S0 + (n1 − 1)S1
SW = .
(n0 − 1) + (n1 − 1)
15
−1
eigenvalue, the maximizer w
b should be the eigenvector of SW SB corresponding to the largest
µ0 − µ
eigenvalue. The key observation is that SB = (b b1 )T , which implies that for any
µ0 − µ
b1 )(b
b0 − µ
vector w, SB w must be in the direction of µ b1 . The desired result immediately follows.
We call
−1
bT x = (b
f (x) = w b1 )T SW
µ0 − µ x (40)
the Fisher linear discriminant function. Given a cutting threshold cm ∈ R, Fisher’s classifi-
cation rule is
b T x ≥ cm
0 if w
h(x) = (41)
bT x < cm .
1 if w
Fisher’s rule is the same as the Gaussian LDA rule in (26) when
1 T −1 π
b0
cm = (b µ0 − µ
b1 ) SW (b b1 ) − log
µ0 + µ . (42)
2 π
b1
5 Logistic Regression
One approach to binary classification is to estimate the regression function m(x) = E(Y |X =
x) = P(Y = 1|X = x) and, once we have an estimate m(x), b use the classification rule
> 12
1 if m(x)
h(x) = (43)
b b
0 otherwise.
For binary classification problems, one possible choice is the linear regression model
d
X
Y = m(X) + = β0 + βj Xj + . (44)
j=1
The linear regression model does not explicitly constrain Y to take on binary values. A
more natural alternative is to use logistic regression, which is the most common binary
classification method.
Before we describe the logistic regression model, let’s recall some basic facts about binary
random variables. If Y takes values 0 and 1, we say that Y has a Bernoulli distribution with
parameter π1 = P(Y = 1). The probability mass function for Y is p(y; π1 ) = π1y (1 − π1 )1−y
for y = 0, 1. The likelihood function for π1 based on iid data Y1 , . . . , Yn is
n
Y n
Y
L(π1 ) = p(Yi ; π1 ) = π1Yi (1 − π1 )1−Yi . (45)
i=1 i=1
16
In the logistic regression model, we assume that
exp β0 + xT β
m(x) = P(Y = 1|X = x) = ≡ π1 (x, β0 , β). (46)
1 + exp β0 + xT β
In other words, given X = x, Y is Bernoulli with mean π1 (x, β0 , β). We can write the model
as
logit P(Y = 1|X = x) = β0 + xT β
(47)
where logit(a) = log(a/(1 − a)). The name “logistic regression” comes from the fact that
exp(x)/(1 + exp(x)) is called the logistic function.
Lemma 9 Both linear regression and logistic regression models have linear decision bound-
aries.
Proof. The linear decision boundary for linear regression is straightforward. The same
result for logistic regression follows from the monotonicity of the logistic function.
The maximum conditional likelihood estimators βb0 and βb cannot be found in closed form.
However, the loglikelihood function is concave and can be efficiently solve by the Newton’s
method in an iterative manner as follows.
Note that the logistic regression classifier is essentially replacing the 0-1 loss with a smooth
loss function. In other words, it uses a surrogate loss function.
For notational simplicity, we redefine (local to this section) the d-dimensional covariate xi
and parameter vector β as the following (d + 1)-dimensional vectors:
17
To maximize `(β), the (k + 1)th Newton step in the algorithm replaces the kth iterate βb(k)
by !−1
2 b(k)
(k+1) (k) ∂ `( β ) ∂`(βb(k) )
βb ← βb − . (51)
∂β∂β T ∂β
βb(k) ) 2 `(β
b(k) )
The gradient ∂s ∂`(∂β and Hessian ∂s ∂∂β∂β T are both evaluated at βb(k) and can be written
as n
∂`(βb(k) ) X ∂ 2 `(βb(k) )
= (π(xi , βb(k) ) − Yi )Xi and T
= −XT WX (52)
∂β i=1
∂β∂β
(k) (k) (k)
where W = diag(w11 , w22 , . . . , wdd ) is a diagonal matrix with
(k)
wii = π(xi , βb(k) ) 1 − π(xi , βb(k) ) .
(53)
(k) T
Let π1 = π1 (x1 , βb(k) ), . . . , π1 (xn , βb(k) ) , (51) can be written as
(k)
βb(k+1) = βb(k) + (XT WX)−1 XT (y − π1 ) (54)
(k)
= (XT WX)−1 XT W Xβb(k) + W−1 (y − π ) 1 (55)
= (XT WX)−1 XT Wz (k) (56)
(k) (k) (k)
where z (k) ≡ (z1 , . . . , zn )T = XT βb(k) + W−1 (y − π1 ) with
!
(k) π1 (xi , βb(k) ) yi − π1 (xi , βb(k) )
zi = log + . (57)
1 − π1 (xi , βb(k) ) π1 (xi , βb(k) )(1 − π1 (xi , βb(k) ))
Given the current estimate βb(k) , the above Newton iteration forms a quadratic approximation
to the negative log-likelihood using Taylor expansion at βb(k) :
1
−`(β) = (z − Xβ)T W(z − Xβ) +constant. (58)
|2 {z }
`Q (β)
We then get an iterative algorithm called iteratively reweighted least squares. See Figure 7.
18
Iteratively Reweighted Least Squares Algorithm
(0) (0) (0)
Choose starting values βb(0) = (βb0 , βb1 , . . . , βbd )T and compute π1 (xi , βb(0) ) using
(0)
Equation (46), for i = 1, . . . , n with βj replaced by its initial value βbj .
b −1 .
we estimate the standard error of βbj as the jth diagonal element of I(β)
Example 10 We apply the logistic regression on the Coronary Risk-Factor Study (CORIS)
data and yields the following estimates and Wald statistics Wj for the coefficients:
19
6 Logistic Regression Versus LDA
There is a close connection between logistic regression and Gaussian LDA. Let (X, Y ) be a
pair of random variables where Y is binary and let p0 (x) = p(x|Y = 0), p1 (x) = p(x|Y = 1),
π1 = P(Y = 1). By Bayes’ theorem,
p(x|Y = 1)π1
P(Y = 1|X = x) = (61)
p(x|Y = 1)π1 + p(x|Y = 0)(1 − π1 )
If we assume that each group is Gaussian with the same covariance matrix Σ, i.e., X|Y =
0 ∼ N (µ0 , Σ) and X|Y = 1 ∼ N (µ1 , Σ), we have
P(Y = 1|X = x) π 1
log = log − (µ0 + µ1 )T Σ−1 (µ1 − µ0 ) (62)
P(Y = 0|X = x) 1−π 2
+ xT Σ−1 (µ1 − µ0 ) (63)
T
≡ α0 + α x. (64)
In logistic regression we maximize the conditional likelihood ni=1 p(Yi |Xi ) but ignore the
Q
second term p(Xi ):
Yn n
Y n
Y
p(Xi , Yi ) = p(Yi |Xi ) p(Xi ) . (66)
i=1
|i=1 {z } |i=1 {z }
logistic ignored
Since classification only requires the knowledge of p(y|x), we don’t really need to estimate
the whole joint distribution. Logistic regression leaves the marginal distribution p(x) un-
specified so it relies on less parametric assumption than LDA. This is an advantage of the
logistic regression approach over LDA. However, if the true class conditional distributions
are Gaussian, the logistic regression will be asymptotically less efficient than LDA, i.e. to
achieve a certain level of classification error, the logistic regression requires more samples.
20
7 Regularized Logistic Regression
As with linear regression, when the dimension d of the covariate is large, we cannot simply fit
a logistic model to all the variables without experiencing numerical and statistical problems.
Akin to the lasso, we will use regularized logistic regression, which includes sparse logistic
regression and ridge logistic regression.
Let `(β0 , β) be the log-likelihood defined in (49). The sparse logistic regression estimator is
an `1 -regularized conditional log-likelihood estimator
n o
βb0 , βb = argmin −`(β0 , β) + λkβk1 . (67)
β0 ,β
The algorithm for logistic ridge regression only requires a simple modification of the itera-
tively reweighed least squares algorithm and is left as an exercise.
For sparse logistic regression, an easy way to calculate βb0 and βb is to apply a `1 -regularized
Newton procedure. Similar to the Newton method for unregularized logistic regression, for
the kth iteration, we first form a quadratic approximation to the negative log-likelihood
`(β0 , β) based on the current estimates βb(k) .
n d
1X (k)
X 2
−`(β0 , β) = wii zi − β0 − βj xij +constant. (69)
2 i=1 j=1
| {z }
`Q (β0 ,β)
(k)
where wii and zi are defined in (53) and (57). Since we have a `1 -regularization term, the
updating formula for the estimate in the (k + 1)th step then becomes
X n d
(k+1) b(k+1) 1 (k)
X 2
β
b ,β = argmin wii zi − β0 − βj xij + λkβk1 . (70)
β0 ,β 2 i=1 j=1
This is a weighted lasso problem and can be solved using coordinate descent. See Figure 8.
Even though the above iterative procedure does not guarantee theoretical convergence, it
works very well in practice.
21
Sparse Logistic Regression Using Coordinate Descent
For j ∈ {1, . . . , d}
(k) P
(a) For i = 1, . . . , n, calculate rij = zi − α0 − `6=j α` xi` .
(k) (k) (k) (k)
(b) Calculate uj = ni=1 wii rij xij and vj = ni=1 wii x2ij .
P P
(k)
(k) |uj |−λ
(c) αj = sign uj (k) .
vj
+
(k+1) (k+1)
4. βb0 = α0 and βb` = α` for ` = 1, . . . d.
b using (46) with the current estimate of βb(k+1) .
5. Update the π(xi , β)’s
22
3.0
2.5
2.0
L(yH(x))
1.5
1.0
0.5
0.0
−2 −1 0 1 2
yH(x)
Figure 9: The 0-1 classification loss (blue dashed line), hinge loss (red solid line) and logistic
loss (black dotted line).
The support vector machine (SVM) classifier is a linear classifier that replaces the 0-1 loss
with a surrogate loss function. (Logistic regression uses a different surrogate.) In this
section, the outcomes are coded as −1 and +1. The loss is L(x, y, β) = I(y 6= hβ (x)) =
0-1
I(yHβ (x) < 0) with the hinge loss Lhinge yi , H(xi ) ≡ 1 − Yi H(Xi ) + instead of the logistic
loss. This is the smallest convex function that lies above the 0-1 loss. (When we discuss
nonparameteric classifiers, we will consider more general support vector machines.)
The support vector machine classifier is bh(x) = I H(x)
b > 0 where the hyperplane H(x) b =
T
β0 + β x is obtained by minimizing
b b
n
X λ
1 − Yi H(Xi ) + + kβk22
(71)
i=1
2
where λ > 0 and the factor 1/2 is only for notational convenience.
Figure 9 compares the hinge loss, 0-1 loss, and logistic loss. The advantage of the hinge
loss is that it is convex, and
it has a corner which leads to efficient computation and the
minimizer of E 1 − Y H(X) + is the Bayes rule. A disadvantage of the hinge loss is that one
can’t recover the regression function m(x) = E(Y |X = x).
The SVM classifier is often developed from a geometric perspective. Suppose first that the
data are linearly separable, that is, there exists a hyperplane that perfectly separates the
two classes. How can we find a separating hyperplane? LDA is not guaranteed to find it. A
separating hyperplane will minimize
X
− Yi H(Xi ).
i∈M
23
H(x) = β0 + β T x = 0
Figure 10: The hyperplane H(x) has the largest margin of all hyperplanes that separate the
two classes.
where M is the index set of all misclassified data points. Rosenblatt’s perceptron algorithm
takes starting values and iteratively updates the coefficients as:
β β Yi Xi
←− +ρ
β0 β0 Yi
where ρ > 0 is the learning rate. If the data are linearly separable, the perceptron algorithm is
guaranteed to converge to a separating hyperplane. However, there could be many separating
hyperplanes. Different starting values may lead to different separating hyperplanes. The
question is, which separating hyperplane is the best?
Intuitively, it seems reasonable to choose the hyperplane “furthest” from the data in the
sense that it separates the +1’s and -1’s and maximizes the distance to the closest point.
This hyperplane is called the maximum margin hyperplane. The margin is the distance from
the hyperplane to the nearest data point Points on the boundary of the margin are called
support vectors. See Figure 10. The goal, then, is to find a separating hyperplane which
maximizes the margin. After some simple algebra, we can show that (71) exactly achieves
this goal. In fact, (71) also works for data that are not linearly separable.
Given two vectors a and b, let ha, bi = aT b = j aj bj denote the inner product of a and b.
P
The following lemma provides the dual of the optimization problem in (72).
24
Lemma 11 The dual of the SVM optimization problem in (72) takes the form
n n n
nX 1 XX o
α
b = argmax αi − αi αk Yi yk hXi , xk i (74)
α∈Rn
i=1
2 i=1 k=1
n
1 X
subject to 0 ≤ α1 , . . . , αn ≤ and αi yi = 0, (75)
λ i=1
Proof. Let αi , γi ≥ 0 be the Lagrange multipliers. The Lagrangian function can be written
as n n n
1 1X X X
L(ξ, β, β0 , α, γ) = kβk22 + ξi + αi 1 − ξi − yi H(xi ) − γi ξi . (77)
2 λ i=1 i=1 i=1
The dual formulation in (74) follows by plugging (78) and (79) into (77). The primal-dual
complementary slackness condition (76) is obtained from the first equation in (80).
The dual problem (74) is easier to solve than the primal problem (71). The data points
(Xi , yi ) for which α bi > 0 are called support vectors. By (76) and (72), for all the data
points (xi , yi ) satisfying yi βb0 + βbT xi > 1, there must be α bi = 0. The solution for the dual
problem is sparse. From the first equality in (79), we see that the final estimate βb is a linear
combination only of these support vectors. Among these support vectors, if αi < 1/λ, we
call (xi , yi ) a margin point. For a margin point (xi , yi ), the last equality in (79) implies that
γi > 0, then the second equality in (80) implies ξi = 0. Moreover, using the first equality in
(80) , we get
βb0 = −Yi XiT β.
b (81)
Therefore, once βb is given, we could calculate βb0 using any margin point (Xi , yi ).
Example 12 We consider classifying two types of irises, versicolor and viriginica. There are
50 observations in each class. The covariates are ”Sepal.Length” ”Sepal.Width” ”Petal.Length”
and ”Petal.Width”. After fitting a SVM we get a 3/100 misclassification rate. The SVM
uses 33 support vectors.
25
9 Case Study I: Supernova Classification
A supernova is an exploding star. Type Ia supernovae are a special class of supernovae that
are very useful in astrophysics research. These supernovae have a characteristic light curve,
which is a plot of the luminosity of the supernova versus time. The maximum brightness
of all type Ia supernovae is approximately the same. In other words, the true (or absolute)
brightness of a type Ia supernova is known. On the other hand, the apparent (or observed)
brightness of a supernova can be measured directly. Since we know both the absolute and
apparent brightness of a type Ia supernova, we can compute its distance. Because of this,
type Ia supernovae are sometimes called standard candles. Two supernovae, one type Ia
and one non-type Ia, are illustrated in Figure 11. Astronomers also measure the redshift
of the supernova, which is essentially the speed at which the supernova is moving away
from us. The relationship between distance and redshift provides important information for
astrophysicists in studying the large scale structure of the universe.
Figure 11: Two supernova remnants from the NASA’s Chandra X-ray Observatory study.
The image in the right panel, the so-called Kepler supernova remnant, is ”Type Ia”. Such
supernovae have a very symmetric, circular remnant. This type of supernova is thought to
be caused by a thermonuclear explosion of a white dwarf, and is often used by astronomers
as a “standard candle” for measuring cosmic distances. The image in the left panel is a
different type of supernova that comes from “core collapse.” Such supernovae are distinctly
more asymmetric. (Credit: NASA/CXC/UCSC/L. Lopez et al.)
26
about 20,000 simulated supernovae. For each supernova, there are a few noisy measurements
of the flux (brightness) in four different filters. These four filters correspond to different
wavelengths. Specifically, the filters correspond to the g-band (green), r-band (red), i-band
(infrared) and z-band (blue). See Figure 12.
25
50
20
40
15
30
10
20
5
10
0
−5
60
40
30
40
20
10
20
0
Figure 12: Four filters (g, r, i, z-bands) corresponding to a type Ia supernova DES-SN000051.
For each band, a weighted regression spline fit (solid red) with the corresponding standard
error curves (dashed green) is provided. The black points with bars represent the flux values
and their estimated standard errors.
To estimate a linear classifier we need to preprocess the data to extract features. One
difficulty is that each supernova is only measured at a few irregular time points, and these
time points are not aligned. To handle this problem we use nonparametric regression to get
a smooth curve. (We used the estimated measurement errors of each flux as weights and
fitted a weighted least squares regression spline to smooth each supernova.) All four filters
27
of each supernova are then aligned according to the peak of the r-band. We also rescale so
that all the curves have the same maximum.
The goal of this study is to build linear classifiers to predict whether a supernova is type
Ia or not. For simplicity, we only use the information in the r-band. First, we align the
fitted regression spline curves of all supernovae by calibrating their maximum peaks and set
the corresponding time point to be day 0. There are altogether 19, 679 supernovae in the
dataset with 1, 367 being labeled. To get a higher signal-to-noise ratio, we throw away all
supernovae with less than 10 r-band flux measurements. We finally get a trimmed dataset
with 255 supernovae, 206 of which are type Ia and 49 of which are non-type Ia.
X1 X2 X3 X4 X5 X1 X2 X3 X4 X5
0.6 8
0.4 7
6
X1
X1
0.2 5
0.0 4
3
−0.2 2
0.6 3
0.4 2
X2
X2
0.2 1
0.0 0
−0.2
0.8 1.0
0.6 0.5
0.4 0.0
X3
X3
y
0.2 −0.5
−1.0
0.0 −1.5
−0.2
0.8
0.6 −0.5
0.4
X4
X4
−1.0
0.2
0.0 −1.5
0.8
−0.5
0.6
X5
X5
0.4 −1.0
0.2
0.0 −1.5
0.00.20.40.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.00.20.40.6 0.00.20.40.6 3 4 5 6 7 8 0 1 2 3 −1.5
−1.0
−0.5
0.00.51.0 −1.5−1.0−0.5 −1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
x x
We use two types of features: the time-domain features and frequency-domain features. For
the time-domain features, the features are the interpolated regression spline values according
to an equally spaced time grid. In this study, the grid has length 100, ranging from day -20
to day 80. Since all the fitted regression curves have similar global shapes, the time-domain
features are expected to be highly correlated. This conjecture is confirmed by the matrix
of scatterplots of the first five features in 13. To make the features less correlated, we also
extract the frequency-domain features, which are simply the discrete cosine transformations
of the corresponding time-domain features. More specifically, given the time domain features
X1 , . . . , Xd (d = 100), Their corresponding frequency domain features X e1 , . . . , X
ed can be
28
written as
d
2X hπ 1 i
Xj =
e Xk cos k − (j − 1) for j = 1, . . . , d. (82)
d k=1 d 2
The right panel of Figure 13 illustrates the scatter matrix of the first 5 frequency-domain
features. In contrast to the time-domain features, the frequency-domain features have low
correlation.
We apply sparse logistic regression (LR), support vector machines (SVM), diagonal linear
discriminant analysis (DLDA), and diagonal quadratic discriminant analysis (DQDA) on
this dataset. For each method, we conduct 100 runs, within each run, 40% of the data are
randomly selected as training and the remaining 60% are used for testing.
Figure 14 illustrates the regularization paths of sparse logistic regression using the ime-
domain and frequency-domain features. A regularization path provides the coefficient value
of each feature over all regularization parameters. Since the time-domain features are highly
correlated, the corresponding regularization path is quite irregular. In contrast, the paths
for the frequency-domain features behave stably.
0 9 10 8 8 0 7 8 9 9 10
10
5
5
Coefficients
Coefficients
0
0
−5
−5
0 10 20 30 40 0 5 10 15 20 25
L1 Norm L1 Norm
Figure 14: The regularization paths of sparse logistic regression using the features of time-
domain and frequency-domain. The vertical axis corresponds to the values of the coefficients,
plotted as a function of their `1 -norm. The path using time-domain features are highly
irregular, while the path using frequency-domain features are more stable.
Figure 15 compares the classification performance of all these methods. The results show that
classification in the frequency domain is not helpful. The regularization paths of the SVM are
the same in both the time and frequency domains. This is expected since the discrete cosine
transformation is an orthonormal transformation, which corresponds to rotatating the data
in the feature space while preserving their Euclidean distances and inner products. It is easy
29
0.20 0.20
0.15 0.15
Errors
Errors
0.10 0.10
0.05 0.05
0.00 0.00
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
log(λ) log(λ)
0.20 0.20
0.15 0.15
Errors
Errors
0.10 0.10
0.05 0.05
0.00 0.00
−8 −6 −4 −2 0 2 4 −8 −6 −4 −2 0 2 4
log(λ) log(λ)
0.4
0.3
0.3
Errors
Errors
0.2
0.2
0.1
0.1
0.0
0.0
Figure 15: Comparison of different methods on the supernova dataset using both the time-
domain (left column) and frequency-domain features (right Column). Top four figures: mean
error curves (black: training error, red: test error) and their corresponding standard error
bars for sparse logistic regression (LR) and support vector machines (SVM). Bottom two
figures: boxplots of the training and test errors of diagonal linear discriminant analysis
(DLDA) and diagonal quadratic discriminant analysis (DQDA). For the time-domain fea-
tures, the SVM achieves the smallest test error among all methods.
30
to see that the SVM is rotation invariant. Sparse logistic regression is not rotation invariant
due to the `1 -norm regularization term. The performance of the sparse logistic regression
in the frequency domain is worse than that in the time domain. The DLDA and DQDA
are also not rotation invariant; their performances decreases significantly in the frequency
domain compared to those in the time domain. In both time and frequency domains, the
SVM outperforms all the other methods. Then follows sparse logistic regression, which is
better than DLDA and DQDA.
In this example, we classify political blogs according to whether their political leanings are
liberal or conservative. Snapshots of two political blogs are shown in Figure 16.
Figure 16: Examples of two political blogs with different orientations, one conservative and
the other liberal.
The data consist of 403 political blogs in a two-month window before the 2004 presidential
election. Among these blogs, 205 are liberal and 198 are conservative. We use bag-of-words
features, i.e., each unique word from these 403 blogs serves as a feature. For each blog,
the value of a feature is the number of occurrences of the word normalized by the total
number of words in the blog. After converting all words to lower case, we remove stop words
and only retain words with at least 10 occurrences across all the 403 blogs. This results
in 23,955 features, each of which corresponds to an English word. Such features are only a
crude representation of the text represented as an unordered collection of words, disregarding
all grammatical structure. We also extracted features that use hyperlink information. In
particular, we selected 292 out of the 403 blogs that are heavily linked to, and for each
blog i = 1, . . . , 403, its linkage information is represented as a 292-dimensional binary vector
(xi1 , . . . , xi292 )T where xij = 1 if the ith blog has a link to the jth feature blog. The total
number of covariates is then 23,955 + 292 = 24,247. Even though the link features only
constitute a small proportion, they are important for predictive accuracy.
We run the full regularization paths of sparse logistic regression and support vector machines,
31
0.6 0.6
0.5 0.5
0.4 0.4
Errors
Errors
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
Sparse LR SVM
0.6 0.6
0.5 0.5
0.4 0.4
Errors
Errors
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
1.0
0.4
0.5
0.2
Coefficients
Coefficients
0.0
0.0
−0.2
−0.5
−0.4
−1.0
0 5 10 15 20 0 5 10 15 20
L1 Norm L1 Norm
Sparse LR path (no linkage information) Sparse LR path (with linkage information)
Figure 17: Comparison of the sparse logistic regression (LR) and support vector machine
(SVM) on the political blog data. (Top four figures): The mean error curves (Black: training
error, Red: test error) and their corresponding standard error bars of the sparse LR and SVM,
with and without the linkage information. (Bottom two figures): Two typical regularization
paths of the sparse logistic LR with and without the linkage information. On this dataset, the
diagonal linear discriminant analysis (DLDA) achieves a test error 0.303 (sd = 0.07) without
the linkage information and a test error 0.159 (sd = 0.02) with the linkage information.
32
100 times each. For each run, the data are randomly partitioned into training (60%) and
testing (40%) sets. Figure 17 shows the mean error curves with their standard errors. From
Figure 17, we see that linkage information is crucial. Without the linkage features, the
smallest mean test error of the support vector machine along the regularization path is 0.247,
while that of the sparse logistic regression is 0.270. With the link features, the smallest test
error for the support vector machine becomes 0.132. Although the support vector machine
has a better mean error curve, it has much larger standard error. Two typical regularization
paths for sparse logistic regression with and without using the link features are provided
at the bottom of Figure 17. By examining these paths, we see that when the link features
are used, 11 of the first 20 selected features are link features. In this case, although thev
class conditional distribution is obviously not Gaussian, we still apply the diagonal linear
discriminant analysis (DLDA) on this dataset for a comparative study. Without the linkage
features, the DLDA has a mean test error 0.303 (sd = 0.07). With the linkage features,
DLDA has a mean test error 0.159 (sd = 0.02).
33
Nonparametric Classification
10/36-702
1 Introduction
Let us recall a few definitions and facts. The classification risk, or error rate, of h is
and the empirical error rate or training error rate based on training data (X1 , Y1 ), . . . , (Xn , Yn )
is n
1X
Rn (h) =
b I(h(Xi ) 6= Yi ). (2)
n i=1
R(h) is minimized by the Bayes’ rule
p1 (x) (1−π)
(
1 1 if >
∗ 1 if m(x) > 2 p0 (x) π
h (x) = = (3)
0 otherwise 0 otherwise.
where m(x) = P(Y = 1 | X = x), pj (x) = p(x | Y = j) and π = P(Y = 1). The excess risk of
a classifier h is R(h) − R(h∗ ).
where mj (x) = P(Y = j|X = x), πj = P(Y = j) and pj (x) = p(x|Y = j).
2 Plugin Methods
> 12
1 if m(x)
h(x) = (4)
b b
0 otherwise.
1
For example, we could use the kernel regresson estimator
Pn ||x−Xi ||
i=1 Yi K h
mb h (x) = P .
n ||x−Xi ||
i=1 K h
Howeve, the bandwidth should be optimized for classification error as described in Section
8.
Theorem 1 Let b
h be the plug-in classifier based on m.
b Then,
Z sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x). (5)
An immediate consequence of this theorem is that any result about nonparametric re-
gression can be turned into a result about nonparametric classification. For example, if
− m(x)|2 dP (x) = OP (n−2β/(2β+d) ) then R(b
h) − R(h∗ ) = OP (n−β/(2β+d) ). How-
R
|m(x)
b
∗
qR (5) is an upper bound and it is possible that R(h) − R(h ) is strictly smaller than
ever, b
|m(x)
b − m(x)|2 dP (x).
h(x) = argmaxj m
b b j (x)
where m
b j (x) is an estimate of P(Y = j|X = x).
We can apply nonparametric density estimation to each class to get estimators pb0 and pb1 .
Then we define
1 if ppbb01 (x)
(x)
> (1−bπ)
(
π
h(x) = (6)
b b
0 otherwise
b = n−1 ni=1 Yi . Hence, any nonparametric density estimation method yields a
P
where π
nonparametric classifier.
A simplification occurs if we assume that the covariate has independent coordinates, con-
ditioned on the class variable Y . Thus, if Xi = (Xi1 , . . . , Xid )T has dimension
Qd d and if
we assume conditional independence, then the density factors as pj (x) = `=1 pj` (x` ). In
2
this case we can estimate the one-dimensional marginals pj` (x` ) separately and then define
pbj (x) = d`=1 pbj` (x` ). This has the advantage that we never have to do more than a one-
Q
dimensional density estimate. This approach is called naive Bayes. The resulting classifier
can sometimes be very accurate even if the independence assumption is false.
It is easy to extend density based methods for multiclass problems. If Y ∈ {1, . . . , k} then
we estimate the k densities pbj (x) = p(x|Y = j) and the classifier is
h(x) = argmaxj π
b bj pbj (x)
Pn
bj = n−1
where π i=1 I(Yi = j).
4 Nearest Neighbors
The k-nearest neighbor classifier can be recast as a plugin rule. Define the regression estsi-
mator Pn
i=1 Yi I(||Xi ≤ x|| ≤ dk (x))
m(x) = P n
i=1 I(||Xi ≤ x|| ≤ dk (x))
b
where dk (x) is the distance between x and its k th -nearest neighbor. Then b
h(x) = I(m(x)
b >
1/2).
It is interesting to consider the classification error when n is large. First suppose that k = 1
and consider a fixed x. Then b h(x) is 1 if the closest Xi has label Y = 1 and b h(x) is 0 if the
closest Xi has label Y = 0. When n is large, the closest Xi is approximately equal to x. So
the probability of an error is
Define
Ln = P(Y 6= b
h(X) | Dn )
where Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. Then we have that
3
The Bayes risk can be written as R∗ = E(A) where A = min{m(X), 1 − m(X)}. Note that
A ≤ 2m(X)(1 − m(X)). Also, by direct integration, E(A(1 − A)) ≤ E(A)E(1 − A). Hence,
we have the well-known result due to Cover and Hart (1967),
Thus, for any problem with small Bayes error, k = 1 nearest neighbors should have small
error.
where
k !
X k j k−j
R(k) =E m (X)(1 − m(X)) [m(X)I(j < k/2) + (1 − m(X))I(j > k/2)] .
j
j=0
Theorem 3 (Devroye and Györfi 1985) Suppose that the distribution of X has a den-
sity and that k → ∞ and k/n → 0. For every > 0 the following is true. For all large
n,
2 2
h) − R∗ > ) ≤ e−n /(72γd )
P(R(b
where b
hn is the k-nearest neighbor classifier estimated on a sample of size n, and where γd
depends on the dimension d of X.
4
Recently, Chaudhuri and Dasgupta (2014) have obtained some very general results about
k-nn classifiers. We state one of their key results here.
for some β ≥ 0 and some C > 0. Also, suppose that m satisfies the following smoothness
condition: for all x and r > 0
|m(B) − m(x)| ≤ LP (B o )α
Blood Pressure 1
0 1
1
Blood Pressure
110
1
0
50
Age
From (5), we conclude that R(b h) − R(h∗ ) = O(n−1/(d+2) ). However, this binwidth was based
on the bias-variance tradeoff of the regression problem. For classification, b should be chosen
as described in Section 8.
Like regression trees, classification trees are partition classifiers where the partition is built
recursively. For illustration, suppose there are two covariates, X1 = age and X2 = blood
pressure. Figure 1 shows a classification tree using these variables.
The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.
Here is how a tree is constructed. First, suppose that y ∈ Y = {0, 1} and that there is
only a single covariate X. We choose a split point t that divides the real line into two sets
6
A1 = (−∞, t] and A2 = (t, ∞). Let rs (j) be the proportion of observations in As such that
Yi = j: Pn
I(Y = j, Xi ∈ As )
Pn i
rs (j) = i=1 (13)
i=1 I(Xi ∈ As )
for s = 1, 2 and j = 0, 1. The impurity of the split t is defined to be I(t) = 2s=1 γs where
P
1
X
γs = 1 − rs (j)2 . (14)
j=0
This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.
When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity. This process is continued until some stopping criterion is met. For example,
we might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0
or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition
element.
This procedure is easily generalized to the case where Y ∈ {1, . . . , K}. We define the impurity
by
Xk
γs = 1 − rs2 (j) (15)
j=1
where ri (j) is the proportion of observations in the partition element for which Y = j.
5 Minimax Results
where R(bh) = P(Y 6= bh(X)), Rn∗ is the Bayes error and the infimum is over all classifiers
constructed from the data (X1 , Y1 ), . . . , (Xn , Yn ). Recall that
sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|2 dP (x)
7
Class Rate Condition
E(α) n−α/(2α+d) α > 1/2
BV n−1/3
p
MI log n/n
−α/(2α+1)
L(α, q) n α > (1/q − 1/2)+
α
Bσ,q n−α/(2α+d) α/d > 1/q − 1/2
Neural nets see text
However, with smaller classes that invoke extra assumptions, such as the Tsybakov noise
condition, there can be a dramatic difference. Here, we summarize Yang’s results under
the richness assumption. This assumption is simply that if m is in the class, then a small
hypercube containing m is also in the class. Yang’s results are summarized in Table 1.
The classes in Table 1 are the following: E(α) is th Sobolev space of order α, BV is the class
of functions of bounded variation, MI is all monotone functions, L(α, q) are α-Lipschitz (in
α
q-norm), and Bσ,q are Besov spaces. For neural nets we have the bound, for every > 0,
1+(2/d) 1+(1/d)
4+(4/d) +
1 log n 4+(2/d)
≤ Rn (P) ≤
n n
It appears that, as d → ∞, we get the dimension independent rate (log n/n)1/4 . However,
this result requires some caution since the class of distributions implicitly gets smaller as d
increases.
8
We can do a nonparametric version by letting H be in a RKHS and taking the penalty to be
||H||2K . In terms of implementation, this means replacing every instance of an inner product
hXi , Xj i with K(Xi , Xj ).
7 Boosting
Boosting refers to a class of methods that build classifiers in a greedy, iterative way. The
original boosting algorithm is called AdaBoost and is due to Freund and Schapire (1996).
See Figure 3.
The algorithm seems mysterious and there is quite a bit of controversey about why (and
when) it works. Perhaps the most compelling explanation is due to Friedman, Hastie and
Tibshirani (2000) which is the explanation we will give. However, the reader is warned
that there is not consensus on the issue. Futher discussions can be found in Bühlmann and
Hothorn (2007), Zhang and Yu (2005) and Mease and Wyner (2008). The latter paper is
followed by a spirited discussion from several authors. Our view is that boosting combines
two distinct ideas: surrogate loss functions and greedy function approximation.
In this section, we assume that Yi ∈ {−1, +1}. Many classifiers then have the form
h(x) = sign(H(x))
for some function H(x). For example, a linear classifier corresponds to H(x) = β T x. The
risk can then be written as
R(h) = P(Y 6= h(X)) = P(Y H(X) < 0) = E(L(A))
where A = Y H(X) and L(a) = I(a < 0). As a function of a, the loss L(a) is discontinuous
which makes it difficult to work with. Friedman, Hastie and Tibshirani (2000) show that
−a −yH(x)
AdaBoost corresponds to using P a surrogate loss, namely, L(a) = e = e P. Consider
finding a classifier of the form m αm hm (x) by minimizing the exponential loss i e−Yi H(Xi ) .
If we do this iteratively, adding one function
P at a time, this leads precisely to AdaBoost.
Typically, the classifiers hj in the sum m αm hm (x) are taken to be very simple classifiers
such as small classification trees.
The argument in Friedman, Hastie and Tibshirani (2000) is as follows. Consider minimizing
the expected loss J(F ) = E(e−Y F (X) ). Suppose our current estimate is F and consider
updating to an improved estimate F (x) + cf (x). Expanding around f (x) = 0,
J(F + cf ) = E(e−Y (F (X)+cf (X)) ) ≈ E(e−Y F (X) (1 − cY f (X) + c2 Y 2 f 2 (X)/2))
= E(e−Y F (X) (1 − cY f (X) + c2 /2))
since Y 2 = f 2 (X) = 1. Now consider minimizing the latter expression a fixed X = x.
If we minimize over f (x) ∈ {−1, +1} we get f (x) = 1 if Ew (y|x) > 0 and f (x) = −1 if
9
1. Input: (X1 , Y1 ), . . . , (Xn , Yn ) where Yi ∈ {−1, +1}.
3. Repeat for m = 1, . . . , M .
Pn
(a) Compute the weighted error (h) = i=1 wi I(Yi 6= h(Xi ) and find hm to
minimize (h).
(b) Let αm = (1/2) log((1 − )/).
(c) Update the weights:
wi e−αm Yi hm (Xi )
wi ←
Z
where Z is chosen so that the weights sum to 1.
Figure 3: AdaBoost
10
Ew (y|x) < 0 where Ew (y|x) = E(w(x, y)y|x)/E(w(x, y)|x) and w(x, y) = e−yF (x) . In other
words, the optimal f is simply the Bayes classifier with respect to the weights. This is exactly
the first step in AdaBoost. If we fix now fix f (x) and minimize over c we get
1 1−
c = log
2
where = Ew (I(Y 6= f (x))). Thus the updated F (x) is
Seen in this light, boosting really combines two ideas. The first is the use of surrogate loss
functions. The second is greedy function approximation.
All the nonparametric methods involve tuning parameters, for example, the number of neigh-
bors k in nearest neighbors. As with density estimation and regression, these parameters
can be chosen by a variety of cross-validation methods. Here we describe the data splitting
version of cross-validation. Suppose the data are (X1 , Y1 ), . . . , (X2n , Y2n ). Now randomly
split the data into two halves that we denote by
n o n o
∗ ∗ ∗ ∗
D = (X1 , Y1 ), . . . , (Xn , Yn ) , and E = (X1 , Y1 ), . . . , (Xn , Yn ) .
e e e e
Let b
h = argminh∈H R(h).
b
11
2
Proof. By Hoeffding’s inequality, P(|R(h)
b − R(h)| > ) ≤ 2e−2n , for each h ∈ H. By the
union bound,
2
P(max |R(h)
b − R(h)| > ) ≤ 2N e−2n = δ
h∈H
q
1
log 2N
where = 2n δ
. Hence, except on a set of probability at most δ,
h) ≤ R(
R(b bbh) + ≤ R(
bbh∗ ) + ≤ R(b
h∗ ) + 2.
p
Note that the difference between R(bh) and R(h∗ ) is O( log N/n) but in regression it was
O(log N/n) which is an interesting difference between the two settings. Under low noise
conditions, the error can be improved.
9 Example
The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 2. See Figures 4, 5, 6, 7.
For high dimensional problems we can use sparsity-based methods. The nonparametric
additive logistic model is
P
p
exp j=1 fj (Xj )
P(Y = 1 | X) ≡ p(X; f ) = P (17)
p
1 + exp j=1 f j (X j )
12
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20
Table 2: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2
0.8
0.05
0.4
0.1
0.0
0.6
0.00
0.2
0.0
0.4
−0.05
−0.1
−0.1
0.0
0.2
−0.10
−0.2
−0.2
0.0
−0.2
−0.15
−0.3
−0.2
−0.3
−0.20
−0.4
−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15
0.06
0.04
0.2
0.2
0.04
0.10
0.03
0.1
0.02
0.02
0.05
0.1
0.01
0.00
0.0
0.00
0.00
−0.02
0.0
−0.1
−0.05
−0.04
−0.02
−0.10
−0.1
−0.2
−0.06
13
0.29
0.28
0.27
Test Error
0.26
0.25
0 10 20 30 40 50
xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193
xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0
xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0
1 1
14
xtrain.V9 < | !0.189962
1 0
Figure 7: Classification tree. The size of the tree was chcosen by cross-validation.
P
where f (X) = j=1p fj (Xj ). To fit this model, the local scoring algorithm runs the backfit-
ting procedure within Newton’s method. One iteratively computes the transformed response
for the current estimate fb
Yi − p(Xi ; fb)
Zi = fb(Xi ) + (19)
p(Xi ; fb)(1 − p(Xi ; fb))
and weights w(Xi ) = p(Xi ; fb)(1 − p(Xi ; fb), and carries out a weighted backfitting of (Z, X)
with weights w. The weighted smooth is given by
Sj (wRj )
Pbj = . (20)
Sj w
where Sj is a linear smoothing matrix, such as a kernel smoother. This extends iteratively
reweighted least squares to the nonparametric setting.
A sparsity penality can be incorporated, just as for sparse additive models (SpAM) for
regression. The Lagrangian is given by
p q
!
X
L(f, λ) = E log 1 + ef (X) − Y f (X) + λ
E(fj2 (Xj )) − L (21)
j=1
15
nonlinear in f , and so we linearize the gradient of the log-likelihood around fb. This yields
the linearized condition E [w(X)(f (X) − Z) | Xj ] + λvj = 0. To see this, note that
0 = E p(X; f ) − Y + p(X; f )(1 − p(X; f ))(f (X) − f (X)) | Xj + λvj
b b b b (22)
= E [w(X)(f (X) − Z) | Xj ] + λvj (23)
When E(fj2 ) 6= 0, this implies the condition
E (w | Xj ) + q λ fj (Xj ) = E(wRj | Xj ). (24)
2
E(fj )
In the finite sample case, in terms of the smoothing matrix Sj , this becomes
Sj (wRj )
fj = .q . (25)
Sj w + λ E(fj2 )
If kSj (wRj )k < λ, then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be
solved explicitly, so one simply iterates until convergence:
Sj (wRj )
fj ← √ . (26)
Sj w + λ n /kfj k
When λ = 0, this yields the standard local scoring update (20).
Example 6 (SpAM for Spam) Here we consider an email spam classification problem,
using the logistic SpAM backfitting algorithm above. This dataset has been studied Hastie et
al (2001) using a set of 3,065 emails as a training set, and conducting hypothesis tests to
choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all
numeric. The attributes measure the percentage of specific words or characters in the email,
the average and maximum run lengths of upper case letters, and the total number of such
letters.
The results of a typical run of logistic SpAM are summarized in Figure 8, using plug-in
bandwidths. A held-out set is used to tune the regularization parameter λ.
Suppose we draw B bootstrap samples and each time we construct a classifier. This gives
classifiers h1 , . . . , hB . We now classify by combining them:
(
1 if B1 j hj (x) ≥ 21
P
h(x) =
0 otherwise.
16
λ(×10−3 ) Error # zeros selected variables
5.5 0.2009 55 { 8,54}
4.5 0.1354 46 {7, 8, 9, 17, 18, 27, 53, 54, 57, 58}
√
4.0 0.1083 ( ) 20 {4, 6–10, 14–22, 26, 27, 38, 53–58}
Figure 8: (Email spam) Classification accuracies and variable selection for logistic SpAM.
This is called bagging which stands for bootstrap aggregration. The basline classifiers are
usually trees.
A variation is to choose a random subset of the predictors to split on at each stage. The
resulting classifier is called a random forests. Random forests often perform very well. Their
theoretical performance is not well understood. Some good references are:
Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.
Lin and Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American
Statistical Association, 101, p 578.
17
Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.
Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.
Now we consider the multiclass version. Suppose we have the nonparametric K-class logistic
regression model
ef` (X)
pf (Y = ` | X) = PK ` = 1, . . . , K (27)
fm (X)
m=1 e
where each function has an additive form
f` (X) = f`1 (X1 ) + f`2 (X2 ) + · · · + f`p (Xp ). (28)
In Newton’s algorithm, we minimize the quadratic approximation to the log-likelihood
h i 1 h i
L(f ) ≈ L(fb) + E (Y − pb)T (f − fb) + E (f − fb)T H(fb)(f − fb) (29)
2
where pb(X) = (pfb(Y = 1 | X), . . . , pfb(Y = K | X)), and H(fb(X)) is the Hessian
The above calculation can be reexpressed as follows, which leads a multiclass backfitting
algorithm. The difference in log-likelihoods for functions {fb` } and {f` } is, to second order,
!2
K−1 K−1 K−1
X X Y` − p` (X) X
E p` (X) fb` (X) − pk (X)fbk (X) + − f` (X) + pk (X)fk (X)
`=0 k=0
p ` (X)
k=0
(35)
18
where p` (X) = P(Y = ` | X), and Y` = δ(Y, `) are indicator variables. Minimizing over {f` }
gives coupled equations for the functions f` ; they can’t be solved independently over `.
A practical approach is to use coordinate descent, computing the function f` holding the
other functions {fk }k6=` fixed, and iterating. Assuming that fk = fbk for k 6= `, this simplifies
to
" 2 X 2 #
Y ` − p ` p k − Y k
E p` (1 − p` )2 fb` + − f` + pk p2` fb` + − f` . (36)
p` (1 − p` ) k6=`
pk p`
After some algebra, this can be seen to be the same as the usual objective function in the
binary case, where we take fb0 = 1 and fb1 arbitrary.
Now assume f` (and fb` ) has an additive form: f` (X) = pj=1 f`j (Xj ). Some further calcula-
P
tion shows that minimizing over each f`j yields the following backfitting algorithm:
" ! #
X Y ` − p `
E p` (1 − p` ) fb` − f`k + | Xj
p ` (1 − p` )
k6=j
f`j (Xj ) ← . (37)
E [p` (1 − p` ) | Xj ]
where
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) + (39)
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)). (40)
This is the same as in binary logistic regression. We thus have the following algorithm:
For each ` = 0, 1, . . . , K − 1
A. Initialize f` = fb`
B. Iterate until convergence:
19
For each j = 1, 2, . . . , p
D. Set fb` ← f` .
Incrementally updating the normalizing constants (step C) is important so that the proba-
bilties p` (X) = ef` (X) /Z(X) can be efficiently computed, and we avoid an O(K 2 ) algorithm.
b
20
Random Forests
One of the best known classifiers is the random forest. It is very simple and effective but
there is still a large gap between theory and practice. Basically, a random forest is an average
of tree estimators.
These notes rely heavily on Biau and Scornet (2016) as well as the other references at the
end of the notes.
−1
Pn
where Y j = nj i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)
Recall from the results on regression that if m ∈ H1 (1, L) and the binwidth b of a regular
partition satisfies b n−1/(d+2) then
c
b − m||2P ≤ 2/(d+2) .
E||m (1)
n
h) − R(h∗ ) = O(n−1/(d+2) ).
We conclude that the corresponding classification risk satisfies R(b
Regression trees and classification trees (also called decision trees) are partition classifiers
where the partition is built recursively. For illustration, suppose there are two covariates,
X1 = age and X2 = blood pressure. Figure 1 shows a classification tree using these variables.
The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.
Here is how a tree is constructed. First, suppose that there is only a single covariate X. We
choose a split point t that divides the real line into two sets A1 = (−∞, t] and A2 = (t, ∞).
Let Y 1 be the mean of the Yi ’s in A1 and let Y 2 be the mean of the Yi ’s in A2 .
1
Age
< 50 ≥ 50
Blood Pressure 1
0 1
1
Blood Pressure
110
1
0
50
Age
2
For continuous Y (regression), the split is chosen to minimize the training error. For binary
Y (classification), the split is chosen to minimizeP2a surrogate for classification error. A
common choice is the impurity defined by I(t) = s=1 γs where
2
γs = 1 − [Y s + (1 − Y s )2 ]. (2)
This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.
Now we continue recursively splitting until some stopping criterion is met. For example, we
might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf has an estimate
m(x)
b which is the mean of Yi ’s in that leaf. For classification, we take b
h(x) = I(m(x)
b > 1/2).
When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity.
2 Example
The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 1. See Figures 3, 4, 5, 6.
3 Bagging
Trees are useful for their simplicity and interpretability. But the prediction error can be
reduced by combining many trees. A common approach, called bagging, is as follows.
Suppose we draw B bootstrap samples and each time we construct a classifier. This gives tree
classifiers h1 , . . . , hB . (The same idea applies to regression.) We now classify by combining
3
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20
Table 1: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2
0.8
0.05
0.4
0.1
0.0
0.6
0.00
0.2
0.0
0.4
−0.05
−0.1
−0.1
0.0
0.2
−0.10
−0.2
−0.2
0.0
−0.2
−0.15
−0.3
−0.2
−0.3
−0.20
−0.4
−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15
0.06
0.04
0.2
0.2
0.04
0.10
0.03
0.1
0.02
0.02
0.05
0.1
0.01
0.00
0.0
0.00
0.00
−0.02
0.0
−0.1
−0.05
−0.04
−0.02
−0.10
−0.1
−0.2
−0.06
4
0.29
0.28
0.27
Test Error
0.26
0.25
0 10 20 30 40 50
xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193
xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0
xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0
1 1
5
xtrain.V9 < | !0.189962
1 0
Figure 6: Classification tree. The size of the tree was chosen by cross-validation.
them: (
1 if B1 j hj (x) ≥ 1
P
2
h(x) =
0 otherwise.
This is called bagging which stands for bootstrap aggregation. A variation is sub-bagging
where we use subsamples instead of bootstrap samples.
To get some intuition about why bagging is useful, consider this example from Buhlmann
and Yu (2002). Suppose that x ∈ R and consider the simple decision rule θbn = I(Y n ≤ x).
Let µ = E[Yi ] and for simplicity assume that Var(Yi ) = 1. Suppose that x is close to µ
√
relative to the sample size. We can model this by setting x ≡ xn = µ + c/ n. Then θbn
converges to I(Z ≤ c) where Z ∼ N (0, 1). So the limiting mean and variance of θbn are
∗
Φ(c) and Φ(c)(1 − Φ(c)). Now the bootstrap
√ distribution of Y (conditional on Y1 , . . . , Yn )
∗
is approximately N (Y , 1/n). That is, n(Y − Y ) ≈ N (0, 1). Let E ∗ denote the average
with respect to the bootstrap randomness. Then, if θen is the bagged estimator, we have
" !#
∗ ∗ ∗
√ ∗ √
θen = E [I(Y ≤ xn )] = E I n(Y − Y ) ≤ n(xn − Y )
√
= Φ( n(xn − Y )) + o(1) = Φ(c + Z) + o(1)
where Z ∼ N (0, 1), and we used the fact that Y ≈ N (µ, 1/n).
To summarize, θbn ≈ I(Z ≤ c) while θen ≈ Φ(c + Z) which is a smoothed version of I(Z ≤ c).
6
In other words, bagging is a smoothing operator. In particular, suppose we take c = 0.
Then θbn converges to a Bernoulli with mean 1/2 and variance 1/4. The bagged estimator
converges to Φ(Z) = Unif(0, 1) which has mean 1/2 and variance 1/12. The reduction in
variance is due to the smoothing effect of bagging.
4 Random Forests
Finally we get to random forests. These are bagged trees except that we also choose random
subsets of features for each tree. The estimator can be written as
1 X
m(x)
b = m
b j (x)
M j
where mb j is a tree estimator based on a subsample (or bootstrap) of size a using p randomly
selected features. The trees are usually required to have some number k of observations in
the leaves. There are three tuning parameters: a, p and k. You could also think of M as a
tuning parameter but generally we can think of M as tending to ∞.
For each tree, we can estimate the prediction error on the un-used data. (The tree is built
on a subsample.) Averaging these prediction errors gives an estimate called the out-of-bag
error estimate.
Unfortunately, it is very difficult to develop theory for random forests since the splitting
is done using greedy methods. Much of the theoretical analysis is done using simplified
versions of random forests. For example, the centered forest is defined as follows. Suppose
the data are on [0, 1]d . Choose a random feature, split in the center. Repeat until there are
k leaves. This defines one tree. Now we average M such trees. Breiman (2004) and Biau
(2002) proved the following.
Theorem 1 If each feature is selected with probability 1/d, k = o(n) and k → ∞ then
E[|m(X)
b − m(X)|2 ] → 0
as n → ∞.
Theorem 2 Suppose that m is Lipschitz and that m only depends on a subset S of the
features and that the probability of selecting j ∈ S is (1/S)(1 + o(1)). Then
3
4|S| log
2 1 2+3
E|m(X)
b − m(X)| = O .
n
7
This is better than the usual Lipschitz rate n−2/(d+2) if |S| ≤ p/2. But the condition that
we select relevant variables with high probability is very strong and proving that this holds
is a research problem.
A significant step forward was made by Scornet, Biau and Vert (2015). Here is their result.
Again, the theorem has strong assumptions but it does allow a greedy split selection. Scornet,
Biau and Vert (2015) provide another interesting result. Suppose that (i) there is a subset
S of relevant features, (ii) p = d, (iii) mj is not constant on any interval for j ∈ S. Then
with high probability, we always split only on relevant variables.
Lin and Jeon (2006) showed that there is a connection between random forests and k-NN
methods. We say that Xi is a layered nearest neighbor (LNN) of x If the hyper-rectangle
defined by x and Xi contains no data points except Xi . Now note that if tree is grown until
each leaf has one point, then m(x)
b is simply a weighted average of LNN’s. More generally,
Lin and Jeon (2006) call Xi a k-potential nearest neighbor k − P N N if there are fewer than
k samples in the the hyper-rectangle defined by x and Xi . If we restrict to random forests
whose leaves have k points then it follows easily that m(x)
b is some weighted average of the
k − P N N ’s.
Let us know return to LNN’s. Let Ln (x) denote all LNN’s of x and let Ln (x) = |Ln (x)|. We
could directly define
1 X
m(x)
b = Yi I(Xi ∈ Ln (x)).
Ln (x) i
Biau and Devroye (2010) showed that, if X has a continuous density,
(d − 1)!E[Ln (x)]
→ 1.
2d (log n)d−1
Moreover, if Y is bounded and m is continuous then, for all p ≥ 1,
b n (X) − m(X)|p → 0
E|m
8
as n → ∞. Unfortunately, the rate of convergence is slow. Suppose that Var(Y |X = x) = σ 2
is constant. Then
σ2 σ 2 (d − 1)!
b n (X) − m(X)|p ≥
E|m ∼ d .
E[Ln (x)] 2 (log n)d−1
If we use k-PNN, with k → ∞ and k = o(n), then the results Lin and Jeon (2006) show that
the estimator is consistent and has variance of order O(1/k(log n)d−1 ).
As an aside, Biau and Devroye (2010) also show that if we apply bagging to the usual 1-NN
rule to subsamples of size k and then average over subsamples, then, if k → ∞ and k = o(n),
then for all p ≥ 1 and all distributions P , we have that E|m(X)
b − m(X)|p → 0. So bagged
1-NN is universally consistent. But at this point, we have wondered quite far from random
forests.
There is also a connection between random forests and kernel methods (Scornet 2016). Let
Aj (x) be the cell containing x in the j th tree. Then we can write the tree estimator as
1 X X Yi I(Xi ∈ Aj (x)) 1 XX
m(x)
b = = Wij Yj
M j i Nj (x) M j i
where Nj (x) is the number of data points in Aj (x) and Wij = I(Xi ∈ Aj (x))/Nj (x). This
suggests that a cell Aj with low density (and hence small Nj (x)) has a high weight. Based
on this observation, Scornet (2016) defined kernel based random forest (KeRF) by
P P
j i Yi I(Xi ∈ Aj (x))
m(x)
b = P .
j Nj (x)
where
1 X
Kn (x, z) = I(x ∈ Aj (x)).
M j
The trees are random. So let us write the j th tree as Tj = T (Θj ) for some random quantity
Θj . So the forests is built from T (Θ1 ), . . . , T (ΘM ). And we can write Aj (x) as A(x, Θj ).
Then Kn (x, z) converges almost surely (as M → ∞) to κn (x, z) = PΘ (z ∈ A(x, Θ)) which is
9
just the probability that x and z are connected, in the sense that they are in the same cell.
Under some assumptions, Scornet (2016) showed that KeRF’s and forests are close to each
other, thus providing a kernel interpretation of forests.
Recall the centered forest we discussed earlier. This is a stylized forest — quite different
from the forests used in practice — but they provide a nice way to study the properties
of the forest. In the case of KeRF’s, Scornet (2016) shows that if m(x) is Lipschitz and
X ∼ Unif([0, 1]d ) then
3+d1log 2
2 12
E[(m(x)
b − m(x)) ] ≤ C(log n) .
n
This is slower than the minimax rate n−2/(d+2) but this probably reflects the difficulty in
analyzing forests.
7 Variable Importance
Let m
b be a random forest estimator. How important is feature X(j)?
LOCO. One way to answer this question is to fit the forest with all the data and fit it
again without using X(j). When we construct a forest, we randomly select features for each
tree. This second forest can be obtained by simply average the trees where feature j was
b (−j) . Let H be a hold-out sample of size m. Then let
not selected. Call this m
bj = 1
X
∆ Wi
m i∈H
where
b (−j) (Xi ))2 − (Yi − m(X
Wi = (Yi − m b i ))2 .
Then ∆j is a consistent estimate of the prediction risk inflation that occurs by not having
access to X(j). Formally, if T denotes the training data then,
" #
E[∆ b (−j) (X))2 − (Y − m(X))
b j |T ] = E (Y − m b 2
T ≡ ∆j .
In fact, since ∆
b j is simply an average, we can easily construct a confidence interval. This
approach is called LOCO (Leave-Out-COvariates). Of course, it is easily extended to sets
of features. The method is explored in Lei, G’Sell, Rinaldo, Tibshirani, Wasserman (2017)
and Rinaldo, Tibshirani, Wasserman (2015).
Permutation Importance. A different approach is to permute the values of X(j) for the
out-of-bag observations, for each tree. Let Oj be the out-of-bag observations for tree j and
10
let Oj∗ be the out-of-bag observations for tree j with X(j) permuted.
bj = 1
XX
Γ Wij
M j i
where
1 X 1 X
Wij = b j (Xi ))2 −
(Yi − m b j (Xi ))2 .
(Yi − m
mj i∈O∗ mj i∈O
j j
where Xj0 has the same distribution as X except that Xj0 (j) is an independent draw from
X(j). This is a lot like LOCO but its meaning is less clear. Note that mb j is not changed
when X(j) is permuted. Gregorutti, Michel and Saint Pierre. (2013) show that, when (X, )
is Gaussian, that Var(X) = (1 − c)I + c11T and that Cov(Y, X(j)) = τ for all j then
2
τ
Γj = 2 .
1 − c + dc
It
P is not clear how this connects to the actual importance of X(j). In the case where Y =
2
j mj (X(j)) + with E[|X] = 0 and E[ |X] < ∞, they show that Γj = 2Var(mj (X(j)).
8 Inference
Using
√ the theory of infinite order U -statistics, Mentch and Hooker (2015) showed that
n(m(x)
b − E[m(x)])/σ
b converges to a Normal(0,1) and they show how to estimate σ.
Wager and Athey (2017) show asymptotic normality if we use sample splitting: part of the
data are used to build the tree and part is used to estimate the average in the leafs of the
tree. Under a number of technical conditions — including the fact that we use subsamples
of size s = nβ with β < 1 — they show that (m(x)
b − m(x))/σn (x) N (0, 1) and they show
how to estimate σn (x). Specifically,
2 X
n−1 n
bn2 (x)
σ = b j (x), Nij )2
(Cov(m
n n−s i
where the covariance is with respect to the trees in the forest and Nij = 1 of (Xi , Yi ) was in
the j th subsample and 0 otherwise.
11
9 Summary
Random forests are considered one of the best all purpose classifiers. But it is still a mystery
why they work so well. The situation is very similar to deep learning. We have seen that
there are now many interesting theoretical results about forests. But the results make strong
assumptions that create a gap between practice and theory. Furthermore, there is no theory
to say why forests outperform other methods. The gap between theory and practice is due
to the fact that forests — as actually used in practice — are complex functions of the data.
10 References
Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.
Biau, Gerard, and Scornet. (2016). A random forest guided tour. Test 25.2: 197-227.
Buhlmann, P., and Yu, B. (2002). Analyzing bagging. Annals of Statistics, 927-961.
Gregorutti, Michel, and Saint Pierre. (2013). Correlation and variable importance in random
forests. arXiv:1310.5726.
Lin, Y. and Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of
the American Statistical Association, 101, p 578.
L. Mentch and G. Hooker. (2015). Ensemble trees and CLTs: Statistical inference for
supervised learning. Journal of Machine Learning Research.
Scornet E. Random forests and kernel methods. (2016). IEEE Transactions on Information
Theory. 62(3):1485-500.
Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.
Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association.
12
Clustering
10/26-702 Spring 2017
In a clustering problem we aim to find groups in the data. Unlike classification, the data
are not labeled, and so clustering is an example of unsupervised learning. We will study the
following approaches:
1. k-means
2. Mixture models
3. Density-based Clustering I: Level Sets and Trees
4. Density-based Clustering II: Modes
5. Hierarchical Clustering
6. Spectral Clustering
1. Rates of convergence
2. Choosing tuning parameters
3. Variable selection
4. High Dimensional Clustering
Example 1 Figures 17 and 18 show some synthetic examples where the clusters are meant
to be intuitively clear. In Figure 17 there are two blob-like clusters. Identifying clusters like
this is easy. Figure 18 shows four clusters: a blob, two rings and a half ring. Identifying
clusters with unusual shapes like this is not quite as easy. In fact, finding clusters of this
type requires nonparametric methods.
One of the oldest approaches to clustering is to find k representative points, called prototypes
or cluster centers, and then divide the data into groups based on which prototype they are
closest to. For now, we assume that k is given. Later we discuss how to choose k.
Warning! My view is that k is a tuning parameter; it is not the number of clusters. Usually
we want to choose k to be larger than the number of clusters.
1
Let X1 , . . . , Xn ∼ P where Xi ∈ Rd . Let C = {c1 , . . . , ck } where each cj ∈ Rd . We call C a
codebook. Let ΠC [X] be the projection of X onto C:
The set Vj is known as a Voronoi cell and consists of all points closer to cj than any other
point in the codebook. See Figure 1.
The usual algorithm to minimize Rn (C) and find C b is the k-means clustering algorithm—
also known as Lloyd’s algorithm— see Figure 2. The risk Rn (C) has multiple minima. The
algorithm will only find a local minimum and the solution depends on the starting values.
A common way to choose the starting values is to select k data points at random. We will
discuss better methods for choosing starting values in Section 2.1.
Example 2 Figure 3 shows synthetic data inspired by the Mickey Mouse example from
http: // en. wikipedia. org/ wiki/ K-means_ clustering . The data in the top left plot
form three clearly defined clusters. k-means easily finds in the clusters (top right). The
bottom shows the same example except that we now make the groups very unbalanced. The
lack of balance causes k-means to produce a poor clustering. But note that, if we “overfit
then merge” then there is no problem.
2
●
●
●
●
●
●
●
Figure 1: The Voronoi tesselation formed by 10 cluster centers c1 , . . . , c10 . The cluster centers
are indicated by dots. The corresponding Voronoi cells T1 , . . . , T10 are defined as follows: a
point x is in Tj if x is closer to cj than ci for i 6= j.
1 X
cj ←− Xi .
nj i: Xi ∈Cj
3
●●●●●
● ●
●●● ●● ●●
● ●
●●●●
●●● ●
●●
● ●●●● ●●● ●● ● ●●●
●●●●
●●
●●●● ●●
●●●● ●●● ●●●
●●
●
● ●● ●● ● ●●●
●●●● ●
●●●
●
● ●
●●
●●●
●●●●
●
●●
●● ●● ●● ●
● ●
●●●●● ● ● ●
● ●●● ●● ●●●●
●●
●●
●
●●● ●
●●
●
● ●
●●●●
●
●●●●●●
●● ●●
●
●●●● ●
●
● ●● ●●●●
● ●●●●●
● ●●●● ●●
●●
● ●
●●●●●
● ●●●
●● ●●
● ● ●●●
●
● ●● ●●●●● ●●
●● ●●
●● ●
●●●
●
●●●●●●●●●
●● ●●● ●●●●
●
●
●
● ●
●●
●●●●● ●
●●●●
● ● ● ●●
●●●●● ●● ●
●
●● ●●
● ●●● ●●● ●
●●
●
●
●● ●●●●● ● ●●●● ●●● ●
●●● ● ● ●●●
●
● ●●●
● ●
●●●●●●
●●● ●●●●●●● ●
●●
●
●●● ● ●
●
● ●●●
●● ●●●●●
● ●●
●
● ●
● ● ●
●● ●●
●● ●
●●● ● ●● ●● ● ●● ● ●●
●●●● ●●
● ●●● ●
●●● ●●●●● ●
●●●●●● ●●
●●
● ●●● ●●●
●●
● ● ● ● ●● ● ● ● ●
● ● ●● ●●●●● ●
●●●●● ● ●
● ●●
●● ●
●
●
●●
●●
●● ●● ●●●●
●● ●● ● ●●
●● ●
●●●● ● ● ●●● ●
●●
●●● ●
●●●●●
● ●
●●
●●●●●●
●
● ●●
●
●●
●●
●
●● ●●● ● ● ● ●●●
●● ●●●● ●●●●
● ● ●● ●
●● ●●●
● ●
● ● ●●●●
● ●●
●
●●●●●●●● ●●●●● ●● ●
●●●●●● ●
●●●●●● ●● ●
●
●● ● ●● ●
● ● ●●●● ●● ● ● ●●
●● ●●●●●● ●●● ●●●●● ●● ●●● ● ●●●●●
● ●
●● ●● ●●●●●
●●
●
● ●●●●●●●●●●●●●
●●
●
●
●
●● ●●● ● ●● ● ●
● ●●● ●
● ●
● ●●
● ●
●
●●
●
●● ●
●
● ● ●●
● ●
●●●●
● ●●●
● ●●
●● ●● ● ●
●●
●● ●
● ●●●● ● ● ●
● ●●●●
●●●● ●●●
●● ●
●●●●●●●●
●●●● ●●●
●●●●●●●● ●● ●
● ● ●
● ● ● ●
●● ●●●●● ●● ● ●
● ● ●
●
●
● ● ●
●●
●●
●
● ●
●
●●●●●●● ●● ● ●●
●● ●●
● ●
●●●●● ●●
● ●●● ●●●●● ● ●● ●●●● ●
●● ●●●●
● ●●●●● ●
● ●●●●● ●● ●●●●
●●
●●●● ●● ●● ●● ●●● ●● ●● ●
● ●●●●● ●●
●● ● ●
●●
● ●●
●●● ● ● ●● ●●● ●● ● ●●●●
● ●
●
●
●●●● ● ●
●●●●●
● ●● ●●●●● ●
● ● ●
● ● ●
● ●
●●●●●
● ● ●●●●● ●
● ● ●
●●
●●
●●
●●● ● ●
●●●● ●● ●●
●●● ●●
●●●
●
●●
●
●
●●
● ●●
●●
●●● ●
●●
●●●●
●● ●●
●●● ●●
●
●
●●
●●●
●
●●
●
●●●● ●
●●●●●
●●
●
●●●
●
● ●●●●
●
●●
●
●
●
●●
●
●●
● ●
●● ●●●
●● ●●●●● ●●●
●
●●●
●
● ●●●●
●
●●
●●
●●
●
●
● ●
●●●●●●
●●
●
●
●●
●●●
●●
●●
●●●● ●●
● ●
●
●
●
●●●
●●● ●●
●
●
●●●
● ●
●●
●
●
●● ●
●
●● ●
●●● ●●
●●
●
● ●
●●● ●
● ●●
● ●
●
●
●●●●
●●●●●
●
●
●●●
●● ●
●●
●
●
● ●● ● ● ●●
● ● ● ● ● ●● ● ● ●● ● ● ●
●
●
●●● ●● ● ●
●
●● ● ● ●
●● ●
●
●
●
●●
● ● ●●●● ●
●●
●●● ●
●
●●● ● ●
●● ●
●
●
● ●● ●
●● ●
●
●
●
●●
● ● ●●●●●
●
●
●●●
●
●
●
●● ●●●
●●
● ●●
●●●●●● ●
●
●
●● ●●●●● ●
●
●
●
●
●
●● ●●●
●●
● ● ●
●
●●●● ●
●
●
●
●● ●●
●●●
●
●
●
●●●●
● ●● ● ●
●●●
●●● ●
●●●
● ●
●●●●●●●● ● ●●●●
● ● ●●●●
●●● ●
●●●
●●●●
●●●●●● ● ●
●● ●● ● ● ●● ● ●●
● ●● ● ● ●
●●●● ●
●●●
●
●
●
●
●●● ●
● ●
●●●●
● ●●●
●
●
● ●●
●●
●●
● ●
●● ●●●●●
●●
●
●
●
●
●●●
● ●
●
●●●
● ● ● ●
● ●●
●●
●
● ●●
●●
●●
● ●●
● ●● ●●●●● ●●●
●●●
● ●● ● ●● ● ●● ●● ●●
● ●
●
● ●●● ● ●
●● ● ● ● ●● ●
●● ● ● ●● ●● ●
●●●
●●●●●
●●●●●●
●
●
●●●
● ●●●
● ●●●●● ●● ●●●
●●●●●
●
●
●●●
● ●●● ●●●
● ●
● ● ●●
●
●●●●●● ● ● ● ●●
●
●●●●●● ●
●
●● ● ● ●
●● ● ●
●●
●●●●●
● ●●
● ●●●
●
●●●●●
● ●● ●●
●●●●●
●●●
●
●
●●
●●● ●●●
● ● ●●
●●●
● ●
●
●●●●
●
●
●●●●●● ● ●
●
●
● ●
●
●●
●●
●●
● ●
●●●
●● ●● ●●● ●●●
●● ●
● ●● ● ● ●
●
● ●●● ●● ●●●●● ●
●●
● ●●●●
● ● ● ● ● ●
● ●
● ● ● ●● ● ● ● ●
● ● ●● ● ●●● ● ●●● ●●●●●● ●●●●● ●●●
●
● ●●● ●●● ●●●● ● ●●●
●● ● ● ●●● ●● ● ●●
●●
● ● ●●●●●● ●● ● ● ●● ● ●
● ● ●●
●●●●● ●
● ●
● ●● ● ●
●● ●●●
●● ●
●●● ● ●
● ●●
●● ● ●
● ●●● ●●● ● ●● ●●● ●●
●●● ● ●
● ● ●●●●● ● ● ●● ●● ● ●● ● ●
● ● ● ● ●● ●●●
●●
● ●
● ● ● ●
●●
● ● ●● ● ●● ●● ●●
●
●
●
●●● ●● ●●●● ● ●●●● ● ●●● ●●
● ●● ● ●●●● ●
●●● ●●●● ●● ●●●● ●● ●●● ●● ● ●● ●●●● ●● ●●●
● ● ●●● ● ● ●● ● ●●●
● ●●●●
●● ●● ●
●
●
● ●●● ●
●● ●
●
●
●●●●●
●●
●
●
●●
●●●●● ●● ●
●
● ●●● ● ●●
●● ● ●●●
●●●● ●●
●●●●
●●●●●●● ●● ● ● ●●●●●● ● ●
● ● ●● ● ● ●●●● ●●
●●●●●●●
● ● ●● ●●● ●● ● ●● ●
● ●●●●●●●●● ●● ●● ●●● ●● ●
● ●●●● ●●
●● ● ● ●●● ●● ● ●●●● ●● ●●
● ●
● ●●●
●● ●● ●● ●● ●● ●● ●● ● ●●● ●● ●● ●
●
●●● ●● ●● ●●●
●● ●● ●● ● ● ● ●● ●●●●● ●● ● ●● ● ●● ● ●
●● ● ●● ● ●● ● ●
●●●●● ●● ● ●● ●
●●●●
● ●●
●●● ●●
● ● ●● ●●● ● ●
● ●
●● ●
● ●● ● ●●● ●●
●●● ●●●●● ●● ● ●●●●● ●●
●●●
●
●● ●
● ●
●●● ●● ● ● ●
●●●● ●
●●●●●●
● ●● ●● ● ● ● ●●●● ●● ●● ●● ● ●●
●●● ● ●● ● ● ● ●●
●●●●● ●● ● ● ●
●
●● ● ● ● ●●●● ● ●●●●
Figure 3: Synthetic data inspired by the “Mickey Mouse” example from wikipedia. Top
left: three balanced clusters. Top right: result from running k means with k = 3. Bottom
left: three unbalanced clusters. Bottom right: result from running k means with k = 3
on the unbalanced clusters. k-means does not work well here because the clusters are very
unbalanced.
Example 5 The top left plot of Figure 6 shows a dataset with two ring-shaped clusters. The
remaining plots show the clusters obtained using k-means clustering with k = 2, 3, 4. Clearly,
k-means does not capture the right structure in this case unless we overfit then merge.
Since Rbn (C) has multiple minima, Lloyd’s algorithm is not guaranteed to minimize Rn (C).
The clustering one obtains will depend on the starting values. The simplest way to choose
starting values is to use k randomly chosen points. But this often leads to poor clustering.
4
100 150 200
50
50
0
0
0 10 30 50 70 0 10 30 50 70 0 10 30 50 70
100 150 200
50
50
0
0 10 30 50 70 0 10 30 50 70 0 10 30 50 70
100 150 200
50
50
0
0 10 30 50 70 0 10 30 50 70 0 10 30 50 70
Figure 4: The nine clusters found in the Topex data using k-means clustering with k = 9.
Each plot show the curves in that cluster together with the mean of the curves in that cluster.
5
0 20 40 60 80 100 0 20 40 60 80 100
Cluster 1 Cluster 2
0 20 40 60 80 100 0 20 40 60 80 100
Cluster 3 Cluster 4
● ●●● ● ●●●
●● ● ● ●●● ●● ● ● ●●●
●●
●● ●● ● ● ●●
●● ●● ●
● ●● ●
●● ● ●●
●● ●●
● ●●
● ●● ●
●● ● ●●
● ● ●
● ● ●● ● ●
● ●
●●
● ●
●●●
● ● ● ●●
●●
●●●
●
●
●●● ●
● ● ● ●
● ●
●
●● ●
●
●
●
●
●
●● ●
●●
●●
●
● ●● ●● ●
●
●
●
●●●
●
●
● ●●
●● ●
●
●
●● ● ● ●
●
●●
● ●
●● ●
●● ●● ● ●
●
● ●
●● ●●●
● ● ● ●●
●
● ●
●●
●●●
● ●●
●
● ● ● ●
●
●●
●●
● ●
●●●●●●●●
●● ● ● ●●
●
●
●● ●
● ●●
● ● ●
● ● ●
●● ● ●●
●
●● ●
● ●● ●●
●
●●
● ●● ●●
●
● ●●● ●
●●● ●
● ● ● ● ●●●
●●●●
●
●●
●
●●
●
●●
●
●
● ●●
●
●
●●
●●●●
●
●
● ●●
● ● ●
●●
●
●
●●● ●
●
● ●●
●
●●
● ●●
● ●
●
●● ●●
●
●●
●
● ●
●● ●●
●●
● ●●●
● ●
●
●
●●
●●● ●●
●
●
● ●
●● ●
●●●●●●●● ●
●
●
● ●
●
● ●
●
● ●●
● ●
● ●
●● ●●
●● ●● ●●
●
● ●● ●●●
●●● ●
● ● ●● ● ●●● ●
●●●
Figure 6: Top left: a dataset with two ring-shaped clusters. Top right: k-means with k = 2.
Bottom left: k-means with k = 3. Bottom right: k-means with k = 4.
6
● ● ● ●
● ●●● ● ●●●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●● ●● ● ● ● ●●
●● ●● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ●● ●●●● ● ● ● ●● ● ● ● ●● ●● ●●●● ●
● ●● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●
●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●
● ●● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●
●●● ●● ●●●
● ● ●●●
●● ● ●● ●●
● ● ●●● ●
●●● ● ●●● ●● ●●●
● ● ●●●
●● ● ●● ●●
● ● ●●● ● ●● ●
●
● ●● ● ●● ●● ●
●● ●●●● ● ● ● ●
● ●
● ●
●
●
● ● ● ● ●● ● ●● ●● ●
●● ●●●● ● ● ● ●
● ●
● ●
●
●
● ● ● ●
● ● ●●● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ● ● ●●● ●
●●● ●
● ●●●●● ●● ● ● ●● ● ●● ● ● ● ●●
●●
●● ●●● ●
● ●●●●● ●● ● ● ●● ● ●● ● ● ● ●●
●● ●
●
● ● ● ● ●●
● ● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ● ●●● ● ●● ●● ●● ●● ● ●● ● ● ● ● ●●● ● ●● ●●
● ● ● ●● ●●●●● ●● ● ●● ● ● ● ●●
●● ● ● ● ● ●● ●●●●● ●● ● ●● ● ● ● ●●
●● ●
●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●
● ●● ●● ● ●
●
● ●
● ● ●● ● ●● ●● ● ●
●
● ●
● ● ●●
●● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ●●● ●● ● ● ● ● ●●● ●● ● ● ●
● ●
● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●●
●● ●●● ●● ● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ●
●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ● ●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ●
● ● ●● ●●●●●● ●
● ●●● ●
●
●● ● ●● ● ● ● ●● ● ● ●● ●●●●●● ●
● ●●● ●
●
●● ● ●● ● ● ● ●●
●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●● ●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●●
● ●● ●● ●
●● ●● ●
● ● ●
● ●●
● ●●
●
● ●● ●● ●
●● ●● ●
● ● ●
● ●●
● ●●
●
●● ●●●●● ●● ●●●● ●● ●● ● ●
● ●●● ● ●
●●
●● ●●●●● ●●● ● ●● ●●●●● ●● ●●●● ●● ●● ● ●
● ●●● ● ●
●●
●● ●●●●● ●●● ●
● ●●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ●● ●● ● ●
● ● ● ● ● ● ● ●● ●● ● ●
● ● ●
● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ●● ●● ● ● ● ●● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ●●● ● ● ●
● ●
● ● ● ●● ● ● ● ●●● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●
● ●● ●
●
● ● ●●●●● ● ● ● ● ●
● ● ●●●●● ● ● ● ●
●●● ●● ●● ●●● ●● ●●
●●
● ● ●●
● ●● ● ● ●●
● ●
● ● ● ●● ●
●
● ● ● ● ●
● ●●
●● ●
●●● ●●● ● ●●● ● ● ●●
● ● ●●
● ●● ● ● ●●
● ●
● ● ● ●● ●
●
● ● ● ● ●
● ●●
●● ●
●●● ●●● ● ●●● ● ●
● ●● ●● ● ●●● ● ●● ●● ● ●● ●● ● ●●● ● ●● ●●
● ● ●●
● ● ●
● ●●
●●● ● ● ●● ●●●●● ● ●●●●
●
●●
●
● ● ● ● ● ● ● ●●
● ● ●
● ●●
●●● ● ● ●● ●●●●● ● ●●●●
●
●●
●
● ● ● ● ●
● ●●●●● ●● ●●●● ● ●● ● ●●
● ● ● ●●●●● ●● ●●●● ● ●● ● ●●
● ●
●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●
●
●●●●● ● ●● ● ● ●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●
●
●●●●● ● ●● ● ●
● ●● ● ●
● ●● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●●
● ●●
●● ● ● ● ● ●● ● ● ●●
●● ● ● ● ● ●● ●
● ● ●●● ●● ● ● ● ● ●●● ●● ● ●
● ●● ● ●● ● ● ●● ● ●● ●
● ● ● ● ● ●
●●● ●●●
● ● ● ●
● ●●
● ● ●●
●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●● ●● ● ● ● ●●
●● ●● ●● ● ● ● ●●
● ● ●● ● ● ● ● ●●
● ●●●● ● ● ● ●● ● ● ● ● ●●
● ●●●● ●
● ●● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●
●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ●
● ●● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ●
●● ● ●● ● ● ● ●● ● ●● ● ● ●
●●● ●● ●●●
● ● ●●● ● ●● ●●
● ● ●●● ● ●● ●
●
●●● ●● ●●●
● ● ●●● ● ●● ●●
● ● ●●● ● ●● ●
●
● ● ● ●● ●● ●
●● ●●●● ● ● ● ● ●
● ● ●●● ●
●
●
● ● ● ● ● ● ●● ●● ●
●● ●●●● ● ● ● ● ●
● ● ●●● ●
●
●
● ● ●
● ● ●●●●● ●● ●●●●● ●● ● ● ●●●● ● ●● ● ●●● ● ● ● ● ●●●●● ●● ●●●●● ●● ● ● ●●●● ● ●● ● ●●● ● ●
●●● ● ● ● ●●
● ● ● ● ● ● ●● ● ●●● ● ● ● ●●
● ● ● ● ● ● ●● ●
● ● ●● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●● ●
●● ●● ●● ●● ● ● ●● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●● ●
●● ●● ●● ●●
● ● ● ●● ●●●●● ●● ● ●● ● ● ●●
● ●● ● ● ● ● ●● ●●●●● ●● ● ●● ● ● ●●
● ●● ●
●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●
● ●● ●● ● ●
●
● ●
● ● ●● ● ●● ●● ● ●
●
● ●
● ● ●●
●● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ●●● ●● ● ● ● ● ●●● ●● ● ● ●
● ●
● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●●
●● ●●● ●● ● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ●
●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ● ●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ●
● ● ●● ●●●●●● ●
● ●●● ●
●
●● ● ●● ● ● ● ●● ● ● ●● ●●●●●● ●
● ●●● ●
●
●● ● ●● ● ● ● ●●
●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●● ●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●●
●● ●● ●
●●●●● ● ● ● ●● ●● ●● ●● ●
●●●●● ● ● ● ●● ●●
● ●
●● ●● ●● ●●● ● ●●●
●●● ●●
● ● ●●● ● ● ●
● ●
●● ●● ●● ●●● ● ●●●
●●● ●●
● ● ●●● ● ● ●
●● ● ●
● ●●● ● ● ●
● ● ● ● ●●● ●● ● ●
● ●●● ● ● ●
● ● ● ● ●●●
● ●●● ● ● ●● ●●
● ●●●●●
● ● ●●●● ● ● ●●● ● ● ●● ●●
● ●●●●●
● ● ●●●● ●
● ● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ●● ●● ● ● ● ●● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●
●● ●● ●● ● ●● ●● ●● ●
● ● ●●●
●● ● ● ●●● ●
●● ● ● ● ● ● ●● ●
●
● ●●●●
● ● ●●●
●● ● ● ●●● ●
●● ● ● ● ● ● ●● ●
●
● ●●●●
● ●●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●●● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●●● ● ●
● ●● ●● ● ●
●●
● ● ●● ●● ●●● ● ● ●● ● ● ●● ● ●● ●● ● ●
●●
● ● ●● ●● ●●● ● ● ●● ● ● ●●
● ● ●●
● ● ●●
●●● ● ●●●●
● ●●
●● ●● ● ● ● ● ●●
● ● ●●
●●● ● ●●●●
● ●●
●● ●● ● ●
● ●●●●● ●● ● ●● ● ●●●●●● ● ●●●●● ●● ● ●● ● ●●●●●●
●●●● ●● ●● ● ●●●● ●● ●● ●
●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●
●
●●●●
● ● ●● ● ● ●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●
●
●●●●
● ● ●● ● ●
● ●● ● ●
● ●● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●●
● ●●
●● ● ● ● ● ●● ● ● ●●
●● ● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ●
● ●● ● ●● ● ● ●● ● ●● ●
● ● ● ● ● ●
●●● ●●●
Figure 7: An example with 9 clusters. Top left: data. Top right: k-means with random
starting values. Bottom left: k-means using starting values from hierarchical clustering.
Bottom right: the k-means++ algorithm.
Example 6 Figure 7 shows data from a distribution with nine clusters. The raw data are in
the top left plot. The top right plot shows the results of running the k-means algorithm with
k = 9 using random points as starting values. The clustering is quite poor. This is because
we have not found the global minimum of the empirical risk function. The two bottom plots
show better methods for selecting starting values that we will describe below.
Hierarchical Starting Values. Tseng and Wong (2005) suggest the following method for
choosing staring values for k-means. Run single-linkage hierarchical clustering (which we
describe in Section 6) to obtains p × k clusters. They suggest using p = 3 as a default. Now
take the centers of the k-largest of the p × k clusters and use these as starting values. See
the bottom left plot in Figure 7.
k-means++ . Arthur and Vassilvitskii (2007) invented an algorithm called k-means++ to get
good starting values. They show that if the starting points are chosen in a certain way, then
we can get close to the minimum with high probability. In fact the starting points themselves
— which we call seed points — are already close to minimizing Rn (C). The algorithm is
described in Figure 8. See the bottom right plot in Figure 7 for an example.
Theorem 7 (Arthur and Vassilvitskii, 2007). Let C = {c1 , . . . , ck } be the seed points from
7
1. Input: Data X = {X1 , . . . , Xn } and an integer k.
3. For j = 2, . . . , k:
D2 (Xi )
pi = Pn 2
.
j=1 D (Xj )
4. Run Lloyd’s algorithm using the seed points C = {c1 , . . . , ck } as starting points and output
the result.
See Arthur and Vassilvitskii (2007) for a proof. They also show that the Euclidean distance
can be replaced with the `p norm in the algorithm. The result is the same except that the
constant 8 gets replaced by 2p+2 . It is possible to improve the k-means++ algorithm. Ailon,
Jaiswal and Monteleoni (2009) showed that, by choosing 3 log k points instead of one point,
at each step of the algorithm, the log k term in (6) can be replaced by a constant. They call
the algorithm, k-means#.
2.2 Choosing k
In k-means clustering we must choose a value for k. This is still an active area of research
and there are no definitive answers. The problem is much different than choosing a tuning
parameter in regression or classification because there is no observable label to predict.
Indeed, for k-means clustering, both the true risk R and estimated risk Rn decrease to 0
8
as k increases. This is in contrast to classification where the true risk gets large for high
complexity classifiers even though the empirical risk decreases. Hence, minimizing risk does
not make sense. There are so many proposals for choosing tuning parameters in clustering
that we cannot possibly consider all of them here. Instead, we highlight a few methods.
Elbow Methods. One approach is to look for sharp drops in estimated risk. Let Rk denote
the minimal risk among all possible clusterings and let R
bk be the empirical risk. It is easy to
see that Rk is a nonincreasing function of k so minimizing Rk does not make sense. Instead,
we can look for the first k such that the improvement Rk − Rk+1 is small, sometimes called
an elbow. This can be done informally by looking at a plot of R bk . We can try to make this
more formal by fixing a small number α > 0 and defining
Rk − Rk+1
kα = min k : ≤α (7)
σ2
Unfortunately, the elbow method often does not work well in practice because there may not
be a well-defined elbow.
Hypothesis Testing. A more formal way to choose k is by way of hypothesis testing. For
each k we test
We begin k = 1. If the test rejects, then we repeat the test for k = 2. We continue until the
first k that is not rejected. In summary, b
k is the first k for which k is not rejected.
Currently, my favorite approach is the one in Liu, Hayes, Andrew Nobel and Marron (2012).
(JASA, 2102, 1281-1293). They simply test if the data are multivariate Normal. If this
rejects, they split into two clusters and repeat. The have an R package sigclust for this.
A similar procedure, called PG means is described in Feng and Hammerly (2007).
Example 8 Figure 9 shows a two-dimensional example. The top left plot shows a single
cluster. The p-values are shown as a function of k in the top right plot. The first k for which
the p-value is larger than α = .05 is k = 1. The bottom left plot shows a dataset with three
clusters. The p-values are shown as a function of k in the bottom right plot. The first k for
which the p-value is larger than α = .05 is k = 3.
9
1.0
●
2
●
●
●
●
●
0.8
● ● ●
● ● ● ●
● ●
● ●
● ●
1
● ●
●● ●
●
●● ● ●
● ● ● ●
0.6
●
p−value
● ●● ● ●
● ●
●● ● ●● ● ●
0
● ● ● ●
● ● ●
●● ● ●● ●
● ●
●
0.4
● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
−1
● ● ● ●
● ● ● ● ●
●
0.2
● ●
●
● ● ●
●
−2
●
●
●
0.0
●
−2 −1 0 1 2 3 5 10 15 20
1.0 k
●
● ●● ●
●●
● ●● ● ● ●● ●
●● ●
● ● ●● ● ● ● ●● ●
●● ●●●● ●
● ●●●●●●
5
●
● ●● ●●● ●●●
0.8
● ●
● ●
●
● ●
●●●●●●●
● ●●●●●● ● ●
●● ● ●● ●● ●● ●●
●●
●● ● ● ●
● ● ●
●
●
● ●
0.6
●
●●● ● ●
p−value
●
●● ● ●●
●
●
●●●●●
● ● ●● ● ●
●● ●●●●●●●
●●●●
●
0
●
● ● ● ●
●● ●
● ●
●● ●● ●● ● ● ●●● ●●
● ● ● ● ●●
● ●● ●●
●●
● ●● ● ● ●●
●● ● ●●
0.4
● ● ●
●
● ●
●
● ● ●
● ●
● ● ●
●●● ● ●
● ●●●●●●● ●●●●●
●
0.2
−5
●●●●●
●● ●● ● ●●●
●
● ●●●●● ●
●
●● ●● ●● ●
● ●● ● ●●●●●● ●
● ● ●● ●●●
●
●●●●
●●●
● ● ●● ● ● ●
● ●●
●
0.0
−6 −4 −2 0 2 4 6 5 10 15 20
Figure 9: Top left: a single cluster. Top right: p-values for various k. The first k for which
the p-value is larger than .05 is k = 1. Bottom left: three clusters. Bottom right: p-values
for various k. The first k for which the p-value is larger than .05 is k = 3.
10
Stability. Another class of methods are based on the idea of stability. The idea is to find
the largest number of clusters than can be estimated with low variability.
We start with a high level description of the idea and then we will discuss the details. Suppose
that Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ) are two independent samples from P . Let Ak
be any clustering algorithm that takes the data as input and outputs k clusters. Define the
stability
Ω(k) = E [s(Ak (Y ), Ak (Z))] (9)
where s(·, ·) is some measure of the similarity of two clusterings. To estimate Ω we use
random subsampling. Suppose that the original data are X = (X1 , . . . , X2n ). Randomly
split the data into two equal sets Y and Z of size n. This process if repeated N times.
Denote the random split obtained in the j th trial by Y j , Z j . Define
N
1 X
s(Ak (Y j ), Ak (Z j )) .
Ω(k) =
b
N j=1
Now we discuss the details. First, we need to define the similarity between two clusterings.
We face two problems. The first is that the cluster labels are arbitrary: the clustering
(1, 1, 1, 2, 2, 2) is the same as the clustering (4, 4, 4, 8, 8, 8). Second, the clusterings Ak (Y )
and Ak (Z) refer to different data sets.
The first problem is easily solved. We can insist the labels take values in {1, . . . , k} and then
we can maximize the similarity over all permutations of the labels. Another way to solve
the problem is the following. Any clustering method can be regarded as a function ψ that
takes two points x and y and outputs a 0 or a 1. The interpretation is that ψ(x, y) = 1 if x
and y are in the same cluster while ψ(x, y) = 0 if x and y are in a different cluster. Using
this representation of the clustering renders the particular choice of labels moot. This is the
approach we will take.
Let ψY and ψZ be clusterings derived from Y and Z. Let us think of Y as training data and
Z as test data. Now ψY returns a clustering for Y and ψZ returns a clustering for Z. We’d
like to somehow apply ψY to Z. Then we would have two clusterings for Z which we could
then compare. There is no unique way to do this. A simple and fairly general approach is
to define
ψY,Z (Zj , Zk ) = ψY (Yj0 , Yk0 ) (10)
where Yj0 is the closest point in Y to Zj and Yk0 is the closest point in Y to Zk . (More
generally, we can use Y and the cluster assignment to Y as input to a classifier; see Lange
et al 2004). The notation ψY,Z indicates that ψ is trained on Y but returns a clustering for
11
Z. Define
1 X
s(ψY,Z , ψZ ) = I (ψY,Z (Zs , Zt ) = ψZ (Zs , Zt )) .
n s6=t
2
Thus s is the fraction of pairs of points in Z on which the two clusterings ψY,Z and ψZ agree.
Finally, we define
N
1 X
Ω(k)
b = s(ψY j ,Z j , ψZ j ).
N j=1
Now we need to decide how to use Ω(k) b to choose k. The interpretation of Ω(k)
b requires
some care. First, note that 0 ≤ Ω(k) ≤ 1 and Ω(1) = Ω(n) = 1. So simply maximizing Ω(k)
b b b b
does not make sense. One possibility is to look for a small k larger than k > 1 with a high
stability. Alternatively, we could try to normalize Ω(k).
b Lange et al (2004) suggest dividing
by the value of Ω(k) obtained when cluster labels are assigned randomly. The theoretical
b
justification for this choice is not clear. Tibshirani, Walther, Botstein and Brown (2001)
suggest that we should compute the stability separately over each cluster and then take the
minimum. However, this can sometimes lead to very low stability for all k > 1.
Many authors have considered schemes of this form, including Breckenridge (1989), Lange,
Roth, Braun and Buhmann (2004), Ben-Hur, Elisseeff and Guyron (2002), Dudoit and
Fridlyand (2002), Levine and Domany (2001), Buhmann (2010), Tibshirani, Walther, Bot-
stein and Brown (2001) and Rinaldo and Wasserman (2009).
It is important to interpret stability correctly. These methods choose the largest number
of stable clusters. That does not mean they choose “the true k.” Indeed, Ben-David, von
Luxburg and Pál (2006), Ben-David and von Luxburg Tübingen (2008) and Rakhlin (2007)
have shown that trying to use stability to choose “the true k” — even if that is well-defined
— will not work. To explain this point further, we consider some examples from Ben-David,
von Luxburg and Pál (2006). Figure 10 shows the four examples. The first example (top
left plot) shows a case where we fit k = 2 clusters. Here, stability analysis will correctly
show that k is too small. The top right plot has k = 3. Stability analysis will correctly show
that k is too large. The bottom two plots show potential failures of stability analysis. Both
cases are stable but k = 2 is too small in the bottom left plot and k = 3 is too big in the
bottom right plot. Stability is subtle. There is much potential for this approach but more
work needs to be done.
A theoretical property of the k-means method is given in the following result. Recall that
C ∗ = {c∗1 , . . . , c∗k } minimizes R(C) = E||X − ΠC [X] ||2 .
12
●●
●● ● ●
●●●
●●● ●● ●●
●●
● ●● ●●
● ●●●●
●●
●●
● ● ● ●●● ●●
●
●●●
●● ● ●●
●
● ●●
● ●●●● ●●●
● ●
●
●●●
● ●●●● ●
●
●● ●●
●
●● ●●●
●● ●●
●●●● ●●●●●
●
●
●●
●●
●●
●● ●
● ● ●●
● ●●
●●●●●
●
●
●
●●●●
●
●
●●●●●● ●
●
● ●● ●
● ●●
● ●● ●●● ●
●
●●●● ●●
●
●
●●●●
●
●●
●●
●●●
●●
●
● ● ●●●●
●●
● ●●●● ●
●●● ●●
●●● ●●●●●●●●●●
●●
●●
●●
● ●●
●● ● ●
●● ●●
●●
●
●●●●● ●●
● ●● ● ●
●●●●●● ●● ● ●●●●●
● ●●
● ●●
●●
●● ●●
●●● ●●
●●● ●●●
●●
●
●
● ●● ● ● ● ●●● ●●● ●●●● ● ●
● ●
●●●●● ●● ●●●
● ●
● ●
●●
●
●
● ●●●
●●
●
● ●●●● ●
●●●● ●●●●
●●
●
●● ●● ●
● ●● ●●● ●● ● ● ●●
●●●●●●●●●●
●● ●
●●
●●
●●●
●
● ●● ●●●
●●
●●●
●
●
●●●
●●
●
● ●●● ●
● ●●
●●●●
●
●
●●
● ●●
●●
● ●
●●● ●●
● ●
●● ●● ● ●
●●●●● ● ●●●● ●●●
● ●●
●●
●●● ●
●● ● ●●●●●
●●
● ●● ● ●
●● ●●●●●
●●●●
● ● ● ●● ●
●●●●●
●
●● ● ●●●
●●
●●●
●●
●
●●
●
●●
● ●●
●
●●●●● ●
●●
●
●●
●●●
●
●●
●● ● ● ●●●
●● ● ●
●●●● ●●●
● ●
●●●●●● ●●● ●
●●●
●●●
● ●
●●
●●●●●
●
● ●
●●
● ●●● ●●●●● ●
● ●●
● ●● ● ●
●● ●
●●
●
●
●●●● ●●●
● ●●
●● ● ● ●
●● ● ●
● ●● ●●
●●●●
● ● ● ●
●
● ●●
● ● ●●●●●●
● ●●● ●
● ●
●●●●● ●
●
●●
●● ●●●●● ● ●
●●●
●●
●●●●●●●
● ●● ●● ●●
●●
●
●
●●●●●
●
●
● ●
●
●
●
● ●
●● ●●●●●
● ● ●●
●●●
●
●●●
●●●
●●
● ●●●● ●●
● ●
●
●●
●
●●
●
●●
●
●● ● ●●
● ●●●●●●● ●●
●
●●
●●
●●●● ● ● ● ●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●●
●
●●●
●●●●
●
●● ●●
●● ●● ●
●●
●
● ● ●● ●●●●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
● ●●● ● ●
●● ●●●●●●
● ●●
●●
●
●●
●● ●●●● ●●●●
●
●●●●●
●
● ●●●● ●
● ●
●●
●
●●
●●● ●●
●
●
●●●●
●●●
●●
●●●●
● ●●
●●
●
●●
● ●
●
●●●●
● ●●●
●●
●●● ●● ●●
●● ●●●● ●● ● ●●● ●●●
●● ●●
● ●●
● ●●●●
●
● ●●●●●
● ●●●● ●● ●●
●● ● ●●● ●● ●●
●●● ●●● ●
● ●● ● ●●
● ●●● ●●
●● ●●
● ● ● ● ●●●
● ● ●
●●
●● ● ●●
●
●● ●●
●
●●● ● ● ●
●●● ● ●●●
●● ●●●
●
●●● ● ●●● ●●● ●● ●
●●● ● ●● ● ●●
● ●
●● ●
● ● ●●
●
●●●●●● ● ●●
●
●● ● ●
● ●● ●
Figure 10: Examples from Ben-David, von Luxburg and Pál (2006). The first example (top
left plot) shows a case where we fit k = 2 clusters. Stability analysis will correctly show that
k is too small. The top right plot has k = 3. Stability analysis will correctly show that k
is too large. The bottom two plots show potential failures of stability analysis. Both cases
are stable but k = 2 is too small in the bottom left plot and k = 3 is too big in the bottom
right plot.
This proof is due to Linder, Lugosi and Zeger (1994). The proof uses techniques from a later
lecture on VC theory so you may want to return to the proof later.
13
R∞
0
P(Y ≥ t)dt whenever Y ≥ 0, we have
n
1X
2 sup |R(C)
b − Rn (C)|
b = 2 sup fC (Xi ) − E(fC (X))
C∈Ck C n i=1
Z ∞ n
!
1X
= 2 sup I(fC (Xi ) > u) − P(fC (Z) > u) du
C 0 n i=1
n
1X
≤ 8B sup I(fC (Xi ) > u) − P(fC (Z) > u)
C,u n
i=1
n
1X
= 8B sup I(Xi ∈ A) − P(A)
A n i=1
where A varies over all sets A of the form {fC (x) > u}. The shattering number of A is
s(A, n) ≤ nk(d+1) . This follows since each set {fC (x) > u} is a union of the complements of
k spheres. By the VC Theorem,
n
!
b − R(C ∗ ) > ) ≤ P 8B sup 1 X
P(R(C) I(Xi ∈ A) − P(A) >
A n i=1
n
!
1X
= P sup I(Xi ∈ A) − P(A) >
A n i=1 8B
2 2
≤ 4(2n)k(d+1) e−n /(512B ) .
q
Now conclude that E(R(C) − R(C )) ≤ C k(d + 1) logn n .
∗
p
b
Theorem 10 p (Bartlett, Linder and Lugosi 1997) Suppose that P (kXk2 ≤ 1) = 1 and
that n ≥ k 4/d , dk 1−2/d log n ≥ 15, kd ≥ 8, n ≥ 8d and n/ log n ≥ dk 1+2/d . Then,
r r !
dk 1−2/d log n dk log n
b − R(C ∗ ) ≤ 32
E(R(C)) =O .
n n
See Bartlett, Linder and Lugosi (1997) for a proof. It follows that k-means is risk consistent
b − R(C ∗ ) →P
in the sense that R(C) 0, as long as k = o(n/(d3 log n)). Moreover, the lower
14
bound implies that we cannot find any other method that improves much over the k-means
approach, at least with respect to this loss function.
The k-means algorithm can be generalized in many ways. For example, if we replace the L2
norm with the L1 norm we get k-medians clustering. We will not discuss these extensions
here.
The best way to use k-means clustering is to “overfit then merge.” Don’t think of the k in
k-means as the number of clusters. Think of it as a tuning parameter. k-means clustering
works much better if we:
1. Choose k large
2. merge close clusters
This eliminates the sensitivity to the choice of k and it allows k-means to fit clusters with
arbitrary shapes. Currently, there is no definitive theory for this approach but in my view,
it is the right way to do k-means clustering.
3 Mixture Models
Simple cluster structure can be discovered using mixture models. We start with a simple
example. We flip a coin with success probability π. If heads, we draw X from a density
p1 (x). If tails, we draw X from a density p0 (x). Then the density of X is
which is called a mixture of two densities p1 and p0 . Figure 11 shows a mixture of two
Gaussians distribution.
Let Z ∼ Bernoulli(π) be the unobserved coin flip. Then we can also write p(x) as
X X
p(x) = p(x, z) = p(x|z)p(z) (12)
z=0,1 z=0,1
where p(x|Z = 0) := p0 (x), p(x|Z = 1) := p1 (x) and p(z) = π z (1 − π)1−z . Equation (12) is
called the hidden variable representation. A more formal definition of finite mixture models
is as follows.
15
[Finite Mixture Models] Let {pθ (x) : θ ∈ Θ} be a parametric class of densities. Define the
mixture model
K−1
X
pψ (x) = πj pθj (x),
j=0
PK−1
where the mixing coefficients πj ≥ 0, j=0 πj = 1 and ψ = (π0 , . . . , πK−1 , θ0 , . . . , θK−1 ) are
the unknown parameters. We call pθ0 , . . . , pθK−1 the component densities.
Generally, even if {pθ (x) : θ ∈ Θ} is an exponential family model, the mixture may no
longer be an exponential family.
Let φ(x; µj , σj2 ) be the probability density function of a univariate Gaussian distribution with
mean µj and variance σj2 . A typical finite mixture model is the mixture of Gaussians. In
one dimension, we have
K−1
X
pψ (x) = πj φ(x; µj , σj2 ),
j=0
PK−1
which has 3K − 1 unknown parameters, due to the restriction j=0 πj = 1.
A finite mixture model pψ (x) has parameters ψ = (π0 , . . . , πK−1 , θ0 , . . . , θK−1 ). The likelihood
of ψ based on the observations X1 , . . . , Xn is
n
Y n K−1
Y X
L(ψ) = pψ (Xi ) = πj pθj (Xi )
i=1 i=1 j=0
16
0.20
0.15
p(p(x)
x)
0.10
0.05
0.00
−4 −2 0 2 4 6
x
x
Figure 11: A mixture of two Gaussians, p(x) = 25 φ(x; −1.25, 1) + 53 φ(x; 2.95, 1).
and, as usual, the maximum likelihood estimator is the value ψb that maximizes L(ψ). Usually,
the likelihood is multimodal and one seeks a local maximum instead if a global maximum.
For fixed θ0 , . . . , θK−1 , the log-likelihood is often a concave function of the mixing parameters
πj . However, for fixed π0 , . . . , πK−1 , the log-likelihood is not generally concave with respect
to θ0 , . . . , θK−1 .
One way to find ψb is to apply your favorite optimizer directly to the log-likelihood.
Xn K−1
X
`(ψ) = log πj pθj (Xi ) .
i=1 j=0
However, `(ψ) is not jointly convex with respect to ψ. It is not clear which algorithm is the
best to optimize such a nonconvex objective function.
A convenient and commonly used algorithm for finding the maximum likelihood estimates of
a mixture model (or the more general latent variable models) is the expectation-maximization
(EM) algorithm. The algorithm runs in an iterative fashion and alternates between the
“E-step” which computes conditional expectations with respect to the current parameter
estimate, and the “M-step” which adjusts the parameter to maximize a lower bound on
the likelihood. While the algorithm can be slow to converge, its simplicity and the fact
that it doesn’t require a choice of step size make it a convenient choice for many estimation
problems.
On the other hand, while simple and flexible, the EM algorithm is only one of many numerical
procedures for obtaining a (local) maximum likelihood estimate of the latent variable models.
In some cases procedures such as Newton’s method or conjugate gradient may be more
effective, and should be considered as alternatives to EM. In general the EM algorithm
converges linearly, and may be extremely slow when the amount of missing information is
large,
In principle, there are polynomial time algorithms for finding good estimates of ψ based on
spectral methods and the method of moments. It appears that, at least so far, these methods
17
are not yet practical enough to be used in routine data analysis.
Example. The data are measurements on duration and waiting time of eruptions of the
Old Faithful geyser from August 1 to August 15, 1985. There are two variables with 299 ob-
servations. The first variable ,“Duration”, represents the numeric eruption time in minutes.
The second variable, “waiting”, represents the waiting time to next eruption. This data is
believed to have two modes. We fit a mixture of two Gaussians using EM algorithm. To
illustrate the EM step, we purposely choose a bad starting point. The EM algorithm quickly
converges in six steps. Figure 12 illustrates the fitted densities for all the six steps. We see
that even though the starting density is unimodal, it quickly becomes bimodal.
mixture
mixture
mixture
densit
densit
densit
y
y
y W
Wa
Wa
aiti
iti
itin
ng
ng
g
tim
ti
tim
s) ) s)
me
in ins min
e(
(m
e
(m e (
ime ime
(m
(m
im
min
n t nt nt
ins
in
Er u Eru Eru
)
Step 1 Step 2 Step 3
mixture
mixture
mixture
densit
densit
densit
y
y W
y W
Wa
aiti
aiti
itin
ng
ng
g
ti
ti
tim
s) s) s)
me
me
e( e( e(
(m
(m
(m
im im im
nt nt nt
ins
ins
in
r u Eru Eru
)
Figure 12: Fitting a mixture of two Gaussians on the Old Faithful Geyser data. The initial
0.8 7
T T
values are π0 = π1 = 0.5. u0 = (4, 70) , u1 = (3, 60) , Σ1 = Σ2 = 7 70 . We see that
even though the starting density is not bimodal, the EM algorithm converges quickly to a
bimodal density.
18
3.3 The Twilight Zone
Mixtures models are conceptually simple but they have some strange properties.
You would expect p(x) to be multimodal but this is not necessarily true. The density p(x)
is unimodal when |µ1 − µ2 | ≤ 2σ and bimodal when |µ1 − µ2 | > 2σ. One might expect that
the maximum number of modes of a mixture of k Gaussians would be k. However, there are
examples where a mixture of k Gaussians has more than k modes. In fact, Edelsbrunner,
Fasy and Rote (2012) show that the relationship between the number of modes of p and the
number of components in the mixture is very complex.
where Pθ is the distribution corresponding to the density pθ . Mixture models are noniden-
tifiable in two different ways. First, there is nonidentifiability due to permutation of labels.
For example, consider a mixture of two univariate Gaussians,
and
pψ2 (x) = 0.7φ(x; 2, 1) + 0.3φ(x; 0, 1),
then pψ1 (x) = pψ2 (x) even though ψ1 = (0.3, 0.7, 0, 2, 1)T 6= (0.7, 0.3, 2, 0, 1)T = ψ2 . This is
not a serious problem although it does contribute to the multimodality of the likelihood.
When µ1 = µ2 = µ, we see that p(x; π, µ1 , µ2 ) = φ(x; µ). The parameter π has disappeared.
Similarly, when π = 1, the parameter µ2 disappears. This means that there are subspaces of
19
the parameter space where the family is not identifiable. This local nonidentifiability causes
many of the usual theoretical properties— such as asymptotic Normality of the maximum
likelihood estimator and the limiting χ2 behavior of the likelihood ratio test— to break
down. For the model (13), there is no simple theory to describe the distribution of the
likelihood ratio test for H0 : µ1 = µ2 versus H1 : µ1 6= µ2 . The best available theory is
very complicated. However, some progress has been made lately using ideas from algebraic
geometry (Yamazaki and Watanabe 2003, Watanabe 2010).
The lack of local identifiabilty causes other problems too. For example, we usually have that
the Fisher information is non-zero and that θb − θ = OP (n−1/2 ) where θb is the maximum
likelihood estimator. Mixture models are, in general, irregular: they do not satisfy the usual
regularity conditions that make parametric models so easy to deal with. Here is an example
from Chen (1995).
Suppose that µ1 < µ2 . We can classify an observation as being from cluster 1 or cluster 2
by computing the probability of being from the first or second component, denoted Z = 0
and Z = 1. We get
(1 − π)φ(x; µ1 , σ12 )
P(Z = 0|X = x) = .
(1 − π)φ(x; µ1 , σ12 ) + πφ(x; µ2 , σ22 )
Define Z(x) = 0 if P(Z = 0|X = x) > 1/2 and Z(x) = 1 otherwise. When σ1 is much
larger than σ2 , Figure 13 shows Z(x). We end up classifying all the observations with large
Xi to the leftmost component. Technically this is correct, yet it seems to be an unintended
consequence of the model and does not capture what we mean by a cluster.
20
1.4
.2MbBiv 7mM+iBQM p(x)
1.2
*Hbb bbB;MK2Mi
1.0
h(x)p(x)
0.8
0.6
*Hbb k
*Hbb R *Hbb R
0.4
0.2
0.0
−2 0 2 4 6
xx
Figure 13: Mixtures are used as a parametric method for finding clusters. Observations with
x = 0 and x = 6 are both classified into the first component.
1.2
seeing the data. Often, the prior is improper, meaning that it does not have a finite integral.
For example, suppose that X1 , . . . , Xn ∼ N (µ, 1). It is common to use an improper prior
1.0
Z
0.6
0.4
Nevertheless, the posterior p(µ|Dn ) ∝ L(µ)π(µ) is a proper √ distribution, where L(µ) is the
0.2
data likelihood of µ. In fact, the posterior for µ is N (X, 1/ n) where x is the sample mean.
The posterior inferences in this case coincide exactly with the frequentist inferences. In many
0.0
parametric models, the posterior inferences are well defined even if the prior is improper and
−0.2
usually they approximate the frequentist inferences. Not so with mixtures. Let
−2 0 2 4 6
1 1
p(x; µ) = φ(x; 0, 1)x + φ(x; µ, 1). (14)
2 2
If π(µ) is improper then so is the posterior. Moreover, Wasserman (2000) shows that the only
priors that yield posteriors in close agreement to frequentist methods are data-dependent
priors.
Use With Caution. Mixture models can have very unusual and unexpected behavior.
This does not mean that we should not use mixture modes. Indeed, mixture models are
extremely useful. However, when you use mixture models, it is important to keep in mind
that many of the properties of models that we often take for granted, may not hold.
What Does All This Mean? Mixture models can have very unusual and unexpected
behavior. This does not mean that we should not use mixture modes. Compare this to
21
kernel density estimators which are simple and very well understood. If you are going to use
mixture models, I advise you to remember the words of Rod Serling:
Let p be the density if the data. Let Lt = {x : ph (x) > t} denote an upper level S setSof p.
Suppose that Lt can be decomposed into finitely many disjoint sets: Lt = C1 · · · Ckt .
We call Ct = {C1 , . . . , Ckt } the level set clusters at level t.
S
Let C = t≥0 Ct . The clusters in C form a tree: if A, B ∈ C, the either (i) A ⊂ B or (ii)B ⊂ A
or (iii) A ∩ B = ∅. We call C the level set cluster tree.
bt = {x : pbh (x) > t}. How do we
The level sets can be estimated in the obvious way: L
decompose Lbt into its connected components? This can be done as follows. For each t let
1. Compute pbh .
2. For each t, let Xt = {Xi : pbh (Xi ) > t}.
3. Form a graph Gt for the points in Xt by connecting Xi and Xj if ||Xi − Xj || ≤ h.
4. The clusters at level t are the connected components of Gt .
http://www.brianpkent.com/projects.html.
Fabrizio Lecci has written an R implementation, include in his R package: TDA (topological
data analysis). You can get it at:
http://cran.r-project.org/web/packages/TDA/index.html
22
cluster labels
0.06
●
8
●
● ●
● ● ● ●
●● ●
● ●
●
●
● ●
●
● ●●●●● ● ● ● ●
● ●●●●● ●●
● ●
● ● ● ● ● ●
●●
● ● ● ● ● ●
● ●
●●
●
●● ● ●●● ● ●● ●●
●
●●
●
●
● ●
●●
● ●
●●●
●
●
●
●●●
●
● ● ●
●●●●
● ●●
●
●●
● ●
●
●● ●
●
●
●● ●●
●● ● ● ●
● ●
●
● ●
● ●
●
●
●
●●
6
●● ●●●● ●●●●
●● ●●●● ● ● ●●●
● ● ●
●●
●
● ● ● ●
● ●
●● ●
● ●● ● ●
● ● ●
● ●● ● ● ●●
● ● ● ● ●●
● ●●●●●● ● ● ●
● ●● ● ● ● ● ●●
●● ● ● ●● ● ● ●● ● ●
● ●
●● ●
● ●
●●
●
●
●●
● ●
●
●
●●● ● ● ● ●●
● ● ● ●● ● ● ● ● ●
● ● ● ● ●●●
●
● ●●● ● ●
● ●●
● ● ●● ● ●
● ● ●
● ● ●
●
● ● ●●
● ●
●
●● ● ●●●● ● ●●
●●
●
●
● ●● ●
● ●
●●
●
● ●●●●●● ●●
● ● ●●
●● ●
● ● ●●● ● ●● ● ● ●●
● ● ●
● ● ● ●●
●
●
● ● ●
●●
●
●●
● ● ● ●● ● ● ●●● ● ●● ●● ●
●●● ● ● ●● ●●●●●
●●● ● ● ●● ●
●
● ● ●
●● ●●● ●
● ● ● ● ● ● ●●
●
● ● ● ● ● ●
●
●
● ● ●● ●
● ● ●
●● ● ●
●
●
● ● ● ●
●
●● ● ●
● ● ● ● ●
●
● ● ● ●
●●●● ●●● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ●
●
● ●
●
●
● ● ●● ●
●
● ●
●
●●● ●●
●
● ●●●● ●● ● ●● ●
●●
●●
● ●
●
● ●● ●●
● ●
● ● ●●
●●● ●● ●● ● ●
●● ●
●● ●●● ● ● ●
● ● ●
● ● ●●
●
● ● ●● ●
● ●●●
● ● ● ● ●● ● ● ●●●
●
● ●
● ● ●●●●
●● ●
● ●● ●● ●●
●● ● ●
●● ● ●● ● ●●●●● ● ●●
●
● ●
● ●● ● ●● ●●●●● ●
●
●
●
●
●●
● ● ●
● ●
●
● ● ● ● ●● ● ●
●
●
●●●● ● ● ●
●
0.04
● ● ●● ●
● ●
●
●
●●
●
● ● ●
●●
● ●
●
● ●
● ● ●
● ● ● ● ● ● ●● ● ● ● ●
● ●●
● ●●● ● ● ●●● ● ● ●
● ●
● ●
●●
● ● ●
●●●● ●
● ●
● ●●
● ● ●
● ● ●
● ● ● ● ● ●● ● ● ● ●
● ●
● ●
●
● ●● ●● ● ●
●
● ● ● ● ● ●● ●
● ● ●●● ● ●
● ● ● ●● ●
●● ● ●● ●●
●
● ● ●● ●● ●
● ● ● ●
●
●● ●
● ●● ● ● ●● ●● ●
●●●●
●
●●
● ●●●● ● ●●●●●●●●●●
●●●●
● ● ● ● ●●●
●● ●● ●
● ●●
● ● ● ● ●●●
●
●● ● ●● ●
●● ● ●
● ●
● ● ● ● ● ●
● ●
●●● ● ●● ●
● ● ● ● ● ● ●●
●
● ● ● ● ●●● ●
● ●
● ●
●
● ● ● ●
●● ●
● ● ● ●
● ●
●
●
● ●●
● ●● ● ●●
● ● ● ● ● ●●
● ● ●
●
● ● ●
●
●
● ●●
● ● ● ●
● ●● ● ● ●
● ●
lambda
● ●
●
● ● ● ● ●
● ●
●
●
●
●●
● ● ●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ●●
●
●
● ●● ●● ●●
● ● ●
●●
●●● ●●
●●●
●● ●
●
● ●
● ● ●
● ●●● ●
●
● ●
●
●
● ●
●
● ●
● ●
● ● ●● ●●●● ● ● ●● ●●●
● ● ●● ●
● ●●● ●● ●●
●●
●
● ● ●● ● ●●
●● ●
●● ●
●●● ●
●●
●●
● ●●●
●● ●
●
● ●
●
●●●● ● ●
● ● ●
XX[,2]
●
● ●
● ●
●
● ●
● ● ● ●
●● ●
● ● ● ● ●● ●
● ● ● ● ● ●
● ●●
● ●●
● ●
●●
●
● ●●● ●● ●●
●
4
● ●●
● ●
● ● ●
●
●
● ● ● ●●●
● ● ●● ●
● ● ● ●
●
● ●
● ● ●● ●
●
●●
●
●
●
●
●
●
●
● ● ●
●●● ●●●● ● ●
● ●● ● ● ●● ● ● ● ● ● ● ●
●●● ●●●● ● ●● ●●
● ●●
● ● ●
●●
●
● ● ●
●
● ●
● ●
● ● ● ●● ● ● ● ●● ● ● ● ●
● ●● ● ●
●● ●
● ● ● ●
●●●
● ● ●
●●●●
●●
●●
● ●● ●● ● ● ● ●
●●●
●● ● ● ●
● ● ● ●● ●●
●
●● ● ●
●
●●
●● ● ●● ● ●
●
● ●● ●
●
●
●● ● ●
●
● ●●
● ●
● ● ●
●
● ● ●
● ●● ● ●
●
● ● ● ●
●
●●●●●●●
●
● ● ●●●●● ● ● ● ●● ●● ●
●
●
●
●● ● ●
●
●●
●●●●●●●● ● ●
● ● ● ● ●
2
● ● ● ● ●
●● ●
●●
● ● ●
●●
● ●
●●●●
● ● ●
●●●
● ● ●
●●● ●● ● ●
● ●●● ● ●● ●
● ●
●● ●● ●● ● ● ● ●● ● ● ● ●
●●
● ●● ●
● ● ● ●●
● ●● ● ● ● ●
●
● ●
●
●●● ●●●
●●
●●●●●
●
●●
●●●
●
●●
●●●●●
● ●
●
● ●
● ● ●●
●● ●
●●●
●
●●
● ● ● ●● ● ●
●
●
●●
●
0.02
●● ●●● ● ● ●●● ●
● ● ● ● ●● ●● ●● ●● ●● ● ● ●
●●● ● ●
● ●●● ● ● ●
● ●●
●
● ●● ● ●
●
●●
●
● ●
●●● ●
● ● ●
● ●
● ●
● ●
● ● ●●
● ●
● ● ● ●
● ●
●
● ●
●
● ●
●
● ●
●
● ●
●
●● ●● ●●● ● ● ● ● ●●●
●
● ● ●●
● ●
●
● ●●●●●
● ●
●
●
● ● ●
● ● ●●●
●●● ● ●
●
●
●
●● ●● ● ●
● ●●● ● ● ●
●● ●
● ● ● ●● ●●●● ●●● ● ● ● ●
●●
●● ●
● ●
●
● ● ● ● ●● ● ● ●
●● ●● ● ●
●● ● ●
●● ●
●
●
● ●
● ●
● ●●● ●● ●● ● ●
●
●
● ●
●
● ● ●●●
●
● ●● ● ●●
●●● ● ●
● ● ● ●● ● ●●●
●
●● ●
● ●● ●● ● ● ●● ● ● ● ●●●●● ● ● ●
●●●● ●
●
●●● ● ● ● ●
●
●● ● ● ● ●● ● ●
●
●
●●
●●
●●
●
●
●
● ●● ● ●
● ●●
●●●●● ●
●●
●●
● ●
●
●
●●
●● ● ●● ● ● ● ●● ●
●
● ●● ●
●
● ●
●●
●●
● ●●
● ● ● ● ● ●●●● ●●
●●
●●
●●
●●
●●
● ● ●●
● ● ●● ●●
●● ●
●● ● ●●● ● ● ● ●
●●●●
0
●
● ●●●●●
●●● ● ●
●●
●
●
● ●●
●
● ● ● ●● ●
●●●
● ● ● ● ●● ● ● ●● ●●
●● ●
●
● ●● ●
● ● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ●
● ●
●
●
●● ●● ●
● ●
● ●
●
● ●
●
●
−2
● ●
0.00
XX[,1]
cluster labels
0.020
●
●●●●●
●●
●● ● ●
● ●
1 2 3 4 5 6 7 8
● ●●
●
●●
●●
●●● ●
●
● ●●●● ● ●
●
●●
●
●●
●● ●● ●●
● ●●● ● ●●
●●●●●●● ●●
lambda
●
●
● ●●
●●●●
●
●●
● ●
●●
●
●●
●
●●
●●
●●
●●
●●●
●● ●
● ●●
●●
●●●
●
●●
●●
●●
●●●
●
●●
●
●●
●●
●●
●●●
●
●●● ● ●
●●
●● ●
●
●●●
●
●●
●●●
●
●
●●●
●
●●
●●
●
●
●
●
●●●
●
●●●
● ●●
●●●●●●●●●●
●
●●
●● ●●●
●●●●●
●●● ●●
●
●●
●● ●●● ●●●●●
0.010
● ●
●●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●●
● ●
●●●●
●●●
●●
●●●
●●
●●●
●●
●
●●
●●
●
●●●
●●
●●
●
●
●●
●●●
●
●●
●●
●
●
●
●●
●●●
●●
●●
●●
●
●●● ●
●
●●●
●
●●
●
●● ●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●
●
●●●
●
●
●
●●
●●●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●
●●●
●
● ●
●●
●
●●
●●
●
●●●
●●
●
●
●●
●
●●
●
●●●
●
●
●●●
●●
●●
●●●
●●●
●● ●●●● ●●
●●●● ●
●
●●
●●●
●●●
XX[,3]
●
● ●
●
●●
●
●●
●
●● ●
●●
●●●
● ●●
●●
●●
● ●
●
●●
●
●
● ●●
●●
●
●●
●
●●
●
●●●●●
●●
●
●●
●
●●
●● ●
●
●● ●
●●●● ●
● ●
● ●●
●●●●
●●●
●●
●
●● ●
●●
XX[,2]
●●
●
●
● ●
●●●●
● ●
●
●
●●
●
●●
●●
● ● ● ●●●
●●●
●●●
●
●●
●
●
●●●
●
●
●●● ●
● ●
●
●
●● ●
●●● ●● ●
●●● ●●
●●●● ●●
●
●●●
●● ●●●
●●●●
●●●●●● ● ●●● ●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ● 10
● ●
●●
●●
●●
● ●●●
● ●●
●●
● ● ●●● 8
● ●●
●● 6
4
2
0
0.000
−2
−4
−4 −2 0 2 4 6 8 10
23
4.1 Theory
How well does this work? Define the Hausdorff distance between two sets by
( )
H(U, V ) = inf : U ⊂ V ⊕ and V ⊂ U ⊕
where [
V ⊕= B(x, )
x∈V
and B(x, ) denotes a ball of radius centered at x. We would like to say that Lt and L bt are
close. In general this is not true. Sometimes Lt and Lt+δ are drastically different even for
small δ. (Think of the case where a mode has height t.) But we can estimate stable level
sets. Let us say that Lt is stable if there exists a > 0 and C > 0 such that, for all δ < a,
p
Theorem 11 Suppose that Lt is stable. Then H(L
bt , Lt ) = OP ( log n/(nhd )).
p
Proof. Let rn = log n/(nhd )). We need to show two things: (i) for every x ∈ Lt there
exists y ∈ Lbt such that ||x − y|| = OP (rn ) and (ii) for every x ∈ L
bt there exists y ∈ Lt such
that ||x − y|| = OP (rn ). First, we note that, by earlier results, ||bph − ph ||∞ = OP (rn ). To
show (i), suppose that x ∈ Lt . By the stability assumption, there exists y ∈ Lt+rn such that
||x − y|| ≤ Crn . Then ph (y) > t + rn which implies that pbh (y) > t and so y ∈ L bt . To show
(ii), let x ∈ Lt so that pbh (x) > t. Thus ph (x) > t − rn . By stability, there is a y ∈ Lt such
b
that ||x − y|| ≤ Crn .
4.2 Persistence
Consider a smooth density p with M = supx p(x) < ∞. The t-level set clusters are the
connected components of the set Lt = {x : p(x) ≥ t}. Suppose we find the upper level
sets Lt = {x : p(x) ≥ t} as we vary t from M to 0. Persistent homology measures how
the topology of Lt varies as we decrease t. In our case, we are only interested in the modes,
which correspond to the zeroth order homology. (Higher order homology refers to holes,
tunnels etc.) The idea of using persistence to study clustering was introduced by Chazal,
Guibas, Oudot and Skraba (2013).
Imagine setting t = M and then gradually decreasing t. Whenever we hit a mode, a new
level set cluster is born. As we decrease t further, some clusters may merge and we say that
one of the clusters (the one born most recently) has died. See Figure 16.
24
b3
●
b3
b1
●
b1
b2
●
birth
b2
d2
b4
●
b4
d4
d1
d3
d3 d1 d4 d2
−5 0 5 death
Figure 16: Starting at the top of the density and moving down, each mode has a birth time
b and a death time d. The persistence diagram (right) plots the points (d1 , b1 ), . . . , (d4 , b4 ).
Modes with a long lifetime are far from the diagonal.
In summary, each mode mj has a death time and a birth time denoted by (dj , bj ). (Note that
the birth time is larger than the death time because we start at high density and move to
lower density.) The modes can be summarized with a persistence diagram where we plot the
points (d1 , b1 ), . . . , (dk , bk ) in the plane. See Figure 16. Points near the diagonal correspond
to modes with short lifetimes. We might kill modes with lifetimes smaller than the bootstrap
quantile α defined by
( B
)
1 X ∗b
α = inf z : I ||b
ph − pbh ||∞ > z ≤ α . (15)
B b=1
Here, pb∗b
h is the density estimator based on the b
th
bootstrap sample. This corresponds
to killing a mode if it is in a 2α band around the diagonal. See Fasy, Lecci, Rinaldo,
Wasserman, Balakrishnan and Singh (2014). Note that the starting and ending points of the
vertical bars on the level set tree are precisely the coordinates of the persistence diagram.
(A more precise bootstrap approach was introduced in Chazal, Fasy, Lecci, Michel, Rinaldo
and Wasserman (2104).)
Let p be the density of X ∈ Rd . Assume that p has modes m1 , . . . , mk0 and that p is a Morse
function, which means that the Hessian of p at each stationary point is non-degenerate. We
25
●
●●●
●●
●
●
● ●
●● ● ●
●
●
● ●●
● ● ●
●
●● ● ●
● ● ●● ● ●●
● ● ● ●
●
● ●● ● ● ● ● ●
● ● ●●● ●●
●
● ●
● ● ● ●●●●
●●
● ●
● ●
●
●●● ● ● ● ● ●●● ●● ●
●
● ● ●
● ●
●
●
● ●
●● ● ●● ●
● ● ● ●
●●
●
●
●● ● ● ● ●● ●
●
●
● ●
●●● ● ●
● ●● ●●
●● ● ●● ● ●
●
●● ●
● ●
● ● ● ● ●● ● ● ● ●
● ●● ●
● ●
●●● ● ●
●
●●●●●
● ●● ●
●
●● ● ●
●
● ● ● ●●● ●
●●
● ● ●
● ● ● ● ● ● ● ●●
● ● ●●
● ●
● ●● ●
● ● ●
●
●
●
●
●
●
●
● ● ● ● ● ●
● ●●
●
●
●
● ● ●● ●● ●
●● ●
●
●
● ●
●●●
● ● ●●
●
● ● ●● ●●
●
●
●
●
● ● ●
●
● ●
●
● ● ●● ● ● ●
● ●
● ● ●
● ●●
●
●
Figure 18: A synthetic example with four clusters with a variety of different shapes.
Given any point x ∈ Rd , there is a unique gradient ascent path, or integral curve, passing
through x that eventually leads to one of the modes. We define the clusters to be the “basins
of attraction” of the modes, the equivalence classes of points whose ascent paths lead to the
same mode. Formally, an integral curve through x is a path πx : R → Rd such that πx (0) = x
and
πx0 (t) = ∇p(πx (t)). (16)
Integral curves never intersect (except at stationary points) and they partition the space.
Equation (16) means that the path π follows the direction of steepest ascent of p through x.
The destination of the integral curve π through a (non-mode) point x is defined by
It can then be shown that for all x, dest(x) = mj for some mode mj . That is: all integral
curves lead to modes. For each mode mj , define the sets
n o
Aj = x : dest(x) = mj . (18)
These sets are known as the ascending manifolds, and also known as the cluster associated
with mj , or the basin of attraction of mj . The Aj ’s partition the space. See Figure 19. The
collection of ascending manifolds is called the Morse complex.
26
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Figure 19: The left plot shows a function with four modes. The right plot shows the ascending
manifolds (basins of attraction) corresponding to the four modes.
where K is a smooth, symmetric kernel and h > 0 is the bandwidth.1 The mean of the
estimator is Z
ph (x) = E[b
ph (x)] = K(t)p(x + th)dt. (20)
To locate the modes of pbh we use the mean shift algorithm which finds modes by approxi-
mating the steepest ascent paths. The algorithm is given in Figure 20. The result of this
process is the set of estimated modes M
c = {m b k }. We also get the clustering for
b 1, . . . , m
free: the mean shift algorithm shows us what mode each point is attracted to. See Figure
21.
27
Mean Shift Algorithm
1. Input: pb(x) and a mesh of points A = {a1 , . . . , aN } (often taken to be the data
points).
(0)
2. For each mesh point aj , set aj = aj and iterate the following equation until
convergence: (s)
Pn ||aj −Xi ||
i=1 Xi K h
(s+1)
aj ←− (s) .
Pn ||aj −Xi ||
i=1 K h
What we are doing is tracing out the gradient flow. The flow lines lead to the modes and
they define the clusters. In general, a flow is a map φ : Rd × R → Rd such that φ(x, 0) = x
and φ(φ(x, t), s) = φ(x, s + t). The latter is called the semi-group property.
As usual, choosing a good bandwidth is crucial. You might wonder if increasing the band-
width, decreases the number of modes. Silverman (1981) showed that the answer is yes if
you use a Normal kernel.
●
● ● ●
●
● ● ●
●
●
●
●
● ●
● ● ●
● ●●
● ●
●●
● ●●
●●● ●
●● ●
●
●●
●
● ●
●
●
28
2
●●
● ●● ● ●●
● ● ●
●●●●
●●●●
● ●
●
●●●●●●●●● ●● ●
●●●●●
●●
●●●
●
●●●
● ●
●
●●●●●●
●
●
●●
●●●
●●●
●
●●
●●
●
●●
●●
●●
●●
●● ●
● ●
● ●●●●
●● ●●● ●
●
●●●●●
● ● ●● ●
●●●●
●●●
●●
●●●
●●●●
●
●
●●
●●●
●
●●●
● ●●● ● ● ●
●●
●
●●●
●●●
● ●●
●●●
●●
●
●●●●●
●
● ●●●
●
●
●● ●●●●
●
●●●●
●
●
●
●●
●●●●
●● ● ●●● ●● ● ●
●
●
●
●●
●
●
●●
● ●●
●
● ●
●●
●●
●●
● ●● ● ●●●● ●
●●●●●
●● ●● ●
●
●● ●
●
●●
●
●●
● ● ● ●●
● ●
●
●●
●●●
●
● ●●
●● ● ●●
● ●
●●
●
●
●
● ●
●●●●●●
●● ●●●●
●●●●
●
●●
●
●●
●
●●
●
●●●
●
●
●
●●
●●
●●
●●●● ●
●●● ●●●●
● ●●
●●●●
●
●
●●
●
●
●●
●●
●●●
●●
●
●●
●
●●●
●
●●●●
●
●●
●
●●
●●
●●●
●●●
●
●
●
●●
● ●●●●●
●
●●
●
●●
●●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●●●
1
● ● ●● ●
●●● ●●
●
●
●
●●
●
●●
●●
● ● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
● ●
●
● ●
●●●●
● ●
●
●
●
●
● ●
●●●●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●●
●● ●●●
● ● ●
● ●
●●
●
● ●
●
●
●
● ● ● ●●● ●●
● ●
● ●●●●
● ●
● ●● ●●● ●
●●
●
● ●
● ●
●●
●●
●●●
● ●●
●●
●
●●
●●
●
●●
●●
●
● ●
●●
●●
●
●
● ●
●●●
●● ●
●●● ● ●●●●
●
●
●●●
●
●●
●●
●
●●
●●
● ●●
●
●●
●●
●●
●
●
●●
●
●
● ●
● ●
●
●●
●
● ●
●
●● ●
● ●●
● ●●
● ●●●● ●●●
●
●
●●
● ●
●
●● ●●
●●
●
●●
●● ●
● ●
●
● ●● ● ●
●
●●
●
●●
●● ●
●
● ●● ● ●
●●
● ●●● ●
●●
● ●
●● ● ●
● ●
● ●
●
●●
●
● ●
● ● ●
●
●●
●
●● ● ●
●●●
●●● ●
●
●●●● ● ●●●
●●●
●
● ●
●
●●
●●
●
●●
●
●
●
●●
●
●
●●● ●●●
●●●
●
●●
●●● ●●
●●●●
● ●● ●
●● ●● ● ●
●
●●
● ●
●
●
● ●●
●●
●●●● ●●
●● ●●● ● ●●● ● ●●
●●●●●●●
●●●
●●
●
●●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●●●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●●
●●●
●●
●
●●
●●
●
●●
●
●
●●
●
●●● ●
●
● ● ● ●
●
●●●●●●●●
● ●● ●● ● ●●●
●●●
●●●
● ●●●
0
●●●
● ●● ●●●●
●●●●
●
●● ● ● ●●●
●●●●●
●●●
●
●●●●●
● ●
●
●●
●● ●
●
●●
●●
● ●
● ●
●
●
●
●●●
● ●
●
●●
●
●●
●●●
● ●
●●● ●
●●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●●
●
●●●●
●●
●
●●
● ●●●
●
●
●● ●● ● ●●●●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
● ●
● ● ●●●●
●●
●
●
●●
● ●●
●
●●
●
●
●
●
●
●●
●
●
● ●●●
●●●
●●
●●
●
●●●
●●
●
●
●●● ● ● ●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●●● ●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●● ● ●
●●
●
●
●
●●
●●
●
●
●● ●●● ●● ● ●
●●
●●●
●
●●
●●●
●
●● ●
●●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●●● ● ●
●
●●●●●●
●●●
●
● ●●
●●●
● ●●
●●
● ● ●
●
●
●●● ●●●
● ●
●●
●
●● ●
● ●
●●● ●
●●
●
● ●
●●
● ●
●
●● ●
●● ●
●
●
−1
●●●
● ● ●
●●
●
●●
●●● ●●●
● ●
●
●●●
●●
●
●●
●● ●
●●
●●●
●
●
●●
●
●
● ●●
●
● ●●●●●
●
●●
●● ●●
● ●
●
●
●
●●
●● ●●
●●
●●●
●●●
●
●●
●
●
●●
●
●●
● ●●
●● ●●●● ●● ●● ●
●●
●
●●
●
●●
●
●
●●●
●●
●●● ●
●●
●
●●
●
●
●●
●
●
● ●
●
●
●
● ●
●
● ●
●● ● ●
●
●● ●●
●
● ● ●
●
● ●
●
● ●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●● ● ● ●
● ●
● ●●
●
●● ●
●
● ●
●●
● ●
●
●
● ● ●
●
●
● ● ●●
●
●
● ●
●
● ●●
●
●
●
●●
●●●●●
●●
●●
●●
●
●●●●●
●
●●
●●
●
●
●●●
●●●
●●
●●
●
●
●●●
●●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●
●●
●
●
●●
●●
●
●●●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●●●●
●
●●●●●
●
●●
●
●●
●
●●
● ●●●●
●●●●
●● ●
●
●
● ●●
●
●●
●●●
●●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●●●
●
●●
●●●
●
●
●●●●
● ● ● ●●● ●
●●
●
●●
● ●
● ●
●●●●
●
●
●
●
●●
●●
●
●●
●●
●●
● ●
●
●
●●
●●●
●
●●● ●
●
●
●●
●
●
●● ●●
●
●
● ●
●
●
●●
●●
● ●
●
●
● ●
● ●
●●
●●
●●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●●
●●
●
●●
●
●●●
●●
● ●
●
●
● ●
●●
●
●
●
●● ●
●●
●
●
● ● ●
●●●
● ● ●●
●
●● ●●
●
●
●●
●
●
●●
●●
● ●● ● ●● ●
●
●
●●●
●●
●
●●● ● ●
●
●●
●
●● ● ● ●●●
●●●
●●●
●●
●● ●
●●●
●● ● ●● ●
●●
●
●
●●
●●
● ●● ●
●●● ●
●●
●
−2
● ●
● ● ● ●● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●● ● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●● ●●
●
●
●
●●
●●
●
●●
●●●
●
●
●●●
●
●●●
● ●●
●●
●●
●
●
●
●●
●
●
●
●
●
●●●
●●
●●
●
●●
●●●
●●●
●●
●●
●
●
●●
●●
●●
●
●●
●● ●●
●● ●
●●●●
●●
●●
●
●●
● ● ● ● ●●
●
●
●
●
●
● ●
●
●
●●
●
● ●●
●
●
●
●
●
● ●
●
●
●●
● ●●
●
●
● ●●
●
●
●
●
●
●
● ●
●●
●●
●
●●
● ●●●
●
●
●
●●
●●
●
●●
●
●●
● ●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
−2 −1 0 1 2 ●
●●
●●●
●
● ●
● ●
●●
● ●
a ●
●
● ●
●●
●
●
●● ● ●
● ●
● ●
● ●
●● ●
●
●●
● ●
●
● ●
●●
● ●
●
●
Figure 22: The crescent data example. Top left: data. Top right: a few steps of mean-shift.
Bottom left: a few steps of blurred mean-shift.
1.0
● ●
● ● ●
●● ● ● ●
●● ●
● ● ● ●● ● ● ●● ●
●
●● ● ● ● ●●● ●
●
●● ● ● ● ●●
●
●●●●●● ● ● ● ● ● ●
● ● ●● ●
● ● ● ● ● ●
●●● ●
●● ●●●● ●●
●
● ●●
● ● ●● ● ● ● ●
●
●
● ● ● ●
●● ● ●
● ●● ● ● ●
●●
● ●●●● ● ●●
●●● ●● ● ●
● ●
●
●● ● ●●● ●● ● ●●●●● ● ●
0.5
● ●●● ● ●●●● ● ●
● ● ●●●
●●
●● ●
●●
●●●● ●●● ●● ● ●● ●●
●
●
●●●
●●
● ●
● ●● ● ● ●● ●
●
●●
●●● ●● ●●●●
● ●● ●●
● ●● ● ● ●
● ●
●●●● ● ● ●●●●
● ●
●●
● ●●
● ●
●●●●● ● ●●
●●●●
●
●
●●
●
●
●
●
●●
●● ●● ●
●
●
●
● ●
●
● ●
● ●
●
●
● ●
● ●
●
●
● ●
● ●
●
● ● ● ● ● ● ●
● ●●● ● ●
●●
● ●●
●●●
●
●
●●
● ● ●
●
● ●
●●
●
● ● ●
● ●● ● ● ●● ●
●●● ●
●
●
●
●●●
● ●
● ●● ● ● ●●
● ●●
●●●
● ● ●●
● ● ●
● ●
●●
● ●●●● ●
●●
u2
● ● ● ●●
● ●
0.0
●● ●
●
● ● ●●
● ● ●●● ● ●
●
●●● ●
● ●
●
● ●● ● ●
●
●
●
●●
● ●●
●
●● ● ● ●
●● ● ●● ●
●● ●●● ● ● ●●● ●●
● ● ● ● ● ●
● ● ●●● ●● ●●
●
●
● ●
● ●●● ● ●
●●●
● ● ● ●● ●●
●
●●
●●
● ● ●● ● ●
●●
●
●●
●
●
● ●●●
●●
●
●● ● ● ●
● ● ● ●
●● ●
● ●●● ● ●● ●
●
●
●●●
●●
●
● ●●
● ●
●
●●●
●●●● ● ●
●●
●●
●
●●● ● ●
● ●●
● ●●
● ● ●●
●●
● ● ● ●●
●
●●● ●
●
−0.5
●● ● ● ● ● ●●
● ●●●
● ● ● ●
●
●● ●
●
● ●●●● ● ●●● ●
●● ● ●●
●● ●
● ● ● ●●
●
●
● ●
● ● ●
●●●
●
●●●● ●
●
●●
●●● ●
●●●
●●
●●
●
●
●● ●
● ● ●● ●
●● ●
●
●● ●
●● ●●● ●● ●●
●● ● ● ● ●●
●
●●● ● ●● ● ● ●
●● ● ● ●
●
● ● ●● ●●●●● ● ●
●● ● ● ●
●●● ● ●●●●●
● ●●● ●● ●
● ●
●● ●●
● ● ●
● ●● ● ●●
●
● ●● ●●●●
●●
●● ●
● ●●● ● ● ● ●
● ●●●●
●●● ●
−1.0
● ●● ●
● ●●●●
● ●
●
●●
●
●●●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●●●
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●● ●●
●
●
●
●●● ●
●●
●
●●
●
●● ●●
●
●
● ●●
● ●
●●●● ●● ● ●
●● ●●
●●●
●●●● ●●●●● ●●
● ●● ●●
●●
●
−1.0 −0.5 0.0 0.5 1.0 ●● ● ●
●●
● ● ●
●
●
●
● ●●●●
● ●
●
u1 ● ● ●● ● ●●●●●
● ●
● ●
●●
●●
●
●
● ●
● ●
● ●
● ●
● ● ●
●
Figure 23: The Broken Ring example. Top left: data. Top right: a few steps of mean-shift.
Bottom left: a few steps of blurred mean-shift.
29
Theorem 12 (Silverman 1981) Let pbh be a kernel density estimator using a Gaussian
kernel in one dimension. Then the number of modes of pbh is a non-increasing function of h.
The Gaussian kernel is the unique kernel with this property.
We still need a way to pick h. We can use cross-validation as before. One could argue that
we should choose h so that we estimate the gradient g(x) = ∇p(x) well since the clustering
is based on the gradient flow.
How can we estimate the loss of the gradient? Consider, first the scalar case. Note that
Z Z Z Z
0 0 2
p − p ) = (b
(b p ) − 2 pbp + (p0 )2 .
0 2 0
We can ignore the last term. The first term is known. To estimate the middle term, we use
integration by parts to get Z Z
pbp = − p00 p
0
where pb00i is the leave-one-out second derivative. More generally, by repeated integration by
parts, we can estimate the loss for the rth derivative by
Z
2 X (2r)
CVr (h) = (b p(r) (x))2 dx − (−1)r pbi (Xi ).
n i
Let’s now discuss estimating derivatives more generally following Chacon and Duong (2013).
Let n
1X
pbH (x) = KH (x − Xi )
n i=1
where KH (x) = |H|−1/2 K(H −1/2 x). Let D = ∂/∂x = (∂/∂x1 , . . . , ∂/∂xd ) be the gradient
operator. Let H(x) be the Hessian of p(x) whose entries are ∂ 2 p/(∂xj ∂xk ). Let
r
D⊗r p = (Dp)⊗r = ∂ r p/∂x⊗r ∈ Rd
where vec takes a matrix and stacks the columns into a vector.
30
The estimate of D⊗r p is
n n
⊗r 1 X ⊗r 1X
(r)
pb (x) = D pbH (x) = D KH (x−Xi ) = |H|−1/2 (H −1/2 )⊗r D⊗r K(H −1/2 (x−Xi ).
n i=1 n i=1
Chacon, Duong and Wand shows that E[L] is minimized by choosing H so that each entry
has order n−2/(d+2r+4) leading to a risk of order O(n−4/(d+2r+4) ). In fact, it may be shown
that
1 1
E[L] = |H|−1/2 tr((H −1 )⊗r R(D⊗r K)) − trR∗ (KH ? KH , D⊗r p)
n n
+ trR (KH ? KH , D p) − 2trR (KH , D⊗r p) + trR(D⊗r p)
∗ ⊗r ∗
where
Z
R(g) = g(x)g T (x)dx
Z
∗
R (a, g) = (a ? g)(x)g T (x)dx
Using some high-voltage calculations, Chacon and Duong (2013) derived the following leave-
one-out approximation to the first two terms:
where
1 X ⊗2r −1/2 2 X
B(H) = 2
D K(H (X i − X j )) − D⊗2r K(H −1/2 (Xi − Xj ))
n i,j n(n − 1) i6=j
A better idea is to used fixed (non-decreasing h). We don’t need h to go to 0 to find the
clusters. More on this when we discuss persistence.
31
5.3 Theoretical Analysis
Theorem 13 Assume that p is Morse with finitely many modes m1 , . . . , mk . Then for h > 0
and not too large, ph is Morse with modes mh1 , . . . , mhk and (possibly after relabelling),
With probability tending to 1, pbh has the same number of modes which we denote by m
b h1 , . . . , m
b hk .
Furthermore, r !
1
max ||m b jh − mjh || = OP
j nhd+2
and r !
1
b jh − mj || = O(h2 ) + OP
max ||m .
j nhd+2
Remark: Setting h n−1/(d+6) gives the rate n−2/(d+6) which is minimax (Tsyabkov 1990)
under smoothness assumptions. See also Romano (1988). However, if we take the fixed h
point if view, then we have a n−1/2 rate.
Proof Outline. But a small ball Bj around each mjh . We will skip the first step, which is
to show that there is one (and only one) local mode in Bj . Let’s focus on showing
r !
1
max ||m
b jh − mjh || = OP .
j nhd+2
For simplicity, write m = mjh and x = m b jh . Let g(x) and H(x) be the gradient and Hessian
of ph (x) and let gb(x) and H(x) be the gradient Hessian of pbh (x). Then
b
Z 1
T T
(0, . . . , 0) = gb(x) = gb(m) + (x − m) H(m
b + u(x − m))du
0
and so Z 1
T
(x − m) H(m
b + u(x − m))du = (g(m) − gb(m))
0
where we used the fact that 0 = g(m). Multiplying on the right by x − m we have
Z 1
T
(x − m) H(m
b + u(x − m))(x − m)du = (bg (m) − gb(m))T (x − m).
0
32
Let λ = inf 0≤u≤1 λmin (H(m + u(x − m))). Then λ = λmin (H(m)) + oP (1) and
Z 1
T b + u(m − x))(x − m)du ≥ λ||x − m||2 .
(x − m) H(x
0
6 Hierarchical Clustering
Hierarchical clustering methods build a set of nested clusters at different resolutions. The
are two types of hierarchical clustering: agglomerative (bottom-up) and divisive (top-down).
With agglomerative clustering we start with some distance or dissimilarity d(x, y) between
points. We then extend this distance so that we can compute the distance d(A, B) between
to sets of points A and B.
X
1
Average Linkage d(A, B) = NA NB
d(x, y)
x∈A,y∈B
3. For j = n − 1 to 1:
33
● ●
● ●●●● ● ● ●
●● ● ● ●
● ●●●
●
●
● ● ●●
● ● ●
●● ●
● ● ● ●
● ●
●● ●
●
●● ● ●
● ● ● ●● ● ● ● ● ● ● ●●
●● ●
●●●●
● ● ● ●● ●● ●
●●●●
● ● ●
● ● ●●●●
● ●● ●● ●●●●
● ●● ●●
●
●●●●
●●● ●●●●● ●● ●
●●
●●
●●● ●●●●● ●●
● ● ●
●●●
● ●● ● ● ● ●
●●●
● ●● ● ●
● ●●
●●●●
● ●●
● ●●
●
● ●● ●●
●●●●
● ●●
● ●●
●
●
● ●● ●●●
●● ● ●● ● ●● ●●●
●● ● ●●
●● ●●● ●●●
● ● ●● ●●● ●●●
●
●●
●
●
●
● ●● ●
●
● ●●
● ● ●
● ● ●● ●
● ● ●● ●
●●●
● ● ●
● ●
● ● ● ●
● ●
● ●
●
●
●●● ● ● ●
● ●●● ● ● ● ●
●
● ● ● ●
● ●●●● ● ● ● ● ●●●● ● ● ●
●● ● ● ● ●● ● ● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
●● ●
● ●● ●
●
●● ● ● ● ●●● ●● ● ● ● ●●●
●● ●
●●●●
● ● ● ●● ●
●●●●
● ● ●
● ● ●●●●
● ●● ●● ● ● ●●●●
● ●● ●●
●
●●●●
●●● ●●●●● ●● ●
●●
●●
●●● ●●●●● ●●
● ● ●
●●●
● ●● ● ● ● ● ●
●●●
● ●● ● ●
● ●●
●●●●
● ● ●●
●●●● ● ●●
●●●●
● ● ●●
●●●●
● ●● ●●●●●● ●● ● ●● ●●●●●● ●●
●●
● ●●● ●●● ●● ●●
● ●●● ●●● ●●
● ● ●●● ● ● ● ●●● ●
● ● ● ●
● ●
● ●
●●● ●●●
● ●
● ● ● ●
● ● ●
●
●
●
●●● ● ● ●
● ●●● ● ● ● ●
●
Figure 24: Hierarchical clustering applied to two noisy rings. Top left: the data. Top right:
two clusters from hierarchical clustering using single linkage. Bottom left: average linkage.
Bottom right: complete linkage.
The result can be represented as a tree, called a dendogram. We can then cut the tree at
different places to yield any number of clusters ranging from 1 to n. Single linkage often
produces thin clusters while complete linkage is better at rounder clusters. Average linkage
is in between.
Example 14 Figure 24 shows agglomerative clustering applied to data generated from two
rings plus noise. The noise is large enough so that the smaller ring looks like a blob. The
data are show in the top left plot. The top right plot shows hierarchical clustering using single
linkage. (The tree is cut to obtain two clusters.) The bottom left plot shows average linkage
and the bottom right plot shows complete linkage. Single linkage works well while average
and complete linkage do poorly.
Let us now mention some theoretical properties of hierarchical clustering. Suppose that
X1 , . . . , Xn is a sample from a distribution P on Rd with density p. A high density cluster is
a maximal connected component of a set of the form {x : p(x) ≥ λ}. One might expect that
single linkage clusters would correspond to high density clusters. This turns out not quite
to be the case. See Hartigan (1981) for details. DasGupta (2010) has a modified version
34
of hierarchical clustering that attempts to fix this problem. His method is very similar to
density clustering.
Single linkage hierarchical clustering is the same as geometric graph clustering. Let G =
(V, E) be a graph where V = {X1 , . . . , Xn } and Eij = 1 if ||Xi − Xj || ≤ and Eij = 0 if
||Xi − Xj || > . Let C1 , . . . , Ck denote the connected components of the graph. As we vary
we get exactly the hierarchical clustering tree.
Finally, we let us mention divisive clustering. This is a form of hierarchical clustering where
we start with one large cluster and then break the cluster recursively into smaller and smaller
pieces.
7 Spectral Clustering
Spectral clustering refers to a class of clustering methods that use ideas related to eigenvector.
An excellent tutorial on spectral clustering is von Luxburg (2006) and some of this section
relies heavily on that paper. More detail can be found in Chung (1997).
Let G be an undirected graph with n vertices. Typically these vertices correspond to obser-
vations X1 , . . . , Xn . Let W be an n × n symmetric weight matrix. Say that Xi and Xj are
connected if Wij > 0. The simplest type of weight matrix has entries that are either 0 or 1.
For example, we could define
The degree matrix D is the n×n diagonal matrix with Dii = nj=1 Wij . The graph Laplacian
P
is
L = D − W. (21)
The graph Laplacian has many interesting properties which we list in the following result.
Recall that a vector v is an eigenvector of L if there is a scalar λ such that Lv = λv in which
case we say that λ is the eigenvalue corresponding to v. Let L(v) = {cv : c ∈ R, c 6= 0} be
the linear space generated by v. If v is an eigenvector with eigenvalue λ and c is any nonzero
constant, then cv is an eigenvector with eigenvalue cλ. These eigenvectors are considered
equivalent. In other words, L(v) is the set of vectors that are equivalent to v.
35
1. For any vector f = (f1 , . . . , fn )T ,
n n
T 1 XX
f Lf = Wij (fi − fj )2 .
2 i=1 j=1
Part 1 of the theorem says that L is like a derivative operator. The last part shows that we
can use the graph Laplacian to find the connected components of the graph.
Proof.
(2) Since W and D are symmetric, it follow that L is symmetric. The fact that L is positive
semi-definite folows from part (1).
(5) First suppose that k = 1 and thus that the graph is fully connected. We already know
that λ1 = 0 and v1 = (1, . . . , 1)T . Suppose there were another eigenvector v with eigenvalue
0. Then n X n
X
0 = v T Lv = Wij (v(i) − v(j))2 .
i=1 j=1
36
It follows that Wij (v(i) − v(j))2 = 0 for all i and j. Since G is fully connected, all Wij > 0.
Hence, v(i) = v(j) for all i, j and so v is constant and thus v ∈ L(v1 ).
Now suppose that K has k components. Let nj be the number of nodes in components
j. We can reliable the vertices so that the first n1 nodes correspond to the first connected
component, the second n2 nodes correspond to the second connected component and so
on. Let v1 = (1, . . . , 1, 0, . . . , 0) where the 1’s correspond to the first component. Let Let
v2 = (0, . . . , 0, 1, . . . , 1, 0, . . . , 0) where the 1’s correspond to the second component. Define
v3 , . . . , vk similarly. Due to the re-ordering of the vertices, L has block diagonal form:
L1
L2
L= .
..
.
Lk
Here, each Li corresponds to one of th connected components of the graph. It is easy to see
that LV − j = 0 for j = 1, . . . , k. Thus, each vj , for j = 1, . . . , k is an eigenvector with zero
eigenvalue. Suppose that v is any eigenvector with 0 eigenvalue. Arguing as before, v must
be constant over some component and 0 elsewhere. Hence, v ∈ L(vj ) for some 1 ≤ j ≤ k.
37
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Figure 25: The top shows a simple graph. The remaining plots are the eigenvectors of
the graph Laplacian. Note that the first two eigenvectors correspond to the two connected
components of the graph.
Note f T Lf measures the smoothness of f relative to the graph. This means that the higher
order eigenvectors generate a basis where the first few basis elements are smooth (with
respect to the graph) and the later basis elements become more wiggly.
Example 17 Figure 25 shows a graph and the corresponding eigenvectors. The two eigen-
vectors correspond two the connected components of the graph. The other eignvectors can be
thought of as forming bases vectors within the connected components.
Wij = I(||Xi − Xj || ≤ )
for some > 0 and then take the clusters to be the connected components of the graph which
can be found by getting the eigenvectors of the Laplacian L. This is exactly equivalent to
geometric graph clustering from Section ??. In this case we have gained nothing except that
we have a new algorithm to find the connected components of the graph. However, there
are other ways to use spectral methods for clustering as we now explain.
The idea underlying the other spectral methods is to use the Laplacian to transform the
data into a new coordinate system in which clusters are easier to find. For this purpose, one
38
typically uses a modified form of the graph Laplacian. The most commonly used weights for
this purpose are
2 2
Wij = e−||Xi −Xj || /(2h ) .
Other kernels Kh (Xi , Xj ) can be used as well. We define the symmetrized Laplacian L =
D−1/2 W D−1/2 and the random walk Laplacian L = D−1 W. (We will explain the name
shortly.) These are very similar and we will focus on the latter. Some authors define the
random walk Laplacian to be I −D−1 W . We prefer to use the definition L = D−1 W because,
as we shall see, it has a nice interpretation. The eigenvectors of I − D−1 W and D−1 W are
the same so it makes little difference which definition is used. The main difference is that
the connected components have eigenvalues 1 instead of 0.
Lemma 18 Let L be the graph Laplacian of a graph G and let L be the random walk Lapla-
cian.
Proof. Homework. H
The mapping T : X → Z transforms the data into a new coordinate system. The numbers
h and r are tuning parameters. The hope is that clusters are easier to find in the new
parameterization.
To get some intuition for this, note that L has a nice probabilistic interpretation (Coifman,
Lafon, Lee 2006). Consider a Markov chain on X1 , . . . , Xn where we jump from Xi to Xj
with probability
Kh (Xi , Xj )
P(Xi −→ Xj ) = L(i, j) = P .
s Kh (Xs , Xj )
The Laplacian L(i, j) captures how easy it is to move from Xi to Xj . If Zi and Zj are close
in Euclidean distance, then they are connected by many high density paths through the
39
data. This Markov chain is a discrete version of a continuous Markov chain with transition
probability: R
Kh (x, y)dP (y)
P (x → A) = RA .
Kh (x, y)dP (y)
b : f → fe is
The corresponding averaging operator A
P
j f (j)Kh (Xi , Xj )
(Af
b )(i) = P
j Kh (Xi , Xj )
The lower order eigenvectors of L are vectors that are smooth relative to P . Thus, project-
ing onto the first few eigenvectors parameterizes in terms of closeness with respect to the
underlying density.
P
1. Let D be the n × n diagonal matrix with Dii = j Wij .
2. Compute the Laplacian L = D−1 W.
3. Find first k eigenvectors v1 , . . . , vk of L.
4. Project each Xi onto the eigenvectors to get new points X bi .
5. Cluster the points Xb1 , . . . , X
bn using any standard clustering algorithm.
There is another way to think about spectral clustering. Spectral methods are similar to
multidimensional scaling. However, multidimensional scaling attempts to reduce dimension
while preserving all pairwise distances. Spectral methods attempt instead to preserve local
distances.
Example 19 Figure 26 shows a simple synthetic example. The top left plot shows the data.
We apply spectral clustering with Gaussian weights and bandwidth h = 3. The top middle
plot shows the first 20 eigenvalues. The top right plot shows the the first versus the second
eigenvector. The two clusters are clearly separated. (Because the clusters are so separated,
the graph is essentially disconnected and the first eigenvector is not constant. For large h,
the graph becomes fully connected and v1 is then constant.) The remaining six plots show
the first six eigenvectors. We see that they form a Fourier-like basis within each cluster.
Of course, single linkage clustering would work just as well with the original data as in the
transformed data. The real advantage would come if the original data were high dimensional.
40
●● ●● ●● ● ●
●● ●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●●● ●
●
●●
●
●
●
●
●
0.8
●● ● ●
●
●
●
● ●● ●
●
●● ●● ●
● ● ●
● ●
● ●
● ●
● ●
●
●
●● ●
● ●
●
●
●●●●
●
● ●●●
●●
●● ●● ●
● ●
●
● ●● ●●
●●
●● ● ●
●
v2
●
●
●
●
● ● ●● ● ●
λ
●●
● ● ●●
●
● ●
●
● ●
● ●●
●
●●
● ●
●●
●
●●● ●
● ●
●
0.4
●●
●● ●●●
●
●
●● ●●●
●●
●
● ● ●
●
● ● ●
●●●●
●●● ●
● ●
● ●
●
●
●
● ● ●
● ●
●
●
●
● ●
● ●
●
● ●
● ●
●
●
● ●
● ● ●
● ●● ●
●
●●
●● ●●
● ●
●
●
●●
● ● ●●
●●● 0.0 ●
●
●
●
●●●● ● ●● ●● ●
●
●
5 10 15 20 v1
v1 v2 v3
v4 v5 v6
Figure 26: Top left: data. Top middle: eigenvalues. Top right: second versus third eigen-
vectors. Remaining plots: first six eigenvectors.
41
1.0
●
●
● ●●●
●●
●●●
●● ●●
●●●●
0.8
●●●●● ●
●
●●
●●●●
● ●●
●●●
●●
●●●
●
●●● ●● ●●
●●●●●●
●●
●● ●
●● ● ●●●●●● ●
●●●●
● ●●
●
●●●●● ● ●
● ●● ●● ● ●
●●
0.6
●● ●●
●
●●
●
●
●
●●
●
●
●●
●●●● ●● ● ●
●
●
●●
●
●●
●
●
●●
●●
●
●
●● ● ●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●●
●●●●●●
● ● ●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ●●●● ●
●●●● ●
●
● ●
●●●●●
●● ●
●
●●●
● ●●
● ●
●● ●
v3
● ● ●
●●
λ
●
●●●● ●
●●●● ●
●● ● ●●
● ● ●●
●● ●●●
0.4
●● ●
● ●
● ● ●●
●●●●● ●● ●
●● ●
●●● ●●●●●
●
●●●
●●
●●●● ●
● ●
●
●●
● ●● ●●
●● ●●
●●●
● ● ●
● ●● ●●●
●
● ●
●
●●●● ●●● ●
0.2
●●●●
● ●
●
●
●
●● ●● ●●
●● ●
●●● ●
●●● ●
●
●●●
●●●● ●
● ●●
0.0
● ●
●
5 10 15 20 v2
1.0
●
0.8
0.6
v3
λ
0.4
0.2
●
●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●
●
●
●
0.0
●
●●
5 10 15 20 v2
Figure 27: Spectral analysis of some zipcode data. Top: h = 6. Bottom: h = 4. The plots
on the right show the second versus third eigenvector. The three colors correspond to the
three digits 1, 2 and 3.
Example 20 Figure 27 shows a spectral analysis of some zipcode data. Each datapoint is a
16 x 16 image of a handwritten number. We restrict ourselves to the digits 1, 2 and 3. We
use Gaussian weights and the top plots correspond to h = 6 while the bottom plots correspond
to h = 4. The left plots show the first 20 eigenvalues. The right plots show a scatterplot of
the second versus the third eigenvector. The three colors correspond to the three digits. We
see that with a good choice of h, namely h = 6, we can clearly see the digits in the plot. The
original dimension of the problem is 16 x 16 =256. That is, each image can be represented by
a point in R256 . However, the spectral method shows that most of the information is captured
by two eignvectors so the effective dimension is 2. This example also shows that the choice
of h is crucial.
Spectral methods are interesting. However, there are some open questions:
1. There are tuning parameters (such as h) and the results are sensitive to these param-
eters. How do we choose these tuning parameters?
2. Does spectral clustering perform better than density clustering?
42
8 High-Dimensional Clustering
As usual, interesting and unexpected things happen in high dimensions. The usual methods
may break down and even the meaning of a cluster may not be clear.
I’ll begin by discussing some recent results from Sarkar and Ghosh (arXiv:1612.09121). Sup-
pose we have data coming from k distributions P1 , . . . , Pk . Let µr be the mean of Pr and
Σr be the covariance matrix. Most clustering methods depend on the pairwise distances
||Xi − Xj ||2 . Now,
Xd
2
||Xi − Xj || = δ(a)
a=1
2
where δa = (Xi (a) − Xj (a)) . This is a sum. As d increases, by the law of large numbers
we might expect this sum to converge to a number (assuming the features are not too
dependent). Indeed, suppose that X is from Pr and Y is from Ps then
1 P p
√ ||X − Y || → σr2 + σs2 + νrs
d
where
d
1X
νrs = lim ||µr (a) − µs (a)||2
d→∞ d
a=1
and
1
σr2 = lim trace(Σr ).
d→∞ d
X Y ||X − Y ||
X ∈ C1 Y ∈ C1 ||X − Y || = 2σ12
X ∈ C2 Y ∈ C2 ||X − Y || = 2σ22
X ∈ C1 Y ∈ C2 ||X − Y || = σ12 + σ22 + ν12
If
σ12 + ν12 < σ22
then every point in cluster 2 is closer to a point in cluster 1 than to other points
in cluster 2. Indeed, if you simulate high dimensional Gaussians, you will see that all the
standard clustering methods fail terribly.
43
What’s really going on is that high dimensional data tend to cluster on rings. Pairwise
distance methods don’t respect rings.
An interesting fix suggested by Sarkar and Ghosh is to use the mean absolute difference
distance (MADD) defined by
1 X
ρ(x, y) = ||x − z|| − ||y − z|| .
n − 2 z6=x,y
P
Suppose that X ∼ Pr and Y ∼ Ps . They show that ρ(X, Y ) → crs where crs ≥ 0 and crs = 0
if and only if σr2 = σs2 and νbr = νbs for all b. What this means is that pairwise distance
methods only work if νrs > |σr2 − σs2 | but MADD works if either νrs 6= 0 or σr 6= σs .
Pairwise distances only use information about two moments and they combine this moment
information in a particular way. MADD combines the moment information in a different and
more effective way. One could also invent other measures that separate mean and variance
information or that use higher moment information.
Marginal Selection (Screening). In marginal selection, we look for variables that marginally
look ‘clustery.” This idea was used in Chan and Hall (2010) and Wasserman, Azizyan and
Singh (2014). We proceed as follows:
3. Reject the null hypothesis that feature j is not multimodal if Tj > cn,eα
where cn,eα is the critical value for the dip test.
Any test of multimodality may be used. Here we describe the dip test (Hartigan and Har-
tigan, 1985). Let Z1 , . . . , Zn ∈ [0, 1] be a sample from a distribution F . We want to test
“H0 : F is unimodal” versus “H1 : F is not unimodal.” Let U be the set of unimodal
44
distributions. Hartigan and Hartigan (1985) define
If F has a density p we also write Dip(F ) as Dip(p). Let Fn be the empirical distribution
function. The dip statistic is Tn = Dip(Fn ). The dip test rejects H0 if Tn > cn,α where the
critical value cn,α is chosen so that, under H0 , P(Tn > cn,α ) ≤ α.2
Since we are conducting multiple tests, we cannot test at a fixed error rate α. Instead, we
replace α with α
e = α/(nd). That is, we test each marginal and we reject H0 if Tn > cn,eα .
By the union bound, the chance of at least one false rejection of H0 is at most de
α = α/n.
There are more refined tests such as the excess mass test given in Chan and Hall (2010),
building on work by Muller and Sawitzki (1991). For simplicity, we use the dip test in this
paper; a fast implementation of the test is available in R.
Marginal selection can obviously fail. See Figure 28 taken from Wasserman, Azizyan and
Singh (2014).
Sparse k-means. Here we discuss the approach in Witten and Tibshirani (2010). Recall
that in k-means clustering we choose C = {c1 , . . . , ck } to minimize
n n
1X 1X
Rn (C) = ||Xi − ΠC [Xi ]||2 = min ||Xi − cj ||2 . (23)
n i=1 n i=1 1≤j≤k
where Aj is the j th cluster and d2 (x, y) = dr=1 (x(r) − y(r))2 is squared Euclidean distance.
P
Further, this is equivalent to maximizing the between sums of squares
k
1X 2 X 1 X 2
B= d (Xs , Xt ) − d (Xs , Xt ). (25)
n s,t n
j=1 j s,t∈A j
Witten
Pd and Tibshirani propose replace the Euclidean norm with the weighted norm d2w (x, y) =
2
r=1 wr (x(r) − y(r)) . Then they propose to maximize
k
1X 2 X 1 X 2
B= dw (Xs , Xt ) − dw (Xs , Xt ) (26)
n s,t n
j=1 j s,t∈A j
2
Specifically, cn,α can be defined by supG∈U PG (Tn > cn,α ) = α. In practice, cn,α can be defined
by PU (Tn > cn,α ) = α where U is Unif(0,1). Hartigan and Hartigan (1985) suggest that this suffices
asymptotically.
45
Figure 28: Three examples, each showing two clusters and two features X(1) and X(2). The
top plots show the clusters. The bottom plots show the marginal density of X(1). Left: The
marginal fails to reveal any clustering structure. This example violates the marginal signature
assumption. Middle: The marginal is multimodal and hence correctly identifies X(1) as a
relevant feature. This example satisfies the marginal signature assumption. Right: In this
case, X(1) is relevant but X(2) is not. Despite the fact that the clusters are close together,
the marginal is multimodal and hence correctly identifies X(1) as a relevant feature. This
example satisfies the marginal signature assumption. (Figure from Wasserman, Azizyan and
Singh, 2014).
The `1 norm on the weights causes some of the components of w to be 0 which results in
variable selection. There is no theory that shows that this method works.
Suppose we want a clustering based on a subset of features S such that |S| = L. Let
δa (i, j) = (Xi (a) − Xj (a))2 be the pairwise distance for the ath feature. Assume that each
46
1. Input X1 , . . . , Xn and k.
√
2. Set w = (w1 , . . . , wd ) where w1 = . . . = wd = 1/ d.
(a) Optimize (25) over C holding w fixed. Find c1 , . . . , ck from the k-means algorithm using
distance dw (Xi , Xj ). Let Aj denote the j th cluster.
(b) Optimize (25) over w holding c1 , . . . , ck fixed. The solution is
sr
wr = qP
d
t=1 s2t
where
sr = (ar − ∆)+ ,
k
1 X X 1 X
ar = wr (Xs (r) − Xt (r))2 − wr (Xs (r) − Xt (r))2
n s,t n
j=1 j s,t∈A j
+
47
feature has been standardized so that
X
δa (i, j) = 1
i,j
P
for all a. Define δS (i, j) = a∈S δa (i, j). Then we can say that the goal of sparse clustering
is to minimize X 1 X
δS (i, j)
j
|Cj | i,j∈C
j
over clusterings and subsets. They propose to minimize by alternating between finding clus-
ters and finding subsets. The former is the usual k-means. The latter is trivial because
δS decomposes into maginal components. Arias-Castro and Pu also suggest a permuta-
tion method for choosing the size of S. Their numerical experiments are very promising.
Currently, no theory has been developed for this approach.
8.3 Mosaics
A different idea is to create a partition of features and observations which I like to call a
mosaic. There are papers that cluster features and observations simultaneously but clear
theory is still lacking.
9 Examples
Example 21 Figures 17 and 18 shows some synthetic examples where the clusters are meant
to be intuitively clear. In Figure 17 there are two blob-like clusters. Identifying clusters like
this is easy. Figure 18 shows four clusters: a blob, two rings and a half ring. Identifying
clusters with unusual shapes like this is not quite as easy. To the human eye, these certainly
look like clusters. But what makes them clusters?
48
200
150
100
50
0
0 10 20 30 40 50 60 70
Figure 30: Some curves from a dataset of 472 curves. Each curve is a radar waveform from
the Topex/Poseidon satellite.
small sample of curves a from a dataset of 472 curves from Frappart (2003). Each curve is a
radar waveform from the Topex/Poseidon satellite which used to map the surface topography
of the oceans.3 One question is whether the 472 curves can be put into groups of similar
shape.
3
See http://topex-www.jpl.nasa.gov/overview/overview.html. The data are available at “Work-
ing Group on Functional and Operator-based Statistics” a web site run by Frederic Ferrarty
and Philippe Vieu. The address is http://www.math.univ-toulouse.fr/staph/npfda/. See also
http://podaac.jpl.nasa.gov/DATA CATALOG/topexPoseidoninfo.html.
49
0 20 40 60 80 100 0 20 40 60 80 100
Type Ia Other
0 20 40 60 80 100 0 20 40 60 80 100
Cluster 1 Cluster 2
Figure 31: Light curves for supernovae. The top two plots show the light curves for two
types of supernovae. The bottom two plots show the results of clustering the curves into two
groups, without using knowledge of their labels.
10 Bibliographic Remarks
k-means clustering goes back to Stuart Lloyd who apparently came up with the algorithm in
1957 although he did not publish it until 1982. See [?]. Another key reference is [?]. Similar
ideas appear in [?]. The related area of mixture models is discussed at length in McLachlan
and Basford (1988). k-means is actually related to principal components analysis; see Ding
and He (2004) and Zha, He, Ding, Simon and Gu (2001). The probabilistic behavior of
random geometric graphs is discussed in detail in [?].
50